All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH v5 00/14] sched: packing tasks
@ 2013-10-18 11:52 Vincent Guittot
  2013-10-18 11:52 ` [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init Vincent Guittot
                   ` (13 more replies)
  0 siblings, 14 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-10-18 11:52 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo, pjt, Morten.Rasmussen, cmetcalf,
	tony.luck, alex.shi, preeti, linaro-kernel
  Cc: rjw, paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	l.majewski, Vincent Guittot

This is the 5th version of the previously named "packing small tasks" patchset.
"small" has been removed because the patchset doesn't only target small tasks 
anymore.

This patchset takes advantage of the new per-task load tracking that is
available in the scheduler to pack the tasks in a minimum number of
CPU/Cluster/Core. The packing mechanism takes into account the power gating
topology of the CPUs to minimize the number of power domains that need to be 
powered on simultaneously.

Most of the code has been put in fair.c file but it can be easily moved to
another location. This patchset tries to solve one part of the larger
energy-efficient scheduling problem and it should be merged with other
proposals that solve other parts like the power-scheduler made by Morten.

The packing is done in 3 steps:

The 1st step creates a topology of the power gating of the CPUs that will help
the scheduler to choose which CPUs will handle the current activity. This
topology is described thanks to a new flag SD_SHARE_POWERDOMAIN that indicates
whether the groups of CPUs of a scheduling domain share their power state. In
order to be efficient, a group of CPUs that share their power state will be
used (or not) simultaneously. By default, this flag is set in all sched_domain
in order to keep the current behavior of the scheduler unchanged.

The 2nd step evaluates the current activity of the system and creates a list of
CPUs for handling it. The average activity level of CPUs is set to 80% but is
configurable by changing the sched_packing_level knob. The activity level and
the involvement of a CPU in the packing effort is evaluated during the periodic
load balance similarly to cpu_power. Then, the default load balancing behavior
is used to balance tasks between this reduced list of CPUs.
As the current activity doesn't take into account a new task, an unused CPUs
can also be selected during the 1st wake up and until the activity is updated.

The 3rd step occurs when the scheduler selects a target CPU for a newly
awakened task. The current wakeup latency of  idle CPUs is used to select the
one with the most shallow c-state. In some situation where the task load is
small compared to the latency, the newly awakened task can even stay on the
current CPU. Since the load is the main metric for the scheduler, the wakeup
latency is transposed into an equivalent load so that the current mechanism of
the load balance that is based on load comparison, is kept unchanged. A shared
structure has been created to exchange information between scheduler and
cpuidle (or any other framework that needs to share information). The wakeup
latency is the only field for the moment but it could be extended with
additional useful information like the target load or the expected sleep
duration of a CPU.

The patchset is based on v3.12-rc2 and is available in the git tree:
git://git.linaro.org/people/vingu/kernel.git
branch sched-packing-small-tasks-v5

If you want to test the patchset, you must enable CONFIG_PACKING_TASKS first.
Then, you also need to create a arch_sd_local_flags that will clear the
SD_SHARE_POWERDOMAIN flag at the appropriate level for your architecture. This
has already be done for ARM architecture in the patchset.

The figures below show the latency of cyclictest with and without the patchset
on an ARM platform with a v3.11. The test has been runned 10 times on each kernel.
#cyclictest -t 3 -q -e 1000000 -l 3000  -i 1800 -d 100
                 average (us) stdev
v3.11            381,5        79,86
v3.11 + patches  173,83       13,62

Change since V4:
 - v4 posting:https://lkml.org/lkml/2013/4/25/396
 - Keep only the aggressive packing mode.
 - Add a finer grain power domain description mechanism that includes
DT description
 - Add a structure to share information with other framework
 - Use current wakeup latency of an idle CPU when selecting the target idle CPU
 - All the task packing mechanism can be disabled with a single config option

Change since V3:
 - v3 posting: https://lkml.org/lkml/2013/3/22/183
 - Take into account comments on previous version.
 - Add an aggressive packing mode and a knob to select between the various mode

Change since V2:
 - v2 posting: https://lkml.org/lkml/2012/12/12/164
 - Migrate only a task that wakes up
 - Change the light tasks threshold to 20%
 - Change the loaded CPU threshold to not pull tasks if the current number of
   running tasks is null but the load average is already greater than 50%
 - Fix the algorithm for selecting the buddy CPU.

Change since V1:
 -v1 posting: https://lkml.org/lkml/2012/10/7/19
Patch 2/6
 - Change the flag name which was not clear. The new name is
   SD_SHARE_POWERDOMAIN.
 - Create an architecture dependent function to tune the sched_domain flags
Patch 3/6
 - Fix issues in the algorithm that looks for the best buddy CPU
 - Use pr_debug instead of pr_info
 - Fix for uniprocessor
Patch 4/6
 - Remove the use of usage_avg_sum which has not been merged
Patch 5/6
 - Change the way the coherency of runnable_avg_sum and runnable_avg_period is
   ensured
Patch 6/6
 - Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM
   platform

Vincent Guittot (14):
  sched: add a new arch_sd_local_flags for sched_domain init
  ARM: sched: clear SD_SHARE_POWERDOMAIN
  sched: define pack buddy CPUs
  sched: do load balance only with packing cpus
  sched: add a packing level knob
  sched: create a new field with available capacity
  sched: get CPU's activity statistic
  sched: move load idx selection in find_idlest_group
  sched: update the packing cpu list
  sched: init this_load to max in find_idlest_group
  sched: add a SCHED_PACKING_TASKS config
  sched: create a statistic structure
  sched: differantiate idle cpu
  cpuidle: set the current wake up latency

 arch/arm/include/asm/topology.h  |    4 +
 arch/arm/kernel/topology.c       |   50 ++++-
 arch/ia64/include/asm/topology.h |    3 +-
 arch/tile/include/asm/topology.h |    3 +-
 drivers/cpuidle/cpuidle.c        |   11 ++
 include/linux/sched.h            |   13 +-
 include/linux/sched/sysctl.h     |    9 +
 include/linux/topology.h         |   11 +-
 init/Kconfig                     |   11 ++
 kernel/sched/core.c              |   11 +-
 kernel/sched/fair.c              |  395 ++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h             |    8 +-
 kernel/sysctl.c                  |   17 ++
 13 files changed, 521 insertions(+), 25 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
  2013-10-18 11:52 [RFC][PATCH v5 00/14] sched: packing tasks Vincent Guittot
@ 2013-10-18 11:52 ` Vincent Guittot
  2013-11-05 14:06   ` Peter Zijlstra
  2013-10-18 11:52 ` [RFC][PATCH v5 03/14] sched: define pack buddy CPUs Vincent Guittot
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 101+ messages in thread
From: Vincent Guittot @ 2013-10-18 11:52 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo, pjt, Morten.Rasmussen, cmetcalf,
	tony.luck, alex.shi, preeti, linaro-kernel
  Cc: rjw, paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	l.majewski, Vincent Guittot

The function arch_sd_local_flags is used to set flags in sched_domains
according to the platform architecture. A new flag SD_SHARE_POWERDOMAIN is
also created to reflect whether groups of CPUs in a sched_domain level can or
not reach different power state. As an example, the flag should be cleared at
CPU level if groups of cores can be power gated independently. This information
is used to decide if it's worth packing some tasks in a group of CPUs in order
to power gate the other groups instead of spreading the tasks. The default
behavior of the scheduler is to spread tasks across CPUs and groups of CPUs so
the flag is set into all sched_domains.

The cpu parameter of arch_sd_local_flags can be used by architecture to fine
tune the scheduler domain flags. As an example SD_SHARE_POWERDOMAIN flag can be
set differently for groups of CPUs according to DT information

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 arch/ia64/include/asm/topology.h |    3 ++-
 arch/tile/include/asm/topology.h |    3 ++-
 include/linux/sched.h            |    1 +
 include/linux/topology.h         |   11 ++++++++---
 kernel/sched/core.c              |   10 ++++++++--
 5 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index a2496e4..4844896 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -46,7 +46,7 @@
 
 void build_cpu_to_node_map(void);
 
-#define SD_CPU_INIT (struct sched_domain) {		\
+#define SD_CPU_INIT(cpu) (struct sched_domain) {	\
 	.parent			= NULL,			\
 	.child			= NULL,			\
 	.groups			= NULL,			\
@@ -65,6 +65,7 @@ void build_cpu_to_node_map(void);
 				| SD_BALANCE_EXEC	\
 				| SD_BALANCE_FORK	\
 				| SD_WAKE_AFFINE,	\
+				| arch_sd_local_flags(0, cpu)\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
index d15c0d8..6f7a97d 100644
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -51,7 +51,7 @@ static inline const struct cpumask *cpumask_of_node(int node)
  */
 
 /* sched_domains SD_CPU_INIT for TILE architecture */
-#define SD_CPU_INIT (struct sched_domain) {				\
+#define SD_CPU_INIT(cpu) (struct sched_domain) {			\
 	.min_interval		= 4,					\
 	.max_interval		= 128,					\
 	.busy_factor		= 64,					\
@@ -71,6 +71,7 @@ static inline const struct cpumask *cpumask_of_node(int node)
 				| 0*SD_WAKE_AFFINE			\
 				| 0*SD_SHARE_CPUPOWER			\
 				| 0*SD_SHARE_PKG_RESOURCES		\
+				| arch_sd_local_flags(0, cpu)		\
 				| 0*SD_SERIALIZE			\
 				,					\
 	.last_balance		= jiffies,				\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6682da3..2004cdb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -763,6 +763,7 @@ enum cpu_idle_type {
 #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
 #define SD_SHARE_CPUPOWER	0x0080	/* Domain members share cpu power */
+#define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
 #define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..f3cd3c2 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -85,7 +85,7 @@ int arch_update_cpu_topology(void);
 #define ARCH_HAS_SCHED_WAKE_IDLE
 /* Common values for SMT siblings */
 #ifndef SD_SIBLING_INIT
-#define SD_SIBLING_INIT (struct sched_domain) {				\
+#define SD_SIBLING_INIT(cpu) (struct sched_domain) {			\
 	.min_interval		= 1,					\
 	.max_interval		= 2,					\
 	.busy_factor		= 64,					\
@@ -99,6 +99,8 @@ int arch_update_cpu_topology(void);
 				| 1*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
 				| 1*SD_SHARE_PKG_RESOURCES		\
+				| arch_sd_local_flags(SD_SHARE_CPUPOWER|\
+					SD_SHARE_PKG_RESOURCES, cpu)	\
 				| 0*SD_SERIALIZE			\
 				| 0*SD_PREFER_SIBLING			\
 				| arch_sd_sibling_asym_packing()	\
@@ -113,7 +115,7 @@ int arch_update_cpu_topology(void);
 #ifdef CONFIG_SCHED_MC
 /* Common values for MC siblings. for now mostly derived from SD_CPU_INIT */
 #ifndef SD_MC_INIT
-#define SD_MC_INIT (struct sched_domain) {				\
+#define SD_MC_INIT(cpu) (struct sched_domain) {				\
 	.min_interval		= 1,					\
 	.max_interval		= 4,					\
 	.busy_factor		= 64,					\
@@ -131,6 +133,8 @@ int arch_update_cpu_topology(void);
 				| 1*SD_WAKE_AFFINE			\
 				| 0*SD_SHARE_CPUPOWER			\
 				| 1*SD_SHARE_PKG_RESOURCES		\
+				| arch_sd_local_flags(			\
+					SD_SHARE_PKG_RESOURCES, cpu)	\
 				| 0*SD_SERIALIZE			\
 				,					\
 	.last_balance		= jiffies,				\
@@ -141,7 +145,7 @@ int arch_update_cpu_topology(void);
 
 /* Common values for CPUs */
 #ifndef SD_CPU_INIT
-#define SD_CPU_INIT (struct sched_domain) {				\
+#define SD_CPU_INIT(cpu) (struct sched_domain) {			\
 	.min_interval		= 1,					\
 	.max_interval		= 4,					\
 	.busy_factor		= 64,					\
@@ -161,6 +165,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_WAKE_AFFINE			\
 				| 0*SD_SHARE_CPUPOWER			\
 				| 0*SD_SHARE_PKG_RESOURCES		\
+				| arch_sd_local_flags(0, cpu)		\
 				| 0*SD_SERIALIZE			\
 				| 1*SD_PREFER_SIBLING			\
 				,					\
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5ac63c9..735e964 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5460,6 +5460,11 @@ int __weak arch_sd_sibling_asym_packing(void)
        return 0*SD_ASYM_PACKING;
 }
 
+int __weak arch_sd_local_flags(int level, int cpu)
+{
+	return 1*SD_SHARE_POWERDOMAIN;
+}
+
 /*
  * Initializers for schedule domains
  * Non-inlined to reduce accumulated stack pressure in build_sched_domains()
@@ -5473,10 +5478,10 @@ int __weak arch_sd_sibling_asym_packing(void)
 
 #define SD_INIT_FUNC(type)						\
 static noinline struct sched_domain *					\
-sd_init_##type(struct sched_domain_topology_level *tl, int cpu) 	\
+sd_init_##type(struct sched_domain_topology_level *tl, int cpu)		\
 {									\
 	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);	\
-	*sd = SD_##type##_INIT;						\
+	*sd = SD_##type##_INIT(cpu);					\
 	SD_INIT_NAME(sd, type);						\
 	sd->private = &tl->data;					\
 	return sd;							\
@@ -5652,6 +5657,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_WAKE_AFFINE
 					| 0*SD_SHARE_CPUPOWER
 					| 0*SD_SHARE_PKG_RESOURCES
+					| 1*SD_SHARE_POWERDOMAIN
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
 					| sd_local_flags(level)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [RFC][PATCH v5 03/14] sched: define pack buddy CPUs
  2013-10-18 11:52 [RFC][PATCH v5 00/14] sched: packing tasks Vincent Guittot
  2013-10-18 11:52 ` [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init Vincent Guittot
@ 2013-10-18 11:52 ` Vincent Guittot
  2013-10-18 11:52 ` [RFC][PATCH v5 04/14] sched: do load balance only with packing cpus Vincent Guittot
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-10-18 11:52 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo, pjt, Morten.Rasmussen, cmetcalf,
	tony.luck, alex.shi, preeti, linaro-kernel
  Cc: rjw, paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	l.majewski, Vincent Guittot

During the creation of sched_domain, we define a pack buddy CPU for each CPU
when one is available. We want to pack at all levels where a group of CPUs can
be power gated independently from others.
On a system that can't power gate a group of CPUs independently, the flag is
set at all sched_domain level and the buddy is set to -1. This is the default
behavior for all architectures.

On a dual clusters / dual cores system which can power gate each core and
cluster independently, the buddy configuration will be :

      | Cluster 0   | Cluster 1   |
      | CPU0 | CPU1 | CPU2 | CPU3 |
-----------------------------------
buddy | CPU0 | CPU0 | CPU0 | CPU2 |

If the cores in a cluster can't be power gated independently, the buddy
configuration becomes:

      | Cluster 0   | Cluster 1   |
      | CPU0 | CPU1 | CPU2 | CPU3 |
-----------------------------------
buddy | CPU0 | CPU1 | CPU0 | CPU0 |

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  |    1 +
 kernel/sched/fair.c  |   70 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |    5 ++++
 3 files changed, 76 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 735e964..0bf5f4d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5184,6 +5184,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 	rcu_assign_pointer(rq->sd, sd);
 	destroy_sched_domains(tmp, cpu);
 
+	update_packing_domain(cpu);
 	update_top_cache_domain(cpu);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 11cd136..5547831 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -178,6 +178,76 @@ void sched_init_granularity(void)
 	update_sysctl();
 }
 
+#ifdef CONFIG_SMP
+#ifdef CONFIG_SCHED_PACKING_TASKS
+/*
+ * Save the id of the optimal CPU that should be used to pack small tasks
+ * The value -1 is used when no buddy has been found
+ */
+DEFINE_PER_CPU(int, sd_pack_buddy);
+
+/*
+ * Look for the best buddy CPU that can be used to pack small tasks
+ * We make the assumption that it doesn't wort to pack on CPU that share the
+ * same powerline. We look for the 1st sched_domain without the
+ * SD_SHARE_POWERDOMAIN flag. Then we look for the sched_group with the lowest
+ * power per core based on the assumption that their power efficiency is
+ * better
+ */
+void update_packing_domain(int cpu)
+{
+	struct sched_domain *sd;
+	int id = -1;
+
+	sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN);
+	if (!sd)
+		sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
+	else
+		sd = sd->parent;
+
+	while (sd && (sd->flags & SD_LOAD_BALANCE)
+		&& !(sd->flags & SD_SHARE_POWERDOMAIN)) {
+		struct sched_group *sg = sd->groups;
+		struct sched_group *pack = sg;
+		struct sched_group *tmp;
+
+		/*
+		 * The sched_domain of a CPU points on the local sched_group
+		 * and this CPU of this local group is a good candidate
+		 */
+		id = cpu;
+
+		/* loop the sched groups to find the best one */
+		for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
+			if (tmp->sgp->power * pack->group_weight >
+					pack->sgp->power * tmp->group_weight)
+				continue;
+
+			if ((tmp->sgp->power * pack->group_weight ==
+					pack->sgp->power * tmp->group_weight)
+			 && (cpumask_first(sched_group_cpus(tmp)) >= id))
+				continue;
+
+			/* we have found a better group */
+			pack = tmp;
+
+			/* Take the 1st CPU of the new group */
+			id = cpumask_first(sched_group_cpus(pack));
+		}
+
+		/* Look for another CPU than itself */
+		if (id != cpu)
+			break;
+
+		sd = sd->parent;
+	}
+
+	pr_debug("CPU%d packing on CPU%d\n", cpu, id);
+	per_cpu(sd_pack_buddy, cpu) = id;
+}
+#endif /* CONFIG_SCHED_PACKING_TASKS */
+#endif /* CONFIG_SMP */
+
 #if BITS_PER_LONG == 32
 # define WMULT_CONST	(~0UL)
 #else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b3c5653..22e3f1d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1022,6 +1022,11 @@ extern void update_group_power(struct sched_domain *sd, int cpu);
 
 extern void trigger_load_balance(struct rq *rq, int cpu);
 extern void idle_balance(int this_cpu, struct rq *this_rq);
+#ifdef CONFIG_SCHED_PACKING_TASKS
+extern void update_packing_domain(int cpu);
+#else
+static inline void update_packing_domain(int cpu) {};
+#endif
 
 extern void idle_enter_fair(struct rq *this_rq);
 extern void idle_exit_fair(struct rq *this_rq);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [RFC][PATCH v5 04/14] sched: do load balance only with packing cpus
  2013-10-18 11:52 [RFC][PATCH v5 00/14] sched: packing tasks Vincent Guittot
  2013-10-18 11:52 ` [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init Vincent Guittot
  2013-10-18 11:52 ` [RFC][PATCH v5 03/14] sched: define pack buddy CPUs Vincent Guittot
@ 2013-10-18 11:52 ` Vincent Guittot
  2013-10-18 11:52 ` [RFC][PATCH v5 05/14] sched: add a packing level knob Vincent Guittot
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-10-18 11:52 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo, pjt, Morten.Rasmussen, cmetcalf,
	tony.luck, alex.shi, preeti, linaro-kernel
  Cc: rjw, paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	l.majewski, Vincent Guittot

The tasks will be scheduled only on the CPUs that participate to the packing
effort. A CPU participates to the packing effort when it is its own buddy.

For ILB, look for an idle CPU close to the packing CPUs whenever possible.
The goal is to prevent the wake up of a CPU which doesn't share the power
domain of the pack buddy CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c |   80 ++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 76 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5547831..7149f38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -186,6 +186,17 @@ void sched_init_granularity(void)
  */
 DEFINE_PER_CPU(int, sd_pack_buddy);
 
+static inline bool is_packing_cpu(int cpu)
+{
+	int my_buddy = per_cpu(sd_pack_buddy, cpu);
+	return (my_buddy == -1) || (cpu == my_buddy);
+}
+
+static inline int get_buddy(int cpu)
+{
+	return per_cpu(sd_pack_buddy, cpu);
+}
+
 /*
  * Look for the best buddy CPU that can be used to pack small tasks
  * We make the assumption that it doesn't wort to pack on CPU that share the
@@ -245,6 +256,32 @@ void update_packing_domain(int cpu)
 	pr_debug("CPU%d packing on CPU%d\n", cpu, id);
 	per_cpu(sd_pack_buddy, cpu) = id;
 }
+
+static int check_nohz_packing(int cpu)
+{
+	if (!is_packing_cpu(cpu))
+		return true;
+
+	return false;
+}
+#else /* CONFIG_SCHED_PACKING_TASKS */
+
+static inline bool is_packing_cpu(int cpu)
+{
+	return 1;
+}
+
+static inline int get_buddy(int cpu)
+{
+	return -1;
+}
+
+static inline int check_nohz_packing(int cpu)
+{
+	return false;
+}
+
+
 #endif /* CONFIG_SCHED_PACKING_TASKS */
 #endif /* CONFIG_SMP */
 
@@ -3370,7 +3407,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 	do {
 		unsigned long load, avg_load;
-		int local_group;
+		int local_group, packing_cpus = 0;
 		int i;
 
 		/* Skip over this group if it has no CPUs allowed */
@@ -3392,8 +3429,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 				load = target_load(i, load_idx);
 
 			avg_load += load;
+
+			if (is_packing_cpu(i))
+				packing_cpus = 1;
 		}
 
+		if (!packing_cpus)
+			continue;
+
 		/* Adjust by relative CPU power of the group */
 		avg_load = (avg_load * SCHED_POWER_SCALE) / group->sgp->power;
 
@@ -3448,7 +3491,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	/*
 	 * If the prevous cpu is cache affine and idle, don't be stupid.
 	 */
-	if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
+	if (i != target && cpus_share_cache(i, target) && idle_cpu(i)
+			&& is_packing_cpu(i))
 		return i;
 
 	/*
@@ -3463,7 +3507,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 				goto next;
 
 			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (i == target || !idle_cpu(i))
+				if (i == target || !idle_cpu(i)
+						|| !is_packing_cpu(i))
 					goto next;
 			}
 
@@ -3528,9 +3573,13 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 	}
 
 	if (affine_sd) {
-		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+		if (cpu != prev_cpu && (wake_affine(affine_sd, p, sync)
+					|| !is_packing_cpu(prev_cpu)))
 			prev_cpu = cpu;
 
+		if (!is_packing_cpu(prev_cpu))
+			prev_cpu =  get_buddy(prev_cpu);
+
 		new_cpu = select_idle_sibling(p, prev_cpu);
 		goto unlock;
 	}
@@ -5593,7 +5642,26 @@ static struct {
 
 static inline int find_new_ilb(int call_cpu)
 {
+	struct sched_domain *sd;
 	int ilb = cpumask_first(nohz.idle_cpus_mask);
+	int buddy = get_buddy(call_cpu);
+
+	/*
+	 * If we have a pack buddy CPU, we try to run load balance on a CPU
+	 * that is close to the buddy.
+	 */
+	if (buddy != -1) {
+		for_each_domain(buddy, sd) {
+			if (sd->flags & SD_SHARE_CPUPOWER)
+				continue;
+
+			ilb = cpumask_first_and(sched_domain_span(sd),
+					nohz.idle_cpus_mask);
+
+			if (ilb < nr_cpu_ids)
+				break;
+		}
+	}
 
 	if (ilb < nr_cpu_ids && idle_cpu(ilb))
 		return ilb;
@@ -5874,6 +5942,10 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
 	if (rq->nr_running >= 2)
 		goto need_kick;
 
+	/* This cpu doesn't contribute to packing effort */
+	if (check_nohz_packing(cpu))
+		goto need_kick;
+
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
 		struct sched_group *sg = sd->groups;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [RFC][PATCH v5 05/14] sched: add a packing level knob
  2013-10-18 11:52 [RFC][PATCH v5 00/14] sched: packing tasks Vincent Guittot
                   ` (2 preceding siblings ...)
  2013-10-18 11:52 ` [RFC][PATCH v5 04/14] sched: do load balance only with packing cpus Vincent Guittot
@ 2013-10-18 11:52 ` Vincent Guittot
  2013-11-12 10:32   ` Peter Zijlstra
  2013-10-18 11:52 ` [RFC][PATCH v5 06/14] sched: create a new field with available capacity Vincent Guittot
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 101+ messages in thread
From: Vincent Guittot @ 2013-10-18 11:52 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo, pjt, Morten.Rasmussen, cmetcalf,
	tony.luck, alex.shi, preeti, linaro-kernel
  Cc: rjw, paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	l.majewski, Vincent Guittot

The knob is used to set an average load threshold that will be used to trig
the inclusion/removal of CPUs in the packing effort list.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched/sysctl.h |    9 +++++++++
 kernel/sched/fair.c          |   26 ++++++++++++++++++++++++++
 kernel/sysctl.c              |   17 +++++++++++++++++
 3 files changed, 52 insertions(+)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..f41afa5 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -44,6 +44,14 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+#ifdef CONFIG_SCHED_PACKING_TASKS
+extern int  __read_mostly sysctl_sched_packing_level;
+
+int sched_proc_update_packing(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+#endif
+
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
@@ -61,6 +69,7 @@ extern unsigned int sysctl_sched_shares_window;
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
 		loff_t *ppos);
+
 #endif
 #ifdef CONFIG_SCHED_DEBUG
 static inline unsigned int get_sysctl_timer_migration(void)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7149f38..5568980 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -186,6 +186,32 @@ void sched_init_granularity(void)
  */
 DEFINE_PER_CPU(int, sd_pack_buddy);
 
+/*
+ * The packing level of the scheduler
+ *
+ * This level define the activity % above which we should add another CPU to
+ * participate to the packing effort of the tasks
+ */
+#define DEFAULT_PACKING_LEVEL 80
+int __read_mostly sysctl_sched_packing_level = DEFAULT_PACKING_LEVEL;
+
+unsigned int sd_pack_threshold = (100 * 1024) / DEFAULT_PACKING_LEVEL;
+
+
+int sched_proc_update_packing(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (ret || !write)
+		return ret;
+
+	if (sysctl_sched_packing_level)
+		sd_pack_threshold = (100 * 1024) / sysctl_sched_packing_level;
+
+	return 0;
+}
+
 static inline bool is_packing_cpu(int cpu)
 {
 	int my_buddy = per_cpu(sd_pack_buddy, cpu);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b2f06f3..77383fc 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -255,11 +255,17 @@ static struct ctl_table sysctl_base_table[] = {
 	{ }
 };
 
+#ifdef CONFIG_SCHED_PACKING_TASKS
+static int min_sched_packing_level;
+static int max_sched_packing_level = 100;
+#endif /* CONFIG_SMP */
+
 #ifdef CONFIG_SCHED_DEBUG
 static int min_sched_granularity_ns = 100000;		/* 100 usecs */
 static int max_sched_granularity_ns = NSEC_PER_SEC;	/* 1 second */
 static int min_wakeup_granularity_ns;			/* 0 usecs */
 static int max_wakeup_granularity_ns = NSEC_PER_SEC;	/* 1 second */
+
 #ifdef CONFIG_SMP
 static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
 static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
@@ -279,6 +285,17 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+#ifdef CONFIG_SCHED_PACKING_TASKS
+	{
+		.procname	= "sched_packing_level",
+		.data		= &sysctl_sched_packing_level,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= sched_proc_update_packing,
+		.extra1		= &min_sched_packing_level,
+		.extra2		= &max_sched_packing_level,
+	},
+#endif
 #ifdef CONFIG_SCHED_DEBUG
 	{
 		.procname	= "sched_min_granularity_ns",
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [RFC][PATCH v5 06/14] sched: create a new field with available capacity
  2013-10-18 11:52 [RFC][PATCH v5 00/14] sched: packing tasks Vincent Guittot
                   ` (3 preceding siblings ...)
  2013-10-18 11:52 ` [RFC][PATCH v5 05/14] sched: add a packing level knob Vincent Guittot
@ 2013-10-18 11:52 ` Vincent Guittot
  2013-11-12 10:34   ` Peter Zijlstra
  2013-10-18 11:52 ` [RFC][PATCH v5 07/14] sched: get CPU's activity statistic Vincent Guittot
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 101+ messages in thread
From: Vincent Guittot @ 2013-10-18 11:52 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo, pjt, Morten.Rasmussen, cmetcalf,
	tony.luck, alex.shi, preeti, linaro-kernel
  Cc: rjw, paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	l.majewski, Vincent Guittot

This new field power_available reflects the available capacity of a CPU
unlike the cpu_power which reflects the current capacity.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  |   14 +++++++++++---
 kernel/sched/sched.h |    3 ++-
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5568980..db9b871 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4584,15 +4584,19 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
 	if (!power)
 		power = 1;
 
+	cpu_rq(cpu)->cpu_available = power;
+	sdg->sgp->power_available = power;
+
 	cpu_rq(cpu)->cpu_power = power;
 	sdg->sgp->power = power;
+
 }
 
 void update_group_power(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long power;
+	unsigned long power, available;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -4604,7 +4608,7 @@ void update_group_power(struct sched_domain *sd, int cpu)
 		return;
 	}
 
-	power = 0;
+	power = available = 0;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -4614,6 +4618,8 @@ void update_group_power(struct sched_domain *sd, int cpu)
 
 		for_each_cpu(cpu, sched_group_cpus(sdg))
 			power += power_of(cpu);
+			available += available_of(cpu);
+
 	} else  {
 		/*
 		 * !SD_OVERLAP domains can assume that child groups
@@ -4623,11 +4629,13 @@ void update_group_power(struct sched_domain *sd, int cpu)
 		group = child->groups;
 		do {
 			power += group->sgp->power;
+			available += group->sgp->power_available;
 			group = group->next;
 		} while (group != child->groups);
 	}
 
-	sdg->sgp->power_orig = sdg->sgp->power = power;
+	sdg->sgp->power_orig = sdg->sgp->power_available = available;
+	sdg->sgp->power = power;
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 22e3f1d..d5a4ec0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -459,6 +459,7 @@ struct rq {
 	struct sched_domain *sd;
 
 	unsigned long cpu_power;
+	unsigned long cpu_available;
 
 	unsigned char idle_balance;
 	/* For active balancing */
@@ -603,7 +604,7 @@ struct sched_group_power {
 	 * CPU power of this group, SCHED_LOAD_SCALE being max power for a
 	 * single CPU.
 	 */
-	unsigned int power, power_orig;
+	unsigned int power, power_orig, power_available;
 	unsigned long next_update;
 	/*
 	 * Number of busy cpus in this group.
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [RFC][PATCH v5 07/14] sched: get CPU's activity statistic
  2013-10-18 11:52 [RFC][PATCH v5 00/14] sched: packing tasks Vincent Guittot
                   ` (4 preceding siblings ...)
  2013-10-18 11:52 ` [RFC][PATCH v5 06/14] sched: create a new field with available capacity Vincent Guittot
@ 2013-10-18 11:52 ` Vincent Guittot
  2013-11-12 10:36   ` Peter Zijlstra
  2013-11-12 10:41   ` Peter Zijlstra
  2013-10-18 11:52 ` [RFC][PATCH v5 08/14] sched: move load idx selection in find_idlest_group Vincent Guittot
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-10-18 11:52 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo, pjt, Morten.Rasmussen, cmetcalf,
	tony.luck, alex.shi, preeti, linaro-kernel
  Cc: rjw, paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	l.majewski, Vincent Guittot

Monitor the activity level of each group of each sched_domain level. The
activity is the amount of cpu_power that is currently used on a CPU. We use
the runnable_avg_sum and _period to evaluate this activity level. In the
special use case where the CPU is fully loaded by more than 1 task, the
activity level is set above the cpu_power in order to reflect the overload of
The cpu

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index db9b871..7e26f65 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -179,6 +179,11 @@ void sched_init_granularity(void)
 }
 
 #ifdef CONFIG_SMP
+static unsigned long available_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_available;
+}
+
 #ifdef CONFIG_SCHED_PACKING_TASKS
 /*
  * Save the id of the optimal CPU that should be used to pack small tasks
@@ -3549,6 +3554,22 @@ done:
 	return target;
 }
 
+static int get_cpu_activity(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	u32 sum = rq->avg.runnable_avg_sum;
+	u32 period = rq->avg.runnable_avg_period;
+
+	sum = min(sum, period);
+
+	if (sum == period) {
+		u32 overload = rq->nr_running > 1 ? 1 : 0;
+		return available_of(cpu) + overload;
+	}
+
+	return (sum * available_of(cpu)) / period;
+}
+
 /*
  * sched_balance_self: balance the current task (running on cpu) in domains
  * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
@@ -4430,6 +4451,7 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_power;
+	unsigned long group_activity; /* Total activity of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int group_capacity;
 	unsigned int idle_cpus;
@@ -4446,6 +4468,7 @@ struct sd_lb_stats {
 	struct sched_group *busiest;	/* Busiest group in this sd */
 	struct sched_group *local;	/* Local group in this sd */
 	unsigned long total_load;	/* Total load of all groups in sd */
+	unsigned long total_activity;  /* Total activity of all groups in sd */
 	unsigned long total_pwr;	/* Total power of all groups in sd */
 	unsigned long avg_load;	/* Average load across all groups in sd */
 
@@ -4465,6 +4488,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 		.busiest = NULL,
 		.local = NULL,
 		.total_load = 0UL,
+		.total_activity = 0UL,
 		.total_pwr = 0UL,
 		.busiest_stat = {
 			.avg_load = 0UL,
@@ -4771,6 +4795,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		}
 
 		sgs->group_load += load;
+		sgs->group_activity += get_cpu_activity(i);
 		sgs->sum_nr_running += nr_running;
 		sgs->sum_weighted_load += weighted_cpuload(i);
 		if (idle_cpu(i))
@@ -4894,6 +4919,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 
 		/* Now, start updating sd_lb_stats */
 		sds->total_load += sgs->group_load;
+		sds->total_activity += sgs->group_activity;
 		sds->total_pwr += sgs->group_power;
 
 		if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [RFC][PATCH v5 08/14] sched: move load idx selection in find_idlest_group
  2013-10-18 11:52 [RFC][PATCH v5 00/14] sched: packing tasks Vincent Guittot
                   ` (5 preceding siblings ...)
  2013-10-18 11:52 ` [RFC][PATCH v5 07/14] sched: get CPU's activity statistic Vincent Guittot
@ 2013-10-18 11:52 ` Vincent Guittot
  2013-11-12 10:49   ` Peter Zijlstra
  2013-11-27 14:10   ` [tip:sched/core] sched/fair: Move " tip-bot for Vincent Guittot
  2013-10-18 11:52 ` [RFC][PATCH v5 09/14] sched: update the packing cpu list Vincent Guittot
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-10-18 11:52 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo, pjt, Morten.Rasmussen, cmetcalf,
	tony.luck, alex.shi, preeti, linaro-kernel
  Cc: rjw, paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	l.majewski, Vincent Guittot

load_idx is used in find_idlest_group but initialized in select_task_rq_fair
even when not used. The load_idx initialisation is moved in find_idlest_group
and the sd_flag replaces it in the function's args.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e26f65..c258c38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3430,12 +3430,16 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
  */
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p,
-		  int this_cpu, int load_idx)
+		  int this_cpu, int sd_flag)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
+	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
+	if (sd_flag & SD_BALANCE_WAKE)
+		load_idx = sd->wake_idx;
+
 	do {
 		unsigned long load, avg_load;
 		int local_group, packing_cpus = 0;
@@ -3632,7 +3636,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 	}
 
 	while (sd) {
-		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
 		int weight;
 
@@ -3641,10 +3644,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 			continue;
 		}
 
-		if (sd_flag & SD_BALANCE_WAKE)
-			load_idx = sd->wake_idx;
-
-		group = find_idlest_group(sd, p, cpu, load_idx);
+		group = find_idlest_group(sd, p, cpu, sd_flag);
 		if (!group) {
 			sd = sd->child;
 			continue;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [RFC][PATCH v5 09/14] sched: update the packing cpu list
  2013-10-18 11:52 [RFC][PATCH v5 00/14] sched: packing tasks Vincent Guittot
                   ` (6 preceding siblings ...)
  2013-10-18 11:52 ` [RFC][PATCH v5 08/14] sched: move load idx selection in find_idlest_group Vincent Guittot
@ 2013-10-18 11:52 ` Vincent Guittot
  2013-10-18 11:52 ` [RFC][PATCH v5 10/14] sched: init this_load to max in find_idlest_group Vincent Guittot
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-10-18 11:52 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo, pjt, Morten.Rasmussen, cmetcalf,
	tony.luck, alex.shi, preeti, linaro-kernel
  Cc: rjw, paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	l.majewski, Vincent Guittot

Use the activity statistics to update the list of CPUs that should be used
to hanlde the current system activity.

The cpu_power is updated for CPUs that don't participate to the packing
effort. We consider that their cpu_power is allocated to idleness as it
could be allocated by rt. So the cpu_power that remains available for cfs,
is set to min value (i.e. 1).

The cpu_power is used for a task that wakes up because a waking up task is
already taken into account in the current activity whereas we use the
power_available for a fork and exec because the task is not part of the current
activity.

In order to quickly found the packing starting point, we save information that
will be used to directly start with the right sched_group at the right
sched_domain level instead of running the complete update_packing_domain
algorithm each time we need to use the packing cpu list.

The sd_power_leader defines the leader of a group of CPU that can't be
powergated independantly. As soon as this CPU is used, all the CPU in the same
group will be used based on the fact that it doesn't worth to keep some cores
idle if they can't be power gated while one core in the group is running.
The sd_pack_group and sd_pack_domain are used to quickly check if a power
leader should be used in the packing effort

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c |  162 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 149 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c258c38..f9b03c1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -185,11 +185,20 @@ static unsigned long available_of(int cpu)
 }
 
 #ifdef CONFIG_SCHED_PACKING_TASKS
+struct sd_pack {
+	int  my_buddy; /* cpu on which tasks should be packed */
+	int my_leader; /* cpu which leads the packing state of a group */
+	struct sched_domain *domain; /* domain at which the check is done */
+	struct sched_group *group; /* starting group for checking */
+};
+
 /*
- * Save the id of the optimal CPU that should be used to pack small tasks
- * The value -1 is used when no buddy has been found
+ * Save per_cpu information about the optimal CPUs that should be used to pack
+ * tasks.
  */
-DEFINE_PER_CPU(int, sd_pack_buddy);
+DEFINE_PER_CPU(struct sd_pack, sd_pack_buddy)  = {
+	.my_buddy = -1,
+};
 
 /*
  * The packing level of the scheduler
@@ -202,6 +211,15 @@ int __read_mostly sysctl_sched_packing_level = DEFAULT_PACKING_LEVEL;
 
 unsigned int sd_pack_threshold = (100 * 1024) / DEFAULT_PACKING_LEVEL;
 
+static inline int get_buddy(int cpu)
+{
+	return per_cpu(sd_pack_buddy, cpu).my_buddy;
+}
+
+static inline int get_leader(int cpu)
+{
+	return per_cpu(sd_pack_buddy, cpu).my_leader;
+}
 
 int sched_proc_update_packing(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
@@ -219,13 +237,19 @@ int sched_proc_update_packing(struct ctl_table *table, int write,
 
 static inline bool is_packing_cpu(int cpu)
 {
-	int my_buddy = per_cpu(sd_pack_buddy, cpu);
+	int my_buddy = get_buddy(cpu);
 	return (my_buddy == -1) || (cpu == my_buddy);
 }
 
-static inline int get_buddy(int cpu)
+static inline bool is_leader_cpu(int cpu, struct sched_domain *sd)
 {
-	return per_cpu(sd_pack_buddy, cpu);
+	if (sd != per_cpu(sd_pack_buddy, cpu).domain)
+		return 0;
+
+	if (cpu != get_leader(cpu))
+		return 0;
+
+	return 1;
 }
 
 /*
@@ -239,7 +263,9 @@ static inline int get_buddy(int cpu)
 void update_packing_domain(int cpu)
 {
 	struct sched_domain *sd;
-	int id = -1;
+	struct sched_group *target = NULL;
+	struct sd_pack *pack = &per_cpu(sd_pack_buddy, cpu);
+	int id = cpu, pcpu = cpu;
 
 	sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN);
 	if (!sd)
@@ -247,6 +273,12 @@ void update_packing_domain(int cpu)
 	else
 		sd = sd->parent;
 
+	if (sd) {
+		pcpu = cpumask_first(sched_group_cpus(sd->groups));
+		if (pcpu != cpu)
+			goto end;
+	}
+
 	while (sd && (sd->flags & SD_LOAD_BALANCE)
 		&& !(sd->flags & SD_SHARE_POWERDOMAIN)) {
 		struct sched_group *sg = sd->groups;
@@ -258,15 +290,16 @@ void update_packing_domain(int cpu)
 		 * and this CPU of this local group is a good candidate
 		 */
 		id = cpu;
+		target = pack;
 
 		/* loop the sched groups to find the best one */
 		for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
-			if (tmp->sgp->power * pack->group_weight >
-					pack->sgp->power * tmp->group_weight)
+			if (tmp->sgp->power_available * pack->group_weight >
+				pack->sgp->power_available * tmp->group_weight)
 				continue;
 
-			if ((tmp->sgp->power * pack->group_weight ==
-					pack->sgp->power * tmp->group_weight)
+			if ((tmp->sgp->power_available * pack->group_weight ==
+				pack->sgp->power_available * tmp->group_weight)
 			 && (cpumask_first(sched_group_cpus(tmp)) >= id))
 				continue;
 
@@ -275,6 +308,7 @@ void update_packing_domain(int cpu)
 
 			/* Take the 1st CPU of the new group */
 			id = cpumask_first(sched_group_cpus(pack));
+			target = pack;
 		}
 
 		/* Look for another CPU than itself */
@@ -284,15 +318,75 @@ void update_packing_domain(int cpu)
 		sd = sd->parent;
 	}
 
+end:
 	pr_debug("CPU%d packing on CPU%d\n", cpu, id);
-	per_cpu(sd_pack_buddy, cpu) = id;
+
+	pack->my_leader = pcpu;
+	pack->my_buddy = id;
+	pack->domain = sd;
+	pack->group = target;
 }
 
+
+void update_packing_buddy(int cpu, int activity)
+{
+	struct sched_group *tmp;
+	int id = cpu, pcpu = get_leader(cpu);
+
+	/* Get the state of 1st CPU of the power group */
+	if (!is_packing_cpu(pcpu))
+		id = get_buddy(pcpu);
+
+	if (cpu != pcpu)
+		goto end;
+
+	/* Set the activity level */
+	if (sysctl_sched_packing_level == 0)
+		activity = INT_MAX;
+	else
+		activity = (activity * sd_pack_threshold) / 1024;
+
+	tmp = per_cpu(sd_pack_buddy, cpu).group;
+	id = cpumask_first(sched_group_cpus(tmp));
+
+	/* Take the best group at this sd level to pack activity */
+	for (; activity > 0; tmp = tmp->next) {
+		int next;
+		if (tmp->sgp->power_available > activity) {
+			next = cpumask_first(sched_group_cpus(tmp));
+			while ((activity > 0) && (id < nr_cpu_ids)) {
+				activity -= available_of(id);
+				id = next;
+				if (pcpu == id) {
+					activity = 0;
+					id = cpu;
+				} else
+					next = cpumask_next(id,
+							sched_group_cpus(tmp));
+			}
+		} else if (cpumask_test_cpu(cpu, sched_group_cpus(tmp))) {
+			id = cpu;
+			activity = 0;
+		} else {
+			activity -= tmp->sgp->power_available;
+		}
+	}
+
+end:
+	per_cpu(sd_pack_buddy, cpu).my_buddy = id;
+}
+
+static int get_cpu_activity(int cpu);
+
 static int check_nohz_packing(int cpu)
 {
 	if (!is_packing_cpu(cpu))
 		return true;
 
+	if ((get_cpu_activity(cpu) * 100) >=
+			(available_of(cpu) * sysctl_sched_packing_level))
+		return true;
+
 	return false;
 }
 #else /* CONFIG_SCHED_PACKING_TASKS */
@@ -302,6 +396,11 @@ static inline bool is_packing_cpu(int cpu)
 	return 1;
 }
 
+static inline bool is_leader_cpu(int cpu, struct sched_domain *sd)
+{
+	return 1;
+}
+
 static inline int get_buddy(int cpu)
 {
 	return -1;
@@ -3443,6 +3542,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 	do {
 		unsigned long load, avg_load;
 		int local_group, packing_cpus = 0;
+		unsigned int power;
 		int i;
 
 		/* Skip over this group if it has no CPUs allowed */
@@ -3472,8 +3572,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		if (!packing_cpus)
 			continue;
 
+		if (sd_flag & SD_BALANCE_WAKE)
+			power = group->sgp->power;
+		else
+			power = group->sgp->power_available;
+
 		/* Adjust by relative CPU power of the group */
-		avg_load = (avg_load * SCHED_POWER_SCALE) / group->sgp->power;
+		avg_load = (avg_load * SCHED_POWER_SCALE) / power;
 
 		if (local_group) {
 			this_load = avg_load;
@@ -4611,6 +4716,9 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
 	cpu_rq(cpu)->cpu_available = power;
 	sdg->sgp->power_available = power;
 
+	if (!is_packing_cpu(cpu))
+		power = 1;
+
 	cpu_rq(cpu)->cpu_power = power;
 	sdg->sgp->power = power;
 
@@ -4931,6 +5039,25 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 	} while (sg != env->sd->groups);
 }
 
+#ifdef CONFIG_SCHED_PACKING_TASKS
+static void update_sd_lb_packing(int cpu, struct sd_lb_stats *sds,
+		struct sched_domain *sd)
+{
+	/* Update the list of packing CPU */
+	if (sd == per_cpu(sd_pack_buddy, cpu).domain)
+		update_packing_buddy(cpu, sds->total_activity);
+
+	/* This CPU doesn't act for agressive packing */
+	if (!is_packing_cpu(cpu))
+		sds->busiest = NULL;
+}
+
+#else /* CONFIG_SCHED_PACKING_TASKS */
+static void update_sd_lb_packing(int cpu, struct sd_lb_stats *sds,
+		struct sched_domain *sd) {}
+
+#endif /* CONFIG_SCHED_PACKING_TASKS */
+
 /**
  * check_asym_packing - Check to see if the group is packed into the
  *			sched doman.
@@ -5153,6 +5280,11 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
+	/*
+	 * Update the involvement of the CPU in the packing effort
+	 */
+	update_sd_lb_packing(env->dst_cpu, &sds, env->sd);
+
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
@@ -5312,6 +5444,10 @@ static int should_we_balance(struct lb_env *env)
 	if (env->idle == CPU_NEWLY_IDLE)
 		return 1;
 
+	/* Leader CPU must be used to update packing CPUs list */
+	if (is_leader_cpu(env->dst_cpu, env->sd))
+		return 1;
+
 	sg_cpus = sched_group_cpus(sg);
 	sg_mask = sched_group_mask(sg);
 	/* Try to find first idle cpu */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [RFC][PATCH v5 10/14] sched: init this_load to max in find_idlest_group
  2013-10-18 11:52 [RFC][PATCH v5 00/14] sched: packing tasks Vincent Guittot
                   ` (7 preceding siblings ...)
  2013-10-18 11:52 ` [RFC][PATCH v5 09/14] sched: update the packing cpu list Vincent Guittot
@ 2013-10-18 11:52 ` Vincent Guittot
  2013-10-18 11:52 ` [RFC][PATCH v5 11/14] sched: add a SCHED_PACKING_TASKS config Vincent Guittot
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-10-18 11:52 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo, pjt, Morten.Rasmussen, cmetcalf,
	tony.luck, alex.shi, preeti, linaro-kernel
  Cc: rjw, paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	l.majewski, Vincent Guittot

Init this_load to max value instead of 0 in find_idlest_group.
If the local group is skipped because it doesn't have allowed CPUs, this_load
stays to 0,  no idlest group will be returned and the selected CPU will be a
not allowed one (which will be replaced in select_fallback_rq by a random
one). With a default value set to max, we will use the idlest group even if we
skip the local_group.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f9b03c1..2d9f782 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3532,7 +3532,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int sd_flag)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
-	unsigned long min_load = ULONG_MAX, this_load = 0;
+	unsigned long min_load = ULONG_MAX, this_load = ULONG_MAX;
 	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [RFC][PATCH v5 11/14] sched: add a SCHED_PACKING_TASKS config
  2013-10-18 11:52 [RFC][PATCH v5 00/14] sched: packing tasks Vincent Guittot
                   ` (8 preceding siblings ...)
  2013-10-18 11:52 ` [RFC][PATCH v5 10/14] sched: init this_load to max in find_idlest_group Vincent Guittot
@ 2013-10-18 11:52 ` Vincent Guittot
  2013-10-18 11:52 ` [RFC][PATCH v5 12/14] sched: create a statistic structure Vincent Guittot
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-10-18 11:52 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo, pjt, Morten.Rasmussen, cmetcalf,
	tony.luck, alex.shi, preeti, linaro-kernel
  Cc: rjw, paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	l.majewski, Vincent Guittot

The SCHED_PACKING_TASKS config is used to enable the packing tasks mecanism

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 init/Kconfig |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 3ecd8a1..4d2b5db 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1178,6 +1178,17 @@ config SCHED_AUTOGROUP
 	  desktop applications.  Task group autogeneration is currently based
 	  upon task session.
 
+config SCHED_PACKING_TASKS
+	bool "Automatic tasks packing"
+	depends on SMP && SCHED_MC
+	default n
+	help
+	  This option enable the packing mode of the scheduler.
+	  This mode ensures that the minimal number of CPUs will
+	  be used to handle the activty of the system. The CPUs
+	  are selected to minimized the number of power domain
+	  that must be kept on
+
 config MM_OWNER
 	bool
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [RFC][PATCH v5 12/14] sched: create a statistic structure
  2013-10-18 11:52 [RFC][PATCH v5 00/14] sched: packing tasks Vincent Guittot
                   ` (9 preceding siblings ...)
  2013-10-18 11:52 ` [RFC][PATCH v5 11/14] sched: add a SCHED_PACKING_TASKS config Vincent Guittot
@ 2013-10-18 11:52 ` Vincent Guittot
  2013-10-18 11:52 ` [RFC][PATCH v5 13/14] sched: differantiate idle cpu Vincent Guittot
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-10-18 11:52 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo, pjt, Morten.Rasmussen, cmetcalf,
	tony.luck, alex.shi, preeti, linaro-kernel
  Cc: rjw, paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	l.majewski, Vincent Guittot

Create a statistic structure that will be used to share information with
other frameworks like cpuidle and cpufreq. This structure only contains the
current wake up latency of a core for now but could be extended with other
usefull information.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h |   12 +++++++++++-
 kernel/sched/fair.c   |    5 +++++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2004cdb..d676aa2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2,7 +2,7 @@
 #define _LINUX_SCHED_H
 
 #include <uapi/linux/sched.h>
-
+#include <linux/atomic.h>
 
 struct sched_param {
 	int sched_priority;
@@ -63,6 +63,16 @@ struct fs_struct;
 struct perf_event_context;
 struct blk_plug;
 
+/* This structure is used to share information and statistics with other
+ * frameworks. It only shares wake up latency fro the moment but should be
+ * extended with other usefull informations
+ */
+struct sched_pm {
+	atomic_t  wake_latency; /* time to wake up the cpu */
+};
+
+DECLARE_PER_CPU(struct sched_pm, sched_stat);
+
 /*
  * List of flags we want to share for kernel threads,
  * if only because they are not used by them anyway.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d9f782..ad8b99a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -178,6 +178,11 @@ void sched_init_granularity(void)
 	update_sysctl();
 }
 
+/* Save per_cpu information that will be shared with other frameworks */
+DEFINE_PER_CPU(struct sched_pm, sched_stat) = {
+	.wake_latency = ATOMIC_INIT(0)
+};
+
 #ifdef CONFIG_SMP
 static unsigned long available_of(int cpu)
 {
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [RFC][PATCH v5 13/14] sched: differantiate idle cpu
  2013-10-18 11:52 [RFC][PATCH v5 00/14] sched: packing tasks Vincent Guittot
                   ` (10 preceding siblings ...)
  2013-10-18 11:52 ` [RFC][PATCH v5 12/14] sched: create a statistic structure Vincent Guittot
@ 2013-10-18 11:52 ` Vincent Guittot
  2013-10-18 11:52 ` [RFC][PATCH v5 14/14] cpuidle: set the current wake up latency Vincent Guittot
  2013-11-11 11:33 ` [RFC][PATCH v5 00/14] sched: packing tasks Catalin Marinas
  13 siblings, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-10-18 11:52 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo, pjt, Morten.Rasmussen, cmetcalf,
	tony.luck, alex.shi, preeti, linaro-kernel
  Cc: rjw, paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	l.majewski, Vincent Guittot

The cost for waking up of a core varies according to its current idle state.
This includes C-state and intermediate state when some sync between cores is
required to reach a deep C-state.
Waking up a CPU in a deep C-state for running a short task is not efficient
from both a power and a performance point of view. We should take into account
the wake up latency of an idle CPU when the scheduler looks for the best CPU
to use for a waking task.
The wake up latency of a CPU is computed into a load that can be directly
compared with task load and other CPUs load.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ad8b99a..4863dad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -394,6 +394,20 @@ static int check_nohz_packing(int cpu)
 
 	return false;
 }
+
+int sched_get_idle_load(int cpu)
+{
+	struct sched_pm *stat = &per_cpu(sched_stat, cpu);
+	int latency = atomic_read(&(stat->wake_latency));
+	/*
+	 * Transform the current wakeup latency (us) into an idle load that
+	 * will be compared to task load to decide if it's worth to wake up
+	 * the cpu. The current formula is quite simple but give good
+	 * approximation in the range [0:10ms]
+	 */
+	return (latency * 21) >> 10;
+}
+
 #else /* CONFIG_SCHED_PACKING_TASKS */
 
 static inline bool is_packing_cpu(int cpu)
@@ -416,6 +430,10 @@ static inline int check_nohz_packing(int cpu)
 	return false;
 }
 
+static inline int sched_get_idle_load(int cpu)
+{
+	return 0;
+}
 
 #endif /* CONFIG_SCHED_PACKING_TASKS */
 #endif /* CONFIG_SMP */
@@ -3207,6 +3225,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 /* Used instead of source_load when we know the type == 0 */
 static unsigned long weighted_cpuload(const int cpu)
 {
+	if (idle_cpu(cpu))
+		return sched_get_idle_load(cpu);
 	return cpu_rq(cpu)->cfs.runnable_load_avg;
 }
 
@@ -3655,6 +3675,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 				if (i == target || !idle_cpu(i)
 						|| !is_packing_cpu(i))
 					goto next;
+				if (weighted_cpuload(i) > p->se.avg.load_avg_contrib)
+					goto next;
 			}
 
 			target = cpumask_first_and(sched_group_cpus(sg),
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [RFC][PATCH v5 14/14] cpuidle: set the current wake up latency
  2013-10-18 11:52 [RFC][PATCH v5 00/14] sched: packing tasks Vincent Guittot
                   ` (11 preceding siblings ...)
  2013-10-18 11:52 ` [RFC][PATCH v5 13/14] sched: differantiate idle cpu Vincent Guittot
@ 2013-10-18 11:52 ` Vincent Guittot
  2013-11-11 11:33 ` [RFC][PATCH v5 00/14] sched: packing tasks Catalin Marinas
  13 siblings, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-10-18 11:52 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo, pjt, Morten.Rasmussen, cmetcalf,
	tony.luck, alex.shi, preeti, linaro-kernel
  Cc: rjw, paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	l.majewski, Vincent Guittot

Save the current wake up latency of a core. This latency is not always
the latency of a defined c-state but can also be an intermediate value
when a core is ready to shutdown (timer migration, cache flush ...) but
wait for the last core of the cluster to finalize the cluster power down.
This latter use case is not manage by the current version of the patch because
it implies that the cpuidle drivers set the wake up latency instead of the
cpuidle core.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Conflicts:

	drivers/cpuidle/cpuidle.c
---
 drivers/cpuidle/cpuidle.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index d75040d..7b6553b 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -19,6 +19,7 @@
 #include <linux/ktime.h>
 #include <linux/hrtimer.h>
 #include <linux/module.h>
+#include <linux/sched.h>
 #include <trace/events/power.h>
 
 #include "cpuidle.h"
@@ -42,6 +43,12 @@ void disable_cpuidle(void)
 	off = 1;
 }
 
+static void cpuidle_set_current_state(int cpu, int latency)
+{
+	struct sched_pm *stat = &per_cpu(sched_stat, cpu);
+
+	atomic_set(&(stat->wake_latency), latency);
+}
 /**
  * cpuidle_play_dead - cpu off-lining
  *
@@ -79,12 +86,16 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
 	ktime_t time_start, time_end;
 	s64 diff;
 
+	cpuidle_set_current_state(dev->cpu, target_state->exit_latency);
+
 	time_start = ktime_get();
 
 	entered_state = target_state->enter(dev, drv, index);
 
 	time_end = ktime_get();
 
+	cpuidle_set_current_state(dev->cpu, 0);
+
 	local_irq_enable();
 
 	diff = ktime_to_us(ktime_sub(time_end, time_start));
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
  2013-10-18 11:52 ` [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init Vincent Guittot
@ 2013-11-05 14:06   ` Peter Zijlstra
  2013-11-05 14:57     ` Vincent Guittot
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2013-11-05 14:06 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, pjt, Morten.Rasmussen, cmetcalf, tony.luck,
	alex.shi, preeti, linaro-kernel, rjw, paulmck, corbet, tglx,
	len.brown, arjan, amit.kucheria, l.majewski, james.hogan,
	schwidefsky, heiko.carstens

On Fri, Oct 18, 2013 at 01:52:15PM +0200, Vincent Guittot wrote:
> The function arch_sd_local_flags is used to set flags in sched_domains
> according to the platform architecture. A new flag SD_SHARE_POWERDOMAIN is
> also created to reflect whether groups of CPUs in a sched_domain level can or
> not reach different power state. As an example, the flag should be cleared at
> CPU level if groups of cores can be power gated independently. This information
> is used to decide if it's worth packing some tasks in a group of CPUs in order
> to power gate the other groups instead of spreading the tasks. The default
> behavior of the scheduler is to spread tasks across CPUs and groups of CPUs so
> the flag is set into all sched_domains.
> 
> The cpu parameter of arch_sd_local_flags can be used by architecture to fine
> tune the scheduler domain flags. As an example SD_SHARE_POWERDOMAIN flag can be
> set differently for groups of CPUs according to DT information
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Not your fault, but you're making a bigger mess of the arch topology
interface.

How about we start with something like the below -- compile tested only.

And then try and lift default_topology so that an arch can override it?

---
 arch/ia64/include/asm/topology.h  |  24 -----
 arch/metag/include/asm/topology.h |  27 ------
 arch/s390/include/asm/topology.h  |   2 -
 arch/tile/include/asm/topology.h  |  33 -------
 include/linux/topology.h          | 115 -----------------------
 kernel/sched/core.c               | 188 +++++++++++++++++++++++---------------
 6 files changed, 114 insertions(+), 275 deletions(-)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index a2496e449b75..20d12fa7e0cd 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -46,30 +46,6 @@
 
 void build_cpu_to_node_map(void);
 
-#define SD_CPU_INIT (struct sched_domain) {		\
-	.parent			= NULL,			\
-	.child			= NULL,			\
-	.groups			= NULL,			\
-	.min_interval		= 1,			\
-	.max_interval		= 4,			\
-	.busy_factor		= 64,			\
-	.imbalance_pct		= 125,			\
-	.cache_nice_tries	= 2,			\
-	.busy_idx		= 2,			\
-	.idle_idx		= 1,			\
-	.newidle_idx		= 0,			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
-	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_NEWIDLE	\
-				| SD_BALANCE_EXEC	\
-				| SD_BALANCE_FORK	\
-				| SD_WAKE_AFFINE,	\
-	.last_balance		= jiffies,		\
-	.balance_interval	= 1,			\
-	.nr_balance_failed	= 0,			\
-}
-
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_SMP
diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h
index 8e9c0b3b9691..e95f874ded1b 100644
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -3,33 +3,6 @@
 
 #ifdef CONFIG_NUMA
 
-/* sched_domains SD_NODE_INIT for Meta machines */
-#define SD_NODE_INIT (struct sched_domain) {		\
-	.parent			= NULL,			\
-	.child			= NULL,			\
-	.groups			= NULL,			\
-	.min_interval		= 8,			\
-	.max_interval		= 32,			\
-	.busy_factor		= 32,			\
-	.imbalance_pct		= 125,			\
-	.cache_nice_tries	= 2,			\
-	.busy_idx		= 3,			\
-	.idle_idx		= 2,			\
-	.newidle_idx		= 0,			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
-	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_FORK	\
-				| SD_BALANCE_EXEC	\
-				| SD_BALANCE_NEWIDLE	\
-				| SD_SERIALIZE,		\
-	.last_balance		= jiffies,		\
-	.balance_interval	= 1,			\
-	.nr_balance_failed	= 0,			\
-	.max_newidle_lb_cost	= 0,			\
-	.next_decay_max_lb_cost	= jiffies,		\
-}
-
 #define cpu_to_node(cpu)	((void)(cpu), 0)
 #define parent_node(node)	((void)(node), 0)
 
diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index 05425b18c0aa..07763bdb408d 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -64,8 +64,6 @@ static inline void s390_init_cpu_topology(void)
 };
 #endif
 
-#define SD_BOOK_INIT	SD_CPU_INIT
-
 #include <asm-generic/topology.h>
 
 #endif /* _ASM_S390_TOPOLOGY_H */
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
index d15c0d8d550f..938311844233 100644
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -44,39 +44,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
 /* For now, use numa node -1 for global allocation. */
 #define pcibus_to_node(bus)		((void)(bus), -1)
 
-/*
- * TILE architecture has many cores integrated in one processor, so we need
- * setup bigger balance_interval for both CPU/NODE scheduling domains to
- * reduce process scheduling costs.
- */
-
-/* sched_domains SD_CPU_INIT for TILE architecture */
-#define SD_CPU_INIT (struct sched_domain) {				\
-	.min_interval		= 4,					\
-	.max_interval		= 128,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 0*SD_WAKE_AFFINE			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 32,					\
-}
-
 /* By definition, we create nodes based on online memory. */
 #define node_has_online_mem(nid) 1
 
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 12ae6ce997d6..02a397aa150e 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -66,121 +66,6 @@ int arch_update_cpu_topology(void);
 #define PENALTY_FOR_NODE_WITH_CPUS	(1)
 #endif
 
-/*
- * Below are the 3 major initializers used in building sched_domains:
- * SD_SIBLING_INIT, for SMT domains
- * SD_CPU_INIT, for SMP domains
- *
- * Any architecture that cares to do any tuning to these values should do so
- * by defining their own arch-specific initializer in include/asm/topology.h.
- * A definition there will automagically override these default initializers
- * and allow arch-specific performance tuning of sched_domains.
- * (Only non-zero and non-null fields need be specified.)
- */
-
-#ifdef CONFIG_SCHED_SMT
-/* MCD - Do we really need this?  It is always on if CONFIG_SCHED_SMT is,
- * so can't we drop this in favor of CONFIG_SCHED_SMT?
- */
-#define ARCH_HAS_SCHED_WAKE_IDLE
-/* Common values for SMT siblings */
-#ifndef SD_SIBLING_INIT
-#define SD_SIBLING_INIT (struct sched_domain) {				\
-	.min_interval		= 1,					\
-	.max_interval		= 2,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 110,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 1*SD_SHARE_CPUPOWER			\
-				| 1*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				| 0*SD_PREFER_SIBLING			\
-				| arch_sd_sibling_asym_packing()	\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-	.smt_gain		= 1178,	/* 15% */			\
-	.max_newidle_lb_cost	= 0,					\
-	.next_decay_max_lb_cost	= jiffies,				\
-}
-#endif
-#endif /* CONFIG_SCHED_SMT */
-
-#ifdef CONFIG_SCHED_MC
-/* Common values for MC siblings. for now mostly derived from SD_CPU_INIT */
-#ifndef SD_MC_INIT
-#define SD_MC_INIT (struct sched_domain) {				\
-	.min_interval		= 1,					\
-	.max_interval		= 4,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 1*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-	.max_newidle_lb_cost	= 0,					\
-	.next_decay_max_lb_cost	= jiffies,				\
-}
-#endif
-#endif /* CONFIG_SCHED_MC */
-
-/* Common values for CPUs */
-#ifndef SD_CPU_INIT
-#define SD_CPU_INIT (struct sched_domain) {				\
-	.min_interval		= 1,					\
-	.max_interval		= 4,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				| 1*SD_PREFER_SIBLING			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-	.max_newidle_lb_cost	= 0,					\
-	.next_decay_max_lb_cost	= jiffies,				\
-}
-#endif
-
-#ifdef CONFIG_SCHED_BOOK
-#ifndef SD_BOOK_INIT
-#error Please define an appropriate SD_BOOK_INIT in include/asm/topology.h!!!
-#endif
-#endif /* CONFIG_SCHED_BOOK */
-
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DECLARE_PER_CPU(int, numa_node);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 450a34b2a637..b4e0b1c97a96 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5379,14 +5379,13 @@ enum s_alloc {
 
 struct sched_domain_topology_level;
 
-typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 
 #define SDTL_OVERLAP	0x01
 
 struct sched_domain_topology_level {
-	sched_domain_init_f init;
 	sched_domain_mask_f mask;
+	int		    sd_flags;
 	int		    flags;
 	int		    numa_level;
 	struct sd_data      data;
@@ -5625,28 +5624,6 @@ int __weak arch_sd_sibling_asym_packing(void)
 # define SD_INIT_NAME(sd, type)		do { } while (0)
 #endif
 
-#define SD_INIT_FUNC(type)						\
-static noinline struct sched_domain *					\
-sd_init_##type(struct sched_domain_topology_level *tl, int cpu) 	\
-{									\
-	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);	\
-	*sd = SD_##type##_INIT;						\
-	SD_INIT_NAME(sd, type);						\
-	sd->private = &tl->data;					\
-	return sd;							\
-}
-
-SD_INIT_FUNC(CPU)
-#ifdef CONFIG_SCHED_SMT
- SD_INIT_FUNC(SIBLING)
-#endif
-#ifdef CONFIG_SCHED_MC
- SD_INIT_FUNC(MC)
-#endif
-#ifdef CONFIG_SCHED_BOOK
- SD_INIT_FUNC(BOOK)
-#endif
-
 static int default_relax_domain_level = -1;
 int sched_domain_level_max;
 
@@ -5741,89 +5718,152 @@ static const struct cpumask *cpu_smt_mask(int cpu)
 }
 #endif
 
-/*
- * Topology list, bottom-up.
- */
-static struct sched_domain_topology_level default_topology[] = {
-#ifdef CONFIG_SCHED_SMT
-	{ sd_init_SIBLING, cpu_smt_mask, },
-#endif
-#ifdef CONFIG_SCHED_MC
-	{ sd_init_MC, cpu_coregroup_mask, },
-#endif
-#ifdef CONFIG_SCHED_BOOK
-	{ sd_init_BOOK, cpu_book_mask, },
-#endif
-	{ sd_init_CPU, cpu_cpu_mask, },
-	{ NULL, },
-};
-
-static struct sched_domain_topology_level *sched_domain_topology = default_topology;
-
-#define for_each_sd_topology(tl)			\
-	for (tl = sched_domain_topology; tl->init; tl++)
+static int sched_domains_curr_level;
 
 #ifdef CONFIG_NUMA
-
 static int sched_domains_numa_levels;
 static int *sched_domains_numa_distance;
 static struct cpumask ***sched_domains_numa_masks;
-static int sched_domains_curr_level;
-
-static inline int sd_local_flags(int level)
-{
-	if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
-		return 0;
+#endif
 
-	return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
-}
+/*
+ * SD_flags allowed in topology descriptions.
+ *
+ * SD_SHARE_CPUPOWER      - describes SMT topologies
+ * SD_SHARE_PKG_RESOURCES - describes shared caches
+ * SD_NUMA                - describes NUMA topologies
+ *
+ * Odd one out:
+ * SD_ASYM_PACKING        - describes SMT quirks
+ */
+#define TOPOLOGY_SD_FLAGS		\
+	(SD_SHARE_CPUPOWER |		\
+	 SD_SHARE_PKG_RESOURCES |	\
+	 SD_NUMA |			\
+	 SD_ASYM_PACKING)
 
 static struct sched_domain *
-sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
+sd_init(struct sched_domain_topology_level *tl, int cpu)
 {
 	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
-	int level = tl->numa_level;
-	int sd_weight = cpumask_weight(
-			sched_domains_numa_masks[level][cpu_to_node(cpu)]);
+	int sd_weight;
+
+	/*
+	 * Ugly hack to pass state to sd_numa_mask()...
+	 */
+	sched_domains_curr_level = tl->numa_level;
+
+	sd_weight = cpumask_weight(tl->mask(cpu));
+
+	if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
+			"wrong sd_flags in topology description\n"))
+		tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;
 
 	*sd = (struct sched_domain){
 		.min_interval		= sd_weight,
 		.max_interval		= 2*sd_weight,
 		.busy_factor		= 32,
 		.imbalance_pct		= 125,
-		.cache_nice_tries	= 2,
-		.busy_idx		= 3,
-		.idle_idx		= 2,
+
+		.cache_nice_tries	= 0,
+		.busy_idx		= 0,
+		.idle_idx		= 0,
 		.newidle_idx		= 0,
 		.wake_idx		= 0,
 		.forkexec_idx		= 0,
 
 		.flags			= 1*SD_LOAD_BALANCE
 					| 1*SD_BALANCE_NEWIDLE
-					| 0*SD_BALANCE_EXEC
-					| 0*SD_BALANCE_FORK
+					| 1*SD_BALANCE_EXEC
+					| 1*SD_BALANCE_FORK
 					| 0*SD_BALANCE_WAKE
-					| 0*SD_WAKE_AFFINE
+					| 1*SD_WAKE_AFFINE
 					| 0*SD_SHARE_CPUPOWER
 					| 0*SD_SHARE_PKG_RESOURCES
-					| 1*SD_SERIALIZE
+					| 0*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
-					| 1*SD_NUMA
-					| sd_local_flags(level)
+					| 0*SD_NUMA
+					| tl->sd_flags
 					,
+
 		.last_balance		= jiffies,
 		.balance_interval	= sd_weight,
+		.smt_gain		= 0,
+		.max_newidle_lb_cost	= 0,
+		.next_decay_max_lb_cost	= jiffies,
 	};
-	SD_INIT_NAME(sd, NUMA);
-	sd->private = &tl->data;
 
 	/*
-	 * Ugly hack to pass state to sd_numa_mask()...
+	 * Convert topological properties into behaviour.
 	 */
-	sched_domains_curr_level = tl->numa_level;
+
+	if (sd->flags & SD_SHARE_CPUPOWER) {
+		sd->imbalance_pct = 110;
+		sd->smt_gain = 1178; /* ~15% */
+
+		/*
+		 * XXX hoist into arch topology descriptions.
+		 */
+		sd->flags |= arch_sd_sibling_asym_packing();
+
+		SD_INIT_NAME(sd, SMT);
+	} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
+		sd->imbalance_pct = 117;
+		sd->cache_nice_tries = 1;
+		sd->busy_idx = 2;
+
+		SD_INIT_NAME(sd, MC);
+#ifdef CONFIG_NUMA
+	} else if (sd->flags & SD_NUMA) {
+		sd->cache_nice_tries = 2;
+		sd->busy_idx = 3;
+		sd->idle_idx = 2;
+
+		sd->flags |= SD_SERIALIZE;
+		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
+			sd->flags &= ~(SD_BALANCE_EXEC |
+				       SD_BALANCE_FORK |
+				       SD_WAKE_AFFINE);
+		}
+
+		SD_INIT_NAME(sd, NUMA);
+#endif
+	} else {
+		sd->flags |= SD_PREFER_SIBLING;
+		sd->cache_nice_tries = 1;
+		sd->busy_idx = 2;
+		sd->idle_idx = 1;
+
+		SD_INIT_NAME(sd, DIE);
+	}
+
+	sd->private = &tl->data;
 
 	return sd;
 }
+/*
+ * Topology list, bottom-up.
+ */
+static struct sched_domain_topology_level default_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES },
+#endif
+#ifdef CONFIG_SCHED_MC
+	{ cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES },
+#endif
+#ifdef CONFIG_SCHED_BOOK
+	{ cpu_book_mask, },
+#endif
+	{ cpu_cpu_mask, },
+	{ NULL, },
+};
+
+static struct sched_domain_topology_level *sched_domain_topology = default_topology;
+
+#define for_each_sd_topology(tl)			\
+	for (tl = sched_domain_topology; tl->mask; tl++)
+
+#ifdef CONFIG_NUMA
 
 static const struct cpumask *sd_numa_mask(int cpu)
 {
@@ -5976,7 +6016,7 @@ static void sched_init_numa(void)
 	/*
 	 * Copy the default topology bits..
 	 */
-	for (i = 0; default_topology[i].init; i++)
+	for (i = 0; default_topology[i].mask; i++)
 		tl[i] = default_topology[i];
 
 	/*
@@ -5984,8 +6024,8 @@ static void sched_init_numa(void)
 	 */
 	for (j = 0; j < level; i++, j++) {
 		tl[i] = (struct sched_domain_topology_level){
-			.init = sd_numa_init,
 			.mask = sd_numa_mask,
+			.sd_flags = SD_NUMA,
 			.flags = SDTL_OVERLAP,
 			.numa_level = j,
 		};
@@ -6145,7 +6185,7 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 		const struct cpumask *cpu_map, struct sched_domain_attr *attr,
 		struct sched_domain *child, int cpu)
 {
-	struct sched_domain *sd = tl->init(tl, cpu);
+	struct sched_domain *sd = sd_init(tl, cpu);
 	if (!sd)
 		return child;
 

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
  2013-11-05 14:06   ` Peter Zijlstra
@ 2013-11-05 14:57     ` Vincent Guittot
  2013-11-05 22:27       ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Vincent Guittot @ 2013-11-05 14:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Paul Turner, Morten Rasmussen,
	cmetcalf, tony.luck, Alex Shi, Preeti U Murthy, linaro-kernel,
	Rafael J. Wysocki, Paul McKenney, Jonathan Corbet,
	Thomas Gleixner, Len Brown, Arjan van de Ven, Amit Kucheria,
	Lukasz Majewski, james.hogan, schwidefsky, heiko.carstens

On 5 November 2013 15:06, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 18, 2013 at 01:52:15PM +0200, Vincent Guittot wrote:
>> The function arch_sd_local_flags is used to set flags in sched_domains
>> according to the platform architecture. A new flag SD_SHARE_POWERDOMAIN is
>> also created to reflect whether groups of CPUs in a sched_domain level can or
>> not reach different power state. As an example, the flag should be cleared at
>> CPU level if groups of cores can be power gated independently. This information
>> is used to decide if it's worth packing some tasks in a group of CPUs in order
>> to power gate the other groups instead of spreading the tasks. The default
>> behavior of the scheduler is to spread tasks across CPUs and groups of CPUs so
>> the flag is set into all sched_domains.
>>
>> The cpu parameter of arch_sd_local_flags can be used by architecture to fine
>> tune the scheduler domain flags. As an example SD_SHARE_POWERDOMAIN flag can be
>> set differently for groups of CPUs according to DT information
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> Not your fault, but you're making a bigger mess of the arch topology
> interface.
>
> How about we start with something like the below -- compile tested only.
>
> And then try and lift default_topology so that an arch can override it?

Your proposal looks fine for me. It's clearly better to move in one
place the configuration of sched_domain fields. Have you already got
an idea about how to let architecture override the topology?

My primary need comes from the fact that the topology configuration is
not the same for all cores

Vincent
>
> ---
>  arch/ia64/include/asm/topology.h  |  24 -----
>  arch/metag/include/asm/topology.h |  27 ------
>  arch/s390/include/asm/topology.h  |   2 -
>  arch/tile/include/asm/topology.h  |  33 -------
>  include/linux/topology.h          | 115 -----------------------
>  kernel/sched/core.c               | 188 +++++++++++++++++++++++---------------
>  6 files changed, 114 insertions(+), 275 deletions(-)
>
> diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
> index a2496e449b75..20d12fa7e0cd 100644
> --- a/arch/ia64/include/asm/topology.h
> +++ b/arch/ia64/include/asm/topology.h
> @@ -46,30 +46,6 @@
>
>  void build_cpu_to_node_map(void);
>
> -#define SD_CPU_INIT (struct sched_domain) {            \
> -       .parent                 = NULL,                 \
> -       .child                  = NULL,                 \
> -       .groups                 = NULL,                 \
> -       .min_interval           = 1,                    \
> -       .max_interval           = 4,                    \
> -       .busy_factor            = 64,                   \
> -       .imbalance_pct          = 125,                  \
> -       .cache_nice_tries       = 2,                    \
> -       .busy_idx               = 2,                    \
> -       .idle_idx               = 1,                    \
> -       .newidle_idx            = 0,                    \
> -       .wake_idx               = 0,                    \
> -       .forkexec_idx           = 0,                    \
> -       .flags                  = SD_LOAD_BALANCE       \
> -                               | SD_BALANCE_NEWIDLE    \
> -                               | SD_BALANCE_EXEC       \
> -                               | SD_BALANCE_FORK       \
> -                               | SD_WAKE_AFFINE,       \
> -       .last_balance           = jiffies,              \
> -       .balance_interval       = 1,                    \
> -       .nr_balance_failed      = 0,                    \
> -}
> -
>  #endif /* CONFIG_NUMA */
>
>  #ifdef CONFIG_SMP
> diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h
> index 8e9c0b3b9691..e95f874ded1b 100644
> --- a/arch/metag/include/asm/topology.h
> +++ b/arch/metag/include/asm/topology.h
> @@ -3,33 +3,6 @@
>
>  #ifdef CONFIG_NUMA
>
> -/* sched_domains SD_NODE_INIT for Meta machines */
> -#define SD_NODE_INIT (struct sched_domain) {           \
> -       .parent                 = NULL,                 \
> -       .child                  = NULL,                 \
> -       .groups                 = NULL,                 \
> -       .min_interval           = 8,                    \
> -       .max_interval           = 32,                   \
> -       .busy_factor            = 32,                   \
> -       .imbalance_pct          = 125,                  \
> -       .cache_nice_tries       = 2,                    \
> -       .busy_idx               = 3,                    \
> -       .idle_idx               = 2,                    \
> -       .newidle_idx            = 0,                    \
> -       .wake_idx               = 0,                    \
> -       .forkexec_idx           = 0,                    \
> -       .flags                  = SD_LOAD_BALANCE       \
> -                               | SD_BALANCE_FORK       \
> -                               | SD_BALANCE_EXEC       \
> -                               | SD_BALANCE_NEWIDLE    \
> -                               | SD_SERIALIZE,         \
> -       .last_balance           = jiffies,              \
> -       .balance_interval       = 1,                    \
> -       .nr_balance_failed      = 0,                    \
> -       .max_newidle_lb_cost    = 0,                    \
> -       .next_decay_max_lb_cost = jiffies,              \
> -}
> -
>  #define cpu_to_node(cpu)       ((void)(cpu), 0)
>  #define parent_node(node)      ((void)(node), 0)
>
> diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
> index 05425b18c0aa..07763bdb408d 100644
> --- a/arch/s390/include/asm/topology.h
> +++ b/arch/s390/include/asm/topology.h
> @@ -64,8 +64,6 @@ static inline void s390_init_cpu_topology(void)
>  };
>  #endif
>
> -#define SD_BOOK_INIT   SD_CPU_INIT
> -
>  #include <asm-generic/topology.h>
>
>  #endif /* _ASM_S390_TOPOLOGY_H */
> diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
> index d15c0d8d550f..938311844233 100644
> --- a/arch/tile/include/asm/topology.h
> +++ b/arch/tile/include/asm/topology.h
> @@ -44,39 +44,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
>  /* For now, use numa node -1 for global allocation. */
>  #define pcibus_to_node(bus)            ((void)(bus), -1)
>
> -/*
> - * TILE architecture has many cores integrated in one processor, so we need
> - * setup bigger balance_interval for both CPU/NODE scheduling domains to
> - * reduce process scheduling costs.
> - */
> -
> -/* sched_domains SD_CPU_INIT for TILE architecture */
> -#define SD_CPU_INIT (struct sched_domain) {                            \
> -       .min_interval           = 4,                                    \
> -       .max_interval           = 128,                                  \
> -       .busy_factor            = 64,                                   \
> -       .imbalance_pct          = 125,                                  \
> -       .cache_nice_tries       = 1,                                    \
> -       .busy_idx               = 2,                                    \
> -       .idle_idx               = 1,                                    \
> -       .newidle_idx            = 0,                                    \
> -       .wake_idx               = 0,                                    \
> -       .forkexec_idx           = 0,                                    \
> -                                                                       \
> -       .flags                  = 1*SD_LOAD_BALANCE                     \
> -                               | 1*SD_BALANCE_NEWIDLE                  \
> -                               | 1*SD_BALANCE_EXEC                     \
> -                               | 1*SD_BALANCE_FORK                     \
> -                               | 0*SD_BALANCE_WAKE                     \
> -                               | 0*SD_WAKE_AFFINE                      \
> -                               | 0*SD_SHARE_CPUPOWER                   \
> -                               | 0*SD_SHARE_PKG_RESOURCES              \
> -                               | 0*SD_SERIALIZE                        \
> -                               ,                                       \
> -       .last_balance           = jiffies,                              \
> -       .balance_interval       = 32,                                   \
> -}
> -
>  /* By definition, we create nodes based on online memory. */
>  #define node_has_online_mem(nid) 1
>
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 12ae6ce997d6..02a397aa150e 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -66,121 +66,6 @@ int arch_update_cpu_topology(void);
>  #define PENALTY_FOR_NODE_WITH_CPUS     (1)
>  #endif
>
> -/*
> - * Below are the 3 major initializers used in building sched_domains:
> - * SD_SIBLING_INIT, for SMT domains
> - * SD_CPU_INIT, for SMP domains
> - *
> - * Any architecture that cares to do any tuning to these values should do so
> - * by defining their own arch-specific initializer in include/asm/topology.h.
> - * A definition there will automagically override these default initializers
> - * and allow arch-specific performance tuning of sched_domains.
> - * (Only non-zero and non-null fields need be specified.)
> - */
> -
> -#ifdef CONFIG_SCHED_SMT
> -/* MCD - Do we really need this?  It is always on if CONFIG_SCHED_SMT is,
> - * so can't we drop this in favor of CONFIG_SCHED_SMT?
> - */
> -#define ARCH_HAS_SCHED_WAKE_IDLE
> -/* Common values for SMT siblings */
> -#ifndef SD_SIBLING_INIT
> -#define SD_SIBLING_INIT (struct sched_domain) {                                \
> -       .min_interval           = 1,                                    \
> -       .max_interval           = 2,                                    \
> -       .busy_factor            = 64,                                   \
> -       .imbalance_pct          = 110,                                  \
> -                                                                       \
> -       .flags                  = 1*SD_LOAD_BALANCE                     \
> -                               | 1*SD_BALANCE_NEWIDLE                  \
> -                               | 1*SD_BALANCE_EXEC                     \
> -                               | 1*SD_BALANCE_FORK                     \
> -                               | 0*SD_BALANCE_WAKE                     \
> -                               | 1*SD_WAKE_AFFINE                      \
> -                               | 1*SD_SHARE_CPUPOWER                   \
> -                               | 1*SD_SHARE_PKG_RESOURCES              \
> -                               | 0*SD_SERIALIZE                        \
> -                               | 0*SD_PREFER_SIBLING                   \
> -                               | arch_sd_sibling_asym_packing()        \
> -                               ,                                       \
> -       .last_balance           = jiffies,                              \
> -       .balance_interval       = 1,                                    \
> -       .smt_gain               = 1178, /* 15% */                       \
> -       .max_newidle_lb_cost    = 0,                                    \
> -       .next_decay_max_lb_cost = jiffies,                              \
> -}
> -#endif
> -#endif /* CONFIG_SCHED_SMT */
> -
> -#ifdef CONFIG_SCHED_MC
> -/* Common values for MC siblings. for now mostly derived from SD_CPU_INIT */
> -#ifndef SD_MC_INIT
> -#define SD_MC_INIT (struct sched_domain) {                             \
> -       .min_interval           = 1,                                    \
> -       .max_interval           = 4,                                    \
> -       .busy_factor            = 64,                                   \
> -       .imbalance_pct          = 125,                                  \
> -       .cache_nice_tries       = 1,                                    \
> -       .busy_idx               = 2,                                    \
> -       .wake_idx               = 0,                                    \
> -       .forkexec_idx           = 0,                                    \
> -                                                                       \
> -       .flags                  = 1*SD_LOAD_BALANCE                     \
> -                               | 1*SD_BALANCE_NEWIDLE                  \
> -                               | 1*SD_BALANCE_EXEC                     \
> -                               | 1*SD_BALANCE_FORK                     \
> -                               | 0*SD_BALANCE_WAKE                     \
> -                               | 1*SD_WAKE_AFFINE                      \
> -                               | 0*SD_SHARE_CPUPOWER                   \
> -                               | 1*SD_SHARE_PKG_RESOURCES              \
> -                               | 0*SD_SERIALIZE                        \
> -                               ,                                       \
> -       .last_balance           = jiffies,                              \
> -       .balance_interval       = 1,                                    \
> -       .max_newidle_lb_cost    = 0,                                    \
> -       .next_decay_max_lb_cost = jiffies,                              \
> -}
> -#endif
> -#endif /* CONFIG_SCHED_MC */
> -
> -/* Common values for CPUs */
> -#ifndef SD_CPU_INIT
> -#define SD_CPU_INIT (struct sched_domain) {                            \
> -       .min_interval           = 1,                                    \
> -       .max_interval           = 4,                                    \
> -       .busy_factor            = 64,                                   \
> -       .imbalance_pct          = 125,                                  \
> -       .cache_nice_tries       = 1,                                    \
> -       .busy_idx               = 2,                                    \
> -       .idle_idx               = 1,                                    \
> -       .newidle_idx            = 0,                                    \
> -       .wake_idx               = 0,                                    \
> -       .forkexec_idx           = 0,                                    \
> -                                                                       \
> -       .flags                  = 1*SD_LOAD_BALANCE                     \
> -                               | 1*SD_BALANCE_NEWIDLE                  \
> -                               | 1*SD_BALANCE_EXEC                     \
> -                               | 1*SD_BALANCE_FORK                     \
> -                               | 0*SD_BALANCE_WAKE                     \
> -                               | 1*SD_WAKE_AFFINE                      \
> -                               | 0*SD_SHARE_CPUPOWER                   \
> -                               | 0*SD_SHARE_PKG_RESOURCES              \
> -                               | 0*SD_SERIALIZE                        \
> -                               | 1*SD_PREFER_SIBLING                   \
> -                               ,                                       \
> -       .last_balance           = jiffies,                              \
> -       .balance_interval       = 1,                                    \
> -       .max_newidle_lb_cost    = 0,                                    \
> -       .next_decay_max_lb_cost = jiffies,                              \
> -}
> -#endif
> -
> -#ifdef CONFIG_SCHED_BOOK
> -#ifndef SD_BOOK_INIT
> -#error Please define an appropriate SD_BOOK_INIT in include/asm/topology.h!!!
> -#endif
> -#endif /* CONFIG_SCHED_BOOK */
> -
>  #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
>  DECLARE_PER_CPU(int, numa_node);
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 450a34b2a637..b4e0b1c97a96 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5379,14 +5379,13 @@ enum s_alloc {
>
>  struct sched_domain_topology_level;
>
> -typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
>  typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
>
>  #define SDTL_OVERLAP   0x01
>
>  struct sched_domain_topology_level {
> -       sched_domain_init_f init;
>         sched_domain_mask_f mask;
> +       int                 sd_flags;
>         int                 flags;
>         int                 numa_level;
>         struct sd_data      data;
> @@ -5625,28 +5624,6 @@ int __weak arch_sd_sibling_asym_packing(void)
>  # define SD_INIT_NAME(sd, type)                do { } while (0)
>  #endif
>
> -#define SD_INIT_FUNC(type)                                             \
> -static noinline struct sched_domain *                                  \
> -sd_init_##type(struct sched_domain_topology_level *tl, int cpu)        \
> -{                                                                      \
> -       struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);       \
> -       *sd = SD_##type##_INIT;                                         \
> -       SD_INIT_NAME(sd, type);                                         \
> -       sd->private = &tl->data;                                        \
> -       return sd;                                                      \
> -}
> -
> -SD_INIT_FUNC(CPU)
> -#ifdef CONFIG_SCHED_SMT
> - SD_INIT_FUNC(SIBLING)
> -#endif
> -#ifdef CONFIG_SCHED_MC
> - SD_INIT_FUNC(MC)
> -#endif
> -#ifdef CONFIG_SCHED_BOOK
> - SD_INIT_FUNC(BOOK)
> -#endif
> -
>  static int default_relax_domain_level = -1;
>  int sched_domain_level_max;
>
> @@ -5741,89 +5718,152 @@ static const struct cpumask *cpu_smt_mask(int cpu)
>  }
>  #endif
>
> -/*
> - * Topology list, bottom-up.
> - */
> -static struct sched_domain_topology_level default_topology[] = {
> -#ifdef CONFIG_SCHED_SMT
> -       { sd_init_SIBLING, cpu_smt_mask, },
> -#endif
> -#ifdef CONFIG_SCHED_MC
> -       { sd_init_MC, cpu_coregroup_mask, },
> -#endif
> -#ifdef CONFIG_SCHED_BOOK
> -       { sd_init_BOOK, cpu_book_mask, },
> -#endif
> -       { sd_init_CPU, cpu_cpu_mask, },
> -       { NULL, },
> -};
> -
> -static struct sched_domain_topology_level *sched_domain_topology = default_topology;
> -
> -#define for_each_sd_topology(tl)                       \
> -       for (tl = sched_domain_topology; tl->init; tl++)
> +static int sched_domains_curr_level;
>
>  #ifdef CONFIG_NUMA
> -
>  static int sched_domains_numa_levels;
>  static int *sched_domains_numa_distance;
>  static struct cpumask ***sched_domains_numa_masks;
> -static int sched_domains_curr_level;
> -
> -static inline int sd_local_flags(int level)
> -{
> -       if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
> -               return 0;
> +#endif
>
> -       return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
> -}
> +/*
> + * SD_flags allowed in topology descriptions.
> + *
> + * SD_SHARE_CPUPOWER      - describes SMT topologies
> + * SD_SHARE_PKG_RESOURCES - describes shared caches
> + * SD_NUMA                - describes NUMA topologies
> + *
> + * Odd one out:
> + * SD_ASYM_PACKING        - describes SMT quirks
> + */
> +#define TOPOLOGY_SD_FLAGS              \
> +       (SD_SHARE_CPUPOWER |            \
> +        SD_SHARE_PKG_RESOURCES |       \
> +        SD_NUMA |                      \
> +        SD_ASYM_PACKING)
>
>  static struct sched_domain *
> -sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
> +sd_init(struct sched_domain_topology_level *tl, int cpu)
>  {
>         struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
> -       int level = tl->numa_level;
> -       int sd_weight = cpumask_weight(
> -                       sched_domains_numa_masks[level][cpu_to_node(cpu)]);
> +       int sd_weight;
> +
> +       /*
> +        * Ugly hack to pass state to sd_numa_mask()...
> +        */
> +       sched_domains_curr_level = tl->numa_level;
> +
> +       sd_weight = cpumask_weight(tl->mask(cpu));
> +
> +       if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
> +                       "wrong sd_flags in topology description\n"))
> +               tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;
>
>         *sd = (struct sched_domain){
>                 .min_interval           = sd_weight,
>                 .max_interval           = 2*sd_weight,
>                 .busy_factor            = 32,
>                 .imbalance_pct          = 125,
> -               .cache_nice_tries       = 2,
> -               .busy_idx               = 3,
> -               .idle_idx               = 2,
> +
> +               .cache_nice_tries       = 0,
> +               .busy_idx               = 0,
> +               .idle_idx               = 0,
>                 .newidle_idx            = 0,
>                 .wake_idx               = 0,
>                 .forkexec_idx           = 0,
>
>                 .flags                  = 1*SD_LOAD_BALANCE
>                                         | 1*SD_BALANCE_NEWIDLE
> -                                       | 0*SD_BALANCE_EXEC
> -                                       | 0*SD_BALANCE_FORK
> +                                       | 1*SD_BALANCE_EXEC
> +                                       | 1*SD_BALANCE_FORK
>                                         | 0*SD_BALANCE_WAKE
> -                                       | 0*SD_WAKE_AFFINE
> +                                       | 1*SD_WAKE_AFFINE
>                                         | 0*SD_SHARE_CPUPOWER
>                                         | 0*SD_SHARE_PKG_RESOURCES
> -                                       | 1*SD_SERIALIZE
> +                                       | 0*SD_SERIALIZE
>                                         | 0*SD_PREFER_SIBLING
> -                                       | 1*SD_NUMA
> -                                       | sd_local_flags(level)
> +                                       | 0*SD_NUMA
> +                                       | tl->sd_flags
>                                         ,
> +
>                 .last_balance           = jiffies,
>                 .balance_interval       = sd_weight,
> +               .smt_gain               = 0,
> +               .max_newidle_lb_cost    = 0,
> +               .next_decay_max_lb_cost = jiffies,
>         };
> -       SD_INIT_NAME(sd, NUMA);
> -       sd->private = &tl->data;
>
>         /*
> -        * Ugly hack to pass state to sd_numa_mask()...
> +        * Convert topological properties into behaviour.
>          */
> -       sched_domains_curr_level = tl->numa_level;
> +
> +       if (sd->flags & SD_SHARE_CPUPOWER) {
> +               sd->imbalance_pct = 110;
> +               sd->smt_gain = 1178; /* ~15% */
> +
> +               /*
> +                * XXX hoist into arch topology descriptions.
> +                */
> +               sd->flags |= arch_sd_sibling_asym_packing();
> +
> +               SD_INIT_NAME(sd, SMT);
> +       } else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
> +               sd->imbalance_pct = 117;
> +               sd->cache_nice_tries = 1;
> +               sd->busy_idx = 2;
> +
> +               SD_INIT_NAME(sd, MC);
> +#ifdef CONFIG_NUMA
> +       } else if (sd->flags & SD_NUMA) {
> +               sd->cache_nice_tries = 2;
> +               sd->busy_idx = 3;
> +               sd->idle_idx = 2;
> +
> +               sd->flags |= SD_SERIALIZE;
> +               if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
> +                       sd->flags &= ~(SD_BALANCE_EXEC |
> +                                      SD_BALANCE_FORK |
> +                                      SD_WAKE_AFFINE);
> +               }
> +
> +               SD_INIT_NAME(sd, NUMA);
> +#endif
> +       } else {
> +               sd->flags |= SD_PREFER_SIBLING;
> +               sd->cache_nice_tries = 1;
> +               sd->busy_idx = 2;
> +               sd->idle_idx = 1;
> +
> +               SD_INIT_NAME(sd, DIE);
> +       }
> +
> +       sd->private = &tl->data;
>
>         return sd;
>  }
> +/*
> + * Topology list, bottom-up.
> + */
> +static struct sched_domain_topology_level default_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> +       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES },
> +#endif
> +#ifdef CONFIG_SCHED_MC
> +       { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES },
> +#endif
> +#ifdef CONFIG_SCHED_BOOK
> +       { cpu_book_mask, },
> +#endif
> +       { cpu_cpu_mask, },
> +       { NULL, },
> +};
> +
> +static struct sched_domain_topology_level *sched_domain_topology = default_topology;
> +
> +#define for_each_sd_topology(tl)                       \
> +       for (tl = sched_domain_topology; tl->mask; tl++)
> +
> +#ifdef CONFIG_NUMA
>
>  static const struct cpumask *sd_numa_mask(int cpu)
>  {
> @@ -5976,7 +6016,7 @@ static void sched_init_numa(void)
>         /*
>          * Copy the default topology bits..
>          */
> -       for (i = 0; default_topology[i].init; i++)
> +       for (i = 0; default_topology[i].mask; i++)
>                 tl[i] = default_topology[i];
>
>         /*
> @@ -5984,8 +6024,8 @@ static void sched_init_numa(void)
>          */
>         for (j = 0; j < level; i++, j++) {
>                 tl[i] = (struct sched_domain_topology_level){
> -                       .init = sd_numa_init,
>                         .mask = sd_numa_mask,
> +                       .sd_flags = SD_NUMA,
>                         .flags = SDTL_OVERLAP,
>                         .numa_level = j,
>                 };
> @@ -6145,7 +6185,7 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
>                 const struct cpumask *cpu_map, struct sched_domain_attr *attr,
>                 struct sched_domain *child, int cpu)
>  {
> -       struct sched_domain *sd = tl->init(tl, cpu);
> +       struct sched_domain *sd = sd_init(tl, cpu);
>         if (!sd)
>                 return child;
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
  2013-11-05 14:57     ` Vincent Guittot
@ 2013-11-05 22:27       ` Peter Zijlstra
  2013-11-06 10:10         ` Vincent Guittot
                           ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Peter Zijlstra @ 2013-11-05 22:27 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Paul Turner, Morten Rasmussen,
	cmetcalf, tony.luck, Alex Shi, Preeti U Murthy, linaro-kernel,
	Rafael J. Wysocki, Paul McKenney, Jonathan Corbet,
	Thomas Gleixner, Len Brown, Arjan van de Ven, Amit Kucheria,
	Lukasz Majewski, james.hogan, schwidefsky, heiko.carstens

On Tue, Nov 05, 2013 at 03:57:23PM +0100, Vincent Guittot wrote:
> Your proposal looks fine for me. It's clearly better to move in one
> place the configuration of sched_domain fields. Have you already got
> an idea about how to let architecture override the topology?

Maybe something like the below -- completely untested (my s390 compiler
is on a machine that's currently powered off).

> My primary need comes from the fact that the topology configuration is
> not the same for all cores

Do expand.. the various cpu masks used in the topology list are per cpu,
is that sufficient room to wriggle or do you need more?

---
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -1070,3 +1070,23 @@ static int __init s390_smp_init(void)
 	return 0;
 }
 subsys_initcall(s390_smp_init);
+
+static struct sched_domain_topology_level s390_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES },
+#endif
+#ifdef CONFIG_SCHED_MC
+	{ cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES },
+#endif
+#ifdef CONFIG_SCHED_BOOK
+	{ cpu_book_mask, },
+#endif
+	{ cpu_cpu_mask, },
+	{ NULL, },
+};
+
+static int __init s390_sched_topology(void)
+{
+	sched_domain_topology = s390_topology;
+}
+early_initcall(s390_sched_topology);
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -889,6 +889,20 @@ void free_sched_domains(cpumask_var_t do
 
 bool cpus_share_cache(int this_cpu, int that_cpu);
 
+typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
+
+#define SDTL_OVERLAP	0x01
+
+struct sched_domain_topology_level {
+	sched_domain_mask_f mask;
+	int		    sd_flags;
+	int		    flags;
+	int		    numa_level;
+	struct sd_data      data;
+};
+
+extern struct sched_domain_topology_level *sched_domain_topology;
+
 #else /* CONFIG_SMP */
 
 struct sched_domain_attr;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5377,20 +5377,6 @@ enum s_alloc {
 	sa_none,
 };
 
-struct sched_domain_topology_level;
-
-typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
-
-#define SDTL_OVERLAP	0x01
-
-struct sched_domain_topology_level {
-	sched_domain_mask_f mask;
-	int		    sd_flags;
-	int		    flags;
-	int		    numa_level;
-	struct sd_data      data;
-};
-
 /*
  * Build an iteration mask that can exclude certain CPUs from the upwards
  * domain traversal.
@@ -5841,6 +5827,7 @@ sd_init(struct sched_domain_topology_lev
 
 	return sd;
 }
+
 /*
  * Topology list, bottom-up.
  */
@@ -5851,14 +5838,11 @@ static struct sched_domain_topology_leve
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES },
 #endif
-#ifdef CONFIG_SCHED_BOOK
-	{ cpu_book_mask, },
-#endif
 	{ cpu_cpu_mask, },
 	{ NULL, },
 };
 
-static struct sched_domain_topology_level *sched_domain_topology = default_topology;
+struct sched_domain_topology_level *sched_domain_topology = default_topology;
 
 #define for_each_sd_topology(tl)			\
 	for (tl = sched_domain_topology; tl->mask; tl++)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
  2013-11-05 22:27       ` Peter Zijlstra
@ 2013-11-06 10:10         ` Vincent Guittot
  2013-11-06 13:53         ` Martin Schwidefsky
  2013-12-18 13:13         ` [RFC] sched: CPU topology try Vincent Guittot
  2 siblings, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-11-06 10:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Paul Turner, Morten Rasmussen,
	cmetcalf, tony.luck, Alex Shi, Preeti U Murthy, linaro-kernel,
	Rafael J. Wysocki, Paul McKenney, Jonathan Corbet,
	Thomas Gleixner, Len Brown, Arjan van de Ven, Amit Kucheria,
	Lukasz Majewski, james.hogan, schwidefsky, heiko.carstens

On 5 November 2013 23:27, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Nov 05, 2013 at 03:57:23PM +0100, Vincent Guittot wrote:
>> Your proposal looks fine for me. It's clearly better to move in one
>> place the configuration of sched_domain fields. Have you already got
>> an idea about how to let architecture override the topology?
>
> Maybe something like the below -- completely untested (my s390 compiler
> is on a machine that's currently powered off).
>
>> My primary need comes from the fact that the topology configuration is
>> not the same for all cores
>
> Do expand.. the various cpu masks used in the topology list are per cpu,
> is that sufficient room to wriggle or do you need more?

My current implementation sets a flag in each level (SMT, MC and CPU)
to describe the power gating capabilities for the groups of cpus but
the capabilities can be different for a same level; I mean that we can
have a group of cpus that can power gate at MC level in the system
whereas another group of CPUs can only power gate at CPU level. With
the current implementation i can't make the difference so i have added
the cpu parameters when setting the flags.
The other solution is to add new topology levels with cpu masks that
can give the power dependency with other (currently the power gating
but we can have more level for frequency dependency as an example). In
this case the current implementation is enough and the main difficulty
will be the place where we can insert these new levels compared to
current ones.

A typical example with one cluster that can power gate at core level
whereas the other cluster can power gate at cluster level, will give
the following domain topology:

If we set a flag in the current topology levels we should have
something like below

CPU0:
domain 0: span 0-1 level: SMT flags: SD_SHARE_CPUPOWER |
SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
    groups: 0 1
domain 1: span 0-7 level: MC flags: SD_SHARE_PKG_RESOURCES
    groups: 0-1 2-3 4-5 6-7
domain 2: span 0-15 level: CPU flags:
    groups: 0-7 8-15

CPU8
domain 0: span 8-9 level: SMT flags: SD_SHARE_CPUPOWER |
SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
    groups: 8 9
domain 1: span 8-15 level: MC flags: SD_SHARE_PKG_RESOURCES |
SD_SHARE_POWERDOMAIN
    groups: 8-9 10-11 12-13 14-15
domain 2: span 0-15 level CPU flags:
    groups: 8-15 0-7

If we create new levels, we could have something like below

CPU0
domain 0: span 0-1 level: SMT flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES
    groups: 0 1
domain 1: span 0-7 level: MC flags: SD_SHARE_PKG_RESOURCES
    groups: 0-1 2-3 4-5 6-7
domain 2: span 0-15 level PWR flags  SD_NOT_SHARE_POWERDOMAIN
    groups: 0-1 2-3 4-5 6-7 8-15
domain 3: span 0-15 level: CPU flags:
    groups: 0-7 8-15

CPU8
domain 0: span 8-9 level: SMT flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES
    groups: 8 9
domain 1: span 8-15 level: MC flags: SD_SHARE_PKG_RESOURCES
    groups: 8-9 10-11 12-13 14-15
domain 2: span 0-15 level PWR flags  SD_NOT_SHARE_POWERDOMAIN
    groups: 0-1 2-3 4-5 6-7 8-15
domain 3: span 0-15 level CPU flags:
    groups: 8-15 0-7

Vincent

>
> ---
> --- a/arch/s390/kernel/smp.c
> +++ b/arch/s390/kernel/smp.c
> @@ -1070,3 +1070,23 @@ static int __init s390_smp_init(void)
>         return 0;
>  }
>  subsys_initcall(s390_smp_init);
> +
> +static struct sched_domain_topology_level s390_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> +       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES },
> +#endif
> +#ifdef CONFIG_SCHED_MC
> +       { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES },
> +#endif
> +#ifdef CONFIG_SCHED_BOOK
> +       { cpu_book_mask, },
> +#endif
> +       { cpu_cpu_mask, },
> +       { NULL, },
> +};
> +
> +static int __init s390_sched_topology(void)
> +{
> +       sched_domain_topology = s390_topology;
> +}
> +early_initcall(s390_sched_topology);
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -889,6 +889,20 @@ void free_sched_domains(cpumask_var_t do
>
>  bool cpus_share_cache(int this_cpu, int that_cpu);
>
> +typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
> +
> +#define SDTL_OVERLAP   0x01
> +
> +struct sched_domain_topology_level {
> +       sched_domain_mask_f mask;
> +       int                 sd_flags;
> +       int                 flags;
> +       int                 numa_level;
> +       struct sd_data      data;
> +};
> +
> +extern struct sched_domain_topology_level *sched_domain_topology;
> +
>  #else /* CONFIG_SMP */
>
>  struct sched_domain_attr;
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5377,20 +5377,6 @@ enum s_alloc {
>         sa_none,
>  };
>
> -struct sched_domain_topology_level;
> -
> -typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
> -
> -#define SDTL_OVERLAP   0x01
> -
> -struct sched_domain_topology_level {
> -       sched_domain_mask_f mask;
> -       int                 sd_flags;
> -       int                 flags;
> -       int                 numa_level;
> -       struct sd_data      data;
> -};
> -
>  /*
>   * Build an iteration mask that can exclude certain CPUs from the upwards
>   * domain traversal.
> @@ -5841,6 +5827,7 @@ sd_init(struct sched_domain_topology_lev
>
>         return sd;
>  }
> +
>  /*
>   * Topology list, bottom-up.
>   */
> @@ -5851,14 +5838,11 @@ static struct sched_domain_topology_leve
>  #ifdef CONFIG_SCHED_MC
>         { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES },
>  #endif
> -#ifdef CONFIG_SCHED_BOOK
> -       { cpu_book_mask, },
> -#endif
>         { cpu_cpu_mask, },
>         { NULL, },
>  };
>
> -static struct sched_domain_topology_level *sched_domain_topology = default_topology;
> +struct sched_domain_topology_level *sched_domain_topology = default_topology;
>
>  #define for_each_sd_topology(tl)                       \
>         for (tl = sched_domain_topology; tl->mask; tl++)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
  2013-11-05 22:27       ` Peter Zijlstra
  2013-11-06 10:10         ` Vincent Guittot
@ 2013-11-06 13:53         ` Martin Schwidefsky
  2013-11-06 14:08           ` Peter Zijlstra
  2013-12-18 13:13         ` [RFC] sched: CPU topology try Vincent Guittot
  2 siblings, 1 reply; 101+ messages in thread
From: Martin Schwidefsky @ 2013-11-06 13:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, linux-kernel, Ingo Molnar, Paul Turner,
	Morten Rasmussen, cmetcalf, tony.luck, Alex Shi, Preeti U Murthy,
	linaro-kernel, Rafael J. Wysocki, Paul McKenney, Jonathan Corbet,
	Thomas Gleixner, Len Brown, Arjan van de Ven, Amit Kucheria,
	Lukasz Majewski, james.hogan, heiko.carstens

On Tue, 5 Nov 2013 23:27:52 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Nov 05, 2013 at 03:57:23PM +0100, Vincent Guittot wrote:
> > Your proposal looks fine for me. It's clearly better to move in one
> > place the configuration of sched_domain fields. Have you already got
> > an idea about how to let architecture override the topology?
> 
> Maybe something like the below -- completely untested (my s390 compiler
> is on a machine that's currently powered off).

In principle I do not see a reason why this should not work, but there
are a few more things to take care of. E.g. struct sd_data is defined
in kernel/sched/core.c, cpu_cpu_mask as well. These need to be moved
to a header where arch/s390/kernel/smp.c can pick it up.

I do have the feeling that the sched_domain_topology should be left
where they are, or do we really want to expose more of the scheduler
internals?

> > My primary need comes from the fact that the topology configuration is
> > not the same for all cores
> 
> Do expand.. the various cpu masks used in the topology list are per cpu,
> is that sufficient room to wriggle or do you need more?
> 
> ---
> --- a/arch/s390/kernel/smp.c
> +++ b/arch/s390/kernel/smp.c
> @@ -1070,3 +1070,23 @@ static int __init s390_smp_init(void)
>  	return 0;
>  }
>  subsys_initcall(s390_smp_init);
> +
> +static struct sched_domain_topology_level s390_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> +	{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES },
> +#endif
> +#ifdef CONFIG_SCHED_MC
> +	{ cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES },
> +#endif
> +#ifdef CONFIG_SCHED_BOOK
> +	{ cpu_book_mask, },
> +#endif
> +	{ cpu_cpu_mask, },
> +	{ NULL, },
> +};
> +
> +static int __init s390_sched_topology(void)
> +{
> +	sched_domain_topology = s390_topology;
> +}
> +early_initcall(s390_sched_topology);
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -889,6 +889,20 @@ void free_sched_domains(cpumask_var_t do
> 
>  bool cpus_share_cache(int this_cpu, int that_cpu);
> 
> +typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
> +
> +#define SDTL_OVERLAP	0x01
> +
> +struct sched_domain_topology_level {
> +	sched_domain_mask_f mask;
> +	int		    sd_flags;
> +	int		    flags;
> +	int		    numa_level;
> +	struct sd_data      data;
> +};
> +
> +extern struct sched_domain_topology_level *sched_domain_topology;
> +
>  #else /* CONFIG_SMP */
> 
>  struct sched_domain_attr;
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5377,20 +5377,6 @@ enum s_alloc {
>  	sa_none,
>  };
> 
> -struct sched_domain_topology_level;
> -
> -typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
> -
> -#define SDTL_OVERLAP	0x01
> -
> -struct sched_domain_topology_level {
> -	sched_domain_mask_f mask;
> -	int		    sd_flags;
> -	int		    flags;
> -	int		    numa_level;
> -	struct sd_data      data;
> -};
> -
>  /*
>   * Build an iteration mask that can exclude certain CPUs from the upwards
>   * domain traversal.
> @@ -5841,6 +5827,7 @@ sd_init(struct sched_domain_topology_lev
> 
>  	return sd;
>  }
> +
>  /*
>   * Topology list, bottom-up.
>   */
> @@ -5851,14 +5838,11 @@ static struct sched_domain_topology_leve
>  #ifdef CONFIG_SCHED_MC
>  	{ cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES },
>  #endif
> -#ifdef CONFIG_SCHED_BOOK
> -	{ cpu_book_mask, },
> -#endif
>  	{ cpu_cpu_mask, },
>  	{ NULL, },
>  };
> 
> -static struct sched_domain_topology_level *sched_domain_topology = default_topology;
> +struct sched_domain_topology_level *sched_domain_topology = default_topology;
> 
>  #define for_each_sd_topology(tl)			\
>  	for (tl = sched_domain_topology; tl->mask; tl++)
> 


-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
  2013-11-06 13:53         ` Martin Schwidefsky
@ 2013-11-06 14:08           ` Peter Zijlstra
  2013-11-12 17:43             ` Dietmar Eggemann
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2013-11-06 14:08 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Vincent Guittot, linux-kernel, Ingo Molnar, Paul Turner,
	Morten Rasmussen, cmetcalf, tony.luck, Alex Shi, Preeti U Murthy,
	linaro-kernel, Rafael J. Wysocki, Paul McKenney, Jonathan Corbet,
	Thomas Gleixner, Len Brown, Arjan van de Ven, Amit Kucheria,
	Lukasz Majewski, james.hogan, heiko.carstens

On Wed, Nov 06, 2013 at 02:53:44PM +0100, Martin Schwidefsky wrote:
> On Tue, 5 Nov 2013 23:27:52 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, Nov 05, 2013 at 03:57:23PM +0100, Vincent Guittot wrote:
> > > Your proposal looks fine for me. It's clearly better to move in one
> > > place the configuration of sched_domain fields. Have you already got
> > > an idea about how to let architecture override the topology?
> > 
> > Maybe something like the below -- completely untested (my s390 compiler
> > is on a machine that's currently powered off).
> 
> In principle I do not see a reason why this should not work, but there
> are a few more things to take care of. E.g. struct sd_data is defined
> in kernel/sched/core.c, cpu_cpu_mask as well. These need to be moved
> to a header where arch/s390/kernel/smp.c can pick it up.
> 
> I do have the feeling that the sched_domain_topology should be left
> where they are, or do we really want to expose more of the scheduler
> internals?

Ah, its a trade off; in that previous patch I removed the entire
sched_domain initializers the archs used to 'have' to fill out. That
exposed far too much behavioural stuff the archs really shouldn't
bother with.

In return we now provide a (hopefully) simpler interface that allows
archs to communicate their topology to the scheduler -- without getting
mixed up in the behavioural aspects (too much).

Maybe s390 wasn't the best example to pick, as the book domain really
isn't that exciting. Arguably I should have taken Power7+ and the
ASYM_PACKING SMT thing.



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-10-18 11:52 [RFC][PATCH v5 00/14] sched: packing tasks Vincent Guittot
                   ` (12 preceding siblings ...)
  2013-10-18 11:52 ` [RFC][PATCH v5 14/14] cpuidle: set the current wake up latency Vincent Guittot
@ 2013-11-11 11:33 ` Catalin Marinas
  2013-11-11 16:36   ` Peter Zijlstra
                     ` (3 more replies)
  13 siblings, 4 replies; 101+ messages in thread
From: Catalin Marinas @ 2013-11-11 11:33 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar,
	Paul Turner, Morten Rasmussen, Chris Metcalf, Tony Luck,
	alex.shi, Preeti U Murthy, linaro-kernel, len.brown, l.majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney,
	Arjan van de Ven, linux-pm

Hi Vincent,

(cross-posting to linux-pm as it was agreed to follow up on this list)

On 18 October 2013 12:52, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> This is the 5th version of the previously named "packing small tasks" patchset.
> "small" has been removed because the patchset doesn't only target small tasks
> anymore.
>
> This patchset takes advantage of the new per-task load tracking that is
> available in the scheduler to pack the tasks in a minimum number of
> CPU/Cluster/Core. The packing mechanism takes into account the power gating
> topology of the CPUs to minimize the number of power domains that need to be
> powered on simultaneously.

As a general comment, it's not clear how this set of patches address
the bigger problem of energy aware scheduling, mainly because we
haven't yet defined _what_ we want from the scheduler, what the
scenarios are, constraints, are we prepared to give up some
performance (speed, latency) for power, how much.

This packing heuristics may work for certain SoCs and workloads but,
for example, there are modern ARM SoCs where the P-state has a much
bigger effect on power and it's more energy-efficient to keep two CPUs
in lower P-state than packing all tasks onto one, even though they may
be gated independently. In such cases _small_ task packing (for some
definition of 'small') would be more useful than general packing but
even this is just heuristics that saves power for particular workloads
without fully defining/addressing the problem.

I would rather start by defining the main goal and working backwards
to an algorithm. We may as well find that task packing based on this
patch set is sufficient but we may also get packing-like behaviour as
a side effect of a broader approach (better energy cost awareness). An
important aspect even in the mobile space is keeping the performance
as close as possible to the standard scheduler while saving a bit more
power. Just trying to reduce the number of non-idle CPUs may not meet
this requirement.


So, IMO, defining the power topology is a good starting point and I
think it's better to separate the patches from the energy saving
algorithms like packing. We need to agree on what information we have
(C-state details, coupling, power gating) and what we can/need to
expose to the scheduler. This can be revisited once we start
implementing/refining the energy awareness.

2nd step is how the _current_ scheduler could use such information
while keeping the current overall system behaviour (how much of
cpuidle we should move into the scheduler).

Question for Peter/Ingo: do you want the scheduler to decide on which
C-state a CPU should be in or we still leave this to a cpuidle
layer/driver?

My understanding from the recent discussions is that the scheduler
should decide directly on the C-state (or rather the deepest C-state
possible since we don't want to duplicate the backend logic for
synchronising CPUs going up or down). This means that the scheduler
needs to know about C-state target residency, wake-up latency (I think
we can leave coupled C-states to the backend, there is some complex
synchronisation which I wouldn't duplicate).

Alternatively (my preferred approach), we get the scheduler to predict
and pass the expected residency and latency requirements down to a
power driver and read back the actual C-states for making task
placement decisions. Some of the menu governor prediction logic could
be turned into a library and used by the scheduler. Basically what
this tries to achieve is better scheduler awareness of the current
C-states decided by a cpuidle/power driver based on the scheduler
constraints.

3rd step is optimising the scheduler for energy saving, taking into
account the information added by the previous steps and possibly
adding some more. This stage however has several sub-steps (that can
be worked on in parallel to the steps above):

a) Define use-cases, typical workloads, acceptance criteria
(performance, latency requirements).

b) Set of benchmarks simulating the scenarios above. I wouldn't bother
with linsched since a power model is never realistic enough. It's
better to run those benchmarks on real hardware and either estimate
the energy based on the C/P states or, depending on SoC, read some
sensors, energy probes. If the scheduler maintainers want to reproduce
the numbers, I'm pretty sure we can ship some boards.

c) Start defining/implementing scheduler algorithm to do optimal task placement.

d) Assess the implementation against benchmarks at (b) *and* other
typical performance benchmarks (whether it's for servers, mobile,
Android etc). At this point we'll most likely go back and refine the
previous steps.

So far we've jumped directly to (c) because we had some scenarios in
mind that needed optimising but those haven't been written down and we
don't have a clear way to assess the impact. There is more here than
simply maximising the idle time. Ideally the scheduler should have an
estimate of the overall energy cost, the cost per task, run-queue, the
energy implications of moving the tasks to another run-queue, possibly
taking the P-state into account (but not 'picking' a P-state).

Anyway, I think we need to address the first steps and think about the
algorithm once we have the bigger picture of what we try to solve.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 11:33 ` [RFC][PATCH v5 00/14] sched: packing tasks Catalin Marinas
@ 2013-11-11 16:36   ` Peter Zijlstra
  2013-11-11 16:39     ` Arjan van de Ven
                       ` (2 more replies)
  2013-11-11 16:38   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 3 replies; 101+ messages in thread
From: Peter Zijlstra @ 2013-11-11 16:36 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Vincent Guittot, Linux Kernel Mailing List, Ingo Molnar,
	Paul Turner, Morten Rasmussen, Chris Metcalf, Tony Luck,
	alex.shi, Preeti U Murthy, linaro-kernel, len.brown, l.majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney,
	Arjan van de Ven, linux-pm

On Mon, Nov 11, 2013 at 11:33:45AM +0000, Catalin Marinas wrote:

tl;dr :-) Still trying to wrap my head around how to do that weird
topology Vincent raised..

> Question for Peter/Ingo: do you want the scheduler to decide on which
> C-state a CPU should be in or we still leave this to a cpuidle
> layer/driver?

I think the can leave most of that in a driver; right along with how to
prod the hardware to actually get into that state.

I think the most important parts are what is now 'generic' code; stuff
that guestimates the idle-time and so forth.

I think the scheduler simply wants to say: we expect to go idle for X
ns, we want a guaranteed wakeup latency of Y ns -- go do your thing.

I think you also raised the point in that we do want some feedback as to
the cost of waking up particular cores to better make decisions on which
to wake. That is indeed so.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 11:33 ` [RFC][PATCH v5 00/14] sched: packing tasks Catalin Marinas
  2013-11-11 16:36   ` Peter Zijlstra
@ 2013-11-11 16:38   ` Peter Zijlstra
  2013-11-11 16:40     ` Arjan van de Ven
  2013-11-12 10:36     ` Vincent Guittot
  2013-11-11 16:54   ` Morten Rasmussen
  2013-11-12 12:35   ` Vincent Guittot
  3 siblings, 2 replies; 101+ messages in thread
From: Peter Zijlstra @ 2013-11-11 16:38 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Vincent Guittot, Linux Kernel Mailing List, Ingo Molnar,
	Paul Turner, Morten Rasmussen, Chris Metcalf, Tony Luck,
	alex.shi, Preeti U Murthy, linaro-kernel, len.brown, l.majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney,
	Arjan van de Ven, linux-pm

On Mon, Nov 11, 2013 at 11:33:45AM +0000, Catalin Marinas wrote:
> My understanding from the recent discussions is that the scheduler
> should decide directly on the C-state (or rather the deepest C-state
> possible since we don't want to duplicate the backend logic for
> synchronising CPUs going up or down). This means that the scheduler
> needs to know about C-state target residency, wake-up latency (I think
> we can leave coupled C-states to the backend, there is some complex
> synchronisation which I wouldn't duplicate).
> 
> Alternatively (my preferred approach), we get the scheduler to predict
> and pass the expected residency and latency requirements down to a
> power driver and read back the actual C-states for making task
> placement decisions. Some of the menu governor prediction logic could
> be turned into a library and used by the scheduler. Basically what
> this tries to achieve is better scheduler awareness of the current
> C-states decided by a cpuidle/power driver based on the scheduler
> constraints.

Ah yes.. so I _think_ the scheduler wants to eventually know about idle
topology constraints. But we can get there in a gradual fashion I hope.

Like the package C states on x86 -- for those to be effective the
scheduler needs to pack tasks and keep entire packages idle for as long
as possible.



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 16:36   ` Peter Zijlstra
@ 2013-11-11 16:39     ` Arjan van de Ven
  2013-11-11 18:18       ` Catalin Marinas
  2013-11-12 17:40     ` Catalin Marinas
  2013-11-25 18:55     ` Daniel Lezcano
  2 siblings, 1 reply; 101+ messages in thread
From: Arjan van de Ven @ 2013-11-11 16:39 UTC (permalink / raw)
  To: Peter Zijlstra, Catalin Marinas
  Cc: Vincent Guittot, Linux Kernel Mailing List, Ingo Molnar,
	Paul Turner, Morten Rasmussen, Chris Metcalf, Tony Luck,
	alex.shi, Preeti U Murthy, linaro-kernel, len.brown, l.majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney, linux-pm


> I think the scheduler simply wants to say: we expect to go idle for X
> ns, we want a guaranteed wakeup latency of Y ns -- go do your thing.

as long as Y normally is "large" or "infinity" that is ok ;-)
(a smaller Y will increase power consumption and decrease system performance)



> I think you also raised the point in that we do want some feedback as to
> the cost of waking up particular cores to better make decisions on which
> to wake. That is indeed so.

having a hardware driver give a prefered CPU ordering for wakes can indeed be useful.
(I'm doubtful that changing the recommendation for each idle is going to pay off,
but proof is in the pudding; there are certainly long term effects where this can help)

>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 16:38   ` Peter Zijlstra
@ 2013-11-11 16:40     ` Arjan van de Ven
  2013-11-12 10:36     ` Vincent Guittot
  1 sibling, 0 replies; 101+ messages in thread
From: Arjan van de Ven @ 2013-11-11 16:40 UTC (permalink / raw)
  To: Peter Zijlstra, Catalin Marinas
  Cc: Vincent Guittot, Linux Kernel Mailing List, Ingo Molnar,
	Paul Turner, Morten Rasmussen, Chris Metcalf, Tony Luck,
	alex.shi, Preeti U Murthy, linaro-kernel, len.brown, l.majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney, linux-pm

On 11/11/2013 8:38 AM, Peter Zijlstra wrote:
> Like the package C states on x86 -- for those to be effective the
> scheduler needs to pack tasks and keep entire packages idle for as long
> as possible.

"package" C states on x86 are not really per package... but system wide.
the name is very confusing.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 11:33 ` [RFC][PATCH v5 00/14] sched: packing tasks Catalin Marinas
  2013-11-11 16:36   ` Peter Zijlstra
  2013-11-11 16:38   ` Peter Zijlstra
@ 2013-11-11 16:54   ` Morten Rasmussen
  2013-11-11 18:31     ` Catalin Marinas
  2013-11-12 12:35   ` Vincent Guittot
  3 siblings, 1 reply; 101+ messages in thread
From: Morten Rasmussen @ 2013-11-11 16:54 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Vincent Guittot, Linux Kernel Mailing List, Peter Zijlstra,
	Ingo Molnar, Paul Turner, Chris Metcalf, Tony Luck, alex.shi,
	Preeti U Murthy, linaro-kernel, len.brown, l.majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney,
	Arjan van de Ven, linux-pm

On Mon, Nov 11, 2013 at 11:33:45AM +0000, Catalin Marinas wrote:
> Hi Vincent,
> 
> (cross-posting to linux-pm as it was agreed to follow up on this list)
> 
> On 18 October 2013 12:52, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> > This is the 5th version of the previously named "packing small tasks" patchset.
> > "small" has been removed because the patchset doesn't only target small tasks
> > anymore.
> >
> > This patchset takes advantage of the new per-task load tracking that is
> > available in the scheduler to pack the tasks in a minimum number of
> > CPU/Cluster/Core. The packing mechanism takes into account the power gating
> > topology of the CPUs to minimize the number of power domains that need to be
> > powered on simultaneously.
> 
> As a general comment, it's not clear how this set of patches address
> the bigger problem of energy aware scheduling, mainly because we
> haven't yet defined _what_ we want from the scheduler, what the
> scenarios are, constraints, are we prepared to give up some
> performance (speed, latency) for power, how much.
> 
> This packing heuristics may work for certain SoCs and workloads but,
> for example, there are modern ARM SoCs where the P-state has a much
> bigger effect on power and it's more energy-efficient to keep two CPUs
> in lower P-state than packing all tasks onto one, even though they may
> be gated independently. In such cases _small_ task packing (for some
> definition of 'small') would be more useful than general packing but
> even this is just heuristics that saves power for particular workloads
> without fully defining/addressing the problem.

When it comes to packing, I think the important things to figure out is
when to do it and how much. Those questions can only be answered when
the performance/energy trade-offs are known for the particular platform.
Packing seems to be a good idea for very small tasks, but I'm not so
sure about medium and big tasks. Packing the latter could lead to worse
performance (latency).

> 
> I would rather start by defining the main goal and working backwards
> to an algorithm. We may as well find that task packing based on this
> patch set is sufficient but we may also get packing-like behaviour as
> a side effect of a broader approach (better energy cost awareness). An
> important aspect even in the mobile space is keeping the performance
> as close as possible to the standard scheduler while saving a bit more

With the exception of big.LITTLE where we want to out-perform the
standard scheduler while saving power.

> power. Just trying to reduce the number of non-idle CPUs may not meet
> this requirement.
> 
> 
> So, IMO, defining the power topology is a good starting point and I
> think it's better to separate the patches from the energy saving
> algorithms like packing. We need to agree on what information we have
> (C-state details, coupling, power gating) and what we can/need to
> expose to the scheduler. This can be revisited once we start
> implementing/refining the energy awareness.
> 
> 2nd step is how the _current_ scheduler could use such information
> while keeping the current overall system behaviour (how much of
> cpuidle we should move into the scheduler).
> 
> Question for Peter/Ingo: do you want the scheduler to decide on which
> C-state a CPU should be in or we still leave this to a cpuidle
> layer/driver?
> 
> My understanding from the recent discussions is that the scheduler
> should decide directly on the C-state (or rather the deepest C-state
> possible since we don't want to duplicate the backend logic for
> synchronising CPUs going up or down). This means that the scheduler
> needs to know about C-state target residency, wake-up latency (I think
> we can leave coupled C-states to the backend, there is some complex
> synchronisation which I wouldn't duplicate).

It would be nice and simple to hide the complexity of the coupled
C-states, but we would loose the ability to prefer waking up cpus in a
cluster/package that already has non-idle cpus over cpus in a
cluster/package that has entered the coupled C-state. If we just know
the requested C-state of a cpu we can't tell the difference as it is
now.

> 
> Alternatively (my preferred approach), we get the scheduler to predict
> and pass the expected residency and latency requirements down to a
> power driver and read back the actual C-states for making task
> placement decisions. Some of the menu governor prediction logic could
> be turned into a library and used by the scheduler. Basically what
> this tries to achieve is better scheduler awareness of the current
> C-states decided by a cpuidle/power driver based on the scheduler
> constraints.

It might be easier to deal with the couple C-states using this approach.

> 
> 3rd step is optimising the scheduler for energy saving, taking into
> account the information added by the previous steps and possibly
> adding some more. This stage however has several sub-steps (that can
> be worked on in parallel to the steps above):
> 
> a) Define use-cases, typical workloads, acceptance criteria
> (performance, latency requirements).
> 
> b) Set of benchmarks simulating the scenarios above. I wouldn't bother
> with linsched since a power model is never realistic enough. It's
> better to run those benchmarks on real hardware and either estimate
> the energy based on the C/P states or, depending on SoC, read some
> sensors, energy probes. If the scheduler maintainers want to reproduce
> the numbers, I'm pretty sure we can ship some boards.
> 
> c) Start defining/implementing scheduler algorithm to do optimal task placement.
> 
> d) Assess the implementation against benchmarks at (b) *and* other
> typical performance benchmarks (whether it's for servers, mobile,
> Android etc). At this point we'll most likely go back and refine the
> previous steps.
> 
> So far we've jumped directly to (c) because we had some scenarios in
> mind that needed optimising but those haven't been written down and we
> don't have a clear way to assess the impact. There is more here than
> simply maximising the idle time. Ideally the scheduler should have an
> estimate of the overall energy cost, the cost per task, run-queue, the
> energy implications of moving the tasks to another run-queue, possibly
> taking the P-state into account (but not 'picking' a P-state).

The energy cost depends strongly on the P-state. I'm not sure if we can
avoid using at least a rough estimate of the P-state or a similar
metric in the energy cost estimation.

> 
> Anyway, I think we need to address the first steps and think about the
> algorithm once we have the bigger picture of what we try to solve.

I agree that we need to have the bigger picture in mind from the
beginning to avoid introducing changes that we later change again or
revert.

Morten

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 16:39     ` Arjan van de Ven
@ 2013-11-11 18:18       ` Catalin Marinas
  2013-11-11 18:20         ` Arjan van de Ven
                           ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Catalin Marinas @ 2013-11-11 18:18 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Peter Zijlstra, Vincent Guittot, Linux Kernel Mailing List,
	Ingo Molnar, Paul Turner, Morten Rasmussen, Chris Metcalf,
	Tony Luck, alex.shi, Preeti U Murthy, linaro-kernel, len.brown,
	l.majewski, Jonathan Corbet, Rafael J. Wysocki, Paul McKenney,
	linux-pm

On Mon, Nov 11, 2013 at 04:39:45PM +0000, Arjan van de Ven wrote:
> > I think the scheduler simply wants to say: we expect to go idle for X
> > ns, we want a guaranteed wakeup latency of Y ns -- go do your thing.
> 
> as long as Y normally is "large" or "infinity" that is ok ;-)
> (a smaller Y will increase power consumption and decrease system performance)

Cpuidle already takes a latency into account via pm_qos. The scheduler
could pass this information down to the hardware driver or the cpuidle
driver could use pm_qos directly (as it's currently done in governors).

The scheduler may have its own requirements in terms of latency (e.g.
some real-time thread) and we could extend the pm_qos API with
per-thread information. But so far we don't have a way to pass such
per-thread requirements from user space (unless we assume that any
real-time thread has some fixed latency requirements). I suggest we
ignore this per-thread part until we find an actual need.

> > I think you also raised the point in that we do want some feedback as to
> > the cost of waking up particular cores to better make decisions on which
> > to wake. That is indeed so.
> 
> having a hardware driver give a prefered CPU ordering for wakes can indeed be useful.
> (I'm doubtful that changing the recommendation for each idle is going to pay off,
> but proof is in the pudding; there are certainly long term effects where this can help)

The ordering is based on the actual C-state, so a simple way is to wake
up the CPU in the shallowest C-state. With asymmetric configurations
(big.LITTLE) we have different costs for the same C-state, so this would
come in handy.

Even for symmetric configuration, the cost of moving a task to a CPU
includes wake-up cost plus the run-time cost which depends on the
P-state after wake-up (that's much trickier since we can't easily
estimate the cost of a P-state and it may change once you place a task
on it).

-- 
Catalin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 18:18       ` Catalin Marinas
@ 2013-11-11 18:20         ` Arjan van de Ven
  2013-11-12 12:06         ` Morten Rasmussen
  2013-11-12 16:48         ` Arjan van de Ven
  2 siblings, 0 replies; 101+ messages in thread
From: Arjan van de Ven @ 2013-11-11 18:20 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Peter Zijlstra, Vincent Guittot, Linux Kernel Mailing List,
	Ingo Molnar, Paul Turner, Morten Rasmussen, Chris Metcalf,
	Tony Luck, alex.shi, Preeti U Murthy, linaro-kernel, len.brown,
	l.majewski, Jonathan Corbet, Rafael J. Wysocki, Paul McKenney,
	linux-pm

On 11/11/2013 10:18 AM, Catalin Marinas wrote:

>
> Even for symmetric configuration, the cost of moving a task to a CPU
> includes wake-up cost plus the run-time cost which depends on the
> P-state after wake-up (that's much trickier since we can't easily
> estimate the cost of a P-state and it may change once you place a task
> on it).

yup including cache refill times (assuming you picked C states
that flushed the cache, which will be the common case... but even
if not, since you're moving at task the likelyhood of cache coldness is high)


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 16:54   ` Morten Rasmussen
@ 2013-11-11 18:31     ` Catalin Marinas
  2013-11-11 19:26       ` Arjan van de Ven
  0 siblings, 1 reply; 101+ messages in thread
From: Catalin Marinas @ 2013-11-11 18:31 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Linux Kernel Mailing List, Peter Zijlstra,
	Ingo Molnar, Paul Turner, Chris Metcalf, Tony Luck, alex.shi,
	Preeti U Murthy, linaro-kernel, len.brown, l.majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney,
	Arjan van de Ven, linux-pm

On Mon, Nov 11, 2013 at 04:54:54PM +0000, Morten Rasmussen wrote:
> On Mon, Nov 11, 2013 at 11:33:45AM +0000, Catalin Marinas wrote:
> > I would rather start by defining the main goal and working backwards
> > to an algorithm. We may as well find that task packing based on this
> > patch set is sufficient but we may also get packing-like behaviour as
> > a side effect of a broader approach (better energy cost awareness). An
> > important aspect even in the mobile space is keeping the performance
> > as close as possible to the standard scheduler while saving a bit more
> 
> With the exception of big.LITTLE where we want to out-perform the
> standard scheduler while saving power.

Good point. Maybe we should start with a separate set of patches for
improving the performance on asymmetric configurations like big.LITTLE
while ignoring (deferring) the power aspect. Things like placing bigger
threads on bigger CPUs and so on (you know better what's needed here ;).

> > My understanding from the recent discussions is that the scheduler
> > should decide directly on the C-state (or rather the deepest C-state
> > possible since we don't want to duplicate the backend logic for
> > synchronising CPUs going up or down). This means that the scheduler
> > needs to know about C-state target residency, wake-up latency (I think
> > we can leave coupled C-states to the backend, there is some complex
> > synchronisation which I wouldn't duplicate).
> 
> It would be nice and simple to hide the complexity of the coupled
> C-states, but we would loose the ability to prefer waking up cpus in a
> cluster/package that already has non-idle cpus over cpus in a
> cluster/package that has entered the coupled C-state. If we just know
> the requested C-state of a cpu we can't tell the difference as it is
> now.

I agree, we can't rely on the requested C-state but the _actual_ state
and this means querying the hardware driver. Can we abstract this via
some interface which provides the cost of waking up a CPU? This could
take into account the state of the other CPUs in the cluster and the
scheduler is simply concerned with the wake-up costs.

> > Alternatively (my preferred approach), we get the scheduler to predict
> > and pass the expected residency and latency requirements down to a
> > power driver and read back the actual C-states for making task
> > placement decisions. Some of the menu governor prediction logic could
> > be turned into a library and used by the scheduler. Basically what
> > this tries to achieve is better scheduler awareness of the current
> > C-states decided by a cpuidle/power driver based on the scheduler
> > constraints.
> 
> It might be easier to deal with the couple C-states using this approach.

We already have drivers taking care of the couple C-states, so it means
passing the information back to the scheduler in some way (actual
C-state or wake-up cost).

It would be nice if we can describe the wake-up costs statically while
considering coupled C-states but it needs more thinking.

-- 
Catalin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 18:31     ` Catalin Marinas
@ 2013-11-11 19:26       ` Arjan van de Ven
  2013-11-11 22:43         ` Nicolas Pitre
  2013-11-11 23:43         ` Catalin Marinas
  0 siblings, 2 replies; 101+ messages in thread
From: Arjan van de Ven @ 2013-11-11 19:26 UTC (permalink / raw)
  To: Catalin Marinas, Morten Rasmussen
  Cc: Vincent Guittot, Linux Kernel Mailing List, Peter Zijlstra,
	Ingo Molnar, Paul Turner, Chris Metcalf, Tony Luck, alex.shi,
	Preeti U Murthy, linaro-kernel, len.brown, l.majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney, linux-pm

On 11/11/2013 10:31 AM, Catalin Marinas wrote:
> I agree, we can't rely on the requested C-state but the_actual_  state
> and this means querying the hardware driver. Can we abstract this via
> some interface which provides the cost of waking up a CPU? This could
> take into account the state of the other CPUs in the cluster and the
> scheduler is simply concerned with the wake-up costs.

can you even query this without actually waking up the cpu and asking ???


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 19:26       ` Arjan van de Ven
@ 2013-11-11 22:43         ` Nicolas Pitre
  2013-11-11 23:43         ` Catalin Marinas
  1 sibling, 0 replies; 101+ messages in thread
From: Nicolas Pitre @ 2013-11-11 22:43 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Catalin Marinas, Morten Rasmussen, Vincent Guittot,
	Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar,
	Paul Turner, Chris Metcalf, Tony Luck, alex.shi, Preeti U Murthy,
	linaro-kernel, len.brown, l.majewski, Jonathan Corbet,
	Rafael J. Wysocki, Paul McKenney, linux-pm

On Mon, 11 Nov 2013, Arjan van de Ven wrote:

> On 11/11/2013 10:31 AM, Catalin Marinas wrote:
> > I agree, we can't rely on the requested C-state but the_actual_  state
> > and this means querying the hardware driver. Can we abstract this via
> > some interface which provides the cost of waking up a CPU? This could
> > take into account the state of the other CPUs in the cluster and the
> > scheduler is simply concerned with the wake-up costs.
> 
> can you even query this without actually waking up the cpu and asking ???

On those systems we're interested in we sure can.


Nicolas

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 19:26       ` Arjan van de Ven
  2013-11-11 22:43         ` Nicolas Pitre
@ 2013-11-11 23:43         ` Catalin Marinas
  1 sibling, 0 replies; 101+ messages in thread
From: Catalin Marinas @ 2013-11-11 23:43 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Morten Rasmussen, Vincent Guittot, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Paul Turner, Chris Metcalf, Tony Luck, alex.shi,
	Preeti U Murthy, linaro-kernel, len.brown, l.majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney, linux-pm

On 11 Nov 2013, at 19:26, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 11/11/2013 10:31 AM, Catalin Marinas wrote:
>> I agree, we can't rely on the requested C-state but the_actual_  state
>> and this means querying the hardware driver. Can we abstract this via
>> some interface which provides the cost of waking up a CPU? This could
>> take into account the state of the other CPUs in the cluster and the
>> scheduler is simply concerned with the wake-up costs.
> 
> can you even query this without actually waking up the cpu and asking ???

Even if you don’t have additional hardware to query the state of a CPU
without waking it up, we could have a per-CPU variable storing the
actual C-states as selected by the arch backend.  This doesn’t need to
be precise but lets say only 90% accurate would probably be enough for
the scheduler.

Catalin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 05/14] sched: add a packing level knob
  2013-10-18 11:52 ` [RFC][PATCH v5 05/14] sched: add a packing level knob Vincent Guittot
@ 2013-11-12 10:32   ` Peter Zijlstra
  2013-11-12 10:44     ` Vincent Guittot
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2013-11-12 10:32 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, pjt, Morten.Rasmussen, cmetcalf, tony.luck,
	alex.shi, preeti, linaro-kernel, rjw, paulmck, corbet, tglx,
	len.brown, arjan, amit.kucheria, l.majewski

On Fri, Oct 18, 2013 at 01:52:18PM +0200, Vincent Guittot wrote:
> +int sched_proc_update_packing(struct ctl_table *table, int write,
> +		void __user *buffer, size_t *lenp,
> +		loff_t *ppos)
> +{
> +	int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> +	if (ret || !write)
> +		return ret;
> +
> +	if (sysctl_sched_packing_level)
> +		sd_pack_threshold = (100 * 1024) / sysctl_sched_packing_level;
> +
> +	return 0;
> +}

> +#ifdef CONFIG_SCHED_PACKING_TASKS
> +static int min_sched_packing_level;
> +static int max_sched_packing_level = 100;
> +#endif /* CONFIG_SMP */

> +#ifdef CONFIG_SCHED_PACKING_TASKS
> +	{
> +		.procname	= "sched_packing_level",
> +		.data		= &sysctl_sched_packing_level,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= sched_proc_update_packing,
> +		.extra1		= &min_sched_packing_level,
> +		.extra2		= &max_sched_packing_level,
> +	},
> +#endif

Shouldn't min_sched_packing_level be 1? Userspace can now write 0 and
expect something; but then we don't update sd_pack_threshold so nothing
really changed.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 06/14] sched: create a new field with available capacity
  2013-10-18 11:52 ` [RFC][PATCH v5 06/14] sched: create a new field with available capacity Vincent Guittot
@ 2013-11-12 10:34   ` Peter Zijlstra
  2013-11-12 11:05     ` Vincent Guittot
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2013-11-12 10:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, pjt, Morten.Rasmussen, cmetcalf, tony.luck,
	alex.shi, preeti, linaro-kernel, rjw, paulmck, corbet, tglx,
	len.brown, arjan, amit.kucheria, l.majewski

On Fri, Oct 18, 2013 at 01:52:19PM +0200, Vincent Guittot wrote:
> This new field power_available reflects the available capacity of a CPU
> unlike the cpu_power which reflects the current capacity.

> -	sdg->sgp->power_orig = sdg->sgp->power = power;
> +	sdg->sgp->power_orig = sdg->sgp->power_available = available;
> +	sdg->sgp->power = power;

This patch leaves me confused as to power_available vs power_orig and
the Changelog doesn't really clarify anything much at all.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 07/14] sched: get CPU's activity statistic
  2013-10-18 11:52 ` [RFC][PATCH v5 07/14] sched: get CPU's activity statistic Vincent Guittot
@ 2013-11-12 10:36   ` Peter Zijlstra
  2013-11-12 10:41   ` Peter Zijlstra
  1 sibling, 0 replies; 101+ messages in thread
From: Peter Zijlstra @ 2013-11-12 10:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, pjt, Morten.Rasmussen, cmetcalf, tony.luck,
	alex.shi, preeti, linaro-kernel, rjw, paulmck, corbet, tglx,
	len.brown, arjan, amit.kucheria, l.majewski

On Fri, Oct 18, 2013 at 01:52:20PM +0200, Vincent Guittot wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index db9b871..7e26f65 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -179,6 +179,11 @@ void sched_init_granularity(void)
>  }
>  
>  #ifdef CONFIG_SMP
> +static unsigned long available_of(int cpu)
> +{
> +	return cpu_rq(cpu)->cpu_available;
> +}
> +
>  #ifdef CONFIG_SCHED_PACKING_TASKS
>  /*
>   * Save the id of the optimal CPU that should be used to pack small tasks

This hunk wants to be in the previous patch, as I'm fairly sure I saw
this function used there.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 16:38   ` Peter Zijlstra
  2013-11-11 16:40     ` Arjan van de Ven
@ 2013-11-12 10:36     ` Vincent Guittot
  1 sibling, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-11-12 10:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Catalin Marinas, Linux Kernel Mailing List, Ingo Molnar,
	Paul Turner, Morten Rasmussen, Chris Metcalf, Tony Luck,
	alex.shi, Preeti U Murthy, linaro-kernel, len.brown,
	Lukasz Majewski, Jonathan Corbet, Rafael J. Wysocki,
	Paul McKenney, Arjan van de Ven, linux-pm

On 11 November 2013 17:38, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Nov 11, 2013 at 11:33:45AM +0000, Catalin Marinas wrote:
>> My understanding from the recent discussions is that the scheduler
>> should decide directly on the C-state (or rather the deepest C-state
>> possible since we don't want to duplicate the backend logic for
>> synchronising CPUs going up or down). This means that the scheduler
>> needs to know about C-state target residency, wake-up latency (I think
>> we can leave coupled C-states to the backend, there is some complex
>> synchronisation which I wouldn't duplicate).
>>
>> Alternatively (my preferred approach), we get the scheduler to predict
>> and pass the expected residency and latency requirements down to a
>> power driver and read back the actual C-states for making task
>> placement decisions. Some of the menu governor prediction logic could
>> be turned into a library and used by the scheduler. Basically what
>> this tries to achieve is better scheduler awareness of the current
>> C-states decided by a cpuidle/power driver based on the scheduler
>> constraints.
>
> Ah yes.. so I _think_ the scheduler wants to eventually know about idle
> topology constraints. But we can get there in a gradual fashion I hope.
>
> Like the package C states on x86 -- for those to be effective the
> scheduler needs to pack tasks and keep entire packages idle for as long
> as possible.

That's the purpose of patches 12, 13 and 14. To get the current wakeup
latency of a core and use it when selecting a core

>
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 07/14] sched: get CPU's activity statistic
  2013-10-18 11:52 ` [RFC][PATCH v5 07/14] sched: get CPU's activity statistic Vincent Guittot
  2013-11-12 10:36   ` Peter Zijlstra
@ 2013-11-12 10:41   ` Peter Zijlstra
  1 sibling, 0 replies; 101+ messages in thread
From: Peter Zijlstra @ 2013-11-12 10:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, pjt, Morten.Rasmussen, cmetcalf, tony.luck,
	alex.shi, preeti, linaro-kernel, rjw, paulmck, corbet, tglx,
	len.brown, arjan, amit.kucheria, l.majewski

On Fri, Oct 18, 2013 at 01:52:20PM +0200, Vincent Guittot wrote:
> +static int get_cpu_activity(int cpu)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +	u32 sum = rq->avg.runnable_avg_sum;
> +	u32 period = rq->avg.runnable_avg_period;
> +
> +	sum = min(sum, period);
> +
> +	if (sum == period) {
> +		u32 overload = rq->nr_running > 1 ? 1 : 0;
> +		return available_of(cpu) + overload;
> +	}
> +
> +	return (sum * available_of(cpu)) / period;

I'm thinking this has a fair potential to overflow our u32, no?

> +}

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 05/14] sched: add a packing level knob
  2013-11-12 10:32   ` Peter Zijlstra
@ 2013-11-12 10:44     ` Vincent Guittot
  2013-11-12 10:55       ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Vincent Guittot @ 2013-11-12 10:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Paul Turner, Morten Rasmussen,
	cmetcalf, tony.luck, Alex Shi, Preeti U Murthy, linaro-kernel,
	Rafael J. Wysocki, Paul McKenney, Jonathan Corbet,
	Thomas Gleixner, Len Brown, Arjan van de Ven, Amit Kucheria,
	Lukasz Majewski

On 12 November 2013 11:32, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 18, 2013 at 01:52:18PM +0200, Vincent Guittot wrote:
>> +int sched_proc_update_packing(struct ctl_table *table, int write,
>> +             void __user *buffer, size_t *lenp,
>> +             loff_t *ppos)
>> +{
>> +     int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>> +     if (ret || !write)
>> +             return ret;
>> +
>> +     if (sysctl_sched_packing_level)
>> +             sd_pack_threshold = (100 * 1024) / sysctl_sched_packing_level;
>> +
>> +     return 0;
>> +}
>
>> +#ifdef CONFIG_SCHED_PACKING_TASKS
>> +static int min_sched_packing_level;
>> +static int max_sched_packing_level = 100;
>> +#endif /* CONFIG_SMP */
>
>> +#ifdef CONFIG_SCHED_PACKING_TASKS
>> +     {
>> +             .procname       = "sched_packing_level",
>> +             .data           = &sysctl_sched_packing_level,
>> +             .maxlen         = sizeof(int),
>> +             .mode           = 0644,
>> +             .proc_handler   = sched_proc_update_packing,
>> +             .extra1         = &min_sched_packing_level,
>> +             .extra2         = &max_sched_packing_level,
>> +     },
>> +#endif
>
> Shouldn't min_sched_packing_level be 1? Userspace can now write 0 and
> expect something; but then we don't update sd_pack_threshold so nothing
> really changed.

value 0 is used to disable to packing feature and the scheduler falls
back to default behavior. This value is tested when setting which cpus
will be used by the scheduler.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 08/14] sched: move load idx selection in find_idlest_group
  2013-10-18 11:52 ` [RFC][PATCH v5 08/14] sched: move load idx selection in find_idlest_group Vincent Guittot
@ 2013-11-12 10:49   ` Peter Zijlstra
  2013-11-27 14:10   ` [tip:sched/core] sched/fair: Move " tip-bot for Vincent Guittot
  1 sibling, 0 replies; 101+ messages in thread
From: Peter Zijlstra @ 2013-11-12 10:49 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, pjt, Morten.Rasmussen, cmetcalf, tony.luck,
	alex.shi, preeti, linaro-kernel, rjw, paulmck, corbet, tglx,
	len.brown, arjan, amit.kucheria, l.majewski

On Fri, Oct 18, 2013 at 01:52:21PM +0200, Vincent Guittot wrote:
> load_idx is used in find_idlest_group but initialized in select_task_rq_fair
> even when not used. The load_idx initialisation is moved in find_idlest_group
> and the sd_flag replaces it in the function's args.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Thanks applied!

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 05/14] sched: add a packing level knob
  2013-11-12 10:44     ` Vincent Guittot
@ 2013-11-12 10:55       ` Peter Zijlstra
  2013-11-12 10:57         ` Vincent Guittot
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2013-11-12 10:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Ingo Molnar, Paul Turner, Morten Rasmussen,
	cmetcalf, tony.luck, Alex Shi, Preeti U Murthy, linaro-kernel,
	Rafael J. Wysocki, Paul McKenney, Jonathan Corbet,
	Thomas Gleixner, Len Brown, Arjan van de Ven, Amit Kucheria,
	Lukasz Majewski

On Tue, Nov 12, 2013 at 11:44:15AM +0100, Vincent Guittot wrote:
> On 12 November 2013 11:32, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Oct 18, 2013 at 01:52:18PM +0200, Vincent Guittot wrote:
> >> +int sched_proc_update_packing(struct ctl_table *table, int write,
> >> +             void __user *buffer, size_t *lenp,
> >> +             loff_t *ppos)
> >> +{
> >> +     int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> >> +     if (ret || !write)
> >> +             return ret;
> >> +
> >> +     if (sysctl_sched_packing_level)
> >> +             sd_pack_threshold = (100 * 1024) / sysctl_sched_packing_level;
> >> +
> >> +     return 0;
> >> +}
> >
> >> +#ifdef CONFIG_SCHED_PACKING_TASKS
> >> +static int min_sched_packing_level;
> >> +static int max_sched_packing_level = 100;
> >> +#endif /* CONFIG_SMP */
> >
> >> +#ifdef CONFIG_SCHED_PACKING_TASKS
> >> +     {
> >> +             .procname       = "sched_packing_level",
> >> +             .data           = &sysctl_sched_packing_level,
> >> +             .maxlen         = sizeof(int),
> >> +             .mode           = 0644,
> >> +             .proc_handler   = sched_proc_update_packing,
> >> +             .extra1         = &min_sched_packing_level,
> >> +             .extra2         = &max_sched_packing_level,
> >> +     },
> >> +#endif
> >
> > Shouldn't min_sched_packing_level be 1? Userspace can now write 0 and
> > expect something; but then we don't update sd_pack_threshold so nothing
> > really changed.
> 
> value 0 is used to disable to packing feature and the scheduler falls
> back to default behavior. This value is tested when setting which cpus
> will be used by the scheduler.

I suspected as much, but it wasn't clear from the Changelog, the patch
or any comments. Plz as to fix.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 05/14] sched: add a packing level knob
  2013-11-12 10:55       ` Peter Zijlstra
@ 2013-11-12 10:57         ` Vincent Guittot
  0 siblings, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-11-12 10:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Paul Turner, Morten Rasmussen,
	cmetcalf, tony.luck, Alex Shi, Preeti U Murthy, linaro-kernel,
	Rafael J. Wysocki, Paul McKenney, Jonathan Corbet,
	Thomas Gleixner, Len Brown, Arjan van de Ven, Amit Kucheria,
	Lukasz Majewski

On 12 November 2013 11:55, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Nov 12, 2013 at 11:44:15AM +0100, Vincent Guittot wrote:
>> On 12 November 2013 11:32, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, Oct 18, 2013 at 01:52:18PM +0200, Vincent Guittot wrote:
>> >> +int sched_proc_update_packing(struct ctl_table *table, int write,
>> >> +             void __user *buffer, size_t *lenp,
>> >> +             loff_t *ppos)
>> >> +{
>> >> +     int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>> >> +     if (ret || !write)
>> >> +             return ret;
>> >> +
>> >> +     if (sysctl_sched_packing_level)
>> >> +             sd_pack_threshold = (100 * 1024) / sysctl_sched_packing_level;
>> >> +
>> >> +     return 0;
>> >> +}
>> >
>> >> +#ifdef CONFIG_SCHED_PACKING_TASKS
>> >> +static int min_sched_packing_level;
>> >> +static int max_sched_packing_level = 100;
>> >> +#endif /* CONFIG_SMP */
>> >
>> >> +#ifdef CONFIG_SCHED_PACKING_TASKS
>> >> +     {
>> >> +             .procname       = "sched_packing_level",
>> >> +             .data           = &sysctl_sched_packing_level,
>> >> +             .maxlen         = sizeof(int),
>> >> +             .mode           = 0644,
>> >> +             .proc_handler   = sched_proc_update_packing,
>> >> +             .extra1         = &min_sched_packing_level,
>> >> +             .extra2         = &max_sched_packing_level,
>> >> +     },
>> >> +#endif
>> >
>> > Shouldn't min_sched_packing_level be 1? Userspace can now write 0 and
>> > expect something; but then we don't update sd_pack_threshold so nothing
>> > really changed.
>>
>> value 0 is used to disable to packing feature and the scheduler falls
>> back to default behavior. This value is tested when setting which cpus
>> will be used by the scheduler.
>
> I suspected as much, but it wasn't clear from the Changelog, the patch
> or any comments. Plz as to fix.

Ok. i will add it

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 06/14] sched: create a new field with available capacity
  2013-11-12 10:34   ` Peter Zijlstra
@ 2013-11-12 11:05     ` Vincent Guittot
  0 siblings, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-11-12 11:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Paul Turner, Morten Rasmussen,
	cmetcalf, tony.luck, Alex Shi, Preeti U Murthy, linaro-kernel,
	Rafael J. Wysocki, Paul McKenney, Jonathan Corbet,
	Thomas Gleixner, Len Brown, Arjan van de Ven, Amit Kucheria,
	Lukasz Majewski

On 12 November 2013 11:34, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 18, 2013 at 01:52:19PM +0200, Vincent Guittot wrote:
>> This new field power_available reflects the available capacity of a CPU
>> unlike the cpu_power which reflects the current capacity.
>
>> -     sdg->sgp->power_orig = sdg->sgp->power = power;
>> +     sdg->sgp->power_orig = sdg->sgp->power_available = available;
>> +     sdg->sgp->power = power;
>
> This patch leaves me confused as to power_available vs power_orig and
> the Changelog doesn't really clarify anything much at all.

Ok, i will add more details in the changelog
power_ori can only modified for SMT purpose otherwise it stays to 1024
whereas power_available take into account the modification that has
been done by the platform.

I can probably re-factor that and merge power_ori and power_available.
At now,  arch_scale_smt_power and arch_scale_freq_power are not used
simultaneously by an architecture so I can probably move the returned
value of arch_scale_freq_power into the power_ori. This would even
make more sense regarding the current use of it.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 18:18       ` Catalin Marinas
  2013-11-11 18:20         ` Arjan van de Ven
@ 2013-11-12 12:06         ` Morten Rasmussen
  2013-11-12 16:48         ` Arjan van de Ven
  2 siblings, 0 replies; 101+ messages in thread
From: Morten Rasmussen @ 2013-11-12 12:06 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Arjan van de Ven, Peter Zijlstra, Vincent Guittot,
	Linux Kernel Mailing List, Ingo Molnar, Paul Turner,
	Chris Metcalf, Tony Luck, alex.shi, Preeti U Murthy,
	linaro-kernel, len.brown, l.majewski, Jonathan Corbet,
	Rafael J. Wysocki, Paul McKenney, linux-pm

On Mon, Nov 11, 2013 at 06:18:05PM +0000, Catalin Marinas wrote:
> On Mon, Nov 11, 2013 at 04:39:45PM +0000, Arjan van de Ven wrote:
> > having a hardware driver give a prefered CPU ordering for wakes can indeed be useful.
> > (I'm doubtful that changing the recommendation for each idle is going to pay off,
> > but proof is in the pudding; there are certainly long term effects where this can help)
> 
> The ordering is based on the actual C-state, so a simple way is to wake
> up the CPU in the shallowest C-state. With asymmetric configurations
> (big.LITTLE) we have different costs for the same C-state, so this would
> come in handy.

Asymmetric configurations add a bit of extra fun to deal with as you
don't want to pick the cpu in the shallowest C-state if it is the wrong
type of cpu for the task waking up. That goes for both big and little
cpus in big.LITTLE.

So the hardware driver would need to know which cpus that are suitable
targets for the task, or we need to somehow limit the driver query to
suitable cpus, or the driver should return a list of cpus guaranteed to
include cpus of all types (big, little, whatever...).

Morten

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 11:33 ` [RFC][PATCH v5 00/14] sched: packing tasks Catalin Marinas
                     ` (2 preceding siblings ...)
  2013-11-11 16:54   ` Morten Rasmussen
@ 2013-11-12 12:35   ` Vincent Guittot
  3 siblings, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-11-12 12:35 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar,
	Paul Turner, Morten Rasmussen, Chris Metcalf, Tony Luck,
	Preeti U Murthy, linaro-kernel, len.brown, Lukasz Majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney,
	Arjan van de Ven, linux-pm, Daniel Lezcano, Tuukka Tikkanen,
	Alex Shi

On 11 November 2013 12:33, Catalin Marinas <catalin.marinas@arm.com> wrote:
> Hi Vincent,
>
> (cross-posting to linux-pm as it was agreed to follow up on this list)
>

<snip>

>
> So, IMO, defining the power topology is a good starting point and I
> think it's better to separate the patches from the energy saving
> algorithms like packing. We need to agree on what information we have

Daniel and Tuukka who are working in cpuidle consolidation in the
scheduler, are also interested in using similar topology information
than me. I have made 1 patchset only because the information was only
used here. So I can probably make a separate patchset of the power
topology description in DT.

Vincent

> (C-state details, coupling, power gating) and what we can/need to
> expose to the scheduler. This can be revisited once we start
> implementing/refining the energy awareness.
>

<snip>
>
> --
> Catalin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 18:18       ` Catalin Marinas
  2013-11-11 18:20         ` Arjan van de Ven
  2013-11-12 12:06         ` Morten Rasmussen
@ 2013-11-12 16:48         ` Arjan van de Ven
  2013-11-12 23:14           ` Catalin Marinas
  2 siblings, 1 reply; 101+ messages in thread
From: Arjan van de Ven @ 2013-11-12 16:48 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Peter Zijlstra, Vincent Guittot, Linux Kernel Mailing List,
	Ingo Molnar, Paul Turner, Morten Rasmussen, Chris Metcalf,
	Tony Luck, alex.shi, Preeti U Murthy, linaro-kernel, len.brown,
	l.majewski, Jonathan Corbet, Rafael J. Wysocki, Paul McKenney,
	linux-pm

On 11/11/2013 10:18 AM, Catalin Marinas wrote:
> The ordering is based on the actual C-state, so a simple way is to wake
> up the CPU in the shallowest C-state. With asymmetric configurations
> (big.LITTLE) we have different costs for the same C-state, so this would
> come in handy.

btw I was considering something else; in practice CPUs will be in the deepest state..
... at which point I was going to go with some other metrics of what is best from a platform level


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 16:36   ` Peter Zijlstra
  2013-11-11 16:39     ` Arjan van de Ven
@ 2013-11-12 17:40     ` Catalin Marinas
  2013-11-25 18:55     ` Daniel Lezcano
  2 siblings, 0 replies; 101+ messages in thread
From: Catalin Marinas @ 2013-11-12 17:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Linux Kernel Mailing List, Ingo Molnar,
	Paul Turner, Morten Rasmussen, Chris Metcalf, Tony Luck,
	alex.shi, Preeti U Murthy, linaro-kernel, len.brown, l.majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney,
	Arjan van de Ven, linux-pm

On Mon, Nov 11, 2013 at 04:36:30PM +0000, Peter Zijlstra wrote:
> On Mon, Nov 11, 2013 at 11:33:45AM +0000, Catalin Marinas wrote:
> 
> tl;dr :-) Still trying to wrap my head around how to do that weird
> topology Vincent raised..

Long email, I know, but topology discussion is a good start ;).

To summarise the rest, I don't see full task packing as useful but
rather getting packing as a result of other decisions (like trying to
estimate the cost of task placement and refining the algorithm from
there). There are ARM SoCs where maximising idle time does not always
mean maximising the energy saving even if the cores can be power-gated
individually (unless you have small workload that doesn't increase the
P-state on the packing CPU).

> > Question for Peter/Ingo: do you want the scheduler to decide on which
> > C-state a CPU should be in or we still leave this to a cpuidle
> > layer/driver?
> 
> I think the can leave most of that in a driver; right along with how to
> prod the hardware to actually get into that state.
> 
> I think the most important parts are what is now 'generic' code; stuff
> that guestimates the idle-time and so forth.
> 
> I think the scheduler simply wants to say: we expect to go idle for X
> ns, we want a guaranteed wakeup latency of Y ns -- go do your thing.

Sounds good (and I think the Linaro guys started looking into this).

> I think you also raised the point in that we do want some feedback as to
> the cost of waking up particular cores to better make decisions on which
> to wake. That is indeed so.

It depends on how we end up implementing the energy awareness in the
scheduler but too simple topology (just which CPUs can be power-gated)
is not that useful.

In a very simplistic and ideal world (note the 'ideal' part), we could
estimate the energy cost of a CPU for a period T:

E = sum(P(Cx) * Tx) + sum(wake-up-energy) + sum(P(Ty) * Ty)

  P(Cx): power in C-state x
  wake-up-energy: the cost of waking up from various C-states
  P(Ty): power of running task y (which also depends on the P-state)
  sum(Tx) + sum(Ty) = T

Assuming that we have such information and can predict (based on past
usage) what the task loads will be, together with other
performance/latency constraints, an 'ideal' scheduler would always
choose the correct C/P states and task placements for optimal energy.
However, the reality is different and even so it would be an NP problem.

But we can try to come up with some "guestimates" based on parameters
provided by the SoC (via DT or ACPI tables or just some low-level
driver/arch code). The scheduler does its best according on these
parameters at certain times (task wake-up, idle balance) while the SoC
can still tune the behaviour.

If we roughly estimate the energy cost of a run-queue and the energy
cost of individual tasks on that run-queue (based on their load and
P-state), we could estimate the cost of moving or waking the
task on another CPU (where the task's cost may change depending on
asymmetric configurations or different P-state). We don't even need to
be precise in the energy costs but just some relative numbers so that
the scheduler can favour one CPU or another. If we ignore P-state costs
and only consider C-states and symmetric configurations, we probably get
a behaviour similar to Vincent's task packing patches.

The information we have currently for C-states is target residency and
exit latency. From these I think we can only infer the wake-up energy
cost not how much we save but placing a CPU into that state. So if we
want the scheduler to decide whether to pack or spread (from an energy
cost perspective), we need additional information in the topology.

Alternatively we could have a power driver which dynamically returns
some estimates every time the scheduler asks for them, with a power
driver for each SoC (which is already the case for ARM SoCs).

-- 
Catalin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
  2013-11-06 14:08           ` Peter Zijlstra
@ 2013-11-12 17:43             ` Dietmar Eggemann
  2013-11-12 18:08               ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Dietmar Eggemann @ 2013-11-12 17:43 UTC (permalink / raw)
  To: Peter Zijlstra, Martin Schwidefsky
  Cc: Vincent Guittot, linux-kernel, Ingo Molnar, Paul Turner,
	Morten Rasmussen, cmetcalf, tony.luck, Alex Shi, Preeti U Murthy,
	linaro-kernel, Rafael J. Wysocki, Paul McKenney, Jonathan Corbet,
	Thomas Gleixner, Len Brown, Arjan van de Ven, Amit Kucheria,
	Lukasz Majewski, james.hogan, heiko.carstens

On 06/11/13 14:08, Peter Zijlstra wrote:
> On Wed, Nov 06, 2013 at 02:53:44PM +0100, Martin Schwidefsky wrote:
>> On Tue, 5 Nov 2013 23:27:52 +0100
>> Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>> On Tue, Nov 05, 2013 at 03:57:23PM +0100, Vincent Guittot wrote:
>>>> Your proposal looks fine for me. It's clearly better to move in one
>>>> place the configuration of sched_domain fields. Have you already got
>>>> an idea about how to let architecture override the topology?
>>>
>>> Maybe something like the below -- completely untested (my s390 compiler
>>> is on a machine that's currently powered off).
>>
>> In principle I do not see a reason why this should not work, but there
>> are a few more things to take care of. E.g. struct sd_data is defined
>> in kernel/sched/core.c, cpu_cpu_mask as well. These need to be moved
>> to a header where arch/s390/kernel/smp.c can pick it up.
>>
>> I do have the feeling that the sched_domain_topology should be left
>> where they are, or do we really want to expose more of the scheduler
>> internals?
> 
> Ah, its a trade off; in that previous patch I removed the entire
> sched_domain initializers the archs used to 'have' to fill out. That
> exposed far too much behavioural stuff the archs really shouldn't
> bother with.
> 
> In return we now provide a (hopefully) simpler interface that allows
> archs to communicate their topology to the scheduler -- without getting
> mixed up in the behavioural aspects (too much).
> 
> Maybe s390 wasn't the best example to pick, as the book domain really
> isn't that exciting. Arguably I should have taken Power7+ and the
> ASYM_PACKING SMT thing.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

We actually don't have to expose sched_domain_topology or any internal
scheduler data structures.

We still can get rid of the SD_XXX_INIT stuff and do the sched_domain
initialization for all levels in one function sd_init().

Moreover, we could introduce a arch specific general function replacing
arch specific functions for particular flags and levels like
arch_sd_sibling_asym_packing() or Vincent's arch_sd_local_flags().
This arch specific general function exposes the level and the
sched_domain pointer to the arch which then could fine tune sched_domain
in each individual level.

Below is a patch which bases on your idea to transform sd_numa_init()
into sd_init(). The main difference is that I don't try to distinguish
based of power management related flags inside sd_init() but rather on
the new sd level data.

Dietmar

----8<----

>From 3df278ad50690a7878c9cc6b18e226805e1f4bd1 Mon Sep 17 00:00:00 2001
From: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Tue, 12 Nov 2013 12:37:36 +0000
Subject: [PATCH] sched: rework sched_domain setup code

This patch removes the sched_domain initializer macros
SD_[SIBLING|MC|BOOK|CPU]_INIT in core.c and in archs and replaces them
with calls to the new function sd_init().  The function sd_init
incorporates the already existing function sd_numa_init().

It introduces preprocessor constants (SD_LVL_[INV|SMT|MC|BOOK|CPU|NUMA])
and replaces 'sched_domain_init_f init' with 'int level' data member in
struct sched_domain_topology_level.

The new data member is used to distinguish the sched_domain level in
sd_init() and is also passed as an argument to the arch specific
function to tweak the sched_domain described below.

To make it still possible for archs to tweak the individual
sched_domain level, a new weak function arch_sd_customize(int level,
struct sched_domain *sd, int cpu) is introduced.
By exposing the sched_domain level and the pointer to the sched_domain
data structure, the archs can tweak individual data members, like the
min or max interval or the flags.  This function also replaces the
existing function arch_sd_sibiling_asym_packing() which is specialized
in setting the SD_ASYM_PACKING flag for the SMT sched_domain level.
The parameter cpu is currently not used but could be used in the
future to setup sched_domain structures in one sched_domain level
differently for different cpus.

Initialization of a sched_domain is done in three steps. First, at the
beginning of sd_init(), the sched_domain data members are set which
have the same value for all or at least most of the sched_domain
levels.  Second, sched_domain data members are set for each
sched_domain level individually in sd_init().  Third,
arch_sd_customize() is called in sd_init().

One exception is SD_NODE_INIT which this patch removes from
arch/metag/include/asm/topology.h. I don't now how it's been used so
this patch does not provide a metag specific arch_sd_customize()
implementation.

This patch has been tested on ARM TC2 (5 CPUs, sched_domain level MC
and CPU) and compile-tested for x86_64, powerpc (chroma_defconfig) and
mips (ip27_defconfig).

It is against v3.12 .

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 arch/ia64/include/asm/topology.h  |   24 -----
 arch/ia64/kernel/topology.c       |    8 ++
 arch/metag/include/asm/topology.h |   25 -----
 arch/powerpc/kernel/smp.c         |    7 +-
 arch/tile/include/asm/topology.h  |   33 ------
 arch/tile/kernel/smp.c            |   12 +++
 include/linux/sched.h             |    8 +-
 include/linux/topology.h          |  109 -------------------
 kernel/sched/core.c               |  214 +++++++++++++++++++++----------------
 9 files changed, 150 insertions(+), 290 deletions(-)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index a2496e4..20d12fa 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -46,30 +46,6 @@
 
 void build_cpu_to_node_map(void);
 
-#define SD_CPU_INIT (struct sched_domain) {		\
-	.parent			= NULL,			\
-	.child			= NULL,			\
-	.groups			= NULL,			\
-	.min_interval		= 1,			\
-	.max_interval		= 4,			\
-	.busy_factor		= 64,			\
-	.imbalance_pct		= 125,			\
-	.cache_nice_tries	= 2,			\
-	.busy_idx		= 2,			\
-	.idle_idx		= 1,			\
-	.newidle_idx		= 0,			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
-	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_NEWIDLE	\
-				| SD_BALANCE_EXEC	\
-				| SD_BALANCE_FORK	\
-				| SD_WAKE_AFFINE,	\
-	.last_balance		= jiffies,		\
-	.balance_interval	= 1,			\
-	.nr_balance_failed	= 0,			\
-}
-
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_SMP
diff --git a/arch/ia64/kernel/topology.c b/arch/ia64/kernel/topology.c
index ca69a5a..5dd627d 100644
--- a/arch/ia64/kernel/topology.c
+++ b/arch/ia64/kernel/topology.c
@@ -99,6 +99,14 @@ out:
 
 subsys_initcall(topology_init);
 
+void arch_sd_customize(int level, struct sched_domain *sd, int cpu)
+{
+	if (level == SD_LVL_CPU) {
+		sd->cache_nice_tries = 2;
+
+		sd->flags &= ~SD_PREFER_SIBLING;
+	}
+}
 
 /*
  * Export cpu cache information through sysfs
diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h
index 23f5118..e95f874 100644
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -3,31 +3,6 @@
 
 #ifdef CONFIG_NUMA
 
-/* sched_domains SD_NODE_INIT for Meta machines */
-#define SD_NODE_INIT (struct sched_domain) {		\
-	.parent			= NULL,			\
-	.child			= NULL,			\
-	.groups			= NULL,			\
-	.min_interval		= 8,			\
-	.max_interval		= 32,			\
-	.busy_factor		= 32,			\
-	.imbalance_pct		= 125,			\
-	.cache_nice_tries	= 2,			\
-	.busy_idx		= 3,			\
-	.idle_idx		= 2,			\
-	.newidle_idx		= 0,			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
-	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_FORK	\
-				| SD_BALANCE_EXEC	\
-				| SD_BALANCE_NEWIDLE	\
-				| SD_SERIALIZE,		\
-	.last_balance		= jiffies,		\
-	.balance_interval	= 1,			\
-	.nr_balance_failed	= 0,			\
-}
-
 #define cpu_to_node(cpu)	((void)(cpu), 0)
 #define parent_node(node)	((void)(node), 0)
 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8e59abc..9ac5bfb 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -802,13 +802,12 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
 }
 
-int arch_sd_sibling_asym_packing(void)
+void arch_sd_customize(int level, struct sched_domain *sd, int cpu)
 {
-	if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
+	if (level == SD_LVL_SMT && cpu_has_feature(CPU_FTR_ASYM_SMT)) {
 		printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
-		return SD_ASYM_PACKING;
+		sd->flags |= SD_ASYM_PACKING;
 	}
-	return 0;
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
index d15c0d8..9383118 100644
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -44,39 +44,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
 /* For now, use numa node -1 for global allocation. */
 #define pcibus_to_node(bus)		((void)(bus), -1)
 
-/*
- * TILE architecture has many cores integrated in one processor, so we need
- * setup bigger balance_interval for both CPU/NODE scheduling domains to
- * reduce process scheduling costs.
- */
-
-/* sched_domains SD_CPU_INIT for TILE architecture */
-#define SD_CPU_INIT (struct sched_domain) {				\
-	.min_interval		= 4,					\
-	.max_interval		= 128,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 0*SD_WAKE_AFFINE			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 32,					\
-}
-
 /* By definition, we create nodes based on online memory. */
 #define node_has_online_mem(nid) 1
 
diff --git a/arch/tile/kernel/smp.c b/arch/tile/kernel/smp.c
index 01e8ab2..dfafe55 100644
--- a/arch/tile/kernel/smp.c
+++ b/arch/tile/kernel/smp.c
@@ -254,3 +254,15 @@ void smp_send_reschedule(int cpu)
 }
 
 #endif /* CHIP_HAS_IPI() */
+
+void arch_sd_customize(int level, struct sched_domain *sd, int cpu)
+{
+	if (level == SD_LVL_CPU) {
+		sd->min_interval = 4;
+		sd->max_interval = 128;
+
+		sd->flags &= ~(SD_WAKE_AFFINE | SD_PREFER_SIBLING);
+
+		sd->balance_interval = 32;
+	}
+}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e27baee..847485d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -769,7 +769,13 @@ enum cpu_idle_type {
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
 
-extern int __weak arch_sd_sibiling_asym_packing(void);
+/* sched-domain levels */
+#define SD_LVL_INV		0x00 /* invalid */
+#define SD_LVL_SMT		0x01
+#define SD_LVL_MC		0x02
+#define SD_LVL_BOOK		0x04
+#define SD_LVL_CPU		0x08
+#define SD_LVL_NUMA		0x10
 
 struct sched_domain_attr {
 	int relax_domain_level;
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..02a397a 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -66,115 +66,6 @@ int arch_update_cpu_topology(void);
 #define PENALTY_FOR_NODE_WITH_CPUS	(1)
 #endif
 
-/*
- * Below are the 3 major initializers used in building sched_domains:
- * SD_SIBLING_INIT, for SMT domains
- * SD_CPU_INIT, for SMP domains
- *
- * Any architecture that cares to do any tuning to these values should do so
- * by defining their own arch-specific initializer in include/asm/topology.h.
- * A definition there will automagically override these default initializers
- * and allow arch-specific performance tuning of sched_domains.
- * (Only non-zero and non-null fields need be specified.)
- */
-
-#ifdef CONFIG_SCHED_SMT
-/* MCD - Do we really need this?  It is always on if CONFIG_SCHED_SMT is,
- * so can't we drop this in favor of CONFIG_SCHED_SMT?
- */
-#define ARCH_HAS_SCHED_WAKE_IDLE
-/* Common values for SMT siblings */
-#ifndef SD_SIBLING_INIT
-#define SD_SIBLING_INIT (struct sched_domain) {				\
-	.min_interval		= 1,					\
-	.max_interval		= 2,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 110,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 1*SD_SHARE_CPUPOWER			\
-				| 1*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				| 0*SD_PREFER_SIBLING			\
-				| arch_sd_sibling_asym_packing()	\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-	.smt_gain		= 1178,	/* 15% */			\
-}
-#endif
-#endif /* CONFIG_SCHED_SMT */
-
-#ifdef CONFIG_SCHED_MC
-/* Common values for MC siblings. for now mostly derived from SD_CPU_INIT */
-#ifndef SD_MC_INIT
-#define SD_MC_INIT (struct sched_domain) {				\
-	.min_interval		= 1,					\
-	.max_interval		= 4,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 1*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-}
-#endif
-#endif /* CONFIG_SCHED_MC */
-
-/* Common values for CPUs */
-#ifndef SD_CPU_INIT
-#define SD_CPU_INIT (struct sched_domain) {				\
-	.min_interval		= 1,					\
-	.max_interval		= 4,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				| 1*SD_PREFER_SIBLING			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-}
-#endif
-
-#ifdef CONFIG_SCHED_BOOK
-#ifndef SD_BOOK_INIT
-#error Please define an appropriate SD_BOOK_INIT in include/asm/topology.h!!!
-#endif
-#endif /* CONFIG_SCHED_BOOK */
-
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DECLARE_PER_CPU(int, numa_node);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5ac63c9..53eda22 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5225,13 +5225,12 @@ enum s_alloc {
 
 struct sched_domain_topology_level;
 
-typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 
 #define SDTL_OVERLAP	0x01
 
 struct sched_domain_topology_level {
-	sched_domain_init_f init;
+	int		    level;
 	sched_domain_mask_f mask;
 	int		    flags;
 	int		    numa_level;
@@ -5455,9 +5454,8 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 	atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight);
 }
 
-int __weak arch_sd_sibling_asym_packing(void)
+void __weak arch_sd_customize(int level, struct sched_domain *sd, int cpu)
 {
-       return 0*SD_ASYM_PACKING;
 }
 
 /*
@@ -5471,28 +5469,6 @@ int __weak arch_sd_sibling_asym_packing(void)
 # define SD_INIT_NAME(sd, type)		do { } while (0)
 #endif
 
-#define SD_INIT_FUNC(type)						\
-static noinline struct sched_domain *					\
-sd_init_##type(struct sched_domain_topology_level *tl, int cpu) 	\
-{									\
-	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);	\
-	*sd = SD_##type##_INIT;						\
-	SD_INIT_NAME(sd, type);						\
-	sd->private = &tl->data;					\
-	return sd;							\
-}
-
-SD_INIT_FUNC(CPU)
-#ifdef CONFIG_SCHED_SMT
- SD_INIT_FUNC(SIBLING)
-#endif
-#ifdef CONFIG_SCHED_MC
- SD_INIT_FUNC(MC)
-#endif
-#ifdef CONFIG_SCHED_BOOK
- SD_INIT_FUNC(BOOK)
-#endif
-
 static int default_relax_domain_level = -1;
 int sched_domain_level_max;
 
@@ -5587,89 +5563,140 @@ static const struct cpumask *cpu_smt_mask(int cpu)
 }
 #endif
 
-/*
- * Topology list, bottom-up.
- */
-static struct sched_domain_topology_level default_topology[] = {
+#ifdef CONFIG_NUMA
+static int sched_domains_numa_levels;
+static int *sched_domains_numa_distance;
+static struct cpumask ***sched_domains_numa_masks;
+static int sched_domains_curr_level;
+#endif
+
+static struct sched_domain *
+sd_init(struct sched_domain_topology_level *tl, int cpu)
+{
+	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
+#ifdef CONFIG_NUMA
+	int sd_weight;
+#endif
+
+	*sd = (struct sched_domain) {
+		.min_interval = 1,
+		.max_interval = 4,
+		.busy_factor = 64,
+		.imbalance_pct = 125,
+
+		.flags	= 1*SD_LOAD_BALANCE
+				| 1*SD_BALANCE_NEWIDLE
+				| 1*SD_BALANCE_EXEC
+				| 1*SD_BALANCE_FORK
+				| 0*SD_BALANCE_WAKE
+				| 1*SD_WAKE_AFFINE
+				| 0*SD_SHARE_CPUPOWER
+				| 0*SD_SHARE_PKG_RESOURCES
+				| 0*SD_SERIALIZE
+				| 0*SD_PREFER_SIBLING
+				,
+
+		.last_balance = jiffies,
+		.balance_interval = 1,
+	};
+
+	switch (tl->level) {
 #ifdef CONFIG_SCHED_SMT
-	{ sd_init_SIBLING, cpu_smt_mask, },
+	case SD_LVL_SMT:
+		sd->max_interval = 2;
+		sd->imbalance_pct = 110;
+
+		sd->flags |= SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES;
+
+		sd->smt_gain = 1178; /* ~15% */
+
+		SD_INIT_NAME(sd, SMT);
+		break;
 #endif
 #ifdef CONFIG_SCHED_MC
-	{ sd_init_MC, cpu_coregroup_mask, },
+	case SD_LVL_MC:
+		sd->cache_nice_tries = 1;
+		sd->busy_idx = 2;
+
+		sd->flags |= SD_SHARE_PKG_RESOURCES;
+
+		SD_INIT_NAME(sd, MC);
+		break;
 #endif
+	case SD_LVL_CPU:
 #ifdef CONFIG_SCHED_BOOK
-	{ sd_init_BOOK, cpu_book_mask, },
+	case SD_LVL_BOOK:
 #endif
-	{ sd_init_CPU, cpu_cpu_mask, },
-	{ NULL, },
-};
+		sd->cache_nice_tries = 1;
+		sd->busy_idx = 2;
+		sd->idle_idx = 1;
 
-static struct sched_domain_topology_level *sched_domain_topology = default_topology;
-
-#define for_each_sd_topology(tl)			\
-	for (tl = sched_domain_topology; tl->init; tl++)
+		sd->flags |= SD_PREFER_SIBLING;
 
+		SD_INIT_NAME(sd, CPU);
+		break;
 #ifdef CONFIG_NUMA
+	case SD_LVL_NUMA:
+		sd_weight = cpumask_weight(sched_domains_numa_masks
+				[tl->numa_level][cpu_to_node(cpu)]);
 
-static int sched_domains_numa_levels;
-static int *sched_domains_numa_distance;
-static struct cpumask ***sched_domains_numa_masks;
-static int sched_domains_curr_level;
+		sd->min_interval = sd_weight;
+		sd->max_interval = 2*sd_weight;
+		sd->busy_factor = 32;
 
-static inline int sd_local_flags(int level)
-{
-	if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
-		return 0;
+		sd->cache_nice_tries = 2;
+		sd->busy_idx = 3;
+		sd->idle_idx = 2;
 
-	return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
-}
+		sd->flags |= SD_SERIALIZE;
 
-static struct sched_domain *
-sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
-{
-	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
-	int level = tl->numa_level;
-	int sd_weight = cpumask_weight(
-			sched_domains_numa_masks[level][cpu_to_node(cpu)]);
-
-	*sd = (struct sched_domain){
-		.min_interval		= sd_weight,
-		.max_interval		= 2*sd_weight,
-		.busy_factor		= 32,
-		.imbalance_pct		= 125,
-		.cache_nice_tries	= 2,
-		.busy_idx		= 3,
-		.idle_idx		= 2,
-		.newidle_idx		= 0,
-		.wake_idx		= 0,
-		.forkexec_idx		= 0,
-
-		.flags			= 1*SD_LOAD_BALANCE
-					| 1*SD_BALANCE_NEWIDLE
-					| 0*SD_BALANCE_EXEC
-					| 0*SD_BALANCE_FORK
-					| 0*SD_BALANCE_WAKE
-					| 0*SD_WAKE_AFFINE
-					| 0*SD_SHARE_CPUPOWER
-					| 0*SD_SHARE_PKG_RESOURCES
-					| 1*SD_SERIALIZE
-					| 0*SD_PREFER_SIBLING
-					| sd_local_flags(level)
-					,
-		.last_balance		= jiffies,
-		.balance_interval	= sd_weight,
-	};
-	SD_INIT_NAME(sd, NUMA);
-	sd->private = &tl->data;
+		if (sched_domains_numa_distance[tl->numa_level] >
+				RECLAIM_DISTANCE)
+			sd->flags &= ~(SD_BALANCE_EXEC | SD_BALANCE_FORK |
+						   SD_WAKE_AFFINE);
 
-	/*
-	 * Ugly hack to pass state to sd_numa_mask()...
-	 */
-	sched_domains_curr_level = tl->numa_level;
+		sd->balance_interval = sd_weight;
+
+		/*
+		 * Ugly hack to pass state to sd_numa_mask()...
+		 */
+		sched_domains_curr_level = tl->numa_level;
+
+		SD_INIT_NAME(sd, NUMA);
+		break;
+#endif
+	}
 
+	arch_sd_customize(tl->level, sd, cpu);
+	sd->private = &tl->data;
 	return sd;
 }
 
+/*
+ * Topology list, bottom-up.
+ */
+static struct sched_domain_topology_level default_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+		{ SD_LVL_SMT, cpu_smt_mask },
+#endif
+#ifdef CONFIG_SCHED_MC
+		{ SD_LVL_MC, cpu_coregroup_mask },
+#endif
+#ifdef CONFIG_SCHED_BOOK
+		{ SD_LVL_BOOK, cpu_book_mask },
+#endif
+		{ SD_LVL_CPU, cpu_cpu_mask },
+		{ SD_LVL_INV, },
+};
+
+static struct sched_domain_topology_level *sched_domain_topology =
+		default_topology;
+
+#define for_each_sd_topology(tl)                       \
+		for (tl = sched_domain_topology; tl->level; tl++)
+
+#ifdef CONFIG_NUMA
+
 static const struct cpumask *sd_numa_mask(int cpu)
 {
 	return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)];
@@ -5821,7 +5848,7 @@ static void sched_init_numa(void)
 	/*
 	 * Copy the default topology bits..
 	 */
-	for (i = 0; default_topology[i].init; i++)
+	for (i = 0; default_topology[i].level; i++)
 		tl[i] = default_topology[i];
 
 	/*
@@ -5829,7 +5856,6 @@ static void sched_init_numa(void)
 	 */
 	for (j = 0; j < level; i++, j++) {
 		tl[i] = (struct sched_domain_topology_level){
-			.init = sd_numa_init,
 			.mask = sd_numa_mask,
 			.flags = SDTL_OVERLAP,
 			.numa_level = j,
@@ -5990,7 +6016,7 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 		const struct cpumask *cpu_map, struct sched_domain_attr *attr,
 		struct sched_domain *child, int cpu)
 {
-	struct sched_domain *sd = tl->init(tl, cpu);
+	struct sched_domain *sd = sd_init(tl, cpu);
 	if (!sd)
 		return child;
 
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
  2013-11-12 17:43             ` Dietmar Eggemann
@ 2013-11-12 18:08               ` Peter Zijlstra
  2013-11-13 15:47                 ` Dietmar Eggemann
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2013-11-12 18:08 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Martin Schwidefsky, Vincent Guittot, linux-kernel, Ingo Molnar,
	Paul Turner, Morten Rasmussen, cmetcalf, tony.luck, Alex Shi,
	Preeti U Murthy, linaro-kernel, Rafael J. Wysocki, Paul McKenney,
	Jonathan Corbet, Thomas Gleixner, Len Brown, Arjan van de Ven,
	Amit Kucheria, Lukasz Majewski, james.hogan, heiko.carstens

On Tue, Nov 12, 2013 at 05:43:36PM +0000, Dietmar Eggemann wrote:
> This patch removes the sched_domain initializer macros
> SD_[SIBLING|MC|BOOK|CPU]_INIT in core.c and in archs and replaces them
> with calls to the new function sd_init().  The function sd_init
> incorporates the already existing function sd_numa_init().

Your patch retains far too much of the weird behavioural variations we
have, nor does it create a proper separation between topology and
behaviour.

We might indeed have to have a single arch_() function that adds
SD_flags, but please restrict the flags it can set -- never allow it to
set behavioural flags.

Furthermore, I think we want to allow the arch to override the base
topology; we've had desire to add per arch level in the past.. eg. add
an L2 level for some x86 variants.



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-12 16:48         ` Arjan van de Ven
@ 2013-11-12 23:14           ` Catalin Marinas
  2013-11-13 16:13             ` Arjan van de Ven
  0 siblings, 1 reply; 101+ messages in thread
From: Catalin Marinas @ 2013-11-12 23:14 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Peter Zijlstra, Vincent Guittot, linux-kernel, Ingo Molnar,
	Paul Turner, Morten Rasmussen, Chris Metcalf, Tony Luck,
	alex.shi, Preeti U Murthy, linaro-kernel, len.brown, l.majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney, linux-pm

On 12 Nov 2013, at 16:48, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 11/11/2013 10:18 AM, Catalin Marinas wrote:
>> The ordering is based on the actual C-state, so a simple way is to wake
>> up the CPU in the shallowest C-state. With asymmetric configurations
>> (big.LITTLE) we have different costs for the same C-state, so this would
>> come in handy.
> 
> btw I was considering something else; in practice CPUs will be in the deepest state..
> ... at which point I was going to go with some other metrics of what is best from a platform level

I agree, other metrics are needed.  The problem is that we currently
only have (relatively, guessed from the target residency) the cost of
transition from a C-state to a P-state (for the latter, not sure which).
But we don’t know what the power (saving) on that C-state is nor the one 
at a P-state (and vendors reluctant to provide such information). So the 
best the scheduler can do is optimise the wake-up cost and blindly assume 
that deeper C-state on a CPU is more efficient than lower P-states on two 
other CPUs (or the other way around).

If we find a good use for such metrics in the scheduler, I think the
vendors would be more open to providing at least some relative (rather
than absolute) numbers.

Catalin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
  2013-11-12 18:08               ` Peter Zijlstra
@ 2013-11-13 15:47                 ` Dietmar Eggemann
  2013-11-13 16:29                   ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Dietmar Eggemann @ 2013-11-13 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, Vincent Guittot, linux-kernel, Ingo Molnar,
	Paul Turner, Morten Rasmussen, cmetcalf, tony.luck, Alex Shi,
	Preeti U Murthy, linaro-kernel, Rafael J. Wysocki, Paul McKenney,
	Jonathan Corbet, Thomas Gleixner, Len Brown, Arjan van de Ven,
	Amit Kucheria, Lukasz Majewski, james.hogan, heiko.carstens

On 12/11/13 18:08, Peter Zijlstra wrote:
> On Tue, Nov 12, 2013 at 05:43:36PM +0000, Dietmar Eggemann wrote:
>> This patch removes the sched_domain initializer macros
>> SD_[SIBLING|MC|BOOK|CPU]_INIT in core.c and in archs and replaces them
>> with calls to the new function sd_init().  The function sd_init
>> incorporates the already existing function sd_numa_init().
> 
> Your patch retains far too much of the weird behavioural variations we
> have, nor does it create a proper separation between topology and
> behaviour.

Could you please explain a little bit further on the weird behavioural
variations. Are you referring to the specific SD_ flags or sd_domain levels?
I agree that this patch doesn't separate behaviour and topology and I
will consider this going forward.

> 
> We might indeed have to have a single arch_() function that adds
> SD_flags, but please restrict the flags it can set -- never allow it to
> set behavioural flags.

Understood. Simply exporting an sd_domain pointer is a no-go.

> 
> Furthermore, I think we want to allow the arch to override the base
> topology; we've had desire to add per arch level in the past.. eg. add
> an L2 level for some x86 variants.

I quite don't understand this one. Are you saying that one idea for the
topology side of things is to have an extra arch specific sd level which
would be the only sd_domain level which could be then overridden by the
arch?

Thanks,

-- Dietmar

> 
> 
> 



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-12 23:14           ` Catalin Marinas
@ 2013-11-13 16:13             ` Arjan van de Ven
  2013-11-13 16:45               ` Catalin Marinas
  0 siblings, 1 reply; 101+ messages in thread
From: Arjan van de Ven @ 2013-11-13 16:13 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Peter Zijlstra, Vincent Guittot, linux-kernel, Ingo Molnar,
	Paul Turner, Morten Rasmussen, Chris Metcalf, Tony Luck,
	alex.shi, Preeti U Murthy, linaro-kernel, len.brown, l.majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney, linux-pm

On 11/12/2013 3:14 PM, Catalin Marinas wrote:
> On 12 Nov 2013, at 16:48, Arjan van de Ven <arjan@linux.intel.com> wrote:
>> On 11/11/2013 10:18 AM, Catalin Marinas wrote:
>>> The ordering is based on the actual C-state, so a simple way is to wake
>>> up the CPU in the shallowest C-state. With asymmetric configurations
>>> (big.LITTLE) we have different costs for the same C-state, so this would
>>> come in handy.
>>
>> btw I was considering something else; in practice CPUs will be in the deepest state..
>> ... at which point I was going to go with some other metrics of what is best from a platform level
>
> I agree, other metrics are needed.  The problem is that we currently
> only have (relatively, guessed from the target residency) the cost of
> transition from a C-state to a P-state (for the latter, not sure which).
> But we don’t know what the power (saving) on that C-state is nor the one
> at a P-state (and vendors reluctant to provide such information). So the
> best the scheduler can do is optimise the wake-up cost and blindly assume
> that deeper C-state on a CPU is more efficient than lower P-states on two
> other CPUs (or the other way around).

for picking the cpu to wake on there are also low level physical kind of things
we'd want to take into account on the intel side.



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
  2013-11-13 15:47                 ` Dietmar Eggemann
@ 2013-11-13 16:29                   ` Peter Zijlstra
  2013-11-14 10:49                     ` Morten Rasmussen
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2013-11-13 16:29 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Martin Schwidefsky, Vincent Guittot, linux-kernel, Ingo Molnar,
	Paul Turner, Morten Rasmussen, cmetcalf, tony.luck, Alex Shi,
	Preeti U Murthy, linaro-kernel, Rafael J. Wysocki, Paul McKenney,
	Jonathan Corbet, Thomas Gleixner, Len Brown, Arjan van de Ven,
	Amit Kucheria, Lukasz Majewski, james.hogan, heiko.carstens

On Wed, Nov 13, 2013 at 03:47:16PM +0000, Dietmar Eggemann wrote:
> On 12/11/13 18:08, Peter Zijlstra wrote:
> > On Tue, Nov 12, 2013 at 05:43:36PM +0000, Dietmar Eggemann wrote:
> >> This patch removes the sched_domain initializer macros
> >> SD_[SIBLING|MC|BOOK|CPU]_INIT in core.c and in archs and replaces them
> >> with calls to the new function sd_init().  The function sd_init
> >> incorporates the already existing function sd_numa_init().
> > 
> > Your patch retains far too much of the weird behavioural variations we
> > have, nor does it create a proper separation between topology and
> > behaviour.
> 
> Could you please explain a little bit further on the weird behavioural
> variations. Are you referring to the specific SD_ flags or sd_domain levels?

+++ b/arch/ia64/kernel/topology.c
@@ -99,6 +99,14 @@ out:

 subsys_initcall(topology_init);

+void arch_sd_customize(int level, struct sched_domain *sd, int cpu)
+{
+       if (level == SD_LVL_CPU) {
+               sd->cache_nice_tries = 2;
+
+               sd->flags &= ~SD_PREFER_SIBLING;
+       }
+}

+++ b/arch/tile/kernel/smp.c
@@ -254,3 +254,15 @@ void smp_send_reschedule(int cpu)
 }

 #endif /* CHIP_HAS_IPI() */
+
+void arch_sd_customize(int level, struct sched_domain *sd, int cpu)
+{
+       if (level == SD_LVL_CPU) {
+               sd->min_interval = 4;
+               sd->max_interval = 128;
+
+               sd->flags &= ~(SD_WAKE_AFFINE | SD_PREFER_SIBLING);
+
+               sd->balance_interval = 32;
+       }
+}

Many of these differences are just bitrot / accidents, and the different
min interval for tile was already taken care of by basing the intervals
off of the domain weight.

On that, you should also not rely on these SD_LVL things; if we wanted
to inject an extra level they'd go all funny.

> I agree that this patch doesn't separate behaviour and topology and I
> will consider this going forward.

Please take the patch I did and work from there.

> > We might indeed have to have a single arch_() function that adds
> > SD_flags, but please restrict the flags it can set -- never allow it to
> > set behavioural flags.
> 
> Understood. Simply exporting an sd_domain pointer is a no-go.

I was more thinking along the lines of:

unsigned long arch_sd_flags(unsigned long sd_flags)
{
	return 0
}

Used like:

	extra_sd_flags = arch_sd_flags(sd->sd_flags);
	if (extra_sd_flags & FOO) {
		WARN("silly bugger: %x\n", extra_sd_flags);
		extra_sd_flags &= ~FOO;
	}
	sd->sd_flags |= extra_sd_flags;

Or something.

> > 
> > Furthermore, I think we want to allow the arch to override the base
> > topology; we've had desire to add per arch level in the past.. eg. add
> > an L2 level for some x86 variants.
> 
> I quite don't understand this one. Are you saying that one idea for the
> topology side of things is to have an extra arch specific sd level which
> would be the only sd_domain level which could be then overridden by the
> arch?

No allow an arch to fully override default_topology[] like I did in that
s390 case.

The case where the x86 cpu_core_map != cpu_llc_shared_map can currently
not be represented. Luckily no recent chips have had this, but there was
a time when this was a popular configuration (Intel Core2Quad).

There's been other 'fun' cases, like AMD putting two nodes in 1 package.
That's not something we can represent (not sure we need to but still).

And there's the AMD bulldozer shared core thing, which we currently
model the same as SMT but which could do with some tweaks - they're
distinct in that they share an 'large' L2.

Then there's the Xeon-7400 which is 'similar' to the AMD in that cores
share L2.

Anyway, all I'm saying is that even within one architecture there's
sufficient variation to allow for runtime topology creation. I'm sure
ARM has plenty weird configurations too.

And I'm not sure we need to represent all the weird variations, but in
the past we've had moments where we wanted to but could not (sanely) do
things.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-13 16:13             ` Arjan van de Ven
@ 2013-11-13 16:45               ` Catalin Marinas
  2013-11-13 17:56                 ` Arjan van de Ven
  0 siblings, 1 reply; 101+ messages in thread
From: Catalin Marinas @ 2013-11-13 16:45 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Peter Zijlstra, Vincent Guittot, linux-kernel, Ingo Molnar,
	Paul Turner, Morten Rasmussen, Chris Metcalf, Tony Luck,
	alex.shi, Preeti U Murthy, linaro-kernel, len.brown, l.majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney, linux-pm

On Wed, Nov 13, 2013 at 04:13:57PM +0000, Arjan van de Ven wrote:
> On 11/12/2013 3:14 PM, Catalin Marinas wrote:
> > On 12 Nov 2013, at 16:48, Arjan van de Ven <arjan@linux.intel.com> wrote:
> >> On 11/11/2013 10:18 AM, Catalin Marinas wrote:
> >>> The ordering is based on the actual C-state, so a simple way is to wake
> >>> up the CPU in the shallowest C-state. With asymmetric configurations
> >>> (big.LITTLE) we have different costs for the same C-state, so this would
> >>> come in handy.
> >>
> >> btw I was considering something else; in practice CPUs will be in the deepest state..
> >> ... at which point I was going to go with some other metrics of what is best from a platform level
> >
> > I agree, other metrics are needed.  The problem is that we currently
> > only have (relatively, guessed from the target residency) the cost of
> > transition from a C-state to a P-state (for the latter, not sure which).
> > But we don’t know what the power (saving) on that C-state is nor the one
> > at a P-state (and vendors reluctant to provide such information). So the
> > best the scheduler can do is optimise the wake-up cost and blindly assume
> > that deeper C-state on a CPU is more efficient than lower P-states on two
> > other CPUs (or the other way around).
> 
> for picking the cpu to wake on there are also low level physical kind of things
> we'd want to take into account on the intel side.

Are these static and could they be hidden behind some cost number in a
topology description? If they are dynamic, we would need arch or driver
hooks to give some cost or priority number that the scheduler can use.

-- 
Catalin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-13 16:45               ` Catalin Marinas
@ 2013-11-13 17:56                 ` Arjan van de Ven
  0 siblings, 0 replies; 101+ messages in thread
From: Arjan van de Ven @ 2013-11-13 17:56 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Peter Zijlstra, Vincent Guittot, linux-kernel, Ingo Molnar,
	Paul Turner, Morten Rasmussen, Chris Metcalf, Tony Luck,
	alex.shi, Preeti U Murthy, linaro-kernel, len.brown, l.majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney, linux-pm


>> for picking the cpu to wake on there are also low level physical kind of things
>> we'd want to take into account on the intel side.
>
> Are these static and could they be hidden behind some cost number in a
> topology description? If they are dynamic, we would need arch or driver
> hooks to give some cost or priority number that the scheduler can use.

they're dynamic but slow moving (say, reevaluated once per second)

so we could have a static table that some driver updates async


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
  2013-11-13 16:29                   ` Peter Zijlstra
@ 2013-11-14 10:49                     ` Morten Rasmussen
  2013-11-14 12:07                       ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Morten Rasmussen @ 2013-11-14 10:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Martin Schwidefsky, Vincent Guittot,
	linux-kernel, Ingo Molnar, Paul Turner, cmetcalf, tony.luck,
	Alex Shi, Preeti U Murthy, linaro-kernel, Rafael J. Wysocki,
	Paul McKenney, Jonathan Corbet, Thomas Gleixner, Len Brown,
	Arjan van de Ven, Amit Kucheria, Lukasz Majewski, james.hogan,
	heiko.carstens

On Wed, Nov 13, 2013 at 04:29:19PM +0000, Peter Zijlstra wrote:
> On Wed, Nov 13, 2013 at 03:47:16PM +0000, Dietmar Eggemann wrote:
> > On 12/11/13 18:08, Peter Zijlstra wrote:
> > > On Tue, Nov 12, 2013 at 05:43:36PM +0000, Dietmar Eggemann wrote:
> > >> This patch removes the sched_domain initializer macros
> > >> SD_[SIBLING|MC|BOOK|CPU]_INIT in core.c and in archs and replaces them
> > >> with calls to the new function sd_init().  The function sd_init
> > >> incorporates the already existing function sd_numa_init().
> > > 
> > > Your patch retains far too much of the weird behavioural variations we
> > > have, nor does it create a proper separation between topology and
> > > behaviour.
> > 
> > Could you please explain a little bit further on the weird behavioural
> > variations. Are you referring to the specific SD_ flags or sd_domain levels?
> 
> +++ b/arch/ia64/kernel/topology.c
> @@ -99,6 +99,14 @@ out:
> 
>  subsys_initcall(topology_init);
> 
> +void arch_sd_customize(int level, struct sched_domain *sd, int cpu)
> +{
> +       if (level == SD_LVL_CPU) {
> +               sd->cache_nice_tries = 2;
> +
> +               sd->flags &= ~SD_PREFER_SIBLING;
> +       }
> +}
> 
> +++ b/arch/tile/kernel/smp.c
> @@ -254,3 +254,15 @@ void smp_send_reschedule(int cpu)
>  }
> 
>  #endif /* CHIP_HAS_IPI() */
> +
> +void arch_sd_customize(int level, struct sched_domain *sd, int cpu)
> +{
> +       if (level == SD_LVL_CPU) {
> +               sd->min_interval = 4;
> +               sd->max_interval = 128;
> +
> +               sd->flags &= ~(SD_WAKE_AFFINE | SD_PREFER_SIBLING);
> +
> +               sd->balance_interval = 32;
> +       }
> +}
> 
> Many of these differences are just bitrot / accidents, and the different
> min interval for tile was already taken care of by basing the intervals
> off of the domain weight.
> 
> On that, you should also not rely on these SD_LVL things; if we wanted
> to inject an extra level they'd go all funny.
> 
> > I agree that this patch doesn't separate behaviour and topology and I
> > will consider this going forward.
> 
> Please take the patch I did and work from there.
> 
> > > We might indeed have to have a single arch_() function that adds
> > > SD_flags, but please restrict the flags it can set -- never allow it to
> > > set behavioural flags.
> > 
> > Understood. Simply exporting an sd_domain pointer is a no-go.
> 
> I was more thinking along the lines of:
> 
> unsigned long arch_sd_flags(unsigned long sd_flags)
> {
> 	return 0
> }
> 
> Used like:
> 
> 	extra_sd_flags = arch_sd_flags(sd->sd_flags);
> 	if (extra_sd_flags & FOO) {
> 		WARN("silly bugger: %x\n", extra_sd_flags);
> 		extra_sd_flags &= ~FOO;
> 	}
> 	sd->sd_flags |= extra_sd_flags;
> 
> Or something.

We need a way to know which group of cpus the flag applies to. If we
don't want to pass a pointer to the sched_domain and we want to replace
the current named sched_domain levels with something more flexible, the
only solution I can think of right now is to pass a cpumask to the arch
code. Better suggestions?

If we let arch generate the topology it could set the flags as well. But
that means that an arch would have to deal with generating the topology
even if it just needs to flip a single flag in the default topology.

Another thing is if we want to put energy related information into the
sched_domain hierarchy. If we want various energy costs (P and C state)
to be represented here we would need to modify more than just flags.

One way to do that is to put the energy information into a sub-struct and
have another arch_sd_energy() call that allows the arch to populate that
struct with relevant information.

Morten

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init
  2013-11-14 10:49                     ` Morten Rasmussen
@ 2013-11-14 12:07                       ` Peter Zijlstra
  0 siblings, 0 replies; 101+ messages in thread
From: Peter Zijlstra @ 2013-11-14 12:07 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Dietmar Eggemann, Martin Schwidefsky, Vincent Guittot,
	linux-kernel, Ingo Molnar, Paul Turner, cmetcalf, tony.luck,
	Alex Shi, Preeti U Murthy, linaro-kernel, Rafael J. Wysocki,
	Paul McKenney, Jonathan Corbet, Thomas Gleixner, Len Brown,
	Arjan van de Ven, Amit Kucheria, Lukasz Majewski, james.hogan,
	heiko.carstens

On Thu, Nov 14, 2013 at 10:49:02AM +0000, Morten Rasmussen wrote:
> We need a way to know which group of cpus the flag applies to. If we
> don't want to pass a pointer to the sched_domain and we want to replace
> the current named sched_domain levels with something more flexible, the
> only solution I can think of right now is to pass a cpumask to the arch
> code. Better suggestions?

That might work.

> If we let arch generate the topology it could set the flags as well. But
> that means that an arch would have to deal with generating the topology
> even if it just needs to flip a single flag in the default topology.
> 
> Another thing is if we want to put energy related information into the
> sched_domain hierarchy. If we want various energy costs (P and C state)
> to be represented here we would need to modify more than just flags.
> 
> One way to do that is to put the energy information into a sub-struct and
> have another arch_sd_energy() call that allows the arch to populate that
> struct with relevant information.

Right.. I didn't think that far ahead yet :-)

So one thing that could be done is something like:

struct sd_energy arch_sd_energy(const struct sched_domain * sd);

Where you have an entire sd available, but since its const you cannot
actually modify it and are restricted to the return value -- which we
can validate before applying.

And we can have one such a function per specific thing we want to allow
modifying.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH v5 00/14] sched: packing tasks
  2013-11-11 16:36   ` Peter Zijlstra
  2013-11-11 16:39     ` Arjan van de Ven
  2013-11-12 17:40     ` Catalin Marinas
@ 2013-11-25 18:55     ` Daniel Lezcano
  2 siblings, 0 replies; 101+ messages in thread
From: Daniel Lezcano @ 2013-11-25 18:55 UTC (permalink / raw)
  To: Peter Zijlstra, Catalin Marinas
  Cc: Vincent Guittot, Linux Kernel Mailing List, Ingo Molnar,
	Paul Turner, Morten Rasmussen, Chris Metcalf, Tony Luck,
	alex.shi, Preeti U Murthy, linaro-kernel, len.brown, l.majewski,
	Jonathan Corbet, Rafael J. Wysocki, Paul McKenney,
	Arjan van de Ven, linux-pm

On 11/11/2013 05:36 PM, Peter Zijlstra wrote:
> On Mon, Nov 11, 2013 at 11:33:45AM +0000, Catalin Marinas wrote:
>
> tl;dr :-) Still trying to wrap my head around how to do that weird
> topology Vincent raised..
>
>> Question for Peter/Ingo: do you want the scheduler to decide on which
>> C-state a CPU should be in or we still leave this to a cpuidle
>> layer/driver?
>
> I think the can leave most of that in a driver; right along with how to
> prod the hardware to actually get into that state.
>
> I think the most important parts are what is now 'generic' code; stuff
> that guestimates the idle-time and so forth.
>
> I think the scheduler simply wants to say: we expect to go idle for X
> ns, we want a guaranteed wakeup latency of Y ns -- go do your thing.

Hi Peter,

IIUC, for full integration in the scheduler, we should eradicate the 
idle task and the related code tied with it, no ?

> I think you also raised the point in that we do want some feedback as to
> the cost of waking up particular cores to better make decisions on which
> to wake. That is indeed so.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [tip:sched/core] sched/fair: Move load idx selection in find_idlest_group
  2013-10-18 11:52 ` [RFC][PATCH v5 08/14] sched: move load idx selection in find_idlest_group Vincent Guittot
  2013-11-12 10:49   ` Peter Zijlstra
@ 2013-11-27 14:10   ` tip-bot for Vincent Guittot
  1 sibling, 0 replies; 101+ messages in thread
From: tip-bot for Vincent Guittot @ 2013-11-27 14:10 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, peterz, vincent.guittot, tglx

Commit-ID:  c44f2a020072d75d6b0cbf9f139a09719cda9367
Gitweb:     http://git.kernel.org/tip/c44f2a020072d75d6b0cbf9f139a09719cda9367
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Fri, 18 Oct 2013 13:52:21 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 27 Nov 2013 13:50:54 +0100

sched/fair: Move load idx selection in find_idlest_group

load_idx is used in find_idlest_group but initialized in select_task_rq_fair
even when not used. The load_idx initialisation is moved in find_idlest_group
and the sd_flag replaces it in the function's args.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: len.brown@intel.com
Cc: amit.kucheria@linaro.org
Cc: pjt@google.com
Cc: l.majewski@samsung.com
Cc: Morten.Rasmussen@arm.com
Cc: cmetcalf@tilera.com
Cc: tony.luck@intel.com
Cc: alex.shi@intel.com
Cc: preeti@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: rjw@sisk.pl
Cc: paulmck@linux.vnet.ibm.com
Cc: corbet@lwn.net
Cc: arjan@linux.intel.com
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1382097147-30088-8-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8b652e..6cb36c7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4110,12 +4110,16 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
  */
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p,
-		  int this_cpu, int load_idx)
+		  int this_cpu, int sd_flag)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
+	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
+	if (sd_flag & SD_BALANCE_WAKE)
+		load_idx = sd->wake_idx;
+
 	do {
 		unsigned long load, avg_load;
 		int local_group;
@@ -4283,7 +4287,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	}
 
 	while (sd) {
-		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
 		int weight;
 
@@ -4292,10 +4295,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			continue;
 		}
 
-		if (sd_flag & SD_BALANCE_WAKE)
-			load_idx = sd->wake_idx;
-
-		group = find_idlest_group(sd, p, cpu, load_idx);
+		group = find_idlest_group(sd, p, cpu, sd_flag);
 		if (!group) {
 			sd = sd->child;
 			continue;

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [RFC] sched: CPU topology try
  2013-11-05 22:27       ` Peter Zijlstra
  2013-11-06 10:10         ` Vincent Guittot
  2013-11-06 13:53         ` Martin Schwidefsky
@ 2013-12-18 13:13         ` Vincent Guittot
  2013-12-23 17:22           ` Dietmar Eggemann
                             ` (3 more replies)
  2 siblings, 4 replies; 101+ messages in thread
From: Vincent Guittot @ 2013-12-18 13:13 UTC (permalink / raw)
  To: peterz, linux-kernel
  Cc: mingo, pjt, Morten.Rasmussen, cmetcalf, tony.luck, alex.shi,
	preeti, linaro-kernel, rjw, paulmck, corbet, tglx, len.brown,
	arjan, amit.kucheria, james.hogan, schwidefsky, heiko.carstens,
	Dietmar.Eggemann, Vincent Guittot

This patch applies on top of the two patches [1][2] that have been proposed by
Peter for creating a new way to initialize sched_domain. It includes some minor
compilation fixes and a trial of using this new method on ARM platform.
[1] https://lkml.org/lkml/2013/11/5/239
[2] https://lkml.org/lkml/2013/11/5/449

Based on the results of this tests, my feeling about this new way to init the
sched_domain is a bit mitigated.

The good point is that I have been able to create the same sched_domain
topologies than before and even more complex ones (where a subset of the cores
in a cluster share their powergating capabilities). I have described various
topology results below.

I use a system that is made of a dual cluster of quad cores with hyperthreading
for my examples.

If one cluster (0-7) can powergate its cores independantly but not the other
cluster (8-15) we have the following topology, which is equal to what I had
previously:

CPU0:
domain 0: span 0-1 level: SMT
    flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
    groups: 0 1
  domain 1: span 0-7 level: MC
      flags: SD_SHARE_PKG_RESOURCES
      groups: 0-1 2-3 4-5 6-7
    domain 2: span 0-15 level: CPU
        flags:
        groups: 0-7 8-15

CPU8
domain 0: span 8-9 level: SMT
    flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
    groups: 8 9
  domain 1: span 8-15 level: MC
      flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
      groups: 8-9 10-11 12-13 14-15
    domain 2: span 0-15 level CPU
        flags:
        groups: 8-15 0-7

We can even describe some more complex topologies if a susbset (2-7) of the
cluster can't powergate independatly:

CPU0:
domain 0: span 0-1 level: SMT
    flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
    groups: 0 1
  domain 1: span 0-7 level: MC
      flags: SD_SHARE_PKG_RESOURCES
      groups: 0-1 2-7
    domain 2: span 0-15 level: CPU
        flags:
        groups: 0-7 8-15

CPU2:
domain 0: span 2-3 level: SMT
    flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
    groups: 0 1
  domain 1: span 2-7 level: MC
      flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
      groups: 2-7 4-5 6-7
    domain 2: span 0-7 level: MC
        flags: SD_SHARE_PKG_RESOURCES
        groups: 2-7 0-1
      domain 3: span 0-15 level: CPU
          flags:
          groups: 0-7 8-15

In this case, we have an aditionnal sched_domain MC level for this subset (2-7)
of cores so we can trigger some load balance in this subset before doing that
on the complete cluster (which is the last level of cache in my example)

We can add more levels that will describe other dependency/independency like
the frequency scaling dependency and as a result the final sched_domain
topology will have additional levels (if they have not been removed during
the degenerate sequence)

My concern is about the configuration of the table that is used to create the
sched_domain. Some levels are "duplicated" with different flags configuration
which make the table not easily readable and we must also take care of the
order  because parents have to gather all cpus of its childs. So we must
choose which capabilities will be a subset of the other one. The order is
almost straight forward when we describe 1 or 2 kind of capabilities
(package ressource sharing and power sharing) but it can become complex if we
want to add more.

Regards
Vincent

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 arch/arm/include/asm/topology.h |    4 ++
 arch/arm/kernel/topology.c      |   99 ++++++++++++++++++++++++++++++++++++++-
 include/linux/sched.h           |    7 +++
 kernel/sched/core.c             |   17 +++----
 4 files changed, 116 insertions(+), 11 deletions(-)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 58b8b84..5102847 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -5,12 +5,16 @@
 
 #include <linux/cpumask.h>
 
+#define CPU_CORE_GATE		0x1
+#define CPU_CLUSTER_GATE	0x2
+
 struct cputopo_arm {
 	int thread_id;
 	int core_id;
 	int socket_id;
 	cpumask_t thread_sibling;
 	cpumask_t core_sibling;
+	int flags;
 };
 
 extern struct cputopo_arm cpu_topology[NR_CPUS];
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 85a8737..8a2aec6 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -24,6 +24,7 @@
 
 #include <asm/cputype.h>
 #include <asm/topology.h>
+#include <asm/smp_plat.h>
 
 /*
  * cpu power scale management
@@ -79,6 +80,51 @@ unsigned long *__cpu_capacity;
 
 unsigned long middle_capacity = 1;
 
+static int __init get_dt_power_topology(struct device_node *topo)
+{
+	const u32 *reg;
+	int len, power = 0;
+	int flag = CPU_CORE_GATE;
+
+	for (; topo; topo = of_get_next_parent(topo)) {
+		reg = of_get_property(topo, "power-gate", &len);
+		if (reg && len == 4 && be32_to_cpup(reg))
+			power |= flag;
+		flag <<= 1;
+	}
+
+	return power;
+}
+
+#define for_each_subnode_with_property(dn, pn, prop_name) \
+	for (dn = of_find_node_with_property(pn, prop_name); dn; \
+	     dn = of_find_node_with_property(dn, prop_name))
+
+static void __init init_dt_power_topology(void)
+{
+	struct device_node *cn, *topo;
+
+	/* Get power domain topology information */
+	cn = of_find_node_by_path("/cpus/cpu-map");
+	if (!cn) {
+		pr_warn("Missing cpu-map node, bailing out\n");
+		return;
+	}
+
+	for_each_subnode_with_property(topo, cn, "cpu") {
+		struct device_node *cpu;
+
+		cpu = of_parse_phandle(topo, "cpu", 0);
+		if (cpu) {
+			u32 hwid;
+
+			of_property_read_u32(cpu, "reg", &hwid);
+			cpu_topology[get_logical_index(hwid)].flags = get_dt_power_topology(topo);
+
+		}
+	}
+}
+
 /*
  * Iterate all CPUs' descriptor in DT and compute the efficiency
  * (as per table_efficiency). Also calculate a middle efficiency
@@ -151,6 +197,8 @@ static void __init parse_dt_topology(void)
 		middle_capacity = ((max_capacity / 3)
 				>> (SCHED_POWER_SHIFT-1)) + 1;
 
+	/* Retrieve power topology information from DT */
+	init_dt_power_topology();
 }
 
 /*
@@ -266,6 +314,52 @@ void store_cpu_topology(unsigned int cpuid)
 		cpu_topology[cpuid].socket_id, mpidr);
 }
 
+#ifdef CONFIG_SCHED_SMT
+static const struct cpumask *cpu_smt_mask(int cpu)
+{
+	return topology_thread_cpumask(cpu);
+}
+#endif
+
+const struct cpumask *cpu_corepower_mask(int cpu)
+{
+	if (cpu_topology[cpu].flags & CPU_CORE_GATE)
+		return &cpu_topology[cpu].thread_sibling;
+	else
+		return &cpu_topology[cpu].core_sibling;
+}
+
+static const struct cpumask *cpu_cpupower_mask(int cpu)
+{
+	if (cpu_topology[cpu].flags & CPU_CLUSTER_GATE)
+		return &cpu_topology[cpu].core_sibling;
+	else
+		return cpumask_of_node(cpu_to_node(cpu));
+}
+
+static const struct cpumask *cpu_cpu_mask(int cpu)
+{
+	return cpumask_of_node(cpu_to_node(cpu));
+}
+
+static struct sched_domain_topology_level arm_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN },
+#endif
+#ifdef CONFIG_SCHED_MC
+	{ cpu_corepower_mask, SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN },
+	{ cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES},
+#endif
+	{ cpu_cpupower_mask, SD_SHARE_POWERDOMAIN },
+	{ cpu_cpu_mask, },
+	{ NULL, },
+};
+
+static int __init arm_sched_topology(void)
+{
+	sched_domain_topology = arm_topology;
+}
+
 /*
  * init_cpu_topology is called at boot when only one cpu is running
  * which prevent simultaneous write access to cpu_topology array
@@ -274,6 +368,9 @@ void __init init_cpu_topology(void)
 {
 	unsigned int cpu;
 
+	/* set scheduler topology descriptor */
+	arm_sched_topology();
+
 	/* init core mask and power*/
 	for_each_possible_cpu(cpu) {
 		struct cputopo_arm *cpu_topo = &(cpu_topology[cpu]);
@@ -283,7 +380,7 @@ void __init init_cpu_topology(void)
 		cpu_topo->socket_id = -1;
 		cpumask_clear(&cpu_topo->core_sibling);
 		cpumask_clear(&cpu_topo->thread_sibling);
-
+		cpu_topo->flags = 0;
 		set_power_scale(cpu, SCHED_POWER_SCALE);
 	}
 	smp_wmb();
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 075a325..8cbaebf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -772,6 +772,7 @@ enum cpu_idle_type {
 #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
 #define SD_SHARE_CPUPOWER	0x0080	/* Domain members share cpu power */
+#define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
 #define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
@@ -893,6 +894,12 @@ typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 
 #define SDTL_OVERLAP	0x01
 
+struct sd_data {
+	struct sched_domain **__percpu sd;
+	struct sched_group **__percpu sg;
+	struct sched_group_power **__percpu sgp;
+};
+
 struct sched_domain_topology_level {
 	sched_domain_mask_f mask;
 	int		    sd_flags;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 73658da..8dc2a50 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4680,7 +4680,8 @@ static int sd_degenerate(struct sched_domain *sd)
 			 SD_BALANCE_FORK |
 			 SD_BALANCE_EXEC |
 			 SD_SHARE_CPUPOWER |
-			 SD_SHARE_PKG_RESOURCES)) {
+			 SD_SHARE_PKG_RESOURCES |
+			 SD_SHARE_POWERDOMAIN)) {
 		if (sd->groups != sd->groups->next)
 			return 0;
 	}
@@ -4711,7 +4712,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 				SD_BALANCE_EXEC |
 				SD_SHARE_CPUPOWER |
 				SD_SHARE_PKG_RESOURCES |
-				SD_PREFER_SIBLING);
+				SD_PREFER_SIBLING |
+				SD_SHARE_POWERDOMAIN);
 		if (nr_node_ids == 1)
 			pflags &= ~SD_SERIALIZE;
 	}
@@ -4978,12 +4980,6 @@ static const struct cpumask *cpu_cpu_mask(int cpu)
 	return cpumask_of_node(cpu_to_node(cpu));
 }
 
-struct sd_data {
-	struct sched_domain **__percpu sd;
-	struct sched_group **__percpu sg;
-	struct sched_group_power **__percpu sgp;
-};
-
 struct s_data {
 	struct sched_domain ** __percpu sd;
 	struct root_domain	*rd;
@@ -5345,7 +5341,8 @@ static struct cpumask ***sched_domains_numa_masks;
 	(SD_SHARE_CPUPOWER |		\
 	 SD_SHARE_PKG_RESOURCES |	\
 	 SD_NUMA |			\
-	 SD_ASYM_PACKING)
+	 SD_ASYM_PACKING |		\
+	 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
 sd_init(struct sched_domain_topology_level *tl, int cpu)
@@ -5464,7 +5461,7 @@ static struct sched_domain_topology_level default_topology[] = {
 	{ NULL, },
 };
 
-static struct sched_domain_topology_level *sched_domain_topology = default_topology;
+struct sched_domain_topology_level *sched_domain_topology = default_topology;
 
 #define for_each_sd_topology(tl)			\
 	for (tl = sched_domain_topology; tl->mask; tl++)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2013-12-18 13:13         ` [RFC] sched: CPU topology try Vincent Guittot
@ 2013-12-23 17:22           ` Dietmar Eggemann
  2014-01-06 13:41             ` Vincent Guittot
  2014-01-06 16:28             ` Peter Zijlstra
  2014-01-01  5:00           ` Preeti U Murthy
                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 101+ messages in thread
From: Dietmar Eggemann @ 2013-12-23 17:22 UTC (permalink / raw)
  To: Vincent Guittot, peterz, linux-kernel
  Cc: mingo, pjt, Morten Rasmussen, cmetcalf, tony.luck, alex.shi,
	preeti, linaro-kernel, rjw, paulmck, corbet, tglx, len.brown,
	arjan, amit.kucheria, james.hogan, schwidefsky, heiko.carstens

Hi Vincent,

On 18/12/13 14:13, Vincent Guittot wrote:
> This patch applies on top of the two patches [1][2] that have been proposed by
> Peter for creating a new way to initialize sched_domain. It includes some minor
> compilation fixes and a trial of using this new method on ARM platform.
> [1] https://lkml.org/lkml/2013/11/5/239
> [2] https://lkml.org/lkml/2013/11/5/449

I came up w/ a similar implementation proposal for an arch specific 
interface for scheduler domain set-up a couple of days ago:

[1] https://lkml.org/lkml/2013/12/13/182

I had the following requirements in mind:

1) The arch should not be able to fine tune individual scheduler 
behaviour, i.e. get rid of the arch specific SD_FOO_INIT macros.

2) Unify the set-up code for conventional and NUMA scheduler domains.

3) The arch is able to specify additional scheduler domain level, other 
than SMT, MC, BOOK, and CPU.

4) Allow to integrate the provision of additional topology related data 
(e.g. energy information) to the scheduler.

Moreover, I think now that:

5) Something like the existing default set-up via default_topology[] is 
needed to avoid code duplication for archs not interested in (3) or (4).

I can see the following similarities w/ your implementation:

1) Move the cpu_foo_mask functions from scheduler to topology. I even 
put cpu_smt_mask() and cpu_cpu_mask() into include/linux/topology.h.

2) Use the existing func ptr sched_domain_mask_f to pass per-cpu cpu 
mask from the topology shim-layer to the scheduler.

>
> Based on the results of this tests, my feeling about this new way to init the
> sched_domain is a bit mitigated.
>
> The good point is that I have been able to create the same sched_domain
> topologies than before and even more complex ones (where a subset of the cores
> in a cluster share their powergating capabilities). I have described various
> topology results below.
>
> I use a system that is made of a dual cluster of quad cores with hyperthreading
> for my examples.
>
> If one cluster (0-7) can powergate its cores independantly but not the other
> cluster (8-15) we have the following topology, which is equal to what I had
> previously:
>
> CPU0:
> domain 0: span 0-1 level: SMT
>      flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>      groups: 0 1
>    domain 1: span 0-7 level: MC
>        flags: SD_SHARE_PKG_RESOURCES
>        groups: 0-1 2-3 4-5 6-7
>      domain 2: span 0-15 level: CPU
>          flags:
>          groups: 0-7 8-15
>
> CPU8
> domain 0: span 8-9 level: SMT
>      flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>      groups: 8 9
>    domain 1: span 8-15 level: MC
>        flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>        groups: 8-9 10-11 12-13 14-15
>      domain 2: span 0-15 level CPU
>          flags:
>          groups: 8-15 0-7
>
> We can even describe some more complex topologies if a susbset (2-7) of the
> cluster can't powergate independatly:
>
> CPU0:
> domain 0: span 0-1 level: SMT
>      flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>      groups: 0 1
>    domain 1: span 0-7 level: MC
>        flags: SD_SHARE_PKG_RESOURCES
>        groups: 0-1 2-7
>      domain 2: span 0-15 level: CPU
>          flags:
>          groups: 0-7 8-15
>
> CPU2:
> domain 0: span 2-3 level: SMT
>      flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>      groups: 0 1
>    domain 1: span 2-7 level: MC
>        flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>        groups: 2-7 4-5 6-7
>      domain 2: span 0-7 level: MC
>          flags: SD_SHARE_PKG_RESOURCES
>          groups: 2-7 0-1
>        domain 3: span 0-15 level: CPU
>            flags:
>            groups: 0-7 8-15
>
> In this case, we have an aditionnal sched_domain MC level for this subset (2-7)
> of cores so we can trigger some load balance in this subset before doing that
> on the complete cluster (which is the last level of cache in my example)

I think the weakest point right now is the condition in sd_init() where 
we convert the topology flags into scheduler behaviour. We not only 
introduce a very tight coupling between topology flags and scheduler 
domain level but also we need to follow a certain order in the 
initialization. This bit needs more thinking.

>
> We can add more levels that will describe other dependency/independency like
> the frequency scaling dependency and as a result the final sched_domain
> topology will have additional levels (if they have not been removed during
> the degenerate sequence)
>
> My concern is about the configuration of the table that is used to create the
> sched_domain. Some levels are "duplicated" with different flags configuration
> which make the table not easily readable and we must also take care of the
> order  because parents have to gather all cpus of its childs. So we must
> choose which capabilities will be a subset of the other one. The order is
> almost straight forward when we describe 1 or 2 kind of capabilities
> (package ressource sharing and power sharing) but it can become complex if we
> want to add more.

I'm not sure if the idea to create a dedicated sched_domain level for 
every topology flag representing a specific functionality will scale. 
 From the perspective of energy-aware scheduling we need e.g. energy 
costs (P and C state) which can only be populated towards the scheduler 
via an additional sub-struct and additional function arch_sd_energy() 
like depicted in Morten's email:

[2] lkml.org/lkml/2013/11/14/102

>
> Regards
> Vincent
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> ---
>   arch/arm/include/asm/topology.h |    4 ++
>   arch/arm/kernel/topology.c      |   99 ++++++++++++++++++++++++++++++++++++++-
>   include/linux/sched.h           |    7 +++
>   kernel/sched/core.c             |   17 +++----
>   4 files changed, 116 insertions(+), 11 deletions(-)
>
> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> index 58b8b84..5102847 100644
> --- a/arch/arm/include/asm/topology.h
> +++ b/arch/arm/include/asm/topology.h
> @@ -5,12 +5,16 @@
>
>   #include <linux/cpumask.h>
>
> +#define CPU_CORE_GATE          0x1
> +#define CPU_CLUSTER_GATE       0x2
> +
>   struct cputopo_arm {
>          int thread_id;
>          int core_id;
>          int socket_id;
>          cpumask_t thread_sibling;
>          cpumask_t core_sibling;
> +       int flags;
>   };
>
>   extern struct cputopo_arm cpu_topology[NR_CPUS];
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index 85a8737..8a2aec6 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -24,6 +24,7 @@
>
>   #include <asm/cputype.h>
>   #include <asm/topology.h>
> +#include <asm/smp_plat.h>
>
>   /*
>    * cpu power scale management
> @@ -79,6 +80,51 @@ unsigned long *__cpu_capacity;
>
>   unsigned long middle_capacity = 1;
>
> +static int __init get_dt_power_topology(struct device_node *topo)
> +{
> +       const u32 *reg;
> +       int len, power = 0;
> +       int flag = CPU_CORE_GATE;
> +
> +       for (; topo; topo = of_get_next_parent(topo)) {
> +               reg = of_get_property(topo, "power-gate", &len);
> +               if (reg && len == 4 && be32_to_cpup(reg))
> +                       power |= flag;
> +               flag <<= 1;
> +       }
> +
> +       return power;
> +}
> +
> +#define for_each_subnode_with_property(dn, pn, prop_name) \
> +       for (dn = of_find_node_with_property(pn, prop_name); dn; \
> +            dn = of_find_node_with_property(dn, prop_name))
> +
> +static void __init init_dt_power_topology(void)
> +{
> +       struct device_node *cn, *topo;
> +
> +       /* Get power domain topology information */
> +       cn = of_find_node_by_path("/cpus/cpu-map");
> +       if (!cn) {
> +               pr_warn("Missing cpu-map node, bailing out\n");
> +               return;
> +       }
> +
> +       for_each_subnode_with_property(topo, cn, "cpu") {
> +               struct device_node *cpu;
> +
> +               cpu = of_parse_phandle(topo, "cpu", 0);
> +               if (cpu) {
> +                       u32 hwid;
> +
> +                       of_property_read_u32(cpu, "reg", &hwid);
> +                       cpu_topology[get_logical_index(hwid)].flags = get_dt_power_topology(topo);
> +
> +               }
> +       }
> +}
> +
>   /*
>    * Iterate all CPUs' descriptor in DT and compute the efficiency
>    * (as per table_efficiency). Also calculate a middle efficiency
> @@ -151,6 +197,8 @@ static void __init parse_dt_topology(void)
>                  middle_capacity = ((max_capacity / 3)
>                                  >> (SCHED_POWER_SHIFT-1)) + 1;
>
> +       /* Retrieve power topology information from DT */
> +       init_dt_power_topology();
>   }
>
>   /*
> @@ -266,6 +314,52 @@ void store_cpu_topology(unsigned int cpuid)
>                  cpu_topology[cpuid].socket_id, mpidr);
>   }
>
> +#ifdef CONFIG_SCHED_SMT
> +static const struct cpumask *cpu_smt_mask(int cpu)
> +{
> +       return topology_thread_cpumask(cpu);
> +}
> +#endif
> +
> +const struct cpumask *cpu_corepower_mask(int cpu)
> +{
> +       if (cpu_topology[cpu].flags & CPU_CORE_GATE)
> +               return &cpu_topology[cpu].thread_sibling;
> +       else
> +               return &cpu_topology[cpu].core_sibling;
> +}
> +
> +static const struct cpumask *cpu_cpupower_mask(int cpu)
> +{
> +       if (cpu_topology[cpu].flags & CPU_CLUSTER_GATE)
> +               return &cpu_topology[cpu].core_sibling;
> +       else
> +               return cpumask_of_node(cpu_to_node(cpu));
> +}
> +
> +static const struct cpumask *cpu_cpu_mask(int cpu)
> +{
> +       return cpumask_of_node(cpu_to_node(cpu));
> +}
> +
> +static struct sched_domain_topology_level arm_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> +       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN },
> +#endif
> +#ifdef CONFIG_SCHED_MC
> +       { cpu_corepower_mask, SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN },
> +       { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES},
> +#endif
> +       { cpu_cpupower_mask, SD_SHARE_POWERDOMAIN },
> +       { cpu_cpu_mask, },
> +       { NULL, },
> +};
> +
> +static int __init arm_sched_topology(void)
> +{
> +       sched_domain_topology = arm_topology;

return missing

> +}
> +
>   /*
>    * init_cpu_topology is called at boot when only one cpu is running
>    * which prevent simultaneous write access to cpu_topology array
> @@ -274,6 +368,9 @@ void __init init_cpu_topology(void)
>   {
>          unsigned int cpu;
>
> +       /* set scheduler topology descriptor */
> +       arm_sched_topology();
> +
>          /* init core mask and power*/
>          for_each_possible_cpu(cpu) {
>                  struct cputopo_arm *cpu_topo = &(cpu_topology[cpu]);
> @@ -283,7 +380,7 @@ void __init init_cpu_topology(void)
>                  cpu_topo->socket_id = -1;
>                  cpumask_clear(&cpu_topo->core_sibling);
>                  cpumask_clear(&cpu_topo->thread_sibling);
> -
> +               cpu_topo->flags = 0;
>                  set_power_scale(cpu, SCHED_POWER_SCALE);
>          }
>          smp_wmb();
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 075a325..8cbaebf 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -772,6 +772,7 @@ enum cpu_idle_type {
>   #define SD_BALANCE_WAKE                0x0010  /* Balance on wakeup */
>   #define SD_WAKE_AFFINE         0x0020  /* Wake task to waking CPU */
>   #define SD_SHARE_CPUPOWER      0x0080  /* Domain members share cpu power */
> +#define SD_SHARE_POWERDOMAIN   0x0100  /* Domain members share power domain */
>   #define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share cpu pkg resources */
>   #define SD_SERIALIZE           0x0400  /* Only a single load balancing instance */
>   #define SD_ASYM_PACKING                0x0800  /* Place busy groups earlier in the domain */
> @@ -893,6 +894,12 @@ typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
>
>   #define SDTL_OVERLAP   0x01
>
> +struct sd_data {
> +       struct sched_domain **__percpu sd;
> +       struct sched_group **__percpu sg;
> +       struct sched_group_power **__percpu sgp;
> +};
> +
>   struct sched_domain_topology_level {
>          sched_domain_mask_f mask;
>          int                 sd_flags;

By exporting struct sched_domain_topology_level and struct sd_data in 
include/linux/sched.h we're exposing a lot of internal scheduler data. 
That's why I came up w/ a new struct arch_sched_domain_info_t which only 
contains the cpu mask func ptr and the integer for the topology flags.

Best Regards,

-- Dietmar

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 73658da..8dc2a50 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4680,7 +4680,8 @@ static int sd_degenerate(struct sched_domain *sd)
>                           SD_BALANCE_FORK |
>                           SD_BALANCE_EXEC |
>                           SD_SHARE_CPUPOWER |
> -                        SD_SHARE_PKG_RESOURCES)) {
> +                        SD_SHARE_PKG_RESOURCES |
> +                        SD_SHARE_POWERDOMAIN)) {
>                  if (sd->groups != sd->groups->next)
>                          return 0;
>          }
> @@ -4711,7 +4712,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
>                                  SD_BALANCE_EXEC |
>                                  SD_SHARE_CPUPOWER |
>                                  SD_SHARE_PKG_RESOURCES |
> -                               SD_PREFER_SIBLING);
> +                               SD_PREFER_SIBLING |
> +                               SD_SHARE_POWERDOMAIN);
>                  if (nr_node_ids == 1)
>                          pflags &= ~SD_SERIALIZE;
>          }
> @@ -4978,12 +4980,6 @@ static const struct cpumask *cpu_cpu_mask(int cpu)
>          return cpumask_of_node(cpu_to_node(cpu));
>   }
>
> -struct sd_data {
> -       struct sched_domain **__percpu sd;
> -       struct sched_group **__percpu sg;
> -       struct sched_group_power **__percpu sgp;
> -};
> -
>   struct s_data {
>          struct sched_domain ** __percpu sd;
>          struct root_domain      *rd;
> @@ -5345,7 +5341,8 @@ static struct cpumask ***sched_domains_numa_masks;
>          (SD_SHARE_CPUPOWER |            \
>           SD_SHARE_PKG_RESOURCES |       \
>           SD_NUMA |                      \
> -        SD_ASYM_PACKING)
> +        SD_ASYM_PACKING |              \
> +        SD_SHARE_POWERDOMAIN)
>
>   static struct sched_domain *
>   sd_init(struct sched_domain_topology_level *tl, int cpu)
> @@ -5464,7 +5461,7 @@ static struct sched_domain_topology_level default_topology[] = {
>          { NULL, },
>   };
>
> -static struct sched_domain_topology_level *sched_domain_topology = default_topology;
> +struct sched_domain_topology_level *sched_domain_topology = default_topology;
>
>   #define for_each_sd_topology(tl)                       \
>          for (tl = sched_domain_topology; tl->mask; tl++)
> --
> 1.7.9.5
>
>



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2013-12-18 13:13         ` [RFC] sched: CPU topology try Vincent Guittot
  2013-12-23 17:22           ` Dietmar Eggemann
@ 2014-01-01  5:00           ` Preeti U Murthy
  2014-01-06 16:33             ` Peter Zijlstra
  2014-01-07 12:40             ` Vincent Guittot
  2014-01-06 16:21           ` Peter Zijlstra
  2014-01-07  9:40           ` Preeti U Murthy
  3 siblings, 2 replies; 101+ messages in thread
From: Preeti U Murthy @ 2014-01-01  5:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, linux-kernel, mingo, pjt, Morten.Rasmussen, cmetcalf,
	tony.luck, alex.shi, linaro-kernel, rjw, paulmck, corbet, tglx,
	len.brown, arjan, amit.kucheria, james.hogan, schwidefsky,
	heiko.carstens, Dietmar.Eggemann

Hi Vincent,

On 12/18/2013 06:43 PM, Vincent Guittot wrote:
> This patch applies on top of the two patches [1][2] that have been proposed by
> Peter for creating a new way to initialize sched_domain. It includes some minor
> compilation fixes and a trial of using this new method on ARM platform.
> [1] https://lkml.org/lkml/2013/11/5/239
> [2] https://lkml.org/lkml/2013/11/5/449
> 
> Based on the results of this tests, my feeling about this new way to init the
> sched_domain is a bit mitigated.
> 
> The good point is that I have been able to create the same sched_domain
> topologies than before and even more complex ones (where a subset of the cores
> in a cluster share their powergating capabilities). I have described various
> topology results below.
> 
> I use a system that is made of a dual cluster of quad cores with hyperthreading
> for my examples.
> 
> If one cluster (0-7) can powergate its cores independantly but not the other
> cluster (8-15) we have the following topology, which is equal to what I had
> previously:
> 
> CPU0:
> domain 0: span 0-1 level: SMT
>     flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>     groups: 0 1
>   domain 1: span 0-7 level: MC
>       flags: SD_SHARE_PKG_RESOURCES
>       groups: 0-1 2-3 4-5 6-7
>     domain 2: span 0-15 level: CPU
>         flags:
>         groups: 0-7 8-15
> 
> CPU8
> domain 0: span 8-9 level: SMT
>     flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>     groups: 8 9
>   domain 1: span 8-15 level: MC
>       flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>       groups: 8-9 10-11 12-13 14-15
>     domain 2: span 0-15 level CPU
>         flags:
>         groups: 8-15 0-7
> 
> We can even describe some more complex topologies if a susbset (2-7) of the
> cluster can't powergate independatly:
> 
> CPU0:
> domain 0: span 0-1 level: SMT
>     flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>     groups: 0 1
>   domain 1: span 0-7 level: MC
>       flags: SD_SHARE_PKG_RESOURCES
>       groups: 0-1 2-7
>     domain 2: span 0-15 level: CPU
>         flags:
>         groups: 0-7 8-15
> 
> CPU2:
> domain 0: span 2-3 level: SMT
>     flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>     groups: 0 1
>   domain 1: span 2-7 level: MC
>       flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>       groups: 2-7 4-5 6-7
>     domain 2: span 0-7 level: MC
>         flags: SD_SHARE_PKG_RESOURCES
>         groups: 2-7 0-1
>       domain 3: span 0-15 level: CPU
>           flags:
>           groups: 0-7 8-15
> 
> In this case, we have an aditionnal sched_domain MC level for this subset (2-7)
> of cores so we can trigger some load balance in this subset before doing that
> on the complete cluster (which is the last level of cache in my example)
> 
> We can add more levels that will describe other dependency/independency like
> the frequency scaling dependency and as a result the final sched_domain
> topology will have additional levels (if they have not been removed during
> the degenerate sequence)

The design looks good to me. In my opinion information like P-states and
C-states dependency can be kept separate from the topology levels, it
might get too complicated unless the information is tightly coupled to
the topology.

> 
> My concern is about the configuration of the table that is used to create the
> sched_domain. Some levels are "duplicated" with different flags configuration

I do not feel this is a problem since the levels are not duplicated,
rather they have different properties within them which is best
represented by flags like you have introduced in this patch.

> which make the table not easily readable and we must also take care of the
> order  because parents have to gather all cpus of its childs. So we must
> choose which capabilities will be a subset of the other one. The order is

The sched domain levels which have SD_SHARE_POWERDOMAIN set is expected
to have cpus which are a subset of the cpus that this domain would have
included had this flag not been set. In addition to this every higher
domain, irrespective of SD_SHARE_POWERDOMAIN being set, will include all
cpus of the lower domains. As far as I see, this patch does not change
these assumptions. Hence I am unable to imagine a scenario when the
parent might not include all cpus of its children domain. Do you have
such a scenario in mind which can arise due to this patch ?

Thanks

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2013-12-23 17:22           ` Dietmar Eggemann
@ 2014-01-06 13:41             ` Vincent Guittot
  2014-01-06 16:31               ` Peter Zijlstra
  2014-01-06 16:28             ` Peter Zijlstra
  1 sibling, 1 reply; 101+ messages in thread
From: Vincent Guittot @ 2014-01-06 13:41 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, linux-kernel, mingo, pjt, Morten Rasmussen, cmetcalf,
	tony.luck, alex.shi, preeti, linaro-kernel, rjw, paulmck, corbet,
	tglx, len.brown, arjan, amit.kucheria, james.hogan, schwidefsky,
	heiko.carstens

On 23 December 2013 18:22, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> Hi Vincent,
>
>
> On 18/12/13 14:13, Vincent Guittot wrote:
>>
>> This patch applies on top of the two patches [1][2] that have been
>> proposed by
>> Peter for creating a new way to initialize sched_domain. It includes some
>> minor
>> compilation fixes and a trial of using this new method on ARM platform.
>> [1] https://lkml.org/lkml/2013/11/5/239
>> [2] https://lkml.org/lkml/2013/11/5/449
>
>
> I came up w/ a similar implementation proposal for an arch specific
> interface for scheduler domain set-up a couple of days ago:
>
> [1] https://lkml.org/lkml/2013/12/13/182
>
> I had the following requirements in mind:
>
> 1) The arch should not be able to fine tune individual scheduler behaviour,
> i.e. get rid of the arch specific SD_FOO_INIT macros.
>
> 2) Unify the set-up code for conventional and NUMA scheduler domains.
>
> 3) The arch is able to specify additional scheduler domain level, other than
> SMT, MC, BOOK, and CPU.
>
> 4) Allow to integrate the provision of additional topology related data
> (e.g. energy information) to the scheduler.
>
> Moreover, I think now that:
>
> 5) Something like the existing default set-up via default_topology[] is
> needed to avoid code duplication for archs not interested in (3) or (4).

Hi Dietmar,

I agree. This default array is available in Peter's patch and my
patches overwrites the default array only if it wants to add more/new
levels

[snip]

>>
>> CPU2:
>> domain 0: span 2-3 level: SMT
>>      flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES |
>> SD_SHARE_POWERDOMAIN
>>      groups: 0 1
>>    domain 1: span 2-7 level: MC
>>        flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>>        groups: 2-7 4-5 6-7
>>      domain 2: span 0-7 level: MC
>>          flags: SD_SHARE_PKG_RESOURCES
>>          groups: 2-7 0-1
>>        domain 3: span 0-15 level: CPU
>>            flags:
>>            groups: 0-7 8-15
>>
>> In this case, we have an aditionnal sched_domain MC level for this subset
>> (2-7)
>> of cores so we can trigger some load balance in this subset before doing
>> that
>> on the complete cluster (which is the last level of cache in my example)
>
>
> I think the weakest point right now is the condition in sd_init() where we
> convert the topology flags into scheduler behaviour. We not only introduce a
> very tight coupling between topology flags and scheduler domain level but
> also we need to follow a certain order in the initialization. This bit needs
> more thinking.

IMHO, these settings will disappear sooner or later, as an example the
idle/busy _idx are going to be removed by Alex's patch.

>
>
>>
>> We can add more levels that will describe other dependency/independency
>> like
>> the frequency scaling dependency and as a result the final sched_domain
>> topology will have additional levels (if they have not been removed during
>> the degenerate sequence)
>>
>> My concern is about the configuration of the table that is used to create
>> the
>> sched_domain. Some levels are "duplicated" with different flags
>> configuration
>> which make the table not easily readable and we must also take care of the
>> order  because parents have to gather all cpus of its childs. So we must
>> choose which capabilities will be a subset of the other one. The order is
>> almost straight forward when we describe 1 or 2 kind of capabilities
>> (package ressource sharing and power sharing) but it can become complex if
>> we
>> want to add more.
>
>
> I'm not sure if the idea to create a dedicated sched_domain level for every
> topology flag representing a specific functionality will scale. From the

It's up to the arch to decide how many levels they want to add; if a
dedicated level is needed or if it can gather some features/flags.
IMHO, having sub structs for energy information like what we have for
the cpu/group capacity will not prevent from having a 1st and quick
topology tree description

> perspective of energy-aware scheduling we need e.g. energy costs (P and C
> state) which can only be populated towards the scheduler via an additional
> sub-struct and additional function arch_sd_energy() like depicted in
> Morten's email:
>
> [2] lkml.org/lkml/2013/11/14/102
>

[snip]

>> +
>> +static int __init arm_sched_topology(void)
>> +{
>> +       sched_domain_topology = arm_topology;
>
>
> return missing

good catch

Thanks

Vincent

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2013-12-18 13:13         ` [RFC] sched: CPU topology try Vincent Guittot
  2013-12-23 17:22           ` Dietmar Eggemann
  2014-01-01  5:00           ` Preeti U Murthy
@ 2014-01-06 16:21           ` Peter Zijlstra
  2014-01-07  8:22             ` Vincent Guittot
  2014-01-07  9:40           ` Preeti U Murthy
  3 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-06 16:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, mingo, pjt, Morten.Rasmussen, cmetcalf, tony.luck,
	alex.shi, preeti, linaro-kernel, rjw, paulmck, corbet, tglx,
	len.brown, arjan, amit.kucheria, james.hogan, schwidefsky,
	heiko.carstens, Dietmar.Eggemann

On Wed, Dec 18, 2013 at 02:13:51PM +0100, Vincent Guittot wrote:
> This patch applies on top of the two patches [1][2] that have been proposed by
> Peter for creating a new way to initialize sched_domain. It includes some minor
> compilation fixes and a trial of using this new method on ARM platform.
> [1] https://lkml.org/lkml/2013/11/5/239
> [2] https://lkml.org/lkml/2013/11/5/449
> 
> Based on the results of this tests, my feeling about this new way to init the
> sched_domain is a bit mitigated.

Yay :-)

> We can add more levels that will describe other dependency/independency like
> the frequency scaling dependency and as a result the final sched_domain
> topology will have additional levels (if they have not been removed during
> the degenerate sequence)

Yeah, this 'creative' use of degenerate domains is pretty neat ;-)

> My concern is about the configuration of the table that is used to create the
> sched_domain. Some levels are "duplicated" with different flags configuration
> which make the table not easily readable and we must also take care of the
> order  because parents have to gather all cpus of its childs. So we must
> choose which capabilities will be a subset of the other one. The order is
> almost straight forward when we describe 1 or 2 kind of capabilities
> (package ressource sharing and power sharing) but it can become complex if we
> want to add more.

I think I see what you're saying, although I hope that won't actually
happen in real hardware -- that said, people do tend to do crazy with
these ARM chips :/

We should also try and be conservative in the topology flags we want to
add, which should further reduce the amount of pain here.

For now I do think this is a viable approach.. Yes its a bit cumbersome
for these asymmetric systems but it does give us enough to start
playing.

I yet have to read Morton's emails on the P and C states, will try to
have a look at those tomorrow with a hopefully fresher brain -- somehow
its the end of the day already..

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2013-12-23 17:22           ` Dietmar Eggemann
  2014-01-06 13:41             ` Vincent Guittot
@ 2014-01-06 16:28             ` Peter Zijlstra
  2014-01-06 17:15               ` Morten Rasmussen
  1 sibling, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-06 16:28 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, linux-kernel, mingo, pjt, Morten Rasmussen,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, rjw,
	paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	james.hogan, schwidefsky, heiko.carstens

On Mon, Dec 23, 2013 at 06:22:17PM +0100, Dietmar Eggemann wrote:
> I'm not sure if the idea to create a dedicated sched_domain level for every
> topology flag representing a specific functionality will scale. From the
> perspective of energy-aware scheduling we need e.g. energy costs (P and C
> state) which can only be populated towards the scheduler via an additional
> sub-struct and additional function arch_sd_energy() like depicted in
> Morten's email:
> 
> [2] lkml.org/lkml/2013/11/14/102

That lkml.org link is actually not working for me (blank page -- maybe
lkml.org is on the blink again).

That said, I yet have to sit down and think about the P state stuff, but
I was thinking we need some rudimentary domain support for that.

For instance, the big-little thingies seem share their P state per
cluster, so we need a domain at that level to hang some state off of --
which we actually have in this case. But we need to ensure we do have
it -- somehow.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-06 13:41             ` Vincent Guittot
@ 2014-01-06 16:31               ` Peter Zijlstra
  2014-01-07  8:32                 ` Vincent Guittot
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-06 16:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, linux-kernel, mingo, pjt, Morten Rasmussen,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, rjw,
	paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	james.hogan, schwidefsky, heiko.carstens

On Mon, Jan 06, 2014 at 02:41:31PM +0100, Vincent Guittot wrote:
> IMHO, these settings will disappear sooner or later, as an example the
> idle/busy _idx are going to be removed by Alex's patch.

Well I'm still entirely unconvinced by them..

removing the cpu_load array makes sense, but I'm starting to doubt the
removal of the _idx things.. I think we want to retain them in some
form, it simply makes sense to look at longer term averages when looking
at larger CPU groups.

So maybe we can express the things in log_2(group-span) or so, but we
need a working replacement for the cpu_load array. Ideally some
expression involving the blocked load.

Its another one of those things I need to ponder more :-)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-01  5:00           ` Preeti U Murthy
@ 2014-01-06 16:33             ` Peter Zijlstra
  2014-01-06 16:37               ` Arjan van de Ven
  2014-01-07 12:40             ` Vincent Guittot
  1 sibling, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-06 16:33 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Vincent Guittot, linux-kernel, mingo, pjt, Morten.Rasmussen,
	cmetcalf, tony.luck, alex.shi, linaro-kernel, rjw, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens, Dietmar.Eggemann

On Wed, Jan 01, 2014 at 10:30:33AM +0530, Preeti U Murthy wrote:
> The design looks good to me. In my opinion information like P-states and
> C-states dependency can be kept separate from the topology levels, it
> might get too complicated unless the information is tightly coupled to
> the topology.

I'm not entirely convinced we can keep them separated, the moment we
have multiple CPUs sharing a P or C state we need somewhere to manage
the shared state and the domain tree seems like the most natural place
for this.

Now it might well be both P and C states operate at 'natural' domains
which we already have so it might be 'easy'.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-06 16:33             ` Peter Zijlstra
@ 2014-01-06 16:37               ` Arjan van de Ven
  2014-01-06 16:48                 ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Arjan van de Ven @ 2014-01-06 16:37 UTC (permalink / raw)
  To: Peter Zijlstra, Preeti U Murthy
  Cc: Vincent Guittot, linux-kernel, mingo, pjt, Morten.Rasmussen,
	cmetcalf, tony.luck, alex.shi, linaro-kernel, rjw, paulmck,
	corbet, tglx, len.brown, amit.kucheria, james.hogan, schwidefsky,
	heiko.carstens, Dietmar.Eggemann

On 1/6/2014 8:33 AM, Peter Zijlstra wrote:
> On Wed, Jan 01, 2014 at 10:30:33AM +0530, Preeti U Murthy wrote:
>> The design looks good to me. In my opinion information like P-states and
>> C-states dependency can be kept separate from the topology levels, it
>> might get too complicated unless the information is tightly coupled to
>> the topology.
>
> I'm not entirely convinced we can keep them separated, the moment we
> have multiple CPUs sharing a P or C state we need somewhere to manage
> the shared state and the domain tree seems like the most natural place
> for this.
>
> Now it might well be both P and C states operate at 'natural' domains
> which we already have so it might be 'easy'.

more than that though.. P and C state sharing is mostly hidden from the OS
(because the OS does not have the ability to do this; e.g. there are things
that do "if THIS cpu goes idle, the OTHER cpu P state changes automatic".

that's not just on x86, the ARM guys (iirc at least the latest snapdragon)  are going in that
direction as well.....

for those systems, the OS really should just make local decisions and let the hardware
cope with hardware grouping.
>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-06 16:37               ` Arjan van de Ven
@ 2014-01-06 16:48                 ` Peter Zijlstra
  2014-01-06 16:54                   ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-06 16:48 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Preeti U Murthy, Vincent Guittot, linux-kernel, mingo, pjt,
	Morten.Rasmussen, cmetcalf, tony.luck, alex.shi, linaro-kernel,
	rjw, paulmck, corbet, tglx, len.brown, amit.kucheria,
	james.hogan, schwidefsky, heiko.carstens, Dietmar.Eggemann

On Mon, Jan 06, 2014 at 08:37:13AM -0800, Arjan van de Ven wrote:
> On 1/6/2014 8:33 AM, Peter Zijlstra wrote:
> >On Wed, Jan 01, 2014 at 10:30:33AM +0530, Preeti U Murthy wrote:
> >>The design looks good to me. In my opinion information like P-states and
> >>C-states dependency can be kept separate from the topology levels, it
> >>might get too complicated unless the information is tightly coupled to
> >>the topology.
> >
> >I'm not entirely convinced we can keep them separated, the moment we
> >have multiple CPUs sharing a P or C state we need somewhere to manage
> >the shared state and the domain tree seems like the most natural place
> >for this.
> >
> >Now it might well be both P and C states operate at 'natural' domains
> >which we already have so it might be 'easy'.
> 
> more than that though.. P and C state sharing is mostly hidden from the OS
> (because the OS does not have the ability to do this; e.g. there are things
> that do "if THIS cpu goes idle, the OTHER cpu P state changes automatic".
> 
> that's not just on x86, the ARM guys (iirc at least the latest snapdragon)  are going in that
> direction as well.....
> 
> for those systems, the OS really should just make local decisions and let the hardware
> cope with hardware grouping.

AFAICT this is a chicken-egg problem, the OS never did anything useful
with it so the hardware guys are now trying to do something with it, but
this also means that if we cannot predict what the hardware will do
under certain circumstances the OS really cannot do anything smart
anymore.

So yes, for certain hardware we'll just have to give up and not do
anything.

That said, some hardware still does allow us to do something and for
those we do need some of this.

Maybe if the OS becomes smart enough the hardware guys will give us some
control again, who knows.

So yes, I'm entirely fine saying that some chips are fucked and we can't
do anything sane with them.. Fine they get to sort things themselves.



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-06 16:48                 ` Peter Zijlstra
@ 2014-01-06 16:54                   ` Peter Zijlstra
  2014-01-06 17:13                     ` Arjan van de Ven
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-06 16:54 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Preeti U Murthy, Vincent Guittot, linux-kernel, mingo, pjt,
	Morten.Rasmussen, cmetcalf, tony.luck, alex.shi, linaro-kernel,
	rjw, paulmck, corbet, tglx, len.brown, amit.kucheria,
	james.hogan, schwidefsky, heiko.carstens, Dietmar.Eggemann

On Mon, Jan 06, 2014 at 05:48:38PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 06, 2014 at 08:37:13AM -0800, Arjan van de Ven wrote:
> > On 1/6/2014 8:33 AM, Peter Zijlstra wrote:
> > >On Wed, Jan 01, 2014 at 10:30:33AM +0530, Preeti U Murthy wrote:
> > >>The design looks good to me. In my opinion information like P-states and
> > >>C-states dependency can be kept separate from the topology levels, it
> > >>might get too complicated unless the information is tightly coupled to
> > >>the topology.
> > >
> > >I'm not entirely convinced we can keep them separated, the moment we
> > >have multiple CPUs sharing a P or C state we need somewhere to manage
> > >the shared state and the domain tree seems like the most natural place
> > >for this.
> > >
> > >Now it might well be both P and C states operate at 'natural' domains
> > >which we already have so it might be 'easy'.
> > 
> > more than that though.. P and C state sharing is mostly hidden from the OS
> > (because the OS does not have the ability to do this; e.g. there are things
> > that do "if THIS cpu goes idle, the OTHER cpu P state changes automatic".
> > 
> > that's not just on x86, the ARM guys (iirc at least the latest snapdragon)  are going in that
> > direction as well.....
> > 
> > for those systems, the OS really should just make local decisions and let the hardware
> > cope with hardware grouping.
> 
> AFAICT this is a chicken-egg problem, the OS never did anything useful
> with it so the hardware guys are now trying to do something with it, but
> this also means that if we cannot predict what the hardware will do
> under certain circumstances the OS really cannot do anything smart
> anymore.
> 
> So yes, for certain hardware we'll just have to give up and not do
> anything.
> 
> That said, some hardware still does allow us to do something and for
> those we do need some of this.
> 
> Maybe if the OS becomes smart enough the hardware guys will give us some
> control again, who knows.
> 
> So yes, I'm entirely fine saying that some chips are fucked and we can't
> do anything sane with them.. Fine they get to sort things themselves.

That is; you're entirely unhelpful and I'm tempting to stop listening
to whatever you have to say on the subject.

Most of your emails are about how stuff cannot possibly work; without
saying how things can work.

The entire point of adding P and C state information to the scheduler is
so that we CAN do cross cpu decisions, but if you're saying we shouldn't
attempt because you can't say how the hardware will react anyway; fine
we'll ignore Intel hardware from now on.

So bloody stop saying what cannot work and start telling how we can make
useful cross cpu decisions.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-06 16:54                   ` Peter Zijlstra
@ 2014-01-06 17:13                     ` Arjan van de Ven
  0 siblings, 0 replies; 101+ messages in thread
From: Arjan van de Ven @ 2014-01-06 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Preeti U Murthy, Vincent Guittot, linux-kernel, mingo, pjt,
	Morten.Rasmussen, cmetcalf, tony.luck, alex.shi, linaro-kernel,
	rjw, paulmck, corbet, tglx, len.brown, amit.kucheria,
	james.hogan, schwidefsky, heiko.carstens, Dietmar.Eggemann


>> AFAICT this is a chicken-egg problem, the OS never did anything useful
>> with it so the hardware guys are now trying to do something with it, but
>> this also means that if we cannot predict what the hardware will do
>> under certain circumstances the OS really cannot do anything smart
>> anymore.
>>
>> So yes, for certain hardware we'll just have to give up and not do
>> anything.
>>
>> That said, some hardware still does allow us to do something and for
>> those we do need some of this.
>>
>> Maybe if the OS becomes smart enough the hardware guys will give us some
>> control again, who knows.
>>
>> So yes, I'm entirely fine saying that some chips are fucked and we can't
>> do anything sane with them.. Fine they get to sort things themselves.
>
> That is; you're entirely unhelpful and I'm tempting to stop listening
> to whatever you have to say on the subject.
>
> Most of your emails are about how stuff cannot possibly work; without
> saying how things can work.
>
> The entire point of adding P and C state information to the scheduler is
> so that we CAN do cross cpu decisions, but if you're saying we shouldn't
> attempt because you can't say how the hardware will react anyway; fine
> we'll ignore Intel hardware from now on.

that's not what I'm trying to say.

if we as OS want to help make such decisions, we also need to face reality of what it means,
and see how we can get there.

let me give a simple but common example case, of a 2 core system where the cores share P state.
one task (A) is high priority/high utilization/whatever
	(e.g. causes the OS to ask for high performance from the CPU if by itself)
the other task (B), on the 2nd core, is not that high priority/utilization/etc
	(e.g. would cause the OS to ask for max power savings from the CPU if by itself)


time	core 0			core 1				what the combined probably should be
0	task A			idle				max performance
1	task A			task B				max performance
2	idle (disk IO)		task B				least power
3	task A			task B				max performance

e.g. a simple case of task A running, and task B coming in... but then task A blocks briefly,
on say disk IO or some mutex or whatever.

we as OS will need to figure out how to get to the combined result, in a way that's relatively race free,
with two common races to take care of:
  * knowing if another core is idle at any time is inherently racey.. it may wake up or go idle the next cycle
  * in hardware modes where the OS controls all, the P state registers tend to be "the last one to write on any
    core controls them all" way; we need to make sure we don't fight ourselves here and assign a core to do
    this decision/communication to hardware on behalf of the whole domain (even if the core that's
    assigned may move around when the assigned core goes idle) rather than the various cores doing it themselves async.
    This tends to be harder than it seems if you also don't want to lose efficiency (e.g. no significant extra
    wakeups from idle and also not missing opportunities to go to "least power" in the "time 2" scenario above)


x86 and modern ARM (snapdragon at least) do this kind of coordination in hardware/microcontroller (with an opt in for the OS to
do it itself on x86 and likely snapdragon) which means the race conditions are not really there.



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-06 16:28             ` Peter Zijlstra
@ 2014-01-06 17:15               ` Morten Rasmussen
  2014-01-07  9:57                 ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Morten Rasmussen @ 2014-01-06 17:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Vincent Guittot, linux-kernel, mingo, pjt,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, rjw,
	paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	james.hogan, schwidefsky, heiko.carstens

On Mon, Jan 06, 2014 at 04:28:13PM +0000, Peter Zijlstra wrote:
> On Mon, Dec 23, 2013 at 06:22:17PM +0100, Dietmar Eggemann wrote:
> > I'm not sure if the idea to create a dedicated sched_domain level for every
> > topology flag representing a specific functionality will scale. From the
> > perspective of energy-aware scheduling we need e.g. energy costs (P and C
> > state) which can only be populated towards the scheduler via an additional
> > sub-struct and additional function arch_sd_energy() like depicted in
> > Morten's email:
> > 
> > [2] lkml.org/lkml/2013/11/14/102
> 
> That lkml.org link is actually not working for me (blank page -- maybe
> lkml.org is on the blink again).
> 
> That said, I yet have to sit down and think about the P state stuff, but
> I was thinking we need some rudimentary domain support for that.
> 
> For instance, the big-little thingies seem share their P state per
> cluster, so we need a domain at that level to hang some state off of --
> which we actually have in this case. But we need to ensure we do have
> it -- somehow.

Is there any examples of frequency domains not matching the span of a
sched_domain?

I would have thought that we would have a matching sched_domain to hang
the P and C state information from for most systems. If not, we could
just add it.

I don't think it is safe to assume that big-little always has cluster
P-states. It is implementation dependent. But the most obvious
alternative would be to have per-cpu P-states in which case we would
also have a matching sched_domain.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-06 16:21           ` Peter Zijlstra
@ 2014-01-07  8:22             ` Vincent Guittot
  0 siblings, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2014-01-07  8:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Paul Turner, Morten Rasmussen,
	cmetcalf, tony.luck, Alex Shi, Preeti U Murthy, linaro-kernel,
	Rafael J. Wysocki, Paul McKenney, Jon Corbet, Thomas Gleixner,
	Len Brown, Arjan van de Ven, Amit Kucheria, james.hogan,
	schwidefsky, heiko.carstens, Dietmar Eggemann

On 6 January 2014 17:21, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Dec 18, 2013 at 02:13:51PM +0100, Vincent Guittot wrote:
>> This patch applies on top of the two patches [1][2] that have been proposed by
>> Peter for creating a new way to initialize sched_domain. It includes some minor
>> compilation fixes and a trial of using this new method on ARM platform.
>> [1] https://lkml.org/lkml/2013/11/5/239
>> [2] https://lkml.org/lkml/2013/11/5/449
>>
>> Based on the results of this tests, my feeling about this new way to init the
>> sched_domain is a bit mitigated.
>
> Yay :-)
>
>> We can add more levels that will describe other dependency/independency like
>> the frequency scaling dependency and as a result the final sched_domain
>> topology will have additional levels (if they have not been removed during
>> the degenerate sequence)
>
> Yeah, this 'creative' use of degenerate domains is pretty neat ;-)

thanks :-)

>
>> My concern is about the configuration of the table that is used to create the
>> sched_domain. Some levels are "duplicated" with different flags configuration
>> which make the table not easily readable and we must also take care of the
>> order  because parents have to gather all cpus of its childs. So we must
>> choose which capabilities will be a subset of the other one. The order is
>> almost straight forward when we describe 1 or 2 kind of capabilities
>> (package ressource sharing and power sharing) but it can become complex if we
>> want to add more.
>
> I think I see what you're saying, although I hope that won't actually
> happen in real hardware -- that said, people do tend to do crazy with
> these ARM chips :/

it should be ok for ARM chip because the cores in a cluster share the
same clock but it doesn't mean that it will not be possible in a near
future or on other arch.

>
> We should also try and be conservative in the topology flags we want to
> add, which should further reduce the amount of pain here.

yes, i see a interest for powerdomain sharing and clock sharing flags
so it should minimize the complexity

>
> For now I do think this is a viable approach.. Yes its a bit cumbersome
> for these asymmetric systems but it does give us enough to start
> playing.

ok

Vincent
>
> I yet have to read Morton's emails on the P and C states, will try to
> have a look at those tomorrow with a hopefully fresher brain -- somehow
> its the end of the day already..

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-06 16:31               ` Peter Zijlstra
@ 2014-01-07  8:32                 ` Vincent Guittot
  2014-01-07 13:22                   ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Vincent Guittot @ 2014-01-07  8:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, linux-kernel, mingo, pjt, Morten Rasmussen,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens

On 6 January 2014 17:31, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jan 06, 2014 at 02:41:31PM +0100, Vincent Guittot wrote:
>> IMHO, these settings will disappear sooner or later, as an example the
>> idle/busy _idx are going to be removed by Alex's patch.
>
> Well I'm still entirely unconvinced by them..
>
> removing the cpu_load array makes sense, but I'm starting to doubt the
> removal of the _idx things.. I think we want to retain them in some
> form, it simply makes sense to look at longer term averages when looking
> at larger CPU groups.
>
> So maybe we can express the things in log_2(group-span) or so, but we
> need a working replacement for the cpu_load array. Ideally some
> expression involving the blocked load.

Using the blocked load can surely give benefit in the load balance
because it gives a view of potential load on a core but it still decay
with the same speed than runnable load average so it doesn't solve the
issue for longer term average. One way is to have a runnable average
load with longer time window

>
> Its another one of those things I need to ponder more :-)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2013-12-18 13:13         ` [RFC] sched: CPU topology try Vincent Guittot
                             ` (2 preceding siblings ...)
  2014-01-06 16:21           ` Peter Zijlstra
@ 2014-01-07  9:40           ` Preeti U Murthy
  2014-01-07  9:50             ` Peter Zijlstra
  3 siblings, 1 reply; 101+ messages in thread
From: Preeti U Murthy @ 2014-01-07  9:40 UTC (permalink / raw)
  To: Vincent Guittot, peterz
  Cc: linux-kernel, mingo, pjt, Morten.Rasmussen, cmetcalf, tony.luck,
	alex.shi, linaro-kernel, rjw, paulmck, corbet, tglx, len.brown,
	arjan, amit.kucheria, james.hogan, schwidefsky, heiko.carstens,
	Dietmar.Eggemann

Hi Vincent, Peter,

On 12/18/2013 06:43 PM, Vincent Guittot wrote:
> This patch applies on top of the two patches [1][2] that have been proposed by
> Peter for creating a new way to initialize sched_domain. It includes some minor
> compilation fixes and a trial of using this new method on ARM platform.
> [1] https://lkml.org/lkml/2013/11/5/239
> [2] https://lkml.org/lkml/2013/11/5/449
> 
> Based on the results of this tests, my feeling about this new way to init the
> sched_domain is a bit mitigated.
> 
> The good point is that I have been able to create the same sched_domain
> topologies than before and even more complex ones (where a subset of the cores
> in a cluster share their powergating capabilities). I have described various
> topology results below.
> 
> I use a system that is made of a dual cluster of quad cores with hyperthreading
> for my examples.
> 
> If one cluster (0-7) can powergate its cores independantly but not the other
> cluster (8-15) we have the following topology, which is equal to what I had
> previously:
> 
> CPU0:
> domain 0: span 0-1 level: SMT
>     flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>     groups: 0 1
>   domain 1: span 0-7 level: MC
>       flags: SD_SHARE_PKG_RESOURCES
>       groups: 0-1 2-3 4-5 6-7
>     domain 2: span 0-15 level: CPU
>         flags:
>         groups: 0-7 8-15
> 
> CPU8
> domain 0: span 8-9 level: SMT
>     flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>     groups: 8 9
>   domain 1: span 8-15 level: MC
>       flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>       groups: 8-9 10-11 12-13 14-15
>     domain 2: span 0-15 level CPU
>         flags:
>         groups: 8-15 0-7
> 
> We can even describe some more complex topologies if a susbset (2-7) of the
> cluster can't powergate independatly:
> 
> CPU0:
> domain 0: span 0-1 level: SMT
>     flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>     groups: 0 1
>   domain 1: span 0-7 level: MC
>       flags: SD_SHARE_PKG_RESOURCES
>       groups: 0-1 2-7
>     domain 2: span 0-15 level: CPU
>         flags:
>         groups: 0-7 8-15
> 
> CPU2:
> domain 0: span 2-3 level: SMT
>     flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>     groups: 0 1
>   domain 1: span 2-7 level: MC
>       flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>       groups: 2-7 4-5 6-7
>     domain 2: span 0-7 level: MC
>         flags: SD_SHARE_PKG_RESOURCES
>         groups: 2-7 0-1
>       domain 3: span 0-15 level: CPU
>           flags:
>           groups: 0-7 8-15
> 
> In this case, we have an aditionnal sched_domain MC level for this subset (2-7)
> of cores so we can trigger some load balance in this subset before doing that
> on the complete cluster (which is the last level of cache in my example)
> 
> We can add more levels that will describe other dependency/independency like
> the frequency scaling dependency and as a result the final sched_domain
> topology will have additional levels (if they have not been removed during
> the degenerate sequence)
> 
> My concern is about the configuration of the table that is used to create the
> sched_domain. Some levels are "duplicated" with different flags configuration
> which make the table not easily readable and we must also take care of the
> order  because parents have to gather all cpus of its childs. So we must
> choose which capabilities will be a subset of the other one. The order is
> almost straight forward when we describe 1 or 2 kind of capabilities
> (package ressource sharing and power sharing) but it can become complex if we
> want to add more.

What if we want to add arch specific flags to the NUMA domain? Currently
with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
the arch can modify the sd flags of the topology levels till just before
the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
initialized. We need to perhaps call into arch here to probe for
additional flags?

Thanks

Regards
Preeti U Murthy
> 
> Regards
> Vincent
> 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07  9:40           ` Preeti U Murthy
@ 2014-01-07  9:50             ` Peter Zijlstra
  2014-01-07 10:39               ` Preeti U Murthy
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-07  9:50 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Vincent Guittot, linux-kernel, mingo, pjt, Morten.Rasmussen,
	cmetcalf, tony.luck, alex.shi, linaro-kernel, rjw, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens, Dietmar.Eggemann

On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
> What if we want to add arch specific flags to the NUMA domain? Currently
> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
> the arch can modify the sd flags of the topology levels till just before
> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
> initialized. We need to perhaps call into arch here to probe for
> additional flags?

What are you thinking of? I was hoping all NUMA details were captured in
the distance table.

Its far easier to talk of specifics in this case.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-06 17:15               ` Morten Rasmussen
@ 2014-01-07  9:57                 ` Peter Zijlstra
  0 siblings, 0 replies; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-07  9:57 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Dietmar Eggemann, Vincent Guittot, linux-kernel, mingo, pjt,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, rjw,
	paulmck, corbet, tglx, len.brown, arjan, amit.kucheria,
	james.hogan, schwidefsky, heiko.carstens

On Mon, Jan 06, 2014 at 05:15:30PM +0000, Morten Rasmussen wrote:
> Is there any examples of frequency domains not matching the span of a
> sched_domain?

nafaik, but I don't really know much about this anyway.

> I would have thought that we would have a matching sched_domain to hang
> the P and C state information from for most systems. If not, we could
> just add it.

This was my thought too.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07  9:50             ` Peter Zijlstra
@ 2014-01-07 10:39               ` Preeti U Murthy
  2014-01-07 11:13                 ` Peter Zijlstra
                                   ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Preeti U Murthy @ 2014-01-07 10:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, linux-kernel, mingo, pjt, Morten.Rasmussen,
	cmetcalf, tony.luck, alex.shi, linaro-kernel, rjw, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens, Dietmar.Eggemann

On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
>> What if we want to add arch specific flags to the NUMA domain? Currently
>> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
>> the arch can modify the sd flags of the topology levels till just before
>> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
>> initialized. We need to perhaps call into arch here to probe for
>> additional flags?
> 
> What are you thinking of? I was hoping all NUMA details were captured in
> the distance table.
> 
> Its far easier to talk of specifics in this case.
> 
If the processor can be core gated, then there is very little power
savings that we could yield from consolidating all the load onto a
single node in a NUMA domain. 6 cores on one node or 3 cores each on two
nodes, the power is drawn by 6 cores in all. So I was thinking under
this circumstance we might want to set the SD_SHARE_POWERDOMAIN flag at
the NUMA domain and spread the load if it favours the workload.

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07 10:39               ` Preeti U Murthy
@ 2014-01-07 11:13                 ` Peter Zijlstra
  2014-01-07 16:31                   ` Preeti U Murthy
  2014-01-07 11:20                 ` Morten Rasmussen
  2014-01-07 12:31                 ` Vincent Guittot
  2 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-07 11:13 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Vincent Guittot, linux-kernel, mingo, pjt, Morten.Rasmussen,
	cmetcalf, tony.luck, alex.shi, linaro-kernel, rjw, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens, Dietmar.Eggemann

On Tue, Jan 07, 2014 at 04:09:39PM +0530, Preeti U Murthy wrote:
> On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
> > On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
> >> What if we want to add arch specific flags to the NUMA domain? Currently
> >> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
> >> the arch can modify the sd flags of the topology levels till just before
> >> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
> >> initialized. We need to perhaps call into arch here to probe for
> >> additional flags?
> > 
> > What are you thinking of? I was hoping all NUMA details were captured in
> > the distance table.
> > 
> > Its far easier to talk of specifics in this case.
> > 
> If the processor can be core gated, then there is very little power
> savings that we could yield from consolidating all the load onto a
> single node in a NUMA domain. 6 cores on one node or 3 cores each on two
> nodes, the power is drawn by 6 cores in all. So I was thinking under
> this circumstance we might want to set the SD_SHARE_POWERDOMAIN flag at
> the NUMA domain and spread the load if it favours the workload.

So Intel has so far not said a lot of sensible things about power
management on their multi-socket platform.

And I've not heard anything at all from IBM on the POWER chips.

What I know from the Intel side is that packet idle hardly saves
anything when compared to the DRAM power and the cost of having to do
remote memory accesses.

In other words, I'm not at all considering power aware scheduling for
NUMA systems until someone starts talking sense :-)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07 10:39               ` Preeti U Murthy
  2014-01-07 11:13                 ` Peter Zijlstra
@ 2014-01-07 11:20                 ` Morten Rasmussen
  2014-01-07 12:31                 ` Vincent Guittot
  2 siblings, 0 replies; 101+ messages in thread
From: Morten Rasmussen @ 2014-01-07 11:20 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Vincent Guittot, linux-kernel, mingo, pjt,
	cmetcalf, tony.luck, alex.shi, linaro-kernel, rjw, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens, Dietmar Eggemann

On Tue, Jan 07, 2014 at 10:39:39AM +0000, Preeti U Murthy wrote:
> On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
> > On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
> >> What if we want to add arch specific flags to the NUMA domain? Currently
> >> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
> >> the arch can modify the sd flags of the topology levels till just before
> >> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
> >> initialized. We need to perhaps call into arch here to probe for
> >> additional flags?
> > 
> > What are you thinking of? I was hoping all NUMA details were captured in
> > the distance table.
> > 
> > Its far easier to talk of specifics in this case.
> > 
> If the processor can be core gated, then there is very little power
> savings that we could yield from consolidating all the load onto a
> single node in a NUMA domain. 6 cores on one node or 3 cores each on two
> nodes, the power is drawn by 6 cores in all.

Not being a NUMA expert, I would have thought that load consolidation at
node level would nearly always save power even when cpus can be power
gated individually. The number of cpus awake is the same, but you only
need to power the caches, memory, and other node peripherals for one
node instead of two in your example. Wouldn't that save power?

Memory/cache intensive workloads might benefit from spreading at node
level though. 

Am I missing something?

Morten

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07 10:39               ` Preeti U Murthy
  2014-01-07 11:13                 ` Peter Zijlstra
  2014-01-07 11:20                 ` Morten Rasmussen
@ 2014-01-07 12:31                 ` Vincent Guittot
  2014-01-07 16:51                   ` Preeti U Murthy
  2 siblings, 1 reply; 101+ messages in thread
From: Vincent Guittot @ 2014-01-07 12:31 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner,
	Morten Rasmussen, cmetcalf, tony.luck, Alex Shi, linaro-kernel,
	Rafael J. Wysocki, Paul McKenney, Jon Corbet, Thomas Gleixner,
	Len Brown, Arjan van de Ven, Amit Kucheria, james.hogan,
	schwidefsky, heiko.carstens, Dietmar Eggemann

On 7 January 2014 11:39, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
>> On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
>>> What if we want to add arch specific flags to the NUMA domain? Currently
>>> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
>>> the arch can modify the sd flags of the topology levels till just before
>>> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
>>> initialized. We need to perhaps call into arch here to probe for
>>> additional flags?
>>
>> What are you thinking of? I was hoping all NUMA details were captured in
>> the distance table.
>>
>> Its far easier to talk of specifics in this case.
>>
> If the processor can be core gated, then there is very little power
> savings that we could yield from consolidating all the load onto a
> single node in a NUMA domain. 6 cores on one node or 3 cores each on two
> nodes, the power is drawn by 6 cores in all. So I was thinking under
> this circumstance we might want to set the SD_SHARE_POWERDOMAIN flag at
> the NUMA domain and spread the load if it favours the workload.

The policy of keeping the tasks running on cores that are close (same
node) to the memory, is the more power efficient, isn't it ? so it's
probably more about where to place the memory than about where to
place the tasks ?

Vincent

>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-01  5:00           ` Preeti U Murthy
  2014-01-06 16:33             ` Peter Zijlstra
@ 2014-01-07 12:40             ` Vincent Guittot
  1 sibling, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2014-01-07 12:40 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner,
	Morten Rasmussen, cmetcalf, tony.luck, Alex Shi, linaro-kernel,
	Rafael J. Wysocki, Paul McKenney, Jon Corbet, Thomas Gleixner,
	Len Brown, Arjan van de Ven, Amit Kucheria, james.hogan,
	schwidefsky, heiko.carstens, Dietmar Eggemann

On 1 January 2014 06:00, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> On 12/18/2013 06:43 PM, Vincent Guittot wrote:
>> This patch applies on top of the two patches [1][2] that have been proposed by
>> Peter for creating a new way to initialize sched_domain. It includes some minor
>> compilation fixes and a trial of using this new method on ARM platform.
>> [1] https://lkml.org/lkml/2013/11/5/239
>> [2] https://lkml.org/lkml/2013/11/5/449
>>
>> Based on the results of this tests, my feeling about this new way to init the
>> sched_domain is a bit mitigated.
>>
>> The good point is that I have been able to create the same sched_domain
>> topologies than before and even more complex ones (where a subset of the cores
>> in a cluster share their powergating capabilities). I have described various
>> topology results below.
>>
>> I use a system that is made of a dual cluster of quad cores with hyperthreading
>> for my examples.
>>
>> If one cluster (0-7) can powergate its cores independantly but not the other
>> cluster (8-15) we have the following topology, which is equal to what I had
>> previously:
>>
>> CPU0:
>> domain 0: span 0-1 level: SMT
>>     flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>>     groups: 0 1
>>   domain 1: span 0-7 level: MC
>>       flags: SD_SHARE_PKG_RESOURCES
>>       groups: 0-1 2-3 4-5 6-7
>>     domain 2: span 0-15 level: CPU
>>         flags:
>>         groups: 0-7 8-15
>>
>> CPU8
>> domain 0: span 8-9 level: SMT
>>     flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>>     groups: 8 9
>>   domain 1: span 8-15 level: MC
>>       flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>>       groups: 8-9 10-11 12-13 14-15
>>     domain 2: span 0-15 level CPU
>>         flags:
>>         groups: 8-15 0-7
>>
>> We can even describe some more complex topologies if a susbset (2-7) of the
>> cluster can't powergate independatly:
>>
>> CPU0:
>> domain 0: span 0-1 level: SMT
>>     flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>>     groups: 0 1
>>   domain 1: span 0-7 level: MC
>>       flags: SD_SHARE_PKG_RESOURCES
>>       groups: 0-1 2-7
>>     domain 2: span 0-15 level: CPU
>>         flags:
>>         groups: 0-7 8-15
>>
>> CPU2:
>> domain 0: span 2-3 level: SMT
>>     flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>>     groups: 0 1
>>   domain 1: span 2-7 level: MC
>>       flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>>       groups: 2-7 4-5 6-7
>>     domain 2: span 0-7 level: MC
>>         flags: SD_SHARE_PKG_RESOURCES
>>         groups: 2-7 0-1
>>       domain 3: span 0-15 level: CPU
>>           flags:
>>           groups: 0-7 8-15
>>
>> In this case, we have an aditionnal sched_domain MC level for this subset (2-7)
>> of cores so we can trigger some load balance in this subset before doing that
>> on the complete cluster (which is the last level of cache in my example)
>>
>> We can add more levels that will describe other dependency/independency like
>> the frequency scaling dependency and as a result the final sched_domain
>> topology will have additional levels (if they have not been removed during
>> the degenerate sequence)
>
> The design looks good to me. In my opinion information like P-states and
> C-states dependency can be kept separate from the topology levels, it
> might get too complicated unless the information is tightly coupled to
> the topology.
>
>>
>> My concern is about the configuration of the table that is used to create the
>> sched_domain. Some levels are "duplicated" with different flags configuration
>
> I do not feel this is a problem since the levels are not duplicated,
> rather they have different properties within them which is best
> represented by flags like you have introduced in this patch.
>
>> which make the table not easily readable and we must also take care of the
>> order  because parents have to gather all cpus of its childs. So we must
>> choose which capabilities will be a subset of the other one. The order is
>
> The sched domain levels which have SD_SHARE_POWERDOMAIN set is expected
> to have cpus which are a subset of the cpus that this domain would have
> included had this flag not been set. In addition to this every higher
> domain, irrespective of SD_SHARE_POWERDOMAIN being set, will include all
> cpus of the lower domains. As far as I see, this patch does not change
> these assumptions. Hence I am unable to imagine a scenario when the
> parent might not include all cpus of its children domain. Do you have
> such a scenario in mind which can arise due to this patch ?

My patch doesn't have issue because i have added only 1 layer which is
always a subset of the current cache level topology but if we add
another feature with another layer, we have to decide which feature
will be a subset of the other one.

Vincent

>
> Thanks
>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07  8:32                 ` Vincent Guittot
@ 2014-01-07 13:22                   ` Peter Zijlstra
  2014-01-07 14:10                     ` Peter Zijlstra
  2014-01-07 14:11                     ` Vincent Guittot
  0 siblings, 2 replies; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-07 13:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, linux-kernel, mingo, pjt, Morten Rasmussen,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens

On Tue, Jan 07, 2014 at 09:32:04AM +0100, Vincent Guittot wrote:
> On 6 January 2014 17:31, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Jan 06, 2014 at 02:41:31PM +0100, Vincent Guittot wrote:
> >> IMHO, these settings will disappear sooner or later, as an example the
> >> idle/busy _idx are going to be removed by Alex's patch.
> >
> > Well I'm still entirely unconvinced by them..
> >
> > removing the cpu_load array makes sense, but I'm starting to doubt the
> > removal of the _idx things.. I think we want to retain them in some
> > form, it simply makes sense to look at longer term averages when looking
> > at larger CPU groups.
> >
> > So maybe we can express the things in log_2(group-span) or so, but we
> > need a working replacement for the cpu_load array. Ideally some
> > expression involving the blocked load.
> 
> Using the blocked load can surely give benefit in the load balance
> because it gives a view of potential load on a core but it still decay
> with the same speed than runnable load average so it doesn't solve the
> issue for longer term average. One way is to have a runnable average
> load with longer time window

Ah, another way of looking at it is that the avg without blocked
component is a 'now' picture. It is the load we are concerned with right
now.

The more blocked we add the further out we look; with the obvious limit
of the entire averaging period.

So the avg that is runnable is right now, t_0; the avg that is runnable +
blocked is t_0 + p, where p is the avg period over which we expect the
blocked contribution to appear.

So something like:

  avg = runnable + p(i) * blocked; where p(i) \e [0,1]

could maybe be used to replace the cpu_load array and still represent
the concept of looking at a bigger picture for larger sets. Leaving open
the details of the map p.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07 13:22                   ` Peter Zijlstra
@ 2014-01-07 14:10                     ` Peter Zijlstra
  2014-01-07 15:41                       ` Morten Rasmussen
  2014-01-07 14:11                     ` Vincent Guittot
  1 sibling, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-07 14:10 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, linux-kernel, mingo, pjt, Morten Rasmussen,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens

On Tue, Jan 07, 2014 at 02:22:20PM +0100, Peter Zijlstra wrote:

I just realized there's two different p's in there.

> Ah, another way of looking at it is that the avg without blocked
> component is a 'now' picture. It is the load we are concerned with right
> now.
> 
> The more blocked we add the further out we look; with the obvious limit
> of the entire averaging period.
> 
> So the avg that is runnable is right now, t_0; the avg that is runnable +
> blocked is t_0 + p, where p is the avg period over which we expect the
> blocked contribution to appear.

So the above p for period, is unrelated to the below p which is a
probability function.

> So something like:
> 
>   avg = runnable + p(i) * blocked; where p(i) \e [0,1]
> 
> could maybe be used to replace the cpu_load array and still represent
> the concept of looking at a bigger picture for larger sets. Leaving open
> the details of the map p.

We probably want to assume task wakeup is constant over time, so p (our
probability function) should probably be an exponential distribution.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07 13:22                   ` Peter Zijlstra
  2014-01-07 14:10                     ` Peter Zijlstra
@ 2014-01-07 14:11                     ` Vincent Guittot
  2014-01-07 15:37                       ` Morten Rasmussen
  1 sibling, 1 reply; 101+ messages in thread
From: Vincent Guittot @ 2014-01-07 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, linux-kernel, mingo, pjt, Morten Rasmussen,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens

On 7 January 2014 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jan 07, 2014 at 09:32:04AM +0100, Vincent Guittot wrote:
>> On 6 January 2014 17:31, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Mon, Jan 06, 2014 at 02:41:31PM +0100, Vincent Guittot wrote:
>> >> IMHO, these settings will disappear sooner or later, as an example the
>> >> idle/busy _idx are going to be removed by Alex's patch.
>> >
>> > Well I'm still entirely unconvinced by them..
>> >
>> > removing the cpu_load array makes sense, but I'm starting to doubt the
>> > removal of the _idx things.. I think we want to retain them in some
>> > form, it simply makes sense to look at longer term averages when looking
>> > at larger CPU groups.
>> >
>> > So maybe we can express the things in log_2(group-span) or so, but we
>> > need a working replacement for the cpu_load array. Ideally some
>> > expression involving the blocked load.
>>
>> Using the blocked load can surely give benefit in the load balance
>> because it gives a view of potential load on a core but it still decay
>> with the same speed than runnable load average so it doesn't solve the
>> issue for longer term average. One way is to have a runnable average
>> load with longer time window
>
> Ah, another way of looking at it is that the avg without blocked
> component is a 'now' picture. It is the load we are concerned with right
> now.
>
> The more blocked we add the further out we look; with the obvious limit
> of the entire averaging period.
>
> So the avg that is runnable is right now, t_0; the avg that is runnable +
> blocked is t_0 + p, where p is the avg period over which we expect the
> blocked contribution to appear.
>
> So something like:
>
>   avg = runnable + p(i) * blocked; where p(i) \e [0,1]
>
> could maybe be used to replace the cpu_load array and still represent
> the concept of looking at a bigger picture for larger sets. Leaving open
> the details of the map p.

That needs to be studied more deeply but that could be a way to have a
larger picture

Another point is that we are using runnable and blocked load average
which are the sum of load_avg_contrib of tasks but we are not using
the runnable_avg_sum of the cpus which is not the now picture but a
average of the past running time (without taking into account task
weight)

Vincent

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07 14:11                     ` Vincent Guittot
@ 2014-01-07 15:37                       ` Morten Rasmussen
  2014-01-08  8:37                         ` Alex Shi
  0 siblings, 1 reply; 101+ messages in thread
From: Morten Rasmussen @ 2014-01-07 15:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Dietmar Eggemann, linux-kernel, mingo, pjt,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens

On Tue, Jan 07, 2014 at 02:11:22PM +0000, Vincent Guittot wrote:
> On 7 January 2014 14:22, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Jan 07, 2014 at 09:32:04AM +0100, Vincent Guittot wrote:
> >> On 6 January 2014 17:31, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Mon, Jan 06, 2014 at 02:41:31PM +0100, Vincent Guittot wrote:
> >> >> IMHO, these settings will disappear sooner or later, as an example the
> >> >> idle/busy _idx are going to be removed by Alex's patch.
> >> >
> >> > Well I'm still entirely unconvinced by them..
> >> >
> >> > removing the cpu_load array makes sense, but I'm starting to doubt the
> >> > removal of the _idx things.. I think we want to retain them in some
> >> > form, it simply makes sense to look at longer term averages when looking
> >> > at larger CPU groups.
> >> >
> >> > So maybe we can express the things in log_2(group-span) or so, but we
> >> > need a working replacement for the cpu_load array. Ideally some
> >> > expression involving the blocked load.
> >>
> >> Using the blocked load can surely give benefit in the load balance
> >> because it gives a view of potential load on a core but it still decay
> >> with the same speed than runnable load average so it doesn't solve the
> >> issue for longer term average. One way is to have a runnable average
> >> load with longer time window

The blocked load discussion comes up again :)

I totally agree that blocked load would be useful, but only if we get
the priority problem sorted out. Blocked load is the sum of load_contrib
of blocked tasks, which means that a tiny high priority task can have a
massive contribution to the blocked load.

> >
> > Ah, another way of looking at it is that the avg without blocked
> > component is a 'now' picture. It is the load we are concerned with right
> > now.
> >
> > The more blocked we add the further out we look; with the obvious limit
> > of the entire averaging period.
> >
> > So the avg that is runnable is right now, t_0; the avg that is runnable +
> > blocked is t_0 + p, where p is the avg period over which we expect the
> > blocked contribution to appear.
> >
> > So something like:
> >
> >   avg = runnable + p(i) * blocked; where p(i) \e [0,1]
> >
> > could maybe be used to replace the cpu_load array and still represent
> > the concept of looking at a bigger picture for larger sets. Leaving open
> > the details of the map p.

Figuring out p is the difficult bit. AFAIK, with blocked load in its
current form we don't have any clue when a task will reappear.

> 
> That needs to be studied more deeply but that could be a way to have a
> larger picture

Agree.

> 
> Another point is that we are using runnable and blocked load average
> which are the sum of load_avg_contrib of tasks but we are not using
> the runnable_avg_sum of the cpus which is not the now picture but a
> average of the past running time (without taking into account task
> weight)

Yes. The rq runnable_avg_sum is an excellent longer term load indicator.
It can't be compared with the runnable and blocked load though. The
other alternative that I can think of is to introduce an unweighted
alternative to blocked load. That is, sum of load_contrib/priority.

Morten

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07 14:10                     ` Peter Zijlstra
@ 2014-01-07 15:41                       ` Morten Rasmussen
  2014-01-07 20:49                         ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Morten Rasmussen @ 2014-01-07 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Dietmar Eggemann, linux-kernel, mingo, pjt,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens

On Tue, Jan 07, 2014 at 02:10:59PM +0000, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 02:22:20PM +0100, Peter Zijlstra wrote:
> 
> I just realized there's two different p's in there.
> 
> > Ah, another way of looking at it is that the avg without blocked
> > component is a 'now' picture. It is the load we are concerned with right
> > now.
> > 
> > The more blocked we add the further out we look; with the obvious limit
> > of the entire averaging period.
> > 
> > So the avg that is runnable is right now, t_0; the avg that is runnable +
> > blocked is t_0 + p, where p is the avg period over which we expect the
> > blocked contribution to appear.
> 
> So the above p for period, is unrelated to the below p which is a
> probability function.
> 
> > So something like:
> > 
> >   avg = runnable + p(i) * blocked; where p(i) \e [0,1]
> > 
> > could maybe be used to replace the cpu_load array and still represent
> > the concept of looking at a bigger picture for larger sets. Leaving open
> > the details of the map p.
> 
> We probably want to assume task wakeup is constant over time, so p (our
> probability function) should probably be an exponential distribution.

Ah, makes more sense now.

You propose that we don't actually try keep track of which tasks that
might wake up when, but just factor in more and more of the blocked load
depending on how conservative the load estimate we want?

I think that could work if we sort of the priority scaling issue that I
mentioned before.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07 11:13                 ` Peter Zijlstra
@ 2014-01-07 16:31                   ` Preeti U Murthy
  0 siblings, 0 replies; 101+ messages in thread
From: Preeti U Murthy @ 2014-01-07 16:31 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot, Morten.Rasmussen
  Cc: linux-kernel, mingo, pjt, cmetcalf, tony.luck, alex.shi,
	linaro-kernel, rafael.j.wysocki, paulmck, corbet, tglx,
	len.brown, arjan, amit.kucheria, james.hogan, schwidefsky,
	heiko.carstens, Dietmar.Eggemann

On 01/07/2014 04:43 PM, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 04:09:39PM +0530, Preeti U Murthy wrote:
>> On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
>>> On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
>>>> What if we want to add arch specific flags to the NUMA domain? Currently
>>>> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
>>>> the arch can modify the sd flags of the topology levels till just before
>>>> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
>>>> initialized. We need to perhaps call into arch here to probe for
>>>> additional flags?
>>>
>>> What are you thinking of? I was hoping all NUMA details were captured in
>>> the distance table.
>>>
>>> Its far easier to talk of specifics in this case.
>>>
>> If the processor can be core gated, then there is very little power
>> savings that we could yield from consolidating all the load onto a
>> single node in a NUMA domain. 6 cores on one node or 3 cores each on two
>> nodes, the power is drawn by 6 cores in all. So I was thinking under
>> this circumstance we might want to set the SD_SHARE_POWERDOMAIN flag at
>> the NUMA domain and spread the load if it favours the workload.
> 
> So Intel has so far not said a lot of sensible things about power
> management on their multi-socket platform.
> 
> And I've not heard anything at all from IBM on the POWER chips.
> 
> What I know from the Intel side is that packet idle hardly saves
> anything when compared to the DRAM power and the cost of having to do
> remote memory accesses.
> 
> In other words, I'm not at all considering power aware scheduling for
> NUMA systems until someone starts talking sense :-)
> 

On Power8 systems, most of the cpuidle power management is done at the
core level. Doing so is expected to yield us good power savings without
much loss of performance, with little exit latency from these idle
states and little overhead obtained from re-initialization of the cores.

However doing idle power management at a node level could hit
performance although good power savings is obtained due to the overhead
of re-initialization of the node which could be significant and of
course the large exit latency from such idle states.

Therefore we would try and consolidate load to cores as much as possible
rather than to nodes so as to leave as many cores idle. Again
consolidation to cores needs to be to 3-4 threads in a core. With 8
threads in a core, running just one thread would hardly do justice to
the core's resources. At the same time running the core full throttle
would hit performance. Hence a fine balance could be obtained by
consolidating load to minimum number of threads.
  *Consolidating load to core and spreading the load across nodes* would
probably help memory intensive workloads finish faster due to less
contention on local node memory and can get the cores to idle faster.

Thanks

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07 12:31                 ` Vincent Guittot
@ 2014-01-07 16:51                   ` Preeti U Murthy
  0 siblings, 0 replies; 101+ messages in thread
From: Preeti U Murthy @ 2014-01-07 16:51 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra, Morten Rasmussen
  Cc: linux-kernel, Ingo Molnar, Paul Turner, cmetcalf, tony.luck,
	Alex Shi, linaro-kernel, Rafael J. Wysocki, Paul McKenney,
	Jon Corbet, Thomas Gleixner, Len Brown, Arjan van de Ven,
	Amit Kucheria, james.hogan, schwidefsky, heiko.carstens,
	Dietmar Eggemann

On 01/07/2014 06:01 PM, Vincent Guittot wrote:
> On 7 January 2014 11:39, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
>>> On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
>>>> What if we want to add arch specific flags to the NUMA domain? Currently
>>>> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
>>>> the arch can modify the sd flags of the topology levels till just before
>>>> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
>>>> initialized. We need to perhaps call into arch here to probe for
>>>> additional flags?
>>>
>>> What are you thinking of? I was hoping all NUMA details were captured in
>>> the distance table.
>>>
>>> Its far easier to talk of specifics in this case.
>>>
>> If the processor can be core gated, then there is very little power
>> savings that we could yield from consolidating all the load onto a
>> single node in a NUMA domain. 6 cores on one node or 3 cores each on two
>> nodes, the power is drawn by 6 cores in all. So I was thinking under
>> this circumstance we might want to set the SD_SHARE_POWERDOMAIN flag at
>> the NUMA domain and spread the load if it favours the workload.
> 
> The policy of keeping the tasks running on cores that are close (same
> node) to the memory, is the more power efficient, isn't it ? so it's
> probably more about where to place the memory than about where to
> place the tasks ?

Yes this is another point. One of the reasons that we try to consolidate
load to cores is that on Power8 systems most of the power management is
at the core level and node level cpuidle states are usually entered into
on fully idle systems due to the overhead involved in exit from these
idle states as I mentioned in reply to this thread.

Another point questioning node level idle states which could for
instance include flushing of large shared cache is that if we try and
consolidate the load to nodes, we must also consolidate memory pages
simultaneously. Else the performance will be severely hurt in
re-fetching the pages which were flushed as compared to core level idle
management.
  Core level idle power management could include flushing of l2 cache,
which is still ok for performance because re-fetching of the pages on
this cache has relatively low overhead and depending on the arch, the
power savings obtained could be worth the overhead.

Thanks

Regards
Preeti U Murthy
> 
> Vincent
> 
>>
>> Regards
>> Preeti U Murthy
>>
> 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07 15:41                       ` Morten Rasmussen
@ 2014-01-07 20:49                         ` Peter Zijlstra
  2014-01-08  8:32                           ` Alex Shi
  2014-01-08 12:35                           ` Morten Rasmussen
  0 siblings, 2 replies; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-07 20:49 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Dietmar Eggemann, linux-kernel, mingo, pjt,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens

On Tue, Jan 07, 2014 at 03:41:54PM +0000, Morten Rasmussen wrote:
> I think that could work if we sort of the priority scaling issue that I
> mentioned before.

We talked a bit about this on IRC a month or so ago, right? My memories
from that are that your main complaint is that we don't detect the
overload scenario right.

That is; the point at which we should start caring about SMP-nice is
when all our CPUs are fully occupied, because up to that point we're
under utilized and work preservation mandates we utilize idle time.

Currently we detect overload by sg.nr_running >= sg.capacity, which can
be very misleading because while a cpu might have a task running 'now'
it might be 99% idle.

At which point I argued we should change the capacity thing anyhow. Ever
since the runnable_avg patch set I've been arguing to change that into
an actual utilization test.

So I think that if we measure overload by something like >95% utilization
on the entire group the load scaling again makes perfect sense.

Given the 3 task {A,B,C} workload where A and B are niced, to land on a
symmetric dual CPU system like: {A,B}+{C}, assuming they're all while(1)
loops :-).

The harder case is where all 3 tasks are of equal weight; in which case
fairness would mandate we (slowly) rotate the tasks such that they all
get 2/3 time -- we also horribly fail at this :-)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07 20:49                         ` Peter Zijlstra
@ 2014-01-08  8:32                           ` Alex Shi
  2014-01-08  8:37                             ` Peter Zijlstra
  2014-01-08 12:35                           ` Morten Rasmussen
  1 sibling, 1 reply; 101+ messages in thread
From: Alex Shi @ 2014-01-08  8:32 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: Vincent Guittot, Dietmar Eggemann, linux-kernel, mingo, pjt,
	cmetcalf, tony.luck, preeti, linaro-kernel, paulmck, corbet,
	tglx, len.brown, arjan, amit.kucheria, james.hogan, schwidefsky,
	heiko.carstens

On 01/08/2014 04:49 AM, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 03:41:54PM +0000, Morten Rasmussen wrote:
>> I think that could work if we sort of the priority scaling issue that I
>> mentioned before.
> 
> We talked a bit about this on IRC a month or so ago, right? My memories
> from that are that your main complaint is that we don't detect the
> overload scenario right.
> 
> That is; the point at which we should start caring about SMP-nice is
> when all our CPUs are fully occupied, because up to that point we're
> under utilized and work preservation mandates we utilize idle time.
> 
> Currently we detect overload by sg.nr_running >= sg.capacity, which can
> be very misleading because while a cpu might have a task running 'now'
> it might be 99% idle.
> 
> At which point I argued we should change the capacity thing anyhow. Ever
> since the runnable_avg patch set I've been arguing to change that into
> an actual utilization test.
> 
> So I think that if we measure overload by something like >95% utilization
> on the entire group the load scaling again makes perfect sense.

In my old power aware scheduling patchset, I had tried the 95 to 99. But
all those values will lead imbalance when we test while(1) like cases.
like in a 24LCPUs groups, 24*5% > 1. So, finally use 100% as overload
indicator. And in testing 100% works well to find overload since few
system service involved. :)
> 
> Given the 3 task {A,B,C} workload where A and B are niced, to land on a
> symmetric dual CPU system like: {A,B}+{C}, assuming they're all while(1)
> loops :-).
> 
> The harder case is where all 3 tasks are of equal weight; in which case
> fairness would mandate we (slowly) rotate the tasks such that they all
> get 2/3 time -- we also horribly fail at this :-)
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-08  8:32                           ` Alex Shi
@ 2014-01-08  8:37                             ` Peter Zijlstra
  2014-01-08 12:52                               ` Morten Rasmussen
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-08  8:37 UTC (permalink / raw)
  To: Alex Shi
  Cc: Morten Rasmussen, Vincent Guittot, Dietmar Eggemann,
	linux-kernel, mingo, pjt, cmetcalf, tony.luck, preeti,
	linaro-kernel, paulmck, corbet, tglx, len.brown, arjan,
	amit.kucheria, james.hogan, schwidefsky, heiko.carstens

On Wed, Jan 08, 2014 at 04:32:18PM +0800, Alex Shi wrote:
> In my old power aware scheduling patchset, I had tried the 95 to 99. But
> all those values will lead imbalance when we test while(1) like cases.
> like in a 24LCPUs groups, 24*5% > 1. So, finally use 100% as overload
> indicator. And in testing 100% works well to find overload since few
> system service involved. :)

Ah indeed, so 100% it is ;-)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07 15:37                       ` Morten Rasmussen
@ 2014-01-08  8:37                         ` Alex Shi
  0 siblings, 0 replies; 101+ messages in thread
From: Alex Shi @ 2014-01-08  8:37 UTC (permalink / raw)
  To: Morten Rasmussen, Vincent Guittot
  Cc: Peter Zijlstra, Dietmar Eggemann, linux-kernel, mingo, pjt,
	cmetcalf, tony.luck, preeti, linaro-kernel, paulmck, corbet,
	tglx, len.brown, arjan, amit.kucheria, james.hogan, schwidefsky,
	heiko.carstens

On 01/07/2014 11:37 PM, Morten Rasmussen wrote:
>>> > >
>>> > > So something like:
>>> > >
>>> > >   avg = runnable + p(i) * blocked; where p(i) \e [0,1]
>>> > >
>>> > > could maybe be used to replace the cpu_load array and still represent
>>> > > the concept of looking at a bigger picture for larger sets. Leaving open
>>> > > the details of the map p.
> Figuring out p is the difficult bit. AFAIK, with blocked load in its
> current form we don't have any clue when a task will reappear.

Yes, that's why we can not find a suitable way to consider the blocked
load in load balance.


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-07 20:49                         ` Peter Zijlstra
  2014-01-08  8:32                           ` Alex Shi
@ 2014-01-08 12:35                           ` Morten Rasmussen
  2014-01-08 12:42                             ` Peter Zijlstra
  2014-01-08 12:45                             ` Peter Zijlstra
  1 sibling, 2 replies; 101+ messages in thread
From: Morten Rasmussen @ 2014-01-08 12:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Dietmar Eggemann, linux-kernel, mingo, pjt,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens

On Tue, Jan 07, 2014 at 08:49:51PM +0000, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 03:41:54PM +0000, Morten Rasmussen wrote:
> > I think that could work if we sort of the priority scaling issue that I
> > mentioned before.
> 
> We talked a bit about this on IRC a month or so ago, right? My memories
> from that are that your main complaint is that we don't detect the
> overload scenario right.
> 
> That is; the point at which we should start caring about SMP-nice is
> when all our CPUs are fully occupied, because up to that point we're
> under utilized and work preservation mandates we utilize idle time.

Yes. I think I stated the problem differently, but I think we talk about
the same thing. Basically, priority-scaling in task load_contrib means
that runnable_load_avg and blocked_load_avg are poor indicators of cpu
load (available idle time). Priority scaling only makes sense when the
system is fully utilized. When that is not the case, it just gives us a
potentially very inaccurate picture of the load (available idle time).

Pretty much what you just said :-)

> Currently we detect overload by sg.nr_running >= sg.capacity, which can
> be very misleading because while a cpu might have a task running 'now'
> it might be 99% idle.
> 
> At which point I argued we should change the capacity thing anyhow. Ever
> since the runnable_avg patch set I've been arguing to change that into
> an actual utilization test.
> 
> So I think that if we measure overload by something like >95% utilization
> on the entire group the load scaling again makes perfect sense.

I agree that it make more sense to change the overload test to be based
on some tracked load. How about the non-overloaded case? Load balancing
would have to be based on unweighted task loads in that case?

> 
> Given the 3 task {A,B,C} workload where A and B are niced, to land on a
> symmetric dual CPU system like: {A,B}+{C}, assuming they're all while(1)
> loops :-).
> 
> The harder case is where all 3 tasks are of equal weight; in which case
> fairness would mandate we (slowly) rotate the tasks such that they all
> get 2/3 time -- we also horribly fail at this :-)

I have encountered that one a number of times. All the middleware noise
in Android sometimes give that effect.

I'm not sure if the NUMA guy would like rotating scheduler though :-)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-08 12:35                           ` Morten Rasmussen
@ 2014-01-08 12:42                             ` Peter Zijlstra
  2014-01-08 12:45                             ` Peter Zijlstra
  1 sibling, 0 replies; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-08 12:42 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Dietmar Eggemann, linux-kernel, mingo, pjt,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens

On Wed, Jan 08, 2014 at 12:35:34PM +0000, Morten Rasmussen wrote:
> > The harder case is where all 3 tasks are of equal weight; in which case
> > fairness would mandate we (slowly) rotate the tasks such that they all
> > get 2/3 time -- we also horribly fail at this :-)
> 
> I have encountered that one a number of times. All the middleware noise
> in Android sometimes give that effect.

You've got a typo there: s/middleware/muddleware/ :-)

> I'm not sure if the NUMA guy would like rotating scheduler though :-)

Hurmph ;-) But yes, the N+1 tasks on a N cpu system is rotten; any
static solution gets 2 tasks that run at 50%, any dynamic solution gets
the migration overhead issue.

So while the dynamic solution would indeed allow each task to (on
average) receive N/N+1 time -- a vast improvement over the 50% thing, it
doesn't come without down sides.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-08 12:35                           ` Morten Rasmussen
  2014-01-08 12:42                             ` Peter Zijlstra
@ 2014-01-08 12:45                             ` Peter Zijlstra
  2014-01-08 13:27                               ` Morten Rasmussen
  1 sibling, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-08 12:45 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Dietmar Eggemann, linux-kernel, mingo, pjt,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens

On Wed, Jan 08, 2014 at 12:35:34PM +0000, Morten Rasmussen wrote:
> > Currently we detect overload by sg.nr_running >= sg.capacity, which can
> > be very misleading because while a cpu might have a task running 'now'
> > it might be 99% idle.
> > 
> > At which point I argued we should change the capacity thing anyhow. Ever
> > since the runnable_avg patch set I've been arguing to change that into
> > an actual utilization test.
> > 
> > So I think that if we measure overload by something like >95% utilization
> > on the entire group the load scaling again makes perfect sense.
> 
> I agree that it make more sense to change the overload test to be based
> on some tracked load. How about the non-overloaded case? Load balancing
> would have to be based on unweighted task loads in that case?

Yeah, until we're overloaded our goal is to minimize idle time.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-08  8:37                             ` Peter Zijlstra
@ 2014-01-08 12:52                               ` Morten Rasmussen
  2014-01-08 13:04                                 ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Morten Rasmussen @ 2014-01-08 12:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Shi, Vincent Guittot, Dietmar Eggemann, linux-kernel, mingo,
	pjt, cmetcalf, tony.luck, preeti, linaro-kernel, paulmck, corbet,
	tglx, len.brown, arjan, amit.kucheria, james.hogan, schwidefsky,
	heiko.carstens

On Wed, Jan 08, 2014 at 08:37:16AM +0000, Peter Zijlstra wrote:
> On Wed, Jan 08, 2014 at 04:32:18PM +0800, Alex Shi wrote:
> > In my old power aware scheduling patchset, I had tried the 95 to 99. But
> > all those values will lead imbalance when we test while(1) like cases.
> > like in a 24LCPUs groups, 24*5% > 1. So, finally use 100% as overload
> > indicator. And in testing 100% works well to find overload since few
> > system service involved. :)
> 
> Ah indeed, so 100% it is ;-)

If I remember correctly, Alex used the rq runnable_avg_sum (in rq->avg)
for this. It is the most obvious choice, but it takes ages to reach
100%.

#define LOAD_AVG_MAX_N 345

Worst case it takes 345 ms from the system is becomes fully utilized
after a long period of idle until the rq runnable_avg_sum reaches 100%.

An unweigthed version of cfs_rq->runnable_load_avg and blocked_load_avg
wouldn't have that delay.

Also, if we are changing the load balance behavior when all cpus are
fully utilized we may need to think about cases where the load is
hovering around the saturation threshold. But I don't think that is
important yet.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-08 12:52                               ` Morten Rasmussen
@ 2014-01-08 13:04                                 ` Peter Zijlstra
  2014-01-08 13:33                                   ` Morten Rasmussen
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-08 13:04 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Alex Shi, Vincent Guittot, Dietmar Eggemann, linux-kernel, mingo,
	pjt, cmetcalf, tony.luck, preeti, linaro-kernel, paulmck, corbet,
	tglx, len.brown, arjan, amit.kucheria, james.hogan, schwidefsky,
	heiko.carstens

On Wed, Jan 08, 2014 at 12:52:28PM +0000, Morten Rasmussen wrote:

> If I remember correctly, Alex used the rq runnable_avg_sum (in rq->avg)
> for this. It is the most obvious choice, but it takes ages to reach
> 100%.
> 
> #define LOAD_AVG_MAX_N 345
> 
> Worst case it takes 345 ms from the system is becomes fully utilized
> after a long period of idle until the rq runnable_avg_sum reaches 100%.
> 
> An unweigthed version of cfs_rq->runnable_load_avg and blocked_load_avg
> wouldn't have that delay.

Right.. not sure we want to involve blocked load on the utilization
metric, but who knows maybe that does make sense.

But yes, we need unweighted runnable_avg.

> Also, if we are changing the load balance behavior when all cpus are
> fully utilized

We already have this tipping point. See all the has_capacity bits. But
yes, it'd get more involved I suppose.

> we may need to think about cases where the load is
> hovering around the saturation threshold. But I don't think that is
> important yet.

Yah.. I'm going to wait until we have a fail case that can give us
some guidance before really pondering this though :-)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-08 12:45                             ` Peter Zijlstra
@ 2014-01-08 13:27                               ` Morten Rasmussen
  2014-01-08 13:32                                 ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Morten Rasmussen @ 2014-01-08 13:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Dietmar Eggemann, linux-kernel, mingo, pjt,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens

On Wed, Jan 08, 2014 at 12:45:34PM +0000, Peter Zijlstra wrote:
> On Wed, Jan 08, 2014 at 12:35:34PM +0000, Morten Rasmussen wrote:
> > > Currently we detect overload by sg.nr_running >= sg.capacity, which can
> > > be very misleading because while a cpu might have a task running 'now'
> > > it might be 99% idle.
> > > 
> > > At which point I argued we should change the capacity thing anyhow. Ever
> > > since the runnable_avg patch set I've been arguing to change that into
> > > an actual utilization test.
> > > 
> > > So I think that if we measure overload by something like >95% utilization
> > > on the entire group the load scaling again makes perfect sense.
> > 
> > I agree that it make more sense to change the overload test to be based
> > on some tracked load. How about the non-overloaded case? Load balancing
> > would have to be based on unweighted task loads in that case?
> 
> Yeah, until we're overloaded our goal is to minimize idle time.

I would say, make the most of the available cpu cycles. Minimizing idle
time is not always the right thing to do when considering power
awareness.

If we know the actual load of the tasks, we may be able to consolidate
them on fewer cpus and save power by idling cpus. In that case the idle
time (total) is unchanged (unless the P-state is changed). Somewhat
similar to the video use-case running on 1, 2, and 4 cpu that I reposted
yesterday.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-08 13:27                               ` Morten Rasmussen
@ 2014-01-08 13:32                                 ` Peter Zijlstra
  2014-01-08 13:45                                   ` Morten Rasmussen
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2014-01-08 13:32 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Dietmar Eggemann, linux-kernel, mingo, pjt,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens

On Wed, Jan 08, 2014 at 01:27:39PM +0000, Morten Rasmussen wrote:
> On Wed, Jan 08, 2014 at 12:45:34PM +0000, Peter Zijlstra wrote:
> > On Wed, Jan 08, 2014 at 12:35:34PM +0000, Morten Rasmussen wrote:
> > > > Currently we detect overload by sg.nr_running >= sg.capacity, which can
> > > > be very misleading because while a cpu might have a task running 'now'
> > > > it might be 99% idle.
> > > > 
> > > > At which point I argued we should change the capacity thing anyhow. Ever
> > > > since the runnable_avg patch set I've been arguing to change that into
> > > > an actual utilization test.
> > > > 
> > > > So I think that if we measure overload by something like >95% utilization
> > > > on the entire group the load scaling again makes perfect sense.
> > > 
> > > I agree that it make more sense to change the overload test to be based
> > > on some tracked load. How about the non-overloaded case? Load balancing
> > > would have to be based on unweighted task loads in that case?
> > 
> > Yeah, until we're overloaded our goal is to minimize idle time.
> 
> I would say, make the most of the available cpu cycles. Minimizing idle
> time is not always the right thing to do when considering power
> awareness.
> 
> If we know the actual load of the tasks, we may be able to consolidate

I think we must start to be careful with the word load, I think you
meant to say utilization. 

> them on fewer cpus and save power by idling cpus. In that case the idle
> time (total) is unchanged (unless the P-state is changed). Somewhat
> similar to the video use-case running on 1, 2, and 4 cpu that I reposted
> yesterday.

But fair enough.. Its idle time when you consider CPUs to always run at
max frequency, but clearly I must stop thinking about CPUs like that :-)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-08 13:04                                 ` Peter Zijlstra
@ 2014-01-08 13:33                                   ` Morten Rasmussen
  0 siblings, 0 replies; 101+ messages in thread
From: Morten Rasmussen @ 2014-01-08 13:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Shi, Vincent Guittot, Dietmar Eggemann, linux-kernel, mingo,
	pjt, cmetcalf, tony.luck, preeti, linaro-kernel, paulmck, corbet,
	tglx, len.brown, arjan, amit.kucheria, james.hogan, schwidefsky,
	heiko.carstens

On Wed, Jan 08, 2014 at 01:04:07PM +0000, Peter Zijlstra wrote:
> On Wed, Jan 08, 2014 at 12:52:28PM +0000, Morten Rasmussen wrote:
> 
> > If I remember correctly, Alex used the rq runnable_avg_sum (in rq->avg)
> > for this. It is the most obvious choice, but it takes ages to reach
> > 100%.
> > 
> > #define LOAD_AVG_MAX_N 345
> > 
> > Worst case it takes 345 ms from the system is becomes fully utilized
> > after a long period of idle until the rq runnable_avg_sum reaches 100%.
> > 
> > An unweigthed version of cfs_rq->runnable_load_avg and blocked_load_avg
> > wouldn't have that delay.
> 
> Right.. not sure we want to involve blocked load on the utilization
> metric, but who knows maybe that does make sense.
> 
> But yes, we need unweighted runnable_avg.

I'm not sure about the blocked load either.

> 
> > Also, if we are changing the load balance behavior when all cpus are
> > fully utilized
> 
> We already have this tipping point. See all the has_capacity bits. But
> yes, it'd get more involved I suppose.

I'll have a look.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC] sched: CPU topology try
  2014-01-08 13:32                                 ` Peter Zijlstra
@ 2014-01-08 13:45                                   ` Morten Rasmussen
  0 siblings, 0 replies; 101+ messages in thread
From: Morten Rasmussen @ 2014-01-08 13:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Dietmar Eggemann, linux-kernel, mingo, pjt,
	cmetcalf, tony.luck, alex.shi, preeti, linaro-kernel, paulmck,
	corbet, tglx, len.brown, arjan, amit.kucheria, james.hogan,
	schwidefsky, heiko.carstens

On Wed, Jan 08, 2014 at 01:32:57PM +0000, Peter Zijlstra wrote:
> On Wed, Jan 08, 2014 at 01:27:39PM +0000, Morten Rasmussen wrote:
> > On Wed, Jan 08, 2014 at 12:45:34PM +0000, Peter Zijlstra wrote:
> > > On Wed, Jan 08, 2014 at 12:35:34PM +0000, Morten Rasmussen wrote:
> > > > > Currently we detect overload by sg.nr_running >= sg.capacity, which can
> > > > > be very misleading because while a cpu might have a task running 'now'
> > > > > it might be 99% idle.
> > > > > 
> > > > > At which point I argued we should change the capacity thing anyhow. Ever
> > > > > since the runnable_avg patch set I've been arguing to change that into
> > > > > an actual utilization test.
> > > > > 
> > > > > So I think that if we measure overload by something like >95% utilization
> > > > > on the entire group the load scaling again makes perfect sense.
> > > > 
> > > > I agree that it make more sense to change the overload test to be based
> > > > on some tracked load. How about the non-overloaded case? Load balancing
> > > > would have to be based on unweighted task loads in that case?
> > > 
> > > Yeah, until we're overloaded our goal is to minimize idle time.
> > 
> > I would say, make the most of the available cpu cycles. Minimizing idle
> > time is not always the right thing to do when considering power
> > awareness.
> > 
> > If we know the actual load of the tasks, we may be able to consolidate
> 
> I think we must start to be careful with the word load, I think you
> meant to say utilization.

Indeed, I meant utilization.

> 
> > them on fewer cpus and save power by idling cpus. In that case the idle
> > time (total) is unchanged (unless the P-state is changed). Somewhat
> > similar to the video use-case running on 1, 2, and 4 cpu that I reposted
> > yesterday.
> 
> But fair enough.. Its idle time when you consider CPUs to always run at
> max frequency, but clearly I must stop thinking about CPUs like that :-)

Yes, it opens a whole new world of problems to be solved :-)

^ permalink raw reply	[flat|nested] 101+ messages in thread

end of thread, other threads:[~2014-01-08 13:45 UTC | newest]

Thread overview: 101+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-18 11:52 [RFC][PATCH v5 00/14] sched: packing tasks Vincent Guittot
2013-10-18 11:52 ` [RFC][PATCH v5 01/14] sched: add a new arch_sd_local_flags for sched_domain init Vincent Guittot
2013-11-05 14:06   ` Peter Zijlstra
2013-11-05 14:57     ` Vincent Guittot
2013-11-05 22:27       ` Peter Zijlstra
2013-11-06 10:10         ` Vincent Guittot
2013-11-06 13:53         ` Martin Schwidefsky
2013-11-06 14:08           ` Peter Zijlstra
2013-11-12 17:43             ` Dietmar Eggemann
2013-11-12 18:08               ` Peter Zijlstra
2013-11-13 15:47                 ` Dietmar Eggemann
2013-11-13 16:29                   ` Peter Zijlstra
2013-11-14 10:49                     ` Morten Rasmussen
2013-11-14 12:07                       ` Peter Zijlstra
2013-12-18 13:13         ` [RFC] sched: CPU topology try Vincent Guittot
2013-12-23 17:22           ` Dietmar Eggemann
2014-01-06 13:41             ` Vincent Guittot
2014-01-06 16:31               ` Peter Zijlstra
2014-01-07  8:32                 ` Vincent Guittot
2014-01-07 13:22                   ` Peter Zijlstra
2014-01-07 14:10                     ` Peter Zijlstra
2014-01-07 15:41                       ` Morten Rasmussen
2014-01-07 20:49                         ` Peter Zijlstra
2014-01-08  8:32                           ` Alex Shi
2014-01-08  8:37                             ` Peter Zijlstra
2014-01-08 12:52                               ` Morten Rasmussen
2014-01-08 13:04                                 ` Peter Zijlstra
2014-01-08 13:33                                   ` Morten Rasmussen
2014-01-08 12:35                           ` Morten Rasmussen
2014-01-08 12:42                             ` Peter Zijlstra
2014-01-08 12:45                             ` Peter Zijlstra
2014-01-08 13:27                               ` Morten Rasmussen
2014-01-08 13:32                                 ` Peter Zijlstra
2014-01-08 13:45                                   ` Morten Rasmussen
2014-01-07 14:11                     ` Vincent Guittot
2014-01-07 15:37                       ` Morten Rasmussen
2014-01-08  8:37                         ` Alex Shi
2014-01-06 16:28             ` Peter Zijlstra
2014-01-06 17:15               ` Morten Rasmussen
2014-01-07  9:57                 ` Peter Zijlstra
2014-01-01  5:00           ` Preeti U Murthy
2014-01-06 16:33             ` Peter Zijlstra
2014-01-06 16:37               ` Arjan van de Ven
2014-01-06 16:48                 ` Peter Zijlstra
2014-01-06 16:54                   ` Peter Zijlstra
2014-01-06 17:13                     ` Arjan van de Ven
2014-01-07 12:40             ` Vincent Guittot
2014-01-06 16:21           ` Peter Zijlstra
2014-01-07  8:22             ` Vincent Guittot
2014-01-07  9:40           ` Preeti U Murthy
2014-01-07  9:50             ` Peter Zijlstra
2014-01-07 10:39               ` Preeti U Murthy
2014-01-07 11:13                 ` Peter Zijlstra
2014-01-07 16:31                   ` Preeti U Murthy
2014-01-07 11:20                 ` Morten Rasmussen
2014-01-07 12:31                 ` Vincent Guittot
2014-01-07 16:51                   ` Preeti U Murthy
2013-10-18 11:52 ` [RFC][PATCH v5 03/14] sched: define pack buddy CPUs Vincent Guittot
2013-10-18 11:52 ` [RFC][PATCH v5 04/14] sched: do load balance only with packing cpus Vincent Guittot
2013-10-18 11:52 ` [RFC][PATCH v5 05/14] sched: add a packing level knob Vincent Guittot
2013-11-12 10:32   ` Peter Zijlstra
2013-11-12 10:44     ` Vincent Guittot
2013-11-12 10:55       ` Peter Zijlstra
2013-11-12 10:57         ` Vincent Guittot
2013-10-18 11:52 ` [RFC][PATCH v5 06/14] sched: create a new field with available capacity Vincent Guittot
2013-11-12 10:34   ` Peter Zijlstra
2013-11-12 11:05     ` Vincent Guittot
2013-10-18 11:52 ` [RFC][PATCH v5 07/14] sched: get CPU's activity statistic Vincent Guittot
2013-11-12 10:36   ` Peter Zijlstra
2013-11-12 10:41   ` Peter Zijlstra
2013-10-18 11:52 ` [RFC][PATCH v5 08/14] sched: move load idx selection in find_idlest_group Vincent Guittot
2013-11-12 10:49   ` Peter Zijlstra
2013-11-27 14:10   ` [tip:sched/core] sched/fair: Move " tip-bot for Vincent Guittot
2013-10-18 11:52 ` [RFC][PATCH v5 09/14] sched: update the packing cpu list Vincent Guittot
2013-10-18 11:52 ` [RFC][PATCH v5 10/14] sched: init this_load to max in find_idlest_group Vincent Guittot
2013-10-18 11:52 ` [RFC][PATCH v5 11/14] sched: add a SCHED_PACKING_TASKS config Vincent Guittot
2013-10-18 11:52 ` [RFC][PATCH v5 12/14] sched: create a statistic structure Vincent Guittot
2013-10-18 11:52 ` [RFC][PATCH v5 13/14] sched: differantiate idle cpu Vincent Guittot
2013-10-18 11:52 ` [RFC][PATCH v5 14/14] cpuidle: set the current wake up latency Vincent Guittot
2013-11-11 11:33 ` [RFC][PATCH v5 00/14] sched: packing tasks Catalin Marinas
2013-11-11 16:36   ` Peter Zijlstra
2013-11-11 16:39     ` Arjan van de Ven
2013-11-11 18:18       ` Catalin Marinas
2013-11-11 18:20         ` Arjan van de Ven
2013-11-12 12:06         ` Morten Rasmussen
2013-11-12 16:48         ` Arjan van de Ven
2013-11-12 23:14           ` Catalin Marinas
2013-11-13 16:13             ` Arjan van de Ven
2013-11-13 16:45               ` Catalin Marinas
2013-11-13 17:56                 ` Arjan van de Ven
2013-11-12 17:40     ` Catalin Marinas
2013-11-25 18:55     ` Daniel Lezcano
2013-11-11 16:38   ` Peter Zijlstra
2013-11-11 16:40     ` Arjan van de Ven
2013-11-12 10:36     ` Vincent Guittot
2013-11-11 16:54   ` Morten Rasmussen
2013-11-11 18:31     ` Catalin Marinas
2013-11-11 19:26       ` Arjan van de Ven
2013-11-11 22:43         ` Nicolas Pitre
2013-11-11 23:43         ` Catalin Marinas
2013-11-12 12:35   ` Vincent Guittot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.