linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework
@ 2009-03-03 11:51 Gautham R Shenoy
  2009-03-03 11:51 ` [PATCH 2 1/3] sched: code cleanup - sd_power_saving_flags(), sd_balance_for_mc/package_power() Gautham R Shenoy
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Gautham R Shenoy @ 2009-03-03 11:51 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan, Balbir Singh, Peter Zijlstra,
	Ingo Molnar, Suresh Siddha
  Cc: Dipankar Sarma, efault, andi, linux-kernel

Hi,

This is the second iteration of the patch series that extends the existing
sched_smt_/mc_power_savings framework to work on platforms
that have on-chip memory controllers making each of the cpu-package
a 'node'.

Original posting can be found here --> http://lkml.org/lkml/2009/2/16/221

Based on Peter's review comments, I have added additional comments at
appropriate places that help us understand the reason behind doing
certain things in the name of power-aware-load balancing.

Background
------------------------------------------------------------------
On machines with on-chip memory controller, each physical CPU
package forms a NUMA node and the CPU level sched_domain will have
only one group.  This prevents any form of power saving balance across
these nodes.  Enabling the sched_mc_power_savings tunable to work as
designed on these new single CPU NUMA node machines will help task
consolidation and save power as we did in other multi core multi
socket platforms.

Consolidation across NODES have implications of cross-node memory
access and other NUMA locality issues.  Even under such constraints
there could be scope for power savings vs performance tradeoffs and
hence making the sched_mc_powersavings work as expected on these
platform is justified.

sched_mc/smt_power_savings is still a tunable and power savings benefits
and performance would vary depending on the workload and the system
topology and hardware features.

The patch series has been tested on a 2-Socket Quad-core Dual threaded
box with kernbench as the workload, varying the number of threads.


+------------------------------------------------------------------------+
|Test: make -j4                                                          |
+-----------+----------+--------+---------+-------------+----------------+
| sched_smt | sched_mc | %Power | Time    | % Package 0 | % Package 1    |
|           |          |        |  (s)    |     idle    |   idle         |
+-----------+----------+--------+---------+-------------+----------------+
|           |          |        |         |Core0: 35.34 |Core4: 62.01    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core1: 58.34 |Core5: 17.41    |
|     0     |    0     |  100   |  107.84 +-------------+----------------+
|           |          |        |         |Core2: 63.97 |Core6: 60.29    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core3: 68.64 |Core7: 61.46    |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
|           |          |        |         |Core0: 34.28 |Core4: 18.26    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core1: 99.19 |Core5: 18.54    |
|     0     |    2     |  99.89 |  109.91 +-------------+----------------+
|           |          |        |         |Core2: 99.89 |Core6: 21.54    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core3: 99.91 |Core7: 23.21    |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
|           |          |        |         |Core0: 20.17 |Core4: 69.30    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core1: 50.22 |Core5: 55.97    |
|     2     |    2     |  95.03 |  139.95 +-------------+----------------+
|           |          |        |         |Core2: 83.95 |Core6: 92.70    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core3: 88.95 |Core7: 95.58    |
+-----------+----------+--------+ --------+-------------+----------------+

+------------------------------------------------------------------------+
|Test: make -j6                                                          |
+-----------+----------+--------+---------+-------------+----------------+
| sched_smt | sched_mc | %Power | Time    | % Package 0 | % Package 1    |
|           |          |        |         |     idle    |   idle         |
+-----------+----------+--------+---------+-------------+----------------+
|           |          |        |         |Core0: 25.35 |Core4: 41.07    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core1: 43.84 |Core5: 19.95    |
|     0     |    0     |  100   |  77.67  +-------------+----------------+
|           |          |        |         |Core2: 43.23 |Core6: 42.82    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core3: 47.66 |Core7: 45.96    |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
|           |          |        |         |Core0: 24.13 |Core4: 41.80    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core1: 51.51 |Core5: 23.61    |
|     0     |    2     |  99.41 |  81.50  +-------------+----------------+
|           |          |        |         |Core2: 55.43 |Core6: 38.67    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core3: 57.79 |Core7: 38.84    |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
|           |          |        |         |Core0:  7.75 |Core4: 94.45    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core1: 19.04 |Core5: 67.42    |
|     2     |    2     |  93.32 | 100.39  +-------------+----------------+
|           |          |        |         |Core2: 28.29 |Core6: 96.90    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core3: 66.63 |Core7: 99.86    |
+-----------+----------+--------+---------+-------------+----------------+

+------------------------------------------------------------------------+
|Test: make -j8                                                          |
+-----------+----------+--------+---------+-------------+----------------+
| sched_smt | sched_mc | %Power | Time    | % Package 0 | % Package 1    |
|           |          |        |         |     idle    |   idle         |
+-----------+----------+--------+---------+-------------+----------------+
|           |          |        |         |Core0: 18.17 |Core4: 33.38    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core1: 34.62 |Core5: 19.58    |
|     0     |    0     |  100   |  63.82  +-------------+----------------+
|           |          |        |         |Core2: 31.99 |Core6: 32.35    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core3: 34.59 |Core7: 29.99    |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
|           |          |        |         |Core0: 15.20 |Core4: 41.41    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core1: 28.45 |Core5: 21.32    |
|     0     |    2     |  99.17 |  65.65  +-------------+----------------+
|           |          |        |         |Core2: 31.14 |Core6: 41.26    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core3: 30.52 |Core7: 42.95    |
+-----------+----------+--------+---------+-------------+----------------+
+-----------+----------+--------+---------+-------------+----------------+
|           |          |        |         |Core0: 16.65 |Core4: 79.04    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core1: 26.74 |Core5: 50.98    |
|     2     |    2     |  89.58 |  82.83  +-------------+----------------+
|           |          |        |         |Core2: 30.42 |Core6: 81.33    |
|           |          |        |         +-------------+----------------+
|           |          |        |         |Core3: 35.57 |Core7: 90.03    |
+-----------+----------+--------+---------+-------------+----------------+

---

Gautham R Shenoy (3):
      sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.
      sched: Fix the wakeup nomination for sched_mc/smt_power_savings.
      sched: code cleanup - sd_power_saving_flags(), sd_balance_for_mc/package_power()


 include/linux/sched.h    |   66 ++++++++++++++++++-----------------
 include/linux/topology.h |    6 +--
 kernel/sched.c           |   88 +++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched_fair.c      |    2 +
 4 files changed, 119 insertions(+), 43 deletions(-)


Thoughts ?
-- 
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2 1/3] sched: code cleanup - sd_power_saving_flags(), sd_balance_for_mc/package_power()
  2009-03-03 11:51 [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework Gautham R Shenoy
@ 2009-03-03 11:51 ` Gautham R Shenoy
  2009-03-03 11:51 ` [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings Gautham R Shenoy
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Gautham R Shenoy @ 2009-03-03 11:51 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan, Balbir Singh, Peter Zijlstra,
	Ingo Molnar, Suresh Siddha
  Cc: Dipankar Sarma, efault, andi, linux-kernel, Gautham R Shenoy,
	Vaidyanathan Srinivasan

This is a code cleanup patch which combines the functions of
sd_balance_for_mc/package_power() and sd_power_saving_flags() into a single
function.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 include/linux/sched.h    |   65 +++++++++++++++++++++++-----------------------
 include/linux/topology.h |    6 +---
 2 files changed, 35 insertions(+), 36 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 02e16d2..362807a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -777,34 +777,45 @@ enum powersavings_balance_level {
 
 extern int sched_mc_power_savings, sched_smt_power_savings;
 
-static inline int sd_balance_for_mc_power(void)
-{
-	if (sched_smt_power_savings)
-		return SD_POWERSAVINGS_BALANCE;
-
-	return 0;
-}
-
-static inline int sd_balance_for_package_power(void)
-{
-	if (sched_mc_power_savings | sched_smt_power_savings)
-		return SD_POWERSAVINGS_BALANCE;
+enum sched_domain_level {
+	SD_LV_NONE = 0,
+	SD_LV_SIBLING,	/* Represents the THREADS domain */
+	SD_LV_MC,	/* Represents the CORES domain */
+	SD_LV_CPU,	/* Represents the PACKAGE domain */
+	SD_LV_NODE,	/* Represents the NODES domain */
+	SD_LV_ALLNODES,
+	SD_LV_MAX
+};
 
-	return 0;
-}
 
-/*
- * Optimise SD flags for power savings:
- * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings.
- * Keep default SD flags if sched_{smt,mc}_power_saving=0
+/**
+ * sd_power_savings_flags: Returns the flags specific to power-aware-load
+ * balancing for a given sched_domain level
+ *
+ * @level: The sched_domain level for which the the power-aware-load balancing
+ * flags need to be set.
+ *
+ * This function helps in setting the flags for power-aware load balancing for
+ * a given sched_domain.
+ *	- SD_POWERSAVINGS_BALANCE tells the load-balancer that power-aware
+ *	load balancing is applicable at this domain.
+ *
+ *	- SD_BALANCE_NEWIDLE helps aggressive task consolidation and
+ *	power-savings.
+ *
+ * For more information on power aware scheduling, see the comment before
+ * find_busiest_group() in kernel/sched.c
  */
 
-static inline int sd_power_saving_flags(void)
+static inline int sd_power_saving_flags(enum sched_domain_level level)
 {
-	if (sched_mc_power_savings | sched_smt_power_savings)
-		return SD_BALANCE_NEWIDLE;
+	if (level == SD_LV_MC && !sched_smt_power_savings)
+		return 0;
+	if (level == SD_LV_CPU &&
+		!(sched_mc_power_savings || sched_smt_power_savings))
+		return 0;
 
-	return 0;
+	return SD_POWERSAVINGS_BALANCE | SD_BALANCE_NEWIDLE;
 }
 
 struct sched_group {
@@ -830,16 +841,6 @@ static inline struct cpumask *sched_group_cpus(struct sched_group *sg)
 	return to_cpumask(sg->cpumask);
 }
 
-enum sched_domain_level {
-	SD_LV_NONE = 0,
-	SD_LV_SIBLING,
-	SD_LV_MC,
-	SD_LV_CPU,
-	SD_LV_NODE,
-	SD_LV_ALLNODES,
-	SD_LV_MAX
-};
-
 struct sched_domain_attr {
 	int relax_domain_level;
 };
diff --git a/include/linux/topology.h b/include/linux/topology.h
index e632d29..8097dce 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -125,8 +125,7 @@ int arch_update_cpu_topology(void);
 				| SD_WAKE_AFFINE	\
 				| SD_WAKE_BALANCE	\
 				| SD_SHARE_PKG_RESOURCES\
-				| sd_balance_for_mc_power()\
-				| sd_power_saving_flags(),\
+				| sd_power_saving_flags(SD_LV_MC),\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 }
@@ -151,8 +150,7 @@ int arch_update_cpu_topology(void);
 				| SD_BALANCE_FORK	\
 				| SD_WAKE_AFFINE	\
 				| SD_WAKE_BALANCE	\
-				| sd_balance_for_package_power()\
-				| sd_power_saving_flags(),\
+				| sd_power_saving_flags(SD_LV_CPU),\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 }


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings.
  2009-03-03 11:51 [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework Gautham R Shenoy
  2009-03-03 11:51 ` [PATCH 2 1/3] sched: code cleanup - sd_power_saving_flags(), sd_balance_for_mc/package_power() Gautham R Shenoy
@ 2009-03-03 11:51 ` Gautham R Shenoy
  2009-03-03 12:23   ` Peter Zijlstra
  2009-03-03 11:51 ` [PATCH 2 3/3] sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE Gautham R Shenoy
  2009-03-03 12:21 ` [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework Peter Zijlstra
  3 siblings, 1 reply; 14+ messages in thread
From: Gautham R Shenoy @ 2009-03-03 11:51 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan, Balbir Singh, Peter Zijlstra,
	Ingo Molnar, Suresh Siddha
  Cc: Dipankar Sarma, efault, andi, linux-kernel, Gautham R Shenoy,
	Vaidyanathan Srinivasan

The existing algorithm to nominate a preferred wake up cpu would not
work on a machine which has both sched_mc_power_savings and
sched_smt_power_savings enabled. On such machines, the nomination at a lower
level would keep overwriting the nominations by it's peer-level as well as
higher level sched_domains. This would lead to the ping-ponging of the
nominated wake-up cpu, thereby preventing us from effectively consolidating
tasks.

Correct this by defining the authorized nomination sched_domain level, which
is either the highest sched_domain level containing the
SD_POWERSAVINGS_BALANCE flag or a lower level which contains the previously
nominated wake-up cpu in it's span.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 include/linux/sched.h |    1 +
 kernel/sched.c        |   74 +++++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched_fair.c   |    2 +
 3 files changed, 70 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 362807a..7f66595 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -776,6 +776,7 @@ enum powersavings_balance_level {
 };
 
 extern int sched_mc_power_savings, sched_smt_power_savings;
+extern enum powersavings_balance_level active_power_savings_level;
 
 enum sched_domain_level {
 	SD_LV_NONE = 0,
diff --git a/kernel/sched.c b/kernel/sched.c
index 52bbf1c..7d22a96 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -515,11 +515,22 @@ struct root_domain {
 #endif
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
 	/*
-	 * Preferred wake up cpu nominated by sched_mc balance that will be
-	 * used when most cpus are idle in the system indicating overall very
-	 * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
+	 * Preferred wake up cpu which is nominated by load balancer,
+	 * is the CPU on which the tasks would be woken up, which
+	 * otherwise would have woken up on an idle CPU even on a system
+	 * with low-cpu-utilization.
+	 * This is triggered at POWERSAVINGS_BALANCE_WAKEUP(2).
 	 */
 	unsigned int sched_mc_preferred_wakeup_cpu;
+	/*
+	 * authorized_nomination_level records the sched-domain level, which can
+	 * in the process of load-balancing nominate the
+	 * sched_mc_preferred_wakeup_cpu.
+	 *
+	 * This helps in serializing the nominations thereby preventing
+	 * multiple sched-domain levels overwriting each others' nominations.
+	 */
+	enum sched_domain_level authorized_nomination_level;
 #endif
 };
 
@@ -3090,6 +3101,22 @@ static int move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
  * find_busiest_group finds and returns the busiest CPU group within the
  * domain. It calculates and returns the amount of weighted load which
  * should be moved to restore balance via the imbalance parameter.
+ *
+ * Power-savings-balance: If the user has enabled the option to save power
+ * by means of task consolidation, then at the corresponding sched_domains,
+ * the SD_POWERSAVINGS_BALANCE flag will be set.
+ *
+ * Within such sched_domains, find_busiest_group would try to identify
+ * a sched_group which can be freed-up and it's tasks can be migrated to
+ * another group which has the capacity to accomodate the former's tasks.
+ * If such a "can-go-idle" sched_groups does exist, then the group which can
+ * accomodate it's tasks is returned as the busiest group.
+ *
+ * Furthermore, if the user opts for more aggressive power-aware load
+ * balancing, i.e when the active_power_savings_level greater or equal to
+ * POWERSAVINGS_BALANCE_WAKEUP, find_busiest_group will also nominate
+ * the preferred CPU, on which the tasks should hence forth
+ * be woken up on, instead of bothering an idle-cpu.
  */
 static struct sched_group *
 find_busiest_group(struct sched_domain *sd, int this_cpu,
@@ -3397,9 +3424,18 @@ out_balanced:
 		goto ret;
 
 	if (this == group_leader && group_leader != group_min) {
+		struct root_domain *my_rd = cpu_rq(this_cpu)->rd;
 		*imbalance = min_load_per_task;
-		if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP) {
-			cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
+		/*
+		 * Nominate the the preferred wakeup cpu only if this
+		 * sched_domain is authorized to do so, or if this sched_domain
+		 * contains the previously nominated cpu.
+		 */
+		if (sd->level == my_rd->authorized_nomination_level ||
+			(sd->level < my_rd->authorized_nomination_level &&
+			cpu_isset(my_rd->sched_mc_preferred_wakeup_cpu,
+					*sched_domain_span(sd)))) {
+			my_rd->sched_mc_preferred_wakeup_cpu =
 				cpumask_first(sched_group_cpus(group_leader));
 		}
 		return group_min;
@@ -3683,7 +3719,8 @@ redo:
 		    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
 			return -1;
 
-		if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+		if (active_power_savings_level <
+				POWERSAVINGS_BALANCE_WAKEUP)
 			return -1;
 
 		if (sd->nr_balance_failed++ < 2)
@@ -7193,6 +7230,9 @@ static void sched_domain_node_span(int node, struct cpumask *span)
 
 int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
 
+/* Records the currently active power saving level. */
+enum powersavings_balance_level active_power_savings_level;
+
 /*
  * The cpus mask in sched_group and sched_domain hangs off the end.
  * FIXME: use cpumask_var_t or dynamic percpu alloc to avoid wasting space
@@ -7781,6 +7821,25 @@ static int __build_sched_domains(const struct cpumask *cpu_map,
 
 	err = 0;
 
+/* Assign the sched-domain level which can nominate preferred wake-up cpu */
+	rd->sched_mc_preferred_wakeup_cpu = UINT_MAX;
+	rd->authorized_nomination_level = SD_LV_NONE;
+
+	if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
+		struct sched_domain *sd;
+		enum sched_domain_level authorized_nomination_level =
+								SD_LV_NONE;
+
+		for_each_domain(first_cpu(*cpu_map), sd) {
+			if (!(sd->flags & SD_POWERSAVINGS_BALANCE))
+				continue;
+			authorized_nomination_level = sd->level;
+		}
+
+		rd->authorized_nomination_level = authorized_nomination_level;
+	}
+
+
 free_tmpmask:
 	free_cpumask_var(tmpmask);
 free_send_covered:
@@ -8027,6 +8086,9 @@ static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
 	else
 		sched_mc_power_savings = level;
 
+	active_power_savings_level = max(sched_smt_power_savings,
+						sched_mc_power_savings);
+
 	arch_reinit_sched_domains();
 
 	return count;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5cc1c16..bddee3e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1042,7 +1042,7 @@ static int wake_idle(int cpu, struct task_struct *p)
 	chosen_wakeup_cpu =
 		cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
 
-	if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
+	if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP &&
 		idle_cpu(cpu) && idle_cpu(this_cpu) &&
 		p->mm && !(p->flags & PF_KTHREAD) &&
 		cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2 3/3] sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.
  2009-03-03 11:51 [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework Gautham R Shenoy
  2009-03-03 11:51 ` [PATCH 2 1/3] sched: code cleanup - sd_power_saving_flags(), sd_balance_for_mc/package_power() Gautham R Shenoy
  2009-03-03 11:51 ` [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings Gautham R Shenoy
@ 2009-03-03 11:51 ` Gautham R Shenoy
  2009-03-03 12:21 ` [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework Peter Zijlstra
  3 siblings, 0 replies; 14+ messages in thread
From: Gautham R Shenoy @ 2009-03-03 11:51 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan, Balbir Singh, Peter Zijlstra,
	Ingo Molnar, Suresh Siddha
  Cc: Dipankar Sarma, efault, andi, linux-kernel, Gautham R Shenoy,
	Vaidyanathan Srinivasan

Currently a sched_domain having a single group can be prevented from getting
degenerated if it contains a SD_POWERSAVINGS_BALANCE flag. But since it has
only one group, it won't have any scope for performing powersavings balance as
it does not have a sibling group to pull from.

Apart from not provide any powersavings, it also fails to participate
in normal load-balancing.

Fix this by allowing such a sched_domain to degenerate and pass on the
responsibility of performing the POWERSAVINGS_BALANCE to it's parent domain.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 kernel/sched.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7d22a96..765e3e9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6957,6 +6957,20 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 				SD_SHARE_PKG_RESOURCES);
 		if (nr_node_ids == 1)
 			pflags &= ~SD_SERIALIZE;
+
+		/*
+		 * If the only flag that is preventing us from degenerating
+		 * a domain with a single group is SD_POWERSAVINGS_BALANCE
+		 * check if it can be transferred to the new parent,
+		 * and degenerate this domain. With a single
+		 * group, it anyway can't contribute to power-aware load
+		 * balancing.
+		 */
+		if (pflags & SD_POWERSAVINGS_BALANCE && parent->parent) {
+			pflags &= ~SD_POWERSAVINGS_BALANCE;
+			parent->parent->flags |=
+				sd_power_saving_flags(parent->level);
+		}
 	}
 	if (~cflags & pflags)
 		return 0;


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework
  2009-03-03 11:51 [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework Gautham R Shenoy
                   ` (2 preceding siblings ...)
  2009-03-03 11:51 ` [PATCH 2 3/3] sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE Gautham R Shenoy
@ 2009-03-03 12:21 ` Peter Zijlstra
  2009-03-03 15:25   ` Vaidyanathan Srinivasan
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2009-03-03 12:21 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Vaidyanathan Srinivasan, Balbir Singh, Ingo Molnar,
	Suresh Siddha, Dipankar Sarma, efault, andi, linux-kernel

On Tue, 2009-03-03 at 17:21 +0530, Gautham R Shenoy wrote:

> Background
> ------------------------------------------------------------------
> On machines with on-chip memory controller, each physical CPU
> package forms a NUMA node and the CPU level sched_domain will have
> only one group.  This prevents any form of power saving balance across
> these nodes.  Enabling the sched_mc_power_savings tunable to work as
> designed on these new single CPU NUMA node machines will help task
> consolidation and save power as we did in other multi core multi
> socket platforms.
> 
> Consolidation across NODES have implications of cross-node memory
> access and other NUMA locality issues.  Even under such constraints
> there could be scope for power savings vs performance tradeoffs and
> hence making the sched_mc_powersavings work as expected on these
> platform is justified.
> 
> sched_mc/smt_power_savings is still a tunable and power savings benefits
> and performance would vary depending on the workload and the system
> topology and hardware features.
> 
> The patch series has been tested on a 2-Socket Quad-core Dual threaded
> box with kernbench as the workload, varying the number of threads.
> 

> +------------------------------------------------------------------------+
> |Test: make -j8                                                          |
> +-----------+----------+--------+---------+-------------+----------------+
> | sched_smt | sched_mc | %Power | Time    | % Package 0 | % Package 1    |
> |           |          |        |         |     idle    |   idle         |
> +-----------+----------+--------+---------+-------------+----------------+
> |           |          |        |         |Core0: 18.17 |Core4: 33.38    |
> |           |          |        |         +-------------+----------------+
> |           |          |        |         |Core1: 34.62 |Core5: 19.58    |
> |     0     |    0     |  100   |  63.82  +-------------+----------------+
> |           |          |        |         |Core2: 31.99 |Core6: 32.35    |
> |           |          |        |         +-------------+----------------+
> |           |          |        |         |Core3: 34.59 |Core7: 29.99    |
> +-----------+----------+--------+---------+-------------+----------------+

> +-----------+----------+--------+---------+-------------+----------------+
> |           |          |        |         |Core0: 16.65 |Core4: 79.04    |
> |           |          |        |         +-------------+----------------+
> |           |          |        |         |Core1: 26.74 |Core5: 50.98    |
> |     2     |    2     |  89.58 |  82.83  +-------------+----------------+
> |           |          |        |         |Core2: 30.42 |Core6: 81.33    |
> |           |          |        |         +-------------+----------------+
> |           |          |        |         |Core3: 35.57 |Core7: 90.03    |
> +-----------+----------+--------+---------+-------------+----------------+

So while we take longer (~20s) we save about 10% in power?

It would be good to mention something about how power usage is measured.

Furthermore, do we really need those separate mc/smt power savings
settings? -- It appears to me we ought to consolidate some of that and
provide a single knob to save power.

> ---
> 
> Gautham R Shenoy (3):
>       sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.
>       sched: Fix the wakeup nomination for sched_mc/smt_power_savings.
>       sched: code cleanup - sd_power_saving_flags(), sd_balance_for_mc/package_power()

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

A few nits on patch #2, please follow up with incremental cleanups.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings.
  2009-03-03 11:51 ` [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings Gautham R Shenoy
@ 2009-03-03 12:23   ` Peter Zijlstra
  2009-03-03 13:59     ` Ingo Molnar
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2009-03-03 12:23 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Vaidyanathan Srinivasan, Balbir Singh, Ingo Molnar,
	Suresh Siddha, Dipankar Sarma, efault, andi, linux-kernel

On Tue, 2009-03-03 at 17:21 +0530, Gautham R Shenoy wrote:
> +/* Assign the sched-domain level which can nominate preferred wake-up cpu */
> +       rd->sched_mc_preferred_wakeup_cpu = UINT_MAX;
> +       rd->authorized_nomination_level = SD_LV_NONE;
> +
> +       if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
> +               struct sched_domain *sd;
> +               enum sched_domain_level authorized_nomination_level =
> +                                                               SD_LV_NONE;
> +
> +               for_each_domain(first_cpu(*cpu_map), sd) {
> +                       if (!(sd->flags & SD_POWERSAVINGS_BALANCE))
> +                               continue;
> +                       authorized_nomination_level = sd->level;
> +               }
> +
> +               rd->authorized_nomination_level = authorized_nomination_level;
> +       }

Very odd looking comments there, and that enum init wrapping looks
weird. Either exceed 80 chars, or write it in a second line like:

 enum sched_domain_level authorized_nomination_level;

 authorized_nomination_level = SD_LV_NONE;


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings.
  2009-03-03 12:23   ` Peter Zijlstra
@ 2009-03-03 13:59     ` Ingo Molnar
  2009-03-03 15:45       ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2009-03-03 13:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gautham R Shenoy, Vaidyanathan Srinivasan, Balbir Singh,
	Suresh Siddha, Dipankar Sarma, efault, andi, linux-kernel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Tue, 2009-03-03 at 17:21 +0530, Gautham R Shenoy wrote:
> > +/* Assign the sched-domain level which can nominate preferred wake-up cpu */
> > +       rd->sched_mc_preferred_wakeup_cpu = UINT_MAX;
> > +       rd->authorized_nomination_level = SD_LV_NONE;
> > +
> > +       if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
> > +               struct sched_domain *sd;
> > +               enum sched_domain_level authorized_nomination_level =
> > +                                                               SD_LV_NONE;
> > +
> > +               for_each_domain(first_cpu(*cpu_map), sd) {
> > +                       if (!(sd->flags & SD_POWERSAVINGS_BALANCE))
> > +                               continue;
> > +                       authorized_nomination_level = sd->level;
> > +               }
> > +
> > +               rd->authorized_nomination_level = authorized_nomination_level;
> > +       }
> 
> Very odd looking comments there, and that enum init wrapping looks
> weird. Either exceed 80 chars, or write it in a second line like:
> 
>  enum sched_domain_level authorized_nomination_level;
> 
>  authorized_nomination_level = SD_LV_NONE;

I think find_busiest_group() needs to be split up into several 
helper functions first, before we add more to it. I.e. first a 
couple of cleanup patches that factor it out into 3-4 helper 
functions plus a really easy-to-read find_busiest_group() main 
function.

Then can we do the above change - and i bet we'll win at least 
one indentation level as well so that weird line break goes 
away.

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework
  2009-03-03 12:21 ` [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework Peter Zijlstra
@ 2009-03-03 15:25   ` Vaidyanathan Srinivasan
  2009-03-03 15:28     ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-03-03 15:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gautham R Shenoy, Balbir Singh, Ingo Molnar, Suresh Siddha,
	Dipankar Sarma, efault, andi, linux-kernel

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-03-03 13:21:57]:

> On Tue, 2009-03-03 at 17:21 +0530, Gautham R Shenoy wrote:
> 
> > Background
> > ------------------------------------------------------------------
> > On machines with on-chip memory controller, each physical CPU
> > package forms a NUMA node and the CPU level sched_domain will have
> > only one group.  This prevents any form of power saving balance across
> > these nodes.  Enabling the sched_mc_power_savings tunable to work as
> > designed on these new single CPU NUMA node machines will help task
> > consolidation and save power as we did in other multi core multi
> > socket platforms.
> > 
> > Consolidation across NODES have implications of cross-node memory
> > access and other NUMA locality issues.  Even under such constraints
> > there could be scope for power savings vs performance tradeoffs and
> > hence making the sched_mc_powersavings work as expected on these
> > platform is justified.
> > 
> > sched_mc/smt_power_savings is still a tunable and power savings benefits
> > and performance would vary depending on the workload and the system
> > topology and hardware features.
> > 
> > The patch series has been tested on a 2-Socket Quad-core Dual threaded
> > box with kernbench as the workload, varying the number of threads.
> > 
> 
> > +------------------------------------------------------------------------+
> > |Test: make -j8                                                          |
> > +-----------+----------+--------+---------+-------------+----------------+
> > | sched_smt | sched_mc | %Power | Time    | % Package 0 | % Package 1    |
> > |           |          |        |         |     idle    |   idle         |
> > +-----------+----------+--------+---------+-------------+----------------+
> > |           |          |        |         |Core0: 18.17 |Core4: 33.38    |
> > |           |          |        |         +-------------+----------------+
> > |           |          |        |         |Core1: 34.62 |Core5: 19.58    |
> > |     0     |    0     |  100   |  63.82  +-------------+----------------+
> > |           |          |        |         |Core2: 31.99 |Core6: 32.35    |
> > |           |          |        |         +-------------+----------------+
> > |           |          |        |         |Core3: 34.59 |Core7: 29.99    |
> > +-----------+----------+--------+---------+-------------+----------------+
> 
> > +-----------+----------+--------+---------+-------------+----------------+
> > |           |          |        |         |Core0: 16.65 |Core4: 79.04    |
> > |           |          |        |         +-------------+----------------+
> > |           |          |        |         |Core1: 26.74 |Core5: 50.98    |
> > |     2     |    2     |  89.58 |  82.83  +-------------+----------------+
> > |           |          |        |         |Core2: 30.42 |Core6: 81.33    |
> > |           |          |        |         +-------------+----------------+
> > |           |          |        |         |Core3: 35.57 |Core7: 90.03    |
> > +-----------+----------+--------+---------+-------------+----------------+
> 
> So while we take longer (~20s) we save about 10% in power?

Yes that is correct. Since we are consolidating on sibling threads the
performance goes down.  Also this degradation is very much workload
dependent.  If the workloads can benefit a lot from sibling threads,
then we will be able to save power with modest performance
degradation.

This tunable is mainly focusing on power savings.  If performance
improves, then it is a bonus :)
 
> It would be good to mention something about how power usage is measured.

Power usage is measured by computing the energy consumed over the
benchmark duration and then finding average power by dividing
energy/time.  The relative power consumption is for the entire system.

> Furthermore, do we really need those separate mc/smt power savings
> settings? -- It appears to me we ought to consolidate some of that and
> provide a single knob to save power.

Yes, having one sched_power_savings will definitely help.  However,
mapping the various combination of settings to a single knob that will
provide consistent behavior across workloads and system configuration
is a challenge.
 
> > ---
> > 
> > Gautham R Shenoy (3):
> >       sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE.
> >       sched: Fix the wakeup nomination for sched_mc/smt_power_savings.
> >       sched: code cleanup - sd_power_saving_flags(), sd_balance_for_mc/package_power()
> 
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> A few nits on patch #2, please follow up with incremental cleanups.

Thanks for the review comments and ack.

--Vaidy


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework
  2009-03-03 15:25   ` Vaidyanathan Srinivasan
@ 2009-03-03 15:28     ` Peter Zijlstra
  2009-03-04 11:06       ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2009-03-03 15:28 UTC (permalink / raw)
  To: svaidy
  Cc: Gautham R Shenoy, Balbir Singh, Ingo Molnar, Suresh Siddha,
	Dipankar Sarma, efault, andi, linux-kernel

On Tue, 2009-03-03 at 20:55 +0530, Vaidyanathan Srinivasan wrote:
>  
> > It would be good to mention something about how power usage is measured.
> 
> Power usage is measured by computing the energy consumed over the
> benchmark duration and then finding average power by dividing
> energy/time.  The relative power consumption is for the entire system.

Is this measured using an external (indepedent) power monitor, or
looking at system internal power usage stats?

Have you compared these two methods, if so, do they give comparable
results?

> > Furthermore, do we really need those separate mc/smt power savings
> > settings? -- It appears to me we ought to consolidate some of that and
> > provide a single knob to save power.
> 
> Yes, having one sched_power_savings will definitely help.  However,
> mapping the various combination of settings to a single knob that will
> provide consistent behavior across workloads and system configuration
> is a challenge.

Would it be an option to provide a single knob for users and have some
fine grained feature set for developing/debugging, much like we have
sched_features?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings.
  2009-03-03 13:59     ` Ingo Molnar
@ 2009-03-03 15:45       ` Vaidyanathan Srinivasan
  2009-03-03 15:48         ` Peter Zijlstra
  0 siblings, 1 reply; 14+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-03-03 15:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Gautham R Shenoy, Balbir Singh, Suresh Siddha,
	Dipankar Sarma, efault, andi, linux-kernel

* Ingo Molnar <mingo@elte.hu> [2009-03-03 14:59:07]:

> 
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Tue, 2009-03-03 at 17:21 +0530, Gautham R Shenoy wrote:
> > > +/* Assign the sched-domain level which can nominate preferred wake-up cpu */
> > > +       rd->sched_mc_preferred_wakeup_cpu = UINT_MAX;
> > > +       rd->authorized_nomination_level = SD_LV_NONE;
> > > +
> > > +       if (active_power_savings_level >= POWERSAVINGS_BALANCE_WAKEUP) {
> > > +               struct sched_domain *sd;
> > > +               enum sched_domain_level authorized_nomination_level =
> > > +                                                               SD_LV_NONE;
> > > +
> > > +               for_each_domain(first_cpu(*cpu_map), sd) {
> > > +                       if (!(sd->flags & SD_POWERSAVINGS_BALANCE))
> > > +                               continue;
> > > +                       authorized_nomination_level = sd->level;
> > > +               }
> > > +
> > > +               rd->authorized_nomination_level = authorized_nomination_level;
> > > +       }
> > 
> > Very odd looking comments there, and that enum init wrapping looks
> > weird. Either exceed 80 chars, or write it in a second line like:
> > 
> >  enum sched_domain_level authorized_nomination_level;
> > 
> >  authorized_nomination_level = SD_LV_NONE;
> 
> I think find_busiest_group() needs to be split up into several 
> helper functions first, before we add more to it. I.e. first a 
> couple of cleanup patches that factor it out into 3-4 helper 
> functions plus a really easy-to-read find_busiest_group() main 
> function.

Hi Ingo,

I had earlier posted cleanup for find_busiest_group(), but Peter
mentioned that he is changing the cpu_power infrastructure and I can
rework the cleanup after that.

[RFC PATCH v1 0/5] Modular find_busiest_group()
V1: http://lkml.org/lkml/2008/9/24/201
V2: http://lkml.org/lkml/2008/10/9/176

Peter did post an RFD for his new proposal:
http://lkml.org/lkml/2008/10/14/148

I think we can revive the discussion thread and start cleaning up.

> Then can we do the above change - and i bet we'll win at least 
> one indentation level as well so that weird line break goes 
> away.

The above code is in __build_sched_domains() and I think we can
shorten the variable names and make the code look better.  Indentation
level here is not as bad as in find_busiest_group().

Thanks,
Vaidy


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings.
  2009-03-03 15:45       ` Vaidyanathan Srinivasan
@ 2009-03-03 15:48         ` Peter Zijlstra
  2009-03-03 16:26           ` Ingo Molnar
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2009-03-03 15:48 UTC (permalink / raw)
  To: svaidy
  Cc: Ingo Molnar, Gautham R Shenoy, Balbir Singh, Suresh Siddha,
	Dipankar Sarma, efault, andi, linux-kernel

On Tue, 2009-03-03 at 21:15 +0530, Vaidyanathan Srinivasan wrote:

> I had earlier posted cleanup for find_busiest_group(), but Peter
> mentioned that he is changing the cpu_power infrastructure and I can
> rework the cleanup after that.

Right, I need to get back to pondering that problem. The whole 1/x thing
doesn't really work.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings.
  2009-03-03 15:48         ` Peter Zijlstra
@ 2009-03-03 16:26           ` Ingo Molnar
  0 siblings, 0 replies; 14+ messages in thread
From: Ingo Molnar @ 2009-03-03 16:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: svaidy, Gautham R Shenoy, Balbir Singh, Suresh Siddha,
	Dipankar Sarma, efault, andi, linux-kernel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Tue, 2009-03-03 at 21:15 +0530, Vaidyanathan Srinivasan wrote:
> 
> > I had earlier posted cleanup for find_busiest_group(), but 
> > Peter mentioned that he is changing the cpu_power 
> > infrastructure and I can rework the cleanup after that.
> 
> Right, I need to get back to pondering that problem. The whole 
> 1/x thing doesn't really work.

well, the cleanups are needed in any case before we add more 
changes to it - that function is a monster.

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework
  2009-03-03 15:28     ` Peter Zijlstra
@ 2009-03-04 11:06       ` Vaidyanathan Srinivasan
  2009-03-04 11:38         ` Gautham R Shenoy
  0 siblings, 1 reply; 14+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-03-04 11:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gautham R Shenoy, Balbir Singh, Ingo Molnar, Suresh Siddha,
	Dipankar Sarma, efault, andi, linux-kernel

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-03-03 16:28:30]:

> On Tue, 2009-03-03 at 20:55 +0530, Vaidyanathan Srinivasan wrote:
> >  
> > > It would be good to mention something about how power usage is measured.
> > 
> > Power usage is measured by computing the energy consumed over the
> > benchmark duration and then finding average power by dividing
> > energy/time.  The relative power consumption is for the entire system.
> 
> Is this measured using an external (indepedent) power monitor, or
> looking at system internal power usage stats?
> 
> Have you compared these two methods, if so, do they give comparable
> results?

Since we are doing relative power comparison, the absolute power value
and the method would not affect the results.

Your question is more relevant for SPECPower type of benchmarks where
the score depends on the average power consumed and accuracy or method
of measurement can affect the score.
 
> > > Furthermore, do we really need those separate mc/smt power savings
> > > settings? -- It appears to me we ought to consolidate some of that and
> > > provide a single knob to save power.
> > 
> > Yes, having one sched_power_savings will definitely help.  However,
> > mapping the various combination of settings to a single knob that will
> > provide consistent behavior across workloads and system configuration
> > is a challenge.
> 
> Would it be an option to provide a single knob for users and have some
> fine grained feature set for developing/debugging, much like we have
> sched_features?

Yes, that will be very good.  Let me explore if sched_features can
include the power saving attributes while the sched_power_savings can
be a simple and coarse control for end-users.

--Vaidy


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework
  2009-03-04 11:06       ` Vaidyanathan Srinivasan
@ 2009-03-04 11:38         ` Gautham R Shenoy
  0 siblings, 0 replies; 14+ messages in thread
From: Gautham R Shenoy @ 2009-03-04 11:38 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Peter Zijlstra, Balbir Singh, Ingo Molnar, Suresh Siddha,
	Dipankar Sarma, efault, andi, linux-kernel

On Wed, Mar 04, 2009 at 04:36:43PM +0530, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-03-03 16:28:30]:
> 
> > On Tue, 2009-03-03 at 20:55 +0530, Vaidyanathan Srinivasan wrote:
> > >  
> > > > It would be good to mention something about how power usage is measured.
> > > 
> > > Power usage is measured by computing the energy consumed over the
> > > benchmark duration and then finding average power by dividing
> > > energy/time.  The relative power consumption is for the entire system.
> > 
> > Is this measured using an external (indepedent) power monitor, or
> > looking at system internal power usage stats?
> > 
> > Have you compared these two methods, if so, do they give comparable
> > results?
> 
> Since we are doing relative power comparison, the absolute power value
> and the method would not affect the results.
> 
> Your question is more relevant for SPECPower type of benchmarks where
> the score depends on the average power consumed and accuracy or method
> of measurement can affect the score.
> 
> > > > Furthermore, do we really need those separate mc/smt power savings
> > > > settings? -- It appears to me we ought to consolidate some of that and
> > > > provide a single knob to save power.
> > > 
> > > Yes, having one sched_power_savings will definitely help.  However,
> > > mapping the various combination of settings to a single knob that will
> > > provide consistent behavior across workloads and system configuration
> > > is a challenge.
> > 
> > Would it be an option to provide a single knob for users and have some
> > fine grained feature set for developing/debugging, much like we have
> > sched_features?
> 
> Yes, that will be very good.  Let me explore if sched_features can
> include the power saving attributes while the sched_power_savings can
> be a simple and coarse control for end-users.

Currently each of sched_smt_power_savings and sched_mc_power_savings
tunables can take the values (0, 1, 2).

However, based on the combinations of
(sched_smt_power_savings, sched_mc_power_savings) values,
the system can be in one of the 5 (and not 9) states mentioned below:
[in the increasing order of aggressiveness of the task consolidation]

0. { (0, 0) }
1. { (0, 1) }
2. { (0, 2) }
3. { (1, 0), (1, 1) }
4. { (1, 2), (2, 0), (2, 1), (2, 2)}

4 and 5 are because sched_smt_power_savings implies sched_mc_power_savings.

So, on a system that has multiple cores and multiple threads, the single
knob can take values (0, 1, 2, 3, 4).

On systems that have only multiple cores, the knob can take values
(0, 1, 2).

> --Vaidy

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-03-04 11:38 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-03 11:51 [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework Gautham R Shenoy
2009-03-03 11:51 ` [PATCH 2 1/3] sched: code cleanup - sd_power_saving_flags(), sd_balance_for_mc/package_power() Gautham R Shenoy
2009-03-03 11:51 ` [PATCH 2 2/3] sched: Fix the wakeup nomination for sched_mc/smt_power_savings Gautham R Shenoy
2009-03-03 12:23   ` Peter Zijlstra
2009-03-03 13:59     ` Ingo Molnar
2009-03-03 15:45       ` Vaidyanathan Srinivasan
2009-03-03 15:48         ` Peter Zijlstra
2009-03-03 16:26           ` Ingo Molnar
2009-03-03 11:51 ` [PATCH 2 3/3] sched: Fix sd_parent_degenerate for SD_POWERSAVINGS_BALANCE Gautham R Shenoy
2009-03-03 12:21 ` [PATCH v2 0/3] sched: Extend sched_mc/smt_power_savings framework Peter Zijlstra
2009-03-03 15:25   ` Vaidyanathan Srinivasan
2009-03-03 15:28     ` Peter Zijlstra
2009-03-04 11:06       ` Vaidyanathan Srinivasan
2009-03-04 11:38         ` Gautham R Shenoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).