linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/8] Tunable sched_mc_power_savings=n
@ 2008-12-18 17:55 Vaidyanathan Srinivasan
  2008-12-18 17:56 ` [PATCH v7 1/8] sched: convert BALANCE_FOR_xx_POWER to inline functions Vaidyanathan Srinivasan
                   ` (9 more replies)
  0 siblings, 10 replies; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-18 17:55 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi, Peter Zijlstra
  Cc: Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton,
	Vaidyanathan Srinivasan

Hi,

The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run
the workload in the system on minimum number of CPU packages and tries
to keep rest of the CPU packages idle for longer duration. Thus
consolidating workloads to fewer packages help other packages to be in
idle state and save power.  The current implementation is very
conservative and does not work effectively across different workloads.
Initial idea of tunable sched_mc_power_savings=n was proposed to
enable tuning of the power saving load balancer based on the system
configuration, workload characteristics and end user requirements.

The power savings and performance of the given workload in an under
utilised system can be controlled by setting values of 0, 1 or 2 to
/sys/devices/system/cpu/sched_mc_power_savings with 0 being highest
performance and least power savings and level 2 indicating maximum
power savings even at the cost of slight performance degradation.

Please refer to the following discussions and article for details.

[1]Making power policy just work
http://lwn.net/Articles/287924/ 

[2][RFC v1] Tunable sched_mc_power_savings=n
http://lwn.net/Articles/287882/
                                                                                                               
v2: http://lwn.net/Articles/297306/
v3: http://lkml.org/lkml/2008/11/10/260
v4: http://lkml.org/lkml/2008/11/21/47
v5: http://lkml.org/lkml/2008/12/11/178
v6: http://lwn.net/Articles/311830/

The following series of patch demonstrates the basic framework for
tunable sched_mc_power_savings.

This version of the patch incorporates comments and feedback received
on the previous post from Andrew Morton.

Changes form v6:
----------------
* Convert BALANCE_FOR_xx_POWER and related macros to inline functions
  based on comments from Andrew and Ingo.
* Ran basic kernelbench test and did not see any performance variation
  due to the changes.

Changes form v5:
---------------
* Fixed the sscanf bug and checking for (p->flags & PF_KTHREAD)
* Dropped the RFC prefix to indicate that the patch is ready for
  testing and inclusion  
* Patch series against 2.6.28-rc8 kernel

Changes from v4:
----------------

* Conditionally added SD_BALANCE_NEWIDLE flag to MC and CPU level
  domain to help task consolidation when sched_mc > 0
  Removing this flag significantly reduces the number load_balance
  calls and considerably slows down task consolidation and in effect
  power savings.

  Ingo and Mike Galbraith removed this flag to improve performance for
  select benchmarks.  I have added the flags only when power savings
  is requested and restore to full performance mode if sched_mc=0

* Fixed few whitespace issues
* Patch series against 2.6.28-rc8 kernel

Changes from v3:
----------------

* Fixed the locking code with double_lock_balance() in
  active-balance-newidle.patch

* Moved sched_mc_preferred_wakeup_cpu to root_domain structure so that
  each partitioned sched domain will get independent nominated cpu

* More comments in active-balance-newidle.patch

* Reverted sched MC level and CPU level fine tuning in v2.6.28-rc4 for
  now.  These affect consolidation since SD_BALANCE_NEWIDLE is
  removed.  I will rework the tuning in the next iteration to
  selectively enable them at sched_mc=2

* Patch series on 2.6.28-rc6 kernel  

Changes from v2:
----------------

* Fixed locking order issue in active-balance new-idle
* Moved the wakeup biasing code to wake_idle() function and preserve
  wake_affine function.  Previous version would break wake affine in
  order to aggressively consolidate tasks  
* Removed sched_mc_preferred_wakeup_cpu global variable and moved to
  doms_cur/dattr_cur and added a per_cpu pointer to appropriate
  storage in partitioned sched domain.  This changed is needed to
  preserve functionality in case of partitioned sched domains
* Patch on 2.6.28-rc3 kernel  

Results:
--------

Basic functionality of the code has not changed and the power vs
performance benefits for kernbench are similar to the ones posted
earlier.

KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
package system

SchedMC Run Time     Package Idle    Energy  Power
0	81.68	     52.83% 54.71%  1.00x J 1.00y W
1	80.70	     36.62% 70.11%  0.95x J 0.96y W
2	74.95	     19.53% 85.92%  0.90x J 0.98y W

The results are marginally better than the previous version of the
patch series which could be within the test variation.

Please feel free to test, and let me know your comments and feedback.
I will post more experimental results with various benchmarks.

Hi Ingo,

Please review and include in sched development tree for testing.

Thanks,
Vaidy

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

---

Gautham R Shenoy (1):
      sched: Framework for sched_mc/smt_power_savings=N

Vaidyanathan Srinivasan (7):
      sched: idle_balance() does not call load_balance_newidle()
      sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0
      sched: activate active load balancing in new idle cpus
      sched: bias task wakeups to preferred semi-idle packages
      sched: nominate preferred wakeup cpu
      sched: favour lower logical cpu number for sched_mc balance
      sched: convert BALANCE_FOR_xx_POWER to inline functions


 include/linux/sched.h    |   57 +++++++++++++++++++++++++----
 include/linux/topology.h |    6 ++-
 kernel/sched.c           |   89 +++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched_fair.c      |   18 +++++++++
 4 files changed, 153 insertions(+), 17 deletions(-)

-- 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v7 1/8] sched: convert BALANCE_FOR_xx_POWER to inline functions
  2008-12-18 17:55 [PATCH v7 0/8] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
@ 2008-12-18 17:56 ` Vaidyanathan Srinivasan
  2008-12-18 17:56 ` [PATCH v7 2/8] sched: Framework for sched_mc/smt_power_savings=N Vaidyanathan Srinivasan
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-18 17:56 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi, Peter Zijlstra
  Cc: Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton,
	Vaidyanathan Srinivasan

BALANCE_FOR_MC_POWER and similar macros defined in sched.h are
not constants and have various condition checks and significant
amount of code that is not suitable to be contain in a macro.
Also there could be side effects on the expressions passed to
some of them like test_sd_parent().

This patch converts all complex macros related to power savings
balance to inline functions.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 include/linux/sched.h    |   33 ++++++++++++++++++++++++---------
 include/linux/topology.h |    4 ++--
 2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e30d1..311122f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -764,15 +764,23 @@ enum cpu_idle_type {
 #define SD_SERIALIZE		1024	/* Only a single load balancing instance */
 #define SD_WAKE_IDLE_FAR	2048	/* Gain latency sacrificing cache hit */
 
-#define BALANCE_FOR_MC_POWER	\
-	(sched_smt_power_savings ? SD_POWERSAVINGS_BALANCE : 0)
+extern int sched_mc_power_savings, sched_smt_power_savings;
+
+static inline int sd_balance_for_mc_power(void)
+{
+	if (sched_smt_power_savings)
+		return SD_POWERSAVINGS_BALANCE;
 
-#define BALANCE_FOR_PKG_POWER	\
-	((sched_mc_power_savings || sched_smt_power_savings) ?	\
-	 SD_POWERSAVINGS_BALANCE : 0)
+	return 0;
+}
 
-#define test_sd_parent(sd, flag)	((sd->parent &&		\
-					 (sd->parent->flags & flag)) ? 1 : 0)
+static inline int sd_balance_for_package_power(void)
+{
+	if (sched_mc_power_savings | sched_smt_power_savings)
+		return SD_POWERSAVINGS_BALANCE;
+
+	return 0;
+}
 
 
 struct sched_group {
@@ -1358,6 +1366,15 @@ struct task_struct {
 	struct list_head	*scm_work_list;
 };
 
+/* Test a flag in parent sched domain */
+static inline int test_sd_parent(struct sched_domain *sd, int flag)
+{
+	if (sd->parent && (sd->parent->flags & flag))
+		return 1;
+
+	return 0;
+}
+
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
@@ -2215,8 +2232,6 @@ __trace_special(void *__tr, void *__data,
 extern long sched_setaffinity(pid_t pid, const cpumask_t *new_mask);
 extern long sched_getaffinity(pid_t pid, cpumask_t *mask);
 
-extern int sched_mc_power_savings, sched_smt_power_savings;
-
 extern void normalize_rt_tasks(void);
 
 #ifdef CONFIG_GROUP_SCHED
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 117f1b7..7f10439 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -125,7 +125,7 @@ void arch_update_cpu_topology(void);
 				| SD_WAKE_AFFINE	\
 				| SD_WAKE_BALANCE	\
 				| SD_SHARE_PKG_RESOURCES\
-				| BALANCE_FOR_MC_POWER,	\
+				| sd_balance_for_mc_power(),\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 }
@@ -150,7 +150,7 @@ void arch_update_cpu_topology(void);
 				| SD_BALANCE_FORK	\
 				| SD_WAKE_AFFINE	\
 				| SD_WAKE_BALANCE	\
-				| BALANCE_FOR_PKG_POWER,\
+				| sd_balance_for_package_power(),\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 }


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 2/8] sched: Framework for sched_mc/smt_power_savings=N
  2008-12-18 17:55 [PATCH v7 0/8] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
  2008-12-18 17:56 ` [PATCH v7 1/8] sched: convert BALANCE_FOR_xx_POWER to inline functions Vaidyanathan Srinivasan
@ 2008-12-18 17:56 ` Vaidyanathan Srinivasan
  2008-12-18 17:56 ` [PATCH v7 3/8] sched: favour lower logical cpu number for sched_mc balance Vaidyanathan Srinivasan
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-18 17:56 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi, Peter Zijlstra
  Cc: Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton,
	Vaidyanathan Srinivasan

From: Gautham R Shenoy <ego@in.ibm.com>

Currently the sched_mc/smt_power_savings variable is a boolean,
which either enables or disables topology based power savings.
This patch extends the behaviour of the variable from boolean to
multivalued, such that based on the value, we decide how
aggressively do we want to perform powersavings balance at
appropriate sched domain based on topology.

Variable levels of power saving tunable would benefit end user to
match the required level of power savings vs performance
trade-off depending on the system configuration and workloads.

This version makes the sched_mc_power_savings global variable to
take more values (0,1,2).  Later versions can have a single
tunable called sched_power_savings instead of
sched_{mc,smt}_power_savings.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/sched.h |   11 +++++++++++
 kernel/sched.c        |   17 ++++++++++++++---
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 311122f..5fe906b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -764,6 +764,17 @@ enum cpu_idle_type {
 #define SD_SERIALIZE		1024	/* Only a single load balancing instance */
 #define SD_WAKE_IDLE_FAR	2048	/* Gain latency sacrificing cache hit */
 
+enum powersavings_balance_level {
+	POWERSAVINGS_BALANCE_NONE = 0,  /* No power saving load balance */
+	POWERSAVINGS_BALANCE_BASIC,	/* Fill one thread/core/package
+					 * first for long running threads
+					 */
+	POWERSAVINGS_BALANCE_WAKEUP,	/* Also bias task wakeups to semi-idle
+					 * cpu package for power savings
+					 */
+	MAX_POWERSAVINGS_BALANCE_LEVELS
+};
+
 extern int sched_mc_power_savings, sched_smt_power_savings;
 
 static inline int sd_balance_for_mc_power(void)
diff --git a/kernel/sched.c b/kernel/sched.c
index e4bb1dd..16897ab 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7879,14 +7879,25 @@ int arch_reinit_sched_domains(void)
 static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
 {
 	int ret;
+	unsigned int level = 0;
 
-	if (buf[0] != '0' && buf[0] != '1')
+	if (sscanf(buf, "%u", &level) != 1)
+		return -EINVAL;
+
+	/*
+	 * level is always be positive so don't check for
+	 * level < POWERSAVINGS_BALANCE_NONE which is 0
+	 * What happens on 0 or 1 byte write,
+	 * need to check for count as well?
+	 */
+
+	if (level >= MAX_POWERSAVINGS_BALANCE_LEVELS)
 		return -EINVAL;
 
 	if (smt)
-		sched_smt_power_savings = (buf[0] == '1');
+		sched_smt_power_savings = level;
 	else
-		sched_mc_power_savings = (buf[0] == '1');
+		sched_mc_power_savings = level;
 
 	ret = arch_reinit_sched_domains();
 


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 3/8] sched: favour lower logical cpu number for sched_mc balance
  2008-12-18 17:55 [PATCH v7 0/8] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
  2008-12-18 17:56 ` [PATCH v7 1/8] sched: convert BALANCE_FOR_xx_POWER to inline functions Vaidyanathan Srinivasan
  2008-12-18 17:56 ` [PATCH v7 2/8] sched: Framework for sched_mc/smt_power_savings=N Vaidyanathan Srinivasan
@ 2008-12-18 17:56 ` Vaidyanathan Srinivasan
  2008-12-18 17:56 ` [PATCH v7 4/8] sched: nominate preferred wakeup cpu Vaidyanathan Srinivasan
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-18 17:56 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi, Peter Zijlstra
  Cc: Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton,
	Vaidyanathan Srinivasan

Just in case two groups have identical load, prefer to move load to lower
logical cpu number rather than the present logic of moving to higher logical
number.

find_busiest_group() tries to look for a group_leader that has spare capacity
to take more tasks and freeup an appropriate least loaded group.  Just in case
there is a tie and the load is equal, then the group with higher logical number
is favoured.  This conflicts with user space irqbalance daemon that will move
interrupts to lower logical number if the system utilisation is very low.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 kernel/sched.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 16897ab..0b9bbbd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3264,7 +3264,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 		 */
 		if ((sum_nr_running < min_nr_running) ||
 		    (sum_nr_running == min_nr_running &&
-		     first_cpu(group->cpumask) <
+		     first_cpu(group->cpumask) >
 		     first_cpu(group_min->cpumask))) {
 			group_min = group;
 			min_nr_running = sum_nr_running;
@@ -3280,7 +3280,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 		if (sum_nr_running <= group_capacity - 1) {
 			if (sum_nr_running > leader_nr_running ||
 			    (sum_nr_running == leader_nr_running &&
-			     first_cpu(group->cpumask) >
+			     first_cpu(group->cpumask) <
 			      first_cpu(group_leader->cpumask))) {
 				group_leader = group;
 				leader_nr_running = sum_nr_running;


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 4/8] sched: nominate preferred wakeup cpu
  2008-12-18 17:55 [PATCH v7 0/8] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
                   ` (2 preceding siblings ...)
  2008-12-18 17:56 ` [PATCH v7 3/8] sched: favour lower logical cpu number for sched_mc balance Vaidyanathan Srinivasan
@ 2008-12-18 17:56 ` Vaidyanathan Srinivasan
  2008-12-18 18:12   ` Balbir Singh
  2008-12-19 21:55   ` Andrew Morton
  2008-12-18 17:56 ` [PATCH v7 5/8] sched: bias task wakeups to preferred semi-idle packages Vaidyanathan Srinivasan
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-18 17:56 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi, Peter Zijlstra
  Cc: Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton,
	Vaidyanathan Srinivasan

When the system utilisation is low and more cpus are idle,
then the process waking up from sleep should prefer to
wakeup an idle cpu from semi-idle cpu package (multi core
package) rather than a completely idle cpu package which
would waste power.

Use the sched_mc balance logic in find_busiest_group() to
nominate a preferred wakeup cpu.

This info can be sored in appropriate sched_domain, but
updating this info in all copies of sched_domain is not
practical.  Hence this information is stored in root_domain
struct which is one copy per partitioned sched domain.
The root_domain can be accessed from each cpu's runqueue
and there is one copy per partitioned sched domain.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 kernel/sched.c |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0b9bbbd..3415fa3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -493,6 +493,14 @@ struct root_domain {
 #ifdef CONFIG_SMP
 	struct cpupri cpupri;
 #endif
+#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+	/*
+	 * Preferred wake up cpu nominated by sched_mc balance that will be
+	 * used when most cpus are idle in the system indicating overall very
+	 * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
+	 */
+	unsigned int sched_mc_preferred_wakeup_cpu;
+#endif
 };
 
 /*
@@ -3407,6 +3415,10 @@ out_balanced:
 
 	if (this == group_leader && group_leader != group_min) {
 		*imbalance = min_load_per_task;
+		if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP) {
+			cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
+					first_cpu(group_leader->cpumask);
+		}
 		return group_min;
 	}
 #endif


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 5/8] sched: bias task wakeups to preferred semi-idle packages
  2008-12-18 17:55 [PATCH v7 0/8] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
                   ` (3 preceding siblings ...)
  2008-12-18 17:56 ` [PATCH v7 4/8] sched: nominate preferred wakeup cpu Vaidyanathan Srinivasan
@ 2008-12-18 17:56 ` Vaidyanathan Srinivasan
  2008-12-18 18:11   ` Balbir Singh
  2008-12-18 17:56 ` [PATCH v7 6/8] sched: activate active load balancing in new idle cpus Vaidyanathan Srinivasan
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-18 17:56 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi, Peter Zijlstra
  Cc: Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton,
	Vaidyanathan Srinivasan

Preferred wakeup cpu (from a semi idle package) has been
nominated in find_busiest_group() in the previous patch.  Use
this information in sched_mc_preferred_wakeup_cpu in function
wake_idle() to bias task wakeups if the following conditions
are satisfied:
        - The present cpu that is trying to wakeup the process is
          idle and waking the target process on this cpu will
          potentially wakeup a completely idle package
        - The previous cpu on which the target process ran is
          also idle and hence selecting the previous cpu may
          wakeup a semi idle cpu package
        - The task being woken up is allowed to run in the
          nominated cpu (cpu affinity and restrictions)

Basically if both the current cpu and the previous cpu on
which the task ran is idle, select the nominated cpu from semi
idle cpu package for running the new task that is waking up.

Cache hotness is considered since the actual biasing happens
in wake_idle() only if the application is cache cold.

This technique will effectively move short running bursty jobs in
a mostly idle system.

Wakeup biasing for power savings gets automatically disabled if
system utilisation increases due to the fact that the probability
of finding both this_cpu and prev_cpu idle decreases.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 kernel/sched_fair.c |   18 ++++++++++++++++++
 1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 98345e4..c82386c 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1027,6 +1027,24 @@ static int wake_idle(int cpu, struct task_struct *p)
 	cpumask_t tmp;
 	struct sched_domain *sd;
 	int i;
+	unsigned int chosen_wakeup_cpu;
+	int this_cpu;
+
+	/*
+	 * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
+	 * are idle and this is not a kernel thread and this task's affinity
+	 * allows it to be moved to preferred cpu, then just move!
+	 */
+
+	this_cpu = smp_processor_id();
+	chosen_wakeup_cpu =
+		cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
+
+	if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
+		idle_cpu(cpu) && idle_cpu(this_cpu) &&
+		p->mm && !(p->flags & PF_KTHREAD) &&
+		cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))
+		return chosen_wakeup_cpu;
 
 	/*
 	 * If it is idle, then it is the best cpu to run this task.


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 6/8] sched: activate active load balancing in new idle cpus
  2008-12-18 17:55 [PATCH v7 0/8] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
                   ` (4 preceding siblings ...)
  2008-12-18 17:56 ` [PATCH v7 5/8] sched: bias task wakeups to preferred semi-idle packages Vaidyanathan Srinivasan
@ 2008-12-18 17:56 ` Vaidyanathan Srinivasan
  2008-12-18 17:56 ` [PATCH v7 7/8] sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0 Vaidyanathan Srinivasan
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-18 17:56 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi, Peter Zijlstra
  Cc: Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton,
	Vaidyanathan Srinivasan

Active load balancing is a process by which migration thread
is woken up on the target CPU in order to pull current
running task on another package into this newly idle
package.

This method is already in use with normal load_balance(),
this patch introduces this method to new idle cpus when
sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.

This logic provides effective consolidation of short running
daemon jobs in a almost idle system

The side effect of this patch may be ping-ponging of tasks
if the system is moderately utilised. May need to adjust the
iterations before triggering.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 kernel/sched.c |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3415fa3..f82a7a0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3692,10 +3692,64 @@ redo:
 	}
 
 	if (!ld_moved) {
+		int active_balance;
+
 		schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
 		if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
 		    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
 			return -1;
+
+		if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+			return -1;
+
+		if (sd->nr_balance_failed++ < 2)
+			return -1;
+
+		/*
+		 * The only task running in a non-idle cpu can be moved to this
+		 * cpu in an attempt to completely freeup the other CPU
+		 * package. The same method used to move task in load_balance()
+		 * have been extended for load_balance_newidle() to speedup
+		 * consolidation at sched_mc=POWERSAVINGS_BALANCE_WAKEUP (2)
+		 *
+		 * The package power saving logic comes from
+		 * find_busiest_group().  If there are no imbalance, then
+		 * f_b_g() will return NULL.  However when sched_mc={1,2} then
+		 * f_b_g() will select a group from which a running task may be
+		 * pulled to this cpu in order to make the other package idle.
+		 * If there is no opportunity to make a package idle and if
+		 * there are no imbalance, then f_b_g() will return NULL and no
+		 * action will be taken in load_balance_newidle().
+		 *
+		 * Under normal task pull operation due to imbalance, there
+		 * will be more than one task in the source run queue and
+		 * move_tasks() will succeed.  ld_moved will be true and this
+		 * active balance code will not be triggered.
+		 */
+
+		/* Lock busiest in correct order while this_rq is held */
+		double_lock_balance(this_rq, busiest);
+
+		/*
+		 * don't kick the migration_thread, if the curr
+		 * task on busiest cpu can't be moved to this_cpu
+		 */
+		if (!cpu_isset(this_cpu, busiest->curr->cpus_allowed)) {
+			double_unlock_balance(this_rq, busiest);
+			all_pinned = 1;
+			return ld_moved;
+		}
+
+		if (!busiest->active_balance) {
+			busiest->active_balance = 1;
+			busiest->push_cpu = this_cpu;
+			active_balance = 1;
+		}
+
+		double_unlock_balance(this_rq, busiest);
+		if (active_balance)
+			wake_up_process(busiest->migration_thread);
+
 	} else
 		sd->nr_balance_failed = 0;
 


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 7/8] sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0
  2008-12-18 17:55 [PATCH v7 0/8] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
                   ` (5 preceding siblings ...)
  2008-12-18 17:56 ` [PATCH v7 6/8] sched: activate active load balancing in new idle cpus Vaidyanathan Srinivasan
@ 2008-12-18 17:56 ` Vaidyanathan Srinivasan
  2008-12-18 17:56 ` [PATCH v7 8/8] sched: idle_balance() does not call load_balance_newidle() Vaidyanathan Srinivasan
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-18 17:56 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi, Peter Zijlstra
  Cc: Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton,
	Vaidyanathan Srinivasan

Add SD_BALANCE_NEWIDLE flag at MC level and CPU level
if sched_mc is set.  This helps power savings and
will not affect performance when sched_mc=0

Ingo and Mike Galbraith have optimised the SD flags by
removing SD_BALANCE_NEWIDLE at MC and CPU level.  This
helps performance but hurts power savings since this
slows down task consolidation by reducing the number
of times load_balance is run.

    sched: fine-tune SD_MC_INIT
        commit 14800984706bf6936bbec5187f736e928be5c218
        Author: Mike Galbraith <efault@gmx.de>
        Date:   Fri Nov 7 15:26:50 2008 +0100

    sched: re-tune balancing -- revert
        commit 9fcd18c9e63e325dbd2b4c726623f760788d5aa8
        Author: Ingo Molnar <mingo@elte.hu>
        Date:   Wed Nov 5 16:52:08 2008 +0100

This patch selectively enables SD_BALANCE_NEWIDLE flag
only when sched_mc is set to 1 or 2.  This helps power savings
by task consolidation and also does not hurt performance at
sched_mc=0 where all power saving optimisations are turned off.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 include/linux/sched.h    |   13 +++++++++++++
 include/linux/topology.h |    6 ++++--
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5fe906b..07348ff 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -793,6 +793,19 @@ static inline int sd_balance_for_package_power(void)
 	return 0;
 }
 
+/*
+ * Optimise SD flags for power savings:
+ * SD_BALANCE_NEWIDLE helps agressive task consolidation and power savings.
+ * Keep default SD flags if sched_{smt,mc}_power_saving=0
+ */
+
+static inline int sd_power_saving_flags(void)
+{
+	if (sched_mc_power_savings | sched_smt_power_savings)
+		return SD_BALANCE_NEWIDLE;
+
+	return 0;
+}
 
 struct sched_group {
 	struct sched_group *next;	/* Must be a circular list */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 7f10439..1628b43 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -125,7 +125,8 @@ void arch_update_cpu_topology(void);
 				| SD_WAKE_AFFINE	\
 				| SD_WAKE_BALANCE	\
 				| SD_SHARE_PKG_RESOURCES\
-				| sd_balance_for_mc_power(),\
+				| sd_balance_for_mc_power()\
+				| sd_power_saving_flags(),\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 }
@@ -150,7 +151,8 @@ void arch_update_cpu_topology(void);
 				| SD_BALANCE_FORK	\
 				| SD_WAKE_AFFINE	\
 				| SD_WAKE_BALANCE	\
-				| sd_balance_for_package_power(),\
+				| sd_balance_for_package_power()\
+				| sd_power_saving_flags(),\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 }


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 8/8] sched: idle_balance() does not call load_balance_newidle()
  2008-12-18 17:55 [PATCH v7 0/8] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
                   ` (6 preceding siblings ...)
  2008-12-18 17:56 ` [PATCH v7 7/8] sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0 Vaidyanathan Srinivasan
@ 2008-12-18 17:56 ` Vaidyanathan Srinivasan
  2008-12-18 18:12   ` Balbir Singh
  2008-12-18 20:19 ` [PATCH v7 0/8] Tunable sched_mc_power_savings=n Ingo Molnar
  2008-12-29 23:43 ` MinChan Kim
  9 siblings, 1 reply; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-18 17:56 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi, Peter Zijlstra
  Cc: Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton,
	Vaidyanathan Srinivasan

load_balance_newidle() does not get called if SD_BALANCE_NEWIDLE
is set at higher level domain (3-CPU) and not in low level domain
(2-MC).

pulled_task is initialised to -1 and checked for non-zero which
is always true if the lowest level sched_domain does not have
SD_BALANCE_NEWIDLE flag set.

Trivial fix to initialise pulled_task to zero.

This patch has been queued for 2.6.29
http://lkml.org/lkml/2008/12/8/213

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 kernel/sched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index f82a7a0..e6a88bf 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3773,7 +3773,7 @@ out_balanced:
 static void idle_balance(int this_cpu, struct rq *this_rq)
 {
 	struct sched_domain *sd;
-	int pulled_task = -1;
+	int pulled_task = 0;
 	unsigned long next_balance = jiffies + HZ;
 	cpumask_t tmpmask;
 


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 5/8] sched: bias task wakeups to preferred semi-idle packages
  2008-12-18 17:56 ` [PATCH v7 5/8] sched: bias task wakeups to preferred semi-idle packages Vaidyanathan Srinivasan
@ 2008-12-18 18:11   ` Balbir Singh
  0 siblings, 0 replies; 62+ messages in thread
From: Balbir Singh @ 2008-12-18 18:11 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Peter Zijlstra, Ingo Molnar, Dipankar Sarma, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton

* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2008-12-18 23:26:29]:

> Preferred wakeup cpu (from a semi idle package) has been
> nominated in find_busiest_group() in the previous patch.  Use
> this information in sched_mc_preferred_wakeup_cpu in function
> wake_idle() to bias task wakeups if the following conditions
> are satisfied:
>         - The present cpu that is trying to wakeup the process is
>           idle and waking the target process on this cpu will
>           potentially wakeup a completely idle package
>         - The previous cpu on which the target process ran is
>           also idle and hence selecting the previous cpu may
>           wakeup a semi idle cpu package
>         - The task being woken up is allowed to run in the
>           nominated cpu (cpu affinity and restrictions)
> 
> Basically if both the current cpu and the previous cpu on
> which the task ran is idle, select the nominated cpu from semi
> idle cpu package for running the new task that is waking up.
> 
> Cache hotness is considered since the actual biasing happens
> in wake_idle() only if the application is cache cold.
> 
> This technique will effectively move short running bursty jobs in
> a mostly idle system.
> 
> Wakeup biasing for power savings gets automatically disabled if
> system utilisation increases due to the fact that the probability
> of finding both this_cpu and prev_cpu idle decreases.
> 
> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> ---
> 
>  kernel/sched_fair.c |   18 ++++++++++++++++++
>  1 files changed, 18 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 98345e4..c82386c 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1027,6 +1027,24 @@ static int wake_idle(int cpu, struct task_struct *p)
>  	cpumask_t tmp;
>  	struct sched_domain *sd;
>  	int i;
> +	unsigned int chosen_wakeup_cpu;
> +	int this_cpu;
> +
> +	/*
> +	 * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
> +	 * are idle and this is not a kernel thread and this task's affinity
> +	 * allows it to be moved to preferred cpu, then just move!
> +	 */
> +
> +	this_cpu = smp_processor_id();
> +	chosen_wakeup_cpu =
> +		cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
> +
> +	if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
> +		idle_cpu(cpu) && idle_cpu(this_cpu) &&
> +		p->mm && !(p->flags & PF_KTHREAD) &&
> +		cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))
> +		return chosen_wakeup_cpu;
> 
>  	/*
>  	 * If it is idle, then it is the best cpu to run this task.

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>

-- 
	Balbir

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu
  2008-12-18 17:56 ` [PATCH v7 4/8] sched: nominate preferred wakeup cpu Vaidyanathan Srinivasan
@ 2008-12-18 18:12   ` Balbir Singh
  2008-12-19 21:55   ` Andrew Morton
  1 sibling, 0 replies; 62+ messages in thread
From: Balbir Singh @ 2008-12-18 18:12 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Peter Zijlstra, Ingo Molnar, Dipankar Sarma, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton

* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2008-12-18 23:26:22]:

> When the system utilisation is low and more cpus are idle,
> then the process waking up from sleep should prefer to
> wakeup an idle cpu from semi-idle cpu package (multi core
> package) rather than a completely idle cpu package which
> would waste power.
> 
> Use the sched_mc balance logic in find_busiest_group() to
> nominate a preferred wakeup cpu.
> 
> This info can be sored in appropriate sched_domain, but
> updating this info in all copies of sched_domain is not
> practical.  Hence this information is stored in root_domain
> struct which is one copy per partitioned sched domain.
> The root_domain can be accessed from each cpu's runqueue
> and there is one copy per partitioned sched domain.
> 
> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> ---
> 
>  kernel/sched.c |   12 ++++++++++++
>  1 files changed, 12 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 0b9bbbd..3415fa3 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -493,6 +493,14 @@ struct root_domain {
>  #ifdef CONFIG_SMP
>  	struct cpupri cpupri;
>  #endif
> +#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> +	/*
> +	 * Preferred wake up cpu nominated by sched_mc balance that will be
> +	 * used when most cpus are idle in the system indicating overall very
> +	 * low system utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP(2)
> +	 */
> +	unsigned int sched_mc_preferred_wakeup_cpu;
> +#endif
>  };
> 
>  /*
> @@ -3407,6 +3415,10 @@ out_balanced:
> 
>  	if (this == group_leader && group_leader != group_min) {
>  		*imbalance = min_load_per_task;
> +		if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP) {
> +			cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu =
> +					first_cpu(group_leader->cpumask);
> +		}
>  		return group_min;
>  	}
>  #endif

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>

-- 
	Balbir

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 8/8] sched: idle_balance() does not call load_balance_newidle()
  2008-12-18 17:56 ` [PATCH v7 8/8] sched: idle_balance() does not call load_balance_newidle() Vaidyanathan Srinivasan
@ 2008-12-18 18:12   ` Balbir Singh
  2008-12-18 20:17     ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: Balbir Singh @ 2008-12-18 18:12 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Peter Zijlstra, Ingo Molnar, Dipankar Sarma, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton

* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2008-12-18 23:26:59]:

> load_balance_newidle() does not get called if SD_BALANCE_NEWIDLE
> is set at higher level domain (3-CPU) and not in low level domain
> (2-MC).
> 
> pulled_task is initialised to -1 and checked for non-zero which
> is always true if the lowest level sched_domain does not have
> SD_BALANCE_NEWIDLE flag set.
> 
> Trivial fix to initialise pulled_task to zero.
> 
> This patch has been queued for 2.6.29
> http://lkml.org/lkml/2008/12/8/213
> 
> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> ---
> 
>  kernel/sched.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index f82a7a0..e6a88bf 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3773,7 +3773,7 @@ out_balanced:
>  static void idle_balance(int this_cpu, struct rq *this_rq)
>  {
>  	struct sched_domain *sd;
> -	int pulled_task = -1;
> +	int pulled_task = 0;
>  	unsigned long next_balance = jiffies + HZ;
>  	cpumask_t tmpmask;
>

Isn't this already in Ingo's patch queue? 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 8/8] sched: idle_balance() does not call load_balance_newidle()
  2008-12-18 18:12   ` Balbir Singh
@ 2008-12-18 20:17     ` Ingo Molnar
  0 siblings, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2008-12-18 20:17 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Peter Zijlstra, Dipankar Sarma, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton


* Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> >  static void idle_balance(int this_cpu, struct rq *this_rq)
> >  {
> >  	struct sched_domain *sd;
> > -	int pulled_task = -1;
> > +	int pulled_task = 0;
> >  	unsigned long next_balance = jiffies + HZ;
> >  	cpumask_t tmpmask;
> >
> 
> Isn't this already in Ingo's patch queue? 

yep - i left it out.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2008-12-18 17:55 [PATCH v7 0/8] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
                   ` (7 preceding siblings ...)
  2008-12-18 17:56 ` [PATCH v7 8/8] sched: idle_balance() does not call load_balance_newidle() Vaidyanathan Srinivasan
@ 2008-12-18 20:19 ` Ingo Molnar
  2008-12-18 20:31   ` Ingo Molnar
                     ` (2 more replies)
  2008-12-29 23:43 ` MinChan Kim
  9 siblings, 3 replies; 62+ messages in thread
From: Ingo Molnar @ 2008-12-18 20:19 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Peter Zijlstra, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton


* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:

> Hi,
> 
> The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run 
> the workload in the system on minimum number of CPU packages and tries 
> to keep rest of the CPU packages idle for longer duration. Thus 
> consolidating workloads to fewer packages help other packages to be in 
> idle state and save power.  The current implementation is very 
> conservative and does not work effectively across different workloads. 
> Initial idea of tunable sched_mc_power_savings=n was proposed to enable 
> tuning of the power saving load balancer based on the system 
> configuration, workload characteristics and end user requirements.
> 
> The power savings and performance of the given workload in an under 
> utilised system can be controlled by setting values of 0, 1 or 2 to 
> /sys/devices/system/cpu/sched_mc_power_savings with 0 being highest 
> performance and least power savings and level 2 indicating maximum power 
> savings even at the cost of slight performance degradation.
> 
> Please refer to the following discussions and article for details.
> 
> [1]Making power policy just work
> http://lwn.net/Articles/287924/ 
> 
> [2][RFC v1] Tunable sched_mc_power_savings=n
> http://lwn.net/Articles/287882/
>                                                                                                                
> v2: http://lwn.net/Articles/297306/
> v3: http://lkml.org/lkml/2008/11/10/260
> v4: http://lkml.org/lkml/2008/11/21/47
> v5: http://lkml.org/lkml/2008/12/11/178
> v6: http://lwn.net/Articles/311830/
> 
> The following series of patch demonstrates the basic framework for
> tunable sched_mc_power_savings.
> 
> This version of the patch incorporates comments and feedback received
> on the previous post from Andrew Morton.
> 
> Changes form v6:
> ----------------
> * Convert BALANCE_FOR_xx_POWER and related macros to inline functions
>   based on comments from Andrew and Ingo.
> * Ran basic kernelbench test and did not see any performance variation
>   due to the changes.
> 
> Changes form v5:
> ---------------
> * Fixed the sscanf bug and checking for (p->flags & PF_KTHREAD)
> * Dropped the RFC prefix to indicate that the patch is ready for
>   testing and inclusion  
> * Patch series against 2.6.28-rc8 kernel

thanks, applied - and i started testing them. It needed some help here and 
there to resolve conflicts with pending cpumask changes. Could you please 
double-check the merged up end result in the latest scheduler devel tree:

  http://people.redhat.com/mingo/tip.git/README

thanks,

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2008-12-18 20:19 ` [PATCH v7 0/8] Tunable sched_mc_power_savings=n Ingo Molnar
@ 2008-12-18 20:31   ` Ingo Molnar
  2008-12-19  8:29     ` Vaidyanathan Srinivasan
  2008-12-19  8:24   ` Vaidyanathan Srinivasan
  2008-12-19 13:34   ` Vaidyanathan Srinivasan
  2 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2008-12-18 20:31 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Peter Zijlstra, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton


* Ingo Molnar <mingo@elte.hu> wrote:

> thanks, applied - and i started testing them. [...]

the sched.h bits needed the small fix below. (struct sched_domain is not 
defined on UP)

	Ingo

------------->
>From edb4c71953409c1deac1a80528ac0aa768762b33 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 18 Dec 2008 21:30:23 +0100
Subject: [PATCH] sched: move test_sd_parent() to an SMP section of sched.h

Impact: build fix

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |   18 +++++++++---------
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5a933d9..e5f928a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -920,6 +920,15 @@ extern void partition_sched_domains(int ndoms_new, struct cpumask *doms_new,
 				    struct sched_domain_attr *dattr_new);
 extern int arch_reinit_sched_domains(void);
 
+/* Test a flag in parent sched domain */
+static inline int test_sd_parent(struct sched_domain *sd, int flag)
+{
+	if (sd->parent && (sd->parent->flags & flag))
+		return 1;
+
+	return 0;
+}
+
 #else /* CONFIG_SMP */
 
 struct sched_domain_attr;
@@ -1431,15 +1440,6 @@ struct task_struct {
 #endif
 };
 
-/* Test a flag in parent sched domain */
-static inline int test_sd_parent(struct sched_domain *sd, int flag)
-{
-	if (sd->parent && (sd->parent->flags & flag))
-		return 1;
-
-	return 0;
-}
-
 /*
  * Priority of a process goes from 0..MAX_PRIO-1, valid RT
  * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2008-12-18 20:19 ` [PATCH v7 0/8] Tunable sched_mc_power_savings=n Ingo Molnar
  2008-12-18 20:31   ` Ingo Molnar
@ 2008-12-19  8:24   ` Vaidyanathan Srinivasan
  2008-12-19 13:34   ` Vaidyanathan Srinivasan
  2 siblings, 0 replies; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-19  8:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Peter Zijlstra, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton

* Ingo Molnar <mingo@elte.hu> [2008-12-18 21:19:38]:

> 
> * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> 
> > Hi,
> > 
> > The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run 
> > the workload in the system on minimum number of CPU packages and tries 
> > to keep rest of the CPU packages idle for longer duration. Thus 
> > consolidating workloads to fewer packages help other packages to be in 
> > idle state and save power.  The current implementation is very 
> > conservative and does not work effectively across different workloads. 
> > Initial idea of tunable sched_mc_power_savings=n was proposed to enable 
> > tuning of the power saving load balancer based on the system 
> > configuration, workload characteristics and end user requirements.
> > 
> > The power savings and performance of the given workload in an under 
> > utilised system can be controlled by setting values of 0, 1 or 2 to 
> > /sys/devices/system/cpu/sched_mc_power_savings with 0 being highest 
> > performance and least power savings and level 2 indicating maximum power 
> > savings even at the cost of slight performance degradation.
> > 
> > Please refer to the following discussions and article for details.
> > 
> > [1]Making power policy just work
> > http://lwn.net/Articles/287924/ 
> > 
> > [2][RFC v1] Tunable sched_mc_power_savings=n
> > http://lwn.net/Articles/287882/
> >                                                                                                                
> > v2: http://lwn.net/Articles/297306/
> > v3: http://lkml.org/lkml/2008/11/10/260
> > v4: http://lkml.org/lkml/2008/11/21/47
> > v5: http://lkml.org/lkml/2008/12/11/178
> > v6: http://lwn.net/Articles/311830/
> > 
> > The following series of patch demonstrates the basic framework for
> > tunable sched_mc_power_savings.
> > 
> > This version of the patch incorporates comments and feedback received
> > on the previous post from Andrew Morton.
> > 
> > Changes form v6:
> > ----------------
> > * Convert BALANCE_FOR_xx_POWER and related macros to inline functions
> >   based on comments from Andrew and Ingo.
> > * Ran basic kernelbench test and did not see any performance variation
> >   due to the changes.
> > 
> > Changes form v5:
> > ---------------
> > * Fixed the sscanf bug and checking for (p->flags & PF_KTHREAD)
> > * Dropped the RFC prefix to indicate that the patch is ready for
> >   testing and inclusion  
> > * Patch series against 2.6.28-rc8 kernel
> 
> thanks, applied - and i started testing them. It needed some help here and 
> there to resolve conflicts with pending cpumask changes. Could you please 
> double-check the merged up end result in the latest scheduler devel tree:
> 
>   http://people.redhat.com/mingo/tip.git/README

Thanks Ingo.  I will review and check the cpumask changes.
I will test the tip tree as well and provide you feedback.

I will look forward to test/bug reports from others on the patch
series and help with the fix.

--Vaidy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2008-12-18 20:31   ` Ingo Molnar
@ 2008-12-19  8:29     ` Vaidyanathan Srinivasan
  0 siblings, 0 replies; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-19  8:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Peter Zijlstra, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton

* Ingo Molnar <mingo@elte.hu> [2008-12-18 21:31:40]:

> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > thanks, applied - and i started testing them. [...]
> 
> the sched.h bits needed the small fix below. (struct sched_domain is not 
> defined on UP)

Thanks Ingo. I will watch out for such errors next time.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

--Vaidy

> 
> 	Ingo
> 
> ------------->
> From edb4c71953409c1deac1a80528ac0aa768762b33 Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <mingo@elte.hu>
> Date: Thu, 18 Dec 2008 21:30:23 +0100
> Subject: [PATCH] sched: move test_sd_parent() to an SMP section of sched.h
> 
> Impact: build fix
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  include/linux/sched.h |   18 +++++++++---------
>  1 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5a933d9..e5f928a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -920,6 +920,15 @@ extern void partition_sched_domains(int ndoms_new, struct cpumask *doms_new,
>  				    struct sched_domain_attr *dattr_new);
>  extern int arch_reinit_sched_domains(void);
> 
> +/* Test a flag in parent sched domain */
> +static inline int test_sd_parent(struct sched_domain *sd, int flag)
> +{
> +	if (sd->parent && (sd->parent->flags & flag))
> +		return 1;
> +
> +	return 0;
> +}
> +
>  #else /* CONFIG_SMP */
> 
>  struct sched_domain_attr;
> @@ -1431,15 +1440,6 @@ struct task_struct {
>  #endif
>  };
> 
> -/* Test a flag in parent sched domain */
> -static inline int test_sd_parent(struct sched_domain *sd, int flag)
> -{
> -	if (sd->parent && (sd->parent->flags & flag))
> -		return 1;
> -
> -	return 0;
> -}
> -
>  /*
>   * Priority of a process goes from 0..MAX_PRIO-1, valid RT
>   * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2008-12-18 20:19 ` [PATCH v7 0/8] Tunable sched_mc_power_savings=n Ingo Molnar
  2008-12-18 20:31   ` Ingo Molnar
  2008-12-19  8:24   ` Vaidyanathan Srinivasan
@ 2008-12-19 13:34   ` Vaidyanathan Srinivasan
  2 siblings, 0 replies; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-19 13:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Peter Zijlstra, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton

* Ingo Molnar <mingo@elte.hu> [2008-12-18 21:19:38]:

> 
> * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> 
> > Hi,
> > 
> > The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run 
> > the workload in the system on minimum number of CPU packages and tries 
> > to keep rest of the CPU packages idle for longer duration. Thus 
> > consolidating workloads to fewer packages help other packages to be in 
> > idle state and save power.  The current implementation is very 
> > conservative and does not work effectively across different workloads. 
> > Initial idea of tunable sched_mc_power_savings=n was proposed to enable 
> > tuning of the power saving load balancer based on the system 
> > configuration, workload characteristics and end user requirements.
> > 
> > The power savings and performance of the given workload in an under 
> > utilised system can be controlled by setting values of 0, 1 or 2 to 
> > /sys/devices/system/cpu/sched_mc_power_savings with 0 being highest 
> > performance and least power savings and level 2 indicating maximum power 
> > savings even at the cost of slight performance degradation.
> > 
> > Please refer to the following discussions and article for details.
> > 
> > [1]Making power policy just work
> > http://lwn.net/Articles/287924/ 
> > 
> > [2][RFC v1] Tunable sched_mc_power_savings=n
> > http://lwn.net/Articles/287882/
> >                                                                                                                
> > v2: http://lwn.net/Articles/297306/
> > v3: http://lkml.org/lkml/2008/11/10/260
> > v4: http://lkml.org/lkml/2008/11/21/47
> > v5: http://lkml.org/lkml/2008/12/11/178
> > v6: http://lwn.net/Articles/311830/
> > 
> > The following series of patch demonstrates the basic framework for
> > tunable sched_mc_power_savings.
> > 
> > This version of the patch incorporates comments and feedback received
> > on the previous post from Andrew Morton.
> > 
> > Changes form v6:
> > ----------------
> > * Convert BALANCE_FOR_xx_POWER and related macros to inline functions
> >   based on comments from Andrew and Ingo.
> > * Ran basic kernelbench test and did not see any performance variation
> >   due to the changes.
> > 
> > Changes form v5:
> > ---------------
> > * Fixed the sscanf bug and checking for (p->flags & PF_KTHREAD)
> > * Dropped the RFC prefix to indicate that the patch is ready for
> >   testing and inclusion  
> > * Patch series against 2.6.28-rc8 kernel
> 
> thanks, applied - and i started testing them. It needed some help here and 
> there to resolve conflicts with pending cpumask changes. Could you please 
> double-check the merged up end result in the latest scheduler devel tree:
> 
>   http://people.redhat.com/mingo/tip.git/README

Hi Ingo,

I reviewed and tested the tip tree.  The merge looks good.  I did not
find any other cpumask API warnings after the one you have fixed in commit
220e7f617826e0527bbc523ba859f6a4bae0bfe1

Results in tip at: 2.6.28-rc8-tip-sv-01393-gc79978a

SchedMC Run Time     Package Idle    Energy  Power
0       80.50        51.47% 54.00%  1.00x J 1.00y W
1       78.89        43.37% 62.86%  0.95x J 0.97y W
2       75.37        30.28% 76.00%  0.91x J 0.98y W
		
Thanks,
Vaidy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu
  2008-12-18 17:56 ` [PATCH v7 4/8] sched: nominate preferred wakeup cpu Vaidyanathan Srinivasan
  2008-12-18 18:12   ` Balbir Singh
@ 2008-12-19 21:55   ` Andrew Morton
  2008-12-19 22:19     ` Andrew Morton
  2008-12-20  4:36     ` Vaidyanathan Srinivasan
  1 sibling, 2 replies; 62+ messages in thread
From: Andrew Morton @ 2008-12-19 21:55 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: linux-kernel, suresh.b.siddha, venkatesh.pallipadi, a.p.zijlstra,
	mingo, dipankar, balbir, vatsa, ego, andi, davecb, tconnors,
	maxk, gregory.haskins, pavel, svaidy, Rusty Russell

On Thu, 18 Dec 2008 23:26:22 +0530
Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:

> When the system utilisation is low and more cpus are idle,
> then the process waking up from sleep should prefer to
> wakeup an idle cpu from semi-idle cpu package (multi core
> package) rather than a completely idle cpu package which
> would waste power.
> 
> Use the sched_mc balance logic in find_busiest_group() to
> nominate a preferred wakeup cpu.
> 
> This info can be sored in appropriate sched_domain, but
> updating this info in all copies of sched_domain is not
> practical.  Hence this information is stored in root_domain
> struct which is one copy per partitioned sched domain.
> The root_domain can be accessed from each cpu's runqueue
> and there is one copy per partitioned sched domain.
> 

kernel/sched.c: In function 'find_busiest_group':
kernel/sched.c:3403: warning: passing argument 1 of '__first_cpu' from incompatible pointer type

Due to

	first_cpu(group_leader->cpumask);

apparently because Rusty changed sched_group.cpumask into a plain old
array and nobody tests their stuff against the tree into which it is
actually integrated :(

kernel/sched.c: In function 'schedule':
kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function

This warning is correct - the code is buggy.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu
  2008-12-19 21:55   ` Andrew Morton
@ 2008-12-19 22:19     ` Andrew Morton
  2008-12-19 22:27       ` Ingo Molnar
  2008-12-20  4:36     ` Vaidyanathan Srinivasan
  1 sibling, 1 reply; 62+ messages in thread
From: Andrew Morton @ 2008-12-19 22:19 UTC (permalink / raw)
  To: svaidy, linux-kernel, suresh.b.siddha, venkatesh.pallipadi,
	a.p.zijlstra, mingo, dipankar, balbir, vatsa, ego, andi, davecb,
	tconnors, maxk, gregory.haskins, pavel, rusty

On Fri, 19 Dec 2008 13:55:08 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 18 Dec 2008 23:26:22 +0530
> Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> 
> > When the system utilisation is low and more cpus are idle,
> > then the process waking up from sleep should prefer to
> > wakeup an idle cpu from semi-idle cpu package (multi core
> > package) rather than a completely idle cpu package which
> > would waste power.
> > 
> > Use the sched_mc balance logic in find_busiest_group() to
> > nominate a preferred wakeup cpu.
> > 
> > This info can be sored in appropriate sched_domain, but
> > updating this info in all copies of sched_domain is not
> > practical.  Hence this information is stored in root_domain
> > struct which is one copy per partitioned sched domain.
> > The root_domain can be accessed from each cpu's runqueue
> > and there is one copy per partitioned sched domain.
> > 
> 
> kernel/sched.c: In function 'find_busiest_group':
> kernel/sched.c:3403: warning: passing argument 1 of '__first_cpu' from incompatible pointer type
> 
> 
> kernel/sched.c: In function 'schedule':
> kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> 

I left those unfixed and...

[   62.307607] initcall init_nfsd+0x0/0xe2 [nfsd] returned 0 after 251 usecs
[   62.399426] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[   74.009934] divide error: 0000 [#1] SMP 
[   74.010085] last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:05:00.1/irq
[   74.010147] CPU 1 
[   74.010241] Modules linked in: nfsd auth_rpcgss exportfs lockd nfs_acl autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_mirror dm_region_hash dm_log dm_multipath dm_mod rfkill input_polldev sbs sbshc battery ac parport_pc lp parport snd_hda_codec_realtek sg snd_hda_intel floppy snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device ide_cd_mod cdrom snd_pcm_oss snd_mixer_oss serio_raw snd_pcm button shpchp snd_timer i2c_i801 i2c_core snd soundcore snd_page_alloc pcspkr ehci_hcd ohci_hcd uhci_hcd
[   74.012821] Pid: 0, comm: swapper Not tainted 2.6.28-rc9-mm1 #1
[   74.012875] RIP: 0010:[<ffffffff802375d8>]  [<ffffffff802375d8>] tg_shares_up+0x113/0x1f1
[   74.012982] RSP: 0018:ffff88025e057d88  EFLAGS: 00010246
[   74.013038] RAX: 0000000000000000 RBX: ffff880028063900 RCX: ffff880028058dc0
[   74.013095] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000ff
[   74.013152] RBP: ffff88025e057dd8 R08: 0000000000000000 R09: 0000000000000000
[   74.013209] R10: ffff880028067640 R11: 0000000000000022 R12: ffff880028063900
[   74.013264] R13: ffffffff802374c5 R14: ffffffff8022c546 R15: ffffffff8079b4a0
[   74.013321] FS:  0000000000000000(0000) GS:ffff88025e0385c0(0000) knlGS:0000000000000000
[   74.013381] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[   74.013435] CR2: 00007fdb795af8f0 CR3: 0000000000201000 CR4: 00000000000006e0
[   74.013489] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   74.013545] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   74.013602] Process swapper (pid: 0, threadinfo ffff88025e050000, task ffff88025e03f870)
[   74.013661] Stack:
[   74.013708]  0000000000000000 ffff880028063a00 0000000000000000 0000000000000000
[   74.013893]  0000000000000008 ffffffff8079b4a0 ffff880028063900 ffffffff802374c5
[   74.013893]  ffffffff8022c546 0000000000000030 ffff88025e057e08 ffffffff8022c530
[   74.013893] Call Trace:
[   74.013893]  <IRQ> <0> [<ffffffff802374c5>] ? tg_shares_up+0x0/0x1f1
[   74.013893]  [<ffffffff8022c546>] ? tg_nop+0x0/0x8
[   74.013893]  [<ffffffff8022c530>] walk_tg_tree+0x5f/0x75
[   74.013893]  [<ffffffff8022c65c>] update_shares+0x46/0x4a
[   74.013893]  [<ffffffff80235fe4>] run_rebalance_domains+0x192/0x522
[   74.013893]  [<ffffffff802543eb>] ? getnstimeofday+0x3d/0x9e
[   74.013893]  [<ffffffff8052bd73>] ? _spin_unlock_irq+0x9/0xc
[   74.013893]  [<ffffffff8024057f>] __do_softirq+0xa3/0x164
[   74.013893]  [<ffffffff8020d0bc>] call_softirq+0x1c/0x28
[   74.013893]  [<ffffffff8020e19d>] do_softirq+0x31/0x73
[   74.013893]  [<ffffffff802402ca>] irq_exit+0x3f/0x41
[   74.013893]  [<ffffffff8021e71d>] smp_apic_timer_interrupt+0x94/0xad
[   74.013893]  [<ffffffff8020caf3>] apic_timer_interrupt+0x13/0x20
[   74.013893]  <EOI> <0>Code: 8b 47 08 4a 8b 0c 00 48 85 c9 0f 84 c3 00 00 00 49 8b 47 10 4a 8b 04 00 48 8b 80 a8 00 00 00 48 85 c0 74 13 48 0f af 45 c8 31 d2 <48> f7 75 c0 49 89 c6 48 89 c6 eb 16 48 63 55 d0 48 8b 45 c8 45 
[   74.013893] RIP  [<ffffffff802375d8>] tg_shares_up+0x113/0x1f1
[   74.013893]  RSP <ffff88025e057d88>
[   74.020022] ---[ end trace 2fc4046e394f2312 ]---
[   74.020188] Kernel panic - not syncing: Fatal exception in interrupt

config: http://userweb.kernel.org/~akpm/config-akpm2.txt

I'll try hacking some div-by-zero avoidance into update_group_shares_cpu().

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu
  2008-12-19 22:19     ` Andrew Morton
@ 2008-12-19 22:27       ` Ingo Molnar
  2008-12-19 22:31         ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2008-12-19 22:27 UTC (permalink / raw)
  To: Andrew Morton, Yinghai Lu
  Cc: svaidy, linux-kernel, suresh.b.siddha, venkatesh.pallipadi,
	a.p.zijlstra, dipankar, balbir, vatsa, ego, andi, davecb,
	tconnors, maxk, gregory.haskins, pavel, rusty


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 19 Dec 2008 13:55:08 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Thu, 18 Dec 2008 23:26:22 +0530
> > Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> > 
> > > When the system utilisation is low and more cpus are idle,
> > > then the process waking up from sleep should prefer to
> > > wakeup an idle cpu from semi-idle cpu package (multi core
> > > package) rather than a completely idle cpu package which
> > > would waste power.
> > > 
> > > Use the sched_mc balance logic in find_busiest_group() to
> > > nominate a preferred wakeup cpu.
> > > 
> > > This info can be sored in appropriate sched_domain, but
> > > updating this info in all copies of sched_domain is not
> > > practical.  Hence this information is stored in root_domain
> > > struct which is one copy per partitioned sched domain.
> > > The root_domain can be accessed from each cpu's runqueue
> > > and there is one copy per partitioned sched domain.
> > > 
> > 
> > kernel/sched.c: In function 'find_busiest_group':
> > kernel/sched.c:3403: warning: passing argument 1 of '__first_cpu' from incompatible pointer type
> > 
> > 
> > kernel/sched.c: In function 'schedule':
> > kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> > 
> 
> I left those unfixed and...
> 
> [   62.307607] initcall init_nfsd+0x0/0xe2 [nfsd] returned 0 after 251 usecs
> [   62.399426] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
> [   74.009934] divide error: 0000 [#1] SMP 
> [   74.010085] last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:05:00.1/irq
> [   74.010147] CPU 1 
> [   74.010241] Modules linked in: nfsd auth_rpcgss exportfs lockd nfs_acl autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_mirror dm_region_hash dm_log dm_multipath dm_mod rfkill input_polldev sbs sbshc battery ac parport_pc lp parport snd_hda_codec_realtek sg snd_hda_intel floppy snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device ide_cd_mod cdrom snd_pcm_oss snd_mixer_oss serio_raw snd_pcm button shpchp snd_timer i2c_i801 i2c_core snd soundcore snd_page_alloc pcspkr ehci_hcd ohci_hcd uhci_hcd
> [   74.012821] Pid: 0, comm: swapper Not tainted 2.6.28-rc9-mm1 #1
> [   74.012875] RIP: 0010:[<ffffffff802375d8>]  [<ffffffff802375d8>] tg_shares_up+0x113/0x1f1
> [   74.012982] RSP: 0018:ffff88025e057d88  EFLAGS: 00010246
> [   74.013038] RAX: 0000000000000000 RBX: ffff880028063900 RCX: ffff880028058dc0
> [   74.013095] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000ff
> [   74.013152] RBP: ffff88025e057dd8 R08: 0000000000000000 R09: 0000000000000000
> [   74.013209] R10: ffff880028067640 R11: 0000000000000022 R12: ffff880028063900
> [   74.013264] R13: ffffffff802374c5 R14: ffffffff8022c546 R15: ffffffff8079b4a0
> [   74.013321] FS:  0000000000000000(0000) GS:ffff88025e0385c0(0000) knlGS:0000000000000000
> [   74.013381] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [   74.013435] CR2: 00007fdb795af8f0 CR3: 0000000000201000 CR4: 00000000000006e0
> [   74.013489] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   74.013545] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [   74.013602] Process swapper (pid: 0, threadinfo ffff88025e050000, task ffff88025e03f870)
> [   74.013661] Stack:
> [   74.013708]  0000000000000000 ffff880028063a00 0000000000000000 0000000000000000
> [   74.013893]  0000000000000008 ffffffff8079b4a0 ffff880028063900 ffffffff802374c5
> [   74.013893]  ffffffff8022c546 0000000000000030 ffff88025e057e08 ffffffff8022c530
> [   74.013893] Call Trace:
> [   74.013893]  <IRQ> <0> [<ffffffff802374c5>] ? tg_shares_up+0x0/0x1f1
> [   74.013893]  [<ffffffff8022c546>] ? tg_nop+0x0/0x8
> [   74.013893]  [<ffffffff8022c530>] walk_tg_tree+0x5f/0x75
> [   74.013893]  [<ffffffff8022c65c>] update_shares+0x46/0x4a
> [   74.013893]  [<ffffffff80235fe4>] run_rebalance_domains+0x192/0x522
> [   74.013893]  [<ffffffff802543eb>] ? getnstimeofday+0x3d/0x9e
> [   74.013893]  [<ffffffff8052bd73>] ? _spin_unlock_irq+0x9/0xc
> [   74.013893]  [<ffffffff8024057f>] __do_softirq+0xa3/0x164
> [   74.013893]  [<ffffffff8020d0bc>] call_softirq+0x1c/0x28
> [   74.013893]  [<ffffffff8020e19d>] do_softirq+0x31/0x73
> [   74.013893]  [<ffffffff802402ca>] irq_exit+0x3f/0x41
> [   74.013893]  [<ffffffff8021e71d>] smp_apic_timer_interrupt+0x94/0xad
> [   74.013893]  [<ffffffff8020caf3>] apic_timer_interrupt+0x13/0x20
> [   74.013893]  <EOI> <0>Code: 8b 47 08 4a 8b 0c 00 48 85 c9 0f 84 c3 00 00 00 49 8b 47 10 4a 8b 04 00 48 8b 80 a8 00 00 00 48 85 c0 74 13 48 0f af 45 c8 31 d2 <48> f7 75 c0 49 89 c6 48 89 c6 eb 16 48 63 55 d0 48 8b 45 c8 45 
> [   74.013893] RIP  [<ffffffff802375d8>] tg_shares_up+0x113/0x1f1
> [   74.013893]  RSP <ffff88025e057d88>
> [   74.020022] ---[ end trace 2fc4046e394f2312 ]---
> [   74.020188] Kernel panic - not syncing: Fatal exception in interrupt
> 
> config: http://userweb.kernel.org/~akpm/config-akpm2.txt
> 
> I'll try hacking some div-by-zero avoidance into update_group_shares_cpu().

hm, weird - Yinghai noticed this crash in -tip and we have that patch 
zapped from all -*-next branches already.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu
  2008-12-19 22:27       ` Ingo Molnar
@ 2008-12-19 22:31         ` Ingo Molnar
  2008-12-19 22:38           ` Andrew Morton
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2008-12-19 22:31 UTC (permalink / raw)
  To: Andrew Morton, Yinghai Lu
  Cc: svaidy, linux-kernel, suresh.b.siddha, venkatesh.pallipadi,
	a.p.zijlstra, dipankar, balbir, vatsa, ego, andi, davecb,
	tconnors, maxk, gregory.haskins, pavel, rusty


* Ingo Molnar <mingo@elte.hu> wrote:

> > [   74.013893]  <EOI> <0>Code: 8b 47 08 4a 8b 0c 00 48 85 c9 0f 84 c3 00 00 00 49 8b 47 10 4a 8b 04 00 48 8b 80 a8 00 00 00 48 85 c0 74 13 48 0f af 45 c8 31 d2 <48> f7 75 c0 49 89 c6 48 89 c6 eb 16 48 63 55 d0 48 8b 45 c8 45 
> > [   74.013893] RIP  [<ffffffff802375d8>] tg_shares_up+0x113/0x1f1
> > [   74.013893]  RSP <ffff88025e057d88>
> > [   74.020022] ---[ end trace 2fc4046e394f2312 ]---
> > [   74.020188] Kernel panic - not syncing: Fatal exception in interrupt
> > 
> > config: http://userweb.kernel.org/~akpm/config-akpm2.txt
> > 
> > I'll try hacking some div-by-zero avoidance into update_group_shares_cpu().
> 
> hm, weird - Yinghai noticed this crash in -tip and we have that patch 
> zapped from all -*-next branches already.

I guess you applied Ken's patch from email and had it in -mm? We have a 
new version of that patch from Ken now based on Yinghai's bugreport - only 
lightly tested for now.

	Ingo

----------->
>From d71f5a7c8bf9cd7c74159a53e522e363f2eddaf5 Mon Sep 17 00:00:00 2001
From: Ken Chen <kenchen@google.com>
Date: Fri, 19 Dec 2008 10:11:50 -0800
Subject: [PATCH] sched: fix uneven per-cpu task_group share distribution

Impact: fix group scheduling behavior

While testing CFS scheduler on linux-2.6-tip tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

We found that task which is pinned to a CPU could be starved relative to its
allocated fair share.

The per-cpu sched_enetity load share calculation in tg_shares_up /
update_group_shares_cpu distributes total task_group's share among all CPUs
for a given SD domain, this would dilute task_group's per-cpu share because
it distributes share to CPU that even has no load.  The trapped share is now
un-consumable and it leads to fair share starvation on the runnable CPU.
Peter was right that it is still required for the low level function to make
distinction between a boosted share that don't have any load and actual tg
share that should be distributed among CPUs in which the tg is running.

Patch to add that boost and we think the scheduler should only boost one
times of tg shares over all empty CPU that don't have any load for the
specific task_group in order to bound maximum temporary boost that a given
task_group can have.

Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |   49 ++++++++++++++++++++++++++++++-------------------
 1 files changed, 30 insertions(+), 19 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ae5ca3f..7d07c97 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1475,24 +1475,34 @@ static void __set_se_shares(struct sched_entity *se, unsigned long shares);
  * Calculate and set the cpu's group shares.
  */
 static void
-update_group_shares_cpu(struct task_group *tg, int cpu,
+update_group_shares_cpu(struct task_group *tg, int cpu, unsigned long boost,
 			unsigned long sd_shares, unsigned long sd_rq_weight)
 {
-	unsigned long shares;
+	unsigned long shares, raw_shares;
 	unsigned long rq_weight;
 
 	if (!tg->se[cpu])
 		return;
 
 	rq_weight = tg->cfs_rq[cpu]->rq_weight;
-
-	/*
-	 *           \Sum shares * rq_weight
-	 * shares =  -----------------------
-	 *               \Sum rq_weight
-	 *
-	 */
-	shares = (sd_shares * rq_weight) / sd_rq_weight;
+	if (rq_weight && sd_rq_weight) {
+		/*
+		 *           \Sum shares * rq_weight
+		 * shares =  -----------------------
+		 *               \Sum rq_weight
+		 *
+		 */
+		raw_shares = (sd_shares * rq_weight) / sd_rq_weight;
+		shares = raw_shares;
+	} else {
+		/*
+		 * If there are currently no tasks on the cpu pretend there
+		 * is one of average load so that when a new task gets to
+		 * run here it will not get delayed by group starvation.
+		 */
+		raw_shares = 0;
+		shares = boost;
+	}
 	shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);
 
 	if (abs(shares - tg->se[cpu]->load.weight) >
@@ -1501,7 +1511,7 @@ update_group_shares_cpu(struct task_group *tg, int cpu,
 		unsigned long flags;
 
 		spin_lock_irqsave(&rq->lock, flags);
-		tg->cfs_rq[cpu]->shares = shares;
+		tg->cfs_rq[cpu]->shares = raw_shares;
 
 		__set_se_shares(tg->se[cpu], shares);
 		spin_unlock_irqrestore(&rq->lock, flags);
@@ -1517,18 +1527,14 @@ static int tg_shares_up(struct task_group *tg, void *data)
 {
 	unsigned long weight, rq_weight = 0;
 	unsigned long shares = 0;
+	unsigned long boost;
 	struct sched_domain *sd = data;
-	int i;
+	int i, no_load_count = 0;
 
 	for_each_cpu(i, sched_domain_span(sd)) {
-		/*
-		 * If there are currently no tasks on the cpu pretend there
-		 * is one of average load so that when a new task gets to
-		 * run here it will not get delayed by group starvation.
-		 */
 		weight = tg->cfs_rq[i]->load.weight;
 		if (!weight)
-			weight = NICE_0_LOAD;
+			no_load_count++;
 
 		tg->cfs_rq[i]->rq_weight = weight;
 		rq_weight += weight;
@@ -1541,8 +1547,13 @@ static int tg_shares_up(struct task_group *tg, void *data)
 	if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
 		shares = tg->shares;
 
+	if (no_load_count)
+		boost = shares / no_load_count;
+	else
+		boost = shares / cpumask_weight(sched_domain_span(sd));
+
 	for_each_cpu(i, sched_domain_span(sd))
-		update_group_shares_cpu(tg, i, shares, rq_weight);
+		update_group_shares_cpu(tg, i, boost, shares, rq_weight);
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu
  2008-12-19 22:31         ` Ingo Molnar
@ 2008-12-19 22:38           ` Andrew Morton
  2008-12-19 22:54             ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: Andrew Morton @ 2008-12-19 22:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: yinghai, svaidy, linux-kernel, suresh.b.siddha,
	venkatesh.pallipadi, a.p.zijlstra, dipankar, balbir, vatsa, ego,
	andi, davecb, tconnors, maxk, gregory.haskins, pavel, rusty

On Fri, 19 Dec 2008 23:31:30 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > [   74.013893]  <EOI> <0>Code: 8b 47 08 4a 8b 0c 00 48 85 c9 0f 84 c3 00 00 00 49 8b 47 10 4a 8b 04 00 48 8b 80 a8 00 00 00 48 85 c0 74 13 48 0f af 45 c8 31 d2 <48> f7 75 c0 49 89 c6 48 89 c6 eb 16 48 63 55 d0 48 8b 45 c8 45 
> > > [   74.013893] RIP  [<ffffffff802375d8>] tg_shares_up+0x113/0x1f1
> > > [   74.013893]  RSP <ffff88025e057d88>
> > > [   74.020022] ---[ end trace 2fc4046e394f2312 ]---
> > > [   74.020188] Kernel panic - not syncing: Fatal exception in interrupt
> > > 
> > > config: http://userweb.kernel.org/~akpm/config-akpm2.txt
> > > 
> > > I'll try hacking some div-by-zero avoidance into update_group_shares_cpu().
> > 
> > hm, weird - Yinghai noticed this crash in -tip and we have that patch 
> > zapped from all -*-next branches already.
> 
> I guess you applied Ken's patch from email and had it in -mm?

Nope.

> We have a 
> new version of that patch from Ken now based on Yinghai's bugreport - only 
> lightly tested for now.
> 
> 	Ingo
> 
> ----------->
> >From d71f5a7c8bf9cd7c74159a53e522e363f2eddaf5 Mon Sep 17 00:00:00 2001
> From: Ken Chen <kenchen@google.com>
> Date: Fri, 19 Dec 2008 10:11:50 -0800
> Subject: [PATCH] sched: fix uneven per-cpu task_group share distribution

linux-next has:

commit 6eed5aaacae0ced9a43a923633befd2c695d6620
Author: Ken Chen <kenchen@google.com>
Date:   Mon Dec 15 23:37:50 2008 -0800

    sched: fix uneven per-cpu task_group share distribution

so I guess we have a latency problem.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu
  2008-12-19 22:38           ` Andrew Morton
@ 2008-12-19 22:54             ` Ingo Molnar
  0 siblings, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2008-12-19 22:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: yinghai, svaidy, linux-kernel, suresh.b.siddha,
	venkatesh.pallipadi, a.p.zijlstra, dipankar, balbir, vatsa, ego,
	andi, davecb, tconnors, maxk, gregory.haskins, pavel, rusty


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 19 Dec 2008 23:31:30 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > * Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > > [   74.013893]  <EOI> <0>Code: 8b 47 08 4a 8b 0c 00 48 85 c9 0f 84 c3 00 00 00 49 8b 47 10 4a 8b 04 00 48 8b 80 a8 00 00 00 48 85 c0 74 13 48 0f af 45 c8 31 d2 <48> f7 75 c0 49 89 c6 48 89 c6 eb 16 48 63 55 d0 48 8b 45 c8 45 
> > > > [   74.013893] RIP  [<ffffffff802375d8>] tg_shares_up+0x113/0x1f1
> > > > [   74.013893]  RSP <ffff88025e057d88>
> > > > [   74.020022] ---[ end trace 2fc4046e394f2312 ]---
> > > > [   74.020188] Kernel panic - not syncing: Fatal exception in interrupt
> > > > 
> > > > config: http://userweb.kernel.org/~akpm/config-akpm2.txt
> > > > 
> > > > I'll try hacking some div-by-zero avoidance into update_group_shares_cpu().
> > > 
> > > hm, weird - Yinghai noticed this crash in -tip and we have that patch 
> > > zapped from all -*-next branches already.
> > 
> > I guess you applied Ken's patch from email and had it in -mm?
> 
> Nope.
> 
> > We have a 
> > new version of that patch from Ken now based on Yinghai's bugreport - only 
> > lightly tested for now.
> > 
> > 	Ingo
> > 
> > ----------->
> > >From d71f5a7c8bf9cd7c74159a53e522e363f2eddaf5 Mon Sep 17 00:00:00 2001
> > From: Ken Chen <kenchen@google.com>
> > Date: Fri, 19 Dec 2008 10:11:50 -0800
> > Subject: [PATCH] sched: fix uneven per-cpu task_group share distribution
> 
> linux-next has:
> 
> commit 6eed5aaacae0ced9a43a923633befd2c695d6620
> Author: Ken Chen <kenchen@google.com>
> Date:   Mon Dec 15 23:37:50 2008 -0800
> 
>     sched: fix uneven per-cpu task_group share distribution
> 
> so I guess we have a latency problem.

ah, yes. I did this 22 hours ago:

| commit ea52f4ea5666ffed175ec41efac52b560818563d
| Author: Ingo Molnar <mingo@elte.hu>
| Date:   Fri Dec 19 02:09:20 2008 +0100
|
|     Revert "sched: fix uneven per-cpu task_group share distribution"
|    
|     This reverts commit 6eed5aaacae0ced9a43a923633befd2c695d6620.
|    
|     Causes crashes.

So should be fixed in the next linux-next iteration.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu
  2008-12-19 21:55   ` Andrew Morton
  2008-12-19 22:19     ` Andrew Morton
@ 2008-12-20  4:36     ` Vaidyanathan Srinivasan
  2008-12-20  4:44       ` Andrew Morton
  1 sibling, 1 reply; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-20  4:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, suresh.b.siddha, venkatesh.pallipadi, a.p.zijlstra,
	mingo, dipankar, balbir, vatsa, ego, andi, davecb, tconnors,
	maxk, gregory.haskins, pavel, Rusty Russell

* Andrew Morton <akpm@linux-foundation.org> [2008-12-19 13:55:08]:

> On Thu, 18 Dec 2008 23:26:22 +0530
> Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> 
> > When the system utilisation is low and more cpus are idle,
> > then the process waking up from sleep should prefer to
> > wakeup an idle cpu from semi-idle cpu package (multi core
> > package) rather than a completely idle cpu package which
> > would waste power.
> > 
> > Use the sched_mc balance logic in find_busiest_group() to
> > nominate a preferred wakeup cpu.
> > 
> > This info can be sored in appropriate sched_domain, but
> > updating this info in all copies of sched_domain is not
> > practical.  Hence this information is stored in root_domain
> > struct which is one copy per partitioned sched domain.
> > The root_domain can be accessed from each cpu's runqueue
> > and there is one copy per partitioned sched domain.
> > 
> 
> kernel/sched.c: In function 'find_busiest_group':
> kernel/sched.c:3403: warning: passing argument 1 of '__first_cpu' from incompatible pointer type
> 
> Due to
> 
> 	first_cpu(group_leader->cpumask);
> 
> apparently because Rusty changed sched_group.cpumask into a plain old
> array and nobody tests their stuff against the tree into which it is
> actually integrated :(

Hi Andrew,

I agree.  These are integration issues and I will test better next time.

There were two such bugs which Ingo fixed in the tip.

commit 220e7f617826e0527bbc523ba859f6a4bae0bfe1
Author: Ingo Molnar <mingo@elte.hu>
Date:   Fri Dec 19 00:53:40 2008 +0100

    sched: fix warning in kernel/sched.c
    
    Impact: fix cpumask conversion bug
    
    this warning:
    
      kernel/sched.c: In function find_busiest_group:
      kernel/sched.c:3429: warning: passing argument 1 of __first_cpu from incompatible pointer type
    
    shows that we forgot to convert a new patch to the new cpumask APIs.
    
    Signed-off-by: Ingo Molnar <mingo@elte.hu>


commit edb4c71953409c1deac1a80528ac0aa768762b33
Author: Ingo Molnar <mingo@elte.hu>
Date:   Thu Dec 18 21:30:23 2008 +0100

    sched: move test_sd_parent() to an SMP section of sched.h
    
    Impact: build fix
    
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

I have tested the sched-tip with these fixes.

> 
> kernel/sched.c: In function 'schedule':
> kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> 
> This warning is correct - the code is buggy.

Yes this is my code bug.  I did not see the warning in sched.c.  Is
there any build option that I need to pass in order to get -Wall
effect?

Here is the fix to initialise the active_balance=0.

Thanks for the detailed review.  I will work to improve the quality of
my submission.

--Vaidy

    sched: bug fix -- initialise active_balance variable
    
    In sched.c load_balance_newidle, potential use of uninitialised
    variable.
    
    Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

diff --git a/kernel/sched.c b/kernel/sched.c
index e6a88bf..a21fe6d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3692,7 +3692,7 @@ redo:
 	}
 
 	if (!ld_moved) {
-		int active_balance;
+		int active_balance = 0;
 
 		schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
 		if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu
  2008-12-20  4:36     ` Vaidyanathan Srinivasan
@ 2008-12-20  4:44       ` Andrew Morton
  2008-12-20  7:54         ` Ingo Molnar
  2008-12-20 10:02         ` Vaidyanathan Srinivasan
  0 siblings, 2 replies; 62+ messages in thread
From: Andrew Morton @ 2008-12-20  4:44 UTC (permalink / raw)
  To: svaidy
  Cc: linux-kernel, suresh.b.siddha, venkatesh.pallipadi, a.p.zijlstra,
	mingo, dipankar, balbir, vatsa, ego, andi, davecb, tconnors,
	maxk, gregory.haskins, pavel, Rusty Russell

On Sat, 20 Dec 2008 10:06:38 +0530 Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:

> > 
> > kernel/sched.c: In function 'schedule':
> > kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> > 
> > This warning is correct - the code is buggy.
> 
> Yes this is my code bug.  I did not see the warning in sched.c.  Is
> there any build option that I need to pass in order to get -Wall
> effect?

That was just with plain old kbuild: `make allmodconfig;make'.

That warning was produced by gcc-4.0.2.  If you're using something more
recent then gcc has regressed.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu
  2008-12-20  4:44       ` Andrew Morton
@ 2008-12-20  7:54         ` Ingo Molnar
  2008-12-20 10:02         ` Vaidyanathan Srinivasan
  1 sibling, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2008-12-20  7:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: svaidy, linux-kernel, suresh.b.siddha, venkatesh.pallipadi,
	a.p.zijlstra, dipankar, balbir, vatsa, ego, andi, davecb,
	tconnors, maxk, gregory.haskins, pavel, Rusty Russell


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Sat, 20 Dec 2008 10:06:38 +0530 Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> 
> > > 
> > > kernel/sched.c: In function 'schedule':
> > > kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> > > 
> > > This warning is correct - the code is buggy.
> > 
> > Yes this is my code bug.  I did not see the warning in sched.c.  Is
> > there any build option that I need to pass in order to get -Wall
> > effect?
> 
> That was just with plain old kbuild: `make allmodconfig;make'.
> 
> That warning was produced by gcc-4.0.2.  If you're using something more 
> recent then gcc has regressed.

hm, it didnt trigger here with gcc 4.3 - and the condition is serious. 
I've applied the fix, thanks guys!

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu
  2008-12-20  4:44       ` Andrew Morton
  2008-12-20  7:54         ` Ingo Molnar
@ 2008-12-20 10:02         ` Vaidyanathan Srinivasan
  2008-12-20 10:36           ` Vaidyanathan Srinivasan
  1 sibling, 1 reply; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-20 10:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, suresh.b.siddha, venkatesh.pallipadi, a.p.zijlstra,
	mingo, dipankar, balbir, vatsa, ego, andi, davecb, tconnors,
	maxk, gregory.haskins, pavel, Rusty Russell

* Andrew Morton <akpm@linux-foundation.org> [2008-12-19 20:44:55]:

> On Sat, 20 Dec 2008 10:06:38 +0530 Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> 
> > > 
> > > kernel/sched.c: In function 'schedule':
> > > kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> > > 
> > > This warning is correct - the code is buggy.
> > 
> > Yes this is my code bug.  I did not see the warning in sched.c.  Is
> > there any build option that I need to pass in order to get -Wall
> > effect?
> 
> That was just with plain old kbuild: `make allmodconfig;make'.
> 
> That warning was produced by gcc-4.0.2.  If you're using something more
> recent then gcc has regressed.

This is an interesting problem.  I am unable to get that warning in the
following GCC versions even with a combination of the following
cmdline options: -Wall -Wextra -Wuninitialized

gcc version 4.1.2 20070502
gcc version 4.2.3

I will look for older gcc and check this out.

--Vaidy


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu
  2008-12-20 10:02         ` Vaidyanathan Srinivasan
@ 2008-12-20 10:36           ` Vaidyanathan Srinivasan
  2008-12-20 10:56             ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-20 10:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, suresh.b.siddha, venkatesh.pallipadi, a.p.zijlstra,
	mingo, dipankar, balbir, vatsa, ego, andi, davecb, tconnors,
	maxk, gregory.haskins, pavel, Rusty Russell

* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2008-12-20 15:32:28]:

> * Andrew Morton <akpm@linux-foundation.org> [2008-12-19 20:44:55]:
> 
> > On Sat, 20 Dec 2008 10:06:38 +0530 Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> > 
> > > > 
> > > > kernel/sched.c: In function 'schedule':
> > > > kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> > > > 
> > > > This warning is correct - the code is buggy.
> > > 
> > > Yes this is my code bug.  I did not see the warning in sched.c.  Is
> > > there any build option that I need to pass in order to get -Wall
> > > effect?
> > 
> > That was just with plain old kbuild: `make allmodconfig;make'.
> > 
> > That warning was produced by gcc-4.0.2.  If you're using something more
> > recent then gcc has regressed.
> 
> This is an interesting problem.  I am unable to get that warning in the
> following GCC versions even with a combination of the following
> cmdline options: -Wall -Wextra -Wuninitialized
> 
> gcc version 4.1.2 20070502
> gcc version 4.2.3
> 
> I will look for older gcc and check this out.

I am able to get the above warning in 
gcc version 3.4.6 (Debian 3.4.6-9)
with default kbuild, no additional params.

Did not get the warning in 
gcc version 4.1.3 20080623

Hence there has been some change in GCC that prevented the
uninitialized warning.

--Vaidy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu
  2008-12-20 10:36           ` Vaidyanathan Srinivasan
@ 2008-12-20 10:56             ` Vaidyanathan Srinivasan
  2008-12-21  8:46               ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-20 10:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, suresh.b.siddha, venkatesh.pallipadi, a.p.zijlstra,
	mingo, dipankar, balbir, vatsa, ego, andi, davecb, tconnors,
	maxk, gregory.haskins, pavel, Rusty Russell

* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2008-12-20 16:06:02]:

> * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2008-12-20 15:32:28]:
> 
> > * Andrew Morton <akpm@linux-foundation.org> [2008-12-19 20:44:55]:
> > 
> > > On Sat, 20 Dec 2008 10:06:38 +0530 Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> > > 
> > > > > 
> > > > > kernel/sched.c: In function 'schedule':
> > > > > kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> > > > > 
> > > > > This warning is correct - the code is buggy.
> > > > 
> > > > Yes this is my code bug.  I did not see the warning in sched.c.  Is
> > > > there any build option that I need to pass in order to get -Wall
> > > > effect?
> > > 
> > > That was just with plain old kbuild: `make allmodconfig;make'.
> > > 
> > > That warning was produced by gcc-4.0.2.  If you're using something more
> > > recent then gcc has regressed.
> > 
> > This is an interesting problem.  I am unable to get that warning in the
> > following GCC versions even with a combination of the following
> > cmdline options: -Wall -Wextra -Wuninitialized
> > 
> > gcc version 4.1.2 20070502
> > gcc version 4.2.3
> > 
> > I will look for older gcc and check this out.
> 
> I am able to get the above warning in 
> gcc version 3.4.6 (Debian 3.4.6-9)
> with default kbuild, no additional params.
> 
> Did not get the warning in 
> gcc version 4.1.3 20080623
> 
> Hence there has been some change in GCC that prevented the
> uninitialized warning.

Related GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=179
Claimed to have been fixed in GCC 4.4, but I don't know if we are ready
to test kernel compile with GCC 4.4.

I have tested till gcc version 4.3.2 (Debian 4.3.2-1) and the bug
is not fixed.  I do not get the warning from sched.c.

--Vaidy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 4/8] sched: nominate preferred wakeup cpu
  2008-12-20 10:56             ` Vaidyanathan Srinivasan
@ 2008-12-21  8:46               ` Ingo Molnar
  0 siblings, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2008-12-21  8:46 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Andrew Morton, linux-kernel, suresh.b.siddha,
	venkatesh.pallipadi, a.p.zijlstra, dipankar, balbir, vatsa, ego,
	andi, davecb, tconnors, maxk, gregory.haskins, pavel,
	Rusty Russell


* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:

> * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2008-12-20 16:06:02]:
> 
> > * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2008-12-20 15:32:28]:
> > 
> > > * Andrew Morton <akpm@linux-foundation.org> [2008-12-19 20:44:55]:
> > > 
> > > > On Sat, 20 Dec 2008 10:06:38 +0530 Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> > > > 
> > > > > > 
> > > > > > kernel/sched.c: In function 'schedule':
> > > > > > kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
> > > > > > 
> > > > > > This warning is correct - the code is buggy.
> > > > > 
> > > > > Yes this is my code bug.  I did not see the warning in sched.c.  Is
> > > > > there any build option that I need to pass in order to get -Wall
> > > > > effect?
> > > > 
> > > > That was just with plain old kbuild: `make allmodconfig;make'.
> > > > 
> > > > That warning was produced by gcc-4.0.2.  If you're using something more
> > > > recent then gcc has regressed.
> > > 
> > > This is an interesting problem.  I am unable to get that warning in the
> > > following GCC versions even with a combination of the following
> > > cmdline options: -Wall -Wextra -Wuninitialized
> > > 
> > > gcc version 4.1.2 20070502
> > > gcc version 4.2.3
> > > 
> > > I will look for older gcc and check this out.
> > 
> > I am able to get the above warning in 
> > gcc version 3.4.6 (Debian 3.4.6-9)
> > with default kbuild, no additional params.
> > 
> > Did not get the warning in 
> > gcc version 4.1.3 20080623
> > 
> > Hence there has been some change in GCC that prevented the
> > uninitialized warning.
> 
> Related GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=179 Claimed 
> to have been fixed in GCC 4.4, but I don't know if we are ready to test 
> kernel compile with GCC 4.4.

4.4 is still way too noisy on x86 so not really practical.

> I have tested till gcc version 4.3.2 (Debian 4.3.2-1) and the bug is not 
> fixed.  I do not get the warning from sched.c.

neither did i get that warning. But Andrew did so thankfully, so we've got 
the fix :) We'll always need variance in testing.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2008-12-18 17:55 [PATCH v7 0/8] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
                   ` (8 preceding siblings ...)
  2008-12-18 20:19 ` [PATCH v7 0/8] Tunable sched_mc_power_savings=n Ingo Molnar
@ 2008-12-29 23:43 ` MinChan Kim
  2008-12-30  2:48   ` Balbir Singh
  9 siblings, 1 reply; 62+ messages in thread
From: MinChan Kim @ 2008-12-29 23:43 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Peter Zijlstra, Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton

Hi, Vaidyanathan.
It's very late reponse. :(

> Results:
> --------
>
> Basic functionality of the code has not changed and the power vs
> performance benefits for kernbench are similar to the ones posted
> earlier.
>
> KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> package system
>
> SchedMC Run Time     Package Idle    Energy  Power
> 0       81.68        52.83% 54.71%  1.00x J 1.00y W
> 1       80.70        36.62% 70.11%  0.95x J 0.96y W
> 2       74.95        19.53% 85.92%  0.90x J 0.98y W
>
> The results are marginally better than the previous version of the
> patch series which could be within the test variation.
>
> Please feel free to test, and let me know your comments and feedback.
> I will post more experimental results with various benchmarks.

Your result is very interesting.
level 2 is more fast and efficient of power.

What's major contributor to use less time in level 2?
I think it's cache bounce is less time than old.
Is right ?

I want to test SCHED_MC but I don't know what you use to benchmark about power.
How do I get the data about 'Package, Idle, Energy, Power'?

-- 
Kinds regards,
MinChan Kim

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2008-12-29 23:43 ` MinChan Kim
@ 2008-12-30  2:48   ` Balbir Singh
  2008-12-30  6:21     ` Ingo Molnar
  2008-12-30 17:31     ` Vaidyanathan Srinivasan
  0 siblings, 2 replies; 62+ messages in thread
From: Balbir Singh @ 2008-12-30  2:48 UTC (permalink / raw)
  To: MinChan Kim
  Cc: Vaidyanathan Srinivasan, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Peter Zijlstra, Ingo Molnar, Dipankar Sarma,
	Vatsa, Gautham R Shenoy, Andi Kleen, David Collier-Brown,
	Tim Connors, Max Krasnyansky, Gregory Haskins, Pavel Machek,
	Andrew Morton

* MinChan Kim <minchan.kim@gmail.com> [2008-12-30 08:43:58]:

> Hi, Vaidyanathan.
> It's very late reponse. :(
> 
> > Results:
> > --------
> >
> > Basic functionality of the code has not changed and the power vs
> > performance benefits for kernbench are similar to the ones posted
> > earlier.
> >
> > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > package system
> >
> > SchedMC Run Time     Package Idle    Energy  Power
> > 0       81.68        52.83% 54.71%  1.00x J 1.00y W
> > 1       80.70        36.62% 70.11%  0.95x J 0.96y W
> > 2       74.95        19.53% 85.92%  0.90x J 0.98y W
> >
> > The results are marginally better than the previous version of the
> > patch series which could be within the test variation.
> >
> > Please feel free to test, and let me know your comments and feedback.
> > I will post more experimental results with various benchmarks.
> 
> Your result is very interesting.
> level 2 is more fast and efficient of power.
> 
> What's major contributor to use less time in level 2?
> I think it's cache bounce is less time than old.
> Is right ?
>

Yes, correct
 
> I want to test SCHED_MC but I don't know what you use to benchmark about power.
> How do I get the data about 'Package, Idle, Energy, Power'?
> 

Note, it is Package Idle (for both packages), it is a x86-64 8 core,
dual socket, quad core box. It is not Package, Idle.

For Energy and Power you need a means of measuring power like a meter.

> -- 
> Kinds regards,
> MinChan Kim
> 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2008-12-30  2:48   ` Balbir Singh
@ 2008-12-30  6:21     ` Ingo Molnar
  2008-12-30  6:44       ` Balbir Singh
  2008-12-30 18:07       ` Vaidyanathan Srinivasan
  2008-12-30 17:31     ` Vaidyanathan Srinivasan
  1 sibling, 2 replies; 62+ messages in thread
From: Ingo Molnar @ 2008-12-30  6:21 UTC (permalink / raw)
  To: Balbir Singh; +Cc: linux-kernel, Peter Zijlstra, Andrew Morton


* Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> > > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > > package system
> > >
> > > SchedMC Run Time     Package Idle    Energy  Power
> > > 0       81.68        52.83% 54.71%  1.00x J 1.00y W
> > > 1       80.70        36.62% 70.11%  0.95x J 0.96y W
> > > 2       74.95        19.53% 85.92%  0.90x J 0.98y W

> > Your result is very interesting.
> > level 2 is more fast and efficient of power.
> > 
> > What's major contributor to use less time in level 2?
> > I think it's cache bounce is less time than old.
> > Is right ?
> 
> Yes, correct

yes, i too noticed that runtime improved so dramatically: +7.5% on 
kernbench is a _very_ big deal.

So i wanted to ask you to re-test whether this speedup is reproducible, 
and if yes, please check a few other workloads (for example sysbench on 
postgresql / mysqld) and send a patch that changes the 
sched_mc_power_savings default to POWERSAVINGS_BALANCE_WAKEUP (2).

Sidenote, _please_ fix your mail client to not mess up the Cc: list via:

Mail-Followup-To: MinChan Kim <minchan.kim@gmail.com>,
        Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>,
        Linux Kernel <linux-kernel@vger.kernel.org>,
        Suresh B Siddha <suresh.b.siddha@intel.com>,
        Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>,
        Ingo Molnar <mingo@elte.hu>, Dipankar Sarma <dipankar@in.ibm.com>,
        Vatsa <vatsa@linux.vnet.ibm.com>, Gautham R Shenoy <ego@in.ibm.com>,
        Andi Kleen <andi@firstfloor.org>,
        David Collier-Brown <davecb@sun.com>,
        Tim Connors <tconnors@astro.swin.edu.au>,
        Max Krasnyansky <maxk@qualcomm.com>,
        Gregory Haskins <gregory.haskins@gmail.com>,
        Pavel Machek <pavel@suse.cz>,
        Andrew Morton <akpm@linux-foundation.org>

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2008-12-30  6:21     ` Ingo Molnar
@ 2008-12-30  6:44       ` Balbir Singh
  2008-12-30  7:20         ` Ingo Molnar
  2008-12-30 18:07       ` Vaidyanathan Srinivasan
  1 sibling, 1 reply; 62+ messages in thread
From: Balbir Singh @ 2008-12-30  6:44 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra, Andrew Morton

* Ingo Molnar <mingo@elte.hu> [2008-12-30 07:21:39]:

> 
> * Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > > > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > > > package system
> > > >
> > > > SchedMC Run Time     Package Idle    Energy  Power
> > > > 0       81.68        52.83% 54.71%  1.00x J 1.00y W
> > > > 1       80.70        36.62% 70.11%  0.95x J 0.96y W
> > > > 2       74.95        19.53% 85.92%  0.90x J 0.98y W
> 
> > > Your result is very interesting.
> > > level 2 is more fast and efficient of power.
> > > 
> > > What's major contributor to use less time in level 2?
> > > I think it's cache bounce is less time than old.
> > > Is right ?
> > 
> > Yes, correct
> 
> yes, i too noticed that runtime improved so dramatically: +7.5% on 
> kernbench is a _very_ big deal.
> 

Yes, it is, I think one way to identify it on the x86 box would be to
use the performance counter patches, I'll see it I can get it working.

> So i wanted to ask you to re-test whether this speedup is reproducible, 
> and if yes, please check a few other workloads (for example sysbench on 
> postgresql / mysqld) and send a patch that changes the 
> sched_mc_power_savings default to POWERSAVINGS_BALANCE_WAKEUP (2).
> 

We'll do that, Vaidyanathan is on vacation and will be back in the
first week of January. We'll revert back with more tests and results.
>From code review, it would seem that the benefits are going to be
workload specific, but results would show for certain.

> Sidenote, _please_ fix your mail client to not mess up the Cc: list via:
>

Thanks, mutt has this turned on by default, I am turning it off and
hope that followup-to goes away.
 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2008-12-30  6:44       ` Balbir Singh
@ 2008-12-30  7:20         ` Ingo Molnar
  0 siblings, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2008-12-30  7:20 UTC (permalink / raw)
  To: Balbir Singh; +Cc: linux-kernel, Peter Zijlstra, Andrew Morton


* Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * Ingo Molnar <mingo@elte.hu> [2008-12-30 07:21:39]:
> 
> > 
> > * Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > > > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > > > > package system
> > > > >
> > > > > SchedMC Run Time     Package Idle    Energy  Power
> > > > > 0       81.68        52.83% 54.71%  1.00x J 1.00y W
> > > > > 1       80.70        36.62% 70.11%  0.95x J 0.96y W
> > > > > 2       74.95        19.53% 85.92%  0.90x J 0.98y W
> > 
> > > > Your result is very interesting.
> > > > level 2 is more fast and efficient of power.
> > > > 
> > > > What's major contributor to use less time in level 2?
> > > > I think it's cache bounce is less time than old.
> > > > Is right ?
> > > 
> > > Yes, correct
> > 
> > yes, i too noticed that runtime improved so dramatically: +7.5% on 
> > kernbench is a _very_ big deal.
> > 
> 
> Yes, it is, I think one way to identify it on the x86 box would be to 
> use the performance counter patches, I'll see it I can get it working.

yeah, good idea, and it's easy to get perf-counters working: just boot 
tip/master on a Core2 (or later) Intel CPU [see below about how to use 
perfcounters on other CPUs], and pick up timec.c:

   http://redhat.com/~mingo/perfcounters/timec.c

then run a kernel build like this:

   $ ./timec -e -5,-4,-3,0,1,2,3,4,5 make -j16 bzImage

that's all. You should get an array of metrics like this:

 [...]
Kernel: arch/x86/boot/bzImage is ready  (#28)

 Performance counter stats for 'make':

  628315.871980  task clock ticks     (msecs)

          42330  CPU migrations       (events)
         124980  context switches     (events)
       18698292  pagefaults           (events)
  1351875946010  CPU cycles           (events)
  1121901478363  instructions         (events)
    10654788968  cache references     (events)
      633581867  cache misses         (events)
   247508675357  branches             (events)
    21567244144  branch misses        (events)

 Wall-clock time elapsed: 118348.109066 msecs

I'd guess the CPU migrations, context-switches, cycles+instructions 
counters are the most interesting ones. I.e. on a Core2 a good metric is 
probably:

   $ ./timec -e -5,-4,0,1 make -j16 bzImage

[ If you see any weirdnesses in the numbers, then you should first suspect 
  some perf-counters bug. Please report any problems to us! ]

If your systems are not Core2 based then patches to extend perfcounters 
support to those CPUs are welcome as well ;-)

NOTE: the sw counters (migrations, per task cost) are available on all 
CPUs. I.e. you can always do:

   $ ./timec -e -5,-4,-3 make -j16 bzImage

regardless of CPU type.

And note that the wall-clock and task-clock results in this case will be 
far more accurate than normal 'time make -j16 bzImage' output.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2008-12-30  2:48   ` Balbir Singh
  2008-12-30  6:21     ` Ingo Molnar
@ 2008-12-30 17:31     ` Vaidyanathan Srinivasan
  1 sibling, 0 replies; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-30 17:31 UTC (permalink / raw)
  To: MinChan Kim, Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Peter Zijlstra, Ingo Molnar, Dipankar Sarma, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Gregory Haskins, Pavel Machek, Andrew Morton

* Balbir Singh <balbir@linux.vnet.ibm.com> [2008-12-30 08:18:19]:

> * MinChan Kim <minchan.kim@gmail.com> [2008-12-30 08:43:58]:
> 
> > Hi, Vaidyanathan.
> > It's very late reponse. :(
> > 
> > > Results:
> > > --------
> > >
> > > Basic functionality of the code has not changed and the power vs
> > > performance benefits for kernbench are similar to the ones posted
> > > earlier.
> > >
> > > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > > package system
> > >
> > > SchedMC Run Time     Package Idle    Energy  Power
> > > 0       81.68        52.83% 54.71%  1.00x J 1.00y W
> > > 1       80.70        36.62% 70.11%  0.95x J 0.96y W
> > > 2       74.95        19.53% 85.92%  0.90x J 0.98y W
> > >
> > > The results are marginally better than the previous version of the
> > > patch series which could be within the test variation.
> > >
> > > Please feel free to test, and let me know your comments and feedback.
> > > I will post more experimental results with various benchmarks.
> > 
> > Your result is very interesting.
> > level 2 is more fast and efficient of power.
> > 
> > What's major contributor to use less time in level 2?
> > I think it's cache bounce is less time than old.
> > Is right ?
> >
> 
> Yes, correct
> 
> > I want to test SCHED_MC but I don't know what you use to benchmark about power.
> > How do I get the data about 'Package, Idle, Energy, Power'?
> > 
> 
> Note, it is Package Idle (for both packages), it is a x86-64 8 core,
> dual socket, quad core box. It is not Package, Idle.
> 
> For Energy and Power you need a means of measuring power like a meter.
> 

Hi MinChan,

Thank you for your interest in sched_mc power saving feature.  As
Balbir has mentioned, you will need a power measurement infrastructure
like an external power meter.

Laptops have battery discharge rate measurement that is a good
approximation for power consumption.  But that is not helpful to test
sched_mc since we would need a multi-socket multi core system to get
power saving benefit from the enhancements.

The 'package idle' information comes from /proc/stat by adding up the
idle times from various logical CPUs belonging to a single physical
package.  All logical CPUs belonging to a single physical package can
be identified from /proc/cpuinfo or
/sys/devices/system/cpu/cpu<n>/topology/physical_package_id

--Vaidy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2008-12-30  6:21     ` Ingo Molnar
  2008-12-30  6:44       ` Balbir Singh
@ 2008-12-30 18:07       ` Vaidyanathan Srinivasan
  2009-01-02  7:26         ` Vaidyanathan Srinivasan
  1 sibling, 1 reply; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-12-30 18:07 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton

* Ingo Molnar <mingo@elte.hu> [2008-12-30 07:21:39]:

> 
> * Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > > > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > > > package system
> > > >
> > > > SchedMC Run Time     Package Idle    Energy  Power
> > > > 0       81.68        52.83% 54.71%  1.00x J 1.00y W
> > > > 1       80.70        36.62% 70.11%  0.95x J 0.96y W
> > > > 2       74.95        19.53% 85.92%  0.90x J 0.98y W
> 
> > > Your result is very interesting.
> > > level 2 is more fast and efficient of power.
> > > 
> > > What's major contributor to use less time in level 2?
> > > I think it's cache bounce is less time than old.
> > > Is right ?
> > 
> > Yes, correct
> 
> yes, i too noticed that runtime improved so dramatically: +7.5% on 
> kernbench is a _very_ big deal.
> 
> So i wanted to ask you to re-test whether this speedup is reproducible, 
> and if yes, please check a few other workloads (for example sysbench on 
> postgresql / mysqld) and send a patch that changes the 
> sched_mc_power_savings default to POWERSAVINGS_BALANCE_WAKEUP (2).

The speedup for kernbench is reproducible.  I will post a more
detailed test report on kernbench soon.  The power-performance benefit
is for an under-utilised system (nr_cpus/2) run of kernbench which is
very ideal to demonstrate the power savings feature.  I will also try
sysbench and update results little later.  As Balbir mentioned, I am
on vacation and traveling.  I will post more benchmark results as soon
as possible.

--Vaidy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2008-12-30 18:07       ` Vaidyanathan Srinivasan
@ 2009-01-02  7:26         ` Vaidyanathan Srinivasan
  2009-01-02 22:16           ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-01-02  7:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton

* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2008-12-30 23:37:22]:

> * Ingo Molnar <mingo@elte.hu> [2008-12-30 07:21:39]:
> 
> > 
> > * Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > > > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > > > > package system
> > > > >
> > > > > SchedMC Run Time     Package Idle    Energy  Power
> > > > > 0       81.68        52.83% 54.71%  1.00x J 1.00y W
> > > > > 1       80.70        36.62% 70.11%  0.95x J 0.96y W
> > > > > 2       74.95        19.53% 85.92%  0.90x J 0.98y W
> > 
> > > > Your result is very interesting.
> > > > level 2 is more fast and efficient of power.
> > > > 
> > > > What's major contributor to use less time in level 2?
> > > > I think it's cache bounce is less time than old.
> > > > Is right ?
> > > 
> > > Yes, correct
> > 
> > yes, i too noticed that runtime improved so dramatically: +7.5% on 
> > kernbench is a _very_ big deal.
> > 
> > So i wanted to ask you to re-test whether this speedup is reproducible, 
> > and if yes, please check a few other workloads (for example sysbench on 
> > postgresql / mysqld) and send a patch that changes the 
> > sched_mc_power_savings default to POWERSAVINGS_BALANCE_WAKEUP (2).
> 
> The speedup for kernbench is reproducible.  I will post a more
> detailed test report on kernbench soon.  The power-performance benefit
> is for an under-utilised system (nr_cpus/2) run of kernbench which is
> very ideal to demonstrate the power savings feature.  I will also try
> sysbench and update results little later.  As Balbir mentioned, I am
> on vacation and traveling.  I will post more benchmark results as soon
> as possible.

Hi Ingo,

I ran my kernbench test setup over multiple iterations with several
thread counts.  The performance will almost max out at 8-10 threads of
kernbench since the system has total of 8 logical CPUs.

The power saving benefits is expected to be maximum around 30-50%
system utilisation (3-5 threads).

I have posted the normalised results in various tables below.
Apparently the sched_mc=2 settings is helping at 8 threads case also
due to better cache line sharing.  I will work with Balbir to see if
we can prove this assertion using performance counters.  I will start
looking at sysbench and try to get benchmark results and power/energy
data in similar configurations.

--Vaidy

All these tests were done on sched-tip at 
commit: fe235c6b2a4b9eb11909fe40249abcce67f7d45d 
uname -a is 2.6.28-rc8-tip-sv-05664-gfe235c6-dirty

Kernbench was run on source 2.6.28 while the previous tests used
2.6.25 kernel which had far less stuff to compile in defconfig :)

System configuration is x86 quad core dual socket system 
with 8 logical CPUs

Each result is an average of 5 iterations.

Kernbench threads = 2

SchedMC Run Time     Package Idle(%)    Energy(J)  Power(W)
0	217.37	     73.30 76.93	1.00	   1.00
1	204.68	     50.86 99.91	0.95	   1.01
2	204.49	     50.69 99.93	0.95	   1.01

Kernbench threads = 4

SchedMC Run Time     Package Idle(%)    Energy(J)  Power(W)
0	109.11	     51.99 54.91	1.00	   1.00
1	109.71	     35.86 71.88	0.96	   0.96
2	102.95	     25.02 82.47	0.92	   0.97

Kernbench threads = 6

SchedMC Run Time     Package Idle(%)    Energy(J)  Power(W)
0	74.77	     35.29 36.37	1.00	   1.00
1	74.54	     28.83 40.90	0.99	   0.99
2	73.38	     21.09 47.29	0.98	   1.00

Kernbench threads = 8

SchedMC Run Time     Package Idle(%)    Energy(J)  Power(W)
0	63.18	     26.23 28.67	1.00	   1.00
1	56.89	     19.85 22.38 	0.93	   1.02
2	55.27	     17.65 19.84	0.91	   1.03


Relative measures for sched_mc=1 compared to baseline sched_mc=0

Threads=>	2	4	6	8	10	12
KernbenchTime	0.94	1.01	1.00	0.90	0.93	0.95
Energy		0.95	0.96	0.99	0.93	0.95	0.97
Power		1.01	0.96	0.99	1.02	1.02	1.01

Relative measures for sched_mc=2 compared to baseline sched_mc=0

Threads=>	2	4	6	8	10	12
KernbenchTime	0.94	0.94	0.98	0.87	0.92	0.94
Energy		0.95	0.92	0.98	0.91	0.94	0.96
Power		1.01	0.97	1.00	1.03	1.02	1.02


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-02  7:26         ` Vaidyanathan Srinivasan
@ 2009-01-02 22:16           ` Ingo Molnar
  2009-01-03  7:29             ` Mike Galbraith
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-01-02 22:16 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan, Mike Galbraith
  Cc: Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton


* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:

> * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2008-12-30 23:37:22]:
> 
> > * Ingo Molnar <mingo@elte.hu> [2008-12-30 07:21:39]:
> > 
> > > 
> > > * Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > > > KERNBENCH Runs: make -j4 on a x86 8 core, dual socket quad core cpu
> > > > > > package system
> > > > > >
> > > > > > SchedMC Run Time     Package Idle    Energy  Power
> > > > > > 0       81.68        52.83% 54.71%  1.00x J 1.00y W
> > > > > > 1       80.70        36.62% 70.11%  0.95x J 0.96y W
> > > > > > 2       74.95        19.53% 85.92%  0.90x J 0.98y W
> > > 
> > > > > Your result is very interesting.
> > > > > level 2 is more fast and efficient of power.
> > > > > 
> > > > > What's major contributor to use less time in level 2?
> > > > > I think it's cache bounce is less time than old.
> > > > > Is right ?
> > > > 
> > > > Yes, correct
> > > 
> > > yes, i too noticed that runtime improved so dramatically: +7.5% on 
> > > kernbench is a _very_ big deal.
> > > 
> > > So i wanted to ask you to re-test whether this speedup is reproducible, 
> > > and if yes, please check a few other workloads (for example sysbench on 
> > > postgresql / mysqld) and send a patch that changes the 
> > > sched_mc_power_savings default to POWERSAVINGS_BALANCE_WAKEUP (2).
> > 
> > The speedup for kernbench is reproducible.  I will post a more
> > detailed test report on kernbench soon.  The power-performance benefit
> > is for an under-utilised system (nr_cpus/2) run of kernbench which is
> > very ideal to demonstrate the power savings feature.  I will also try
> > sysbench and update results little later.  As Balbir mentioned, I am
> > on vacation and traveling.  I will post more benchmark results as soon
> > as possible.
> 
> Hi Ingo,
> 
> I ran my kernbench test setup over multiple iterations with several
> thread counts.  The performance will almost max out at 8-10 threads of
> kernbench since the system has total of 8 logical CPUs.
> 
> The power saving benefits is expected to be maximum around 30-50%
> system utilisation (3-5 threads).
> 
> I have posted the normalised results in various tables below.
> Apparently the sched_mc=2 settings is helping at 8 threads case also
> due to better cache line sharing.  I will work with Balbir to see if
> we can prove this assertion using performance counters.  I will start
> looking at sysbench and try to get benchmark results and power/energy
> data in similar configurations.
> 
> --Vaidy
> 
> All these tests were done on sched-tip at 
> commit: fe235c6b2a4b9eb11909fe40249abcce67f7d45d 
> uname -a is 2.6.28-rc8-tip-sv-05664-gfe235c6-dirty
> 
> Kernbench was run on source 2.6.28 while the previous tests used
> 2.6.25 kernel which had far less stuff to compile in defconfig :)
> 
> System configuration is x86 quad core dual socket system 
> with 8 logical CPUs
> 
> Each result is an average of 5 iterations.
> 
> Kernbench threads = 2
> 
> SchedMC Run Time     Package Idle(%)    Energy(J)  Power(W)
> 0	217.37	     73.30 76.93	1.00	   1.00
> 1	204.68	     50.86 99.91	0.95	   1.01
> 2	204.49	     50.69 99.93	0.95	   1.01
> 
> Kernbench threads = 4
> 
> SchedMC Run Time     Package Idle(%)    Energy(J)  Power(W)
> 0	109.11	     51.99 54.91	1.00	   1.00
> 1	109.71	     35.86 71.88	0.96	   0.96
> 2	102.95	     25.02 82.47	0.92	   0.97
> 
> Kernbench threads = 6
> 
> SchedMC Run Time     Package Idle(%)    Energy(J)  Power(W)
> 0	74.77	     35.29 36.37	1.00	   1.00
> 1	74.54	     28.83 40.90	0.99	   0.99
> 2	73.38	     21.09 47.29	0.98	   1.00
> 
> Kernbench threads = 8
> 
> SchedMC Run Time     Package Idle(%)    Energy(J)  Power(W)
> 0	63.18	     26.23 28.67	1.00	   1.00
> 1	56.89	     19.85 22.38 	0.93	   1.02
> 2	55.27	     17.65 19.84	0.91	   1.03

looks very promising.

> Relative measures for sched_mc=1 compared to baseline sched_mc=0
> 
> Threads=>	2	4	6	8	10	12
> KernbenchTime	0.94	1.01	1.00	0.90	0.93	0.95
> Energy		0.95	0.96	0.99	0.93	0.95	0.97
> Power		1.01	0.96	0.99	1.02	1.02	1.01
> 
> Relative measures for sched_mc=2 compared to baseline sched_mc=0
> 
> Threads=>	2	4	6	8	10	12
> KernbenchTime	0.94	0.94	0.98	0.87	0.92	0.94
> Energy		0.95	0.92	0.98	0.91	0.94	0.96
> Power		1.01	0.97	1.00	1.03	1.02	1.02

That's icing on the cake.

Mike, would you be interesting in having a look at sched_mc=2 as a 
kernel-wide default - and give it your blessing if you find it to be a net 
improvement for the various performance and interactivity tests you do?

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-02 22:16           ` Ingo Molnar
@ 2009-01-03  7:29             ` Mike Galbraith
  2009-01-03 10:16               ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 62+ messages in thread
From: Mike Galbraith @ 2009-01-03  7:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vaidyanathan Srinivasan, Balbir Singh, linux-kernel,
	Peter Zijlstra, Andrew Morton

On Fri, 2009-01-02 at 23:16 +0100, Ingo Molnar wrote:

> Mike, would you be interesting in having a look at sched_mc=2 as a 
> kernel-wide default - and give it your blessing if you find it to be a net 
> improvement for the various performance and interactivity tests you do?

Sure.

	-Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-03  7:29             ` Mike Galbraith
@ 2009-01-03 10:16               ` Vaidyanathan Srinivasan
  2009-01-03 11:22                 ` Mike Galbraith
  0 siblings, 1 reply; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-01-03 10:16 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton

* Mike Galbraith <efault@gmx.de> [2009-01-03 08:29:25]:

> On Fri, 2009-01-02 at 23:16 +0100, Ingo Molnar wrote:
> 
> > Mike, would you be interesting in having a look at sched_mc=2 as a 
> > kernel-wide default - and give it your blessing if you find it to be a net 
> > improvement for the various performance and interactivity tests you do?
> 
> Sure.

Thanks Mike and Ingo.  I will be glad to help with test and benchmarks
on the platforms that I have access.

I am presently working on sysbench.

--Vaidy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-03 10:16               ` Vaidyanathan Srinivasan
@ 2009-01-03 11:22                 ` Mike Galbraith
  2009-01-04 15:00                   ` Mike Galbraith
  0 siblings, 1 reply; 62+ messages in thread
From: Mike Galbraith @ 2009-01-03 11:22 UTC (permalink / raw)
  To: svaidy
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton

On Sat, 2009-01-03 at 15:46 +0530, Vaidyanathan Srinivasan wrote:
> * Mike Galbraith <efault@gmx.de> [2009-01-03 08:29:25]:
> 
> > On Fri, 2009-01-02 at 23:16 +0100, Ingo Molnar wrote:
> > 
> > > Mike, would you be interesting in having a look at sched_mc=2 as a 
> > > kernel-wide default - and give it your blessing if you find it to be a net 
> > > improvement for the various performance and interactivity tests you do?
> > 
> > Sure.
> 
> Thanks Mike and Ingo.  I will be glad to help with test and benchmarks
> on the platforms that I have access.
> 
> I am presently working on sysbench.

The testing I can do is rather severely limited since I have only one
Q6600.  I butchered mc_capable() to use what I can though, ie see if
SD_BALANCE_NEWIDLE still harms tbench and mysql+oltp.  I think that's
about all I can do on my wee box.

	-Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-03 11:22                 ` Mike Galbraith
@ 2009-01-04 15:00                   ` Mike Galbraith
  2009-01-04 18:19                     ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 62+ messages in thread
From: Mike Galbraith @ 2009-01-04 15:00 UTC (permalink / raw)
  To: svaidy
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2145 bytes --]

On Sat, 2009-01-03 at 12:22 +0100, Mike Galbraith wrote:
> On Sat, 2009-01-03 at 15:46 +0530, Vaidyanathan Srinivasan wrote:
> > * Mike Galbraith <efault@gmx.de> [2009-01-03 08:29:25]:
> > 
> > > On Fri, 2009-01-02 at 23:16 +0100, Ingo Molnar wrote:
> > > 
> > > > Mike, would you be interesting in having a look at sched_mc=2 as a 
> > > > kernel-wide default - and give it your blessing if you find it to be a net 
> > > > improvement for the various performance and interactivity tests you do?
> > > 
> > > Sure.
> > 
> > Thanks Mike and Ingo.  I will be glad to help with test and benchmarks
> > on the platforms that I have access.
> > 
> > I am presently working on sysbench.
> 
> The testing I can do is rather severely limited since I have only one
> Q6600.  I butchered mc_capable() to use what I can though, ie see if
> SD_BALANCE_NEWIDLE still harms tbench and mysql+oltp.  I think that's
> about all I can do on my wee box.

I do not see any difference for tbench, results are within jitter.  I do
for mysql+oltp, and the test results are fairly strange.

If you take a peek at the attached chart: the 2.6.26.8 data is with
scalability backports/fixes.  That's where our 29-to-be should be.
Domain tunings in this kernel are identical to 28/29 stock as well.

Note that there is no knee at 8 clients.  If I turn SD_BALANCE_NEWIDLE
on in 26+backports, peak drops a tad, and that knee re-appears, just as
before we turned the thing off.  Throughput after the knee also drops a
tad, nearly to the point where tip+sched_mc=2 now _comes up to_, and it
definitely is SD_BALANCE_NEWIDLE making the difference.  IOW, what used
to be a loser, and still is a loser in 26+fixes, is now a winner in tip
after the knee which should not be there.  Seems something else has
changed, re-introducing the knee, and cutting throughput a tad.

(The hefty difference at the very end I knew about.  SD_BALANCE_NEWIDLE
helps considerably when mysql is very thoroughly jammed up on itself)

When I look at that chart, it looks almost like SD_BALANCE_NEWIDLE is
partially offsetting some other problem.

I haven't done any interactivity testing yet.

	-Mike

[-- Attachment #2: zzz.jpg --]
[-- Type: image/jpeg, Size: 75980 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-04 15:00                   ` Mike Galbraith
@ 2009-01-04 18:19                     ` Vaidyanathan Srinivasan
  2009-01-04 19:52                       ` Mike Galbraith
  0 siblings, 1 reply; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-01-04 18:19 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton

* Mike Galbraith <efault@gmx.de> [2009-01-04 16:00:00]:

> On Sat, 2009-01-03 at 12:22 +0100, Mike Galbraith wrote:
> > On Sat, 2009-01-03 at 15:46 +0530, Vaidyanathan Srinivasan wrote:
> > > * Mike Galbraith <efault@gmx.de> [2009-01-03 08:29:25]:
> > > 
> > > > On Fri, 2009-01-02 at 23:16 +0100, Ingo Molnar wrote:
> > > > 
> > > > > Mike, would you be interesting in having a look at sched_mc=2 as a 
> > > > > kernel-wide default - and give it your blessing if you find it to be a net 
> > > > > improvement for the various performance and interactivity tests you do?
> > > > 
> > > > Sure.
> > > 
> > > Thanks Mike and Ingo.  I will be glad to help with test and benchmarks
> > > on the platforms that I have access.
> > > 
> > > I am presently working on sysbench.
> > 
> > The testing I can do is rather severely limited since I have only one
> > Q6600.  I butchered mc_capable() to use what I can though, ie see if
> > SD_BALANCE_NEWIDLE still harms tbench and mysql+oltp.  I think that's
> > about all I can do on my wee box.
> 
> I do not see any difference for tbench, results are within jitter.  I do
> for mysql+oltp, and the test results are fairly strange.
> 
> If you take a peek at the attached chart: the 2.6.26.8 data is with
> scalability backports/fixes.  That's where our 29-to-be should be.
> Domain tunings in this kernel are identical to 28/29 stock as well.
> 
> Note that there is no knee at 8 clients.  If I turn SD_BALANCE_NEWIDLE
> on in 26+backports, peak drops a tad, and that knee re-appears, just as
> before we turned the thing off.  Throughput after the knee also drops a
> tad, nearly to the point where tip+sched_mc=2 now _comes up to_, and it
> definitely is SD_BALANCE_NEWIDLE making the difference.  IOW, what used
> to be a loser, and still is a loser in 26+fixes, is now a winner in tip
> after the knee which should not be there.  Seems something else has
> changed, re-introducing the knee, and cutting throughput a tad.
> 
> (The hefty difference at the very end I knew about.  SD_BALANCE_NEWIDLE
> helps considerably when mysql is very thoroughly jammed up on itself)
> 
> When I look at that chart, it looks almost like SD_BALANCE_NEWIDLE is
> partially offsetting some other problem.
> 
> I haven't done any interactivity testing yet.

Hi Mike,

Thanks for the detailed test report.

I am new to sysbench.  I just started few OLTP runs with pgsql. In
your graph you are plotting and comparing read/write-per-sec and not
transactions-per-sec.  Both the parameter vary in a similar manner.
Can you please let me know some background on using the
read/write-per-sec result for comparison.

I assume you have run the above tests on Q6600 box that has single
quad core package that consist of two dual core CPUs.  Can you please
let me know the sched_domain tree that was build by hacking
mc_capable().  The effect of sched_mc={1,2} depends on the
sched groups that was build and their flags.

As you have mentioned in your data, sched_mc=2 helps recover some
performance mainly because of NEWIDLE balance.  

You have mentioned clients in the x-axis of the graph, what is their
relation to the number of threads?  

Please feel free to point me to any previous discussion on sysbench
where the above questions have been discussed.

Thanks,
Vaidy


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-04 18:19                     ` Vaidyanathan Srinivasan
@ 2009-01-04 19:52                       ` Mike Galbraith
  2009-01-05  3:20                         ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 62+ messages in thread
From: Mike Galbraith @ 2009-01-04 19:52 UTC (permalink / raw)
  To: svaidy
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton

On Sun, 2009-01-04 at 23:49 +0530, Vaidyanathan Srinivasan wrote:

> I am new to sysbench.  I just started few OLTP runs with pgsql. In
> your graph you are plotting and comparing read/write-per-sec and not
> transactions-per-sec.  Both the parameter vary in a similar manner.
> Can you please let me know some background on using the
> read/write-per-sec result for comparison.

It means nothing.  One result is the same as any other.  I prefer low
level read/write but someone else may prefer higher level transactions.

> I assume you have run the above tests on Q6600 box that has single
> quad core package that consist of two dual core CPUs.

Yes.  I can go down from there, but not up.  Oh darn.

>   Can you please
> let me know the sched_domain tree that was build by hacking
> mc_capable().  The effect of sched_mc={1,2} depends on the
> sched groups that was build and their flags.

Hm.  I don't know what you're asking.  I hacked mc_capable only so I
could have the tweakable handy in case there was something I didn't know
about yet (having _just_ jumped in with both feet;)

> As you have mentioned in your data, sched_mc=2 helps recover some
> performance mainly because of NEWIDLE balance.

Yes, odd.

> You have mentioned clients in the x-axis of the graph, what is their
> relation to the number of threads?

I have 4 cores, and 2 caches.  Any relationship is all about cache.

Please feel free to point me to any previous discussion on sysbench
> where the above questions have been discussed.

I haven't seen such.

> Thanks,
> Vaidy

	Cheers,

	-Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-04 19:52                       ` Mike Galbraith
@ 2009-01-05  3:20                         ` Vaidyanathan Srinivasan
  2009-01-05  4:40                           ` Mike Galbraith
  0 siblings, 1 reply; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-01-05  3:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton

* Mike Galbraith <efault@gmx.de> [2009-01-04 20:52:49]:

> On Sun, 2009-01-04 at 23:49 +0530, Vaidyanathan Srinivasan wrote:
> 
> > I am new to sysbench.  I just started few OLTP runs with pgsql. In
> > your graph you are plotting and comparing read/write-per-sec and not
> > transactions-per-sec.  Both the parameter vary in a similar manner.
> > Can you please let me know some background on using the
> > read/write-per-sec result for comparison.
> 
> It means nothing.  One result is the same as any other.  I prefer low
> level read/write but someone else may prefer higher level transactions.
> 
> > I assume you have run the above tests on Q6600 box that has single
> > quad core package that consist of two dual core CPUs.
> 
> Yes.  I can go down from there, but not up.  Oh darn.
> 
> >   Can you please
> > let me know the sched_domain tree that was build by hacking
> > mc_capable().  The effect of sched_mc={1,2} depends on the
> > sched groups that was build and their flags.
> 
> Hm.  I don't know what you're asking.  I hacked mc_capable only so I
> could have the tweakable handy in case there was something I didn't know
> about yet (having _just_ jumped in with both feet;)

When CONFIG_SCHED_DEBUG is enabled, the sched domain tree is dumped
(dmesg)

CPU0 attaching sched-domain:
 domain 0: span 0,2-4 level MC
  groups: 0 2 3 4
  domain 1: span 0-7 level CPU
   groups: 0,2-4 1,5-7
CPU1 attaching sched-domain:
 domain 0: span 1,5-7 level MC
  groups: 1 5 6 7
  domain 1: span 0-7 level CPU
   groups: 1,5-7 0,2-4

This tree clearly depicts what the scheduler thinks of the system.
Correct sched domain and groups setup is critical for performance and
powersavings. When you hack mc_capable, domain at MC level should have
one group with two core (sharing the L2 cache) and CPU level shall
have two groups each with two cpus representing two dual core CPUs.

The above example is from my dual socket quad core system.

> > As you have mentioned in your data, sched_mc=2 helps recover some
> > performance mainly because of NEWIDLE balance.
> 
> Yes, odd.
> 
> > You have mentioned clients in the x-axis of the graph, what is their
> > relation to the number of threads?
> 
> I have 4 cores, and 2 caches.  Any relationship is all about cache.

I was actually asking about software threads specified in the sysbench
benchmark.  Your have run almost 256 clients on a 4 core box, does
that mean sysbench had 256 worker threads?

Thanks for the clarifications.

--Vaidy



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-05  3:20                         ` Vaidyanathan Srinivasan
@ 2009-01-05  4:40                           ` Mike Galbraith
  2009-01-05  6:36                             ` Mike Galbraith
  2009-01-06 14:54                             ` Vaidyanathan Srinivasan
  0 siblings, 2 replies; 62+ messages in thread
From: Mike Galbraith @ 2009-01-05  4:40 UTC (permalink / raw)
  To: svaidy
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton

On Mon, 2009-01-05 at 08:50 +0530, Vaidyanathan Srinivasan wrote: 
> When CONFIG_SCHED_DEBUG is enabled, the sched domain tree is dumped
> (dmesg)

Oh, that.  I'm dense <thwack>

[    0.476050] CPU0 attaching sched-domain:
[    0.476052]  domain 0: span 0-1 level MC
[    0.476054]   groups: 0 1
[    0.476057]   domain 1: span 0-3 level CPU
[    0.476058]    groups: 0-1 2-3
[    0.476062] CPU1 attaching sched-domain:
[    0.476064]  domain 0: span 0-1 level MC
[    0.476065]   groups: 1 0
[    0.476067]   domain 1: span 0-3 level CPU
[    0.476069]    groups: 0-1 2-3
[    0.476072] CPU2 attaching sched-domain:
[    0.476073]  domain 0: span 2-3 level MC
[    0.476075]   groups: 2 3
[    0.476077]   domain 1: span 0-3 level CPU
[    0.476078]    groups: 2-3 0-1
[    0.476081] CPU3 attaching sched-domain:
[    0.476083]  domain 0: span 2-3 level MC
[    0.476084]   groups: 3 2
[    0.476086]   domain 1: span 0-3 level CPU
[    0.476088]    groups: 2-3 0-1

2.6.26.8
[    0.524043] CPU0 attaching sched-domain:
[    0.524045]  domain 0: span 0-1
[    0.524046]   groups: 0 1
[    0.524049]   domain 1: span 0-3
[    0.524051]    groups: 0-1 2-3
[    0.524054] CPU1 attaching sched-domain:
[    0.524055]  domain 0: span 0-1
[    0.524056]   groups: 1 0
[    0.524059]   domain 1: span 0-3
[    0.524060]    groups: 0-1 2-3
[    0.524063] CPU2 attaching sched-domain:
[    0.524064]  domain 0: span 2-3
[    0.524065]   groups: 2 3
[    0.524068]   domain 1: span 0-3
[    0.524069]    groups: 2-3 0-1
[    0.524072] CPU3 attaching sched-domain:
[    0.524073]  domain 0: span 2-3
[    0.524075]   groups: 3 2
[    0.524077]   domain 1: span 0-3
[    0.524078]    groups: 2-3 0-1

> I was actually asking about software threads specified in the sysbench
> benchmark.  Your have run almost 256 clients on a 4 core box, does
> that mean sysbench had 256 worker threads?

Yes.

	-Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-05  4:40                           ` Mike Galbraith
@ 2009-01-05  6:36                             ` Mike Galbraith
  2009-01-05 15:19                               ` Mike Galbraith
  2009-01-06 14:54                             ` Vaidyanathan Srinivasan
  1 sibling, 1 reply; 62+ messages in thread
From: Mike Galbraith @ 2009-01-05  6:36 UTC (permalink / raw)
  To: svaidy
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton

Virgin 28 does not have a knee, but grows one when NEWIDLE is enabled.
The throughput curve is essentially the mc=2 curve with the knee
smoothed away, and a higher peak.  The very end is considerably lower.
ie it only helps the extremely contested end, and not nearly as much.
The whole curve is roughly a point under the 26.8 curve (which could
easily be config differences etc, but the shape is right).

I'll rummage around.

	-Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-05  6:36                             ` Mike Galbraith
@ 2009-01-05 15:19                               ` Mike Galbraith
  2009-01-06  9:31                                 ` Mike Galbraith
  0 siblings, 1 reply; 62+ messages in thread
From: Mike Galbraith @ 2009-01-05 15:19 UTC (permalink / raw)
  To: svaidy
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton

On Mon, 2009-01-05 at 07:37 +0100, Mike Galbraith wrote:

> I'll rummage around.

Seems to be about the only thing it could be, load balancing inflicting
injury on very sensitive mysql+oltp pairs.

	-Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-05 15:19                               ` Mike Galbraith
@ 2009-01-06  9:31                                 ` Mike Galbraith
  2009-01-06 15:07                                   ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 62+ messages in thread
From: Mike Galbraith @ 2009-01-06  9:31 UTC (permalink / raw)
  To: svaidy
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton

On Mon, 2009-01-05 at 16:19 +0100, Mike Galbraith wrote:
> On Mon, 2009-01-05 at 07:37 +0100, Mike Galbraith wrote:
> 
> > I'll rummage around.
> 
> Seems to be about the only thing it could be, load balancing inflicting
> injury on very sensitive mysql+oltp pairs.

BTW, I verified this.  Reverting all load-balancing changes fully
restored mysql+oltp peak, and brought mid-range throughput to the same
level as sched_mc=2 except at the log-jam end.  (haven't looked at
vmark, though I'd expect it to be hurting a bit too, it's affinity
sensitive as well)

I expected sched_mc=2 to help an nfs mount kbuild, and it did, quite a
bit.  I first tried an nfs4 mount, but after a while, the odd ipv6 80%
idle problem came back, so I reverted to nfs3.  Full built time there
went from 4m25s to 4m2s.  A nice improvement.

I haven't noticed anything on the interactivity front.

Personally, I'd go for sched_mc=2 as default.  I value the fork/exec
load much more than sensitive benchmarks, though what hurts mysql+oltp
will certainly hurt others as well.  We have a bit of conflict between
keeping CPUs busy and affinity cost.  Something to work on.

	-Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-05  4:40                           ` Mike Galbraith
  2009-01-05  6:36                             ` Mike Galbraith
@ 2009-01-06 14:54                             ` Vaidyanathan Srinivasan
  1 sibling, 0 replies; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-01-06 14:54 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton

* Mike Galbraith <efault@gmx.de> [2009-01-05 05:40:16]:

> On Mon, 2009-01-05 at 08:50 +0530, Vaidyanathan Srinivasan wrote: 
> > When CONFIG_SCHED_DEBUG is enabled, the sched domain tree is dumped
> > (dmesg)
> 
> Oh, that.  I'm dense <thwack>
> 
> [    0.476050] CPU0 attaching sched-domain:
> [    0.476052]  domain 0: span 0-1 level MC
> [    0.476054]   groups: 0 1
> [    0.476057]   domain 1: span 0-3 level CPU
> [    0.476058]    groups: 0-1 2-3
> [    0.476062] CPU1 attaching sched-domain:
> [    0.476064]  domain 0: span 0-1 level MC
> [    0.476065]   groups: 1 0
> [    0.476067]   domain 1: span 0-3 level CPU
> [    0.476069]    groups: 0-1 2-3
> [    0.476072] CPU2 attaching sched-domain:
> [    0.476073]  domain 0: span 2-3 level MC
> [    0.476075]   groups: 2 3
> [    0.476077]   domain 1: span 0-3 level CPU
> [    0.476078]    groups: 2-3 0-1
> [    0.476081] CPU3 attaching sched-domain:
> [    0.476083]  domain 0: span 2-3 level MC
> [    0.476084]   groups: 3 2
> [    0.476086]   domain 1: span 0-3 level CPU
> [    0.476088]    groups: 2-3 0-1

Hi Mike,

This seems to be correct for the configuration.  Hope this would be
same for shced_mc=1 and sched_mc=2 since you would have hacked
mc_capable.

By default, all 4 cores will be 1 group at CPU at sched_mc={1,2} so
that packages are clearly identified in the CPU level sched groups.

 
> 2.6.26.8
> [    0.524043] CPU0 attaching sched-domain:
> [    0.524045]  domain 0: span 0-1
> [    0.524046]   groups: 0 1
> [    0.524049]   domain 1: span 0-3
> [    0.524051]    groups: 0-1 2-3
> [    0.524054] CPU1 attaching sched-domain:
> [    0.524055]  domain 0: span 0-1
> [    0.524056]   groups: 1 0
> [    0.524059]   domain 1: span 0-3
> [    0.524060]    groups: 0-1 2-3
> [    0.524063] CPU2 attaching sched-domain:
> [    0.524064]  domain 0: span 2-3
> [    0.524065]   groups: 2 3
> [    0.524068]   domain 1: span 0-3
> [    0.524069]    groups: 2-3 0-1
> [    0.524072] CPU3 attaching sched-domain:
> [    0.524073]  domain 0: span 2-3
> [    0.524075]   groups: 3 2
> [    0.524077]   domain 1: span 0-3
> [    0.524078]    groups: 2-3 0-1
> 
> > I was actually asking about software threads specified in the sysbench
> > benchmark.  Your have run almost 256 clients on a 4 core box, does
> > that mean sysbench had 256 worker threads?
> 
> Yes.

Let me try similar experiments on my dual socket quad core system.
I was limiting the threads to 8 assuming that the system will max-out
by then.

Thanks for the updates.

--Vaidy


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-06  9:31                                 ` Mike Galbraith
@ 2009-01-06 15:07                                   ` Vaidyanathan Srinivasan
  2009-01-06 17:48                                     ` Mike Galbraith
  0 siblings, 1 reply; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-01-06 15:07 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton

* Mike Galbraith <efault@gmx.de> [2009-01-06 10:31:37]:

> On Mon, 2009-01-05 at 16:19 +0100, Mike Galbraith wrote:
> > On Mon, 2009-01-05 at 07:37 +0100, Mike Galbraith wrote:
> > 
> > > I'll rummage around.
> > 
> > Seems to be about the only thing it could be, load balancing inflicting
> > injury on very sensitive mysql+oltp pairs.
> 
> BTW, I verified this.  Reverting all load-balancing changes fully
> restored mysql+oltp peak, and brought mid-range throughput to the same
> level as sched_mc=2 except at the log-jam end.  (haven't looked at
> vmark, though I'd expect it to be hurting a bit too, it's affinity
> sensitive as well)
> 
> I expected sched_mc=2 to help an nfs mount kbuild, and it did, quite a
> bit.  I first tried an nfs4 mount, but after a while, the odd ipv6 80%
> idle problem came back, so I reverted to nfs3.  Full built time there
> went from 4m25s to 4m2s.  A nice improvement.
> 
> I haven't noticed anything on the interactivity front.
> 
> Personally, I'd go for sched_mc=2 as default.  I value the fork/exec
> load much more than sensitive benchmarks, though what hurts mysql+oltp
> will certainly hurt others as well.  We have a bit of conflict between
> keeping CPUs busy and affinity cost.  Something to work on.

Hi Mike,

Thanks for the detailed benchmark reports.  Glad to hear that
sched_mc=2 is helping in most scenarios.  Though we would be tempted to
make it default, I would still like to default to zero in order to
provide base line performance.  I would expect end users to flip the
settings to sched_mc=2 if it helps their workload in terms of
performance and/or power savings.

The fact that sched_mc=2 provide performance and/or power saving
benefits is a good justification to include the new code and tunable.

The benefits from the sched_mc=2 settings vary widely based on
workload and system configuration.  Hence in my opinion, I would not
want to change the default to 2 at this time until more wide spread
use of the tunable under various workloads and system configurations.

--Vaidy


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-06 15:07                                   ` Vaidyanathan Srinivasan
@ 2009-01-06 17:48                                     ` Mike Galbraith
  2009-01-06 18:45                                       ` Balbir Singh
  0 siblings, 1 reply; 62+ messages in thread
From: Mike Galbraith @ 2009-01-06 17:48 UTC (permalink / raw)
  To: svaidy
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Peter Zijlstra, Andrew Morton

On Tue, 2009-01-06 at 20:37 +0530, Vaidyanathan Srinivasan wrote:

> Hi Mike,
> 
> Thanks for the detailed benchmark reports.  Glad to hear that
> sched_mc=2 is helping in most scenarios.  Though we would be tempted to
> make it default, I would still like to default to zero in order to
> provide base line performance.  I would expect end users to flip the
> settings to sched_mc=2 if it helps their workload in terms of
> performance and/or power savings.

The mysql+oltp peak loss is there either way, but with 2, mid range
throughput is ~28 baseline.  High end (yawn) is better, and the nfs
kbuild performs better than baseline.

Baseline performance, at least wrt mysql+oltp doesn't seem to be an
option.  Not my call.  More testing and more testers required I suppose.

	-Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-06 17:48                                     ` Mike Galbraith
@ 2009-01-06 18:45                                       ` Balbir Singh
  2009-01-07  8:59                                         ` Mike Galbraith
  0 siblings, 1 reply; 62+ messages in thread
From: Balbir Singh @ 2009-01-06 18:45 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: svaidy, Ingo Molnar, linux-kernel, Peter Zijlstra, Andrew Morton

* Mike Galbraith <efault@gmx.de> [2009-01-06 18:48:37]:

> On Tue, 2009-01-06 at 20:37 +0530, Vaidyanathan Srinivasan wrote:
> 
> > Hi Mike,
> > 
> > Thanks for the detailed benchmark reports.  Glad to hear that
> > sched_mc=2 is helping in most scenarios.  Though we would be tempted to
> > make it default, I would still like to default to zero in order to
> > provide base line performance.  I would expect end users to flip the
> > settings to sched_mc=2 if it helps their workload in terms of
> > performance and/or power savings.
> 
> The mysql+oltp peak loss is there either way, but with 2, mid range
> throughput is ~28 baseline.  High end (yawn) is better, and the nfs
> kbuild performs better than baseline.
> 
> Baseline performance, at least wrt mysql+oltp doesn't seem to be an
> option.  Not my call.  More testing and more testers required I suppose.

Yes, more testing is definitely due. I'd like to hear from people
with larger and newer boxes as well before I would be comfortable making
sched_mc=2 as default.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-06 18:45                                       ` Balbir Singh
@ 2009-01-07  8:59                                         ` Mike Galbraith
  2009-01-07 11:26                                           ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 62+ messages in thread
From: Mike Galbraith @ 2009-01-07  8:59 UTC (permalink / raw)
  To: balbir; +Cc: svaidy, Ingo Molnar, linux-kernel, Peter Zijlstra, Andrew Morton

I have a couple questions if you don't mind.

In wake_idle() there's an odd looking check for kthreads.  Why does it
matter if a kbuild task wakes kernel or userland helper thread?

I also don't see why sched_mc overrides domain tunings.  You can turn
NEWIDLE off and sched_mc remains as set, so it's a one-way override.  If
NEWIDLE is a requirement for sched_mc > 0, it seems only logical to set
sched_mc to 0 if the user explicitly disables NEWIDLE.

	-Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-07  8:59                                         ` Mike Galbraith
@ 2009-01-07 11:26                                           ` Vaidyanathan Srinivasan
  2009-01-07 14:36                                             ` Mike Galbraith
  0 siblings, 1 reply; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-01-07 11:26 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: balbir, Ingo Molnar, linux-kernel, Peter Zijlstra, Andrew Morton

* Mike Galbraith <efault@gmx.de> [2009-01-07 09:59:15]:

> I have a couple questions if you don't mind.
> 
> In wake_idle() there's an odd looking check for kthreads.  Why does it
> matter if a kbuild task wakes kernel or userland helper thread?

Only user space tasks/threads should be biased at wakeup and
consolidated for the following reasons:

* kernel threads generally perform housekeeping tasks and its rate of
  execution and cpu utilisation is far less than that of user task in
  an under utilised system.  
* Biasing at wakeup help consolidate bursty user space tasks. Long
  running user space tasks will be consolidated by load balancer.
  kthreads are not bursty, they are generally very short running.

> I also don't see why sched_mc overrides domain tunings.  You can turn
> NEWIDLE off and sched_mc remains as set, so it's a one-way override.  If
> NEWIDLE is a requirement for sched_mc > 0, it seems only logical to set
> sched_mc to 0 if the user explicitly disables NEWIDLE.

SD_BALANCE_NEWIDLE is required for sched_mc=2 and power savings while
not mandated for sched_mc=0.  We can remove NEWIDLE balance at
sched_mc=0 if that helps baseline performance.  Ingo had identified
that removing NEWIDLE balance help performance in certain benchmarks.

I do not get what you have mentioned by NEWIDLE off?  Is there
a separate user space control to independently set or reset
SD_BALANCE_NEWIDLE at a give sched_domain level?

Thanks for the detailed review.

--Vaidy


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-07 11:26                                           ` Vaidyanathan Srinivasan
@ 2009-01-07 14:36                                             ` Mike Galbraith
  2009-01-07 15:35                                               ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 62+ messages in thread
From: Mike Galbraith @ 2009-01-07 14:36 UTC (permalink / raw)
  To: svaidy; +Cc: balbir, Ingo Molnar, linux-kernel, Peter Zijlstra, Andrew Morton

On Wed, 2009-01-07 at 16:56 +0530, Vaidyanathan Srinivasan wrote:
> * Mike Galbraith <efault@gmx.de> [2009-01-07 09:59:15]:
> 
> > I have a couple questions if you don't mind.
> > 
> > In wake_idle() there's an odd looking check for kthreads.  Why does it
> > matter if a kbuild task wakes kernel or userland helper thread?
> 
> Only user space tasks/threads should be biased at wakeup and
> consolidated for the following reasons:
> 
> * kernel threads generally perform housekeeping tasks and its rate of
>   execution and cpu utilisation is far less than that of user task in
>   an under utilised system.  
> * Biasing at wakeup help consolidate bursty user space tasks. Long
>   running user space tasks will be consolidated by load balancer.
>   kthreads are not bursty, they are generally very short running.

Rummaging around in an nfs mount is bursty, and the nfs threads are part
of the burst.  No big deal, but I think it's illogical to differentiate.

> > I also don't see why sched_mc overrides domain tunings.  You can turn
> > NEWIDLE off and sched_mc remains as set, so it's a one-way override.  If
> > NEWIDLE is a requirement for sched_mc > 0, it seems only logical to set
> > sched_mc to 0 if the user explicitly disables NEWIDLE.
> 
> SD_BALANCE_NEWIDLE is required for sched_mc=2 and power savings while

That's a buglet then, because you can have sched_mc=2 and NEWIDLE off.

> not mandated for sched_mc=0.  We can remove NEWIDLE balance at
> sched_mc=0 if that helps baseline performance.  Ingo had identified
> that removing NEWIDLE balance help performance in certain benchmarks.

It used to help mysql+oltp low end throughput for one.  efbe027 seems to
have done away with that though (not that alone).

> I do not get what you have mentioned by NEWIDLE off?  Is there
> a separate user space control to independently set or reset
> SD_BALANCE_NEWIDLE at a give sched_domain level?

Yes. /proc/sys/kernel/sched_domain/cpuN/domainN/*

> Thanks for the detailed review.

You're very welcome.

	-Mike

> --Vaidy


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-07 14:36                                             ` Mike Galbraith
@ 2009-01-07 15:35                                               ` Vaidyanathan Srinivasan
  2009-01-08  8:06                                                 ` Mike Galbraith
  0 siblings, 1 reply; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-01-07 15:35 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: balbir, Ingo Molnar, linux-kernel, Peter Zijlstra, Andrew Morton

* Mike Galbraith <efault@gmx.de> [2009-01-07 15:36:25]:

> On Wed, 2009-01-07 at 16:56 +0530, Vaidyanathan Srinivasan wrote:
> > * Mike Galbraith <efault@gmx.de> [2009-01-07 09:59:15]:
> > 
> > > I have a couple questions if you don't mind.
> > > 
> > > In wake_idle() there's an odd looking check for kthreads.  Why does it
> > > matter if a kbuild task wakes kernel or userland helper thread?
> > 
> > Only user space tasks/threads should be biased at wakeup and
> > consolidated for the following reasons:
> > 
> > * kernel threads generally perform housekeeping tasks and its rate of
> >   execution and cpu utilisation is far less than that of user task in
> >   an under utilised system.  
> > * Biasing at wakeup help consolidate bursty user space tasks. Long
> >   running user space tasks will be consolidated by load balancer.
> >   kthreads are not bursty, they are generally very short running.
> 
> Rummaging around in an nfs mount is bursty, and the nfs threads are part
> of the burst.  No big deal, but I think it's illogical to differentiate.

nfs threads is a real good example.  I missed that.  I will play
around with nfs threads and try to optimise the condition.

> > > I also don't see why sched_mc overrides domain tunings.  You can turn
> > > NEWIDLE off and sched_mc remains as set, so it's a one-way override.  If
> > > NEWIDLE is a requirement for sched_mc > 0, it seems only logical to set
> > > sched_mc to 0 if the user explicitly disables NEWIDLE.
> > 
> > SD_BALANCE_NEWIDLE is required for sched_mc=2 and power savings while
> 
> That's a buglet then, because you can have sched_mc=2 and NEWIDLE off.

Tweaking the sched domain parameters affect the default power saving
heuristics anyway.  Protecting against the flags alone will not help.
Your point is valid, I will think about this scenario and see how we
can protect the behavior.
 
> > not mandated for sched_mc=0.  We can remove NEWIDLE balance at
> > sched_mc=0 if that helps baseline performance.  Ingo had identified
> > that removing NEWIDLE balance help performance in certain benchmarks.
> 
> It used to help mysql+oltp low end throughput for one.  efbe027 seems to
> have done away with that though (not that alone).
> 
> > I do not get what you have mentioned by NEWIDLE off?  Is there
> > a separate user space control to independently set or reset
> > SD_BALANCE_NEWIDLE at a give sched_domain level?
> 
> Yes. /proc/sys/kernel/sched_domain/cpuN/domainN/*

Changing flags through this interface is complete low level control.
End users who want to tweak individual sched domain parameters are
either experimenting or optimising for their specific workload.

The defaults should generally across common workloads.  Users can
further tweak this low level settings.  We can clearly document the
relations so that we know what to expect.

sched_mc=2 depends on NEWIDLE but the wakeup biasing part will be
enabled even without this flag.  The power savings will be marginally
better than sched_mc=1 without NEWIDLE.  But the end user can have
that setup if they want to.

You have pointed out a scenario where users can turn off NEWIDLE and
still expect to use sched_mc=2.  The scenario is valid and the user
should be allowed to do so.

--Vaidy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-07 15:35                                               ` Vaidyanathan Srinivasan
@ 2009-01-08  8:06                                                 ` Mike Galbraith
  2009-01-08 17:46                                                   ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 62+ messages in thread
From: Mike Galbraith @ 2009-01-08  8:06 UTC (permalink / raw)
  To: svaidy; +Cc: balbir, Ingo Molnar, linux-kernel, Peter Zijlstra, Andrew Morton

On Wed, 2009-01-07 at 21:05 +0530, Vaidyanathan Srinivasan wrote:

> sched_mc=2 depends on NEWIDLE but the wakeup biasing part will be
> enabled even without this flag.  The power savings will be marginally
> better than sched_mc=1 without NEWIDLE.  But the end user can have
> that setup if they want to.

That implies to me that there should exist a flag SD_WAKEUP_BIAS or
whatnot.  All settings should be visible.  Currently, sched_mc > 0 turns
on NEWIDLE, which is visible, but 2 turns on this 'invisible to the
ultimate low-level control interface' thing.

	-Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-08  8:06                                                 ` Mike Galbraith
@ 2009-01-08 17:46                                                   ` Vaidyanathan Srinivasan
  2009-01-09  6:00                                                     ` Mike Galbraith
  0 siblings, 1 reply; 62+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-01-08 17:46 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: balbir, Ingo Molnar, linux-kernel, Peter Zijlstra, Andrew Morton

* Mike Galbraith <efault@gmx.de> [2009-01-08 09:06:48]:

> On Wed, 2009-01-07 at 21:05 +0530, Vaidyanathan Srinivasan wrote:
> 
> > sched_mc=2 depends on NEWIDLE but the wakeup biasing part will be
> > enabled even without this flag.  The power savings will be marginally
> > better than sched_mc=1 without NEWIDLE.  But the end user can have
> > that setup if they want to.
> 
> That implies to me that there should exist a flag SD_WAKEUP_BIAS or
> whatnot.  All settings should be visible.  Currently, sched_mc > 0 turns
> on NEWIDLE, which is visible, but 2 turns on this 'invisible to the
> ultimate low-level control interface' thing.

Having a feature flag for each power saving balance is possible but
will be an overkill since independent control of features at each
sched domain may not yield useful configurations.  The present power
saving heuristics that gets enabled at sched_mc=1 is primarily
controlled by SD_POWERSAVE_BALANCE but has side effects and can be
rendered ineffective by changing other SD flags.

We did consider having individual SD_flag for each of the power
savings functions, but that quickly lead to too many possible
configurations with most of them incorrect or ineffective.  In order
to improve the consumeability of the power saving features, we have
proposed to enhance the existing tunable with monotonically increasing
powersavings vs performance tradeoffs.  

I agree with the 'visibility' to low level interface that you have
pointed out.  It will be good to have flags showup at the low level
interface, but do end users like system administrators would want to
know and tune these flags?  It makes sense with the current set of
flags which depict important scheduler features, but power savings
related flags only further qualify or emphasise existing functions and
control points in the scheduler.  

In my opinion adding flags for each powersaving feature will add 
un-necessary complexity and may not be as useful to end users.

--Vaidy


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 0/8] Tunable sched_mc_power_savings=n
  2009-01-08 17:46                                                   ` Vaidyanathan Srinivasan
@ 2009-01-09  6:00                                                     ` Mike Galbraith
  0 siblings, 0 replies; 62+ messages in thread
From: Mike Galbraith @ 2009-01-09  6:00 UTC (permalink / raw)
  To: svaidy; +Cc: balbir, Ingo Molnar, linux-kernel, Peter Zijlstra, Andrew Morton

On Thu, 2009-01-08 at 23:16 +0530, Vaidyanathan Srinivasan wrote:

> I agree with the 'visibility' to low level interface that you have
> pointed out.  It will be good to have flags showup at the low level
> interface, but do end users like system administrators would want to
> know and tune these flags?

IMHO there is no difference between these tunings and any other tuning,
the unwary admin can shoot himself in the foot in any number of ways.
The monotonic interface setting sensible flags is fine, but invisible
flags is not.

We may have to agree to disagree.

	-Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2009-01-09  6:00 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-18 17:55 [PATCH v7 0/8] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
2008-12-18 17:56 ` [PATCH v7 1/8] sched: convert BALANCE_FOR_xx_POWER to inline functions Vaidyanathan Srinivasan
2008-12-18 17:56 ` [PATCH v7 2/8] sched: Framework for sched_mc/smt_power_savings=N Vaidyanathan Srinivasan
2008-12-18 17:56 ` [PATCH v7 3/8] sched: favour lower logical cpu number for sched_mc balance Vaidyanathan Srinivasan
2008-12-18 17:56 ` [PATCH v7 4/8] sched: nominate preferred wakeup cpu Vaidyanathan Srinivasan
2008-12-18 18:12   ` Balbir Singh
2008-12-19 21:55   ` Andrew Morton
2008-12-19 22:19     ` Andrew Morton
2008-12-19 22:27       ` Ingo Molnar
2008-12-19 22:31         ` Ingo Molnar
2008-12-19 22:38           ` Andrew Morton
2008-12-19 22:54             ` Ingo Molnar
2008-12-20  4:36     ` Vaidyanathan Srinivasan
2008-12-20  4:44       ` Andrew Morton
2008-12-20  7:54         ` Ingo Molnar
2008-12-20 10:02         ` Vaidyanathan Srinivasan
2008-12-20 10:36           ` Vaidyanathan Srinivasan
2008-12-20 10:56             ` Vaidyanathan Srinivasan
2008-12-21  8:46               ` Ingo Molnar
2008-12-18 17:56 ` [PATCH v7 5/8] sched: bias task wakeups to preferred semi-idle packages Vaidyanathan Srinivasan
2008-12-18 18:11   ` Balbir Singh
2008-12-18 17:56 ` [PATCH v7 6/8] sched: activate active load balancing in new idle cpus Vaidyanathan Srinivasan
2008-12-18 17:56 ` [PATCH v7 7/8] sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0 Vaidyanathan Srinivasan
2008-12-18 17:56 ` [PATCH v7 8/8] sched: idle_balance() does not call load_balance_newidle() Vaidyanathan Srinivasan
2008-12-18 18:12   ` Balbir Singh
2008-12-18 20:17     ` Ingo Molnar
2008-12-18 20:19 ` [PATCH v7 0/8] Tunable sched_mc_power_savings=n Ingo Molnar
2008-12-18 20:31   ` Ingo Molnar
2008-12-19  8:29     ` Vaidyanathan Srinivasan
2008-12-19  8:24   ` Vaidyanathan Srinivasan
2008-12-19 13:34   ` Vaidyanathan Srinivasan
2008-12-29 23:43 ` MinChan Kim
2008-12-30  2:48   ` Balbir Singh
2008-12-30  6:21     ` Ingo Molnar
2008-12-30  6:44       ` Balbir Singh
2008-12-30  7:20         ` Ingo Molnar
2008-12-30 18:07       ` Vaidyanathan Srinivasan
2009-01-02  7:26         ` Vaidyanathan Srinivasan
2009-01-02 22:16           ` Ingo Molnar
2009-01-03  7:29             ` Mike Galbraith
2009-01-03 10:16               ` Vaidyanathan Srinivasan
2009-01-03 11:22                 ` Mike Galbraith
2009-01-04 15:00                   ` Mike Galbraith
2009-01-04 18:19                     ` Vaidyanathan Srinivasan
2009-01-04 19:52                       ` Mike Galbraith
2009-01-05  3:20                         ` Vaidyanathan Srinivasan
2009-01-05  4:40                           ` Mike Galbraith
2009-01-05  6:36                             ` Mike Galbraith
2009-01-05 15:19                               ` Mike Galbraith
2009-01-06  9:31                                 ` Mike Galbraith
2009-01-06 15:07                                   ` Vaidyanathan Srinivasan
2009-01-06 17:48                                     ` Mike Galbraith
2009-01-06 18:45                                       ` Balbir Singh
2009-01-07  8:59                                         ` Mike Galbraith
2009-01-07 11:26                                           ` Vaidyanathan Srinivasan
2009-01-07 14:36                                             ` Mike Galbraith
2009-01-07 15:35                                               ` Vaidyanathan Srinivasan
2009-01-08  8:06                                                 ` Mike Galbraith
2009-01-08 17:46                                                   ` Vaidyanathan Srinivasan
2009-01-09  6:00                                                     ` Mike Galbraith
2009-01-06 14:54                             ` Vaidyanathan Srinivasan
2008-12-30 17:31     ` Vaidyanathan Srinivasan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).