linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
@ 2008-09-08 13:14 Vaidyanathan Srinivasan
  2008-09-08 13:16 ` [RFC PATCH v2 1/7] sched: arch_reinit_sched_domains() must destroy domains to force rebuild Vaidyanathan Srinivasan
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-09-08 13:14 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi
  Cc: Ingo Molnar, Peter Zijlstra, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Vaidyanathan Srinivasan

Hi,

The existing power saving loadbalancer CONFIG_SCHED_MC attempts to run
the workload in the system on minimum number of CPU packages and tries
to keep rest of the CPU packages idle for longer duration. Thus
consolidating workloads to fewer packages help other packages to be in
idle state and save power.  The current implementation is very
conservative and does not work effectively across different workloads.
Initial idea of tunable sched_mc_power_savings=n was proposed to
enable tuning of the power saving load balancer based on the system
configuration, workload characteristics and end user requirements.

Please refer to the following discussions and article for details.

Making power policy just work
http://lwn.net/Articles/287924/ 
[RFC v1] Tunable sched_mc_power_savings=n
http://lwn.net/Articles/287882/

The following series of patch demonstrates the basic framework for
tunable sched_mc_power_savings.

The power savings and performance of the given workload in an under
utilised system can be controlled by setting values of 0, 1 or 2 to
/sys/devices/system/cpu/sched_mc_power_savings with 0 being highest
performance and least power savings and level 2 indicating maximum
power savings even at the cost of slight performance degradation.

enum powersavings_balance_level {
	POWERSAVINGS_BALANCE_NONE = 0,  /* No power saving load balance */
	POWERSAVINGS_BALANCE_BASIC,	/* Fill one thread/core/package 
					 * first for long running threads 
					 */ 
	POWERSAVINGS_BALANCE_WAKEUP,	/* Also bias task wakeups to semi-idle
					 * cpu package for power savings
					 */
	MAX_POWERSAVINGS_BALANCE_LEVELS
};

sched_mc_power_savings values of 0 and 1 are implemented and available
in the default kernel.  The new level of 2 support task wakeup biasing
that helps in consolidating very short running bursty jobs in an
almost idle system.  This level of task consolidation packs most
workloads to one cpu package and extends the tickless idle time for
unused cpu packages.

Changes from default kernels's power save balance implementation:

* Nominate wakeup cpu during power saving load balance operation
* Use the nominated cpu for power efficient wakeup cpu selection
* Perform active load balance for newly idle cpu for aggressive task
  consolidation.
* This patch is against 2.6.27-rc5 with two fixes for basic
  sched_mc_power_saving balance bugs in mainline kernel. (They have
  already been independently posted in LKML)

Results:
--------

sched_mc_power_saving=2 is expected to provide power and/or energy
savings when the overall system utilisation is less than ~50%.  At
higher system utilisation in the case of a small two socket system, the
opportunity for power savings decrease and hence this level may not
provide any further benefit compared to sched_mc_power_saving=1.

KERNBENCH Runs: make -j4 on a x86 8 cpu, dual socket quad core cpu
package

SchedMC Run Time     Package Idle    Energy  Power
0	76.88s      51.634% 53.809%  1.00x J 1.00y W
1	78.50s      42.020% 64.687%  0.97x J 0.95y W 
2	73.31s      17.305% 87.765%  0.92x J 0.97y W

The above test run has normalised power and energy values for
a 4 job make on an 8 cpu system.  Typical system utilisation will be
~50%.  The package idle percentage are the idle time of each of the
quad core cpu package in the test system.  Energy and average power
for the test duration has been measured and normalised.

sched_mc=0 is the baseline reference run.  In this particular system
setup, sched_mc=1 actually degraded performance by making the jobs
jump cpus.  While sched_mc=2 was able to consolidate better due to
task wakeup biasing and thereby improving performance for this
particular test.  sched_mc=2 wins by least energy and maximum
performance.  The average power is higher than sched_mc=1 (but still
less than baseline run) because the ondemand governor would have
increased the cpu frequency based on utilisation. The cpu package idle
percentage given an indication of the level of consolidation that was
obtained.  This info is from /proc/stat snapshot on all cpus and
averaged for all cores in a package (after taking care of topology). 

SPECjbb runs: 2 warehouse on x86 8 cpu, dual socket quad core cpu
package, average system utilisation around ~25%

SchedMC SPECjbb OPS     Watts   
0	1.00x		1.00y	
1	0.98x		0.98y	
2	0.95x		0.95y	

We can see a linear reduction in performance and average power
consumed.  sched_mc tunable can be used to trade performance for
average power consumed for this workload.

However the results are not as good for 4 warehouse in the same system
where the system utilisation is slightly above 50%
SchedMC	SPECjbb ops	 Package Idle	Power
0	1.00x		48.483% 51.306%	1.00z
1	0.92x		21.398% 79.095% 0.93z
2	0.84x		28.778% 93.282% 0.92z

There is significant reduction in performance for a marginal 1%
reduction in power for sched_mc=2.

These results are illustrative of basic idea and possibilities with the
tunable sched_mc_power_savings settings.  More work needs to be done
to tune the various heuristics for different workloads.  The power and
performance tradoffs are machine configuration dependent and hence
more experimentation is needed in order to get the correct design.

Processor power saving features like deep sleep states on server
processors will significantly improve power savings (proportional to
package idle time).  This technique is primarily intended for multi
socket systems with multi-core cpus where the power (voltage) control
is at a per socket level.

I will post more experimental data for different workloads and
benchmarks.  

Please let me know your comments and suggestions.

Thanks,
Vaidy
---

Gautham R Shenoy (2):
      sched: Framework for sched_mc/smt_power_savings=N
      sched: Fix __load_balance_iterator() for cfq with only one task.

Max Krasnyansky (1):
      sched: arch_reinit_sched_domains() must destroy domains to force rebuild

Vaidyanathan Srinivasan (4):
      sched: activate active load balancing in new idle cpus
      sched: bias task wakeups to preferred semi-idle packages
      sched: nominate preferred wakeup cpu
      sched: favour lower logical cpu number for sched_mc balance


 include/linux/cpuset.h |    2 +
 include/linux/sched.h  |   11 +++++++
 kernel/sched.c         |   79 +++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched_fair.c    |   15 +++++++++
 4 files changed, 94 insertions(+), 13 deletions(-)

-- 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 1/7] sched: arch_reinit_sched_domains() must destroy domains to force rebuild
  2008-09-08 13:14 [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
@ 2008-09-08 13:16 ` Vaidyanathan Srinivasan
  2008-09-08 13:17 ` [RFC PATCH v2 2/7] sched: Fix __load_balance_iterator() for cfq with only one task Vaidyanathan Srinivasan
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-09-08 13:16 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi
  Cc: Ingo Molnar, Peter Zijlstra, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Vaidyanathan Srinivasan

From: Max Krasnyansky <maxk@qualcomm.com>

What I realized recently is that calling rebuild_sched_domains() in
arch_reinit_sched_domains() by itself is not enough when cpusets are enabled.
partition_sched_domains() code is trying to avoid unnecessary domain rebuilds
and will not actually rebuild anything if new domain masks match the old ones.

What this means is that doing
     echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
on a system with cpusets enabled will not take affect untill something changes
in the cpuset setup (ie new sets created or deleted).

This patch fixes restore correct behaviour where domains must be rebuilt in
order to enable MC powersaving flags.

Test on quad-core Core2 box with both CONFIG_CPUSETS and !CONFIG_CPUSETS.
Also tested on dual-core Core2 laptop. Lockdep is happy and things are working
as expected.

Ingo, please apply.
btw We also need to push my other cpuset patch into mainline. We currently
calling rebuild_sched_domains() without cgroup lock which is bad. When I made
original sched: changes the assumption was that cpuset patch will also go in.
I'm talking about
	"cpuset: Rework sched domains and CPU hotplug handling"
It's been ACKed by Paul has been in the -tip for awhile now.

Reference LKML threads:

http://lkml.org/lkml/2008/8/29/191
http://lkml.org/lkml/2008/8/29/343

Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: svaidy@linux.vnet.ibm.com
Cc: peterz@infradead.org
Cc: mingo@elte.hu
Tested-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 include/linux/cpuset.h |    2 +-
 kernel/sched.c         |   19 +++++++++++++------
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e8f450c..2691926 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -160,7 +160,7 @@ static inline int current_cpuset_is_being_rebound(void)
 
 static inline void rebuild_sched_domains(void)
 {
-	partition_sched_domains(0, NULL, NULL);
+	partition_sched_domains(1, NULL, NULL);
 }
 
 #endif /* !CONFIG_CPUSETS */
diff --git a/kernel/sched.c b/kernel/sched.c
index 9a1ddb8..5a38540 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7637,24 +7637,27 @@ static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
  * and partition_sched_domains() will fallback to the single partition
  * 'fallback_doms', it also forces the domains to be rebuilt.
  *
+ * If doms_new==NULL it will be replaced with cpu_online_map.
+ * ndoms_new==0 is a special case for destroying existing domains.
+ * It will not create the default domain.
+ *
  * Call with hotplug lock held
  */
 void partition_sched_domains(int ndoms_new, cpumask_t *doms_new,
 			     struct sched_domain_attr *dattr_new)
 {
-	int i, j;
+	int i, j, n;
 
 	mutex_lock(&sched_domains_mutex);
 
 	/* always unregister in case we don't destroy any domains */
 	unregister_sched_domain_sysctl();
 
-	if (doms_new == NULL)
-		ndoms_new = 0;
+	n = doms_new ? ndoms_new : 0;
 
 	/* Destroy deleted domains */
 	for (i = 0; i < ndoms_cur; i++) {
-		for (j = 0; j < ndoms_new; j++) {
+		for (j = 0; j < n; j++) {
 			if (cpus_equal(doms_cur[i], doms_new[j])
 			    && dattrs_equal(dattr_cur, i, dattr_new, j))
 				goto match1;
@@ -7667,7 +7670,6 @@ match1:
 
 	if (doms_new == NULL) {
 		ndoms_cur = 0;
-		ndoms_new = 1;
 		doms_new = &fallback_doms;
 		cpus_andnot(doms_new[0], cpu_online_map, cpu_isolated_map);
 		dattr_new = NULL;
@@ -7704,8 +7706,13 @@ match2:
 int arch_reinit_sched_domains(void)
 {
 	get_online_cpus();
+
+	/* Destroy domains first to force the rebuild */
+	partition_sched_domains(0, NULL, NULL);
+
 	rebuild_sched_domains();
 	put_online_cpus();
+
 	return 0;
 }
 
@@ -7789,7 +7796,7 @@ static int update_sched_domains(struct notifier_block *nfb,
 	case CPU_ONLINE_FROZEN:
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
-		partition_sched_domains(0, NULL, NULL);
+		partition_sched_domains(1, NULL, NULL);
 		return NOTIFY_OK;
 
 	default:


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 2/7] sched: Fix __load_balance_iterator() for cfq with only one task.
  2008-09-08 13:14 [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
  2008-09-08 13:16 ` [RFC PATCH v2 1/7] sched: arch_reinit_sched_domains() must destroy domains to force rebuild Vaidyanathan Srinivasan
@ 2008-09-08 13:17 ` Vaidyanathan Srinivasan
  2008-09-08 13:18 ` [RFC PATCH v2 3/7] sched: Framework for sched_mc/smt_power_savings=N Vaidyanathan Srinivasan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-09-08 13:17 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi
  Cc: Ingo Molnar, Peter Zijlstra, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Vaidyanathan Srinivasan

From: Gautham R Shenoy <ego@in.ibm.com>

The __load_balance_iterator() returns a NULL when there's only one
sched_entity which is a task. It is caused by the following code-path.


	/* Skip over entities that are not tasks */
	do {
		se = list_entry(next, struct sched_entity, group_node);
		next = next->next;
	} while (next != &cfs_rq->tasks && !entity_is_task(se));

	if (next == &cfs_rq->tasks)
		return NULL;
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      This will return NULL even when se is a task.

As a side-effect, there was a regression in sched_mc behavior since 2.6.25,
since iter_move_one_task() when it calls load_balance_start_fair(),
would not get any tasks to move!

Fix this by checking if the last entity was a task or not.

Reference:
http://lkml.org/lkml/2008/9/5/135

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
---

 kernel/sched_fair.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index fb8994c..f1c96e3 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1451,7 +1451,7 @@ __load_balance_iterator(struct cfs_rq *cfs_rq, struct list_head *next)
 		next = next->next;
 	} while (next != &cfs_rq->tasks && !entity_is_task(se));
 
-	if (next == &cfs_rq->tasks)
+	if (next == &cfs_rq->tasks && !entity_is_task(se))
 		return NULL;
 
 	cfs_rq->balance_iterator = next;


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 3/7] sched: Framework for sched_mc/smt_power_savings=N
  2008-09-08 13:14 [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
  2008-09-08 13:16 ` [RFC PATCH v2 1/7] sched: arch_reinit_sched_domains() must destroy domains to force rebuild Vaidyanathan Srinivasan
  2008-09-08 13:17 ` [RFC PATCH v2 2/7] sched: Fix __load_balance_iterator() for cfq with only one task Vaidyanathan Srinivasan
@ 2008-09-08 13:18 ` Vaidyanathan Srinivasan
  2008-09-08 13:20 ` [RFC PATCH v2 4/7] sched: favour lower logical cpu number for sched_mc balance Vaidyanathan Srinivasan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-09-08 13:18 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi
  Cc: Ingo Molnar, Peter Zijlstra, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Vaidyanathan Srinivasan

From: Gautham R Shenoy <ego@in.ibm.com>

***  RFC patch of work in progress and not for inclusion. ***

Currently the sched_mc/smt_power_savings variable is a boolean, which either
enables or disables topology based power savings. This extends the behaviour of
the variable from boolean to multivalued, such that based on the value, we
decide how aggressively do we want to perform topology based powersavings
balance.

Variable levels of power saving tunable would benefit end user to match the
required level of power savings vs performance trade off depending on the
system configuration and workloads.

This initial version makes the sched_mc_power_savings global variable to take
more values (0,1,2).

Later version is expected to add new member sd->powersavings_level at the multi
core CPU level sched_domain. This make all sd->flags check for
SD_POWERSAVINGS_BALANCE into a different macro that will check for
powersavings_level.

The power savings level setting should be in one place either in the
sched_mc_power_savings global variable or contained within the appropriate
sched_domain structure.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 include/linux/sched.h |   11 +++++++++++
 kernel/sched.c        |   16 +++++++++++++---
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cfb0d87..7c12240 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -721,6 +721,17 @@ enum cpu_idle_type {
 #define SD_SERIALIZE		1024	/* Only a single load balancing instance */
 #define SD_WAKE_IDLE_FAR	2048	/* Gain latency sacrificing cache hit */
 
+enum powersavings_balance_level {
+	POWERSAVINGS_BALANCE_NONE = 0,  /* No power saving load balance */
+	POWERSAVINGS_BALANCE_BASIC,	/* Fill one thread/core/package
+					 * first for long running threads
+					 */
+	POWERSAVINGS_BALANCE_WAKEUP,	/* Also bias task wakeups to semi-idle
+					 * cpu package for power savings
+					 */
+	MAX_POWERSAVINGS_BALANCE_LEVELS
+};
+
 #define BALANCE_FOR_MC_POWER	\
 	(sched_smt_power_savings ? SD_POWERSAVINGS_BALANCE : 0)
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 5a38540..e535f98 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7719,14 +7719,24 @@ int arch_reinit_sched_domains(void)
 static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
 {
 	int ret;
+	unsigned int level = 0;
 
-	if (buf[0] != '0' && buf[0] != '1')
+	sscanf(buf, "%u", &level);
+
+	/*
+	 * level is always be positive so don't check for
+	 * level < POWERSAVINGS_BALANCE_NONE which is 0
+	 * What happens on 0 or 1 byte write,
+	 * need to check for count as well?
+	 */
+
+	if (level >= MAX_POWERSAVINGS_BALANCE_LEVELS)
 		return -EINVAL;
 
 	if (smt)
-		sched_smt_power_savings = (buf[0] == '1');
+		sched_smt_power_savings = level;
 	else
-		sched_mc_power_savings = (buf[0] == '1');
+		sched_mc_power_savings = level;
 
 	ret = arch_reinit_sched_domains();
 


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 4/7] sched: favour lower logical cpu number for sched_mc balance
  2008-09-08 13:14 [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
                   ` (2 preceding siblings ...)
  2008-09-08 13:18 ` [RFC PATCH v2 3/7] sched: Framework for sched_mc/smt_power_savings=N Vaidyanathan Srinivasan
@ 2008-09-08 13:20 ` Vaidyanathan Srinivasan
  2008-09-08 13:21 ` [RFC PATCH v2 5/7] sched: nominate preferred wakeup cpu Vaidyanathan Srinivasan
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-09-08 13:20 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi
  Cc: Ingo Molnar, Peter Zijlstra, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Vaidyanathan Srinivasan

Just in case two groups have identical load, prefer to move load to lower
logical cpu number rather than the present logic of moving to higher logical
number.

find_busiest_group() tries to look for a group_leader that has spare capacity
to take more tasks and freeup an appropriate least loaded group.  Just in case
there is a tie and the load is equal, then the group with higher logical number
is favoured.  This conflicts with user space irqbalance daemon that will move
interrupts to lower logical number if the system utilisation is very low.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 kernel/sched.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index e535f98..569fc8d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3237,7 +3237,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 		 */
 		if ((sum_nr_running < min_nr_running) ||
 		    (sum_nr_running == min_nr_running &&
-		     first_cpu(group->cpumask) <
+		     first_cpu(group->cpumask) >
 		     first_cpu(group_min->cpumask))) {
 			group_min = group;
 			min_nr_running = sum_nr_running;
@@ -3253,7 +3253,7 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 		if (sum_nr_running <= group_capacity - 1) {
 			if (sum_nr_running > leader_nr_running ||
 			    (sum_nr_running == leader_nr_running &&
-			     first_cpu(group->cpumask) >
+			     first_cpu(group->cpumask) <
 			      first_cpu(group_leader->cpumask))) {
 				group_leader = group;
 				leader_nr_running = sum_nr_running;


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 5/7] sched: nominate preferred wakeup cpu
  2008-09-08 13:21 ` [RFC PATCH v2 5/7] sched: nominate preferred wakeup cpu Vaidyanathan Srinivasan
@ 2008-09-08 13:21   ` Peter Zijlstra
  2008-09-08 13:43     ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2008-09-08 13:21 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi, Ingo Molnar,
	Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
	Andi Kleen, David Collier-Brown, Tim Connors, Max Krasnyansky

On Mon, 2008-09-08 at 18:51 +0530, Vaidyanathan Srinivasan wrote:
> When the system utilisation is low and more cpus are idle,
> then the process waking up from sleep should prefer to
> wakeup an idle cpu from semi-idle cpu package (multi core
> package) rather than a completely idle cpu package which
> would waste power.
> 
> Use the sched_mc balance logic in find_busiest_group() to
> nominate a preferred wakeup cpu.
> 
> This info can be sored in appropriate sched_domain, but
> updating this info in all copies of sched_domain is not
> practical.  For now lets try with a global variable.
> 
> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> ---
> 
>  kernel/sched.c |   10 ++++++++++
>  1 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 569fc8d..4ae79f5 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3380,6 +3380,9 @@ out_balanced:
>  
>  	if (this == group_leader && group_leader != group_min) {
>  		*imbalance = min_load_per_task;
> +		if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP)
> +			sched_mc_preferred_wakeup_cpu =
> +					first_cpu(group_leader->cpumask);
>  		return group_min;
>  	}
>  #endif
> @@ -6911,6 +6914,13 @@ static void sched_domain_node_span(int node, cpumask_t *span)
>  int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
>  
>  /*
> + * Preferred wake up cpu nominated by sched_mc balance that will be used when
> + * most cpus are idle in the system indicating overall very low system
> + * utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP (2).
> + */
> +unsigned int sched_mc_preferred_wakeup_cpu;

This cannot be a global variable, what happens when we have two disjoint
load-balance domains?

> +/*
>   * SMT sched-domains:
>   */
>  #ifdef CONFIG_SCHED_SMT
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 5/7] sched: nominate preferred wakeup cpu
  2008-09-08 13:14 [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
                   ` (3 preceding siblings ...)
  2008-09-08 13:20 ` [RFC PATCH v2 4/7] sched: favour lower logical cpu number for sched_mc balance Vaidyanathan Srinivasan
@ 2008-09-08 13:21 ` Vaidyanathan Srinivasan
  2008-09-08 13:21   ` Peter Zijlstra
  2008-09-08 13:22 ` [RFC PATCH v2 6/7] sched: bias task wakeups to preferred semi-idle packages Vaidyanathan Srinivasan
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-09-08 13:21 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi
  Cc: Ingo Molnar, Peter Zijlstra, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Vaidyanathan Srinivasan

When the system utilisation is low and more cpus are idle,
then the process waking up from sleep should prefer to
wakeup an idle cpu from semi-idle cpu package (multi core
package) rather than a completely idle cpu package which
would waste power.

Use the sched_mc balance logic in find_busiest_group() to
nominate a preferred wakeup cpu.

This info can be sored in appropriate sched_domain, but
updating this info in all copies of sched_domain is not
practical.  For now lets try with a global variable.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 kernel/sched.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 569fc8d..4ae79f5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3380,6 +3380,9 @@ out_balanced:
 
 	if (this == group_leader && group_leader != group_min) {
 		*imbalance = min_load_per_task;
+		if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP)
+			sched_mc_preferred_wakeup_cpu =
+					first_cpu(group_leader->cpumask);
 		return group_min;
 	}
 #endif
@@ -6911,6 +6914,13 @@ static void sched_domain_node_span(int node, cpumask_t *span)
 int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
 
 /*
+ * Preferred wake up cpu nominated by sched_mc balance that will be used when
+ * most cpus are idle in the system indicating overall very low system
+ * utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP (2).
+ */
+unsigned int sched_mc_preferred_wakeup_cpu;
+
+/*
  * SMT sched-domains:
  */
 #ifdef CONFIG_SCHED_SMT


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 6/7] sched: bias task wakeups to preferred semi-idle packages
  2008-09-08 13:14 [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
                   ` (4 preceding siblings ...)
  2008-09-08 13:21 ` [RFC PATCH v2 5/7] sched: nominate preferred wakeup cpu Vaidyanathan Srinivasan
@ 2008-09-08 13:22 ` Vaidyanathan Srinivasan
  2008-09-08 13:23 ` [RFC PATCH v2 7/7] sched: activate active load balancing in new idle cpus Vaidyanathan Srinivasan
  2008-09-08 13:25 ` [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n Peter Zijlstra
  7 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-09-08 13:22 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi
  Cc: Ingo Molnar, Peter Zijlstra, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Vaidyanathan Srinivasan

Preferred wakeup cpu (from a semi idle package) has been
nominated in find_busiest_group() in the previous patch.  Use
this information in sched_mc_preferred_wakeup_cpu in function
select_task_rq_fair() to bias task wakeups if the following
conditions are satisfied:
	- The present cpu that is trying to wakeup the process is
	  idle and waking the target process on this cpu will
	  potentially wakeup a completely idle package
	- The previous cpu on which the target process ran is
	  also idle and hence selecting the previous cpu may
	  wakeup a semi idle cpu package
	- The task being woken up is allowed to run in the
	  nominated cpu (cpu affinity and restrictions)

Basically if both the current cpu and the previous cpu on
which the task ran is idle, select the nominated cpu from semi
idle cpu package for running the new task that is waking up.

Cache hotness and system utilisation could also be considered
before the decision is made.  (Not currently implemented)

This technique will effectively move short running bursty jobs in
an mostly idle system.

Wakeup biasing for power savings gets automatically disabled if
system utilisation increases due to the fact that the probability
of finding both this_cpu and prev_cpu idle decreases.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 kernel/sched_fair.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f1c96e3..1301521 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -73,6 +73,9 @@ unsigned int sysctl_sched_wakeup_granularity = 5000000UL;
 
 const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
 
+/* Preferred wakeup cpu nominated by sched_mc load balancing logic */
+extern unsigned int sched_mc_preferred_wakeup_cpu;
+
 /**************************************************************
  * CFS operations on generic schedulable entities:
  */
@@ -1232,6 +1235,16 @@ static int select_task_rq_fair(struct task_struct *p, int sync)
 		}
 	}
 
+	/*
+	 * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
+	 * are idle and this is not a kernel thread and this task's affinity
+	 * allows it to be moved to preferred cpu, then just move!
+	 */
+	if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
+		idle_cpu(prev_cpu) && idle_cpu(this_cpu) && p->mm &&
+		cpu_isset(sched_mc_preferred_wakeup_cpu, p->cpus_allowed))
+		return sched_mc_preferred_wakeup_cpu;
+
 	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
 		goto out;
 


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 7/7] sched: activate active load balancing in new idle cpus
  2008-09-08 13:14 [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
                   ` (5 preceding siblings ...)
  2008-09-08 13:22 ` [RFC PATCH v2 6/7] sched: bias task wakeups to preferred semi-idle packages Vaidyanathan Srinivasan
@ 2008-09-08 13:23 ` Vaidyanathan Srinivasan
  2008-09-08 13:25 ` [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n Peter Zijlstra
  7 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-09-08 13:23 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi
  Cc: Ingo Molnar, Peter Zijlstra, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky, Vaidyanathan Srinivasan

Active load balancing is a process by which migration thread
is woken up on the target CPU in order to pull current
running task on another package into this newly idle
package.

This method is already in use with normal load_balance(),
this patch introduces this method to new idle cpus when
sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.

This logic provides effective consolidation of short running
daemon jobs in a almost idle system

The side effect of this patch may be ping-ponging of tasks
if the system is moderately utilised. May need to adjust the
iterations before triggering.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 kernel/sched.c |   30 ++++++++++++++++++++++++++++++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 4ae79f5..e47e8e8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3656,10 +3656,40 @@ redo:
 	}
 
 	if (!ld_moved) {
+		int active_balance;
+		unsigned long flags;
+
 		schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
 		if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
 		    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
 			return -1;
+
+		if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+			return -1;
+
+		if (sd->nr_balance_failed++ < 1)
+			return -1;
+
+		spin_lock_irqsave(&busiest->lock, flags);
+
+		/* don't kick the migration_thread, if the curr
+		 * task on busiest cpu can't be moved to this_cpu
+		 */
+		if (!cpu_isset(this_cpu, busiest->curr->cpus_allowed)) {
+			spin_unlock_irqrestore(&busiest->lock, flags);
+			all_pinned = 1;
+			return ld_moved;
+		}
+
+		if (!busiest->active_balance) {
+			busiest->active_balance = 1;
+			busiest->push_cpu = this_cpu;
+			active_balance = 1;
+		}
+		spin_unlock_irqrestore(&busiest->lock, flags);
+		if (active_balance)
+			wake_up_process(busiest->migration_thread);
+
 	} else
 		sd->nr_balance_failed = 0;
 


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
  2008-09-08 13:14 [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
                   ` (6 preceding siblings ...)
  2008-09-08 13:23 ` [RFC PATCH v2 7/7] sched: activate active load balancing in new idle cpus Vaidyanathan Srinivasan
@ 2008-09-08 13:25 ` Peter Zijlstra
  2008-09-08 13:48   ` Vaidyanathan Srinivasan
  7 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2008-09-08 13:25 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi, Ingo Molnar,
	Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
	Andi Kleen, David Collier-Brown, Tim Connors, Max Krasnyansky

May I again ask to first clean up the current power saving code before
stacking more on top of it?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 5/7] sched: nominate preferred wakeup cpu
  2008-09-08 13:21   ` Peter Zijlstra
@ 2008-09-08 13:43     ` Vaidyanathan Srinivasan
  0 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-09-08 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi, Ingo Molnar,
	Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
	Andi Kleen, David Collier-Brown, Tim Connors, Max Krasnyansky

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2008-09-08 15:21:31]:

> On Mon, 2008-09-08 at 18:51 +0530, Vaidyanathan Srinivasan wrote:
> > When the system utilisation is low and more cpus are idle,
> > then the process waking up from sleep should prefer to
> > wakeup an idle cpu from semi-idle cpu package (multi core
> > package) rather than a completely idle cpu package which
> > would waste power.
> > 
> > Use the sched_mc balance logic in find_busiest_group() to
> > nominate a preferred wakeup cpu.
> > 
> > This info can be sored in appropriate sched_domain, but
> > updating this info in all copies of sched_domain is not
> > practical.  For now lets try with a global variable.
> > 
> > Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> > ---
> > 
> >  kernel/sched.c |   10 ++++++++++
> >  1 files changed, 10 insertions(+), 0 deletions(-)
> > 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 569fc8d..4ae79f5 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -3380,6 +3380,9 @@ out_balanced:
> >  
> >  	if (this == group_leader && group_leader != group_min) {
> >  		*imbalance = min_load_per_task;
> > +		if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP)
> > +			sched_mc_preferred_wakeup_cpu =
> > +					first_cpu(group_leader->cpumask);
> >  		return group_min;
> >  	}
> >  #endif
> > @@ -6911,6 +6914,13 @@ static void sched_domain_node_span(int node, cpumask_t *span)
> >  int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
> >  
> >  /*
> > + * Preferred wake up cpu nominated by sched_mc balance that will be used when
> > + * most cpus are idle in the system indicating overall very low system
> > + * utilisation. Triggered at POWERSAVINGS_BALANCE_WAKEUP (2).
> > + */
> > +unsigned int sched_mc_preferred_wakeup_cpu;
> 
> This cannot be a global variable, what happens when we have two disjoint
> load-balance domains?

Agreed this is certainly a problem.  I tried adding this to the
sched_domain, but accessing the correct 'copy' for sched_domain that
holds this variable from any cpu is not fast.  

Thank you for pointing this out.  I will find a alternative
implementation.

--Vaidy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
  2008-09-08 13:25 ` [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n Peter Zijlstra
@ 2008-09-08 13:48   ` Vaidyanathan Srinivasan
  2008-09-08 13:56     ` Peter Zijlstra
  2008-09-08 13:58     ` Andi Kleen
  0 siblings, 2 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-09-08 13:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi, Ingo Molnar,
	Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
	Andi Kleen, David Collier-Brown, Tim Connors, Max Krasnyansky

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2008-09-08 15:25:46]:

> May I again ask to first clean up the current power saving code before
> stacking more on top of it?

:) I understand that you have asked for two things with respect to the
current power save balance code:

1. Detailed documentation
2. Cleanup the group_min and group_leader stuff in
   find_busiest_group()

Did I get the requirements correct?

--Vaidy


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
  2008-09-08 13:48   ` Vaidyanathan Srinivasan
@ 2008-09-08 13:56     ` Peter Zijlstra
  2008-09-09  1:20       ` Suresh Siddha
  2008-09-08 13:58     ` Andi Kleen
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2008-09-08 13:56 UTC (permalink / raw)
  To: svaidy
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi, Ingo Molnar,
	Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
	Andi Kleen, David Collier-Brown, Tim Connors, Max Krasnyansky

On Mon, 2008-09-08 at 19:18 +0530, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2008-09-08 15:25:46]:
> 
> > May I again ask to first clean up the current power saving code before
> > stacking more on top of it?
> 
> :) I understand that you have asked for two things with respect to the
> current power save balance code:
> 
> 1. Detailed documentation

Preferably in the form of in-code comments and code structure, this
Documentation/* stuff always gets lost on me.

> 2. Cleanup the group_min and group_leader stuff in
>    find_busiest_group()
> 
> Did I get the requirements correct?

That would be much appreciated. 

But I also prefer to get rid of that power savings tweak in
cpu_coregroup_map(). 

But above all, readable code ;-)

find_busiest_group() is the stuff of nightmares.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
  2008-09-08 13:48   ` Vaidyanathan Srinivasan
  2008-09-08 13:56     ` Peter Zijlstra
@ 2008-09-08 13:58     ` Andi Kleen
  2008-09-10 13:45       ` Vaidyanathan Srinivasan
  1 sibling, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2008-09-08 13:58 UTC (permalink / raw)
  To: Peter Zijlstra, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Ingo Molnar, Dipankar Sarma, Balbir Singh,
	Vatsa, Gautham R Shenoy, Andi Kleen, David Collier-Brown,
	Tim Connors, Max Krasnyansky

> 1. Detailed documentation

Messy code cannot be really made good with documentation. It's
not that your patches are that messy, it's more that it makes
something already overcomplicated even worse.

> 2. Cleanup the group_min and group_leader stuff in
>    find_busiest_group()

I think one issue is that there are general too many special cases
that completely change the algorithm especially for power saving.
Perhaps it would make sense to refactor the code a bit and then
use different high level code paths for those?  I assume that
would make it all simpler and easier to understand.

The other alternative would be to dynamically change the domains
so that a generic graph walker without knowledge of power savings
could DTRT in all cases. But I assume that would be much harder.

-Andi



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
  2008-09-08 13:56     ` Peter Zijlstra
@ 2008-09-09  1:20       ` Suresh Siddha
  2008-09-09  6:18         ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Suresh Siddha @ 2008-09-09  1:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: svaidy, Linux Kernel, Siddha, Suresh B, Pallipadi, Venkatesh,
	Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky

On Mon, Sep 08, 2008 at 06:56:09AM -0700, Peter Zijlstra wrote:
> On Mon, 2008-09-08 at 19:18 +0530, Vaidyanathan Srinivasan wrote:
> > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2008-09-08 15:25:46]:
> >
> > > May I again ask to first clean up the current power saving code before
> > > stacking more on top of it?
> >
> > :) I understand that you have asked for two things with respect to the
> > current power save balance code:
> >
> > 1. Detailed documentation
> 
> Preferably in the form of in-code comments and code structure, this
> Documentation/* stuff always gets lost on me.

Peter, Almost every if() stmt/basic block  in the power savings code has
comments around it. And also power-savings code is 50 lines (mostly comments)
in 320 lines of that function.

> But I also prefer to get rid of that power savings tweak in
> cpu_coregroup_map().

Why? Based on the power vs perf, we wanted to construct topologies
differently. Reason for the complexity is, in some of the Intel cpu's,
while the cores share the same package they have different last level caches.
So for performance, we want to differentiate based on last level caches
and for power, we want to consolidate based on the package information.

> But above all, readable code ;-)
> 
> find_busiest_group() is the stuff of nightmares.

power-savings code is very small part of that nightmare :) That code
became complex over years with HT, smp-nice etc.

I haven't been following recent sched changes. I can take a look at it
and see what I can do to better organize find_busiest_group()

thanks,
suresh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
  2008-09-09  1:20       ` Suresh Siddha
@ 2008-09-09  6:18         ` Peter Zijlstra
  2008-09-09  6:31           ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2008-09-09  6:18 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: svaidy, Linux Kernel, Pallipadi, Venkatesh, Ingo Molnar,
	Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
	Andi Kleen, David Collier-Brown, Tim Connors, Max Krasnyansky,
	Nick Piggin

On Mon, 2008-09-08 at 18:20 -0700, Suresh Siddha wrote:
> On Mon, Sep 08, 2008 at 06:56:09AM -0700, Peter Zijlstra wrote:
> > On Mon, 2008-09-08 at 19:18 +0530, Vaidyanathan Srinivasan wrote:
> > > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2008-09-08 15:25:46]:
> > >
> > > > May I again ask to first clean up the current power saving code before
> > > > stacking more on top of it?
> > >
> > > :) I understand that you have asked for two things with respect to the
> > > current power save balance code:
> > >
> > > 1. Detailed documentation
> > 
> > Preferably in the form of in-code comments and code structure, this
> > Documentation/* stuff always gets lost on me.
> 
> Peter, Almost every if() stmt/basic block  in the power savings code has
> comments around it. And also power-savings code is 50 lines (mostly comments)
> in 320 lines of that function.

Sure, and those comments only tell me what it does, not why it does it.
Also, 1/6-th is still a significant amount.

> > But I also prefer to get rid of that power savings tweak in
> > cpu_coregroup_map().
> 
> Why? Based on the power vs perf, we wanted to construct topologies
> differently. Reason for the complexity is, in some of the Intel cpu's,
> while the cores share the same package they have different last level caches.
> So for performance, we want to differentiate based on last level caches
> and for power, we want to consolidate based on the package information.

But then why not add a domain level that represents the packages and
power schedule on that? That way you dont have to change the domain
structure depending on the load balance goals.

Also, that justification is just wrong - AMD has similar constructs in
its cpus, and god knows what other architectures do, so hiding this in
arch/x86 is wrong too.

> > But above all, readable code ;-)
> > 
> > find_busiest_group() is the stuff of nightmares.
> 
> power-savings code is very small part of that nightmare :) That code
> became complex over years with HT, smp-nice etc.

Yes I understand, but we really need to do something about it - and
everybody with interests in it should be looking at ways to reduce the
nightmare, not add to it.

> I haven't been following recent sched changes. I can take a look at it
> and see what I can do to better organize find_busiest_group()

You and me both then :-)

I've been looking at the history of that function - it started out quite
readable - but has, over the years, grown into a monstrosity.

One of the things I'm currently looking at is getting rid of the
small_imbalance stuff - that made sense when we balanced based on
nr_running, but now that we're weight based and willing to over-balance
a little I'm not sure that it still does.

Also, it looks to me that Christoph Lameters fix
0a2966b48fb784e437520e400ddc94874ddbd4e8 introduced a bug. By skipping
the cpu in the group accumulation the group average computation below
needs to be adjusted, because it uses the group power stuff, which
assumes all cpus contributed.

Then there is this whole sched_group stuff, which I intent to have a
hard look at, afaict its unneeded and we can iterate over the
sub-domains just as well.

Finally, we should move all this stuff into sched_fair and get rid of
that iterator interface and fix up all nr_running etc.. usages to refer
to cfs.nr_running and similar.

Then there is the idea Andi proposed, splitting up the performance and
power balancer into two separate functions, something that is worth
looking into imho.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
  2008-09-09  6:18         ` Peter Zijlstra
@ 2008-09-09  6:31           ` Nick Piggin
  2008-09-09  6:54             ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2008-09-09  6:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suresh Siddha, svaidy, Linux Kernel, Pallipadi, Venkatesh,
	Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky

On Tuesday 09 September 2008 16:18, Peter Zijlstra wrote:

> I've been looking at the history of that function - it started out quite
> readable - but has, over the years, grown into a monstrosity.

I agree it is terrible, and subsequent "features" weren't really properly
written or integrated into the sched domains idea.


> Then there is this whole sched_group stuff, which I intent to have a
> hard look at, afaict its unneeded and we can iterate over the
> sub-domains just as well.

What sub-domains? The domains-minus-groups are just a graph (in existing
setup code AFAIK just a line) of cpumasks. You have to group because you
want enough control for example not to pull load from an unusually busy
CPU from one group if it's load should actually be spread out over a
smaller domain (ie. probably other CPUs within the group we're looking at).

It would be nice if you could make it simpler of course, but I just don't
understand you or maybe you thought of some other way to solve this or
why it doesn't matter...


> Finally, we should move all this stuff into sched_fair and get rid of
> that iterator interface and fix up all nr_running etc.. usages to refer
> to cfs.nr_running and similar.
>
> Then there is the idea Andi proposed, splitting up the performance and
> power balancer into two separate functions, something that is worth
> looking into imho.

That's what *I* suggested. Before it even went in. Of course there was no
attempt made at all and it went in despite my reservations, but what's new
:)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
  2008-09-09  6:31           ` Nick Piggin
@ 2008-09-09  6:54             ` Peter Zijlstra
  2008-09-09  7:59               ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2008-09-09  6:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Suresh Siddha, svaidy, Linux Kernel, Pallipadi, Venkatesh,
	Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky

On Tue, 2008-09-09 at 16:31 +1000, Nick Piggin wrote:
> On Tuesday 09 September 2008 16:18, Peter Zijlstra wrote:
> 
> > I've been looking at the history of that function - it started out quite
> > readable - but has, over the years, grown into a monstrosity.
> 
> I agree it is terrible, and subsequent "features" weren't really properly
> written or integrated into the sched domains idea.
> 
> 
> > Then there is this whole sched_group stuff, which I intent to have a
> > hard look at, afaict its unneeded and we can iterate over the
> > sub-domains just as well.
> 
> What sub-domains? The domains-minus-groups are just a graph (in existing
> setup code AFAIK just a line) of cpumasks. You have to group because you
> want enough control for example not to pull load from an unusually busy
> CPU from one group if it's load should actually be spread out over a
> smaller domain (ie. probably other CPUs within the group we're looking at).
> 
> It would be nice if you could make it simpler of course, but I just don't
> understand you or maybe you thought of some other way to solve this or
> why it doesn't matter...

Right, I get the domain stuff - that's good stuff.

But, let my try and confuse you with ASCII-art ;-)

             Domain [0-7]
       group [0-3]  group [4-7]

     Domain [0-3]      
  group[0-1]  [group2-3]

Domain [0-1]
group 0 group 1

(right hand side not drawn due to lack of space etc...)

So we have this tree of domains, which is cool stuff. But then we have
these groups in there, which closely match up with the domain's child
domains.

So my idea was to ditch the groups and just iterate over the child
domains.

> > Finally, we should move all this stuff into sched_fair and get rid of
> > that iterator interface and fix up all nr_running etc.. usages to refer
> > to cfs.nr_running and similar.
> >
> > Then there is the idea Andi proposed, splitting up the performance and
> > power balancer into two separate functions, something that is worth
> > looking into imho.
> 
> That's what *I* suggested. Before it even went in. Of course there was no
> attempt made at all and it went in despite my reservations, but what's new
> :)

Even more reason to make it happen.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
  2008-09-09  6:54             ` Peter Zijlstra
@ 2008-09-09  7:59               ` Nick Piggin
  2008-09-09  8:25                 ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Nick Piggin @ 2008-09-09  7:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suresh Siddha, svaidy, Linux Kernel, Pallipadi, Venkatesh,
	Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky

On Tuesday 09 September 2008 16:54, Peter Zijlstra wrote:
> On Tue, 2008-09-09 at 16:31 +1000, Nick Piggin wrote:
> > On Tuesday 09 September 2008 16:18, Peter Zijlstra wrote:
> > > I've been looking at the history of that function - it started out
> > > quite readable - but has, over the years, grown into a monstrosity.
> >
> > I agree it is terrible, and subsequent "features" weren't really properly
> > written or integrated into the sched domains idea.
> >
> > > Then there is this whole sched_group stuff, which I intent to have a
> > > hard look at, afaict its unneeded and we can iterate over the
> > > sub-domains just as well.
> >
> > What sub-domains? The domains-minus-groups are just a graph (in existing
> > setup code AFAIK just a line) of cpumasks. You have to group because you
> > want enough control for example not to pull load from an unusually busy
> > CPU from one group if it's load should actually be spread out over a
> > smaller domain (ie. probably other CPUs within the group we're looking
> > at).
> >
> > It would be nice if you could make it simpler of course, but I just don't
> > understand you or maybe you thought of some other way to solve this or
> > why it doesn't matter...
>
> Right, I get the domain stuff - that's good stuff.
>
> But, let my try and confuse you with ASCII-art ;-)
>
>              Domain [0-7]
>        group [0-3]  group [4-7]
>
>      Domain [0-3]
>   group[0-1]  [group2-3]
>
> Domain [0-1]
> group 0 group 1
>
> (right hand side not drawn due to lack of space etc...)
>
> So we have this tree of domains, which is cool stuff. But then we have
> these groups in there, which closely match up with the domain's child
> domains.

But it's all per-cpu, so you'd have to iterate down other CPU's child
domains. Which may get dirtied by that CPU. So you get cacheline
bounces.

You also lose flexibility (although nobody really takes full advantage
of it) of totally arbitrary topology on a per-cpu basis.


> So my idea was to ditch the groups and just iterate over the child
> domains.

I'm not saying you couldn't do it (reasonably well -- cacheline bouncing
might be a problem if you propose to traverse other CPU's domains), but
what exactly does that gain you?


> > > Finally, we should move all this stuff into sched_fair and get rid of
> > > that iterator interface and fix up all nr_running etc.. usages to refer
> > > to cfs.nr_running and similar.
> > >
> > > Then there is the idea Andi proposed, splitting up the performance and
> > > power balancer into two separate functions, something that is worth
> > > looking into imho.
> >
> > That's what *I* suggested. Before it even went in. Of course there was no
> > attempt made at all and it went in despite my reservations, but what's
> > new
> >
> > :)
>
> Even more reason to make it happen.

Yes it would be great if it happens.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
  2008-09-09  7:59               ` Nick Piggin
@ 2008-09-09  8:25                 ` Peter Zijlstra
  2008-09-09  9:03                   ` Nick Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2008-09-09  8:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Suresh Siddha, svaidy, Linux Kernel, Pallipadi, Venkatesh,
	Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky

On Tue, 2008-09-09 at 17:59 +1000, Nick Piggin wrote:
> On Tuesday 09 September 2008 16:54, Peter Zijlstra wrote:
> > On Tue, 2008-09-09 at 16:31 +1000, Nick Piggin wrote:
> > > On Tuesday 09 September 2008 16:18, Peter Zijlstra wrote:
> > > > I've been looking at the history of that function - it started out
> > > > quite readable - but has, over the years, grown into a monstrosity.
> > >
> > > I agree it is terrible, and subsequent "features" weren't really properly
> > > written or integrated into the sched domains idea.
> > >
> > > > Then there is this whole sched_group stuff, which I intent to have a
> > > > hard look at, afaict its unneeded and we can iterate over the
> > > > sub-domains just as well.
> > >
> > > What sub-domains? The domains-minus-groups are just a graph (in existing
> > > setup code AFAIK just a line) of cpumasks. You have to group because you
> > > want enough control for example not to pull load from an unusually busy
> > > CPU from one group if it's load should actually be spread out over a
> > > smaller domain (ie. probably other CPUs within the group we're looking
> > > at).
> > >
> > > It would be nice if you could make it simpler of course, but I just don't
> > > understand you or maybe you thought of some other way to solve this or
> > > why it doesn't matter...
> >
> > Right, I get the domain stuff - that's good stuff.
> >
> > But, let my try and confuse you with ASCII-art ;-)
> >
> >              Domain [0-7]
> >        group [0-3]  group [4-7]
> >
> >      Domain [0-3]
> >   group[0-1]  [group2-3]
> >
> > Domain [0-1]
> > group 0 group 1
> >
> > (right hand side not drawn due to lack of space etc...)
> >
> > So we have this tree of domains, which is cool stuff. But then we have
> > these groups in there, which closely match up with the domain's child
> > domains.
> 
> But it's all per-cpu, so you'd have to iterate down other CPU's child
> domains. Which may get dirtied by that CPU. So you get cacheline
> bounces.

Humm, are you saying each cpu has its own domain tree? My understanding
was that its a global structure, eg. given:

   domain[0-1]

domain[0] domain[1]

cpu0's parent domain is the same instance as cpu1's.

> You also lose flexibility (although nobody really takes full advantage
> of it) of totally arbitrary topology on a per-cpu basis.

Afaict the only flexibility you loose is that you cannot make groups
larger/smaller than the child domain - which given that the whole
premesis of the groups existence is that the inner-group balancing
should be done by the level below - doesn't make sense anyway.

> > So my idea was to ditch the groups and just iterate over the child
> > domains.
> 
> I'm not saying you couldn't do it (reasonably well -- cacheline bouncing
> might be a problem if you propose to traverse other CPU's domains), but
> what exactly does that gain you?

Those cacheline bounces could be mitigated by splitting sched_domain
into two parts with a cacheline aligned dummy and keep the rarely
modified data separate from the frequently modified data.

As to the gains - a graph walk with a single type seems more elegant to
me.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
  2008-09-09  8:25                 ` Peter Zijlstra
@ 2008-09-09  9:03                   ` Nick Piggin
  0 siblings, 0 replies; 22+ messages in thread
From: Nick Piggin @ 2008-09-09  9:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suresh Siddha, svaidy, Linux Kernel, Pallipadi, Venkatesh,
	Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, David Collier-Brown, Tim Connors,
	Max Krasnyansky

On Tuesday 09 September 2008 18:25, Peter Zijlstra wrote:
> On Tue, 2008-09-09 at 17:59 +1000, Nick Piggin wrote:

> > But it's all per-cpu, so you'd have to iterate down other CPU's child
> > domains. Which may get dirtied by that CPU. So you get cacheline
> > bounces.
>
> Humm, are you saying each cpu has its own domain tree? My understanding
> was that its a global structure, eg. given:
>
>    domain[0-1]
>
> domain[0] domain[1]
>
> cpu0's parent domain is the same instance as cpu1's.

I haven't looked recently, but that's how I wrote it. Has that changed?


> > You also lose flexibility (although nobody really takes full advantage
> > of it) of totally arbitrary topology on a per-cpu basis.
>
> Afaict the only flexibility you loose is that you cannot make groups
> larger/smaller than the child domain - which given that the whole
> premesis of the groups existence is that the inner-group balancing
> should be done by the level below - doesn't make sense anyway.

But you *also* cannot have per-cpu domain trees.


> > > So my idea was to ditch the groups and just iterate over the child
> > > domains.
> >
> > I'm not saying you couldn't do it (reasonably well -- cacheline bouncing
> > might be a problem if you propose to traverse other CPU's domains), but
> > what exactly does that gain you?
>
> Those cacheline bounces could be mitigated by splitting sched_domain
> into two parts with a cacheline aligned dummy and keep the rarely
> modified data separate from the frequently modified data.

You could.


> As to the gains - a graph walk with a single type seems more elegant to
> me.

It's fundamentally two different things anyway though. I don't
see any theoretical improvement, and it definitely wouldn't
improve the practical side much if any because the biggest problem
I don't think is the simple walks themselves but the calculations
and stuff.

If it can yield something clearly better that is impossible using
domains and groups, I could change my mind.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n
  2008-09-08 13:58     ` Andi Kleen
@ 2008-09-10 13:45       ` Vaidyanathan Srinivasan
  0 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-09-10 13:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Ingo Molnar, Dipankar Sarma, Balbir Singh,
	Vatsa, Gautham R Shenoy, David Collier-Brown, Tim Connors,
	Max Krasnyansky

* Andi Kleen <andi@firstfloor.org> [2008-09-08 15:58:34]:

> > 1. Detailed documentation
> 
> Messy code cannot be really made good with documentation. It's
> not that your patches are that messy, it's more that it makes
> something already overcomplicated even worse.
> 
> > 2. Cleanup the group_min and group_leader stuff in
> >    find_busiest_group()
> 
> I think one issue is that there are general too many special cases
> that completely change the algorithm especially for power saving.
> Perhaps it would make sense to refactor the code a bit and then
> use different high level code paths for those?  I assume that
> would make it all simpler and easier to understand.

Hi Andi,

I will try to refactor the code and see if it can look cleaner.  Power
saving balance is actually just a corner case in general load_balance.
We will have to do default load_balance for performance each time we
enter this section of the scheduler, except when certain complex
conditions are met and we see an opportunity to save power. When that
window of opportunity to save power exist, the code will take
a different path and do things that will normally not be done by the
default load balancer logic.

I think we can split the statistics (load) computation part and the
decision part to separate functions and call the power save balance
conditions first and if there is no opportunity to save power, then
recommend the default load balancing decision.

> 
> The other alternative would be to dynamically change the domains
> so that a generic graph walker without knowledge of power savings
> could DTRT in all cases. But I assume that would be much harder.

This option may not work because we will have to change the decision
only when a power saving opportunity exist.  Hence more often the
power save balance code should just fall through into the default
balancer.  Splitting them this way may add duplicate code in both the
paths.

--Vaidy

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2008-09-10 13:44 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-08 13:14 [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n Vaidyanathan Srinivasan
2008-09-08 13:16 ` [RFC PATCH v2 1/7] sched: arch_reinit_sched_domains() must destroy domains to force rebuild Vaidyanathan Srinivasan
2008-09-08 13:17 ` [RFC PATCH v2 2/7] sched: Fix __load_balance_iterator() for cfq with only one task Vaidyanathan Srinivasan
2008-09-08 13:18 ` [RFC PATCH v2 3/7] sched: Framework for sched_mc/smt_power_savings=N Vaidyanathan Srinivasan
2008-09-08 13:20 ` [RFC PATCH v2 4/7] sched: favour lower logical cpu number for sched_mc balance Vaidyanathan Srinivasan
2008-09-08 13:21 ` [RFC PATCH v2 5/7] sched: nominate preferred wakeup cpu Vaidyanathan Srinivasan
2008-09-08 13:21   ` Peter Zijlstra
2008-09-08 13:43     ` Vaidyanathan Srinivasan
2008-09-08 13:22 ` [RFC PATCH v2 6/7] sched: bias task wakeups to preferred semi-idle packages Vaidyanathan Srinivasan
2008-09-08 13:23 ` [RFC PATCH v2 7/7] sched: activate active load balancing in new idle cpus Vaidyanathan Srinivasan
2008-09-08 13:25 ` [RFC PATCH v2 0/7] Tunable sched_mc_power_savings=n Peter Zijlstra
2008-09-08 13:48   ` Vaidyanathan Srinivasan
2008-09-08 13:56     ` Peter Zijlstra
2008-09-09  1:20       ` Suresh Siddha
2008-09-09  6:18         ` Peter Zijlstra
2008-09-09  6:31           ` Nick Piggin
2008-09-09  6:54             ` Peter Zijlstra
2008-09-09  7:59               ` Nick Piggin
2008-09-09  8:25                 ` Peter Zijlstra
2008-09-09  9:03                   ` Nick Piggin
2008-09-08 13:58     ` Andi Kleen
2008-09-10 13:45       ` Vaidyanathan Srinivasan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).