linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
@ 2009-05-13 13:11 Vaidyanathan Srinivasan
  2009-05-13 13:11 ` [RFC PATCH v2 1/2] sched: add sched_max_capacity_pct Vaidyanathan Srinivasan
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-05-13 13:11 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Peter Zijlstra, Arjan van de Ven
  Cc: Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, Gregory Haskins, Mike Galbraith,
	Thomas Gleixner, Arun Bharadwaj, Vaidyanathan Srinivasan

Hi,

The idea of extending sched_mc_powersavings tunable for cpu evacuation
was discussed at http://lwn.net/Articles/330309/ 

The summary of the discussion is as follows:

* Using sched_mc=3,4,5 to evacuate 1,2,4 cores is completely
  non-intuitive and broken interface.  Ingo wanted to see if we can
  model a global percentile tunable that would map to core throttling.

* Peter Zijlstra wanted more justifications for throttling at the core
  level.  Throttling may be a resource management problem rather than
  scheduler/load balancer

* CPU hotplug and cpuset/cgroup based cpu throttling are viable
  alternatives to this approach.  

Changes in v2:

* Created a percentage knob sched_max_capacity_pct=n
  Defaults to 100, can be set to 75 or 50 to evacuate cores

* This patch is still a hack for discussion and has many
  limitations.

v1: http://lkml.org/lkml/2009/4/26/202

Into and parts from previous post for quick reference:
------------------------------------------------------

Objective:
----------

* Framework to evacuate tasks from cpus in order to force the cpu
  cores to stay at idle.  Forcefully idling cores and packages can
  reduce power consumption.

* Fast response time and low OS overhead to moved tasks away from
  selected cpu packages.  CPU hotplug is too heavyweight for this
  purpose

Use cases:
---------

* Ability to throttle the number of cores used in the system along
  with other power saving controls like cpufreq governors can enable
  the system to operate at a more power efficient operating point and
  still meet the design objectives.
 
* Facilitate thermal management by evacuating cores from hot cpu
  packages

Alternatives:
-------------

* CPU hotplug: Heavy weight and slow.  Setting up and tear down of
  data structures involved.  May need new fast or light weight
  notifications

* CPUSets: Exclusive CPU sets and partitioned sched domains involve
  rebuilding sched domains and relatively heavy weight for the purpose

The following patch is against 2.6.30-rc5 and will work only in an
under utilised system (No of tasks <= number of cores).

Test results for ebizzy 8 threads at various sched_max_capacity_pct
settings. The test platform is dual socket quad core x86 system
(pre-Nehalem).

This is an interesting characteristics of the ebizzy benchmark where
the following command line improved in performance as we evacuated
cores!  Perhaps cross-cache traffic... I will verify that next time.

ebizzy -s 4096 -t 8 -S 30

sched_mc_power_savings was set to 2 in the experiment

-----------------------------------------------------------------
sched_max_capacity_pct	No Cores	Performance	AvgPower	
			used		Records/sec	(Watts)
-----------------------------------------------------------------
100			8		1.00x		1.00y
 87			7		1.03x		0.98y
 75			6		1.06x		0.95y
 62			5		1.26x		0.91y
 50			4		1.15x		0.86y
-----------------------------------------------------------------
		
There were wide run variation with ebizzy.  The purpose of the above
data is to justify use of core evacuation for power vs performance
trade-offs.  The patch does not yet work for kernbench and other
complex workloads/benchmarks. I even tried SPECjbb and did not get the
expected CPU utilisation at various settings to reduce power
consumption.  The utilisation/power was much lower than expected.

ToDo:
-----

* Identify good benchmark to demonstrate benefits of cpu evacuation

* Make the core evacuation predictable under different system load
  conditions and workload characteristics.  This is turning out to be
  a major challenge in this approach.

* Enhance framework to control which particular packages/cores will be
  evacuated, this is needed for thermal management.  The
  CPU hotplug/cpuset approach will solve this problem.

I can experiment with different benchmarks/platforms and post results
while the framework is being discussed.

Please let me know you comments and suggestions.

Thanks,
Vaidy

---

Vaidyanathan Srinivasan (2):
      sched: loadbalancer hacks for forced packing of tasks
      sched: add sched_max_capacity_pct


 kernel/sched.c |   65 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 64 insertions(+), 1 deletions(-)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 1/2] sched: add sched_max_capacity_pct
  2009-05-13 13:11 [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Vaidyanathan Srinivasan
@ 2009-05-13 13:11 ` Vaidyanathan Srinivasan
  2009-05-13 13:11 ` [RFC PATCH v2 2/2] sched: loadbalancer hacks for forced packing of tasks Vaidyanathan Srinivasan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-05-13 13:11 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Peter Zijlstra, Arjan van de Ven
  Cc: Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, Gregory Haskins, Mike Galbraith,
	Thomas Gleixner, Arun Bharadwaj, Vaidyanathan Srinivasan

Add a new sysfs variable that can be used by user space
to pass the number of core to evacuate or force idle.

/sys/devices/system/cpu/sched_max_capacity_pct defaults to 100

This is percentage value that can be used to force idle cores.
The percentage number shall be in steps corresponding to number
of cores in the system.

On a 8 core system (dual socket quad core), each core step will
be 12.5% rounded to 12%.

Echoing 88 will use 7 cores in the system:

%	No of cores
100	8
87	7
75	6
62	5
50	4
...
...

This patch will evacuate only one package (50%) in ths case.

** This is a RFC patch for discussion ***

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 kernel/sched.c |   37 +++++++++++++++++++++++++++++++++++++
 1 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index b902e58..f22b9f6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3291,6 +3291,9 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
 
 
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+
+int sched_evacuate_cores; /* No of forced-idle cores */
+
 /**
  * init_sd_power_savings_stats - Initialize power savings statistics for
  * the given sched_domain, during load balancing.
@@ -8604,6 +8607,37 @@ static ssize_t sched_mc_power_savings_store(struct sysdev_class *class,
 static SYSDEV_CLASS_ATTR(sched_mc_power_savings, 0644,
 			 sched_mc_power_savings_show,
 			 sched_mc_power_savings_store);
+
+static ssize_t sched_max_capacity_pct_show(struct sysdev_class *class,
+					   char *page)
+{
+	int capacity;
+	/* Convert no of cores to system capacity percentage */
+	/* FIXME: Will work only for non-threaded systems */
+	capacity = 100 - sched_evacuate_cores * 100 / nr_cpu_ids;
+	return sprintf(page, "%u\n", capacity);
+}
+static ssize_t sched_max_capacity_pct_store(struct sysdev_class *class,
+					    const char *buf, size_t count)
+{
+	int capacity;
+	if (!sscanf(buf, "%u", &capacity))
+		return -EINVAL;
+
+	if (capacity < 1 || capacity > 100)
+		return -EINVAL;
+
+	/* Convert user provided percentage into no-of-cores to evacuate */
+	/* FIXME: Will work only for non-threaded systems */
+	sched_evacuate_cores = (101 - capacity) * nr_cpu_ids / 100;
+	return count;
+}
+
+
+static SYSDEV_CLASS_ATTR(sched_max_capacity_pct, 0644,
+			 sched_max_capacity_pct_show,
+			 sched_max_capacity_pct_store);
+
 #endif
 
 #ifdef CONFIG_SCHED_SMT
@@ -8635,6 +8669,9 @@ int __init sched_create_sysfs_power_savings_entries(struct sysdev_class *cls)
 	if (!err && mc_capable())
 		err = sysfs_create_file(&cls->kset.kobj,
 					&attr_sched_mc_power_savings.attr);
+	if (!err)
+		err = sysfs_create_file(&cls->kset.kobj,
+					&attr_sched_max_capacity_pct.attr);
 #endif
 	return err;
 }


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH v2 2/2] sched: loadbalancer hacks for forced packing of tasks
  2009-05-13 13:11 [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Vaidyanathan Srinivasan
  2009-05-13 13:11 ` [RFC PATCH v2 1/2] sched: add sched_max_capacity_pct Vaidyanathan Srinivasan
@ 2009-05-13 13:11 ` Vaidyanathan Srinivasan
  2009-05-13 13:14 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Peter Zijlstra
  2009-05-13 14:35 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Andi Kleen
  3 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-05-13 13:11 UTC (permalink / raw)
  To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Peter Zijlstra, Arjan van de Ven
  Cc: Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Andi Kleen, Gregory Haskins, Mike Galbraith,
	Thomas Gleixner, Arun Bharadwaj, Vaidyanathan Srinivasan

Pack more tasks in a group so as to reduce number of CPUs
used to run the work in the system.

Just for load balancing purpose, assume the group capacity
has been increased by group_capacity_bump()

Hacks:

o Make non-idle cpus also perform powersave balance so
  that we can pull more tasks into the group
o Increase group capacity for calculation
o Increase load-balancing threshold so that even if a
  group is overloaded by group_capacity_bump(), consider
  it balanced

Basically if we want to evacuate 2 cores, the group capacity
is increased by 2 (*SCHED_LOAD_SCALE) and the power save
balancer will accommodate the tasks after selecting the group
leader.

This will not work if the system is overloaded. Even
after pulling 2 extra tasks, there could be tasks to fill the
other package.  At this point we are not yet reducing the
group capacity of the other group.

*** RFC patch for discussion ***

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 kernel/sched.c |   28 +++++++++++++++++++++++++++-
 1 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index f22b9f6..186b0ec 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3234,6 +3234,7 @@ struct sd_lb_stats {
 	int group_imb; /* Is there imbalance in this sd */
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
 	int power_savings_balance; /* Is powersave balance needed for this sd */
+	unsigned int group_capacity_bump; /* % increase in group capacity */
 	struct sched_group *group_min; /* Least loaded group in sd */
 	struct sched_group *group_leader; /* Group which relieves group_min */
 	unsigned long min_load_per_task; /* load_per_task in group_min */
@@ -3294,6 +3295,15 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
 
 int sched_evacuate_cores; /* No of forced-idle cores */
 
+static inline unsigned int group_capacity_bump(struct sched_domain *sd)
+{
+
+	if (sd->flags & SD_POWERSAVINGS_BALANCE)
+		return sched_evacuate_cores;
+
+	return 0;
+}
+
 /**
  * init_sd_power_savings_stats - Initialize power savings statistics for
  * the given sched_domain, during load balancing.
@@ -3309,12 +3319,14 @@ static inline void init_sd_power_savings_stats(struct sched_domain *sd,
 	 * Busy processors will not participate in power savings
 	 * balance.
 	 */
-	if (idle == CPU_NOT_IDLE || !(sd->flags & SD_POWERSAVINGS_BALANCE))
+	if ((idle == CPU_NOT_IDLE && !sched_evacuate_cores) ||
+		!(sd->flags & SD_POWERSAVINGS_BALANCE))
 		sds->power_savings_balance = 0;
 	else {
 		sds->power_savings_balance = 1;
 		sds->min_nr_running = ULONG_MAX;
 		sds->leader_nr_running = 0;
+		sds->group_capacity_bump = group_capacity_bump(sd);
 	}
 }
 
@@ -3436,6 +3448,12 @@ static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,
 {
 	return 0;
 }
+
+static inline unsigned int group_capacity_bump(struct sched_domain *sd)
+{
+	return 0;
+}
+
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
 
 
@@ -3568,6 +3586,8 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
 
 		if (local_group && balance && !(*balance))
 			return;
+		/* Bump up group capacity for forced packing of tasks */
+		sgs.group_capacity += sds->group_capacity_bump;
 
 		sds->total_load += sgs.group_load;
 		sds->total_pwr += group->__cpu_power;
@@ -3768,6 +3788,12 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
 	if (100 * sds.max_load <= sd->imbalance_pct * sds.this_load)
 		goto out_balanced;
 
+	/* Push the upper limits for overload */
+	if (sds.max_load <= (sds.busiest->__cpu_power +
+				sds.group_capacity_bump * SCHED_LOAD_SCALE) /
+				sds.busiest->__cpu_power * SCHED_LOAD_SCALE)
+		goto out_balanced;
+
 	sds.busiest_load_per_task /= sds.busiest_nr_running;
 	if (sds.group_imb)
 		sds.busiest_load_per_task =


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
  2009-05-13 13:11 [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Vaidyanathan Srinivasan
  2009-05-13 13:11 ` [RFC PATCH v2 1/2] sched: add sched_max_capacity_pct Vaidyanathan Srinivasan
  2009-05-13 13:11 ` [RFC PATCH v2 2/2] sched: loadbalancer hacks for forced packing of tasks Vaidyanathan Srinivasan
@ 2009-05-13 13:14 ` Peter Zijlstra
  2009-05-13 13:42   ` [RFC PATCH v2 0/2] Saving power by cpu evacuationsched_max_capacity_pct=n Vaidyanathan Srinivasan
  2009-05-13 13:45   ` Balbir Singh
  2009-05-13 14:35 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Andi Kleen
  3 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2009-05-13 13:14 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Arjan van de Ven, Ingo Molnar, Dipankar Sarma, Balbir Singh,
	Vatsa, Gautham R Shenoy, Andi Kleen, Gregory Haskins,
	Mike Galbraith, Thomas Gleixner, Arun Bharadwaj

On Wed, 2009-05-13 at 18:41 +0530, Vaidyanathan Srinivasan wrote:

> * Peter Zijlstra wanted more justifications for throttling at the core
>   level.  Throttling may be a resource management problem rather than
>   scheduler/load balancer

No, I mandate that it be thermal management. Any other reason and you've
got a NAK.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuationsched_max_capacity_pct=n
  2009-05-13 13:14 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Peter Zijlstra
@ 2009-05-13 13:42   ` Vaidyanathan Srinivasan
  2009-05-13 13:45   ` Balbir Singh
  1 sibling, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-05-13 13:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Arjan van de Ven, Ingo Molnar, Dipankar Sarma, Balbir Singh,
	Vatsa, Gautham R Shenoy, Andi Kleen, Gregory Haskins,
	Mike Galbraith, Thomas Gleixner, Arun Bharadwaj

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-05-13 15:14:57]:

> On Wed, 2009-05-13 at 18:41 +0530, Vaidyanathan Srinivasan wrote:
> 
> > * Peter Zijlstra wanted more justifications for throttling at the core
> >   level.  Throttling may be a resource management problem rather than
> >   scheduler/load balancer
> 
> No, I mandate that it be thermal management. Any other reason and you've
> got a NAK.

Hi Peter,

Yes, I understand your objection.  Your want throttling to be done for
the purpose of thermal management only.  The primary purpose for
throttling should be thermal management (power savings may be
a side-effect)

What I meant in the above comment was that the implementation for
throttling could be solved using resource management framework,
cpuset/cgroup rather than biasing the load balancer to avoid work on
a particular core.

I am open to ideas for a clean and easy framework for core level
throttling.

Thanks,
Vaidy



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuationsched_max_capacity_pct=n
  2009-05-13 13:14 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Peter Zijlstra
  2009-05-13 13:42   ` [RFC PATCH v2 0/2] Saving power by cpu evacuationsched_max_capacity_pct=n Vaidyanathan Srinivasan
@ 2009-05-13 13:45   ` Balbir Singh
  2009-05-13 13:47     ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2009-05-13 13:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vaidyanathan Srinivasan, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
	Dipankar Sarma, Vatsa, Gautham R Shenoy, Andi Kleen,
	Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-05-13 15:14:57]:

> On Wed, 2009-05-13 at 18:41 +0530, Vaidyanathan Srinivasan wrote:
> 
> > * Peter Zijlstra wanted more justifications for throttling at the core
> >   level.  Throttling may be a resource management problem rather than
> >   scheduler/load balancer
> 
> No, I mandate that it be thermal management. Any other reason and you've
> got a NAK.

We've been discussing hard limits from the resource management view
point. Bharata is working on a RFC that we should be able to share
soon

-- 
	Balbir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuationsched_max_capacity_pct=n
  2009-05-13 13:45   ` Balbir Singh
@ 2009-05-13 13:47     ` Peter Zijlstra
  2009-05-13 14:42       ` [RFC PATCH v2 0/2] Saving power by cpuevacuationsched_max_capacity_pct=n Balbir Singh
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-05-13 13:47 UTC (permalink / raw)
  To: balbir
  Cc: Vaidyanathan Srinivasan, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
	Dipankar Sarma, Vatsa, Gautham R Shenoy, Andi Kleen,
	Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj

On Wed, 2009-05-13 at 19:15 +0530, Balbir Singh wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-05-13 15:14:57]:
> 
> > On Wed, 2009-05-13 at 18:41 +0530, Vaidyanathan Srinivasan wrote:
> > 
> > > * Peter Zijlstra wanted more justifications for throttling at the core
> > >   level.  Throttling may be a resource management problem rather than
> > >   scheduler/load balancer
> > 
> > No, I mandate that it be thermal management. Any other reason and you've
> > got a NAK.
> 
> We've been discussing hard limits from the resource management view
> point. Bharata is working on a RFC that we should be able to share
> soon

Right, and you well know I dislike them.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
  2009-05-13 13:11 [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Vaidyanathan Srinivasan
                   ` (2 preceding siblings ...)
  2009-05-13 13:14 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Peter Zijlstra
@ 2009-05-13 14:35 ` Andi Kleen
  2009-05-13 14:36   ` Peter Zijlstra
  3 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2009-05-13 14:35 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
	Peter Zijlstra, Arjan van de Ven, Ingo Molnar, Dipankar Sarma,
	Balbir Singh, Vatsa, Gautham R Shenoy, Andi Kleen,
	Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj

On Wed, May 13, 2009 at 06:41:00PM +0530, Vaidyanathan Srinivasan wrote:
> * Using sched_mc=3,4,5 to evacuate 1,2,4 cores is completely
>   non-intuitive and broken interface.  Ingo wanted to see if we can
>   model a global percentile tunable that would map to core throttling.

I have one request. CPU throttling is already a very well established
term in the x86 world, refering to thermal throttling when the CPU
overheats. This is implemented by ACPI and the CPU. It's always
a very bad thing that should be avoided at all costs.

You seem to use it for something else. Can you please use a different
term for that? Reusing the same word for something else is confusing.

Thanks,

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
  2009-05-13 14:35 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Andi Kleen
@ 2009-05-13 14:36   ` Peter Zijlstra
  2009-05-13 14:46     ` Andi Kleen
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-05-13 14:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Vaidyanathan Srinivasan, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
	Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
	Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj

On Wed, 2009-05-13 at 16:35 +0200, Andi Kleen wrote:
> On Wed, May 13, 2009 at 06:41:00PM +0530, Vaidyanathan Srinivasan wrote:
> > * Using sched_mc=3,4,5 to evacuate 1,2,4 cores is completely
> >   non-intuitive and broken interface.  Ingo wanted to see if we can
> >   model a global percentile tunable that would map to core throttling.
> 
> I have one request. CPU throttling is already a very well established
> term in the x86 world, refering to thermal throttling when the CPU
> overheats. This is implemented by ACPI and the CPU. It's always
> a very bad thing that should be avoided at all costs.

Its about avoiding that.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpuevacuationsched_max_capacity_pct=n
  2009-05-13 13:47     ` Peter Zijlstra
@ 2009-05-13 14:42       ` Balbir Singh
  0 siblings, 0 replies; 22+ messages in thread
From: Balbir Singh @ 2009-05-13 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vaidyanathan Srinivasan, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
	Dipankar Sarma, Vatsa, Gautham R Shenoy, Andi Kleen,
	Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-05-13 15:47:55]:

> On Wed, 2009-05-13 at 19:15 +0530, Balbir Singh wrote:
> > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-05-13 15:14:57]:
> > 
> > > On Wed, 2009-05-13 at 18:41 +0530, Vaidyanathan Srinivasan wrote:
> > > 
> > > > * Peter Zijlstra wanted more justifications for throttling at the core
> > > >   level.  Throttling may be a resource management problem rather than
> > > >   scheduler/load balancer
> > > 
> > > No, I mandate that it be thermal management. Any other reason and you've
> > > got a NAK.
> > 
> > We've been discussing hard limits from the resource management view
> > point. Bharata is working on a RFC that we should be able to share
> > soon
> 
> Right, and you well know I dislike them.
> 

Yes, but we hope to convince you with use cases and examples.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
  2009-05-13 14:36   ` Peter Zijlstra
@ 2009-05-13 14:46     ` Andi Kleen
  2009-05-13 14:50       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2009-05-13 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Vaidyanathan Srinivasan, Linux Kernel,
	Suresh B Siddha, Venkatesh Pallipadi, Arjan van de Ven,
	Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Gregory Haskins, Mike Galbraith,
	Thomas Gleixner, Arun Bharadwaj

On Wed, May 13, 2009 at 04:36:42PM +0200, Peter Zijlstra wrote:
> On Wed, 2009-05-13 at 16:35 +0200, Andi Kleen wrote:
> > On Wed, May 13, 2009 at 06:41:00PM +0530, Vaidyanathan Srinivasan wrote:
> > > * Using sched_mc=3,4,5 to evacuate 1,2,4 cores is completely
> > >   non-intuitive and broken interface.  Ingo wanted to see if we can
> > >   model a global percentile tunable that would map to core throttling.
> > 
> > I have one request. CPU throttling is already a very well established
> > term in the x86 world, refering to thermal throttling when the CPU
> > overheats. This is implemented by ACPI and the CPU. It's always
> > a very bad thing that should be avoided at all costs.
> 
> Its about avoiding that.

Hmm? Can you explain please? CPU throttling should only happen when your
cooling system is broken in some way.

It's not a power saving feature, just a "don't make CPU melt" feature.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
  2009-05-13 14:46     ` Andi Kleen
@ 2009-05-13 14:50       ` Peter Zijlstra
  2009-05-13 15:01         ` Andi Kleen
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-05-13 14:50 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Vaidyanathan Srinivasan, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
	Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
	Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj

On Wed, 2009-05-13 at 16:46 +0200, Andi Kleen wrote:
> On Wed, May 13, 2009 at 04:36:42PM +0200, Peter Zijlstra wrote:
> > On Wed, 2009-05-13 at 16:35 +0200, Andi Kleen wrote:
> > > On Wed, May 13, 2009 at 06:41:00PM +0530, Vaidyanathan Srinivasan wrote:
> > > > * Using sched_mc=3,4,5 to evacuate 1,2,4 cores is completely
> > > >   non-intuitive and broken interface.  Ingo wanted to see if we can
> > > >   model a global percentile tunable that would map to core throttling.
> > > 
> > > I have one request. CPU throttling is already a very well established
> > > term in the x86 world, refering to thermal throttling when the CPU
> > > overheats. This is implemented by ACPI and the CPU. It's always
> > > a very bad thing that should be avoided at all costs.
> > 
> > Its about avoiding that.
> 
> Hmm? Can you explain please? CPU throttling should only happen when your
> cooling system is broken in some way.
> 
> It's not a power saving feature, just a "don't make CPU melt" feature.

>From what I've been told its popular to over-commit the cooling capacity
in a rack, so that a number of servers can run at full thermal capacity
but not all.

I've also been told that hardware sucks at throttling, therefore people
want to fix the OS so as to limit the thermal capacity and avoid the
hardware throttle from kicking in, whilst still not exceeding the rack
capacity or similar nonsense.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
  2009-05-13 14:50       ` Peter Zijlstra
@ 2009-05-13 15:01         ` Andi Kleen
  2009-05-13 15:02           ` Peter Zijlstra
                             ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Andi Kleen @ 2009-05-13 15:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Vaidyanathan Srinivasan, Linux Kernel,
	Suresh B Siddha, Venkatesh Pallipadi, Arjan van de Ven,
	Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Gregory Haskins, Mike Galbraith,
	Thomas Gleixner, Arun Bharadwaj

> >From what I've been told its popular to over-commit the cooling capacity
> in a rack, so that a number of servers can run at full thermal capacity
> but not all.

Yes. But in this case you don't want to use throttling, you want
to use p-states which actually safe power unlike throttling.

> I've also been told that hardware sucks at throttling, 

Throttling is not really something you should use in normal 
operation, it's just a emergency measure. For that it works
quite well, but you really don't want it in normal operation.

> therefore people
> want to fix the OS so as to limit the thermal capacity and avoid the
> hardware throttle from kicking in, whilst still not exceeding the rack
> capacity or similar nonsense.

Yes that's fine and common, but you actually need to save power for this,
which throttling doesn't do.

My understanding this work is a extension of the existing
sched_mc_power_savings features that tries to be optionally more 
aggressive to keep complete package idle so that package level
power saving kicks in.

I'm just requesting that they don't call that throttling.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
  2009-05-13 15:01         ` Andi Kleen
@ 2009-05-13 15:02           ` Peter Zijlstra
  2009-05-13 15:10             ` Andi Kleen
  2009-05-14 15:13           ` Vaidyanathan Srinivasan
  2009-05-19 20:40           ` Pavel Machek
  2 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-05-13 15:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Vaidyanathan Srinivasan, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
	Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
	Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj

On Wed, 2009-05-13 at 17:01 +0200, Andi Kleen wrote:
> > >From what I've been told its popular to over-commit the cooling capacity
> > in a rack, so that a number of servers can run at full thermal capacity
> > but not all.
> 
> Yes. But in this case you don't want to use throttling, you want
> to use p-states which actually safe power unlike throttling.
> 
> > I've also been told that hardware sucks at throttling, 
> 
> Throttling is not really something you should use in normal 
> operation, it's just a emergency measure. For that it works
> quite well, but you really don't want it in normal operation.
> 
> > therefore people
> > want to fix the OS so as to limit the thermal capacity and avoid the
> > hardware throttle from kicking in, whilst still not exceeding the rack
> > capacity or similar nonsense.
> 
> Yes that's fine and common, but you actually need to save power for this,
> which throttling doesn't do.
> 
> My understanding this work is a extension of the existing
> sched_mc_power_savings features that tries to be optionally more 
> aggressive to keep complete package idle so that package level
> power saving kicks in.
> 
> I'm just requesting that they don't call that throttling.

Ah no, this work differs in that regard in that it actually 'generates'
idle time, instead of optimizing idle time.

Therefore it takes actual cpu time away from real work, which is
throttling. Granted, one could call it limiting or similar, but
throttling is a correct name.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
  2009-05-13 15:02           ` Peter Zijlstra
@ 2009-05-13 15:10             ` Andi Kleen
  2009-05-14 14:58               ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2009-05-13 15:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Vaidyanathan Srinivasan, Linux Kernel,
	Suresh B Siddha, Venkatesh Pallipadi, Arjan van de Ven,
	Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Gregory Haskins, Mike Galbraith,
	Thomas Gleixner, Arun Bharadwaj

> > Yes that's fine and common, but you actually need to save power for this,
> > which throttling doesn't do.
> > 
> > My understanding this work is a extension of the existing
> > sched_mc_power_savings features that tries to be optionally more 
> > aggressive to keep complete package idle so that package level
> > power saving kicks in.
> > 
> > I'm just requesting that they don't call that throttling.
> 
> Ah no, this work differs in that regard in that it actually 'generates'
> idle time, instead of optimizing idle time.

That is what i meant with "more aggressive to keep complete packages idle"
above.

> 
> Therefore it takes actual cpu time away from real work, which is
> throttling. Granted, one could call it limiting or similar, but
> throttling is a correct name.

That will be always ongoing confusion with the existing established
term. 

If you really need to call it throttling use "scheduler throttling"
or something like that, but a different word would be better.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
  2009-05-13 15:10             ` Andi Kleen
@ 2009-05-14 14:58               ` Vaidyanathan Srinivasan
  2009-05-14 15:06                 ` Andi Kleen
  0 siblings, 1 reply; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-05-14 14:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
	Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
	Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj

* Andi Kleen <andi@firstfloor.org> [2009-05-13 17:10:54]:

> > > Yes that's fine and common, but you actually need to save power for this,
> > > which throttling doesn't do.
> > > 
> > > My understanding this work is a extension of the existing
> > > sched_mc_power_savings features that tries to be optionally more 
> > > aggressive to keep complete package idle so that package level
> > > power saving kicks in.
> > > 
> > > I'm just requesting that they don't call that throttling.
> > 
> > Ah no, this work differs in that regard in that it actually 'generates'
> > idle time, instead of optimizing idle time.
> 
> That is what i meant with "more aggressive to keep complete packages idle"
> above.

Hi Andi,

There is a difference in the framework as Peter has mentioned, we are
trying to create idle times by forcefully reducing work.  From an
end-user point of view, this can be seen as a logical extension of
sched_mc_power_savings... v1 of the RFC extends the framework.

However Ingo suggested that the knob is not intuitive and hence I have
tried to switch to a percentage knob sched_max_capacity_pct.

I am interested in an easy, simple and intuitive framework to evacuate
cores which may imply forcefully reducing (throttling) work.
 
> > Therefore it takes actual cpu time away from real work, which is
> > throttling. Granted, one could call it limiting or similar, but
> > throttling is a correct name.
> 
> That will be always ongoing confusion with the existing established
> term. 
> 
> If you really need to call it throttling use "scheduler throttling"
> or something like that, but a different word would be better.

I think 'scheduler throttling' is good so that we avoid the term 'CPU
throttling' or core throttling.  I had named this cpu evacuation or
core evacuation just to avoid confusion with hardware throttling.

--Vaidy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
  2009-05-14 14:58               ` Vaidyanathan Srinivasan
@ 2009-05-14 15:06                 ` Andi Kleen
  2009-05-14 15:43                   ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2009-05-14 15:06 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Andi Kleen, Peter Zijlstra, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
	Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
	Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj

> I think 'scheduler throttling' is good so that we avoid the term 'CPU
> throttling' or core throttling.  I had named this cpu evacuation or
> core evacuation just to avoid confusion with hardware throttling.

Evacuation sounds good, although shouldn't it be package or 
socket evacuation?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
  2009-05-13 15:01         ` Andi Kleen
  2009-05-13 15:02           ` Peter Zijlstra
@ 2009-05-14 15:13           ` Vaidyanathan Srinivasan
  2009-05-19 20:40           ` Pavel Machek
  2 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-05-14 15:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
	Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
	Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj

* Andi Kleen <andi@firstfloor.org> [2009-05-13 17:01:00]:

> > >From what I've been told its popular to over-commit the cooling capacity
> > in a rack, so that a number of servers can run at full thermal capacity
> > but not all.
> 
> Yes. But in this case you don't want to use throttling, you want
> to use p-states which actually safe power unlike throttling.

One of the design points for the discussion is to bring in C-States
into the equation.  As you have mentioned today we can effectively use
P-States to reduce core frequency and thereby reduce average power
and heat.  With the introduction of very low power deep sleep states
in the processor, C-States can provide substantial power savings apart
from just P-State based methods.  Forcefully idling cores will lead to
exploitation of C-States and their power savings benefits.

As mentioned earlier, cpu throttling as it exist today should not
be used in normal operating conditions.  However exploiting P-States
and C-States as two control variables, the system can be made to
operate at various power (thermal) and performance points.

> 
> > I've also been told that hardware sucks at throttling, 
> 
> Throttling is not really something you should use in normal 
> operation, it's just a emergency measure. For that it works
> quite well, but you really don't want it in normal operation.
> 
> > therefore people
> > want to fix the OS so as to limit the thermal capacity and avoid the
> > hardware throttle from kicking in, whilst still not exceeding the rack
> > capacity or similar nonsense.
> 
> Yes that's fine and common, but you actually need to save power for this,
> which throttling doesn't do.

Reducing work, scheduling them smartly in the OS can greatly save
power as compared to throttling in hardware in order to reduce power
or heat.
 
> My understanding this work is a extension of the existing
> sched_mc_power_savings features that tries to be optionally more 
> aggressive to keep complete package idle so that package level
> power saving kicks in.

Scheduling work smartly (power efficiently) is part of the
sched_mc_power_savings framework, while this RFC/discussion is around
reducing work or forcing idle times but at a granularity of
cores/packages to provide maximum power/thermal benefits.

--Vaidy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
  2009-05-14 15:06                 ` Andi Kleen
@ 2009-05-14 15:43                   ` Vaidyanathan Srinivasan
  0 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-05-14 15:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
	Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
	Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj

* Andi Kleen <andi@firstfloor.org> [2009-05-14 17:06:32]:

> > I think 'scheduler throttling' is good so that we avoid the term 'CPU
> > throttling' or core throttling.  I had named this cpu evacuation or
> > core evacuation just to avoid confusion with hardware throttling.
> 
> Evacuation sounds good, although shouldn't it be package or 
> socket evacuation?

Lets start with 'core evacuation' since that seems to be the lowest
granularity on a threaded system.  We certainly want the framework to
provide socket/package and node evacuation on much larger systems.

--Vaidy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
  2009-05-13 15:01         ` Andi Kleen
  2009-05-13 15:02           ` Peter Zijlstra
  2009-05-14 15:13           ` Vaidyanathan Srinivasan
@ 2009-05-19 20:40           ` Pavel Machek
  2009-05-22  9:14             ` Vaidyanathan Srinivasan
  2 siblings, 1 reply; 22+ messages in thread
From: Pavel Machek @ 2009-05-19 20:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Vaidyanathan Srinivasan, Linux Kernel,
	Suresh B Siddha, Venkatesh Pallipadi, Arjan van de Ven,
	Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
	Gautham R Shenoy, Gregory Haskins, Mike Galbraith,
	Thomas Gleixner, Arun Bharadwaj

On Wed 2009-05-13 17:01:00, Andi Kleen wrote:
> > >From what I've been told its popular to over-commit the cooling capacity
> > in a rack, so that a number of servers can run at full thermal capacity
> > but not all.
> 
> Yes. But in this case you don't want to use throttling, you want
> to use p-states which actually safe power unlike throttling.
> 
> > I've also been told that hardware sucks at throttling, 
> 
> Throttling is not really something you should use in normal 
> operation, it's just a emergency measure. For that it works
> quite well, but you really don't want it in normal operation.
> 
> > therefore people
> > want to fix the OS so as to limit the thermal capacity and avoid the
> > hardware throttle from kicking in, whilst still not exceeding the rack
> > capacity or similar nonsense.
> 
> Yes that's fine and common, but you actually need to save power for this,
> which throttling doesn't do.

Actually throttling will lower power consumption at any given moment
(not power consumption for any given task!) and will keep your rack
from melting.

But I don't see why it is neccessary to evacuate cores for this. Why
not just schedule special task that enters C3 instead of computing?

That was what I planned to do on athlon 900 (1 core) with broken
fan...

For what you are doing, cpu hotplug seems more suitable. Can you
 enhance it so that it is fast enough for you?

							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
  2009-05-19 20:40           ` Pavel Machek
@ 2009-05-22  9:14             ` Vaidyanathan Srinivasan
  2009-05-28 20:36               ` Pavel Machek
  0 siblings, 1 reply; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-05-22  9:14 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andi Kleen, Peter Zijlstra, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
	Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
	Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj

* Pavel Machek <pavel@ucw.cz> [2009-05-19 22:40:15]:

> On Wed 2009-05-13 17:01:00, Andi Kleen wrote:
> > > >From what I've been told its popular to over-commit the cooling capacity
> > > in a rack, so that a number of servers can run at full thermal capacity
> > > but not all.
> > 
> > Yes. But in this case you don't want to use throttling, you want
> > to use p-states which actually safe power unlike throttling.
> > 
> > > I've also been told that hardware sucks at throttling, 
> > 
> > Throttling is not really something you should use in normal 
> > operation, it's just a emergency measure. For that it works
> > quite well, but you really don't want it in normal operation.
> > 
> > > therefore people
> > > want to fix the OS so as to limit the thermal capacity and avoid the
> > > hardware throttle from kicking in, whilst still not exceeding the rack
> > > capacity or similar nonsense.
> > 
> > Yes that's fine and common, but you actually need to save power for this,
> > which throttling doesn't do.
> 
> Actually throttling will lower power consumption at any given moment
> (not power consumption for any given task!) and will keep your rack
> from melting.

Yes, we want to reduce overall power consumption.
 
> But I don't see why it is neccessary to evacuate cores for this. Why
> not just schedule special task that enters C3 instead of computing?

This is what essentially happens in the load balancer approach.  Not
scheduling on a particular core will run the scheduler's idle task
that will transition the core to lowest power state.  Pinning a user
space task and using special driver to hold the core in C3 state will
break scheduling fairness. At this point the application decides when
to give the core back to scheduler.

> That was what I planned to do on athlon 900 (1 core) with broken
> fan...
> 
> For what you are doing, cpu hotplug seems more suitable. Can you
>  enhance it so that it is fast enough for you?

Yes cpu hotplug framework can be used.  That is definitely an
alternative to this approach.  However in the case of cpuhotplug, the
evacuation is directed to a particular core which may affect user
space affinity and cpusets.  But in this case we can limit the overall
system capacity, like run at most 7 cores at a time in an 8 core
system, but we actually don't need to care which particular core is
'forced to idle' at any given point in time.

Further discussion regarding this can be found in the following
thread: http://lkml.org/lkml/2009/5/19/54

--Vaidy


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
  2009-05-22  9:14             ` Vaidyanathan Srinivasan
@ 2009-05-28 20:36               ` Pavel Machek
  0 siblings, 0 replies; 22+ messages in thread
From: Pavel Machek @ 2009-05-28 20:36 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Andi Kleen, Peter Zijlstra, Linux Kernel, Suresh B Siddha,
	Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
	Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
	Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj

Hi!

> > But I don't see why it is neccessary to evacuate cores for this. Why
> > not just schedule special task that enters C3 instead of computing?
> 
> This is what essentially happens in the load balancer approach.  Not
> scheduling on a particular core will run the scheduler's idle task
> that will transition the core to lowest power state.  Pinning a user
> space task and using special driver to hold the core in C3 state will
> break scheduling fairness. At this point the application decides when
> to give the core back to scheduler.

Why would it break scheduling fairness? You just schedule "realtime"
task that does C3 instead of computation. The behaviour is very
similar to "normal" realtime task. Why would it break scheduler?

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2009-05-28 20:36 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-13 13:11 [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Vaidyanathan Srinivasan
2009-05-13 13:11 ` [RFC PATCH v2 1/2] sched: add sched_max_capacity_pct Vaidyanathan Srinivasan
2009-05-13 13:11 ` [RFC PATCH v2 2/2] sched: loadbalancer hacks for forced packing of tasks Vaidyanathan Srinivasan
2009-05-13 13:14 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Peter Zijlstra
2009-05-13 13:42   ` [RFC PATCH v2 0/2] Saving power by cpu evacuationsched_max_capacity_pct=n Vaidyanathan Srinivasan
2009-05-13 13:45   ` Balbir Singh
2009-05-13 13:47     ` Peter Zijlstra
2009-05-13 14:42       ` [RFC PATCH v2 0/2] Saving power by cpuevacuationsched_max_capacity_pct=n Balbir Singh
2009-05-13 14:35 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Andi Kleen
2009-05-13 14:36   ` Peter Zijlstra
2009-05-13 14:46     ` Andi Kleen
2009-05-13 14:50       ` Peter Zijlstra
2009-05-13 15:01         ` Andi Kleen
2009-05-13 15:02           ` Peter Zijlstra
2009-05-13 15:10             ` Andi Kleen
2009-05-14 14:58               ` Vaidyanathan Srinivasan
2009-05-14 15:06                 ` Andi Kleen
2009-05-14 15:43                   ` Vaidyanathan Srinivasan
2009-05-14 15:13           ` Vaidyanathan Srinivasan
2009-05-19 20:40           ` Pavel Machek
2009-05-22  9:14             ` Vaidyanathan Srinivasan
2009-05-28 20:36               ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).