All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 00/14] load-balancing and cpu_power -v3
@ 2009-09-03 13:21 Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 01/14] sched: restore __cpu_power to a straight sum of power Peter Zijlstra
                   ` (13 more replies)
  0 siblings, 14 replies; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Peter Zijlstra

By popular demand the latest version of my patch queue on the subject ;-)

It has the aperf/mperf bits (although I haven't got an SMT capable machine
to test).

It also includes a few wake_idle cleanups which seem to regress my kbuild
test, so I'll have to prod a bit more at that.

Enjoy!


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 01/14] sched: restore __cpu_power to a straight sum of power
  2009-09-03 13:21 [RFC][PATCH 00/14] load-balancing and cpu_power -v3 Peter Zijlstra
@ 2009-09-03 13:21 ` Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 02/14] sched: SD_PREFER_SIBLING Peter Zijlstra
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Peter Zijlstra

[-- Attachment #1: sched-lb-1.patch --]
[-- Type: text/plain, Size: 2319 bytes --]

cpu_power is supposed to be a representation of the process capacity
of the cpu, not a value to randomly tweak in order to affect
placement.

Remove the placement hacks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h    |    1 +
 include/linux/topology.h |    1 +
 kernel/sched.c           |   34 ++++++++++++++++++----------------
 3 files changed, 20 insertions(+), 16 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -8468,15 +8468,13 @@ static void free_sched_groups(const stru
  * there are asymmetries in the topology. If there are asymmetries, group
  * having more cpu_power will pickup more load compared to the group having
  * less cpu_power.
- *
- * cpu_power will be a multiple of SCHED_LOAD_SCALE. This multiple represents
- * the maximum number of tasks a group can handle in the presence of other idle
- * or lightly loaded groups in the same sched domain.
  */
 static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 {
 	struct sched_domain *child;
 	struct sched_group *group;
+	long power;
+	int weight;
 
 	WARN_ON(!sd || !sd->groups);
 
@@ -8487,22 +8485,20 @@ static void init_sched_groups_power(int 
 
 	sd->groups->__cpu_power = 0;
 
-	/*
-	 * For perf policy, if the groups in child domain share resources
-	 * (for example cores sharing some portions of the cache hierarchy
-	 * or SMT), then set this domain groups cpu_power such that each group
-	 * can handle only one task, when there are other idle groups in the
-	 * same sched domain.
-	 */
-	if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
-		       (child->flags &
-			(SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
-		sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
+	if (!child) {
+		power = SCHED_LOAD_SCALE;
+		weight = cpumask_weight(sched_domain_span(sd));
+		/*
+		 * SMT siblings share the power of a single core.
+		 */
+		if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1)
+			power /= weight;
+		sg_inc_cpu_power(sd->groups, power);
 		return;
 	}
 
 	/*
-	 * add cpu_power of each child group to this groups cpu_power
+	 * Add cpu_power of each child group to this groups cpu_power.
 	 */
 	group = child->groups;
 	do {

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 02/14] sched: SD_PREFER_SIBLING
  2009-09-03 13:21 [RFC][PATCH 00/14] load-balancing and cpu_power -v3 Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 01/14] sched: restore __cpu_power to a straight sum of power Peter Zijlstra
@ 2009-09-03 13:21 ` Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 03/14] sched: update the cpu_power sum during load-balance Peter Zijlstra
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Peter Zijlstra

[-- Attachment #1: sched-lb-2.patch --]
[-- Type: text/plain, Size: 3932 bytes --]

Do the placement thing using SD flags

XXX: consider degenerate bits

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |   29 +++++++++++++++--------------
 kernel/sched.c        |   14 +++++++++++++-
 2 files changed, 28 insertions(+), 15 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -808,18 +808,19 @@ enum cpu_idle_type {
 #define SCHED_LOAD_SCALE_FUZZ	SCHED_LOAD_SCALE
 
 #ifdef CONFIG_SMP
-#define SD_LOAD_BALANCE		1	/* Do load balancing on this domain. */
-#define SD_BALANCE_NEWIDLE	2	/* Balance when about to become idle */
-#define SD_BALANCE_EXEC		4	/* Balance on exec */
-#define SD_BALANCE_FORK		8	/* Balance on fork, clone */
-#define SD_WAKE_IDLE		16	/* Wake to idle CPU on task wakeup */
-#define SD_WAKE_AFFINE		32	/* Wake task to waking CPU */
-#define SD_WAKE_BALANCE		64	/* Perform balancing at task wakeup */
-#define SD_SHARE_CPUPOWER	128	/* Domain members share cpu power */
-#define SD_POWERSAVINGS_BALANCE	256	/* Balance for power savings */
-#define SD_SHARE_PKG_RESOURCES	512	/* Domain members share cpu pkg resources */
-#define SD_SERIALIZE		1024	/* Only a single load balancing instance */
-#define SD_WAKE_IDLE_FAR	2048	/* Gain latency sacrificing cache hit */
+#define SD_LOAD_BALANCE		0x0001	/* Do load balancing on this domain. */
+#define SD_BALANCE_NEWIDLE	0x0002	/* Balance when about to become idle */
+#define SD_BALANCE_EXEC		0x0004	/* Balance on exec */
+#define SD_BALANCE_FORK		0x0008	/* Balance on fork, clone */
+#define SD_WAKE_IDLE		0x0010	/* Wake to idle CPU on task wakeup */
+#define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
+#define SD_WAKE_BALANCE		0x0040	/* Perform balancing at task wakeup */
+#define SD_SHARE_CPUPOWER	0x0080	/* Domain members share cpu power */
+#define SD_POWERSAVINGS_BALANCE	0x0100	/* Balance for power savings */
+#define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
+#define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */
+#define SD_WAKE_IDLE_FAR	0x0800	/* Gain latency sacrificing cache hit */
+#define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 
 enum powersavings_balance_level {
 	POWERSAVINGS_BALANCE_NONE = 0,  /* No power saving load balance */
@@ -839,7 +840,7 @@ static inline int sd_balance_for_mc_powe
 	if (sched_smt_power_savings)
 		return SD_POWERSAVINGS_BALANCE;
 
-	return 0;
+	return SD_PREFER_SIBLING;
 }
 
 static inline int sd_balance_for_package_power(void)
@@ -847,7 +848,7 @@ static inline int sd_balance_for_package
 	if (sched_mc_power_savings | sched_smt_power_savings)
 		return SD_POWERSAVINGS_BALANCE;
 
-	return 0;
+	return SD_PREFER_SIBLING;
 }
 
 /*
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -3803,9 +3803,13 @@ static inline void update_sd_lb_stats(st
 			const struct cpumask *cpus, int *balance,
 			struct sd_lb_stats *sds)
 {
+	struct sched_domain *child = sd->child;
 	struct sched_group *group = sd->groups;
 	struct sg_lb_stats sgs;
-	int load_idx;
+	int load_idx, prefer_sibling = 0;
+
+	if (child && child->flags & SD_PREFER_SIBLING)
+		prefer_sibling = 1;
 
 	init_sd_power_savings_stats(sd, sds, idle);
 	load_idx = get_sd_load_idx(sd, idle);
@@ -3825,6 +3829,14 @@ static inline void update_sd_lb_stats(st
 		sds->total_load += sgs.group_load;
 		sds->total_pwr += group->__cpu_power;
 
+		/*
+		 * In case the child domain prefers tasks go to siblings
+		 * first, lower the group capacity to one so that we'll try
+		 * and move all the excess tasks away.
+		 */
+		if (prefer_sibling)
+			sgs.group_capacity = 1;
+
 		if (local_group) {
 			sds->this_load = sgs.avg_load;
 			sds->this = group;

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 03/14] sched: update the cpu_power sum during load-balance
  2009-09-03 13:21 [RFC][PATCH 00/14] load-balancing and cpu_power -v3 Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 01/14] sched: restore __cpu_power to a straight sum of power Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 02/14] sched: SD_PREFER_SIBLING Peter Zijlstra
@ 2009-09-03 13:21 ` Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 04/14] sched: add smt_gain Peter Zijlstra
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Peter Zijlstra

[-- Attachment #1: sched-lb-3.patch --]
[-- Type: text/plain, Size: 2677 bytes --]

In order to prepare for a more dynamic cpu_power, update the group sum
while walking the sched domains during load-balance.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -3699,6 +3699,28 @@ static inline int check_power_save_busie
 }
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
 
+static void update_sched_power(struct sched_domain *sd)
+{
+	struct sched_domain *child = sd->child;
+	struct sched_group *group, *sdg = sd->groups;
+	unsigned long power = sdg->__cpu_power;
+
+	if (!child) {
+		/* compute cpu power for this cpu */
+		return;
+	}
+
+	sdg->__cpu_power = 0;
+
+	group = child->groups;
+	do {
+		sdg->__cpu_power += group->__cpu_power;
+		group = group->next;
+	} while (group != child->groups);
+
+	if (power != sdg->__cpu_power)
+		sdg->reciprocal_cpu_power = reciprocal_value(sdg->__cpu_power);
+}
 
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
@@ -3712,7 +3734,8 @@ static inline int check_power_save_busie
  * @balance: Should we balance.
  * @sgs: variable to hold the statistics for this group.
  */
-static inline void update_sg_lb_stats(struct sched_group *group, int this_cpu,
+static inline void update_sg_lb_stats(struct sched_domain *sd,
+			struct sched_group *group, int this_cpu,
 			enum cpu_idle_type idle, int load_idx, int *sd_idle,
 			int local_group, const struct cpumask *cpus,
 			int *balance, struct sg_lb_stats *sgs)
@@ -3723,8 +3746,11 @@ static inline void update_sg_lb_stats(st
 	unsigned long sum_avg_load_per_task;
 	unsigned long avg_load_per_task;
 
-	if (local_group)
+	if (local_group) {
 		balance_cpu = group_first_cpu(group);
+		if (balance_cpu == this_cpu)
+			update_sched_power(sd);
+	}
 
 	/* Tally up the load of all CPUs in the group */
 	sum_avg_load_per_task = avg_load_per_task = 0;
@@ -3828,7 +3854,7 @@ static inline void update_sd_lb_stats(st
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_cpus(group));
 		memset(&sgs, 0, sizeof(sgs));
-		update_sg_lb_stats(group, this_cpu, idle, load_idx, sd_idle,
+		update_sg_lb_stats(sd, group, this_cpu, idle, load_idx, sd_idle,
 				local_group, cpus, balance, &sgs);
 
 		if (local_group && balance && !(*balance))
@@ -3863,7 +3889,6 @@ static inline void update_sd_lb_stats(st
 		update_sd_power_savings_stats(group, sds, local_group, &sgs);
 		group = group->next;
 	} while (group != sd->groups);
-
 }
 
 /**

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 04/14] sched: add smt_gain
  2009-09-03 13:21 [RFC][PATCH 00/14] load-balancing and cpu_power -v3 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2009-09-03 13:21 ` [RFC][PATCH 03/14] sched: update the cpu_power sum during load-balance Peter Zijlstra
@ 2009-09-03 13:21 ` Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 05/14] sched: dynamic cpu_power Peter Zijlstra
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Peter Zijlstra

[-- Attachment #1: sched-lb-1b.patch --]
[-- Type: text/plain, Size: 1916 bytes --]

The idea is that multi-threading a core yields more work capacity than
a single thread, provide a way to express a static gain for threads.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h    |    1 +
 include/linux/topology.h |    1 +
 kernel/sched.c           |    8 +++++++-
 3 files changed, 9 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -930,6 +930,7 @@ struct sched_domain {
 	unsigned int newidle_idx;
 	unsigned int wake_idx;
 	unsigned int forkexec_idx;
+	unsigned int smt_gain;
 	int flags;			/* See SD_* */
 	enum sched_domain_level level;
 
Index: linux-2.6/include/linux/topology.h
===================================================================
--- linux-2.6.orig/include/linux/topology.h
+++ linux-2.6/include/linux/topology.h
@@ -99,6 +99,7 @@ int arch_update_cpu_topology(void);
 				| SD_SHARE_CPUPOWER,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
+	.smt_gain		= 1178,	/* 15% */	\
 }
 #endif
 #endif /* CONFIG_SCHED_SMT */
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -8490,9 +8490,15 @@ static void init_sched_groups_power(int 
 		weight = cpumask_weight(sched_domain_span(sd));
 		/*
 		 * SMT siblings share the power of a single core.
+		 * Usually multiple threads get a better yield out of
+		 * that one core than a single thread would have,
+		 * reflect that in sd->smt_gain.
 		 */
-		if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1)
+		if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
+			power *= sd->smt_gain;
 			power /= weight;
+			power >>= SCHED_LOAD_SHIFT;
+		}
 		sg_inc_cpu_power(sd->groups, power);
 		return;
 	}

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 05/14] sched: dynamic cpu_power
  2009-09-03 13:21 [RFC][PATCH 00/14] load-balancing and cpu_power -v3 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2009-09-03 13:21 ` [RFC][PATCH 04/14] sched: add smt_gain Peter Zijlstra
@ 2009-09-03 13:21 ` Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 06/14] sched: scale down cpu_power due to RT tasks Peter Zijlstra
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Peter Zijlstra

[-- Attachment #1: sched-lb-4.patch --]
[-- Type: text/plain, Size: 1983 bytes --]

Recompute the cpu_power for each cpu during load-balance

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   38 +++++++++++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -3691,14 +3691,46 @@ static inline int check_power_save_busie
 }
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
 
-static void update_sched_power(struct sched_domain *sd)
+unsigned long __weak arch_smt_gain(struct sched_domain *sd, int cpu)
+{
+	unsigned long weight = cpumask_weight(sched_domain_span(sd));
+	unsigned long smt_gain = sd->smt_gain;
+
+	smt_gain /= weight;
+
+	return smt_gain;
+}
+
+static void update_cpu_power(struct sched_domain *sd, int cpu)
+{
+	unsigned long weight = cpumask_weight(sched_domain_span(sd));
+	unsigned long power = SCHED_LOAD_SCALE;
+	struct sched_group *sdg = sd->groups;
+	unsigned long old = sdg->__cpu_power;
+
+	/* here we could scale based on cpufreq */
+
+	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
+		power *= arch_smt_gain(sd, cpu);
+		power >>= SCHED_LOAD_SHIFT;
+	}
+
+	/* here we could scale based on RT time */
+
+	if (power != old) {
+		sdg->__cpu_power = power;
+		sdg->reciprocal_cpu_power = reciprocal_value(power);
+	}
+}
+
+static void update_group_power(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
 	unsigned long power = sdg->__cpu_power;
 
 	if (!child) {
-		/* compute cpu power for this cpu */
+		update_cpu_power(sd, cpu);
 		return;
 	}
 
@@ -3743,7 +3775,7 @@ static inline void update_sg_lb_stats(st
 	if (local_group) {
 		balance_cpu = group_first_cpu(group);
 		if (balance_cpu == this_cpu)
-			update_sched_power(sd);
+			update_group_power(sd, this_cpu);
 	}
 
 	/* Tally up the load of all CPUs in the group */

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 06/14] sched: scale down cpu_power due to RT tasks
  2009-09-03 13:21 [RFC][PATCH 00/14] load-balancing and cpu_power -v3 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2009-09-03 13:21 ` [RFC][PATCH 05/14] sched: dynamic cpu_power Peter Zijlstra
@ 2009-09-03 13:21 ` Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 07/14] sched: try to deal with low capacity Peter Zijlstra
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Peter Zijlstra

[-- Attachment #1: sched-lb-5.patch --]
[-- Type: text/plain, Size: 5311 bytes --]

Keep an average on the amount of time spend on RT tasks and use that
fraction to scale down the cpu_power for regular tasks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    1 
 kernel/sched.c        |   64 +++++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched_rt.c     |    6 +---
 kernel/sysctl.c       |    8 ++++++
 4 files changed, 72 insertions(+), 7 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1863,6 +1863,7 @@ extern unsigned int sysctl_sched_child_r
 extern unsigned int sysctl_sched_features;
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
+extern unsigned int sysctl_sched_time_avg;
 extern unsigned int sysctl_timer_migration;
 
 int sched_nr_latency_handler(struct ctl_table *table, int write,
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -627,6 +627,9 @@ struct rq {
 
 	struct task_struct *migration_thread;
 	struct list_head migration_queue;
+
+	u64 rt_avg;
+	u64 age_stamp;
 #endif
 
 	/* calc_load related fields */
@@ -863,6 +866,14 @@ unsigned int sysctl_sched_shares_ratelim
 unsigned int sysctl_sched_shares_thresh = 4;
 
 /*
+ * period over which we average the RT time consumption, measured
+ * in ms.
+ *
+ * default: 1s
+ */
+const_debug unsigned int sysctl_sched_time_avg = MSEC_PER_SEC;
+
+/*
  * period over which we measure -rt task cpu usage in us.
  * default: 1s
  */
@@ -1280,12 +1291,37 @@ void wake_up_idle_cpu(int cpu)
 }
 #endif /* CONFIG_NO_HZ */
 
+static u64 sched_avg_period(void)
+{
+	return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
+}
+
+static void sched_avg_update(struct rq *rq)
+{
+	s64 period = sched_avg_period();
+
+	while ((s64)(rq->clock - rq->age_stamp) > period) {
+		rq->age_stamp += period;
+		rq->rt_avg /= 2;
+	}
+}
+
+static void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
+{
+	rq->rt_avg += rt_delta;
+	sched_avg_update(rq);
+}
+
 #else /* !CONFIG_SMP */
 static void resched_task(struct task_struct *p)
 {
 	assert_spin_locked(&task_rq(p)->lock);
 	set_tsk_need_resched(p);
 }
+
+static void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
+{
+}
 #endif /* CONFIG_SMP */
 
 #if BITS_PER_LONG == 32
@@ -3699,7 +3735,7 @@ static inline int check_power_save_busie
 }
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
 
-unsigned long __weak arch_smt_gain(struct sched_domain *sd, int cpu)
+unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu)
 {
 	unsigned long weight = cpumask_weight(sched_domain_span(sd));
 	unsigned long smt_gain = sd->smt_gain;
@@ -3709,6 +3745,24 @@ unsigned long __weak arch_smt_gain(struc
 	return smt_gain;
 }
 
+unsigned long scale_rt_power(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	u64 total, available;
+
+	sched_avg_update(rq);
+
+	total = sched_avg_period() + (rq->clock - rq->age_stamp);
+	available = total - rq->rt_avg;
+
+	if (unlikely((s64)total < SCHED_LOAD_SCALE))
+		total = SCHED_LOAD_SCALE;
+
+	total >>= SCHED_LOAD_SHIFT;
+
+	return div_u64(available, total);
+}
+
 static void update_cpu_power(struct sched_domain *sd, int cpu)
 {
 	unsigned long weight = cpumask_weight(sched_domain_span(sd));
@@ -3719,11 +3773,15 @@ static void update_cpu_power(struct sche
 	/* here we could scale based on cpufreq */
 
 	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
-		power *= arch_smt_gain(sd, cpu);
+		power *= arch_scale_smt_power(sd, cpu);
 		power >>= SCHED_LOAD_SHIFT;
 	}
 
-	/* here we could scale based on RT time */
+	power *= scale_rt_power(cpu);
+	power >>= SCHED_LOAD_SHIFT;
+
+	if (!power)
+		power = 1;
 
 	if (power != old) {
 		sdg->__cpu_power = power;
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -615,6 +615,8 @@ static void update_curr_rt(struct rq *rq
 	curr->se.exec_start = rq->clock;
 	cpuacct_charge(curr, delta_exec);
 
+	sched_rt_avg_update(rq, delta_exec);
+
 	if (!rt_bandwidth_enabled())
 		return;
 
@@ -887,8 +889,6 @@ static void enqueue_task_rt(struct rq *r
 
 	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
-
-	inc_cpu_load(rq, p->se.load.weight);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
@@ -899,8 +899,6 @@ static void dequeue_task_rt(struct rq *r
 	dequeue_rt_entity(rt_se);
 
 	dequeue_pushable_task(rq, p);
-
-	dec_cpu_load(rq, p->se.load.weight);
 }
 
 /*
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -332,6 +332,14 @@ static struct ctl_table kern_table[] = {
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_time_avg",
+		.data		= &sysctl_sched_time_avg,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "timer_migration",
 		.data		= &sysctl_timer_migration,
 		.maxlen		= sizeof(unsigned int),

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 07/14] sched: try to deal with low capacity
  2009-09-03 13:21 [RFC][PATCH 00/14] load-balancing and cpu_power -v3 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2009-09-03 13:21 ` [RFC][PATCH 06/14] sched: scale down cpu_power due to RT tasks Peter Zijlstra
@ 2009-09-03 13:21 ` Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 08/14] sched: remove reciprocal for cpu_power Peter Zijlstra
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Peter Zijlstra

[-- Attachment #1: sched-lb-6.patch --]
[-- Type: text/plain, Size: 2481 bytes --]

When the capacity drops low, we want to migrate load away. Allow the
load-balancer to remove all tasks when we hit rock bottom.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
[ego@in.ibm.com: fix to update_sd_power_savings_stats]
---
 kernel/sched.c |   35 +++++++++++++++++++++++++++++------
 1 file changed, 29 insertions(+), 6 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -3668,7 +3668,7 @@ static inline void update_sd_power_savin
 	 * capacity but still has some space to pick up some load
 	 * from other group and save more power
 	 */
-	if (sgs->sum_nr_running > sgs->group_capacity - 1)
+	if (sgs->sum_nr_running + 1 > sgs->group_capacity)
 		return;
 
 	if (sgs->sum_nr_running > sds->leader_nr_running ||
@@ -3908,8 +3908,8 @@ static inline void update_sg_lb_stats(st
 	if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
 		sgs->group_imb = 1;
 
-	sgs->group_capacity = group->__cpu_power / SCHED_LOAD_SCALE;
-
+	sgs->group_capacity =
+		DIV_ROUND_CLOSEST(group->__cpu_power, SCHED_LOAD_SCALE);
 }
 
 /**
@@ -3959,7 +3959,7 @@ static inline void update_sd_lb_stats(st
 		 * and move all the excess tasks away.
 		 */
 		if (prefer_sibling)
-			sgs.group_capacity = 1;
+			sgs.group_capacity = min(sgs.group_capacity, 1UL);
 
 		if (local_group) {
 			sds->this_load = sgs.avg_load;
@@ -4191,6 +4191,26 @@ ret:
 	return NULL;
 }
 
+static struct sched_group *group_of(int cpu)
+{
+	struct sched_domain *sd = rcu_dereference(cpu_rq(cpu)->sd);
+
+	if (!sd)
+		return NULL;
+
+	return sd->groups;
+}
+
+static unsigned long power_of(int cpu)
+{
+	struct sched_group *group = group_of(cpu);
+
+	if (!group)
+		return SCHED_LOAD_SCALE;
+
+	return group->__cpu_power;
+}
+
 /*
  * find_busiest_queue - find the busiest runqueue among the cpus in group.
  */
@@ -4203,15 +4223,18 @@ find_busiest_queue(struct sched_group *g
 	int i;
 
 	for_each_cpu(i, sched_group_cpus(group)) {
+		unsigned long power = power_of(i);
+		unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
 		unsigned long wl;
 
 		if (!cpumask_test_cpu(i, cpus))
 			continue;
 
 		rq = cpu_rq(i);
-		wl = weighted_cpuload(i);
+		wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
+		wl /= power;
 
-		if (rq->nr_running == 1 && wl > imbalance)
+		if (capacity && rq->nr_running == 1 && wl > imbalance)
 			continue;
 
 		if (wl > max_load) {

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 08/14] sched: remove reciprocal for cpu_power
  2009-09-03 13:21 [RFC][PATCH 00/14] load-balancing and cpu_power -v3 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2009-09-03 13:21 ` [RFC][PATCH 07/14] sched: try to deal with low capacity Peter Zijlstra
@ 2009-09-03 13:21 ` Peter Zijlstra
  2009-09-03 13:59   ` Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 09/14] x86: move APERF/MPERF into a X86_FEATURE Peter Zijlstra
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Peter Zijlstra

[-- Attachment #1: sched-lb-7.patch --]
[-- Type: text/plain, Size: 9273 bytes --]

Its a source of fail, also, now that cpu_power is dynamical, its a
waste of time.

before:
<idle>-0   [000]   132.877936: find_busiest_group: avg_load: 0 group_load: 8241 power: 1 

after:
bash-1689  [001]   137.862151: find_busiest_group: avg_load: 10636288 group_load: 10387 power: 1

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
[andreas.herrmann3@amd.com: remove include]
---
 include/linux/sched.h |   10 +----
 kernel/sched.c        |  100 +++++++++++++++++---------------------------------
 2 files changed, 36 insertions(+), 74 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -120,30 +120,8 @@
  */
 #define RUNTIME_INF	((u64)~0ULL)
 
-#ifdef CONFIG_SMP
-
 static void double_rq_lock(struct rq *rq1, struct rq *rq2);
 
-/*
- * Divide a load by a sched group cpu_power : (load / sg->__cpu_power)
- * Since cpu_power is a 'constant', we can use a reciprocal divide.
- */
-static inline u32 sg_div_cpu_power(const struct sched_group *sg, u32 load)
-{
-	return reciprocal_divide(load, sg->reciprocal_cpu_power);
-}
-
-/*
- * Each time a sched group cpu_power is changed,
- * we must compute its reciprocal value
- */
-static inline void sg_inc_cpu_power(struct sched_group *sg, u32 val)
-{
-	sg->__cpu_power += val;
-	sg->reciprocal_cpu_power = reciprocal_value(sg->__cpu_power);
-}
-#endif
-
 static inline int rt_policy(int policy)
 {
 	if (unlikely(policy == SCHED_FIFO || policy == SCHED_RR))
@@ -2335,8 +2313,7 @@ find_idlest_group(struct sched_domain *s
 		}
 
 		/* Adjust by relative CPU power of the group */
-		avg_load = sg_div_cpu_power(group,
-				avg_load * SCHED_LOAD_SCALE);
+		avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power;
 
 		if (local_group) {
 			this_load = avg_load;
@@ -3768,7 +3745,6 @@ static void update_cpu_power(struct sche
 	unsigned long weight = cpumask_weight(sched_domain_span(sd));
 	unsigned long power = SCHED_LOAD_SCALE;
 	struct sched_group *sdg = sd->groups;
-	unsigned long old = sdg->__cpu_power;
 
 	/* here we could scale based on cpufreq */
 
@@ -3783,33 +3759,26 @@ static void update_cpu_power(struct sche
 	if (!power)
 		power = 1;
 
-	if (power != old) {
-		sdg->__cpu_power = power;
-		sdg->reciprocal_cpu_power = reciprocal_value(power);
-	}
+	sdg->cpu_power = power;
 }
 
 static void update_group_power(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long power = sdg->__cpu_power;
 
 	if (!child) {
 		update_cpu_power(sd, cpu);
 		return;
 	}
 
-	sdg->__cpu_power = 0;
+	sdg->cpu_power = 0;
 
 	group = child->groups;
 	do {
-		sdg->__cpu_power += group->__cpu_power;
+		sdg->cpu_power += group->cpu_power;
 		group = group->next;
 	} while (group != child->groups);
-
-	if (power != sdg->__cpu_power)
-		sdg->reciprocal_cpu_power = reciprocal_value(sdg->__cpu_power);
 }
 
 /**
@@ -3889,8 +3858,7 @@ static inline void update_sg_lb_stats(st
 	}
 
 	/* Adjust by relative CPU power of the group */
-	sgs->avg_load = sg_div_cpu_power(group,
-			sgs->group_load * SCHED_LOAD_SCALE);
+	sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;
 
 
 	/*
@@ -3902,14 +3870,14 @@ static inline void update_sg_lb_stats(st
 	 *      normalized nr_running number somewhere that negates
 	 *      the hierarchy?
 	 */
-	avg_load_per_task = sg_div_cpu_power(group,
-			sum_avg_load_per_task * SCHED_LOAD_SCALE);
+	avg_load_per_task = (sum_avg_load_per_task * SCHED_LOAD_SCALE) /
+		group->cpu_power;
 
 	if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
 		sgs->group_imb = 1;
 
 	sgs->group_capacity =
-		DIV_ROUND_CLOSEST(group->__cpu_power, SCHED_LOAD_SCALE);
+		DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
 }
 
 /**
@@ -3951,7 +3919,7 @@ static inline void update_sd_lb_stats(st
 			return;
 
 		sds->total_load += sgs.group_load;
-		sds->total_pwr += group->__cpu_power;
+		sds->total_pwr += group->cpu_power;
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
@@ -4016,28 +3984,28 @@ static inline void fix_small_imbalance(s
 	 * moving them.
 	 */
 
-	pwr_now += sds->busiest->__cpu_power *
+	pwr_now += sds->busiest->cpu_power *
 			min(sds->busiest_load_per_task, sds->max_load);
-	pwr_now += sds->this->__cpu_power *
+	pwr_now += sds->this->cpu_power *
 			min(sds->this_load_per_task, sds->this_load);
 	pwr_now /= SCHED_LOAD_SCALE;
 
 	/* Amount of load we'd subtract */
-	tmp = sg_div_cpu_power(sds->busiest,
-			sds->busiest_load_per_task * SCHED_LOAD_SCALE);
+	tmp = (sds->busiest_load_per_task * SCHED_LOAD_SCALE) /
+		sds->busiest->cpu_power;
 	if (sds->max_load > tmp)
-		pwr_move += sds->busiest->__cpu_power *
+		pwr_move += sds->busiest->cpu_power *
 			min(sds->busiest_load_per_task, sds->max_load - tmp);
 
 	/* Amount of load we'd add */
-	if (sds->max_load * sds->busiest->__cpu_power <
+	if (sds->max_load * sds->busiest->cpu_power <
 		sds->busiest_load_per_task * SCHED_LOAD_SCALE)
-		tmp = sg_div_cpu_power(sds->this,
-			sds->max_load * sds->busiest->__cpu_power);
+		tmp = (sds->max_load * sds->busiest->cpu_power) /
+			sds->this->cpu_power;
 	else
-		tmp = sg_div_cpu_power(sds->this,
-			sds->busiest_load_per_task * SCHED_LOAD_SCALE);
-	pwr_move += sds->this->__cpu_power *
+		tmp = (sds->busiest_load_per_task * SCHED_LOAD_SCALE) /
+			sds->this->cpu_power;
+	pwr_move += sds->this->cpu_power *
 			min(sds->this_load_per_task, sds->this_load + tmp);
 	pwr_move /= SCHED_LOAD_SCALE;
 
@@ -4072,8 +4040,8 @@ static inline void calculate_imbalance(s
 			sds->max_load - sds->busiest_load_per_task);
 
 	/* How much load to actually move to equalise the imbalance */
-	*imbalance = min(max_pull * sds->busiest->__cpu_power,
-		(sds->avg_load - sds->this_load) * sds->this->__cpu_power)
+	*imbalance = min(max_pull * sds->busiest->cpu_power,
+		(sds->avg_load - sds->this_load) * sds->this->cpu_power)
 			/ SCHED_LOAD_SCALE;
 
 	/*
@@ -4208,7 +4176,7 @@ static unsigned long power_of(int cpu)
 	if (!group)
 		return SCHED_LOAD_SCALE;
 
-	return group->__cpu_power;
+	return group->cpu_power;
 }
 
 /*
@@ -7934,7 +7902,7 @@ static int sched_domain_debug_one(struct
 			break;
 		}
 
-		if (!group->__cpu_power) {
+		if (!group->cpu_power) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: domain->cpu_power not "
 					"set\n");
@@ -7958,9 +7926,9 @@ static int sched_domain_debug_one(struct
 		cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group));
 
 		printk(KERN_CONT " %s", str);
-		if (group->__cpu_power != SCHED_LOAD_SCALE) {
-			printk(KERN_CONT " (__cpu_power = %d)",
-				group->__cpu_power);
+		if (group->cpu_power != SCHED_LOAD_SCALE) {
+			printk(KERN_CONT " (cpu_power = %d)",
+				group->cpu_power);
 		}
 
 		group = group->next;
@@ -8245,7 +8213,7 @@ init_sched_build_groups(const struct cpu
 			continue;
 
 		cpumask_clear(sched_group_cpus(sg));
-		sg->__cpu_power = 0;
+		sg->cpu_power = 0;
 
 		for_each_cpu(j, span) {
 			if (group_fn(j, cpu_map, NULL, tmpmask) != group)
@@ -8503,7 +8471,7 @@ static void init_numa_sched_groups_power
 				continue;
 			}
 
-			sg_inc_cpu_power(sg, sd->groups->__cpu_power);
+			sg->cpu_power += sd->groups->cpu_power;
 		}
 		sg = sg->next;
 	} while (sg != group_head);
@@ -8540,7 +8508,7 @@ static int build_numa_sched_groups(struc
 		sd->groups = sg;
 	}
 
-	sg->__cpu_power = 0;
+	sg->cpu_power = 0;
 	cpumask_copy(sched_group_cpus(sg), d->nodemask);
 	sg->next = sg;
 	cpumask_or(d->covered, d->covered, d->nodemask);
@@ -8563,7 +8531,7 @@ static int build_numa_sched_groups(struc
 			       "Can not alloc domain group for node %d\n", j);
 			return -ENOMEM;
 		}
-		sg->__cpu_power = 0;
+		sg->cpu_power = 0;
 		cpumask_copy(sched_group_cpus(sg), d->tmpmask);
 		sg->next = prev->next;
 		cpumask_or(d->covered, d->covered, d->tmpmask);
@@ -8641,7 +8609,7 @@ static void init_sched_groups_power(int 
 
 	child = sd->child;
 
-	sd->groups->__cpu_power = 0;
+	sd->groups->cpu_power = 0;
 
 	if (!child) {
 		power = SCHED_LOAD_SCALE;
@@ -8657,7 +8625,7 @@ static void init_sched_groups_power(int 
 			power /= weight;
 			power >>= SCHED_LOAD_SHIFT;
 		}
-		sg_inc_cpu_power(sd->groups, power);
+		sd->groups->cpu_power += power;
 		return;
 	}
 
@@ -8666,7 +8634,7 @@ static void init_sched_groups_power(int 
 	 */
 	group = child->groups;
 	do {
-		sg_inc_cpu_power(sd->groups, group->__cpu_power);
+		sd->groups->cpu_power += group->cpu_power;
 		group = group->next;
 	} while (group != child->groups);
 }
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -870,15 +870,9 @@ struct sched_group {
 
 	/*
 	 * CPU power of this group, SCHED_LOAD_SCALE being max power for a
-	 * single CPU. This is read only (except for setup, hotplug CPU).
-	 * Note : Never change cpu_power without recompute its reciprocal
+	 * single CPU.
 	 */
-	unsigned int __cpu_power;
-	/*
-	 * reciprocal value of cpu_power to avoid expensive divides
-	 * (see include/linux/reciprocal_div.h)
-	 */
-	u32 reciprocal_cpu_power;
+	unsigned int cpu_power;
 
 	/*
 	 * The CPUs this group covers.

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 09/14] x86: move APERF/MPERF into a X86_FEATURE
  2009-09-03 13:21 [RFC][PATCH 00/14] load-balancing and cpu_power -v3 Peter Zijlstra
                   ` (7 preceding siblings ...)
  2009-09-03 13:21 ` [RFC][PATCH 08/14] sched: remove reciprocal for cpu_power Peter Zijlstra
@ 2009-09-03 13:21 ` Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 10/14] x86: generic aperf/mperf code Peter Zijlstra
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Peter Zijlstra, H. Peter Anvin, Venkatesh Pallipadi, Yanmin,
	Dave Jones, Len Brown, Yinghai Lu, cpufreq

[-- Attachment #1: sched-lb-8.patch --]
[-- Type: text/plain, Size: 2914 bytes --]

Move the APERFMPERF capacility into a X86_FEATURE flag so that it can
be used outside of the acpi cpufreq driver.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Yanmin <yanmin_zhang@linux.intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: cpufreq@vger.kernel.org
---
 arch/x86/include/asm/cpufeature.h          |    1 +
 arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c |    9 ++-------
 arch/x86/kernel/cpu/intel.c                |    6 ++++++
 3 files changed, 9 insertions(+), 7 deletions(-)

Index: linux-2.6/arch/x86/include/asm/cpufeature.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/cpufeature.h
+++ linux-2.6/arch/x86/include/asm/cpufeature.h
@@ -95,6 +95,7 @@
 #define X86_FEATURE_NONSTOP_TSC	(3*32+24) /* TSC does not stop in C states */
 #define X86_FEATURE_CLFLUSH_MONITOR (3*32+25) /* "" clflush reqd with monitor */
 #define X86_FEATURE_EXTD_APICID	(3*32+26) /* has extended APICID (8 bits) */
+#define X86_FEATURE_APERFMPERF	(3*32+27) /* APERFMPERF */
 
 /* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
 #define X86_FEATURE_XMM3	(4*32+ 0) /* "pni" SSE-3 */
Index: linux-2.6/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
+++ linux-2.6/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
@@ -60,7 +60,6 @@ enum {
 };
 
 #define INTEL_MSR_RANGE		(0xffff)
-#define CPUID_6_ECX_APERFMPERF_CAPABILITY	(0x1)
 
 struct acpi_cpufreq_data {
 	struct acpi_processor_performance *acpi_data;
@@ -731,12 +730,8 @@ static int acpi_cpufreq_cpu_init(struct 
 	acpi_processor_notify_smm(THIS_MODULE);
 
 	/* Check for APERF/MPERF support in hardware */
-	if (c->x86_vendor == X86_VENDOR_INTEL && c->cpuid_level >= 6) {
-		unsigned int ecx;
-		ecx = cpuid_ecx(6);
-		if (ecx & CPUID_6_ECX_APERFMPERF_CAPABILITY)
-			acpi_cpufreq_driver.getavg = get_measured_perf;
-	}
+	if (cpu_has(c, X86_FEATURE_APERFMPERF))
+		acpi_cpufreq_driver.getavg = get_measured_perf;
 
 	dprintk("CPU%u - ACPI performance management activated.\n", cpu);
 	for (i = 0; i < perf->state_count; i++)
Index: linux-2.6/arch/x86/kernel/cpu/intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/intel.c
+++ linux-2.6/arch/x86/kernel/cpu/intel.c
@@ -350,6 +350,12 @@ static void __cpuinit init_intel(struct 
 			set_cpu_cap(c, X86_FEATURE_ARCH_PERFMON);
 	}
 
+	if (c->cpuid_level > 6) {
+		unsigned ecx = cpuid_ecx(6);
+		if (ecx & 0x01)
+			set_cpu_cap(c, X86_FEATURE_APERFMPERF);
+	}
+
 	if (cpu_has_xmm2)
 		set_cpu_cap(c, X86_FEATURE_LFENCE_RDTSC);
 	if (cpu_has_ds) {

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 10/14] x86: generic aperf/mperf code.
  2009-09-03 13:21 [RFC][PATCH 00/14] load-balancing and cpu_power -v3 Peter Zijlstra
                   ` (8 preceding siblings ...)
  2009-09-03 13:21 ` [RFC][PATCH 09/14] x86: move APERF/MPERF into a X86_FEATURE Peter Zijlstra
@ 2009-09-03 13:21 ` Peter Zijlstra
  2009-09-04  9:19   ` Thomas Renninger
  2009-09-03 13:21 ` [RFC][PATCH 11/14] sched: provide arch_scale_freq_power Peter Zijlstra
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Peter Zijlstra, H. Peter Anvin, Venkatesh Pallipadi, Yanmin,
	Dave Jones, Len Brown, Yinghai Lu, cpufreq

[-- Attachment #1: sched-lb-9.patch --]
[-- Type: text/plain, Size: 3966 bytes --]

Move some of the aperf/mperf code out from the cpufreq driver thingy
so that other people can enjoy it too.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Yanmin <yanmin_zhang@linux.intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: cpufreq@vger.kernel.org
---
 arch/x86/include/asm/processor.h           |   12 ++++++++
 arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c |   41 +++++++++--------------------
 2 files changed, 26 insertions(+), 27 deletions(-)

Index: linux-2.6/arch/x86/include/asm/processor.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/processor.h
+++ linux-2.6/arch/x86/include/asm/processor.h
@@ -1000,4 +1000,16 @@ extern void start_thread(struct pt_regs 
 extern int get_tsc_mode(unsigned long adr);
 extern int set_tsc_mode(unsigned int val);
 
+struct aperfmperf {
+	u64 aperf, mperf;
+};
+
+static inline void get_aperfmperf(struct aperfmperf *am)
+{
+	WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_APERFMPERF));
+
+	rdmsrl(MSR_IA32_APERF, am->aperf);
+	rdmsrl(MSR_IA32_MPERF, am->mperf);
+}
+
 #endif /* _ASM_X86_PROCESSOR_H */
Index: linux-2.6/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
+++ linux-2.6/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
@@ -243,23 +243,12 @@ static u32 get_cur_val(const struct cpum
 	return cmd.val;
 }
 
-struct perf_pair {
-	union {
-		struct {
-			u32 lo;
-			u32 hi;
-		} split;
-		u64 whole;
-	} aperf, mperf;
-};
-
 /* Called via smp_call_function_single(), on the target CPU */
 static void read_measured_perf_ctrs(void *_cur)
 {
-	struct perf_pair *cur = _cur;
+	struct aperfmperf *am = _cur;
 
-	rdmsr(MSR_IA32_APERF, cur->aperf.split.lo, cur->aperf.split.hi);
-	rdmsr(MSR_IA32_MPERF, cur->mperf.split.lo, cur->mperf.split.hi);
+	get_aperfmperf(am);
 }
 
 /*
@@ -278,19 +267,17 @@ static void read_measured_perf_ctrs(void
 static unsigned int get_measured_perf(struct cpufreq_policy *policy,
 				      unsigned int cpu)
 {
-	struct perf_pair readin, cur;
+	struct aperfmperf readin, cur;
 	unsigned int perf_percent;
 	unsigned int retval;
 
 	if (smp_call_function_single(cpu, read_measured_perf_ctrs, &readin, 1))
 		return 0;
 
-	cur.aperf.whole = readin.aperf.whole -
-				per_cpu(msr_data, cpu).saved_aperf;
-	cur.mperf.whole = readin.mperf.whole -
-				per_cpu(msr_data, cpu).saved_mperf;
-	per_cpu(msr_data, cpu).saved_aperf = readin.aperf.whole;
-	per_cpu(msr_data, cpu).saved_mperf = readin.mperf.whole;
+	cur.aperf = readin.aperf - per_cpu(msr_data, cpu).saved_aperf;
+	cur.mperf = readin.mperf - per_cpu(msr_data, cpu).saved_mperf;
+	per_cpu(msr_data, cpu).saved_aperf = readin.aperf;
+	per_cpu(msr_data, cpu).saved_mperf = readin.mperf;
 
 #ifdef __i386__
 	/*
@@ -305,8 +292,8 @@ static unsigned int get_measured_perf(st
 		h = max_t(u32, cur.aperf.split.hi, cur.mperf.split.hi);
 		shift_count = fls(h);
 
-		cur.aperf.whole >>= shift_count;
-		cur.mperf.whole >>= shift_count;
+		cur.aperf >>= shift_count;
+		cur.mperf >>= shift_count;
 	}
 
 	if (((unsigned long)(-1) / 100) < cur.aperf.split.lo) {
@@ -321,14 +308,14 @@ static unsigned int get_measured_perf(st
 		perf_percent = 0;
 
 #else
-	if (unlikely(((unsigned long)(-1) / 100) < cur.aperf.whole)) {
+	if (unlikely(((unsigned long)(-1) / 100) < cur.aperf)) {
 		int shift_count = 7;
-		cur.aperf.whole >>= shift_count;
-		cur.mperf.whole >>= shift_count;
+		cur.aperf >>= shift_count;
+		cur.mperf >>= shift_count;
 	}
 
-	if (cur.aperf.whole && cur.mperf.whole)
-		perf_percent = (cur.aperf.whole * 100) / cur.mperf.whole;
+	if (cur.aperf && cur.mperf)
+		perf_percent = (cur.aperf * 100) / cur.mperf;
 	else
 		perf_percent = 0;
 

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 11/14] sched: provide arch_scale_freq_power
  2009-09-03 13:21 [RFC][PATCH 00/14] load-balancing and cpu_power -v3 Peter Zijlstra
                   ` (9 preceding siblings ...)
  2009-09-03 13:21 ` [RFC][PATCH 10/14] x86: generic aperf/mperf code Peter Zijlstra
@ 2009-09-03 13:21 ` Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 12/14] x86: sched: provide arch implementations using aperf/mperf Peter Zijlstra
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Peter Zijlstra

[-- Attachment #1: sched-lb-10.patch --]
[-- Type: text/plain, Size: 1665 bytes --]

Provide an ach specific hook for cpufreq based scaling of cpu_power.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -3712,7 +3712,18 @@ static inline int check_power_save_busie
 }
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
 
-unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu)
+
+unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
+{
+	return SCHED_LOAD_SCALE;
+}
+
+unsigned long __weak arch_scale_freq_power(struct sched_domain *sd, int cpu)
+{
+	return default_scale_freq_power(sd, cpu);
+}
+
+unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu)
 {
 	unsigned long weight = cpumask_weight(sched_domain_span(sd));
 	unsigned long smt_gain = sd->smt_gain;
@@ -3722,6 +3733,11 @@ unsigned long __weak arch_scale_smt_powe
 	return smt_gain;
 }
 
+unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu)
+{
+	return default_scale_smt_power(sd, cpu);
+}
+
 unsigned long scale_rt_power(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -3746,7 +3762,8 @@ static void update_cpu_power(struct sche
 	unsigned long power = SCHED_LOAD_SCALE;
 	struct sched_group *sdg = sd->groups;
 
-	/* here we could scale based on cpufreq */
+	power *= arch_scale_freq_power(sd, cpu);
+	power >> SCHED_LOAD_SHIFT;
 
 	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
 		power *= arch_scale_smt_power(sd, cpu);

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 12/14] x86: sched: provide arch implementations using aperf/mperf
  2009-09-03 13:21 [RFC][PATCH 00/14] load-balancing and cpu_power -v3 Peter Zijlstra
                   ` (10 preceding siblings ...)
  2009-09-03 13:21 ` [RFC][PATCH 11/14] sched: provide arch_scale_freq_power Peter Zijlstra
@ 2009-09-03 13:21 ` Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 13/14] sched: cleanup wake_idle power saving Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 14/14] sched: cleanup wake_idle Peter Zijlstra
  13 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Peter Zijlstra

[-- Attachment #1: sched-lb-11.patch --]
[-- Type: text/plain, Size: 3331 bytes --]

APERF/MPERF support for cpu_power.

APERF/MPERF is arch defined to be a relative scale of work capacity
per logical cpu, this is assumed to include SMT and Turbo mode.

APERF/MPERF are specified to both reset to 0 when either counter
wraps, which is highly inconvenient, since that'll give a blimp when
that happens. The manual specifies writing 0 to the counters after
each read, but that's 1) too expensive, and 2) destroys the
possibility of sharing these counters with other users, so we live
with the blimp - the other existing user does too.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/Makefile |    2 -
 arch/x86/kernel/cpu/sched.c  |   58 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h        |    4 ++
 3 files changed, 63 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/kernel/cpu/Makefile
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/Makefile
+++ linux-2.6/arch/x86/kernel/cpu/Makefile
@@ -13,7 +13,7 @@ CFLAGS_common.o		:= $(nostackp)
 
 obj-y			:= intel_cacheinfo.o addon_cpuid_features.o
 obj-y			+= proc.o capflags.o powerflags.o common.o
-obj-y			+= vmware.o hypervisor.o
+obj-y			+= vmware.o hypervisor.o sched.o
 
 obj-$(CONFIG_X86_32)	+= bugs.o cmpxchg.o
 obj-$(CONFIG_X86_64)	+= bugs_64.o
Index: linux-2.6/arch/x86/kernel/cpu/sched.c
===================================================================
--- /dev/null
+++ linux-2.6/arch/x86/kernel/cpu/sched.c
@@ -0,0 +1,58 @@
+#include <linux/sched.h>
+#include <linux/math64.h>
+#include <linux/percpu.h>
+#include <linux/irqflags.h>
+
+#include <asm/cpufeature.h>
+#include <asm/processor.h>
+
+static DEFINE_PER_CPU(struct aperfmperf, old_aperfmperf);
+
+static unsigned long scale_aperfmperf(void)
+{
+	struct aperfmperf cur, val, *old = &__get_cpu_var(old_aperfmperf);
+	unsigned long ratio = SCHED_LOAD_SCALE;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	get_aperfmperf(&val);
+	local_irq_restore(flags);
+
+	cur = val;
+	cur.aperf -= old->aperf;
+	cur.mperf -= old->mperf;
+	*old = val;
+
+	cur.mperf >>= SCHED_LOAD_SHIFT;
+	if (cur.mperf)
+		ratio = div_u64(cur.aperf, cur.mperf);
+
+	return ratio;
+}
+
+unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu)
+{
+	/*
+	 * do aperf/mperf on the cpu level because it includes things
+	 * like turbo mode, which are relevant to full cores.
+	 */
+	if (boot_cpu_has(X86_FEATURE_APERFMPERF))
+		return scale_aperfmperf();
+
+	/*
+	 * maybe have something cpufreq here
+	 */
+
+	return default_scale_freq_power(sd, cpu);
+}
+
+unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu)
+{
+	/*
+	 * aperf/mperf already includes the smt gain
+	 */
+	if (boot_cpu_has(X86_FEATURE_APERFMPERF))
+		return SCHED_LOAD_SCALE;
+
+	return default_scale_smt_power(sd, cpu);
+}
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1012,6 +1012,10 @@ partition_sched_domains(int ndoms_new, s
 }
 #endif	/* !CONFIG_SMP */
 
+
+unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu);
+unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu);
+
 struct io_context;			/* See blkdev.h */
 
 

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 13/14] sched: cleanup wake_idle power saving
  2009-09-03 13:21 [RFC][PATCH 00/14] load-balancing and cpu_power -v3 Peter Zijlstra
                   ` (11 preceding siblings ...)
  2009-09-03 13:21 ` [RFC][PATCH 12/14] x86: sched: provide arch implementations using aperf/mperf Peter Zijlstra
@ 2009-09-03 13:21 ` Peter Zijlstra
  2009-09-03 13:21 ` [RFC][PATCH 14/14] sched: cleanup wake_idle Peter Zijlstra
  13 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Peter Zijlstra

[-- Attachment #1: sched-lb-12.patch --]
[-- Type: text/plain, Size: 2825 bytes --]

Hopefully a more readable version of the same.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |   58 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 39 insertions(+), 19 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1062,29 +1062,49 @@ static void yield_task_fair(struct rq *r
 
 #define cpu_rd_active(cpu, rq) cpumask_test_cpu(cpu, rq->rd->online)
 
+/*
+ * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
+ * are idle and this is not a kernel thread and this task's affinity
+ * allows it to be moved to preferred cpu, then just move!
+ *
+ * XXX - can generate significant overload on perferred_wakeup_cpu
+ *       with plenty of idle cpus, leading to a significant loss in
+ *       throughput.
+ *
+ * Returns: <  0 - no placement decision made
+ *          >= 0 - place on cpu
+ */
+static int wake_idle_power_save(int cpu, struct task_struct *p)
+{
+	int this_cpu = smp_processor_id();
+	int wakeup_cpu;
+
+	if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+		return -1;
+
+	if (!idle_cpu(cpu) || !idle_cpu(this_cpu))
+		return -1;
+
+	if (!p->mm || (p->flags & PF_KTHREAD))
+		return -1;
+
+	wakeup_cpu = cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
+
+	if (!cpu_isset(wakeup_cpu, p->cpus_allowed))
+		return -1;
+
+	return wakeup_cpu;
+}
+
 static int wake_idle(int cpu, struct task_struct *p)
 {
+	struct rq *task_rq = task_rq(p);
 	struct sched_domain *sd;
 	int i;
-	unsigned int chosen_wakeup_cpu;
-	int this_cpu;
-	struct rq *task_rq = task_rq(p);
-
-	/*
-	 * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
-	 * are idle and this is not a kernel thread and this task's affinity
-	 * allows it to be moved to preferred cpu, then just move!
-	 */
 
-	this_cpu = smp_processor_id();
-	chosen_wakeup_cpu =
-		cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
-
-	if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
-		idle_cpu(cpu) && idle_cpu(this_cpu) &&
-		p->mm && !(p->flags & PF_KTHREAD) &&
-		cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))
-		return chosen_wakeup_cpu;
+	i = wake_idle_power_save(cpu, p);
+	if (i >= 0)
+		return i;
 
 	/*
 	 * If it is idle, then it is the best cpu to run this task.
@@ -1093,7 +1113,7 @@ static int wake_idle(int cpu, struct tas
 	 * Siblings must be also busy(in most cases) as they didn't already
 	 * pickup the extra load from this cpu and hence we need not check
 	 * sibling runqueue info. This will avoid the checks and cache miss
-	 * penalities associated with that.
+	 * penalties associated with that.
 	 */
 	if (idle_cpu(cpu) || cpu_rq(cpu)->cfs.nr_running > 1)
 		return cpu;

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH 14/14] sched: cleanup wake_idle
  2009-09-03 13:21 [RFC][PATCH 00/14] load-balancing and cpu_power -v3 Peter Zijlstra
                   ` (12 preceding siblings ...)
  2009-09-03 13:21 ` [RFC][PATCH 13/14] sched: cleanup wake_idle power saving Peter Zijlstra
@ 2009-09-03 13:21 ` Peter Zijlstra
  13 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Peter Zijlstra

[-- Attachment #1: sched-lb-13.patch --]
[-- Type: text/plain, Size: 2694 bytes --]

A more readable version, with a few differences:

 - don't check against the root domain, but instead check
   SD_LOAD_BALANCE

 - don't re-iterate the cpus already iterated on the previous SD

 - use rcu_read_lock() around the sd iteration

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |   45 +++++++++++++++++++++++++--------------------
 1 file changed, 25 insertions(+), 20 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1053,15 +1053,10 @@ static void yield_task_fair(struct rq *r
  * not idle and an idle cpu is available.  The span of cpus to
  * search starts with cpus closest then further out as needed,
  * so we always favor a closer, idle cpu.
- * Domains may include CPUs that are not usable for migration,
- * hence we need to mask them out (rq->rd->online)
  *
  * Returns the CPU we should wake onto.
  */
 #if defined(ARCH_HAS_SCHED_WAKE_IDLE)
-
-#define cpu_rd_active(cpu, rq) cpumask_test_cpu(cpu, rq->rd->online)
-
 /*
  * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
  * are idle and this is not a kernel thread and this task's affinity
@@ -1099,7 +1094,7 @@ static int wake_idle_power_save(int cpu,
 static int wake_idle(int cpu, struct task_struct *p)
 {
 	struct rq *task_rq = task_rq(p);
-	struct sched_domain *sd;
+	struct sched_domain *sd, *child = NULL;
 	int i;
 
 	i = wake_idle_power_save(cpu, p);
@@ -1118,24 +1113,34 @@ static int wake_idle(int cpu, struct tas
 	if (idle_cpu(cpu) || cpu_rq(cpu)->cfs.nr_running > 1)
 		return cpu;
 
+	rcu_read_lock();
 	for_each_domain(cpu, sd) {
-		if ((sd->flags & SD_WAKE_IDLE)
-		    || ((sd->flags & SD_WAKE_IDLE_FAR)
-			&& !task_hot(p, task_rq->clock, sd))) {
-			for_each_cpu_and(i, sched_domain_span(sd),
-					 &p->cpus_allowed) {
-				if (cpu_rd_active(i, task_rq) && idle_cpu(i)) {
-					if (i != task_cpu(p)) {
-						schedstat_inc(p,
-						       se.nr_wakeups_idle);
-					}
-					return i;
-				}
-			}
-		} else {
+		if (!(sd->flags & SD_LOAD_BALANCE))
 			break;
+
+		if (!(sd->flags & SD_WAKE_IDLE) &&
+		    (task_hot(p, task_rq->clock, sd) || !(sd->flags & SD_WAKE_IDLE_FAR)))
+			break;
+
+		for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
+			if (child && cpumask_test_cpu(i, sched_domain_span(child)))
+				continue;
+
+			if (!idle_cpu(i))
+				continue;
+
+			if (task_cpu(p) != i)
+				schedstat_inc(p, se.nr_wakeups_idle);
+
+			cpu = i;
+			goto unlock;
 		}
+
+		child = sd;
 	}
+unlock:
+	rcu_read_unlock();
+
 	return cpu;
 }
 #else /* !ARCH_HAS_SCHED_WAKE_IDLE*/

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 08/14] sched: remove reciprocal for cpu_power
  2009-09-03 13:21 ` [RFC][PATCH 08/14] sched: remove reciprocal for cpu_power Peter Zijlstra
@ 2009-09-03 13:59   ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-03 13:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gautham R Shenoy, Andreas Herrmann, Balbir Singh

On Thu, 2009-09-03 at 15:21 +0200, Peter Zijlstra wrote:
> [andreas.herrmann3@amd.com: remove include]

/me reminds himself to quilt refresh before sending email, not after.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 10/14] x86: generic aperf/mperf code.
  2009-09-03 13:21 ` [RFC][PATCH 10/14] x86: generic aperf/mperf code Peter Zijlstra
@ 2009-09-04  9:19   ` Thomas Renninger
  2009-09-04  9:25     ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Renninger @ 2009-09-04  9:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Gautham R Shenoy, Andreas Herrmann,
	Balbir Singh, H. Peter Anvin, Venkatesh Pallipadi, Yanmin,
	Dave Jones, Len Brown, Yinghai Lu, cpufreq

On Thursday 03 September 2009 15:21:55 Peter Zijlstra wrote:
> Move some of the aperf/mperf code out from the cpufreq driver thingy
> so that other people can enjoy it too.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
> Cc: Yanmin <yanmin_zhang@linux.intel.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Yinghai Lu <yhlu.kernel@gmail.com>
> Cc: cpufreq@vger.kernel.org
> ---
>  arch/x86/include/asm/processor.h           |   12 ++++++++
>  arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c |   41 
+++++++++--------------------
>  2 files changed, 26 insertions(+), 27 deletions(-)
> 
> Index: linux-2.6/arch/x86/include/asm/processor.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/processor.h
> +++ linux-2.6/arch/x86/include/asm/processor.h
> @@ -1000,4 +1000,16 @@ extern void start_thread(struct pt_regs 
>  extern int get_tsc_mode(unsigned long adr);
>  extern int set_tsc_mode(unsigned int val);
>  
> +struct aperfmperf {
> +	u64 aperf, mperf;
> +};
> +
> +static inline void get_aperfmperf(struct aperfmperf *am)
> +{
> +	WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_APERFMPERF));
> +
> +	rdmsrl(MSR_IA32_APERF, am->aperf);
> +	rdmsrl(MSR_IA32_MPERF, am->mperf);
> +}
> +
>  #endif /* _ASM_X86_PROCESSOR_H */
> Index: linux-2.6/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
> +++ linux-2.6/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
> @@ -243,23 +243,12 @@ static u32 get_cur_val(const struct cpum
>  	return cmd.val;
>  }
>  
> -struct perf_pair {
> -	union {
> -		struct {
> -			u32 lo;
> -			u32 hi;
> -		} split;
> -		u64 whole;
> -	} aperf, mperf;
> -};
> -
>  /* Called via smp_call_function_single(), on the target CPU */
>  static void read_measured_perf_ctrs(void *_cur)
>  {
> -	struct perf_pair *cur = _cur;
> +	struct aperfmperf *am = _cur;
>  
> -	rdmsr(MSR_IA32_APERF, cur->aperf.split.lo, cur->aperf.split.hi);
> -	rdmsr(MSR_IA32_MPERF, cur->mperf.split.lo, cur->mperf.split.hi);
> +	get_aperfmperf(am);
>  }
>  
>  /*
> @@ -278,19 +267,17 @@ static void read_measured_perf_ctrs(void
>  static unsigned int get_measured_perf(struct cpufreq_policy *policy,
>  				      unsigned int cpu)
>  {
> -	struct perf_pair readin, cur;
> +	struct aperfmperf readin, cur;
>  	unsigned int perf_percent;
>  	unsigned int retval;
>  
>  	if (smp_call_function_single(cpu, read_measured_perf_ctrs, &readin, 
1))
>  		return 0;
>  
> -	cur.aperf.whole = readin.aperf.whole -
> -				per_cpu(msr_data, cpu).saved_aperf;
> -	cur.mperf.whole = readin.mperf.whole -
> -				per_cpu(msr_data, cpu).saved_mperf;
> -	per_cpu(msr_data, cpu).saved_aperf = readin.aperf.whole;
> -	per_cpu(msr_data, cpu).saved_mperf = readin.mperf.whole;
> +	cur.aperf = readin.aperf - per_cpu(msr_data, cpu).saved_aperf;
> +	cur.mperf = readin.mperf - per_cpu(msr_data, cpu).saved_mperf;
> +	per_cpu(msr_data, cpu).saved_aperf = readin.aperf;
> +	per_cpu(msr_data, cpu).saved_mperf = readin.mperf;
>  
>  #ifdef __i386__
>  	/*
> @@ -305,8 +292,8 @@ static unsigned int get_measured_perf(st
>  		h = max_t(u32, cur.aperf.split.hi, cur.mperf.split.hi);
You still use struct perf_pair split/hi/lo members in #ifdef __i386__ 
case which you deleted above.
>  		shift_count = fls(h);
>  
> -		cur.aperf.whole >>= shift_count;
> -		cur.mperf.whole >>= shift_count;
> +		cur.aperf >>= shift_count;
> +		cur.mperf >>= shift_count;
>  	}
>  
>  	if (((unsigned long)(-1) / 100) < cur.aperf.split.lo) {
Same here, possibly still elsewhere.
Is this only x86_64 compile tested?

     Thomas

> @@ -321,14 +308,14 @@ static unsigned int get_measured_perf(st
>  		perf_percent = 0;
>  
>  #else
> -	if (unlikely(((unsigned long)(-1) / 100) < cur.aperf.whole)) {
> +	if (unlikely(((unsigned long)(-1) / 100) < cur.aperf)) {
>  		int shift_count = 7;
> -		cur.aperf.whole >>= shift_count;
> -		cur.mperf.whole >>= shift_count;
> +		cur.aperf >>= shift_count;
> +		cur.mperf >>= shift_count;
>  	}
>  
> -	if (cur.aperf.whole && cur.mperf.whole)
> -		perf_percent = (cur.aperf.whole * 100) / cur.mperf.whole;
> +	if (cur.aperf && cur.mperf)
> +		perf_percent = (cur.aperf * 100) / cur.mperf;
>  	else
>  		perf_percent = 0;
>  
> 
> -- 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cpufreq" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 10/14] x86: generic aperf/mperf code.
  2009-09-04  9:19   ` Thomas Renninger
@ 2009-09-04  9:25     ` Peter Zijlstra
  2009-09-04  9:27       ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-04  9:25 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Ingo Molnar, linux-kernel, Gautham R Shenoy, Andreas Herrmann,
	Balbir Singh, H. Peter Anvin, Venkatesh Pallipadi, Yanmin,
	Dave Jones, Len Brown, Yinghai Lu, cpufreq

On Fri, 2009-09-04 at 11:19 +0200, Thomas Renninger wrote:
> You still use struct perf_pair split/hi/lo members in #ifdef __i386__ 
> case which you deleted above.

> >               shift_count = fls(h);
> >  
> > -             cur.aperf.whole >>= shift_count;
> > -             cur.mperf.whole >>= shift_count;
> > +             cur.aperf >>= shift_count;
> > +             cur.mperf >>= shift_count;
> >       }
> >  
> >       if (((unsigned long)(-1) / 100) < cur.aperf.split.lo) {
> Same here, possibly still elsewhere.
> Is this only x86_64 compile tested?

Of course, who still has 32bit only hardware anyway ;-)

Will fix, thanks for spotting that.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 10/14] x86: generic aperf/mperf code.
  2009-09-04  9:25     ` Peter Zijlstra
@ 2009-09-04  9:27       ` Peter Zijlstra
  2009-09-04  9:34         ` Thomas Renninger
  2009-09-04 14:22         ` Dave Jones
  0 siblings, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-04  9:27 UTC (permalink / raw)
  To: Thomas Renninger
  Cc: Ingo Molnar, linux-kernel, Gautham R Shenoy, Andreas Herrmann,
	Balbir Singh, H. Peter Anvin, Venkatesh Pallipadi, Yanmin,
	Dave Jones, Len Brown, Yinghai Lu, cpufreq

On Fri, 2009-09-04 at 11:25 +0200, Peter Zijlstra wrote:
> On Fri, 2009-09-04 at 11:19 +0200, Thomas Renninger wrote:
> > You still use struct perf_pair split/hi/lo members in #ifdef __i386__ 
> > case which you deleted above.
> 
> > >               shift_count = fls(h);
> > >  
> > > -             cur.aperf.whole >>= shift_count;
> > > -             cur.mperf.whole >>= shift_count;
> > > +             cur.aperf >>= shift_count;
> > > +             cur.mperf >>= shift_count;
> > >       }
> > >  
> > >       if (((unsigned long)(-1) / 100) < cur.aperf.split.lo) {
> > Same here, possibly still elsewhere.
> > Is this only x86_64 compile tested?
> 
> Of course, who still has 32bit only hardware anyway ;-)
> 
> Will fix, thanks for spotting that.

Hrmm, on that, does it really make sense to maintain the i386 code path?

How frequently is that code called and what i386 only chips support
aperf/mperf, atom?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 10/14] x86: generic aperf/mperf code.
  2009-09-04  9:27       ` Peter Zijlstra
@ 2009-09-04  9:34         ` Thomas Renninger
  2009-09-04 14:22         ` Dave Jones
  1 sibling, 0 replies; 23+ messages in thread
From: Thomas Renninger @ 2009-09-04  9:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Gautham R Shenoy, Andreas Herrmann,
	Balbir Singh, H. Peter Anvin, Venkatesh Pallipadi, Yanmin,
	Dave Jones, Len Brown, Yinghai Lu, cpufreq, Venkatesh Pallipadi

On Friday 04 September 2009 11:27:19 Peter Zijlstra wrote:
> On Fri, 2009-09-04 at 11:25 +0200, Peter Zijlstra wrote:
> > On Fri, 2009-09-04 at 11:19 +0200, Thomas Renninger wrote:
> > > You still use struct perf_pair split/hi/lo members in #ifdef 
__i386__ 
> > > case which you deleted above.
> > 
> > > >               shift_count = fls(h);
> > > >  
> > > > -             cur.aperf.whole >>= shift_count;
> > > > -             cur.mperf.whole >>= shift_count;
> > > > +             cur.aperf >>= shift_count;
> > > > +             cur.mperf >>= shift_count;
> > > >       }
> > > >  
> > > >       if (((unsigned long)(-1) / 100) < cur.aperf.split.lo) {
> > > Same here, possibly still elsewhere.
> > > Is this only x86_64 compile tested?
> > 
> > Of course, who still has 32bit only hardware anyway ;-)
> > 
> > Will fix, thanks for spotting that.
> 
> Hrmm, on that, does it really make sense to maintain the i386 code
> path?
> 
> How frequently is that code called and what i386 only chips support
> aperf/mperf, atom?
Venki should know best.

   Thomas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 10/14] x86: generic aperf/mperf code.
  2009-09-04  9:27       ` Peter Zijlstra
  2009-09-04  9:34         ` Thomas Renninger
@ 2009-09-04 14:22         ` Dave Jones
  2009-09-04 14:42           ` Peter Zijlstra
  1 sibling, 1 reply; 23+ messages in thread
From: Dave Jones @ 2009-09-04 14:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Renninger, Ingo Molnar, linux-kernel, Gautham R Shenoy,
	Andreas Herrmann, Balbir Singh, H. Peter Anvin,
	Venkatesh Pallipadi, Yanmin, Len Brown, Yinghai Lu, cpufreq

On Fri, Sep 04, 2009 at 11:27:19AM +0200, Peter Zijlstra wrote:
 > On Fri, 2009-09-04 at 11:25 +0200, Peter Zijlstra wrote:
 > > On Fri, 2009-09-04 at 11:19 +0200, Thomas Renninger wrote:
 > > > You still use struct perf_pair split/hi/lo members in #ifdef __i386__ 
 > > > case which you deleted above.
 > > 
 > > > >               shift_count = fls(h);
 > > > >  
 > > > > -             cur.aperf.whole >>= shift_count;
 > > > > -             cur.mperf.whole >>= shift_count;
 > > > > +             cur.aperf >>= shift_count;
 > > > > +             cur.mperf >>= shift_count;
 > > > >       }
 > > > >  
 > > > >       if (((unsigned long)(-1) / 100) < cur.aperf.split.lo) {
 > > > Same here, possibly still elsewhere.
 > > > Is this only x86_64 compile tested?
 > > 
 > > Of course, who still has 32bit only hardware anyway ;-)
 > > 
 > > Will fix, thanks for spotting that.
 > 
 > Hrmm, on that, does it really make sense to maintain the i386 code path?
 > 
 > How frequently is that code called and what i386 only chips support
 > aperf/mperf, atom?

any 64-bit cpu that supports it can have a 32bit kernel installed on it.
(and a significant number of users actually do this).

	Dave

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 10/14] x86: generic aperf/mperf code.
  2009-09-04 14:22         ` Dave Jones
@ 2009-09-04 14:42           ` Peter Zijlstra
  2009-09-04 17:45             ` H. Peter Anvin
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2009-09-04 14:42 UTC (permalink / raw)
  To: Dave Jones
  Cc: Thomas Renninger, Ingo Molnar, linux-kernel, Gautham R Shenoy,
	Andreas Herrmann, Balbir Singh, H. Peter Anvin,
	Venkatesh Pallipadi, Yanmin, Len Brown, Yinghai Lu, cpufreq

On Fri, 2009-09-04 at 10:22 -0400, Dave Jones wrote:
> On Fri, Sep 04, 2009 at 11:27:19AM +0200, Peter Zijlstra wrote:
>  > On Fri, 2009-09-04 at 11:25 +0200, Peter Zijlstra wrote:
>  > > On Fri, 2009-09-04 at 11:19 +0200, Thomas Renninger wrote:
>  > > > You still use struct perf_pair split/hi/lo members in #ifdef __i386__ 
>  > > > case which you deleted above.
>  > > 
>  > > > >               shift_count = fls(h);
>  > > > >  
>  > > > > -             cur.aperf.whole >>= shift_count;
>  > > > > -             cur.mperf.whole >>= shift_count;
>  > > > > +             cur.aperf >>= shift_count;
>  > > > > +             cur.mperf >>= shift_count;
>  > > > >       }
>  > > > >  
>  > > > >       if (((unsigned long)(-1) / 100) < cur.aperf.split.lo) {
>  > > > Same here, possibly still elsewhere.
>  > > > Is this only x86_64 compile tested?
>  > > 
>  > > Of course, who still has 32bit only hardware anyway ;-)
>  > > 
>  > > Will fix, thanks for spotting that.
>  > 
>  > Hrmm, on that, does it really make sense to maintain the i386 code path?
>  > 
>  > How frequently is that code called and what i386 only chips support
>  > aperf/mperf, atom?
> 
> any 64-bit cpu that supports it can have a 32bit kernel installed on it.
> (and a significant number of users actually do this).

1) we really should be pushing those people to run 64bit kernels

[ I'm still hoping distros will start shipping 64bit kernels and have
  the bootloader pick the 64bit one when the hardware supports lm ]

2) those cpus aren't real bad at 64bit divisions :-)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH 10/14] x86: generic aperf/mperf code.
  2009-09-04 14:42           ` Peter Zijlstra
@ 2009-09-04 17:45             ` H. Peter Anvin
  0 siblings, 0 replies; 23+ messages in thread
From: H. Peter Anvin @ 2009-09-04 17:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, Thomas Renninger, Ingo Molnar, linux-kernel,
	Gautham R Shenoy, Andreas Herrmann, Balbir Singh,
	Venkatesh Pallipadi, Yanmin, Len Brown, Yinghai Lu, cpufreq

On 09/04/2009 07:42 AM, Peter Zijlstra wrote:
>>
>> any 64-bit cpu that supports it can have a 32bit kernel installed on it.
>> (and a significant number of users actually do this).
> 
> 1) we really should be pushing those people to run 64bit kernels
> 
> [ I'm still hoping distros will start shipping 64bit kernels and have
>   the bootloader pick the 64bit one when the hardware supports lm ]
> 

FWIW, there is already support in Syslinux to do that.

If what distros want is a single kernel image that can boot into either
mode, that might be an interesting enhancement to wraplinux.

	-hpa


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2009-09-04 17:49 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-03 13:21 [RFC][PATCH 00/14] load-balancing and cpu_power -v3 Peter Zijlstra
2009-09-03 13:21 ` [RFC][PATCH 01/14] sched: restore __cpu_power to a straight sum of power Peter Zijlstra
2009-09-03 13:21 ` [RFC][PATCH 02/14] sched: SD_PREFER_SIBLING Peter Zijlstra
2009-09-03 13:21 ` [RFC][PATCH 03/14] sched: update the cpu_power sum during load-balance Peter Zijlstra
2009-09-03 13:21 ` [RFC][PATCH 04/14] sched: add smt_gain Peter Zijlstra
2009-09-03 13:21 ` [RFC][PATCH 05/14] sched: dynamic cpu_power Peter Zijlstra
2009-09-03 13:21 ` [RFC][PATCH 06/14] sched: scale down cpu_power due to RT tasks Peter Zijlstra
2009-09-03 13:21 ` [RFC][PATCH 07/14] sched: try to deal with low capacity Peter Zijlstra
2009-09-03 13:21 ` [RFC][PATCH 08/14] sched: remove reciprocal for cpu_power Peter Zijlstra
2009-09-03 13:59   ` Peter Zijlstra
2009-09-03 13:21 ` [RFC][PATCH 09/14] x86: move APERF/MPERF into a X86_FEATURE Peter Zijlstra
2009-09-03 13:21 ` [RFC][PATCH 10/14] x86: generic aperf/mperf code Peter Zijlstra
2009-09-04  9:19   ` Thomas Renninger
2009-09-04  9:25     ` Peter Zijlstra
2009-09-04  9:27       ` Peter Zijlstra
2009-09-04  9:34         ` Thomas Renninger
2009-09-04 14:22         ` Dave Jones
2009-09-04 14:42           ` Peter Zijlstra
2009-09-04 17:45             ` H. Peter Anvin
2009-09-03 13:21 ` [RFC][PATCH 11/14] sched: provide arch_scale_freq_power Peter Zijlstra
2009-09-03 13:21 ` [RFC][PATCH 12/14] x86: sched: provide arch implementations using aperf/mperf Peter Zijlstra
2009-09-03 13:21 ` [RFC][PATCH 13/14] sched: cleanup wake_idle power saving Peter Zijlstra
2009-09-03 13:21 ` [RFC][PATCH 14/14] sched: cleanup wake_idle Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.