All of lore.kernel.org
 help / color / mirror / Atom feed
* [RESEND PATCH V5 0/8] remove cpu_load idx
@ 2014-04-16  2:43 Alex Shi
  2014-04-16  2:43 ` [PATCH V5 1/8] sched: shortcut to remove load_idx Alex Shi
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Alex Shi @ 2014-04-16  2:43 UTC (permalink / raw)
  To: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano, efault
  Cc: wangyun, linux-kernel, mgorman

In the cpu_load decay usage, we mixed the long term, short term load with
balance bias, randomly pick a big or small value according to balance 
destination or source. This mix is wrong, the balance bias should be based
on task moving cost between cpu groups, not on random history or instant load.
History load maybe diverage a lot from real load, that lead to incorrect bias.

like on busy_idx,
We mix history load decay and bias together. The ridiculous thing is, when 
all cpu load are continuous stable, long/short term load is same. then we 
lose the bias meaning, so any minimum imbalance may cause unnecessary task
moving. To prevent this funny thing happen, we have to reuse the 
imbalance_pct again in find_busiest_group().  But that clearly causes over
bias in normal time. If there are some burst load in system, it is more worse.

As to idle_idx:
Though I have some cencern of usage corretion, 
https://lkml.org/lkml/2014/3/12/247 but since we are working on cpu
idle migration into scheduler. The problem will be reconsidered. We don't
need to care it too much now.

In fact, the cpu_load decays can be replaced by the sched_avg decay, that 
also decays load on time. The balance bias part can fullly use fixed bias --
imbalance_pct, which is already used in newly idle, wake, forkexec balancing
and numa balancing scenarios.

This patch removed the cpu_load idx decay, since it can be replaced by
sched_avg feature. and left the imbalance_pct bias untouched.

V5,
1, remove unify bias patch and biased_load function. Thanks for PeterZ's 
comments!
2, remove get_sd_load_idx() in the 1st patch as SrikarD's suggestion.
3, remove LB_BIAS feature, it is not needed now.

V4,
1, rebase on latest tip/master
2, replace target_load by biased_load as Morten's suggestion

V3,
1, correct the wake_affine bias. Thanks for Morten's reminder!
2, replace source_load by weighted_cpuload for better function name meaning.

V2,
1, This version do some tuning on load bias of target load.
2, Got further to remove the cpu_load in rq.
3, Revert the patch 'Limit sd->*_idx range on sysctl' since no needs

Any testing/comments are appreciated.

This patch rebase on latest tip/master.
The git tree for this patchset at:
 git@github.com:alexshi/power-scheduling.git noload

[PATCH V5 1/8] sched: shortcut to remove load_idx
[PATCH V5 2/8] sched: remove rq->cpu_load[load_idx] array
[PATCH V5 3/8] sched: remove source_load and target_load
[PATCH V5 4/8] sched: remove LB_BIAS
[PATCH V5 5/8] sched: clean up cpu_load update
[PATCH V5 6/8] sched: rewrite update_cpu_load_nohz
[PATCH V5 7/8] sched: remove rq->cpu_load and rq->nr_load_updates
[PATCH V5 8/8] sched: rename update_*_cpu_load

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V5 1/8] sched: shortcut to remove load_idx
  2014-04-16  2:43 [RESEND PATCH V5 0/8] remove cpu_load idx Alex Shi
@ 2014-04-16  2:43 ` Alex Shi
  2014-04-16  2:43 ` [PATCH V5 2/8] sched: remove rq->cpu_load[load_idx] array Alex Shi
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Alex Shi @ 2014-04-16  2:43 UTC (permalink / raw)
  To: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano, efault
  Cc: wangyun, linux-kernel, mgorman

Shortcut to remove rq->cpu_load[load_idx] effect in scheduler.
In five load idx, only busy_idx, idle_idx are not zero.
Newidle_idx, wake_idx and fork_idx are all zero in all archs.

So, change the idx to zero here can fully remove load_idx effect.

Signed-off-by: Alex Shi <alex.shi@linaro.org>
---
 kernel/sched/fair.c | 30 +-----------------------------
 1 file changed, 1 insertion(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f14a65..ddff32a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5507,34 +5507,6 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 	};
 }
 
-/**
- * get_sd_load_idx - Obtain the load index for a given sched domain.
- * @sd: The sched_domain whose load_idx is to be obtained.
- * @idle: The idle status of the CPU for whose sd load_idx is obtained.
- *
- * Return: The load index.
- */
-static inline int get_sd_load_idx(struct sched_domain *sd,
-					enum cpu_idle_type idle)
-{
-	int load_idx;
-
-	switch (idle) {
-	case CPU_NOT_IDLE:
-		load_idx = sd->busy_idx;
-		break;
-
-	case CPU_NEWLY_IDLE:
-		load_idx = sd->newidle_idx;
-		break;
-	default:
-		load_idx = sd->idle_idx;
-		break;
-	}
-
-	return load_idx;
-}
-
 static unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
 {
 	return SCHED_POWER_SCALE;
@@ -5920,7 +5892,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
 
-	load_idx = get_sd_load_idx(env->sd, env->idle);
+	load_idx = 0;
 
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V5 2/8] sched: remove rq->cpu_load[load_idx] array
  2014-04-16  2:43 [RESEND PATCH V5 0/8] remove cpu_load idx Alex Shi
  2014-04-16  2:43 ` [PATCH V5 1/8] sched: shortcut to remove load_idx Alex Shi
@ 2014-04-16  2:43 ` Alex Shi
  2014-04-16  2:43 ` [PATCH V5 3/8] sched: remove source_load and target_load Alex Shi
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Alex Shi @ 2014-04-16  2:43 UTC (permalink / raw)
  To: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano, efault
  Cc: wangyun, linux-kernel, mgorman

Since load_idx effect removed in load balance, we don't need the
load_idx decays in scheduler. that will save some process in sched_tick
and others places.

Signed-off-by: Alex Shi <alex.shi@linaro.org>
---
 arch/ia64/include/asm/topology.h |  5 ---
 arch/tile/include/asm/topology.h |  6 ---
 include/linux/sched.h            |  5 ---
 include/linux/topology.h         |  8 ----
 kernel/sched/core.c              | 58 +++++++------------------
 kernel/sched/debug.c             |  6 +--
 kernel/sched/fair.c              | 51 +++++++++-------------
 kernel/sched/proc.c              | 92 ++--------------------------------------
 kernel/sched/sched.h             |  3 +-
 9 files changed, 42 insertions(+), 192 deletions(-)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index 5cb55a1..e7c2188 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -55,11 +55,6 @@ void build_cpu_to_node_map(void);
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
 	.cache_nice_tries	= 2,			\
-	.busy_idx		= 2,			\
-	.idle_idx		= 1,			\
-	.newidle_idx		= 0,			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
index d15c0d8..05f6ffe 100644
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -57,12 +57,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
 	.busy_factor		= 64,					\
 	.imbalance_pct		= 125,					\
 	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
 	.flags			= 1*SD_LOAD_BALANCE			\
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 25f54c7..3b08d7b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -901,11 +901,6 @@ struct sched_domain {
 	unsigned int busy_factor;	/* less balancing by factor if busy */
 	unsigned int imbalance_pct;	/* No balance until over watermark */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
-	unsigned int busy_idx;
-	unsigned int idle_idx;
-	unsigned int newidle_idx;
-	unsigned int wake_idx;
-	unsigned int forkexec_idx;
 	unsigned int smt_gain;
 
 	int nohz_idle;			/* NOHZ IDLE status */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 7062330..7e9a3e0 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -121,9 +121,6 @@ int arch_update_cpu_topology(void);
 	.busy_factor		= 64,					\
 	.imbalance_pct		= 125,					\
 	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
 									\
 	.flags			= 1*SD_LOAD_BALANCE			\
 				| 1*SD_BALANCE_NEWIDLE			\
@@ -151,11 +148,6 @@ int arch_update_cpu_topology(void);
 	.busy_factor		= 64,					\
 	.imbalance_pct		= 125,					\
 	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
 									\
 	.flags			= 1*SD_LOAD_BALANCE			\
 				| 1*SD_BALANCE_NEWIDLE			\
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 268a45e..33fd59b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4816,64 +4816,45 @@ static void sd_free_ctl_entry(struct ctl_table **tablep)
 	*tablep = NULL;
 }
 
-static int min_load_idx = 0;
-static int max_load_idx = CPU_LOAD_IDX_MAX-1;
-
 static void
 set_table_entry(struct ctl_table *entry,
 		const char *procname, void *data, int maxlen,
-		umode_t mode, proc_handler *proc_handler,
-		bool load_idx)
+		umode_t mode, proc_handler *proc_handler)
 {
 	entry->procname = procname;
 	entry->data = data;
 	entry->maxlen = maxlen;
 	entry->mode = mode;
 	entry->proc_handler = proc_handler;
-
-	if (load_idx) {
-		entry->extra1 = &min_load_idx;
-		entry->extra2 = &max_load_idx;
-	}
 }
 
 static struct ctl_table *
 sd_alloc_ctl_domain_table(struct sched_domain *sd)
 {
-	struct ctl_table *table = sd_alloc_ctl_entry(14);
+	struct ctl_table *table = sd_alloc_ctl_entry(9);
 
 	if (table == NULL)
 		return NULL;
 
 	set_table_entry(&table[0], "min_interval", &sd->min_interval,
-		sizeof(long), 0644, proc_doulongvec_minmax, false);
+		sizeof(long), 0644, proc_doulongvec_minmax);
 	set_table_entry(&table[1], "max_interval", &sd->max_interval,
-		sizeof(long), 0644, proc_doulongvec_minmax, false);
-	set_table_entry(&table[2], "busy_idx", &sd->busy_idx,
-		sizeof(int), 0644, proc_dointvec_minmax, true);
-	set_table_entry(&table[3], "idle_idx", &sd->idle_idx,
-		sizeof(int), 0644, proc_dointvec_minmax, true);
-	set_table_entry(&table[4], "newidle_idx", &sd->newidle_idx,
-		sizeof(int), 0644, proc_dointvec_minmax, true);
-	set_table_entry(&table[5], "wake_idx", &sd->wake_idx,
-		sizeof(int), 0644, proc_dointvec_minmax, true);
-	set_table_entry(&table[6], "forkexec_idx", &sd->forkexec_idx,
-		sizeof(int), 0644, proc_dointvec_minmax, true);
-	set_table_entry(&table[7], "busy_factor", &sd->busy_factor,
-		sizeof(int), 0644, proc_dointvec_minmax, false);
-	set_table_entry(&table[8], "imbalance_pct", &sd->imbalance_pct,
-		sizeof(int), 0644, proc_dointvec_minmax, false);
-	set_table_entry(&table[9], "cache_nice_tries",
+		sizeof(long), 0644, proc_doulongvec_minmax);
+	set_table_entry(&table[2], "busy_factor", &sd->busy_factor,
+		sizeof(int), 0644, proc_dointvec_minmax);
+	set_table_entry(&table[3], "imbalance_pct", &sd->imbalance_pct,
+		sizeof(int), 0644, proc_dointvec_minmax);
+	set_table_entry(&table[4], "cache_nice_tries",
 		&sd->cache_nice_tries,
-		sizeof(int), 0644, proc_dointvec_minmax, false);
+		sizeof(int), 0644, proc_dointvec_minmax);
 	set_table_entry(&table[10], "flags", &sd->flags,
-		sizeof(int), 0644, proc_dointvec_minmax, false);
+		sizeof(int), 0644, proc_dointvec_minmax);
 	set_table_entry(&table[11], "max_newidle_lb_cost",
 		&sd->max_newidle_lb_cost,
-		sizeof(long), 0644, proc_doulongvec_minmax, false);
+		sizeof(long), 0644, proc_doulongvec_minmax);
 	set_table_entry(&table[12], "name", sd->name,
-		CORENAME_MAX_SIZE, 0444, proc_dostring, false);
-	/* &table[13] is terminator */
+		CORENAME_MAX_SIZE, 0444, proc_dostring);
+	/* &table[8] is terminator */
 
 	return table;
 }
@@ -5996,11 +5977,6 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 		.busy_factor		= 32,
 		.imbalance_pct		= 125,
 		.cache_nice_tries	= 2,
-		.busy_idx		= 3,
-		.idle_idx		= 2,
-		.newidle_idx		= 0,
-		.wake_idx		= 0,
-		.forkexec_idx		= 0,
 
 		.flags			= 1*SD_LOAD_BALANCE
 					| 1*SD_BALANCE_NEWIDLE
@@ -6750,7 +6726,7 @@ DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
 
 void __init sched_init(void)
 {
-	int i, j;
+	int i;
 	unsigned long alloc_size = 0, ptr;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -6853,9 +6829,7 @@ void __init sched_init(void)
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
 #endif
 
-		for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
-			rq->cpu_load[j] = 0;
-
+		rq->cpu_load = 0;
 		rq->last_load_update_tick = jiffies;
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 695f977..0e48e98 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -302,11 +302,7 @@ do {									\
 	PN(next_balance);
 	SEQ_printf(m, "  .%-30s: %ld\n", "curr->pid", (long)(task_pid_nr(rq->curr)));
 	PN(clock);
-	P(cpu_load[0]);
-	P(cpu_load[1]);
-	P(cpu_load[2]);
-	P(cpu_load[3]);
-	P(cpu_load[4]);
+	P(cpu_load);
 #undef P
 #undef PN
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ddff32a..12a35ea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1015,8 +1015,8 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 }
 
 static unsigned long weighted_cpuload(const int cpu);
-static unsigned long source_load(int cpu, int type);
-static unsigned long target_load(int cpu, int type);
+static unsigned long source_load(int cpu);
+static unsigned long target_load(int cpu);
 static unsigned long power_of(int cpu);
 static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
@@ -3964,30 +3964,30 @@ static unsigned long weighted_cpuload(const int cpu)
  * We want to under-estimate the load of migration sources, to
  * balance conservatively.
  */
-static unsigned long source_load(int cpu, int type)
+static unsigned long source_load(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
 
-	if (type == 0 || !sched_feat(LB_BIAS))
+	if (!sched_feat(LB_BIAS))
 		return total;
 
-	return min(rq->cpu_load[type-1], total);
+	return min(rq->cpu_load, total);
 }
 
 /*
  * Return a high guess at the load of a migration-target cpu weighted
  * according to the scheduling class and "nice" value.
  */
-static unsigned long target_load(int cpu, int type)
+static unsigned long target_load(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
 
-	if (type == 0 || !sched_feat(LB_BIAS))
+	if (!sched_feat(LB_BIAS))
 		return total;
 
-	return max(rq->cpu_load[type-1], total);
+	return max(rq->cpu_load, total);
 }
 
 static unsigned long power_of(int cpu)
@@ -4187,7 +4187,7 @@ static int wake_wide(struct task_struct *p)
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
-	int idx, this_cpu, prev_cpu;
+	int this_cpu, prev_cpu;
 	unsigned long tl_per_task;
 	struct task_group *tg;
 	unsigned long weight;
@@ -4200,11 +4200,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	if (wake_wide(p))
 		return 0;
 
-	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
-	load	  = source_load(prev_cpu, idx);
-	this_load = target_load(this_cpu, idx);
+	load	  = source_load(prev_cpu);
+	this_load = target_load(this_cpu);
 
 	/*
 	 * If sync wakeup then subtract the (maximum possible)
@@ -4260,7 +4259,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 
 	if (balanced ||
 	    (this_load <= load &&
-	     this_load + target_load(prev_cpu, idx) <= tl_per_task)) {
+	     this_load + target_load(prev_cpu) <= tl_per_task)) {
 		/*
 		 * This domain has SD_WAKE_AFFINE and
 		 * p is cache cold in this domain, and
@@ -4279,17 +4278,12 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
  * domain.
  */
 static struct sched_group *
-find_idlest_group(struct sched_domain *sd, struct task_struct *p,
-		  int this_cpu, int sd_flag)
+find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
-	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
-	if (sd_flag & SD_BALANCE_WAKE)
-		load_idx = sd->wake_idx;
-
 	do {
 		unsigned long load, avg_load;
 		int local_group;
@@ -4309,9 +4303,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		for_each_cpu(i, sched_group_cpus(group)) {
 			/* Bias balancing toward cpus of our domain */
 			if (local_group)
-				load = source_load(i, load_idx);
+				load = source_load(i);
 			else
-				load = target_load(i, load_idx);
+				load = target_load(i);
 
 			avg_load += load;
 		}
@@ -4466,7 +4460,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			continue;
 		}
 
-		group = find_idlest_group(sd, p, cpu, sd_flag);
+		group = find_idlest_group(sd, p, cpu);
 		if (!group) {
 			sd = sd->child;
 			continue;
@@ -5754,12 +5748,11 @@ static inline int sg_capacity(struct lb_env *env, struct sched_group *group)
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
  * @group: sched_group whose statistics are to be updated.
- * @load_idx: Load index of sched_domain of this_cpu for load calc.
  * @local_group: Does group contain this_cpu.
  * @sgs: variable to hold the statistics for this group.
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
-			struct sched_group *group, int load_idx,
+			struct sched_group *group,
 			int local_group, struct sg_lb_stats *sgs)
 {
 	unsigned long load;
@@ -5772,9 +5765,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		/* Bias balancing toward cpus of our domain */
 		if (local_group)
-			load = target_load(i, load_idx);
+			load = target_load(i);
 		else
-			load = source_load(i, load_idx);
+			load = source_load(i);
 
 		sgs->group_load += load;
 		sgs->sum_nr_running += rq->nr_running;
@@ -5887,13 +5880,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats tmp_sgs;
-	int load_idx, prefer_sibling = 0;
+	int prefer_sibling = 0;
 
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
 
-	load_idx = 0;
-
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
 		int local_group;
@@ -5908,7 +5899,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 				update_group_power(env->sd, env->dst_cpu);
 		}
 
-		update_sg_lb_stats(env, sg, load_idx, local_group, sgs);
+		update_sg_lb_stats(env, sg, local_group, sgs);
 
 		if (local_group)
 			goto next_group;
diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index 16f5a30..a2435c5 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -11,7 +11,7 @@
 unsigned long this_cpu_load(void)
 {
 	struct rq *this = this_rq();
-	return this->cpu_load[0];
+	return this->cpu_load;
 }
 
 
@@ -398,105 +398,19 @@ static void calc_load_account_active(struct rq *this_rq)
  * End of global load-average stuff
  */
 
-/*
- * The exact cpuload at various idx values, calculated at every tick would be
- * load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load
- *
- * If a cpu misses updates for n-1 ticks (as it was idle) and update gets called
- * on nth tick when cpu may be busy, then we have:
- * load = ((2^idx - 1) / 2^idx)^(n-1) * load
- * load = (2^idx - 1) / 2^idx) * load + 1 / 2^idx * cur_load
- *
- * decay_load_missed() below does efficient calculation of
- * load = ((2^idx - 1) / 2^idx)^(n-1) * load
- * avoiding 0..n-1 loop doing load = ((2^idx - 1) / 2^idx) * load
- *
- * The calculation is approximated on a 128 point scale.
- * degrade_zero_ticks is the number of ticks after which load at any
- * particular idx is approximated to be zero.
- * degrade_factor is a precomputed table, a row for each load idx.
- * Each column corresponds to degradation factor for a power of two ticks,
- * based on 128 point scale.
- * Example:
- * row 2, col 3 (=12) says that the degradation at load idx 2 after
- * 8 ticks is 12/128 (which is an approximation of exact factor 3^8/4^8).
- *
- * With this power of 2 load factors, we can degrade the load n times
- * by looking at 1 bits in n and doing as many mult/shift instead of
- * n mult/shifts needed by the exact degradation.
- */
-#define DEGRADE_SHIFT		7
-static const unsigned char
-		degrade_zero_ticks[CPU_LOAD_IDX_MAX] = {0, 8, 32, 64, 128};
-static const unsigned char
-		degrade_factor[CPU_LOAD_IDX_MAX][DEGRADE_SHIFT + 1] = {
-					{0, 0, 0, 0, 0, 0, 0, 0},
-					{64, 32, 8, 0, 0, 0, 0, 0},
-					{96, 72, 40, 12, 1, 0, 0},
-					{112, 98, 75, 43, 15, 1, 0},
-					{120, 112, 98, 76, 45, 16, 2} };
 
 /*
- * Update cpu_load for any missed ticks, due to tickless idle. The backlog
- * would be when CPU is idle and so we just decay the old load without
- * adding any new load.
- */
-static unsigned long
-decay_load_missed(unsigned long load, unsigned long missed_updates, int idx)
-{
-	int j = 0;
-
-	if (!missed_updates)
-		return load;
-
-	if (missed_updates >= degrade_zero_ticks[idx])
-		return 0;
-
-	if (idx == 1)
-		return load >> missed_updates;
-
-	while (missed_updates) {
-		if (missed_updates % 2)
-			load = (load * degrade_factor[idx][j]) >> DEGRADE_SHIFT;
-
-		missed_updates >>= 1;
-		j++;
-	}
-	return load;
-}
-
-/*
- * Update rq->cpu_load[] statistics. This function is usually called every
+ * Update rq->cpu_load statistics. This function is usually called every
  * scheduler tick (TICK_NSEC). With tickless idle this will not be called
  * every tick. We fix it up based on jiffies.
  */
 static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
 			      unsigned long pending_updates)
 {
-	int i, scale;
-
 	this_rq->nr_load_updates++;
 
 	/* Update our load: */
-	this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */
-	for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) {
-		unsigned long old_load, new_load;
-
-		/* scale is effectively 1 << i now, and >> i divides by scale */
-
-		old_load = this_rq->cpu_load[i];
-		old_load = decay_load_missed(old_load, pending_updates - 1, i);
-		new_load = this_load;
-		/*
-		 * Round up the averaging division if load is increasing. This
-		 * prevents us from getting stuck on 9 if the load is 10, for
-		 * example.
-		 */
-		if (new_load > old_load)
-			new_load += scale - 1;
-
-		this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
-	}
+	this_rq->cpu_load = this_load; /* Fasttrack for idx 0 */
 
 	sched_avg_update(this_rq);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 456e492..1f144e8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -528,8 +528,7 @@ struct rq {
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
 #endif
-	#define CPU_LOAD_IDX_MAX 5
-	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+	unsigned long cpu_load;
 	unsigned long last_load_update_tick;
 #ifdef CONFIG_NO_HZ_COMMON
 	u64 nohz_stamp;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V5 3/8] sched: remove source_load and target_load
  2014-04-16  2:43 [RESEND PATCH V5 0/8] remove cpu_load idx Alex Shi
  2014-04-16  2:43 ` [PATCH V5 1/8] sched: shortcut to remove load_idx Alex Shi
  2014-04-16  2:43 ` [PATCH V5 2/8] sched: remove rq->cpu_load[load_idx] array Alex Shi
@ 2014-04-16  2:43 ` Alex Shi
  2014-04-24 14:18   ` Peter Zijlstra
  2014-04-16  2:43 ` [PATCH V5 4/8] sched: remove LB_BIAS Alex Shi
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Alex Shi @ 2014-04-16  2:43 UTC (permalink / raw)
  To: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano, efault
  Cc: wangyun, linux-kernel, mgorman

We have no load_idx any more, so source/target_load always return the
same value as weighted_cpuload. So we can remove these 2 functions.

Signed-off-by: Alex Shi <alex.shi@linaro.org>
---
 kernel/sched/fair.c | 54 +++++------------------------------------------------
 1 file changed, 5 insertions(+), 49 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12a35ea..cad2b6d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1015,8 +1015,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 }
 
 static unsigned long weighted_cpuload(const int cpu);
-static unsigned long source_load(int cpu);
-static unsigned long target_load(int cpu);
 static unsigned long power_of(int cpu);
 static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
@@ -3951,45 +3949,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 }
 
 #ifdef CONFIG_SMP
-/* Used instead of source_load when we know the type == 0 */
 static unsigned long weighted_cpuload(const int cpu)
 {
 	return cpu_rq(cpu)->cfs.runnable_load_avg;
 }
 
-/*
- * Return a low guess at the load of a migration-source cpu weighted
- * according to the scheduling class and "nice" value.
- *
- * We want to under-estimate the load of migration sources, to
- * balance conservatively.
- */
-static unsigned long source_load(int cpu)
-{
-	struct rq *rq = cpu_rq(cpu);
-	unsigned long total = weighted_cpuload(cpu);
-
-	if (!sched_feat(LB_BIAS))
-		return total;
-
-	return min(rq->cpu_load, total);
-}
-
-/*
- * Return a high guess at the load of a migration-target cpu weighted
- * according to the scheduling class and "nice" value.
- */
-static unsigned long target_load(int cpu)
-{
-	struct rq *rq = cpu_rq(cpu);
-	unsigned long total = weighted_cpuload(cpu);
-
-	if (!sched_feat(LB_BIAS))
-		return total;
-
-	return max(rq->cpu_load, total);
-}
-
 static unsigned long power_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_power;
@@ -4202,8 +4166,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
-	load	  = source_load(prev_cpu);
-	this_load = target_load(this_cpu);
+	load	  = weighted_cpuload(prev_cpu);
+	this_load = weighted_cpuload(this_cpu);
 
 	/*
 	 * If sync wakeup then subtract the (maximum possible)
@@ -4259,7 +4223,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 
 	if (balanced ||
 	    (this_load <= load &&
-	     this_load + target_load(prev_cpu) <= tl_per_task)) {
+	     this_load + weighted_cpuload(prev_cpu) <= tl_per_task)) {
 		/*
 		 * This domain has SD_WAKE_AFFINE and
 		 * p is cache cold in this domain, and
@@ -4301,11 +4265,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 		avg_load = 0;
 
 		for_each_cpu(i, sched_group_cpus(group)) {
-			/* Bias balancing toward cpus of our domain */
-			if (local_group)
-				load = source_load(i);
-			else
-				load = target_load(i);
+			load = weighted_cpuload(i);
 
 			avg_load += load;
 		}
@@ -5763,11 +5723,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
 
-		/* Bias balancing toward cpus of our domain */
-		if (local_group)
-			load = target_load(i);
-		else
-			load = source_load(i);
+		load = weighted_cpuload(i);
 
 		sgs->group_load += load;
 		sgs->sum_nr_running += rq->nr_running;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V5 4/8] sched: remove LB_BIAS
  2014-04-16  2:43 [RESEND PATCH V5 0/8] remove cpu_load idx Alex Shi
                   ` (2 preceding siblings ...)
  2014-04-16  2:43 ` [PATCH V5 3/8] sched: remove source_load and target_load Alex Shi
@ 2014-04-16  2:43 ` Alex Shi
  2014-04-16  2:43 ` [PATCH V5 5/8] sched: clean up cpu_load update Alex Shi
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Alex Shi @ 2014-04-16  2:43 UTC (permalink / raw)
  To: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano, efault
  Cc: wangyun, linux-kernel, mgorman

Don't need LB_BIAS now. so remove it.

Signed-off-by: Alex Shi <alex.shi@linaro.org>
---
 Documentation/timers/NO_HZ.txt | 2 --
 kernel/sched/features.h        | 1 -
 2 files changed, 3 deletions(-)

diff --git a/Documentation/timers/NO_HZ.txt b/Documentation/timers/NO_HZ.txt
index cca122f..84dd71d 100644
--- a/Documentation/timers/NO_HZ.txt
+++ b/Documentation/timers/NO_HZ.txt
@@ -177,8 +177,6 @@ not come for free:
 	slightly differently than those for non-adaptive-tick CPUs.
 	This might in turn perturb load-balancing of real-time tasks.
 
-6.	The LB_BIAS scheduler feature is disabled by adaptive ticks.
-
 Although improvements are expected over time, adaptive ticks is quite
 useful for many types of real-time and compute-intensive applications.
 However, the drawbacks listed above mean that adaptive ticks should not
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 5716929..fe3c1cb 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -43,7 +43,6 @@ SCHED_FEAT(ARCH_POWER, true)
 
 SCHED_FEAT(HRTICK, false)
 SCHED_FEAT(DOUBLE_TICK, false)
-SCHED_FEAT(LB_BIAS, true)
 
 /*
  * Decrement CPU power based on time not spent running tasks
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V5 5/8] sched: clean up cpu_load update
  2014-04-16  2:43 [RESEND PATCH V5 0/8] remove cpu_load idx Alex Shi
                   ` (3 preceding siblings ...)
  2014-04-16  2:43 ` [PATCH V5 4/8] sched: remove LB_BIAS Alex Shi
@ 2014-04-16  2:43 ` Alex Shi
  2014-04-16  2:43 ` [PATCH V5 6/8] sched: rewrite update_cpu_load_nohz Alex Shi
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Alex Shi @ 2014-04-16  2:43 UTC (permalink / raw)
  To: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano, efault
  Cc: wangyun, linux-kernel, mgorman

Since we don't decay the rq->cpu_load, so we don't need the
pending_updates. But we still want update rq->rt_avg, so
still keep rq->last_load_update_tick and func __update_cpu_load.

Signed-off-by: Alex Shi <alex.shi@linaro.org>
---
 kernel/sched/proc.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index a2435c5..057bb9b 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -404,8 +404,7 @@ static void calc_load_account_active(struct rq *this_rq)
  * scheduler tick (TICK_NSEC). With tickless idle this will not be called
  * every tick. We fix it up based on jiffies.
  */
-static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
-			      unsigned long pending_updates)
+static void __update_cpu_load(struct rq *this_rq, unsigned long this_load)
 {
 	this_rq->nr_load_updates++;
 
@@ -449,7 +448,6 @@ void update_idle_cpu_load(struct rq *this_rq)
 {
 	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
 	unsigned long load = get_rq_runnable_load(this_rq);
-	unsigned long pending_updates;
 
 	/*
 	 * bail if there's load or we're actually up-to-date.
@@ -457,10 +455,9 @@ void update_idle_cpu_load(struct rq *this_rq)
 	if (load || curr_jiffies == this_rq->last_load_update_tick)
 		return;
 
-	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
 	this_rq->last_load_update_tick = curr_jiffies;
 
-	__update_cpu_load(this_rq, load, pending_updates);
+	__update_cpu_load(this_rq, load);
 }
 
 /*
@@ -483,7 +480,7 @@ void update_cpu_load_nohz(void)
 		 * We were idle, this means load 0, the current load might be
 		 * !0 due to remote wakeups and the sort.
 		 */
-		__update_cpu_load(this_rq, 0, pending_updates);
+		__update_cpu_load(this_rq, 0);
 	}
 	raw_spin_unlock(&this_rq->lock);
 }
@@ -499,7 +496,7 @@ void update_cpu_load_active(struct rq *this_rq)
 	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
 	 */
 	this_rq->last_load_update_tick = jiffies;
-	__update_cpu_load(this_rq, load, 1);
+	__update_cpu_load(this_rq, load);
 
 	calc_load_account_active(this_rq);
 }
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V5 6/8] sched: rewrite update_cpu_load_nohz
  2014-04-16  2:43 [RESEND PATCH V5 0/8] remove cpu_load idx Alex Shi
                   ` (4 preceding siblings ...)
  2014-04-16  2:43 ` [PATCH V5 5/8] sched: clean up cpu_load update Alex Shi
@ 2014-04-16  2:43 ` Alex Shi
  2014-04-16  2:43 ` [PATCH V5 7/8] sched: remove rq->cpu_load and rq->nr_load_updates Alex Shi
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Alex Shi @ 2014-04-16  2:43 UTC (permalink / raw)
  To: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano, efault
  Cc: wangyun, linux-kernel, mgorman

After change to sched_avg, the cpu load in idle exit was decayed.
So, it maybe near zero if waking a long time sleep task, or, a full
non-decay load if waking a new forked task. Then, we can use it to
reflect the cpu load, don't need to pretend 0.

Signed-off-by: Alex Shi <alex.shi@linaro.org>
---
 kernel/sched/proc.c | 19 ++-----------------
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index 057bb9b..383c4ba 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -461,28 +461,13 @@ void update_idle_cpu_load(struct rq *this_rq)
 }
 
 /*
- * Called from tick_nohz_idle_exit() -- try and fix up the ticks we missed.
+ * Called from tick_nohz_idle_exit()
  */
 void update_cpu_load_nohz(void)
 {
 	struct rq *this_rq = this_rq();
-	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
-	unsigned long pending_updates;
-
-	if (curr_jiffies == this_rq->last_load_update_tick)
-		return;
 
-	raw_spin_lock(&this_rq->lock);
-	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
-	if (pending_updates) {
-		this_rq->last_load_update_tick = curr_jiffies;
-		/*
-		 * We were idle, this means load 0, the current load might be
-		 * !0 due to remote wakeups and the sort.
-		 */
-		__update_cpu_load(this_rq, 0);
-	}
-	raw_spin_unlock(&this_rq->lock);
+	update_idle_cpu_load(this_rq);
 }
 #endif /* CONFIG_NO_HZ */
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V5 7/8] sched: remove rq->cpu_load and rq->nr_load_updates
  2014-04-16  2:43 [RESEND PATCH V5 0/8] remove cpu_load idx Alex Shi
                   ` (5 preceding siblings ...)
  2014-04-16  2:43 ` [PATCH V5 6/8] sched: rewrite update_cpu_load_nohz Alex Shi
@ 2014-04-16  2:43 ` Alex Shi
  2014-04-16  2:43 ` [PATCH V5 8/8] sched: rename update_*_cpu_load Alex Shi
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Alex Shi @ 2014-04-16  2:43 UTC (permalink / raw)
  To: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano, efault
  Cc: wangyun, linux-kernel, mgorman

The cpu_load is the copy of rq->cfs.runnable_load_avg. And it updated
on time. So we can use the latter directly. Thus saved 2 rq variables:
cpu_load and nr_load_updates.
Then don't need __update_cpu_load(), just keep sched_avg_update().
Thus removed get_rq_runnable_load() which used for update_cpu_load only.

Signed-off-by: Alex Shi <alex.shi@linaro.org>
---
 kernel/sched/core.c  |  2 --
 kernel/sched/debug.c |  2 --
 kernel/sched/proc.c  | 55 +++++++++++++---------------------------------------
 kernel/sched/sched.h |  2 --
 4 files changed, 13 insertions(+), 48 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 33fd59b..80118b3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6828,8 +6828,6 @@ void __init sched_init(void)
 #ifdef CONFIG_RT_GROUP_SCHED
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
 #endif
-
-		rq->cpu_load = 0;
 		rq->last_load_update_tick = jiffies;
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 0e48e98..a03186f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -297,12 +297,10 @@ do {									\
 	SEQ_printf(m, "  .%-30s: %lu\n", "load",
 		   rq->load.weight);
 	P(nr_switches);
-	P(nr_load_updates);
 	P(nr_uninterruptible);
 	PN(next_balance);
 	SEQ_printf(m, "  .%-30s: %ld\n", "curr->pid", (long)(task_pid_nr(rq->curr)));
 	PN(clock);
-	P(cpu_load);
 #undef P
 #undef PN
 
diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index 383c4ba..dd3c2d9 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -8,12 +8,19 @@
 
 #include "sched.h"
 
+#ifdef CONFIG_SMP
 unsigned long this_cpu_load(void)
 {
-	struct rq *this = this_rq();
-	return this->cpu_load;
+	struct rq *rq = this_rq();
+	return rq->cfs.runnable_load_avg;
 }
-
+#else
+unsigned long this_cpu_load(void)
+{
+	struct rq *rq = this_rq();
+	return rq->load.weight;
+}
+#endif
 
 /*
  * Global load-average calculations
@@ -398,34 +405,6 @@ static void calc_load_account_active(struct rq *this_rq)
  * End of global load-average stuff
  */
 
-
-/*
- * Update rq->cpu_load statistics. This function is usually called every
- * scheduler tick (TICK_NSEC). With tickless idle this will not be called
- * every tick. We fix it up based on jiffies.
- */
-static void __update_cpu_load(struct rq *this_rq, unsigned long this_load)
-{
-	this_rq->nr_load_updates++;
-
-	/* Update our load: */
-	this_rq->cpu_load = this_load; /* Fasttrack for idx 0 */
-
-	sched_avg_update(this_rq);
-}
-
-#ifdef CONFIG_SMP
-static inline unsigned long get_rq_runnable_load(struct rq *rq)
-{
-	return rq->cfs.runnable_load_avg;
-}
-#else
-static inline unsigned long get_rq_runnable_load(struct rq *rq)
-{
-	return rq->load.weight;
-}
-#endif
-
 #ifdef CONFIG_NO_HZ_COMMON
 /*
  * There is no sane way to deal with nohz on smp when using jiffies because the
@@ -447,17 +426,15 @@ static inline unsigned long get_rq_runnable_load(struct rq *rq)
 void update_idle_cpu_load(struct rq *this_rq)
 {
 	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
-	unsigned long load = get_rq_runnable_load(this_rq);
 
 	/*
 	 * bail if there's load or we're actually up-to-date.
 	 */
-	if (load || curr_jiffies == this_rq->last_load_update_tick)
+	if (curr_jiffies == this_rq->last_load_update_tick)
 		return;
 
 	this_rq->last_load_update_tick = curr_jiffies;
-
-	__update_cpu_load(this_rq, load);
+	sched_avg_update(this_rq);
 }
 
 /*
@@ -466,7 +443,6 @@ void update_idle_cpu_load(struct rq *this_rq)
 void update_cpu_load_nohz(void)
 {
 	struct rq *this_rq = this_rq();
-
 	update_idle_cpu_load(this_rq);
 }
 #endif /* CONFIG_NO_HZ */
@@ -476,12 +452,7 @@ void update_cpu_load_nohz(void)
  */
 void update_cpu_load_active(struct rq *this_rq)
 {
-	unsigned long load = get_rq_runnable_load(this_rq);
-	/*
-	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
-	 */
 	this_rq->last_load_update_tick = jiffies;
-	__update_cpu_load(this_rq, load);
-
+	sched_avg_update(this_rq);
 	calc_load_account_active(this_rq);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1f144e8..f521d8e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -528,7 +528,6 @@ struct rq {
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
 #endif
-	unsigned long cpu_load;
 	unsigned long last_load_update_tick;
 #ifdef CONFIG_NO_HZ_COMMON
 	u64 nohz_stamp;
@@ -541,7 +540,6 @@ struct rq {
 
 	/* capture load from *all* tasks on this cpu: */
 	struct load_weight load;
-	unsigned long nr_load_updates;
 	u64 nr_switches;
 
 	struct cfs_rq cfs;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V5 8/8] sched: rename update_*_cpu_load
  2014-04-16  2:43 [RESEND PATCH V5 0/8] remove cpu_load idx Alex Shi
                   ` (6 preceding siblings ...)
  2014-04-16  2:43 ` [PATCH V5 7/8] sched: remove rq->cpu_load and rq->nr_load_updates Alex Shi
@ 2014-04-16  2:43 ` Alex Shi
  2014-04-24 16:20 ` [RESEND PATCH V5 0/8] remove cpu_load idx Peter Zijlstra
  2014-04-29 14:52 ` Morten Rasmussen
  9 siblings, 0 replies; 22+ messages in thread
From: Alex Shi @ 2014-04-16  2:43 UTC (permalink / raw)
  To: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano, efault
  Cc: wangyun, linux-kernel, mgorman

Since we have no cpu_load update, rename the related functions:
s/update_idle_cpu_load/update_idle_rt_avg/
s/update_cpu_load_nohz/update_rt_avg_nohz/
s/update_cpu_load_active/update_avg_load_active/

No functional change.

Signed-off-by: Alex Shi <alex.shi@linaro.org>
---
 include/linux/sched.h    | 2 +-
 kernel/sched/core.c      | 2 +-
 kernel/sched/fair.c      | 2 +-
 kernel/sched/proc.c      | 8 ++++----
 kernel/sched/sched.h     | 4 ++--
 kernel/time/tick-sched.c | 2 +-
 6 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3b08d7b..1e57983 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -178,7 +178,7 @@ extern unsigned long this_cpu_load(void);
 
 
 extern void calc_global_load(unsigned long ticks);
-extern void update_cpu_load_nohz(void);
+extern void update_rt_avg_nohz(void);
 
 extern unsigned long get_parent_ip(unsigned long addr);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 80118b3..8900c8a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2427,7 +2427,7 @@ void scheduler_tick(void)
 	raw_spin_lock(&rq->lock);
 	update_rq_clock(rq);
 	curr->sched_class->task_tick(rq, curr, 0);
-	update_cpu_load_active(rq);
+	update_avg_load_active(rq);
 	raw_spin_unlock(&rq->lock);
 
 	perf_event_task_tick();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cad2b6d..864754f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7013,7 +7013,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 
 		raw_spin_lock_irq(&rq->lock);
 		update_rq_clock(rq);
-		update_idle_cpu_load(rq);
+		update_idle_rt_avg(rq);
 		raw_spin_unlock_irq(&rq->lock);
 
 		rebalance_domains(rq, CPU_IDLE);
diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index dd3c2d9..42b7706 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -423,7 +423,7 @@ static void calc_load_account_active(struct rq *this_rq)
  * Called from nohz_idle_balance() to update the load ratings before doing the
  * idle balance.
  */
-void update_idle_cpu_load(struct rq *this_rq)
+void update_idle_rt_avg(struct rq *this_rq)
 {
 	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
 
@@ -440,17 +440,17 @@ void update_idle_cpu_load(struct rq *this_rq)
 /*
  * Called from tick_nohz_idle_exit()
  */
-void update_cpu_load_nohz(void)
+void update_rt_avg_nohz(void)
 {
 	struct rq *this_rq = this_rq();
-	update_idle_cpu_load(this_rq);
+	update_idle_rt_avg(this_rq);
 }
 #endif /* CONFIG_NO_HZ */
 
 /*
  * Called from scheduler_tick()
  */
-void update_cpu_load_active(struct rq *this_rq)
+void update_avg_load_active(struct rq *this_rq)
 {
 	this_rq->last_load_update_tick = jiffies;
 	sched_avg_update(this_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f521d8e..f4fb861 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -21,7 +21,7 @@ extern unsigned long calc_load_update;
 extern atomic_long_t calc_load_tasks;
 
 extern long calc_load_fold_active(struct rq *this_rq);
-extern void update_cpu_load_active(struct rq *this_rq);
+extern void update_avg_load_active(struct rq *this_rq);
 
 /*
  * Helpers for converting nanosecond timing to jiffy resolution
@@ -1209,7 +1209,7 @@ extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
 
 unsigned long to_ratio(u64 period, u64 runtime);
 
-extern void update_idle_cpu_load(struct rq *this_rq);
+extern void update_idle_rt_avg(struct rq *this_rq);
 
 extern void init_task_runnable_average(struct task_struct *p);
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9f8af69..b1a400a 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -866,7 +866,7 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 {
 	/* Update jiffies first */
 	tick_do_update_jiffies64(now);
-	update_cpu_load_nohz();
+	update_rt_avg_nohz();
 
 	calc_load_exit_idle();
 	touch_softlockup_watchdog();
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH V5 3/8] sched: remove source_load and target_load
  2014-04-16  2:43 ` [PATCH V5 3/8] sched: remove source_load and target_load Alex Shi
@ 2014-04-24 14:18   ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2014-04-24 14:18 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, morten.rasmussen, vincent.guittot, daniel.lezcano, efault,
	wangyun, linux-kernel, mgorman

On Wed, Apr 16, 2014 at 10:43:24AM +0800, Alex Shi wrote:
> We have no load_idx any more, so source/target_load always return the
> same value as weighted_cpuload. So we can remove these 2 functions.

That's just not true:

> -/*
> - * Return a low guess at the load of a migration-source cpu weighted
> - * according to the scheduling class and "nice" value.
> - *
> - * We want to under-estimate the load of migration sources, to
> - * balance conservatively.
> - */
> -static unsigned long source_load(int cpu)
> -{
> -	struct rq *rq = cpu_rq(cpu);
> -	unsigned long total = weighted_cpuload(cpu);
> -
> -	if (!sched_feat(LB_BIAS))
> -		return total;
> -
> -	return min(rq->cpu_load, total);
> -}
> -
> -/*
> - * Return a high guess at the load of a migration-target cpu weighted
> - * according to the scheduling class and "nice" value.
> - */
> -static unsigned long target_load(int cpu)
> -{
> -	struct rq *rq = cpu_rq(cpu);
> -	unsigned long total = weighted_cpuload(cpu);
> -
> -	if (!sched_feat(LB_BIAS))
> -		return total;
> -
> -	return max(rq->cpu_load, total);
> -}



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH V5 0/8] remove cpu_load idx
  2014-04-16  2:43 [RESEND PATCH V5 0/8] remove cpu_load idx Alex Shi
                   ` (7 preceding siblings ...)
  2014-04-16  2:43 ` [PATCH V5 8/8] sched: rename update_*_cpu_load Alex Shi
@ 2014-04-24 16:20 ` Peter Zijlstra
  2014-04-29 15:04   ` Morten Rasmussen
  2014-05-01  0:41   ` Alex Shi
  2014-04-29 14:52 ` Morten Rasmussen
  9 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2014-04-24 16:20 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, morten.rasmussen, vincent.guittot, daniel.lezcano, efault,
	wangyun, linux-kernel, mgorman



OK, this series is a lot saner, with the exception of 3/8 and
dependents.

I do still worry a bit for loosing the longer term view for the big
domains though. Sadly I don't have any really big machines.

I think the entire series is equivalent to setting LB_BIAS to false. So
I suppose we could do that for a while and if nobody reports horrible
things we could just do this.

Anybody?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH V5 0/8] remove cpu_load idx
  2014-04-16  2:43 [RESEND PATCH V5 0/8] remove cpu_load idx Alex Shi
                   ` (8 preceding siblings ...)
  2014-04-24 16:20 ` [RESEND PATCH V5 0/8] remove cpu_load idx Peter Zijlstra
@ 2014-04-29 14:52 ` Morten Rasmussen
  2014-04-30  9:24   ` Alex Shi
  9 siblings, 1 reply; 22+ messages in thread
From: Morten Rasmussen @ 2014-04-29 14:52 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, vincent.guittot, daniel.lezcano, efault, wangyun,
	linux-kernel, mgorman

On Wed, Apr 16, 2014 at 03:43:21AM +0100, Alex Shi wrote:
> In the cpu_load decay usage, we mixed the long term, short term load with
> balance bias, randomly pick a big or small value according to balance 
> destination or source.

I disagree that it is random. min()/max() in {source,target}_load()
provides a conservative bias to the load estimate that should prevent us
from trying to pull tasks from the source cpu if its current load is
just a temporary spike. Likewise, we don't try to pull tasks to the
target cpu if the load is just a temporary drop.

> This mix is wrong, the balance bias should be based
> on task moving cost between cpu groups, not on random history or instant load.

Your patch set actually changes everything to be based on the instant
load alone. rq->cfs.runnable_load_avg is updated instantaneously when
tasks are enqueued and deqeueue, so this load expression is quite volatile.

What do you mean by "task moving cost"?

> History load maybe diverage a lot from real load, that lead to incorrect bias.
> 
> like on busy_idx,
> We mix history load decay and bias together. The ridiculous thing is, when 
> all cpu load are continuous stable, long/short term load is same. then we 
> lose the bias meaning, so any minimum imbalance may cause unnecessary task
> moving. To prevent this funny thing happen, we have to reuse the 
> imbalance_pct again in find_busiest_group().  But that clearly causes over
> bias in normal time. If there are some burst load in system, it is more worse.

Isn't imbalance_pct only used once in the periodic load-balance path?

It is not clear to me what the over bias problem is. If you have a
stable situation, I would expect the long and short term load to be the
same?

> As to idle_idx:
> Though I have some cencern of usage corretion, 
> https://lkml.org/lkml/2014/3/12/247 but since we are working on cpu
> idle migration into scheduler. The problem will be reconsidered. We don't
> need to care it too much now.
> 
> In fact, the cpu_load decays can be replaced by the sched_avg decay, that 
> also decays load on time. The balance bias part can fullly use fixed bias --
> imbalance_pct, which is already used in newly idle, wake, forkexec balancing
> and numa balancing scenarios.

As I have said previously, I agree that cpu_load[] is somewhat broken in
its current form, but I don't see how removing it and replacing it with
the instantaneous cpu load solves the problems you point out.

The current cpu_load[] averages the cpu_load over time, while
rq->cfs.runnable_load_avg is the sum of the currently runnable tasks'
load_avg_contrib. The former provides a long term view of the cpu_load,
the latter does not. It can change radically in an instant. I'm
therefore a bit concerned about the stability of the load-balance
decisions. However, since most decisions are based on cpu_load[0]
anyway, we could try setting LB_BIAS to false as Peter suggests and see
what happens.

Morten

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH V5 0/8] remove cpu_load idx
  2014-04-24 16:20 ` [RESEND PATCH V5 0/8] remove cpu_load idx Peter Zijlstra
@ 2014-04-29 15:04   ` Morten Rasmussen
  2014-05-09 16:30     ` Alex Shi
  2014-05-01  0:41   ` Alex Shi
  1 sibling, 1 reply; 22+ messages in thread
From: Morten Rasmussen @ 2014-04-29 15:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Shi, mingo, vincent.guittot, daniel.lezcano, efault,
	wangyun, linux-kernel, mgorman

On Thu, Apr 24, 2014 at 05:20:29PM +0100, Peter Zijlstra wrote:
> OK, this series is a lot saner, with the exception of 3/8 and
> dependents.
> 
> I do still worry a bit for loosing the longer term view for the big
> domains though. Sadly I don't have any really big machines.
> 
> I think the entire series is equivalent to setting LB_BIAS to false. So
> I suppose we could do that for a while and if nobody reports horrible
> things we could just do this.
> 
> Anybody?

I can't say what will happen on big machines, but I think the LB_BIAS
test could be a way to see what happens. I'm not convinced that it won't
lead to more task migrations since we will use the instantaneous cpu
load (weighted_cpuload()) unfiltered.

Morten

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH V5 0/8] remove cpu_load idx
  2014-04-29 14:52 ` Morten Rasmussen
@ 2014-04-30  9:24   ` Alex Shi
  2014-05-06  8:33     ` Alex Shi
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Shi @ 2014-04-30  9:24 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, vincent.guittot, daniel.lezcano, efault, wangyun,
	linux-kernel, mgorman

On 04/29/2014 10:52 PM, Morten Rasmussen wrote:
> On Wed, Apr 16, 2014 at 03:43:21AM +0100, Alex Shi wrote:
>> In the cpu_load decay usage, we mixed the long term, short term load with
>> balance bias, randomly pick a big or small value according to balance 
>> destination or source.
> 
> I disagree that it is random. min()/max() in {source,target}_load()
> provides a conservative bias to the load estimate that should prevent us
> from trying to pull tasks from the source cpu if its current load is
> just a temporary spike. Likewise, we don't try to pull tasks to the
> target cpu if the load is just a temporary drop.

Thanks a lot for review, Morten!

A temporary spike load should be very small bases its runnable load. It
can not cause much impact.
Here the random refer to short term or long term load selection.

> 
>> This mix is wrong, the balance bias should be based
>> on task moving cost between cpu groups, not on random history or instant load.
> 
> Your patch set actually changes everything to be based on the instant
> load alone. rq->cfs.runnable_load_avg is updated instantaneously when
> tasks are enqueued and deqeueue, so this load expression is quite volatile.

Seems we are backing to the predication correction argument. :)

The runnable load is not instant with runnable consideration. And no
testing show that taking too much history load will lead to a better
balance. Current cpu_load[] are decayed with different degree level. And
there is no good reason to support the different level decay setting
after runnable load involved.

> 
> What do you mean by "task moving cost"?

I mean the possible LL cache missing, and memory access latency on
different NUMA after a task move to other cpu.

> 
>> History load maybe diverage a lot from real load, that lead to incorrect bias.
>>
>> like on busy_idx,
>> We mix history load decay and bias together. The ridiculous thing is, when 
>> all cpu load are continuous stable, long/short term load is same. then we 
>> lose the bias meaning, so any minimum imbalance may cause unnecessary task
>> moving. To prevent this funny thing happen, we have to reuse the 
>> imbalance_pct again in find_busiest_group().  But that clearly causes over
>> bias in normal time. If there are some burst load in system, it is more worse.
> 
> Isn't imbalance_pct only used once in the periodic load-balance path?

Yes, but we already used source/target_load bias. Then, we have biased
twice. that is over bias.

> 
> It is not clear to me what the over bias problem is. If you have a
> stable situation, I would expect the long and short term load to be the
> same?

yes, long/short term load is same, then source/taget_load is same, then
any minimum temporary load change can cause rebalance, that is bad
considering the relative bigger task moving cost. so current code still
need add imbalance_pct to prevent such things happen. Using both
source/target load bias and imbalance pct bias is over bias.

> 
>> As to idle_idx:
>> Though I have some cencern of usage corretion, 
>> https://lkml.org/lkml/2014/3/12/247 but since we are working on cpu
>> idle migration into scheduler. The problem will be reconsidered. We don't
>> need to care it too much now.
>>
>> In fact, the cpu_load decays can be replaced by the sched_avg decay, that 
>> also decays load on time. The balance bias part can fullly use fixed bias --
>> imbalance_pct, which is already used in newly idle, wake, forkexec balancing
>> and numa balancing scenarios.
> 
> As I have said previously, I agree that cpu_load[] is somewhat broken in
> its current form, but I don't see how removing it and replacing it with
> the instantaneous cpu load solves the problems you point out.

I am glad that we get an agreement on the cpu_load[] is inappropriate. :)

Actually, this patchset just remove it and use the cpu load which
considered the runnable info, not 'instantaneous'.

> 
> The current cpu_load[] averages the cpu_load over time, while
> rq->cfs.runnable_load_avg is the sum of the currently runnable tasks'
> load_avg_contrib. The former provides a long term view of the cpu_load,

It doesn't. it may or may not use the long term load, just according to
which load is bigger or smaller. It just pretend to consider the long
term load status. but may drop it.

> the latter does not. It can change radically in an instant. I'm
> therefore a bit concerned about the stability of the load-balance
> decisions. However, since most decisions are based on cpu_load[0]
> anyway, we could try setting LB_BIAS to false as Peter suggests and see
> what happens.


Maybe the predication is reasonable on per task history. but on a cpu
load history, with many tasks rebalance. No testing show current method
is helpful.

For task load change, scheduler has no idea for its future except guess
from its history. but for cpu load change, scheduler know this from task
wakeup and balance, which both under control and its aim.


I think the first patch of this serial has the same effect of LB_LIAS
disable. and previous result show performance is good.

Anyway, I just pushed the following patch to github, maybe fengguang's
testing system will care this.

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 5716929..0bf649f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -43,7 +43,7 @@ SCHED_FEAT(ARCH_POWER, true)

 SCHED_FEAT(HRTICK, false)
 SCHED_FEAT(DOUBLE_TICK, false)
-SCHED_FEAT(LB_BIAS, true)
+SCHED_FEAT(LB_BIAS, false)


-- 
Thanks
    Alex

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH V5 0/8] remove cpu_load idx
  2014-04-24 16:20 ` [RESEND PATCH V5 0/8] remove cpu_load idx Peter Zijlstra
  2014-04-29 15:04   ` Morten Rasmussen
@ 2014-05-01  0:41   ` Alex Shi
  2014-05-06  9:54     ` Preeti Murthy
  1 sibling, 1 reply; 22+ messages in thread
From: Alex Shi @ 2014-05-01  0:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, morten.rasmussen, vincent.guittot, daniel.lezcano, efault,
	wangyun, linux-kernel, mgorman


于 4/25/14, 0:20, Peter Zijlstra 写道:
>
> OK, this series is a lot saner, with the exception of 3/8 and
> dependents.
>
> I do still worry a bit for loosing the longer term view for the big
> domains though. Sadly I don't have any really big machines.
>
> I think the entire series is equivalent to setting LB_BIAS to false. So
> I suppose we could do that for a while and if nobody reports horrible
> things we could just do this.

Very sorry for over look this email!

Yes, it is a good idea to try false LB_BIAS first.
>
> Anybody?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH V5 0/8] remove cpu_load idx
  2014-04-30  9:24   ` Alex Shi
@ 2014-05-06  8:33     ` Alex Shi
  2014-05-06 11:43       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Shi @ 2014-05-06  8:33 UTC (permalink / raw)
  To: Morten Rasmussen, Fengguang Wu
  Cc: mingo, peterz, vincent.guittot, daniel.lezcano, efault, wangyun,
	linux-kernel, mgorman


> 
> Maybe the predication is reasonable on per task history. but on a cpu
> load history, with many tasks rebalance. No testing show current method
> is helpful.
> 
> For task load change, scheduler has no idea for its future except guess
> from its history. but for cpu load change, scheduler know this from task
> wakeup and balance, which both under control and its aim.
> 
> 
> I think the first patch of this serial has the same effect of LB_LIAS
> disable. and previous result show performance is good.
> 
> Anyway, I just pushed the following patch to github, maybe fengguang's
> testing system will care this.

Fengguang,

Are there any performance change on
https://github.com/alexshi/power-scheduling.git noload repository?

> 
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 5716929..0bf649f 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -43,7 +43,7 @@ SCHED_FEAT(ARCH_POWER, true)
> 
>  SCHED_FEAT(HRTICK, false)
>  SCHED_FEAT(DOUBLE_TICK, false)
> -SCHED_FEAT(LB_BIAS, true)
> +SCHED_FEAT(LB_BIAS, false)
> 
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH V5 0/8] remove cpu_load idx
  2014-05-01  0:41   ` Alex Shi
@ 2014-05-06  9:54     ` Preeti Murthy
  2014-05-06 11:39       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Preeti Murthy @ 2014-05-06  9:54 UTC (permalink / raw)
  To: Alex Shi, Peter Zijlstra, Morten Rasmussen
  Cc: mingo, Vincent Guittot, Daniel Lezcano, Mike Galbraith, wangyun,
	LKML, Mel Gorman, Preeti U Murthy

Hi Morten, Peter, Alex,

In a similar context, I noticed that /proc/loadavg makes use of
avenrun[] array which keeps track of the history of the global
load average. This however makes use of the sum of
nr_running + nr_uninterruptible per cpu. Why are we not
using the cpu_load[] array here which also keeps track
of the history of per-cpu load and then return a sum of it?
Of course with this patchset this might not be possible, but
I have elaborated my point  below.

Using nr_running to show the global load average would
be misleading when entire load balancing is being done on the
basis of the history of cfs_rq->runnable_load_avg/cpu_load[]
right? IOW, to the best of my understanding we do not use
nr_running anywhere to directly determine cpu load in the kernel.

My idea was that the global/per_cpu load that we reflect via
proc/sys interfaces must be consistent. I haven't really
looked at what /proc/schedstat, /proc/stat, top are all reading
from. But /proc/loadavg is reading out global nr_running +
waiting tasks when this will not give us the accurate picture
of the system load especially when there are many short running
tasks.

I observed this when looking at tuned. Tuned sets the cpu_dma_latency
depending on what it reads from /proc/loadavg. This would mean
for a small number of short running tasks also this metric could
reflect a number which makes it look like the system is loaded
reasonably. It then disables deep idle states by setting a high
pm_qos latency requirement for system. This is bad because
it disables power savings even on a lightly loaded system. This
is just an example of how users of /proc/loadavg could make
the wrong decisions based on an inaccurate measure of system
load.

Do you think we must take a look again at the avenrun[] array
and update it to reflect the right cpu load average?

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH V5 0/8] remove cpu_load idx
  2014-05-06  9:54     ` Preeti Murthy
@ 2014-05-06 11:39       ` Peter Zijlstra
  2014-05-08  9:10         ` Preeti U Murthy
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2014-05-06 11:39 UTC (permalink / raw)
  To: Preeti Murthy
  Cc: Alex Shi, Morten Rasmussen, mingo, Vincent Guittot,
	Daniel Lezcano, Mike Galbraith, wangyun, LKML, Mel Gorman,
	Preeti U Murthy

[-- Attachment #1: Type: text/plain, Size: 1493 bytes --]

On Tue, May 06, 2014 at 03:24:13PM +0530, Preeti Murthy wrote:
> Hi Morten, Peter, Alex,
> 
> In a similar context, I noticed that /proc/loadavg makes use of
> avenrun[] array which keeps track of the history of the global
> load average. This however makes use of the sum of
> nr_running + nr_uninterruptible per cpu. Why are we not
> using the cpu_load[] array here which also keeps track
> of the history of per-cpu load and then return a sum of it?

Entirely different kind of 'load'. Note that you cannot use
->nr_uninterruptible per-cpu, also note that sched/proc.c doesn't.

> Using nr_running to show the global load average would
> be misleading when entire load balancing is being done on the
> basis of the history of cfs_rq->runnable_load_avg/cpu_load[]
> right? IOW, to the best of my understanding we do not use
> nr_running anywhere to directly determine cpu load in the kernel.
> 
> My idea was that the global/per_cpu load that we reflect via
> proc/sys interfaces must be consistent. I haven't really
> looked at what /proc/schedstat, /proc/stat, top are all reading
> from. But /proc/loadavg is reading out global nr_running +
> waiting tasks when this will not give us the accurate picture
> of the system load especially when there are many short running
> tasks.

Nobody said /proc/loadavg is a sane number, but its what it is and since
its a global number its entirely unsuited for balancing -- not to
mention all other reasons its crap.


[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH V5 0/8] remove cpu_load idx
  2014-05-06  8:33     ` Alex Shi
@ 2014-05-06 11:43       ` Peter Zijlstra
  2014-05-09 16:41         ` Alex Shi
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2014-05-06 11:43 UTC (permalink / raw)
  To: Alex Shi
  Cc: Morten Rasmussen, Fengguang Wu, mingo, vincent.guittot,
	daniel.lezcano, efault, wangyun, linux-kernel, mgorman

[-- Attachment #1: Type: text/plain, Size: 1341 bytes --]

On Tue, May 06, 2014 at 04:33:38PM +0800, Alex Shi wrote:
> 
> > 
> > Maybe the predication is reasonable on per task history. but on a cpu
> > load history, with many tasks rebalance. No testing show current method
> > is helpful.
> > 
> > For task load change, scheduler has no idea for its future except guess
> > from its history. but for cpu load change, scheduler know this from task
> > wakeup and balance, which both under control and its aim.
> > 
> > 
> > I think the first patch of this serial has the same effect of LB_LIAS
> > disable. and previous result show performance is good.
> > 
> > Anyway, I just pushed the following patch to github, maybe fengguang's
> > testing system will care this.
> 
> Fengguang,
> 
> Are there any performance change on
> https://github.com/alexshi/power-scheduling.git noload repository?

You forgot to qualify that with the important bit; on _large_ systems.
Esp. non fully connected numa boxen. Also, I'm not sure Wu has workloads
that are typical of such systems -- even if he has such machines, which
I don't know either.

Enterprise distro testing has _some_ of that, but the very sad truth is
that most enterprise users lag behind at least a full release cycle. So
by the time people start using the kernel, its so old nobody really
cares anymore :-(



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH V5 0/8] remove cpu_load idx
  2014-05-06 11:39       ` Peter Zijlstra
@ 2014-05-08  9:10         ` Preeti U Murthy
  0 siblings, 0 replies; 22+ messages in thread
From: Preeti U Murthy @ 2014-05-08  9:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Preeti Murthy, Alex Shi, Morten Rasmussen, mingo,
	Vincent Guittot, Daniel Lezcano, Mike Galbraith, wangyun, LKML,
	Mel Gorman

On 05/06/2014 05:09 PM, Peter Zijlstra wrote:
> On Tue, May 06, 2014 at 03:24:13PM +0530, Preeti Murthy wrote:
>> Hi Morten, Peter, Alex,
>>
>> In a similar context, I noticed that /proc/loadavg makes use of
>> avenrun[] array which keeps track of the history of the global
>> load average. This however makes use of the sum of
>> nr_running + nr_uninterruptible per cpu. Why are we not
>> using the cpu_load[] array here which also keeps track
>> of the history of per-cpu load and then return a sum of it?
> 
> Entirely different kind of 'load'. Note that you cannot use
> ->nr_uninterruptible per-cpu, also note that sched/proc.c doesn't.

True, I worded it wrong above. It uses the per_cpu nr_running and a
global nr_uninterruptible.

> 
>> Using nr_running to show the global load average would
>> be misleading when entire load balancing is being done on the
>> basis of the history of cfs_rq->runnable_load_avg/cpu_load[]
>> right? IOW, to the best of my understanding we do not use
>> nr_running anywhere to directly determine cpu load in the kernel.
>>
>> My idea was that the global/per_cpu load that we reflect via
>> proc/sys interfaces must be consistent. I haven't really
>> looked at what /proc/schedstat, /proc/stat, top are all reading
>> from. But /proc/loadavg is reading out global nr_running +
>> waiting tasks when this will not give us the accurate picture
>> of the system load especially when there are many short running
>> tasks.
> 
> Nobody said /proc/loadavg is a sane number, but its what it is and since
> its a global number its entirely unsuited for balancing -- not to
> mention all other reasons its crap.

I agree its not meant for balancing. My point was that since its
inaccurate why don't we correct it. But if your argument is that we can
live with /proc/loadavg showing a reasonable view of system load then it
shouldn't be a problem.

Regards
Preeti U Murthy
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH V5 0/8] remove cpu_load idx
  2014-04-29 15:04   ` Morten Rasmussen
@ 2014-05-09 16:30     ` Alex Shi
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Shi @ 2014-05-09 16:30 UTC (permalink / raw)
  To: Morten Rasmussen, Peter Zijlstra
  Cc: mingo, vincent.guittot, daniel.lezcano, efault, wangyun,
	linux-kernel, mgorman


于 4/29/14, 23:04, Morten Rasmussen 写道:
> On Thu, Apr 24, 2014 at 05:20:29PM +0100, Peter Zijlstra wrote:
>> OK, this series is a lot saner, with the exception of 3/8 and
>> dependents.
>>
>> I do still worry a bit for loosing the longer term view for the big
>> domains though. Sadly I don't have any really big machines.
>>
>> I think the entire series is equivalent to setting LB_BIAS to false. So
>> I suppose we could do that for a while and if nobody reports horrible
>> things we could just do this.
>>
>> Anybody?
> I can't say what will happen on big machines, but I think the LB_BIAS
> test could be a way to see what happens. I'm not convinced that it won't
> lead to more task migrations since we will use the instantaneous cpu
> load (weighted_cpuload()) unfiltered.

Reject a simple and partly tested solution because of a untested guess?
Even the old double bias is better in few sceanrios, it doesn't mean it 
is good
in theory.

I don't know the result on big machine too. but imbalance_pct of 
sched_domain
is designed for balance bias on different domain level. We should tune 
that instead
of bias on an random long term load.

Before runnable avg introduction, cpu_load idx make sense on history 
load tracking.
With runnable avg, big instant load means tasks had been running for 
long time. so
we already counted the task history.
As to the cpu history load, if it caused by task moving to other cpu, it 
isn't right to
still count their load on old cpu.

> Morten


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RESEND PATCH V5 0/8] remove cpu_load idx
  2014-05-06 11:43       ` Peter Zijlstra
@ 2014-05-09 16:41         ` Alex Shi
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Shi @ 2014-05-09 16:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, Fengguang Wu, mingo, vincent.guittot,
	daniel.lezcano, efault, wangyun, linux-kernel, mgorman


于 5/6/14, 19:43, Peter Zijlstra 写道:
> On Tue, May 06, 2014 at 04:33:38PM +0800, Alex Shi wrote:
>>> Maybe the predication is reasonable on per task history. but on a cpu
>>> load history, with many tasks rebalance. No testing show current method
>>> is helpful.
>>>
>>> For task load change, scheduler has no idea for its future except guess
>>> from its history. but for cpu load change, scheduler know this from task
>>> wakeup and balance, which both under control and its aim.
>>>
>>>
>>> I think the first patch of this serial has the same effect of LB_LIAS
>>> disable. and previous result show performance is good.
>>>
>>> Anyway, I just pushed the following patch to github, maybe fengguang's
>>> testing system will care this.
>> Fengguang,
>>
>> Are there any performance change on
>> https://github.com/alexshi/power-scheduling.git noload repository?
> You forgot to qualify that with the important bit; on _large_ systems.
> Esp. non fully connected numa boxen. Also, I'm not sure Wu has workloads
> that are typical of such systems -- even if he has such machines, which
> I don't know either.

Fengguang,
Why not introduce your machines and workloads to US? It is a good chance 
to sell your system. :)
>
> Enterprise distro testing has _some_ of that, but the very sad truth is
> that most enterprise users lag behind at least a full release cycle. So
> by the time people start using the kernel, its so old nobody really
> cares anymore :-(
It sounds so bad.
Does redhat like to take some action for this? You have big impact to 
Linux community!

and what's your plan for this patch set?
It remove much of tick precess, and performance looks good as far as 
testing. :)

>
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-05-09 16:41 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-16  2:43 [RESEND PATCH V5 0/8] remove cpu_load idx Alex Shi
2014-04-16  2:43 ` [PATCH V5 1/8] sched: shortcut to remove load_idx Alex Shi
2014-04-16  2:43 ` [PATCH V5 2/8] sched: remove rq->cpu_load[load_idx] array Alex Shi
2014-04-16  2:43 ` [PATCH V5 3/8] sched: remove source_load and target_load Alex Shi
2014-04-24 14:18   ` Peter Zijlstra
2014-04-16  2:43 ` [PATCH V5 4/8] sched: remove LB_BIAS Alex Shi
2014-04-16  2:43 ` [PATCH V5 5/8] sched: clean up cpu_load update Alex Shi
2014-04-16  2:43 ` [PATCH V5 6/8] sched: rewrite update_cpu_load_nohz Alex Shi
2014-04-16  2:43 ` [PATCH V5 7/8] sched: remove rq->cpu_load and rq->nr_load_updates Alex Shi
2014-04-16  2:43 ` [PATCH V5 8/8] sched: rename update_*_cpu_load Alex Shi
2014-04-24 16:20 ` [RESEND PATCH V5 0/8] remove cpu_load idx Peter Zijlstra
2014-04-29 15:04   ` Morten Rasmussen
2014-05-09 16:30     ` Alex Shi
2014-05-01  0:41   ` Alex Shi
2014-05-06  9:54     ` Preeti Murthy
2014-05-06 11:39       ` Peter Zijlstra
2014-05-08  9:10         ` Preeti U Murthy
2014-04-29 14:52 ` Morten Rasmussen
2014-04-30  9:24   ` Alex Shi
2014-05-06  8:33     ` Alex Shi
2014-05-06 11:43       ` Peter Zijlstra
2014-05-09 16:41         ` Alex Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.