All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] sched: remove cpu_load decay
@ 2013-12-03  9:05 Alex Shi
  2013-12-03  9:05 ` [PATCH 1/4] sched: shortcut to remove load_idx Alex Shi
                   ` (5 more replies)
  0 siblings, 6 replies; 34+ messages in thread
From: Alex Shi @ 2013-12-03  9:05 UTC (permalink / raw)
  To: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano,
	fweisbec, linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu
  Cc: james.hogan, alex.shi, jason.low2, gregkh, hanjun.guo, linux-kernel

The cpu_load decays on time according past cpu load of rq. New sched_avg decays on tasks' load of time. Now we has 2 kind decay for cpu_load. That is a kind of redundancy. And increase the system load in sched_tick etc.

This patch removes the cpu_load decay. 

There are 5 load_idx used for cpu_load in sched_domain. busy_idx and idle_idx are not zero usually, but newidle_idx, wake_idx and forkexec_idx are all zero on every arch. A shortcut to remove cpu_Load decay in the first patch. just one line patch for this change. :)

I have tested the patchset on my pandaES board, 2 cores ARM Cortex A9.
hackbench thread/pipe performance increased nearly 8% with this patchset!

	latest kernel 527d1511310a89		+ this patchset
hackbench -T -g 10 -f 40
	23.25"					21.7"
	23.16"					19.99"
	24.24"					21.53"
hackbench -p -g 10 -f 40
	26.52"					22.48"
	23.89"					24.00"
	25.65"					23.06"
hackbench -P -g 10 -f 40
	20.14"					19.37"
	19.96"					19.76"
	21.76"					21.54"

Daniel tested this patchset on his Core2 2 socket server, the hackbench has no clear regression/improvement.
This patchset is also in Fengguang's x86 testing for more than 1 week, and no regression report until now. 
Fengguang, could I assume there is no regression on Intel platforms? :)

The git tree for this patchset at:
 git@github.com:alexshi/power-scheduling.git no-load-idx 

--
Thanks
	Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/4] sched: shortcut to remove load_idx
  2013-12-03  9:05 [PATCH 0/4] sched: remove cpu_load decay Alex Shi
@ 2013-12-03  9:05 ` Alex Shi
  2013-12-03  9:05 ` [PATCH 2/4] sched: remove rq->cpu_load[load_idx] array Alex Shi
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: Alex Shi @ 2013-12-03  9:05 UTC (permalink / raw)
  To: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano,
	fweisbec, linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu
  Cc: james.hogan, alex.shi, jason.low2, gregkh, hanjun.guo, linux-kernel

Shortcut to remove rq->cpu_load[load_idx] effect in scheduler.
In five load idx, only busy_idx, idle_idx are not zero.
Newidle_idx, wake_idx and fork_idx are all zero in all archs.

So, change the idx to zero here can fully remove load_idx effect.

Signed-off-by: Alex Shi <alex.shi@linaro.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8b652e..ce683aa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5633,7 +5633,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
 
-	load_idx = get_sd_load_idx(env->sd, env->idle);
+	load_idx = 0;
 
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/4] sched: remove rq->cpu_load[load_idx] array
  2013-12-03  9:05 [PATCH 0/4] sched: remove cpu_load decay Alex Shi
  2013-12-03  9:05 ` [PATCH 1/4] sched: shortcut to remove load_idx Alex Shi
@ 2013-12-03  9:05 ` Alex Shi
  2013-12-03  9:05 ` [PATCH 3/4] sched: clean up cpu_load update Alex Shi
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: Alex Shi @ 2013-12-03  9:05 UTC (permalink / raw)
  To: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano,
	fweisbec, linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu
  Cc: james.hogan, alex.shi, jason.low2, gregkh, hanjun.guo, linux-kernel

Since load_idx effect removed in load balance, we don't need the
load_idx decays in scheduler. that will save some process in sched_tick
and others places.

Signed-off-by: Alex Shi <alex.shi@linaro.org>
---
 arch/ia64/include/asm/topology.h  |  5 ---
 arch/metag/include/asm/topology.h |  5 ---
 arch/tile/include/asm/topology.h  |  6 ---
 include/linux/sched.h             |  5 ---
 include/linux/topology.h          |  8 ----
 kernel/sched/core.c               | 60 ++++++++-----------------
 kernel/sched/debug.c              |  6 +--
 kernel/sched/fair.c               | 79 +++++++++------------------------
 kernel/sched/proc.c               | 92 ++-------------------------------------
 kernel/sched/sched.h              |  3 +-
 10 files changed, 43 insertions(+), 226 deletions(-)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index a2496e4..54e5b17 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -55,11 +55,6 @@ void build_cpu_to_node_map(void);
 	.busy_factor		= 64,			\
 	.imbalance_pct		= 125,			\
 	.cache_nice_tries	= 2,			\
-	.busy_idx		= 2,			\
-	.idle_idx		= 1,			\
-	.newidle_idx		= 0,			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h
index 8e9c0b3..d1d15cd 100644
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -13,11 +13,6 @@
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
 	.cache_nice_tries	= 2,			\
-	.busy_idx		= 3,			\
-	.idle_idx		= 2,			\
-	.newidle_idx		= 0,			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_FORK	\
 				| SD_BALANCE_EXEC	\
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
index d15c0d8..05f6ffe 100644
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -57,12 +57,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
 	.busy_factor		= 64,					\
 	.imbalance_pct		= 125,					\
 	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
 	.flags			= 1*SD_LOAD_BALANCE			\
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7e35d4b..a23e02d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,11 +815,6 @@ struct sched_domain {
 	unsigned int busy_factor;	/* less balancing by factor if busy */
 	unsigned int imbalance_pct;	/* No balance until over watermark */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
-	unsigned int busy_idx;
-	unsigned int idle_idx;
-	unsigned int newidle_idx;
-	unsigned int wake_idx;
-	unsigned int forkexec_idx;
 	unsigned int smt_gain;
 
 	int nohz_idle;			/* NOHZ IDLE status */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 12ae6ce..863fad3 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -121,9 +121,6 @@ int arch_update_cpu_topology(void);
 	.busy_factor		= 64,					\
 	.imbalance_pct		= 125,					\
 	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
 									\
 	.flags			= 1*SD_LOAD_BALANCE			\
 				| 1*SD_BALANCE_NEWIDLE			\
@@ -151,11 +148,6 @@ int arch_update_cpu_topology(void);
 	.busy_factor		= 64,					\
 	.imbalance_pct		= 125,					\
 	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
 									\
 	.flags			= 1*SD_LOAD_BALANCE			\
 				| 1*SD_BALANCE_NEWIDLE			\
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c180860..9528f75 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4279,61 +4279,42 @@ static void sd_free_ctl_entry(struct ctl_table **tablep)
 	*tablep = NULL;
 }
 
-static int min_load_idx = 0;
-static int max_load_idx = CPU_LOAD_IDX_MAX-1;
-
 static void
 set_table_entry(struct ctl_table *entry,
 		const char *procname, void *data, int maxlen,
-		umode_t mode, proc_handler *proc_handler,
-		bool load_idx)
+		umode_t mode, proc_handler *proc_handler)
 {
 	entry->procname = procname;
 	entry->data = data;
 	entry->maxlen = maxlen;
 	entry->mode = mode;
 	entry->proc_handler = proc_handler;
-
-	if (load_idx) {
-		entry->extra1 = &min_load_idx;
-		entry->extra2 = &max_load_idx;
-	}
 }
 
 static struct ctl_table *
 sd_alloc_ctl_domain_table(struct sched_domain *sd)
 {
-	struct ctl_table *table = sd_alloc_ctl_entry(13);
+	struct ctl_table *table = sd_alloc_ctl_entry(8);
 
 	if (table == NULL)
 		return NULL;
 
 	set_table_entry(&table[0], "min_interval", &sd->min_interval,
-		sizeof(long), 0644, proc_doulongvec_minmax, false);
+		sizeof(long), 0644, proc_doulongvec_minmax);
 	set_table_entry(&table[1], "max_interval", &sd->max_interval,
-		sizeof(long), 0644, proc_doulongvec_minmax, false);
-	set_table_entry(&table[2], "busy_idx", &sd->busy_idx,
-		sizeof(int), 0644, proc_dointvec_minmax, true);
-	set_table_entry(&table[3], "idle_idx", &sd->idle_idx,
-		sizeof(int), 0644, proc_dointvec_minmax, true);
-	set_table_entry(&table[4], "newidle_idx", &sd->newidle_idx,
-		sizeof(int), 0644, proc_dointvec_minmax, true);
-	set_table_entry(&table[5], "wake_idx", &sd->wake_idx,
-		sizeof(int), 0644, proc_dointvec_minmax, true);
-	set_table_entry(&table[6], "forkexec_idx", &sd->forkexec_idx,
-		sizeof(int), 0644, proc_dointvec_minmax, true);
-	set_table_entry(&table[7], "busy_factor", &sd->busy_factor,
-		sizeof(int), 0644, proc_dointvec_minmax, false);
-	set_table_entry(&table[8], "imbalance_pct", &sd->imbalance_pct,
-		sizeof(int), 0644, proc_dointvec_minmax, false);
-	set_table_entry(&table[9], "cache_nice_tries",
+		sizeof(long), 0644, proc_doulongvec_minmax);
+	set_table_entry(&table[2], "busy_factor", &sd->busy_factor,
+		sizeof(int), 0644, proc_dointvec_minmax);
+	set_table_entry(&table[3], "imbalance_pct", &sd->imbalance_pct,
+		sizeof(int), 0644, proc_dointvec_minmax);
+	set_table_entry(&table[4], "cache_nice_tries",
 		&sd->cache_nice_tries,
-		sizeof(int), 0644, proc_dointvec_minmax, false);
-	set_table_entry(&table[10], "flags", &sd->flags,
-		sizeof(int), 0644, proc_dointvec_minmax, false);
-	set_table_entry(&table[11], "name", sd->name,
-		CORENAME_MAX_SIZE, 0444, proc_dostring, false);
-	/* &table[12] is terminator */
+		sizeof(int), 0644, proc_dointvec_minmax);
+	set_table_entry(&table[5], "flags", &sd->flags,
+		sizeof(int), 0644, proc_dointvec_minmax);
+	set_table_entry(&table[6], "name", sd->name,
+		CORENAME_MAX_SIZE, 0444, proc_dostring);
+	/* &table[7] is terminator */
 
 	return table;
 }
@@ -5425,11 +5406,6 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 		.busy_factor		= 32,
 		.imbalance_pct		= 125,
 		.cache_nice_tries	= 2,
-		.busy_idx		= 3,
-		.idle_idx		= 2,
-		.newidle_idx		= 0,
-		.wake_idx		= 0,
-		.forkexec_idx		= 0,
 
 		.flags			= 1*SD_LOAD_BALANCE
 					| 1*SD_BALANCE_NEWIDLE
@@ -6178,7 +6154,7 @@ DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
 
 void __init sched_init(void)
 {
-	int i, j;
+	int i;
 	unsigned long alloc_size = 0, ptr;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -6279,9 +6255,7 @@ void __init sched_init(void)
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
 #endif
 
-		for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
-			rq->cpu_load[j] = 0;
-
+		rq->cpu_load = 0;
 		rq->last_load_update_tick = jiffies;
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 5c34d18..675be71 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -303,11 +303,7 @@ do {									\
 	PN(next_balance);
 	SEQ_printf(m, "  .%-30s: %ld\n", "curr->pid", (long)(task_pid_nr(rq->curr)));
 	PN(clock);
-	P(cpu_load[0]);
-	P(cpu_load[1]);
-	P(cpu_load[2]);
-	P(cpu_load[3]);
-	P(cpu_load[4]);
+	P(cpu_load);
 #undef P
 #undef PN
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ce683aa..bccdd89 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -977,8 +977,8 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
 }
 
 static unsigned long weighted_cpuload(const int cpu);
-static unsigned long source_load(int cpu, int type);
-static unsigned long target_load(int cpu, int type);
+static unsigned long source_load(int cpu);
+static unsigned long target_load(int cpu);
 static unsigned long power_of(int cpu);
 static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
@@ -3794,30 +3794,30 @@ static unsigned long weighted_cpuload(const int cpu)
  * We want to under-estimate the load of migration sources, to
  * balance conservatively.
  */
-static unsigned long source_load(int cpu, int type)
+static unsigned long source_load(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
 
-	if (type == 0 || !sched_feat(LB_BIAS))
+	if (!sched_feat(LB_BIAS))
 		return total;
 
-	return min(rq->cpu_load[type-1], total);
+	return min(rq->cpu_load, total);
 }
 
 /*
  * Return a high guess at the load of a migration-target cpu weighted
  * according to the scheduling class and "nice" value.
  */
-static unsigned long target_load(int cpu, int type)
+static unsigned long target_load(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
 
-	if (type == 0 || !sched_feat(LB_BIAS))
+	if (!sched_feat(LB_BIAS))
 		return total;
 
-	return max(rq->cpu_load[type-1], total);
+	return max(rq->cpu_load, total);
 }
 
 static unsigned long power_of(int cpu)
@@ -4017,7 +4017,7 @@ static int wake_wide(struct task_struct *p)
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
-	int idx, this_cpu, prev_cpu;
+	int this_cpu, prev_cpu;
 	unsigned long tl_per_task;
 	struct task_group *tg;
 	unsigned long weight;
@@ -4030,11 +4030,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	if (wake_wide(p))
 		return 0;
 
-	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
-	load	  = source_load(prev_cpu, idx);
-	this_load = target_load(this_cpu, idx);
+	load	  = source_load(prev_cpu);
+	this_load = target_load(this_cpu);
 
 	/*
 	 * If sync wakeup then subtract the (maximum possible)
@@ -4090,7 +4089,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 
 	if (balanced ||
 	    (this_load <= load &&
-	     this_load + target_load(prev_cpu, idx) <= tl_per_task)) {
+	     this_load + target_load(prev_cpu) <= tl_per_task)) {
 		/*
 		 * This domain has SD_WAKE_AFFINE and
 		 * p is cache cold in this domain, and
@@ -4109,8 +4108,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
  * domain.
  */
 static struct sched_group *
-find_idlest_group(struct sched_domain *sd, struct task_struct *p,
-		  int this_cpu, int load_idx)
+find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
@@ -4135,9 +4133,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		for_each_cpu(i, sched_group_cpus(group)) {
 			/* Bias balancing toward cpus of our domain */
 			if (local_group)
-				load = source_load(i, load_idx);
+				load = source_load(i);
 			else
-				load = target_load(i, load_idx);
+				load = target_load(i);
 
 			avg_load += load;
 		}
@@ -4283,7 +4281,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	}
 
 	while (sd) {
-		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
 		int weight;
 
@@ -4292,10 +4289,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			continue;
 		}
 
-		if (sd_flag & SD_BALANCE_WAKE)
-			load_idx = sd->wake_idx;
-
-		group = find_idlest_group(sd, p, cpu, load_idx);
+		group = find_idlest_group(sd, p, cpu);
 		if (!group) {
 			sd = sd->child;
 			continue;
@@ -5238,34 +5232,6 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 	};
 }
 
-/**
- * get_sd_load_idx - Obtain the load index for a given sched domain.
- * @sd: The sched_domain whose load_idx is to be obtained.
- * @idle: The idle status of the CPU for whose sd load_idx is obtained.
- *
- * Return: The load index.
- */
-static inline int get_sd_load_idx(struct sched_domain *sd,
-					enum cpu_idle_type idle)
-{
-	int load_idx;
-
-	switch (idle) {
-	case CPU_NOT_IDLE:
-		load_idx = sd->busy_idx;
-		break;
-
-	case CPU_NEWLY_IDLE:
-		load_idx = sd->newidle_idx;
-		break;
-	default:
-		load_idx = sd->idle_idx;
-		break;
-	}
-
-	return load_idx;
-}
-
 static unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
 {
 	return SCHED_POWER_SCALE;
@@ -5492,12 +5458,11 @@ static inline int sg_capacity(struct lb_env *env, struct sched_group *group)
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
  * @group: sched_group whose statistics are to be updated.
- * @load_idx: Load index of sched_domain of this_cpu for load calc.
  * @local_group: Does group contain this_cpu.
  * @sgs: variable to hold the statistics for this group.
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
-			struct sched_group *group, int load_idx,
+			struct sched_group *group,
 			int local_group, struct sg_lb_stats *sgs)
 {
 	unsigned long nr_running;
@@ -5513,9 +5478,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		/* Bias balancing toward cpus of our domain */
 		if (local_group)
-			load = target_load(i, load_idx);
+			load = target_load(i);
 		else
-			load = source_load(i, load_idx);
+			load = source_load(i);
 
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
@@ -5628,13 +5593,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats tmp_sgs;
-	int load_idx, prefer_sibling = 0;
+	int prefer_sibling = 0;
 
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
 
-	load_idx = 0;
-
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
 		int local_group;
@@ -5649,7 +5612,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 				update_group_power(env->sd, env->dst_cpu);
 		}
 
-		update_sg_lb_stats(env, sg, load_idx, local_group, sgs);
+		update_sg_lb_stats(env, sg, local_group, sgs);
 
 		if (local_group)
 			goto next_group;
diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index 16f5a30..a2435c5 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -11,7 +11,7 @@
 unsigned long this_cpu_load(void)
 {
 	struct rq *this = this_rq();
-	return this->cpu_load[0];
+	return this->cpu_load;
 }
 
 
@@ -398,105 +398,19 @@ static void calc_load_account_active(struct rq *this_rq)
  * End of global load-average stuff
  */
 
-/*
- * The exact cpuload at various idx values, calculated at every tick would be
- * load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load
- *
- * If a cpu misses updates for n-1 ticks (as it was idle) and update gets called
- * on nth tick when cpu may be busy, then we have:
- * load = ((2^idx - 1) / 2^idx)^(n-1) * load
- * load = (2^idx - 1) / 2^idx) * load + 1 / 2^idx * cur_load
- *
- * decay_load_missed() below does efficient calculation of
- * load = ((2^idx - 1) / 2^idx)^(n-1) * load
- * avoiding 0..n-1 loop doing load = ((2^idx - 1) / 2^idx) * load
- *
- * The calculation is approximated on a 128 point scale.
- * degrade_zero_ticks is the number of ticks after which load at any
- * particular idx is approximated to be zero.
- * degrade_factor is a precomputed table, a row for each load idx.
- * Each column corresponds to degradation factor for a power of two ticks,
- * based on 128 point scale.
- * Example:
- * row 2, col 3 (=12) says that the degradation at load idx 2 after
- * 8 ticks is 12/128 (which is an approximation of exact factor 3^8/4^8).
- *
- * With this power of 2 load factors, we can degrade the load n times
- * by looking at 1 bits in n and doing as many mult/shift instead of
- * n mult/shifts needed by the exact degradation.
- */
-#define DEGRADE_SHIFT		7
-static const unsigned char
-		degrade_zero_ticks[CPU_LOAD_IDX_MAX] = {0, 8, 32, 64, 128};
-static const unsigned char
-		degrade_factor[CPU_LOAD_IDX_MAX][DEGRADE_SHIFT + 1] = {
-					{0, 0, 0, 0, 0, 0, 0, 0},
-					{64, 32, 8, 0, 0, 0, 0, 0},
-					{96, 72, 40, 12, 1, 0, 0},
-					{112, 98, 75, 43, 15, 1, 0},
-					{120, 112, 98, 76, 45, 16, 2} };
 
 /*
- * Update cpu_load for any missed ticks, due to tickless idle. The backlog
- * would be when CPU is idle and so we just decay the old load without
- * adding any new load.
- */
-static unsigned long
-decay_load_missed(unsigned long load, unsigned long missed_updates, int idx)
-{
-	int j = 0;
-
-	if (!missed_updates)
-		return load;
-
-	if (missed_updates >= degrade_zero_ticks[idx])
-		return 0;
-
-	if (idx == 1)
-		return load >> missed_updates;
-
-	while (missed_updates) {
-		if (missed_updates % 2)
-			load = (load * degrade_factor[idx][j]) >> DEGRADE_SHIFT;
-
-		missed_updates >>= 1;
-		j++;
-	}
-	return load;
-}
-
-/*
- * Update rq->cpu_load[] statistics. This function is usually called every
+ * Update rq->cpu_load statistics. This function is usually called every
  * scheduler tick (TICK_NSEC). With tickless idle this will not be called
  * every tick. We fix it up based on jiffies.
  */
 static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
 			      unsigned long pending_updates)
 {
-	int i, scale;
-
 	this_rq->nr_load_updates++;
 
 	/* Update our load: */
-	this_rq->cpu_load[0] = this_load; /* Fasttrack for idx 0 */
-	for (i = 1, scale = 2; i < CPU_LOAD_IDX_MAX; i++, scale += scale) {
-		unsigned long old_load, new_load;
-
-		/* scale is effectively 1 << i now, and >> i divides by scale */
-
-		old_load = this_rq->cpu_load[i];
-		old_load = decay_load_missed(old_load, pending_updates - 1, i);
-		new_load = this_load;
-		/*
-		 * Round up the averaging division if load is increasing. This
-		 * prevents us from getting stuck on 9 if the load is 10, for
-		 * example.
-		 */
-		if (new_load > old_load)
-			new_load += scale - 1;
-
-		this_rq->cpu_load[i] = (old_load * (scale - 1) + new_load) >> i;
-	}
+	this_rq->cpu_load = this_load; /* Fasttrack for idx 0 */
 
 	sched_avg_update(this_rq);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 88c85b2..01f6e7a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -413,8 +413,7 @@ struct rq {
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
 #endif
-	#define CPU_LOAD_IDX_MAX 5
-	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+	unsigned long cpu_load;
 	unsigned long last_load_update_tick;
 #ifdef CONFIG_NO_HZ_COMMON
 	u64 nohz_stamp;
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/4] sched: clean up cpu_load update
  2013-12-03  9:05 [PATCH 0/4] sched: remove cpu_load decay Alex Shi
  2013-12-03  9:05 ` [PATCH 1/4] sched: shortcut to remove load_idx Alex Shi
  2013-12-03  9:05 ` [PATCH 2/4] sched: remove rq->cpu_load[load_idx] array Alex Shi
@ 2013-12-03  9:05 ` Alex Shi
  2013-12-03  9:05 ` [PATCH 4/4] sched: bias to target cpu load to reduce task moving Alex Shi
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: Alex Shi @ 2013-12-03  9:05 UTC (permalink / raw)
  To: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano,
	fweisbec, linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu
  Cc: james.hogan, alex.shi, jason.low2, gregkh, hanjun.guo, linux-kernel

Since we don't decay the rq->cpu_load, so we don't need the
pending_updates. But we still want update rq->rt_avg, so
still keep rq->last_load_update_tick and func __update_cpu_load.

After remove the load_idx, in the most of time the source_load is
equal to target_load, except only when source cpu is idle. At that
time we force set the cpu_load is 0(in update_cpu_load_nohz).
So we still left cpu_load in rq.

Signed-off-by: Alex Shi <alex.shi@linaro.org>
---
 kernel/sched/proc.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index a2435c5..057bb9b 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -404,8 +404,7 @@ static void calc_load_account_active(struct rq *this_rq)
  * scheduler tick (TICK_NSEC). With tickless idle this will not be called
  * every tick. We fix it up based on jiffies.
  */
-static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
-			      unsigned long pending_updates)
+static void __update_cpu_load(struct rq *this_rq, unsigned long this_load)
 {
 	this_rq->nr_load_updates++;
 
@@ -449,7 +448,6 @@ void update_idle_cpu_load(struct rq *this_rq)
 {
 	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
 	unsigned long load = get_rq_runnable_load(this_rq);
-	unsigned long pending_updates;
 
 	/*
 	 * bail if there's load or we're actually up-to-date.
@@ -457,10 +455,9 @@ void update_idle_cpu_load(struct rq *this_rq)
 	if (load || curr_jiffies == this_rq->last_load_update_tick)
 		return;
 
-	pending_updates = curr_jiffies - this_rq->last_load_update_tick;
 	this_rq->last_load_update_tick = curr_jiffies;
 
-	__update_cpu_load(this_rq, load, pending_updates);
+	__update_cpu_load(this_rq, load);
 }
 
 /*
@@ -483,7 +480,7 @@ void update_cpu_load_nohz(void)
 		 * We were idle, this means load 0, the current load might be
 		 * !0 due to remote wakeups and the sort.
 		 */
-		__update_cpu_load(this_rq, 0, pending_updates);
+		__update_cpu_load(this_rq, 0);
 	}
 	raw_spin_unlock(&this_rq->lock);
 }
@@ -499,7 +496,7 @@ void update_cpu_load_active(struct rq *this_rq)
 	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
 	 */
 	this_rq->last_load_update_tick = jiffies;
-	__update_cpu_load(this_rq, load, 1);
+	__update_cpu_load(this_rq, load);
 
 	calc_load_account_active(this_rq);
 }
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2013-12-03  9:05 [PATCH 0/4] sched: remove cpu_load decay Alex Shi
                   ` (2 preceding siblings ...)
  2013-12-03  9:05 ` [PATCH 3/4] sched: clean up cpu_load update Alex Shi
@ 2013-12-03  9:05 ` Alex Shi
  2013-12-04  9:06   ` Yuanhan Liu
  2013-12-17 14:10   ` Morten Rasmussen
  2013-12-03 10:26 ` [PATCH 0/4] sched: remove cpu_load decay Peter Zijlstra
  2013-12-13 20:03 ` Peter Zijlstra
  5 siblings, 2 replies; 34+ messages in thread
From: Alex Shi @ 2013-12-03  9:05 UTC (permalink / raw)
  To: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano,
	fweisbec, linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu
  Cc: james.hogan, alex.shi, jason.low2, gregkh, hanjun.guo, linux-kernel

Task migration happens when target just a bit less then source cpu load.
To reduce such situation happens, aggravate the target cpu load with
sd->imbalance_pct/100.

This patch removes the hackbench thread regression on Daniel's
Intel Core2 server.

a5d6e63		+patch1~3		+patch1~4
hackbench -T -s 4096 -l 1000 -g 10 -f 40
27.914"         38.694"			28.587"
28.390"         38.341"			29.513"
28.048"         38.626"			28.706"

Signed-off-by: Alex Shi <alex.shi@linaro.org>
---
 kernel/sched/fair.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bccdd89..c49b7ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
 
 static unsigned long weighted_cpuload(const int cpu);
 static unsigned long source_load(int cpu);
-static unsigned long target_load(int cpu);
+static unsigned long target_load(int cpu, int imbalance_pct);
 static unsigned long power_of(int cpu);
 static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
@@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu)
  * Return a high guess at the load of a migration-target cpu weighted
  * according to the scheduling class and "nice" value.
  */
-static unsigned long target_load(int cpu)
+static unsigned long target_load(int cpu, int imbalance_pct)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
 
+	/*
+	 * without cpu_load decay, in most of time cpu_load is same as total
+	 * so we need to make target a bit heavier to reduce task migration
+	 */
+	total = total * imbalance_pct / 100;
+
 	if (!sched_feat(LB_BIAS))
 		return total;
 
@@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
 	load	  = source_load(prev_cpu);
-	this_load = target_load(this_cpu);
+	this_load = target_load(this_cpu, 100);
 
 	/*
 	 * If sync wakeup then subtract the (maximum possible)
@@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 
 	if (balanced ||
 	    (this_load <= load &&
-	     this_load + target_load(prev_cpu) <= tl_per_task)) {
+	     this_load + target_load(prev_cpu, 100) <= tl_per_task)) {
 		/*
 		 * This domain has SD_WAKE_AFFINE and
 		 * p is cache cold in this domain, and
@@ -4135,7 +4141,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			if (local_group)
 				load = source_load(i);
 			else
-				load = target_load(i);
+				load = target_load(i, sd->imbalance_pct);
 
 			avg_load += load;
 		}
@@ -5478,7 +5484,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		/* Bias balancing toward cpus of our domain */
 		if (local_group)
-			load = target_load(i);
+			load = target_load(i, env->sd->imbalance_pct);
 		else
 			load = source_load(i);
 
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/4] sched: remove cpu_load decay
  2013-12-03  9:05 [PATCH 0/4] sched: remove cpu_load decay Alex Shi
                   ` (3 preceding siblings ...)
  2013-12-03  9:05 ` [PATCH 4/4] sched: bias to target cpu load to reduce task moving Alex Shi
@ 2013-12-03 10:26 ` Peter Zijlstra
  2013-12-10  1:04   ` Alex Shi
  2013-12-13 20:03 ` Peter Zijlstra
  5 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2013-12-03 10:26 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, morten.rasmussen, vincent.guittot, daniel.lezcano,
	fweisbec, linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel


Paul, can you guys have a look at this, last time around you have a
regression with this stuff, so it would be good to hear from you.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2013-12-03  9:05 ` [PATCH 4/4] sched: bias to target cpu load to reduce task moving Alex Shi
@ 2013-12-04  9:06   ` Yuanhan Liu
  2013-12-04 11:25     ` Alex Shi
  2013-12-17 14:10   ` Morten Rasmussen
  1 sibling, 1 reply; 34+ messages in thread
From: Yuanhan Liu @ 2013-12-04  9:06 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano,
	fweisbec, linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel, Huang Ying

On Tue, Dec 03, 2013 at 05:05:56PM +0800, Alex Shi wrote:
> Task migration happens when target just a bit less then source cpu load.
> To reduce such situation happens, aggravate the target cpu load with
> sd->imbalance_pct/100.
> 
> This patch removes the hackbench thread regression on Daniel's
> Intel Core2 server.
> 
> a5d6e63		+patch1~3		+patch1~4
> hackbench -T -s 4096 -l 1000 -g 10 -f 40
> 27.914"         38.694"			28.587"
> 28.390"         38.341"			29.513"
> 28.048"         38.626"			28.706"
> 
> Signed-off-by: Alex Shi <alex.shi@linaro.org>

Hi Alex,

We obsevered 150% performance gain with vm-scalability/300s-mmap-pread-seq
testcase with this patch applied. Here is a list of changes we got so far:

testbox : brickland
testcase: vm-scalability/300s-mmap-pread-seq


    f1b6442c7dd12802e622      d70495ef86f397816d73  
       (parent commit)            (this commit)
------------------------  ------------------------  
             26393249.80      +150.9%  66223933.60  vm-scalability.throughput

                  225.12       -49.9%       112.75  time.elapsed_time
                36333.40       -90.7%      3392.20  vmstat.system.cs
                    2.40      +375.0%        11.40  vmstat.cpu.id
              3770081.60       -97.7%     87673.40  time.major_page_faults
              3975276.20       -97.0%    117409.60  time.voluntary_context_switches
                    3.05      +301.7%        12.24  iostat.cpu.idle
                21118.41       -70.3%      6277.19  time.system_time
                   18.40      +130.4%        42.40  vmstat.cpu.us
                   77.00       -41.3%        45.20  vmstat.cpu.sy
                47459.60       -31.3%     32592.20  vmstat.system.in
                82435.40       -12.1%     72443.60  time.involuntary_context_switches
                 5128.13       +14.0%      5848.30  time.user_time
                11656.20        -7.8%     10745.60  time.percent_of_cpu_this_job_got
           1069997484.80        +0.3% 1073679919.00 time.minor_page_faults


	--yliu
> ---
>  kernel/sched/fair.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bccdd89..c49b7ba 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
>  
>  static unsigned long weighted_cpuload(const int cpu);
>  static unsigned long source_load(int cpu);
> -static unsigned long target_load(int cpu);
> +static unsigned long target_load(int cpu, int imbalance_pct);
>  static unsigned long power_of(int cpu);
>  static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
>  
> @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu)
>   * Return a high guess at the load of a migration-target cpu weighted
>   * according to the scheduling class and "nice" value.
>   */
> -static unsigned long target_load(int cpu)
> +static unsigned long target_load(int cpu, int imbalance_pct)
>  {
>  	struct rq *rq = cpu_rq(cpu);
>  	unsigned long total = weighted_cpuload(cpu);
>  
> +	/*
> +	 * without cpu_load decay, in most of time cpu_load is same as total
> +	 * so we need to make target a bit heavier to reduce task migration
> +	 */
> +	total = total * imbalance_pct / 100;
> +
>  	if (!sched_feat(LB_BIAS))
>  		return total;
>  
> @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  	this_cpu  = smp_processor_id();
>  	prev_cpu  = task_cpu(p);
>  	load	  = source_load(prev_cpu);
> -	this_load = target_load(this_cpu);
> +	this_load = target_load(this_cpu, 100);
>  
>  	/*
>  	 * If sync wakeup then subtract the (maximum possible)
> @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  
>  	if (balanced ||
>  	    (this_load <= load &&
> -	     this_load + target_load(prev_cpu) <= tl_per_task)) {
> +	     this_load + target_load(prev_cpu, 100) <= tl_per_task)) {
>  		/*
>  		 * This domain has SD_WAKE_AFFINE and
>  		 * p is cache cold in this domain, and
> @@ -4135,7 +4141,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  			if (local_group)
>  				load = source_load(i);
>  			else
> -				load = target_load(i);
> +				load = target_load(i, sd->imbalance_pct);
>  
>  			avg_load += load;
>  		}
> @@ -5478,7 +5484,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  
>  		/* Bias balancing toward cpus of our domain */
>  		if (local_group)
> -			load = target_load(i);
> +			load = target_load(i, env->sd->imbalance_pct);
>  		else
>  			load = source_load(i);
>  
> -- 
> 1.8.1.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2013-12-04  9:06   ` Yuanhan Liu
@ 2013-12-04 11:25     ` Alex Shi
  0 siblings, 0 replies; 34+ messages in thread
From: Alex Shi @ 2013-12-04 11:25 UTC (permalink / raw)
  To: Yuanhan Liu
  Cc: mingo, peterz, morten.rasmussen, vincent.guittot, daniel.lezcano,
	fweisbec, linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel, Huang Ying


> We obsevered 150% performance gain with vm-scalability/300s-mmap-pread-seq
> testcase with this patch applied. Here is a list of changes we got so far:
> 
> testbox : brickland

got some explain of brickland on wiki:
High-end server platform based on the Ivy Bridge-EX processor
> testcase: vm-scalability/300s-mmap-pread-seq

https://github.com/aristeu/vm-scalability

Thanks a lot for testing! :)
Do you have data of base upstream commit?

> 
> 
>     f1b6442c7dd12802e622      d70495ef86f397816d73  
>        (parent commit)            (this commit)
> ------------------------  ------------------------  
>              26393249.80      +150.9%  66223933.60  vm-scalability.throughput
> 
>                   225.12       -49.9%       112.75  time.elapsed_time
>                 36333.40       -90.7%      3392.20  vmstat.system.cs
>                     2.40      +375.0%        11.40  vmstat.cpu.id
>               3770081.60       -97.7%     87673.40  time.major_page_faults
>               3975276.20       -97.0%    117409.60  time.voluntary_context_switches
>                     3.05      +301.7%        12.24  iostat.cpu.idle
>                 21118.41       -70.3%      6277.19  time.system_time
>                    18.40      +130.4%        42.40  vmstat.cpu.us
>                    77.00       -41.3%        45.20  vmstat.cpu.sy
>                 47459.60       -31.3%     32592.20  vmstat.system.in
>                 82435.40       -12.1%     72443.60  time.involuntary_context_switches
>                  5128.13       +14.0%      5848.30  time.user_time
>                 11656.20        -7.8%     10745.60  time.percent_of_cpu_this_job_got
>            1069997484.80        +0.3% 1073679919.00 time.minor_page_faults
> 

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/4] sched: remove cpu_load decay
  2013-12-03 10:26 ` [PATCH 0/4] sched: remove cpu_load decay Peter Zijlstra
@ 2013-12-10  1:04   ` Alex Shi
  2013-12-10  1:06     ` Paul Turner
  2013-12-13 19:50     ` bsegall
  0 siblings, 2 replies; 34+ messages in thread
From: Alex Shi @ 2013-12-10  1:04 UTC (permalink / raw)
  To: pjt
  Cc: Peter Zijlstra, mingo, morten.rasmussen, vincent.guittot,
	daniel.lezcano, fweisbec, linux, tony.luck, fenghua.yu, tglx,
	akpm, arjan, fengguang.wu, james.hogan, jason.low2, gregkh,
	hanjun.guo, linux-kernel

On 12/03/2013 06:26 PM, Peter Zijlstra wrote:
> 
> Paul, can you guys have a look at this, last time around you have a
> regression with this stuff, so it would be good to hear from you.
> 

Ping Paul.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/4] sched: remove cpu_load decay
  2013-12-10  1:04   ` Alex Shi
@ 2013-12-10  1:06     ` Paul Turner
  2013-12-13 19:50     ` bsegall
  1 sibling, 0 replies; 34+ messages in thread
From: Paul Turner @ 2013-12-10  1:06 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Vincent Guittot,
	Daniel Lezcano, Frédéric Weisbecker, linux, tony.luck,
	fenghua.yu, Thomas Gleixner, Andrew Morton, Arjan van de Ven,
	Fengguang Wu, james.hogan, Jason Low, Greg Kroah-Hartman,
	hanjun.guo, LKML

On Mon, Dec 9, 2013 at 5:04 PM, Alex Shi <alex.shi@linaro.org> wrote:
> On 12/03/2013 06:26 PM, Peter Zijlstra wrote:
>>
>> Paul, can you guys have a look at this, last time around you have a
>> regression with this stuff, so it would be good to hear from you.
>>
>
> Ping Paul.
>

Ben was looking at this right before the thanksgiving break.  I'll ask
him what he has tomorrow.

> --
> Thanks
>     Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/4] sched: remove cpu_load decay
  2013-12-10  1:04   ` Alex Shi
  2013-12-10  1:06     ` Paul Turner
@ 2013-12-13 19:50     ` bsegall
  2013-12-14 12:53       ` Alex Shi
  1 sibling, 1 reply; 34+ messages in thread
From: bsegall @ 2013-12-13 19:50 UTC (permalink / raw)
  To: Alex Shi
  Cc: pjt, Peter Zijlstra, mingo, morten.rasmussen, vincent.guittot,
	daniel.lezcano, fweisbec, linux, tony.luck, fenghua.yu, tglx,
	akpm, arjan, fengguang.wu, james.hogan, jason.low2, gregkh,
	hanjun.guo, linux-kernel

Alex Shi <alex.shi@linaro.org> writes:

> On 12/03/2013 06:26 PM, Peter Zijlstra wrote:
>> 
>> Paul, can you guys have a look at this, last time around you have a
>> regression with this stuff, so it would be good to hear from you.
>> 

This appears to be neutral on our search benchmark.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/4] sched: remove cpu_load decay
  2013-12-03  9:05 [PATCH 0/4] sched: remove cpu_load decay Alex Shi
                   ` (4 preceding siblings ...)
  2013-12-03 10:26 ` [PATCH 0/4] sched: remove cpu_load decay Peter Zijlstra
@ 2013-12-13 20:03 ` Peter Zijlstra
  2013-12-14 13:27   ` Alex Shi
  5 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2013-12-13 20:03 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, morten.rasmussen, vincent.guittot, daniel.lezcano,
	fweisbec, linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel



I had a quick peek at the actual patches.

afaict we're now using weighted_cpuload() aka runnable_load_avg as the
->cpu_load. Whatever happened to also using the blocked_avg?

I totally hate patch 4; it seems like a random hack to make up for the
lack of blocked_avg.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/4] sched: remove cpu_load decay
  2013-12-13 19:50     ` bsegall
@ 2013-12-14 12:53       ` Alex Shi
  0 siblings, 0 replies; 34+ messages in thread
From: Alex Shi @ 2013-12-14 12:53 UTC (permalink / raw)
  To: bsegall
  Cc: pjt, Peter Zijlstra, mingo, morten.rasmussen, vincent.guittot,
	daniel.lezcano, fweisbec, linux, tony.luck, fenghua.yu, tglx,
	akpm, arjan, fengguang.wu, james.hogan, jason.low2, gregkh,
	hanjun.guo, linux-kernel

On 12/14/2013 03:50 AM, bsegall@google.com wrote:
> Alex Shi <alex.shi@linaro.org> writes:
> 
>> On 12/03/2013 06:26 PM, Peter Zijlstra wrote:
>>>
>>> Paul, can you guys have a look at this, last time around you have a
>>> regression with this stuff, so it would be good to hear from you.
>>>
> 
> This appears to be neutral on our search benchmark.
> 


Thanks for testing!

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/4] sched: remove cpu_load decay
  2013-12-13 20:03 ` Peter Zijlstra
@ 2013-12-14 13:27   ` Alex Shi
  2013-12-17 14:04     ` Morten Rasmussen
  0 siblings, 1 reply; 34+ messages in thread
From: Alex Shi @ 2013-12-14 13:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, morten.rasmussen, vincent.guittot, daniel.lezcano,
	fweisbec, linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On 12/14/2013 04:03 AM, Peter Zijlstra wrote:
> 
> 
> I had a quick peek at the actual patches.
> 
> afaict we're now using weighted_cpuload() aka runnable_load_avg as the
> ->cpu_load. Whatever happened to also using the blocked_avg?

When enabling the sched_avg in load balance, I didn't find any positive
testing result for several blocked_avg trying, just few regression. :(

And since this patchset is almost clean up only, no blocked_load_avg
trying again...

> 
> I totally hate patch 4; it seems like a random hack to make up for the
> lack of blocked_avg.

Yes, this bias criteria seems a bit arbitrary. :)
But, anyway even with blocked_load_avg, we still need to consider to
bias to local cpu. like in a scenario, 2 cpus both has nearly zero
blocked_load_avg.

BTW,

Paul, do you has new idea on blocked_load_avg using?

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/4] sched: remove cpu_load decay
  2013-12-14 13:27   ` Alex Shi
@ 2013-12-17 14:04     ` Morten Rasmussen
  2013-12-17 15:37       ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Morten Rasmussen @ 2013-12-17 14:04 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, mingo, vincent.guittot, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On Sat, Dec 14, 2013 at 01:27:59PM +0000, Alex Shi wrote:
> On 12/14/2013 04:03 AM, Peter Zijlstra wrote:
> > 
> > 
> > I had a quick peek at the actual patches.
> > 
> > afaict we're now using weighted_cpuload() aka runnable_load_avg as the
> > ->cpu_load. Whatever happened to also using the blocked_avg?

AFAICT, ->cpu_load is actually a snapshot value of weigthed_cpuload()
that gets updated occasionally. That has been the case since b92486cb.
By removing the cpu_load indexes {source,target}_load are now comparing
an old snapshot of weighted_cpuload() with the current value. I don't
think that really makes sense. weighted_cpuload() may change rapidly
when tasks are enqueued or dequeued so the old snapshot doesn't have
much meaning in my opinion. Maybe I'm missing something?

Comparing cpu_load indexes with different decay rates in {source,
target}_load() sort of make sense as it makes load-balancing decisions
more conservative.

If we can indeed remove decayed cpu_load there is more code that should
be revisited and potentially be ripped out. {source,target}_load() could
probably be reduced to weighted_cpuload(), which would change the
load-balance behaviour. However, these patches already affect
load-balancing as indicated by the fix in patch 4.

I believe we have discussed using blocked_load_avg in weighted_cpuload()
in the past. While it seems to be the right thing to include it, it
causes problems related to the priority scaling of the task loads.
If you include a blocked load in the weighted_cpuload() and have tiny
(very low cpu utilization) task running at very high priority, your
weighted_cpuload() will be quite high and force other normal priority
tasks away from the cpu and leaving the cpu idle most of the time.

> 
> When enabling the sched_avg in load balance, I didn't find any positive
> testing result for several blocked_avg trying, just few regression. :(
> 
> And since this patchset is almost clean up only, no blocked_load_avg
> trying again...

My worry here is that I don't really understand why the current code
works when the decayed cpu_load has been removed.

> > 
> > I totally hate patch 4; it seems like a random hack to make up for the
> > lack of blocked_avg.
> 
> Yes, this bias criteria seems a bit arbitrary. :)

This is why I think {source, target}_load() and their use need to be
reconsidered.

Morten

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2013-12-03  9:05 ` [PATCH 4/4] sched: bias to target cpu load to reduce task moving Alex Shi
  2013-12-04  9:06   ` Yuanhan Liu
@ 2013-12-17 14:10   ` Morten Rasmussen
  2013-12-17 15:38     ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Morten Rasmussen @ 2013-12-17 14:10 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, vincent.guittot, daniel.lezcano, fweisbec, linux,
	tony.luck, fenghua.yu, tglx, akpm, arjan, pjt, fengguang.wu,
	james.hogan, jason.low2, gregkh, hanjun.guo, linux-kernel

On Tue, Dec 03, 2013 at 09:05:56AM +0000, Alex Shi wrote:
> Task migration happens when target just a bit less then source cpu load.
> To reduce such situation happens, aggravate the target cpu load with
> sd->imbalance_pct/100.
> 
> This patch removes the hackbench thread regression on Daniel's
> Intel Core2 server.
> 
> a5d6e63		+patch1~3		+patch1~4
> hackbench -T -s 4096 -l 1000 -g 10 -f 40
> 27.914"         38.694"			28.587"
> 28.390"         38.341"			29.513"
> 28.048"         38.626"			28.706"
> 
> Signed-off-by: Alex Shi <alex.shi@linaro.org>
> ---
>  kernel/sched/fair.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bccdd89..c49b7ba 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
>  
>  static unsigned long weighted_cpuload(const int cpu);
>  static unsigned long source_load(int cpu);
> -static unsigned long target_load(int cpu);
> +static unsigned long target_load(int cpu, int imbalance_pct);
>  static unsigned long power_of(int cpu);
>  static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
>  
> @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu)
>   * Return a high guess at the load of a migration-target cpu weighted
>   * according to the scheduling class and "nice" value.
>   */
> -static unsigned long target_load(int cpu)
> +static unsigned long target_load(int cpu, int imbalance_pct)
>  {
>  	struct rq *rq = cpu_rq(cpu);
>  	unsigned long total = weighted_cpuload(cpu);
>  
> +	/*
> +	 * without cpu_load decay, in most of time cpu_load is same as total
> +	 * so we need to make target a bit heavier to reduce task migration
> +	 */
> +	total = total * imbalance_pct / 100;
> +
>  	if (!sched_feat(LB_BIAS))
>  		return total;
>  
> @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  	this_cpu  = smp_processor_id();
>  	prev_cpu  = task_cpu(p);
>  	load	  = source_load(prev_cpu);
> -	this_load = target_load(this_cpu);
> +	this_load = target_load(this_cpu, 100);
>  
>  	/*
>  	 * If sync wakeup then subtract the (maximum possible)
> @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  
>  	if (balanced ||
>  	    (this_load <= load &&
> -	     this_load + target_load(prev_cpu) <= tl_per_task)) {
> +	     this_load + target_load(prev_cpu, 100) <= tl_per_task)) {
>  		/*
>  		 * This domain has SD_WAKE_AFFINE and
>  		 * p is cache cold in this domain, and
> @@ -4135,7 +4141,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  			if (local_group)
>  				load = source_load(i);
>  			else
> -				load = target_load(i);
> +				load = target_load(i, sd->imbalance_pct);

Don't you apply imbalance_pct twice here? Later on in
find_idlest_group() you have:

	if (!idlest || 100*this_load < imbalance*min_load)
		return NULL;

where min_load comes from target_load().


>  
>  			avg_load += load;
>  		}
> @@ -5478,7 +5484,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  
>  		/* Bias balancing toward cpus of our domain */
>  		if (local_group)
> -			load = target_load(i);
> +			load = target_load(i, env->sd->imbalance_pct);

You probably have the same problem here.

Morten

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/4] sched: remove cpu_load decay
  2013-12-17 14:04     ` Morten Rasmussen
@ 2013-12-17 15:37       ` Peter Zijlstra
  2013-12-17 18:12         ` Morten Rasmussen
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2013-12-17 15:37 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Alex Shi, mingo, vincent.guittot, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On Tue, Dec 17, 2013 at 02:04:57PM +0000, Morten Rasmussen wrote:
> On Sat, Dec 14, 2013 at 01:27:59PM +0000, Alex Shi wrote:
> > On 12/14/2013 04:03 AM, Peter Zijlstra wrote:
> > > 
> > > 
> > > I had a quick peek at the actual patches.
> > > 
> > > afaict we're now using weighted_cpuload() aka runnable_load_avg as the
> > > ->cpu_load. Whatever happened to also using the blocked_avg?
> 
> AFAICT, ->cpu_load is actually a snapshot value of weigthed_cpuload()
> that gets updated occasionally. That has been the case since b92486cb.
> By removing the cpu_load indexes {source,target}_load are now comparing
> an old snapshot of weighted_cpuload() with the current value. I don't
> think that really makes sense. 

Agreed, worse cpu_load is a very very recent snapshot, so there's not
been much chance to really diverge much between when we last looked at
it.

[ for busy load-balance, for newidle there might be since we can run
between ticks ]

> weighted_cpuload() may change rapidly
> when tasks are enqueued or dequeued so the old snapshot doesn't have
> much meaning in my opinion. Maybe I'm missing something?

Right, which is where it makes sense to also account some of the blocked
load, since that anticipates these arrivals/departures and should smooth
out the over-all load pictures. Which is something that sounds right for
balancing.

You don't want to really care too much about the high freq fluctuation,
but care more about the longer term load.

Or rather -- and this is where the idx thing came from, you want a
longer term view the bigger your sched_domain is. Since that balances
nicely against the cost of actually moving tasks around.

And while runnable_load_avg still includes high freq arrival/departure
events, the runnable+blocked load should have much less of that.

> Comparing cpu_load indexes with different decay rates in {source,
> target}_load() sort of make sense as it makes load-balancing decisions
> more conservative.

*nod*

> I believe we have discussed using blocked_load_avg in weighted_cpuload()
> in the past. While it seems to be the right thing to include it, it
> causes problems related to the priority scaling of the task loads.
> If you include a blocked load in the weighted_cpuload() and have tiny
> (very low cpu utilization) task running at very high priority, your
> weighted_cpuload() will be quite high and force other normal priority
> tasks away from the cpu and leaving the cpu idle most of the time.

Ah, right. Which is where we should look at balancing utilization as
well as weight.

Let me ponder this a bit more.

> > 
> > When enabling the sched_avg in load balance, I didn't find any positive
> > testing result for several blocked_avg trying, just few regression. :(
> > 
> > And since this patchset is almost clean up only, no blocked_load_avg
> > trying again...
> 
> My worry here is that I don't really understand why the current code
> works when the decayed cpu_load has been removed.

Not too much different from before I think; but it does loose the longer
term view on the bigger domains. That in turn makes it slightly more
agressive, which can be good or bad depending on the workload (good on
high spawn loads like hackbenchs, bad on more gentle stuff that has
cache footprint).

> > > I totally hate patch 4; it seems like a random hack to make up for the
> > > lack of blocked_avg.
> > 
> > Yes, this bias criteria seems a bit arbitrary. :)
> 
> This is why I think {source, target}_load() and their use need to be
> reconsidered.

Aside from that, there's something entirely wrong with 4 in that we
already have an imbalance between source and target loads, adding
another basically random imbalance pass on top of that just doesn't make
any kind of sense what so ff'ing ever.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2013-12-17 14:10   ` Morten Rasmussen
@ 2013-12-17 15:38     ` Peter Zijlstra
  2013-12-19 13:34       ` Alex Shi
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2013-12-17 15:38 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Alex Shi, mingo, vincent.guittot, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On Tue, Dec 17, 2013 at 02:10:12PM +0000, Morten Rasmussen wrote:
> > @@ -4135,7 +4141,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >  			if (local_group)
> >  				load = source_load(i);
> >  			else
> > -				load = target_load(i);
> > +				load = target_load(i, sd->imbalance_pct);
> 
> Don't you apply imbalance_pct twice here? Later on in
> find_idlest_group() you have:
> 
> 	if (!idlest || 100*this_load < imbalance*min_load)
> 		return NULL;
> 
> where min_load comes from target_load().

Yes! exactly! this doesn't make any sense.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/4] sched: remove cpu_load decay
  2013-12-17 15:37       ` Peter Zijlstra
@ 2013-12-17 18:12         ` Morten Rasmussen
  2013-12-20 14:43           ` Alex Shi
  0 siblings, 1 reply; 34+ messages in thread
From: Morten Rasmussen @ 2013-12-17 18:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Shi, mingo, vincent.guittot, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On Tue, Dec 17, 2013 at 03:37:23PM +0000, Peter Zijlstra wrote:
> On Tue, Dec 17, 2013 at 02:04:57PM +0000, Morten Rasmussen wrote:
> > On Sat, Dec 14, 2013 at 01:27:59PM +0000, Alex Shi wrote:
> > > On 12/14/2013 04:03 AM, Peter Zijlstra wrote:
> > > > 
> > > > 
> > > > I had a quick peek at the actual patches.
> > > > 
> > > > afaict we're now using weighted_cpuload() aka runnable_load_avg as the
> > > > ->cpu_load. Whatever happened to also using the blocked_avg?
> > 
> > AFAICT, ->cpu_load is actually a snapshot value of weigthed_cpuload()
> > that gets updated occasionally. That has been the case since b92486cb.
> > By removing the cpu_load indexes {source,target}_load are now comparing
> > an old snapshot of weighted_cpuload() with the current value. I don't
> > think that really makes sense. 
> 
> Agreed, worse cpu_load is a very very recent snapshot, so there's not
> been much chance to really diverge much between when we last looked at
> it.
> 
> [ for busy load-balance, for newidle there might be since we can run
> between ticks ]
> 
> > weighted_cpuload() may change rapidly
> > when tasks are enqueued or dequeued so the old snapshot doesn't have
> > much meaning in my opinion. Maybe I'm missing something?
> 
> Right, which is where it makes sense to also account some of the blocked
> load, since that anticipates these arrivals/departures and should smooth
> out the over-all load pictures. Which is something that sounds right for
> balancing.
> 
> You don't want to really care too much about the high freq fluctuation,
> but care more about the longer term load.
> 
> Or rather -- and this is where the idx thing came from, you want a
> longer term view the bigger your sched_domain is. Since that balances
> nicely against the cost of actually moving tasks around.

That makes sense.

> 
> And while runnable_load_avg still includes high freq arrival/departure
> events, the runnable+blocked load should have much less of that.

Agreed, we either need a smooth version of runnable_load_avg or add the
blocked load (given that we fix the priority issue).

There is actually another long-term view of the cpu load in
rq->avg.runnable_avg_sum but I think it might be too conversative. Also
it doesn't track the weight of the tasks on the cpu, just whether the
cpu was idle or not.

> 
> > Comparing cpu_load indexes with different decay rates in {source,
> > target}_load() sort of make sense as it makes load-balancing decisions
> > more conservative.
> 
> *nod*
> 
> > I believe we have discussed using blocked_load_avg in weighted_cpuload()
> > in the past. While it seems to be the right thing to include it, it
> > causes problems related to the priority scaling of the task loads.
> > If you include a blocked load in the weighted_cpuload() and have tiny
> > (very low cpu utilization) task running at very high priority, your
> > weighted_cpuload() will be quite high and force other normal priority
> > tasks away from the cpu and leaving the cpu idle most of the time.
> 
> Ah, right. Which is where we should look at balancing utilization as
> well as weight.
> 
> Let me ponder this a bit more.

Yes. At least for Android devices this is a big deal.

Would it be too invasive to have an unweighted_cpuload() for balancing
utilization? It would require maintaining an unweighted version of
runnable_load_avg and blocked load.

Maybe you have better ideas.

> 
> > > 
> > > When enabling the sched_avg in load balance, I didn't find any positive
> > > testing result for several blocked_avg trying, just few regression. :(
> > > 
> > > And since this patchset is almost clean up only, no blocked_load_avg
> > > trying again...
> > 
> > My worry here is that I don't really understand why the current code
> > works when the decayed cpu_load has been removed.
> 
> Not too much different from before I think; but it does loose the longer
> term view on the bigger domains. That in turn makes it slightly more
> agressive, which can be good or bad depending on the workload (good on
> high spawn loads like hackbenchs, bad on more gentle stuff that has
> cache footprint).
> 
> > > > I totally hate patch 4; it seems like a random hack to make up for the
> > > > lack of blocked_avg.
> > > 
> > > Yes, this bias criteria seems a bit arbitrary. :)
> > 
> > This is why I think {source, target}_load() and their use need to be
> > reconsidered.
> 
> Aside from that, there's something entirely wrong with 4 in that we
> already have an imbalance between source and target loads, adding
> another basically random imbalance pass on top of that just doesn't make
> any kind of sense what so ff'ing ever.

Agreed.

Morten

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2013-12-17 15:38     ` Peter Zijlstra
@ 2013-12-19 13:34       ` Alex Shi
  2013-12-20 11:19         ` Morten Rasmussen
  0 siblings, 1 reply; 34+ messages in thread
From: Alex Shi @ 2013-12-19 13:34 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: mingo, vincent.guittot, daniel.lezcano, fweisbec, linux,
	tony.luck, fenghua.yu, tglx, akpm, arjan, pjt, fengguang.wu,
	james.hogan, jason.low2, gregkh, hanjun.guo, linux-kernel

On 12/17/2013 11:38 PM, Peter Zijlstra wrote:
> On Tue, Dec 17, 2013 at 02:10:12PM +0000, Morten Rasmussen wrote:
>>> @@ -4135,7 +4141,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>>  			if (local_group)
>>>  				load = source_load(i);
>>>  			else
>>> -				load = target_load(i);
>>> +				load = target_load(i, sd->imbalance_pct);
>>
>> Don't you apply imbalance_pct twice here? Later on in
>> find_idlest_group() you have:
>>
>> 	if (!idlest || 100*this_load < imbalance*min_load)
>> 		return NULL;
>>
>> where min_load comes from target_load().
> 
> Yes! exactly! this doesn't make any sense.

Thanks a lot for review and comments!

I changed the patch to following shape. and push it under Fengguang's testing 
system monitor. Any testing are appreciated!

BTW, Seems lots of changes in scheduler come from kinds of scenarios/benchmarks
experience. But I still like to take any theoretical comments/suggestions.

-- 
Thanks
    Alex

===

>From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@linaro.org>
Date: Sat, 23 Nov 2013 23:18:09 +0800
Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving

Task migration happens when target just a bit less then source cpu load.
To reduce such situation happens, aggravate the target cpu load with
sd->imbalance_pct/100 in wake_affine.

In find_idlest/busiest_group, change the aggravate to local cpu only
from old group aggravation.

on my pandaboard ES.

	latest kernel 527d1511310a89		+ whole patchset
hackbench -T -g 10 -f 40
	23.25"					21.99"
	23.16"					21.20"
	24.24"					21.89"
hackbench -p -g 10 -f 40
	26.52"					21.46"
	23.89"					22.96"
	25.65"					22.73"
hackbench -P -g 10 -f 40
	20.14"					19.72"
	19.96"					19.10"
	21.76"					20.03"

Signed-off-by: Alex Shi <alex.shi@linaro.org>
---
 kernel/sched/fair.c | 35 ++++++++++++++++-------------------
 1 file changed, 16 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bccdd89..3623ba4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
 
 static unsigned long weighted_cpuload(const int cpu);
 static unsigned long source_load(int cpu);
-static unsigned long target_load(int cpu);
+static unsigned long target_load(int cpu, int imbalance_pct);
 static unsigned long power_of(int cpu);
 static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
@@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu)
  * Return a high guess at the load of a migration-target cpu weighted
  * according to the scheduling class and "nice" value.
  */
-static unsigned long target_load(int cpu)
+static unsigned long target_load(int cpu, int imbalance_pct)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
 
+	/*
+	 * without cpu_load decay, in most of time cpu_load is same as total
+	 * so we need to make target a bit heavier to reduce task migration
+	 */
+	total = total * imbalance_pct / 100;
+
 	if (!sched_feat(LB_BIAS))
 		return total;
 
@@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
 	load	  = source_load(prev_cpu);
-	this_load = target_load(this_cpu);
+	this_load = target_load(this_cpu, 100);
 
 	/*
 	 * If sync wakeup then subtract the (maximum possible)
@@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 
 	if (balanced ||
 	    (this_load <= load &&
-	     this_load + target_load(prev_cpu) <= tl_per_task)) {
+	     this_load + target_load(prev_cpu, 100) <= tl_per_task)) {
 		/*
 		 * This domain has SD_WAKE_AFFINE and
 		 * p is cache cold in this domain, and
@@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
-	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
 	do {
 		unsigned long load, avg_load;
@@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 
 		for_each_cpu(i, sched_group_cpus(group)) {
 			/* Bias balancing toward cpus of our domain */
-			if (local_group)
+			if (i == this_cpu)
 				load = source_load(i);
 			else
-				load = target_load(i);
+				load = target_load(i, sd->imbalance_pct);
 
 			avg_load += load;
 		}
@@ -4151,7 +4156,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 		}
 	} while (group = group->next, group != sd->groups);
 
-	if (!idlest || 100*this_load < imbalance*min_load)
+	if (!idlest || this_load < min_load)
 		return NULL;
 	return idlest;
 }
@@ -5476,9 +5481,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		nr_running = rq->nr_running;
 
-		/* Bias balancing toward cpus of our domain */
-		if (local_group)
-			load = target_load(i);
+		/* Bias balancing toward dst cpu */
+		if (env->dst_cpu == i)
+			load = target_load(i, env->sd->imbalance_pct);
 		else
 			load = source_load(i);
 
@@ -5918,14 +5923,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		if ((local->idle_cpus < busiest->idle_cpus) &&
 		    busiest->sum_nr_running <= busiest->group_weight)
 			goto out_balanced;
-	} else {
-		/*
-		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
-		 * imbalance_pct to be conservative.
-		 */
-		if (100 * busiest->avg_load <=
-				env->sd->imbalance_pct * local->avg_load)
-			goto out_balanced;
 	}
 
 force_balance:
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2013-12-19 13:34       ` Alex Shi
@ 2013-12-20 11:19         ` Morten Rasmussen
  2013-12-20 14:45           ` Alex Shi
  2013-12-25 14:58           ` Alex Shi
  0 siblings, 2 replies; 34+ messages in thread
From: Morten Rasmussen @ 2013-12-20 11:19 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, mingo, vincent.guittot, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On Thu, Dec 19, 2013 at 01:34:08PM +0000, Alex Shi wrote:
> On 12/17/2013 11:38 PM, Peter Zijlstra wrote:
> > On Tue, Dec 17, 2013 at 02:10:12PM +0000, Morten Rasmussen wrote:
> >>> @@ -4135,7 +4141,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >>>  			if (local_group)
> >>>  				load = source_load(i);
> >>>  			else
> >>> -				load = target_load(i);
> >>> +				load = target_load(i, sd->imbalance_pct);
> >>
> >> Don't you apply imbalance_pct twice here? Later on in
> >> find_idlest_group() you have:
> >>
> >> 	if (!idlest || 100*this_load < imbalance*min_load)
> >> 		return NULL;
> >>
> >> where min_load comes from target_load().
> > 
> > Yes! exactly! this doesn't make any sense.
> 
> Thanks a lot for review and comments!
> 
> I changed the patch to following shape. and push it under Fengguang's testing 
> system monitor. Any testing are appreciated!
> 
> BTW, Seems lots of changes in scheduler come from kinds of scenarios/benchmarks
> experience. But I still like to take any theoretical comments/suggestions.
> 
> -- 
> Thanks
>     Alex
> 
> ===
> 
> From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@linaro.org>
> Date: Sat, 23 Nov 2013 23:18:09 +0800
> Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
> 
> Task migration happens when target just a bit less then source cpu load.
> To reduce such situation happens, aggravate the target cpu load with
> sd->imbalance_pct/100 in wake_affine.
> 
> In find_idlest/busiest_group, change the aggravate to local cpu only
> from old group aggravation.
> 
> on my pandaboard ES.
> 
> 	latest kernel 527d1511310a89		+ whole patchset
> hackbench -T -g 10 -f 40
> 	23.25"					21.99"
> 	23.16"					21.20"
> 	24.24"					21.89"
> hackbench -p -g 10 -f 40
> 	26.52"					21.46"
> 	23.89"					22.96"
> 	25.65"					22.73"
> hackbench -P -g 10 -f 40
> 	20.14"					19.72"
> 	19.96"					19.10"
> 	21.76"					20.03"
> 
> Signed-off-by: Alex Shi <alex.shi@linaro.org>
> ---
>  kernel/sched/fair.c | 35 ++++++++++++++++-------------------
>  1 file changed, 16 insertions(+), 19 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bccdd89..3623ba4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
>  
>  static unsigned long weighted_cpuload(const int cpu);
>  static unsigned long source_load(int cpu);
> -static unsigned long target_load(int cpu);
> +static unsigned long target_load(int cpu, int imbalance_pct);
>  static unsigned long power_of(int cpu);
>  static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
>  
> @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu)
>   * Return a high guess at the load of a migration-target cpu weighted
>   * according to the scheduling class and "nice" value.
>   */
> -static unsigned long target_load(int cpu)
> +static unsigned long target_load(int cpu, int imbalance_pct)
>  {
>  	struct rq *rq = cpu_rq(cpu);
>  	unsigned long total = weighted_cpuload(cpu);
>  
> +	/*
> +	 * without cpu_load decay, in most of time cpu_load is same as total
> +	 * so we need to make target a bit heavier to reduce task migration
> +	 */
> +	total = total * imbalance_pct / 100;
> +
>  	if (!sched_feat(LB_BIAS))
>  		return total;
>  
> @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  	this_cpu  = smp_processor_id();
>  	prev_cpu  = task_cpu(p);
>  	load	  = source_load(prev_cpu);
> -	this_load = target_load(this_cpu);
> +	this_load = target_load(this_cpu, 100);
>  
>  	/*
>  	 * If sync wakeup then subtract the (maximum possible)
> @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  
>  	if (balanced ||
>  	    (this_load <= load &&
> -	     this_load + target_load(prev_cpu) <= tl_per_task)) {
> +	     this_load + target_load(prev_cpu, 100) <= tl_per_task)) {
>  		/*
>  		 * This domain has SD_WAKE_AFFINE and
>  		 * p is cache cold in this domain, and
> @@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  {
>  	struct sched_group *idlest = NULL, *group = sd->groups;
>  	unsigned long min_load = ULONG_MAX, this_load = 0;
> -	int imbalance = 100 + (sd->imbalance_pct-100)/2;
>  
>  	do {
>  		unsigned long load, avg_load;
> @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  
>  		for_each_cpu(i, sched_group_cpus(group)) {
>  			/* Bias balancing toward cpus of our domain */
> -			if (local_group)
> +			if (i == this_cpu)

What is the motivation for changing the local_group load calculation?
Now the load contributions of all cpus in the local group, except
this_cpu, will contribute more as their contribution (this_load) is
determined using target_load() instead.

If I'm not mistaken, that will lead to more frequent load balancing as
the local_group bias has been reduced. That is the opposite of your
intentions based on your comment in target_load().

>  				load = source_load(i);
>  			else
> -				load = target_load(i);
> +				load = target_load(i, sd->imbalance_pct);

You scale by sd->imbalance_pct instead of 100+(sd->imbalance_pct-100)/2
that you removed above. sd->imbalance_pct may have been arbitrarily
chosen in the past, but changing it may affect behavior. 

>  
>  			avg_load += load;
>  		}
> @@ -4151,7 +4156,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  		}
>  	} while (group = group->next, group != sd->groups);
>  
> -	if (!idlest || 100*this_load < imbalance*min_load)
> +	if (!idlest || this_load < min_load)
>  		return NULL;
>  	return idlest;
>  }
> @@ -5476,9 +5481,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  
>  		nr_running = rq->nr_running;
>  
> -		/* Bias balancing toward cpus of our domain */
> -		if (local_group)
> -			load = target_load(i);
> +		/* Bias balancing toward dst cpu */
> +		if (env->dst_cpu == i)
> +			load = target_load(i, env->sd->imbalance_pct);

Here you do the same group load bias change as above.

>  		else
>  			load = source_load(i);
>  
> @@ -5918,14 +5923,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  		if ((local->idle_cpus < busiest->idle_cpus) &&
>  		    busiest->sum_nr_running <= busiest->group_weight)
>  			goto out_balanced;
> -	} else {
> -		/*
> -		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
> -		 * imbalance_pct to be conservative.
> -		 */
> -		if (100 * busiest->avg_load <=
> -				env->sd->imbalance_pct * local->avg_load)
> -			goto out_balanced;
>  	}
>  
>  force_balance:

As said my previous replies to this series, I think this problem should
be solved by fixing the cause of the problem, that is the cpu_load
calculation, instead of biasing the cpu_load where-ever it is used to
hide the problem.

Doing a bit of git archaeology reveals that the cpu_load code goes back
to 7897986bad8f6cd50d6149345aca7f6480f49464 and that in the original
patch x86 was using *_idx > 0 for all sched_domain levels except SMT. In
my opinion that made logical sense. If we are about to change to *_idx=0
we are removing the main idea behind that code and there needs to be a
new one. Otherwise, cpu_load doesn't make sense.

Morten

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/4] sched: remove cpu_load decay
  2013-12-17 18:12         ` Morten Rasmussen
@ 2013-12-20 14:43           ` Alex Shi
  0 siblings, 0 replies; 34+ messages in thread
From: Alex Shi @ 2013-12-20 14:43 UTC (permalink / raw)
  To: Morten Rasmussen, Peter Zijlstra
  Cc: mingo, vincent.guittot, daniel.lezcano, fweisbec, linux,
	tony.luck, fenghua.yu, tglx, akpm, arjan, pjt, fengguang.wu,
	james.hogan, jason.low2, gregkh, hanjun.guo, linux-kernel

Thanks a lot for comments from Morten & Peter! :)

On 12/18/2013 02:12 AM, Morten Rasmussen wrote:
> On Tue, Dec 17, 2013 at 03:37:23PM +0000, Peter Zijlstra wrote:
>> On Tue, Dec 17, 2013 at 02:04:57PM +0000, Morten Rasmussen wrote:
>>> On Sat, Dec 14, 2013 at 01:27:59PM +0000, Alex Shi wrote:
>>>> On 12/14/2013 04:03 AM, Peter Zijlstra wrote:
>>>>>
>>>>>
>>>>> I had a quick peek at the actual patches.
>>>>>
>>>>> afaict we're now using weighted_cpuload() aka runnable_load_avg as the
>>>>> ->cpu_load. Whatever happened to also using the blocked_avg?
>>>
>>> AFAICT, ->cpu_load is actually a snapshot value of weigthed_cpuload()
>>> that gets updated occasionally. That has been the case since b92486cb.
>>> By removing the cpu_load indexes {source,target}_load are now comparing
>>> an old snapshot of weighted_cpuload() with the current value. I don't
>>> think that really makes sense. 
>>
>> Agreed, worse cpu_load is a very very recent snapshot, so there's not
>> been much chance to really diverge much between when we last looked at
>> it.
>>
>> [ for busy load-balance, for newidle there might be since we can run
>> between ticks ]
>>
>>> weighted_cpuload() may change rapidly
>>> when tasks are enqueued or dequeued so the old snapshot doesn't have
>>> much meaning in my opinion. Maybe I'm missing something?
>>
>> Right, which is where it makes sense to also account some of the blocked
>> load, since that anticipates these arrivals/departures and should smooth
>> out the over-all load pictures. Which is something that sounds right for
>> balancing.
>>
>> You don't want to really care too much about the high freq fluctuation,
>> but care more about the longer term load.
>>
>> Or rather -- and this is where the idx thing came from, you want a
>> longer term view the bigger your sched_domain is. Since that balances
>> nicely against the cost of actually moving tasks around.

As to blocked_load_avg, It looks like give some respect of left tasks.
But since there are many cpus in each sched domain. The chances for back
to same cpu is very limited. So forget that legacy, clear memory for new
tasks seems a better choice. And the previous trying also show this.

> 
> That makes sense.
> 
>>
>> And while runnable_load_avg still includes high freq arrival/departure
>> events, the runnable+blocked load should have much less of that.
> 
> Agreed, we either need a smooth version of runnable_load_avg or add the
> blocked load (given that we fix the priority issue).
> 
> There is actually another long-term view of the cpu load in
> rq->avg.runnable_avg_sum but I think it might be too conversative. Also
> it doesn't track the weight of the tasks on the cpu, just whether the
> cpu was idle or not.
> 
>>
>>> Comparing cpu_load indexes with different decay rates in {source,
>>> target}_load() sort of make sense as it makes load-balancing decisions
>>> more conservative.
>>
>> *nod*
>>
>>> I believe we have discussed using blocked_load_avg in weighted_cpuload()
>>> in the past. While it seems to be the right thing to include it, it
>>> causes problems related to the priority scaling of the task loads.
>>> If you include a blocked load in the weighted_cpuload() and have tiny
>>> (very low cpu utilization) task running at very high priority, your
>>> weighted_cpuload() will be quite high and force other normal priority
>>> tasks away from the cpu and leaving the cpu idle most of the time.
>>
>> Ah, right. Which is where we should look at balancing utilization as
>> well as weight.
>>
>> Let me ponder this a bit more.
> 
> Yes. At least for Android devices this is a big deal.
> 
> Would it be too invasive to have an unweighted_cpuload() for balancing
> utilization? It would require maintaining an unweighted version of
> runnable_load_avg and blocked load.
> 
> Maybe you have better ideas.
> 
>>
>>>>
>>>> When enabling the sched_avg in load balance, I didn't find any positive
>>>> testing result for several blocked_avg trying, just few regression. :(
>>>>
>>>> And since this patchset is almost clean up only, no blocked_load_avg
>>>> trying again...
>>>
>>> My worry here is that I don't really understand why the current code
>>> works when the decayed cpu_load has been removed.
>>
>> Not too much different from before I think; but it does loose the longer
>> term view on the bigger domains. That in turn makes it slightly more
>> agressive, which can be good or bad depending on the workload (good on
>> high spawn loads like hackbenchs, bad on more gentle stuff that has
>> cache footprint).

Yes, That is the point. :)
But I don't know why we need this long term view. the hackbench a bit
worse on Intel old core2 machine, -- even it is a bit out of date.
>>
>>>>> I totally hate patch 4; it seems like a random hack to make up for the
>>>>> lack of blocked_avg.
>>>>
>>>> Yes, this bias criteria seems a bit arbitrary. :)
>>>
>>> This is why I think {source, target}_load() and their use need to be
>>> reconsidered.
>>
>> Aside from that, there's something entirely wrong with 4 in that we
>> already have an imbalance between source and target loads, adding
>> another basically random imbalance pass on top of that just doesn't make
>> any kind of sense what so ff'ing ever.

My fault, I will reconsider more on this point.
> 
> Agreed.
> 
> Morten
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2013-12-20 11:19         ` Morten Rasmussen
@ 2013-12-20 14:45           ` Alex Shi
  2013-12-25 14:58           ` Alex Shi
  1 sibling, 0 replies; 34+ messages in thread
From: Alex Shi @ 2013-12-20 14:45 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, vincent.guittot, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On 12/20/2013 07:19 PM, Morten Rasmussen wrote:
>> @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>> >  
>> >  		for_each_cpu(i, sched_group_cpus(group)) {
>> >  			/* Bias balancing toward cpus of our domain */
>> > -			if (local_group)
>> > +			if (i == this_cpu)
> What is the motivation for changing the local_group load calculation?
> Now the load contributions of all cpus in the local group, except
> this_cpu, will contribute more as their contribution (this_load) is
> determined using target_load() instead.
> 
> If I'm not mistaken, that will lead to more frequent load balancing as
> the local_group bias has been reduced. That is the opposite of your
> intentions based on your comment in target_load().

Good catch. will reconsider this again. :)
> 
>> >  				load = source_load(i);
>> >  			else
>> > -				load = target_load(i);
>> > +				load = target_load(i, sd->imbalance_pct);
> You scale by sd->imbalance_pct instead of 100+(sd->imbalance_pct-100)/2
> that you removed above. sd->imbalance_pct may have been arbitrarily
> chosen in the past, but changing it may affect behavior. 
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2013-12-20 11:19         ` Morten Rasmussen
  2013-12-20 14:45           ` Alex Shi
@ 2013-12-25 14:58           ` Alex Shi
  2014-01-02 16:04             ` Morten Rasmussen
  1 sibling, 1 reply; 34+ messages in thread
From: Alex Shi @ 2013-12-25 14:58 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, vincent.guittot, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel


>> From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001
>> From: Alex Shi <alex.shi@linaro.org>
>> Date: Sat, 23 Nov 2013 23:18:09 +0800
>> Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
>>
>> Task migration happens when target just a bit less then source cpu load.
>> To reduce such situation happens, aggravate the target cpu load with
>> sd->imbalance_pct/100 in wake_affine.
>>
>> In find_idlest/busiest_group, change the aggravate to local cpu only
>> from old group aggravation.
>>
>> on my pandaboard ES.
>>
>> 	latest kernel 527d1511310a89		+ whole patchset
>> hackbench -T -g 10 -f 40
>> 	23.25"					21.99"
>> 	23.16"					21.20"
>> 	24.24"					21.89"
>> hackbench -p -g 10 -f 40
>> 	26.52"					21.46"
>> 	23.89"					22.96"
>> 	25.65"					22.73"
>> hackbench -P -g 10 -f 40
>> 	20.14"					19.72"
>> 	19.96"					19.10"
>> 	21.76"					20.03"
>>
>> Signed-off-by: Alex Shi <alex.shi@linaro.org>
>> ---
>>  kernel/sched/fair.c | 35 ++++++++++++++++-------------------
>>  1 file changed, 16 insertions(+), 19 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bccdd89..3623ba4 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
>>  
>>  static unsigned long weighted_cpuload(const int cpu);
>>  static unsigned long source_load(int cpu);
>> -static unsigned long target_load(int cpu);
>> +static unsigned long target_load(int cpu, int imbalance_pct);
>>  static unsigned long power_of(int cpu);
>>  static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
>>  
>> @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu)
>>   * Return a high guess at the load of a migration-target cpu weighted
>>   * according to the scheduling class and "nice" value.
>>   */
>> -static unsigned long target_load(int cpu)
>> +static unsigned long target_load(int cpu, int imbalance_pct)
>>  {
>>  	struct rq *rq = cpu_rq(cpu);
>>  	unsigned long total = weighted_cpuload(cpu);
>>  
>> +	/*
>> +	 * without cpu_load decay, in most of time cpu_load is same as total
>> +	 * so we need to make target a bit heavier to reduce task migration
>> +	 */
>> +	total = total * imbalance_pct / 100;
>> +
>>  	if (!sched_feat(LB_BIAS))
>>  		return total;
>>  
>> @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>  	this_cpu  = smp_processor_id();
>>  	prev_cpu  = task_cpu(p);
>>  	load	  = source_load(prev_cpu);
>> -	this_load = target_load(this_cpu);
>> +	this_load = target_load(this_cpu, 100);
>>  
>>  	/*
>>  	 * If sync wakeup then subtract the (maximum possible)
>> @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>  
>>  	if (balanced ||
>>  	    (this_load <= load &&
>> -	     this_load + target_load(prev_cpu) <= tl_per_task)) {
>> +	     this_load + target_load(prev_cpu, 100) <= tl_per_task)) {
>>  		/*
>>  		 * This domain has SD_WAKE_AFFINE and
>>  		 * p is cache cold in this domain, and
>> @@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>  {
>>  	struct sched_group *idlest = NULL, *group = sd->groups;
>>  	unsigned long min_load = ULONG_MAX, this_load = 0;
>> -	int imbalance = 100 + (sd->imbalance_pct-100)/2;
>>  
>>  	do {
>>  		unsigned long load, avg_load;
>> @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>  
>>  		for_each_cpu(i, sched_group_cpus(group)) {
>>  			/* Bias balancing toward cpus of our domain */
>> -			if (local_group)
>> +			if (i == this_cpu)
> 
> What is the motivation for changing the local_group load calculation?
> Now the load contributions of all cpus in the local group, except
> this_cpu, will contribute more as their contribution (this_load) is
> determined using target_load() instead.

This part code 147cbb4bbe99, written in 2005 for x86, at that time, only
2 cores(guess no HT at that time) in cpu socket. With the cores number
increasing trend, the sched_group become large and large, to give whole
group this bias value is becoming non-sense. So it looks reasonable to
just bias this cpu only.
> 
> If I'm not mistaken, that will lead to more frequent load balancing as
> the local_group bias has been reduced. That is the opposite of your
> intentions based on your comment in target_load().
> 
>>  				load = source_load(i);
>>  			else
>> -				load = target_load(i);
>> +				load = target_load(i, sd->imbalance_pct);
> 
> You scale by sd->imbalance_pct instead of 100+(sd->imbalance_pct-100)/2
> that you removed above. sd->imbalance_pct may have been arbitrarily
> chosen in the past, but changing it may affect behavior. 

the first commit with this line:
int imbalance = 100 + (sd->imbalance_pct-100)/2;
is 147cbb4bbe99, that also has no explanation for this imbalance value.
AFAIK, at that time, maybe only 2 LCPU in one socket, so, guess it is
reason to use half of imbalance_pct.
but sched_domain is used widely for many kinds of platform/cpu type, the
value is arbitrary too.
Another reason to use it, I scaled any cpu load which is not this_cpu,
so a bit conservation is to use origin imbalance_pct, not half of it.
> 
>>  
>>  			avg_load += load;
>>  		}
>> @@ -4151,7 +4156,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>  		}
>>  	} while (group = group->next, group != sd->groups);
>>  
>> -	if (!idlest || 100*this_load < imbalance*min_load)
>> +	if (!idlest || this_load < min_load)
>>  		return NULL;
>>  	return idlest;
>>  }
>> @@ -5476,9 +5481,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>  
>>  		nr_running = rq->nr_running;
>>  
>> -		/* Bias balancing toward cpus of our domain */
>> -		if (local_group)
>> -			load = target_load(i);
>> +		/* Bias balancing toward dst cpu */
>> +		if (env->dst_cpu == i)
>> +			load = target_load(i, env->sd->imbalance_pct);
> 
> Here you do the same group load bias change as above.
> 
>>  		else
>>  			load = source_load(i);
>>  
>> @@ -5918,14 +5923,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>>  		if ((local->idle_cpus < busiest->idle_cpus) &&
>>  		    busiest->sum_nr_running <= busiest->group_weight)
>>  			goto out_balanced;
>> -	} else {
>> -		/*
>> -		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
>> -		 * imbalance_pct to be conservative.
>> -		 */
>> -		if (100 * busiest->avg_load <=
>> -				env->sd->imbalance_pct * local->avg_load)
>> -			goto out_balanced;
>>  	}
>>  
>>  force_balance:
> 
> As said my previous replies to this series, I think this problem should
> be solved by fixing the cause of the problem, that is the cpu_load
> calculation, instead of biasing the cpu_load where-ever it is used to
> hide the problem.
> 
> Doing a bit of git archaeology reveals that the cpu_load code goes back
> to 7897986bad8f6cd50d6149345aca7f6480f49464 and that in the original
> patch x86 was using *_idx > 0 for all sched_domain levels except SMT. In
> my opinion that made logical sense. If we are about to change to *_idx=0
> we are removing the main idea behind that code and there needs to be a
> new one. Otherwise, cpu_load doesn't make sense.

The commit is written in 2005. The CFS scheduler merged into kernel in
Oct 2007. it is a too old legacy for us ...
> 
> Morten
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2013-12-25 14:58           ` Alex Shi
@ 2014-01-02 16:04             ` Morten Rasmussen
  2014-01-06 13:35               ` Alex Shi
  0 siblings, 1 reply; 34+ messages in thread
From: Morten Rasmussen @ 2014-01-02 16:04 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, mingo, vincent.guittot, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On Wed, Dec 25, 2013 at 02:58:26PM +0000, Alex Shi wrote:
> 
> >> From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001
> >> From: Alex Shi <alex.shi@linaro.org>
> >> Date: Sat, 23 Nov 2013 23:18:09 +0800
> >> Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
> >>
> >> Task migration happens when target just a bit less then source cpu load.
> >> To reduce such situation happens, aggravate the target cpu load with
> >> sd->imbalance_pct/100 in wake_affine.
> >>
> >> In find_idlest/busiest_group, change the aggravate to local cpu only
> >> from old group aggravation.
> >>
> >> on my pandaboard ES.
> >>
> >> 	latest kernel 527d1511310a89		+ whole patchset
> >> hackbench -T -g 10 -f 40
> >> 	23.25"					21.99"
> >> 	23.16"					21.20"
> >> 	24.24"					21.89"
> >> hackbench -p -g 10 -f 40
> >> 	26.52"					21.46"
> >> 	23.89"					22.96"
> >> 	25.65"					22.73"
> >> hackbench -P -g 10 -f 40
> >> 	20.14"					19.72"
> >> 	19.96"					19.10"
> >> 	21.76"					20.03"
> >>
> >> Signed-off-by: Alex Shi <alex.shi@linaro.org>
> >> ---
> >>  kernel/sched/fair.c | 35 ++++++++++++++++-------------------
> >>  1 file changed, 16 insertions(+), 19 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index bccdd89..3623ba4 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
> >>  
> >>  static unsigned long weighted_cpuload(const int cpu);
> >>  static unsigned long source_load(int cpu);
> >> -static unsigned long target_load(int cpu);
> >> +static unsigned long target_load(int cpu, int imbalance_pct);
> >>  static unsigned long power_of(int cpu);
> >>  static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
> >>  
> >> @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu)
> >>   * Return a high guess at the load of a migration-target cpu weighted
> >>   * according to the scheduling class and "nice" value.
> >>   */
> >> -static unsigned long target_load(int cpu)
> >> +static unsigned long target_load(int cpu, int imbalance_pct)
> >>  {
> >>  	struct rq *rq = cpu_rq(cpu);
> >>  	unsigned long total = weighted_cpuload(cpu);
> >>  
> >> +	/*
> >> +	 * without cpu_load decay, in most of time cpu_load is same as total
> >> +	 * so we need to make target a bit heavier to reduce task migration
> >> +	 */
> >> +	total = total * imbalance_pct / 100;
> >> +
> >>  	if (!sched_feat(LB_BIAS))
> >>  		return total;
> >>  
> >> @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> >>  	this_cpu  = smp_processor_id();
> >>  	prev_cpu  = task_cpu(p);
> >>  	load	  = source_load(prev_cpu);
> >> -	this_load = target_load(this_cpu);
> >> +	this_load = target_load(this_cpu, 100);
> >>  
> >>  	/*
> >>  	 * If sync wakeup then subtract the (maximum possible)
> >> @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> >>  
> >>  	if (balanced ||
> >>  	    (this_load <= load &&
> >> -	     this_load + target_load(prev_cpu) <= tl_per_task)) {
> >> +	     this_load + target_load(prev_cpu, 100) <= tl_per_task)) {
> >>  		/*
> >>  		 * This domain has SD_WAKE_AFFINE and
> >>  		 * p is cache cold in this domain, and
> >> @@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >>  {
> >>  	struct sched_group *idlest = NULL, *group = sd->groups;
> >>  	unsigned long min_load = ULONG_MAX, this_load = 0;
> >> -	int imbalance = 100 + (sd->imbalance_pct-100)/2;
> >>  
> >>  	do {
> >>  		unsigned long load, avg_load;
> >> @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >>  
> >>  		for_each_cpu(i, sched_group_cpus(group)) {
> >>  			/* Bias balancing toward cpus of our domain */
> >> -			if (local_group)
> >> +			if (i == this_cpu)
> > 
> > What is the motivation for changing the local_group load calculation?
> > Now the load contributions of all cpus in the local group, except
> > this_cpu, will contribute more as their contribution (this_load) is
> > determined using target_load() instead.
> 
> This part code 147cbb4bbe99, written in 2005 for x86, at that time, only
> 2 cores(guess no HT at that time) in cpu socket. With the cores number

NUMA support was already present. I guess that means support for systems
with significantly more than two cpus.

> increasing trend, the sched_group become large and large, to give whole
> group this bias value is becoming non-sense. So it looks reasonable to
> just bias this cpu only.

I think that is question whether you want to bias the whole group or
not at wake-up balancing. If you don't change the bias weight and only
apply it to this_cpu, you will be more prone to send the waking task
somewhere else.

> > 
> > If I'm not mistaken, that will lead to more frequent load balancing as
> > the local_group bias has been reduced. That is the opposite of your
> > intentions based on your comment in target_load().
> > 
> >>  				load = source_load(i);
> >>  			else
> >> -				load = target_load(i);
> >> +				load = target_load(i, sd->imbalance_pct);
> > 
> > You scale by sd->imbalance_pct instead of 100+(sd->imbalance_pct-100)/2
> > that you removed above. sd->imbalance_pct may have been arbitrarily
> > chosen in the past, but changing it may affect behavior. 
> 
> the first commit with this line:
> int imbalance = 100 + (sd->imbalance_pct-100)/2;
> is 147cbb4bbe99, that also has no explanation for this imbalance value.

100 + (sd->imbalance_pct-100)/2 is used several places in sched.c before
147cbb4bbe99. It seems that it was used to encourage balancing in
certain situations such as wake-up, while sd->imbalance_pct was used
unmodified in other situations.

That seems to still be the case in v3.13-rc6. find_busiest_group() and
task_numa_compare() use sd->imbalance_pct, while task_numa_migrate(),
wake_affine(), and find_idlest_group() use 100 + (sd->imbalance_pct -
100)/2. The pattern seems to be to use the reduced imbalance_pct for
wake-up and the full imbalance_pct for balancing of runnable tasks.

AFAICT, using sd->imbalance_pct in find_idlest_group() will increase the
bias towards the local group. However, I would not be comfortable
changing it without understanding the full implications.

> AFAIK, at that time, maybe only 2 LCPU in one socket, so, guess it is
> reason to use half of imbalance_pct.

Based on the comment removed in 147cbb4bbe99 it seems that the reason is
to encourage balancing at wake-up. It doesn't seem to have any relation
to the number of cpus.

> but sched_domain is used widely for many kinds of platform/cpu type, the
> value is arbitrary too.
> Another reason to use it, I scaled any cpu load which is not this_cpu,
> so a bit conservation is to use origin imbalance_pct, not half of it.

The net result is almost the same if the groups have two cpus each.
Stronger for groups with one cpu, and weaker for groups with more than
two cpus.

> > 
> >>  
> >>  			avg_load += load;
> >>  		}
> >> @@ -4151,7 +4156,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >>  		}
> >>  	} while (group = group->next, group != sd->groups);
> >>  
> >> -	if (!idlest || 100*this_load < imbalance*min_load)
> >> +	if (!idlest || this_load < min_load)
> >>  		return NULL;
> >>  	return idlest;
> >>  }
> >> @@ -5476,9 +5481,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>  
> >>  		nr_running = rq->nr_running;
> >>  
> >> -		/* Bias balancing toward cpus of our domain */
> >> -		if (local_group)
> >> -			load = target_load(i);
> >> +		/* Bias balancing toward dst cpu */
> >> +		if (env->dst_cpu == i)
> >> +			load = target_load(i, env->sd->imbalance_pct);
> > 
> > Here you do the same group load bias change as above.
> > 

I don't think it is right to change the bias here to only apply to the
dst_cpu. rebalance_domains()/load_balance() is balancing groups not just
pulling task to dst_cpu. It must pull enough tasks to balance the
groups, even if it means temporarily overloading dst_cpu. The bias
should therefore apply to the whole group.

> >>  		else
> >>  			load = source_load(i);
> >>  
> >> @@ -5918,14 +5923,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >>  		if ((local->idle_cpus < busiest->idle_cpus) &&
> >>  		    busiest->sum_nr_running <= busiest->group_weight)
> >>  			goto out_balanced;
> >> -	} else {
> >> -		/*
> >> -		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
> >> -		 * imbalance_pct to be conservative.
> >> -		 */
> >> -		if (100 * busiest->avg_load <=
> >> -				env->sd->imbalance_pct * local->avg_load)
> >> -			goto out_balanced;
> >>  	}
> >>  
> >>  force_balance:
> > 
> > As said my previous replies to this series, I think this problem should
> > be solved by fixing the cause of the problem, that is the cpu_load
> > calculation, instead of biasing the cpu_load where-ever it is used to
> > hide the problem.
> > 
> > Doing a bit of git archaeology reveals that the cpu_load code goes back
> > to 7897986bad8f6cd50d6149345aca7f6480f49464 and that in the original
> > patch x86 was using *_idx > 0 for all sched_domain levels except SMT. In
> > my opinion that made logical sense. If we are about to change to *_idx=0
> > we are removing the main idea behind that code and there needs to be a
> > new one. Otherwise, cpu_load doesn't make sense.
> 
> The commit is written in 2005. The CFS scheduler merged into kernel in
> Oct 2007. it is a too old legacy for us ...

And that is why I think cpu_load should be reconsidered. It doesn't make
sense to cripple the cpu_load code without fully understanding what it
is doing and then try to hide the problems it causes.

The cpu_load code might be older than CFS but in my opinion it still
makes sense. Changing cpu_load to be a snapshot of weighted_cpuload()
does not.

Morten

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2014-01-02 16:04             ` Morten Rasmussen
@ 2014-01-06 13:35               ` Alex Shi
  2014-01-07 12:55                 ` Morten Rasmussen
  0 siblings, 1 reply; 34+ messages in thread
From: Alex Shi @ 2014-01-06 13:35 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, vincent.guittot, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On 01/03/2014 12:04 AM, Morten Rasmussen wrote:
> On Wed, Dec 25, 2013 at 02:58:26PM +0000, Alex Shi wrote:
>>
>>>> From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001
>>>> From: Alex Shi <alex.shi@linaro.org>
>>>> Date: Sat, 23 Nov 2013 23:18:09 +0800
>>>> Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
>>>>
>>>> Task migration happens when target just a bit less then source cpu load.
>>>> To reduce such situation happens, aggravate the target cpu load with
>>>> sd->imbalance_pct/100 in wake_affine.
>>>>
>>>> In find_idlest/busiest_group, change the aggravate to local cpu only
>>>> from old group aggravation.
>>>>
>>>> on my pandaboard ES.
>>>>
>>>> 	latest kernel 527d1511310a89		+ whole patchset
>>>> hackbench -T -g 10 -f 40
>>>> 	23.25"					21.99"
>>>> 	23.16"					21.20"
>>>> 	24.24"					21.89"
>>>> hackbench -p -g 10 -f 40
>>>> 	26.52"					21.46"
>>>> 	23.89"					22.96"
>>>> 	25.65"					22.73"
>>>> hackbench -P -g 10 -f 40
>>>> 	20.14"					19.72"
>>>> 	19.96"					19.10"
>>>> 	21.76"					20.03"
>>>>
>>>> Signed-off-by: Alex Shi <alex.shi@linaro.org>
>>>> ---
>>>>  kernel/sched/fair.c | 35 ++++++++++++++++-------------------
>>>>  1 file changed, 16 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index bccdd89..3623ba4 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
>>>>  
>>>>  static unsigned long weighted_cpuload(const int cpu);
>>>>  static unsigned long source_load(int cpu);
>>>> -static unsigned long target_load(int cpu);
>>>> +static unsigned long target_load(int cpu, int imbalance_pct);
>>>>  static unsigned long power_of(int cpu);
>>>>  static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
>>>>  
>>>> @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu)
>>>>   * Return a high guess at the load of a migration-target cpu weighted
>>>>   * according to the scheduling class and "nice" value.
>>>>   */
>>>> -static unsigned long target_load(int cpu)
>>>> +static unsigned long target_load(int cpu, int imbalance_pct)
>>>>  {
>>>>  	struct rq *rq = cpu_rq(cpu);
>>>>  	unsigned long total = weighted_cpuload(cpu);
>>>>  
>>>> +	/*
>>>> +	 * without cpu_load decay, in most of time cpu_load is same as total
>>>> +	 * so we need to make target a bit heavier to reduce task migration
>>>> +	 */
>>>> +	total = total * imbalance_pct / 100;
>>>> +
>>>>  	if (!sched_feat(LB_BIAS))
>>>>  		return total;
>>>>  
>>>> @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>>>  	this_cpu  = smp_processor_id();
>>>>  	prev_cpu  = task_cpu(p);
>>>>  	load	  = source_load(prev_cpu);
>>>> -	this_load = target_load(this_cpu);
>>>> +	this_load = target_load(this_cpu, 100);
>>>>  
>>>>  	/*
>>>>  	 * If sync wakeup then subtract the (maximum possible)
>>>> @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>>>  
>>>>  	if (balanced ||
>>>>  	    (this_load <= load &&
>>>> -	     this_load + target_load(prev_cpu) <= tl_per_task)) {
>>>> +	     this_load + target_load(prev_cpu, 100) <= tl_per_task)) {
>>>>  		/*
>>>>  		 * This domain has SD_WAKE_AFFINE and
>>>>  		 * p is cache cold in this domain, and
>>>> @@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>>>  {
>>>>  	struct sched_group *idlest = NULL, *group = sd->groups;
>>>>  	unsigned long min_load = ULONG_MAX, this_load = 0;
>>>> -	int imbalance = 100 + (sd->imbalance_pct-100)/2;
>>>>  
>>>>  	do {
>>>>  		unsigned long load, avg_load;
>>>> @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>>>  
>>>>  		for_each_cpu(i, sched_group_cpus(group)) {
>>>>  			/* Bias balancing toward cpus of our domain */
>>>> -			if (local_group)
>>>> +			if (i == this_cpu)
>>>
>>> What is the motivation for changing the local_group load calculation?
>>> Now the load contributions of all cpus in the local group, except
>>> this_cpu, will contribute more as their contribution (this_load) is
>>> determined using target_load() instead.
>>
>> This part code 147cbb4bbe99, written in 2005 for x86, at that time, only
>> 2 cores(guess no HT at that time) in cpu socket. With the cores number
> 
> NUMA support was already present. I guess that means support for systems
> with significantly more than two cpus.

Thanks a lot for comments, Morten!
the find_idlest_group used in select_task_rq_fair, that designed to
select cpu near by old cpu whenever for wake or fork tasks. So in common
case, the selected cpu is rarely on another NUMA.
> 
>> increasing trend, the sched_group become large and large, to give whole
>> group this bias value is becoming non-sense. So it looks reasonable to
>> just bias this cpu only.
> 
> I think that is question whether you want to bias the whole group or
> not at wake-up balancing. If you don't change the bias weight and only
> apply it to this_cpu, you will be more prone to send the waking task
> somewhere else.

We don't need to follow the exactly old logical. Guess people will be
more happier to see the code/logical simple with better benchmark
performance. :)
> 
>>>
>>> If I'm not mistaken, that will lead to more frequent load balancing as
>>> the local_group bias has been reduced. That is the opposite of your
>>> intentions based on your comment in target_load().
>>>
>>>>  				load = source_load(i);
>>>>  			else
>>>> -				load = target_load(i);
>>>> +				load = target_load(i, sd->imbalance_pct);
>>>
>>> You scale by sd->imbalance_pct instead of 100+(sd->imbalance_pct-100)/2
>>> that you removed above. sd->imbalance_pct may have been arbitrarily
>>> chosen in the past, but changing it may affect behavior. 
>>
>> the first commit with this line:
>> int imbalance = 100 + (sd->imbalance_pct-100)/2;
>> is 147cbb4bbe99, that also has no explanation for this imbalance value.
> 
> 100 + (sd->imbalance_pct-100)/2 is used several places in sched.c before
> 147cbb4bbe99. It seems that it was used to encourage balancing in
> certain situations such as wake-up, while sd->imbalance_pct was used
> unmodified in other situations.
> 
> That seems to still be the case in v3.13-rc6. find_busiest_group() and
> task_numa_compare() use sd->imbalance_pct, while task_numa_migrate(),
> wake_affine(), and find_idlest_group() use 100 + (sd->imbalance_pct -
> 100)/2. The pattern seems to be to use the reduced imbalance_pct for
> wake-up and the full imbalance_pct for balancing of runnable tasks.
> 
> AFAICT, using sd->imbalance_pct in find_idlest_group() will increase the
> bias towards the local group. However, I would not be comfortable
> changing it without understanding the full implications.
> 
>> AFAIK, at that time, maybe only 2 LCPU in one socket, so, guess it is
>> reason to use half of imbalance_pct.
> 
> Based on the comment removed in 147cbb4bbe99 it seems that the reason is
> to encourage balancing at wake-up. It doesn't seem to have any relation
> to the number of cpus.
> 
>> but sched_domain is used widely for many kinds of platform/cpu type, the
>> value is arbitrary too.
>> Another reason to use it, I scaled any cpu load which is not this_cpu,
>> so a bit conservation is to use origin imbalance_pct, not half of it.
> 
> The net result is almost the same if the groups have two cpus each.
> Stronger for groups with one cpu, and weaker for groups with more than
> two cpus.
> 
>>>
>>>>  
>>>>  			avg_load += load;
>>>>  		}
>>>> @@ -4151,7 +4156,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>>>  		}
>>>>  	} while (group = group->next, group != sd->groups);
>>>>  
>>>> -	if (!idlest || 100*this_load < imbalance*min_load)
>>>> +	if (!idlest || this_load < min_load)
>>>>  		return NULL;
>>>>  	return idlest;
>>>>  }
>>>> @@ -5476,9 +5481,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>>>  
>>>>  		nr_running = rq->nr_running;
>>>>  
>>>> -		/* Bias balancing toward cpus of our domain */
>>>> -		if (local_group)
>>>> -			load = target_load(i);
>>>> +		/* Bias balancing toward dst cpu */
>>>> +		if (env->dst_cpu == i)
>>>> +			load = target_load(i, env->sd->imbalance_pct);
>>>
>>> Here you do the same group load bias change as above.
>>>
> 
> I don't think it is right to change the bias here to only apply to the
> dst_cpu. rebalance_domains()/load_balance() is balancing groups not just
> pulling task to dst_cpu. It must pull enough tasks to balance the
> groups, even if it means temporarily overloading dst_cpu. The bias
> should therefore apply to the whole group.

I have different understanding for load_balance(). It the dst_cpu
overloaded, other cpu need to move back some load. That cause more
expensive task CS.
> 
>>>>  		else
>>>>  			load = source_load(i);
>>>>  
>>>> @@ -5918,14 +5923,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>>>>  		if ((local->idle_cpus < busiest->idle_cpus) &&
>>>>  		    busiest->sum_nr_running <= busiest->group_weight)
>>>>  			goto out_balanced;
>>>> -	} else {
>>>> -		/*
>>>> -		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
>>>> -		 * imbalance_pct to be conservative.
>>>> -		 */
>>>> -		if (100 * busiest->avg_load <=
>>>> -				env->sd->imbalance_pct * local->avg_load)
>>>> -			goto out_balanced;
>>>>  	}
>>>>  
>>>>  force_balance:
>>>
>>> As said my previous replies to this series, I think this problem should
>>> be solved by fixing the cause of the problem, that is the cpu_load
>>> calculation, instead of biasing the cpu_load where-ever it is used to
>>> hide the problem.
>>>
>>> Doing a bit of git archaeology reveals that the cpu_load code goes back
>>> to 7897986bad8f6cd50d6149345aca7f6480f49464 and that in the original
>>> patch x86 was using *_idx > 0 for all sched_domain levels except SMT. In
>>> my opinion that made logical sense. If we are about to change to *_idx=0
>>> we are removing the main idea behind that code and there needs to be a
>>> new one. Otherwise, cpu_load doesn't make sense.
>>
>> The commit is written in 2005. The CFS scheduler merged into kernel in
>> Oct 2007. it is a too old legacy for us ...
> 
> And that is why I think cpu_load should be reconsidered. It doesn't make
> sense to cripple the cpu_load code without fully understanding what it
> is doing and then try to hide the problems it causes.

I understand your concern, Morten.
But if no one explain well what these code for, it means it run out of
control, and should be out of kernel. On the contrary, the new code has
the better performance both on arm and x86 -- I'd like take fengguang's
testing result as advantages, since the task migration and CS reduced.
So, consider above reasons, I don't know how to refuse a better
performance code, but keep the unclean legacy.
> 
> The cpu_load code might be older than CFS but in my opinion it still
> makes sense. Changing cpu_load to be a snapshot of weighted_cpuload()
> does not.


> 
> Morten
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2014-01-06 13:35               ` Alex Shi
@ 2014-01-07 12:55                 ` Morten Rasmussen
  2014-01-07 12:59                   ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Morten Rasmussen @ 2014-01-07 12:55 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, mingo, vincent.guittot, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On Mon, Jan 06, 2014 at 01:35:39PM +0000, Alex Shi wrote:
> On 01/03/2014 12:04 AM, Morten Rasmussen wrote:
> > On Wed, Dec 25, 2013 at 02:58:26PM +0000, Alex Shi wrote:
> >>
> >>>> From 5cd67d975001edafe2ee820e0be5d86881a23bd6 Mon Sep 17 00:00:00 2001
> >>>> From: Alex Shi <alex.shi@linaro.org>
> >>>> Date: Sat, 23 Nov 2013 23:18:09 +0800
> >>>> Subject: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
> >>>>
> >>>> Task migration happens when target just a bit less then source cpu load.
> >>>> To reduce such situation happens, aggravate the target cpu load with
> >>>> sd->imbalance_pct/100 in wake_affine.
> >>>>
> >>>> In find_idlest/busiest_group, change the aggravate to local cpu only
> >>>> from old group aggravation.
> >>>>
> >>>> on my pandaboard ES.
> >>>>
> >>>>    latest kernel 527d1511310a89            + whole patchset
> >>>> hackbench -T -g 10 -f 40
> >>>>    23.25"                                  21.99"
> >>>>    23.16"                                  21.20"
> >>>>    24.24"                                  21.89"
> >>>> hackbench -p -g 10 -f 40
> >>>>    26.52"                                  21.46"
> >>>>    23.89"                                  22.96"
> >>>>    25.65"                                  22.73"
> >>>> hackbench -P -g 10 -f 40
> >>>>    20.14"                                  19.72"
> >>>>    19.96"                                  19.10"
> >>>>    21.76"                                  20.03"
> >>>>
> >>>> Signed-off-by: Alex Shi <alex.shi@linaro.org>
> >>>> ---
> >>>>  kernel/sched/fair.c | 35 ++++++++++++++++-------------------
> >>>>  1 file changed, 16 insertions(+), 19 deletions(-)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index bccdd89..3623ba4 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -978,7 +978,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
> >>>>
> >>>>  static unsigned long weighted_cpuload(const int cpu);
> >>>>  static unsigned long source_load(int cpu);
> >>>> -static unsigned long target_load(int cpu);
> >>>> +static unsigned long target_load(int cpu, int imbalance_pct);
> >>>>  static unsigned long power_of(int cpu);
> >>>>  static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
> >>>>
> >>>> @@ -3809,11 +3809,17 @@ static unsigned long source_load(int cpu)
> >>>>   * Return a high guess at the load of a migration-target cpu weighted
> >>>>   * according to the scheduling class and "nice" value.
> >>>>   */
> >>>> -static unsigned long target_load(int cpu)
> >>>> +static unsigned long target_load(int cpu, int imbalance_pct)
> >>>>  {
> >>>>    struct rq *rq = cpu_rq(cpu);
> >>>>    unsigned long total = weighted_cpuload(cpu);
> >>>>
> >>>> +  /*
> >>>> +   * without cpu_load decay, in most of time cpu_load is same as total
> >>>> +   * so we need to make target a bit heavier to reduce task migration
> >>>> +   */
> >>>> +  total = total * imbalance_pct / 100;
> >>>> +
> >>>>    if (!sched_feat(LB_BIAS))
> >>>>            return total;
> >>>>
> >>>> @@ -4033,7 +4039,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> >>>>    this_cpu  = smp_processor_id();
> >>>>    prev_cpu  = task_cpu(p);
> >>>>    load      = source_load(prev_cpu);
> >>>> -  this_load = target_load(this_cpu);
> >>>> +  this_load = target_load(this_cpu, 100);
> >>>>
> >>>>    /*
> >>>>     * If sync wakeup then subtract the (maximum possible)
> >>>> @@ -4089,7 +4095,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> >>>>
> >>>>    if (balanced ||
> >>>>        (this_load <= load &&
> >>>> -       this_load + target_load(prev_cpu) <= tl_per_task)) {
> >>>> +       this_load + target_load(prev_cpu, 100) <= tl_per_task)) {
> >>>>            /*
> >>>>             * This domain has SD_WAKE_AFFINE and
> >>>>             * p is cache cold in this domain, and
> >>>> @@ -4112,7 +4118,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >>>>  {
> >>>>    struct sched_group *idlest = NULL, *group = sd->groups;
> >>>>    unsigned long min_load = ULONG_MAX, this_load = 0;
> >>>> -  int imbalance = 100 + (sd->imbalance_pct-100)/2;
> >>>>
> >>>>    do {
> >>>>            unsigned long load, avg_load;
> >>>> @@ -4132,10 +4137,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >>>>
> >>>>            for_each_cpu(i, sched_group_cpus(group)) {
> >>>>                    /* Bias balancing toward cpus of our domain */
> >>>> -                  if (local_group)
> >>>> +                  if (i == this_cpu)
> >>>
> >>> What is the motivation for changing the local_group load calculation?
> >>> Now the load contributions of all cpus in the local group, except
> >>> this_cpu, will contribute more as their contribution (this_load) is
> >>> determined using target_load() instead.
> >>
> >> This part code 147cbb4bbe99, written in 2005 for x86, at that time, only
> >> 2 cores(guess no HT at that time) in cpu socket. With the cores number
> >
> > NUMA support was already present. I guess that means support for systems
> > with significantly more than two cpus.
> 
> Thanks a lot for comments, Morten!
> the find_idlest_group used in select_task_rq_fair, that designed to
> select cpu near by old cpu whenever for wake or fork tasks. So in common
> case, the selected cpu is rarely on another NUMA.

Yes. My point was just that there was support for more than two cpus
when this code was introduce, and group could therefore have more than
two cpus in them.

> >
> >> increasing trend, the sched_group become large and large, to give whole
> >> group this bias value is becoming non-sense. So it looks reasonable to
> >> just bias this cpu only.
> >
> > I think that is question whether you want to bias the whole group or
> > not at wake-up balancing. If you don't change the bias weight and only
> > apply it to this_cpu, you will be more prone to send the waking task
> > somewhere else.
> 
> We don't need to follow the exactly old logical. Guess people will be
> more happier to see the code/logical simple with better benchmark
> performance. :)

I'm not against changes, especially if they make things simpler or make
more sense. But IMHO, that is not the same as just removing code and
showing benchmark improvements. There must be an explaination of why the
new code works better and the resulting code needs to make sense. 

> >
> >>>
> >>> If I'm not mistaken, that will lead to more frequent load balancing as
> >>> the local_group bias has been reduced. That is the opposite of your
> >>> intentions based on your comment in target_load().
> >>>
> >>>>                            load = source_load(i);
> >>>>                    else
> >>>> -                          load = target_load(i);
> >>>> +                          load = target_load(i, sd->imbalance_pct);
> >>>
> >>> You scale by sd->imbalance_pct instead of 100+(sd->imbalance_pct-100)/2
> >>> that you removed above. sd->imbalance_pct may have been arbitrarily
> >>> chosen in the past, but changing it may affect behavior.
> >>
> >> the first commit with this line:
> >> int imbalance = 100 + (sd->imbalance_pct-100)/2;
> >> is 147cbb4bbe99, that also has no explanation for this imbalance value.
> >
> > 100 + (sd->imbalance_pct-100)/2 is used several places in sched.c before
> > 147cbb4bbe99. It seems that it was used to encourage balancing in
> > certain situations such as wake-up, while sd->imbalance_pct was used
> > unmodified in other situations.
> >
> > That seems to still be the case in v3.13-rc6. find_busiest_group() and
> > task_numa_compare() use sd->imbalance_pct, while task_numa_migrate(),
> > wake_affine(), and find_idlest_group() use 100 + (sd->imbalance_pct -
> > 100)/2. The pattern seems to be to use the reduced imbalance_pct for
> > wake-up and the full imbalance_pct for balancing of runnable tasks.
> >
> > AFAICT, using sd->imbalance_pct in find_idlest_group() will increase the
> > bias towards the local group. However, I would not be comfortable
> > changing it without understanding the full implications.
> >
> >> AFAIK, at that time, maybe only 2 LCPU in one socket, so, guess it is
> >> reason to use half of imbalance_pct.
> >
> > Based on the comment removed in 147cbb4bbe99 it seems that the reason is
> > to encourage balancing at wake-up. It doesn't seem to have any relation
> > to the number of cpus.
> >
> >> but sched_domain is used widely for many kinds of platform/cpu type, the
> >> value is arbitrary too.
> >> Another reason to use it, I scaled any cpu load which is not this_cpu,
> >> so a bit conservation is to use origin imbalance_pct, not half of it.
> >
> > The net result is almost the same if the groups have two cpus each.
> > Stronger for groups with one cpu, and weaker for groups with more than
> > two cpus.
> >
> >>>
> >>>>
> >>>>                    avg_load += load;
> >>>>            }
> >>>> @@ -4151,7 +4156,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >>>>            }
> >>>>    } while (group = group->next, group != sd->groups);
> >>>>
> >>>> -  if (!idlest || 100*this_load < imbalance*min_load)
> >>>> +  if (!idlest || this_load < min_load)
> >>>>            return NULL;
> >>>>    return idlest;
> >>>>  }
> >>>> @@ -5476,9 +5481,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>>>
> >>>>            nr_running = rq->nr_running;
> >>>>
> >>>> -          /* Bias balancing toward cpus of our domain */
> >>>> -          if (local_group)
> >>>> -                  load = target_load(i);
> >>>> +          /* Bias balancing toward dst cpu */
> >>>> +          if (env->dst_cpu == i)
> >>>> +                  load = target_load(i, env->sd->imbalance_pct);
> >>>
> >>> Here you do the same group load bias change as above.
> >>>
> >
> > I don't think it is right to change the bias here to only apply to the
> > dst_cpu. rebalance_domains()/load_balance() is balancing groups not just
> > pulling task to dst_cpu. It must pull enough tasks to balance the
> > groups, even if it means temporarily overloading dst_cpu. The bias
> > should therefore apply to the whole group.
> 
> I have different understanding for load_balance(). It the dst_cpu
> overloaded, other cpu need to move back some load. That cause more
> expensive task CS.

My understanding is that should_we_balance() decides which cpu is
eligible for doing the load balancing for a given domain (and the
domains above). That is, only one cpu in a group is allowed to load
balance between the local group and other groups. That cpu would
therefore be reponsible for pulling enough load that the groups are
balanced even if it means temporarily overloading itself. The other cpus
in the group will take care of load balancing the extra load within the
local group later.

If the balance cpu didn't overload itself it wouldn't always be able to
balance the groups, especially for large groups.

If my understanding is right, I think it is wrong to not bias the entire
target group.

> >
> >>>>            else
> >>>>                    load = source_load(i);
> >>>>
> >>>> @@ -5918,14 +5923,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >>>>            if ((local->idle_cpus < busiest->idle_cpus) &&
> >>>>                busiest->sum_nr_running <= busiest->group_weight)
> >>>>                    goto out_balanced;
> >>>> -  } else {
> >>>> -          /*
> >>>> -           * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
> >>>> -           * imbalance_pct to be conservative.
> >>>> -           */
> >>>> -          if (100 * busiest->avg_load <=
> >>>> -                          env->sd->imbalance_pct * local->avg_load)
> >>>> -                  goto out_balanced;
> >>>>    }
> >>>>
> >>>>  force_balance:
> >>>
> >>> As said my previous replies to this series, I think this problem should
> >>> be solved by fixing the cause of the problem, that is the cpu_load
> >>> calculation, instead of biasing the cpu_load where-ever it is used to
> >>> hide the problem.
> >>>
> >>> Doing a bit of git archaeology reveals that the cpu_load code goes back
> >>> to 7897986bad8f6cd50d6149345aca7f6480f49464 and that in the original
> >>> patch x86 was using *_idx > 0 for all sched_domain levels except SMT. In
> >>> my opinion that made logical sense. If we are about to change to *_idx=0
> >>> we are removing the main idea behind that code and there needs to be a
> >>> new one. Otherwise, cpu_load doesn't make sense.
> >>
> >> The commit is written in 2005. The CFS scheduler merged into kernel in
> >> Oct 2007. it is a too old legacy for us ...
> >
> > And that is why I think cpu_load should be reconsidered. It doesn't make
> > sense to cripple the cpu_load code without fully understanding what it
> > is doing and then try to hide the problems it causes.
> 
> I understand your concern, Morten.
> But if no one explain well what these code for, it means it run out of
> control, and should be out of kernel.

Fully agree. That is why I think all the users of cpu_load must be
considered as well. I doesn't make sense to rip out most of the cpu_load
code and change its behaviour without considering the code that uses
cpu_load.

I like the idea of cleaning up the cpu_load stuff, but it must be
replaced with something that makes sense. Otherwise, we just save the
problems for later and carry around unexplainable code until it happens.

> On the contrary, the new code has
> the better performance both on arm and x86 -- I'd like take fengguang's
> testing result as advantages, since the task migration and CS reduced.
> So, consider above reasons, I don't know how to refuse a better
> performance code, but keep the unclean legacy.

I may have missed something, but I don't understand the reason for the
performance improvements that you are reporting. I see better numbers
for a few benchmarks, but I still don't understand why the code makes
sense after the cleanup. If we don't understand why it works, we cannot
be sure that it doesn't harm other benchmarks. There is always a chance
that we miss something but, IMHO, not having any idea to begin with
increases the chances for problems later significantly. So why not get
to the bottom of the problem of cleaning up cpu_load?

Have you done more extensive benchmarking? Have you seen any regressions
in other benchmarks?

Morten

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2014-01-07 12:55                 ` Morten Rasmussen
@ 2014-01-07 12:59                   ` Peter Zijlstra
  2014-01-07 13:15                     ` Peter Zijlstra
  2014-01-08 14:15                     ` Alex Shi
  0 siblings, 2 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-01-07 12:59 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Alex Shi, mingo, vincent.guittot, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On Tue, Jan 07, 2014 at 12:55:18PM +0000, Morten Rasmussen wrote:
> My understanding is that should_we_balance() decides which cpu is
> eligible for doing the load balancing for a given domain (and the
> domains above). That is, only one cpu in a group is allowed to load
> balance between the local group and other groups. That cpu would
> therefore be reponsible for pulling enough load that the groups are
> balanced even if it means temporarily overloading itself. The other cpus
> in the group will take care of load balancing the extra load within the
> local group later.

Correct.

> I may have missed something, but I don't understand the reason for the
> performance improvements that you are reporting. I see better numbers
> for a few benchmarks, but I still don't understand why the code makes
> sense after the cleanup. If we don't understand why it works, we cannot
> be sure that it doesn't harm other benchmarks. There is always a chance
> that we miss something but, IMHO, not having any idea to begin with
> increases the chances for problems later significantly. So why not get
> to the bottom of the problem of cleaning up cpu_load?
> 
> Have you done more extensive benchmarking? Have you seen any regressions
> in other benchmarks?

I only remember hackbench numbers and that generally fares well with a
more aggressive balancer since it has no actual work to speak of the
migration penalty is very low and because there's a metric ton of tasks
the aggressive leveling makes for more coherent 'throughput'.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2014-01-07 12:59                   ` Peter Zijlstra
@ 2014-01-07 13:15                     ` Peter Zijlstra
  2014-01-07 13:32                       ` Vincent Guittot
  2014-01-07 15:16                       ` Morten Rasmussen
  2014-01-08 14:15                     ` Alex Shi
  1 sibling, 2 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-01-07 13:15 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Alex Shi, mingo, vincent.guittot, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On Tue, Jan 07, 2014 at 01:59:30PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 12:55:18PM +0000, Morten Rasmussen wrote:
> > My understanding is that should_we_balance() decides which cpu is
> > eligible for doing the load balancing for a given domain (and the
> > domains above). That is, only one cpu in a group is allowed to load
> > balance between the local group and other groups. That cpu would
> > therefore be reponsible for pulling enough load that the groups are
> > balanced even if it means temporarily overloading itself. The other cpus
> > in the group will take care of load balancing the extra load within the
> > local group later.
> 
> Correct.

On that; one of the things I wanted to (and previously did attempt but
failed) is trying to rotate this cpu. Currently its always the first cpu
(of the group) and that gives a noticeable bias.

If we could slowly rotate the cpu that does this that would alleviate
both the load and cost bias.

One thing I was thinking of is keeping a global counter maybe:
 'x := jiffies >> n'
might be good enough and using the 'x % nr_cpus_in_group'-th cpu
instead.

Then again, these are micro issue and not a lot of people complain
about this.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2014-01-07 13:15                     ` Peter Zijlstra
@ 2014-01-07 13:32                       ` Vincent Guittot
  2014-01-07 13:40                         ` Peter Zijlstra
  2014-01-07 15:16                       ` Morten Rasmussen
  1 sibling, 1 reply; 34+ messages in thread
From: Vincent Guittot @ 2014-01-07 13:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, Alex Shi, mingo, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On 7 January 2014 14:15, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jan 07, 2014 at 01:59:30PM +0100, Peter Zijlstra wrote:
>> On Tue, Jan 07, 2014 at 12:55:18PM +0000, Morten Rasmussen wrote:
>> > My understanding is that should_we_balance() decides which cpu is
>> > eligible for doing the load balancing for a given domain (and the
>> > domains above). That is, only one cpu in a group is allowed to load
>> > balance between the local group and other groups. That cpu would
>> > therefore be reponsible for pulling enough load that the groups are
>> > balanced even if it means temporarily overloading itself. The other cpus
>> > in the group will take care of load balancing the extra load within the
>> > local group later.
>>
>> Correct.
>
> On that; one of the things I wanted to (and previously did attempt but
> failed) is trying to rotate this cpu. Currently its always the first cpu
> (of the group) and that gives a noticeable bias.

Isn't the current policy (it's the 1st idle cpu in priority). a good
enough way to rotate the cpus ? Are you need the rotation for loaded
use case too ?

>
> If we could slowly rotate the cpu that does this that would alleviate
> both the load and cost bias.
>
> One thing I was thinking of is keeping a global counter maybe:
>  'x := jiffies >> n'
> might be good enough and using the 'x % nr_cpus_in_group'-th cpu
> instead.
>
> Then again, these are micro issue and not a lot of people complain
> about this.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2014-01-07 13:32                       ` Vincent Guittot
@ 2014-01-07 13:40                         ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-01-07 13:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, Alex Shi, mingo, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On Tue, Jan 07, 2014 at 02:32:07PM +0100, Vincent Guittot wrote:
> On 7 January 2014 14:15, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Jan 07, 2014 at 01:59:30PM +0100, Peter Zijlstra wrote:
> >> On Tue, Jan 07, 2014 at 12:55:18PM +0000, Morten Rasmussen wrote:
> >> > My understanding is that should_we_balance() decides which cpu is
> >> > eligible for doing the load balancing for a given domain (and the
> >> > domains above). That is, only one cpu in a group is allowed to load
> >> > balance between the local group and other groups. That cpu would
> >> > therefore be reponsible for pulling enough load that the groups are
> >> > balanced even if it means temporarily overloading itself. The other cpus
> >> > in the group will take care of load balancing the extra load within the
> >> > local group later.
> >>
> >> Correct.
> >
> > On that; one of the things I wanted to (and previously did attempt but
> > failed) is trying to rotate this cpu. Currently its always the first cpu
> > (of the group) and that gives a noticeable bias.
> 
> Isn't the current policy (it's the 1st idle cpu in priority). a good
> enough way to rotate the cpus ? Are you need the rotation for loaded
> use case too ?

Yeah its for the fully loaded case. And like I said, there's not been
many complaints on this.

The 'problem' is that its always same cpu that does the most expensive
full machine balance; and always that cpu that is the one that gains
extra load to redistribute in the group. So its penalized twice.

Like said, really minor issue. Just something I thought I'd mention.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2014-01-07 13:15                     ` Peter Zijlstra
  2014-01-07 13:32                       ` Vincent Guittot
@ 2014-01-07 15:16                       ` Morten Rasmussen
  2014-01-07 20:37                         ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Morten Rasmussen @ 2014-01-07 15:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Shi, mingo, vincent.guittot, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On Tue, Jan 07, 2014 at 01:15:23PM +0000, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 01:59:30PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 07, 2014 at 12:55:18PM +0000, Morten Rasmussen wrote:
> > > My understanding is that should_we_balance() decides which cpu is
> > > eligible for doing the load balancing for a given domain (and the
> > > domains above). That is, only one cpu in a group is allowed to load
> > > balance between the local group and other groups. That cpu would
> > > therefore be reponsible for pulling enough load that the groups are
> > > balanced even if it means temporarily overloading itself. The other cpus
> > > in the group will take care of load balancing the extra load within the
> > > local group later.
> > 
> > Correct.
> 
> On that; one of the things I wanted to (and previously did attempt but
> failed) is trying to rotate this cpu. Currently its always the first cpu
> (of the group) and that gives a noticeable bias.
> 
> If we could slowly rotate the cpu that does this that would alleviate
> both the load and cost bias.

>From a load perspective wouldn't it be better to pick the least loaded
cpu in the group? It is not cheap to implement, but in theory it should
give less balancing within the group later an less unfairness until it
happens.

Rotating the cpu is probably good enough for most cases and certainly
easier to implement.

> 
> One thing I was thinking of is keeping a global counter maybe:
>  'x := jiffies >> n'
> might be good enough and using the 'x % nr_cpus_in_group'-th cpu
> instead.
> 
> Then again, these are micro issue and not a lot of people complain
> about this.

The bias continues after they first round of load balance by the other
cpus?

Pulling everything to one cpu is not ideal from a performance point of
view. You loose some available cpu cycles until the balance settles.
However, it is not easy to do better and maintain scalability at the
same time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2014-01-07 15:16                       ` Morten Rasmussen
@ 2014-01-07 20:37                         ` Peter Zijlstra
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2014-01-07 20:37 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Alex Shi, mingo, vincent.guittot, daniel.lezcano, fweisbec,
	linux, tony.luck, fenghua.yu, tglx, akpm, arjan, pjt,
	fengguang.wu, james.hogan, jason.low2, gregkh, hanjun.guo,
	linux-kernel

On Tue, Jan 07, 2014 at 03:16:32PM +0000, Morten Rasmussen wrote:
> From a load perspective wouldn't it be better to pick the least loaded
> cpu in the group? It is not cheap to implement, but in theory it should
> give less balancing within the group later an less unfairness until it
> happens.

I tried that; see 04f733b4afac5dc93ae9b0a8703c60b87def491e for why it
doesn't work.

> Rotating the cpu is probably good enough for most cases and certainly
> easier to implement.

Indeed.

> The bias continues after they first round of load balance by the other
> cpus?

The cost, yes. Even when perfectly balanced, we still get to iterate the
entire machine computing s[gd]_lb_stats to find out we're good and don't
need to move tasks about.

> Pulling everything to one cpu is not ideal from a performance point of
> view. You loose some available cpu cycles until the balance settles.
> However, it is not easy to do better and maintain scalability at the
> same time.

Right, its part of the cost we pay for scaling better. And rotating this
cost around a bit would alleviate the obvious bias.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] sched: bias to target cpu load to reduce task moving
  2014-01-07 12:59                   ` Peter Zijlstra
  2014-01-07 13:15                     ` Peter Zijlstra
@ 2014-01-08 14:15                     ` Alex Shi
  1 sibling, 0 replies; 34+ messages in thread
From: Alex Shi @ 2014-01-08 14:15 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: mingo, vincent.guittot, daniel.lezcano, fweisbec, linux,
	tony.luck, fenghua.yu, tglx, akpm, arjan, pjt, fengguang.wu,
	james.hogan, jason.low2, gregkh, hanjun.guo, linux-kernel

On 01/07/2014 08:59 PM, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 12:55:18PM +0000, Morten Rasmussen wrote:
>> My understanding is that should_we_balance() decides which cpu is
>> eligible for doing the load balancing for a given domain (and the
>> domains above). That is, only one cpu in a group is allowed to load
>> balance between the local group and other groups. That cpu would
>> therefore be reponsible for pulling enough load that the groups are
>> balanced even if it means temporarily overloading itself. The other cpus
>> in the group will take care of load balancing the extra load within the
>> local group later.

Thanks for both of you comments and explanations! :)

I know this patch's change is arguable and my attempt doesn't tune well. But I believe I am in a correct way. :) let me explain a bit for this patch again.

First cpu_load includes the history load info, so repeatedly decay and use the history load is kind of non-sense. and the old source/target_load randomly select history load or current load just according to max/min, it also owe a well explanation.
Second, we consider the bias in source/target_load already. but still use imbalance_pct as last check in idlest/busiest group finding. It is also a kind of redundant job. If we can consider the source/target bias, we'd better not use imbalance_pct again.
And last, imbalance pct overused with quickly core number increasing cpu. Like in find_busiset_group:
Assume a 2 groups domain, each group has 8 cores cpus.
    The target group will bias 8 * (imbalance_pct -100) 
				= 8 * (125 - 100) = 200.
     Since each of cpu bias .25 times load, for 8 cpus, totally bias 2 times average cpu load between groups. That is a too much. But if there only 2 cores in cpu group(common case when the code introduced). the bias is just 2 * 25 / 100 = 0.5 times average cpu load.

Now this patchset remove the cpu_load array avoid repeated history decay; reorganize the imbalance_pct usage to avoid redundant balance bias. and reduce the bias value between cpu groups -- maybe it isn't tune well. :)

> 
> Correct.
> 
>> I may have missed something, but I don't understand the reason for the
>> performance improvements that you are reporting. I see better numbers
>> for a few benchmarks, but I still don't understand why the code makes
>> sense after the cleanup. If we don't understand why it works, we cannot
>> be sure that it doesn't harm other benchmarks. There is always a chance
>> that we miss something but, IMHO, not having any idea to begin with
>> increases the chances for problems later significantly. So why not get
>> to the bottom of the problem of cleaning up cpu_load?
>>
>> Have you done more extensive benchmarking? Have you seen any regressions
>> in other benchmarks?
> 
> I only remember hackbench numbers and that generally fares well with a
> more aggressive balancer since it has no actual work to speak of the
> migration penalty is very low and because there's a metric ton of tasks
> the aggressive leveling makes for more coherent 'throughput'.

I just tested hackbench on arm. and with more testing times plus rebase to .13-rc6, the variation increased, then the benefit become unclear. anyway still no regression find on both perf-stat cpu-migration times and real execute time.

On 0day performance testing should tested kbuild, hackbench, aim7, dbench, tbench, sysbench, netperf etc. etc. No regression found.

The 0day performance testing also catch a cpu migration reduced on kvm guest.
https://lkml.org/lkml/2013/12/21/135 

and another benchmark get benefit on the old patchset:
like the testing results show on 0day performance testing: 

https://lkml.org/lkml/2013/12/4/102

Hi Alex,

We obsevered 150% performance gain with vm-scalability/300s-mmap-pread-seq
testcase with this patch applied. Here is a list of changes we got so far:

testbox : brickland
testcase: vm-scalability/300s-mmap-pread-seq


    f1b6442c7dd12802e622      d70495ef86f397816d73  
       (parent commit)            (this commit)
------------------------  ------------------------  
             26393249.80      +150.9%  66223933.60  vm-scalability.throughput

                  225.12       -49.9%       112.75  time.elapsed_time
                36333.40       -90.7%      3392.20  vmstat.system.cs
                    2.40      +375.0%        11.40  vmstat.cpu.id
              3770081.60       -97.7%     87673.40  time.major_page_faults
              3975276.20       -97.0%    117409.60  time.voluntary_context_switches
                    3.05      +301.7%        12.24  iostat.cpu.idle
                21118.41       -70.3%      6277.19  time.system_time
                   18.40      +130.4%        42.40  vmstat.cpu.us
                   77.00       -41.3%        45.20  vmstat.cpu.sy
                47459.60       -31.3%     32592.20  vmstat.system.in
                82435.40       -12.1%     72443.60  time.involuntary_context_switches
                 5128.13       +14.0%      5848.30  time.user_time
                11656.20        -7.8%     10745.60  time.percent_of_cpu_this_job_got
           1069997484.80        +0.3% 1073679919.00 time.minor_page_faults

Btw, the latest patchset include more clean up.
	git@github.com:alexshi/power-scheduling.git noload
Guess fengguang's 0day performance is doing test on it.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2014-01-08 14:18 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-03  9:05 [PATCH 0/4] sched: remove cpu_load decay Alex Shi
2013-12-03  9:05 ` [PATCH 1/4] sched: shortcut to remove load_idx Alex Shi
2013-12-03  9:05 ` [PATCH 2/4] sched: remove rq->cpu_load[load_idx] array Alex Shi
2013-12-03  9:05 ` [PATCH 3/4] sched: clean up cpu_load update Alex Shi
2013-12-03  9:05 ` [PATCH 4/4] sched: bias to target cpu load to reduce task moving Alex Shi
2013-12-04  9:06   ` Yuanhan Liu
2013-12-04 11:25     ` Alex Shi
2013-12-17 14:10   ` Morten Rasmussen
2013-12-17 15:38     ` Peter Zijlstra
2013-12-19 13:34       ` Alex Shi
2013-12-20 11:19         ` Morten Rasmussen
2013-12-20 14:45           ` Alex Shi
2013-12-25 14:58           ` Alex Shi
2014-01-02 16:04             ` Morten Rasmussen
2014-01-06 13:35               ` Alex Shi
2014-01-07 12:55                 ` Morten Rasmussen
2014-01-07 12:59                   ` Peter Zijlstra
2014-01-07 13:15                     ` Peter Zijlstra
2014-01-07 13:32                       ` Vincent Guittot
2014-01-07 13:40                         ` Peter Zijlstra
2014-01-07 15:16                       ` Morten Rasmussen
2014-01-07 20:37                         ` Peter Zijlstra
2014-01-08 14:15                     ` Alex Shi
2013-12-03 10:26 ` [PATCH 0/4] sched: remove cpu_load decay Peter Zijlstra
2013-12-10  1:04   ` Alex Shi
2013-12-10  1:06     ` Paul Turner
2013-12-13 19:50     ` bsegall
2013-12-14 12:53       ` Alex Shi
2013-12-13 20:03 ` Peter Zijlstra
2013-12-14 13:27   ` Alex Shi
2013-12-17 14:04     ` Morten Rasmussen
2013-12-17 15:37       ` Peter Zijlstra
2013-12-17 18:12         ` Morten Rasmussen
2013-12-20 14:43           ` Alex Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.