All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch v7 0/8] sched: using runnable load avg in balance
@ 2013-05-30  7:01 Alex Shi
  2013-05-30  7:01 ` [patch v7 1/8] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
                   ` (8 more replies)
  0 siblings, 9 replies; 12+ messages in thread
From: Alex Shi @ 2013-05-30  7:01 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, bp, pjt, namhyung, efault, morten.rasmussen
  Cc: vincent.guittot, preeti, viresh.kumar, linux-kernel, alex.shi,
	mgorman, riel, wangyun, Jason Low, Changlong Xie

Thanks comments from Peter, Paul, Morten, Michael and Preeti.

The most important change of this version is rebasing on latest
tip/sched/core tree.

I tested on Intel core2, NHM, SNB, IVB, 2 and 4 sockets machines with
benchmark kbuild, aim7, dbench, tbench, hackbench, oltp, and netperf
loopback etc. 

On SNB EP 4 sockets machine, the hackbench increased about 50%, and
result become stable. on other machines, hackbench increased about
2~10%. oltp increased about 30% in NHM EX box. netperf loopback also
increased on SNB EP 4 sockets box. no clear changes on other
benchmarks.

Michael Wang gotten better performance on pgbench on his box with this
patchset. https://lkml.org/lkml/2013/5/16/82

And Morten tested previous version with better power consumption.
http://comments.gmane.org/gmane.linux.kernel/1463371

Changlong found ltp cgroup stress testing get faster on SNB EP
machine.  https://lkml.org/lkml/2013/5/23/65
---
3.10-rc1          patch1-7         patch1-8
duration=764   duration=754   duration=750
duration=764   duration=754   duration=751
duration=763   duration=755   duration=751

duration means the seconds of testing cost.
---

Jason also found java server load benefited on his 8 sockets machine:
https://lkml.org/lkml/2013/5/29/673
---
When using a 3.10-rc2 tip kernel with patches 1-8, there was about a 40%
improvement in performance of the workload compared to when using the
vanilla 3.10-rc2 tip kernel with no patches. When using a 3.10-rc2 tip
kernel with just patches 1-7, the performance improvement of the
workload over the vanilla 3.10-rc2 tip kernel was about 25%.
---

We also tried to include blocked load avg in balance. but find many
benchmark performance drop a lot! So, seems accumulating current
blocked_load_avg into cpu load isn't a good idea.  
The blocked_load_avg is decayed same as runnable load, sometime is far
bigger than runnable load, that drive tasks to other idle or slight
load cpu, than cause both performance and power issue. But if the
blocked load is decayed too fast, it lose its effect. 
Another issue of blocked load is that when waking up task, we can not 
know blocked load proportion of the task on cpu. So, the blocked load is
meaningless in wake affine decision.  

According to above problem, I can not figure out some way to use 
blocked_load_avg in balance now.

Anyway, since using runnable load avg in balance brings much benefit on
performance and power. and this patch was reviewed for long time. 
So maybe it's time to let it clobbered in some sub-maintain tree, like tip 
or linux-next.  Any comments?

Regards
Alex
[patch v7 1/8] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
[patch v7 2/8] sched: move few runnable tg variables into CONFIG_SMP
[patch v7 3/8] sched: set initial value of runnable avg for new
[patch v7 4/8] sched: fix slept time double counting in enqueue
[patch v7 5/8] sched: update cpu load after task_tick.
[patch v7 6/8] sched: compute runnable load avg in cpu_load and
[patch v7 7/8] sched: consider runnable load average in move_tasks
[patch v7 8/8] sched: remove blocked_load_avg in tg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [patch v7 1/8] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
  2013-05-30  7:01 [patch v7 0/8] sched: using runnable load avg in balance Alex Shi
@ 2013-05-30  7:01 ` Alex Shi
  2013-05-30  7:01 ` [patch v7 2/8] sched: move few runnable tg variables into CONFIG_SMP Alex Shi
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Alex Shi @ 2013-05-30  7:01 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, bp, pjt, namhyung, efault, morten.rasmussen
  Cc: vincent.guittot, preeti, viresh.kumar, linux-kernel, alex.shi,
	mgorman, riel, wangyun, Jason Low, Changlong Xie

Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/sched.h |  7 +------
 kernel/sched/core.c   |  7 +------
 kernel/sched/fair.c   | 13 ++-----------
 kernel/sched/sched.h  | 10 ++--------
 4 files changed, 6 insertions(+), 31 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..0019bef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -994,12 +994,7 @@ struct sched_entity {
 	struct cfs_rq		*my_q;
 #endif
 
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
 	/* Per-entity load-tracking */
 	struct sched_avg	avg;
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 36f85be..b9e7036 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1598,12 +1598,7 @@ static void __sched_fork(struct task_struct *p)
 	p->se.vruntime			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
 	p->se.avg.runnable_avg_period = 0;
 	p->se.avg.runnable_avg_sum = 0;
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ee1c2e..f404468 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1128,8 +1128,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
 /*
  * We choose a half-life close to 1 scheduling period.
  * Note: The tables below are dependent on this value.
@@ -3436,12 +3435,6 @@ unlock:
 }
 
 /*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
  * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
  * previous cpu.  However, the caller only guarantees p->pi_lock is held; no
@@ -3464,7 +3457,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 		atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
 	}
 }
-#endif
 #endif /* CONFIG_SMP */
 
 static unsigned long
@@ -6167,9 +6159,8 @@ const struct sched_class fair_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
 	.migrate_task_rq	= migrate_task_rq_fair,
-#endif
+
 	.rq_online		= rq_online_fair,
 	.rq_offline		= rq_offline_fair,
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 74ff659..d892a9f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -269,12 +269,6 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
 	/*
 	 * CFS Load tracking
 	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -284,9 +278,9 @@ struct cfs_rq {
 	u64 runnable_load_avg, blocked_load_avg;
 	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	/* Required to track per-cpu representation of a task_group */
 	u32 tg_runnable_contrib;
 	u64 tg_load_contrib;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [patch v7 2/8] sched: move few runnable tg variables into CONFIG_SMP
  2013-05-30  7:01 [patch v7 0/8] sched: using runnable load avg in balance Alex Shi
  2013-05-30  7:01 ` [patch v7 1/8] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
@ 2013-05-30  7:01 ` Alex Shi
  2013-05-30  7:01 ` [patch v7 3/8] sched: set initial value of runnable avg for new forked task Alex Shi
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Alex Shi @ 2013-05-30  7:01 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, bp, pjt, namhyung, efault, morten.rasmussen
  Cc: vincent.guittot, preeti, viresh.kumar, linux-kernel, alex.shi,
	mgorman, riel, wangyun, Jason Low, Changlong Xie

The following 2 variables only used under CONFIG_SMP, so better to move
their definiation into CONFIG_SMP too.

        atomic64_t load_avg;
        atomic_t runnable_avg;

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/sched.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d892a9f..24b1503 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -149,9 +149,11 @@ struct task_group {
 	unsigned long shares;
 
 	atomic_t load_weight;
+#ifdef	CONFIG_SMP
 	atomic64_t load_avg;
 	atomic_t runnable_avg;
 #endif
+#endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
 	struct sched_rt_entity **rt_se;
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [patch v7 3/8] sched: set initial value of runnable avg for new forked task
  2013-05-30  7:01 [patch v7 0/8] sched: using runnable load avg in balance Alex Shi
  2013-05-30  7:01 ` [patch v7 1/8] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
  2013-05-30  7:01 ` [patch v7 2/8] sched: move few runnable tg variables into CONFIG_SMP Alex Shi
@ 2013-05-30  7:01 ` Alex Shi
  2013-05-30  7:02 ` [patch v7 4/8] sched: fix slept time double counting in enqueue entity Alex Shi
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Alex Shi @ 2013-05-30  7:01 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, bp, pjt, namhyung, efault, morten.rasmussen
  Cc: vincent.guittot, preeti, viresh.kumar, linux-kernel, alex.shi,
	mgorman, riel, wangyun, Jason Low, Changlong Xie

We need initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task.
Otherwise random values of above variables cause mess when do new task
enqueue:
    enqueue_task_fair
        enqueue_entity
            enqueue_entity_load_avg

and make forking balancing imbalance since incorrect load_avg_contrib.

Further more, Morten Rasmussen notice some tasks were not launched at
once after created. So Paul and Peter suggest giving a start value for
new task runnable avg time same as sched_slice().

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/core.c  |  6 ++----
 kernel/sched/fair.c  | 23 +++++++++++++++++++++++
 kernel/sched/sched.h |  2 ++
 3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b9e7036..6f226c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1598,10 +1598,6 @@ static void __sched_fork(struct task_struct *p)
 	p->se.vruntime			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
-#ifdef CONFIG_SMP
-	p->se.avg.runnable_avg_period = 0;
-	p->se.avg.runnable_avg_sum = 0;
-#endif
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
@@ -1745,6 +1741,8 @@ void wake_up_new_task(struct task_struct *p)
 	set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
 #endif
 
+	/* Give new task start runnable values */
+	set_task_runnable_avg(p);
 	rq = __task_rq_lock(p);
 	activate_task(rq, p, 0);
 	p->on_rq = 1;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f404468..1fc30b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -680,6 +680,26 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	return calc_delta_fair(sched_slice(cfs_rq, se), se);
 }
 
+#ifdef CONFIG_SMP
+static inline void __update_task_entity_contrib(struct sched_entity *se);
+
+/* Give new task start runnable values to heavy its load in infant time */
+void set_task_runnable_avg(struct task_struct *p)
+{
+	u32 slice;
+
+	p->se.avg.decay_count = 0;
+	slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
+	p->se.avg.runnable_avg_sum = slice;
+	p->se.avg.runnable_avg_period = slice;
+	__update_task_entity_contrib(&p->se);
+}
+#else
+void set_task_runnable_avg(struct task_struct *p)
+{
+}
+#endif
+
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -1527,6 +1547,9 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 	 * We track migrations using entity decay_count <= 0, on a wake-up
 	 * migration we use a negative decay count to track the remote decays
 	 * accumulated while sleeping.
+	 *
+	 * When enqueue a new forked task, the se->avg.decay_count == 0, so
+	 * we bypass update_entity_load_avg(), use avg.load_avg_contrib direct.
 	 */
 	if (unlikely(se->avg.decay_count <= 0)) {
 		se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24b1503..8bc66c6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1058,6 +1058,8 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
 
 extern void update_idle_cpu_load(struct rq *this_rq);
 
+extern void set_task_runnable_avg(struct task_struct *p);
+
 #ifdef CONFIG_PARAVIRT
 static inline u64 steal_ticks(u64 steal)
 {
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [patch v7 4/8] sched: fix slept time double counting in enqueue entity
  2013-05-30  7:01 [patch v7 0/8] sched: using runnable load avg in balance Alex Shi
                   ` (2 preceding siblings ...)
  2013-05-30  7:01 ` [patch v7 3/8] sched: set initial value of runnable avg for new forked task Alex Shi
@ 2013-05-30  7:02 ` Alex Shi
  2013-05-30  7:02 ` [patch v7 5/8] sched: update cpu load after task_tick Alex Shi
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Alex Shi @ 2013-05-30  7:02 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, bp, pjt, namhyung, efault, morten.rasmussen
  Cc: vincent.guittot, preeti, viresh.kumar, linux-kernel, alex.shi,
	mgorman, riel, wangyun, Jason Low, Changlong Xie

The wakeuped migrated task will __synchronize_entity_decay(se); in
migrate_task_fair, then it needs to set
`se->avg.last_runnable_update -= (-se->avg.decay_count) << 20'
before update_entity_load_avg, in order to avoid slept time is updated
twice for se.avg.load_avg_contrib in both __syncchronize and
update_entity_load_avg.

but if the slept task is waked up from self cpu, it miss the
last_runnable_update before update_entity_load_avg(se, 0, 1), then the
slept time was used twice in both functions.
So we need to remove the double slept time counting.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1fc30b9..42c7be0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1570,7 +1570,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 		}
 		wakeup = 0;
 	} else {
-		__synchronize_entity_decay(se);
+		se->avg.last_runnable_update += __synchronize_entity_decay(se)
+							<< 20;
 	}
 
 	/* migrated tasks did not contribute to our blocked load */
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [patch v7 5/8] sched: update cpu load after task_tick.
  2013-05-30  7:01 [patch v7 0/8] sched: using runnable load avg in balance Alex Shi
                   ` (3 preceding siblings ...)
  2013-05-30  7:02 ` [patch v7 4/8] sched: fix slept time double counting in enqueue entity Alex Shi
@ 2013-05-30  7:02 ` Alex Shi
  2013-05-30  7:02 ` [patch v7 6/8] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task Alex Shi
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Alex Shi @ 2013-05-30  7:02 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, bp, pjt, namhyung, efault, morten.rasmussen
  Cc: vincent.guittot, preeti, viresh.kumar, linux-kernel, alex.shi,
	mgorman, riel, wangyun, Jason Low, Changlong Xie

To get the latest runnable info, we need do this cpuload update after
task_tick.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6f226c2..05176b8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2152,8 +2152,8 @@ void scheduler_tick(void)
 
 	raw_spin_lock(&rq->lock);
 	update_rq_clock(rq);
-	update_cpu_load_active(rq);
 	curr->sched_class->task_tick(rq, curr, 0);
+	update_cpu_load_active(rq);
 	raw_spin_unlock(&rq->lock);
 
 	perf_event_task_tick();
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [patch v7 6/8] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-05-30  7:01 [patch v7 0/8] sched: using runnable load avg in balance Alex Shi
                   ` (4 preceding siblings ...)
  2013-05-30  7:02 ` [patch v7 5/8] sched: update cpu load after task_tick Alex Shi
@ 2013-05-30  7:02 ` Alex Shi
  2013-05-30  7:02 ` [patch v7 7/8] sched: consider runnable load average in move_tasks Alex Shi
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Alex Shi @ 2013-05-30  7:02 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, bp, pjt, namhyung, efault, morten.rasmussen
  Cc: vincent.guittot, preeti, viresh.kumar, linux-kernel, alex.shi,
	mgorman, riel, wangyun, Jason Low, Changlong Xie

They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.

We also try to include the blocked_load_avg as cpu load in balancing,
but that cause kbuild performance drop 6% on every Intel machine, and
aim7/oltp drop on some of 4 CPU sockets machines.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c |  5 +++--
 kernel/sched/proc.c | 17 +++++++++++++++--
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 42c7be0..eadd2e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2962,7 +2962,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 /* Used instead of source_load when we know the type == 0 */
 static unsigned long weighted_cpuload(const int cpu)
 {
-	return cpu_rq(cpu)->load.weight;
+	return cpu_rq(cpu)->cfs.runnable_load_avg;
 }
 
 /*
@@ -3007,9 +3007,10 @@ static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
+	unsigned long load_avg = rq->cfs.runnable_load_avg;
 
 	if (nr_running)
-		return rq->load.weight / nr_running;
+		return load_avg / nr_running;
 
 	return 0;
 }
diff --git a/kernel/sched/proc.c b/kernel/sched/proc.c
index bb3a6a0..ce5cd48 100644
--- a/kernel/sched/proc.c
+++ b/kernel/sched/proc.c
@@ -501,6 +501,18 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
 	sched_avg_update(this_rq);
 }
 
+#ifdef CONFIG_SMP
+unsigned long get_rq_runnable_load(struct rq *rq)
+{
+	return rq->cfs.runnable_load_avg;
+}
+#else
+unsigned long get_rq_runnable_load(struct rq *rq)
+{
+	return rq->load.weight;
+}
+#endif
+
 #ifdef CONFIG_NO_HZ_COMMON
 /*
  * There is no sane way to deal with nohz on smp when using jiffies because the
@@ -522,7 +534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
 void update_idle_cpu_load(struct rq *this_rq)
 {
 	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
-	unsigned long load = this_rq->load.weight;
+	unsigned long load = get_rq_runnable_load(this_rq);
 	unsigned long pending_updates;
 
 	/*
@@ -568,11 +580,12 @@ void update_cpu_load_nohz(void)
  */
 void update_cpu_load_active(struct rq *this_rq)
 {
+	unsigned long load = get_rq_runnable_load(this_rq);
 	/*
 	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
 	 */
 	this_rq->last_load_update_tick = jiffies;
-	__update_cpu_load(this_rq, this_rq->load.weight, 1);
+	__update_cpu_load(this_rq, load, 1);
 
 	calc_load_account_active(this_rq);
 }
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [patch v7 7/8] sched: consider runnable load average in move_tasks
  2013-05-30  7:01 [patch v7 0/8] sched: using runnable load avg in balance Alex Shi
                   ` (5 preceding siblings ...)
  2013-05-30  7:02 ` [patch v7 6/8] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task Alex Shi
@ 2013-05-30  7:02 ` Alex Shi
  2013-05-31 10:19   ` Morten Rasmussen
  2013-05-30  7:02 ` [patch v7 8/8] sched: remove blocked_load_avg in tg Alex Shi
  2013-06-03  6:43 ` [patch v7 0/8] sched: using runnable load avg in balance Alex Shi
  8 siblings, 1 reply; 12+ messages in thread
From: Alex Shi @ 2013-05-30  7:02 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, bp, pjt, namhyung, efault, morten.rasmussen
  Cc: vincent.guittot, preeti, viresh.kumar, linux-kernel, alex.shi,
	mgorman, riel, wangyun, Jason Low, Changlong Xie

Except using runnable load average in background, move_tasks is also
the key functions in load balance. We need consider the runnable load
average in it in order to the apple to apple load comparison.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eadd2e7..bb2470a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4178,11 +4178,11 @@ static int tg_load_down(struct task_group *tg, void *data)
 	long cpu = (long)data;
 
 	if (!tg->parent) {
-		load = cpu_rq(cpu)->load.weight;
+		load = cpu_rq(cpu)->avg.load_avg_contrib;
 	} else {
 		load = tg->parent->cfs_rq[cpu]->h_load;
-		load *= tg->se[cpu]->load.weight;
-		load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
+		load *= tg->se[cpu]->avg.load_avg_contrib;
+		load /= tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
 	}
 
 	tg->cfs_rq[cpu]->h_load = load;
@@ -4210,8 +4210,8 @@ static unsigned long task_h_load(struct task_struct *p)
 	struct cfs_rq *cfs_rq = task_cfs_rq(p);
 	unsigned long load;
 
-	load = p->se.load.weight;
-	load = div_u64(load * cfs_rq->h_load, cfs_rq->load.weight + 1);
+	load = p->se.avg.load_avg_contrib;
+	load = div_u64(load * cfs_rq->h_load, cfs_rq->runnable_load_avg + 1);
 
 	return load;
 }
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [patch v7 8/8] sched: remove blocked_load_avg in tg
  2013-05-30  7:01 [patch v7 0/8] sched: using runnable load avg in balance Alex Shi
                   ` (6 preceding siblings ...)
  2013-05-30  7:02 ` [patch v7 7/8] sched: consider runnable load average in move_tasks Alex Shi
@ 2013-05-30  7:02 ` Alex Shi
  2013-06-03  6:43 ` [patch v7 0/8] sched: using runnable load avg in balance Alex Shi
  8 siblings, 0 replies; 12+ messages in thread
From: Alex Shi @ 2013-05-30  7:02 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, bp, pjt, namhyung, efault, morten.rasmussen
  Cc: vincent.guittot, preeti, viresh.kumar, linux-kernel, alex.shi,
	mgorman, riel, wangyun, Jason Low, Changlong Xie

blocked_load_avg sometime is too heavy and far bigger than runnable load
avg, that make balance make wrong decision. So remove it.

Changlong tested this patch, found ltp cgroup stress testing get better
performance: https://lkml.org/lkml/2013/5/23/65
---
3.10-rc1          patch1-7         patch1-8
duration=764   duration=754   duration=750
duration=764   duration=754   duration=751
duration=763   duration=755   duration=751

duration means the seconds of testing cost.
---

And Jason also tested this patchset on his 8 sockets machine:
https://lkml.org/lkml/2013/5/29/673
---
When using a 3.10-rc2 tip kernel with patches 1-8, there was about a 40%
improvement in performance of the workload compared to when using the
vanilla 3.10-rc2 tip kernel with no patches. When using a 3.10-rc2 tip
kernel with just patches 1-7, the performance improvement of the
workload over the vanilla 3.10-rc2 tip kernel was about 25%.
---

Signed-off-by: Alex Shi <alex.shi@intel.com>
Tested-by: Changlong Xie <changlongx.xie@intel.com>
Tested-by: Jason Low <jason.low2@hp.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bb2470a..163d9ce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1358,7 +1358,7 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 	struct task_group *tg = cfs_rq->tg;
 	s64 tg_contrib;
 
-	tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
+	tg_contrib = cfs_rq->runnable_load_avg;
 	tg_contrib -= cfs_rq->tg_load_contrib;
 
 	if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [patch v7 7/8] sched: consider runnable load average in move_tasks
  2013-05-30  7:02 ` [patch v7 7/8] sched: consider runnable load average in move_tasks Alex Shi
@ 2013-05-31 10:19   ` Morten Rasmussen
  2013-05-31 15:07     ` Alex Shi
  0 siblings, 1 reply; 12+ messages in thread
From: Morten Rasmussen @ 2013-05-31 10:19 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, bp, pjt, namhyung, efault,
	vincent.guittot, preeti, viresh.kumar, linux-kernel, mgorman,
	riel, wangyun, Jason Low, Changlong Xie

On Thu, May 30, 2013 at 08:02:03AM +0100, Alex Shi wrote:
> Except using runnable load average in background, move_tasks is also
> the key functions in load balance. We need consider the runnable load
> average in it in order to the apple to apple load comparison.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eadd2e7..bb2470a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4178,11 +4178,11 @@ static int tg_load_down(struct task_group *tg, void *data)
>  	long cpu = (long)data;
>  
>  	if (!tg->parent) {
> -		load = cpu_rq(cpu)->load.weight;
> +		load = cpu_rq(cpu)->avg.load_avg_contrib;
>  	} else {
>  		load = tg->parent->cfs_rq[cpu]->h_load;
> -		load *= tg->se[cpu]->load.weight;
> -		load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
> +		load *= tg->se[cpu]->avg.load_avg_contrib;
> +		load /= tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;

runnable_load_avg is u64, so you need to use div_u64() similar to how it
is already done in task_h_load() further down in this patch. It doesn't
build on ARM as is.

Fix:
-               load /= tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
+               load = div_u64(load,
				tg->parent->cfs_rq[cpu]->runnable_load_avg + 1);

Morten

>  	}
>  
>  	tg->cfs_rq[cpu]->h_load = load;
> @@ -4210,8 +4210,8 @@ static unsigned long task_h_load(struct task_struct *p)
>  	struct cfs_rq *cfs_rq = task_cfs_rq(p);
>  	unsigned long load;
>  
> -	load = p->se.load.weight;
> -	load = div_u64(load * cfs_rq->h_load, cfs_rq->load.weight + 1);
> +	load = p->se.avg.load_avg_contrib;
> +	load = div_u64(load * cfs_rq->h_load, cfs_rq->runnable_load_avg + 1);
>  
>  	return load;
>  }
> -- 
> 1.7.12
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch v7 7/8] sched: consider runnable load average in move_tasks
  2013-05-31 10:19   ` Morten Rasmussen
@ 2013-05-31 15:07     ` Alex Shi
  0 siblings, 0 replies; 12+ messages in thread
From: Alex Shi @ 2013-05-31 15:07 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, tglx, akpm, bp, pjt, namhyung, efault,
	vincent.guittot, preeti, viresh.kumar, linux-kernel, mgorman,
	riel, wangyun, Jason Low, Changlong Xie


> 
> runnable_load_avg is u64, so you need to use div_u64() similar to how it
> is already done in task_h_load() further down in this patch. It doesn't
> build on ARM as is.
> 
> Fix:
> -               load /= tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
> +               load = div_u64(load,
> 				tg->parent->cfs_rq[cpu]->runnable_load_avg + 1);
> 
> Morten

Thank a lot for review!

div_u64 or do_div will do force cast u32 on the divisor, so in 64bit machine, 
the divisor may become incorrect.
Since cfs_rq->runnable_load_avg is always smaller the cfs_rq.load.weight. and 
load.weight is 'unsigned long', we can cast the runnable_load_avg to 
'unsigned long' too. Than the div will fit on both 64/32 bit machine and no 
data concatenate!

So the patch changed as following.

BTW, Paul & Peter:
in cfs_rq, runnable_load_avg, blocked_load_avg, tg_load_contrib are all
u64, but their are similar with 'unsigned long' load.weight. So could we change
them to 'unsigned long'?

---

>From 4a17564363f6d65c9d513ad206b54ebd032d3f46 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@intel.com>
Date: Mon, 3 Dec 2012 23:00:53 +0800
Subject: [PATCH 7/8] sched: consider runnable load average in move_tasks

Except using runnable load average in background, move_tasks is also
the key functions in load balance. We need consider the runnable load
average in it in order to the apple to apple load comparison.

Morten catch a div u64 bug on ARM, thanks!

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c |   17 ++++++++++-------
 1 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eadd2e7..73e4507 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4178,11 +4178,14 @@ static int tg_load_down(struct task_group *tg, void *data)
 	long cpu = (long)data;
 
 	if (!tg->parent) {
-		load = cpu_rq(cpu)->load.weight;
+		load = cpu_rq(cpu)->avg.load_avg_contrib;
 	} else {
+		unsigned long tmp_rla;
+		tmp_rla = tg->parent->cfs_rq[cpu]->runnable_load_avg + 1;
+
 		load = tg->parent->cfs_rq[cpu]->h_load;
-		load *= tg->se[cpu]->load.weight;
-		load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
+		load *= tg->se[cpu]->avg.load_avg_contrib;
+		load /= tmp_rla;
 	}
 
 	tg->cfs_rq[cpu]->h_load = load;
@@ -4208,12 +4211,12 @@ static void update_h_load(long cpu)
 static unsigned long task_h_load(struct task_struct *p)
 {
 	struct cfs_rq *cfs_rq = task_cfs_rq(p);
-	unsigned long load;
+	unsigned long load, tmp_rla;
 
-	load = p->se.load.weight;
-	load = div_u64(load * cfs_rq->h_load, cfs_rq->load.weight + 1);
+	load = p->se.avg.load_avg_contrib * cfs_rq->h_load;
+	tmp_rla = cfs_rq->runnable_load_avg + 1;
 
-	return load;
+	return load / tmp_rla;
 }
 #else
 static inline void update_blocked_averages(int cpu)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [patch v7 0/8] sched: using runnable load avg in balance
  2013-05-30  7:01 [patch v7 0/8] sched: using runnable load avg in balance Alex Shi
                   ` (7 preceding siblings ...)
  2013-05-30  7:02 ` [patch v7 8/8] sched: remove blocked_load_avg in tg Alex Shi
@ 2013-06-03  6:43 ` Alex Shi
  8 siblings, 0 replies; 12+ messages in thread
From: Alex Shi @ 2013-06-03  6:43 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, bp, pjt, namhyung, efault,
	morten.rasmussen, vincent.guittot, preeti, viresh.kumar,
	linux-kernel, mgorman, riel, wangyun, Jason Low, Changlong Xie

On 05/30/2013 03:01 PM, Alex Shi wrote:
> Anyway, since using runnable load avg in balance brings much benefit on
> performance and power. and this patch was reviewed for long time. 
> So maybe it's time to let it clobbered in some sub-maintain tree, like tip 
> or linux-next.  Any comments?
> 

Peter,
What's your opinion about this patchset? Are there sth missing? :)


The patchset git tree is here:
git@github.com:alexshi/power-scheduling.git runnablelb

> [patch v7 1/8] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
> [patch v7 2/8] sched: move few runnable tg variables into CONFIG_SMP
> [patch v7 3/8] sched: set initial value of runnable avg for new
> [patch v7 4/8] sched: fix slept time double counting in enqueue
> [patch v7 5/8] sched: update cpu load after task_tick.

Patch 2~5 are the bug fixing.
> [patch v7 6/8] sched: compute runnable load avg in cpu_load and
> [patch v7 7/8] sched: consider runnable load average in move_tasks

Only patch 6th/7th enable the runnable load in load balance.
> [patch v7 8/8] sched: remove blocked_load_avg in tg

According to test, the 8th patch has performance gain.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-06-03  6:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-30  7:01 [patch v7 0/8] sched: using runnable load avg in balance Alex Shi
2013-05-30  7:01 ` [patch v7 1/8] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
2013-05-30  7:01 ` [patch v7 2/8] sched: move few runnable tg variables into CONFIG_SMP Alex Shi
2013-05-30  7:01 ` [patch v7 3/8] sched: set initial value of runnable avg for new forked task Alex Shi
2013-05-30  7:02 ` [patch v7 4/8] sched: fix slept time double counting in enqueue entity Alex Shi
2013-05-30  7:02 ` [patch v7 5/8] sched: update cpu load after task_tick Alex Shi
2013-05-30  7:02 ` [patch v7 6/8] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task Alex Shi
2013-05-30  7:02 ` [patch v7 7/8] sched: consider runnable load average in move_tasks Alex Shi
2013-05-31 10:19   ` Morten Rasmussen
2013-05-31 15:07     ` Alex Shi
2013-05-30  7:02 ` [patch v7 8/8] sched: remove blocked_load_avg in tg Alex Shi
2013-06-03  6:43 ` [patch v7 0/8] sched: using runnable load avg in balance Alex Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.