linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/14] sched: entity load-tracking re-work
@ 2012-02-02  1:38 Paul Turner
  2012-02-02  1:38 ` [RFC PATCH 05/14] sched: account for blocked load waking back up Paul Turner
                   ` (18 more replies)
  0 siblings, 19 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-02  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

Hi all,

The attached series is an RFC on implementing load tracking at the entity
instead of cfs_rq level. This results in a bottom-up load-computation in which
entities contribute to their parents load, as opposed to the current top-down
where the parent averages its children.  In particular this allows us to
correctly migrate load with their accompanying entities and provides the
necessary inputs for intelligent load-balancing and power-management.

It was previously well tested and stable, but that was on v3.1-; there's been
some fairly extensive changes in the wake-up path since so apologies if anything
was broken in the rebase.Note also, since this is also an RFC on the approach I
have not yet de-linted the various CONFIG combinations for introduced compiler
errors.

Background:
----------
We currently track the load average at the parenting cfs_rq level. We divide
execution into a series of consecutive "windows, w_i".  Within each we track:
  \Sum load_i * t_i where \Sum t_i = w and each load_i a disjoint load level.

The load average is then \Sum w_j / 2^n.

This this works reasonably well but there's a few problems
1) Since decay occurs at boundary window boundary there are 'skews':
At a window boundary the 'foreground' time has a bias against the
time immediately preceding it (as a result of the folding division)
e.g.  __xx_|_yyyy___ vs __xx_yyyy_|___  (where '|' is a window boundary).

The accounting here is 2x + 4y/2 or 2x + 4y, depending on which side of the
window your load lands on.

2) Everything within a window 'w' is accounted equally, we only fold at
the boundaries.  This actually means you can't set 'w' large enough to
really have meaningful coverage of the sched period without throwing
decay out the window.  But then with w/2 << sched_period (currently
w/2=5ms) the average ends up having fairly drastic swings even with
stable loads.

(Note:  Before we even looked at migrating to per-entity tracking we evaluating
foreground window into the considered average until it was "complete", this
represented a measurable improvement in stability under predictable load.)

3) Since the load sum average is per-cfs_rq and not per-entity when a task
entity migrates we lose its contribution to load-sum and effectively double
count it while it former sum decays.

New approach:
-------------
Instead of tracking load on a per-cfs_rq basis we do it on a per-sched_entity
basis.  The load sum for a cfs_rq is then just the sum of its childrens' load
averages.  The obvious immediately nice property is that when we migrate thread
A from cpu1-->cpu2, we carry its load with it; leaving the global load sum
unmodified.

The 'windows' above are replaced with more fine-grained tracking, that is (for
an entity j):

runnable_j = u_i * y^i , load_avg_j = runnable_j * weight_j [*]

Where: u_i is the usage in the last i`th ~ms and y is chosen such that
y^~sched_period = 1/2 (we currently choose k=32).This means that load tracked 1
sched_period ago contributes about ~50% as current execution.

Now, the real challenge in tracking at an entity basis is handling blocked
entities.  Obviously runnable entities will be updated naturally as we iterate
through them but there could be O(many) blocked entities so we can't afford to
iterate through them and update their tracking.

That our decay for a unit period is exponential introduces a particularly nice
property here:
We can separate the contributed load on a cfs_rq into blocked and runnable.
Runnable load is just the sum of all load_avg_j above, maintained by the
entities themselves as they run and self update (when they update their
load_avg they update the cumulative sum also).

Blocked load then looks like:
  load_avg_j = weight_k * \Sum u[k]_n * y^n

To account an entirely idle period we then only need to multiply by y.

This ends up being orders of magnitude more accurate than the current
tracking schema, even with the old shares_window massively dilated to
better capture a stable load-state.  It's also obviously stable under
migration.

As referenced above this also allows us to potentially improve decisions within
the load-balancer, for both distribution and power-management.

Exmaple: consider 1x80% task  and 2x40% tasks on a 2-core machine.  It's
currently a bit of a gamble as to whether you get an {AB, B} or {A,
BB} split since they have equal weight (assume 1024).  With per-task
tracking we can actually consider them at their contributed weight and
see a stable ~{800,{400, 400}} load-split.  Likewise within balance_tasks we can
consider the load migrated to be that actually contributed.

Thanks,

- Paul


---

Ben Segall (1):
      sched: maintain per-rq runnable averages

Paul Turner (13):
      sched: track the runnable average on a per-task entitiy basis
      sched: aggregate load contributed by task entities on parenting cfs_rq
      sched: maintain the load contribution of blocked entities
      sched: account for blocked load waking back up
      sched: aggregate total task_group load
      sched: compute load contribution by a group entity
      sched: normalize tg load contributions against runnable time
      sched: maintain runnable averages across throttled periods
      sched: replace update_shares weight distribution with per-entity computation
      sched: refactor update_shares_cpu() -> update_blocked_avgs()
      sched: update_cfs_shares at period edge
      sched: make __update_entity_runnable_avg() fast
      sched: implement usage tracking


 include/linux/sched.h |   11 +
 kernel/sched/core.c   |    3 
 kernel/sched/debug.c  |   37 ++-
 kernel/sched/fair.c   |  714 ++++++++++++++++++++++++++++++++++++++-----------
 kernel/sched/sched.h  |   29 +-
 5 files changed, 611 insertions(+), 183 deletions(-)



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC PATCH 06/14] sched: aggregate total task_group load
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (2 preceding siblings ...)
  2012-02-02  1:38 ` [RFC PATCH 07/14] sched: compute load contribution by a group entity Paul Turner
@ 2012-02-02  1:38 ` Paul Turner
  2012-02-17  4:41   ` Nikunj A Dadhania
  2012-02-02  1:38 ` [RFC PATCH 03/14] sched: aggregate load contributed by task entities on parenting cfs_rq Paul Turner
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 68+ messages in thread
From: Paul Turner @ 2012-02-02  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

Maintain a global running sum of the average load seen on each cfs_rq belonging
to each task group so that it may be used in calculating an appropriate
shares:weight distribution.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/debug.c |    4 ++++
 kernel/sched/fair.c  |   17 +++++++++++++++++
 kernel/sched/sched.h |    2 ++
 3 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index a638d9d..f6227c0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -228,6 +228,10 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lld\n", "blocked_load_avg",
 			cfs_rq->blocked_load_avg);
+	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_avg",
+			atomic64_read(&cfs_rq->tg->load_avg));
+	SEQ_printf(m, "  .%-30s: %lld\n", "tg_load_contrib",
+			cfs_rq->tg_load_contrib);
 #endif
 
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9a8f6d..7771003 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1111,6 +1111,21 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
 	return se->avg.load_avg_contrib - old_contrib;
 }
 
+static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
+						 int force_update)
+{
+	struct task_group *tg = cfs_rq->tg;
+	s64 tg_contrib;
+
+	tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
+	tg_contrib -= cfs_rq->tg_load_contrib;
+
+	if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
+		atomic64_add(tg_contrib, &tg->load_avg);
+		cfs_rq->tg_load_contrib += tg_contrib;
+	}
+}
+
 static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
 						 long load_contrib)
 {
@@ -1166,6 +1181,8 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 		atomic64_add(decays, &cfs_rq->decay_counter);
 		cfs_rq->last_decay = now;
 	}
+
+	__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f45b49..17f99e7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -116,6 +116,7 @@ struct task_group {
 	unsigned long shares;
 
 	atomic_t load_weight;
+	atomic64_t load_avg;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -272,6 +273,7 @@ struct cfs_rq {
 	unsigned long load_contribution;
 
 	u64 runnable_load_avg, blocked_load_avg;
+	u64 tg_load_contrib;
 	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
 #endif /* CONFIG_SMP */



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC PATCH 01/14] sched: track the runnable average on a per-task entitiy basis
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
  2012-02-02  1:38 ` [RFC PATCH 05/14] sched: account for blocked load waking back up Paul Turner
@ 2012-02-02  1:38 ` Paul Turner
  2012-02-15 23:37   ` Peter Zijlstra
  2012-02-16 13:27   ` Peter Zijlstra
  2012-02-02  1:38 ` [RFC PATCH 07/14] sched: compute load contribution by a group entity Paul Turner
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-02  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

Instead of tracking averaging the load parented by a cfs_rq, we can track
entity load directly.  With the load for a given cfs_Rq then being the sum of
its children.

To do this we represent the historical contribution to runnable average within each
trailing 1024us of execution as the coefficients of a geometric series.

We can express this for a given task t as:
  runnable_sum(t) = \Sum u_i * y^i ,
  load(t) = weight_t * runnable_sum(t) / (\Sum 1024 * y^i)

Where: u_i is the usage in the last i`th 1024us period (approximately 1ms) ~ms
and y is chosen such that y^k = 1/2.  We currently choose k to be 32 which
roughly translates to about a sched period.

Signed-off-by: Paul Turner <pjt@google.com>
---
 include/linux/sched.h |    7 +++
 kernel/sched/debug.c  |    2 +
 kernel/sched/fair.c   |  114 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 123 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a5be381..91599c8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1156,6 +1156,11 @@ struct load_weight {
 	unsigned long weight, inv_weight;
 };
 
+struct sched_avg {
+	u64 runnable_avg_sum, runnable_avg_period;
+	u64 last_runnable_update;
+};
+
 #ifdef CONFIG_SCHEDSTATS
 struct sched_statistics {
 	u64			wait_start;
@@ -1215,6 +1220,8 @@ struct sched_entity {
 	struct cfs_rq		*cfs_rq;
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq		*my_q;
+
+	struct sched_avg	avg;
 #endif
 };
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 09acaa1..d89db32 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -85,6 +85,8 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->statistics.wait_count);
 #endif
 	P(se->load.weight);
+	P(se->avg.runnable_avg_sum);
+	P(se->avg.runnable_avg_period);
 #undef PN
 #undef P
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e77a6b..a570e9c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,6 +988,108 @@ static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+/*
+ * Approximate:
+ *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
+ */
+static __always_inline u64 decay_load(u64 val, int n)
+{
+	for (;n && val;n--) {
+		val *= 4008;
+		val >>= 12;
+	}
+
+	return val;
+}
+
+/* We can represent the historical contribution to runnable average as the
+ * coefficients of a geometric series.  To do this we sub-divide our runnable
+ * history into segments of approximately 1ms (1024us); label the segment that
+ * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
+ *
+ * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
+ *      p0            p1           p1
+ *     (now)       (~1ms ago)  (~2ms ago)
+ *
+ * Let u_i denote the fraction of p_i that the entity was runnable.
+ *
+ * We can then represent historical load-average using u_i as the co-efficients
+ * to for a geometric series.
+ *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
+ *  (Taking the sum over the equivalently decayed period)
+ *
+ * We choose k to be approximately the width of a scheduling period, that is:
+ *   y^32 = 0.5
+ * This means that the contribution to load ~32ms ago will be weighted
+ * approximately half as much as the contribution to load within the last ms.
+ *
+ * When a period "rolls over" and we have new u_0`, we can multiply the
+ * previous sum again by k to update:
+ *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
+ *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1]
+ */
+static __always_inline int __update_entity_runnable_avg(u64 now,
+							struct sched_avg *sa,
+							int runnable)
+{
+	u64 delta;
+	int delta_w, decayed = 0;
+
+	delta = now - sa->last_runnable_update;
+	if((s64)delta < 0) {
+		sa->last_runnable_update = now;
+		return 0;
+	}
+
+	/*
+	 * Use 1024us as the unit of measurement since it's a reasonable
+	 * approximation of 1ms and fast to compute.
+	 */
+	delta >>= 10;
+	if (!delta)
+		return 0;
+	sa->last_runnable_update = now;
+
+	delta_w = sa->runnable_avg_period % 1024;
+	if (delta + delta_w >= 1024) {
+		/* period roll-over */
+		decayed = 1;
+
+		delta_w = 1024 - delta_w;
+		BUG_ON(delta_w > delta);
+		do {
+			if (runnable)
+				sa->runnable_avg_sum += delta_w;
+			sa->runnable_avg_period += delta_w;
+
+			sa->runnable_avg_sum =
+				decay_load(sa->runnable_avg_sum, 1);
+			sa->runnable_avg_period =
+				decay_load(sa->runnable_avg_period, 1);
+
+			delta -= delta_w;
+			delta_w = 1024;
+		} while (delta >= 1024);
+	}
+
+	if (runnable)
+		sa->runnable_avg_sum += delta;
+	sa->runnable_avg_period += delta;
+
+	return decayed;
+}
+
+/* Update a sched_entity's runnable average */
+static inline void update_entity_load_avg(struct sched_entity *se)
+{
+	__update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
+				     se->on_rq);
+}
+#else
+static inline void update_entity_load_avg(struct sched_entity *se) {}
+#endif
+
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 #ifdef CONFIG_SCHEDSTATS
@@ -1112,6 +1214,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_curr(cfs_rq);
 	update_cfs_load(cfs_rq, 0);
+	update_entity_load_avg(se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
@@ -1186,6 +1289,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+	update_entity_load_avg(se);
 
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
@@ -1355,6 +1459,8 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		update_stats_wait_start(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
+		/* in !on_rq case, update occurred at dequeue */
+		update_entity_load_avg(prev);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -1367,6 +1473,14 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 */
 	update_curr(cfs_rq);
 
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+	/*
+	 * Ensure that runnable average is periodically updated.
+	 */
+	if (likely(curr->avg.last_runnable_update)) {
+		update_entity_load_avg(curr);
+	}
+#endif
 	/*
 	 * Update share accounting for long-running entities.
 	 */



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC PATCH 04/14] sched: maintain the load contribution of blocked entities
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (4 preceding siblings ...)
  2012-02-02  1:38 ` [RFC PATCH 03/14] sched: aggregate load contributed by task entities on parenting cfs_rq Paul Turner
@ 2012-02-02  1:38 ` Paul Turner
  2012-02-16 12:25   ` Peter Zijlstra
  2012-02-02  1:38 ` [RFC PATCH 02/14] sched: maintain per-rq runnable averages Paul Turner
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 68+ messages in thread
From: Paul Turner @ 2012-02-02  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

We are currently maintaining:
  runnable_load(cfs_rq) = \Sum task_load(t)

For all running children t of cfs_rq.  While this can be naturally updated for
tasks in a runnable state (as they are scheduled); this does not account for
the load contributed by blocked task entities.

This can be solved by introducing a separate accounting for blocked load:
  blocked_load(cfs_rq) = \Sum runnable(b) * weight(b)

Obviously we do not want to iterate over all blocked entities to account for
their decay, we instead observe that:
  runnable_load(t) = \Sum p_i*y^i

and that to account for an additional idle period we only need to compute:
  y*runnable_load(t).

This means that we can compute all blocked entities at once by evaluating:
  blocked_load(cfs_rq)` = y * blocked_load(cfs_rq)

Finally we maintain a decay counter so that when a sleeping entity re-awakens
we can determine how much of its load should be removed from the blocked sum.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Ben Segall <bsegall@google.com>
---
 include/linux/sched.h |    2 -
 kernel/sched/core.c   |    3 +
 kernel/sched/debug.c  |    3 +
 kernel/sched/fair.c   |  119 ++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h  |    4 +-
 5 files changed, 116 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f2999f0..70eae51 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1158,7 +1158,7 @@ struct load_weight {
 
 struct sched_avg {
 	u64 runnable_avg_sum, runnable_avg_period;
-	u64 last_runnable_update;
+	u64 last_runnable_update, decay_count;
 	unsigned long load_avg_contrib;
 };
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d7c4322..79c3e31 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1683,6 +1683,9 @@ static void __sched_fork(struct task_struct *p)
 	p->se.vruntime			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+	p->se.avg.decay_count = 0;
+#endif
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 5a55d26..a638d9d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -94,6 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->avg.runnable_avg_sum);
 	P(se->avg.runnable_avg_period);
 	P(se->avg.load_avg_contrib);
+	P(se->avg.decay_count);
 #undef PN
 #undef P
 }
@@ -225,6 +226,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			atomic_read(&cfs_rq->tg->load_weight));
 	SEQ_printf(m, "  .%-30s: %lld\n", "runnable_load_avg",
 			cfs_rq->runnable_load_avg);
+	SEQ_printf(m, "  .%-30s: %lld\n", "blocked_load_avg",
+			cfs_rq->blocked_load_avg);
 #endif
 
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcdad5d..cc4ec4b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1080,6 +1080,20 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	return decayed;
 }
 
+/* Synchronize an entity's decay with its parentin cfs_rq.*/
+static inline void __synchronize_entity_decay(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 decays = atomic64_read(&cfs_rq->decay_counter);
+
+	decays -= se->avg.decay_count;
+	if (!decays)
+		return;
+
+	se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
+	se->avg.decay_count += decays;
+}
+
 /* Compute the current contribution to load_avg by se, return any delta */
 static long __update_entity_load_avg_contrib(struct sched_entity *se)
 {
@@ -1095,8 +1109,18 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
 	return se->avg.load_avg_contrib - old_contrib;
 }
 
+static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
+						 long load_contrib)
+{
+	if (likely(load_contrib < cfs_rq->blocked_load_avg))
+		cfs_rq->blocked_load_avg -= load_contrib;
+	else
+		cfs_rq->blocked_load_avg = 0;
+}
+
 /* Update a sched_entity's runnable average */
-static inline void update_entity_load_avg(struct sched_entity *se)
+static inline void update_entity_load_avg(struct sched_entity *se,
+					  int update_cfs_rq)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	long contrib_delta;
@@ -1106,8 +1130,34 @@ static inline void update_entity_load_avg(struct sched_entity *se)
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
+
+	if (!update_cfs_rq)
+		return;
+
 	if (se->on_rq)
 		cfs_rq->runnable_load_avg += contrib_delta;
+	else
+		subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+}
+
+/*
+ * Decay the load contributed by all blocked children and account this so that
+ * they their contribution may appropriately discounted when they wake up.
+ */
+static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq)
+{
+	u64 now = rq_of(cfs_rq)->clock_task >> 20;
+	u64 decays;
+
+	decays = now - cfs_rq->last_decay;
+	if (!decays)
+		return;
+
+	cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
+					      decays);
+	atomic64_add(decays, &cfs_rq->decay_counter);
+
+	cfs_rq->last_decay = now;
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
@@ -1117,27 +1167,47 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 
 /* Add the load generated by se into cfs_rq's child load-average */
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
-						  struct sched_entity *se)
+						  struct sched_entity *se,
+						  int wakeup)
 {
-	update_entity_load_avg(se);
+	__synchronize_entity_decay(se);
+
+	if (wakeup)
+		subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+
+	update_entity_load_avg(se, 0);
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+	update_cfs_rq_blocked_load(cfs_rq);
 }
 
-/* Remove se's load from this cfs_rq child load-average */
+/*
+ * Remove se's load from this cfs_rq child load-average, if the entity is
+ * transitioning to a blocked state we track its projected decay using
+ * blocked_load_avg.
+ */
 static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
-						  struct sched_entity *se)
+						  struct sched_entity *se,
+						  int sleep)
 {
-	update_entity_load_avg(se);
+	update_entity_load_avg(se, 1);
+
 	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+	if (sleep) {
+		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+	}
 }
 
 #else
-static inline void update_entity_load_avg(struct sched_entity *se) {}
+static inline update_entity_load_avg(struct sched_entity *se,
+				     int update_cfs_rq) {}
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
-						  struct sched_entity *se {}
+						  struct sched_entity *se,
+						  int wakeup) {}
 static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
-						  struct sched_entity *se) {}
+						  struct sched_entity *se,
+						  int sleep) {}
 #endif
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1264,7 +1334,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_curr(cfs_rq);
 	update_cfs_load(cfs_rq, 0);
-	enqueue_entity_load_avg(se);
+	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
@@ -1339,7 +1409,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	dequeue_entity_load_avg(cfs_rq, se);
+	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
 
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
@@ -1510,7 +1580,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
-		update_entity_load_avg(prev);
+		update_entity_load_avg(prev, 1);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -1528,9 +1598,11 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 * Ensure that runnable average is periodically updated.
 	 */
 	if (likely(curr->avg.last_runnable_update)) {
-		update_entity_load_avg(curr);
+		update_entity_load_avg(curr, 1);
+		update_cfs_rq_blocked_load(cfs_rq);
 	}
 #endif
+
 	/*
 	 * Update share accounting for long-running entities.
 	 */
@@ -2388,6 +2460,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
+		update_entity_load_avg(se, 1);
 	}
 
 	if (!se) {
@@ -2449,6 +2522,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
+		update_entity_load_avg(se, 1);
 	}
 
 	if (!se) {
@@ -3473,6 +3547,7 @@ static int update_shares_cpu(struct task_group *tg, int cpu)
 
 	update_rq_clock(rq);
 	update_cfs_load(cfs_rq, 1);
+	update_cfs_rq_blocked_load(cfs_rq);
 
 	/*
 	 * We need to update shares after updating tg->load_weight in
@@ -5478,6 +5553,21 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
 		place_entity(cfs_rq, se, 0);
 		se->vruntime -= cfs_rq->min_vruntime;
 	}
+
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+	/*
+	* Remove our load from contribution when we leave sched_fair
+	* and ensure we don't carry in an old decay_count if we
+	* switch back.
+	*/
+	if (p->se.avg.decay_count) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
+		__synchronize_entity_decay(&p->se);
+		subtract_blocked_load_contrib(cfs_rq,
+				p->se.avg.load_avg_contrib);
+		p->se.avg.decay_count = 0;
+	}
+#endif
 }
 
 /*
@@ -5525,6 +5615,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #ifndef CONFIG_64BIT
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
 #endif
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+	atomic64_set(&cfs_rq->decay_counter, 1);
+#endif
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 77a3427..2c19c26 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -271,7 +271,9 @@ struct cfs_rq {
 
 	unsigned long load_contribution;
 
-	u64 runnable_load_avg;
+	u64 runnable_load_avg, blocked_load_avg;
+	atomic64_t decay_counter;
+	u64 last_decay;
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_CFS_BANDWIDTH
 	int runtime_enabled;



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC PATCH 05/14] sched: account for blocked load waking back up
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
@ 2012-02-02  1:38 ` Paul Turner
  2012-02-16 15:57   ` Peter Zijlstra
  2012-02-16 16:02   ` Peter Zijlstra
  2012-02-02  1:38 ` [RFC PATCH 01/14] sched: track the runnable average on a per-task entitiy basis Paul Turner
                   ` (17 subsequent siblings)
  18 siblings, 2 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-02  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

When a running entity blocks we migrate its tracked load to
cfs_rq->blocked_runnable_avg.  In the sleep case this occurs while holding
rq->lock and so is a natural transition.  Wake-ups however, are potentially
asynchronous in the presence of migration and so special care must be taken.

We use an atomic counter to track such migrated load, taking care to match this
with the previously introduced decay counters so that we don't migrate too much
load.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Ben Segall <bsegall@google.com>
---
 include/linux/sched.h |    2 +
 kernel/sched/fair.c   |   98 +++++++++++++++++++++++++++++++++++++++++--------
 kernel/sched/sched.h  |    7 +++-
 3 files changed, 90 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 70eae51..09b8c45 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1160,6 +1160,8 @@ struct sched_avg {
 	u64 runnable_avg_sum, runnable_avg_period;
 	u64 last_runnable_update, decay_count;
 	unsigned long load_avg_contrib;
+
+	int contributes_blocked_load;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cc4ec4b..c9a8f6d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1081,17 +1081,19 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 }
 
 /* Synchronize an entity's decay with its parentin cfs_rq.*/
-static inline void __synchronize_entity_decay(struct sched_entity *se)
+static inline u64 __synchronize_entity_decay(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 decays = atomic64_read(&cfs_rq->decay_counter);
 
 	decays -= se->avg.decay_count;
 	if (!decays)
-		return;
+		return 0;
 
 	se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
 	se->avg.decay_count += decays;
+
+	return decays;
 }
 
 /* Compute the current contribution to load_avg by se, return any delta */
@@ -1144,20 +1146,26 @@ static inline void update_entity_load_avg(struct sched_entity *se,
  * Decay the load contributed by all blocked children and account this so that
  * they their contribution may appropriately discounted when they wake up.
  */
-static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq)
+static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 {
 	u64 now = rq_of(cfs_rq)->clock_task >> 20;
 	u64 decays;
 
 	decays = now - cfs_rq->last_decay;
-	if (!decays)
+	if (!decays && !force_update)
 		return;
 
-	cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
-					      decays);
-	atomic64_add(decays, &cfs_rq->decay_counter);
+	if (atomic64_read(&cfs_rq->removed_load)) {
+		u64 removed_load = atomic64_xchg(&cfs_rq->removed_load, 0);
+		subtract_blocked_load_contrib(cfs_rq, removed_load);
+	}
 
-	cfs_rq->last_decay = now;
+	if (decays) {
+		cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
+						      decays);
+		atomic64_add(decays, &cfs_rq->decay_counter);
+		cfs_rq->last_decay = now;
+	}
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
@@ -1170,14 +1178,34 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  struct sched_entity *se,
 						  int wakeup)
 {
-	__synchronize_entity_decay(se);
+	/* we track migrations using entity decay_count == 0 */
+	if (unlikely(se->avg.decay_count <= 0)) {
+		se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
+		if (se->avg.decay_count) {
+			/*
+			 * In a wake-up migration we have to approximate the
+			 * time sleeping.
+			 */
+			se->avg.last_runnable_update -= (-se->avg.decay_count)
+							<< 20;
+			update_entity_load_avg(se, 0);
+		}
+		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+		wakeup = 0;
+	} else {
+		__synchronize_entity_decay(se);
+	}
 
-	if (wakeup)
+	/* migrated tasks did not contribute to our blocked load */
+	if (wakeup) {
+		se->avg.contributes_blocked_load = 0;
 		subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+		update_entity_load_avg(se, 0);
+	}
 
-	update_entity_load_avg(se, 0);
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
-	update_cfs_rq_blocked_load(cfs_rq);
+	/* we force update consideration on load-balancer moves */
+	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
 }
 
 /*
@@ -1190,14 +1218,38 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  int sleep)
 {
 	update_entity_load_avg(se, 1);
+	/* we force update consideration on load-balancer moves */
+	update_cfs_rq_blocked_load(cfs_rq, !sleep);
 
 	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
 	if (sleep) {
+		se->avg.contributes_blocked_load = 1;
 		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
 		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+	} else {
+		se->avg.contributes_blocked_load = 0;
+		se->avg.decay_count = 0;
 	}
 }
 
+/*
+ * Accumulate removed load so that it can be processed when we next update
+ * owning cfs_rq under rq->lock.
+ */
+inline void remove_task_load_avg_async(struct task_struct *p)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	if (!se->avg.contributes_blocked_load)
+		return;
+
+	se->avg.contributes_blocked_load = 0;
+	__synchronize_entity_decay(se);
+	se->avg.decay_count = 0;
+	atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
+}
+
 #else
 static inline update_entity_load_avg(struct sched_entity *se,
 				     int update_cfs_rq) {}
@@ -1208,6 +1260,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  struct sched_entity *se,
 						  int sleep) {}
+inline void remove_task_load_avg_async(struct task_struct *p) {}
 #endif
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1599,7 +1652,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 */
 	if (likely(curr->avg.last_runnable_update)) {
 		update_entity_load_avg(curr, 1);
-		update_cfs_rq_blocked_load(cfs_rq);
+		update_cfs_rq_blocked_load(cfs_rq, 1);
 	}
 #endif
 
@@ -3547,7 +3600,7 @@ static int update_shares_cpu(struct task_group *tg, int cpu)
 
 	update_rq_clock(rq);
 	update_cfs_load(cfs_rq, 1);
-	update_cfs_rq_blocked_load(cfs_rq);
+	update_cfs_rq_blocked_load(cfs_rq, 1);
 
 	/*
 	 * We need to update shares after updating tg->load_weight in
@@ -5617,12 +5670,14 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #endif
 #if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
 	atomic64_set(&cfs_rq->decay_counter, 1);
+	atomic64_set(&cfs_rq->removed_load, 0);
 #endif
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void task_move_group_fair(struct task_struct *p, int on_rq)
 {
+	struct cfs_rq *cfs_rq;
 	/*
 	 * If the task was not on the rq at the time of this cgroup movement
 	 * it must have been asleep, sleeping tasks keep their ->vruntime
@@ -5654,8 +5709,19 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
 	if (!on_rq)
 		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
 	set_task_rq(p, task_cpu(p));
-	if (!on_rq)
-		p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
+	if (!on_rq) {
+		cfs_rq = cfs_rq_of(&p->se);
+		p->se.vruntime += cfs_rq->min_vruntime;
+#ifdef CONFIG_SMP
+		/*
+		 * set_task_rq will() have removed our previous contribution,
+		 * but we must synchronize explicitly against further decay
+		 * here.
+		 */
+		p->se.avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+		cfs_rq->blocked_load_avg += p->se.avg.load_avg_contrib;
+#endif
+	}
 }
 
 void free_fair_sched_group(struct task_group *tg)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2c19c26..9f45b49 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -272,7 +272,7 @@ struct cfs_rq {
 	unsigned long load_contribution;
 
 	u64 runnable_load_avg, blocked_load_avg;
-	atomic64_t decay_counter;
+	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -570,6 +570,8 @@ static inline struct task_group *task_group(struct task_struct *p)
 	return autogroup_task_group(p, tg);
 }
 
+inline void remove_task_load_avg_async(struct task_struct *p);
+
 /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 {
@@ -578,6 +580,9 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 #endif
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	/* if we're migrating a sleeping task we need to remove its load */
+	remove_task_load_avg_async(p);
+
 	p->se.cfs_rq = tg->cfs_rq[cpu];
 	p->se.parent = tg->se[cpu];
 #endif



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC PATCH 08/14] sched: normalize tg load contributions against runnable time
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (6 preceding siblings ...)
  2012-02-02  1:38 ` [RFC PATCH 02/14] sched: maintain per-rq runnable averages Paul Turner
@ 2012-02-02  1:38 ` Paul Turner
  2012-02-15 23:36   ` Peter Zijlstra
  2012-02-15 23:38   ` Peter Zijlstra
  2012-02-02  1:38 ` [RFC PATCH 09/14] sched: maintain runnable averages across throttled periods Paul Turner
                   ` (10 subsequent siblings)
  18 siblings, 2 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-02  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

Entities of equal weight should receive equitable distribution of cpu time.
This is challenging in the case of a task_group's shares as execution may be
occurring on multiple cpus simultaneously.

To handle this we divide up the shares into weights proportionate with the load
on each cfs_rq.  This does not however, account for the fact that the sum of
the parts may be less than one cpu and so we need to normalize:
  load(tg) = min(runnable_avg(tg), 1) * tg->shares
Where runnable_avg is the aggregate time in which the task_group had runnable
children.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Ben Segall <bsegall@google.com>.
---
 kernel/sched/debug.c |    4 ++++
 kernel/sched/fair.c  |   34 ++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |    2 ++
 3 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f6227c0..8d87796 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -232,6 +232,10 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			atomic64_read(&cfs_rq->tg->load_avg));
 	SEQ_printf(m, "  .%-30s: %lld\n", "tg_load_contrib",
 			cfs_rq->tg_load_contrib);
+	SEQ_printf(m, "  .%-30s: %d\n", "tg_runnable_contrib",
+			cfs_rq->tg_runnable_contrib);
+	SEQ_printf(m, "  .%-30s: %d\n", "tg->runnable_avg",
+			atomic_read(&cfs_rq->tg->runnable_avg));
 #endif
 
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d8af5e..803c622 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1103,13 +1103,45 @@ static inline void __update_task_entity_contrib(struct sched_entity *se)
 					     se->avg.runnable_avg_period + 1);
 }
 
+/*
+ * Aggregate cfs_rq runnable averages into an equivalent task_group
+ * representation for computing load contributions.
+ */
+static inline void __update_tg_runnable_avg(struct sched_avg *sa,
+						  struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	long contrib;
+
+	contrib = (sa->runnable_avg_sum << 12) / (sa->runnable_avg_period + 1);
+	contrib -= cfs_rq->tg_runnable_contrib;
+
+	if (abs(contrib) > cfs_rq->tg_runnable_contrib/64) {
+		atomic_add(contrib, &tg->runnable_avg);
+		cfs_rq->tg_runnable_contrib += contrib;
+	}
+}
+
 static inline void __update_group_entity_contrib(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = group_cfs_rq(se);
 	struct task_group *tg = cfs_rq->tg;
+	int runnable_avg;
 
 	se->avg.load_avg_contrib = (cfs_rq->tg_load_contrib * tg->shares);
 	se->avg.load_avg_contrib /= atomic64_read(&tg->load_avg) + 1;
+
+	/*
+	 * Unlike a task-entity, a group entity may be using >=1 cpu globally.
+	 * However, in the case that it's using <1 cpu we need to form a
+	 * correction term so that we contribute the same load as a task of
+	 * equal weight. (Global runnable time is taken as a fraction over 2^12.)
+	 */
+	runnable_avg = atomic_read(&tg->runnable_avg);
+	if (runnable_avg < (1<<12)) {
+		se->avg.load_avg_contrib *= runnable_avg;
+		se->avg.load_avg_contrib /= (1<<12);
+	}
 }
 
 /* Compute the current contribution to load_avg by se, return any delta */
@@ -1122,6 +1154,7 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
 	} else {
 		if (!se->on_rq)
 			__synchronize_entity_decay(se);
+		__update_tg_runnable_avg(&se->avg, group_cfs_rq(se));
 		__update_group_entity_contrib(se);
 	}
 
@@ -1205,6 +1238,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
 	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
+	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 
 /* Add the load generated by se into cfs_rq's child load-average */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 17f99e7..57cc227 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -117,6 +117,7 @@ struct task_group {
 
 	atomic_t load_weight;
 	atomic64_t load_avg;
+	atomic_t runnable_avg;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -272,6 +273,7 @@ struct cfs_rq {
 
 	unsigned long load_contribution;
 
+	u32 tg_runnable_contrib;
 	u64 runnable_load_avg, blocked_load_avg;
 	u64 tg_load_contrib;
 	atomic64_t decay_counter, removed_load;



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC PATCH 07/14] sched: compute load contribution by a group entity
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
  2012-02-02  1:38 ` [RFC PATCH 05/14] sched: account for blocked load waking back up Paul Turner
  2012-02-02  1:38 ` [RFC PATCH 01/14] sched: track the runnable average on a per-task entitiy basis Paul Turner
@ 2012-02-02  1:38 ` Paul Turner
  2012-02-02  1:38 ` [RFC PATCH 06/14] sched: aggregate total task_group load Paul Turner
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-02  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

Unlike task entities who have a fixed weight, group entities instead own a
fraction of their parenting task_group's shares as their contributed weight.

Compute this fraction so that we can correctly account hierarchies and shared
entity nodes.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c |   29 +++++++++++++++++++++++------
 1 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7771003..6d8af5e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1096,17 +1096,34 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
 	return decays;
 }
 
+static inline void __update_task_entity_contrib(struct sched_entity *se)
+{
+	se->avg.load_avg_contrib = div64_u64(se->avg.runnable_avg_sum *
+					     se->load.weight,
+					     se->avg.runnable_avg_period + 1);
+}
+
+static inline void __update_group_entity_contrib(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = group_cfs_rq(se);
+	struct task_group *tg = cfs_rq->tg;
+
+	se->avg.load_avg_contrib = (cfs_rq->tg_load_contrib * tg->shares);
+	se->avg.load_avg_contrib /= atomic64_read(&tg->load_avg) + 1;
+}
+
 /* Compute the current contribution to load_avg by se, return any delta */
 static long __update_entity_load_avg_contrib(struct sched_entity *se)
 {
 	long old_contrib = se->avg.load_avg_contrib;
 
-	if (!entity_is_task(se))
-		return 0;
-
-	se->avg.load_avg_contrib = div64_u64(se->avg.runnable_avg_sum *
-					     se->load.weight,
-					     se->avg.runnable_avg_period + 1);
+	if (entity_is_task(se)) {
+		__update_task_entity_contrib(se);
+	} else {
+		if (!se->on_rq)
+			__synchronize_entity_decay(se);
+		__update_group_entity_contrib(se);
+	}
 
 	return se->avg.load_avg_contrib - old_contrib;
 }



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC PATCH 02/14] sched: maintain per-rq runnable averages
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (5 preceding siblings ...)
  2012-02-02  1:38 ` [RFC PATCH 04/14] sched: maintain the load contribution of blocked entities Paul Turner
@ 2012-02-02  1:38 ` Paul Turner
  2012-02-02  1:38 ` [RFC PATCH 08/14] sched: normalize tg load contributions against runnable time Paul Turner
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-02  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

From: Ben Segall <bsegall@google.com>

Since runqueues do not have a corresponding sched_entity we instead embed a
sched_avg structure directly.

Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched/debug.c |   10 ++++++++--
 kernel/sched/fair.c  |   18 ++++++++++++++++--
 kernel/sched/sched.h |    2 ++
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index d89db32..43e3162 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -61,14 +61,20 @@ static unsigned long nsec_low(unsigned long long nsec)
 static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group *tg)
 {
 	struct sched_entity *se = tg->se[cpu];
-	if (!se)
-		return;
 
 #define P(F) \
 	SEQ_printf(m, "  .%-30s: %lld\n", #F, (long long)F)
 #define PN(F) \
 	SEQ_printf(m, "  .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)F))
 
+	if (!se) {
+		struct sched_avg *avg = &cpu_rq(cpu)->avg;
+		P(avg->runnable_avg_sum);
+		P(avg->runnable_avg_period);
+		return;
+	}
+
+
 	PN(se->exec_start);
 	PN(se->vruntime);
 	PN(se->sum_exec_runtime);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a570e9c..8fa199f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1086,8 +1086,14 @@ static inline void update_entity_load_avg(struct sched_entity *se)
 	__update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
 				     se->on_rq);
 }
+
+static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
+{
+	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
+}
 #else
 static inline void update_entity_load_avg(struct sched_entity *se) {}
+static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 #endif
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -2340,8 +2346,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_shares(cfs_rq);
 	}
 
-	if (!se)
+	if (!se) {
+		update_rq_runnable_avg(rq, rq->nr_running);
 		inc_nr_running(rq);
+	}
 	hrtick_update(rq);
 }
 
@@ -2399,8 +2407,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_shares(cfs_rq);
 	}
 
-	if (!se)
+	if (!se) {
 		dec_nr_running(rq);
+		update_rq_runnable_avg(rq, 1);
+	}
 	hrtick_update(rq);
 }
 
@@ -4749,6 +4759,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	if (this_rq->avg_idle < sysctl_sched_migration_cost)
 		return;
 
+	update_rq_runnable_avg(this_rq, 1);
+
 	/*
 	 * Drop the rq->lock, but keep IRQ/preempt disabled.
 	 */
@@ -5328,6 +5340,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		cfs_rq = cfs_rq_of(se);
 		entity_tick(cfs_rq, se, queued);
 	}
+
+	update_rq_runnable_avg(rq, 1);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8a2c768..42b6df6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -473,6 +473,8 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+
+	struct sched_avg avg;
 };
 
 static inline int cpu_of(struct rq *rq)



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC PATCH 03/14] sched: aggregate load contributed by task entities on parenting cfs_rq
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (3 preceding siblings ...)
  2012-02-02  1:38 ` [RFC PATCH 06/14] sched: aggregate total task_group load Paul Turner
@ 2012-02-02  1:38 ` Paul Turner
  2012-02-02  1:38 ` [RFC PATCH 04/14] sched: maintain the load contribution of blocked entities Paul Turner
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-02  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

For a given task t, we can compute its contribution to load as:
  task_load(t) = runnable_avg(t) * weight(t)

On a parenting cfs_rq we can then aggregate
  runnable_load(cfs_rq) = \Sum task_load(t), for all runnable children t

Maintain this bottom up, with task entities adding their contributed load to
the parenting cfs_rq sum.  When a task entities load changes we add the same
delta to the maintained sum.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Ben Segall <bsegall@google.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/debug.c  |    3 +++
 kernel/sched/fair.c   |   52 +++++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h  |    2 ++
 4 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 91599c8..f2999f0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1159,6 +1159,7 @@ struct load_weight {
 struct sched_avg {
 	u64 runnable_avg_sum, runnable_avg_period;
 	u64 last_runnable_update;
+	unsigned long load_avg_contrib;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 43e3162..5a55d26 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -93,6 +93,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->load.weight);
 	P(se->avg.runnable_avg_sum);
 	P(se->avg.runnable_avg_period);
+	P(se->avg.load_avg_contrib);
 #undef PN
 #undef P
 }
@@ -222,6 +223,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->load_contribution);
 	SEQ_printf(m, "  .%-30s: %d\n", "load_tg",
 			atomic_read(&cfs_rq->tg->load_weight));
+	SEQ_printf(m, "  .%-30s: %lld\n", "runnable_load_avg",
+			cfs_rq->runnable_load_avg);
 #endif
 
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8fa199f..bcdad5d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1080,20 +1080,64 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	return decayed;
 }
 
+/* Compute the current contribution to load_avg by se, return any delta */
+static long __update_entity_load_avg_contrib(struct sched_entity *se)
+{
+	long old_contrib = se->avg.load_avg_contrib;
+
+	if (!entity_is_task(se))
+		return 0;
+
+	se->avg.load_avg_contrib = div64_u64(se->avg.runnable_avg_sum *
+					     se->load.weight,
+					     se->avg.runnable_avg_period + 1);
+
+	return se->avg.load_avg_contrib - old_contrib;
+}
+
 /* Update a sched_entity's runnable average */
 static inline void update_entity_load_avg(struct sched_entity *se)
 {
-	__update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
-				     se->on_rq);
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	long contrib_delta;
+
+	if(!__update_entity_runnable_avg(rq_of(cfs_rq)->clock_task, &se->avg,
+					se->on_rq))
+		return;
+
+	contrib_delta = __update_entity_load_avg_contrib(se);
+	if (se->on_rq)
+		cfs_rq->runnable_load_avg += contrib_delta;
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
 	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
 }
+
+/* Add the load generated by se into cfs_rq's child load-average */
+static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
+						  struct sched_entity *se)
+{
+	update_entity_load_avg(se);
+	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+}
+
+/* Remove se's load from this cfs_rq child load-average */
+static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
+						  struct sched_entity *se)
+{
+	update_entity_load_avg(se);
+	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+}
+
 #else
 static inline void update_entity_load_avg(struct sched_entity *se) {}
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
+static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
+						  struct sched_entity *se {}
+static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
+						  struct sched_entity *se) {}
 #endif
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1220,7 +1264,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_curr(cfs_rq);
 	update_cfs_load(cfs_rq, 0);
-	update_entity_load_avg(se);
+	enqueue_entity_load_avg(se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
@@ -1295,7 +1339,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	update_entity_load_avg(se);
+	dequeue_entity_load_avg(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 42b6df6..77a3427 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -270,6 +270,8 @@ struct cfs_rq {
 	u64 load_stamp, load_last, load_unacc_exec_time;
 
 	unsigned long load_contribution;
+
+	u64 runnable_load_avg;
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_CFS_BANDWIDTH
 	int runtime_enabled;



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC PATCH 13/14] sched: make __update_entity_runnable_avg() fast
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (9 preceding siblings ...)
  2012-02-02  1:38 ` [RFC PATCH 12/14] sched: update_cfs_shares at period edge Paul Turner
@ 2012-02-02  1:38 ` Paul Turner
  2012-02-06 20:48   ` Peter Zijlstra
  2012-02-02  1:38 ` [RFC PATCH 11/14] sched: refactor update_shares_cpu() -> update_blocked_avgs() Paul Turner
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 68+ messages in thread
From: Paul Turner @ 2012-02-02  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

__update_entity_runnable_avg forms the core of maintaining an entity's runnable
load average.  In this function we charge the accumulated run-time since last
update and handle appropriate decay.  In some cases, e.g. a waking task, this
time interval may be much larger than our period unit.

Fortunately we can exploit some properties of our series to perform decay for a
blocked update in constant time and account the contribution for a running
update in essentially-constant* time.

[*]: For any running entity they should be performing updates at the tick which
gives us a soft limit of 1 jiffy between updates, and we can compute up to a
32 jiffy update in a single pass.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c |  100 ++++++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 82 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9e2c9a4..ad524bb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -896,17 +896,77 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 
 #if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
 /*
+ * We choose a half-life close to 1 scheduling period.
+ * Note: The tables below are dependent on this value.
+ */
+#define LOAD_AVG_PERIOD 32
+
+/* Precomputed fixed inverse multiplies for multiplication by y^n */
+const static u32 runnable_avg_yN_inv[] = {
+	0xffffffff,0xfa83b2db,0xf5257d15,0xefe4b99b,0xeac0c6e7,0xe5b906e7,
+	0xe0ccdeec,0xdbfbb797,0xd744fcca,0xd2a81d91,0xce248c15,0xc9b9bd86,
+	0xc5672a11,0xc12c4cca,0xbd08a39f,0xb8fbaf47,0xb504f333,0xb123f581,
+	0xad583eea,0xa9a15ab4,0xa5fed6a9,0xa2704303,0x9ef53260,0x9b8d39b9,
+	0x9837f051,0x94f4efa8,0x91c3d373,0x8ea4398b,0x8b95c1e3,0x88980e80,
+	0x85aac367,0x82cd8698,
+};
+
+/* Precomputed \Sum y^k { 1<=k<=n } */
+const static u32 runnable_avg_yN_sum[] = {
+	    0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
+	 9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
+	17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
+};
+
+/*
  * Approximate:
  *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
  */
 static __always_inline u64 decay_load(u64 val, int n)
 {
-	for (;n && val;n--) {
-		val *= 4008;
-		val >>= 12;
+	if (!n)
+		return val;
+
+	/*
+	 * As y^PERIOD = 1/2, we can combine
+	 *    y^n = 1/2^(n/PERIOD) * k^(n%PERIOD)
+	 * With a look-up table which covers k^n (n<PERIOD)
+	 *
+	 * To achieve constant time decay_load.
+	 */
+	if (unlikely(n >= LOAD_AVG_PERIOD)) {
+		val >>= n/LOAD_AVG_PERIOD;
+		n %= LOAD_AVG_PERIOD;
 	}
 
-	return val;
+	val *= runnable_avg_yN_inv[n];
+	return SRR(val, 32);
+}
+
+/*
+ * For updates fully spanning n periods, the contribution to runnable
+ * average will be: \Sum 1024*y^n
+ *
+ * We can compute this reasonably efficiently by combining:
+ *   y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for  n <PERIOD}
+ */
+static u32 __compute_runnable_contrib(int n)
+{
+	u32 contrib = 0;
+
+	if (likely(n<=LOAD_AVG_PERIOD))
+		return runnable_avg_yN_sum[n];
+
+	/* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
+	do {
+		contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
+		contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
+
+		n -= LOAD_AVG_PERIOD;
+	} while (n>LOAD_AVG_PERIOD);
+
+	contrib = decay_load(contrib, n);
+	return contrib + runnable_avg_yN_sum[n];
 }
 
 /* We can represent the historical contribution to runnable average as the
@@ -940,6 +1000,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 							int runnable)
 {
 	u64 delta;
+	u32 periods, runnable_contrib;
 	int delta_w, decayed = 0;
 
 	delta = now - sa->last_runnable_update;
@@ -963,20 +1024,23 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		decayed = 1;
 
 		delta_w = 1024 - delta_w;
-		BUG_ON(delta_w > delta);
-		do {
-			if (runnable)
-				sa->runnable_avg_sum += delta_w;
-			sa->runnable_avg_period += delta_w;
-
-			sa->runnable_avg_sum =
-				decay_load(sa->runnable_avg_sum, 1);
-			sa->runnable_avg_period =
-				decay_load(sa->runnable_avg_period, 1);
-
-			delta -= delta_w;
-			delta_w = 1024;
-		} while (delta >= 1024);
+		if (runnable)
+			sa->runnable_avg_sum += delta_w;
+		sa->runnable_avg_period += delta_w;
+
+		delta -= delta_w;
+		periods = delta / 1024;
+		delta %= 1024;
+
+		sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
+						  periods + 1);
+		sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
+						  periods + 1);
+
+		runnable_contrib = __compute_runnable_contrib(periods);
+		if (runnable)
+			sa->runnable_avg_sum += runnable_contrib;
+		sa->runnable_avg_period += runnable_contrib;
 	}
 
 	if (runnable)



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC PATCH 12/14] sched: update_cfs_shares at period edge
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (8 preceding siblings ...)
  2012-02-02  1:38 ` [RFC PATCH 09/14] sched: maintain runnable averages across throttled periods Paul Turner
@ 2012-02-02  1:38 ` Paul Turner
  2012-02-02  1:38 ` [RFC PATCH 13/14] sched: make __update_entity_runnable_avg() fast Paul Turner
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-02  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

Now that our measurement intervals are small (~1ms) we can amortize the posting
of update_shares() to be about each period overflow.  This is a large cost
saving for frequently switching tasks.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c |   18 ++++++++++--------
 1 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b6ee7a4..9e2c9a4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1141,6 +1141,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 	}
 
 	__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
+	update_cfs_shares(cfs_rq);
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
@@ -1362,9 +1363,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
 	account_entity_enqueue(cfs_rq, se);
-	update_cfs_shares(cfs_rq);
+	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
 
 	if (flags & ENQUEUE_WAKEUP) {
 		place_entity(cfs_rq, se, 0);
@@ -1437,7 +1437,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
 
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
@@ -1457,8 +1456,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
-	se->on_rq = 0;
 	account_entity_dequeue(cfs_rq, se);
+	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
 
 	/*
 	 * Normalize the entity after updating the min_vruntime because the
@@ -1472,7 +1471,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	return_cfs_rq_runtime(cfs_rq);
 
 	update_min_vruntime(cfs_rq);
-	update_cfs_shares(cfs_rq);
+	se->on_rq = 0;
 }
 
 /*
@@ -2488,8 +2487,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
+		update_cfs_rq_blocked_load(cfs_rq, 0);
 	}
 
 	if (!se) {
@@ -2549,8 +2548,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
+		update_cfs_rq_blocked_load(cfs_rq, 0);
 	}
 
 	if (!se) {
@@ -5836,8 +5835,11 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 		se = tg->se[i];
 		/* Propagate contribution to hierarchy */
 		raw_spin_lock_irqsave(&rq->lock, flags);
-		for_each_sched_entity(se)
+		for_each_sched_entity(se) {
 			update_cfs_shares(group_cfs_rq(se));
+			/* update contribution to parent */
+			update_entity_load_avg(se, 1);
+		}
 		raw_spin_unlock_irqrestore(&rq->lock, flags);
 	}
 



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC PATCH 14/14] sched: implement usage tracking
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (11 preceding siblings ...)
  2012-02-02  1:38 ` [RFC PATCH 11/14] sched: refactor update_shares_cpu() -> update_blocked_avgs() Paul Turner
@ 2012-02-02  1:38 ` Paul Turner
  2012-02-16 13:37   ` Peter Zijlstra
                     ` (5 more replies)
  2012-02-02  1:38 ` [RFC PATCH 10/14] sched: replace update_shares weight distribution with per-entity computation Paul Turner
                   ` (5 subsequent siblings)
  18 siblings, 6 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-02  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

With the frame-work for runnable tracking now fully in place.  Per-entity usage
tracking is a simple and low-overhead addition.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Ben Segall <bsegall@google.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/debug.c  |    3 +++
 kernel/sched/fair.c   |   29 +++++++++++++++++++++++------
 kernel/sched/sched.h  |    4 ++--
 4 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 09b8c45..209185f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1159,6 +1159,7 @@ struct load_weight {
 struct sched_avg {
 	u64 runnable_avg_sum, runnable_avg_period;
 	u64 last_runnable_update, decay_count;
+	u32 usage_avg_sum;
 	unsigned long load_avg_contrib;
 
 	int contributes_blocked_load;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 0911ec6..4d39069 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -93,6 +93,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->load.weight);
 	P(se->avg.runnable_avg_sum);
 	P(se->avg.runnable_avg_period);
+	P(se->avg.usage_avg_sum);
 	P(se->avg.load_avg_contrib);
 	P(se->avg.decay_count);
 #undef PN
@@ -228,6 +229,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->tg_runnable_contrib);
 	SEQ_printf(m, "  .%-30s: %d\n", "tg->runnable_avg",
 			atomic_read(&cfs_rq->tg->runnable_avg));
+	SEQ_printf(m, "  .%-30s: %d\n", "tg->usage_avg",
+			atomic_read(&cfs_rq->tg->usage_avg));
 #endif
 
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ad524bb..222c2c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -997,7 +997,8 @@ static u32 __compute_runnable_contrib(int n)
  */
 static __always_inline int __update_entity_runnable_avg(u64 now,
 							struct sched_avg *sa,
-							int runnable)
+							int runnable,
+							int running)
 {
 	u64 delta;
 	u32 periods, runnable_contrib;
@@ -1026,6 +1027,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		delta_w = 1024 - delta_w;
 		if (runnable)
 			sa->runnable_avg_sum += delta_w;
+		if (running)
+			sa->usage_avg_sum += delta_w;
 		sa->runnable_avg_period += delta_w;
 
 		delta -= delta_w;
@@ -1036,15 +1039,20 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 						  periods + 1);
 		sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
 						  periods + 1);
+		sa->usage_avg_sum = decay_load(sa->usage_avg_sum, periods + 1);
 
 		runnable_contrib = __compute_runnable_contrib(periods);
 		if (runnable)
 			sa->runnable_avg_sum += runnable_contrib;
+		if (running)
+			sa->usage_avg_sum += runnable_contrib;
 		sa->runnable_avg_period += runnable_contrib;
 	}
 
 	if (runnable)
 		sa->runnable_avg_sum += delta;
+	if (running)
+		sa->usage_avg_sum += delta;
 	sa->runnable_avg_period += delta;
 
 	return decayed;
@@ -1081,14 +1089,21 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 						  struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
-	long contrib;
+	long contrib, usage_contrib;
 
 	contrib = (sa->runnable_avg_sum << 12) / (sa->runnable_avg_period + 1);
 	contrib -= cfs_rq->tg_runnable_contrib;
 
-	if (abs(contrib) > cfs_rq->tg_runnable_contrib/64) {
+	usage_contrib = (sa->usage_avg_sum << 12) / (sa->runnable_avg_period + 1);
+	usage_contrib -= cfs_rq->tg_usage_contrib;
+
+	if ((abs(contrib) > cfs_rq->tg_runnable_contrib/64) ||
+	    (abs(usage_contrib) > cfs_rq->tg_usage_contrib/64)) {
 		atomic_add(contrib, &tg->runnable_avg);
 		cfs_rq->tg_runnable_contrib += contrib;
+
+		atomic_add(usage_contrib, &tg->usage_avg);
+		cfs_rq->tg_usage_contrib += usage_contrib;
 	}
 }
 
@@ -1164,8 +1179,8 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	long contrib_delta;
 
-	if(!__update_entity_runnable_avg(rq_of(cfs_rq)->clock_task, &se->avg,
-					se->on_rq))
+	if(!__update_entity_runnable_avg(cfs_rq_clock_task(cfs_rq), &se->avg,
+					 se->on_rq, cfs_rq->curr == se))
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
@@ -1210,7 +1225,8 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
+	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable,
+				     runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 
@@ -1590,6 +1606,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
+		update_entity_load_avg(se, 1);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 77f3297..9602a47 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -117,7 +117,7 @@ struct task_group {
 
 	atomic_t load_weight;
 	atomic64_t load_avg;
-	atomic_t runnable_avg;
+	atomic_t runnable_avg, usage_avg;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -260,7 +260,7 @@ struct cfs_rq {
 	 */
 	unsigned long h_load;
 
-	u32 tg_runnable_contrib;
+	u32 tg_runnable_contrib, tg_usage_contrib;
 	u64 runnable_load_avg, blocked_load_avg;
 	u64 tg_load_contrib;
 	atomic64_t decay_counter, removed_load;



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC PATCH 09/14] sched: maintain runnable averages across throttled periods
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (7 preceding siblings ...)
  2012-02-02  1:38 ` [RFC PATCH 08/14] sched: normalize tg load contributions against runnable time Paul Turner
@ 2012-02-02  1:38 ` Paul Turner
  2012-02-02  1:38 ` [RFC PATCH 12/14] sched: update_cfs_shares at period edge Paul Turner
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-02  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

With bandwidth control tracked entities may cease execution according to user
specified bandwidth limits.  Charging this time as either throttled or blocked
however, is incorrect and would falsely skew in either direction.

What we actually want is for any throttled periods to be "invisible" to
load-tracking as they are removed from the system for that interval and
contribute normally otherwise.

Do this by moderating the progression of time to omit any periods in which the
entity belonged to a throttled hierarchy.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched/fair.c  |   33 +++++++++++++++++++++++++++------
 kernel/sched/sched.h |    3 ++-
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 803c622..71c7410 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1185,6 +1185,8 @@ static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
 		cfs_rq->blocked_load_avg = 0;
 }
 
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
+
 /* Update a sched_entity's runnable average */
 static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq)
@@ -1213,7 +1215,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
  */
 static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 {
-	u64 now = rq_of(cfs_rq)->clock_task >> 20;
+	u64 now = cfs_rq_clock_task(cfs_rq) >> 20;
 	u64 decays;
 
 	decays = now - cfs_rq->last_decay;
@@ -1820,6 +1822,15 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 	return &tg->cfs_bandwidth;
 }
 
+/* rq->task_clock normalized against any time this cfs_rq has spent throttled */
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
+{
+	if (unlikely(cfs_rq->throttle_count))
+		return cfs_rq->throttled_clock_task;
+
+	return rq_of(cfs_rq)->clock_task - cfs_rq->throttled_clock_task_time;
+}
+
 /* returns 0 on failure to allocate runtime */
 static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
@@ -1970,6 +1981,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 		cfs_rq->load_stamp += delta;
 		cfs_rq->load_last += delta;
 
+		/* adjust cfs_rq_clock_task() */
+		cfs_rq->throttled_clock_task_time += rq->clock_task -
+					     cfs_rq->throttled_clock_task;
+
 		/* update entity weight now that we are on_rq again */
 		update_cfs_shares(cfs_rq);
 	}
@@ -1984,8 +1999,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
 	/* group is entering throttled state, record last load */
-	if (!cfs_rq->throttle_count)
+	if (!cfs_rq->throttle_count) {
 		update_cfs_load(cfs_rq, 0);
+		cfs_rq->throttled_clock_task = rq->clock_task;
+	}
 	cfs_rq->throttle_count++;
 
 	return 0;
@@ -2000,7 +2017,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
-	/* account load preceding throttle */
+	/* freeze hierarchy runnable averages while throttled */
 	rcu_read_lock();
 	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
 	rcu_read_unlock();
@@ -2024,7 +2041,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		rq->nr_running -= task_delta;
 
 	cfs_rq->throttled = 1;
-	cfs_rq->throttled_timestamp = rq->clock;
+	cfs_rq->throttled_clock = rq->clock;
 	raw_spin_lock(&cfs_b->lock);
 	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
 	raw_spin_unlock(&cfs_b->lock);
@@ -2042,10 +2059,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	cfs_rq->throttled = 0;
 	raw_spin_lock(&cfs_b->lock);
-	cfs_b->throttled_time += rq->clock - cfs_rq->throttled_timestamp;
+	cfs_b->throttled_time += rq->clock - cfs_rq->throttled_clock;
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
-	cfs_rq->throttled_timestamp = 0;
 
 	update_rq_clock(rq);
 	/* update hierarchical throttle state */
@@ -2445,6 +2461,11 @@ void unthrottle_offline_cfs_rqs(struct rq *rq)
 }
 
 #else /* CONFIG_CFS_BANDWIDTH */
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
+{
+	return rq_of(cfs_rq)->clock_task;
+}
+
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
 				     unsigned long delta_exec) {}
 static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 57cc227..a823ca4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -284,7 +284,8 @@ struct cfs_rq {
 	u64 runtime_expires;
 	s64 runtime_remaining;
 
-	u64 throttled_timestamp;
+	u64 throttled_clock, throttled_clock_task;
+	u64 throttled_clock_task_time;
 	int throttled, throttle_count;
 	struct list_head throttled_list;
 #endif /* CONFIG_CFS_BANDWIDTH */



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC PATCH 11/14] sched: refactor update_shares_cpu() -> update_blocked_avgs()
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (10 preceding siblings ...)
  2012-02-02  1:38 ` [RFC PATCH 13/14] sched: make __update_entity_runnable_avg() fast Paul Turner
@ 2012-02-02  1:38 ` Paul Turner
  2012-02-02  1:38 ` [RFC PATCH 14/14] sched: implement usage tracking Paul Turner
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-02  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

Now that running entities maintain their own load-averages the work we must do
in update_shares() is largely restricted to the periodic decay of blocked
entities.  This allows us to be a little less pessimistic regarding our
occupancy on rq->lock and the associated rq->clock updates required.

We also no longer use the instantaneous load to calculate group-entity
load-weight.  This was previously required as a compromise since migrated load
would remain in the parenting average while it decayed, however it leads to
over-allocation of shares.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched/fair.c |   57 ++++++++++++++++++++++++++++-----------------------
 1 files changed, 31 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 46b3f98..5b6ee7a4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -823,7 +823,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
 	 */
 	tg_weight = atomic64_read(&tg->load_avg);
 	tg_weight -= cfs_rq->tg_load_contrib;
-	tg_weight += cfs_rq->load.weight;
+	tg_weight += cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
 
 	return tg_weight;
 }
@@ -833,7 +833,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 	long tg_weight, load, shares;
 
 	tg_weight = calc_tg_weight(tg, cfs_rq);
-	load = cfs_rq->load.weight;
+	load = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
 
 	shares = (tg->shares * load);
 	if (tg_weight)
@@ -3559,21 +3559,16 @@ out:
 /*
  * update tg->load_weight by folding this cpu's load_avg
  */
-static int update_shares_cpu(struct task_group *tg, int cpu)
+static void __update_blocked_averages_cpu(struct task_group *tg, int cpu)
 {
-	struct sched_entity *se;
-	struct cfs_rq *cfs_rq;
-	unsigned long flags;
+	struct sched_entity *se = tg->se[cpu];
+	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
 	struct rq *rq;
 
+	/* throttled entities do not contribute to load */
+	if (throttled_hierarchy(cfs_rq))
+		return;
 
-	rq = cpu_rq(cpu);
-	se = tg->se[cpu];
-	cfs_rq = tg->cfs_rq[cpu];
-
-	raw_spin_lock_irqsave(&rq->lock, flags);
-
-	update_rq_clock(rq);
 	update_cfs_rq_blocked_load(cfs_rq, 1);
 
 	if (se) {
@@ -3586,29 +3581,39 @@ static int update_shares_cpu(struct task_group *tg, int cpu)
 		else
 			list_del_leaf_cfs_rq(cfs_rq);
 	}
-
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
-
-	return 0;
 }
 
-static void update_shares(int cpu)
+static void update_blocked_averages(int cpu)
 {
-	struct cfs_rq *cfs_rq;
 	struct rq *rq = cpu_rq(cpu);
+	struct cfs_rq *cfs_rq;
+
+	unsigned long flags;
+	int num_updates = 0;
 
 	rcu_read_lock();
+	raw_spin_lock_irqsave(&rq->lock, flags);
+	update_rq_clock(rq);
 	/*
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
 	for_each_leaf_cfs_rq(rq, cfs_rq) {
-		/* throttled entities do not contribute to load */
-		if (throttled_hierarchy(cfs_rq))
-			continue;
+		__update_blocked_averages_cpu(cfs_rq->tg, rq->cpu);
 
-		update_shares_cpu(cfs_rq->tg, cpu);
+		/*
+		 * Periodically release the lock so that a cfs_rq with many
+		 * children cannot hold it for an arbitrary period of time.
+		 */
+		if (num_updates++ % 20 == 0) {
+			raw_spin_unlock_irqrestore(&rq->lock, flags);
+			cpu_relax();
+			raw_spin_lock_irqsave(&rq->lock, flags);
+			update_rq_clock(rq);
+		}
 	}
+
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
 	rcu_read_unlock();
 }
 
@@ -3689,7 +3694,7 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 	return max_load_move - rem_load_move;
 }
 #else
-static inline void update_shares(int cpu)
+static inline void update_blocked_averages(int cpu)
 {
 }
 
@@ -4917,7 +4922,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	 */
 	raw_spin_unlock(&this_rq->lock);
 
-	update_shares(this_cpu);
+	update_blocked_averages(this_cpu);
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
@@ -5258,7 +5263,7 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 	int update_next_balance = 0;
 	int need_serialize;
 
-	update_shares(cpu);
+	update_blocked_averages(cpu);
 
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC PATCH 10/14] sched: replace update_shares weight distribution with per-entity computation
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (12 preceding siblings ...)
  2012-02-02  1:38 ` [RFC PATCH 14/14] sched: implement usage tracking Paul Turner
@ 2012-02-02  1:38 ` Paul Turner
  2012-02-06 20:02 ` [RFC PATCH 00/14] sched: entity load-tracking re-work Peter Zijlstra
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-02  1:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

Now that the machinery in place is in place to compute contributed load in a
bottom up fashion; replace the shares distribution code within update_shares()
accordingly.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/debug.c |    8 ---
 kernel/sched/fair.c  |  150 ++++++--------------------------------------------
 kernel/sched/sched.h |   13 ----
 3 files changed, 18 insertions(+), 153 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8d87796..0911ec6 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -216,14 +216,6 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #ifdef CONFIG_SMP
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_avg",
-			SPLIT_NS(cfs_rq->load_avg));
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_period",
-			SPLIT_NS(cfs_rq->load_period));
-	SEQ_printf(m, "  .%-30s: %ld\n", "load_contrib",
-			cfs_rq->load_contribution);
-	SEQ_printf(m, "  .%-30s: %d\n", "load_tg",
-			atomic_read(&cfs_rq->tg->load_weight));
 	SEQ_printf(m, "  .%-30s: %lld\n", "runnable_load_avg",
 			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lld\n", "blocked_load_avg",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 71c7410..46b3f98 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -655,9 +655,6 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	return calc_delta_fair(sched_slice(cfs_rq, se), se);
 }
 
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update);
-static void update_cfs_shares(struct cfs_rq *cfs_rq);
-
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -677,10 +674,6 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
 
 	curr->vruntime += delta_exec_weighted;
 	update_min_vruntime(cfs_rq);
-
-#if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
-	cfs_rq->load_unacc_exec_time += delta_exec;
-#endif
 }
 
 static void update_curr(struct cfs_rq *cfs_rq)
@@ -818,72 +811,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-/* we need this in update_cfs_load and load-balance functions below */
-static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 # ifdef CONFIG_SMP
-static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
-					    int global_update)
-{
-	struct task_group *tg = cfs_rq->tg;
-	long load_avg;
-
-	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
-	load_avg -= cfs_rq->load_contribution;
-
-	if (global_update || abs(load_avg) > cfs_rq->load_contribution / 8) {
-		atomic_add(load_avg, &tg->load_weight);
-		cfs_rq->load_contribution += load_avg;
-	}
-}
-
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
-{
-	u64 period = sysctl_sched_shares_window;
-	u64 now, delta;
-	unsigned long load = cfs_rq->load.weight;
-
-	if (cfs_rq->tg == &root_task_group || throttled_hierarchy(cfs_rq))
-		return;
-
-	now = rq_of(cfs_rq)->clock_task;
-	delta = now - cfs_rq->load_stamp;
-
-	/* truncate load history at 4 idle periods */
-	if (cfs_rq->load_stamp > cfs_rq->load_last &&
-	    now - cfs_rq->load_last > 4 * period) {
-		cfs_rq->load_period = 0;
-		cfs_rq->load_avg = 0;
-		delta = period - 1;
-	}
-
-	cfs_rq->load_stamp = now;
-	cfs_rq->load_unacc_exec_time = 0;
-	cfs_rq->load_period += delta;
-	if (load) {
-		cfs_rq->load_last = now;
-		cfs_rq->load_avg += delta * load;
-	}
-
-	/* consider updating load contribution on each fold or truncate */
-	if (global_update || cfs_rq->load_period > period
-	    || !cfs_rq->load_period)
-		update_cfs_rq_load_contribution(cfs_rq, global_update);
-
-	while (cfs_rq->load_period > period) {
-		/*
-		 * Inline assembly required to prevent the compiler
-		 * optimising this loop into a divmod call.
-		 * See __iter_div_u64_rem() for another example of this.
-		 */
-		asm("" : "+rm" (cfs_rq->load_period));
-		cfs_rq->load_period /= 2;
-		cfs_rq->load_avg /= 2;
-	}
-
-	if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg)
-		list_del_leaf_cfs_rq(cfs_rq);
-}
-
 static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
 {
 	long tg_weight;
@@ -893,8 +821,8 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
 	 * to gain a more accurate current total weight. See
 	 * update_cfs_rq_load_contribution().
 	 */
-	tg_weight = atomic_read(&tg->load_weight);
-	tg_weight -= cfs_rq->load_contribution;
+	tg_weight = atomic64_read(&tg->load_avg);
+	tg_weight -= cfs_rq->tg_load_contrib;
 	tg_weight += cfs_rq->load.weight;
 
 	return tg_weight;
@@ -918,27 +846,11 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 
 	return shares;
 }
-
-static void update_entity_shares_tick(struct cfs_rq *cfs_rq)
-{
-	if (cfs_rq->load_unacc_exec_time > sysctl_sched_shares_window) {
-		update_cfs_load(cfs_rq, 0);
-		update_cfs_shares(cfs_rq);
-	}
-}
 # else /* CONFIG_SMP */
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
-{
-}
-
 static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 {
 	return tg->shares;
 }
-
-static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
-{
-}
 # endif /* CONFIG_SMP */
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
@@ -956,6 +868,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		account_entity_enqueue(cfs_rq, se);
 }
 
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
+
 static void update_cfs_shares(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg;
@@ -975,17 +889,9 @@ static void update_cfs_shares(struct cfs_rq *cfs_rq)
 	reweight_entity(cfs_rq_of(se), se, shares);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
-{
-}
-
 static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 {
 }
-
-static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
-{
-}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
@@ -1456,7 +1362,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	update_cfs_load(cfs_rq, 0);
 	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
@@ -1553,7 +1458,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
-	update_cfs_load(cfs_rq, 0);
 	account_entity_dequeue(cfs_rq, se);
 
 	/*
@@ -1726,11 +1630,6 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	}
 #endif
 
-	/*
-	 * Update share accounting for long-running entities.
-	 */
-	update_entity_shares_tick(cfs_rq);
-
 #ifdef CONFIG_SCHED_HRTICK
 	/*
 	 * queued ticks are scheduled to match the slice, so don't bother
@@ -1975,18 +1874,9 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 	cfs_rq->throttle_count--;
 #ifdef CONFIG_SMP
 	if (!cfs_rq->throttle_count) {
-		u64 delta = rq->clock_task - cfs_rq->load_stamp;
-
-		/* leaving throttled state, advance shares averaging windows */
-		cfs_rq->load_stamp += delta;
-		cfs_rq->load_last += delta;
-
 		/* adjust cfs_rq_clock_task() */
 		cfs_rq->throttled_clock_task_time += rq->clock_task -
 					     cfs_rq->throttled_clock_task;
-
-		/* update entity weight now that we are on_rq again */
-		update_cfs_shares(cfs_rq);
 	}
 #endif
 
@@ -1998,11 +1888,9 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	struct rq *rq = data;
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
-	/* group is entering throttled state, record last load */
-	if (!cfs_rq->throttle_count) {
-		update_cfs_load(cfs_rq, 0);
+	/* group is entering throttled state, stop time */
+	if (!cfs_rq->throttle_count)
 		cfs_rq->throttled_clock_task = rq->clock_task;
-	}
 	cfs_rq->throttle_count++;
 
 	return 0;
@@ -2600,7 +2488,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
 	}
@@ -2662,7 +2549,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
 	}
@@ -3675,27 +3561,31 @@ out:
  */
 static int update_shares_cpu(struct task_group *tg, int cpu)
 {
+	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
 	unsigned long flags;
 	struct rq *rq;
 
-	if (!tg->se[cpu])
-		return 0;
 
 	rq = cpu_rq(cpu);
+	se = tg->se[cpu];
 	cfs_rq = tg->cfs_rq[cpu];
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	update_rq_clock(rq);
-	update_cfs_load(cfs_rq, 1);
 	update_cfs_rq_blocked_load(cfs_rq, 1);
 
-	/*
-	 * We need to update shares after updating tg->load_weight in
-	 * order to adjust the weight of groups with long running tasks.
-	 */
-	update_cfs_shares(cfs_rq);
+	if (se) {
+		/*
+		 * We can pivot on the runnable average decaying to zero for list
+		 * removal since the parent average will always be >= child.
+		 */
+		if (se->avg.runnable_avg_sum)
+			update_cfs_shares(cfs_rq);
+		else
+			list_del_leaf_cfs_rq(cfs_rq);
+	}
 
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 
@@ -5895,10 +5785,6 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 
 	cfs_rq->tg = tg;
 	cfs_rq->rq = rq;
-#ifdef CONFIG_SMP
-	/* allow initial update_cfs_load() to truncate */
-	cfs_rq->load_stamp = 1;
-#endif
 	init_cfs_rq_runtime(cfs_rq);
 
 	tg->cfs_rq[cpu] = cfs_rq;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a823ca4..77f3297 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -260,19 +260,6 @@ struct cfs_rq {
 	 */
 	unsigned long h_load;
 
-	/*
-	 * Maintaining per-cpu shares distribution for group scheduling
-	 *
-	 * load_stamp is the last time we updated the load average
-	 * load_last is the last time we updated the load average and saw load
-	 * load_unacc_exec_time is currently unaccounted execution time
-	 */
-	u64 load_avg;
-	u64 load_period;
-	u64 load_stamp, load_last, load_unacc_exec_time;
-
-	unsigned long load_contribution;
-
 	u32 tg_runnable_contrib;
 	u64 runnable_load_avg, blocked_load_avg;
 	u64 tg_load_contrib;



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (13 preceding siblings ...)
  2012-02-02  1:38 ` [RFC PATCH 10/14] sched: replace update_shares weight distribution with per-entity computation Paul Turner
@ 2012-02-06 20:02 ` Peter Zijlstra
  2012-02-17  9:07 ` Nikunj A Dadhania
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-06 20:02 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>                            Mailer: 
> StGit/0.15
> 
Its broken, every email has the exact same timestamp! This means emails
are sorted by receive order, which doesn't match actual patch order.

Maybe the new and improved stgit-1.6 has this fixed.

/me goes read more


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 13/14] sched: make __update_entity_runnable_avg() fast
  2012-02-02  1:38 ` [RFC PATCH 13/14] sched: make __update_entity_runnable_avg() fast Paul Turner
@ 2012-02-06 20:48   ` Peter Zijlstra
  2012-02-17 12:49     ` Paul Turner
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-06 20:48 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> +/* Precomputed fixed inverse multiplies for multiplication by y^n */
> +const static u32 runnable_avg_yN_inv[] = {
> +       0xffffffff,0xfa83b2db,0xf5257d15,0xefe4b99b,0xeac0c6e7,0xe5b906e7,
> +       0xe0ccdeec,0xdbfbb797,0xd744fcca,0xd2a81d91,0xce248c15,0xc9b9bd86,
> +       0xc5672a11,0xc12c4cca,0xbd08a39f,0xb8fbaf47,0xb504f333,0xb123f581,
> +       0xad583eea,0xa9a15ab4,0xa5fed6a9,0xa2704303,0x9ef53260,0x9b8d39b9,
> +       0x9837f051,0x94f4efa8,0x91c3d373,0x8ea4398b,0x8b95c1e3,0x88980e80,
> +       0x85aac367,0x82cd8698,
> +};

I wonder if modern Intel isn't at the point where computing this thing
is cheaper than the cacheline miss. You can compute y^n in O(log n) time
and with n < 32 that's 5 multiplications (see fixed_power_int). Add to
that the division.

Of course there's platforms, ARM?, where reverse is likely true. Bugger
that.

> +/* Precomputed \Sum y^k { 1<=k<=n } */
> +const static u32 runnable_avg_yN_sum[] = {
> +           0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
> +        9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
> +       17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
> +}; 

Right, can't see a fast way to compute this..

The asymmetry in the tables annoys me though 32 vs 33 entries, hex vs
dec :-)


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 08/14] sched: normalize tg load contributions against runnable time
  2012-02-02  1:38 ` [RFC PATCH 08/14] sched: normalize tg load contributions against runnable time Paul Turner
@ 2012-02-15 23:36   ` Peter Zijlstra
  2012-02-17 12:32     ` Paul Turner
  2012-02-17 12:34     ` Peter Zijlstra
  2012-02-15 23:38   ` Peter Zijlstra
  1 sibling, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-15 23:36 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> Entities of equal weight should receive equitable distribution of cpu time.
> This is challenging in the case of a task_group's shares as execution may be
> occurring on multiple cpus simultaneously.
> 
> To handle this we divide up the shares into weights proportionate with the load
> on each cfs_rq.  This does not however, account for the fact that the sum of
> the parts may be less than one cpu and so we need to normalize:
>   load(tg) = min(runnable_avg(tg), 1) * tg->shares
> Where runnable_avg is the aggregate time in which the task_group had runnable
> children.


>  static inline void __update_group_entity_contrib(struct sched_entity *se)
>  {
>         struct cfs_rq *cfs_rq = group_cfs_rq(se);
>         struct task_group *tg = cfs_rq->tg;
> +       int runnable_avg;
>  
>         se->avg.load_avg_contrib = (cfs_rq->tg_load_contrib * tg->shares);
>         se->avg.load_avg_contrib /= atomic64_read(&tg->load_avg) + 1;
> +
> +       /*
> +        * Unlike a task-entity, a group entity may be using >=1 cpu globally.
> +        * However, in the case that it's using <1 cpu we need to form a
> +        * correction term so that we contribute the same load as a task of
> +        * equal weight. (Global runnable time is taken as a fraction over 2^12.)
> +        */
> +       runnable_avg = atomic_read(&tg->runnable_avg);
> +       if (runnable_avg < (1<<12)) {
> +               se->avg.load_avg_contrib *= runnable_avg;
> +               se->avg.load_avg_contrib /= (1<<12);
> +       }
>  } 

This seems weird, and the comments don't explain anything.

Ah,.. you can count runnable multiple times (on each cpu), this also
means that the number you're using (when below 1) can still be utter
crap.

Neither the comment nor the changelog mention this, it should, it should
also mention why it doesn't matter (does it?).

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 01/14] sched: track the runnable average on a per-task entitiy basis
  2012-02-02  1:38 ` [RFC PATCH 01/14] sched: track the runnable average on a per-task entitiy basis Paul Turner
@ 2012-02-15 23:37   ` Peter Zijlstra
  2012-02-17 11:43     ` Paul Turner
  2012-02-16 13:27   ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-15 23:37 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> +       /*
> +        * Use 1024us as the unit of measurement since it's a reasonable
> +        * approximation of 1ms and fast to compute.
> +        */
> +       delta >>= 10;

ns -> us ?, text talks about ms, slightly confusing

> +       if (!delta)
> +               return 0;
> +       sa->last_runnable_update = now;
> +
> +       delta_w = sa->runnable_avg_period % 1024;

so delta_w is the chunk of this p we already accounted.

> +       if (delta + delta_w >= 1024) {

if delta pushes us over 1024*1024 ns (~1ms) we roll a window.

> +               /* period roll-over */
> +               decayed = 1;
> +
> +               delta_w = 1024 - delta_w;

The distance we need to reach the next window.

> +               BUG_ON(delta_w > delta); 

somehow reading this code took forever, this suggests clarification,
either through better variable names or more comments. Could also mean
I'm a moron and should get more sleep or so :-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 08/14] sched: normalize tg load contributions against runnable time
  2012-02-02  1:38 ` [RFC PATCH 08/14] sched: normalize tg load contributions against runnable time Paul Turner
  2012-02-15 23:36   ` Peter Zijlstra
@ 2012-02-15 23:38   ` Peter Zijlstra
  1 sibling, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-15 23:38 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Ben Segall <bsegall@google.com>. 

Ben being the last sob would suggest Ben being the one passing the patch
on, and yet I'm getting it from pjt. Also note the trailing dot on Ben's
line.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 04/14] sched: maintain the load contribution of blocked entities
  2012-02-02  1:38 ` [RFC PATCH 04/14] sched: maintain the load contribution of blocked entities Paul Turner
@ 2012-02-16 12:25   ` Peter Zijlstra
  2012-02-17 11:53     ` Paul Turner
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-16 12:25 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> +static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
> +                                                long load_contrib)
> +{
> +       if (likely(load_contrib < cfs_rq->blocked_load_avg))
> +               cfs_rq->blocked_load_avg -= load_contrib;
> +       else
> +               cfs_rq->blocked_load_avg = 0;
> +}
> +
>  /* Update a sched_entity's runnable average */
> -static inline void update_entity_load_avg(struct sched_entity *se)
> +static inline void update_entity_load_avg(struct sched_entity *se,
> +                                         int update_cfs_rq)
>  {
>         struct cfs_rq *cfs_rq = cfs_rq_of(se);
>         long contrib_delta;
> @@ -1106,8 +1130,34 @@ static inline void update_entity_load_avg(struct sched_entity *se)
>                 return;
>  
>         contrib_delta = __update_entity_load_avg_contrib(se);
> +
> +       if (!update_cfs_rq)
> +               return;
> +
>         if (se->on_rq)
>                 cfs_rq->runnable_load_avg += contrib_delta;
> +       else
> +               subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
> +} 

So that last bit is add_blocked_load_contrib(), right? 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 01/14] sched: track the runnable average on a per-task entitiy basis
  2012-02-02  1:38 ` [RFC PATCH 01/14] sched: track the runnable average on a per-task entitiy basis Paul Turner
  2012-02-15 23:37   ` Peter Zijlstra
@ 2012-02-16 13:27   ` Peter Zijlstra
  2012-02-17 11:44     ` Paul Turner
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-16 13:27 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> +static __always_inline u64 decay_load(u64 val, int n)
> +{
> +       for (;n && val;n--) {
> +               val *= 4008;
> +               val >>= 12;
> +       }
> +
> +       return val;
> +}

> +                       sa->runnable_avg_sum =
> +                               decay_load(sa->runnable_avg_sum, 1);
> +                       sa->runnable_avg_period =
> +                               decay_load(sa->runnable_avg_period, 1); 

Since all you ever seem to do is:

  x = decay(x, n);

and frequently run over the line limits it might make sense to either
introduce a CPP helper or make the first argument a pointer and ditch
the return value so we end up with something like:

  decay(&x, n);



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 14/14] sched: implement usage tracking
  2012-02-02  1:38 ` [RFC PATCH 14/14] sched: implement usage tracking Paul Turner
@ 2012-02-16 13:37   ` Peter Zijlstra
  2012-02-17 10:54     ` Paul Turner
  2012-02-16 16:58   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-16 13:37 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>  static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>  {
> -       __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
> +       __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable,
> +                                    runnable); 

Its not immediately obvious why we use @runnable for @running, 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 05/14] sched: account for blocked load waking back up
  2012-02-02  1:38 ` [RFC PATCH 05/14] sched: account for blocked load waking back up Paul Turner
@ 2012-02-16 15:57   ` Peter Zijlstra
  2012-02-17 13:00     ` Paul Turner
  2012-02-16 16:02   ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-16 15:57 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> +               if (se->avg.decay_count) {
> +                       /*
> +                        * In a wake-up migration we have to approximate the
> +                        * time sleeping.
> +                        */
> +                       se->avg.last_runnable_update -= (-se->avg.decay_count)
> +                                                       << 20;
> +                       update_entity_load_avg(se, 0);
> +               } 

That comment wants more why and how. I think I remember pjt telling me
about this last week, but those are already vague memories, I'll be sure
to be completely clueless in another week.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 05/14] sched: account for blocked load waking back up
  2012-02-02  1:38 ` [RFC PATCH 05/14] sched: account for blocked load waking back up Paul Turner
  2012-02-16 15:57   ` Peter Zijlstra
@ 2012-02-16 16:02   ` Peter Zijlstra
  2012-02-17 11:39     ` Paul Turner
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-16 16:02 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:

>         u64 last_runnable_update, decay_count;

> +       /* we track migrations using entity decay_count == 0 */
> +       if (unlikely(se->avg.decay_count <= 0)) { 

!sleep dequeue isn't migrate only, also, <= 0 on an unsigned is weird.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 14/14] sched: implement usage tracking
  2012-02-02  1:38 ` [RFC PATCH 14/14] sched: implement usage tracking Paul Turner
  2012-02-16 13:37   ` Peter Zijlstra
@ 2012-02-16 16:58   ` Peter Zijlstra
  2012-02-17 11:32     ` Paul Turner
  2012-02-29 10:37   ` sched per task ARM fix Pantelis Antoniou
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-16 16:58 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>  struct sched_avg {
>         u64 runnable_avg_sum, runnable_avg_period;
>         u64 last_runnable_update, decay_count;
> +       u32 usage_avg_sum;

4 byte hole

>         unsigned long load_avg_contrib;
>  
>         int contributes_blocked_load; 
>};


A) Running/Runnable
    - uses last_runnable_update to track
        runnable_avg_sum, usage_avg_sum and runnable_avg_period

B) Blocked
    - uses decay_count to keep track of sleeptime

C) 'Migrate'
    - uses contributes_blocked_load to check if we actually did migrate?

So there's a number of things I don't get, why have
remove_task_load_avg_async() in set_task_rq()? enqueue_entity_load_avg()
can do the __synchronize_entity_decay() thing (and actually does) just
fine, right?

Similarly, why not use last_runnable_update (as last_update) to track
both A and B since a task cannot be in both states at the same time
anyway. If you always update the timestamp you can use it to either
track runnable/usage or decay for sleep.

That would get rid of decay_count and contributes_blocked_load, no?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 06/14] sched: aggregate total task_group load
  2012-02-02  1:38 ` [RFC PATCH 06/14] sched: aggregate total task_group load Paul Turner
@ 2012-02-17  4:41   ` Nikunj A Dadhania
  2012-02-17 10:52     ` Paul Turner
  0 siblings, 1 reply; 68+ messages in thread
From: Nikunj A Dadhania @ 2012-02-17  4:41 UTC (permalink / raw)
  To: Paul Turner, linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Wed, 01 Feb 2012 17:38:26 -0800, Paul Turner <pjt@google.com> wrote:
> +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
> +						 int force_update)
> +{
> +	struct task_group *tg = cfs_rq->tg;
> +	s64 tg_contrib;
> +
> +	tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
> +	tg_contrib -= cfs_rq->tg_load_contrib;
> +
> +	if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
>
Not obvious to me where this 8 is coming from?

Regards
Nikunj


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (14 preceding siblings ...)
  2012-02-06 20:02 ` [RFC PATCH 00/14] sched: entity load-tracking re-work Peter Zijlstra
@ 2012-02-17  9:07 ` Nikunj A Dadhania
  2012-02-17 10:48   ` Paul Turner
  2012-02-20 17:01 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 68+ messages in thread
From: Nikunj A Dadhania @ 2012-02-17  9:07 UTC (permalink / raw)
  To: Paul Turner, linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

[-- Attachment #1: Type: text/plain, Size: 2372 bytes --]

On Wed, 01 Feb 2012 17:38:26 -0800, Paul Turner <pjt@google.com> wrote:
> Hi all,
> 
> The attached series is an RFC on implementing load tracking at the entity
> instead of cfs_rq level. This results in a bottom-up load-computation in which
> entities contribute to their parents load, as opposed to the current top-down
> where the parent averages its children.  In particular this allows us to
> correctly migrate load with their accompanying entities and provides the
> necessary inputs for intelligent load-balancing and power-management.
> 
> It was previously well tested and stable, but that was on v3.1-; there's been
> some fairly extensive changes in the wake-up path since so apologies if anything
> was broken in the rebase.Note also, since this is also an RFC on the approach I
> have not yet de-linted the various CONFIG combinations for introduced compiler
> errors.
> 

I gave a quick run to this series, and it seems the fairness across
taskgroups is broken with this.

Test setup:
Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8
core, 64GB RAM, 16 cpu.

Create 3 taskgroups: fair16, fair32 and fair48 having 16, 32 and 48
cpu-hog tasks respectively. They have equal shares(default 1024), so
they should consume roughly the same time.

120secs run 1:
Time consumed by fair16 cgroup:  712912 Tasks: 16
Time consumed by fair32 cgroup:  650977 Tasks: 32
Time consumed by fair48 cgroup:  575681 Tasks: 48

120secs run 2:
Time consumed by fair16 cgroup:  686295 Tasks: 16
Time consumed by fair32 cgroup:  643474 Tasks: 32
Time consumed by fair48 cgroup:  611415 Tasks: 48

600secs run 1:
Time consumed by fair16 cgroup:  4109678 Tasks: 16
Time consumed by fair32 cgroup:  1743983 Tasks: 32
Time consumed by fair48 cgroup:  3759826 Tasks: 48

600secs run 2:
Time consumed by fair16 cgroup:  3893365 Tasks: 16
Time consumed by fair32 cgroup:  3028280 Tasks: 32
Time consumed by fair48 cgroup:  2692001 Tasks: 48

As you can see there is a lot of variance in the above results.

wo patches
120secs run 1:
Time consumed by fair16 cgroup:  667644 Tasks: 16
Time consumed by fair32 cgroup:  653771 Tasks: 32
Time consumed by fair48 cgroup:  624915 Tasks: 48

600secs run 1:
Time consumed by fair16 cgroup:  3278425 Tasks: 16
Time consumed by fair32 cgroup:  3140335 Tasks: 32
Time consumed by fair48 cgroup:  3198817 Tasks: 48

Regards
Nikunj


[-- Attachment #2: Fairness Script --]
[-- Type: application/x-sh, Size: 793 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-02-17  9:07 ` Nikunj A Dadhania
@ 2012-02-17 10:48   ` Paul Turner
  2012-02-20  9:41     ` Nikunj A Dadhania
  0 siblings, 1 reply; 68+ messages in thread
From: Paul Turner @ 2012-02-17 10:48 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Kamalesh Babulal, Ben Segall,
	Ingo Molnar, Vaidyanathan Srinivasan

On Fri, Feb 17, 2012 at 1:07 AM, Nikunj A Dadhania
<nikunj@linux.vnet.ibm.com> wrote:
> On Wed, 01 Feb 2012 17:38:26 -0800, Paul Turner <pjt@google.com> wrote:
>> Hi all,
>>
>> The attached series is an RFC on implementing load tracking at the entity
>> instead of cfs_rq level. This results in a bottom-up load-computation in which
>> entities contribute to their parents load, as opposed to the current top-down
>> where the parent averages its children.  In particular this allows us to
>> correctly migrate load with their accompanying entities and provides the
>> necessary inputs for intelligent load-balancing and power-management.
>>
>> It was previously well tested and stable, but that was on v3.1-; there's been
>> some fairly extensive changes in the wake-up path since so apologies if anything
>> was broken in the rebase.Note also, since this is also an RFC on the approach I
>> have not yet de-linted the various CONFIG combinations for introduced compiler
>> errors.
>>
>
> I gave a quick run to this series, and it seems the fairness across
> taskgroups is broken with this.
>
> Test setup:
> Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8
> core, 64GB RAM, 16 cpu.
>
> Create 3 taskgroups: fair16, fair32 and fair48 having 16, 32 and 48
> cpu-hog tasks respectively. They have equal shares(default 1024), so
> they should consume roughly the same time.
>
> 120secs run 1:
> Time consumed by fair16 cgroup:  712912 Tasks: 16
> Time consumed by fair32 cgroup:  650977 Tasks: 32
> Time consumed by fair48 cgroup:  575681 Tasks: 48
>
> 120secs run 2:
> Time consumed by fair16 cgroup:  686295 Tasks: 16
> Time consumed by fair32 cgroup:  643474 Tasks: 32
> Time consumed by fair48 cgroup:  611415 Tasks: 48
>
> 600secs run 1:
> Time consumed by fair16 cgroup:  4109678 Tasks: 16
> Time consumed by fair32 cgroup:  1743983 Tasks: 32
> Time consumed by fair48 cgroup:  3759826 Tasks: 48
>
> 600secs run 2:
> Time consumed by fair16 cgroup:  3893365 Tasks: 16
> Time consumed by fair32 cgroup:  3028280 Tasks: 32
> Time consumed by fair48 cgroup:  2692001 Tasks: 48
>


This is almost certainly a result of me twiddling with the weight in
calc_cfs_shares (using average instead of instantaneous weight) in
this version -- see patch 11/14.  While this had some nice stability
properties it was not hot for fairness so I've since reverted it
(snippet attached below).

While it's hard to guarantee it was exactly this since I'm carrying a
few other minor edits in preparation for the next version, the current
results for the next version of this series look like:

8-core:
Starting task group fair16...done
Starting task group fair32...done
Starting task group fair48...done
Waiting for the task to run for 120 secs
Interpreting the results. Please wait....
Time consumed by fair16 cgroup:  316985 Tasks: 16
Time consumed by fair32 cgroup:  320274 Tasks: 32
Time consumed by fair48 cgroup:  320811 Tasks: 48

24-core:
Starting task group fair16...done
Starting task group fair32...done
Starting task group fair48...done
Waiting for the task to run for 120 secs
Interpreting the results. Please wait....
Time consumed by fair16 cgroup:  12628615 Tasks: 96
Time consumed by fair32 cgroup:  12562859 Tasks: 192
Time consumed by fair48 cgroup:  12600364 Tasks: 288

These results are stable and consistent.

As soon as I finish working through Peter's comments I'll upload a
pre-posting so you can re-test if you'd like.  Expect a formal
(non-RFC) posting with other nits such as detangling tracking from
FAIR_GROUP_SCHED so that we may use it more comprehensively following
that within the next week or so.

Thanks,

- Paul

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -823,7 +823,7 @@ static inline long calc_tg_weight(struct
task_group *tg, struct cfs_rq *cfs_rq)
        */
       tg_weight = atomic64_read(&tg->load_avg);
       tg_weight -= cfs_rq->tg_load_contrib;
-       tg_weight += cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
+       tg_weight += cfs_rq->load.weight;


       return tg_weight;
 }
@@ -833,7 +833,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq,
struct task_group *tg)
       long tg_weight, load, shares;

       tg_weight = calc_tg_weight(tg, cfs_rq);
-       load = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
+       load = cfs_rq->load.weight;

> As you can see there is a lot of variance in the above results.
>
> wo patches
> 120secs run 1:
> Time consumed by fair16 cgroup:  667644 Tasks: 16
> Time consumed by fair32 cgroup:  653771 Tasks: 32
> Time consumed by fair48 cgroup:  624915 Tasks: 48
>
> 600secs run 1:
> Time consumed by fair16 cgroup:  3278425 Tasks: 16
> Time consumed by fair32 cgroup:  3140335 Tasks: 32
> Time consumed by fair48 cgroup:  3198817 Tasks: 48
>
> Regards
> Nikunj
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 06/14] sched: aggregate total task_group load
  2012-02-17  4:41   ` Nikunj A Dadhania
@ 2012-02-17 10:52     ` Paul Turner
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-17 10:52 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Kamalesh Babulal, Ben Segall,
	Ingo Molnar, Vaidyanathan Srinivasan

On Thu, Feb 16, 2012 at 8:41 PM, Nikunj A Dadhania
<nikunj@linux.vnet.ibm.com> wrote:
> On Wed, 01 Feb 2012 17:38:26 -0800, Paul Turner <pjt@google.com> wrote:
>> +static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
>> +                                              int force_update)
>> +{
>> +     struct task_group *tg = cfs_rq->tg;
>> +     s64 tg_contrib;
>> +
>> +     tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
>> +     tg_contrib -= cfs_rq->tg_load_contrib;
>> +
>> +     if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
>>
> Not obvious to me where this 8 is coming from?
>

It's arbitrary, it requires a change in load contrib by more than
1/8th -- or 12.5% -- of the contrib we're advertising globally before
we pay the cost of an update in the non-forced case.

We used the same trick in the previous shares tracking code since we
did not have a natural rate limit on the update rate.  While this is
not as much of an issue in the new code, it does not seem to be
hurting the accuracy and squashing essentially spurious updates does
not hurt.

- Paul

> Regards
> Nikunj
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 14/14] sched: implement usage tracking
  2012-02-16 13:37   ` Peter Zijlstra
@ 2012-02-17 10:54     ` Paul Turner
  2012-02-20 16:11       ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Paul Turner @ 2012-02-17 10:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Thu, Feb 16, 2012 at 5:37 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>>  static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>>  {
>> -       __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
>> +       __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable,
>> +                                    runnable);
>
> Its not immediately obvious why we use @runnable for @running,

Isn't it?  An rq is the root of all scheduling -- if there are any
runnable tasks than one of them better be running when this is called.

I can add a comment to this effect.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 14/14] sched: implement usage tracking
  2012-02-16 16:58   ` Peter Zijlstra
@ 2012-02-17 11:32     ` Paul Turner
  2012-02-20 16:30       ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Paul Turner @ 2012-02-17 11:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Thu, Feb 16, 2012 at 8:58 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>>  struct sched_avg {
>>         u64 runnable_avg_sum, runnable_avg_period;
>>         u64 last_runnable_update, decay_count;
>> +       u32 usage_avg_sum;
>
> 4 byte hole
>
>>         unsigned long load_avg_contrib;
>>
>>         int contributes_blocked_load;
>>};
>
>
> A) Running/Runnable
>    - uses last_runnable_update to track
>        runnable_avg_sum, usage_avg_sum and runnable_avg_period

Yes, with the runnable average being runnable_avg_sum / runnable_avg_period

>
> B) Blocked
>    - uses decay_count to keep track of sleeptime

Not quite (we could track sleep-time explicitly about clock_task after all).

We need decay-count to synchronize against entity_tick /
update_blocked_shares accounting decays against
cfs_rq->blocked_load_sum.

Since we do not know where we slept relative to the tick, unless we
explicitly track where this occurred we don't know for a "n.y" jiffy
sleep whether the "0.y" fraction included an extra decay.  It's
important to get this right to reduce instability, since when the
entity wakes back up we need to charge an equivalent decay against
it's contribution before we can remove it from the blocked_load_sum.

For the common case of a short sleep whether we charge an extra period
(if y included a jiffy edge) or not is significant to stability since
getting it wrong means we leave a little extra load in blocked_load.

Further complicating this we need to be able to do said
synchronization in the annoying cases of wake-up migration (where we
don't hold the lock on previous rq) and 32-bit machines (where
everything sucks).

>
> C) 'Migrate'
>    - uses contributes_blocked_load to check if we actually did migrate?
>

We actually use se->avg.decay_count to check whether we were migrated
(setting it appropriately at migration); contributes_blocked_load is
just a convenience variable to track whether we're a part of
cfs_rq->blocked_load_avg or not.

I have to double-check everything but I think the sign of
se->decay_count can be used equivalently so I'll likely just remove it
in favor of that.  Although in order to keep things readable I'll
replace it a similarly named function that does this sign check.)

> So there's a number of things I don't get, why have
> remove_task_load_avg_async() in set_task_rq()? enqueue_entity_load_avg()
> can do the __synchronize_entity_decay() thing (and actually does) just
> fine, right?

Wake-up migration is the nuisance here.

In this case we:
A) Don't know where we came from in enqueue_entity_load_avg to
synchronize the decay against
B) Need to be able to remove the blocked load possessed by said waking
entity against its former cfs_rq->blocked_load_sum

There's a one or two other paths (than ttwu) that this matters for,
using set_task_rq() gets them all.

>
> Similarly, why not use last_runnable_update (as last_update) to track
> both A and B since a task cannot be in both states at the same time
> anyway. If you always update the timestamp you can use it to either
> track runnable/usage or decay for sleep.

This is pretty much exactly what we do (use it to track both A and B);
modulo the complications surrounding decay for sleep explained above
and below.

There's two things at play here:
1) An entity's runnable_avg :=  runnable_avg_sum / runnable_avg_period
2) An entity's load contribution := runnable_avg_sum /
runnable_avg_period * entity_weight

The load contribution is what we added to blocked_load_sum.  When we
wake back up we synchronize decays against this and remove it from the
sum.  However, decay into this is really just an imperfect
approximation (alignment of when we actually slept vs tick)

As soon as it's actually running again we:
- Adjust our load contrib by decay so that we can remove the right
amount from blocked_load_sum
- Charge the time since last_runnable_update as not running
- Compute a new, *accurate* load_contrib from (1) above, throwing away
the decayed previous one
- Add the new accurate calculation into the running load sum

The nice thing is that when we start this cycle again when the entity
next blocks we don't compound the losses of precision due to aligning
decay vs ticks since we get to throw away that number and do an
accurate accounting based on runnable_avg again.

>
> That would get rid of decay_count and contributes_blocked_load, no?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 05/14] sched: account for blocked load waking back up
  2012-02-16 16:02   ` Peter Zijlstra
@ 2012-02-17 11:39     ` Paul Turner
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-17 11:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Thu, Feb 16, 2012 at 8:02 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>
>>         u64 last_runnable_update, decay_count;
>
>> +       /* we track migrations using entity decay_count == 0 */
>> +       if (unlikely(se->avg.decay_count <= 0)) {
>
> !sleep dequeue isn't migrate only,

Well they mostly are these days; but in the == 0 case we're either a
load-balancer migration or someone doing a dequeue/enqueue pair on an
entity about some sort of update.

The key is that when it is an load-balancer move we'll resync
appropriately on enqueue (which we need to do).  We will essentially
sync with ourselves in the other cases, but they have no bearing on
why we do this.

> also, <= 0 on an unsigned is weird.

Yeah that should be a s64 not u64 (fixed).

>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 01/14] sched: track the runnable average on a per-task entitiy basis
  2012-02-15 23:37   ` Peter Zijlstra
@ 2012-02-17 11:43     ` Paul Turner
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-17 11:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Wed, Feb 15, 2012 at 3:37 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>> +       /*
>> +        * Use 1024us as the unit of measurement since it's a reasonable
>> +        * approximation of 1ms and fast to compute.
>> +        */
>> +       delta >>= 10;
>
> ns -> us ?, text talks about ms, slightly confusing

Yes, comment was incorrect; I'd noticed this also on a re-read -- it's fixed.

>
>> +       if (!delta)
>> +               return 0;
>> +       sa->last_runnable_update = now;
>> +
>> +       delta_w = sa->runnable_avg_period % 1024;
>
> so delta_w is the chunk of this p we already accounted.
>
>> +       if (delta + delta_w >= 1024) {
>
> if delta pushes us over 1024*1024 ns (~1ms) we roll a window.
>
>> +               /* period roll-over */
>> +               decayed = 1;
>> +
>> +               delta_w = 1024 - delta_w;
>
> The distance we need to reach the next window.
>
>> +               BUG_ON(delta_w > delta);
>
> somehow reading this code took forever, this suggests clarification,
> either through better variable names or more comments. Could also mean
> I'm a moron and should get more sleep or so :-)

No... this bit is definitely fiddly, I definitely got it wrong more
than once writing it down.  And then got it again wrong later when
"optimizing" :-).

It deserves more comments, I will add.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 01/14] sched: track the runnable average on a per-task entitiy basis
  2012-02-16 13:27   ` Peter Zijlstra
@ 2012-02-17 11:44     ` Paul Turner
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-17 11:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Thu, Feb 16, 2012 at 5:27 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>> +static __always_inline u64 decay_load(u64 val, int n)
>> +{
>> +       for (;n && val;n--) {
>> +               val *= 4008;
>> +               val >>= 12;
>> +       }
>> +
>> +       return val;
>> +}
>
>> +                       sa->runnable_avg_sum =
>> +                               decay_load(sa->runnable_avg_sum, 1);
>> +                       sa->runnable_avg_period =
>> +                               decay_load(sa->runnable_avg_period, 1);
>
> Since all you ever seem to do is:
>
>  x = decay(x, n);
>
> and frequently run over the line limits it might make sense to either
> introduce a CPP helper or make the first argument a pointer and ditch
> the return value so we end up with something like:
>
>  decay(&x, n);

Ah-ha, good idea.  I had done some other fiddles to try and make
things fit better, but this will work.

Thanks!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 04/14] sched: maintain the load contribution of blocked entities
  2012-02-16 12:25   ` Peter Zijlstra
@ 2012-02-17 11:53     ` Paul Turner
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-17 11:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Thu, Feb 16, 2012 at 4:25 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>> +static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
>> +                                                long load_contrib)
>> +{
>> +       if (likely(load_contrib < cfs_rq->blocked_load_avg))
>> +               cfs_rq->blocked_load_avg -= load_contrib;
>> +       else
>> +               cfs_rq->blocked_load_avg = 0;
>> +}
>> +
>>  /* Update a sched_entity's runnable average */
>> -static inline void update_entity_load_avg(struct sched_entity *se)
>> +static inline void update_entity_load_avg(struct sched_entity *se,
>> +                                         int update_cfs_rq)
>>  {
>>         struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>         long contrib_delta;
>> @@ -1106,8 +1130,34 @@ static inline void update_entity_load_avg(struct sched_entity *se)
>>                 return;
>>
>>         contrib_delta = __update_entity_load_avg_contrib(se);
>> +
>> +       if (!update_cfs_rq)
>> +               return;
>> +
>>         if (se->on_rq)
>>                 cfs_rq->runnable_load_avg += contrib_delta;
>> +       else
>> +               subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
>> +}
>
> So that last bit is add_blocked_load_contrib(), right?

Yes, although contrib_delta is signed.

I suppose this looks a little funny but there's a good explanation:
When adding or removing a contribution to blocked_load_sum we have to
be careful since rounding errors may mean the delta to our
contribution and the delta to our portion of the blocked_load_sum may
be off by a few bits.  Being low is fine since it's constantly
decaying to zero so any error term is not long for this world -- but
we do want to make sure we don't underflow in the other direction.

This means any time we remove we have to do the whole
	"if (likely(load_contrib < cfs_rq->blocked_load_avg))
		cfs_rq->blocked_load_avg -= load_contrib;
	else
		cfs_rq->blocked_load_avg = 0"
thing.

Typically we only care about doing this on removing load (e.g. local
wake-up); since we have no error going in the opposite direction; so
we end up with subtract_blocked_load_contrib to handle the first case.

Coming back to the use of it here:
Since contrib_delta is on a freshly computed load_contribution we have
to take the same care as when we are removing; but luckily we already
have subtract_blocked_load_contrib from dealing with that everywhere
else.  So we just re-use it and flip the sign.

We could call it add everywhere else (and again flip the sign) but
that's less intuitive since we really are only looking to (safely) remove in
those cases.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 08/14] sched: normalize tg load contributions against runnable time
  2012-02-15 23:36   ` Peter Zijlstra
@ 2012-02-17 12:32     ` Paul Turner
  2012-02-20 16:10       ` Peter Zijlstra
  2012-02-17 12:34     ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Paul Turner @ 2012-02-17 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Wed, Feb 15, 2012 at 3:36 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>> Entities of equal weight should receive equitable distribution of cpu time.
>> This is challenging in the case of a task_group's shares as execution may be
>> occurring on multiple cpus simultaneously.
>>
>> To handle this we divide up the shares into weights proportionate with the load
>> on each cfs_rq.  This does not however, account for the fact that the sum of
>> the parts may be less than one cpu and so we need to normalize:
>>   load(tg) = min(runnable_avg(tg), 1) * tg->shares
>> Where runnable_avg is the aggregate time in which the task_group had runnable
>> children.
>
>
>>  static inline void __update_group_entity_contrib(struct sched_entity *se)
>>  {
>>         struct cfs_rq *cfs_rq = group_cfs_rq(se);
>>         struct task_group *tg = cfs_rq->tg;
>> +       int runnable_avg;
>>
>>         se->avg.load_avg_contrib = (cfs_rq->tg_load_contrib * tg->shares);
>>         se->avg.load_avg_contrib /= atomic64_read(&tg->load_avg) + 1;
>> +
>> +       /*
>> +        * Unlike a task-entity, a group entity may be using >=1 cpu globally.
>> +        * However, in the case that it's using <1 cpu we need to form a
>> +        * correction term so that we contribute the same load as a task of
>> +        * equal weight. (Global runnable time is taken as a fraction over 2^12.)
>> +        */
>> +       runnable_avg = atomic_read(&tg->runnable_avg);
>> +       if (runnable_avg < (1<<12)) {
>> +               se->avg.load_avg_contrib *= runnable_avg;
>> +               se->avg.load_avg_contrib /= (1<<12);
>> +       }
>>  }
>
> This seems weird, and the comments don't explain anything.
>

So consider (on cpu_i):

  R_i
 /  \
t1  A_i
      \
      t2

Where weight(t1) = shares(A) = 1024

Now, if t1 is runnable 80% of the time we'd charge 80% of its weight
to R in the load-avg, or about 820

If t2, in A, is runnable the exact same proportion of time, then it
should make the same contribution to R.

*But* A is a group entity, so its contribution is then its load weight
as a fraction of the task_groups.  The catch is that if t2 is the only
task running then this is 1024/1024  * 1024 = 1024 != 820, we've lost
the fact that t2 in A was only actually runnable 80% of the time.  We
need to perform the same normalization against how much it actually
runs.. BUT

We can't just use A_i's runnable, since group entities are per-cpu and
thus completely wrong as t2 moves around.  We also can't easily carry
t2's contribution to A_i's runnability since there's no way to pull
the sum apart or account it into the new parenting tree efficiently.

But what we c an do is aggregate an estimation of whether the group as
a whole would contribute its full shares value if placed only on one
cpu and adjust appropriately.

I'll try to paraphrase the above into a more useful comment explaining this.

> Ah,.. you can count runnable multiple times (on each cpu), this also
> means that the number you're using (when below 1) can still be utter
> crap.

So, as an estimate it has the nice property that it's always a lower
bound on the true value.

Consider the two cases for runnable contributions across a pair of cpus:

Either they are disjoint by wall time, in which case they would have
accounted for the same amount were they actually on the same cpu.
Or they overlap, in which case that over-lap period would be an
additional run-delay incurred on one of them.

Then, the maximum amount we understate for a given overlap is n(n+1)/2
* k where k is the the width of the overlap and n is the number of
cpus we over-lapped on.

But then, n is bounded by num_cpus -- so on a small machine our error
is bounded (e.g. we can be no more  than 1/3 less than true on 2
cpus).  On a larger machine this term can be larger, however, if k is
of consequential size we know we accounted at least n*k so we're still
going to approach the ceiling of 1 very quickly at which point it
doesn't matter that we under-estimated.  The flip side on a larger
machine is that if k is of inconsequential size then n*2 * k is still
tiny relative to the size of the machine and the accuracy of the
approximation matters less.

>
> Neither the comment nor the changelog mention this, it should, it should
> also mention why it doesn't matter (does it?).

It doesn't and it should.  Although I'll take the liberty shortening
it a little to something like "unfortunately we cannot compute
runnable_avg(tg) directly, however, XXX is a reasonable
approximation."

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 08/14] sched: normalize tg load contributions against runnable time
  2012-02-15 23:36   ` Peter Zijlstra
  2012-02-17 12:32     ` Paul Turner
@ 2012-02-17 12:34     ` Peter Zijlstra
  1 sibling, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-17 12:34 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Thu, 2012-02-16 at 00:36 +0100, Peter Zijlstra wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> > Entities of equal weight should receive equitable distribution of cpu time.
> > This is challenging in the case of a task_group's shares as execution may be
> > occurring on multiple cpus simultaneously.
> > 
> > To handle this we divide up the shares into weights proportionate with the load
> > on each cfs_rq.  This does not however, account for the fact that the sum of
> > the parts may be less than one cpu and so we need to normalize:
> >   load(tg) = min(runnable_avg(tg), 1) * tg->shares
> > Where runnable_avg is the aggregate time in which the task_group had runnable
> > children.
> 
> 
> >  static inline void __update_group_entity_contrib(struct sched_entity *se)
> >  {
> >         struct cfs_rq *cfs_rq = group_cfs_rq(se);
> >         struct task_group *tg = cfs_rq->tg;
> > +       int runnable_avg;
> >  
> >         se->avg.load_avg_contrib = (cfs_rq->tg_load_contrib * tg->shares);
> >         se->avg.load_avg_contrib /= atomic64_read(&tg->load_avg) + 1;
> > +
> > +       /*
> > +        * Unlike a task-entity, a group entity may be using >=1 cpu globally.
> > +        * However, in the case that it's using <1 cpu we need to form a
> > +        * correction term so that we contribute the same load as a task of
> > +        * equal weight. (Global runnable time is taken as a fraction over 2^12.)
> > +        */
> > +       runnable_avg = atomic_read(&tg->runnable_avg);
> > +       if (runnable_avg < (1<<12)) {
> > +               se->avg.load_avg_contrib *= runnable_avg;
> > +               se->avg.load_avg_contrib /= (1<<12);
> > +       }
> >  } 
> 
> This seems weird, and the comments don't explain anything.
> 
> Ah,.. you can count runnable multiple times (on each cpu), this also
> means that the number you're using (when below 1) can still be utter
> crap.
> 
> Neither the comment nor the changelog mention this, it should, it should
> also mention why it doesn't matter (does it?).

Since we don't know when we were runnable in the window, we can take our
runnable fraction as a flat probability distribution over the entire
window.

The combined answer we're looking for is what fraction of time was any
of our cpus running.

Take p_i to be the runnable probability of cpu i, then the probability
that both cpu0 and cpu1 were runnable is pc_0,1 = p_0 * p_1, so the
probability that either was running is p_01 = p_0 + p_1 - pc_0,1.

The 3 cpu case becomes when was either cpu01 or cpu2 running, yielding
the iteration: p_012 = p_01 + p_2 - pc_01,2.

p_012 = p_0 + p_1 + p_2 - (p_0 * p_1 + (p_0 + p_1 - p_0 * p_1) * p_2)

Now for small values of p our combined/corrective term is small, since
its a product of small, which is smaller, however it becomes more
dominant the nearer we get to 1.

Since its more likely to get near to 1 the more CPUs we have, I'm not
entirely convinced we can ignore it.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 13/14] sched: make __update_entity_runnable_avg() fast
  2012-02-06 20:48   ` Peter Zijlstra
@ 2012-02-17 12:49     ` Paul Turner
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-17 12:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Mon, Feb 6, 2012 at 12:48 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>> +/* Precomputed fixed inverse multiplies for multiplication by y^n */
>> +const static u32 runnable_avg_yN_inv[] = {
>> +       0xffffffff,0xfa83b2db,0xf5257d15,0xefe4b99b,0xeac0c6e7,0xe5b906e7,
>> +       0xe0ccdeec,0xdbfbb797,0xd744fcca,0xd2a81d91,0xce248c15,0xc9b9bd86,
>> +       0xc5672a11,0xc12c4cca,0xbd08a39f,0xb8fbaf47,0xb504f333,0xb123f581,
>> +       0xad583eea,0xa9a15ab4,0xa5fed6a9,0xa2704303,0x9ef53260,0x9b8d39b9,
>> +       0x9837f051,0x94f4efa8,0x91c3d373,0x8ea4398b,0x8b95c1e3,0x88980e80,
>> +       0x85aac367,0x82cd8698,
>> +};
>
> I wonder if modern Intel isn't at the point where computing this thing
> is cheaper than the cacheline miss. You can compute y^n in O(log n) time
> and with n < 32 that's 5 multiplications (see fixed_power_int). Add to
> that the division.
>
> Of course there's platforms, ARM?, where reverse is likely true. Bugger
> that.
>
>> +/* Precomputed \Sum y^k { 1<=k<=n } */
>> +const static u32 runnable_avg_yN_sum[] = {
>> +           0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
>> +        9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
>> +       17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
>> +};
>
> Right, can't see a fast way to compute this..
>
> The asymmetry in the tables annoys me though 32 vs 33 entries,

We could shrink yN sum to 32 entries but then
__compute_runnable_contrib() ends up a lot uglier with a bunch of
conditionals about the n=0 case.  (and if we did that the same
argument could then be made for making runnable_avg_yN_inv 31
entries.. at which point we're asymmetric again.. ahhhhh!! must go
sort my pencils..)

>hex vs dec :-)
>

So I personally think hex versus dec is somewhat reasonable in that:

\Sum y^n converges to a/(1-r) -- about ~47k -- which is a very human
readable number.  It's easy to do the mental math on what a given
entry in that table is going to do to your runnable_avg / runnable_sum
from just looking at it.

Conversely, for the inverse multiplies they're the completely
unintelligible ~0*y^k.  We could write them in decimal, but the table
would just be enormous.

Since the second case would be enormously ugly and we already have
precedent for using hex for inverses wmult; the only unified option is
then to use hex for both.  But then I personally certainly can't
eyeball the numbers for \Sum y^n in the same fashion.  But perhaps
this is something we can lose.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 05/14] sched: account for blocked load waking back up
  2012-02-16 15:57   ` Peter Zijlstra
@ 2012-02-17 13:00     ` Paul Turner
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-17 13:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Thu, Feb 16, 2012 at 7:57 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>> +               if (se->avg.decay_count) {
>> +                       /*
>> +                        * In a wake-up migration we have to approximate the
>> +                        * time sleeping.
>> +                        */
>> +                       se->avg.last_runnable_update -= (-se->avg.decay_count)
>> +                                                       << 20;
>> +                       update_entity_load_avg(se, 0);
>> +               }
>
> That comment wants more why and how. I think I remember pjt telling me
> about this last week, but those are already vague memories, I'll be sure
> to be completely clueless in another week.

The full reason is short enough to practically supplant the existing
comment:  We can't synchronize clock_task between our old-cpu and our
new [don't have the locks and 32-bit says you can't have nice things],
so we can use our carried decays (which we can grab atomically for
exactly this reason) to approximate that.

I'll update appropriately.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-02-17 10:48   ` Paul Turner
@ 2012-02-20  9:41     ` Nikunj A Dadhania
  2012-02-21  2:33       ` Paul Turner
  0 siblings, 1 reply; 68+ messages in thread
From: Nikunj A Dadhania @ 2012-02-20  9:41 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Kamalesh Babulal, Ben Segall,
	Ingo Molnar, Vaidyanathan Srinivasan

On Fri, 17 Feb 2012 02:48:06 -0800, Paul Turner <pjt@google.com> wrote:
> 
> This is almost certainly a result of me twiddling with the weight in
> calc_cfs_shares (using average instead of instantaneous weight) in
> this version -- see patch 11/14.  While this had some nice stability
> properties it was not hot for fairness so I've since reverted it
> (snippet attached below).
>
For my understanding, what do you mean by stability here?

> 
> 24-core:
> Starting task group fair16...done
> Starting task group fair32...done
> Starting task group fair48...done
> Waiting for the task to run for 120 secs
> Interpreting the results. Please wait....
> Time consumed by fair16 cgroup:  12628615 Tasks: 96
> Time consumed by fair32 cgroup:  12562859 Tasks: 192
> Time consumed by fair48 cgroup:  12600364 Tasks: 288
> 
"Tasks:" should be 16,32,48? 

Regards,
Nikunj


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 08/14] sched: normalize tg load contributions against runnable time
  2012-02-17 12:32     ` Paul Turner
@ 2012-02-20 16:10       ` Peter Zijlstra
  0 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-20 16:10 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Fri, 2012-02-17 at 04:32 -0800, Paul Turner wrote:
> > Neither the comment nor the changelog mention this, it should, it should
> > also mention why it doesn't matter (does it?).
> 
> It doesn't and it should.  Although I'll take the liberty shortening
> it a little to something

For the in-code comment that's fine, it would be good for the changelog
to have the entire story though.

>  like "unfortunately we cannot compute
> runnable_avg(tg) directly, however, XXX is a reasonable
> approximation." 

Yeah, not easily done indeed, you could compute a corrective term if
you'd have something like the avg and variance of runnable over the
various CPUs, but those too suck to have to compute.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 14/14] sched: implement usage tracking
  2012-02-17 10:54     ` Paul Turner
@ 2012-02-20 16:11       ` Peter Zijlstra
  0 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-20 16:11 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Fri, 2012-02-17 at 02:54 -0800, Paul Turner wrote:
> On Thu, Feb 16, 2012 at 5:37 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
> >>  static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
> >>  {
> >> -       __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
> >> +       __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable,
> >> +                                    runnable);
> >
> > Its not immediately obvious why we use @runnable for @running,
> 
> Isn't it?  An rq is the root of all scheduling -- if there are any
> runnable tasks than one of them better be running when this is called.
> 
> I can add a comment to this effect.

Ah, indeed. Yes a comment would be good ;-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 14/14] sched: implement usage tracking
  2012-02-17 11:32     ` Paul Turner
@ 2012-02-20 16:30       ` Peter Zijlstra
  0 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-20 16:30 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Fri, 2012-02-17 at 03:32 -0800, Paul Turner wrote:

> Since we do not know where we slept relative to the tick, unless we
> explicitly track where this occurred we don't know for a "n.y" jiffy
> sleep whether the "0.y" fraction included an extra decay. 

but isn't that the same problem you have in
__update_entity_runnable_avg() where you use ->runnable_avg_period %
1024 ?

Also, shouldn't __synchronize_entity_decay() update
->runnable_avg_period? Surely when you wake up your window fraction
could be different than when you went to sleep?

>  It's
> important to get this right to reduce instability, since when the
> entity wakes back up we need to charge an equivalent decay against
> it's contribution before we can remove it from the blocked_load_sum.
> 
> For the common case of a short sleep whether we charge an extra period
> (if y included a jiffy edge) or not is significant to stability since
> getting it wrong means we leave a little extra load in blocked_load.

Sure..

> Further complicating this we need to be able to do said
> synchronization in the annoying cases of wake-up migration (where we
> don't hold the lock on previous rq) and 32-bit machines (where
> everything sucks).

Ah, wouldn't task_waking_fair() be a better place though?

> >
> > C) 'Migrate'
> >    - uses contributes_blocked_load to check if we actually did migrate?
> >
> 
> We actually use se->avg.decay_count to check whether we were migrated
> (setting it appropriately at migration); contributes_blocked_load is
> just a convenience variable to track whether we're a part of
> cfs_rq->blocked_load_avg or not.

If you use task_waking_fair() you can also use ENQUEUE_WAKING.

> There's a one or two other paths (than ttwu) that this matters for,
> using set_task_rq() gets them all.

What paths are those? Having a cfs hook in the generic code there isn't
pretty an suggests funny things could happen when PI gets done.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (15 preceding siblings ...)
  2012-02-17  9:07 ` Nikunj A Dadhania
@ 2012-02-20 17:01 ` Peter Zijlstra
  2012-03-12 10:39 ` Morten Rasmussen
  2012-03-12 10:57 ` Vincent Guittot
  18 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2012-02-20 17:01 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan


One more thing though, check what it all does when you run with the
'default' jiffies based sched_clock(). Nobody sane should use that, but
then there's bound to be people who disagree with that :-)

I suspect it will all more or less work out, but someone should verify
that it actually does. It would be such a shame to have to revert this
because we broken something daft like m68k or so ;-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-02-20  9:41     ` Nikunj A Dadhania
@ 2012-02-21  2:33       ` Paul Turner
  0 siblings, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-02-21  2:33 UTC (permalink / raw)
  To: Nikunj A Dadhania
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Kamalesh Babulal, Ben Segall,
	Ingo Molnar, Vaidyanathan Srinivasan

On Mon, Feb 20, 2012 at 1:41 AM, Nikunj A Dadhania
<nikunj@linux.vnet.ibm.com> wrote:
> On Fri, 17 Feb 2012 02:48:06 -0800, Paul Turner <pjt@google.com> wrote:
>>
>> This is almost certainly a result of me twiddling with the weight in
>> calc_cfs_shares (using average instead of instantaneous weight) in
>> this version -- see patch 11/14.  While this had some nice stability
>> properties it was not hot for fairness so I've since reverted it
>> (snippet attached below).
>>
> For my understanding, what do you mean by stability here?

The result is stable; meaning repeating the experiment returns the same number.

>
>>
>> 24-core:
>> Starting task group fair16...done
>> Starting task group fair32...done
>> Starting task group fair48...done
>> Waiting for the task to run for 120 secs
>> Interpreting the results. Please wait....
>> Time consumed by fair16 cgroup:  12628615 Tasks: 96
>> Time consumed by fair32 cgroup:  12562859 Tasks: 192
>> Time consumed by fair48 cgroup:  12600364 Tasks: 288
>>
> "Tasks:" should be 16,32,48?
>

Ah, I ran your script multiple times (for stability) above, it must
not have been killing itself properly (notice each of those numbers is
the respective tasks per run times 6).

A correct first run on a 24-core looks like:
Starting task group fair16...done
Starting task group fair32...done
Starting task group fair48...done
Waiting for the task to run for 120 secs
Interpreting the results. Please wait....
Time consumed by fair16 cgroup:  1332211 Tasks: 16
Time consumed by fair32 cgroup:  1227356 Tasks: 32
Time consumed by fair48 cgroup:  1217174 Tasks: 48

The small boost to the tasks=16 case is almost certainly tied to our
current handling of sleeper credit and entity placement -- since there
are less tasks than cores, whenever a task moves to a core it has not
been previously executing on it gets a vruntime boost.

Thanks,

- Paul


> Regards,
> Nikunj
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched: entity load-tracking re-work - Fix for ARM
  2012-02-29 10:37   ` [PATCH 1/2] sched: entity load-tracking re-work - Fix for ARM Pantelis Antoniou
@ 2012-02-28 17:45     ` Morten Rasmussen
  2012-02-28 17:52       ` Pantelis Antoniou
  0 siblings, 1 reply; 68+ messages in thread
From: Morten Rasmussen @ 2012-02-28 17:45 UTC (permalink / raw)
  To: Pantelis Antoniou; +Cc: linux-kernel, Paul Turner

On Wed, Feb 29, 2012 at 10:37:38AM +0000, Pantelis Antoniou wrote:
> @@ -1110,9 +1110,9 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
>  	struct cfs_rq *cfs_rq = group_cfs_rq(se);
>  	struct task_group *tg = cfs_rq->tg;
>  	int runnable_avg;
> +	u64 contrib;
>  
> -	se->avg.load_avg_contrib = (cfs_rq->tg_load_contrib * tg->shares);
> -	se->avg.load_avg_contrib /= atomic64_read(&tg->load_avg) + 1;

It seems that contrib is never assigned?
Fix:
+	contrib = (cfs_rq->tg_load_contrib * tg->shares);

> +	se->avg.load_avg_contrib = div_u64(contrib, atomic64_read(&tg->load_avg) + 1);

Regards,
Morten


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] sched: entity load-tracking re-work - Fix for ARM
  2012-02-28 17:45     ` Morten Rasmussen
@ 2012-02-28 17:52       ` Pantelis Antoniou
  0 siblings, 0 replies; 68+ messages in thread
From: Pantelis Antoniou @ 2012-02-28 17:52 UTC (permalink / raw)
  To: Morten Rasmussen; +Cc: linux-kernel, Paul Turner


On Feb 28, 2012, at 7:45 PM, Morten Rasmussen wrote:

> On Wed, Feb 29, 2012 at 10:37:38AM +0000, Pantelis Antoniou wrote:
>> @@ -1110,9 +1110,9 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
>> 	struct cfs_rq *cfs_rq = group_cfs_rq(se);
>> 	struct task_group *tg = cfs_rq->tg;
>> 	int runnable_avg;
>> +	u64 contrib;
>> 
>> -	se->avg.load_avg_contrib = (cfs_rq->tg_load_contrib * tg->shares);
>> -	se->avg.load_avg_contrib /= atomic64_read(&tg->load_avg) + 1;
> 
> It seems that contrib is never assigned?
> Fix:
> +	contrib = (cfs_rq->tg_load_contrib * tg->shares);
> 
>> +	se->avg.load_avg_contrib = div_u64(contrib, atomic64_read(&tg->load_avg) + 1);
> 
> Regards,
> Morten
> 

Hmm, yeah,

2 line patch and brown paper bag time.

Thanks

-- Pantelis

^ permalink raw reply	[flat|nested] 68+ messages in thread

* sched per task ARM fix
  2012-02-02  1:38 ` [RFC PATCH 14/14] sched: implement usage tracking Paul Turner
  2012-02-16 13:37   ` Peter Zijlstra
  2012-02-16 16:58   ` Peter Zijlstra
@ 2012-02-29 10:37   ` Pantelis Antoniou
  2012-02-29 10:37   ` [PATCH 1/2] sched: entity load-tracking re-work - Fix for ARM Pantelis Antoniou
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 68+ messages in thread
From: Pantelis Antoniou @ 2012-02-29 10:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: Paul Turner


Hi Paul,

The patch series were failing to build for ARM, and with these small
fixes they do, and they appear to work, at least on panda where I tested them.
It also didn't build when CGROUPs where not configured so there's that as well.

Regards

-- Pantelis

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 1/2] sched: entity load-tracking re-work - Fix for ARM
  2012-02-02  1:38 ` [RFC PATCH 14/14] sched: implement usage tracking Paul Turner
                     ` (2 preceding siblings ...)
  2012-02-29 10:37   ` sched per task ARM fix Pantelis Antoniou
@ 2012-02-29 10:37   ` Pantelis Antoniou
  2012-02-28 17:45     ` Morten Rasmussen
  2012-02-29 10:37   ` [PATCH 2/2] sched: load-tracking compile when cgroup undefined Pantelis Antoniou
  2012-03-13 16:57   ` [RFC PATCH 14/14] sched: implement usage tracking Vincent Guittot
  5 siblings, 1 reply; 68+ messages in thread
From: Pantelis Antoniou @ 2012-02-29 10:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: Paul Turner, Pantelis Antoniou

[-- Attachment #1: Type: text/plain, Size: 191 bytes --]


u64/u64 divisions caused undefined reference to `__aeabi_uldivmod'
Convert them to div_u64 calls.
---
 kernel/sched/fair.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-sched-entity-load-tracking-re-work-Fix-for-ARM.patch --]
[-- Type: text/x-patch; name="0001-sched-entity-load-tracking-re-work-Fix-for-ARM.patch", Size: 1265 bytes --]

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b7f7a6..0cea5e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1089,10 +1089,10 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 	struct task_group *tg = cfs_rq->tg;
 	long contrib, usage_contrib;
 
-	contrib = (sa->runnable_avg_sum << 12) / (sa->runnable_avg_period + 1);
+	contrib = div_u64(sa->runnable_avg_sum << 12, sa->runnable_avg_period + 1);
 	contrib -= cfs_rq->tg_runnable_contrib;
 
-	usage_contrib = (sa->usage_avg_sum << 12) / (sa->runnable_avg_period + 1);
+	usage_contrib = div_u64(sa->usage_avg_sum << 12, sa->runnable_avg_period + 1);
 	usage_contrib -= cfs_rq->tg_usage_contrib;
 
 	if ((abs(contrib) > cfs_rq->tg_runnable_contrib/64) ||
@@ -1110,9 +1110,9 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 	struct cfs_rq *cfs_rq = group_cfs_rq(se);
 	struct task_group *tg = cfs_rq->tg;
 	int runnable_avg;
+	u64 contrib;
 
-	se->avg.load_avg_contrib = (cfs_rq->tg_load_contrib * tg->shares);
-	se->avg.load_avg_contrib /= atomic64_read(&tg->load_avg) + 1;
+	se->avg.load_avg_contrib = div_u64(contrib, atomic64_read(&tg->load_avg) + 1);
 
 	/*
 	 * Unlike a task-entity, a group entity may be using >=1 cpu globally.

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 2/2] sched: load-tracking compile when cgroup undefined
  2012-02-02  1:38 ` [RFC PATCH 14/14] sched: implement usage tracking Paul Turner
                     ` (3 preceding siblings ...)
  2012-02-29 10:37   ` [PATCH 1/2] sched: entity load-tracking re-work - Fix for ARM Pantelis Antoniou
@ 2012-02-29 10:37   ` Pantelis Antoniou
  2012-03-13 16:57   ` [RFC PATCH 14/14] sched: implement usage tracking Vincent Guittot
  5 siblings, 0 replies; 68+ messages in thread
From: Pantelis Antoniou @ 2012-02-29 10:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: Paul Turner, Pantelis Antoniou

[-- Attachment #1: Type: text/plain, Size: 138 bytes --]


Fix compilation when CGROUPs are not configured.
---
 kernel/sched/fair.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0002-sched-load-tracking-compile-when-cgroup-undefined.patch --]
[-- Type: text/x-patch; name="0002-sched-load-tracking-compile-when-cgroup-undefined.patch", Size: 929 bytes --]

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0cea5e4..127fbd2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1306,7 +1306,7 @@ inline void remove_task_load_avg_async(struct task_struct *p)
 }
 
 #else
-static inline update_entity_load_avg(struct sched_entity *se,
+static inline void update_entity_load_avg(struct sched_entity *se,
 				     int update_cfs_rq) {}
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
@@ -1316,6 +1316,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  struct sched_entity *se,
 						  int sleep) {}
 inline void remove_task_load_avg_async(struct task_struct *p) {}
+static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
+	int force_update) {}
 #endif
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (16 preceding siblings ...)
  2012-02-20 17:01 ` Peter Zijlstra
@ 2012-03-12 10:39 ` Morten Rasmussen
  2012-03-13 16:44   ` Paul E. McKenney
  2012-03-13 17:28   ` Peter Zijlstra
  2012-03-12 10:57 ` Vincent Guittot
  18 siblings, 2 replies; 68+ messages in thread
From: Morten Rasmussen @ 2012-03-12 10:39 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Kamalesh Babulal, Ben Segall,
	Ingo Molnar, Vaidyanathan Srinivasan, Paul E. McKenney,
	Robin Randhawa

On Thu, Feb 02, 2012 at 01:38:26AM +0000, Paul Turner wrote:
> As referenced above this also allows us to potentially improve decisions within
> the load-balancer, for both distribution and power-management.
>
> Exmaple: consider 1x80% task  and 2x40% tasks on a 2-core machine. It's
> currently a bit of a gamble as to whether you get an {AB, B} or {A,
> BB} split since they have equal weight (assume 1024).  With per-task
> tracking we can actually consider them at their contributed weight and
> see a stable ~{800,{400, 400}} load-split.  Likewise within balance_tasks we can
> consider the load migrated to be that actually contributed.

Hi Paul (and LKML),

As a follow up to the discussions held during the scheduler mini-summit
at the last Linaro Connect I would like to share what I (working for
ARM) have observed so far in my experiments with big.LITTLE scheduling.

I see task affinity on big.LITTLE systems as a combination of
user-space affinity (via cgroups+cpuset etc) and introspective affinity
as result of intelligent load balancing in the scheduler. I see the
entity load tracking in this patch set as a step towards the latter. I
am very interested in better task profiling in the scheduler as this is
crucial for selecting which tasks that should go on which type of core.

I am using the patches for some very crude experiments with scheduling
on big.LITTLE to explore possibilities and learn about potential issues.
What I want to achieve is that high priority CPU-intensive tasks will
be scheduled on fast and less power-efficient big cores and background
tasks will be scheduled on power-efficient little cores. At the same
time I would also like to minimize the performance impact experienced
by the user. The following is a summary of the observation that I have
made so far. I would appreciate comments and suggestions on the best way
to go from here.

I have set up two sched_domains on a 4-core ARM system with two cores
each that represents big and little clusters and disabled load balancing
between them. The aim is to separate heavy and high priority tasks from
less important tasks using the two domains. Based on load_avg_contrib
tasks will be assigned to one of the domains by select_task_rq().
However, this does not work out very well. If a task in the little
domain suddenly consumes more CPU time and never goes to sleep it will
never get the chance to migrate to the big domain. On a homogeneous
system it doesn't really matter _where_ a task goes if imbalance is
unavoidable as all cores have equal performance. For heterogeneous
systems like big.LITTLE it makes a huge difference. To mitigate this
issue I am periodically checking the currently running task on each
little core to see if a CPU-intensive task is stuck there. If there is
it will be migrated to a core in the big domain using
stop_one_cpu_nowait() similar to the active load balance mechanism. It
is not a pretty solution, so I am open for suggestions. Furthermore, by
only checking the current task there is a chance of missing busy tasks
waiting on the runqueue but checking the entire runqueue seems too
expensive.

My observations are based on a simple mobile workload modelling web
browsing. That is basically two threads waking up occasionally to render
a web page. Using my current setup the most CPU intensive of the two
will be scheduled on the big cluster as intended. The remaining
background threads are always on the little cluster leaving the big
cluster idle when it is not rendering to save power. The
task-stuck-on-little problem can most easily be observed with CPU
intensive workloads such the sysbench CPU workload.

I have looked at traces of both runnable time and usage time trying to
understand why you use runnable time as your load metric and not usage
time which seems more intuitive. What I see is that runnable time
depends on the total runqueue load. If you have many tasks on the
runqueue they will wait longer and therefore have higher individual
load_avg_contrib than they would if the were scheduled across more CPUs.
Usage time is also affected by the number of tasks on the runqueue as
more tasks means less CPU time. However, less usage can also just mean
that the task does not execute very often. This would make a load
contribution estimate based on usage time less accurate. Is this your
reason for choosing runnable time?

Do you have any thoughts or comments on how entity load tracking could
be applied to introspectively select tasks for appropriate CPUs in
system like big.LITTLE?

Best regards,
Morten


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
                   ` (17 preceding siblings ...)
  2012-03-12 10:39 ` Morten Rasmussen
@ 2012-03-12 10:57 ` Vincent Guittot
  2012-03-14 15:59   ` Paul Turner
  18 siblings, 1 reply; 68+ messages in thread
From: Vincent Guittot @ 2012-03-12 10:57 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Kamalesh Babulal, Ben Segall,
	Ingo Molnar, Vaidyanathan Srinivasan

Hi Paul,

On 2 February 2012 02:38, Paul Turner <pjt@google.com> wrote:
> Hi all,
>
> The attached series is an RFC on implementing load tracking at the entity
> instead of cfs_rq level. This results in a bottom-up load-computation in which
> entities contribute to their parents load, as opposed to the current top-down
> where the parent averages its children.  In particular this allows us to
> correctly migrate load with their accompanying entities and provides the
> necessary inputs for intelligent load-balancing and power-management.
>
> It was previously well tested and stable, but that was on v3.1-; there's been
> some fairly extensive changes in the wake-up path since so apologies if anything
> was broken in the rebase.Note also, since this is also an RFC on the approach I
> have not yet de-linted the various CONFIG combinations for introduced compiler
> errors.
>
> Background:
> ----------
> We currently track the load average at the parenting cfs_rq level. We divide
> execution into a series of consecutive "windows, w_i".  Within each we track:
>  \Sum load_i * t_i where \Sum t_i = w and each load_i a disjoint load level.
>
> The load average is then \Sum w_j / 2^n.
>
> This this works reasonably well but there's a few problems
> 1) Since decay occurs at boundary window boundary there are 'skews':
> At a window boundary the 'foreground' time has a bias against the
> time immediately preceding it (as a result of the folding division)
> e.g.  __xx_|_yyyy___ vs __xx_yyyy_|___  (where '|' is a window boundary).
>
> The accounting here is 2x + 4y/2 or 2x + 4y, depending on which side of the
> window your load lands on.
>
> 2) Everything within a window 'w' is accounted equally, we only fold at
> the boundaries.  This actually means you can't set 'w' large enough to
> really have meaningful coverage of the sched period without throwing
> decay out the window.  But then with w/2 << sched_period (currently
> w/2=5ms) the average ends up having fairly drastic swings even with
> stable loads.
>
> (Note:  Before we even looked at migrating to per-entity tracking we evaluating
> foreground window into the considered average until it was "complete", this
> represented a measurable improvement in stability under predictable load.)
>
> 3) Since the load sum average is per-cfs_rq and not per-entity when a task
> entity migrates we lose its contribution to load-sum and effectively double
> count it while it former sum decays.
>
> New approach:
> -------------
> Instead of tracking load on a per-cfs_rq basis we do it on a per-sched_entity
> basis.  The load sum for a cfs_rq is then just the sum of its childrens' load
> averages.  The obvious immediately nice property is that when we migrate thread
> A from cpu1-->cpu2, we carry its load with it; leaving the global load sum
> unmodified.
>
> The 'windows' above are replaced with more fine-grained tracking, that is (for
> an entity j):
>
> runnable_j = u_i * y^i , load_avg_j = runnable_j * weight_j [*]
>
> Where: u_i is the usage in the last i`th ~ms and y is chosen such that
> y^~sched_period = 1/2 (we currently choose k=32).This means that load tracked 1
> sched_period ago contributes about ~50% as current execution.

We have a sched_period of 30ms for a 16 cores system but it's only 12
ms on a dual core. Do you think that it's important to keep this rule
(y^~sched_period = 1/2) whatever the number of core is ? The
sched_period also increases with a large number of running thread

>
> Now, the real challenge in tracking at an entity basis is handling blocked
> entities.  Obviously runnable entities will be updated naturally as we iterate
> through them but there could be O(many) blocked entities so we can't afford to
> iterate through them and update their tracking.
>

I have run sysbench of my 32bits ARM platform. The results are available below:

config 1 : v3.3-rc3 and CONFIG_FAIR_GROUP_SCHED not set
config 2 : v3.3-rc3 and CONFIG_FAIR_GROUP_SCHED set
config 3 : v3.3-rc3 with your patches and  CONFIG_FAIR_GROUP_SCHED not set
config 4 : v3.3-rc3 with your patches and  CONFIG_FAIR_GROUP_SCHED set

sysbench --test=cpu --num-threads=12 --max-time=10 run
             total number of events
             test1 test2 test3
config1    336   336   338
config2    336   337   338
config3    336   338   336
config4    336   338   337

sysbench --test=threads --thread-locks=9 --num-threads=12 --max-time=10 run
             total number of events
             test1 test2 test3
config1   5218 5228 5203
config2   5204 5217 5251
config3   5221 5219 5217
config4   4859 4796 4836


Whereas we have no difference for the cpu test (all the threads on in
the run queue waiting for a core), we can see a decrease for the
threads test. I'm wondering if it's linked to the fact that I have a
32bits machine doing 64bits division. Have you seen such kind of
difference on a 64bits system ?

Regards,
Vincent

> That our decay for a unit period is exponential introduces a particularly nice
> property here:
> We can separate the contributed load on a cfs_rq into blocked and runnable.
> Runnable load is just the sum of all load_avg_j above, maintained by the
> entities themselves as they run and self update (when they update their
> load_avg they update the cumulative sum also).
>
> Blocked load then looks like:
>  load_avg_j = weight_k * \Sum u[k]_n * y^n
>
> To account an entirely idle period we then only need to multiply by y.
>
> This ends up being orders of magnitude more accurate than the current
> tracking schema, even with the old shares_window massively dilated to
> better capture a stable load-state.  It's also obviously stable under
> migration.
>
> As referenced above this also allows us to potentially improve decisions within
> the load-balancer, for both distribution and power-management.
>
> Exmaple: consider 1x80% task  and 2x40% tasks on a 2-core machine.  It's
> currently a bit of a gamble as to whether you get an {AB, B} or {A,
> BB} split since they have equal weight (assume 1024).  With per-task
> tracking we can actually consider them at their contributed weight and
> see a stable ~{800,{400, 400}} load-split.  Likewise within balance_tasks we can
> consider the load migrated to be that actually contributed.
>
> Thanks,
>
> - Paul
>
>
> ---
>
> Ben Segall (1):
>      sched: maintain per-rq runnable averages
>
> Paul Turner (13):
>      sched: track the runnable average on a per-task entitiy basis
>      sched: aggregate load contributed by task entities on parenting cfs_rq
>      sched: maintain the load contribution of blocked entities
>      sched: account for blocked load waking back up
>      sched: aggregate total task_group load
>      sched: compute load contribution by a group entity
>      sched: normalize tg load contributions against runnable time
>      sched: maintain runnable averages across throttled periods
>      sched: replace update_shares weight distribution with per-entity computation
>      sched: refactor update_shares_cpu() -> update_blocked_avgs()
>      sched: update_cfs_shares at period edge
>      sched: make __update_entity_runnable_avg() fast
>      sched: implement usage tracking
>
>
>  include/linux/sched.h |   11 +
>  kernel/sched/core.c   |    3
>  kernel/sched/debug.c  |   37 ++-
>  kernel/sched/fair.c   |  714 ++++++++++++++++++++++++++++++++++++++-----------
>  kernel/sched/sched.h  |   29 +-
>  5 files changed, 611 insertions(+), 183 deletions(-)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-03-12 10:39 ` Morten Rasmussen
@ 2012-03-13 16:44   ` Paul E. McKenney
  2012-03-13 17:08     ` Anca Emanuel
  2012-03-13 17:28   ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Paul E. McKenney @ 2012-03-13 16:44 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Paul Turner, linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Kamalesh Babulal, Ben Segall,
	Ingo Molnar, Vaidyanathan Srinivasan, Robin Randhawa,
	linaro-sched-sig

On Mon, Mar 12, 2012 at 10:39:27AM +0000, Morten Rasmussen wrote:
> On Thu, Feb 02, 2012 at 01:38:26AM +0000, Paul Turner wrote:
> > As referenced above this also allows us to potentially improve decisions within
> > the load-balancer, for both distribution and power-management.
> >
> > Exmaple: consider 1x80% task  and 2x40% tasks on a 2-core machine. It's
> > currently a bit of a gamble as to whether you get an {AB, B} or {A,
> > BB} split since they have equal weight (assume 1024).  With per-task
> > tracking we can actually consider them at their contributed weight and
> > see a stable ~{800,{400, 400}} load-split.  Likewise within balance_tasks we can
> > consider the load migrated to be that actually contributed.
> 
> Hi Paul (and LKML),

Hello, Morten!

> As a follow up to the discussions held during the scheduler mini-summit
> at the last Linaro Connect I would like to share what I (working for
> ARM) have observed so far in my experiments with big.LITTLE scheduling.
> 
> I see task affinity on big.LITTLE systems as a combination of
> user-space affinity (via cgroups+cpuset etc) and introspective affinity
> as result of intelligent load balancing in the scheduler. I see the
> entity load tracking in this patch set as a step towards the latter. I
> am very interested in better task profiling in the scheduler as this is
> crucial for selecting which tasks that should go on which type of core.
> 
> I am using the patches for some very crude experiments with scheduling
> on big.LITTLE to explore possibilities and learn about potential issues.
> What I want to achieve is that high priority CPU-intensive tasks will
> be scheduled on fast and less power-efficient big cores and background
> tasks will be scheduled on power-efficient little cores. At the same
> time I would also like to minimize the performance impact experienced
> by the user. The following is a summary of the observation that I have
> made so far. I would appreciate comments and suggestions on the best way
> to go from here.
> 
> I have set up two sched_domains on a 4-core ARM system with two cores
> each that represents big and little clusters and disabled load balancing
> between them. The aim is to separate heavy and high priority tasks from
> less important tasks using the two domains. Based on load_avg_contrib
> tasks will be assigned to one of the domains by select_task_rq().
> However, this does not work out very well. If a task in the little
> domain suddenly consumes more CPU time and never goes to sleep it will
> never get the chance to migrate to the big domain. On a homogeneous
> system it doesn't really matter _where_ a task goes if imbalance is
> unavoidable as all cores have equal performance. For heterogeneous
> systems like big.LITTLE it makes a huge difference. To mitigate this
> issue I am periodically checking the currently running task on each
> little core to see if a CPU-intensive task is stuck there. If there is
> it will be migrated to a core in the big domain using
> stop_one_cpu_nowait() similar to the active load balance mechanism. It
> is not a pretty solution, so I am open for suggestions. Furthermore, by
> only checking the current task there is a chance of missing busy tasks
> waiting on the runqueue but checking the entire runqueue seems too
> expensive.
> 
> My observations are based on a simple mobile workload modelling web
> browsing. That is basically two threads waking up occasionally to render
> a web page. Using my current setup the most CPU intensive of the two
> will be scheduled on the big cluster as intended. The remaining
> background threads are always on the little cluster leaving the big
> cluster idle when it is not rendering to save power. The
> task-stuck-on-little problem can most easily be observed with CPU
> intensive workloads such the sysbench CPU workload.
> 
> I have looked at traces of both runnable time and usage time trying to
> understand why you use runnable time as your load metric and not usage
> time which seems more intuitive. What I see is that runnable time
> depends on the total runqueue load. If you have many tasks on the
> runqueue they will wait longer and therefore have higher individual
> load_avg_contrib than they would if the were scheduled across more CPUs.
> Usage time is also affected by the number of tasks on the runqueue as
> more tasks means less CPU time. However, less usage can also just mean
> that the task does not execute very often. This would make a load
> contribution estimate based on usage time less accurate. Is this your
> reason for choosing runnable time?

It might be a tradeoff between accuracy of scheduling and CPU cost of
scheduling, but I have to defer to Peter Z, Paul Turner, and the rest
of the scheduler guys on this one.

							Thanx, Paul

> Do you have any thoughts or comments on how entity load tracking could
> be applied to introspectively select tasks for appropriate CPUs in
> system like big.LITTLE?
> 
> Best regards,
> Morten
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 14/14] sched: implement usage tracking
  2012-02-02  1:38 ` [RFC PATCH 14/14] sched: implement usage tracking Paul Turner
                     ` (4 preceding siblings ...)
  2012-02-29 10:37   ` [PATCH 2/2] sched: load-tracking compile when cgroup undefined Pantelis Antoniou
@ 2012-03-13 16:57   ` Vincent Guittot
  2012-03-14 15:01     ` Peter Zijlstra
  2012-03-14 15:44     ` Paul Turner
  5 siblings, 2 replies; 68+ messages in thread
From: Vincent Guittot @ 2012-03-13 16:57 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Kamalesh Babulal, Ben Segall,
	Ingo Molnar, Vaidyanathan Srinivasan

Hi Paul,

On 2 February 2012 02:38, Paul Turner <pjt@google.com> wrote:
> With the frame-work for runnable tracking now fully in place.  Per-entity usage
> tracking is a simple and low-overhead addition.
>
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Ben Segall <bsegall@google.com>
> ---
>  include/linux/sched.h |    1 +
>  kernel/sched/debug.c  |    3 +++
>  kernel/sched/fair.c   |   29 +++++++++++++++++++++++------
>  kernel/sched/sched.h  |    4 ++--
>  4 files changed, 29 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 09b8c45..209185f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1159,6 +1159,7 @@ struct load_weight {
>  struct sched_avg {
>        u64 runnable_avg_sum, runnable_avg_period;
>        u64 last_runnable_update, decay_count;
> +       u32 usage_avg_sum;

Why usage_avg_sum is 32bits whereas runnable_avg_sum and
runnable_avg_period are 64bits long ? You are doing the same
computation on these 3 variables. Only the computation need to be done
in 64bits but the result could be saved in 32bits ?

Regards,
Vincent

>        unsigned long load_avg_contrib;
>
>        int contributes_blocked_load;
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 0911ec6..4d39069 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -93,6 +93,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
>        P(se->load.weight);
>        P(se->avg.runnable_avg_sum);
>        P(se->avg.runnable_avg_period);
> +       P(se->avg.usage_avg_sum);
>        P(se->avg.load_avg_contrib);
>        P(se->avg.decay_count);
>  #undef PN
> @@ -228,6 +229,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>                        cfs_rq->tg_runnable_contrib);
>        SEQ_printf(m, "  .%-30s: %d\n", "tg->runnable_avg",
>                        atomic_read(&cfs_rq->tg->runnable_avg));
> +       SEQ_printf(m, "  .%-30s: %d\n", "tg->usage_avg",
> +                       atomic_read(&cfs_rq->tg->usage_avg));
>  #endif
>
>        print_cfs_group_stats(m, cpu, cfs_rq->tg);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ad524bb..222c2c9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -997,7 +997,8 @@ static u32 __compute_runnable_contrib(int n)
>  */
>  static __always_inline int __update_entity_runnable_avg(u64 now,
>                                                        struct sched_avg *sa,
> -                                                       int runnable)
> +                                                       int runnable,
> +                                                       int running)
>  {
>        u64 delta;
>        u32 periods, runnable_contrib;
> @@ -1026,6 +1027,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>                delta_w = 1024 - delta_w;
>                if (runnable)
>                        sa->runnable_avg_sum += delta_w;
> +               if (running)
> +                       sa->usage_avg_sum += delta_w;
>                sa->runnable_avg_period += delta_w;
>
>                delta -= delta_w;
> @@ -1036,15 +1039,20 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>                                                  periods + 1);
>                sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
>                                                  periods + 1);
> +               sa->usage_avg_sum = decay_load(sa->usage_avg_sum, periods + 1);
>
>                runnable_contrib = __compute_runnable_contrib(periods);
>                if (runnable)
>                        sa->runnable_avg_sum += runnable_contrib;
> +               if (running)
> +                       sa->usage_avg_sum += runnable_contrib;
>                sa->runnable_avg_period += runnable_contrib;
>        }
>
>        if (runnable)
>                sa->runnable_avg_sum += delta;
> +       if (running)
> +               sa->usage_avg_sum += delta;
>        sa->runnable_avg_period += delta;
>
>        return decayed;
> @@ -1081,14 +1089,21 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
>                                                  struct cfs_rq *cfs_rq)
>  {
>        struct task_group *tg = cfs_rq->tg;
> -       long contrib;
> +       long contrib, usage_contrib;
>
>        contrib = (sa->runnable_avg_sum << 12) / (sa->runnable_avg_period + 1);
>        contrib -= cfs_rq->tg_runnable_contrib;
>
> -       if (abs(contrib) > cfs_rq->tg_runnable_contrib/64) {
> +       usage_contrib = (sa->usage_avg_sum << 12) / (sa->runnable_avg_period + 1);
> +       usage_contrib -= cfs_rq->tg_usage_contrib;
> +
> +       if ((abs(contrib) > cfs_rq->tg_runnable_contrib/64) ||
> +           (abs(usage_contrib) > cfs_rq->tg_usage_contrib/64)) {
>                atomic_add(contrib, &tg->runnable_avg);
>                cfs_rq->tg_runnable_contrib += contrib;
> +
> +               atomic_add(usage_contrib, &tg->usage_avg);
> +               cfs_rq->tg_usage_contrib += usage_contrib;
>        }
>  }
>
> @@ -1164,8 +1179,8 @@ static inline void update_entity_load_avg(struct sched_entity *se,
>        struct cfs_rq *cfs_rq = cfs_rq_of(se);
>        long contrib_delta;
>
> -       if(!__update_entity_runnable_avg(rq_of(cfs_rq)->clock_task, &se->avg,
> -                                       se->on_rq))
> +       if(!__update_entity_runnable_avg(cfs_rq_clock_task(cfs_rq), &se->avg,
> +                                        se->on_rq, cfs_rq->curr == se))
>                return;
>
>        contrib_delta = __update_entity_load_avg_contrib(se);
> @@ -1210,7 +1225,8 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
>
>  static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>  {
> -       __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
> +       __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable,
> +                                    runnable);
>        __update_tg_runnable_avg(&rq->avg, &rq->cfs);
>  }
>
> @@ -1590,6 +1606,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>                 */
>                update_stats_wait_end(cfs_rq, se);
>                __dequeue_entity(cfs_rq, se);
> +               update_entity_load_avg(se, 1);
>        }
>
>        update_stats_curr_start(cfs_rq, se);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 77f3297..9602a47 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -117,7 +117,7 @@ struct task_group {
>
>        atomic_t load_weight;
>        atomic64_t load_avg;
> -       atomic_t runnable_avg;
> +       atomic_t runnable_avg, usage_avg;
>  #endif
>
>  #ifdef CONFIG_RT_GROUP_SCHED
> @@ -260,7 +260,7 @@ struct cfs_rq {
>         */
>        unsigned long h_load;
>
> -       u32 tg_runnable_contrib;
> +       u32 tg_runnable_contrib, tg_usage_contrib;
>        u64 runnable_load_avg, blocked_load_avg;
>        u64 tg_load_contrib;
>        atomic64_t decay_counter, removed_load;
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-03-13 16:44   ` Paul E. McKenney
@ 2012-03-13 17:08     ` Anca Emanuel
  2012-03-13 17:23       ` Paul E. McKenney
  2012-03-14  9:03       ` Amit Kucheria
  0 siblings, 2 replies; 68+ messages in thread
From: Anca Emanuel @ 2012-03-13 17:08 UTC (permalink / raw)
  To: paulmck
  Cc: Morten Rasmussen, Paul Turner, linux-kernel, Venki Pallipadi,
	Srivatsa Vaddagiri, Peter Zijlstra, Mike Galbraith,
	Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan, Robin Randhawa, linaro-sched-sig

@Paul E. McKenney: ups, that was some special pgp code ?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-03-13 17:08     ` Anca Emanuel
@ 2012-03-13 17:23       ` Paul E. McKenney
  2012-03-14  9:03       ` Amit Kucheria
  1 sibling, 0 replies; 68+ messages in thread
From: Paul E. McKenney @ 2012-03-13 17:23 UTC (permalink / raw)
  To: Anca Emanuel
  Cc: Morten Rasmussen, Paul Turner, linux-kernel, Venki Pallipadi,
	Srivatsa Vaddagiri, Peter Zijlstra, Mike Galbraith,
	Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan, Robin Randhawa, linaro-sched-sig

On Tue, Mar 13, 2012 at 07:08:54PM +0200, Anca Emanuel wrote:
> @Paul E. McKenney: ups, that was some special pgp code ?

Not that I am aware of.  Here is the body of the message again.

							Thanx, Paul

------------------------------------------------------------------------

On Mon, Mar 12, 2012 at 10:39:27AM +0000, Morten Rasmussen wrote:
> On Thu, Feb 02, 2012 at 01:38:26AM +0000, Paul Turner wrote:
> > As referenced above this also allows us to potentially improve decisions within
> > the load-balancer, for both distribution and power-management.
> >
> > Exmaple: consider 1x80% task  and 2x40% tasks on a 2-core machine. It's
> > currently a bit of a gamble as to whether you get an {AB, B} or {A,
> > BB} split since they have equal weight (assume 1024).  With per-task
> > tracking we can actually consider them at their contributed weight and
> > see a stable ~{800,{400, 400}} load-split.  Likewise within balance_tasks we can
> > consider the load migrated to be that actually contributed.
> 
> Hi Paul (and LKML),

Hello, Morten!

> As a follow up to the discussions held during the scheduler mini-summit
> at the last Linaro Connect I would like to share what I (working for
> ARM) have observed so far in my experiments with big.LITTLE scheduling.
> 
> I see task affinity on big.LITTLE systems as a combination of
> user-space affinity (via cgroups+cpuset etc) and introspective affinity
> as result of intelligent load balancing in the scheduler. I see the
> entity load tracking in this patch set as a step towards the latter. I
> am very interested in better task profiling in the scheduler as this is
> crucial for selecting which tasks that should go on which type of core.
> 
> I am using the patches for some very crude experiments with scheduling
> on big.LITTLE to explore possibilities and learn about potential issues.
> What I want to achieve is that high priority CPU-intensive tasks will
> be scheduled on fast and less power-efficient big cores and background
> tasks will be scheduled on power-efficient little cores. At the same
> time I would also like to minimize the performance impact experienced
> by the user. The following is a summary of the observation that I have
> made so far. I would appreciate comments and suggestions on the best way
> to go from here.
> 
> I have set up two sched_domains on a 4-core ARM system with two cores
> each that represents big and little clusters and disabled load balancing
> between them. The aim is to separate heavy and high priority tasks from
> less important tasks using the two domains. Based on load_avg_contrib
> tasks will be assigned to one of the domains by select_task_rq().
> However, this does not work out very well. If a task in the little
> domain suddenly consumes more CPU time and never goes to sleep it will
> never get the chance to migrate to the big domain. On a homogeneous
> system it doesn't really matter _where_ a task goes if imbalance is
> unavoidable as all cores have equal performance. For heterogeneous
> systems like big.LITTLE it makes a huge difference. To mitigate this
> issue I am periodically checking the currently running task on each
> little core to see if a CPU-intensive task is stuck there. If there is
> it will be migrated to a core in the big domain using
> stop_one_cpu_nowait() similar to the active load balance mechanism. It
> is not a pretty solution, so I am open for suggestions. Furthermore, by
> only checking the current task there is a chance of missing busy tasks
> waiting on the runqueue but checking the entire runqueue seems too
> expensive.
> 
> My observations are based on a simple mobile workload modelling web
> browsing. That is basically two threads waking up occasionally to render
> a web page. Using my current setup the most CPU intensive of the two
> will be scheduled on the big cluster as intended. The remaining
> background threads are always on the little cluster leaving the big
> cluster idle when it is not rendering to save power. The
> task-stuck-on-little problem can most easily be observed with CPU
> intensive workloads such the sysbench CPU workload.
> 
> I have looked at traces of both runnable time and usage time trying to
> understand why you use runnable time as your load metric and not usage
> time which seems more intuitive. What I see is that runnable time
> depends on the total runqueue load. If you have many tasks on the
> runqueue they will wait longer and therefore have higher individual
> load_avg_contrib than they would if the were scheduled across more CPUs.
> Usage time is also affected by the number of tasks on the runqueue as
> more tasks means less CPU time. However, less usage can also just mean
> that the task does not execute very often. This would make a load
> contribution estimate based on usage time less accurate. Is this your
> reason for choosing runnable time?

It might be a tradeoff between accuracy of scheduling and CPU cost of
scheduling, but I have to defer to Peter Z, Paul Turner, and the rest
of the scheduler guys on this one.

							Thanx, Paul

> Do you have any thoughts or comments on how entity load tracking could
> be applied to introspectively select tasks for appropriate CPUs in
> system like big.LITTLE?
> 
> Best regards,
> Morten
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-03-12 10:39 ` Morten Rasmussen
  2012-03-13 16:44   ` Paul E. McKenney
@ 2012-03-13 17:28   ` Peter Zijlstra
  1 sibling, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2012-03-13 17:28 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Paul Turner, linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan, Paul E. McKenney, Robin Randhawa

On Mon, 2012-03-12 at 10:39 +0000, Morten Rasmussen wrote:
> I have looked at traces of both runnable time and usage time trying to
> understand why you use runnable time as your load metric and not usage
> time which seems more intuitive. What I see is that runnable time
> depends on the total runqueue load. If you have many tasks on the
> runqueue they will wait longer and therefore have higher individual
> load_avg_contrib than they would if the were scheduled across more CPUs.
> Usage time is also affected by the number of tasks on the runqueue as
> more tasks means less CPU time. However, less usage can also just mean
> that the task does not execute very often. This would make a load
> contribution estimate based on usage time less accurate. Is this your
> reason for choosing runnable time? 

Exactly so, you cannot ever have more than 100% usage, so no matter how
many tasks you stick on a cpu, you'll never get over that 100% and thus
this is not a usable load metric.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-03-13 17:08     ` Anca Emanuel
  2012-03-13 17:23       ` Paul E. McKenney
@ 2012-03-14  9:03       ` Amit Kucheria
  2012-03-14 19:19         ` Paul E. McKenney
  1 sibling, 1 reply; 68+ messages in thread
From: Amit Kucheria @ 2012-03-14  9:03 UTC (permalink / raw)
  To: Anca Emanuel
  Cc: paulmck, Srivatsa Vaddagiri, Peter Zijlstra, linaro-sched-sig,
	Mike Galbraith, linux-kernel, Kamalesh Babulal, Ben Segall,
	Ingo Molnar, Vaidyanathan Srinivasan

On 12 Mar 13, Anca Emanuel wrote:
> @Paul E. McKenney: ups, that was some special pgp code ?
> 

It does look gibberish in the gmail interface. Mutt seems to have no problems.

/Amit

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 14/14] sched: implement usage tracking
  2012-03-13 16:57   ` [RFC PATCH 14/14] sched: implement usage tracking Vincent Guittot
@ 2012-03-14 15:01     ` Peter Zijlstra
  2012-03-14 15:45       ` Vincent Guittot
  2012-03-14 15:47       ` Paul Turner
  2012-03-14 15:44     ` Paul Turner
  1 sibling, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2012-03-14 15:01 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Paul Turner, linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On Tue, 2012-03-13 at 17:57 +0100, Vincent Guittot wrote:
> >  struct sched_avg {
> >        u64 runnable_avg_sum, runnable_avg_period;
> >        u64 last_runnable_update, decay_count;
> > +       u32 usage_avg_sum;
> 
> Why usage_avg_sum is 32bits whereas runnable_avg_sum and
> runnable_avg_period are 64bits long ? You are doing the same
> computation on these 3 variables. Only the computation need to be done
> in 64bits but the result could be saved in 32bits ? 

Since you can never use more than 100% of cpu time, usage_avg_sum is
bound to 100% of the period, which (assuming your period < ~4s) should
thus still fit in the ~4s u32 provides.

Runnable otoh is not bound by that and thus we cannot guarantee the
value stays within the ~4s value range.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 14/14] sched: implement usage tracking
  2012-03-13 16:57   ` [RFC PATCH 14/14] sched: implement usage tracking Vincent Guittot
  2012-03-14 15:01     ` Peter Zijlstra
@ 2012-03-14 15:44     ` Paul Turner
  1 sibling, 0 replies; 68+ messages in thread
From: Paul Turner @ 2012-03-14 15:44 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Kamalesh Babulal, Ben Segall,
	Ingo Molnar, Vaidyanathan Srinivasan

On Tue, Mar 13, 2012 at 9:57 AM, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
> Hi Paul,
>
> On 2 February 2012 02:38, Paul Turner <pjt@google.com> wrote:
>> With the frame-work for runnable tracking now fully in place.  Per-entity usage
>> tracking is a simple and low-overhead addition.
>>
>> Signed-off-by: Paul Turner <pjt@google.com>
>> Signed-off-by: Ben Segall <bsegall@google.com>
>> ---
>>  include/linux/sched.h |    1 +
>>  kernel/sched/debug.c  |    3 +++
>>  kernel/sched/fair.c   |   29 +++++++++++++++++++++++------
>>  kernel/sched/sched.h  |    4 ++--
>>  4 files changed, 29 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 09b8c45..209185f 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1159,6 +1159,7 @@ struct load_weight {
>>  struct sched_avg {
>>        u64 runnable_avg_sum, runnable_avg_period;
>>        u64 last_runnable_update, decay_count;
>> +       u32 usage_avg_sum;
>
> Why usage_avg_sum is 32bits whereas runnable_avg_sum and
> runnable_avg_period are 64bits long ? You are doing the same
> computation on these 3 variables. Only the computation need to be done
> in 64bits but the result could be saved in 32bits ?

Yes, these fit in 32 bits.

Sorry I have not had a chance to post v2 yet, I've been buried with
unrelated work related to context-switching and LinSched[*].  Now that
those monkeys are wrestled to the ground I'll get this moving again.

[*] LinSched on v3.3-rc7 is now up at:
http://git.kernel.org/?p=linux/kernel/git/pjt/linsched.git;a=shortlog;h=refs/heads/linsched-v3.3-alpha
I'll be sending an announcement email in a few hours.

>
> Regards,
> Vincent
>
>>        unsigned long load_avg_contrib;
>>
>>        int contributes_blocked_load;
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 0911ec6..4d39069 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -93,6 +93,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
>>        P(se->load.weight);
>>        P(se->avg.runnable_avg_sum);
>>        P(se->avg.runnable_avg_period);
>> +       P(se->avg.usage_avg_sum);
>>        P(se->avg.load_avg_contrib);
>>        P(se->avg.decay_count);
>>  #undef PN
>> @@ -228,6 +229,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>>                        cfs_rq->tg_runnable_contrib);
>>        SEQ_printf(m, "  .%-30s: %d\n", "tg->runnable_avg",
>>                        atomic_read(&cfs_rq->tg->runnable_avg));
>> +       SEQ_printf(m, "  .%-30s: %d\n", "tg->usage_avg",
>> +                       atomic_read(&cfs_rq->tg->usage_avg));
>>  #endif
>>
>>        print_cfs_group_stats(m, cpu, cfs_rq->tg);
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ad524bb..222c2c9 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -997,7 +997,8 @@ static u32 __compute_runnable_contrib(int n)
>>  */
>>  static __always_inline int __update_entity_runnable_avg(u64 now,
>>                                                        struct sched_avg *sa,
>> -                                                       int runnable)
>> +                                                       int runnable,
>> +                                                       int running)
>>  {
>>        u64 delta;
>>        u32 periods, runnable_contrib;
>> @@ -1026,6 +1027,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>>                delta_w = 1024 - delta_w;
>>                if (runnable)
>>                        sa->runnable_avg_sum += delta_w;
>> +               if (running)
>> +                       sa->usage_avg_sum += delta_w;
>>                sa->runnable_avg_period += delta_w;
>>
>>                delta -= delta_w;
>> @@ -1036,15 +1039,20 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>>                                                  periods + 1);
>>                sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
>>                                                  periods + 1);
>> +               sa->usage_avg_sum = decay_load(sa->usage_avg_sum, periods + 1);
>>
>>                runnable_contrib = __compute_runnable_contrib(periods);
>>                if (runnable)
>>                        sa->runnable_avg_sum += runnable_contrib;
>> +               if (running)
>> +                       sa->usage_avg_sum += runnable_contrib;
>>                sa->runnable_avg_period += runnable_contrib;
>>        }
>>
>>        if (runnable)
>>                sa->runnable_avg_sum += delta;
>> +       if (running)
>> +               sa->usage_avg_sum += delta;
>>        sa->runnable_avg_period += delta;
>>
>>        return decayed;
>> @@ -1081,14 +1089,21 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
>>                                                  struct cfs_rq *cfs_rq)
>>  {
>>        struct task_group *tg = cfs_rq->tg;
>> -       long contrib;
>> +       long contrib, usage_contrib;
>>
>>        contrib = (sa->runnable_avg_sum << 12) / (sa->runnable_avg_period + 1);
>>        contrib -= cfs_rq->tg_runnable_contrib;
>>
>> -       if (abs(contrib) > cfs_rq->tg_runnable_contrib/64) {
>> +       usage_contrib = (sa->usage_avg_sum << 12) / (sa->runnable_avg_period + 1);
>> +       usage_contrib -= cfs_rq->tg_usage_contrib;
>> +
>> +       if ((abs(contrib) > cfs_rq->tg_runnable_contrib/64) ||
>> +           (abs(usage_contrib) > cfs_rq->tg_usage_contrib/64)) {
>>                atomic_add(contrib, &tg->runnable_avg);
>>                cfs_rq->tg_runnable_contrib += contrib;
>> +
>> +               atomic_add(usage_contrib, &tg->usage_avg);
>> +               cfs_rq->tg_usage_contrib += usage_contrib;
>>        }
>>  }
>>
>> @@ -1164,8 +1179,8 @@ static inline void update_entity_load_avg(struct sched_entity *se,
>>        struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>        long contrib_delta;
>>
>> -       if(!__update_entity_runnable_avg(rq_of(cfs_rq)->clock_task, &se->avg,
>> -                                       se->on_rq))
>> +       if(!__update_entity_runnable_avg(cfs_rq_clock_task(cfs_rq), &se->avg,
>> +                                        se->on_rq, cfs_rq->curr == se))
>>                return;
>>
>>        contrib_delta = __update_entity_load_avg_contrib(se);
>> @@ -1210,7 +1225,8 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
>>
>>  static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>>  {
>> -       __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
>> +       __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable,
>> +                                    runnable);
>>        __update_tg_runnable_avg(&rq->avg, &rq->cfs);
>>  }
>>
>> @@ -1590,6 +1606,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>                 */
>>                update_stats_wait_end(cfs_rq, se);
>>                __dequeue_entity(cfs_rq, se);
>> +               update_entity_load_avg(se, 1);
>>        }
>>
>>        update_stats_curr_start(cfs_rq, se);
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 77f3297..9602a47 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -117,7 +117,7 @@ struct task_group {
>>
>>        atomic_t load_weight;
>>        atomic64_t load_avg;
>> -       atomic_t runnable_avg;
>> +       atomic_t runnable_avg, usage_avg;
>>  #endif
>>
>>  #ifdef CONFIG_RT_GROUP_SCHED
>> @@ -260,7 +260,7 @@ struct cfs_rq {
>>         */
>>        unsigned long h_load;
>>
>> -       u32 tg_runnable_contrib;
>> +       u32 tg_runnable_contrib, tg_usage_contrib;
>>        u64 runnable_load_avg, blocked_load_avg;
>>        u64 tg_load_contrib;
>>        atomic64_t decay_counter, removed_load;
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 14/14] sched: implement usage tracking
  2012-03-14 15:01     ` Peter Zijlstra
@ 2012-03-14 15:45       ` Vincent Guittot
  2012-03-14 15:47       ` Paul Turner
  1 sibling, 0 replies; 68+ messages in thread
From: Vincent Guittot @ 2012-03-14 15:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Vaidyanathan Srinivasan

On 14 March 2012 16:01, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> On Tue, 2012-03-13 at 17:57 +0100, Vincent Guittot wrote:
> > >  struct sched_avg {
> > >        u64 runnable_avg_sum, runnable_avg_period;
> > >        u64 last_runnable_update, decay_count;
> > > +       u32 usage_avg_sum;
> >
> > Why usage_avg_sum is 32bits whereas runnable_avg_sum and
> > runnable_avg_period are 64bits long ? You are doing the same
> > computation on these 3 variables. Only the computation need to be done
> > in 64bits but the result could be saved in 32bits ?
>
> Since you can never use more than 100% of cpu time, usage_avg_sum is
> bound to 100% of the period, which (assuming your period < ~4s) should
> thus still fit in the ~4s u32 provides.
>
> Runnable otoh is not bound by that and thus we cannot guarantee the
> value stays within the ~4s value range.

Shouldn't usage_avg_sum and runnable_avg_sum be equal for rq because
running and runnable parameters are equals and we have the same
computation ?

Furthermore, the geometric series should ensure us that final value
shouldn't be larger than 1024/(1-y) which is around 47000

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 14/14] sched: implement usage tracking
  2012-03-14 15:01     ` Peter Zijlstra
  2012-03-14 15:45       ` Vincent Guittot
@ 2012-03-14 15:47       ` Paul Turner
  2012-03-15 10:52         ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Paul Turner @ 2012-03-14 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, linux-kernel, Venki Pallipadi,
	Srivatsa Vaddagiri, Mike Galbraith, Kamalesh Babulal, Ben Segall,
	Ingo Molnar, Vaidyanathan Srinivasan

On Wed, Mar 14, 2012 at 8:01 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2012-03-13 at 17:57 +0100, Vincent Guittot wrote:
>> >  struct sched_avg {
>> >        u64 runnable_avg_sum, runnable_avg_period;
>> >        u64 last_runnable_update, decay_count;
>> > +       u32 usage_avg_sum;
>>
>> Why usage_avg_sum is 32bits whereas runnable_avg_sum and
>> runnable_avg_period are 64bits long ? You are doing the same
>> computation on these 3 variables. Only the computation need to be done
>> in 64bits but the result could be saved in 32bits ?
>
> Since you can never use more than 100% of cpu time, usage_avg_sum is
> bound to 100% of the period, which (assuming your period < ~4s) should
> thus still fit in the ~4s u32 provides.
>
> Runnable otoh is not bound by that and thus we cannot guarantee the
> value stays within the ~4s value range.

Actually for runnable we can also make such a guarantee since:

runnable_sum <= \Sum 1024 * k^p --> 1024/(1-k) [geometric series, k<1]
--> ~48k for our choice of k.

(We do however need 64-bits on any values that accumulate sums of
loads, e.g. removed_load and *_load_sum.)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-03-12 10:57 ` Vincent Guittot
@ 2012-03-14 15:59   ` Paul Turner
  2012-03-15  9:59     ` Vincent Guittot
  2012-04-25 13:07     ` Vincent Guittot
  0 siblings, 2 replies; 68+ messages in thread
From: Paul Turner @ 2012-03-14 15:59 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Kamalesh Babulal, Ben Segall,
	Ingo Molnar, Vaidyanathan Srinivasan

On Mon, Mar 12, 2012 at 3:57 AM, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
> Hi Paul,
>
> On 2 February 2012 02:38, Paul Turner <pjt@google.com> wrote:
>> Hi all,
>>
>> The attached series is an RFC on implementing load tracking at the entity
>> instead of cfs_rq level. This results in a bottom-up load-computation in which
>> entities contribute to their parents load, as opposed to the current top-down
>> where the parent averages its children.  In particular this allows us to
>> correctly migrate load with their accompanying entities and provides the
>> necessary inputs for intelligent load-balancing and power-management.
>>
>> It was previously well tested and stable, but that was on v3.1-; there's been
>> some fairly extensive changes in the wake-up path since so apologies if anything
>> was broken in the rebase.Note also, since this is also an RFC on the approach I
>> have not yet de-linted the various CONFIG combinations for introduced compiler
>> errors.
>>
>> Background:
>> ----------
>> We currently track the load average at the parenting cfs_rq level. We divide
>> execution into a series of consecutive "windows, w_i".  Within each we track:
>>  \Sum load_i * t_i where \Sum t_i = w and each load_i a disjoint load level.
>>
>> The load average is then \Sum w_j / 2^n.
>>
>> This this works reasonably well but there's a few problems
>> 1) Since decay occurs at boundary window boundary there are 'skews':
>> At a window boundary the 'foreground' time has a bias against the
>> time immediately preceding it (as a result of the folding division)
>> e.g.  __xx_|_yyyy___ vs __xx_yyyy_|___  (where '|' is a window boundary).
>>
>> The accounting here is 2x + 4y/2 or 2x + 4y, depending on which side of the
>> window your load lands on.
>>
>> 2) Everything within a window 'w' is accounted equally, we only fold at
>> the boundaries.  This actually means you can't set 'w' large enough to
>> really have meaningful coverage of the sched period without throwing
>> decay out the window.  But then with w/2 << sched_period (currently
>> w/2=5ms) the average ends up having fairly drastic swings even with
>> stable loads.
>>
>> (Note:  Before we even looked at migrating to per-entity tracking we evaluating
>> foreground window into the considered average until it was "complete", this
>> represented a measurable improvement in stability under predictable load.)
>>
>> 3) Since the load sum average is per-cfs_rq and not per-entity when a task
>> entity migrates we lose its contribution to load-sum and effectively double
>> count it while it former sum decays.
>>
>> New approach:
>> -------------
>> Instead of tracking load on a per-cfs_rq basis we do it on a per-sched_entity
>> basis.  The load sum for a cfs_rq is then just the sum of its childrens' load
>> averages.  The obvious immediately nice property is that when we migrate thread
>> A from cpu1-->cpu2, we carry its load with it; leaving the global load sum
>> unmodified.
>>
>> The 'windows' above are replaced with more fine-grained tracking, that is (for
>> an entity j):
>>
>> runnable_j = u_i * y^i , load_avg_j = runnable_j * weight_j [*]
>>
>> Where: u_i is the usage in the last i`th ~ms and y is chosen such that
>> y^~sched_period = 1/2 (we currently choose k=32).This means that load tracked 1
>> sched_period ago contributes about ~50% as current execution.
>
> We have a sched_period of 30ms for a 16 cores system but it's only 12
> ms on a dual core. Do you think that it's important to keep this rule
> (y^~sched_period = 1/2) whatever the number of core is ? The
> sched_period also increases with a large number of running thread
>

Yes both of these points are valid.

We could consider tuning it on demand, however, I'm not sure it's
required.  The only real requirement for stability here is that we
have a period that's typically larger than the scheduling quantum
(although, without being so large that decay is not meaningful!).

This does not hold in the opposite direction, that is, it's not a
requirement that we not span several quantums, only that we span at
least *one*.

We ran it across a variety of topologies (both large and small) in
LinSched and did not see any immediate ill effects.  I would suggest
we play this one by ear and see if a negative case arises.

>>
>> Now, the real challenge in tracking at an entity basis is handling blocked
>> entities.  Obviously runnable entities will be updated naturally as we iterate
>> through them but there could be O(many) blocked entities so we can't afford to
>> iterate through them and update their tracking.
>>
>
> I have run sysbench of my 32bits ARM platform. The results are available below:
>
> config 1 : v3.3-rc3 and CONFIG_FAIR_GROUP_SCHED not set
> config 2 : v3.3-rc3 and CONFIG_FAIR_GROUP_SCHED set
> config 3 : v3.3-rc3 with your patches and  CONFIG_FAIR_GROUP_SCHED not set
> config 4 : v3.3-rc3 with your patches and  CONFIG_FAIR_GROUP_SCHED set
>
> sysbench --test=cpu --num-threads=12 --max-time=10 run
>             total number of events
>             test1 test2 test3
> config1    336   336   338
> config2    336   337   338
> config3    336   338   336
> config4    336   338   337
>
> sysbench --test=threads --thread-locks=9 --num-threads=12 --max-time=10 run
>             total number of events
>             test1 test2 test3
> config1   5218 5228 5203
> config2   5204 5217 5251
> config3   5221 5219 5217
> config4   4859 4796 4836
>
>
> Whereas we have no difference for the cpu test (all the threads on in
> the run queue waiting for a core), we can see a decrease for the
> threads test. I'm wondering if it's linked to the fact that I have a
> 32bits machine doing 64bits division. Have you seen such kind of
> difference on a 64bits system ?
>

So I unfortunately do not have a (non-x86) 32-bit system to
meaningfully benchmark this on.  And yes, 32-bit was my only *real*
worry so I'm not completely shocked to see an initial regression here
:-(

With 32-bit performance in mind there are a few obvious things to
change in our runnable_sum accumulations (these can fit in 32-bit).
Once we grab those low hanging fruit I'm happy to work with you to see
what remains.  Do you have any that are externally accessible
per-chance or would you prefer several differently tuned patch-bombs?
:)


> Regards,
> Vincent
>
>> That our decay for a unit period is exponential introduces a particularly nice
>> property here:
>> We can separate the contributed load on a cfs_rq into blocked and runnable.
>> Runnable load is just the sum of all load_avg_j above, maintained by the
>> entities themselves as they run and self update (when they update their
>> load_avg they update the cumulative sum also).
>>
>> Blocked load then looks like:
>>  load_avg_j = weight_k * \Sum u[k]_n * y^n
>>
>> To account an entirely idle period we then only need to multiply by y.
>>
>> This ends up being orders of magnitude more accurate than the current
>> tracking schema, even with the old shares_window massively dilated to
>> better capture a stable load-state.  It's also obviously stable under
>> migration.
>>
>> As referenced above this also allows us to potentially improve decisions within
>> the load-balancer, for both distribution and power-management.
>>
>> Exmaple: consider 1x80% task  and 2x40% tasks on a 2-core machine.  It's
>> currently a bit of a gamble as to whether you get an {AB, B} or {A,
>> BB} split since they have equal weight (assume 1024).  With per-task
>> tracking we can actually consider them at their contributed weight and
>> see a stable ~{800,{400, 400}} load-split.  Likewise within balance_tasks we can
>> consider the load migrated to be that actually contributed.
>>
>> Thanks,
>>
>> - Paul
>>
>>
>> ---
>>
>> Ben Segall (1):
>>      sched: maintain per-rq runnable averages
>>
>> Paul Turner (13):
>>      sched: track the runnable average on a per-task entitiy basis
>>      sched: aggregate load contributed by task entities on parenting cfs_rq
>>      sched: maintain the load contribution of blocked entities
>>      sched: account for blocked load waking back up
>>      sched: aggregate total task_group load
>>      sched: compute load contribution by a group entity
>>      sched: normalize tg load contributions against runnable time
>>      sched: maintain runnable averages across throttled periods
>>      sched: replace update_shares weight distribution with per-entity computation
>>      sched: refactor update_shares_cpu() -> update_blocked_avgs()
>>      sched: update_cfs_shares at period edge
>>      sched: make __update_entity_runnable_avg() fast
>>      sched: implement usage tracking
>>
>>
>>  include/linux/sched.h |   11 +
>>  kernel/sched/core.c   |    3
>>  kernel/sched/debug.c  |   37 ++-
>>  kernel/sched/fair.c   |  714 ++++++++++++++++++++++++++++++++++++++-----------
>>  kernel/sched/sched.h  |   29 +-
>>  5 files changed, 611 insertions(+), 183 deletions(-)
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-03-14  9:03       ` Amit Kucheria
@ 2012-03-14 19:19         ` Paul E. McKenney
  0 siblings, 0 replies; 68+ messages in thread
From: Paul E. McKenney @ 2012-03-14 19:19 UTC (permalink / raw)
  To: Anca Emanuel, Srivatsa Vaddagiri, Peter Zijlstra,
	linaro-sched-sig, Mike Galbraith, linux-kernel, Kamalesh Babulal,
	Ben Segall, Ingo Molnar, Vaidyanathan Srinivasan

On Wed, Mar 14, 2012 at 11:03:27AM +0200, Amit Kucheria wrote:
> On 12 Mar 13, Anca Emanuel wrote:
> > @Paul E. McKenney: ups, that was some special pgp code ?
> > 
> 
> It does look gibberish in the gmail interface. Mutt seems to have no problems.

Perhaps this demonstrates that gmail is a more intelligent (and more
judgmental) than is mutt.  Me, I will stick with mutt.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-03-14 15:59   ` Paul Turner
@ 2012-03-15  9:59     ` Vincent Guittot
  2012-04-25 13:07     ` Vincent Guittot
  1 sibling, 0 replies; 68+ messages in thread
From: Vincent Guittot @ 2012-03-15  9:59 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Kamalesh Babulal, Ben Segall,
	Ingo Molnar, Vaidyanathan Srinivasan

On 14 March 2012 16:59, Paul Turner <pjt@google.com> wrote:
>
> On Mon, Mar 12, 2012 at 3:57 AM, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> > Hi Paul,
> >
> > On 2 February 2012 02:38, Paul Turner <pjt@google.com> wrote:
> >> Hi all,
> >>
> >> The attached series is an RFC on implementing load tracking at the entity
> >> instead of cfs_rq level. This results in a bottom-up load-computation in which
> >> entities contribute to their parents load, as opposed to the current top-down
> >> where the parent averages its children.  In particular this allows us to
> >> correctly migrate load with their accompanying entities and provides the
> >> necessary inputs for intelligent load-balancing and power-management.
> >>
> >> It was previously well tested and stable, but that was on v3.1-; there's been
> >> some fairly extensive changes in the wake-up path since so apologies if anything
> >> was broken in the rebase.Note also, since this is also an RFC on the approach I
> >> have not yet de-linted the various CONFIG combinations for introduced compiler
> >> errors.
> >>
> >> Background:
> >> ----------
> >> We currently track the load average at the parenting cfs_rq level. We divide
> >> execution into a series of consecutive "windows, w_i".  Within each we track:
> >>  \Sum load_i * t_i where \Sum t_i = w and each load_i a disjoint load level.
> >>
> >> The load average is then \Sum w_j / 2^n.
> >>
> >> This this works reasonably well but there's a few problems
> >> 1) Since decay occurs at boundary window boundary there are 'skews':
> >> At a window boundary the 'foreground' time has a bias against the
> >> time immediately preceding it (as a result of the folding division)
> >> e.g.  __xx_|_yyyy___ vs __xx_yyyy_|___  (where '|' is a window boundary).
> >>
> >> The accounting here is 2x + 4y/2 or 2x + 4y, depending on which side of the
> >> window your load lands on.
> >>
> >> 2) Everything within a window 'w' is accounted equally, we only fold at
> >> the boundaries.  This actually means you can't set 'w' large enough to
> >> really have meaningful coverage of the sched period without throwing
> >> decay out the window.  But then with w/2 << sched_period (currently
> >> w/2=5ms) the average ends up having fairly drastic swings even with
> >> stable loads.
> >>
> >> (Note:  Before we even looked at migrating to per-entity tracking we evaluating
> >> foreground window into the considered average until it was "complete", this
> >> represented a measurable improvement in stability under predictable load.)
> >>
> >> 3) Since the load sum average is per-cfs_rq and not per-entity when a task
> >> entity migrates we lose its contribution to load-sum and effectively double
> >> count it while it former sum decays.
> >>
> >> New approach:
> >> -------------
> >> Instead of tracking load on a per-cfs_rq basis we do it on a per-sched_entity
> >> basis.  The load sum for a cfs_rq is then just the sum of its childrens' load
> >> averages.  The obvious immediately nice property is that when we migrate thread
> >> A from cpu1-->cpu2, we carry its load with it; leaving the global load sum
> >> unmodified.
> >>
> >> The 'windows' above are replaced with more fine-grained tracking, that is (for
> >> an entity j):
> >>
> >> runnable_j = u_i * y^i , load_avg_j = runnable_j * weight_j [*]
> >>
> >> Where: u_i is the usage in the last i`th ~ms and y is chosen such that
> >> y^~sched_period = 1/2 (we currently choose k=32).This means that load tracked 1
> >> sched_period ago contributes about ~50% as current execution.
> >
> > We have a sched_period of 30ms for a 16 cores system but it's only 12
> > ms on a dual core. Do you think that it's important to keep this rule
> > (y^~sched_period = 1/2) whatever the number of core is ? The
> > sched_period also increases with a large number of running thread
> >
>
> Yes both of these points are valid.
>
> We could consider tuning it on demand, however, I'm not sure it's
> required.  The only real requirement for stability here is that we
> have a period that's typically larger than the scheduling quantum
> (although, without being so large that decay is not meaningful!).
>
> This does not hold in the opposite direction, that is, it's not a
> requirement that we not span several quantums, only that we span at
> least *one*.
>
> We ran it across a variety of topologies (both large and small) in
> LinSched and did not see any immediate ill effects.  I would suggest
> we play this one by ear and see if a negative case arises.
>
> >>
> >> Now, the real challenge in tracking at an entity basis is handling blocked
> >> entities.  Obviously runnable entities will be updated naturally as we iterate
> >> through them but there could be O(many) blocked entities so we can't afford to
> >> iterate through them and update their tracking.
> >>
> >
> > I have run sysbench of my 32bits ARM platform. The results are available below:
> >
> > config 1 : v3.3-rc3 and CONFIG_FAIR_GROUP_SCHED not set
> > config 2 : v3.3-rc3 and CONFIG_FAIR_GROUP_SCHED set
> > config 3 : v3.3-rc3 with your patches and  CONFIG_FAIR_GROUP_SCHED not set
> > config 4 : v3.3-rc3 with your patches and  CONFIG_FAIR_GROUP_SCHED set
> >
> > sysbench --test=cpu --num-threads=12 --max-time=10 run
> >             total number of events
> >             test1 test2 test3
> > config1    336   336   338
> > config2    336   337   338
> > config3    336   338   336
> > config4    336   338   337
> >
> > sysbench --test=threads --thread-locks=9 --num-threads=12 --max-time=10 run
> >             total number of events
> >             test1 test2 test3
> > config1   5218 5228 5203
> > config2   5204 5217 5251
> > config3   5221 5219 5217
> > config4   4859 4796 4836
> >
> >
> > Whereas we have no difference for the cpu test (all the threads on in
> > the run queue waiting for a core), we can see a decrease for the
> > threads test. I'm wondering if it's linked to the fact that I have a
> > 32bits machine doing 64bits division. Have you seen such kind of
> > difference on a 64bits system ?
> >
>
> So I unfortunately do not have a (non-x86) 32-bit system to
> meaningfully benchmark this on.  And yes, 32-bit was my only *real*
> worry so I'm not completely shocked to see an initial regression here
> :-(
>
> With 32-bit performance in mind there are a few obvious things to
> change in our runnable_sum accumulations (these can fit in 32-bit).
> Once we grab those low hanging fruit I'm happy to work with you to see
> what remains.  Do you have any that are externally accessible
> per-chance or would you prefer several differently tuned patch-bombs?
> :)

I will be pleased to help you on (non-x86) 32bits system.
Let me check how we can do for such tests.

>
>
>
> > Regards,
> > Vincent
> >
> >> That our decay for a unit period is exponential introduces a particularly nice
> >> property here:
> >> We can separate the contributed load on a cfs_rq into blocked and runnable.
> >> Runnable load is just the sum of all load_avg_j above, maintained by the
> >> entities themselves as they run and self update (when they update their
> >> load_avg they update the cumulative sum also).
> >>
> >> Blocked load then looks like:
> >>  load_avg_j = weight_k * \Sum u[k]_n * y^n
> >>
> >> To account an entirely idle period we then only need to multiply by y.
> >>
> >> This ends up being orders of magnitude more accurate than the current
> >> tracking schema, even with the old shares_window massively dilated to
> >> better capture a stable load-state.  It's also obviously stable under
> >> migration.
> >>
> >> As referenced above this also allows us to potentially improve decisions within
> >> the load-balancer, for both distribution and power-management.
> >>
> >> Exmaple: consider 1x80% task  and 2x40% tasks on a 2-core machine.  It's
> >> currently a bit of a gamble as to whether you get an {AB, B} or {A,
> >> BB} split since they have equal weight (assume 1024).  With per-task
> >> tracking we can actually consider them at their contributed weight and
> >> see a stable ~{800,{400, 400}} load-split.  Likewise within balance_tasks we can
> >> consider the load migrated to be that actually contributed.
> >>
> >> Thanks,
> >>
> >> - Paul
> >>
> >>
> >> ---
> >>
> >> Ben Segall (1):
> >>      sched: maintain per-rq runnable averages
> >>
> >> Paul Turner (13):
> >>      sched: track the runnable average on a per-task entitiy basis
> >>      sched: aggregate load contributed by task entities on parenting cfs_rq
> >>      sched: maintain the load contribution of blocked entities
> >>      sched: account for blocked load waking back up
> >>      sched: aggregate total task_group load
> >>      sched: compute load contribution by a group entity
> >>      sched: normalize tg load contributions against runnable time
> >>      sched: maintain runnable averages across throttled periods
> >>      sched: replace update_shares weight distribution with per-entity computation
> >>      sched: refactor update_shares_cpu() -> update_blocked_avgs()
> >>      sched: update_cfs_shares at period edge
> >>      sched: make __update_entity_runnable_avg() fast
> >>      sched: implement usage tracking
> >>
> >>
> >>  include/linux/sched.h |   11 +
> >>  kernel/sched/core.c   |    3
> >>  kernel/sched/debug.c  |   37 ++-
> >>  kernel/sched/fair.c   |  714 ++++++++++++++++++++++++++++++++++++++-----------
> >>  kernel/sched/sched.h  |   29 +-
> >>  5 files changed, 611 insertions(+), 183 deletions(-)
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 14/14] sched: implement usage tracking
  2012-03-14 15:47       ` Paul Turner
@ 2012-03-15 10:52         ` Peter Zijlstra
  0 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2012-03-15 10:52 UTC (permalink / raw)
  To: Paul Turner
  Cc: Vincent Guittot, linux-kernel, Venki Pallipadi,
	Srivatsa Vaddagiri, Mike Galbraith, Kamalesh Babulal, Ben Segall,
	Ingo Molnar, Vaidyanathan Srinivasan

On Wed, 2012-03-14 at 08:47 -0700, Paul Turner wrote:
> On Wed, Mar 14, 2012 at 8:01 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Tue, 2012-03-13 at 17:57 +0100, Vincent Guittot wrote:
> >> >  struct sched_avg {
> >> >        u64 runnable_avg_sum, runnable_avg_period;
> >> >        u64 last_runnable_update, decay_count;
> >> > +       u32 usage_avg_sum;
> >>
> >> Why usage_avg_sum is 32bits whereas runnable_avg_sum and
> >> runnable_avg_period are 64bits long ? You are doing the same
> >> computation on these 3 variables. Only the computation need to be done
> >> in 64bits but the result could be saved in 32bits ?
> >
> > Since you can never use more than 100% of cpu time, usage_avg_sum is
> > bound to 100% of the period, which (assuming your period < ~4s) should
> > thus still fit in the ~4s u32 provides.
> >
> > Runnable otoh is not bound by that and thus we cannot guarantee the
> > value stays within the ~4s value range.
> 
> Actually for runnable we can also make such a guarantee since:
> 
> runnable_sum <= \Sum 1024 * k^p --> 1024/(1-k) [geometric series, k<1]
> --> ~48k for our choice of k.
> 
> (We do however need 64-bits on any values that accumulate sums of
> loads, e.g. removed_load and *_load_sum.)

Only for the single entry, right? For the aggregated case the runnable
count can be nr_running times your limit, and IIRC (but my brain is
fuzzy since its been a while since I looked at this stuff) you use the
same structure in both cases.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC PATCH 00/14] sched: entity load-tracking re-work
  2012-03-14 15:59   ` Paul Turner
  2012-03-15  9:59     ` Vincent Guittot
@ 2012-04-25 13:07     ` Vincent Guittot
  1 sibling, 0 replies; 68+ messages in thread
From: Vincent Guittot @ 2012-04-25 13:07 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith, Kamalesh Babulal, Ben Segall,
	Ingo Molnar, Vaidyanathan Srinivasan, linaro-sched-sig

Hi Paul,

I have tested an optimization of your patch set for 32bits ARM
processor and I can see an improvement of (1.8%-2.8%) when I test it
with sysbench on a dual cortex-A9 platform. The test is described
below:

config1: v3.4-rc3
config2: config1 + your patches
config3: config2 + use of 32bits for load avg tracking

sysbench --test=threads --num-threads=12 --thread-locks=9 --max-time=10

           results %config1 %config2
config1  5686
config2  5358     -5.8%
config3  5510     -3.0%     +2.8%

The patches are available in the following git tree
http://git.linaro.org/gitweb?p=people/vingu/kernel.git;a=shortlog;h=refs/heads/sched_load_tracking

While testing the patches, I have found that the results were tightly
linked to the location of the sched section in the memory mapping. I
have run the same tests but modified the location of the sched
section. The results below show the best value for each configuration.

           results config1 config2
config1  5895
config2  5479   -7.1%
config3  5577   -5.4%   +1.8%

I can still see an improvement with 32bits value even if it is smaller
than default mapping. Have you seen such variation when testing your
patches ?

I'm going to continue to look which variable can fit in a 32bits
variable but if you have optimizations that you would like to test on
32bits platform, I will be pleased to test them. I'm trying to put in
place a way to automate the test of your patches on an ARM platform
and the feedback but it take a bit more time than expected to make it
ready.

Regards,
Vincent


On 14 March 2012 16:59, Paul Turner <pjt@google.com> wrote:
> On Mon, Mar 12, 2012 at 3:57 AM, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
>> Hi Paul,
>>
>> On 2 February 2012 02:38, Paul Turner <pjt@google.com> wrote:
>>> Hi all,
>>>
>>> The attached series is an RFC on implementing load tracking at the entity
>>> instead of cfs_rq level. This results in a bottom-up load-computation in which
>>> entities contribute to their parents load, as opposed to the current top-down
>>> where the parent averages its children.  In particular this allows us to
>>> correctly migrate load with their accompanying entities and provides the
>>> necessary inputs for intelligent load-balancing and power-management.
>>>
>>> It was previously well tested and stable, but that was on v3.1-; there's been
>>> some fairly extensive changes in the wake-up path since so apologies if anything
>>> was broken in the rebase.Note also, since this is also an RFC on the approach I
>>> have not yet de-linted the various CONFIG combinations for introduced compiler
>>> errors.
>>>
>>> Background:
>>> ----------
>>> We currently track the load average at the parenting cfs_rq level. We divide
>>> execution into a series of consecutive "windows, w_i".  Within each we track:
>>>  \Sum load_i * t_i where \Sum t_i = w and each load_i a disjoint load level.
>>>
>>> The load average is then \Sum w_j / 2^n.
>>>
>>> This this works reasonably well but there's a few problems
>>> 1) Since decay occurs at boundary window boundary there are 'skews':
>>> At a window boundary the 'foreground' time has a bias against the
>>> time immediately preceding it (as a result of the folding division)
>>> e.g.  __xx_|_yyyy___ vs __xx_yyyy_|___  (where '|' is a window boundary).
>>>
>>> The accounting here is 2x + 4y/2 or 2x + 4y, depending on which side of the
>>> window your load lands on.
>>>
>>> 2) Everything within a window 'w' is accounted equally, we only fold at
>>> the boundaries.  This actually means you can't set 'w' large enough to
>>> really have meaningful coverage of the sched period without throwing
>>> decay out the window.  But then with w/2 << sched_period (currently
>>> w/2=5ms) the average ends up having fairly drastic swings even with
>>> stable loads.
>>>
>>> (Note:  Before we even looked at migrating to per-entity tracking we evaluating
>>> foreground window into the considered average until it was "complete", this
>>> represented a measurable improvement in stability under predictable load.)
>>>
>>> 3) Since the load sum average is per-cfs_rq and not per-entity when a task
>>> entity migrates we lose its contribution to load-sum and effectively double
>>> count it while it former sum decays.
>>>
>>> New approach:
>>> -------------
>>> Instead of tracking load on a per-cfs_rq basis we do it on a per-sched_entity
>>> basis.  The load sum for a cfs_rq is then just the sum of its childrens' load
>>> averages.  The obvious immediately nice property is that when we migrate thread
>>> A from cpu1-->cpu2, we carry its load with it; leaving the global load sum
>>> unmodified.
>>>
>>> The 'windows' above are replaced with more fine-grained tracking, that is (for
>>> an entity j):
>>>
>>> runnable_j = u_i * y^i , load_avg_j = runnable_j * weight_j [*]
>>>
>>> Where: u_i is the usage in the last i`th ~ms and y is chosen such that
>>> y^~sched_period = 1/2 (we currently choose k=32).This means that load tracked 1
>>> sched_period ago contributes about ~50% as current execution.
>>
>> We have a sched_period of 30ms for a 16 cores system but it's only 12
>> ms on a dual core. Do you think that it's important to keep this rule
>> (y^~sched_period = 1/2) whatever the number of core is ? The
>> sched_period also increases with a large number of running thread
>>
>
> Yes both of these points are valid.
>
> We could consider tuning it on demand, however, I'm not sure it's
> required.  The only real requirement for stability here is that we
> have a period that's typically larger than the scheduling quantum
> (although, without being so large that decay is not meaningful!).
>
> This does not hold in the opposite direction, that is, it's not a
> requirement that we not span several quantums, only that we span at
> least *one*.
>
> We ran it across a variety of topologies (both large and small) in
> LinSched and did not see any immediate ill effects.  I would suggest
> we play this one by ear and see if a negative case arises.
>
>>>
>>> Now, the real challenge in tracking at an entity basis is handling blocked
>>> entities.  Obviously runnable entities will be updated naturally as we iterate
>>> through them but there could be O(many) blocked entities so we can't afford to
>>> iterate through them and update their tracking.
>>>
>>
>> I have run sysbench of my 32bits ARM platform. The results are available below:
>>
>> config 1 : v3.3-rc3 and CONFIG_FAIR_GROUP_SCHED not set
>> config 2 : v3.3-rc3 and CONFIG_FAIR_GROUP_SCHED set
>> config 3 : v3.3-rc3 with your patches and  CONFIG_FAIR_GROUP_SCHED not set
>> config 4 : v3.3-rc3 with your patches and  CONFIG_FAIR_GROUP_SCHED set
>>
>> sysbench --test=cpu --num-threads=12 --max-time=10 run
>>             total number of events
>>             test1 test2 test3
>> config1    336   336   338
>> config2    336   337   338
>> config3    336   338   336
>> config4    336   338   337
>>
>> sysbench --test=threads --thread-locks=9 --num-threads=12 --max-time=10 run
>>             total number of events
>>             test1 test2 test3
>> config1   5218 5228 5203
>> config2   5204 5217 5251
>> config3   5221 5219 5217
>> config4   4859 4796 4836
>>
>>
>> Whereas we have no difference for the cpu test (all the threads on in
>> the run queue waiting for a core), we can see a decrease for the
>> threads test. I'm wondering if it's linked to the fact that I have a
>> 32bits machine doing 64bits division. Have you seen such kind of
>> difference on a 64bits system ?
>>
>
> So I unfortunately do not have a (non-x86) 32-bit system to
> meaningfully benchmark this on.  And yes, 32-bit was my only *real*
> worry so I'm not completely shocked to see an initial regression here
> :-(
>
> With 32-bit performance in mind there are a few obvious things to
> change in our runnable_sum accumulations (these can fit in 32-bit).
> Once we grab those low hanging fruit I'm happy to work with you to see
> what remains.  Do you have any that are externally accessible
> per-chance or would you prefer several differently tuned patch-bombs?
> :)
>
>
>> Regards,
>> Vincent
>>
>>> That our decay for a unit period is exponential introduces a particularly nice
>>> property here:
>>> We can separate the contributed load on a cfs_rq into blocked and runnable.
>>> Runnable load is just the sum of all load_avg_j above, maintained by the
>>> entities themselves as they run and self update (when they update their
>>> load_avg they update the cumulative sum also).
>>>
>>> Blocked load then looks like:
>>>  load_avg_j = weight_k * \Sum u[k]_n * y^n
>>>
>>> To account an entirely idle period we then only need to multiply by y.
>>>
>>> This ends up being orders of magnitude more accurate than the current
>>> tracking schema, even with the old shares_window massively dilated to
>>> better capture a stable load-state.  It's also obviously stable under
>>> migration.
>>>
>>> As referenced above this also allows us to potentially improve decisions within
>>> the load-balancer, for both distribution and power-management.
>>>
>>> Exmaple: consider 1x80% task  and 2x40% tasks on a 2-core machine.  It's
>>> currently a bit of a gamble as to whether you get an {AB, B} or {A,
>>> BB} split since they have equal weight (assume 1024).  With per-task
>>> tracking we can actually consider them at their contributed weight and
>>> see a stable ~{800,{400, 400}} load-split.  Likewise within balance_tasks we can
>>> consider the load migrated to be that actually contributed.
>>>
>>> Thanks,
>>>
>>> - Paul
>>>
>>>
>>> ---
>>>
>>> Ben Segall (1):
>>>      sched: maintain per-rq runnable averages
>>>
>>> Paul Turner (13):
>>>      sched: track the runnable average on a per-task entitiy basis
>>>      sched: aggregate load contributed by task entities on parenting cfs_rq
>>>      sched: maintain the load contribution of blocked entities
>>>      sched: account for blocked load waking back up
>>>      sched: aggregate total task_group load
>>>      sched: compute load contribution by a group entity
>>>      sched: normalize tg load contributions against runnable time
>>>      sched: maintain runnable averages across throttled periods
>>>      sched: replace update_shares weight distribution with per-entity computation
>>>      sched: refactor update_shares_cpu() -> update_blocked_avgs()
>>>      sched: update_cfs_shares at period edge
>>>      sched: make __update_entity_runnable_avg() fast
>>>      sched: implement usage tracking
>>>
>>>
>>>  include/linux/sched.h |   11 +
>>>  kernel/sched/core.c   |    3
>>>  kernel/sched/debug.c  |   37 ++-
>>>  kernel/sched/fair.c   |  714 ++++++++++++++++++++++++++++++++++++++-----------
>>>  kernel/sched/sched.h  |   29 +-
>>>  5 files changed, 611 insertions(+), 183 deletions(-)
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2012-04-25 13:07 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-02  1:38 [RFC PATCH 00/14] sched: entity load-tracking re-work Paul Turner
2012-02-02  1:38 ` [RFC PATCH 05/14] sched: account for blocked load waking back up Paul Turner
2012-02-16 15:57   ` Peter Zijlstra
2012-02-17 13:00     ` Paul Turner
2012-02-16 16:02   ` Peter Zijlstra
2012-02-17 11:39     ` Paul Turner
2012-02-02  1:38 ` [RFC PATCH 01/14] sched: track the runnable average on a per-task entitiy basis Paul Turner
2012-02-15 23:37   ` Peter Zijlstra
2012-02-17 11:43     ` Paul Turner
2012-02-16 13:27   ` Peter Zijlstra
2012-02-17 11:44     ` Paul Turner
2012-02-02  1:38 ` [RFC PATCH 07/14] sched: compute load contribution by a group entity Paul Turner
2012-02-02  1:38 ` [RFC PATCH 06/14] sched: aggregate total task_group load Paul Turner
2012-02-17  4:41   ` Nikunj A Dadhania
2012-02-17 10:52     ` Paul Turner
2012-02-02  1:38 ` [RFC PATCH 03/14] sched: aggregate load contributed by task entities on parenting cfs_rq Paul Turner
2012-02-02  1:38 ` [RFC PATCH 04/14] sched: maintain the load contribution of blocked entities Paul Turner
2012-02-16 12:25   ` Peter Zijlstra
2012-02-17 11:53     ` Paul Turner
2012-02-02  1:38 ` [RFC PATCH 02/14] sched: maintain per-rq runnable averages Paul Turner
2012-02-02  1:38 ` [RFC PATCH 08/14] sched: normalize tg load contributions against runnable time Paul Turner
2012-02-15 23:36   ` Peter Zijlstra
2012-02-17 12:32     ` Paul Turner
2012-02-20 16:10       ` Peter Zijlstra
2012-02-17 12:34     ` Peter Zijlstra
2012-02-15 23:38   ` Peter Zijlstra
2012-02-02  1:38 ` [RFC PATCH 09/14] sched: maintain runnable averages across throttled periods Paul Turner
2012-02-02  1:38 ` [RFC PATCH 12/14] sched: update_cfs_shares at period edge Paul Turner
2012-02-02  1:38 ` [RFC PATCH 13/14] sched: make __update_entity_runnable_avg() fast Paul Turner
2012-02-06 20:48   ` Peter Zijlstra
2012-02-17 12:49     ` Paul Turner
2012-02-02  1:38 ` [RFC PATCH 11/14] sched: refactor update_shares_cpu() -> update_blocked_avgs() Paul Turner
2012-02-02  1:38 ` [RFC PATCH 14/14] sched: implement usage tracking Paul Turner
2012-02-16 13:37   ` Peter Zijlstra
2012-02-17 10:54     ` Paul Turner
2012-02-20 16:11       ` Peter Zijlstra
2012-02-16 16:58   ` Peter Zijlstra
2012-02-17 11:32     ` Paul Turner
2012-02-20 16:30       ` Peter Zijlstra
2012-02-29 10:37   ` sched per task ARM fix Pantelis Antoniou
2012-02-29 10:37   ` [PATCH 1/2] sched: entity load-tracking re-work - Fix for ARM Pantelis Antoniou
2012-02-28 17:45     ` Morten Rasmussen
2012-02-28 17:52       ` Pantelis Antoniou
2012-02-29 10:37   ` [PATCH 2/2] sched: load-tracking compile when cgroup undefined Pantelis Antoniou
2012-03-13 16:57   ` [RFC PATCH 14/14] sched: implement usage tracking Vincent Guittot
2012-03-14 15:01     ` Peter Zijlstra
2012-03-14 15:45       ` Vincent Guittot
2012-03-14 15:47       ` Paul Turner
2012-03-15 10:52         ` Peter Zijlstra
2012-03-14 15:44     ` Paul Turner
2012-02-02  1:38 ` [RFC PATCH 10/14] sched: replace update_shares weight distribution with per-entity computation Paul Turner
2012-02-06 20:02 ` [RFC PATCH 00/14] sched: entity load-tracking re-work Peter Zijlstra
2012-02-17  9:07 ` Nikunj A Dadhania
2012-02-17 10:48   ` Paul Turner
2012-02-20  9:41     ` Nikunj A Dadhania
2012-02-21  2:33       ` Paul Turner
2012-02-20 17:01 ` Peter Zijlstra
2012-03-12 10:39 ` Morten Rasmussen
2012-03-13 16:44   ` Paul E. McKenney
2012-03-13 17:08     ` Anca Emanuel
2012-03-13 17:23       ` Paul E. McKenney
2012-03-14  9:03       ` Amit Kucheria
2012-03-14 19:19         ` Paul E. McKenney
2012-03-13 17:28   ` Peter Zijlstra
2012-03-12 10:57 ` Vincent Guittot
2012-03-14 15:59   ` Paul Turner
2012-03-15  9:59     ` Vincent Guittot
2012-04-25 13:07     ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).