linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/16] sched: per-entity load-tracking
@ 2012-08-23 14:14 pjt
  2012-08-23 14:14 ` [patch 01/16] sched: track the runnable average on a per-task entitiy basis pjt
                   ` (17 more replies)
  0 siblings, 18 replies; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

Hi all,

Please find attached the latest version for CFS load-tracking.

It implements load-tracking on a per-sched_entity (currently SCHED_NORMAL, but
could be extended to RT as well) basis. This results in a bottom-up
load-computation in which entities contribute to their parents' load, as
opposed to the current top-down where the parent averages its children.  In
particular this allows us to correctly migrate load with their accompanying
entities and provides the necessary inputs for intelligent load-balancing and
power-management.

We've been running this internally for some time now and modulo any gremlins
from rebasing it, I think things have been shaken out and we're touching
mergeable state.

Special thanks to Namhyung Kim and Peter Zijlstra for comments on the last
round series.

For more background and prior discussion please review the previous posting:
https://lkml.org/lkml/2012/6/27/644

- Paul


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch 01/16] sched: track the runnable average on a per-task entitiy basis
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
@ 2012-08-23 14:14 ` pjt
  2012-08-24  8:20   ` Namhyung Kim
  2012-10-24  9:43   ` [tip:sched/core] sched: Track the runnable average on a per-task entity basis tip-bot for Paul Turner
  2012-08-23 14:14 ` [patch 02/16] sched: maintain per-rq runnable averages pjt
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-sched_avg.patch --]
[-- Type: text/plain, Size: 7771 bytes --]

From: Paul Turner <pjt@google.com>

Instead of tracking averaging the load parented by a cfs_rq, we can track
entity load directly.  With the load for a given cfs_rq then being the sum of
its children.

To do this we represent the historical contribution to runnable average within each
trailing 1024us of execution as the coefficients of a geometric series.

We can express this for a given task t as:
  runnable_sum(t) = \Sum u_i * y^i, runnable_avg_period(t) = \Sum 1024 * y^i
  load(t) = weight_t * runnable_sum(t) / runnable_avg_period(t)

Where: u_i is the usage in the last i`th 1024us period (approximately 1ms) ~ms
and y is chosen such that y^k = 1/2.  We currently choose k to be 32 which
roughly translates to about a sched period.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 include/linux/sched.h |   13 +++++
 kernel/sched/core.c   |    5 ++
 kernel/sched/debug.c  |    4 ++
 kernel/sched/fair.c   |  128 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 150 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f3eebc1..f553da9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1139,6 +1139,16 @@ struct load_weight {
 	unsigned long weight, inv_weight;
 };
 
+struct sched_avg {
+	/*
+	 * These sums represent an infinite geometric series and so are bound
+	 * above by 1024/(1-y).  Thus we only need a u32 to store them for for all
+	 * choices of y < 1-2^(-32)*1024.
+	 */
+	u32 runnable_avg_sum, runnable_avg_period;
+	u64 last_runnable_update;
+};
+
 #ifdef CONFIG_SCHEDSTATS
 struct sched_statistics {
 	u64			wait_start;
@@ -1199,6 +1209,9 @@ struct sched_entity {
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq		*my_q;
 #endif
+#ifdef CONFIG_SMP
+	struct sched_avg	avg;
+#endif
 };
 
 struct sched_rt_entity {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78d9c96..fcc3cad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1556,6 +1556,11 @@ static void __sched_fork(struct task_struct *p)
 	p->se.vruntime			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
+#ifdef CONFIG_SMP
+	p->se.avg.runnable_avg_period = 0;
+	p->se.avg.runnable_avg_sum = 0;
+#endif
+
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 6f79596..61f7097 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -85,6 +85,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->statistics.wait_count);
 #endif
 	P(se->load.weight);
+#ifdef CONFIG_SMP
+	P(se->avg.runnable_avg_sum);
+	P(se->avg.runnable_avg_period);
+#endif
 #undef PN
 #undef P
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01d3eda..2c53263 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -971,6 +971,125 @@ static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_SMP
+/*
+ * Approximate:
+ *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
+ */
+static __always_inline u64 decay_load(u64 val, u64 n)
+{
+	for (; n && val; n--) {
+		val *= 4008;
+		val >>= 12;
+	}
+
+	return val;
+}
+
+/* We can represent the historical contribution to runnable average as the
+ * coefficients of a geometric series.  To do this we sub-divide our runnable
+ * history into segments of approximately 1ms (1024us); label the segment that
+ * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
+ *
+ * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
+ *      p0            p1           p1
+ *     (now)       (~1ms ago)  (~2ms ago)
+ *
+ * Let u_i denote the fraction of p_i that the entity was runnable.
+ *
+ * We then designate the fractions u_i as our co-efficients, yielding the
+ * following representation of historical load:
+ *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
+ *
+ * We choose y based on the with of a reasonably scheduling period, fixing:
+ *   y^32 = 0.5
+ *
+ * This means that the contribution to load ~32ms ago (u_32) will be weighted
+ * approximately half as much as the contribution to load within the last ms
+ * (u_0).
+ *
+ * When a period "rolls over" and we have new u_0`, multiplying the previous
+ * sum again by y is sufficient to update:
+ *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
+ *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1]
+ */
+static __always_inline int __update_entity_runnable_avg(u64 now,
+							struct sched_avg *sa,
+							int runnable)
+{
+	u64 delta;
+	int delta_w, decayed = 0;
+
+	delta = now - sa->last_runnable_update;
+	/*
+	 * This should only happen when time goes backwards, which it
+	 * unfortunately does during sched clock init when we swap over to TSC.
+	 */
+	if ((s64)delta < 0) {
+		sa->last_runnable_update = now;
+		return 0;
+	}
+
+	/*
+	 * Use 1024ns as the unit of measurement since it's a reasonable
+	 * approximation of 1us and fast to compute.
+	 */
+	delta >>= 10;
+	if (!delta)
+		return 0;
+	sa->last_runnable_update = now;
+
+	/* delta_w is the amount already accumulated against our next period */
+	delta_w = sa->runnable_avg_period % 1024;
+	if (delta + delta_w >= 1024) {
+		/* period roll-over */
+		decayed = 1;
+
+		/*
+		 * Now that we know we're crossing a period boundary, figure
+		 * out how much from delta we need to complete the current
+		 * period and accrue it.
+		 */
+		delta_w = 1024 - delta_w;
+		BUG_ON(delta_w > delta);
+		do {
+			if (runnable)
+				sa->runnable_avg_sum += delta_w;
+			sa->runnable_avg_period += delta_w;
+
+			/*
+			 * Remainder of delta initiates a new period, roll over
+			 * the previous.
+			 */
+			sa->runnable_avg_sum =
+				decay_load(sa->runnable_avg_sum, 1);
+			sa->runnable_avg_period =
+				decay_load(sa->runnable_avg_period, 1);
+
+			delta -= delta_w;
+			/* New period is empty */
+			delta_w = 1024;
+		} while (delta >= 1024);
+	}
+
+	/* Remainder of delta accrued against u_0` */
+	if (runnable)
+		sa->runnable_avg_sum += delta;
+	sa->runnable_avg_period += delta;
+
+	return decayed;
+}
+
+/* Update a sched_entity's runnable average */
+static inline void update_entity_load_avg(struct sched_entity *se)
+{
+	__update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
+				     se->on_rq);
+}
+#else
+static inline void update_entity_load_avg(struct sched_entity *se) {}
+#endif
+
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 #ifdef CONFIG_SCHEDSTATS
@@ -1097,6 +1216,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_curr(cfs_rq);
 	update_cfs_load(cfs_rq, 0);
+	update_entity_load_avg(se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
@@ -1171,6 +1291,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+	update_entity_load_avg(se);
 
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
@@ -1340,6 +1461,8 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		update_stats_wait_start(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
+		/* in !on_rq case, update occurred at dequeue */
+		update_entity_load_avg(prev);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -1353,6 +1476,11 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	update_curr(cfs_rq);
 
 	/*
+	 * Ensure that runnable average is periodically updated.
+	 */
+	update_entity_load_avg(curr);
+
+	/*
 	 * Update share accounting for long-running entities.
 	 */
 	update_entity_shares_tick(cfs_rq);



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [patch 02/16] sched: maintain per-rq runnable averages
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
  2012-08-23 14:14 ` [patch 01/16] sched: track the runnable average on a per-task entitiy basis pjt
@ 2012-08-23 14:14 ` pjt
  2012-10-24  9:44   ` [tip:sched/core] sched: Maintain " tip-bot for Ben Segall
  2012-10-28 10:12   ` [patch 02/16] sched: maintain " Preeti Murthy
  2012-08-23 14:14 ` [patch 03/16] sched: aggregate load contributed by task entities on parenting cfs_rq pjt
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-root_avg.patch --]
[-- Type: text/plain, Size: 3131 bytes --]

From: Ben Segall <bsegall@google.com>

Since runqueues do not have a corresponding sched_entity we instead embed a
sched_avg structure directly.

Signed-off-by: Ben Segall <bsegall@google.com>
Reviewed-by: Paul Turner <pjt@google.com>
---
 kernel/sched/debug.c |   10 ++++++++--
 kernel/sched/fair.c  |   18 ++++++++++++++++--
 kernel/sched/sched.h |    2 ++
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 61f7097..4240abc 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -61,14 +61,20 @@ static unsigned long nsec_low(unsigned long long nsec)
 static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group *tg)
 {
 	struct sched_entity *se = tg->se[cpu];
-	if (!se)
-		return;
 
 #define P(F) \
 	SEQ_printf(m, "  .%-30s: %lld\n", #F, (long long)F)
 #define PN(F) \
 	SEQ_printf(m, "  .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)F))
 
+	if (!se) {
+		struct sched_avg *avg = &cpu_rq(cpu)->avg;
+		P(avg->runnable_avg_sum);
+		P(avg->runnable_avg_period);
+		return;
+	}
+
+
 	PN(se->exec_start);
 	PN(se->vruntime);
 	PN(se->sum_exec_runtime);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2c53263..6eb2ce2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1086,8 +1086,14 @@ static inline void update_entity_load_avg(struct sched_entity *se)
 	__update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
 				     se->on_rq);
 }
+
+static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
+{
+	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
+}
 #else
 static inline void update_entity_load_avg(struct sched_entity *se) {}
+static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 #endif
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -2339,8 +2345,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_shares(cfs_rq);
 	}
 
-	if (!se)
+	if (!se) {
+		update_rq_runnable_avg(rq, rq->nr_running);
 		inc_nr_running(rq);
+	}
 	hrtick_update(rq);
 }
 
@@ -2398,8 +2406,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_shares(cfs_rq);
 	}
 
-	if (!se)
+	if (!se) {
 		dec_nr_running(rq);
+		update_rq_runnable_avg(rq, 1);
+	}
 	hrtick_update(rq);
 }
 
@@ -4573,6 +4583,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	if (this_rq->avg_idle < sysctl_sched_migration_cost)
 		return;
 
+	update_rq_runnable_avg(this_rq, 1);
+
 	/*
 	 * Drop the rq->lock, but keep IRQ/preempt disabled.
 	 */
@@ -5071,6 +5083,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		cfs_rq = cfs_rq_of(se);
 		entity_tick(cfs_rq, se, queued);
 	}
+
+	update_rq_runnable_avg(rq, 1);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 804c2e5..eb61c75 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -467,6 +467,8 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+
+	struct sched_avg avg;
 };
 
 static inline int cpu_of(struct rq *rq)



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [patch 03/16] sched: aggregate load contributed by task entities on parenting cfs_rq
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
  2012-08-23 14:14 ` [patch 01/16] sched: track the runnable average on a per-task entitiy basis pjt
  2012-08-23 14:14 ` [patch 02/16] sched: maintain per-rq runnable averages pjt
@ 2012-08-23 14:14 ` pjt
  2012-10-24  9:45   ` [tip:sched/core] sched: Aggregate " tip-bot for Paul Turner
  2012-08-23 14:14 ` [patch 04/16] sched: maintain the load contribution of blocked entities pjt
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-cfs_rq_load.patch --]
[-- Type: text/plain, Size: 5492 bytes --]

From: Paul Turner <pjt@google.com>

For a given task t, we can compute its contribution to load as:
  task_load(t) = runnable_avg(t) * weight(t)

On a parenting cfs_rq we can then aggregate
  runnable_load(cfs_rq) = \Sum task_load(t), for all runnable children t

Maintain this bottom up, with task entities adding their contributed load to
the parenting cfs_rq sum.  When a task entity's load changes we add the same
delta to the maintained sum.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/debug.c  |    3 +++
 kernel/sched/fair.c   |   51 +++++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h  |   10 +++++++++-
 4 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f553da9..943a60d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1147,6 +1147,7 @@ struct sched_avg {
 	 */
 	u32 runnable_avg_sum, runnable_avg_period;
 	u64 last_runnable_update;
+	unsigned long load_avg_contrib;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4240abc..c953a89 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -94,6 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 #ifdef CONFIG_SMP
 	P(se->avg.runnable_avg_sum);
 	P(se->avg.runnable_avg_period);
+	P(se->avg.load_avg_contrib);
 #endif
 #undef PN
 #undef P
@@ -224,6 +225,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->load_contribution);
 	SEQ_printf(m, "  .%-30s: %d\n", "load_tg",
 			atomic_read(&cfs_rq->tg->load_weight));
+	SEQ_printf(m, "  .%-30s: %lld\n", "runnable_load_avg",
+			cfs_rq->runnable_load_avg);
 #endif
 
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6eb2ce2..f1151f9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1080,20 +1080,63 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	return decayed;
 }
 
+/* Compute the current contribution to load_avg by se, return any delta */
+static long __update_entity_load_avg_contrib(struct sched_entity *se)
+{
+	long old_contrib = se->avg.load_avg_contrib;
+
+	if (!entity_is_task(se))
+		return 0;
+
+	se->avg.load_avg_contrib = div64_u64(se->avg.runnable_avg_sum *
+					     se->load.weight,
+					     se->avg.runnable_avg_period + 1);
+
+	return se->avg.load_avg_contrib - old_contrib;
+}
+
 /* Update a sched_entity's runnable average */
 static inline void update_entity_load_avg(struct sched_entity *se)
 {
-	__update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
-				     se->on_rq);
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	long contrib_delta;
+
+	if (!__update_entity_runnable_avg(rq_of(cfs_rq)->clock_task, &se->avg,
+					  se->on_rq))
+		return;
+
+	contrib_delta = __update_entity_load_avg_contrib(se);
+	if (se->on_rq)
+		cfs_rq->runnable_load_avg += contrib_delta;
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
 	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
 }
+
+/* Add the load generated by se into cfs_rq's child load-average */
+static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
+						  struct sched_entity *se)
+{
+	update_entity_load_avg(se);
+	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+}
+
+/* Remove se's load from this cfs_rq child load-average */
+static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
+						  struct sched_entity *se)
+{
+	update_entity_load_avg(se);
+	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+}
 #else
 static inline void update_entity_load_avg(struct sched_entity *se) {}
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
+static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
+						  struct sched_entity *se) {}
+static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
+						  struct sched_entity *se) {}
 #endif
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1222,7 +1265,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_curr(cfs_rq);
 	update_cfs_load(cfs_rq, 0);
-	update_entity_load_avg(se);
+	enqueue_entity_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
@@ -1297,7 +1340,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	update_entity_load_avg(se);
+	dequeue_entity_load_avg(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eb61c75..7e35ae0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -222,6 +222,15 @@ struct cfs_rq {
 	unsigned int nr_spread_over;
 #endif
 
+#ifdef CONFIG_SMP
+	/*
+	 * CFS Load tracking
+	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
+	 * This allows for the description of both thread and group usage (in
+	 * the FAIR_GROUP_SCHED case).
+	 */
+	u64 runnable_load_avg;
+#endif
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	struct rq *rq;	/* cpu runqueue to which this cfs_rq is attached */
 
@@ -1221,4 +1230,3 @@ static inline u64 irq_time_read(int cpu)
 }
 #endif /* CONFIG_64BIT */
 #endif /* CONFIG_IRQ_TIME_ACCOUNTING */
-



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [patch 04/16] sched: maintain the load contribution of blocked entities
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
                   ` (2 preceding siblings ...)
  2012-08-23 14:14 ` [patch 03/16] sched: aggregate load contributed by task entities on parenting cfs_rq pjt
@ 2012-08-23 14:14 ` pjt
  2012-10-24  9:46   ` [tip:sched/core] sched: Maintain " tip-bot for Paul Turner
  2012-08-23 14:14 ` [patch 05/16] sched: add an rq migration call-back to sched_class pjt
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-cfs_rq_blocked_load.patch --]
[-- Type: text/plain, Size: 10835 bytes --]

From: Paul Turner <pjt@google.com>

We are currently maintaining:
  runnable_load(cfs_rq) = \Sum task_load(t)

For all running children t of cfs_rq.  While this can be naturally updated for
tasks in a runnable state (as they are scheduled); this does not account for
the load contributed by blocked task entities.

This can be solved by introducing a separate accounting for blocked load:
  blocked_load(cfs_rq) = \Sum runnable(b) * weight(b)

Obviously we do not want to iterate over all blocked entities to account for
their decay, we instead observe that:
  runnable_load(t) = \Sum p_i*y^i

and that to account for an additional idle period we only need to compute:
  y*runnable_load(t).

This means that we can compute all blocked entities at once by evaluating:
  blocked_load(cfs_rq)` = y * blocked_load(cfs_rq)

Finally we maintain a decay counter so that when a sleeping entity re-awakens
we can determine how much of its load should be removed from the blocked sum.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 include/linux/sched.h |    1 
 kernel/sched/core.c   |    1 
 kernel/sched/debug.c  |    3 +
 kernel/sched/fair.c   |  124 ++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h  |    4 +-
 5 files changed, 118 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 943a60d..7406249 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1147,6 +1147,7 @@ struct sched_avg {
 	 */
 	u32 runnable_avg_sum, runnable_avg_period;
 	u64 last_runnable_update;
+	s64 decay_count;
 	unsigned long load_avg_contrib;
 };
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fcc3cad..33e6fe1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1560,7 +1560,6 @@ static void __sched_fork(struct task_struct *p)
 	p->se.avg.runnable_avg_period = 0;
 	p->se.avg.runnable_avg_sum = 0;
 #endif
-
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c953a89..2d2e2b3 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -95,6 +95,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->avg.runnable_avg_sum);
 	P(se->avg.runnable_avg_period);
 	P(se->avg.load_avg_contrib);
+	P(se->avg.decay_count);
 #endif
 #undef PN
 #undef P
@@ -227,6 +228,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			atomic_read(&cfs_rq->tg->load_weight));
 	SEQ_printf(m, "  .%-30s: %lld\n", "runnable_load_avg",
 			cfs_rq->runnable_load_avg);
+	SEQ_printf(m, "  .%-30s: %lld\n", "blocked_load_avg",
+			cfs_rq->blocked_load_avg);
 #endif
 
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f1151f9..0ce8d91 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1080,6 +1080,20 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	return decayed;
 }
 
+/* Synchronize an entity's decay with its parenting cfs_rq.*/
+static inline void __synchronize_entity_decay(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 decays = atomic64_read(&cfs_rq->decay_counter);
+
+	decays -= se->avg.decay_count;
+	if (!decays)
+		return;
+
+	se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
+	se->avg.decay_count = 0;
+}
+
 /* Compute the current contribution to load_avg by se, return any delta */
 static long __update_entity_load_avg_contrib(struct sched_entity *se)
 {
@@ -1095,8 +1109,18 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
 	return se->avg.load_avg_contrib - old_contrib;
 }
 
+static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
+						 long load_contrib)
+{
+	if (likely(load_contrib < cfs_rq->blocked_load_avg))
+		cfs_rq->blocked_load_avg -= load_contrib;
+	else
+		cfs_rq->blocked_load_avg = 0;
+}
+
 /* Update a sched_entity's runnable average */
-static inline void update_entity_load_avg(struct sched_entity *se)
+static inline void update_entity_load_avg(struct sched_entity *se,
+					  int update_cfs_rq)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	long contrib_delta;
@@ -1106,8 +1130,34 @@ static inline void update_entity_load_avg(struct sched_entity *se)
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
+
+	if (!update_cfs_rq)
+		return;
+
 	if (se->on_rq)
 		cfs_rq->runnable_load_avg += contrib_delta;
+	else
+		subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+}
+
+/*
+ * Decay the load contributed by all blocked children and account this so that
+ * their contribution may appropriately discounted when they wake up.
+ */
+static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq)
+{
+	u64 now = rq_of(cfs_rq)->clock_task >> 20;
+	u64 decays;
+
+	decays = now - cfs_rq->last_decay;
+	if (!decays)
+		return;
+
+	cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
+					      decays);
+	atomic64_add(decays, &cfs_rq->decay_counter);
+
+	cfs_rq->last_decay = now;
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
@@ -1117,26 +1167,53 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 
 /* Add the load generated by se into cfs_rq's child load-average */
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
-						  struct sched_entity *se)
+						  struct sched_entity *se,
+						  int wakeup)
 {
-	update_entity_load_avg(se);
+	/* we track migrations using entity decay_count == 0 */
+	if (unlikely(!se->avg.decay_count)) {
+		se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
+		wakeup = 0;
+	} else {
+		__synchronize_entity_decay(se);
+	}
+
+	if (wakeup)
+		subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+
+	update_entity_load_avg(se, 0);
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+	update_cfs_rq_blocked_load(cfs_rq);
 }
 
-/* Remove se's load from this cfs_rq child load-average */
+/*
+ * Remove se's load from this cfs_rq child load-average, if the entity is
+ * transitioning to a blocked state we track its projected decay using
+ * blocked_load_avg.
+ */
 static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
-						  struct sched_entity *se)
+						  struct sched_entity *se,
+						  int sleep)
 {
-	update_entity_load_avg(se);
+	update_entity_load_avg(se, 1);
+
 	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+	if (sleep) {
+		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+	} /* migrations, e.g. sleep=0 leave decay_count == 0 */
 }
 #else
-static inline void update_entity_load_avg(struct sched_entity *se) {}
+static inline void update_entity_load_avg(struct sched_entity *se,
+					  int update_cfs_rq) {}
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
-						  struct sched_entity *se) {}
+					   struct sched_entity *se,
+					   int wakeup) {}
 static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
-						  struct sched_entity *se) {}
+					   struct sched_entity *se,
+					   int sleep) {}
+static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq) {}
 #endif
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1265,7 +1342,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_curr(cfs_rq);
 	update_cfs_load(cfs_rq, 0);
-	enqueue_entity_load_avg(cfs_rq, se);
+	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
@@ -1340,7 +1417,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	dequeue_entity_load_avg(cfs_rq, se);
+	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
 
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
@@ -1511,7 +1588,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
-		update_entity_load_avg(prev);
+		update_entity_load_avg(prev, 1);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -1527,7 +1604,8 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
-	update_entity_load_avg(curr);
+	update_entity_load_avg(curr, 1);
+	update_cfs_rq_blocked_load(cfs_rq);
 
 	/*
 	 * Update share accounting for long-running entities.
@@ -2386,6 +2464,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
+		update_entity_load_avg(se, 1);
 	}
 
 	if (!se) {
@@ -2447,6 +2526,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
+		update_entity_load_avg(se, 1);
 	}
 
 	if (!se) {
@@ -3483,6 +3563,7 @@ static int update_shares_cpu(struct task_group *tg, int cpu)
 
 	update_rq_clock(rq);
 	update_cfs_load(cfs_rq, 1);
+	update_cfs_rq_blocked_load(cfs_rq);
 
 	/*
 	 * We need to update shares after updating tg->load_weight in
@@ -5220,6 +5301,20 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
 		place_entity(cfs_rq, se, 0);
 		se->vruntime -= cfs_rq->min_vruntime;
 	}
+
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+	/*
+	* Remove our load from contribution when we leave sched_fair
+	* and ensure we don't carry in an old decay_count if we
+	* switch back.
+	*/
+	if (p->se.avg.decay_count) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
+		__synchronize_entity_decay(&p->se);
+		subtract_blocked_load_contrib(cfs_rq,
+				p->se.avg.load_avg_contrib);
+	}
+#endif
 }
 
 /*
@@ -5266,6 +5361,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #ifndef CONFIG_64BIT
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
 #endif
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+	atomic64_set(&cfs_rq->decay_counter, 1);
+#endif
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7e35ae0..1df06e9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -229,7 +229,9 @@ struct cfs_rq {
 	 * This allows for the description of both thread and group usage (in
 	 * the FAIR_GROUP_SCHED case).
 	 */
-	u64 runnable_load_avg;
+	u64 runnable_load_avg, blocked_load_avg;
+	atomic64_t decay_counter;
+	u64 last_decay;
 #endif
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	struct rq *rq;	/* cpu runqueue to which this cfs_rq is attached */



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [patch 05/16] sched: add an rq migration call-back to sched_class
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
                   ` (3 preceding siblings ...)
  2012-08-23 14:14 ` [patch 04/16] sched: maintain the load contribution of blocked entities pjt
@ 2012-08-23 14:14 ` pjt
  2012-10-24  9:47   ` [tip:sched/core] sched: Add " tip-bot for Paul Turner
  2012-08-23 14:14 ` [patch 06/16] sched: account for blocked load waking back up pjt
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-migrate_rq.patch --]
[-- Type: text/plain, Size: 2444 bytes --]

From: Paul Turner <pjt@google.com>

Since we are now doing bottom up load accumulation we need explicit
notification when a task has been re-parented so that the old hierarchy can be
updated.

Adds: migrate_task_rq(struct task_struct *p, int next_cpu)

(The alternative is to do this out of __set_task_cpu, but it was suggested that
this would be a cleaner encapsulation.)

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    2 ++
 kernel/sched/fair.c   |   12 ++++++++++++
 3 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7406249..93e27c0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1105,6 +1105,7 @@ struct sched_class {
 
 #ifdef CONFIG_SMP
 	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
+	void (*migrate_task_rq)(struct task_struct *p, int next_cpu);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 33e6fe1..b3e2442 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -952,6 +952,8 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	trace_sched_migrate_task(p, new_cpu);
 
 	if (task_cpu(p) != new_cpu) {
+		if (p->sched_class->migrate_task_rq)
+			p->sched_class->migrate_task_rq(p, new_cpu);
 		p->se.nr_migrations++;
 		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
 	}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0ce8d91..e11e24c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3028,6 +3028,17 @@ unlock:
 
 	return new_cpu;
 }
+
+/*
+ * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
+ * cfs_rq_of(p) references at time of call are still valid and identify the
+ * previous cpu.  However, the caller only guarantees p->pi_lock is held; no
+ * other assumptions, including the state of rq->lock, should be made.
+ */
+static void
+migrate_task_rq_fair(struct task_struct *p, int next_cpu)
+{
+}
 #endif /* CONFIG_SMP */
 
 static unsigned long
@@ -5591,6 +5602,7 @@ const struct sched_class fair_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_fair,
+	.migrate_task_rq	= migrate_task_rq_fair,
 
 	.rq_online		= rq_online_fair,
 	.rq_offline		= rq_offline_fair,



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [patch 06/16] sched: account for blocked load waking back up
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
                   ` (4 preceding siblings ...)
  2012-08-23 14:14 ` [patch 05/16] sched: add an rq migration call-back to sched_class pjt
@ 2012-08-23 14:14 ` pjt
       [not found]   ` <CAM4v1pO8SPCmqJTTBHpqwrwuO7noPdskg0RSooxyPsWoE395_A@mail.gmail.com>
  2012-10-24  9:48   ` [tip:sched/core] sched: Account " tip-bot for Paul Turner
  2012-08-23 14:14 ` [patch 07/16] sched: aggregate total task_group load pjt
                   ` (11 subsequent siblings)
  17 siblings, 2 replies; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-wakeup_load.patch --]
[-- Type: text/plain, Size: 8017 bytes --]

From: Paul Turner <pjt@google.com>

When a running entity blocks we migrate its tracked load to
cfs_rq->blocked_runnable_avg.  In the sleep case this occurs while holding
rq->lock and so is a natural transition.  Wake-ups however, are potentially
asynchronous in the presence of migration and so special care must be taken.

We use an atomic counter to track such migrated load, taking care to match this
with the previously introduced decay counters so that we don't migrate too much
load.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c  |   95 +++++++++++++++++++++++++++++++++++++++++---------
 kernel/sched/sched.h |    2 +
 2 files changed, 78 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e11e24c..9c1f090 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1081,17 +1081,19 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 }
 
 /* Synchronize an entity's decay with its parenting cfs_rq.*/
-static inline void __synchronize_entity_decay(struct sched_entity *se)
+static inline u64 __synchronize_entity_decay(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 decays = atomic64_read(&cfs_rq->decay_counter);
 
 	decays -= se->avg.decay_count;
 	if (!decays)
-		return;
+		return 0;
 
 	se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
 	se->avg.decay_count = 0;
+
+	return decays;
 }
 
 /* Compute the current contribution to load_avg by se, return any delta */
@@ -1144,20 +1146,26 @@ static inline void update_entity_load_avg(struct sched_entity *se,
  * Decay the load contributed by all blocked children and account this so that
  * their contribution may appropriately discounted when they wake up.
  */
-static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq)
+static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 {
 	u64 now = rq_of(cfs_rq)->clock_task >> 20;
 	u64 decays;
 
 	decays = now - cfs_rq->last_decay;
-	if (!decays)
+	if (!decays && !force_update)
 		return;
 
-	cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
-					      decays);
-	atomic64_add(decays, &cfs_rq->decay_counter);
+	if (atomic64_read(&cfs_rq->removed_load)) {
+		u64 removed_load = atomic64_xchg(&cfs_rq->removed_load, 0);
+		subtract_blocked_load_contrib(cfs_rq, removed_load);
+	}
 
-	cfs_rq->last_decay = now;
+	if (decays) {
+		cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
+						      decays);
+		atomic64_add(decays, &cfs_rq->decay_counter);
+		cfs_rq->last_decay = now;
+	}
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
@@ -1170,20 +1178,42 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  struct sched_entity *se,
 						  int wakeup)
 {
-	/* we track migrations using entity decay_count == 0 */
-	if (unlikely(!se->avg.decay_count)) {
+	/*
+	 * We track migrations using entity decay_count <= 0, on a wake-up
+	 * migration we use a negative decay count to track the remote decays
+	 * accumulated while sleeping.
+	 */
+	if (unlikely(se->avg.decay_count <= 0)) {
 		se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
+		if (se->avg.decay_count) {
+			/*
+			 * In a wake-up migration we have to approximate the
+			 * time sleeping.  This is because we can't synchronize
+			 * clock_task between the two cpus, and it is not
+			 * guaranteed to be read-safe.  Instead, we can
+			 * approximate this using our carried decays, which are
+			 * explicitly atomically readable.
+			 */
+			se->avg.last_runnable_update -= (-se->avg.decay_count)
+							<< 20;
+			update_entity_load_avg(se, 0);
+			/* Indicate that we're now synchronized and on-rq */
+			se->avg.decay_count = 0;
+		}
 		wakeup = 0;
 	} else {
 		__synchronize_entity_decay(se);
 	}
 
-	if (wakeup)
+	/* migrated tasks did not contribute to our blocked load */
+	if (wakeup) {
 		subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+		update_entity_load_avg(se, 0);
+	}
 
-	update_entity_load_avg(se, 0);
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
-	update_cfs_rq_blocked_load(cfs_rq);
+	/* we force update consideration on load-balancer moves */
+	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
 }
 
 /*
@@ -1196,6 +1226,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  int sleep)
 {
 	update_entity_load_avg(se, 1);
+	/* we force update consideration on load-balancer moves */
+	update_cfs_rq_blocked_load(cfs_rq, !sleep);
 
 	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
 	if (sleep) {
@@ -1213,7 +1245,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 					   struct sched_entity *se,
 					   int sleep) {}
-static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq) {}
+static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
+					      int force_update) {}
 #endif
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1605,7 +1638,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 * Ensure that runnable average is periodically updated.
 	 */
 	update_entity_load_avg(curr, 1);
-	update_cfs_rq_blocked_load(cfs_rq);
+	update_cfs_rq_blocked_load(cfs_rq, 1);
 
 	/*
 	 * Update share accounting for long-running entities.
@@ -3038,6 +3071,19 @@ unlock:
 static void
 migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 {
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	/*
+	 * Load tracking: accumulate removed load so that it can be processed
+	 * when we next update owning cfs_rq under rq->lock.  Tasks contribute
+	 * to blocked load iff they have a positive decay-count.  It can never
+	 * be negative here since on-rq tasks have decay-count == 0.
+	 */
+	if (se->avg.decay_count) {
+		se->avg.decay_count = -__synchronize_entity_decay(se);
+		atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
+	}
 }
 #endif /* CONFIG_SMP */
 
@@ -3574,7 +3620,7 @@ static int update_shares_cpu(struct task_group *tg, int cpu)
 
 	update_rq_clock(rq);
 	update_cfs_load(cfs_rq, 1);
-	update_cfs_rq_blocked_load(cfs_rq);
+	update_cfs_rq_blocked_load(cfs_rq, 1);
 
 	/*
 	 * We need to update shares after updating tg->load_weight in
@@ -5374,12 +5420,14 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #endif
 #if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
 	atomic64_set(&cfs_rq->decay_counter, 1);
+	atomic64_set(&cfs_rq->removed_load, 0);
 #endif
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void task_move_group_fair(struct task_struct *p, int on_rq)
 {
+	struct cfs_rq *cfs_rq;
 	/*
 	 * If the task was not on the rq at the time of this cgroup movement
 	 * it must have been asleep, sleeping tasks keep their ->vruntime
@@ -5411,8 +5459,19 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
 	if (!on_rq)
 		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
 	set_task_rq(p, task_cpu(p));
-	if (!on_rq)
-		p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
+	if (!on_rq) {
+		cfs_rq = cfs_rq_of(&p->se);
+		p->se.vruntime += cfs_rq->min_vruntime;
+#ifdef CONFIG_SMP
+		/*
+		 * migrate_task_rq_fair() will have removed our previous
+		 * contribution, but we must synchronize for ongoing future
+		 * decay.
+		 */
+		p->se.avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+		cfs_rq->blocked_load_avg += p->se.avg.load_avg_contrib;
+#endif
+	}
 }
 
 void free_fair_sched_group(struct task_group *tg)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1df06e9..f282c62 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -230,7 +230,7 @@ struct cfs_rq {
 	 * the FAIR_GROUP_SCHED case).
 	 */
 	u64 runnable_load_avg, blocked_load_avg;
-	atomic64_t decay_counter;
+	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
 #endif
 #ifdef CONFIG_FAIR_GROUP_SCHED



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [patch 07/16] sched: aggregate total task_group load
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
                   ` (5 preceding siblings ...)
  2012-08-23 14:14 ` [patch 06/16] sched: account for blocked load waking back up pjt
@ 2012-08-23 14:14 ` pjt
  2012-10-24  9:49   ` [tip:sched/core] sched: Aggregate " tip-bot for Paul Turner
  2012-08-23 14:14 ` [patch 08/16] sched: compute load contribution by a group entity pjt
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-compute_tg_load.patch --]
[-- Type: text/plain, Size: 2988 bytes --]

From: Paul Turner <pjt@google.com>

Maintain a global running sum of the average load seen on each cfs_rq belonging
to each task group so that it may be used in calculating an appropriate
shares:weight distribution.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/debug.c |    4 ++++
 kernel/sched/fair.c  |   22 ++++++++++++++++++++++
 kernel/sched/sched.h |    4 ++++
 3 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2d2e2b3..2908923 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -230,6 +230,10 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lld\n", "blocked_load_avg",
 			cfs_rq->blocked_load_avg);
+	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_avg",
+			atomic64_read(&cfs_rq->tg->load_avg));
+	SEQ_printf(m, "  .%-30s: %lld\n", "tg_load_contrib",
+			cfs_rq->tg_load_contrib);
 #endif
 
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c1f090..40259cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1096,6 +1096,26 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
 	return decays;
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
+						 int force_update)
+{
+	struct task_group *tg = cfs_rq->tg;
+	s64 tg_contrib;
+
+	tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
+	tg_contrib -= cfs_rq->tg_load_contrib;
+
+	if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
+		atomic64_add(tg_contrib, &tg->load_avg);
+		cfs_rq->tg_load_contrib += tg_contrib;
+	}
+}
+#else
+static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
+						 int force_update) {}
+#endif
+
 /* Compute the current contribution to load_avg by se, return any delta */
 static long __update_entity_load_avg_contrib(struct sched_entity *se)
 {
@@ -1166,6 +1186,8 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 		atomic64_add(decays, &cfs_rq->decay_counter);
 		cfs_rq->last_decay = now;
 	}
+
+	__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f282c62..0c453e7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -112,6 +112,7 @@ struct task_group {
 	unsigned long shares;
 
 	atomic_t load_weight;
+	atomic64_t load_avg;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -232,6 +233,9 @@ struct cfs_rq {
 	u64 runnable_load_avg, blocked_load_avg;
 	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	u64 tg_load_contrib;
+#endif
 #endif
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	struct rq *rq;	/* cpu runqueue to which this cfs_rq is attached */



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [patch 08/16] sched: compute load contribution by a group entity
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
                   ` (6 preceding siblings ...)
  2012-08-23 14:14 ` [patch 07/16] sched: aggregate total task_group load pjt
@ 2012-08-23 14:14 ` pjt
  2012-10-24  9:50   ` [tip:sched/core] sched: Compute " tip-bot for Paul Turner
  2012-08-23 14:14 ` [patch 09/16] sched: normalize tg load contributions against runnable time pjt
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-cfs_rq_contrib.patch --]
[-- Type: text/plain, Size: 2182 bytes --]

From: Paul Turner <pjt@google.com>

Unlike task entities who have a fixed weight, group entities instead own a
fraction of their parenting task_group's shares as their contributed weight.

Compute this fraction so that we can correctly account hierarchies and shared
entity nodes.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c |   33 +++++++++++++++++++++++++++------
 1 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40259cc..92ef5f1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1111,22 +1111,43 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 		cfs_rq->tg_load_contrib += tg_contrib;
 	}
 }
+
+static inline void __update_group_entity_contrib(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = group_cfs_rq(se);
+	struct task_group *tg = cfs_rq->tg;
+	u64 contrib;
+
+	contrib = cfs_rq->tg_load_contrib * tg->shares;
+	se->avg.load_avg_contrib = div64_u64(contrib,
+					     atomic64_read(&tg->load_avg) + 1);
+}
 #else
 static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 						 int force_update) {}
+static inline void __update_group_entity_contrib(struct sched_entity *se) {}
 #endif
 
+static inline void __update_task_entity_contrib(struct sched_entity *se)
+{
+	u32 contrib;
+
+	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
+	contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
+	contrib /= (se->avg.runnable_avg_period + 1);
+	se->avg.load_avg_contrib = scale_load(contrib);
+}
+
 /* Compute the current contribution to load_avg by se, return any delta */
 static long __update_entity_load_avg_contrib(struct sched_entity *se)
 {
 	long old_contrib = se->avg.load_avg_contrib;
 
-	if (!entity_is_task(se))
-		return 0;
-
-	se->avg.load_avg_contrib = div64_u64(se->avg.runnable_avg_sum *
-					     se->load.weight,
-					     se->avg.runnable_avg_period + 1);
+	if (entity_is_task(se)) {
+		__update_task_entity_contrib(se);
+	} else {
+		__update_group_entity_contrib(se);
+	}
 
 	return se->avg.load_avg_contrib - old_contrib;
 }



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [patch 09/16] sched: normalize tg load contributions against runnable time
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
                   ` (7 preceding siblings ...)
  2012-08-23 14:14 ` [patch 08/16] sched: compute load contribution by a group entity pjt
@ 2012-08-23 14:14 ` pjt
  2012-10-24  9:51   ` [tip:sched/core] sched: Normalize " tip-bot for Paul Turner
  2012-08-23 14:14 ` [patch 10/16] sched: maintain runnable averages across throttled periods pjt
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-normalize_runnable_shares.patch --]
[-- Type: text/plain, Size: 5440 bytes --]

From: Paul Turner <pjt@google.com>

Entities of equal weight should receive equitable distribution of cpu time.
This is challenging in the case of a task_group's shares as execution may be
occurring on multiple cpus simultaneously.

To handle this we divide up the shares into weights proportionate with the load
on each cfs_rq.  This does not however, account for the fact that the sum of
the parts may be less than one cpu and so we need to normalize:
  load(tg) = min(runnable_avg(tg), 1) * tg->shares
Where runnable_avg is the aggregate time in which the task_group had runnable
children.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>.
---
 kernel/sched/debug.c |    4 ++++
 kernel/sched/fair.c  |   56 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |    2 ++
 3 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2908923..71b0ea3 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -234,6 +234,10 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			atomic64_read(&cfs_rq->tg->load_avg));
 	SEQ_printf(m, "  .%-30s: %lld\n", "tg_load_contrib",
 			cfs_rq->tg_load_contrib);
+	SEQ_printf(m, "  .%-30s: %d\n", "tg_runnable_contrib",
+			cfs_rq->tg_runnable_contrib);
+	SEQ_printf(m, "  .%-30s: %d\n", "tg->runnable_avg",
+			atomic_read(&cfs_rq->tg->runnable_avg));
 #endif
 
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 92ef5f1..47a7998 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1112,19 +1112,73 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 	}
 }
 
+/*
+ * Aggregate cfs_rq runnable averages into an equivalent task_group
+ * representation for computing load contributions.
+ */
+static inline void __update_tg_runnable_avg(struct sched_avg *sa,
+						  struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	long contrib;
+
+	/* The fraction of a cpu used by this cfs_rq */
+	contrib = div_u64(sa->runnable_avg_sum << NICE_0_SHIFT,
+			  sa->runnable_avg_period + 1);
+	contrib -= cfs_rq->tg_runnable_contrib;
+
+	if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
+		atomic_add(contrib, &tg->runnable_avg);
+		cfs_rq->tg_runnable_contrib += contrib;
+	}
+}
+
 static inline void __update_group_entity_contrib(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = group_cfs_rq(se);
 	struct task_group *tg = cfs_rq->tg;
+	int runnable_avg;
+
 	u64 contrib;
 
 	contrib = cfs_rq->tg_load_contrib * tg->shares;
 	se->avg.load_avg_contrib = div64_u64(contrib,
 					     atomic64_read(&tg->load_avg) + 1);
+
+	/*
+	 * For group entities we need to compute a correction term in the case
+	 * that they are consuming <1 cpu so that we would contribute the same
+	 * load as a task of equal weight.
+	 *
+	 * Explicitly co-ordinating this measurement would be expensive, but
+	 * fortunately the sum of each cpus contribution forms a usable
+	 * lower-bound on the true value.
+	 *
+	 * Consider the aggregate of 2 contributions.  Either they are disjoint
+	 * (and the sum represents true value) or they are disjoint and we are
+	 * understating by the aggregate of their overlap.
+	 *
+	 * Extending this to N cpus, for a given overlap, the maximum amount we
+	 * understand is then n_i(n_i+1)/2 * w_i where n_i is the number of
+	 * cpus that overlap for this interval and w_i is the interval width.
+	 *
+	 * On a small machine; the first term is well-bounded which bounds the
+	 * total error since w_i is a subset of the period.  Whereas on a
+	 * larger machine, while this first term can be larger, if w_i is the
+	 * of consequential size guaranteed to see n_i*w_i quickly converge to
+	 * our upper bound of 1-cpu.
+	 */
+	runnable_avg = atomic_read(&tg->runnable_avg);
+	if (runnable_avg < NICE_0_LOAD) {
+		se->avg.load_avg_contrib *= runnable_avg;
+		se->avg.load_avg_contrib >>= NICE_0_SHIFT;
+	}
 }
 #else
 static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 						 int force_update) {}
+static inline void __update_tg_runnable_avg(struct sched_avg *sa,
+						  struct cfs_rq *cfs_rq) {}
 static inline void __update_group_entity_contrib(struct sched_entity *se) {}
 #endif
 
@@ -1146,6 +1200,7 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
 	if (entity_is_task(se)) {
 		__update_task_entity_contrib(se);
 	} else {
+		__update_tg_runnable_avg(&se->avg, group_cfs_rq(se));
 		__update_group_entity_contrib(se);
 	}
 
@@ -1214,6 +1269,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
 	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
+	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 
 /* Add the load generated by se into cfs_rq's child load-average */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0c453e7..1474bf2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -113,6 +113,7 @@ struct task_group {
 
 	atomic_t load_weight;
 	atomic64_t load_avg;
+	atomic_t runnable_avg;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -234,6 +235,7 @@ struct cfs_rq {
 	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	u32 tg_runnable_contrib;
 	u64 tg_load_contrib;
 #endif
 #endif



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [patch 10/16] sched: maintain runnable averages across throttled periods
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
                   ` (8 preceding siblings ...)
  2012-08-23 14:14 ` [patch 09/16] sched: normalize tg load contributions against runnable time pjt
@ 2012-08-23 14:14 ` pjt
  2012-10-24  9:52   ` [tip:sched/core] sched: Maintain " tip-bot for Paul Turner
  2012-08-23 14:14 ` [patch 11/16] sched: replace update_shares weight distribution with per-entity computation pjt
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-bwc_load.patch --]
[-- Type: text/plain, Size: 5662 bytes --]

From: Paul Turner <pjt@google.com>

With bandwidth control tracked entities may cease execution according to user
specified bandwidth limits.  Charging this time as either throttled or blocked
however, is incorrect and would falsely skew in either direction.

What we actually want is for any throttled periods to be "invisible" to
load-tracking as they are removed from the system for that interval and
contribute normally otherwise.

Do this by moderating the progression of time to omit any periods in which the
entity belonged to a throttled hierarchy.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c  |   50 ++++++++++++++++++++++++++++++++++++++++----------
 kernel/sched/sched.h |    3 ++-
 2 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 47a7998..e7decc9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1216,15 +1216,26 @@ static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
 		cfs_rq->blocked_load_avg = 0;
 }
 
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
+
 /* Update a sched_entity's runnable average */
 static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	long contrib_delta;
+	u64 now;
 
-	if (!__update_entity_runnable_avg(rq_of(cfs_rq)->clock_task, &se->avg,
-					  se->on_rq))
+	/*
+	 * For a group entity we need to use their owned cfs_rq_clock_task() in
+	 * case they are the parent of a throttled hierarchy.
+	 */
+	if (entity_is_task(se))
+		now = cfs_rq_clock_task(cfs_rq);
+	else
+		now = cfs_rq_clock_task(group_cfs_rq(se));
+
+	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
@@ -1244,7 +1255,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
  */
 static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 {
-	u64 now = rq_of(cfs_rq)->clock_task >> 20;
+	u64 now = cfs_rq_clock_task(cfs_rq) >> 20;
 	u64 decays;
 
 	decays = now - cfs_rq->last_decay;
@@ -1835,6 +1846,15 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 	return &tg->cfs_bandwidth;
 }
 
+/* rq->task_clock normalized against any time this cfs_rq has spent throttled */
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
+{
+	if (unlikely(cfs_rq->throttle_count))
+		return cfs_rq->throttled_clock_task;
+
+	return rq_of(cfs_rq)->clock_task - cfs_rq->throttled_clock_task_time;
+}
+
 /* returns 0 on failure to allocate runtime */
 static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
@@ -1985,6 +2005,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 		cfs_rq->load_stamp += delta;
 		cfs_rq->load_last += delta;
 
+		/* adjust cfs_rq_clock_task() */
+		cfs_rq->throttled_clock_task_time += rq->clock_task -
+					     cfs_rq->throttled_clock_task;
+
 		/* update entity weight now that we are on_rq again */
 		update_cfs_shares(cfs_rq);
 	}
@@ -1999,8 +2023,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
 	/* group is entering throttled state, record last load */
-	if (!cfs_rq->throttle_count)
+	if (!cfs_rq->throttle_count) {
 		update_cfs_load(cfs_rq, 0);
+		cfs_rq->throttled_clock_task = rq->clock_task;
+	}
 	cfs_rq->throttle_count++;
 
 	return 0;
@@ -2015,7 +2041,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
-	/* account load preceding throttle */
+	/* freeze hierarchy runnable averages while throttled */
 	rcu_read_lock();
 	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
 	rcu_read_unlock();
@@ -2039,7 +2065,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		rq->nr_running -= task_delta;
 
 	cfs_rq->throttled = 1;
-	cfs_rq->throttled_timestamp = rq->clock;
+	cfs_rq->throttled_clock = rq->clock;
 	raw_spin_lock(&cfs_b->lock);
 	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
 	raw_spin_unlock(&cfs_b->lock);
@@ -2057,10 +2083,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	cfs_rq->throttled = 0;
 	raw_spin_lock(&cfs_b->lock);
-	cfs_b->throttled_time += rq->clock - cfs_rq->throttled_timestamp;
+	cfs_b->throttled_time += rq->clock - cfs_rq->throttled_clock;
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
-	cfs_rq->throttled_timestamp = 0;
 
 	update_rq_clock(rq);
 	/* update hierarchical throttle state */
@@ -2460,8 +2485,13 @@ void unthrottle_offline_cfs_rqs(struct rq *rq)
 }
 
 #else /* CONFIG_CFS_BANDWIDTH */
-static __always_inline
-void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, unsigned long delta_exec) {}
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
+{
+	return rq_of(cfs_rq)->clock_task;
+}
+
+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+				     unsigned long delta_exec) {}
 static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1474bf2..cfb8f1b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -281,7 +281,8 @@ struct cfs_rq {
 	u64 runtime_expires;
 	s64 runtime_remaining;
 
-	u64 throttled_timestamp;
+	u64 throttled_clock, throttled_clock_task;
+	u64 throttled_clock_task_time;
 	int throttled, throttle_count;
 	struct list_head throttled_list;
 #endif /* CONFIG_CFS_BANDWIDTH */



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [patch 11/16] sched: replace update_shares weight distribution with per-entity computation
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
                   ` (9 preceding siblings ...)
  2012-08-23 14:14 ` [patch 10/16] sched: maintain runnable averages across throttled periods pjt
@ 2012-08-23 14:14 ` pjt
  2012-09-24 19:44   ` "Jan H. Schönherr"
  2012-10-24  9:53   ` [tip:sched/core] sched: Replace " tip-bot for Paul Turner
  2012-08-23 14:14 ` [patch 12/16] sched: refactor update_shares_cpu() -> update_blocked_avgs() pjt
                   ` (6 subsequent siblings)
  17 siblings, 2 replies; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-swap_update_shares.patch --]
[-- Type: text/plain, Size: 11652 bytes --]

From: Paul Turner <pjt@google.com>

Now that the machinery in place is in place to compute contributed load in a
bottom up fashion; replace the shares distribution code within update_shares()
accordingly.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/debug.c |    8 ---
 kernel/sched/fair.c  |  155 +++++++-------------------------------------------
 kernel/sched/sched.h |   36 ++++--------
 3 files changed, 34 insertions(+), 165 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 71b0ea3..2cd3c1b 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -218,14 +218,6 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #ifdef CONFIG_SMP
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_avg",
-			SPLIT_NS(cfs_rq->load_avg));
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_period",
-			SPLIT_NS(cfs_rq->load_period));
-	SEQ_printf(m, "  .%-30s: %ld\n", "load_contrib",
-			cfs_rq->load_contribution);
-	SEQ_printf(m, "  .%-30s: %d\n", "load_tg",
-			atomic_read(&cfs_rq->tg->load_weight));
 	SEQ_printf(m, "  .%-30s: %lld\n", "runnable_load_avg",
 			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lld\n", "blocked_load_avg",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7decc9..3af150f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -653,9 +653,6 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	return calc_delta_fair(sched_slice(cfs_rq, se), se);
 }
 
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update);
-static void update_cfs_shares(struct cfs_rq *cfs_rq);
-
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -675,10 +672,6 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
 
 	curr->vruntime += delta_exec_weighted;
 	update_min_vruntime(cfs_rq);
-
-#if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
-	cfs_rq->load_unacc_exec_time += delta_exec;
-#endif
 }
 
 static void update_curr(struct cfs_rq *cfs_rq)
@@ -801,72 +794,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-/* we need this in update_cfs_load and load-balance functions below */
-static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 # ifdef CONFIG_SMP
-static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
-					    int global_update)
-{
-	struct task_group *tg = cfs_rq->tg;
-	long load_avg;
-
-	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
-	load_avg -= cfs_rq->load_contribution;
-
-	if (global_update || abs(load_avg) > cfs_rq->load_contribution / 8) {
-		atomic_add(load_avg, &tg->load_weight);
-		cfs_rq->load_contribution += load_avg;
-	}
-}
-
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
-{
-	u64 period = sysctl_sched_shares_window;
-	u64 now, delta;
-	unsigned long load = cfs_rq->load.weight;
-
-	if (cfs_rq->tg == &root_task_group || throttled_hierarchy(cfs_rq))
-		return;
-
-	now = rq_of(cfs_rq)->clock_task;
-	delta = now - cfs_rq->load_stamp;
-
-	/* truncate load history at 4 idle periods */
-	if (cfs_rq->load_stamp > cfs_rq->load_last &&
-	    now - cfs_rq->load_last > 4 * period) {
-		cfs_rq->load_period = 0;
-		cfs_rq->load_avg = 0;
-		delta = period - 1;
-	}
-
-	cfs_rq->load_stamp = now;
-	cfs_rq->load_unacc_exec_time = 0;
-	cfs_rq->load_period += delta;
-	if (load) {
-		cfs_rq->load_last = now;
-		cfs_rq->load_avg += delta * load;
-	}
-
-	/* consider updating load contribution on each fold or truncate */
-	if (global_update || cfs_rq->load_period > period
-	    || !cfs_rq->load_period)
-		update_cfs_rq_load_contribution(cfs_rq, global_update);
-
-	while (cfs_rq->load_period > period) {
-		/*
-		 * Inline assembly required to prevent the compiler
-		 * optimising this loop into a divmod call.
-		 * See __iter_div_u64_rem() for another example of this.
-		 */
-		asm("" : "+rm" (cfs_rq->load_period));
-		cfs_rq->load_period /= 2;
-		cfs_rq->load_avg /= 2;
-	}
-
-	if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg)
-		list_del_leaf_cfs_rq(cfs_rq);
-}
-
 static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
 {
 	long tg_weight;
@@ -876,8 +804,8 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
 	 * to gain a more accurate current total weight. See
 	 * update_cfs_rq_load_contribution().
 	 */
-	tg_weight = atomic_read(&tg->load_weight);
-	tg_weight -= cfs_rq->load_contribution;
+	tg_weight = atomic64_read(&tg->load_avg);
+	tg_weight -= cfs_rq->tg_load_contrib;
 	tg_weight += cfs_rq->load.weight;
 
 	return tg_weight;
@@ -901,27 +829,11 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 
 	return shares;
 }
-
-static void update_entity_shares_tick(struct cfs_rq *cfs_rq)
-{
-	if (cfs_rq->load_unacc_exec_time > sysctl_sched_shares_window) {
-		update_cfs_load(cfs_rq, 0);
-		update_cfs_shares(cfs_rq);
-	}
-}
 # else /* CONFIG_SMP */
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
-{
-}
-
 static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 {
 	return tg->shares;
 }
-
-static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
-{
-}
 # endif /* CONFIG_SMP */
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
@@ -939,6 +851,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		account_entity_enqueue(cfs_rq, se);
 }
 
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
+
 static void update_cfs_shares(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg;
@@ -958,17 +872,9 @@ static void update_cfs_shares(struct cfs_rq *cfs_rq)
 	reweight_entity(cfs_rq_of(se), se, shares);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
-{
-}
-
 static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 {
 }
-
-static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
-{
-}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_SMP
@@ -1484,7 +1390,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	update_cfs_load(cfs_rq, 0);
 	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
@@ -1581,7 +1486,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
-	update_cfs_load(cfs_rq, 0);
 	account_entity_dequeue(cfs_rq, se);
 
 	/*
@@ -1750,11 +1654,6 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	update_entity_load_avg(curr, 1);
 	update_cfs_rq_blocked_load(cfs_rq, 1);
 
-	/*
-	 * Update share accounting for long-running entities.
-	 */
-	update_entity_shares_tick(cfs_rq);
-
 #ifdef CONFIG_SCHED_HRTICK
 	/*
 	 * queued ticks are scheduled to match the slice, so don't bother
@@ -1999,18 +1898,9 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 	cfs_rq->throttle_count--;
 #ifdef CONFIG_SMP
 	if (!cfs_rq->throttle_count) {
-		u64 delta = rq->clock_task - cfs_rq->load_stamp;
-
-		/* leaving throttled state, advance shares averaging windows */
-		cfs_rq->load_stamp += delta;
-		cfs_rq->load_last += delta;
-
 		/* adjust cfs_rq_clock_task() */
 		cfs_rq->throttled_clock_task_time += rq->clock_task -
 					     cfs_rq->throttled_clock_task;
-
-		/* update entity weight now that we are on_rq again */
-		update_cfs_shares(cfs_rq);
 	}
 #endif
 
@@ -2022,11 +1912,9 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	struct rq *rq = data;
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
-	/* group is entering throttled state, record last load */
-	if (!cfs_rq->throttle_count) {
-		update_cfs_load(cfs_rq, 0);
+	/* group is entering throttled state, stop time */
+	if (!cfs_rq->throttle_count)
 		cfs_rq->throttled_clock_task = rq->clock_task;
-	}
 	cfs_rq->throttle_count++;
 
 	return 0;
@@ -2624,7 +2512,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
 	}
@@ -2686,7 +2573,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
 	}
@@ -3735,27 +3621,34 @@ next:
  */
 static int update_shares_cpu(struct task_group *tg, int cpu)
 {
+	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
 	unsigned long flags;
 	struct rq *rq;
 
-	if (!tg->se[cpu])
-		return 0;
-
 	rq = cpu_rq(cpu);
+	se = tg->se[cpu];
 	cfs_rq = tg->cfs_rq[cpu];
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	update_rq_clock(rq);
-	update_cfs_load(cfs_rq, 1);
 	update_cfs_rq_blocked_load(cfs_rq, 1);
 
-	/*
-	 * We need to update shares after updating tg->load_weight in
-	 * order to adjust the weight of groups with long running tasks.
-	 */
-	update_cfs_shares(cfs_rq);
+	if (se) {
+		update_entity_load_avg(se, 1);
+		/*
+		 * We can pivot on the runnable average decaying to zero for
+		 * list removal since the parent average will always be >=
+		 * child.
+		 */
+		if (se->avg.runnable_avg_sum)
+			update_cfs_shares(cfs_rq);
+		else
+			list_del_leaf_cfs_rq(cfs_rq);
+	} else {
+		update_rq_runnable_avg(rq, rq->nr_running);
+	}
 
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 
@@ -5685,10 +5578,6 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 
 	cfs_rq->tg = tg;
 	cfs_rq->rq = rq;
-#ifdef CONFIG_SMP
-	/* allow initial update_cfs_load() to truncate */
-	cfs_rq->load_stamp = 1;
-#endif
 	init_cfs_rq_runtime(cfs_rq);
 
 	tg->cfs_rq[cpu] = cfs_rq;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cfb8f1b..89a0e38 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -234,11 +234,21 @@ struct cfs_rq {
 	u64 runnable_load_avg, blocked_load_avg;
 	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	u32 tg_runnable_contrib;
 	u64 tg_load_contrib;
-#endif
-#endif
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+	/*
+	 *   h_load = weight * f(tg)
+	 *
+	 * Where f(tg) is the recursive weight fraction assigned to
+	 * this group.
+	 */
+	unsigned long h_load;
+#endif /* CONFIG_SMP */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	struct rq *rq;	/* cpu runqueue to which this cfs_rq is attached */
 
@@ -254,28 +264,6 @@ struct cfs_rq {
 	struct list_head leaf_cfs_rq_list;
 	struct task_group *tg;	/* group that "owns" this runqueue */
 
-#ifdef CONFIG_SMP
-	/*
-	 *   h_load = weight * f(tg)
-	 *
-	 * Where f(tg) is the recursive weight fraction assigned to
-	 * this group.
-	 */
-	unsigned long h_load;
-
-	/*
-	 * Maintaining per-cpu shares distribution for group scheduling
-	 *
-	 * load_stamp is the last time we updated the load average
-	 * load_last is the last time we updated the load average and saw load
-	 * load_unacc_exec_time is currently unaccounted execution time
-	 */
-	u64 load_avg;
-	u64 load_period;
-	u64 load_stamp, load_last, load_unacc_exec_time;
-
-	unsigned long load_contribution;
-#endif /* CONFIG_SMP */
 #ifdef CONFIG_CFS_BANDWIDTH
 	int runtime_enabled;
 	u64 runtime_expires;



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [patch 12/16] sched: refactor update_shares_cpu() -> update_blocked_avgs()
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
                   ` (10 preceding siblings ...)
  2012-08-23 14:14 ` [patch 11/16] sched: replace update_shares weight distribution with per-entity computation pjt
@ 2012-08-23 14:14 ` pjt
  2012-10-24  9:54   ` [tip:sched/core] sched: Refactor " tip-bot for Paul Turner
  2012-08-23 14:14 ` [patch 13/16] sched: update_cfs_shares at period edge pjt
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-improve_update_shares.patch --]
[-- Type: text/plain, Size: 3317 bytes --]

From: Paul Turner <pjt@google.com>

Now that running entities maintain their own load-averages the work we must do
in update_shares() is largely restricted to the periodic decay of blocked
entities.  This allows us to be a little less pessimistic regarding our
occupancy on rq->lock and the associated rq->clock updates required.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c |   50 +++++++++++++++++++++++---------------------------
 1 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3af150f..44cd4be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3619,20 +3619,15 @@ next:
 /*
  * update tg->load_weight by folding this cpu's load_avg
  */
-static int update_shares_cpu(struct task_group *tg, int cpu)
+static void __update_blocked_averages_cpu(struct task_group *tg, int cpu)
 {
-	struct sched_entity *se;
-	struct cfs_rq *cfs_rq;
-	unsigned long flags;
-	struct rq *rq;
-
-	rq = cpu_rq(cpu);
-	se = tg->se[cpu];
-	cfs_rq = tg->cfs_rq[cpu];
+	struct sched_entity *se = tg->se[cpu];
+	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	/* throttled entities do not contribute to load */
+	if (throttled_hierarchy(cfs_rq))
+		return;
 
-	update_rq_clock(rq);
 	update_cfs_rq_blocked_load(cfs_rq, 1);
 
 	if (se) {
@@ -3647,32 +3642,33 @@ static int update_shares_cpu(struct task_group *tg, int cpu)
 		else
 			list_del_leaf_cfs_rq(cfs_rq);
 	} else {
+		struct rq *rq = rq_of(cfs_rq);
 		update_rq_runnable_avg(rq, rq->nr_running);
 	}
-
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
-
-	return 0;
 }
 
-static void update_shares(int cpu)
+static void update_blocked_averages(int cpu)
 {
-	struct cfs_rq *cfs_rq;
 	struct rq *rq = cpu_rq(cpu);
+	struct cfs_rq *cfs_rq;
+	unsigned long flags;
 
-	rcu_read_lock();
+	raw_spin_lock_irqsave(&rq->lock, flags);
+	update_rq_clock(rq);
 	/*
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
 	for_each_leaf_cfs_rq(rq, cfs_rq) {
-		/* throttled entities do not contribute to load */
-		if (throttled_hierarchy(cfs_rq))
-			continue;
-
-		update_shares_cpu(cfs_rq->tg, cpu);
+		/*
+		 * Note: We may want to consider periodically releasing
+		 * rq->lock about these updates so that creating many task
+		 * groups does not result in continually extending hold time.
+		 */
+		__update_blocked_averages_cpu(cfs_rq->tg, rq->cpu);
 	}
-	rcu_read_unlock();
+
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
 
 /*
@@ -3724,7 +3720,7 @@ static unsigned long task_h_load(struct task_struct *p)
 	return load;
 }
 #else
-static inline void update_shares(int cpu)
+static inline void update_blocked_averages(int cpu)
 {
 }
 
@@ -4793,7 +4789,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	 */
 	raw_spin_unlock(&this_rq->lock);
 
-	update_shares(this_cpu);
+	update_blocked_averages(this_cpu);
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
@@ -5053,7 +5049,7 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 	int update_next_balance = 0;
 	int need_serialize;
 
-	update_shares(cpu);
+	update_blocked_averages(cpu);
 
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [patch 13/16] sched: update_cfs_shares at period edge
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
                   ` (11 preceding siblings ...)
  2012-08-23 14:14 ` [patch 12/16] sched: refactor update_shares_cpu() -> update_blocked_avgs() pjt
@ 2012-08-23 14:14 ` pjt
  2012-09-24 19:51   ` "Jan H. Schönherr"
  2012-10-24  9:55   ` [tip:sched/core] sched: Update_cfs_shares " tip-bot for Paul Turner
  2012-08-23 14:14 ` [patch 14/16] sched: make __update_entity_runnable_avg() fast pjt
                   ` (4 subsequent siblings)
  17 siblings, 2 replies; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-update_cfs_shares_at_decay.patch --]
[-- Type: text/plain, Size: 3112 bytes --]

From: Paul Turner <pjt@google.com>

Now that our measurement intervals are small (~1ms) we can amortize the posting
of update_shares() to be about each period overflow.  This is a large cost
saving for frequently switching tasks.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c |   18 ++++++++++--------
 1 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44cd4be..12e9ae5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1181,6 +1181,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 	}
 
 	__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
+	update_cfs_shares(cfs_rq);
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
@@ -1390,9 +1391,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
 	account_entity_enqueue(cfs_rq, se);
-	update_cfs_shares(cfs_rq);
+	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
 
 	if (flags & ENQUEUE_WAKEUP) {
 		place_entity(cfs_rq, se, 0);
@@ -1465,7 +1465,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
 
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
@@ -1485,8 +1484,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
-	se->on_rq = 0;
 	account_entity_dequeue(cfs_rq, se);
+	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
 
 	/*
 	 * Normalize the entity after updating the min_vruntime because the
@@ -1500,7 +1499,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	return_cfs_rq_runtime(cfs_rq);
 
 	update_min_vruntime(cfs_rq);
-	update_cfs_shares(cfs_rq);
+	se->on_rq = 0;
 }
 
 /*
@@ -2512,8 +2511,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
+		update_cfs_rq_blocked_load(cfs_rq, 0);
 	}
 
 	if (!se) {
@@ -2573,8 +2572,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
+		update_cfs_rq_blocked_load(cfs_rq, 0);
 	}
 
 	if (!se) {
@@ -5620,8 +5619,11 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 		se = tg->se[i];
 		/* Propagate contribution to hierarchy */
 		raw_spin_lock_irqsave(&rq->lock, flags);
-		for_each_sched_entity(se)
+		for_each_sched_entity(se) {
 			update_cfs_shares(group_cfs_rq(se));
+			/* update contribution to parent */
+			update_entity_load_avg(se, 1);
+		}
 		raw_spin_unlock_irqrestore(&rq->lock, flags);
 	}
 



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [patch 14/16] sched: make __update_entity_runnable_avg() fast
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
                   ` (12 preceding siblings ...)
  2012-08-23 14:14 ` [patch 13/16] sched: update_cfs_shares at period edge pjt
@ 2012-08-23 14:14 ` pjt
  2012-08-24  8:28   ` Namhyung Kim
  2012-10-24  9:56   ` [tip:sched/core] sched: Make " tip-bot for Paul Turner
  2012-08-23 14:14 ` [patch 15/16] sched: implement usage tracking pjt
                   ` (3 subsequent siblings)
  17 siblings, 2 replies; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-fast_decay.patch --]
[-- Type: text/plain, Size: 5573 bytes --]

From: Paul Turner <pjt@google.com>

__update_entity_runnable_avg forms the core of maintaining an entity's runnable
load average.  In this function we charge the accumulated run-time since last
update and handle appropriate decay.  In some cases, e.g. a waking task, this
time interval may be much larger than our period unit.

Fortunately we can exploit some properties of our series to perform decay for a
blocked update in constant time and account the contribution for a running
update in essentially-constant* time.

[*]: For any running entity they should be performing updates at the tick which
gives us a soft limit of 1 jiffy between updates, and we can compute up to a
32 jiffy update in a single pass.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c |  123 +++++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 99 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12e9ae5..b249371 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -879,17 +879,90 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 
 #ifdef CONFIG_SMP
 /*
+ * We choose a half-life close to 1 scheduling period.
+ * Note: The tables below are dependent on this value.
+ */
+#define LOAD_AVG_PERIOD 32
+#define LOAD_AVG_MAX 47765 /* maximum possible load avg */
+#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */
+
+/* Precomputed fixed inverse multiplies for multiplication by y^n */
+static const u32 runnable_avg_yN_inv[] = {
+	0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6,
+	0xe0ccdeeb, 0xdbfbb796, 0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85,
+	0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46, 0xb504f333, 0xb123f581,
+	0xad583ee9, 0xa9a15ab4, 0xa5fed6a9, 0xa2704302, 0x9ef5325f, 0x9b8d39b9,
+	0x9837f050, 0x94f4efa8, 0x91c3d373, 0x8ea4398a, 0x8b95c1e3, 0x88980e80,
+	0x85aac367, 0x82cd8698,
+};
+
+/*
+ * Precomputed \Sum y^k { 1<=k<=n }.  These are floor(true_value) to prevent
+ * over-estimates when re-combining.
+ */
+static const u32 runnable_avg_yN_sum[] = {
+	    0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
+	 9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
+	17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
+};
+
+/*
  * Approximate:
  *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
  */
 static __always_inline u64 decay_load(u64 val, u64 n)
 {
-	for (; n && val; n--) {
-		val *= 4008;
-		val >>= 12;
+	int local_n;
+	if (!n)
+		return val;
+	else if (unlikely(n > LOAD_AVG_PERIOD * 63))
+		return 0;
+
+	/* will be 32 bits if that's desirable */
+	local_n = n;
+
+	/*
+	 * As y^PERIOD = 1/2, we can combine
+	 *    y^n = 1/2^(n/PERIOD) * k^(n%PERIOD)
+	 * With a look-up table which covers k^n (n<PERIOD)
+	 *
+	 * To achieve constant time decay_load.
+	 */
+	if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
+		val >>= local_n / LOAD_AVG_PERIOD;
+		n %= LOAD_AVG_PERIOD;
 	}
 
-	return val;
+	val *= runnable_avg_yN_inv[local_n];
+	return SRR(val, 32);
+}
+
+/*
+ * For updates fully spanning n periods, the contribution to runnable
+ * average will be: \Sum 1024*y^n
+ *
+ * We can compute this reasonably efficiently by combining:
+ *   y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for  n <PERIOD}
+ */
+static u32 __compute_runnable_contrib(u64 n)
+{
+	u32 contrib = 0;
+
+	if (likely(n <= LOAD_AVG_PERIOD))
+		return runnable_avg_yN_sum[n];
+	else if (unlikely(n >= LOAD_AVG_MAX_N))
+		return LOAD_AVG_MAX;
+
+	/* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
+	do {
+		contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
+		contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
+
+		n -= LOAD_AVG_PERIOD;
+	} while (n > LOAD_AVG_PERIOD);
+
+	contrib = decay_load(contrib, n);
+	return contrib + runnable_avg_yN_sum[n];
 }
 
 /* We can represent the historical contribution to runnable average as the
@@ -923,7 +996,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 							struct sched_avg *sa,
 							int runnable)
 {
-	u64 delta;
+	u64 delta, periods;
+	u32 runnable_contrib;
 	int delta_w, decayed = 0;
 
 	delta = now - sa->last_runnable_update;
@@ -957,25 +1031,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		 * period and accrue it.
 		 */
 		delta_w = 1024 - delta_w;
-		BUG_ON(delta_w > delta);
-		do {
-			if (runnable)
-				sa->runnable_avg_sum += delta_w;
-			sa->runnable_avg_period += delta_w;
-
-			/*
-			 * Remainder of delta initiates a new period, roll over
-			 * the previous.
-			 */
-			sa->runnable_avg_sum =
-				decay_load(sa->runnable_avg_sum, 1);
-			sa->runnable_avg_period =
-				decay_load(sa->runnable_avg_period, 1);
-
-			delta -= delta_w;
-			/* New period is empty */
-			delta_w = 1024;
-		} while (delta >= 1024);
+		if (runnable)
+			sa->runnable_avg_sum += delta_w;
+		sa->runnable_avg_period += delta_w;
+
+		delta -= delta_w;
+
+		/* Figure out how many additional periods this update spans */
+		periods = delta / 1024;
+		delta %= 1024;
+
+		sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
+						  periods + 1);
+		sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
+						     periods + 1);
+
+		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
+		runnable_contrib = __compute_runnable_contrib(periods);
+		if (runnable)
+			sa->runnable_avg_sum += runnable_contrib;
+		sa->runnable_avg_period += runnable_contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [patch 15/16] sched: implement usage tracking
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
                   ` (13 preceding siblings ...)
  2012-08-23 14:14 ` [patch 14/16] sched: make __update_entity_runnable_avg() fast pjt
@ 2012-08-23 14:14 ` pjt
  2012-10-19 12:18   ` Vincent Guittot
  2012-08-23 14:14 ` [patch 16/16] sched: introduce temporary FAIR_GROUP_SCHED dependency for load-tracking pjt
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-account_usage.patch --]
[-- Type: text/plain, Size: 5623 bytes --]

From: Paul Turner <pjt@google.com>

With the frame-work for runnable tracking now fully in place.  Per-entity usage
tracking is a simple and low-overhead addition.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/debug.c  |    3 +++
 kernel/sched/fair.c   |   33 ++++++++++++++++++++++++++++-----
 kernel/sched/sched.h  |    4 ++--
 4 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 93e27c0..2a4be1f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1150,6 +1150,7 @@ struct sched_avg {
 	u64 last_runnable_update;
 	s64 decay_count;
 	unsigned long load_avg_contrib;
+	u32 usage_avg_sum;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2cd3c1b..b9d54d0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -94,6 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 #ifdef CONFIG_SMP
 	P(se->avg.runnable_avg_sum);
 	P(se->avg.runnable_avg_period);
+	P(se->avg.usage_avg_sum);
 	P(se->avg.load_avg_contrib);
 	P(se->avg.decay_count);
 #endif
@@ -230,6 +231,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->tg_runnable_contrib);
 	SEQ_printf(m, "  .%-30s: %d\n", "tg->runnable_avg",
 			atomic_read(&cfs_rq->tg->runnable_avg));
+	SEQ_printf(m, "  .%-30s: %d\n", "tg->usage_avg",
+			atomic_read(&cfs_rq->tg->usage_avg));
 #endif
 
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b249371..44a9a11 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -994,7 +994,8 @@ static u32 __compute_runnable_contrib(u64 n)
  */
 static __always_inline int __update_entity_runnable_avg(u64 now,
 							struct sched_avg *sa,
-							int runnable)
+							int runnable,
+							int running)
 {
 	u64 delta, periods;
 	u32 runnable_contrib;
@@ -1033,6 +1034,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		delta_w = 1024 - delta_w;
 		if (runnable)
 			sa->runnable_avg_sum += delta_w;
+		if (running)
+			sa->usage_avg_sum += delta_w;
 		sa->runnable_avg_period += delta_w;
 
 		delta -= delta_w;
@@ -1045,17 +1048,22 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 						  periods + 1);
 		sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
 						     periods + 1);
+		sa->usage_avg_sum = decay_load(sa->usage_avg_sum, periods + 1);
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		runnable_contrib = __compute_runnable_contrib(periods);
 		if (runnable)
 			sa->runnable_avg_sum += runnable_contrib;
+		if (running)
+			sa->usage_avg_sum += runnable_contrib;
 		sa->runnable_avg_period += runnable_contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
 	if (runnable)
 		sa->runnable_avg_sum += delta;
+	if (running)
+		sa->usage_avg_sum += delta;
 	sa->runnable_avg_period += delta;
 
 	return decayed;
@@ -1101,16 +1109,28 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 						  struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
-	long contrib;
+	long contrib, usage_contrib;
 
 	/* The fraction of a cpu used by this cfs_rq */
 	contrib = div_u64(sa->runnable_avg_sum << NICE_0_SHIFT,
 			  sa->runnable_avg_period + 1);
 	contrib -= cfs_rq->tg_runnable_contrib;
 
-	if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
+	usage_contrib = div_u64(sa->usage_avg_sum << NICE_0_SHIFT,
+			        sa->runnable_avg_period + 1);
+	usage_contrib -= cfs_rq->tg_usage_contrib;
+
+	/*
+	 * contrib/usage at this point represent deltas, only update if they
+	 * are substantive.
+	 */
+	if ((abs(contrib) > cfs_rq->tg_runnable_contrib / 64) ||
+	    (abs(usage_contrib) > cfs_rq->tg_usage_contrib / 64)) {
 		atomic_add(contrib, &tg->runnable_avg);
 		cfs_rq->tg_runnable_contrib += contrib;
+
+		atomic_add(usage_contrib, &tg->usage_avg);
+		cfs_rq->tg_usage_contrib += usage_contrib;
 	}
 }
 
@@ -1216,7 +1236,8 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
+	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+					  cfs_rq->curr == se))
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
@@ -1261,7 +1282,8 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
+	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable,
+				     runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 
@@ -1629,6 +1651,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
+		update_entity_load_avg(se, 1);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 89a0e38..e14601e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -113,7 +113,7 @@ struct task_group {
 
 	atomic_t load_weight;
 	atomic64_t load_avg;
-	atomic_t runnable_avg;
+	atomic_t runnable_avg, usage_avg;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -236,7 +236,7 @@ struct cfs_rq {
 	u64 last_decay;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	u32 tg_runnable_contrib;
+	u32 tg_runnable_contrib, tg_usage_contrib;
 	u64 tg_load_contrib;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [patch 16/16] sched: introduce temporary FAIR_GROUP_SCHED dependency for load-tracking
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
                   ` (14 preceding siblings ...)
  2012-08-23 14:14 ` [patch 15/16] sched: implement usage tracking pjt
@ 2012-08-23 14:14 ` pjt
  2012-10-24  9:57   ` [tip:sched/core] sched: Introduce " tip-bot for Paul Turner
  2012-09-24  9:30 ` [patch 00/16] sched: per-entity load-tracking "Jan H. Schönherr"
  2012-11-26 13:08 ` Jassi Brar
  17 siblings, 1 reply; 59+ messages in thread
From: pjt @ 2012-08-23 14:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Venki Pallipadi,
	Ben Segall, Mike Galbraith, Vincent Guittot, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

[-- Attachment #1: sched-depend_on_fair.patch --]
[-- Type: text/plain, Size: 4290 bytes --]

From: Paul Turner <pjt@google.com>

While per-entity load-tracking is generally useful, beyond computing shares
distribution, e.g.
  runnable based load-balance (in progress), governors, power-management, etc

These facilities are not yet consumers of this data.  This may be trivially
reverted when the information is required; but avoid paying the overhead for
calculations we will not use until then.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 include/linux/sched.h |    8 +++++++-
 kernel/sched/core.c   |    7 ++++++-
 kernel/sched/fair.c   |   13 +++++++++++--
 kernel/sched/sched.h  |    9 ++++++++-
 4 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2a4be1f..da37380 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1213,7 +1213,13 @@ struct sched_entity {
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq		*my_q;
 #endif
-#ifdef CONFIG_SMP
+/*
+ * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
+ * removed when useful for applications beyond shares distribution (e.g.
+ * load-balance).
+ */
+#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+	/* Per-entity load-tracking */
 	struct sched_avg	avg;
 #endif
 };
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b3e2442..e497f59 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1558,7 +1558,12 @@ static void __sched_fork(struct task_struct *p)
 	p->se.vruntime			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
-#ifdef CONFIG_SMP
+/*
+ * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
+ * removed when useful for applications beyond shares distribution (e.g.
+ * load-balance).
+ */
+#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
 	p->se.avg.runnable_avg_period = 0;
 	p->se.avg.runnable_avg_sum = 0;
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44a9a11..c46fd45 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -877,7 +877,8 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-#ifdef CONFIG_SMP
+/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
+#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
 /*
  * We choose a half-life close to 1 scheduling period.
  * Note: The tables below are dependent on this value.
@@ -3175,6 +3176,12 @@ unlock:
 }
 
 /*
+ * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
+ * removed when useful for applications beyond shares distribution (e.g.
+ * load-balance).
+ */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/*
  * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
  * previous cpu.  However, the caller only guarantees p->pi_lock is held; no
@@ -3197,6 +3204,7 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 		atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
 	}
 }
+#endif
 #endif /* CONFIG_SMP */
 
 static unsigned long
@@ -5775,8 +5783,9 @@ const struct sched_class fair_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_fair,
+#ifdef CONFIG_FAIR_GROUP_SCHED
 	.migrate_task_rq	= migrate_task_rq_fair,
-
+#endif
 	.rq_online		= rq_online_fair,
 	.rq_offline		= rq_offline_fair,
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e14601e..b3e84f6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -225,6 +225,12 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
+/*
+ * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
+ * removed when useful for applications beyond shares distribution (e.g.
+ * load-balance).
+ */
+#ifdef CONFIG_FAIR_GROUP_SCHED
 	/*
 	 * CFS Load tracking
 	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -234,7 +240,8 @@ struct cfs_rq {
 	u64 runnable_load_avg, blocked_load_avg;
 	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
-
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+/* These always depend on CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	u32 tg_runnable_contrib, tg_usage_contrib;
 	u64 tg_load_contrib;



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [patch 01/16] sched: track the runnable average on a per-task entitiy basis
  2012-08-23 14:14 ` [patch 01/16] sched: track the runnable average on a per-task entitiy basis pjt
@ 2012-08-24  8:20   ` Namhyung Kim
  2012-08-28 22:12     ` Paul Turner
  2012-10-24  9:43   ` [tip:sched/core] sched: Track the runnable average on a per-task entity basis tip-bot for Paul Turner
  1 sibling, 1 reply; 59+ messages in thread
From: Namhyung Kim @ 2012-08-24  8:20 UTC (permalink / raw)
  To: pjt
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Ben Segall, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney

Hi,

Just typos below..

On Thu, 23 Aug 2012 07:14:23 -0700, > From: Paul Turner <pjt@google.com>
>
> Instead of tracking averaging the load parented by a cfs_rq, we can track
> entity load directly.  With the load for a given cfs_rq then being the sum of
> its children.
>
> To do this we represent the historical contribution to runnable average within each
> trailing 1024us of execution as the coefficients of a geometric series.
>
> We can express this for a given task t as:
>   runnable_sum(t) = \Sum u_i * y^i, runnable_avg_period(t) = \Sum 1024 * y^i
>   load(t) = weight_t * runnable_sum(t) / runnable_avg_period(t)
>
> Where: u_i is the usage in the last i`th 1024us period (approximately 1ms) ~ms
> and y is chosen such that y^k = 1/2.  We currently choose k to be 32 which
> roughly translates to about a sched period.
>
> Signed-off-by: Paul Turner <pjt@google.com>
> Reviewed-by: Ben Segall <bsegall@google.com>
> ---
>  include/linux/sched.h |   13 +++++
>  kernel/sched/core.c   |    5 ++
>  kernel/sched/debug.c  |    4 ++
>  kernel/sched/fair.c   |  128 +++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 150 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f3eebc1..f553da9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1139,6 +1139,16 @@ struct load_weight {
>  	unsigned long weight, inv_weight;
>  };
>  
> +struct sched_avg {
> +	/*
> +	 * These sums represent an infinite geometric series and so are bound
> +	 * above by 1024/(1-y).  Thus we only need a u32 to store them for for all
> +	 * choices of y < 1-2^(-32)*1024.
> +	 */
> +	u32 runnable_avg_sum, runnable_avg_period;
> +	u64 last_runnable_update;
> +};
> +
>  #ifdef CONFIG_SCHEDSTATS
>  struct sched_statistics {
>  	u64			wait_start;
> @@ -1199,6 +1209,9 @@ struct sched_entity {
>  	/* rq "owned" by this entity/group: */
>  	struct cfs_rq		*my_q;
>  #endif
> +#ifdef CONFIG_SMP
> +	struct sched_avg	avg;
> +#endif
>  };
>  
>  struct sched_rt_entity {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 78d9c96..fcc3cad 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1556,6 +1556,11 @@ static void __sched_fork(struct task_struct *p)
>  	p->se.vruntime			= 0;
>  	INIT_LIST_HEAD(&p->se.group_node);
>  
> +#ifdef CONFIG_SMP
> +	p->se.avg.runnable_avg_period = 0;
> +	p->se.avg.runnable_avg_sum = 0;
> +#endif
> +
>  #ifdef CONFIG_SCHEDSTATS
>  	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>  #endif
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 6f79596..61f7097 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -85,6 +85,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
>  	P(se->statistics.wait_count);
>  #endif
>  	P(se->load.weight);
> +#ifdef CONFIG_SMP
> +	P(se->avg.runnable_avg_sum);
> +	P(se->avg.runnable_avg_period);
> +#endif
>  #undef PN
>  #undef P
>  }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 01d3eda..2c53263 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -971,6 +971,125 @@ static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
>  }
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  
> +#ifdef CONFIG_SMP
> +/*
> + * Approximate:
> + *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
> + */
> +static __always_inline u64 decay_load(u64 val, u64 n)
> +{
> +	for (; n && val; n--) {
> +		val *= 4008;
> +		val >>= 12;
> +	}
> +
> +	return val;
> +}
> +
> +/* We can represent the historical contribution to runnable average as the
> + * coefficients of a geometric series.  To do this we sub-divide our runnable
> + * history into segments of approximately 1ms (1024us); label the segment that
> + * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
> + *
> + * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
> + *      p0            p1           p1

Should it be                          p2 ?


> + *     (now)       (~1ms ago)  (~2ms ago)
> + *
> + * Let u_i denote the fraction of p_i that the entity was runnable.
> + *
> + * We then designate the fractions u_i as our co-efficients, yielding the
> + * following representation of historical load:
> + *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
> + *
> + * We choose y based on the with of a reasonably scheduling period, fixing:
> + *   y^32 = 0.5
> + *
> + * This means that the contribution to load ~32ms ago (u_32) will be weighted
> + * approximately half as much as the contribution to load within the last ms
> + * (u_0).
> + *
> + * When a period "rolls over" and we have new u_0`, multiplying the previous
> + * sum again by y is sufficient to update:
> + *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
> + *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1]

s/u_{i+1]/u_{i+1}]/

Thanks,
Namhyung


> + */
> +static __always_inline int __update_entity_runnable_avg(u64 now,
> +							struct sched_avg *sa,
> +							int runnable)
> +{
> +	u64 delta;
> +	int delta_w, decayed = 0;
> +
> +	delta = now - sa->last_runnable_update;
> +	/*
> +	 * This should only happen when time goes backwards, which it
> +	 * unfortunately does during sched clock init when we swap over to TSC.
> +	 */
> +	if ((s64)delta < 0) {
> +		sa->last_runnable_update = now;
> +		return 0;
> +	}
> +
> +	/*
> +	 * Use 1024ns as the unit of measurement since it's a reasonable
> +	 * approximation of 1us and fast to compute.
> +	 */
> +	delta >>= 10;
> +	if (!delta)
> +		return 0;
> +	sa->last_runnable_update = now;
> +
> +	/* delta_w is the amount already accumulated against our next period */
> +	delta_w = sa->runnable_avg_period % 1024;
> +	if (delta + delta_w >= 1024) {
> +		/* period roll-over */
> +		decayed = 1;
> +
> +		/*
> +		 * Now that we know we're crossing a period boundary, figure
> +		 * out how much from delta we need to complete the current
> +		 * period and accrue it.
> +		 */
> +		delta_w = 1024 - delta_w;
> +		BUG_ON(delta_w > delta);
> +		do {
> +			if (runnable)
> +				sa->runnable_avg_sum += delta_w;
> +			sa->runnable_avg_period += delta_w;
> +
> +			/*
> +			 * Remainder of delta initiates a new period, roll over
> +			 * the previous.
> +			 */
> +			sa->runnable_avg_sum =
> +				decay_load(sa->runnable_avg_sum, 1);
> +			sa->runnable_avg_period =
> +				decay_load(sa->runnable_avg_period, 1);
> +
> +			delta -= delta_w;
> +			/* New period is empty */
> +			delta_w = 1024;
> +		} while (delta >= 1024);
> +	}
> +
> +	/* Remainder of delta accrued against u_0` */
> +	if (runnable)
> +		sa->runnable_avg_sum += delta;
> +	sa->runnable_avg_period += delta;
> +
> +	return decayed;
> +}
> +
> +/* Update a sched_entity's runnable average */
> +static inline void update_entity_load_avg(struct sched_entity *se)
> +{
> +	__update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
> +				     se->on_rq);
> +}
> +#else
> +static inline void update_entity_load_avg(struct sched_entity *se) {}
> +#endif
> +
>  static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>  #ifdef CONFIG_SCHEDSTATS
> @@ -1097,6 +1216,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 */
>  	update_curr(cfs_rq);
>  	update_cfs_load(cfs_rq, 0);
> +	update_entity_load_avg(se);
>  	account_entity_enqueue(cfs_rq, se);
>  	update_cfs_shares(cfs_rq);
>  
> @@ -1171,6 +1291,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 * Update run-time statistics of the 'current'.
>  	 */
>  	update_curr(cfs_rq);
> +	update_entity_load_avg(se);
>  
>  	update_stats_dequeue(cfs_rq, se);
>  	if (flags & DEQUEUE_SLEEP) {
> @@ -1340,6 +1461,8 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
>  		update_stats_wait_start(cfs_rq, prev);
>  		/* Put 'current' back into the tree. */
>  		__enqueue_entity(cfs_rq, prev);
> +		/* in !on_rq case, update occurred at dequeue */
> +		update_entity_load_avg(prev);
>  	}
>  	cfs_rq->curr = NULL;
>  }
> @@ -1353,6 +1476,11 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>  	update_curr(cfs_rq);
>  
>  	/*
> +	 * Ensure that runnable average is periodically updated.
> +	 */
> +	update_entity_load_avg(curr);
> +
> +	/*
>  	 * Update share accounting for long-running entities.
>  	 */
>  	update_entity_shares_tick(cfs_rq);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 14/16] sched: make __update_entity_runnable_avg() fast
  2012-08-23 14:14 ` [patch 14/16] sched: make __update_entity_runnable_avg() fast pjt
@ 2012-08-24  8:28   ` Namhyung Kim
  2012-08-28 22:18     ` Paul Turner
  2012-10-24  9:56   ` [tip:sched/core] sched: Make " tip-bot for Paul Turner
  1 sibling, 1 reply; 59+ messages in thread
From: Namhyung Kim @ 2012-08-24  8:28 UTC (permalink / raw)
  To: pjt
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Ben Segall, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney

On Thu, 23 Aug 2012 07:14:36 -0700, > From: Paul Turner <pjt@google.com>
>
> __update_entity_runnable_avg forms the core of maintaining an entity's runnable
> load average.  In this function we charge the accumulated run-time since last
> update and handle appropriate decay.  In some cases, e.g. a waking task, this
> time interval may be much larger than our period unit.
>
> Fortunately we can exploit some properties of our series to perform decay for a
> blocked update in constant time and account the contribution for a running
> update in essentially-constant* time.
>
> [*]: For any running entity they should be performing updates at the tick which
> gives us a soft limit of 1 jiffy between updates, and we can compute up to a
> 32 jiffy update in a single pass.
>
> Signed-off-by: Paul Turner <pjt@google.com>
> Reviewed-by: Ben Segall <bsegall@google.com>
> ---
>  kernel/sched/fair.c |  123 +++++++++++++++++++++++++++++++++++++++++----------
>  1 files changed, 99 insertions(+), 24 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 12e9ae5..b249371 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -879,17 +879,90 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
>  
>  #ifdef CONFIG_SMP
>  /*
> + * We choose a half-life close to 1 scheduling period.
> + * Note: The tables below are dependent on this value.
> + */
> +#define LOAD_AVG_PERIOD 32
> +#define LOAD_AVG_MAX 47765 /* maximum possible load avg */
> +#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */
> +
> +/* Precomputed fixed inverse multiplies for multiplication by y^n */
> +static const u32 runnable_avg_yN_inv[] = {
> +	0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6,
> +	0xe0ccdeeb, 0xdbfbb796, 0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85,
> +	0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46, 0xb504f333, 0xb123f581,
> +	0xad583ee9, 0xa9a15ab4, 0xa5fed6a9, 0xa2704302, 0x9ef5325f, 0x9b8d39b9,
> +	0x9837f050, 0x94f4efa8, 0x91c3d373, 0x8ea4398a, 0x8b95c1e3, 0x88980e80,
> +	0x85aac367, 0x82cd8698,
> +};
> +
> +/*
> + * Precomputed \Sum y^k { 1<=k<=n }.  These are floor(true_value) to prevent
> + * over-estimates when re-combining.
> + */
> +static const u32 runnable_avg_yN_sum[] = {
> +	    0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
> +	 9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
> +	17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
> +};
> +
> +/*
>   * Approximate:
>   *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
>   */
>  static __always_inline u64 decay_load(u64 val, u64 n)
>  {
> -	for (; n && val; n--) {
> -		val *= 4008;
> -		val >>= 12;
> +	int local_n;
> +	if (!n)
> +		return val;
> +	else if (unlikely(n > LOAD_AVG_PERIOD * 63))
> +		return 0;
> +
> +	/* will be 32 bits if that's desirable */
> +	local_n = n;
> +
> +	/*
> +	 * As y^PERIOD = 1/2, we can combine
> +	 *    y^n = 1/2^(n/PERIOD) * k^(n%PERIOD)
> +	 * With a look-up table which covers k^n (n<PERIOD)
> +	 *
> +	 * To achieve constant time decay_load.
> +	 */
> +	if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
> +		val >>= local_n / LOAD_AVG_PERIOD;
> +		n %= LOAD_AVG_PERIOD;

s/n/local_n/ ?

Thanks,
Namhyung


>  	}
>  
> -	return val;
> +	val *= runnable_avg_yN_inv[local_n];
> +	return SRR(val, 32);
> +}
> +
> +/*
> + * For updates fully spanning n periods, the contribution to runnable
> + * average will be: \Sum 1024*y^n
> + *
> + * We can compute this reasonably efficiently by combining:
> + *   y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for  n <PERIOD}
> + */
> +static u32 __compute_runnable_contrib(u64 n)
> +{
> +	u32 contrib = 0;
> +
> +	if (likely(n <= LOAD_AVG_PERIOD))
> +		return runnable_avg_yN_sum[n];
> +	else if (unlikely(n >= LOAD_AVG_MAX_N))
> +		return LOAD_AVG_MAX;
> +
> +	/* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
> +	do {
> +		contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
> +		contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
> +
> +		n -= LOAD_AVG_PERIOD;
> +	} while (n > LOAD_AVG_PERIOD);
> +
> +	contrib = decay_load(contrib, n);
> +	return contrib + runnable_avg_yN_sum[n];
>  }
>  
>  /* We can represent the historical contribution to runnable average as the
> @@ -923,7 +996,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>  							struct sched_avg *sa,
>  							int runnable)
>  {
> -	u64 delta;
> +	u64 delta, periods;
> +	u32 runnable_contrib;
>  	int delta_w, decayed = 0;
>  
>  	delta = now - sa->last_runnable_update;
> @@ -957,25 +1031,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>  		 * period and accrue it.
>  		 */
>  		delta_w = 1024 - delta_w;
> -		BUG_ON(delta_w > delta);
> -		do {
> -			if (runnable)
> -				sa->runnable_avg_sum += delta_w;
> -			sa->runnable_avg_period += delta_w;
> -
> -			/*
> -			 * Remainder of delta initiates a new period, roll over
> -			 * the previous.
> -			 */
> -			sa->runnable_avg_sum =
> -				decay_load(sa->runnable_avg_sum, 1);
> -			sa->runnable_avg_period =
> -				decay_load(sa->runnable_avg_period, 1);
> -
> -			delta -= delta_w;
> -			/* New period is empty */
> -			delta_w = 1024;
> -		} while (delta >= 1024);
> +		if (runnable)
> +			sa->runnable_avg_sum += delta_w;
> +		sa->runnable_avg_period += delta_w;
> +
> +		delta -= delta_w;
> +
> +		/* Figure out how many additional periods this update spans */
> +		periods = delta / 1024;
> +		delta %= 1024;
> +
> +		sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
> +						  periods + 1);
> +		sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
> +						     periods + 1);
> +
> +		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
> +		runnable_contrib = __compute_runnable_contrib(periods);
> +		if (runnable)
> +			sa->runnable_avg_sum += runnable_contrib;
> +		sa->runnable_avg_period += runnable_contrib;
>  	}
>  
>  	/* Remainder of delta accrued against u_0` */

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 01/16] sched: track the runnable average on a per-task entitiy basis
  2012-08-24  8:20   ` Namhyung Kim
@ 2012-08-28 22:12     ` Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2012-08-28 22:12 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Ben Segall, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney

On Fri, Aug 24, 2012 at 1:20 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> Hi,
>
> Just typos below..
>

Applied, Thanks.

> On Thu, 23 Aug 2012 07:14:23 -0700, > From: Paul Turner <pjt@google.com>
>>
>> Instead of tracking averaging the load parented by a cfs_rq, we can track
>> entity load directly.  With the load for a given cfs_rq then being the
>> sum of
>> its children.
>>
>> To do this we represent the historical contribution to runnable average
>> within each
>> trailing 1024us of execution as the coefficients of a geometric series.
>>
>> We can express this for a given task t as:
>>   runnable_sum(t) = \Sum u_i * y^i, runnable_avg_period(t) = \Sum 1024 *
>> y^i
>>   load(t) = weight_t * runnable_sum(t) / runnable_avg_period(t)
>>
>> Where: u_i is the usage in the last i`th 1024us period (approximately
>> 1ms) ~ms
>> and y is chosen such that y^k = 1/2.  We currently choose k to be 32
>> which
>> roughly translates to about a sched period.
>>
>> Signed-off-by: Paul Turner <pjt@google.com>
>> Reviewed-by: Ben Segall <bsegall@google.com>
>> ---
>>  include/linux/sched.h |   13 +++++
>>  kernel/sched/core.c   |    5 ++
>>  kernel/sched/debug.c  |    4 ++
>>  kernel/sched/fair.c   |  128
>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 150 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index f3eebc1..f553da9 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1139,6 +1139,16 @@ struct load_weight {
>>       unsigned long weight, inv_weight;
>>  };
>>
>> +struct sched_avg {
>> +     /*
>> +      * These sums represent an infinite geometric series and so are
>> bound
>> +      * above by 1024/(1-y).  Thus we only need a u32 to store them for
>> for all
>> +      * choices of y < 1-2^(-32)*1024.
>> +      */
>> +     u32 runnable_avg_sum, runnable_avg_period;
>> +     u64 last_runnable_update;
>> +};
>> +
>>  #ifdef CONFIG_SCHEDSTATS
>>  struct sched_statistics {
>>       u64                     wait_start;
>> @@ -1199,6 +1209,9 @@ struct sched_entity {
>>       /* rq "owned" by this entity/group: */
>>       struct cfs_rq           *my_q;
>>  #endif
>> +#ifdef CONFIG_SMP
>> +     struct sched_avg        avg;
>> +#endif
>>  };
>>
>>  struct sched_rt_entity {
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 78d9c96..fcc3cad 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1556,6 +1556,11 @@ static void __sched_fork(struct task_struct *p)
>>       p->se.vruntime                  = 0;
>>       INIT_LIST_HEAD(&p->se.group_node);
>>
>> +#ifdef CONFIG_SMP
>> +     p->se.avg.runnable_avg_period = 0;
>> +     p->se.avg.runnable_avg_sum = 0;
>> +#endif
>> +
>>  #ifdef CONFIG_SCHEDSTATS
>>       memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>>  #endif
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 6f79596..61f7097 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -85,6 +85,10 @@ static void print_cfs_group_stats(struct seq_file *m,
>> int cpu, struct task_group
>>       P(se->statistics.wait_count);
>>  #endif
>>       P(se->load.weight);
>> +#ifdef CONFIG_SMP
>> +     P(se->avg.runnable_avg_sum);
>> +     P(se->avg.runnable_avg_period);
>> +#endif
>>  #undef PN
>>  #undef P
>>  }
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 01d3eda..2c53263 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -971,6 +971,125 @@ static inline void update_entity_shares_tick(struct
>> cfs_rq *cfs_rq)
>>  }
>>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>>
>> +#ifdef CONFIG_SMP
>> +/*
>> + * Approximate:
>> + *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
>> + */
>> +static __always_inline u64 decay_load(u64 val, u64 n)
>> +{
>> +     for (; n && val; n--) {
>> +             val *= 4008;
>> +             val >>= 12;
>> +     }
>> +
>> +     return val;
>> +}
>> +
>> +/* We can represent the historical contribution to runnable average as
>> the
>> + * coefficients of a geometric series.  To do this we sub-divide our
>> runnable
>> + * history into segments of approximately 1ms (1024us); label the
>> segment that
>> + * occurred N-ms ago p_N, with p_0 corresponding to the current period,
>> e.g.
>> + *
>> + * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
>> + *      p0            p1           p1
>
> Should it be                          p2 ?
>
>
>> + *     (now)       (~1ms ago)  (~2ms ago)
>> + *
>> + * Let u_i denote the fraction of p_i that the entity was runnable.
>> + *
>> + * We then designate the fractions u_i as our co-efficients, yielding
>> the
>> + * following representation of historical load:
>> + *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
>> + *
>> + * We choose y based on the with of a reasonably scheduling period,
>> fixing:
>> + *   y^32 = 0.5
>> + *
>> + * This means that the contribution to load ~32ms ago (u_32) will be
>> weighted
>> + * approximately half as much as the contribution to load within the
>> last ms
>> + * (u_0).
>> + *
>> + * When a period "rolls over" and we have new u_0`, multiplying the
>> previous
>> + * sum again by y is sufficient to update:
>> + *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
>> + *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1]
>
> s/u_{i+1]/u_{i+1}]/
>
> Thanks,
> Namhyung
>
>
>> + */
>> +static __always_inline int __update_entity_runnable_avg(u64 now,
>> +                                                     struct sched_avg
>> *sa,
>> +                                                     int runnable)
>> +{
>> +     u64 delta;
>> +     int delta_w, decayed = 0;
>> +
>> +     delta = now - sa->last_runnable_update;
>> +     /*
>> +      * This should only happen when time goes backwards, which it
>> +      * unfortunately does during sched clock init when we swap over to
>> TSC.
>> +      */
>> +     if ((s64)delta < 0) {
>> +             sa->last_runnable_update = now;
>> +             return 0;
>> +     }
>> +
>> +     /*
>> +      * Use 1024ns as the unit of measurement since it's a reasonable
>> +      * approximation of 1us and fast to compute.
>> +      */
>> +     delta >>= 10;
>> +     if (!delta)
>> +             return 0;
>> +     sa->last_runnable_update = now;
>> +
>> +     /* delta_w is the amount already accumulated against our next
>> period */
>> +     delta_w = sa->runnable_avg_period % 1024;
>> +     if (delta + delta_w >= 1024) {
>> +             /* period roll-over */
>> +             decayed = 1;
>> +
>> +             /*
>> +              * Now that we know we're crossing a period boundary,
>> figure
>> +              * out how much from delta we need to complete the current
>> +              * period and accrue it.
>> +              */
>> +             delta_w = 1024 - delta_w;
>> +             BUG_ON(delta_w > delta);
>> +             do {
>> +                     if (runnable)
>> +                             sa->runnable_avg_sum += delta_w;
>> +                     sa->runnable_avg_period += delta_w;
>> +
>> +                     /*
>> +                      * Remainder of delta initiates a new period, roll
>> over
>> +                      * the previous.
>> +                      */
>> +                     sa->runnable_avg_sum =
>> +                             decay_load(sa->runnable_avg_sum, 1);
>> +                     sa->runnable_avg_period =
>> +                             decay_load(sa->runnable_avg_period, 1);
>> +
>> +                     delta -= delta_w;
>> +                     /* New period is empty */
>> +                     delta_w = 1024;
>> +             } while (delta >= 1024);
>> +     }
>> +
>> +     /* Remainder of delta accrued against u_0` */
>> +     if (runnable)
>> +             sa->runnable_avg_sum += delta;
>> +     sa->runnable_avg_period += delta;
>> +
>> +     return decayed;
>> +}
>> +
>> +/* Update a sched_entity's runnable average */
>> +static inline void update_entity_load_avg(struct sched_entity *se)
>> +{
>> +     __update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task,
>> &se->avg,
>> +                                  se->on_rq);
>> +}
>> +#else
>> +static inline void update_entity_load_avg(struct sched_entity *se) {}
>> +#endif
>> +
>>  static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity
>> *se)
>>  {
>>  #ifdef CONFIG_SCHEDSTATS
>> @@ -1097,6 +1216,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct
>> sched_entity *se, int flags)
>>        */
>>       update_curr(cfs_rq);
>>       update_cfs_load(cfs_rq, 0);
>> +     update_entity_load_avg(se);
>>       account_entity_enqueue(cfs_rq, se);
>>       update_cfs_shares(cfs_rq);
>>
>> @@ -1171,6 +1291,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct
>> sched_entity *se, int flags)
>>        * Update run-time statistics of the 'current'.
>>        */
>>       update_curr(cfs_rq);
>> +     update_entity_load_avg(se);
>>
>>       update_stats_dequeue(cfs_rq, se);
>>       if (flags & DEQUEUE_SLEEP) {
>> @@ -1340,6 +1461,8 @@ static void put_prev_entity(struct cfs_rq *cfs_rq,
>> struct sched_entity *prev)
>>               update_stats_wait_start(cfs_rq, prev);
>>               /* Put 'current' back into the tree. */
>>               __enqueue_entity(cfs_rq, prev);
>> +             /* in !on_rq case, update occurred at dequeue */
>> +             update_entity_load_avg(prev);
>>       }
>>       cfs_rq->curr = NULL;
>>  }
>> @@ -1353,6 +1476,11 @@ entity_tick(struct cfs_rq *cfs_rq, struct
>> sched_entity *curr, int queued)
>>       update_curr(cfs_rq);
>>
>>       /*
>> +      * Ensure that runnable average is periodically updated.
>> +      */
>> +     update_entity_load_avg(curr);
>> +
>> +     /*
>>        * Update share accounting for long-running entities.
>>        */
>>       update_entity_shares_tick(cfs_rq);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 14/16] sched: make __update_entity_runnable_avg() fast
  2012-08-24  8:28   ` Namhyung Kim
@ 2012-08-28 22:18     ` Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2012-08-28 22:18 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Ben Segall, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney

Applied, Thanks.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 06/16] sched: account for blocked load waking back up
       [not found]   ` <CAM4v1pO8SPCmqJTTBHpqwrwuO7noPdskg0RSooxyPsWoE395_A@mail.gmail.com>
@ 2012-09-04 17:29     ` Benjamin Segall
  0 siblings, 0 replies; 59+ messages in thread
From: Benjamin Segall @ 2012-09-04 17:29 UTC (permalink / raw)
  To: Preeti Murthy
  Cc: pjt, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney,
	Namhyung Kim

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=utf-8, Size: 4979 bytes --]

Preeti Murthy <preeti.lkml@gmail.com> writes:

> Hi Paul,
>
>     @@ -1170,20 +1178,42 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
>                                                       struct sched_entity *se,
>                                                       int wakeup)
>      {
>     -       /* we track migrations using entity decay_count == 0 */
>     -       if (unlikely(!se->avg.decay_count)) {
>     +       /*
>     +        * We track migrations using entity decay_count <= 0, on a wake-up
>     +        * migration we use a negative decay count to track the remote decays
>     +        * accumulated while sleeping.
>     +        */
>     +       if (unlikely(se->avg.decay_count <= 0)) {
>                     se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
>     +               if (se->avg.decay_count) {
>     +                       /*
>     +                        * In a wake-up migration we have to approximate the
>     +                        * time sleeping.  This is because we can't synchronize
>     +                        * clock_task between the two cpus, and it is not
>     +                        * guaranteed to be read-safe.  Instead, we can
>     +                        * approximate this using our carried decays, which are
>     +                        * explicitly atomically readable.
>     +                        */
>     +                       se->avg.last_runnable_update -= (-se->avg.decay_count)
>     +                                                       << 20;
>     +                       update_entity_load_avg(se, 0);
>     +                       /* Indicate that we're now synchronized and on-rq */
>     +                       se->avg.decay_count = 0;
>     +               }
>                     wakeup = 0;
>             } else {
>                     __synchronize_entity_decay(se);
>
>  
> Should not the last_runnable_update of se get updated in __synchronize_entity_decay()?
> Because it contains the value of the runnable update before going to sleep.If not updated,when
> update_entity_load_avg() is called below during a local wakeup,it will decay the runtime load
> for the duration including the time the sched entity has slept.

If you are asking if it should be updated in the else block (local
wakeup, no migration) here, no:

* __synchronize_entity_decay will decay load_avg_contrib to match the
  decay that the cfs_rq has done, keeping those in sync, and ensuring we
  don't subtract too much when we update our current load average.
* clock_task - last_runnable_update will be the amount of time that the
  task has been blocked. update_entity_load_avg (below) and
  __update_entity_runnable_avg will account this time as non-runnable
  time into runnable_avg_sum/period, and from there onto the cfs_rq via
  __update_entity_load_avg_contrib.

Both of these are necessary, and will happen. In the case of !wakeup,
the task is being moved between groups or is migrating between cpus, and
we pretend (to the best of our ability in the case of migrating between
cpus which may have different clock_tasks) that the task has been
runnable this entire time.

In the more general case, no, it is called from migrate_task_rq_fair,
which doesn't have the necessary locks to read clock_task.

>
> This also means that during dequeue_entity_load_avg(),update_entity_load_avg() needs to be
> called to keep the runnable_avg_sum of the sched entity updated till
> before sleep.

Yes, this happens first thing in dequeue_entity_load_avg.
>
>             }
>    
>     -       if (wakeup)
>     +       /* migrated tasks did not contribute to our blocked load */
>     +       if (wakeup) {
>                     subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
>     +               update_entity_load_avg(se, 0);
>     +       }
>    
>     -       update_entity_load_avg(se, 0);
>             cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
>     -       update_cfs_rq_blocked_load(cfs_rq);
>     +       /* we force update consideration on load-balancer moves */
>     +       update_cfs_rq_blocked_load(cfs_rq, !wakeup);
>      }
>    
>       --
>     To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>     the body of a message to majordomo@vger.kernel.org
>     More majordomo info at  http://vger.kernel.org/majordomo-info.html
>     Please read the FAQ at  http://www.tux.org/lkml/
>
> Regards
> Preeti
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] sched: per-entity load-tracking
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
                   ` (15 preceding siblings ...)
  2012-08-23 14:14 ` [patch 16/16] sched: introduce temporary FAIR_GROUP_SCHED dependency for load-tracking pjt
@ 2012-09-24  9:30 ` "Jan H. Schönherr"
  2012-09-24 17:16   ` Benjamin Segall
  2012-11-26 13:08 ` Jassi Brar
  17 siblings, 1 reply; 59+ messages in thread
From: "Jan H. Schönherr" @ 2012-09-24  9:30 UTC (permalink / raw)
  To: pjt
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Ben Segall, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney,
	Namhyung Kim

Hi Paul.

Am 23.08.2012 16:14, schrieb pjt@google.com:
> Please find attached the latest version for CFS load-tracking.

Originally, I thought, this series also takes care of
the leaf-cfs-runqueue ordering issue described here:

http://lkml.org/lkml/2011/7/18/86

Now, that I had a closer look, I see that it does not take
care of it.

Is there still any reason why the leaf_cfs_rq-list must be sorted?
Or could we just get rid of the ordering requirement, now?

(That seems easier than to fix the issue, as I suspect that
__update_blocked_averages_cpu() might still punch some holes
in the hierarchy in some edge cases.)

I'd like to see that issue resolved. :)

Regards
Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] sched: per-entity load-tracking
  2012-09-24  9:30 ` [patch 00/16] sched: per-entity load-tracking "Jan H. Schönherr"
@ 2012-09-24 17:16   ` Benjamin Segall
  2012-10-05  9:07     ` Paul Turner
  0 siblings, 1 reply; 59+ messages in thread
From: Benjamin Segall @ 2012-09-24 17:16 UTC (permalink / raw)
  To: Jan H. Schönherr
  Cc: pjt, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney,
	Namhyung Kim

"Jan H. Schönherr" <schnhrr@cs.tu-berlin.de> writes:

> Hi Paul.
>
> Am 23.08.2012 16:14, schrieb pjt@google.com:
>> Please find attached the latest version for CFS load-tracking.
>
> Originally, I thought, this series also takes care of
> the leaf-cfs-runqueue ordering issue described here:
>
> http://lkml.org/lkml/2011/7/18/86
>
> Now, that I had a closer look, I see that it does not take
> care of it.
>
> Is there still any reason why the leaf_cfs_rq-list must be sorted?
> Or could we just get rid of the ordering requirement, now?

Ideally yes, since a parent's __update_cfs_rq_tg_load_contrib and
update_cfs_shares still depend on accurate values in
runnable_load_avg/blocked_load_avg from its children. That said, nothing
should completely fall over, it would make load decay take longer to
propogate to the root.
>
> (That seems easier than to fix the issue, as I suspect that
> __update_blocked_averages_cpu() might still punch some holes
> in the hierarchy in some edge cases.)

Yeah, I suspect it's possible that the parent ends up with a slightly
lower runnable_avg_sum if they're both hovering around the max value
since it isn't quite continuous, and it might be the case that this
difference is large enough to require one more tick to decay to zero.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 11/16] sched: replace update_shares weight distribution with per-entity computation
  2012-08-23 14:14 ` [patch 11/16] sched: replace update_shares weight distribution with per-entity computation pjt
@ 2012-09-24 19:44   ` "Jan H. Schönherr"
  2012-09-24 20:39     ` Benjamin Segall
  2012-10-24  9:53   ` [tip:sched/core] sched: Replace " tip-bot for Paul Turner
  1 sibling, 1 reply; 59+ messages in thread
From: "Jan H. Schönherr" @ 2012-09-24 19:44 UTC (permalink / raw)
  To: pjt
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Ben Segall, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney,
	Namhyung Kim

Am 23.08.2012 16:14, schrieb pjt@google.com:
> From: Paul Turner <pjt@google.com>
> 
> Now that the machinery in place is in place to compute contributed load in a
> bottom up fashion; replace the shares distribution code within update_shares()
> accordingly.

[snip]

>  static int update_shares_cpu(struct task_group *tg, int cpu)
>  {
> +	struct sched_entity *se;
>  	struct cfs_rq *cfs_rq;
>  	unsigned long flags;
>  	struct rq *rq;
>  
> -	if (!tg->se[cpu])
> -		return 0;
> -
>  	rq = cpu_rq(cpu);
> +	se = tg->se[cpu];
>  	cfs_rq = tg->cfs_rq[cpu];
>  
>  	raw_spin_lock_irqsave(&rq->lock, flags);
>  
>  	update_rq_clock(rq);
> -	update_cfs_load(cfs_rq, 1);
>  	update_cfs_rq_blocked_load(cfs_rq, 1);
>  
> -	/*
> -	 * We need to update shares after updating tg->load_weight in
> -	 * order to adjust the weight of groups with long running tasks.
> -	 */
> -	update_cfs_shares(cfs_rq);
> +	if (se) {
> +		update_entity_load_avg(se, 1);
> +		/*
> +		 * We can pivot on the runnable average decaying to zero for
> +		 * list removal since the parent average will always be >=
> +		 * child.
> +		 */
> +		if (se->avg.runnable_avg_sum)
> +			update_cfs_shares(cfs_rq);
> +		else
> +			list_del_leaf_cfs_rq(cfs_rq);

The blocked load, which we decay from this function, is not part of
se->avg.runnable_avg_sum. Is list removal a good idea while there might be
blocked load? We only get here, because we are on that list... don't we end up
with a wrong task group load then?

Regards
Jan

> +	} else {
> +		update_rq_runnable_avg(rq, rq->nr_running);
> +	}
>  
>  	raw_spin_unlock_irqrestore(&rq->lock, flags);
>  


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 13/16] sched: update_cfs_shares at period edge
  2012-08-23 14:14 ` [patch 13/16] sched: update_cfs_shares at period edge pjt
@ 2012-09-24 19:51   ` "Jan H. Schönherr"
  2012-10-02 21:09     ` Paul Turner
  2012-10-24  9:55   ` [tip:sched/core] sched: Update_cfs_shares " tip-bot for Paul Turner
  1 sibling, 1 reply; 59+ messages in thread
From: "Jan H. Schönherr" @ 2012-09-24 19:51 UTC (permalink / raw)
  To: pjt
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Ben Segall, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney,
	Namhyung Kim

Am 23.08.2012 16:14, schrieb pjt@google.com:
> From: Paul Turner <pjt@google.com>
> 
> Now that our measurement intervals are small (~1ms) we can amortize the posting
> of update_shares() to be about each period overflow.  This is a large cost
> saving for frequently switching tasks.

[snip]

> @@ -1181,6 +1181,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
>  	}
>  
>  	__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
> +	update_cfs_shares(cfs_rq);
>  }

Here a call to update_cfs_shares() gets added. Doesn't that make the call to
update_cfs_shares() in __update_blocked_averages_cpu() superfluous?


Function pasted here for reference:

static void __update_blocked_averages_cpu(struct task_group *tg, int cpu)
{
	struct sched_entity *se = tg->se[cpu];
	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];

	/* throttled entities do not contribute to load */
	if (throttled_hierarchy(cfs_rq))
		return;

	update_cfs_rq_blocked_load(cfs_rq, 1);

	if (se) {
		update_entity_load_avg(se, 1);
		/*
		 * We can pivot on the runnable average decaying to zero for
		 * list removal since the parent average will always be >=
		 * child.
		 */
		if (se->avg.runnable_avg_sum)
			update_cfs_shares(cfs_rq);
		else
			list_del_leaf_cfs_rq(cfs_rq);
	} else {
		struct rq *rq = rq_of(cfs_rq);
		update_rq_runnable_avg(rq, rq->nr_running);
	}
}


Regards
Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 11/16] sched: replace update_shares weight distribution with per-entity computation
  2012-09-24 19:44   ` "Jan H. Schönherr"
@ 2012-09-24 20:39     ` Benjamin Segall
  2012-10-02 21:14       ` Paul Turner
  0 siblings, 1 reply; 59+ messages in thread
From: Benjamin Segall @ 2012-09-24 20:39 UTC (permalink / raw)
  To: Jan H. Schönherr
  Cc: pjt, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney,
	Namhyung Kim

blocked_load_avg ~= \sum_child child.runnable_avg_sum/child.runnable_avg_period * child.weight

The thought was: So if all the children have hit zero runnable_avg_sum
(or in the case of a child task, will when they wake up), then
blocked_avg sum should also hit zero at the same and we're in theory
fine.

However, child load can be significantly larger than even the maximum
value of runnable_avg_sum (and you can get a full contribution off a new
task with only one tick of runnable_avg_sum anyway...), so
runnable_avg_sum can hit zero first due to rounding. We should case on
runnable_avg_sum || blocked_load_avg.


As a side note, currently decay_load uses SRR, which means none of these
will hit zero anyway if updates occur more often than once per 32ms. I'm
not sure how we missed /that/, but fixes incoming.

Thanks,
Ben


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 13/16] sched: update_cfs_shares at period edge
  2012-09-24 19:51   ` "Jan H. Schönherr"
@ 2012-10-02 21:09     ` Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2012-10-02 21:09 UTC (permalink / raw)
  To: Jan H. Schönherr
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Ben Segall, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney,
	Namhyung Kim

On Mon, Sep 24, 2012 at 12:51 PM, "Jan H. Schönherr"
<schnhrr@cs.tu-berlin.de> wrote:
> Am 23.08.2012 16:14, schrieb pjt@google.com:
>> From: Paul Turner <pjt@google.com>
>>
>> Now that our measurement intervals are small (~1ms) we can amortize the posting
>> of update_shares() to be about each period overflow.  This is a large cost
>> saving for frequently switching tasks.
>
> [snip]
>
>> @@ -1181,6 +1181,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
>>       }
>>
>>       __update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
>> +     update_cfs_shares(cfs_rq);
>>  }
>
> Here a call to update_cfs_shares() gets added. Doesn't that make the call to
> update_cfs_shares() in __update_blocked_averages_cpu() superfluous?

Yes -- updated, Thanks.

>
>
> Function pasted here for reference:
>
> static void __update_blocked_averages_cpu(struct task_group *tg, int cpu)
> {
>         struct sched_entity *se = tg->se[cpu];
>         struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
>
>         /* throttled entities do not contribute to load */
>         if (throttled_hierarchy(cfs_rq))
>                 return;
>
>         update_cfs_rq_blocked_load(cfs_rq, 1);
>
>         if (se) {
>                 update_entity_load_avg(se, 1);
>                 /*
>                  * We can pivot on the runnable average decaying to zero for
>                  * list removal since the parent average will always be >=
>                  * child.
>                  */
>                 if (se->avg.runnable_avg_sum)
>                         update_cfs_shares(cfs_rq);
>                 else
>                         list_del_leaf_cfs_rq(cfs_rq);
>         } else {
>                 struct rq *rq = rq_of(cfs_rq);
>                 update_rq_runnable_avg(rq, rq->nr_running);
>         }
> }
>
>
> Regards
> Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 11/16] sched: replace update_shares weight distribution with per-entity computation
  2012-09-24 20:39     ` Benjamin Segall
@ 2012-10-02 21:14       ` Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2012-10-02 21:14 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: Jan H. Schönherr, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney,
	Namhyung Kim

On Mon, Sep 24, 2012 at 1:39 PM, Benjamin Segall <bsegall@google.com> wrote:
> blocked_load_avg ~= \sum_child child.runnable_avg_sum/child.runnable_avg_period * child.weight
>
> The thought was: So if all the children have hit zero runnable_avg_sum
> (or in the case of a child task, will when they wake up), then
> blocked_avg sum should also hit zero at the same and we're in theory
> fine.
>
> However, child load can be significantly larger than even the maximum
> value of runnable_avg_sum (and you can get a full contribution off a new
> task with only one tick of runnable_avg_sum anyway...), so
> runnable_avg_sum can hit zero first due to rounding. We should case on
> runnable_avg_sum || blocked_load_avg.

Clipping blocked_load_avg when runnable_avg_sum goes to zero is
sufficient.  At this point we cannot contribute to our parent anyway.

>
>
> As a side note, currently decay_load uses SRR, which means none of these
> will hit zero anyway if updates occur more often than once per 32ms. I'm
> not sure how we missed /that/, but fixes incoming.

Egads, fixed.  We definitely used to have that, I think it got lost in
the "clean everything up, break it into a series, and make it pretty"
step.  Perhaps that explains why some of the numbers in the previous
table were a little different.


>
> Thanks,
> Ben
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] sched: per-entity load-tracking
  2012-09-24 17:16   ` Benjamin Segall
@ 2012-10-05  9:07     ` Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2012-10-05  9:07 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: Jan H. Schönherr, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney,
	Namhyung Kim

On Mon, Sep 24, 2012 at 10:16 AM, Benjamin Segall <bsegall@google.com> wrote:
> "Jan H. Schönherr" <schnhrr@cs.tu-berlin.de> writes:
>
>> Hi Paul.
>>
>> Am 23.08.2012 16:14, schrieb pjt@google.com:
>>> Please find attached the latest version for CFS load-tracking.
>>
>> Originally, I thought, this series also takes care of
>> the leaf-cfs-runqueue ordering issue described here:
>>
>> http://lkml.org/lkml/2011/7/18/86
>>
>> Now, that I had a closer look, I see that it does not take
>> care of it.
>>
>> Is there still any reason why the leaf_cfs_rq-list must be sorted?
>> Or could we just get rid of the ordering requirement, now?
>
> Ideally yes, since a parent's __update_cfs_rq_tg_load_contrib and
> update_cfs_shares still depend on accurate values in
> runnable_load_avg/blocked_load_avg from its children. That said, nothing
> should completely fall over, it would make load decay take longer to
> propogate to the root.
>>
>> (That seems easier than to fix the issue, as I suspect that
>> __update_blocked_averages_cpu() might still punch some holes
>> in the hierarchy in some edge cases.)
>
> Yeah, I suspect it's possible that the parent ends up with a slightly
> lower runnable_avg_sum if they're both hovering around the max value
> since it isn't quite continuous, and it might be the case that this
> difference is large enough to require one more tick to decay to zero.

OK so coming back to this.  I had a look at this last week and
realized I'd managed to pervert my original intent.

Specifically, the idea here was barring numerical rounding errors
about LOAD_AVG_MAX we can guarantee a parent's runnable average is
greater than or equal to its child, since a parent is runnable
whenever its child is runnable by definition.  Provided we fix up
possible rounding errors (e.g. with a clamp) this then guarantees
we'll always remove child nodes before parent.

So I did this.  Then I thought: oh dear.  When I'd previously proposed
the above as a resolution for out-of-order removal I had not tackled
the problem of correct accounting on bandwidth constrained entities.
It turns out we end up having to "stop" time to handle this
efficiently / correctly.  But this means that we can then no longer
depend on the constraint above as the sums on a sub-tree can
potentially become out of sync.

So I got back to this again tonight and just spent a few hours tonight
looking at some alternate approaches to resolve this.  There's a few
games we can play here but after all of that I now re-realize we still
won't handle an on-list grand-parent correctly when the parent/child
are not on tree; and that this is fundamentally an issue with
enqueue's ordering -- no hole punching from parent before child
removal required.

I suspect we might want to do a segment splice on enqueue after all.
Let me sleep on it.

- Paul

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 15/16] sched: implement usage tracking
  2012-08-23 14:14 ` [patch 15/16] sched: implement usage tracking pjt
@ 2012-10-19 12:18   ` Vincent Guittot
  0 siblings, 0 replies; 59+ messages in thread
From: Vincent Guittot @ 2012-10-19 12:18 UTC (permalink / raw)
  To: pjt
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Ben Segall, Mike Galbraith, Nikunj A Dadhania,
	Morten Rasmussen, Paul E. McKenney, Namhyung Kim

Hi Paul,

I think that you have forgot to reset .usage_avg_sum in the
__sched_fork as it's already done for .runnable_avg_sum and
.usage_avg_sum

And it seems that this reset is not corrected in the latest version in
your git repo:
http://git.kernel.org/?p=linux/kernel/git/pjt/sched.git;a=blob;f=kernel/sched/core.c;h=df55e2ecdd2398648c7d01e318070d06b845a5b0;hb=refs/heads/load_tracking#l1535

Regards,
Vincent

On 23 August 2012 16:14,  <pjt@google.com> wrote:
> From: Paul Turner <pjt@google.com>
>
> With the frame-work for runnable tracking now fully in place.  Per-entity usage
> tracking is a simple and low-overhead addition.
>
> Signed-off-by: Paul Turner <pjt@google.com>
> Reviewed-by: Ben Segall <bsegall@google.com>
> ---
>  include/linux/sched.h |    1 +
>  kernel/sched/debug.c  |    3 +++
>  kernel/sched/fair.c   |   33 ++++++++++++++++++++++++++++-----
>  kernel/sched/sched.h  |    4 ++--
>  4 files changed, 34 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 93e27c0..2a4be1f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1150,6 +1150,7 @@ struct sched_avg {
>         u64 last_runnable_update;
>         s64 decay_count;
>         unsigned long load_avg_contrib;
> +       u32 usage_avg_sum;
>  };
>
>  #ifdef CONFIG_SCHEDSTATS
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 2cd3c1b..b9d54d0 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -94,6 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
>  #ifdef CONFIG_SMP
>         P(se->avg.runnable_avg_sum);
>         P(se->avg.runnable_avg_period);
> +       P(se->avg.usage_avg_sum);
>         P(se->avg.load_avg_contrib);
>         P(se->avg.decay_count);
>  #endif
> @@ -230,6 +231,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>                         cfs_rq->tg_runnable_contrib);
>         SEQ_printf(m, "  .%-30s: %d\n", "tg->runnable_avg",
>                         atomic_read(&cfs_rq->tg->runnable_avg));
> +       SEQ_printf(m, "  .%-30s: %d\n", "tg->usage_avg",
> +                       atomic_read(&cfs_rq->tg->usage_avg));
>  #endif
>
>         print_cfs_group_stats(m, cpu, cfs_rq->tg);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b249371..44a9a11 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -994,7 +994,8 @@ static u32 __compute_runnable_contrib(u64 n)
>   */
>  static __always_inline int __update_entity_runnable_avg(u64 now,
>                                                         struct sched_avg *sa,
> -                                                       int runnable)
> +                                                       int runnable,
> +                                                       int running)
>  {
>         u64 delta, periods;
>         u32 runnable_contrib;
> @@ -1033,6 +1034,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>                 delta_w = 1024 - delta_w;
>                 if (runnable)
>                         sa->runnable_avg_sum += delta_w;
> +               if (running)
> +                       sa->usage_avg_sum += delta_w;
>                 sa->runnable_avg_period += delta_w;
>
>                 delta -= delta_w;
> @@ -1045,17 +1048,22 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>                                                   periods + 1);
>                 sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
>                                                      periods + 1);
> +               sa->usage_avg_sum = decay_load(sa->usage_avg_sum, periods + 1);
>
>                 /* Efficiently calculate \sum (1..n_period) 1024*y^i */
>                 runnable_contrib = __compute_runnable_contrib(periods);
>                 if (runnable)
>                         sa->runnable_avg_sum += runnable_contrib;
> +               if (running)
> +                       sa->usage_avg_sum += runnable_contrib;
>                 sa->runnable_avg_period += runnable_contrib;
>         }
>
>         /* Remainder of delta accrued against u_0` */
>         if (runnable)
>                 sa->runnable_avg_sum += delta;
> +       if (running)
> +               sa->usage_avg_sum += delta;
>         sa->runnable_avg_period += delta;
>
>         return decayed;
> @@ -1101,16 +1109,28 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
>                                                   struct cfs_rq *cfs_rq)
>  {
>         struct task_group *tg = cfs_rq->tg;
> -       long contrib;
> +       long contrib, usage_contrib;
>
>         /* The fraction of a cpu used by this cfs_rq */
>         contrib = div_u64(sa->runnable_avg_sum << NICE_0_SHIFT,
>                           sa->runnable_avg_period + 1);
>         contrib -= cfs_rq->tg_runnable_contrib;
>
> -       if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
> +       usage_contrib = div_u64(sa->usage_avg_sum << NICE_0_SHIFT,
> +                               sa->runnable_avg_period + 1);
> +       usage_contrib -= cfs_rq->tg_usage_contrib;
> +
> +       /*
> +        * contrib/usage at this point represent deltas, only update if they
> +        * are substantive.
> +        */
> +       if ((abs(contrib) > cfs_rq->tg_runnable_contrib / 64) ||
> +           (abs(usage_contrib) > cfs_rq->tg_usage_contrib / 64)) {
>                 atomic_add(contrib, &tg->runnable_avg);
>                 cfs_rq->tg_runnable_contrib += contrib;
> +
> +               atomic_add(usage_contrib, &tg->usage_avg);
> +               cfs_rq->tg_usage_contrib += usage_contrib;
>         }
>  }
>
> @@ -1216,7 +1236,8 @@ static inline void update_entity_load_avg(struct sched_entity *se,
>         else
>                 now = cfs_rq_clock_task(group_cfs_rq(se));
>
> -       if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
> +       if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
> +                                         cfs_rq->curr == se))
>                 return;
>
>         contrib_delta = __update_entity_load_avg_contrib(se);
> @@ -1261,7 +1282,8 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
>
>  static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>  {
> -       __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
> +       __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable,
> +                                    runnable);
>         __update_tg_runnable_avg(&rq->avg, &rq->cfs);
>  }
>
> @@ -1629,6 +1651,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>                  */
>                 update_stats_wait_end(cfs_rq, se);
>                 __dequeue_entity(cfs_rq, se);
> +               update_entity_load_avg(se, 1);
>         }
>
>         update_stats_curr_start(cfs_rq, se);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 89a0e38..e14601e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -113,7 +113,7 @@ struct task_group {
>
>         atomic_t load_weight;
>         atomic64_t load_avg;
> -       atomic_t runnable_avg;
> +       atomic_t runnable_avg, usage_avg;
>  #endif
>
>  #ifdef CONFIG_RT_GROUP_SCHED
> @@ -236,7 +236,7 @@ struct cfs_rq {
>         u64 last_decay;
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -       u32 tg_runnable_contrib;
> +       u32 tg_runnable_contrib, tg_usage_contrib;
>         u64 tg_load_contrib;
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>
>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Track the runnable average on a per-task entity basis
  2012-08-23 14:14 ` [patch 01/16] sched: track the runnable average on a per-task entitiy basis pjt
  2012-08-24  8:20   ` Namhyung Kim
@ 2012-10-24  9:43   ` tip-bot for Paul Turner
  2012-10-25  3:28     ` li guang
  1 sibling, 1 reply; 59+ messages in thread
From: tip-bot for Paul Turner @ 2012-10-24  9:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, bsegall, hpa, mingo, a.p.zijlstra, pjt, tglx

Commit-ID:  9d85f21c94f7f7a84d0ba686c58aa6d9da58fdbb
Gitweb:     http://git.kernel.org/tip/9d85f21c94f7f7a84d0ba686c58aa6d9da58fdbb
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 4 Oct 2012 13:18:29 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:18 +0200

sched: Track the runnable average on a per-task entity basis

Instead of tracking averaging the load parented by a cfs_rq, we can track
entity load directly. With the load for a given cfs_rq then being the sum
of its children.

To do this we represent the historical contribution to runnable average
within each trailing 1024us of execution as the coefficients of a
geometric series.

We can express this for a given task t as:

  runnable_sum(t) = \Sum u_i * y^i, runnable_avg_period(t) = \Sum 1024 * y^i
  load(t) = weight_t * runnable_sum(t) / runnable_avg_period(t)

Where: u_i is the usage in the last i`th 1024us period (approximately 1ms)
~ms and y is chosen such that y^k = 1/2.  We currently choose k to be 32 which
roughly translates to about a sched period.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.372695337@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |   13 +++++
 kernel/sched/core.c   |    5 ++
 kernel/sched/debug.c  |    4 ++
 kernel/sched/fair.c   |  129 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 151 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0dd42a0..418fc6d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1095,6 +1095,16 @@ struct load_weight {
 	unsigned long weight, inv_weight;
 };
 
+struct sched_avg {
+	/*
+	 * These sums represent an infinite geometric series and so are bound
+	 * above by 1024/(1-y).  Thus we only need a u32 to store them for for all
+	 * choices of y < 1-2^(-32)*1024.
+	 */
+	u32 runnable_avg_sum, runnable_avg_period;
+	u64 last_runnable_update;
+};
+
 #ifdef CONFIG_SCHEDSTATS
 struct sched_statistics {
 	u64			wait_start;
@@ -1155,6 +1165,9 @@ struct sched_entity {
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq		*my_q;
 #endif
+#ifdef CONFIG_SMP
+	struct sched_avg	avg;
+#endif
 };
 
 struct sched_rt_entity {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..fd9d085 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1524,6 +1524,11 @@ static void __sched_fork(struct task_struct *p)
 	p->se.vruntime			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
+#ifdef CONFIG_SMP
+	p->se.avg.runnable_avg_period = 0;
+	p->se.avg.runnable_avg_sum = 0;
+#endif
+
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 6f79596..61f7097 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -85,6 +85,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->statistics.wait_count);
 #endif
 	P(se->load.weight);
+#ifdef CONFIG_SMP
+	P(se->avg.runnable_avg_sum);
+	P(se->avg.runnable_avg_period);
+#endif
 #undef PN
 #undef P
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a1..16d67f9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -971,6 +971,126 @@ static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_SMP
+/*
+ * Approximate:
+ *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
+ */
+static __always_inline u64 decay_load(u64 val, u64 n)
+{
+	for (; n && val; n--) {
+		val *= 4008;
+		val >>= 12;
+	}
+
+	return val;
+}
+
+/*
+ * We can represent the historical contribution to runnable average as the
+ * coefficients of a geometric series.  To do this we sub-divide our runnable
+ * history into segments of approximately 1ms (1024us); label the segment that
+ * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
+ *
+ * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
+ *      p0            p1           p2
+ *     (now)       (~1ms ago)  (~2ms ago)
+ *
+ * Let u_i denote the fraction of p_i that the entity was runnable.
+ *
+ * We then designate the fractions u_i as our co-efficients, yielding the
+ * following representation of historical load:
+ *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
+ *
+ * We choose y based on the with of a reasonably scheduling period, fixing:
+ *   y^32 = 0.5
+ *
+ * This means that the contribution to load ~32ms ago (u_32) will be weighted
+ * approximately half as much as the contribution to load within the last ms
+ * (u_0).
+ *
+ * When a period "rolls over" and we have new u_0`, multiplying the previous
+ * sum again by y is sufficient to update:
+ *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
+ *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
+ */
+static __always_inline int __update_entity_runnable_avg(u64 now,
+							struct sched_avg *sa,
+							int runnable)
+{
+	u64 delta;
+	int delta_w, decayed = 0;
+
+	delta = now - sa->last_runnable_update;
+	/*
+	 * This should only happen when time goes backwards, which it
+	 * unfortunately does during sched clock init when we swap over to TSC.
+	 */
+	if ((s64)delta < 0) {
+		sa->last_runnable_update = now;
+		return 0;
+	}
+
+	/*
+	 * Use 1024ns as the unit of measurement since it's a reasonable
+	 * approximation of 1us and fast to compute.
+	 */
+	delta >>= 10;
+	if (!delta)
+		return 0;
+	sa->last_runnable_update = now;
+
+	/* delta_w is the amount already accumulated against our next period */
+	delta_w = sa->runnable_avg_period % 1024;
+	if (delta + delta_w >= 1024) {
+		/* period roll-over */
+		decayed = 1;
+
+		/*
+		 * Now that we know we're crossing a period boundary, figure
+		 * out how much from delta we need to complete the current
+		 * period and accrue it.
+		 */
+		delta_w = 1024 - delta_w;
+		BUG_ON(delta_w > delta);
+		do {
+			if (runnable)
+				sa->runnable_avg_sum += delta_w;
+			sa->runnable_avg_period += delta_w;
+
+			/*
+			 * Remainder of delta initiates a new period, roll over
+			 * the previous.
+			 */
+			sa->runnable_avg_sum =
+				decay_load(sa->runnable_avg_sum, 1);
+			sa->runnable_avg_period =
+				decay_load(sa->runnable_avg_period, 1);
+
+			delta -= delta_w;
+			/* New period is empty */
+			delta_w = 1024;
+		} while (delta >= 1024);
+	}
+
+	/* Remainder of delta accrued against u_0` */
+	if (runnable)
+		sa->runnable_avg_sum += delta;
+	sa->runnable_avg_period += delta;
+
+	return decayed;
+}
+
+/* Update a sched_entity's runnable average */
+static inline void update_entity_load_avg(struct sched_entity *se)
+{
+	__update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
+				     se->on_rq);
+}
+#else
+static inline void update_entity_load_avg(struct sched_entity *se) {}
+#endif
+
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 #ifdef CONFIG_SCHEDSTATS
@@ -1097,6 +1217,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_curr(cfs_rq);
 	update_cfs_load(cfs_rq, 0);
+	update_entity_load_avg(se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
@@ -1171,6 +1292,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+	update_entity_load_avg(se);
 
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
@@ -1340,6 +1462,8 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		update_stats_wait_start(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
+		/* in !on_rq case, update occurred at dequeue */
+		update_entity_load_avg(prev);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -1353,6 +1477,11 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	update_curr(cfs_rq);
 
 	/*
+	 * Ensure that runnable average is periodically updated.
+	 */
+	update_entity_load_avg(curr);
+
+	/*
 	 * Update share accounting for long-running entities.
 	 */
 	update_entity_shares_tick(cfs_rq);

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Maintain per-rq runnable averages
  2012-08-23 14:14 ` [patch 02/16] sched: maintain per-rq runnable averages pjt
@ 2012-10-24  9:44   ` tip-bot for Ben Segall
  2012-10-28 10:12   ` [patch 02/16] sched: maintain " Preeti Murthy
  1 sibling, 0 replies; 59+ messages in thread
From: tip-bot for Ben Segall @ 2012-10-24  9:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, bsegall, hpa, mingo, a.p.zijlstra, pjt, tglx

Commit-ID:  18bf2805d9b30cb823d4919b42cd230f59c7ce1f
Gitweb:     http://git.kernel.org/tip/18bf2805d9b30cb823d4919b42cd230f59c7ce1f
Author:     Ben Segall <bsegall@google.com>
AuthorDate: Thu, 4 Oct 2012 12:51:20 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:20 +0200

sched: Maintain per-rq runnable averages

Since runqueues do not have a corresponding sched_entity we instead embed a
sched_avg structure directly.

Signed-off-by: Ben Segall <bsegall@google.com>
Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.442637130@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/debug.c |   10 ++++++++--
 kernel/sched/fair.c  |   18 ++++++++++++++++--
 kernel/sched/sched.h |    2 ++
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 61f7097..4240abc 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -61,14 +61,20 @@ static unsigned long nsec_low(unsigned long long nsec)
 static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group *tg)
 {
 	struct sched_entity *se = tg->se[cpu];
-	if (!se)
-		return;
 
 #define P(F) \
 	SEQ_printf(m, "  .%-30s: %lld\n", #F, (long long)F)
 #define PN(F) \
 	SEQ_printf(m, "  .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)F))
 
+	if (!se) {
+		struct sched_avg *avg = &cpu_rq(cpu)->avg;
+		P(avg->runnable_avg_sum);
+		P(avg->runnable_avg_period);
+		return;
+	}
+
+
 	PN(se->exec_start);
 	PN(se->vruntime);
 	PN(se->sum_exec_runtime);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 16d67f9..8c5468f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1087,8 +1087,14 @@ static inline void update_entity_load_avg(struct sched_entity *se)
 	__update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
 				     se->on_rq);
 }
+
+static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
+{
+	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
+}
 #else
 static inline void update_entity_load_avg(struct sched_entity *se) {}
+static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 #endif
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -2340,8 +2346,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_shares(cfs_rq);
 	}
 
-	if (!se)
+	if (!se) {
+		update_rq_runnable_avg(rq, rq->nr_running);
 		inc_nr_running(rq);
+	}
 	hrtick_update(rq);
 }
 
@@ -2399,8 +2407,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_shares(cfs_rq);
 	}
 
-	if (!se)
+	if (!se) {
 		dec_nr_running(rq);
+		update_rq_runnable_avg(rq, 1);
+	}
 	hrtick_update(rq);
 }
 
@@ -4586,6 +4596,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	if (this_rq->avg_idle < sysctl_sched_migration_cost)
 		return;
 
+	update_rq_runnable_avg(this_rq, 1);
+
 	/*
 	 * Drop the rq->lock, but keep IRQ/preempt disabled.
 	 */
@@ -5083,6 +5095,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		cfs_rq = cfs_rq_of(se);
 		entity_tick(cfs_rq, se, queued);
 	}
+
+	update_rq_runnable_avg(rq, 1);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7a7db09..14b5719 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -467,6 +467,8 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+
+	struct sched_avg avg;
 };
 
 static inline int cpu_of(struct rq *rq)

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Aggregate load contributed by task entities on parenting cfs_rq
  2012-08-23 14:14 ` [patch 03/16] sched: aggregate load contributed by task entities on parenting cfs_rq pjt
@ 2012-10-24  9:45   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Paul Turner @ 2012-10-24  9:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, bsegall, hpa, mingo, a.p.zijlstra, pjt, tglx

Commit-ID:  2dac754e10a5d41d94d2d2365c0345d4f215a266
Gitweb:     http://git.kernel.org/tip/2dac754e10a5d41d94d2d2365c0345d4f215a266
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 4 Oct 2012 13:18:30 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:21 +0200

sched: Aggregate load contributed by task entities on parenting cfs_rq

For a given task t, we can compute its contribution to load as:

  task_load(t) = runnable_avg(t) * weight(t)

On a parenting cfs_rq we can then aggregate:

  runnable_load(cfs_rq) = \Sum task_load(t), for all runnable children t

Maintain this bottom up, with task entities adding their contributed load to
the parenting cfs_rq sum.  When a task entity's load changes we add the same
delta to the maintained sum.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.514678907@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |    1 +
 kernel/sched/debug.c  |    3 ++
 kernel/sched/fair.c   |   51 +++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h  |   10 ++++++++-
 4 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 418fc6d..81d8b1b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1103,6 +1103,7 @@ struct sched_avg {
 	 */
 	u32 runnable_avg_sum, runnable_avg_period;
 	u64 last_runnable_update;
+	unsigned long load_avg_contrib;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4240abc..c953a89 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -94,6 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 #ifdef CONFIG_SMP
 	P(se->avg.runnable_avg_sum);
 	P(se->avg.runnable_avg_period);
+	P(se->avg.load_avg_contrib);
 #endif
 #undef PN
 #undef P
@@ -224,6 +225,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->load_contribution);
 	SEQ_printf(m, "  .%-30s: %d\n", "load_tg",
 			atomic_read(&cfs_rq->tg->load_weight));
+	SEQ_printf(m, "  .%-30s: %lld\n", "runnable_load_avg",
+			cfs_rq->runnable_load_avg);
 #endif
 
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c5468f..77af759 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1081,20 +1081,63 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	return decayed;
 }
 
+/* Compute the current contribution to load_avg by se, return any delta */
+static long __update_entity_load_avg_contrib(struct sched_entity *se)
+{
+	long old_contrib = se->avg.load_avg_contrib;
+
+	if (!entity_is_task(se))
+		return 0;
+
+	se->avg.load_avg_contrib = div64_u64(se->avg.runnable_avg_sum *
+					     se->load.weight,
+					     se->avg.runnable_avg_period + 1);
+
+	return se->avg.load_avg_contrib - old_contrib;
+}
+
 /* Update a sched_entity's runnable average */
 static inline void update_entity_load_avg(struct sched_entity *se)
 {
-	__update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
-				     se->on_rq);
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	long contrib_delta;
+
+	if (!__update_entity_runnable_avg(rq_of(cfs_rq)->clock_task, &se->avg,
+					  se->on_rq))
+		return;
+
+	contrib_delta = __update_entity_load_avg_contrib(se);
+	if (se->on_rq)
+		cfs_rq->runnable_load_avg += contrib_delta;
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
 	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
 }
+
+/* Add the load generated by se into cfs_rq's child load-average */
+static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
+						  struct sched_entity *se)
+{
+	update_entity_load_avg(se);
+	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+}
+
+/* Remove se's load from this cfs_rq child load-average */
+static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
+						  struct sched_entity *se)
+{
+	update_entity_load_avg(se);
+	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+}
 #else
 static inline void update_entity_load_avg(struct sched_entity *se) {}
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
+static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
+						  struct sched_entity *se) {}
+static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
+						  struct sched_entity *se) {}
 #endif
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1223,7 +1266,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_curr(cfs_rq);
 	update_cfs_load(cfs_rq, 0);
-	update_entity_load_avg(se);
+	enqueue_entity_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
@@ -1298,7 +1341,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	update_entity_load_avg(se);
+	dequeue_entity_load_avg(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14b5719..e653973 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -222,6 +222,15 @@ struct cfs_rq {
 	unsigned int nr_spread_over;
 #endif
 
+#ifdef CONFIG_SMP
+	/*
+	 * CFS Load tracking
+	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
+	 * This allows for the description of both thread and group usage (in
+	 * the FAIR_GROUP_SCHED case).
+	 */
+	u64 runnable_load_avg;
+#endif
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	struct rq *rq;	/* cpu runqueue to which this cfs_rq is attached */
 
@@ -1214,4 +1223,3 @@ static inline u64 irq_time_read(int cpu)
 }
 #endif /* CONFIG_64BIT */
 #endif /* CONFIG_IRQ_TIME_ACCOUNTING */
-

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Maintain the load contribution of blocked entities
  2012-08-23 14:14 ` [patch 04/16] sched: maintain the load contribution of blocked entities pjt
@ 2012-10-24  9:46   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Paul Turner @ 2012-10-24  9:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, bsegall, hpa, mingo, a.p.zijlstra, pjt, tglx

Commit-ID:  9ee474f55664ff63111c843099d365e7ecffb56f
Gitweb:     http://git.kernel.org/tip/9ee474f55664ff63111c843099d365e7ecffb56f
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 4 Oct 2012 13:18:30 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:22 +0200

sched: Maintain the load contribution of blocked entities

We are currently maintaining:

  runnable_load(cfs_rq) = \Sum task_load(t)

For all running children t of cfs_rq.  While this can be naturally updated for
tasks in a runnable state (as they are scheduled); this does not account for
the load contributed by blocked task entities.

This can be solved by introducing a separate accounting for blocked load:

  blocked_load(cfs_rq) = \Sum runnable(b) * weight(b)

Obviously we do not want to iterate over all blocked entities to account for
their decay, we instead observe that:

  runnable_load(t) = \Sum p_i*y^i

and that to account for an additional idle period we only need to compute:

  y*runnable_load(t).

This means that we can compute all blocked entities at once by evaluating:

  blocked_load(cfs_rq)` = y * blocked_load(cfs_rq)

Finally we maintain a decay counter so that when a sleeping entity re-awakens
we can determine how much of its load should be removed from the blocked sum.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.585389902@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    1 -
 kernel/sched/debug.c  |    3 +
 kernel/sched/fair.c   |  128 ++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h  |    4 +-
 5 files changed, 122 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 81d8b1b..b1831ac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1103,6 +1103,7 @@ struct sched_avg {
 	 */
 	u32 runnable_avg_sum, runnable_avg_period;
 	u64 last_runnable_update;
+	s64 decay_count;
 	unsigned long load_avg_contrib;
 };
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fd9d085..00898f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1528,7 +1528,6 @@ static void __sched_fork(struct task_struct *p)
 	p->se.avg.runnable_avg_period = 0;
 	p->se.avg.runnable_avg_sum = 0;
 #endif
-
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c953a89..2d2e2b3 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -95,6 +95,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->avg.runnable_avg_sum);
 	P(se->avg.runnable_avg_period);
 	P(se->avg.load_avg_contrib);
+	P(se->avg.decay_count);
 #endif
 #undef PN
 #undef P
@@ -227,6 +228,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			atomic_read(&cfs_rq->tg->load_weight));
 	SEQ_printf(m, "  .%-30s: %lld\n", "runnable_load_avg",
 			cfs_rq->runnable_load_avg);
+	SEQ_printf(m, "  .%-30s: %lld\n", "blocked_load_avg",
+			cfs_rq->blocked_load_avg);
 #endif
 
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 77af759..8319417 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -259,6 +259,8 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
 	return grp->my_q;
 }
 
+static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq);
+
 static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	if (!cfs_rq->on_list) {
@@ -278,6 +280,8 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 		}
 
 		cfs_rq->on_list = 1;
+		/* We should have no load, but we need to update last_decay. */
+		update_cfs_rq_blocked_load(cfs_rq);
 	}
 }
 
@@ -1081,6 +1085,20 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	return decayed;
 }
 
+/* Synchronize an entity's decay with its parenting cfs_rq.*/
+static inline void __synchronize_entity_decay(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 decays = atomic64_read(&cfs_rq->decay_counter);
+
+	decays -= se->avg.decay_count;
+	if (!decays)
+		return;
+
+	se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
+	se->avg.decay_count = 0;
+}
+
 /* Compute the current contribution to load_avg by se, return any delta */
 static long __update_entity_load_avg_contrib(struct sched_entity *se)
 {
@@ -1096,8 +1114,18 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
 	return se->avg.load_avg_contrib - old_contrib;
 }
 
+static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
+						 long load_contrib)
+{
+	if (likely(load_contrib < cfs_rq->blocked_load_avg))
+		cfs_rq->blocked_load_avg -= load_contrib;
+	else
+		cfs_rq->blocked_load_avg = 0;
+}
+
 /* Update a sched_entity's runnable average */
-static inline void update_entity_load_avg(struct sched_entity *se)
+static inline void update_entity_load_avg(struct sched_entity *se,
+					  int update_cfs_rq)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	long contrib_delta;
@@ -1107,8 +1135,34 @@ static inline void update_entity_load_avg(struct sched_entity *se)
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
+
+	if (!update_cfs_rq)
+		return;
+
 	if (se->on_rq)
 		cfs_rq->runnable_load_avg += contrib_delta;
+	else
+		subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+}
+
+/*
+ * Decay the load contributed by all blocked children and account this so that
+ * their contribution may appropriately discounted when they wake up.
+ */
+static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq)
+{
+	u64 now = rq_of(cfs_rq)->clock_task >> 20;
+	u64 decays;
+
+	decays = now - cfs_rq->last_decay;
+	if (!decays)
+		return;
+
+	cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
+					      decays);
+	atomic64_add(decays, &cfs_rq->decay_counter);
+
+	cfs_rq->last_decay = now;
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
@@ -1118,26 +1172,53 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 
 /* Add the load generated by se into cfs_rq's child load-average */
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
-						  struct sched_entity *se)
+						  struct sched_entity *se,
+						  int wakeup)
 {
-	update_entity_load_avg(se);
+	/* we track migrations using entity decay_count == 0 */
+	if (unlikely(!se->avg.decay_count)) {
+		se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
+		wakeup = 0;
+	} else {
+		__synchronize_entity_decay(se);
+	}
+
+	if (wakeup)
+		subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+
+	update_entity_load_avg(se, 0);
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+	update_cfs_rq_blocked_load(cfs_rq);
 }
 
-/* Remove se's load from this cfs_rq child load-average */
+/*
+ * Remove se's load from this cfs_rq child load-average, if the entity is
+ * transitioning to a blocked state we track its projected decay using
+ * blocked_load_avg.
+ */
 static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
-						  struct sched_entity *se)
+						  struct sched_entity *se,
+						  int sleep)
 {
-	update_entity_load_avg(se);
+	update_entity_load_avg(se, 1);
+
 	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+	if (sleep) {
+		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+	} /* migrations, e.g. sleep=0 leave decay_count == 0 */
 }
 #else
-static inline void update_entity_load_avg(struct sched_entity *se) {}
+static inline void update_entity_load_avg(struct sched_entity *se,
+					  int update_cfs_rq) {}
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
-						  struct sched_entity *se) {}
+					   struct sched_entity *se,
+					   int wakeup) {}
 static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
-						  struct sched_entity *se) {}
+					   struct sched_entity *se,
+					   int sleep) {}
+static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq) {}
 #endif
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1266,7 +1347,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_curr(cfs_rq);
 	update_cfs_load(cfs_rq, 0);
-	enqueue_entity_load_avg(cfs_rq, se);
+	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
@@ -1341,7 +1422,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	dequeue_entity_load_avg(cfs_rq, se);
+	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
 
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
@@ -1512,7 +1593,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
-		update_entity_load_avg(prev);
+		update_entity_load_avg(prev, 1);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -1528,7 +1609,8 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
-	update_entity_load_avg(curr);
+	update_entity_load_avg(curr, 1);
+	update_cfs_rq_blocked_load(cfs_rq);
 
 	/*
 	 * Update share accounting for long-running entities.
@@ -2387,6 +2469,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
+		update_entity_load_avg(se, 1);
 	}
 
 	if (!se) {
@@ -2448,6 +2531,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
+		update_entity_load_avg(se, 1);
 	}
 
 	if (!se) {
@@ -3498,6 +3582,7 @@ static int update_shares_cpu(struct task_group *tg, int cpu)
 
 	update_rq_clock(rq);
 	update_cfs_load(cfs_rq, 1);
+	update_cfs_rq_blocked_load(cfs_rq);
 
 	/*
 	 * We need to update shares after updating tg->load_weight in
@@ -5232,6 +5317,20 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
 		place_entity(cfs_rq, se, 0);
 		se->vruntime -= cfs_rq->min_vruntime;
 	}
+
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+	/*
+	* Remove our load from contribution when we leave sched_fair
+	* and ensure we don't carry in an old decay_count if we
+	* switch back.
+	*/
+	if (p->se.avg.decay_count) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
+		__synchronize_entity_decay(&p->se);
+		subtract_blocked_load_contrib(cfs_rq,
+				p->se.avg.load_avg_contrib);
+	}
+#endif
 }
 
 /*
@@ -5278,6 +5377,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #ifndef CONFIG_64BIT
 	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
 #endif
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+	atomic64_set(&cfs_rq->decay_counter, 1);
+#endif
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e653973..664ff39 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -229,7 +229,9 @@ struct cfs_rq {
 	 * This allows for the description of both thread and group usage (in
 	 * the FAIR_GROUP_SCHED case).
 	 */
-	u64 runnable_load_avg;
+	u64 runnable_load_avg, blocked_load_avg;
+	atomic64_t decay_counter;
+	u64 last_decay;
 #endif
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	struct rq *rq;	/* cpu runqueue to which this cfs_rq is attached */

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Add an rq migration call-back to sched_class
  2012-08-23 14:14 ` [patch 05/16] sched: add an rq migration call-back to sched_class pjt
@ 2012-10-24  9:47   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Paul Turner @ 2012-10-24  9:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, bsegall, hpa, mingo, a.p.zijlstra, pjt, tglx

Commit-ID:  0a74bef8bed18dc6889e9bc37ea1050a50c86c89
Gitweb:     http://git.kernel.org/tip/0a74bef8bed18dc6889e9bc37ea1050a50c86c89
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 4 Oct 2012 13:18:30 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:23 +0200

sched: Add an rq migration call-back to sched_class

Since we are now doing bottom up load accumulation we need explicit
notification when a task has been re-parented so that the old hierarchy can be
updated.

Adds: migrate_task_rq(struct task_struct *p, int next_cpu)

(The alternative is to do this out of __set_task_cpu, but it was suggested that
this would be a cleaner encapsulation.)

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.660023400@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    2 ++
 kernel/sched/fair.c   |   12 ++++++++++++
 3 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b1831ac..e483ccb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1061,6 +1061,7 @@ struct sched_class {
 
 #ifdef CONFIG_SMP
 	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
+	void (*migrate_task_rq)(struct task_struct *p, int next_cpu);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 00898f1..f268600 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -952,6 +952,8 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	trace_sched_migrate_task(p, new_cpu);
 
 	if (task_cpu(p) != new_cpu) {
+		if (p->sched_class->migrate_task_rq)
+			p->sched_class->migrate_task_rq(p, new_cpu);
 		p->se.nr_migrations++;
 		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
 	}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8319417..5e602e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3047,6 +3047,17 @@ unlock:
 
 	return new_cpu;
 }
+
+/*
+ * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
+ * cfs_rq_of(p) references at time of call are still valid and identify the
+ * previous cpu.  However, the caller only guarantees p->pi_lock is held; no
+ * other assumptions, including the state of rq->lock, should be made.
+ */
+static void
+migrate_task_rq_fair(struct task_struct *p, int next_cpu)
+{
+}
 #endif /* CONFIG_SMP */
 
 static unsigned long
@@ -5607,6 +5618,7 @@ const struct sched_class fair_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_fair,
+	.migrate_task_rq	= migrate_task_rq_fair,
 
 	.rq_online		= rq_online_fair,
 	.rq_offline		= rq_offline_fair,

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Account for blocked load waking back up
  2012-08-23 14:14 ` [patch 06/16] sched: account for blocked load waking back up pjt
       [not found]   ` <CAM4v1pO8SPCmqJTTBHpqwrwuO7noPdskg0RSooxyPsWoE395_A@mail.gmail.com>
@ 2012-10-24  9:48   ` tip-bot for Paul Turner
  1 sibling, 0 replies; 59+ messages in thread
From: tip-bot for Paul Turner @ 2012-10-24  9:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, bsegall, hpa, mingo, a.p.zijlstra, pjt, tglx

Commit-ID:  aff3e498844441fa71c5ee1bbc470e1dff9548d9
Gitweb:     http://git.kernel.org/tip/aff3e498844441fa71c5ee1bbc470e1dff9548d9
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 4 Oct 2012 13:18:30 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:23 +0200

sched: Account for blocked load waking back up

When a running entity blocks we migrate its tracked load to
cfs_rq->blocked_runnable_avg.  In the sleep case this occurs while holding
rq->lock and so is a natural transition.  Wake-ups however, are potentially
asynchronous in the presence of migration and so special care must be taken.

We use an atomic counter to track such migrated load, taking care to match this
with the previously introduced decay counters so that we don't migrate too much
load.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.726077467@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c  |  100 ++++++++++++++++++++++++++++++++++++++++----------
 kernel/sched/sched.h |    2 +-
 2 files changed, 81 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5e602e6..74dc29b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -259,7 +259,8 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
 	return grp->my_q;
 }
 
-static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq);
+static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
+				       int force_update);
 
 static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
@@ -281,7 +282,7 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 
 		cfs_rq->on_list = 1;
 		/* We should have no load, but we need to update last_decay. */
-		update_cfs_rq_blocked_load(cfs_rq);
+		update_cfs_rq_blocked_load(cfs_rq, 0);
 	}
 }
 
@@ -1086,17 +1087,19 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 }
 
 /* Synchronize an entity's decay with its parenting cfs_rq.*/
-static inline void __synchronize_entity_decay(struct sched_entity *se)
+static inline u64 __synchronize_entity_decay(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 decays = atomic64_read(&cfs_rq->decay_counter);
 
 	decays -= se->avg.decay_count;
 	if (!decays)
-		return;
+		return 0;
 
 	se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
 	se->avg.decay_count = 0;
+
+	return decays;
 }
 
 /* Compute the current contribution to load_avg by se, return any delta */
@@ -1149,20 +1152,26 @@ static inline void update_entity_load_avg(struct sched_entity *se,
  * Decay the load contributed by all blocked children and account this so that
  * their contribution may appropriately discounted when they wake up.
  */
-static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq)
+static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 {
 	u64 now = rq_of(cfs_rq)->clock_task >> 20;
 	u64 decays;
 
 	decays = now - cfs_rq->last_decay;
-	if (!decays)
+	if (!decays && !force_update)
 		return;
 
-	cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
-					      decays);
-	atomic64_add(decays, &cfs_rq->decay_counter);
+	if (atomic64_read(&cfs_rq->removed_load)) {
+		u64 removed_load = atomic64_xchg(&cfs_rq->removed_load, 0);
+		subtract_blocked_load_contrib(cfs_rq, removed_load);
+	}
 
-	cfs_rq->last_decay = now;
+	if (decays) {
+		cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
+						      decays);
+		atomic64_add(decays, &cfs_rq->decay_counter);
+		cfs_rq->last_decay = now;
+	}
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
@@ -1175,20 +1184,42 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  struct sched_entity *se,
 						  int wakeup)
 {
-	/* we track migrations using entity decay_count == 0 */
-	if (unlikely(!se->avg.decay_count)) {
+	/*
+	 * We track migrations using entity decay_count <= 0, on a wake-up
+	 * migration we use a negative decay count to track the remote decays
+	 * accumulated while sleeping.
+	 */
+	if (unlikely(se->avg.decay_count <= 0)) {
 		se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
+		if (se->avg.decay_count) {
+			/*
+			 * In a wake-up migration we have to approximate the
+			 * time sleeping.  This is because we can't synchronize
+			 * clock_task between the two cpus, and it is not
+			 * guaranteed to be read-safe.  Instead, we can
+			 * approximate this using our carried decays, which are
+			 * explicitly atomically readable.
+			 */
+			se->avg.last_runnable_update -= (-se->avg.decay_count)
+							<< 20;
+			update_entity_load_avg(se, 0);
+			/* Indicate that we're now synchronized and on-rq */
+			se->avg.decay_count = 0;
+		}
 		wakeup = 0;
 	} else {
 		__synchronize_entity_decay(se);
 	}
 
-	if (wakeup)
+	/* migrated tasks did not contribute to our blocked load */
+	if (wakeup) {
 		subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+		update_entity_load_avg(se, 0);
+	}
 
-	update_entity_load_avg(se, 0);
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
-	update_cfs_rq_blocked_load(cfs_rq);
+	/* we force update consideration on load-balancer moves */
+	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
 }
 
 /*
@@ -1201,6 +1232,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  int sleep)
 {
 	update_entity_load_avg(se, 1);
+	/* we force update consideration on load-balancer moves */
+	update_cfs_rq_blocked_load(cfs_rq, !sleep);
 
 	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
 	if (sleep) {
@@ -1218,7 +1251,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 					   struct sched_entity *se,
 					   int sleep) {}
-static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq) {}
+static inline void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
+					      int force_update) {}
 #endif
 
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1610,7 +1644,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 * Ensure that runnable average is periodically updated.
 	 */
 	update_entity_load_avg(curr, 1);
-	update_cfs_rq_blocked_load(cfs_rq);
+	update_cfs_rq_blocked_load(cfs_rq, 1);
 
 	/*
 	 * Update share accounting for long-running entities.
@@ -3057,6 +3091,19 @@ unlock:
 static void
 migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 {
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	/*
+	 * Load tracking: accumulate removed load so that it can be processed
+	 * when we next update owning cfs_rq under rq->lock.  Tasks contribute
+	 * to blocked load iff they have a positive decay-count.  It can never
+	 * be negative here since on-rq tasks have decay-count == 0.
+	 */
+	if (se->avg.decay_count) {
+		se->avg.decay_count = -__synchronize_entity_decay(se);
+		atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
+	}
 }
 #endif /* CONFIG_SMP */
 
@@ -3593,7 +3640,7 @@ static int update_shares_cpu(struct task_group *tg, int cpu)
 
 	update_rq_clock(rq);
 	update_cfs_load(cfs_rq, 1);
-	update_cfs_rq_blocked_load(cfs_rq);
+	update_cfs_rq_blocked_load(cfs_rq, 1);
 
 	/*
 	 * We need to update shares after updating tg->load_weight in
@@ -5390,12 +5437,14 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #endif
 #if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
 	atomic64_set(&cfs_rq->decay_counter, 1);
+	atomic64_set(&cfs_rq->removed_load, 0);
 #endif
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void task_move_group_fair(struct task_struct *p, int on_rq)
 {
+	struct cfs_rq *cfs_rq;
 	/*
 	 * If the task was not on the rq at the time of this cgroup movement
 	 * it must have been asleep, sleeping tasks keep their ->vruntime
@@ -5427,8 +5476,19 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
 	if (!on_rq)
 		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
 	set_task_rq(p, task_cpu(p));
-	if (!on_rq)
-		p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
+	if (!on_rq) {
+		cfs_rq = cfs_rq_of(&p->se);
+		p->se.vruntime += cfs_rq->min_vruntime;
+#ifdef CONFIG_SMP
+		/*
+		 * migrate_task_rq_fair() will have removed our previous
+		 * contribution, but we must synchronize for ongoing future
+		 * decay.
+		 */
+		p->se.avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
+		cfs_rq->blocked_load_avg += p->se.avg.load_avg_contrib;
+#endif
+	}
 }
 
 void free_fair_sched_group(struct task_group *tg)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 664ff39..30236ab 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -230,7 +230,7 @@ struct cfs_rq {
 	 * the FAIR_GROUP_SCHED case).
 	 */
 	u64 runnable_load_avg, blocked_load_avg;
-	atomic64_t decay_counter;
+	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
 #endif
 #ifdef CONFIG_FAIR_GROUP_SCHED

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Aggregate total task_group load
  2012-08-23 14:14 ` [patch 07/16] sched: aggregate total task_group load pjt
@ 2012-10-24  9:49   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Paul Turner @ 2012-10-24  9:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, bsegall, hpa, mingo, a.p.zijlstra, pjt, tglx

Commit-ID:  c566e8e9e44b72b53091da20e2dedefc730f2ee2
Gitweb:     http://git.kernel.org/tip/c566e8e9e44b72b53091da20e2dedefc730f2ee2
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 4 Oct 2012 13:18:30 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:24 +0200

sched: Aggregate total task_group load

Maintain a global running sum of the average load seen on each cfs_rq belonging
to each task group so that it may be used in calculating an appropriate
shares:weight distribution.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.792901086@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/debug.c |    4 ++++
 kernel/sched/fair.c  |   22 ++++++++++++++++++++++
 kernel/sched/sched.h |    4 ++++
 3 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2d2e2b3..2908923 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -230,6 +230,10 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lld\n", "blocked_load_avg",
 			cfs_rq->blocked_load_avg);
+	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_avg",
+			atomic64_read(&cfs_rq->tg->load_avg));
+	SEQ_printf(m, "  .%-30s: %lld\n", "tg_load_contrib",
+			cfs_rq->tg_load_contrib);
 #endif
 
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 74dc29b..db78822 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1102,6 +1102,26 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
 	return decays;
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
+						 int force_update)
+{
+	struct task_group *tg = cfs_rq->tg;
+	s64 tg_contrib;
+
+	tg_contrib = cfs_rq->runnable_load_avg + cfs_rq->blocked_load_avg;
+	tg_contrib -= cfs_rq->tg_load_contrib;
+
+	if (force_update || abs64(tg_contrib) > cfs_rq->tg_load_contrib / 8) {
+		atomic64_add(tg_contrib, &tg->load_avg);
+		cfs_rq->tg_load_contrib += tg_contrib;
+	}
+}
+#else
+static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
+						 int force_update) {}
+#endif
+
 /* Compute the current contribution to load_avg by se, return any delta */
 static long __update_entity_load_avg_contrib(struct sched_entity *se)
 {
@@ -1172,6 +1192,8 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 		atomic64_add(decays, &cfs_rq->decay_counter);
 		cfs_rq->last_decay = now;
 	}
+
+	__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 30236ab..924a990 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -112,6 +112,7 @@ struct task_group {
 	unsigned long shares;
 
 	atomic_t load_weight;
+	atomic64_t load_avg;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -232,6 +233,9 @@ struct cfs_rq {
 	u64 runnable_load_avg, blocked_load_avg;
 	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	u64 tg_load_contrib;
+#endif
 #endif
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	struct rq *rq;	/* cpu runqueue to which this cfs_rq is attached */

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Compute load contribution by a group entity
  2012-08-23 14:14 ` [patch 08/16] sched: compute load contribution by a group entity pjt
@ 2012-10-24  9:50   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Paul Turner @ 2012-10-24  9:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, bsegall, hpa, mingo, a.p.zijlstra, pjt, tglx

Commit-ID:  8165e145ceb62fc338e099c9b12b3239c83d2f8e
Gitweb:     http://git.kernel.org/tip/8165e145ceb62fc338e099c9b12b3239c83d2f8e
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 4 Oct 2012 13:18:31 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:25 +0200

sched: Compute load contribution by a group entity

Unlike task entities who have a fixed weight, group entities instead own a
fraction of their parenting task_group's shares as their contributed weight.

Compute this fraction so that we can correctly account hierarchies and shared
entity nodes.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.855074415@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c |   33 +++++++++++++++++++++++++++------
 1 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index db78822..e20cb26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1117,22 +1117,43 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 		cfs_rq->tg_load_contrib += tg_contrib;
 	}
 }
+
+static inline void __update_group_entity_contrib(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = group_cfs_rq(se);
+	struct task_group *tg = cfs_rq->tg;
+	u64 contrib;
+
+	contrib = cfs_rq->tg_load_contrib * tg->shares;
+	se->avg.load_avg_contrib = div64_u64(contrib,
+					     atomic64_read(&tg->load_avg) + 1);
+}
 #else
 static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 						 int force_update) {}
+static inline void __update_group_entity_contrib(struct sched_entity *se) {}
 #endif
 
+static inline void __update_task_entity_contrib(struct sched_entity *se)
+{
+	u32 contrib;
+
+	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
+	contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
+	contrib /= (se->avg.runnable_avg_period + 1);
+	se->avg.load_avg_contrib = scale_load(contrib);
+}
+
 /* Compute the current contribution to load_avg by se, return any delta */
 static long __update_entity_load_avg_contrib(struct sched_entity *se)
 {
 	long old_contrib = se->avg.load_avg_contrib;
 
-	if (!entity_is_task(se))
-		return 0;
-
-	se->avg.load_avg_contrib = div64_u64(se->avg.runnable_avg_sum *
-					     se->load.weight,
-					     se->avg.runnable_avg_period + 1);
+	if (entity_is_task(se)) {
+		__update_task_entity_contrib(se);
+	} else {
+		__update_group_entity_contrib(se);
+	}
 
 	return se->avg.load_avg_contrib - old_contrib;
 }

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Normalize tg load contributions against runnable time
  2012-08-23 14:14 ` [patch 09/16] sched: normalize tg load contributions against runnable time pjt
@ 2012-10-24  9:51   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Paul Turner @ 2012-10-24  9:51 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, a.p.zijlstra, pjt, tglx

Commit-ID:  bb17f65571e97a7ec0297571fb1154fbd107ad00
Gitweb:     http://git.kernel.org/tip/bb17f65571e97a7ec0297571fb1154fbd107ad00
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 4 Oct 2012 13:18:31 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:26 +0200

sched: Normalize tg load contributions against runnable time

Entities of equal weight should receive equitable distribution of cpu time.
This is challenging in the case of a task_group's shares as execution may be
occurring on multiple cpus simultaneously.

To handle this we divide up the shares into weights proportionate with the load
on each cfs_rq.  This does not however, account for the fact that the sum of
the parts may be less than one cpu and so we need to normalize:
  load(tg) = min(runnable_avg(tg), 1) * tg->shares
Where runnable_avg is the aggregate time in which the task_group had runnable
children.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.930124292@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/debug.c |    4 +++
 kernel/sched/fair.c  |   56 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |    2 +
 3 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2908923..71b0ea3 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -234,6 +234,10 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			atomic64_read(&cfs_rq->tg->load_avg));
 	SEQ_printf(m, "  .%-30s: %lld\n", "tg_load_contrib",
 			cfs_rq->tg_load_contrib);
+	SEQ_printf(m, "  .%-30s: %d\n", "tg_runnable_contrib",
+			cfs_rq->tg_runnable_contrib);
+	SEQ_printf(m, "  .%-30s: %d\n", "tg->runnable_avg",
+			atomic_read(&cfs_rq->tg->runnable_avg));
 #endif
 
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e20cb26..9e49722 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1118,19 +1118,73 @@ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 	}
 }
 
+/*
+ * Aggregate cfs_rq runnable averages into an equivalent task_group
+ * representation for computing load contributions.
+ */
+static inline void __update_tg_runnable_avg(struct sched_avg *sa,
+						  struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	long contrib;
+
+	/* The fraction of a cpu used by this cfs_rq */
+	contrib = div_u64(sa->runnable_avg_sum << NICE_0_SHIFT,
+			  sa->runnable_avg_period + 1);
+	contrib -= cfs_rq->tg_runnable_contrib;
+
+	if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
+		atomic_add(contrib, &tg->runnable_avg);
+		cfs_rq->tg_runnable_contrib += contrib;
+	}
+}
+
 static inline void __update_group_entity_contrib(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = group_cfs_rq(se);
 	struct task_group *tg = cfs_rq->tg;
+	int runnable_avg;
+
 	u64 contrib;
 
 	contrib = cfs_rq->tg_load_contrib * tg->shares;
 	se->avg.load_avg_contrib = div64_u64(contrib,
 					     atomic64_read(&tg->load_avg) + 1);
+
+	/*
+	 * For group entities we need to compute a correction term in the case
+	 * that they are consuming <1 cpu so that we would contribute the same
+	 * load as a task of equal weight.
+	 *
+	 * Explicitly co-ordinating this measurement would be expensive, but
+	 * fortunately the sum of each cpus contribution forms a usable
+	 * lower-bound on the true value.
+	 *
+	 * Consider the aggregate of 2 contributions.  Either they are disjoint
+	 * (and the sum represents true value) or they are disjoint and we are
+	 * understating by the aggregate of their overlap.
+	 *
+	 * Extending this to N cpus, for a given overlap, the maximum amount we
+	 * understand is then n_i(n_i+1)/2 * w_i where n_i is the number of
+	 * cpus that overlap for this interval and w_i is the interval width.
+	 *
+	 * On a small machine; the first term is well-bounded which bounds the
+	 * total error since w_i is a subset of the period.  Whereas on a
+	 * larger machine, while this first term can be larger, if w_i is the
+	 * of consequential size guaranteed to see n_i*w_i quickly converge to
+	 * our upper bound of 1-cpu.
+	 */
+	runnable_avg = atomic_read(&tg->runnable_avg);
+	if (runnable_avg < NICE_0_LOAD) {
+		se->avg.load_avg_contrib *= runnable_avg;
+		se->avg.load_avg_contrib >>= NICE_0_SHIFT;
+	}
 }
 #else
 static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 						 int force_update) {}
+static inline void __update_tg_runnable_avg(struct sched_avg *sa,
+						  struct cfs_rq *cfs_rq) {}
 static inline void __update_group_entity_contrib(struct sched_entity *se) {}
 #endif
 
@@ -1152,6 +1206,7 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
 	if (entity_is_task(se)) {
 		__update_task_entity_contrib(se);
 	} else {
+		__update_tg_runnable_avg(&se->avg, group_cfs_rq(se));
 		__update_group_entity_contrib(se);
 	}
 
@@ -1220,6 +1275,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
 	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
+	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 
 /* Add the load generated by se into cfs_rq's child load-average */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 924a990..134928d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -113,6 +113,7 @@ struct task_group {
 
 	atomic_t load_weight;
 	atomic64_t load_avg;
+	atomic_t runnable_avg;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -234,6 +235,7 @@ struct cfs_rq {
 	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	u32 tg_runnable_contrib;
 	u64 tg_load_contrib;
 #endif
 #endif

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Maintain runnable averages across throttled periods
  2012-08-23 14:14 ` [patch 10/16] sched: maintain runnable averages across throttled periods pjt
@ 2012-10-24  9:52   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Paul Turner @ 2012-10-24  9:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, bsegall, hpa, mingo, a.p.zijlstra, pjt, tglx

Commit-ID:  f1b17280efbd21873d1db8631117bdbccbcb39a2
Gitweb:     http://git.kernel.org/tip/f1b17280efbd21873d1db8631117bdbccbcb39a2
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 4 Oct 2012 13:18:31 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:27 +0200

sched: Maintain runnable averages across throttled periods

With bandwidth control tracked entities may cease execution according to user
specified bandwidth limits.  Charging this time as either throttled or blocked
however, is incorrect and would falsely skew in either direction.

What we actually want is for any throttled periods to be "invisible" to
load-tracking as they are removed from the system for that interval and
contribute normally otherwise.

Do this by moderating the progression of time to omit any periods in which the
entity belonged to a throttled hierarchy.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.998912151@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c  |   50 ++++++++++++++++++++++++++++++++++++++++----------
 kernel/sched/sched.h |    3 ++-
 2 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9e49722..873c9f5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1222,15 +1222,26 @@ static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
 		cfs_rq->blocked_load_avg = 0;
 }
 
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
+
 /* Update a sched_entity's runnable average */
 static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	long contrib_delta;
+	u64 now;
 
-	if (!__update_entity_runnable_avg(rq_of(cfs_rq)->clock_task, &se->avg,
-					  se->on_rq))
+	/*
+	 * For a group entity we need to use their owned cfs_rq_clock_task() in
+	 * case they are the parent of a throttled hierarchy.
+	 */
+	if (entity_is_task(se))
+		now = cfs_rq_clock_task(cfs_rq);
+	else
+		now = cfs_rq_clock_task(group_cfs_rq(se));
+
+	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
@@ -1250,7 +1261,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
  */
 static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 {
-	u64 now = rq_of(cfs_rq)->clock_task >> 20;
+	u64 now = cfs_rq_clock_task(cfs_rq) >> 20;
 	u64 decays;
 
 	decays = now - cfs_rq->last_decay;
@@ -1841,6 +1852,15 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 	return &tg->cfs_bandwidth;
 }
 
+/* rq->task_clock normalized against any time this cfs_rq has spent throttled */
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
+{
+	if (unlikely(cfs_rq->throttle_count))
+		return cfs_rq->throttled_clock_task;
+
+	return rq_of(cfs_rq)->clock_task - cfs_rq->throttled_clock_task_time;
+}
+
 /* returns 0 on failure to allocate runtime */
 static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
@@ -1991,6 +2011,10 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 		cfs_rq->load_stamp += delta;
 		cfs_rq->load_last += delta;
 
+		/* adjust cfs_rq_clock_task() */
+		cfs_rq->throttled_clock_task_time += rq->clock_task -
+					     cfs_rq->throttled_clock_task;
+
 		/* update entity weight now that we are on_rq again */
 		update_cfs_shares(cfs_rq);
 	}
@@ -2005,8 +2029,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
 	/* group is entering throttled state, record last load */
-	if (!cfs_rq->throttle_count)
+	if (!cfs_rq->throttle_count) {
 		update_cfs_load(cfs_rq, 0);
+		cfs_rq->throttled_clock_task = rq->clock_task;
+	}
 	cfs_rq->throttle_count++;
 
 	return 0;
@@ -2021,7 +2047,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
-	/* account load preceding throttle */
+	/* freeze hierarchy runnable averages while throttled */
 	rcu_read_lock();
 	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
 	rcu_read_unlock();
@@ -2045,7 +2071,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		rq->nr_running -= task_delta;
 
 	cfs_rq->throttled = 1;
-	cfs_rq->throttled_timestamp = rq->clock;
+	cfs_rq->throttled_clock = rq->clock;
 	raw_spin_lock(&cfs_b->lock);
 	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
 	raw_spin_unlock(&cfs_b->lock);
@@ -2063,10 +2089,9 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	cfs_rq->throttled = 0;
 	raw_spin_lock(&cfs_b->lock);
-	cfs_b->throttled_time += rq->clock - cfs_rq->throttled_timestamp;
+	cfs_b->throttled_time += rq->clock - cfs_rq->throttled_clock;
 	list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
-	cfs_rq->throttled_timestamp = 0;
 
 	update_rq_clock(rq);
 	/* update hierarchical throttle state */
@@ -2466,8 +2491,13 @@ static void unthrottle_offline_cfs_rqs(struct rq *rq)
 }
 
 #else /* CONFIG_CFS_BANDWIDTH */
-static __always_inline
-void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, unsigned long delta_exec) {}
+static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
+{
+	return rq_of(cfs_rq)->clock_task;
+}
+
+static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
+				     unsigned long delta_exec) {}
 static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 134928d..d13bce7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -281,7 +281,8 @@ struct cfs_rq {
 	u64 runtime_expires;
 	s64 runtime_remaining;
 
-	u64 throttled_timestamp;
+	u64 throttled_clock, throttled_clock_task;
+	u64 throttled_clock_task_time;
 	int throttled, throttle_count;
 	struct list_head throttled_list;
 #endif /* CONFIG_CFS_BANDWIDTH */

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Replace update_shares weight distribution with per-entity computation
  2012-08-23 14:14 ` [patch 11/16] sched: replace update_shares weight distribution with per-entity computation pjt
  2012-09-24 19:44   ` "Jan H. Schönherr"
@ 2012-10-24  9:53   ` tip-bot for Paul Turner
  1 sibling, 0 replies; 59+ messages in thread
From: tip-bot for Paul Turner @ 2012-10-24  9:53 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, bsegall, hpa, mingo, a.p.zijlstra, pjt, tglx

Commit-ID:  82958366cfea1a50e7e90907b2d55ae29ed69974
Gitweb:     http://git.kernel.org/tip/82958366cfea1a50e7e90907b2d55ae29ed69974
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 4 Oct 2012 13:18:31 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:28 +0200

sched: Replace update_shares weight distribution with per-entity computation

Now that the machinery in place is in place to compute contributed load in a
bottom up fashion; replace the shares distribution code within update_shares()
accordingly.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.061208672@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/debug.c |    8 ---
 kernel/sched/fair.c  |  157 ++++++++------------------------------------------
 kernel/sched/sched.h |   36 ++++--------
 3 files changed, 36 insertions(+), 165 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 71b0ea3..2cd3c1b 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -218,14 +218,6 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #ifdef CONFIG_SMP
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_avg",
-			SPLIT_NS(cfs_rq->load_avg));
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_period",
-			SPLIT_NS(cfs_rq->load_period));
-	SEQ_printf(m, "  .%-30s: %ld\n", "load_contrib",
-			cfs_rq->load_contribution);
-	SEQ_printf(m, "  .%-30s: %d\n", "load_tg",
-			atomic_read(&cfs_rq->tg->load_weight));
 	SEQ_printf(m, "  .%-30s: %lld\n", "runnable_load_avg",
 			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lld\n", "blocked_load_avg",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 873c9f5..57fae95 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -658,9 +658,6 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	return calc_delta_fair(sched_slice(cfs_rq, se), se);
 }
 
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update);
-static void update_cfs_shares(struct cfs_rq *cfs_rq);
-
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -680,10 +677,6 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
 
 	curr->vruntime += delta_exec_weighted;
 	update_min_vruntime(cfs_rq);
-
-#if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
-	cfs_rq->load_unacc_exec_time += delta_exec;
-#endif
 }
 
 static void update_curr(struct cfs_rq *cfs_rq)
@@ -806,72 +799,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-/* we need this in update_cfs_load and load-balance functions below */
-static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 # ifdef CONFIG_SMP
-static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
-					    int global_update)
-{
-	struct task_group *tg = cfs_rq->tg;
-	long load_avg;
-
-	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
-	load_avg -= cfs_rq->load_contribution;
-
-	if (global_update || abs(load_avg) > cfs_rq->load_contribution / 8) {
-		atomic_add(load_avg, &tg->load_weight);
-		cfs_rq->load_contribution += load_avg;
-	}
-}
-
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
-{
-	u64 period = sysctl_sched_shares_window;
-	u64 now, delta;
-	unsigned long load = cfs_rq->load.weight;
-
-	if (cfs_rq->tg == &root_task_group || throttled_hierarchy(cfs_rq))
-		return;
-
-	now = rq_of(cfs_rq)->clock_task;
-	delta = now - cfs_rq->load_stamp;
-
-	/* truncate load history at 4 idle periods */
-	if (cfs_rq->load_stamp > cfs_rq->load_last &&
-	    now - cfs_rq->load_last > 4 * period) {
-		cfs_rq->load_period = 0;
-		cfs_rq->load_avg = 0;
-		delta = period - 1;
-	}
-
-	cfs_rq->load_stamp = now;
-	cfs_rq->load_unacc_exec_time = 0;
-	cfs_rq->load_period += delta;
-	if (load) {
-		cfs_rq->load_last = now;
-		cfs_rq->load_avg += delta * load;
-	}
-
-	/* consider updating load contribution on each fold or truncate */
-	if (global_update || cfs_rq->load_period > period
-	    || !cfs_rq->load_period)
-		update_cfs_rq_load_contribution(cfs_rq, global_update);
-
-	while (cfs_rq->load_period > period) {
-		/*
-		 * Inline assembly required to prevent the compiler
-		 * optimising this loop into a divmod call.
-		 * See __iter_div_u64_rem() for another example of this.
-		 */
-		asm("" : "+rm" (cfs_rq->load_period));
-		cfs_rq->load_period /= 2;
-		cfs_rq->load_avg /= 2;
-	}
-
-	if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg)
-		list_del_leaf_cfs_rq(cfs_rq);
-}
-
 static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
 {
 	long tg_weight;
@@ -881,8 +809,8 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
 	 * to gain a more accurate current total weight. See
 	 * update_cfs_rq_load_contribution().
 	 */
-	tg_weight = atomic_read(&tg->load_weight);
-	tg_weight -= cfs_rq->load_contribution;
+	tg_weight = atomic64_read(&tg->load_avg);
+	tg_weight -= cfs_rq->tg_load_contrib;
 	tg_weight += cfs_rq->load.weight;
 
 	return tg_weight;
@@ -906,27 +834,11 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 
 	return shares;
 }
-
-static void update_entity_shares_tick(struct cfs_rq *cfs_rq)
-{
-	if (cfs_rq->load_unacc_exec_time > sysctl_sched_shares_window) {
-		update_cfs_load(cfs_rq, 0);
-		update_cfs_shares(cfs_rq);
-	}
-}
 # else /* CONFIG_SMP */
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
-{
-}
-
 static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 {
 	return tg->shares;
 }
-
-static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
-{
-}
 # endif /* CONFIG_SMP */
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
@@ -944,6 +856,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		account_entity_enqueue(cfs_rq, se);
 }
 
+static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
+
 static void update_cfs_shares(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg;
@@ -963,17 +877,9 @@ static void update_cfs_shares(struct cfs_rq *cfs_rq)
 	reweight_entity(cfs_rq_of(se), se, shares);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
-static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
-{
-}
-
 static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 {
 }
-
-static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
-{
-}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_SMP
@@ -1490,7 +1396,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	update_cfs_load(cfs_rq, 0);
 	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
@@ -1587,7 +1492,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
-	update_cfs_load(cfs_rq, 0);
 	account_entity_dequeue(cfs_rq, se);
 
 	/*
@@ -1756,11 +1660,6 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	update_entity_load_avg(curr, 1);
 	update_cfs_rq_blocked_load(cfs_rq, 1);
 
-	/*
-	 * Update share accounting for long-running entities.
-	 */
-	update_entity_shares_tick(cfs_rq);
-
 #ifdef CONFIG_SCHED_HRTICK
 	/*
 	 * queued ticks are scheduled to match the slice, so don't bother
@@ -2005,18 +1904,9 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 	cfs_rq->throttle_count--;
 #ifdef CONFIG_SMP
 	if (!cfs_rq->throttle_count) {
-		u64 delta = rq->clock_task - cfs_rq->load_stamp;
-
-		/* leaving throttled state, advance shares averaging windows */
-		cfs_rq->load_stamp += delta;
-		cfs_rq->load_last += delta;
-
 		/* adjust cfs_rq_clock_task() */
 		cfs_rq->throttled_clock_task_time += rq->clock_task -
 					     cfs_rq->throttled_clock_task;
-
-		/* update entity weight now that we are on_rq again */
-		update_cfs_shares(cfs_rq);
 	}
 #endif
 
@@ -2028,11 +1918,9 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	struct rq *rq = data;
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
-	/* group is entering throttled state, record last load */
-	if (!cfs_rq->throttle_count) {
-		update_cfs_load(cfs_rq, 0);
+	/* group is entering throttled state, stop time */
+	if (!cfs_rq->throttle_count)
 		cfs_rq->throttled_clock_task = rq->clock_task;
-	}
 	cfs_rq->throttle_count++;
 
 	return 0;
@@ -2630,7 +2518,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
 	}
@@ -2692,7 +2579,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
 	}
@@ -3755,27 +3641,36 @@ next:
  */
 static int update_shares_cpu(struct task_group *tg, int cpu)
 {
+	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
 	unsigned long flags;
 	struct rq *rq;
 
-	if (!tg->se[cpu])
-		return 0;
-
 	rq = cpu_rq(cpu);
+	se = tg->se[cpu];
 	cfs_rq = tg->cfs_rq[cpu];
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	update_rq_clock(rq);
-	update_cfs_load(cfs_rq, 1);
 	update_cfs_rq_blocked_load(cfs_rq, 1);
 
-	/*
-	 * We need to update shares after updating tg->load_weight in
-	 * order to adjust the weight of groups with long running tasks.
-	 */
-	update_cfs_shares(cfs_rq);
+	if (se) {
+		update_entity_load_avg(se, 1);
+		/*
+		 * We pivot on our runnable average having decayed to zero for
+		 * list removal.  This generally implies that all our children
+		 * have also been removed (modulo rounding error or bandwidth
+		 * control); however, such cases are rare and we can fix these
+		 * at enqueue.
+		 *
+		 * TODO: fix up out-of-order children on enqueue.
+		 */
+		if (!se->avg.runnable_avg_sum && !cfs_rq->nr_running)
+			list_del_leaf_cfs_rq(cfs_rq);
+	} else {
+		update_rq_runnable_avg(rq, rq->nr_running);
+	}
 
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 
@@ -5702,10 +5597,6 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 
 	cfs_rq->tg = tg;
 	cfs_rq->rq = rq;
-#ifdef CONFIG_SMP
-	/* allow initial update_cfs_load() to truncate */
-	cfs_rq->load_stamp = 1;
-#endif
 	init_cfs_rq_runtime(cfs_rq);
 
 	tg->cfs_rq[cpu] = cfs_rq;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d13bce7..0a75a43 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -234,11 +234,21 @@ struct cfs_rq {
 	u64 runnable_load_avg, blocked_load_avg;
 	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	u32 tg_runnable_contrib;
 	u64 tg_load_contrib;
-#endif
-#endif
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+	/*
+	 *   h_load = weight * f(tg)
+	 *
+	 * Where f(tg) is the recursive weight fraction assigned to
+	 * this group.
+	 */
+	unsigned long h_load;
+#endif /* CONFIG_SMP */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	struct rq *rq;	/* cpu runqueue to which this cfs_rq is attached */
 
@@ -254,28 +264,6 @@ struct cfs_rq {
 	struct list_head leaf_cfs_rq_list;
 	struct task_group *tg;	/* group that "owns" this runqueue */
 
-#ifdef CONFIG_SMP
-	/*
-	 *   h_load = weight * f(tg)
-	 *
-	 * Where f(tg) is the recursive weight fraction assigned to
-	 * this group.
-	 */
-	unsigned long h_load;
-
-	/*
-	 * Maintaining per-cpu shares distribution for group scheduling
-	 *
-	 * load_stamp is the last time we updated the load average
-	 * load_last is the last time we updated the load average and saw load
-	 * load_unacc_exec_time is currently unaccounted execution time
-	 */
-	u64 load_avg;
-	u64 load_period;
-	u64 load_stamp, load_last, load_unacc_exec_time;
-
-	unsigned long load_contribution;
-#endif /* CONFIG_SMP */
 #ifdef CONFIG_CFS_BANDWIDTH
 	int runtime_enabled;
 	u64 runtime_expires;

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Refactor update_shares_cpu() -> update_blocked_avgs()
  2012-08-23 14:14 ` [patch 12/16] sched: refactor update_shares_cpu() -> update_blocked_avgs() pjt
@ 2012-10-24  9:54   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Paul Turner @ 2012-10-24  9:54 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, bsegall, hpa, mingo, a.p.zijlstra, pjt, tglx

Commit-ID:  48a1675323fa1b7844e479ad2a4469f4558c0f79
Gitweb:     http://git.kernel.org/tip/48a1675323fa1b7844e479ad2a4469f4558c0f79
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 4 Oct 2012 13:18:31 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:28 +0200

sched: Refactor update_shares_cpu() -> update_blocked_avgs()

Now that running entities maintain their own load-averages the work we must do
in update_shares() is largely restricted to the periodic decay of blocked
entities.  This allows us to be a little less pessimistic regarding our
occupancy on rq->lock and the associated rq->clock updates required.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.133999170@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c |   50 +++++++++++++++++++++++---------------------------
 1 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57fae95..dcc27d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3639,20 +3639,15 @@ next:
 /*
  * update tg->load_weight by folding this cpu's load_avg
  */
-static int update_shares_cpu(struct task_group *tg, int cpu)
+static void __update_blocked_averages_cpu(struct task_group *tg, int cpu)
 {
-	struct sched_entity *se;
-	struct cfs_rq *cfs_rq;
-	unsigned long flags;
-	struct rq *rq;
-
-	rq = cpu_rq(cpu);
-	se = tg->se[cpu];
-	cfs_rq = tg->cfs_rq[cpu];
+	struct sched_entity *se = tg->se[cpu];
+	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
 
-	raw_spin_lock_irqsave(&rq->lock, flags);
+	/* throttled entities do not contribute to load */
+	if (throttled_hierarchy(cfs_rq))
+		return;
 
-	update_rq_clock(rq);
 	update_cfs_rq_blocked_load(cfs_rq, 1);
 
 	if (se) {
@@ -3669,32 +3664,33 @@ static int update_shares_cpu(struct task_group *tg, int cpu)
 		if (!se->avg.runnable_avg_sum && !cfs_rq->nr_running)
 			list_del_leaf_cfs_rq(cfs_rq);
 	} else {
+		struct rq *rq = rq_of(cfs_rq);
 		update_rq_runnable_avg(rq, rq->nr_running);
 	}
-
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
-
-	return 0;
 }
 
-static void update_shares(int cpu)
+static void update_blocked_averages(int cpu)
 {
-	struct cfs_rq *cfs_rq;
 	struct rq *rq = cpu_rq(cpu);
+	struct cfs_rq *cfs_rq;
+	unsigned long flags;
 
-	rcu_read_lock();
+	raw_spin_lock_irqsave(&rq->lock, flags);
+	update_rq_clock(rq);
 	/*
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
 	for_each_leaf_cfs_rq(rq, cfs_rq) {
-		/* throttled entities do not contribute to load */
-		if (throttled_hierarchy(cfs_rq))
-			continue;
-
-		update_shares_cpu(cfs_rq->tg, cpu);
+		/*
+		 * Note: We may want to consider periodically releasing
+		 * rq->lock about these updates so that creating many task
+		 * groups does not result in continually extending hold time.
+		 */
+		__update_blocked_averages_cpu(cfs_rq->tg, rq->cpu);
 	}
-	rcu_read_unlock();
+
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
 
 /*
@@ -3746,7 +3742,7 @@ static unsigned long task_h_load(struct task_struct *p)
 	return load;
 }
 #else
-static inline void update_shares(int cpu)
+static inline void update_blocked_averages(int cpu)
 {
 }
 
@@ -4813,7 +4809,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
 	 */
 	raw_spin_unlock(&this_rq->lock);
 
-	update_shares(this_cpu);
+	update_blocked_averages(this_cpu);
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
@@ -5068,7 +5064,7 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 	int update_next_balance = 0;
 	int need_serialize;
 
-	update_shares(cpu);
+	update_blocked_averages(cpu);
 
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Update_cfs_shares at period edge
  2012-08-23 14:14 ` [patch 13/16] sched: update_cfs_shares at period edge pjt
  2012-09-24 19:51   ` "Jan H. Schönherr"
@ 2012-10-24  9:55   ` tip-bot for Paul Turner
  1 sibling, 0 replies; 59+ messages in thread
From: tip-bot for Paul Turner @ 2012-10-24  9:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, bsegall, hpa, mingo, a.p.zijlstra, pjt, tglx

Commit-ID:  f269ae0469fc882332bdfb5db15d3c1315fe2a10
Gitweb:     http://git.kernel.org/tip/f269ae0469fc882332bdfb5db15d3c1315fe2a10
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 4 Oct 2012 13:18:31 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:29 +0200

sched: Update_cfs_shares at period edge

Now that our measurement intervals are small (~1ms) we can amortize the posting
of update_shares() to be about each period overflow.  This is a large cost
saving for frequently switching tasks.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.200772172@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c |   18 ++++++++++--------
 1 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dcc27d8..002a769 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1187,6 +1187,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 	}
 
 	__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
+	update_cfs_shares(cfs_rq);
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
@@ -1396,9 +1397,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
 	account_entity_enqueue(cfs_rq, se);
-	update_cfs_shares(cfs_rq);
+	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
 
 	if (flags & ENQUEUE_WAKEUP) {
 		place_entity(cfs_rq, se, 0);
@@ -1471,7 +1471,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
 
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
@@ -1491,8 +1490,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
-	se->on_rq = 0;
 	account_entity_dequeue(cfs_rq, se);
+	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
 
 	/*
 	 * Normalize the entity after updating the min_vruntime because the
@@ -1506,7 +1505,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	return_cfs_rq_runtime(cfs_rq);
 
 	update_min_vruntime(cfs_rq);
-	update_cfs_shares(cfs_rq);
+	se->on_rq = 0;
 }
 
 /*
@@ -2518,8 +2517,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
+		update_cfs_rq_blocked_load(cfs_rq, 0);
 	}
 
 	if (!se) {
@@ -2579,8 +2578,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_cfs_shares(cfs_rq);
 		update_entity_load_avg(se, 1);
+		update_cfs_rq_blocked_load(cfs_rq, 0);
 	}
 
 	if (!se) {
@@ -5639,8 +5638,11 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 		se = tg->se[i];
 		/* Propagate contribution to hierarchy */
 		raw_spin_lock_irqsave(&rq->lock, flags);
-		for_each_sched_entity(se)
+		for_each_sched_entity(se) {
 			update_cfs_shares(group_cfs_rq(se));
+			/* update contribution to parent */
+			update_entity_load_avg(se, 1);
+		}
 		raw_spin_unlock_irqrestore(&rq->lock, flags);
 	}
 

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Make __update_entity_runnable_avg() fast
  2012-08-23 14:14 ` [patch 14/16] sched: make __update_entity_runnable_avg() fast pjt
  2012-08-24  8:28   ` Namhyung Kim
@ 2012-10-24  9:56   ` tip-bot for Paul Turner
  1 sibling, 0 replies; 59+ messages in thread
From: tip-bot for Paul Turner @ 2012-10-24  9:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, bsegall, hpa, mingo, a.p.zijlstra, pjt, tglx

Commit-ID:  5b51f2f80b3b906ce59bd4dce6eca3c7f34cb1b9
Gitweb:     http://git.kernel.org/tip/5b51f2f80b3b906ce59bd4dce6eca3c7f34cb1b9
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 4 Oct 2012 13:18:32 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:30 +0200

sched: Make __update_entity_runnable_avg() fast

__update_entity_runnable_avg forms the core of maintaining an entity's runnable
load average.  In this function we charge the accumulated run-time since last
update and handle appropriate decay.  In some cases, e.g. a waking task, this
time interval may be much larger than our period unit.

Fortunately we can exploit some properties of our series to perform decay for a
blocked update in constant time and account the contribution for a running
update in essentially-constant* time.

[*]: For any running entity they should be performing updates at the tick which
gives us a soft limit of 1 jiffy between updates, and we can compute up to a
32 jiffy update in a single pass.

C program to generate the magic constants in the arrays:

  #include <math.h>
  #include <stdio.h>

  #define N 32
  #define WMULT_SHIFT 32

  const long WMULT_CONST = ((1UL << N) - 1);
  double y;

  long runnable_avg_yN_inv[N];
  void calc_mult_inv() {
  	int i;
  	double yn = 0;

  	printf("inverses\n");
  	for (i = 0; i < N; i++) {
  		yn = (double)WMULT_CONST * pow(y, i);
  		runnable_avg_yN_inv[i] = yn;
  		printf("%2d: 0x%8lx\n", i, runnable_avg_yN_inv[i]);
  	}
  	printf("\n");
  }

  long mult_inv(long c, int n) {
  	return (c * runnable_avg_yN_inv[n]) >>  WMULT_SHIFT;
  }

  void calc_yn_sum(int n)
  {
  	int i;
  	double sum = 0, sum_fl = 0, diff = 0;

  	/*
  	 * We take the floored sum to ensure the sum of partial sums is never
  	 * larger than the actual sum.
  	 */
  	printf("sum y^n\n");
  	printf("   %8s  %8s %8s\n", "exact", "floor", "error");
  	for (i = 1; i <= n; i++) {
  		sum = (y * sum + y * 1024);
  		sum_fl = floor(y * sum_fl+ y * 1024);
  		printf("%2d: %8.0f  %8.0f %8.0f\n", i, sum, sum_fl,
  			sum_fl - sum);
  	}
  	printf("\n");
  }

  void calc_conv(long n) {
  	long old_n;
  	int i = -1;

  	printf("convergence (LOAD_AVG_MAX, LOAD_AVG_MAX_N)\n");
  	do {
  		old_n = n;
  		n = mult_inv(n, 1) + 1024;
  		i++;
  	} while (n != old_n);
  	printf("%d> %ld\n", i - 1, n);
  	printf("\n");
  }

  void main() {
  	y = pow(0.5, 1/(double)N);
  	calc_mult_inv();
  	calc_conv(1024);
  	calc_yn_sum(N);
  }

[ Compile with -lm ]
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.277808946@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c |  125 +++++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 101 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 002a769..6ecf455 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -884,17 +884,92 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 
 #ifdef CONFIG_SMP
 /*
+ * We choose a half-life close to 1 scheduling period.
+ * Note: The tables below are dependent on this value.
+ */
+#define LOAD_AVG_PERIOD 32
+#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
+#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */
+
+/* Precomputed fixed inverse multiplies for multiplication by y^n */
+static const u32 runnable_avg_yN_inv[] = {
+	0xffffffff, 0xfa83b2da, 0xf5257d14, 0xefe4b99a, 0xeac0c6e6, 0xe5b906e6,
+	0xe0ccdeeb, 0xdbfbb796, 0xd744fcc9, 0xd2a81d91, 0xce248c14, 0xc9b9bd85,
+	0xc5672a10, 0xc12c4cc9, 0xbd08a39e, 0xb8fbaf46, 0xb504f333, 0xb123f581,
+	0xad583ee9, 0xa9a15ab4, 0xa5fed6a9, 0xa2704302, 0x9ef5325f, 0x9b8d39b9,
+	0x9837f050, 0x94f4efa8, 0x91c3d373, 0x8ea4398a, 0x8b95c1e3, 0x88980e80,
+	0x85aac367, 0x82cd8698,
+};
+
+/*
+ * Precomputed \Sum y^k { 1<=k<=n }.  These are floor(true_value) to prevent
+ * over-estimates when re-combining.
+ */
+static const u32 runnable_avg_yN_sum[] = {
+	    0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
+	 9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
+	17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
+};
+
+/*
  * Approximate:
  *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
  */
 static __always_inline u64 decay_load(u64 val, u64 n)
 {
-	for (; n && val; n--) {
-		val *= 4008;
-		val >>= 12;
+	unsigned int local_n;
+
+	if (!n)
+		return val;
+	else if (unlikely(n > LOAD_AVG_PERIOD * 63))
+		return 0;
+
+	/* after bounds checking we can collapse to 32-bit */
+	local_n = n;
+
+	/*
+	 * As y^PERIOD = 1/2, we can combine
+	 *    y^n = 1/2^(n/PERIOD) * k^(n%PERIOD)
+	 * With a look-up table which covers k^n (n<PERIOD)
+	 *
+	 * To achieve constant time decay_load.
+	 */
+	if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
+		val >>= local_n / LOAD_AVG_PERIOD;
+		local_n %= LOAD_AVG_PERIOD;
 	}
 
-	return val;
+	val *= runnable_avg_yN_inv[local_n];
+	/* We don't use SRR here since we always want to round down. */
+	return val >> 32;
+}
+
+/*
+ * For updates fully spanning n periods, the contribution to runnable
+ * average will be: \Sum 1024*y^n
+ *
+ * We can compute this reasonably efficiently by combining:
+ *   y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for  n <PERIOD}
+ */
+static u32 __compute_runnable_contrib(u64 n)
+{
+	u32 contrib = 0;
+
+	if (likely(n <= LOAD_AVG_PERIOD))
+		return runnable_avg_yN_sum[n];
+	else if (unlikely(n >= LOAD_AVG_MAX_N))
+		return LOAD_AVG_MAX;
+
+	/* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
+	do {
+		contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
+		contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
+
+		n -= LOAD_AVG_PERIOD;
+	} while (n > LOAD_AVG_PERIOD);
+
+	contrib = decay_load(contrib, n);
+	return contrib + runnable_avg_yN_sum[n];
 }
 
 /*
@@ -929,7 +1004,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 							struct sched_avg *sa,
 							int runnable)
 {
-	u64 delta;
+	u64 delta, periods;
+	u32 runnable_contrib;
 	int delta_w, decayed = 0;
 
 	delta = now - sa->last_runnable_update;
@@ -963,25 +1039,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		 * period and accrue it.
 		 */
 		delta_w = 1024 - delta_w;
-		BUG_ON(delta_w > delta);
-		do {
-			if (runnable)
-				sa->runnable_avg_sum += delta_w;
-			sa->runnable_avg_period += delta_w;
-
-			/*
-			 * Remainder of delta initiates a new period, roll over
-			 * the previous.
-			 */
-			sa->runnable_avg_sum =
-				decay_load(sa->runnable_avg_sum, 1);
-			sa->runnable_avg_period =
-				decay_load(sa->runnable_avg_period, 1);
-
-			delta -= delta_w;
-			/* New period is empty */
-			delta_w = 1024;
-		} while (delta >= 1024);
+		if (runnable)
+			sa->runnable_avg_sum += delta_w;
+		sa->runnable_avg_period += delta_w;
+
+		delta -= delta_w;
+
+		/* Figure out how many additional periods this update spans */
+		periods = delta / 1024;
+		delta %= 1024;
+
+		sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
+						  periods + 1);
+		sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
+						     periods + 1);
+
+		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
+		runnable_contrib = __compute_runnable_contrib(periods);
+		if (runnable)
+			sa->runnable_avg_sum += runnable_contrib;
+		sa->runnable_avg_period += runnable_contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [tip:sched/core] sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking
  2012-08-23 14:14 ` [patch 16/16] sched: introduce temporary FAIR_GROUP_SCHED dependency for load-tracking pjt
@ 2012-10-24  9:57   ` tip-bot for Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: tip-bot for Paul Turner @ 2012-10-24  9:57 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, bsegall, hpa, mingo, a.p.zijlstra, pjt, tglx

Commit-ID:  f4e26b120b9de84cb627bc7361ba43cfdc51341f
Gitweb:     http://git.kernel.org/tip/f4e26b120b9de84cb627bc7361ba43cfdc51341f
Author:     Paul Turner <pjt@google.com>
AuthorDate: Thu, 4 Oct 2012 13:18:32 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 24 Oct 2012 10:27:31 +0200

sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking

While per-entity load-tracking is generally useful, beyond computing shares
distribution, e.g. runnable based load-balance (in progress), governors,
power-management, etc.

These facilities are not yet consumers of this data.  This may be trivially
reverted when the information is required; but avoid paying the overhead for
calculations we will not use until then.

Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.422162369@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |    8 +++++++-
 kernel/sched/core.c   |    7 ++++++-
 kernel/sched/fair.c   |   13 +++++++++++--
 kernel/sched/sched.h  |    9 ++++++++-
 4 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e483ccb..e1581a0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1168,7 +1168,13 @@ struct sched_entity {
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq		*my_q;
 #endif
-#ifdef CONFIG_SMP
+/*
+ * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
+ * removed when useful for applications beyond shares distribution (e.g.
+ * load-balance).
+ */
+#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+	/* Per-entity load-tracking */
 	struct sched_avg	avg;
 #endif
 };
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f268600..5dae0d2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1526,7 +1526,12 @@ static void __sched_fork(struct task_struct *p)
 	p->se.vruntime			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
-#ifdef CONFIG_SMP
+/*
+ * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
+ * removed when useful for applications beyond shares distribution (e.g.
+ * load-balance).
+ */
+#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
 	p->se.avg.runnable_avg_period = 0;
 	p->se.avg.runnable_avg_sum = 0;
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6ecf455..3e6a353 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,7 +882,8 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-#ifdef CONFIG_SMP
+/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
+#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
 /*
  * We choose a half-life close to 1 scheduling period.
  * Note: The tables below are dependent on this value.
@@ -3174,6 +3175,12 @@ unlock:
 }
 
 /*
+ * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
+ * removed when useful for applications beyond shares distribution (e.g.
+ * load-balance).
+ */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/*
  * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
  * previous cpu.  However, the caller only guarantees p->pi_lock is held; no
@@ -3196,6 +3203,7 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 		atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
 	}
 }
+#endif
 #endif /* CONFIG_SMP */
 
 static unsigned long
@@ -5773,8 +5781,9 @@ const struct sched_class fair_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_fair,
+#ifdef CONFIG_FAIR_GROUP_SCHED
 	.migrate_task_rq	= migrate_task_rq_fair,
-
+#endif
 	.rq_online		= rq_online_fair,
 	.rq_offline		= rq_offline_fair,
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0a75a43..5eca173 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -225,6 +225,12 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
+/*
+ * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
+ * removed when useful for applications beyond shares distribution (e.g.
+ * load-balance).
+ */
+#ifdef CONFIG_FAIR_GROUP_SCHED
 	/*
 	 * CFS Load tracking
 	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -234,7 +240,8 @@ struct cfs_rq {
 	u64 runnable_load_avg, blocked_load_avg;
 	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
-
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+/* These always depend on CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	u32 tg_runnable_contrib;
 	u64 tg_load_contrib;

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [tip:sched/core] sched: Track the runnable average on a per-task entity basis
  2012-10-24  9:43   ` [tip:sched/core] sched: Track the runnable average on a per-task entity basis tip-bot for Paul Turner
@ 2012-10-25  3:28     ` li guang
  2012-10-25 16:58       ` Benjamin Segall
  0 siblings, 1 reply; 59+ messages in thread
From: li guang @ 2012-10-25  3:28 UTC (permalink / raw)
  To: mingo, hpa, bsegall, linux-kernel, a.p.zijlstra, pjt, tglx
  Cc: linux-tip-commits

在 2012-10-24三的 02:43 -0700,tip-bot for Paul Turner写道:
> Commit-ID:  9d85f21c94f7f7a84d0ba686c58aa6d9da58fdbb
> Gitweb:     http://git.kernel.org/tip/9d85f21c94f7f7a84d0ba686c58aa6d9da58fdbb
> Author:     Paul Turner <pjt@google.com>
> AuthorDate: Thu, 4 Oct 2012 13:18:29 +0200
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Wed, 24 Oct 2012 10:27:18 +0200
> 
> sched: Track the runnable average on a per-task entity basis
> 
> Instead of tracking averaging the load parented by a cfs_rq, we can track
> entity load directly. With the load for a given cfs_rq then being the sum
> of its children.
> 
> To do this we represent the historical contribution to runnable average
> within each trailing 1024us of execution as the coefficients of a
> geometric series.
> 
> We can express this for a given task t as:
> 
>   runnable_sum(t) = \Sum u_i * y^i, runnable_avg_period(t) = \Sum 1024 * y^i
>   load(t) = weight_t * runnable_sum(t) / runnable_avg_period(t)
> 
> Where: u_i is the usage in the last i`th 1024us period (approximately 1ms)
> ~ms and y is chosen such that y^k = 1/2.  We currently choose k to be 32 which
> roughly translates to about a sched period.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Reviewed-by: Ben Segall <bsegall@google.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Link: http://lkml.kernel.org/r/20120823141506.372695337@google.com
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  include/linux/sched.h |   13 +++++
>  kernel/sched/core.c   |    5 ++
>  kernel/sched/debug.c  |    4 ++
>  kernel/sched/fair.c   |  129 +++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 151 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0dd42a0..418fc6d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1095,6 +1095,16 @@ struct load_weight {
>  	unsigned long weight, inv_weight;
>  };
>  
> +struct sched_avg {
> +	/*
> +	 * These sums represent an infinite geometric series and so are bound
> +	 * above by 1024/(1-y).  Thus we only need a u32 to store them for for all
> +	 * choices of y < 1-2^(-32)*1024.
> +	 */
> +	u32 runnable_avg_sum, runnable_avg_period;
> +	u64 last_runnable_update;
> +};
> +
>  #ifdef CONFIG_SCHEDSTATS
>  struct sched_statistics {
>  	u64			wait_start;
> @@ -1155,6 +1165,9 @@ struct sched_entity {
>  	/* rq "owned" by this entity/group: */
>  	struct cfs_rq		*my_q;
>  #endif
> +#ifdef CONFIG_SMP
> +	struct sched_avg	avg;
> +#endif
>  };
>  
>  struct sched_rt_entity {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2d8927f..fd9d085 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1524,6 +1524,11 @@ static void __sched_fork(struct task_struct *p)
>  	p->se.vruntime			= 0;
>  	INIT_LIST_HEAD(&p->se.group_node);
>  
> +#ifdef CONFIG_SMP
> +	p->se.avg.runnable_avg_period = 0;
> +	p->se.avg.runnable_avg_sum = 0;
> +#endif
> +
>  #ifdef CONFIG_SCHEDSTATS
>  	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>  #endif
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 6f79596..61f7097 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -85,6 +85,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
>  	P(se->statistics.wait_count);
>  #endif
>  	P(se->load.weight);
> +#ifdef CONFIG_SMP
> +	P(se->avg.runnable_avg_sum);
> +	P(se->avg.runnable_avg_period);
> +#endif
>  #undef PN
>  #undef P
>  }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6b800a1..16d67f9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -971,6 +971,126 @@ static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
>  }
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  
> +#ifdef CONFIG_SMP
> +/*
> + * Approximate:
> + *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
> + */
> +static __always_inline u64 decay_load(u64 val, u64 n)
> +{
> +	for (; n && val; n--) {
> +		val *= 4008;
> +		val >>= 12;
> +	}
> +
> +	return val;
> +}
> +
> +/*
> + * We can represent the historical contribution to runnable average as the
> + * coefficients of a geometric series.  To do this we sub-divide our runnable
> + * history into segments of approximately 1ms (1024us); label the segment that
> + * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
> + *
> + * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
> + *      p0            p1           p2
> + *     (now)       (~1ms ago)  (~2ms ago)
> + *
> + * Let u_i denote the fraction of p_i that the entity was runnable.
> + *
> + * We then designate the fractions u_i as our co-efficients, yielding the
> + * following representation of historical load:
> + *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
> + *
> + * We choose y based on the with of a reasonably scheduling period, fixing:
> + *   y^32 = 0.5
> + *
> + * This means that the contribution to load ~32ms ago (u_32) will be weighted
> + * approximately half as much as the contribution to load within the last ms
> + * (u_0).
> + *
> + * When a period "rolls over" and we have new u_0`, multiplying the previous
> + * sum again by y is sufficient to update:
> + *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
> + *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
> + */
> +static __always_inline int __update_entity_runnable_avg(u64 now,
> +							struct sched_avg *sa,
> +							int runnable)
> +{
> +	u64 delta;
> +	int delta_w, decayed = 0;
> +
> +	delta = now - sa->last_runnable_update;
> +	/*
> +	 * This should only happen when time goes backwards, which it
> +	 * unfortunately does during sched clock init when we swap over to TSC.
> +	 */
> +	if ((s64)delta < 0) {
> +		sa->last_runnable_update = now;
> +		return 0;
> +	}
> +
> +	/*
> +	 * Use 1024ns as the unit of measurement since it's a reasonable
> +	 * approximation of 1us and fast to compute.
> +	 */
> +	delta >>= 10;
> +	if (!delta)
> +		return 0;
> +	sa->last_runnable_update = now;
> +
> +	/* delta_w is the amount already accumulated against our next period */
> +	delta_w = sa->runnable_avg_period % 1024;
> +	if (delta + delta_w >= 1024) {
> +		/* period roll-over */
> +		decayed = 1;
> +
> +		/*
> +		 * Now that we know we're crossing a period boundary, figure
> +		 * out how much from delta we need to complete the current
> +		 * period and accrue it.
> +		 */
> +		delta_w = 1024 - delta_w;
> +		BUG_ON(delta_w > delta);
> +		do {
> +			if (runnable)
> +				sa->runnable_avg_sum += delta_w;
> +			sa->runnable_avg_period += delta_w;
> +
> +			/*
> +			 * Remainder of delta initiates a new period, roll over
> +			 * the previous.
> +			 */
> +			sa->runnable_avg_sum =
> +				decay_load(sa->runnable_avg_sum, 1);

Is this u0+u1*y+u2*y^2+u3*y^3 ...,
seems no, this is u0+u1*y+u2*y+u3*y+u4*y ...

> +			sa->runnable_avg_period =
> +				decay_load(sa->runnable_avg_period, 1);
> +
> +			delta -= delta_w;
> +			/* New period is empty */
> +			delta_w = 1024;
> +		} while (delta >= 1024);
> +	}
> +
> +	/* Remainder of delta accrued against u_0` */
> +	if (runnable)
> +		sa->runnable_avg_sum += delta;
> +	sa->runnable_avg_period += delta;
> +
> +	return decayed;
> +}
> +
> +/* Update a sched_entity's runnable average */
> +static inline void update_entity_load_avg(struct sched_entity *se)
> +{
> +	__update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
> +				     se->on_rq);
> +}
> +#else
> +static inline void update_entity_load_avg(struct sched_entity *se) {}
> +#endif
> +
>  static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>  #ifdef CONFIG_SCHEDSTATS
> @@ -1097,6 +1217,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 */
>  	update_curr(cfs_rq);
>  	update_cfs_load(cfs_rq, 0);
> +	update_entity_load_avg(se);
>  	account_entity_enqueue(cfs_rq, se);
>  	update_cfs_shares(cfs_rq);
>  
> @@ -1171,6 +1292,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 * Update run-time statistics of the 'current'.
>  	 */
>  	update_curr(cfs_rq);
> +	update_entity_load_avg(se);
>  
>  	update_stats_dequeue(cfs_rq, se);
>  	if (flags & DEQUEUE_SLEEP) {
> @@ -1340,6 +1462,8 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
>  		update_stats_wait_start(cfs_rq, prev);
>  		/* Put 'current' back into the tree. */
>  		__enqueue_entity(cfs_rq, prev);
> +		/* in !on_rq case, update occurred at dequeue */
> +		update_entity_load_avg(prev);
>  	}
>  	cfs_rq->curr = NULL;
>  }
> @@ -1353,6 +1477,11 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>  	update_curr(cfs_rq);
>  
>  	/*
> +	 * Ensure that runnable average is periodically updated.
> +	 */
> +	update_entity_load_avg(curr);
> +
> +	/*
>  	 * Update share accounting for long-running entities.
>  	 */
>  	update_entity_shares_tick(cfs_rq);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
liguang    lig.fnst@cn.fujitsu.com
FNST linux kernel team


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [tip:sched/core] sched: Track the runnable average on a per-task entity basis
  2012-10-25  3:28     ` li guang
@ 2012-10-25 16:58       ` Benjamin Segall
  0 siblings, 0 replies; 59+ messages in thread
From: Benjamin Segall @ 2012-10-25 16:58 UTC (permalink / raw)
  To: li guang
  Cc: mingo, hpa, linux-kernel, a.p.zijlstra, pjt, tglx, linux-tip-commits

li guang <lig.fnst@cn.fujitsu.com> writes:

> 在 2012-10-24三的 02:43 -0700,tip-bot for Paul Turner写道:
>> +		do {
>> +			if (runnable)
>> +				sa->runnable_avg_sum += delta_w;
>> +			sa->runnable_avg_period += delta_w;
>> +
>> +			/*
>> +			 * Remainder of delta initiates a new period, roll over
>> +			 * the previous.
>> +			 */
>> +			sa->runnable_avg_sum =
>> +				decay_load(sa->runnable_avg_sum, 1);
>
> Is this u0+u1*y+u2*y^2+u3*y^3 ...,
> seems no, this is u0+u1*y+u2*y+u3*y+u4*y ...
>
It is cumulative, so it is u0+y*(u1+y*(u2+..., which is u0+u1*y+u2*y^2+...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/16] sched: maintain per-rq runnable averages
  2012-08-23 14:14 ` [patch 02/16] sched: maintain per-rq runnable averages pjt
  2012-10-24  9:44   ` [tip:sched/core] sched: Maintain " tip-bot for Ben Segall
@ 2012-10-28 10:12   ` Preeti Murthy
  2012-10-29 17:38     ` Benjamin Segall
  1 sibling, 1 reply; 59+ messages in thread
From: Preeti Murthy @ 2012-10-28 10:12 UTC (permalink / raw)
  To: pjt, Ben Segall
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney,
	Namhyung Kim

Hi Paul, Ben,

A few queries regarding this patch:

1.What exactly is the significance of introducing sched_avg structure
for a runqueue? If I have
   understood correctly, sched_avg keeps track of how long a task has
been active,
   how long has it been serviced by the processor and its lifetime.How
does this apply analogously
   to the runqueue?

2.Is this a right measure to overwrite rq->load.weight because the
rq->sched_avg does not seem to
   take care of task priorities.IOW, what is the idea behind
introducing this metric for the runqueue?
   Why cant the run queue load be updated the same way as the cfs_rq
load is updated:
    cfs_rq->runnable_load_avg and cfs_rq->blocked_load_avg.

3.What is the significance of passing rq->nr_running in
enqueue_task_fair while updating
   the run queue load? Because __update_entity_runnable_avg does not
treat this argument
   any differently if it is >1.

Thank you

Regards
Preeti U Murthy

On Thu, Aug 23, 2012 at 7:44 PM,  <pjt@google.com> wrote:
> From: Ben Segall <bsegall@google.com>
>
> Since runqueues do not have a corresponding sched_entity we instead embed a
> sched_avg structure directly.
>
> Signed-off-by: Ben Segall <bsegall@google.com>
> Reviewed-by: Paul Turner <pjt@google.com>
> ---

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/16] sched: maintain per-rq runnable averages
  2012-10-28 10:12   ` [patch 02/16] sched: maintain " Preeti Murthy
@ 2012-10-29 17:38     ` Benjamin Segall
  2012-11-07  8:28       ` Preeti U Murthy
  0 siblings, 1 reply; 59+ messages in thread
From: Benjamin Segall @ 2012-10-29 17:38 UTC (permalink / raw)
  To: Preeti Murthy
  Cc: pjt, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney,
	Namhyung Kim

Preeti Murthy <preeti.lkml@gmail.com> writes:

> Hi Paul, Ben,
>
> A few queries regarding this patch:
>
> 1.What exactly is the significance of introducing sched_avg structure
> for a runqueue? If I have
>    understood correctly, sched_avg keeps track of how long a task has
> been active,
>    how long has it been serviced by the processor and its lifetime.How
> does this apply analogously
>    to the runqueue?

Remember that sched_avg's are not just for tasks, they're for any CFS
group entity (sched_entity), for which they track the time runnable and
the time used, which allows the system-wide per-task_group computation
of runnable and usage.

Computing these on the root has no usage in this patchset, but any
extensions of this using hierarchy-based fractional usage or runnable
time would need it, and retrofitting it afterwards would be a pain.
>
> 2.Is this a right measure to overwrite rq->load.weight because the
> rq->sched_avg does not seem to
>    take care of task priorities.IOW, what is the idea behind
> introducing this metric for the runqueue?
>    Why cant the run queue load be updated the same way as the cfs_rq
> load is updated:
>     cfs_rq->runnable_load_avg and cfs_rq->blocked_load_avg.

Loadwise you would indeed want the cfs_rq statistics, that is what they
are there for. The sched_avg numbers are only useful in computing the
parent's load (irrelevant on the root), or for extensions using raw
usage/runnable numbers.
>
> 3.What is the significance of passing rq->nr_running in
> enqueue_task_fair while updating
>    the run queue load? Because __update_entity_runnable_avg does not
> treat this argument
>    any differently if it is >1.

That could just as well be rq->nr_running != 0, it would behave the same.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 02/16] sched: maintain per-rq runnable averages
  2012-10-29 17:38     ` Benjamin Segall
@ 2012-11-07  8:28       ` Preeti U Murthy
  0 siblings, 0 replies; 59+ messages in thread
From: Preeti U Murthy @ 2012-11-07  8:28 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: Preeti Murthy, pjt, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi, Mike Galbraith, Vincent Guittot,
	Nikunj A Dadhania, Morten Rasmussen, Paul E. McKenney,
	Namhyung Kim, viresh.kumar

On 10/29/2012 11:08 PM, Benjamin Segall wrote:
> Preeti Murthy <preeti.lkml@gmail.com> writes:
> 
>> Hi Paul, Ben,
>>
>> A few queries regarding this patch:
>>
>> 1.What exactly is the significance of introducing sched_avg structure
>> for a runqueue? If I have
>>    understood correctly, sched_avg keeps track of how long a task has
>> been active,
>>    how long has it been serviced by the processor and its lifetime.How
>> does this apply analogously
>>    to the runqueue?
> 
> Remember that sched_avg's are not just for tasks, they're for any CFS
> group entity (sched_entity), for which they track the time runnable and
> the time used, which allows the system-wide per-task_group computation
> of runnable and usage.
> 
> Computing these on the root has no usage in this patchset, but any
> extensions of this using hierarchy-based fractional usage or runnable
> time would need it, and retrofitting it afterwards would be a pain.
>>
>> 2.Is this a right measure to overwrite rq->load.weight because the
>> rq->sched_avg does not seem to
>>    take care of task priorities.IOW, what is the idea behind
>> introducing this metric for the runqueue?
>>    Why cant the run queue load be updated the same way as the cfs_rq
>> load is updated:
>>     cfs_rq->runnable_load_avg and cfs_rq->blocked_load_avg.
> 
> Loadwise you would indeed want the cfs_rq statistics, that is what they
> are there for. The sched_avg numbers are only useful in computing the
> parent's load (irrelevant on the root), or for extensions using raw
> usage/runnable numbers.
>>
>> 3.What is the significance of passing rq->nr_running in
>> enqueue_task_fair while updating
>>    the run queue load? Because __update_entity_runnable_avg does not
>> treat this argument
>>    any differently if it is >1.
> 
> That could just as well be rq->nr_running != 0, it would behave the same.

Hi Ben,
After going through your suggestions,below is a patch which I wish to begin
with in my effort to integrate the per-entity-load-tracking metric with the
scheduler.I had posted out a patchset earlier,
(https://lkml.org/lkml/2012/10/25/162) but due to various drawbacks,
I am redoing it along the lines of the suggestions posted in reply to it.
Please do let me know if I am using the metric in the right way.Thanks.

Regards
Preeti U Murthy

------------START OF PATCH--------------------------------------------

Since load balancing requires runqueue load to track the load of the sched
groups and hence the sched domains,introduce the cfs_rq equivalent metric of
current runnable load to the run queue as well.

The idea is something like this:

1.The entire load balancing framework is hinged upon what
weighted_cpuload() has to say about the load of the sched group which in turn
adds upto the weight of the sched domain and will ultimately be used to decide
whether to do load balance and to calculate the imbalance.

2.Currently weighted_cpuload() is returning rq->load.weight,but it needs to
use the per-entity-load-tracking metric to reflect the runqueue load.So it
needs to be replaced it with rq->runnable_load_avg.

3.This being the first step towards integrating the per-entity-load tracking
metric with the load balancer.

Signed-off-by: Preeti U Murthy
---
 kernel/sched/fair.c  |    9 ++++++++-
 kernel/sched/sched.h |    1 +
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a9cdc8f..6c89b28 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1499,8 +1499,11 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	if (!update_cfs_rq)
 		return;

-	if (se->on_rq)
+	if (se->on_rq) {
 		cfs_rq->runnable_load_avg += contrib_delta;
+		if(!parent_entity(se))
+			rq->runnable_load_avg += contrib_delta;
+	}
 	else
 		subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
 }
@@ -1579,6 +1582,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 	}

 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+	if(!parent_entity(se))
+		rq->runnable_load_avg += se->avg.load_avg_contrib;
 	/* we force update consideration on load-balancer moves */
 	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
 }
@@ -1597,6 +1602,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 	update_cfs_rq_blocked_load(cfs_rq, !sleep);

 	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+	if(!parent_entity(se))
+		rq->runnable_load_avg -= se->avg.load_avg_contrib;
 	if (sleep) {
 		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
 		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bfd004a..3001d97 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -382,6 +382,7 @@ struct rq {
 	/* list of leaf cfs_rq on this cpu: */
 	struct list_head leaf_cfs_rq_list;
 #ifdef CONFIG_SMP
+	u64 runnable_load_avg;
 	unsigned long h_load_throttle;
 #endif /* CONFIG_SMP */
 #endif /* CONFIG_FAIR_GROUP_SCHED */




^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] sched: per-entity load-tracking
  2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
                   ` (16 preceding siblings ...)
  2012-09-24  9:30 ` [patch 00/16] sched: per-entity load-tracking "Jan H. Schönherr"
@ 2012-11-26 13:08 ` Jassi Brar
  2012-12-20  7:39   ` Stephen Boyd
  17 siblings, 1 reply; 59+ messages in thread
From: Jassi Brar @ 2012-11-26 13:08 UTC (permalink / raw)
  To: pjt
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi

Hi Paul,

On Thu, Aug 23, 2012 at 7:44 PM,  <pjt@google.com> wrote:
> Hi all,
>
> Please find attached the latest version for CFS load-tracking.
>
> It implements load-tracking on a per-sched_entity (currently SCHED_NORMAL, but
> could be extended to RT as well) basis. This results in a bottom-up
> load-computation in which entities contribute to their parents' load, as
> opposed to the current top-down where the parent averages its children.  In
> particular this allows us to correctly migrate load with their accompanying
> entities and provides the necessary inputs for intelligent load-balancing and
> power-management.
>
> We've been running this internally for some time now and modulo any gremlins
> from rebasing it, I think things have been shaken out and we're touching
> mergeable state.
>
> Special thanks to Namhyung Kim and Peter Zijlstra for comments on the last
> round series.
>
> For more background and prior discussion please review the previous posting:
> https://lkml.org/lkml/2012/6/27/644
>
The patchset introduces 64-bit atomic ops, which would need
init_atomic64_lock() already called, but that is an initcall made too
late. Should we consider calling init_atomic64_lock() sooner in
start_kernel()?

As an example of breakage, I see the following dump with
CONFIG_DEBUG_SPINLOCK on an OMAP based Pandaboard.
   ........
    Calibrating delay loop...
    BUG: spinlock bad magic on CPU#0, swapper/0/0
     lock: atomic64_lock+0x0/0x400, .magic: 00000000, .owner:
<none>/-1, .owner_cpu: 0
    [<c001f734>] (unwind_backtrace+0x0/0x13c) from [<c06a8f4c>]
(dump_stack+0x20/0x24)
    [<c06a8f4c>] (dump_stack+0x20/0x24) from [<c0359118>] (spin_dump+0x84/0x98)
    [<c0359118>] (spin_dump+0x84/0x98) from [<c0359158>] (spin_bug+0x2c/0x30)
    [<c0359158>] (spin_bug+0x2c/0x30) from [<c03593c8>]
(do_raw_spin_lock+0x1c0/0x200)
    [<c03593c8>] (do_raw_spin_lock+0x1c0/0x200) from [<c06acfbc>]
(_raw_spin_lock_irqsave+0x64/0x70)
    [<c06acfbc>] (_raw_spin_lock_irqsave+0x64/0x70) from [<c03675f4>]
(atomic64_read+0x30/0x54)
    [<c03675f4>] (atomic64_read+0x30/0x54) from [<c008f9dc>]
(update_cfs_rq_blocked_load+0x64/0x140)
    [<c008f9dc>] (update_cfs_rq_blocked_load+0x64/0x140) from
[<c0093c88>] (task_tick_fair+0x2a4/0x798)
    [<c0093c88>] (task_tick_fair+0x2a4/0x798) from [<c008b7e8>]
(scheduler_tick+0xd4/0x10c)
    [<c008b7e8>] (scheduler_tick+0xd4/0x10c) from [<c00644d4>]
(update_process_times+0x6c/0x7c)
    [<c00644d4>] (update_process_times+0x6c/0x7c) from [<c00a67a0>]
(tick_periodic+0x58/0xd4)
    [<c00a67a0>] (tick_periodic+0x58/0xd4) from [<c00a68dc>]
(tick_handle_periodic+0x28/0x9c)
    [<c00a68dc>] (tick_handle_periodic+0x28/0x9c) from [<c002e780>]
(omap2_gp_timer_interrupt+0x34/0x44)
    [<c002e780>] (omap2_gp_timer_interrupt+0x34/0x44) from
[<c00d004c>] (handle_irq_event_percpu+0x78/0x29c)
    [<c00d004c>] (handle_irq_event_percpu+0x78/0x29c) from
[<c00d02bc>] (handle_irq_event+0x4c/0x6c)
    [<c00d02bc>] (handle_irq_event+0x4c/0x6c) from [<c00d3490>]
(handle_fasteoi_irq+0xcc/0x1a4)
    [<c00d3490>] (handle_fasteoi_irq+0xcc/0x1a4) from [<c00cf90c>]
(generic_handle_irq+0x30/0x40)
    [<c00cf90c>] (generic_handle_irq+0x30/0x40) from [<c0016480>]
(handle_IRQ+0x5c/0xbc)
    [<c0016480>] (handle_IRQ+0x5c/0xbc) from [<c0008538>]
(gic_handle_irq+0x38/0x6c)
    [<c0008538>] (gic_handle_irq+0x38/0x6c) from [<c06add24>]
(__irq_svc+0x44/0x7c)
    Exception stack(0xc0a0ff00 to 0xc0a0ff48)
    ff00: 0000001a ffff6a00 c0a100c0 ffff6a00 00000000 00000000
c0ac0280 00000000
    ff20: c0ac030c 410fc0f0 c0a100c0 c0a0ffb4 c0a0fe80 c0a0ff48
c0a100c0 c06a5110
    ff40: 60000053 ffffffff
    [<c06add24>] (__irq_svc+0x44/0x7c) from [<c06a5110>]
(calibrate_delay+0x37c/0x528)
    [<c06a5110>] (calibrate_delay+0x37c/0x528) from [<c09977f0>]
(start_kernel+0x270/0x310)
    [<c09977f0>] (start_kernel+0x270/0x310) from [<80008078>] (0x80008078)
    1590.23 BogoMIPS (lpj=6213632)
    .....


Thanks.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] sched: per-entity load-tracking
  2012-11-26 13:08 ` Jassi Brar
@ 2012-12-20  7:39   ` Stephen Boyd
  2012-12-20  8:08     ` Jassi Brar
  0 siblings, 1 reply; 59+ messages in thread
From: Stephen Boyd @ 2012-12-20  7:39 UTC (permalink / raw)
  To: Jassi Brar
  Cc: pjt, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi

On 11/26/2012 5:08 AM, Jassi Brar wrote:
> The patchset introduces 64-bit atomic ops, which would need
> init_atomic64_lock() already called, but that is an initcall made too
> late. Should we consider calling init_atomic64_lock() sooner in
> start_kernel()?
>
> As an example of breakage, I see the following dump with
> CONFIG_DEBUG_SPINLOCK on an OMAP based Pandaboard.

I saw this post while searching lkml for similar problems. Perhaps you
can try applying my patch to see if this BUG message goes away. Thanks.

https://lkml.org/lkml/2012/12/19/302

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch 00/16] sched: per-entity load-tracking
  2012-12-20  7:39   ` Stephen Boyd
@ 2012-12-20  8:08     ` Jassi Brar
  0 siblings, 0 replies; 59+ messages in thread
From: Jassi Brar @ 2012-12-20  8:08 UTC (permalink / raw)
  To: Stephen Boyd
  Cc: pjt, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Venki Pallipadi

On Thu, Dec 20, 2012 at 1:09 PM, Stephen Boyd <sboyd@codeaurora.org> wrote:
> On 11/26/2012 5:08 AM, Jassi Brar wrote:
>> The patchset introduces 64-bit atomic ops, which would need
>> init_atomic64_lock() already called, but that is an initcall made too
>> late. Should we consider calling init_atomic64_lock() sooner in
>> start_kernel()?
>>
>> As an example of breakage, I see the following dump with
>> CONFIG_DEBUG_SPINLOCK on an OMAP based Pandaboard.
>
> I saw this post while searching lkml for similar problems. Perhaps you
> can try applying my patch to see if this BUG message goes away. Thanks.
>
> https://lkml.org/lkml/2012/12/19/302
>
Yes, this should do too.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/16] sched: track the runnable average on a per-task entitiy basis
  2012-06-28  6:06   ` Namhyung Kim
@ 2012-07-12  0:14     ` Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2012-07-12  0:14 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Vincent Guittot, Peter Zijlstra, Nikunj A Dadhania,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Paul E. McKenney, Morten Rasmussen, Vaidyanathan Srinivasan

On Wed, Jun 27, 2012 at 11:06 PM, Namhyung Kim <namhyung@kernel.org> wrote:
> Hi,
>
> Some nitpicks and questions.
>
>
> On Wed, 27 Jun 2012 19:24:14 -0700, Paul Turner wrote:
>> Instead of tracking averaging the load parented by a cfs_rq, we can track
>> entity load directly.  With the load for a given cfs_Rq then being the sum of
>
> s/cfs_Rq/cfs_rq/
>
>
>> its children.
>>
>> To do this we represent the historical contribution to runnable average within each
>> trailing 1024us of execution as the coefficients of a geometric series.
>>
>> We can express this for a given task t as:
>>   runnable_sum(t) = \Sum u_i * y^i ,
>>   load(t) = weight_t * runnable_sum(t) / (\Sum 1024 * y^i)
>>
>
> This "\Sum 1024 *y^i" is the runnable(_avg)_period, right?

Yes.

>
>
>> Where: u_i is the usage in the last i`th 1024us period (approximately 1ms) ~ms
>> and y is chosen such that y^k = 1/2.  We currently choose k to be 32 which
>> roughly translates to about a sched period.
>>
>> Signed-off-by: Paul Turner <pjt@google.com>
>> ---
>>  include/linux/sched.h |    8 +++
>>  kernel/sched/debug.c  |    4 ++
>>  kernel/sched/fair.c   |  128 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 140 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 9dced2e..5bf5c79 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1136,6 +1136,11 @@ struct load_weight {
>>       unsigned long weight, inv_weight;
>>  };
>>
>> +struct sched_avg {
>> +     u32 runnable_avg_sum, runnable_avg_period;
>> +     u64 last_runnable_update;
>> +};
>> +
>>  #ifdef CONFIG_SCHEDSTATS
>>  struct sched_statistics {
>>       u64                     wait_start;
>> @@ -1196,6 +1201,9 @@ struct sched_entity {
>>       /* rq "owned" by this entity/group: */
>>       struct cfs_rq           *my_q;
>>  #endif
>> +#ifdef CONFIG_SMP
>> +     struct sched_avg        avg;
>> +#endif
>>  };
>>
>>  struct sched_rt_entity {
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index c09a4e7..cd5ef23 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -85,6 +85,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
>>       P(se->statistics.wait_count);
>>  #endif
>>       P(se->load.weight);
>> +#ifdef CONFIG_SMP
>> +     P(se->avg.runnable_avg_sum);
>> +     P(se->avg.runnable_avg_period);
>> +#endif
>>  #undef PN
>>  #undef P
>>  }
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 3704ad3..864a122 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -976,6 +976,125 @@ static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
>>  }
>>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>>
>> +#ifdef CONFIG_SMP
>> +/*
>> + * Approximate:
>> + *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
>> + */
>> +static __always_inline u64 decay_load(u64 val, int n)
>> +{
>> +     for (; n && val; n--) {
>> +             val *= 4008;
>> +             val >>= 12;
>> +     }
>> +
>> +     return val;
>> +}
>> +
>> +/* We can represent the historical contribution to runnable average as the
>> + * coefficients of a geometric series.  To do this we sub-divide our runnable
>> + * history into segments of approximately 1ms (1024us); label the segment that
>> + * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
>> + *
>> + * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
>> + *      p0            p1           p1
>                                       p2
>
>> + *     (now)       (~1ms ago)  (~2ms ago)
>> + *
>> + * Let u_i denote the fraction of p_i that the entity was runnable.
>> + *
>> + * We then designate the fractions u_i as our co-efficients, yielding the
>> + * following representation of historical load:
>> + *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
>> + *
>> + * We choose y based on the with of a reasonably scheduling period, fixing:
>                                width ?
>> + *   y^32 = 0.5
>> + *
>> + * This means that the contribution to load ~32ms ago (u_32) will be weighted
>> + * approximately half as much as the contribution to load within the last ms
>> + * (u_0).
>> + *
>> + * When a period "rolls over" and we have new u_0`, multiplying the previous
>> + * sum again by y is sufficient to update:
>> + *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
>> + *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1]
>                                                                     u_{i+1}]
>
>> + */
>> +static __always_inline int __update_entity_runnable_avg(u64 now,
>
> Is the return value used by elsewhere?
>

Yes, it's later used to trigger updates as we accumulate new usage
periods (1024us increments).

> Thanks,
> Namhyung
>
>
>> +                                                     struct sched_avg *sa,
>> +                                                     int runnable)
>> +{
>> +     u64 delta;
>> +     int delta_w, decayed = 0;
>> +
>> +     delta = now - sa->last_runnable_update;
>> +     /*
>> +      * This should only happen when time goes backwards, which it
>> +      * unfortunately does during sched clock init when we swap over to TSC.
>> +      */
>> +     if ((s64)delta < 0) {
>> +             sa->last_runnable_update = now;
>> +             return 0;
>> +     }
>> +
>> +     /*
>> +      * Use 1024ns as the unit of measurement since it's a reasonable
>> +      * approximation of 1us and fast to compute.
>> +      */
>> +     delta >>= 10;
>> +     if (!delta)
>> +             return 0;
>> +     sa->last_runnable_update = now;
>> +
>> +     /* delta_w is the amount already accumulated against our next period */
>> +     delta_w = sa->runnable_avg_period % 1024;
>> +     if (delta + delta_w >= 1024) {
>> +             /* period roll-over */
>> +             decayed = 1;
>> +
>> +             /*
>> +              * Now that we know we're crossing a period boundary, figure
>> +              * out how much from delta we need to complete the current
>> +              * period and accrue it.
>> +              */
>> +             delta_w = 1024 - delta_w;
>> +             BUG_ON(delta_w > delta);
>> +             do {
>> +                     if (runnable)
>> +                             sa->runnable_avg_sum += delta_w;
>> +                     sa->runnable_avg_period += delta_w;
>> +
>> +                     /*
>> +                      * Remainder of delta initiates a new period, roll over
>> +                      * the previous.
>> +                      */
>> +                     sa->runnable_avg_sum =
>> +                             decay_load(sa->runnable_avg_sum, 1);
>> +                     sa->runnable_avg_period =
>> +                             decay_load(sa->runnable_avg_period, 1);
>> +
>> +                     delta -= delta_w;
>> +                     /* New period is empty */
>> +                     delta_w = 1024;
>> +             } while (delta >= 1024);
>> +     }
>> +
>> +     /* Remainder of delta accrued against u_0` */
>> +     if (runnable)
>> +             sa->runnable_avg_sum += delta;
>> +     sa->runnable_avg_period += delta;
>> +
>> +     return decayed;
>> +}
>> +
>> +/* Update a sched_entity's runnable average */
>> +static inline void update_entity_load_avg(struct sched_entity *se)
>> +{
>> +     __update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
>> +                                  se->on_rq);
>> +}
>> +#else
>> +static inline void update_entity_load_avg(struct sched_entity *se) {}
>> +#endif
>> +
>>  static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>  {
>>  #ifdef CONFIG_SCHEDSTATS
>> @@ -1102,6 +1221,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>        */
>>       update_curr(cfs_rq);
>>       update_cfs_load(cfs_rq, 0);
>> +     update_entity_load_avg(se);
>>       account_entity_enqueue(cfs_rq, se);
>>       update_cfs_shares(cfs_rq);
>>
>> @@ -1176,6 +1296,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>        * Update run-time statistics of the 'current'.
>>        */
>>       update_curr(cfs_rq);
>> +     update_entity_load_avg(se);
>>
>>       update_stats_dequeue(cfs_rq, se);
>>       if (flags & DEQUEUE_SLEEP) {
>> @@ -1345,6 +1466,8 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
>>               update_stats_wait_start(cfs_rq, prev);
>>               /* Put 'current' back into the tree. */
>>               __enqueue_entity(cfs_rq, prev);
>> +             /* in !on_rq case, update occurred at dequeue */
>> +             update_entity_load_avg(prev);
>>       }
>>       cfs_rq->curr = NULL;
>>  }
>> @@ -1358,6 +1481,11 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>       update_curr(cfs_rq);
>>
>>       /*
>> +      * Ensure that runnable average is periodically updated.
>> +      */
>> +     update_entity_load_avg(curr);
>> +
>> +     /*
>>        * Update share accounting for long-running entities.
>>        */
>>       update_entity_shares_tick(cfs_rq);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/16] sched: track the runnable average on a per-task entitiy basis
  2012-07-04 15:32   ` Peter Zijlstra
@ 2012-07-12  0:12     ` Paul Turner
  0 siblings, 0 replies; 59+ messages in thread
From: Paul Turner @ 2012-07-12  0:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Vincent Guittot, Nikunj A Dadhania, Mike Galbraith,
	Kamalesh Babulal, Ben Segall, Ingo Molnar, Paul E. McKenney,
	Morten Rasmussen, Vaidyanathan Srinivasan

On Wed, Jul 4, 2012 at 8:32 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Wed, 2012-06-27 at 19:24 -0700, Paul Turner wrote:
>> Instead of tracking averaging the load parented by a cfs_rq, we can track
>> entity load directly.  With the load for a given cfs_Rq then being the sum of
>> its children.
>>
>> To do this we represent the historical contribution to runnable average within each
>> trailing 1024us of execution as the coefficients of a geometric series.
>>
>> We can express this for a given task t as:
>>   runnable_sum(t) = \Sum u_i * y^i ,
>>   load(t) = weight_t * runnable_sum(t) / (\Sum 1024 * y^i)
>>
>> Where: u_i is the usage in the last i`th 1024us period (approximately 1ms) ~ms
>> and y is chosen such that y^k = 1/2.  We currently choose k to be 32 which
>> roughly translates to about a sched period.
>>
>> Signed-off-by: Paul Turner <pjt@google.com>
>> ---
>>  include/linux/sched.h |    8 +++
>>  kernel/sched/debug.c  |    4 ++
>>  kernel/sched/fair.c   |  128 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 140 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 9dced2e..5bf5c79 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1136,6 +1136,11 @@ struct load_weight {
>>         unsigned long weight, inv_weight;
>>  };
>>
>> +struct sched_avg {
>> +       u32 runnable_avg_sum, runnable_avg_period;
>> +       u64 last_runnable_update;
>> +};
>
>
> So we can use u32 because:
>
>              n         1
> lim n->inf \Sum y^i = --- = ~ 46.66804636511427012122 ; y^32 = 0.5
>              i=0      1-y
>
> So the values should never be larger than ~47k, right?

Yes -- this is made explicit later in the series.
>
> /me goes add something like that in a comment.
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/16] sched: track the runnable average on a per-task entitiy basis
  2012-06-28  2:24 ` [PATCH 01/16] sched: track the runnable average on a per-task entitiy basis Paul Turner
  2012-06-28  6:06   ` Namhyung Kim
@ 2012-07-04 15:32   ` Peter Zijlstra
  2012-07-12  0:12     ` Paul Turner
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2012-07-04 15:32 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Vincent Guittot, Nikunj A Dadhania, Mike Galbraith,
	Kamalesh Babulal, Ben Segall, Ingo Molnar, Paul E. McKenney,
	Morten Rasmussen, Vaidyanathan Srinivasan

On Wed, 2012-06-27 at 19:24 -0700, Paul Turner wrote:
> Instead of tracking averaging the load parented by a cfs_rq, we can track
> entity load directly.  With the load for a given cfs_Rq then being the sum of
> its children.
> 
> To do this we represent the historical contribution to runnable average within each
> trailing 1024us of execution as the coefficients of a geometric series.
> 
> We can express this for a given task t as:
>   runnable_sum(t) = \Sum u_i * y^i ,
>   load(t) = weight_t * runnable_sum(t) / (\Sum 1024 * y^i)
> 
> Where: u_i is the usage in the last i`th 1024us period (approximately 1ms) ~ms
> and y is chosen such that y^k = 1/2.  We currently choose k to be 32 which
> roughly translates to about a sched period.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> ---
>  include/linux/sched.h |    8 +++
>  kernel/sched/debug.c  |    4 ++
>  kernel/sched/fair.c   |  128 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 140 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9dced2e..5bf5c79 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1136,6 +1136,11 @@ struct load_weight {
>         unsigned long weight, inv_weight;
>  };
>  
> +struct sched_avg {
> +       u32 runnable_avg_sum, runnable_avg_period;
> +       u64 last_runnable_update;
> +}; 


So we can use u32 because:

             n         1 
lim n->inf \Sum y^i = --- = ~ 46.66804636511427012122 ; y^32 = 0.5
             i=0      1-y

So the values should never be larger than ~47k, right?

/me goes add something like that in a comment.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/16] sched: track the runnable average on a per-task entitiy basis
  2012-06-28  2:24 ` [PATCH 01/16] sched: track the runnable average on a per-task entitiy basis Paul Turner
@ 2012-06-28  6:06   ` Namhyung Kim
  2012-07-12  0:14     ` Paul Turner
  2012-07-04 15:32   ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Namhyung Kim @ 2012-06-28  6:06 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Venki Pallipadi, Srivatsa Vaddagiri,
	Vincent Guittot, Peter Zijlstra, Nikunj A Dadhania,
	Mike Galbraith, Kamalesh Babulal, Ben Segall, Ingo Molnar,
	Paul E. McKenney, Morten Rasmussen, Vaidyanathan Srinivasan

Hi,

Some nitpicks and questions.


On Wed, 27 Jun 2012 19:24:14 -0700, Paul Turner wrote:
> Instead of tracking averaging the load parented by a cfs_rq, we can track
> entity load directly.  With the load for a given cfs_Rq then being the sum of

s/cfs_Rq/cfs_rq/


> its children.
>
> To do this we represent the historical contribution to runnable average within each
> trailing 1024us of execution as the coefficients of a geometric series.
>
> We can express this for a given task t as:
>   runnable_sum(t) = \Sum u_i * y^i ,
>   load(t) = weight_t * runnable_sum(t) / (\Sum 1024 * y^i)
>

This "\Sum 1024 *y^i" is the runnable(_avg)_period, right?


> Where: u_i is the usage in the last i`th 1024us period (approximately 1ms) ~ms
> and y is chosen such that y^k = 1/2.  We currently choose k to be 32 which
> roughly translates to about a sched period.
>
> Signed-off-by: Paul Turner <pjt@google.com>
> ---
>  include/linux/sched.h |    8 +++
>  kernel/sched/debug.c  |    4 ++
>  kernel/sched/fair.c   |  128 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 140 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9dced2e..5bf5c79 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1136,6 +1136,11 @@ struct load_weight {
>  	unsigned long weight, inv_weight;
>  };
>  
> +struct sched_avg {
> +	u32 runnable_avg_sum, runnable_avg_period;
> +	u64 last_runnable_update;
> +};
> +
>  #ifdef CONFIG_SCHEDSTATS
>  struct sched_statistics {
>  	u64			wait_start;
> @@ -1196,6 +1201,9 @@ struct sched_entity {
>  	/* rq "owned" by this entity/group: */
>  	struct cfs_rq		*my_q;
>  #endif
> +#ifdef CONFIG_SMP
> +	struct sched_avg	avg;
> +#endif
>  };
>  
>  struct sched_rt_entity {
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index c09a4e7..cd5ef23 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -85,6 +85,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
>  	P(se->statistics.wait_count);
>  #endif
>  	P(se->load.weight);
> +#ifdef CONFIG_SMP
> +	P(se->avg.runnable_avg_sum);
> +	P(se->avg.runnable_avg_period);
> +#endif
>  #undef PN
>  #undef P
>  }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3704ad3..864a122 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -976,6 +976,125 @@ static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
>  }
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  
> +#ifdef CONFIG_SMP
> +/*
> + * Approximate:
> + *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
> + */
> +static __always_inline u64 decay_load(u64 val, int n)
> +{
> +	for (; n && val; n--) {
> +		val *= 4008;
> +		val >>= 12;
> +	}
> +
> +	return val;
> +}
> +
> +/* We can represent the historical contribution to runnable average as the
> + * coefficients of a geometric series.  To do this we sub-divide our runnable
> + * history into segments of approximately 1ms (1024us); label the segment that
> + * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
> + *
> + * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
> + *      p0            p1           p1
                                      p2

> + *     (now)       (~1ms ago)  (~2ms ago)
> + *
> + * Let u_i denote the fraction of p_i that the entity was runnable.
> + *
> + * We then designate the fractions u_i as our co-efficients, yielding the
> + * following representation of historical load:
> + *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
> + *
> + * We choose y based on the with of a reasonably scheduling period, fixing:
                               width ?
> + *   y^32 = 0.5
> + *
> + * This means that the contribution to load ~32ms ago (u_32) will be weighted
> + * approximately half as much as the contribution to load within the last ms
> + * (u_0).
> + *
> + * When a period "rolls over" and we have new u_0`, multiplying the previous
> + * sum again by y is sufficient to update:
> + *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
> + *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1]
                                                                    u_{i+1}]

> + */
> +static __always_inline int __update_entity_runnable_avg(u64 now,

Is the return value used by elsewhere?

Thanks,
Namhyung


> +							struct sched_avg *sa,
> +							int runnable)
> +{
> +	u64 delta;
> +	int delta_w, decayed = 0;
> +
> +	delta = now - sa->last_runnable_update;
> +	/*
> +	 * This should only happen when time goes backwards, which it
> +	 * unfortunately does during sched clock init when we swap over to TSC.
> +	 */
> +	if ((s64)delta < 0) {
> +		sa->last_runnable_update = now;
> +		return 0;
> +	}
> +
> +	/*
> +	 * Use 1024ns as the unit of measurement since it's a reasonable
> +	 * approximation of 1us and fast to compute.
> +	 */
> +	delta >>= 10;
> +	if (!delta)
> +		return 0;
> +	sa->last_runnable_update = now;
> +
> +	/* delta_w is the amount already accumulated against our next period */
> +	delta_w = sa->runnable_avg_period % 1024;
> +	if (delta + delta_w >= 1024) {
> +		/* period roll-over */
> +		decayed = 1;
> +
> +		/*
> +		 * Now that we know we're crossing a period boundary, figure
> +		 * out how much from delta we need to complete the current
> +		 * period and accrue it.
> +		 */
> +		delta_w = 1024 - delta_w;
> +		BUG_ON(delta_w > delta);
> +		do {
> +			if (runnable)
> +				sa->runnable_avg_sum += delta_w;
> +			sa->runnable_avg_period += delta_w;
> +
> +			/*
> +			 * Remainder of delta initiates a new period, roll over
> +			 * the previous.
> +			 */
> +			sa->runnable_avg_sum =
> +				decay_load(sa->runnable_avg_sum, 1);
> +			sa->runnable_avg_period =
> +				decay_load(sa->runnable_avg_period, 1);
> +
> +			delta -= delta_w;
> +			/* New period is empty */
> +			delta_w = 1024;
> +		} while (delta >= 1024);
> +	}
> +
> +	/* Remainder of delta accrued against u_0` */
> +	if (runnable)
> +		sa->runnable_avg_sum += delta;
> +	sa->runnable_avg_period += delta;
> +
> +	return decayed;
> +}
> +
> +/* Update a sched_entity's runnable average */
> +static inline void update_entity_load_avg(struct sched_entity *se)
> +{
> +	__update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
> +				     se->on_rq);
> +}
> +#else
> +static inline void update_entity_load_avg(struct sched_entity *se) {}
> +#endif
> +
>  static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>  #ifdef CONFIG_SCHEDSTATS
> @@ -1102,6 +1221,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 */
>  	update_curr(cfs_rq);
>  	update_cfs_load(cfs_rq, 0);
> +	update_entity_load_avg(se);
>  	account_entity_enqueue(cfs_rq, se);
>  	update_cfs_shares(cfs_rq);
>  
> @@ -1176,6 +1296,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 * Update run-time statistics of the 'current'.
>  	 */
>  	update_curr(cfs_rq);
> +	update_entity_load_avg(se);
>  
>  	update_stats_dequeue(cfs_rq, se);
>  	if (flags & DEQUEUE_SLEEP) {
> @@ -1345,6 +1466,8 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
>  		update_stats_wait_start(cfs_rq, prev);
>  		/* Put 'current' back into the tree. */
>  		__enqueue_entity(cfs_rq, prev);
> +		/* in !on_rq case, update occurred at dequeue */
> +		update_entity_load_avg(prev);
>  	}
>  	cfs_rq->curr = NULL;
>  }
> @@ -1358,6 +1481,11 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>  	update_curr(cfs_rq);
>  
>  	/*
> +	 * Ensure that runnable average is periodically updated.
> +	 */
> +	update_entity_load_avg(curr);
> +
> +	/*
>  	 * Update share accounting for long-running entities.
>  	 */
>  	update_entity_shares_tick(cfs_rq);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 01/16] sched: track the runnable average on a per-task entitiy basis
  2012-06-28  2:24 [PATCH 00/16] Series short description Paul Turner
@ 2012-06-28  2:24 ` Paul Turner
  2012-06-28  6:06   ` Namhyung Kim
  2012-07-04 15:32   ` Peter Zijlstra
  0 siblings, 2 replies; 59+ messages in thread
From: Paul Turner @ 2012-06-28  2:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Venki Pallipadi, Srivatsa Vaddagiri, Vincent Guittot,
	Peter Zijlstra, Nikunj A Dadhania, Mike Galbraith,
	Kamalesh Babulal, Ben Segall, Ingo Molnar, Paul E. McKenney,
	Morten Rasmussen, Vaidyanathan Srinivasan

Instead of tracking averaging the load parented by a cfs_rq, we can track
entity load directly.  With the load for a given cfs_Rq then being the sum of
its children.

To do this we represent the historical contribution to runnable average within each
trailing 1024us of execution as the coefficients of a geometric series.

We can express this for a given task t as:
  runnable_sum(t) = \Sum u_i * y^i ,
  load(t) = weight_t * runnable_sum(t) / (\Sum 1024 * y^i)

Where: u_i is the usage in the last i`th 1024us period (approximately 1ms) ~ms
and y is chosen such that y^k = 1/2.  We currently choose k to be 32 which
roughly translates to about a sched period.

Signed-off-by: Paul Turner <pjt@google.com>
---
 include/linux/sched.h |    8 +++
 kernel/sched/debug.c  |    4 ++
 kernel/sched/fair.c   |  128 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 140 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9dced2e..5bf5c79 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1136,6 +1136,11 @@ struct load_weight {
 	unsigned long weight, inv_weight;
 };
 
+struct sched_avg {
+	u32 runnable_avg_sum, runnable_avg_period;
+	u64 last_runnable_update;
+};
+
 #ifdef CONFIG_SCHEDSTATS
 struct sched_statistics {
 	u64			wait_start;
@@ -1196,6 +1201,9 @@ struct sched_entity {
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq		*my_q;
 #endif
+#ifdef CONFIG_SMP
+	struct sched_avg	avg;
+#endif
 };
 
 struct sched_rt_entity {
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c09a4e7..cd5ef23 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -85,6 +85,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->statistics.wait_count);
 #endif
 	P(se->load.weight);
+#ifdef CONFIG_SMP
+	P(se->avg.runnable_avg_sum);
+	P(se->avg.runnable_avg_period);
+#endif
 #undef PN
 #undef P
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3704ad3..864a122 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -976,6 +976,125 @@ static inline void update_entity_shares_tick(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_SMP
+/*
+ * Approximate:
+ *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
+ */
+static __always_inline u64 decay_load(u64 val, int n)
+{
+	for (; n && val; n--) {
+		val *= 4008;
+		val >>= 12;
+	}
+
+	return val;
+}
+
+/* We can represent the historical contribution to runnable average as the
+ * coefficients of a geometric series.  To do this we sub-divide our runnable
+ * history into segments of approximately 1ms (1024us); label the segment that
+ * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
+ *
+ * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
+ *      p0            p1           p1
+ *     (now)       (~1ms ago)  (~2ms ago)
+ *
+ * Let u_i denote the fraction of p_i that the entity was runnable.
+ *
+ * We then designate the fractions u_i as our co-efficients, yielding the
+ * following representation of historical load:
+ *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
+ *
+ * We choose y based on the with of a reasonably scheduling period, fixing:
+ *   y^32 = 0.5
+ *
+ * This means that the contribution to load ~32ms ago (u_32) will be weighted
+ * approximately half as much as the contribution to load within the last ms
+ * (u_0).
+ *
+ * When a period "rolls over" and we have new u_0`, multiplying the previous
+ * sum again by y is sufficient to update:
+ *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
+ *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1]
+ */
+static __always_inline int __update_entity_runnable_avg(u64 now,
+							struct sched_avg *sa,
+							int runnable)
+{
+	u64 delta;
+	int delta_w, decayed = 0;
+
+	delta = now - sa->last_runnable_update;
+	/*
+	 * This should only happen when time goes backwards, which it
+	 * unfortunately does during sched clock init when we swap over to TSC.
+	 */
+	if ((s64)delta < 0) {
+		sa->last_runnable_update = now;
+		return 0;
+	}
+
+	/*
+	 * Use 1024ns as the unit of measurement since it's a reasonable
+	 * approximation of 1us and fast to compute.
+	 */
+	delta >>= 10;
+	if (!delta)
+		return 0;
+	sa->last_runnable_update = now;
+
+	/* delta_w is the amount already accumulated against our next period */
+	delta_w = sa->runnable_avg_period % 1024;
+	if (delta + delta_w >= 1024) {
+		/* period roll-over */
+		decayed = 1;
+
+		/*
+		 * Now that we know we're crossing a period boundary, figure
+		 * out how much from delta we need to complete the current
+		 * period and accrue it.
+		 */
+		delta_w = 1024 - delta_w;
+		BUG_ON(delta_w > delta);
+		do {
+			if (runnable)
+				sa->runnable_avg_sum += delta_w;
+			sa->runnable_avg_period += delta_w;
+
+			/*
+			 * Remainder of delta initiates a new period, roll over
+			 * the previous.
+			 */
+			sa->runnable_avg_sum =
+				decay_load(sa->runnable_avg_sum, 1);
+			sa->runnable_avg_period =
+				decay_load(sa->runnable_avg_period, 1);
+
+			delta -= delta_w;
+			/* New period is empty */
+			delta_w = 1024;
+		} while (delta >= 1024);
+	}
+
+	/* Remainder of delta accrued against u_0` */
+	if (runnable)
+		sa->runnable_avg_sum += delta;
+	sa->runnable_avg_period += delta;
+
+	return decayed;
+}
+
+/* Update a sched_entity's runnable average */
+static inline void update_entity_load_avg(struct sched_entity *se)
+{
+	__update_entity_runnable_avg(rq_of(cfs_rq_of(se))->clock_task, &se->avg,
+				     se->on_rq);
+}
+#else
+static inline void update_entity_load_avg(struct sched_entity *se) {}
+#endif
+
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 #ifdef CONFIG_SCHEDSTATS
@@ -1102,6 +1221,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_curr(cfs_rq);
 	update_cfs_load(cfs_rq, 0);
+	update_entity_load_avg(se);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
@@ -1176,6 +1296,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+	update_entity_load_avg(se);
 
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
@@ -1345,6 +1466,8 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		update_stats_wait_start(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
+		/* in !on_rq case, update occurred at dequeue */
+		update_entity_load_avg(prev);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -1358,6 +1481,11 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	update_curr(cfs_rq);
 
 	/*
+	 * Ensure that runnable average is periodically updated.
+	 */
+	update_entity_load_avg(curr);
+
+	/*
 	 * Update share accounting for long-running entities.
 	 */
 	update_entity_shares_tick(cfs_rq);



^ permalink raw reply related	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2012-12-20  8:30 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-23 14:14 [patch 00/16] sched: per-entity load-tracking pjt
2012-08-23 14:14 ` [patch 01/16] sched: track the runnable average on a per-task entitiy basis pjt
2012-08-24  8:20   ` Namhyung Kim
2012-08-28 22:12     ` Paul Turner
2012-10-24  9:43   ` [tip:sched/core] sched: Track the runnable average on a per-task entity basis tip-bot for Paul Turner
2012-10-25  3:28     ` li guang
2012-10-25 16:58       ` Benjamin Segall
2012-08-23 14:14 ` [patch 02/16] sched: maintain per-rq runnable averages pjt
2012-10-24  9:44   ` [tip:sched/core] sched: Maintain " tip-bot for Ben Segall
2012-10-28 10:12   ` [patch 02/16] sched: maintain " Preeti Murthy
2012-10-29 17:38     ` Benjamin Segall
2012-11-07  8:28       ` Preeti U Murthy
2012-08-23 14:14 ` [patch 03/16] sched: aggregate load contributed by task entities on parenting cfs_rq pjt
2012-10-24  9:45   ` [tip:sched/core] sched: Aggregate " tip-bot for Paul Turner
2012-08-23 14:14 ` [patch 04/16] sched: maintain the load contribution of blocked entities pjt
2012-10-24  9:46   ` [tip:sched/core] sched: Maintain " tip-bot for Paul Turner
2012-08-23 14:14 ` [patch 05/16] sched: add an rq migration call-back to sched_class pjt
2012-10-24  9:47   ` [tip:sched/core] sched: Add " tip-bot for Paul Turner
2012-08-23 14:14 ` [patch 06/16] sched: account for blocked load waking back up pjt
     [not found]   ` <CAM4v1pO8SPCmqJTTBHpqwrwuO7noPdskg0RSooxyPsWoE395_A@mail.gmail.com>
2012-09-04 17:29     ` Benjamin Segall
2012-10-24  9:48   ` [tip:sched/core] sched: Account " tip-bot for Paul Turner
2012-08-23 14:14 ` [patch 07/16] sched: aggregate total task_group load pjt
2012-10-24  9:49   ` [tip:sched/core] sched: Aggregate " tip-bot for Paul Turner
2012-08-23 14:14 ` [patch 08/16] sched: compute load contribution by a group entity pjt
2012-10-24  9:50   ` [tip:sched/core] sched: Compute " tip-bot for Paul Turner
2012-08-23 14:14 ` [patch 09/16] sched: normalize tg load contributions against runnable time pjt
2012-10-24  9:51   ` [tip:sched/core] sched: Normalize " tip-bot for Paul Turner
2012-08-23 14:14 ` [patch 10/16] sched: maintain runnable averages across throttled periods pjt
2012-10-24  9:52   ` [tip:sched/core] sched: Maintain " tip-bot for Paul Turner
2012-08-23 14:14 ` [patch 11/16] sched: replace update_shares weight distribution with per-entity computation pjt
2012-09-24 19:44   ` "Jan H. Schönherr"
2012-09-24 20:39     ` Benjamin Segall
2012-10-02 21:14       ` Paul Turner
2012-10-24  9:53   ` [tip:sched/core] sched: Replace " tip-bot for Paul Turner
2012-08-23 14:14 ` [patch 12/16] sched: refactor update_shares_cpu() -> update_blocked_avgs() pjt
2012-10-24  9:54   ` [tip:sched/core] sched: Refactor " tip-bot for Paul Turner
2012-08-23 14:14 ` [patch 13/16] sched: update_cfs_shares at period edge pjt
2012-09-24 19:51   ` "Jan H. Schönherr"
2012-10-02 21:09     ` Paul Turner
2012-10-24  9:55   ` [tip:sched/core] sched: Update_cfs_shares " tip-bot for Paul Turner
2012-08-23 14:14 ` [patch 14/16] sched: make __update_entity_runnable_avg() fast pjt
2012-08-24  8:28   ` Namhyung Kim
2012-08-28 22:18     ` Paul Turner
2012-10-24  9:56   ` [tip:sched/core] sched: Make " tip-bot for Paul Turner
2012-08-23 14:14 ` [patch 15/16] sched: implement usage tracking pjt
2012-10-19 12:18   ` Vincent Guittot
2012-08-23 14:14 ` [patch 16/16] sched: introduce temporary FAIR_GROUP_SCHED dependency for load-tracking pjt
2012-10-24  9:57   ` [tip:sched/core] sched: Introduce " tip-bot for Paul Turner
2012-09-24  9:30 ` [patch 00/16] sched: per-entity load-tracking "Jan H. Schönherr"
2012-09-24 17:16   ` Benjamin Segall
2012-10-05  9:07     ` Paul Turner
2012-11-26 13:08 ` Jassi Brar
2012-12-20  7:39   ` Stephen Boyd
2012-12-20  8:08     ` Jassi Brar
  -- strict thread matches above, loose matches on Subject: below --
2012-06-28  2:24 [PATCH 00/16] Series short description Paul Turner
2012-06-28  2:24 ` [PATCH 01/16] sched: track the runnable average on a per-task entitiy basis Paul Turner
2012-06-28  6:06   ` Namhyung Kim
2012-07-12  0:14     ` Paul Turner
2012-07-04 15:32   ` Peter Zijlstra
2012-07-12  0:12     ` Paul Turner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).