[tg_shares_up rewrite v3 00/11] reduce overhead for tg->shares distribution

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [tg_shares_up rewrite v3 00/11] reduce overhead for tg->shares distribution
@ 2010-11-12  3:24 Paul Turner
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 01/11] sched: rewrite tg_shares_up Paul Turner
                   ` (10 more replies)
  0 siblings, 11 replies; 18+ messages in thread
From: Paul Turner @ 2010-11-12  3:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao, Karl Rister, Balbir Singh, David Miller

-- 

This is v3 of the tg_shares_up re-write previously posted at:
http://lkml.org/lkml/2010/10/16/17

It corrects compile errors in the !CONFIG_FAIR_GROUP_SCHED present in v2 posted
yesterday (Thanks to David Miller for catching this) as well as a race in the
destruction of a task_group.
http://lkml.org/lkml/2010/11/10/618

The key idea behind this patchset is an idea of Peter Zijlstra'a to compute
global loads as a function of local averages, avoiding the cpu-cpu interactions
which previously plagued tg_shares_up.

Not much 'material' has changed since v1 since results seem favourable/stable.

Changes:
v3
sched-rework-tg_shares_up.patch
- remove lint: {max,min}_sched_shares_ratelimit from sysctl.c 

sched-tg-ondemand.patch
- (bug fix) fix synchronization on leaf_cfs_rq_list removal during task_group reclaim.
- fix !CONFIG_FAIR_GROUP_SCHED build 
- fix comment on what task_group_lock protects

sched-tg-global_update_load.patch
- add missing arguments to update_cfs_load() in !CONFIG_FAIR_GROUP_SCHED case

v2 (http://lkml.org/lkml/2010/11/10/618)
sched-rework-tg_shares_up.patch
- (bug fix) use owning tg->shares value in effective_load
- remove redundant cfs_rq_of() in {enqueue,dequeue}_entity

sched-tg-add-shares_window_sysctl.patch
- update comment on sysctl interpretation

sched-tg-fix-update_cfs_load.patch
- update comment to reflect periods instead of foldings (more intuitive)

Some motivation again:
24 thread Intel box, 150 active cgroups, multiple threads/group, system load at ~90% (sample size=10s):
tip:
    2.64%  [k] tg_shares_up <!>
    0.15%  [k] __set_se_shares

patched:
    0.02%  [k] update_cfs_load
    0.01%  [k] update_cpu_load
    0.00%  [k] update_cfs_shares

Fairness coverage for the above at: http://rs5.risingnet.net/~pjt/patches/shares_data_v1.txt

Thanks,

- Paul


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tg_shares_up rewrite v3 01/11] sched: rewrite tg_shares_up
  2010-11-12  3:24 [tg_shares_up rewrite v3 00/11] reduce overhead for tg->shares distribution Paul Turner
@ 2010-11-12  3:24 ` Paul Turner
  2010-11-12 11:10   ` Peter Zijlstra
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 02/11] sched: on-demand (active) cfs_rq list Paul Turner
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 18+ messages in thread
From: Paul Turner @ 2010-11-12  3:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao, Karl Rister, Balbir Singh, David Miller

[-- Attachment #1: sched-rework-tg_shares_up.patch --]
[-- Type: text/plain, Size: 19479 bytes --]

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

By tracking a per-cpu load-avg for each cfs_rq and folding it into a
global task_group load on each tick we can rework tg_shares_up to be
strictly per-cpu.

This should improve cpu-cgroup performance for smp systems
significantly.

[ Paul: changed to use queueing cfs_rq ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul Turner <pjt@google.com>

---
 include/linux/sched.h   |    2 
 kernel/sched.c          |  173 ++++++++++++------------------------------------
 kernel/sched_debug.c    |   15 +++-
 kernel/sched_fair.c     |  164 +++++++++++++++++++++++++++++----------------
 kernel/sched_features.h |    2 
 kernel/sysctl.c         |   19 -----
 6 files changed, 162 insertions(+), 213 deletions(-)

Index: include/linux/sched.h
===================================================================
--- include/linux/sched.h.orig
+++ include/linux/sched.h
@@ -1868,8 +1868,6 @@ static inline void wake_up_idle_cpu(int 
 extern unsigned int sysctl_sched_latency;
 extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
-extern unsigned int sysctl_sched_shares_ratelimit;
-extern unsigned int sysctl_sched_shares_thresh;
 extern unsigned int sysctl_sched_child_runs_first;
 
 enum sched_tunable_scaling {
Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -253,6 +253,8 @@ struct task_group {
 	/* runqueue "owned" by this group on each cpu */
 	struct cfs_rq **cfs_rq;
 	unsigned long shares;
+
+	atomic_t load_weight;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -359,15 +361,11 @@ struct cfs_rq {
 	 */
 	unsigned long h_load;
 
-	/*
-	 * this cpu's part of tg->shares
-	 */
-	unsigned long shares;
+	u64 load_avg;
+	u64 load_period;
+	u64 load_stamp;
 
-	/*
-	 * load.weight at the time we set shares
-	 */
-	unsigned long rq_weight;
+	unsigned long load_contribution;
 #endif
 #endif
 };
@@ -790,20 +788,6 @@ late_initcall(sched_init_debug);
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
 
 /*
- * ratelimit for updating the group shares.
- * default: 0.25ms
- */
-unsigned int sysctl_sched_shares_ratelimit = 250000;
-unsigned int normalized_sysctl_sched_shares_ratelimit = 250000;
-
-/*
- * Inject some fuzzyness into changing the per-cpu group shares
- * this avoids remote rq-locks at the expense of fairness.
- * default: 4
- */
-unsigned int sysctl_sched_shares_thresh = 4;
-
-/*
  * period over which we average the RT time consumption, measured
  * in ms.
  *
@@ -1352,6 +1336,12 @@ static inline void update_load_sub(struc
 	lw->inv_weight = 0;
 }
 
+static inline void update_load_set(struct load_weight *lw, unsigned long w)
+{
+	lw->weight = w;
+	lw->inv_weight = 0;
+}
+
 /*
  * To aid in avoiding the subversion of "niceness" due to uneven distribution
  * of tasks with abnormal "nice" values across CPUs the contribution that
@@ -1540,97 +1530,44 @@ static unsigned long cpu_avg_load_per_ta
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-static __read_mostly unsigned long __percpu *update_shares_data;
-
-static void __set_se_shares(struct sched_entity *se, unsigned long shares);
-
-/*
- * Calculate and set the cpu's group shares.
- */
-static void update_group_shares_cpu(struct task_group *tg, int cpu,
-				    unsigned long sd_shares,
-				    unsigned long sd_rq_weight,
-				    unsigned long *usd_rq_weight)
-{
-	unsigned long shares, rq_weight;
-	int boost = 0;
-
-	rq_weight = usd_rq_weight[cpu];
-	if (!rq_weight) {
-		boost = 1;
-		rq_weight = NICE_0_LOAD;
-	}
-
-	/*
-	 *             \Sum_j shares_j * rq_weight_i
-	 * shares_i =  -----------------------------
-	 *                  \Sum_j rq_weight_j
-	 */
-	shares = (sd_shares * rq_weight) / sd_rq_weight;
-	shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);
-
-	if (abs(shares - tg->se[cpu]->load.weight) >
-			sysctl_sched_shares_thresh) {
-		struct rq *rq = cpu_rq(cpu);
-		unsigned long flags;
-
-		raw_spin_lock_irqsave(&rq->lock, flags);
-		tg->cfs_rq[cpu]->rq_weight = boost ? 0 : rq_weight;
-		tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
-		__set_se_shares(tg->se[cpu], shares);
-		raw_spin_unlock_irqrestore(&rq->lock, flags);
-	}
-}
+static void update_cfs_load(struct cfs_rq *cfs_rq);
+static void update_cfs_shares(struct cfs_rq *cfs_rq);
 
 /*
- * Re-compute the task group their per cpu shares over the given domain.
- * This needs to be done in a bottom-up fashion because the rq weight of a
- * parent group depends on the shares of its child groups.
+ * update tg->load_weight by folding this cpu's load_avg
  */
 static int tg_shares_up(struct task_group *tg, void *data)
 {
-	unsigned long weight, rq_weight = 0, sum_weight = 0, shares = 0;
-	unsigned long *usd_rq_weight;
-	struct sched_domain *sd = data;
+	long load_avg;
+	struct cfs_rq *cfs_rq;
 	unsigned long flags;
-	int i;
+	int cpu = (long)data;
+	struct rq *rq;
 
-	if (!tg->se[0])
+	if (!tg->se[cpu])
 		return 0;
 
-	local_irq_save(flags);
-	usd_rq_weight = per_cpu_ptr(update_shares_data, smp_processor_id());
-
-	for_each_cpu(i, sched_domain_span(sd)) {
-		weight = tg->cfs_rq[i]->load.weight;
-		usd_rq_weight[i] = weight;
-
-		rq_weight += weight;
-		/*
-		 * If there are currently no tasks on the cpu pretend there
-		 * is one of average load so that when a new task gets to
-		 * run here it will not get delayed by group starvation.
-		 */
-		if (!weight)
-			weight = NICE_0_LOAD;
+	rq = cpu_rq(cpu);
+	cfs_rq = tg->cfs_rq[cpu];
 
-		sum_weight += weight;
-		shares += tg->cfs_rq[i]->shares;
-	}
+	raw_spin_lock_irqsave(&rq->lock, flags);
 
-	if (!rq_weight)
-		rq_weight = sum_weight;
+	update_rq_clock(rq);
+	update_cfs_load(cfs_rq);
 
-	if ((!shares && rq_weight) || shares > tg->shares)
-		shares = tg->shares;
+	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
+	load_avg -= cfs_rq->load_contribution;
 
-	if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
-		shares = tg->shares;
+	atomic_add(load_avg, &tg->load_weight);
+	cfs_rq->load_contribution += load_avg;
 
-	for_each_cpu(i, sched_domain_span(sd))
-		update_group_shares_cpu(tg, i, shares, rq_weight, usd_rq_weight);
+	/*
+	 * We need to update shares after updating tg->load_weight in
+	 * order to adjust the weight of groups with long running tasks.
+	 */
+	update_cfs_shares(cfs_rq);
 
-	local_irq_restore(flags);
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
 
 	return 0;
 }
@@ -1649,7 +1586,7 @@ static int tg_load_down(struct task_grou
 		load = cpu_rq(cpu)->load.weight;
 	} else {
 		load = tg->parent->cfs_rq[cpu]->h_load;
-		load *= tg->cfs_rq[cpu]->shares;
+		load *= tg->se[cpu]->load.weight;
 		load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
 	}
 
@@ -1658,21 +1595,16 @@ static int tg_load_down(struct task_grou
 	return 0;
 }
 
-static void update_shares(struct sched_domain *sd)
+static void update_shares(long cpu)
 {
-	s64 elapsed;
-	u64 now;
-
 	if (root_task_group_empty())
 		return;
 
-	now = local_clock();
-	elapsed = now - sd->last_update;
+	/*
+	 * XXX: replace with an on-demand list
+	 */
 
-	if (elapsed >= (s64)(u64)sysctl_sched_shares_ratelimit) {
-		sd->last_update = now;
-		walk_tg_tree(tg_nop, tg_shares_up, sd);
-	}
+	walk_tg_tree(tg_nop, tg_shares_up, (void *)cpu);
 }
 
 static void update_h_load(long cpu)
@@ -1682,7 +1614,7 @@ static void update_h_load(long cpu)
 
 #else
 
-static inline void update_shares(struct sched_domain *sd)
+static inline void update_shares(int cpu)
 {
 }
 
@@ -1807,15 +1739,6 @@ static void double_rq_unlock(struct rq *
 
 #endif
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
-{
-#ifdef CONFIG_SMP
-	cfs_rq->shares = shares;
-#endif
-}
-#endif
-
 static void calc_load_account_idle(struct rq *this_rq);
 static void update_sysctl(void);
 static int get_update_sysctl_factor(void);
@@ -5404,7 +5327,6 @@ static void update_sysctl(void)
 	SET_SYSCTL(sched_min_granularity);
 	SET_SYSCTL(sched_latency);
 	SET_SYSCTL(sched_wakeup_granularity);
-	SET_SYSCTL(sched_shares_ratelimit);
 #undef SET_SYSCTL
 }
 
@@ -7721,8 +7643,7 @@ static void init_tg_cfs_entry(struct tas
 		se->cfs_rq = parent->my_q;
 
 	se->my_q = cfs_rq;
-	se->load.weight = tg->shares;
-	se->load.inv_weight = 0;
+	update_load_set(&se->load, tg->shares);
 	se->parent = parent;
 }
 #endif
@@ -7815,10 +7736,6 @@ void __init sched_init(void)
 
 #endif /* CONFIG_CGROUP_SCHED */
 
-#if defined CONFIG_FAIR_GROUP_SCHED && defined CONFIG_SMP
-	update_shares_data = __alloc_percpu(nr_cpu_ids * sizeof(unsigned long),
-					    __alignof__(unsigned long));
-#endif
 	for_each_possible_cpu(i) {
 		struct rq *rq;
 
@@ -8386,8 +8303,7 @@ static void __set_se_shares(struct sched
 	if (on_rq)
 		dequeue_entity(cfs_rq, se, 0);
 
-	se->load.weight = shares;
-	se->load.inv_weight = 0;
+	update_load_set(&se->load, shares);
 
 	if (on_rq)
 		enqueue_entity(cfs_rq, se, 0);
@@ -8444,7 +8360,6 @@ int sched_group_set_shares(struct task_g
 		/*
 		 * force a rebalance
 		 */
-		cfs_rq_set_shares(tg->cfs_rq[i], 0);
 		set_se_shares(tg->se[i], shares);
 	}
 
Index: kernel/sched_debug.c
===================================================================
--- kernel/sched_debug.c.orig
+++ kernel/sched_debug.c
@@ -202,15 +202,22 @@ void print_cfs_rq(struct seq_file *m, in
 	spread0 = min_vruntime - rq0_min_vruntime;
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread0",
 			SPLIT_NS(spread0));
-	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
-	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
-
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);
+	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
+	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #ifdef CONFIG_SMP
-	SEQ_printf(m, "  .%-30s: %lu\n", "shares", cfs_rq->shares);
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_avg",
+			SPLIT_NS(cfs_rq->load_avg));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_period",
+			SPLIT_NS(cfs_rq->load_period));
+	SEQ_printf(m, "  .%-30s: %ld\n", "load_contrib",
+			cfs_rq->load_contribution);
+	SEQ_printf(m, "  .%-30s: %d\n", "load_tg",
+			atomic_read(&tg->load_weight));
 #endif
+
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
 #endif
 }
Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -417,7 +417,6 @@ int sched_proc_update_handler(struct ctl
 	WRT_SYSCTL(sched_min_granularity);
 	WRT_SYSCTL(sched_latency);
 	WRT_SYSCTL(sched_wakeup_granularity);
-	WRT_SYSCTL(sched_shares_ratelimit);
 #undef WRT_SYSCTL
 
 	return 0;
@@ -633,7 +632,6 @@ account_entity_enqueue(struct cfs_rq *cf
 		list_add(&se->group_node, &cfs_rq->tasks);
 	}
 	cfs_rq->nr_running++;
-	se->on_rq = 1;
 }
 
 static void
@@ -647,9 +645,89 @@ account_entity_dequeue(struct cfs_rq *cf
 		list_del_init(&se->group_node);
 	}
 	cfs_rq->nr_running--;
-	se->on_rq = 0;
 }
 
+#if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
+static void update_cfs_load(struct cfs_rq *cfs_rq)
+{
+	u64 period = sched_avg_period();
+	u64 now, delta;
+
+	if (!cfs_rq)
+		return;
+
+	now = rq_of(cfs_rq)->clock;
+	delta = now - cfs_rq->load_stamp;
+
+	cfs_rq->load_stamp = now;
+	cfs_rq->load_period += delta;
+	cfs_rq->load_avg += delta * cfs_rq->load.weight;
+
+	while (cfs_rq->load_period > period) {
+		/*
+		 * Inline assembly required to prevent the compiler
+		 * optimising this loop into a divmod call.
+		 * See __iter_div_u64_rem() for another example of this.
+		 */
+		asm("" : "+rm" (cfs_rq->load_period));
+		cfs_rq->load_period /= 2;
+		cfs_rq->load_avg /= 2;
+	}
+}
+
+static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+			    unsigned long weight)
+{
+	if (se->on_rq)
+		account_entity_dequeue(cfs_rq, se);
+
+	update_load_set(&se->load, weight);
+
+	if (se->on_rq)
+		account_entity_enqueue(cfs_rq, se);
+}
+
+static void update_cfs_shares(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg;
+	struct sched_entity *se;
+	long load_weight, load, shares;
+
+	if (!cfs_rq)
+		return;
+
+	tg = cfs_rq->tg;
+	se = tg->se[cpu_of(rq_of(cfs_rq))];
+	if (!se)
+		return;
+
+	load = cfs_rq->load.weight;
+
+	load_weight = atomic_read(&tg->load_weight);
+	load_weight -= cfs_rq->load_contribution;
+	load_weight += load;
+
+	shares = (tg->shares * load);
+	if (load_weight)
+		shares /= load_weight;
+
+	if (shares < MIN_SHARES)
+		shares = MIN_SHARES;
+	if (shares > tg->shares)
+		shares = tg->shares;
+
+	reweight_entity(cfs_rq_of(se), se, shares);
+}
+#else /* CONFIG_FAIR_GROUP_SCHED */
+static inline void update_cfs_load(struct cfs_rq *cfs_rq)
+{
+}
+
+static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
+{
+}
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 #ifdef CONFIG_SCHEDSTATS
@@ -771,7 +849,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+	update_cfs_load(cfs_rq);
 	account_entity_enqueue(cfs_rq, se);
+	update_cfs_shares(cfs_rq);
 
 	if (flags & ENQUEUE_WAKEUP) {
 		place_entity(cfs_rq, se, 0);
@@ -782,6 +862,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	check_spread(cfs_rq, se);
 	if (se != cfs_rq->curr)
 		__enqueue_entity(cfs_rq, se);
+	se->on_rq = 1;
 }
 
 static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -825,8 +906,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
+	se->on_rq = 0;
+	update_cfs_load(cfs_rq);
 	account_entity_dequeue(cfs_rq, se);
 	update_min_vruntime(cfs_rq);
+	update_cfs_shares(cfs_rq);
 
 	/*
 	 * Normalize the entity after updating the min_vruntime because the
@@ -1055,6 +1139,13 @@ enqueue_task_fair(struct rq *rq, struct 
 		flags = ENQUEUE_WAKEUP;
 	}
 
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		update_cfs_load(cfs_rq);
+		update_cfs_shares(cfs_rq);
+	}
+
 	hrtick_update(rq);
 }
 
@@ -1071,12 +1162,20 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
+
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight)
 			break;
 		flags |= DEQUEUE_SLEEP;
 	}
 
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		update_cfs_load(cfs_rq);
+		update_cfs_shares(cfs_rq);
+	}
+
 	hrtick_update(rq);
 }
 
@@ -1143,51 +1242,20 @@ static void task_waking_fair(struct rq *
  * Adding load to a group doesn't make a group heavier, but can cause movement
  * of group shares between cpus. Assuming the shares were perfectly aligned one
  * can calculate the shift in shares.
- *
- * The problem is that perfectly aligning the shares is rather expensive, hence
- * we try to avoid doing that too often - see update_shares(), which ratelimits
- * this change.
- *
- * We compensate this by not only taking the current delta into account, but
- * also considering the delta between when the shares were last adjusted and
- * now.
- *
- * We still saw a performance dip, some tracing learned us that between
- * cgroup:/ and cgroup:/foo balancing the number of affine wakeups increased
- * significantly. Therefore try to bias the error in direction of failing
- * the affine wakeup.
- *
  */
-static long effective_load(struct task_group *tg, int cpu,
-		long wl, long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	struct sched_entity *se = tg->se[cpu];
 
 	if (!tg->parent)
 		return wl;
 
-	/*
-	 * By not taking the decrease of shares on the other cpu into
-	 * account our error leans towards reducing the affine wakeups.
-	 */
-	if (!wl && sched_feat(ASYM_EFF_LOAD))
-		return wl;
-
 	for_each_sched_entity(se) {
 		long S, rw, s, a, b;
-		long more_w;
-
-		/*
-		 * Instead of using this increment, also add the difference
-		 * between when the shares were last updated and now.
-		 */
-		more_w = se->my_q->load.weight - se->my_q->rq_weight;
-		wl += more_w;
-		wg += more_w;
 
 		S = se->my_q->tg->shares;
-		s = se->my_q->shares;
-		rw = se->my_q->rq_weight;
+		s = se->load.weight;
+		rw = se->my_q->load.weight;
 
 		a = S*(rw + wl);
 		b = S*rw + s*wg;
@@ -1508,23 +1576,6 @@ select_task_rq_fair(struct rq *rq, struc
 			sd = tmp;
 	}
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	if (sched_feat(LB_SHARES_UPDATE)) {
-		/*
-		 * Pick the largest domain to update shares over
-		 */
-		tmp = sd;
-		if (affine_sd && (!tmp || affine_sd->span_weight > sd->span_weight))
-			tmp = affine_sd;
-
-		if (tmp) {
-			raw_spin_unlock(&rq->lock);
-			update_shares(tmp);
-			raw_spin_lock(&rq->lock);
-		}
-	}
-#endif
-
 	if (affine_sd) {
 		if (cpu == prev_cpu || wake_affine(affine_sd, p, sync))
 			return select_idle_sibling(p, cpu);
@@ -2977,7 +3028,6 @@ static int load_balance(int this_cpu, st
 	schedstat_inc(sd, lb_count[idle]);
 
 redo:
-	update_shares(sd);
 	group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
 				   cpus, balance);
 
@@ -3119,8 +3169,6 @@ out_one_pinned:
 	else
 		ld_moved = 0;
 out:
-	if (ld_moved)
-		update_shares(sd);
 	return ld_moved;
 }
 
@@ -3514,6 +3562,8 @@ static void rebalance_domains(int cpu, e
 	int update_next_balance = 0;
 	int need_serialize;
 
+	update_shares(cpu);
+
 	for_each_domain(cpu, sd) {
 		if (!(sd->flags & SD_LOAD_BALANCE))
 			continue;
Index: kernel/sched_features.h
===================================================================
--- kernel/sched_features.h.orig
+++ kernel/sched_features.h
@@ -52,8 +52,6 @@ SCHED_FEAT(ARCH_POWER, 0)
 SCHED_FEAT(HRTICK, 0)
 SCHED_FEAT(DOUBLE_TICK, 0)
 SCHED_FEAT(LB_BIAS, 1)
-SCHED_FEAT(LB_SHARES_UPDATE, 1)
-SCHED_FEAT(ASYM_EFF_LOAD, 1)
 
 /*
  * Spin-wait on mutex acquisition when the mutex owner is running on
Index: kernel/sysctl.c
===================================================================
--- kernel/sysctl.c.orig
+++ kernel/sysctl.c
@@ -261,8 +261,6 @@ static int min_wakeup_granularity_ns;			
 static int max_wakeup_granularity_ns = NSEC_PER_SEC;	/* 1 second */
 static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
 static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
-static int min_sched_shares_ratelimit = 100000; /* 100 usec */
-static int max_sched_shares_ratelimit = NSEC_PER_SEC; /* 1 second */
 #endif
 
 #ifdef CONFIG_COMPACTION
@@ -307,15 +305,6 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &max_wakeup_granularity_ns,
 	},
 	{
-		.procname	= "sched_shares_ratelimit",
-		.data		= &sysctl_sched_shares_ratelimit,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= sched_proc_update_handler,
-		.extra1		= &min_sched_shares_ratelimit,
-		.extra2		= &max_sched_shares_ratelimit,
-	},
-	{
 		.procname	= "sched_tunable_scaling",
 		.data		= &sysctl_sched_tunable_scaling,
 		.maxlen		= sizeof(enum sched_tunable_scaling),
@@ -325,14 +314,6 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &max_sched_tunable_scaling,
 	},
 	{
-		.procname	= "sched_shares_thresh",
-		.data		= &sysctl_sched_shares_thresh,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &zero,
-	},
-	{
 		.procname	= "sched_migration_cost",
 		.data		= &sysctl_sched_migration_cost,
 		.maxlen		= sizeof(unsigned int),

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tg_shares_up rewrite v3 02/11] sched: on-demand (active) cfs_rq list
  2010-11-12  3:24 [tg_shares_up rewrite v3 00/11] reduce overhead for tg->shares distribution Paul Turner
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 01/11] sched: rewrite tg_shares_up Paul Turner
@ 2010-11-12  3:24 ` Paul Turner
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 03/11] sched: make tg_shares_up() walk on-demand Paul Turner
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Paul Turner @ 2010-11-12  3:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao, Karl Rister, Balbir Singh, David Miller

[-- Attachment #1: sched-tg-ondemand.patch --]
[-- Type: text/plain, Size: 13254 bytes --]

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Make certain load-balance actions scale per number of active cgroups
instead of the number of existing cgroups.

This makes wakeup/sleep paths more expensive, but is a win for systems
where the vast majority of existing cgroups are idle.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |  105 +++++++++++++---------------------------------------
 kernel/sched_fair.c |   46 +++++++++++++++++++---
 kernel/sched_rt.c   |   24 +++++++++++
 3 files changed, 92 insertions(+), 83 deletions(-)

Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -274,9 +274,7 @@ struct task_group {
 
 #define root_task_group init_task_group
 
-/* task_group_lock serializes add/remove of task groups and also changes to
- * a task group's cpu shares.
- */
+/* task_group_lock serializes the addition/removal of task groups */
 static DEFINE_SPINLOCK(task_group_lock);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -344,6 +342,7 @@ struct cfs_rq {
 	 * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a cpu. This
 	 * list is used during load balance.
 	 */
+	int on_list;
 	struct list_head leaf_cfs_rq_list;
 	struct task_group *tg;	/* group that "owns" this runqueue */
 
@@ -1530,7 +1529,7 @@ static unsigned long cpu_avg_load_per_ta
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-static void update_cfs_load(struct cfs_rq *cfs_rq);
+static void update_cfs_load(struct cfs_rq *cfs_rq, int lb);
 static void update_cfs_shares(struct cfs_rq *cfs_rq);
 
 /*
@@ -1553,7 +1552,7 @@ static int tg_shares_up(struct task_grou
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	update_rq_clock(rq);
-	update_cfs_load(cfs_rq);
+	update_cfs_load(cfs_rq, 1);
 
 	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
 	load_avg -= cfs_rq->load_contribution;
@@ -7622,15 +7621,13 @@ static void init_rt_rq(struct rt_rq *rt_
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
-				struct sched_entity *se, int cpu, int add,
+				struct sched_entity *se, int cpu,
 				struct sched_entity *parent)
 {
 	struct rq *rq = cpu_rq(cpu);
 	tg->cfs_rq[cpu] = cfs_rq;
 	init_cfs_rq(cfs_rq, rq);
 	cfs_rq->tg = tg;
-	if (add)
-		list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
 
 	tg->se[cpu] = se;
 	/* se could be NULL for init_task_group */
@@ -7650,7 +7647,7 @@ static void init_tg_cfs_entry(struct tas
 
 #ifdef CONFIG_RT_GROUP_SCHED
 static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
-		struct sched_rt_entity *rt_se, int cpu, int add,
+		struct sched_rt_entity *rt_se, int cpu,
 		struct sched_rt_entity *parent)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -7659,8 +7656,6 @@ static void init_tg_rt_entry(struct task
 	init_rt_rq(rt_rq, rq);
 	rt_rq->tg = tg;
 	rt_rq->rt_runtime = tg->rt_bandwidth.rt_runtime;
-	if (add)
-		list_add(&rt_rq->leaf_rt_rq_list, &rq->leaf_rt_rq_list);
 
 	tg->rt_se[cpu] = rt_se;
 	if (!rt_se)
@@ -7769,7 +7764,7 @@ void __init sched_init(void)
 		 * We achieve this by letting init_task_group's tasks sit
 		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
 		 */
-		init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, 1, NULL);
+		init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, NULL);
 #endif
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -7777,7 +7772,7 @@ void __init sched_init(void)
 #ifdef CONFIG_RT_GROUP_SCHED
 		INIT_LIST_HEAD(&rq->leaf_rt_rq_list);
 #ifdef CONFIG_CGROUP_SCHED
-		init_tg_rt_entry(&init_task_group, &rq->rt, NULL, i, 1, NULL);
+		init_tg_rt_entry(&init_task_group, &rq->rt, NULL, i, NULL);
 #endif
 #endif
 
@@ -8053,7 +8048,7 @@ int alloc_fair_sched_group(struct task_g
 		if (!se)
 			goto err_free_rq;
 
-		init_tg_cfs_entry(tg, cfs_rq, se, i, 0, parent->se[i]);
+		init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
 	}
 
 	return 1;
@@ -8064,15 +8059,22 @@ int alloc_fair_sched_group(struct task_g
 	return 0;
 }
 
-static inline void register_fair_sched_group(struct task_group *tg, int cpu)
-{
-	list_add_rcu(&tg->cfs_rq[cpu]->leaf_cfs_rq_list,
-			&cpu_rq(cpu)->leaf_cfs_rq_list);
-}
-
 static inline void unregister_fair_sched_group(struct task_group *tg, int cpu)
 {
-	list_del_rcu(&tg->cfs_rq[cpu]->leaf_cfs_rq_list);
+	struct rq *rq = cpu_rq(cpu);
+	unsigned long flags;
+	int i;
+
+	/*
+	* Only empty task groups can be destroyed; so we can speculatively
+	* check on_list without danger of it being re-added.
+	*/
+	if (!tg->cfs_rq[cpu]->on_list)
+		return;
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+	list_del_leaf_cfs_rq(tg->cfs_rq[i]);
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
 #else /* !CONFG_FAIR_GROUP_SCHED */
 static inline void free_fair_sched_group(struct task_group *tg)
@@ -8085,10 +8087,6 @@ int alloc_fair_sched_group(struct task_g
 	return 1;
 }
 
-static inline void register_fair_sched_group(struct task_group *tg, int cpu)
-{
-}
-
 static inline void unregister_fair_sched_group(struct task_group *tg, int cpu)
 {
 }
@@ -8143,7 +8141,7 @@ int alloc_rt_sched_group(struct task_gro
 		if (!rt_se)
 			goto err_free_rq;
 
-		init_tg_rt_entry(tg, rt_rq, rt_se, i, 0, parent->rt_se[i]);
+		init_tg_rt_entry(tg, rt_rq, rt_se, i, parent->rt_se[i]);
 	}
 
 	return 1;
@@ -8153,17 +8151,6 @@ int alloc_rt_sched_group(struct task_gro
  err:
 	return 0;
 }
-
-static inline void register_rt_sched_group(struct task_group *tg, int cpu)
-{
-	list_add_rcu(&tg->rt_rq[cpu]->leaf_rt_rq_list,
-			&cpu_rq(cpu)->leaf_rt_rq_list);
-}
-
-static inline void unregister_rt_sched_group(struct task_group *tg, int cpu)
-{
-	list_del_rcu(&tg->rt_rq[cpu]->leaf_rt_rq_list);
-}
 #else /* !CONFIG_RT_GROUP_SCHED */
 static inline void free_rt_sched_group(struct task_group *tg)
 {
@@ -8174,14 +8161,6 @@ int alloc_rt_sched_group(struct task_gro
 {
 	return 1;
 }
-
-static inline void register_rt_sched_group(struct task_group *tg, int cpu)
-{
-}
-
-static inline void unregister_rt_sched_group(struct task_group *tg, int cpu)
-{
-}
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_CGROUP_SCHED
@@ -8197,7 +8176,6 @@ struct task_group *sched_create_group(st
 {
 	struct task_group *tg;
 	unsigned long flags;
-	int i;
 
 	tg = kzalloc(sizeof(*tg), GFP_KERNEL);
 	if (!tg)
@@ -8210,10 +8188,6 @@ struct task_group *sched_create_group(st
 		goto err;
 
 	spin_lock_irqsave(&task_group_lock, flags);
-	for_each_possible_cpu(i) {
-		register_fair_sched_group(tg, i);
-		register_rt_sched_group(tg, i);
-	}
 	list_add_rcu(&tg->list, &task_groups);
 
 	WARN_ON(!parent); /* root should already exist */
@@ -8243,11 +8217,11 @@ void sched_destroy_group(struct task_gro
 	unsigned long flags;
 	int i;
 
-	spin_lock_irqsave(&task_group_lock, flags);
-	for_each_possible_cpu(i) {
+	/* end participation in shares distribution */
+	for_each_possible_cpu(i)
 		unregister_fair_sched_group(tg, i);
-		unregister_rt_sched_group(tg, i);
-	}
+
+	spin_lock_irqsave(&task_group_lock, flags);
 	list_del_rcu(&tg->list);
 	list_del_rcu(&tg->siblings);
 	spin_unlock_irqrestore(&task_group_lock, flags);
@@ -8325,7 +8299,6 @@ static DEFINE_MUTEX(shares_mutex);
 int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 {
 	int i;
-	unsigned long flags;
 
 	/*
 	 * We can't change the weight of the root cgroup.
@@ -8342,19 +8315,6 @@ int sched_group_set_shares(struct task_g
 	if (tg->shares == shares)
 		goto done;
 
-	spin_lock_irqsave(&task_group_lock, flags);
-	for_each_possible_cpu(i)
-		unregister_fair_sched_group(tg, i);
-	list_del_rcu(&tg->siblings);
-	spin_unlock_irqrestore(&task_group_lock, flags);
-
-	/* wait for any ongoing reference to this group to finish */
-	synchronize_sched();
-
-	/*
-	 * Now we are free to modify the group's share on each cpu
-	 * w/o tripping rebalance_share or load_balance_fair.
-	 */
 	tg->shares = shares;
 	for_each_possible_cpu(i) {
 		/*
@@ -8363,15 +8323,6 @@ int sched_group_set_shares(struct task_g
 		set_se_shares(tg->se[i], shares);
 	}
 
-	/*
-	 * Enable load balance activity on this group, by inserting it back on
-	 * each cpu's rq->leaf_cfs_rq_list.
-	 */
-	spin_lock_irqsave(&task_group_lock, flags);
-	for_each_possible_cpu(i)
-		register_fair_sched_group(tg, i);
-	list_add_rcu(&tg->siblings, &tg->parent->children);
-	spin_unlock_irqrestore(&task_group_lock, flags);
 done:
 	mutex_unlock(&shares_mutex);
 	return 0;
Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -143,6 +143,24 @@ static inline struct cfs_rq *cpu_cfs_rq(
 	return cfs_rq->tg->cfs_rq[this_cpu];
 }
 
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	if (!cfs_rq->on_list) {
+		list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
+				&rq_of(cfs_rq)->leaf_cfs_rq_list);
+
+		cfs_rq->on_list = 1;
+	}
+}
+
+static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	if (cfs_rq->on_list) {
+		list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
+		cfs_rq->on_list = 0;
+	}
+}
+
 /* Iterate thr' all leaf cfs_rq's on a runqueue */
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
 	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
@@ -246,6 +264,14 @@ static inline struct cfs_rq *cpu_cfs_rq(
 	return &cpu_rq(this_cpu)->cfs;
 }
 
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+{
+}
+
+static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+{
+}
+
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
 		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
 
@@ -648,7 +674,7 @@ account_entity_dequeue(struct cfs_rq *cf
 }
 
 #if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
-static void update_cfs_load(struct cfs_rq *cfs_rq)
+static void update_cfs_load(struct cfs_rq *cfs_rq, int lb)
 {
 	u64 period = sched_avg_period();
 	u64 now, delta;
@@ -673,6 +699,11 @@ static void update_cfs_load(struct cfs_r
 		cfs_rq->load_period /= 2;
 		cfs_rq->load_avg /= 2;
 	}
+
+	if (lb && !cfs_rq->nr_running) {
+		if (cfs_rq->load_avg < (period / 8))
+			list_del_leaf_cfs_rq(cfs_rq);
+	}
 }
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
@@ -719,7 +750,7 @@ static void update_cfs_shares(struct cfs
 	reweight_entity(cfs_rq_of(se), se, shares);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void update_cfs_load(struct cfs_rq *cfs_rq)
+static inline void update_cfs_load(struct cfs_rq *cfs_rq, int lb)
 {
 }
 
@@ -849,7 +880,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	update_cfs_load(cfs_rq);
+	update_cfs_load(cfs_rq, 0);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
@@ -863,6 +894,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	if (se != cfs_rq->curr)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
+
+	if (cfs_rq->nr_running == 1)
+		list_add_leaf_cfs_rq(cfs_rq);
 }
 
 static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -907,7 +941,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
-	update_cfs_load(cfs_rq);
+	update_cfs_load(cfs_rq, 0);
 	account_entity_dequeue(cfs_rq, se);
 	update_min_vruntime(cfs_rq);
 	update_cfs_shares(cfs_rq);
@@ -1142,7 +1176,7 @@ enqueue_task_fair(struct rq *rq, struct 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-		update_cfs_load(cfs_rq);
+		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -1172,7 +1206,7 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-		update_cfs_load(cfs_rq);
+		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
Index: kernel/sched_rt.c
===================================================================
--- kernel/sched_rt.c.orig
+++ kernel/sched_rt.c
@@ -183,6 +183,17 @@ static inline u64 sched_rt_period(struct
 	return ktime_to_ns(rt_rq->tg->rt_bandwidth.rt_period);
 }
 
+static inline void list_add_leaf_rt_rq(struct rt_rq *rt_rq)
+{
+	list_add_rcu(&rt_rq->leaf_rt_rq_list,
+			&rq_of_rt_rq(rt_rq)->leaf_rt_rq_list);
+}
+
+static inline void list_del_leaf_rt_rq(struct rt_rq *rt_rq)
+{
+	list_del_rcu(&rt_rq->leaf_rt_rq_list);
+}
+
 #define for_each_leaf_rt_rq(rt_rq, rq) \
 	list_for_each_entry_rcu(rt_rq, &rq->leaf_rt_rq_list, leaf_rt_rq_list)
 
@@ -276,6 +287,14 @@ static inline u64 sched_rt_period(struct
 	return ktime_to_ns(def_rt_bandwidth.rt_period);
 }
 
+static inline void list_add_leaf_rt_rq(struct rt_rq *rt_rq)
+{
+}
+
+static inline void list_del_leaf_rt_rq(struct rt_rq *rt_rq)
+{
+}
+
 #define for_each_leaf_rt_rq(rt_rq, rq) \
 	for (rt_rq = &rq->rt; rt_rq; rt_rq = NULL)
 
@@ -825,6 +844,9 @@ static void __enqueue_rt_entity(struct s
 	if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running))
 		return;
 
+	if (!rt_rq->rt_nr_running)
+		list_add_leaf_rt_rq(rt_rq);
+
 	if (head)
 		list_add(&rt_se->run_list, queue);
 	else
@@ -844,6 +866,8 @@ static void __dequeue_rt_entity(struct s
 		__clear_bit(rt_se_prio(rt_se), array->bitmap);
 
 	dec_rt_tasks(rt_se, rt_rq);
+	if (!rt_rq->rt_nr_running)
+		list_del_leaf_rt_rq(rt_rq);
 }
 
 /*

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tg_shares_up rewrite v3 03/11] sched: make tg_shares_up() walk on-demand
  2010-11-12  3:24 [tg_shares_up rewrite v3 00/11] reduce overhead for tg->shares distribution Paul Turner
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 01/11] sched: rewrite tg_shares_up Paul Turner
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 02/11] sched: on-demand (active) cfs_rq list Paul Turner
@ 2010-11-12  3:24 ` Paul Turner
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 04/11] sched: fix load corruption from update_cfs_shares Paul Turner
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Paul Turner @ 2010-11-12  3:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao, Karl Rister, Balbir Singh, David Miller

[-- Attachment #1: sched-tg-more-ondemand.patch --]
[-- Type: text/plain, Size: 4421 bytes --]

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Make tg_shares_up() use the active cgroup list, this means we cannot
do a strict bottom-up walk of the hierarchy, but assuming its a very
wide tree with a small number of active groups it should be a win.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |   67 ----------------------------------------------------
 kernel/sched_fair.c |   58 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+), 67 deletions(-)

Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -281,13 +281,6 @@ static DEFINE_SPINLOCK(task_group_lock);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-#ifdef CONFIG_SMP
-static int root_task_group_empty(void)
-{
-	return list_empty(&root_task_group.children);
-}
-#endif
-
 # define INIT_TASK_GROUP_LOAD	NICE_0_LOAD
 
 /*
@@ -1531,48 +1524,6 @@ static unsigned long cpu_avg_load_per_ta
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-static void update_cfs_load(struct cfs_rq *cfs_rq, int lb);
-static void update_cfs_shares(struct cfs_rq *cfs_rq);
-
-/*
- * update tg->load_weight by folding this cpu's load_avg
- */
-static int tg_shares_up(struct task_group *tg, void *data)
-{
-	long load_avg;
-	struct cfs_rq *cfs_rq;
-	unsigned long flags;
-	int cpu = (long)data;
-	struct rq *rq;
-
-	if (!tg->se[cpu])
-		return 0;
-
-	rq = cpu_rq(cpu);
-	cfs_rq = tg->cfs_rq[cpu];
-
-	raw_spin_lock_irqsave(&rq->lock, flags);
-
-	update_rq_clock(rq);
-	update_cfs_load(cfs_rq, 1);
-
-	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
-	load_avg -= cfs_rq->load_contribution;
-
-	atomic_add(load_avg, &tg->load_weight);
-	cfs_rq->load_contribution += load_avg;
-
-	/*
-	 * We need to update shares after updating tg->load_weight in
-	 * order to adjust the weight of groups with long running tasks.
-	 */
-	update_cfs_shares(cfs_rq);
-
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
-
-	return 0;
-}
-
 /*
  * Compute the cpu's hierarchical load factor for each task group.
  * This needs to be done in a top-down fashion because the load of a child
@@ -1596,29 +1547,11 @@ static int tg_load_down(struct task_grou
 	return 0;
 }
 
-static void update_shares(long cpu)
-{
-	if (root_task_group_empty())
-		return;
-
-	/*
-	 * XXX: replace with an on-demand list
-	 */
-
-	walk_tg_tree(tg_nop, tg_shares_up, (void *)cpu);
-}
-
 static void update_h_load(long cpu)
 {
 	walk_tg_tree(tg_load_down, tg_nop, (void *)cpu);
 }
 
-#else
-
-static inline void update_shares(int cpu)
-{
-}
-
 #endif
 
 #ifdef CONFIG_PREEMPT
Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -2000,6 +2000,60 @@ out:
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+/*
+ * update tg->load_weight by folding this cpu's load_avg
+ */
+static int tg_shares_up(struct task_group *tg, int cpu)
+{
+	struct cfs_rq *cfs_rq;
+	unsigned long flags;
+	struct rq *rq;
+	long load_avg;
+
+	if (!tg->se[cpu])
+		return 0;
+
+	rq = cpu_rq(cpu);
+	cfs_rq = tg->cfs_rq[cpu];
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+
+	update_rq_clock(rq);
+	update_cfs_load(cfs_rq, 1);
+
+	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
+	load_avg -= cfs_rq->load_contribution;
+	atomic_add(load_avg, &tg->load_weight);
+	cfs_rq->load_contribution += load_avg;
+
+	/*
+	 * We need to update shares after updating tg->load_weight in
+	 * order to adjust the weight of groups with long running tasks.
+	 */
+	update_cfs_shares(cfs_rq);
+
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+	return 0;
+}
+
+static void update_shares(int cpu)
+{
+	struct cfs_rq *cfs_rq;
+	struct rq *rq = cpu_rq(cpu);
+
+	rcu_read_lock();
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
+		struct task_group *tg = cfs_rq->tg;
+
+		do {
+			tg_shares_up(tg, cpu);
+			tg = tg->parent;
+		} while (tg);
+	}
+	rcu_read_unlock();
+}
+
 static unsigned long
 load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		  unsigned long max_load_move,
@@ -2047,6 +2101,10 @@ load_balance_fair(struct rq *this_rq, in
 	return max_load_move - rem_load_move;
 }
 #else
+static inline void update_shares(int cpu)
+{
+}
+
 static unsigned long
 load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		  unsigned long max_load_move,

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tg_shares_up rewrite v3 04/11] sched: fix load corruption from update_cfs_shares
  2010-11-12  3:24 [tg_shares_up rewrite v3 00/11] reduce overhead for tg->shares distribution Paul Turner
                   ` (2 preceding siblings ...)
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 03/11] sched: make tg_shares_up() walk on-demand Paul Turner
@ 2010-11-12  3:24 ` Paul Turner
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 05/11] sched: fix update_cfs_load synchronization Paul Turner
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Paul Turner @ 2010-11-12  3:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao, Karl Rister, Balbir Singh, David Miller

[-- Attachment #1: sched-tg-fix-update_cfs_shares.patch --]
[-- Type: text/plain, Size: 3223 bytes --]

as part of enqueue_entity both a new entity weight and its contribution to the
queuing cfs_rq / rq are updated.  Since update_cfs_shares will only update the
queueing weights when the entity is on_rq (which in this case it is not yet),
there's a dependency loop here:

update_cfs_shares needs account_entity_enqueue to update cfs_rq->load.weight
account_entity_enqueue needs the updated weight for the queuing cfs_rq load[*]

Fix this and avoid spurious dequeue/enqueues by issuing update_cfs_shares as
if we had accounted the enqueue already.

This was also resulting in rq->load corruption previously.

[*]: this dependency also exists when using the group cfs_rq w/ 
update_cfs_shares as the weight of the enqueued entity changes without the
load being updated.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched_fair.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -718,7 +718,7 @@ static void reweight_entity(struct cfs_r
 		account_entity_enqueue(cfs_rq, se);
 }
 
-static void update_cfs_shares(struct cfs_rq *cfs_rq)
+static void update_cfs_shares(struct cfs_rq *cfs_rq, long weight_delta)
 {
 	struct task_group *tg;
 	struct sched_entity *se;
@@ -732,7 +732,7 @@ static void update_cfs_shares(struct cfs
 	if (!se)
 		return;
 
-	load = cfs_rq->load.weight;
+	load = cfs_rq->load.weight + weight_delta;
 
 	load_weight = atomic_read(&tg->load_weight);
 	load_weight -= cfs_rq->load_contribution;
@@ -754,7 +754,7 @@ static inline void update_cfs_load(struc
 {
 }
 
-static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
+static inline void update_cfs_shares(struct cfs_rq *cfs_rq, long weight_delta)
 {
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -881,8 +881,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 */
 	update_curr(cfs_rq);
 	update_cfs_load(cfs_rq, 0);
+	update_cfs_shares(cfs_rq, se->load.weight);
 	account_entity_enqueue(cfs_rq, se);
-	update_cfs_shares(cfs_rq);
 
 	if (flags & ENQUEUE_WAKEUP) {
 		place_entity(cfs_rq, se, 0);
@@ -944,7 +944,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	update_cfs_load(cfs_rq, 0);
 	account_entity_dequeue(cfs_rq, se);
 	update_min_vruntime(cfs_rq);
-	update_cfs_shares(cfs_rq);
+	update_cfs_shares(cfs_rq, 0);
 
 	/*
 	 * Normalize the entity after updating the min_vruntime because the
@@ -1177,7 +1177,7 @@ enqueue_task_fair(struct rq *rq, struct 
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		update_cfs_load(cfs_rq, 0);
-		update_cfs_shares(cfs_rq);
+		update_cfs_shares(cfs_rq, 0);
 	}
 
 	hrtick_update(rq);
@@ -1207,7 +1207,7 @@ static void dequeue_task_fair(struct rq 
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		update_cfs_load(cfs_rq, 0);
-		update_cfs_shares(cfs_rq);
+		update_cfs_shares(cfs_rq, 0);
 	}
 
 	hrtick_update(rq);
@@ -2030,7 +2030,7 @@ static int tg_shares_up(struct task_grou
 	 * We need to update shares after updating tg->load_weight in
 	 * order to adjust the weight of groups with long running tasks.
 	 */
-	update_cfs_shares(cfs_rq);
+	update_cfs_shares(cfs_rq, 0);
 
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tg_shares_up rewrite v3 05/11] sched: fix update_cfs_load synchronization
  2010-11-12  3:24 [tg_shares_up rewrite v3 00/11] reduce overhead for tg->shares distribution Paul Turner
                   ` (3 preceding siblings ...)
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 04/11] sched: fix load corruption from update_cfs_shares Paul Turner
@ 2010-11-12  3:24 ` Paul Turner
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 06/11] sched: hierarchal order on shares update list Paul Turner
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Paul Turner @ 2010-11-12  3:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao, Karl Rister, Balbir Singh, David Miller

[-- Attachment #1: sched-tg-fix-update_cfs_load.patch --]
[-- Type: text/plain, Size: 4093 bytes --]

Using cfs_rq->nr_running is not sufficient to synchronize update_cfs_load with
the put path since nr_running accounting occurs at deactivation.

It's also not safe to make the removal decision based on load_avg as this fails
with both high periods and low shares.  Resolve this by clipping history after 4
periods without activity.

Note: the above will always occur from update_shares() since in the 
last-task-sleep-case that task will still be cfs_rq->curr when update_cfs_load is
called.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |    2 +-
 kernel/sched_fair.c |   33 +++++++++++++++++++++------------
 2 files changed, 22 insertions(+), 13 deletions(-)

Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -674,10 +674,11 @@ account_entity_dequeue(struct cfs_rq *cf
 }
 
 #if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
-static void update_cfs_load(struct cfs_rq *cfs_rq, int lb)
+static void update_cfs_load(struct cfs_rq *cfs_rq)
 {
 	u64 period = sched_avg_period();
 	u64 now, delta;
+	unsigned long load = cfs_rq->load.weight;
 
 	if (!cfs_rq)
 		return;
@@ -685,9 +686,19 @@ static void update_cfs_load(struct cfs_r
 	now = rq_of(cfs_rq)->clock;
 	delta = now - cfs_rq->load_stamp;
 
+	/* truncate load history at 4 idle periods */
+	if (cfs_rq->load_stamp > cfs_rq->load_last &&
+	    now - cfs_rq->load_last > 4 * period) {
+		cfs_rq->load_period = 0;
+		cfs_rq->load_avg = 0;
+	}
+
 	cfs_rq->load_stamp = now;
 	cfs_rq->load_period += delta;
-	cfs_rq->load_avg += delta * cfs_rq->load.weight;
+	if (load) {
+		cfs_rq->load_last = now;
+		cfs_rq->load_avg += delta * load;
+	}
 
 	while (cfs_rq->load_period > period) {
 		/*
@@ -700,10 +711,8 @@ static void update_cfs_load(struct cfs_r
 		cfs_rq->load_avg /= 2;
 	}
 
-	if (lb && !cfs_rq->nr_running) {
-		if (cfs_rq->load_avg < (period / 8))
-			list_del_leaf_cfs_rq(cfs_rq);
-	}
+	if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg)
+		list_del_leaf_cfs_rq(cfs_rq);
 }
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
@@ -750,7 +759,7 @@ static void update_cfs_shares(struct cfs
 	reweight_entity(cfs_rq_of(se), se, shares);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void update_cfs_load(struct cfs_rq *cfs_rq, int lb)
+static inline void update_cfs_load(struct cfs_rq *cfs_rq)
 {
 }
 
@@ -880,7 +889,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	update_cfs_load(cfs_rq, 0);
+	update_cfs_load(cfs_rq);
 	update_cfs_shares(cfs_rq, se->load.weight);
 	account_entity_enqueue(cfs_rq, se);
 
@@ -941,7 +950,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
-	update_cfs_load(cfs_rq, 0);
+	update_cfs_load(cfs_rq);
 	account_entity_dequeue(cfs_rq, se);
 	update_min_vruntime(cfs_rq);
 	update_cfs_shares(cfs_rq, 0);
@@ -1176,7 +1185,7 @@ enqueue_task_fair(struct rq *rq, struct 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-		update_cfs_load(cfs_rq, 0);
+		update_cfs_load(cfs_rq);
 		update_cfs_shares(cfs_rq, 0);
 	}
 
@@ -1206,7 +1215,7 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-		update_cfs_load(cfs_rq, 0);
+		update_cfs_load(cfs_rq);
 		update_cfs_shares(cfs_rq, 0);
 	}
 
@@ -2019,7 +2028,7 @@ static int tg_shares_up(struct task_grou
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	update_rq_clock(rq);
-	update_cfs_load(cfs_rq, 1);
+	update_cfs_load(cfs_rq);
 
 	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
 	load_avg -= cfs_rq->load_contribution;
Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -357,7 +357,7 @@ struct cfs_rq {
 
 	u64 load_avg;
 	u64 load_period;
-	u64 load_stamp;
+	u64 load_stamp, load_last;
 
 	unsigned long load_contribution;
 #endif

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tg_shares_up rewrite v3 06/11] sched: hierarchal order on shares update list
  2010-11-12  3:24 [tg_shares_up rewrite v3 00/11] reduce overhead for tg->shares distribution Paul Turner
                   ` (4 preceding siblings ...)
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 05/11] sched: fix update_cfs_load synchronization Paul Turner
@ 2010-11-12  3:24 ` Paul Turner
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 07/11] sched: add sysctl_sched_shares_window Paul Turner
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Paul Turner @ 2010-11-12  3:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao, Karl Rister, Balbir Singh, David Miller

[-- Attachment #1: sched-tg-order-ondemand-list.patch --]
[-- Type: text/plain, Size: 2357 bytes --]

Avoid duplicate shares update calls by ensuring children always appear before
parents in rq->leaf_cfs_rq_list.

This allows us to do a single in-order traversal for update_shares().

Since we always enqueue in bottom-up order this reduces to 2 cases:
1) Our parent is already in the list, e.g.

   root
     \
      b
      /\
      c d* (root->b->c already enqueued)

Since d's parent is enqueued we push it to the head of the list, implicitly ahead of b.

2) Our parent does not appear in the list (or we have no parent)
In this case we enqueue to the tail of the list, if our parent is subsequently enqueued 
(bottom-up) it will appear to our right by the same rule.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched_fair.c |   26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -146,8 +146,20 @@ static inline struct cfs_rq *cpu_cfs_rq(
 static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	if (!cfs_rq->on_list) {
-		list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
+		/*
+		 * Ensure we either appear before our parent (if already
+		 * enqueued) or force our parent to appear after us when it is
+		 * enqueued.  The fact that we always enqueue bottom-up
+		 * reduces this to two cases.
+		 */
+		if (cfs_rq->tg->parent &&
+		    cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {
+			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
+				&rq_of(cfs_rq)->leaf_cfs_rq_list);
+		} else {
+			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
 				&rq_of(cfs_rq)->leaf_cfs_rq_list);
+		}
 
 		cfs_rq->on_list = 1;
 	}
@@ -2009,7 +2021,7 @@ out:
 /*
  * update tg->load_weight by folding this cpu's load_avg
  */
-static int tg_shares_up(struct task_group *tg, int cpu)
+static int update_shares_cpu(struct task_group *tg, int cpu)
 {
 	struct cfs_rq *cfs_rq;
 	unsigned long flags;
@@ -2049,14 +2061,8 @@ static void update_shares(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 
 	rcu_read_lock();
-	for_each_leaf_cfs_rq(rq, cfs_rq) {
-		struct task_group *tg = cfs_rq->tg;
-
-		do {
-			tg_shares_up(tg, cpu);
-			tg = tg->parent;
-		} while (tg);
-	}
+	for_each_leaf_cfs_rq(rq, cfs_rq)
+		update_shares_cpu(cfs_rq->tg, cpu);
 	rcu_read_unlock();
 }
 

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tg_shares_up rewrite v3 07/11] sched: add sysctl_sched_shares_window
  2010-11-12  3:24 [tg_shares_up rewrite v3 00/11] reduce overhead for tg->shares distribution Paul Turner
                   ` (5 preceding siblings ...)
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 06/11] sched: hierarchal order on shares update list Paul Turner
@ 2010-11-12  3:24 ` Paul Turner
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 08/11] sched: update shares on idle_balance Paul Turner
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Paul Turner @ 2010-11-12  3:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao, Karl Rister, Balbir Singh, David Miller

[-- Attachment #1: sched-tg-add-shares_window_sysctl.patch --]
[-- Type: text/plain, Size: 2326 bytes --]

Introduce a new sysctl for the shares window and disambiguate it from
sched_time_avg.

A 10ms window appears to be a good compromise between accuracy and performance.

Signed-off-by: Paul Turner <pjt@google.com>

---
 include/linux/sched.h |    1 +
 kernel/sched_fair.c   |    9 ++++++++-
 kernel/sysctl.c       |    7 +++++++
 3 files changed, 16 insertions(+), 1 deletion(-)

Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -89,6 +89,13 @@ unsigned int normalized_sysctl_sched_wak
 
 const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
 
+/*
+ * The exponential sliding  window over which load is averaged for shares
+ * distribution.
+ * (default: 10msec)
+ */
+unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
+
 static const struct sched_class fair_sched_class;
 
 /**************************************************************
@@ -688,7 +695,7 @@ account_entity_dequeue(struct cfs_rq *cf
 #if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
 static void update_cfs_load(struct cfs_rq *cfs_rq)
 {
-	u64 period = sched_avg_period();
+	u64 period = sysctl_sched_shares_window;
 	u64 now, delta;
 	unsigned long load = cfs_rq->load.weight;
 
Index: include/linux/sched.h
===================================================================
--- include/linux/sched.h.orig
+++ include/linux/sched.h
@@ -1883,6 +1883,7 @@ extern unsigned int sysctl_sched_migrati
 extern unsigned int sysctl_sched_nr_migrate;
 extern unsigned int sysctl_sched_time_avg;
 extern unsigned int sysctl_timer_migration;
+extern unsigned int sysctl_sched_shares_window;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
Index: kernel/sysctl.c
===================================================================
--- kernel/sysctl.c.orig
+++ kernel/sysctl.c
@@ -337,6 +337,13 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "sched_shares_window",
+		.data		= &sysctl_sched_shares_window,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "timer_migration",
 		.data		= &sysctl_timer_migration,
 		.maxlen		= sizeof(unsigned int),

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tg_shares_up rewrite v3 08/11] sched: update shares on idle_balance
  2010-11-12  3:24 [tg_shares_up rewrite v3 00/11] reduce overhead for tg->shares distribution Paul Turner
                   ` (6 preceding siblings ...)
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 07/11] sched: add sysctl_sched_shares_window Paul Turner
@ 2010-11-12  3:24 ` Paul Turner
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 09/11] sched: demand based update_cfs_load() Paul Turner
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Paul Turner @ 2010-11-12  3:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao, Karl Rister, Balbir Singh, David Miller

[-- Attachment #1: sched-tg-update_shares_on_idle_balance.patch --]
[-- Type: text/plain, Size: 675 bytes --]

Since shares updates are no longer expensive and effectively local, update them
at idle_balance().  This allows us to more quickly redistribute shares to
another cpu when our load becomes idle.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched_fair.c |    1 +
 1 file changed, 1 insertion(+)

Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -3301,6 +3301,7 @@ static void idle_balance(int this_cpu, s
 	 */
 	raw_spin_unlock(&this_rq->lock);
 
+	update_shares(this_cpu);
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
 		int balance = 1;

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tg_shares_up rewrite v3 09/11] sched: demand based update_cfs_load()
  2010-11-12  3:24 [tg_shares_up rewrite v3 00/11] reduce overhead for tg->shares distribution Paul Turner
                   ` (7 preceding siblings ...)
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 08/11] sched: update shares on idle_balance Paul Turner
@ 2010-11-12  3:24 ` Paul Turner
  2010-11-12 10:53   ` Peter Zijlstra
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 10/11] sched: allow update_cfs_load to update global load Paul Turner
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 11/11] sched: update tg->shares after cpu.shares write Paul Turner
  10 siblings, 1 reply; 18+ messages in thread
From: Paul Turner @ 2010-11-12  3:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao, Karl Rister, Balbir Singh, David Miller

[-- Attachment #1: sched-tg-demand_based_update_cfs_load.patch --]
[-- Type: text/plain, Size: 2254 bytes --]

When the system is busy, dilation of rq->next_balance makes lb->update_shares()
insufficiently frequent for threads which don't sleep (no dequeue/enqueue
updates).  Adjust for this by making demand based updates based on the
accumulation of execution time sufficient to wrap our averaging window.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |    9 ++++++++-
 kernel/sched_fair.c |   12 ++++++++++++
 2 files changed, 20 insertions(+), 1 deletion(-)

Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -355,9 +355,16 @@ struct cfs_rq {
 	 */
 	unsigned long h_load;
 
+	/*
+	 * Maintaining per-cpu shares distribution for group scheduling
+	 *
+	 * load_stamp is the last time we updated the load average
+	 * load_last is the last time we updated the load average and saw load
+	 * load_unacc_exec_time is currently unaccounted execution time
+	 */
 	u64 load_avg;
 	u64 load_period;
-	u64 load_stamp, load_last;
+	u64 load_stamp, load_last, load_unacc_exec_time;
 
 	unsigned long load_contribution;
 #endif
Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -539,6 +539,9 @@ static u64 sched_vslice(struct cfs_rq *c
 	return calc_delta_fair(sched_slice(cfs_rq, se), se);
 }
 
+static void update_cfs_load(struct cfs_rq *cfs_rq);
+static void update_cfs_shares(struct cfs_rq *cfs_rq, long weight_delta);
+
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -558,6 +561,14 @@ __update_curr(struct cfs_rq *cfs_rq, str
 
 	curr->vruntime += delta_exec_weighted;
 	update_min_vruntime(cfs_rq);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	cfs_rq->load_unacc_exec_time += delta_exec;
+	if (cfs_rq->load_unacc_exec_time > sysctl_sched_shares_window) {
+		update_cfs_load(cfs_rq);
+		update_cfs_shares(cfs_rq, 0);
+	}
+#endif
 }
 
 static void update_curr(struct cfs_rq *cfs_rq)
@@ -713,6 +724,7 @@ static void update_cfs_load(struct cfs_r
 	}
 
 	cfs_rq->load_stamp = now;
+	cfs_rq->load_unacc_exec_time = 0;
 	cfs_rq->load_period += delta;
 	if (load) {
 		cfs_rq->load_last = now;

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tg_shares_up rewrite v3 10/11] sched: allow update_cfs_load to update global load
  2010-11-12  3:24 [tg_shares_up rewrite v3 00/11] reduce overhead for tg->shares distribution Paul Turner
                   ` (8 preceding siblings ...)
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 09/11] sched: demand based update_cfs_load() Paul Turner
@ 2010-11-12  3:24 ` Paul Turner
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 11/11] sched: update tg->shares after cpu.shares write Paul Turner
  10 siblings, 0 replies; 18+ messages in thread
From: Paul Turner @ 2010-11-12  3:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao, Karl Rister, Balbir Singh, David Miller

[-- Attachment #1: sched-tg-global_update_load.patch --]
[-- Type: text/plain, Size: 4379 bytes --]

Refactor the global load updates from update_shares_cpu() so that
update_cfs_load() can update global load when it is more than ~10% out of sync.

The new global_load parameter allows us to force an update, regardless of
the error factor so that we can synchronize w/ update_shares().

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched_fair.c |   44 +++++++++++++++++++++++++++++---------------
 1 file changed, 29 insertions(+), 15 deletions(-)

Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -539,7 +539,7 @@ static u64 sched_vslice(struct cfs_rq *c
 	return calc_delta_fair(sched_slice(cfs_rq, se), se);
 }
 
-static void update_cfs_load(struct cfs_rq *cfs_rq);
+static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update);
 static void update_cfs_shares(struct cfs_rq *cfs_rq, long weight_delta);
 
 /*
@@ -565,7 +565,7 @@ __update_curr(struct cfs_rq *cfs_rq, str
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	cfs_rq->load_unacc_exec_time += delta_exec;
 	if (cfs_rq->load_unacc_exec_time > sysctl_sched_shares_window) {
-		update_cfs_load(cfs_rq);
+		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq, 0);
 	}
 #endif
@@ -704,7 +704,22 @@ account_entity_dequeue(struct cfs_rq *cf
 }
 
 #if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
-static void update_cfs_load(struct cfs_rq *cfs_rq)
+static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
+					    int global_update)
+{
+	struct task_group *tg = cfs_rq->tg;
+	long load_avg;
+
+	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
+	load_avg -= cfs_rq->load_contribution;
+
+	if (global_update || abs(load_avg) > cfs_rq->load_contribution / 8) {
+		atomic_add(load_avg, &tg->load_weight);
+		cfs_rq->load_contribution += load_avg;
+	}
+}
+
+static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
 {
 	u64 period = sysctl_sched_shares_window;
 	u64 now, delta;
@@ -731,6 +746,11 @@ static void update_cfs_load(struct cfs_r
 		cfs_rq->load_avg += delta * load;
 	}
 
+	/* consider updating load contribution on each fold or truncate */
+	if (global_update || cfs_rq->load_period > period
+	    || !cfs_rq->load_period)
+		update_cfs_rq_load_contribution(cfs_rq, global_update);
+
 	while (cfs_rq->load_period > period) {
 		/*
 		 * Inline assembly required to prevent the compiler
@@ -790,7 +810,7 @@ static void update_cfs_shares(struct cfs
 	reweight_entity(cfs_rq_of(se), se, shares);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void update_cfs_load(struct cfs_rq *cfs_rq)
+static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
 {
 }
 
@@ -920,7 +940,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	update_cfs_load(cfs_rq);
+	update_cfs_load(cfs_rq, 0);
 	update_cfs_shares(cfs_rq, se->load.weight);
 	account_entity_enqueue(cfs_rq, se);
 
@@ -981,7 +1001,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
-	update_cfs_load(cfs_rq);
+	update_cfs_load(cfs_rq, 0);
 	account_entity_dequeue(cfs_rq, se);
 	update_min_vruntime(cfs_rq);
 	update_cfs_shares(cfs_rq, 0);
@@ -1216,7 +1236,7 @@ enqueue_task_fair(struct rq *rq, struct 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-		update_cfs_load(cfs_rq);
+		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq, 0);
 	}
 
@@ -1246,7 +1266,7 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-		update_cfs_load(cfs_rq);
+		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq, 0);
 	}
 
@@ -2048,7 +2068,6 @@ static int update_shares_cpu(struct task
 	struct cfs_rq *cfs_rq;
 	unsigned long flags;
 	struct rq *rq;
-	long load_avg;
 
 	if (!tg->se[cpu])
 		return 0;
@@ -2059,12 +2078,7 @@ static int update_shares_cpu(struct task
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	update_rq_clock(rq);
-	update_cfs_load(cfs_rq);
-
-	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
-	load_avg -= cfs_rq->load_contribution;
-	atomic_add(load_avg, &tg->load_weight);
-	cfs_rq->load_contribution += load_avg;
+	update_cfs_load(cfs_rq, 1);
 
 	/*
 	 * We need to update shares after updating tg->load_weight in

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [tg_shares_up rewrite v3 11/11] sched: update tg->shares after cpu.shares write
  2010-11-12  3:24 [tg_shares_up rewrite v3 00/11] reduce overhead for tg->shares distribution Paul Turner
                   ` (9 preceding siblings ...)
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 10/11] sched: allow update_cfs_load to update global load Paul Turner
@ 2010-11-12  3:24 ` Paul Turner
  10 siblings, 0 replies; 18+ messages in thread
From: Paul Turner @ 2010-11-12  3:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao, Karl Rister, Balbir Singh, David Miller

[-- Attachment #1: sched-tg-fix_group_set_shares.patch --]
[-- Type: text/plain, Size: 2279 bytes --]

Formerly sched_group_set_shares would force a rebalance by overflowing domain
share sums.  Now that per-cpu averages are maintained we can set the true value
by issuing an update_cfs_shares() following a tg->shares update.

Also initialize tg se->load to 0 for consistency since we'll now set correct
weights on enqueue.

Signed-off-by: Paul Turner <pjt@google.com?>

---
 kernel/sched.c |   42 +++++++++++-------------------------------
 1 file changed, 11 insertions(+), 31 deletions(-)

Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -7575,7 +7575,7 @@ static void init_tg_cfs_entry(struct tas
 		se->cfs_rq = parent->my_q;
 
 	se->my_q = cfs_rq;
-	update_load_set(&se->load, tg->shares);
+	update_load_set(&se->load, 0);
 	se->parent = parent;
 }
 #endif
@@ -8194,37 +8194,12 @@ void sched_move_task(struct task_struct 
 #endif /* CONFIG_CGROUP_SCHED */
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static void __set_se_shares(struct sched_entity *se, unsigned long shares)
-{
-	struct cfs_rq *cfs_rq = se->cfs_rq;
-	int on_rq;
-
-	on_rq = se->on_rq;
-	if (on_rq)
-		dequeue_entity(cfs_rq, se, 0);
-
-	update_load_set(&se->load, shares);
-
-	if (on_rq)
-		enqueue_entity(cfs_rq, se, 0);
-}
-
-static void set_se_shares(struct sched_entity *se, unsigned long shares)
-{
-	struct cfs_rq *cfs_rq = se->cfs_rq;
-	struct rq *rq = cfs_rq->rq;
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&rq->lock, flags);
-	__set_se_shares(se, shares);
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
-}
-
 static DEFINE_MUTEX(shares_mutex);
 
 int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 {
 	int i;
+	unsigned long flags;
 
 	/*
 	 * We can't change the weight of the root cgroup.
@@ -8243,10 +8218,15 @@ int sched_group_set_shares(struct task_g
 
 	tg->shares = shares;
 	for_each_possible_cpu(i) {
-		/*
-		 * force a rebalance
-		 */
-		set_se_shares(tg->se[i], shares);
+		struct rq *rq = cpu_rq(i);
+		struct sched_entity *se;
+
+		se = tg->se[i];
+		/* Propagate contribution to hierarchy */
+		raw_spin_lock_irqsave(&rq->lock, flags);
+		for_each_sched_entity(se)
+			update_cfs_shares(group_cfs_rq(se), 0);
+		raw_spin_unlock_irqrestore(&rq->lock, flags);
 	}
 
 done:

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [tg_shares_up rewrite v3 09/11] sched: demand based update_cfs_load()
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 09/11] sched: demand based update_cfs_load() Paul Turner
@ 2010-11-12 10:53   ` Peter Zijlstra
  2010-11-12 12:12     ` Peter Zijlstra
  2010-11-13  1:01     ` Paul Turner
  0 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2010-11-12 10:53 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Bharata B Rao,
	Karl Rister, Balbir Singh, David Miller

On Thu, 2010-11-11 at 19:24 -0800, Paul Turner wrote:
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +       cfs_rq->load_unacc_exec_time += delta_exec;
> +       if (cfs_rq->load_unacc_exec_time > sysctl_sched_shares_window)
> {
> +               update_cfs_load(cfs_rq);
> +               update_cfs_shares(cfs_rq, 0);

Why not:
+		cfs_rq->load_unacc_exec_time -= sysctl_sched_shares_window;

(although you probably want to read the sysctl value into a local
variable using ACCESS_ONCE() and use that for both cases).

> +       }
> +#endif
>  }
>  
>  static void update_curr(struct cfs_rq *cfs_rq)
> @@ -713,6 +724,7 @@ static void update_cfs_load(struct cfs_r
>         }
>  
>         cfs_rq->load_stamp = now;
> +       cfs_rq->load_unacc_exec_time = 0; 

and drop this one?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [tg_shares_up rewrite v3 01/11] sched: rewrite tg_shares_up
  2010-11-12  3:24 ` [tg_shares_up rewrite v3 01/11] sched: rewrite tg_shares_up Paul Turner
@ 2010-11-12 11:10   ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2010-11-12 11:10 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Bharata B Rao,
	Karl Rister, Balbir Singh, David Miller

On Thu, 2010-11-11 at 19:24 -0800, Paul Turner wrote:
> --- include/linux/sched.h.orig
> +++ include/linux/sched.h 

You generated -p0 patches.. that caused utter destruction with my apply
scripts.

Could you please repost as -p1 patches?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [tg_shares_up rewrite v3 09/11] sched: demand based update_cfs_load()
  2010-11-12 10:53   ` Peter Zijlstra
@ 2010-11-12 12:12     ` Peter Zijlstra
  2010-11-13  1:01     ` Paul Turner
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2010-11-12 12:12 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Bharata B Rao,
	Karl Rister, Balbir Singh, David Miller

On Fri, 2010-11-12 at 11:53 +0100, Peter Zijlstra wrote:
> On Thu, 2010-11-11 at 19:24 -0800, Paul Turner wrote:
> > +#ifdef CONFIG_FAIR_GROUP_SCHED
> > +       cfs_rq->load_unacc_exec_time += delta_exec;
> > +       if (cfs_rq->load_unacc_exec_time > sysctl_sched_shares_window)
> > {
> > +               update_cfs_load(cfs_rq);
> > +               update_cfs_shares(cfs_rq, 0);
> 
> Why not:
> +		cfs_rq->load_unacc_exec_time -= sysctl_sched_shares_window;
> 
> (although you probably want to read the sysctl value into a local
> variable using ACCESS_ONCE() and use that for both cases).

Ah, I see why not, still that reset looks better here than in that other
function.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [tg_shares_up rewrite v3 09/11] sched: demand based update_cfs_load()
  2010-11-12 10:53   ` Peter Zijlstra
  2010-11-12 12:12     ` Peter Zijlstra
@ 2010-11-13  1:01     ` Paul Turner
  2010-11-13  2:42       ` Paul Turner
  1 sibling, 1 reply; 18+ messages in thread
From: Paul Turner @ 2010-11-13  1:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Bharata B Rao,
	Karl Rister, Balbir Singh, David Miller

On Fri, Nov 12, 2010 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2010-11-11 at 19:24 -0800, Paul Turner wrote:
>> +#ifdef CONFIG_FAIR_GROUP_SCHED
>> +       cfs_rq->load_unacc_exec_time += delta_exec;
>> +       if (cfs_rq->load_unacc_exec_time > sysctl_sched_shares_window)
>> {
>> +               update_cfs_load(cfs_rq);
>> +               update_cfs_shares(cfs_rq, 0);
>
> Why not:
> +               cfs_rq->load_unacc_exec_time -= sysctl_sched_shares_window;
>
> (although you probably want to read the sysctl value into a local
> variable using ACCESS_ONCE() and use that for both cases).
>

I think this is hard to do in a clean fashion (without a strange arg
to update_cfs_load).

I'm also not sure it's worth synchronizing on a shares_window change
since, when we trigger an update from update_curr() is independent of
folding that time into the load average anyway.
(Note even: using sysctl_sched_shares_window is already a larger
window than update_cfs_load will normally fold at since we fold at
window/2 after the initial period, it just sets a grace period on
computation without updates in the busy case.)

>> +       }
>> +#endif
>>  }
>>
>>  static void update_curr(struct cfs_rq *cfs_rq)
>> @@ -713,6 +724,7 @@ static void update_cfs_load(struct cfs_r
>>         }
>>
>>         cfs_rq->load_stamp = now;
>> +       cfs_rq->load_unacc_exec_time = 0;
>
> and drop this one?
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [tg_shares_up rewrite v3 09/11] sched: demand based update_cfs_load()
  2010-11-13  1:01     ` Paul Turner
@ 2010-11-13  2:42       ` Paul Turner
  2010-11-13 22:45         ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Paul Turner @ 2010-11-13  2:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Bharata B Rao,
	Karl Rister, Balbir Singh, David Miller

On Fri, Nov 12, 2010 at 5:01 PM, Paul Turner <pjt@google.com> wrote:
> On Fri, Nov 12, 2010 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> On Thu, 2010-11-11 at 19:24 -0800, Paul Turner wrote:
>>> +#ifdef CONFIG_FAIR_GROUP_SCHED
>>> +       cfs_rq->load_unacc_exec_time += delta_exec;
>>> +       if (cfs_rq->load_unacc_exec_time > sysctl_sched_shares_window)
>>> {
>>> +               update_cfs_load(cfs_rq);
>>> +               update_cfs_shares(cfs_rq, 0);
>>
>> Why not:
>> +               cfs_rq->load_unacc_exec_time -= sysctl_sched_shares_window;
>>
>> (although you probably want to read the sysctl value into a local
>> variable using ACCESS_ONCE() and use that for both cases).
>>
>
> I think this is hard to do in a clean fashion (without a strange arg
> to update_cfs_load).
>
> I'm also not sure it's worth synchronizing on a shares_window change
> since, when we trigger an update from update_curr() is independent of
> folding that time into the load average anyway.
> (Note even: using sysctl_sched_shares_window is already a larger
> window than update_cfs_load will normally fold at since we fold at
> window/2 after the initial period, it just sets a grace period on
> computation without updates in the busy case.)
>

Let me know if you still think this warrants any changes and I'll (fix
accordingly) and repost with the "-p0" stripping fixed.

>>> +       }
>>> +#endif
>>>  }
>>>
>>>  static void update_curr(struct cfs_rq *cfs_rq)
>>> @@ -713,6 +724,7 @@ static void update_cfs_load(struct cfs_r
>>>         }
>>>
>>>         cfs_rq->load_stamp = now;
>>> +       cfs_rq->load_unacc_exec_time = 0;
>>
>> and drop this one?
>>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [tg_shares_up rewrite v3 09/11] sched: demand based update_cfs_load()
  2010-11-13  2:42       ` Paul Turner
@ 2010-11-13 22:45         ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2010-11-13 22:45 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Bharata B Rao,
	Karl Rister, Balbir Singh, David Miller

On Fri, 2010-11-12 at 18:42 -0800, Paul Turner wrote:
> 
> Let me know if you still think this warrants any changes and I'll (fix
> accordingly) and repost with the "-p0" stripping fixed.
> 
Nah, its fine I just thought the code odd, but it does make sense.

Please repost the patches as -p1

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2010-11-13 22:45 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-12  3:24 [tg_shares_up rewrite v3 00/11] reduce overhead for tg->shares distribution Paul Turner
2010-11-12  3:24 ` [tg_shares_up rewrite v3 01/11] sched: rewrite tg_shares_up Paul Turner
2010-11-12 11:10   ` Peter Zijlstra
2010-11-12  3:24 ` [tg_shares_up rewrite v3 02/11] sched: on-demand (active) cfs_rq list Paul Turner
2010-11-12  3:24 ` [tg_shares_up rewrite v3 03/11] sched: make tg_shares_up() walk on-demand Paul Turner
2010-11-12  3:24 ` [tg_shares_up rewrite v3 04/11] sched: fix load corruption from update_cfs_shares Paul Turner
2010-11-12  3:24 ` [tg_shares_up rewrite v3 05/11] sched: fix update_cfs_load synchronization Paul Turner
2010-11-12  3:24 ` [tg_shares_up rewrite v3 06/11] sched: hierarchal order on shares update list Paul Turner
2010-11-12  3:24 ` [tg_shares_up rewrite v3 07/11] sched: add sysctl_sched_shares_window Paul Turner
2010-11-12  3:24 ` [tg_shares_up rewrite v3 08/11] sched: update shares on idle_balance Paul Turner
2010-11-12  3:24 ` [tg_shares_up rewrite v3 09/11] sched: demand based update_cfs_load() Paul Turner
2010-11-12 10:53   ` Peter Zijlstra
2010-11-12 12:12     ` Peter Zijlstra
2010-11-13  1:01     ` Paul Turner
2010-11-13  2:42       ` Paul Turner
2010-11-13 22:45         ` Peter Zijlstra
2010-11-12  3:24 ` [tg_shares_up rewrite v3 10/11] sched: allow update_cfs_load to update global load Paul Turner
2010-11-12  3:24 ` [tg_shares_up rewrite v3 11/11] sched: update tg->shares after cpu.shares write Paul Turner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).