[RFC PATCHSET v2] sched/fair: fix load balancer behavior when cgroup is in use

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCHSET v2] sched/fair: fix load balancer behavior when cgroup is in use
@ 2017-05-04 20:28 Tejun Heo
  2017-05-04 20:29 ` [PATCH 1/3] sched/fair: Peter's shares_type patch Tejun Heo
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Tejun Heo @ 2017-05-04 20:28 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith,
	Paul Turner, Chris Mason, kernel-team

Hello,

v1 posting can be found at

  http://lkml.kernel.org/r/20170424201344.GA14169@wtj.duckdns.org

The patchset is still RFC and based on v4.11.  I used Peter's updated
calc_cfs_shares() instead of scaling manually and updated so that
runnable_load_avg is propagated independently from load_avg.  Due to
the way sched_entity and cfs_rq loads are calculated, this requires an
extra runnable_load_avg calculation for group sched_entities, but the
end result is cleaner and actually makes sense.

Vincent, can you please verify whether the regression that you see is
gone with this version?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/3] sched/fair: Peter's shares_type patch
  2017-05-04 20:28 [RFC PATCHSET v2] sched/fair: fix load balancer behavior when cgroup is in use Tejun Heo
@ 2017-05-04 20:29 ` Tejun Heo
  2017-05-05 10:40   ` Vincent Guittot
  2017-05-04 20:29 ` [PATCH 2/3] sched/fair: Add load_weight->runnable_load_{sum|avg} Tejun Heo
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-05-04 20:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith,
	Paul Turner, Chris Mason, kernel-team

From: Peter Zijlstra <peterz@infradead.org>

This patch is combination of

   http://lkml.kernel.org/r/20170502081905.GA4626@worktop.programming.kicks-ass.net
 + http://lkml.kernel.org/r/20170502083009.GA3377@worktop.programming.kicks-ass.net
 + build fix & use shares_avg for propagating load_avg instead of runnable

This fixes the propagation problem described in the following while
keeping group se->load_avg.avg in line with the matching
cfs_rq->load_avg.avg.

   http://lkml.kernel.org/r/20170424201415.GB14169@wtj.duckdns.org

---
 kernel/sched/fair.c |   98 +++++++++++++++++++++++++---------------------------
 1 file changed, 48 insertions(+), 50 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2636,26 +2636,57 @@ account_entity_dequeue(struct cfs_rq *cf
 	cfs_rq->nr_running--;
 }
 
+enum shares_type {
+	shares_runnable,
+	shares_avg,
+	shares_weight,
+};
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 # ifdef CONFIG_SMP
-static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
+static long
+calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg,
+		enum shares_type shares_type)
 {
-	long tg_weight, load, shares;
+	long tg_weight, tg_shares, load, shares;
 
-	/*
-	 * This really should be: cfs_rq->avg.load_avg, but instead we use
-	 * cfs_rq->load.weight, which is its upper bound. This helps ramp up
-	 * the shares for small weight interactive tasks.
-	 */
-	load = scale_load_down(cfs_rq->load.weight);
+	tg_shares = READ_ONCE(tg->shares);
+
+	switch (shares_type) {
+	case shares_runnable:
+		/*
+		 * Instead of the correct cfs_rq->avg.load_avg we use
+		 * cfs_rq->runnable_load_avg, which does not include the
+		 * blocked load.
+		 */
+		load = cfs_rq->runnable_load_avg;
+		break;
+
+	case shares_avg:
+		load = cfs_rq->avg.load_avg;
+		break;
+
+	case shares_weight:
+		/*
+		 * Instead of the correct cfs_rq->avg.load_avg we use
+		 * cfs_rq->load.weight, which is its upper bound. This helps
+		 * ramp up the shares for small weight interactive tasks.
+		 */
+		load = scale_load_down(cfs_rq->load.weight);
+		break;
+	}
 
 	tg_weight = atomic_long_read(&tg->load_avg);
 
-	/* Ensure tg_weight >= load */
+	/*
+	 * This ensures the sum is up-to-date for this CPU, in case of the other
+	 * two approximations it biases the sum towards their value and in case
+	 * of (near) UP ensures the division ends up <= 1.
+	 */
 	tg_weight -= cfs_rq->tg_load_avg_contrib;
 	tg_weight += load;
 
-	shares = (tg->shares * load);
+	shares = (tg_shares * load);
 	if (tg_weight)
 		shares /= tg_weight;
 
@@ -2671,15 +2702,11 @@ static long calc_cfs_shares(struct cfs_r
 	 * case no task is runnable on a CPU MIN_SHARES=2 should be returned
 	 * instead of 0.
 	 */
-	if (shares < MIN_SHARES)
-		shares = MIN_SHARES;
-	if (shares > tg->shares)
-		shares = tg->shares;
-
-	return shares;
+	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
 # else /* CONFIG_SMP */
-static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
+static inline long
+calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg, enum shares_type shares_type)
 {
 	return tg->shares;
 }
@@ -2721,7 +2748,7 @@ static void update_cfs_shares(struct sch
 	if (likely(se->load.weight == tg->shares))
 		return;
 #endif
-	shares = calc_cfs_shares(cfs_rq, tg);
+	shares = calc_cfs_shares(cfs_rq, tg, shares_weight);
 
 	reweight_entity(cfs_rq_of(se), se, shares);
 }
@@ -3078,39 +3105,10 @@ static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long delta, load = gcfs_rq->avg.load_avg;
-
-	/*
-	 * If the load of group cfs_rq is null, the load of the
-	 * sched_entity will also be null so we can skip the formula
-	 */
-	if (load) {
-		long tg_load;
-
-		/* Get tg's load and ensure tg_load > 0 */
-		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
-		/* Ensure tg_load >= load and updated with current load*/
-		tg_load -= gcfs_rq->tg_load_avg_contrib;
-		tg_load += load;
-
-		/*
-		 * We need to compute a correction term in the case that the
-		 * task group is consuming more CPU than a task of equal
-		 * weight. A task with a weight equals to tg->shares will have
-		 * a load less or equal to scale_load_down(tg->shares).
-		 * Similarly, the sched_entities that represent the task group
-		 * at parent level, can't have a load higher than
-		 * scale_load_down(tg->shares). And the Sum of sched_entities'
-		 * load must be <= scale_load_down(tg->shares).
-		 */
-		if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
-			/* scale gcfs_rq's load into tg's shares*/
-			load *= scale_load_down(gcfs_rq->tg->shares);
-			load /= tg_load;
-		}
-	}
+	long load, delta;
 
+	load = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg,
+					       shares_avg));
 	delta = load - se->avg.load_avg;
 
 	/* Nothing to update */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 2/3] sched/fair: Add load_weight->runnable_load_{sum|avg}
  2017-05-04 20:28 [RFC PATCHSET v2] sched/fair: fix load balancer behavior when cgroup is in use Tejun Heo
  2017-05-04 20:29 ` [PATCH 1/3] sched/fair: Peter's shares_type patch Tejun Heo
@ 2017-05-04 20:29 ` Tejun Heo
  2017-05-05 13:22   ` Dietmar Eggemann
  2017-05-04 20:30 ` [PATCH 3/3] sched/fair: Propagate runnable_load_avg independently from load_avg Tejun Heo
  2017-05-05  8:46 ` [RFC PATCHSET v2] sched/fair: fix load balancer behavior when cgroup is in use Vincent Guittot
  3 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-05-04 20:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith,
	Paul Turner, Chris Mason, kernel-team

Currently, runnable_load_avg, which represents the portion of load avg
only from tasks which are currently active, is tracked by cfs_rq but
not by sched_entity.  We want to propagate runnable_load_avg of a
nested cfs_rq without affecting load_avg propagation.  To implement an
equivalent propagation channel, sched_entity needs to track
runnable_load_avg too.

This patch moves cfs_rq->runnable_load_{sum|avg} into struct
load_weight which is already used to track load_avg and shared by both
cfs_rq and sched_entity.

This patch only changes where runnable_load_{sum|avg} are located and
doesn't cause any actual behavior changes.  The fields are still only
used for cfs_rqs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
Cc: Chris Mason <clm@fb.com>
---
 include/linux/sched.h |    2 +
 kernel/sched/debug.c  |    2 -
 kernel/sched/fair.c   |   63 +++++++++++++++++++++++++-------------------------
 kernel/sched/sched.h  |    2 -
 4 files changed, 35 insertions(+), 34 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -314,9 +314,11 @@ struct load_weight {
 struct sched_avg {
 	u64				last_update_time;
 	u64				load_sum;
+	u64				runnable_load_sum;
 	u32				util_sum;
 	u32				period_contrib;
 	unsigned long			load_avg;
+	unsigned long			runnable_load_avg;
 	unsigned long			util_avg;
 };
 
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -516,7 +516,7 @@ void print_cfs_rq(struct seq_file *m, in
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
 			cfs_rq->avg.load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_load_avg",
-			cfs_rq->runnable_load_avg);
+			cfs_rq->avg.runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed_load_avg",
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2656,10 +2656,10 @@ calc_cfs_shares(struct cfs_rq *cfs_rq, s
 	case shares_runnable:
 		/*
 		 * Instead of the correct cfs_rq->avg.load_avg we use
-		 * cfs_rq->runnable_load_avg, which does not include the
-		 * blocked load.
+		 * cfs_rq->avg.runnable_load_avg, which does not include
+		 * the blocked load.
 		 */
-		load = cfs_rq->runnable_load_avg;
+		load = cfs_rq->avg.runnable_load_avg;
 		break;
 
 	case shares_avg:
@@ -2877,7 +2877,7 @@ static u32 __compute_runnable_contrib(u6
  */
 static __always_inline int
 __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
-		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
+		  unsigned long weight, int running, bool update_runnable)
 {
 	u64 delta, scaled_delta, periods;
 	u32 contrib;
@@ -2923,10 +2923,8 @@ __update_load_avg(u64 now, int cpu, stru
 		scaled_delta_w = cap_scale(delta_w, scale_freq);
 		if (weight) {
 			sa->load_sum += weight * scaled_delta_w;
-			if (cfs_rq) {
-				cfs_rq->runnable_load_sum +=
-						weight * scaled_delta_w;
-			}
+			if (update_runnable)
+				sa->runnable_load_sum += weight * scaled_delta_w;
 		}
 		if (running)
 			sa->util_sum += scaled_delta_w * scale_cpu;
@@ -2938,9 +2936,9 @@ __update_load_avg(u64 now, int cpu, stru
 		delta %= 1024;
 
 		sa->load_sum = decay_load(sa->load_sum, periods + 1);
-		if (cfs_rq) {
-			cfs_rq->runnable_load_sum =
-				decay_load(cfs_rq->runnable_load_sum, periods + 1);
+		if (update_runnable) {
+			sa->runnable_load_sum =
+				decay_load(sa->runnable_load_sum, periods + 1);
 		}
 		sa->util_sum = decay_load((u64)(sa->util_sum), periods + 1);
 
@@ -2949,8 +2947,8 @@ __update_load_avg(u64 now, int cpu, stru
 		contrib = cap_scale(contrib, scale_freq);
 		if (weight) {
 			sa->load_sum += weight * contrib;
-			if (cfs_rq)
-				cfs_rq->runnable_load_sum += weight * contrib;
+			if (update_runnable)
+				sa->runnable_load_sum += weight * contrib;
 		}
 		if (running)
 			sa->util_sum += contrib * scale_cpu;
@@ -2960,8 +2958,8 @@ __update_load_avg(u64 now, int cpu, stru
 	scaled_delta = cap_scale(delta, scale_freq);
 	if (weight) {
 		sa->load_sum += weight * scaled_delta;
-		if (cfs_rq)
-			cfs_rq->runnable_load_sum += weight * scaled_delta;
+		if (update_runnable)
+			sa->runnable_load_sum += weight * scaled_delta;
 	}
 	if (running)
 		sa->util_sum += scaled_delta * scale_cpu;
@@ -2970,9 +2968,9 @@ __update_load_avg(u64 now, int cpu, stru
 
 	if (decayed) {
 		sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX);
-		if (cfs_rq) {
-			cfs_rq->runnable_load_avg =
-				div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
+		if (update_runnable) {
+			sa->runnable_load_avg =
+				div_u64(sa->runnable_load_sum, LOAD_AVG_MAX);
 		}
 		sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
 	}
@@ -3075,7 +3073,7 @@ void set_task_rq_fair(struct sched_entit
 		n_last_update_time = next->avg.last_update_time;
 #endif
 		__update_load_avg(p_last_update_time, cpu_of(rq_of(prev)),
-				  &se->avg, 0, 0, NULL);
+				  &se->avg, 0, 0, false);
 		se->avg.last_update_time = n_last_update_time;
 	}
 }
@@ -3129,8 +3127,9 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
 	 */
 	if (se->on_rq) {
 		/* Update parent cfs_rq runnable_load_avg */
-		add_positive(&cfs_rq->runnable_load_avg, delta);
-		cfs_rq->runnable_load_sum = cfs_rq->runnable_load_avg * LOAD_AVG_MAX;
+		add_positive(&cfs_rq->avg.runnable_load_avg, delta);
+		cfs_rq->avg.runnable_load_sum =
+			cfs_rq->avg.runnable_load_avg * LOAD_AVG_MAX;
 	}
 }
 
@@ -3264,7 +3263,7 @@ update_cfs_rq_load_avg(u64 now, struct c
 	}
 
 	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-		scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);
+		scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, true);
 
 #ifndef CONFIG_64BIT
 	smp_wmb();
@@ -3299,7 +3298,7 @@ static inline void update_load_avg(struc
 	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
 		__update_load_avg(now, cpu, &se->avg,
 			  se->on_rq * scale_load_down(se->load.weight),
-			  cfs_rq->curr == se, NULL);
+			  cfs_rq->curr == se, false);
 	}
 
 	decayed  = update_cfs_rq_load_avg(now, cfs_rq, true);
@@ -3355,8 +3354,8 @@ enqueue_entity_load_avg(struct cfs_rq *c
 {
 	struct sched_avg *sa = &se->avg;
 
-	cfs_rq->runnable_load_avg += sa->load_avg;
-	cfs_rq->runnable_load_sum += sa->load_sum;
+	cfs_rq->avg.runnable_load_avg += sa->load_avg;
+	cfs_rq->avg.runnable_load_sum += sa->load_sum;
 
 	if (!sa->last_update_time) {
 		attach_entity_load_avg(cfs_rq, se);
@@ -3368,10 +3367,12 @@ enqueue_entity_load_avg(struct cfs_rq *c
 static inline void
 dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	cfs_rq->runnable_load_avg =
-		max_t(long, cfs_rq->runnable_load_avg - se->avg.load_avg, 0);
-	cfs_rq->runnable_load_sum =
-		max_t(s64,  cfs_rq->runnable_load_sum - se->avg.load_sum, 0);
+	struct sched_avg *sa = &se->avg;
+
+	cfs_rq->avg.runnable_load_avg =
+		max_t(long, cfs_rq->avg.runnable_load_avg - sa->load_avg, 0);
+	cfs_rq->avg.runnable_load_sum =
+		max_t(s64,  cfs_rq->avg.runnable_load_sum - sa->load_sum, 0);
 }
 
 #ifndef CONFIG_64BIT
@@ -3405,7 +3406,7 @@ void sync_entity_load_avg(struct sched_e
 	u64 last_update_time;
 
 	last_update_time = cfs_rq_last_update_time(cfs_rq);
-	__update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+	__update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, false);
 }
 
 /*
@@ -3433,7 +3434,7 @@ void remove_entity_load_avg(struct sched
 
 static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->runnable_load_avg;
+	return cfs_rq->avg.runnable_load_avg;
 }
 
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -422,8 +422,6 @@ struct cfs_rq {
 	 * CFS load tracking
 	 */
 	struct sched_avg avg;
-	u64 runnable_load_sum;
-	unsigned long runnable_load_avg;
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	unsigned long tg_load_avg_contrib;
 	unsigned long propagate_avg;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 3/3] sched/fair: Propagate runnable_load_avg independently from load_avg
  2017-05-04 20:28 [RFC PATCHSET v2] sched/fair: fix load balancer behavior when cgroup is in use Tejun Heo
  2017-05-04 20:29 ` [PATCH 1/3] sched/fair: Peter's shares_type patch Tejun Heo
  2017-05-04 20:29 ` [PATCH 2/3] sched/fair: Add load_weight->runnable_load_{sum|avg} Tejun Heo
@ 2017-05-04 20:30 ` Tejun Heo
  2017-05-05 10:42   ` Vincent Guittot
  2017-05-05 16:51   ` Vincent Guittot
  2017-05-05  8:46 ` [RFC PATCHSET v2] sched/fair: fix load balancer behavior when cgroup is in use Vincent Guittot
  3 siblings, 2 replies; 20+ messages in thread
From: Tejun Heo @ 2017-05-04 20:30 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith,
	Paul Turner, Chris Mason, kernel-team

We noticed that with cgroup CPU controller in use, the scheduling
latency gets wonky regardless of nesting level or weight
configuration.  This is easily reproducible with Chris Mason's
schbench[1].

All tests are run on a single socket, 16 cores, 32 threads machine.
While the machine is mostly idle, it isn't completey.  There's always
some variable management load going on.  The command used is

 schbench -m 2 -t 16 -s 10000 -c 15000 -r 30

which measures scheduling latency with 32 threads each of which
repeatedly runs for 15ms and then sleeps for 10ms.  Here's a
representative result when running from the root cgroup.

 # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
 Latency percentiles (usec)
	 50.0000th: 26
	 75.0000th: 62
	 90.0000th: 74
	 95.0000th: 86
	 *99.0000th: 887
	 99.5000th: 3692
	 99.9000th: 10832
	 min=0, max=13374

The following is inside a first level CPU cgroup with the maximum
weight.

 # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
 Latency percentiles (usec)
	 50.0000th: 31
	 75.0000th: 65
	 90.0000th: 71
	 95.0000th: 91
	 *99.0000th: 7288
	 99.5000th: 10352
	 99.9000th: 12496
	 min=0, max=13023

Note the drastic increase in p99 scheduling latency.  After
investigation, it turned out that the update_sd_lb_stats(), which is
used by load_balance() to pick the most loaded group, was often
picking the wrong group.  A CPU which has one schbench running and
another queued wouldn't report the correspondingly higher
weighted_cpuload() and get looked over as the target of load
balancing.

weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
sum of the load_avg of all active sched_entities.  Without cgroups or
at the root cgroup, each task's load_avg contributes directly to the
sum.  When a task wakes up or goes to sleep, the change is immediately
reflected on runnable_load_avg which in turn affects load balancing.

However, when CPU cgroup is in use, a nested cfs_rq blocks this
immediate propagation.  When a task wakes up inside a cgroup, the
nested cfs_rq's runnable_load_avg is updated but doesn't get
propagated to the parent.  It only affects the matching sched_entity's
load_avg over time which then gets propagated to the runnable_load_avg
at that level.  This makes the runnable_load_avg at the root queue
incorrectly include blocked load_avgs of tasks queued in nested
cfs_rqs causing the load balancer to make the wrong choices.

This patch fixes the bug by propagating nested cfs_rq's
runnable_load_avg independently from load_avg.  Tasks still contribute
to its cfs_rq's runnable_load_avg the same way; however, a nested
cfs_rq directly propagates the scaled runnable_load_avg to the
matching group sched_entity's avg.runnable_load_avg and keeps the se's
and parent cfs_rq's runnable_load_avg in sync.

This ensures that, for any given cfs_rq, its runnable_load_avg is the
sum of the scaled load_avgs of all and only active tasks queued on it
and its descendants.  This allows the load balancer to operate on the
same information whether there are nested cfs_rqs or not.

With the patch applied, the p99 latency from inside a cgroup is
equivalent to the root cgroup case.

 # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
 Latency percentiles (usec)
	 50.0000th: 40
	 75.0000th: 71
	 90.0000th: 89
	 95.0000th: 108
	 *99.0000th: 679
	 99.5000th: 3500
	 99.9000th: 10960
	 min=0, max=13790

[1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
Cc: Chris Mason <clm@fb.com>
---
 kernel/sched/fair.c |   55 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 37 insertions(+), 18 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3098,6 +3098,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
 	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
 }
 
+static inline void
+update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
+	long load, delta;
+
+	load = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg,
+					       shares_runnable));
+	delta = load - se->avg.runnable_load_avg;
+
+	/* Nothing to update */
+	if (!delta)
+		return;
+
+	/* Set new sched_entity's load */
+	se->avg.runnable_load_avg = load;
+	se->avg.runnable_load_sum = se->avg.runnable_load_avg * LOAD_AVG_MAX;
+
+	/* Update parent cfs_rq load */
+	add_positive(&cfs_rq->avg.runnable_load_avg, delta);
+	cfs_rq->avg.runnable_load_sum =
+		cfs_rq->avg.runnable_load_avg * LOAD_AVG_MAX;
+}
+
 /* Take into account change of load of a child task group */
 static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -3120,17 +3144,6 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
 	/* Update parent cfs_rq load */
 	add_positive(&cfs_rq->avg.load_avg, delta);
 	cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
-
-	/*
-	 * If the sched_entity is already enqueued, we also have to update the
-	 * runnable load avg.
-	 */
-	if (se->on_rq) {
-		/* Update parent cfs_rq runnable_load_avg */
-		add_positive(&cfs_rq->avg.runnable_load_avg, delta);
-		cfs_rq->avg.runnable_load_sum =
-			cfs_rq->avg.runnable_load_avg * LOAD_AVG_MAX;
-	}
 }
 
 static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq)
@@ -3152,16 +3165,16 @@ static inline int test_and_clear_tg_cfs_
 /* Update task and its cfs_rq load average */
 static inline int propagate_entity_load_avg(struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	if (entity_is_task(se))
 		return 0;
 
+	update_tg_cfs_runnable(cfs_rq, se);
+
 	if (!test_and_clear_tg_cfs_propagate(se))
 		return 0;
 
-	cfs_rq = cfs_rq_of(se);
-
 	set_tg_cfs_propagate(cfs_rq);
 
 	update_tg_cfs_util(cfs_rq, se);
@@ -3298,7 +3311,7 @@ static inline void update_load_avg(struc
 	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
 		__update_load_avg(now, cpu, &se->avg,
 			  se->on_rq * scale_load_down(se->load.weight),
-			  cfs_rq->curr == se, false);
+			  cfs_rq->curr == se, !entity_is_task(se));
 	}
 
 	decayed  = update_cfs_rq_load_avg(now, cfs_rq, true);
@@ -3354,8 +3367,10 @@ enqueue_entity_load_avg(struct cfs_rq *c
 {
 	struct sched_avg *sa = &se->avg;
 
-	cfs_rq->avg.runnable_load_avg += sa->load_avg;
-	cfs_rq->avg.runnable_load_sum += sa->load_sum;
+	if (entity_is_task(se)) {
+		cfs_rq->avg.runnable_load_avg += sa->load_avg;
+		cfs_rq->avg.runnable_load_sum += sa->load_sum;
+	}
 
 	if (!sa->last_update_time) {
 		attach_entity_load_avg(cfs_rq, se);
@@ -3369,6 +3384,9 @@ dequeue_entity_load_avg(struct cfs_rq *c
 {
 	struct sched_avg *sa = &se->avg;
 
+	if (!entity_is_task(se))
+		return;
+
 	cfs_rq->avg.runnable_load_avg =
 		max_t(long, cfs_rq->avg.runnable_load_avg - sa->load_avg, 0);
 	cfs_rq->avg.runnable_load_sum =
@@ -3406,7 +3424,8 @@ void sync_entity_load_avg(struct sched_e
 	u64 last_update_time;
 
 	last_update_time = cfs_rq_last_update_time(cfs_rq);
-	__update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, false);
+	__update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)),
+			  &se->avg, 0, 0, !entity_is_task(se));
 }
 
 /*

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCHSET v2] sched/fair: fix load balancer behavior when cgroup is in use
  2017-05-04 20:28 [RFC PATCHSET v2] sched/fair: fix load balancer behavior when cgroup is in use Tejun Heo
                   ` (2 preceding siblings ...)
  2017-05-04 20:30 ` [PATCH 3/3] sched/fair: Propagate runnable_load_avg independently from load_avg Tejun Heo
@ 2017-05-05  8:46 ` Vincent Guittot
  2017-05-05 13:28   ` Tejun Heo
  3 siblings, 1 reply; 20+ messages in thread
From: Vincent Guittot @ 2017-05-05  8:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hi Tejun,

On 4 May 2017 at 22:28, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> v1 posting can be found at
>
>   http://lkml.kernel.org/r/20170424201344.GA14169@wtj.duckdns.org
>
> The patchset is still RFC and based on v4.11.  I used Peter's updated
> calc_cfs_shares() instead of scaling manually and updated so that
> runnable_load_avg is propagated independently from load_avg.  Due to
> the way sched_entity and cfs_rq loads are calculated, this requires an
> extra runnable_load_avg calculation for group sched_entities, but the
> end result is cleaner and actually makes sense.
>
> Vincent, can you please verify whether the regression that you see is
> gone with this version?

schbench results looks better with this version
Latency percentiles (usec)
50.0000th: 212
75.0000th: 292
90.0000th: 385
95.0000th: 439
*99.0000th: 671
99.5000th: 7992
99.9000th: 12176
min=0, max=14855

p99 is back to a normal value but p99.5 stays higher than mainline

I have also checked load_avg and runnable_load_avg value and there is
something incorrect. I will provide details on the related patch

Regards,
Vincent
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] sched/fair: Peter's shares_type patch
  2017-05-04 20:29 ` [PATCH 1/3] sched/fair: Peter's shares_type patch Tejun Heo
@ 2017-05-05 10:40   ` Vincent Guittot
  2017-05-05 15:30     ` Tejun Heo
  2017-05-05 15:41     ` Peter Zijlstra
  0 siblings, 2 replies; 20+ messages in thread
From: Vincent Guittot @ 2017-05-05 10:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 4 May 2017 at 22:29, Tejun Heo <tj@kernel.org> wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> This patch is combination of
>
>    http://lkml.kernel.org/r/20170502081905.GA4626@worktop.programming.kicks-ass.net
>  + http://lkml.kernel.org/r/20170502083009.GA3377@worktop.programming.kicks-ass.net
>  + build fix & use shares_avg for propagating load_avg instead of runnable
>
> This fixes the propagation problem described in the following while
> keeping group se->load_avg.avg in line with the matching
> cfs_rq->load_avg.avg.
>
>    http://lkml.kernel.org/r/20170424201415.GB14169@wtj.duckdns.org
>
> ---
>  kernel/sched/fair.c |   98 +++++++++++++++++++++++++---------------------------
>  1 file changed, 48 insertions(+), 50 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2636,26 +2636,57 @@ account_entity_dequeue(struct cfs_rq *cf
>         cfs_rq->nr_running--;
>  }
>
> +enum shares_type {
> +       shares_runnable,
> +       shares_avg,
> +       shares_weight,
> +};
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  # ifdef CONFIG_SMP
> -static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
> +static long
> +calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg,
> +               enum shares_type shares_type)
>  {
> -       long tg_weight, load, shares;
> +       long tg_weight, tg_shares, load, shares;
>
> -       /*
> -        * This really should be: cfs_rq->avg.load_avg, but instead we use
> -        * cfs_rq->load.weight, which is its upper bound. This helps ramp up
> -        * the shares for small weight interactive tasks.
> -        */
> -       load = scale_load_down(cfs_rq->load.weight);
> +       tg_shares = READ_ONCE(tg->shares);
> +
> +       switch (shares_type) {
> +       case shares_runnable:
> +               /*
> +                * Instead of the correct cfs_rq->avg.load_avg we use
> +                * cfs_rq->runnable_load_avg, which does not include the
> +                * blocked load.
> +                */
> +               load = cfs_rq->runnable_load_avg;
> +               break;
> +
> +       case shares_avg:
> +               load = cfs_rq->avg.load_avg;
> +               break;
> +
> +       case shares_weight:
> +               /*
> +                * Instead of the correct cfs_rq->avg.load_avg we use
> +                * cfs_rq->load.weight, which is its upper bound. This helps
> +                * ramp up the shares for small weight interactive tasks.
> +                */
> +               load = scale_load_down(cfs_rq->load.weight);
> +               break;
> +       }
>
>         tg_weight = atomic_long_read(&tg->load_avg);
>
> -       /* Ensure tg_weight >= load */
> +       /*
> +        * This ensures the sum is up-to-date for this CPU, in case of the other
> +        * two approximations it biases the sum towards their value and in case
> +        * of (near) UP ensures the division ends up <= 1.
> +        */
>         tg_weight -= cfs_rq->tg_load_avg_contrib;
>         tg_weight += load;

The above is correct for shares_avg and shares_weight but it's not
correct for shares_runnable  because runnable_load_avg is a subset of
load_avg.

To highlight this, i have run a single periodic task TA which runs 3ms
with a period of 7.777ms. In case you wonder why 7.777ms, it is just
to be sure to not be synced with the tick (4ms on my platform) and
cover more case but this doesn't have any impact in this example.
TA is the only task, nothing else runs on the CPU and the background
activity is near nothing and doesn't impact the result below

When TA runs in the root domain, {root
cfs_rq,TA}->avg.{load_avg,util_avg} are in the range [390..440] and
root cfs_rq->avg.runnable_load_avg toogles between 0 when cfs_rq is
idle and [390..440] when TA is running

When TA runs in a dedicated cgroup,  {root
cfs_rq,TA}->avg.{load_avg,util_avg} remains in the range [390..440]
but root cfs_rq->avg.runnable_load_avg is in the range [1012..1036]
and often stays in this range whereas TA is sleeping. I have checked
that I was tracing correctly all PELT metrics update and that root
cfs_rq->avg.runnable_load_avg is really not null and not the side
effect of a missing trace.

In fact, if TA is the only task in the cgroup, tg->load_avg ==
cfs_rq->tg_load_avg_contrib.
For shares_avg, tg_weight == cfs_rq->runnable_load_avg and shares =
cfs_rq->runnable_load_avg * tg->shares / cfs_rq->runnable_load_avg =
tg->shares = 1024.
Then, because of rounding in pelt computation, child
cfs_rq->runnable_load_avg can something be not null (around 2 or 3)
when idle so groupentity and root cfs_rq stays around 1024

For shares_runnable, it should be

group_entity->runnable_load_avg = cfs_rq->runnable_load_avg *
group_entity->avg.load_avg / cfs_rq->avg.load_avg

And i think that shares_avg is also wrong and should be something like:

group_entity->avg.load_avg = cfs_rq->avg.load_avg *
group_entity->load.weight / cfs_rq->load.weight

Then we have to take into account the fact that when we propagate
load, group_entity->load.weight and cfs_rq->load.weight of prev cpu
and new cpu have not been updated yet. I still need to think a bit
more on that impact and how to compensate that when propagating load


>
> -       shares = (tg->shares * load);
> +       shares = (tg_shares * load);
>         if (tg_weight)
>                 shares /= tg_weight;
>
> @@ -2671,15 +2702,11 @@ static long calc_cfs_shares(struct cfs_r
>          * case no task is runnable on a CPU MIN_SHARES=2 should be returned
>          * instead of 0.
>          */
> -       if (shares < MIN_SHARES)
> -               shares = MIN_SHARES;
> -       if (shares > tg->shares)
> -               shares = tg->shares;
> -
> -       return shares;
> +       return clamp_t(long, shares, MIN_SHARES, tg_shares);
>  }
>  # else /* CONFIG_SMP */
> -static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
> +static inline long
> +calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg, enum shares_type shares_type)
>  {
>         return tg->shares;
>  }
> @@ -2721,7 +2748,7 @@ static void update_cfs_shares(struct sch
>         if (likely(se->load.weight == tg->shares))
>                 return;
>  #endif
> -       shares = calc_cfs_shares(cfs_rq, tg);
> +       shares = calc_cfs_shares(cfs_rq, tg, shares_weight);
>
>         reweight_entity(cfs_rq_of(se), se, shares);
>  }
> @@ -3078,39 +3105,10 @@ static inline void
>  update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> -       long delta, load = gcfs_rq->avg.load_avg;
> -
> -       /*
> -        * If the load of group cfs_rq is null, the load of the
> -        * sched_entity will also be null so we can skip the formula
> -        */
> -       if (load) {
> -               long tg_load;
> -
> -               /* Get tg's load and ensure tg_load > 0 */
> -               tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
> -
> -               /* Ensure tg_load >= load and updated with current load*/
> -               tg_load -= gcfs_rq->tg_load_avg_contrib;
> -               tg_load += load;
> -
> -               /*
> -                * We need to compute a correction term in the case that the
> -                * task group is consuming more CPU than a task of equal
> -                * weight. A task with a weight equals to tg->shares will have
> -                * a load less or equal to scale_load_down(tg->shares).
> -                * Similarly, the sched_entities that represent the task group
> -                * at parent level, can't have a load higher than
> -                * scale_load_down(tg->shares). And the Sum of sched_entities'
> -                * load must be <= scale_load_down(tg->shares).
> -                */
> -               if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
> -                       /* scale gcfs_rq's load into tg's shares*/
> -                       load *= scale_load_down(gcfs_rq->tg->shares);
> -                       load /= tg_load;
> -               }
> -       }
> +       long load, delta;
>
> +       load = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg,
> +                                              shares_avg));
>         delta = load - se->avg.load_avg;
>
>         /* Nothing to update */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] sched/fair: Propagate runnable_load_avg independently from load_avg
  2017-05-04 20:30 ` [PATCH 3/3] sched/fair: Propagate runnable_load_avg independently from load_avg Tejun Heo
@ 2017-05-05 10:42   ` Vincent Guittot
  2017-05-05 12:18     ` Vincent Guittot
  2017-05-05 16:51   ` Vincent Guittot
  1 sibling, 1 reply; 20+ messages in thread
From: Vincent Guittot @ 2017-05-05 10:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 4 May 2017 at 22:30, Tejun Heo <tj@kernel.org> wrote:
> We noticed that with cgroup CPU controller in use, the scheduling
> latency gets wonky regardless of nesting level or weight
> configuration.  This is easily reproducible with Chris Mason's
> schbench[1].
>
> All tests are run on a single socket, 16 cores, 32 threads machine.
> While the machine is mostly idle, it isn't completey.  There's always
> some variable management load going on.  The command used is
>
>  schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>
> which measures scheduling latency with 32 threads each of which
> repeatedly runs for 15ms and then sleeps for 10ms.  Here's a
> representative result when running from the root cgroup.
>
>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>  Latency percentiles (usec)
>          50.0000th: 26
>          75.0000th: 62
>          90.0000th: 74
>          95.0000th: 86
>          *99.0000th: 887
>          99.5000th: 3692
>          99.9000th: 10832
>          min=0, max=13374
>
> The following is inside a first level CPU cgroup with the maximum
> weight.
>
>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>  Latency percentiles (usec)
>          50.0000th: 31
>          75.0000th: 65
>          90.0000th: 71
>          95.0000th: 91
>          *99.0000th: 7288
>          99.5000th: 10352
>          99.9000th: 12496
>          min=0, max=13023
>
> Note the drastic increase in p99 scheduling latency.  After
> investigation, it turned out that the update_sd_lb_stats(), which is
> used by load_balance() to pick the most loaded group, was often
> picking the wrong group.  A CPU which has one schbench running and
> another queued wouldn't report the correspondingly higher
> weighted_cpuload() and get looked over as the target of load
> balancing.
>
> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
> sum of the load_avg of all active sched_entities.  Without cgroups or
> at the root cgroup, each task's load_avg contributes directly to the
> sum.  When a task wakes up or goes to sleep, the change is immediately
> reflected on runnable_load_avg which in turn affects load balancing.
>
> However, when CPU cgroup is in use, a nested cfs_rq blocks this
> immediate propagation.  When a task wakes up inside a cgroup, the
> nested cfs_rq's runnable_load_avg is updated but doesn't get
> propagated to the parent.  It only affects the matching sched_entity's
> load_avg over time which then gets propagated to the runnable_load_avg
> at that level.  This makes the runnable_load_avg at the root queue
> incorrectly include blocked load_avgs of tasks queued in nested
> cfs_rqs causing the load balancer to make the wrong choices.
>
> This patch fixes the bug by propagating nested cfs_rq's
> runnable_load_avg independently from load_avg.  Tasks still contribute
> to its cfs_rq's runnable_load_avg the same way; however, a nested
> cfs_rq directly propagates the scaled runnable_load_avg to the
> matching group sched_entity's avg.runnable_load_avg and keeps the se's
> and parent cfs_rq's runnable_load_avg in sync.
>
> This ensures that, for any given cfs_rq, its runnable_load_avg is the
> sum of the scaled load_avgs of all and only active tasks queued on it
> and its descendants.  This allows the load balancer to operate on the
> same information whether there are nested cfs_rqs or not.
>
> With the patch applied, the p99 latency from inside a cgroup is
> equivalent to the root cgroup case.
>
>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>  Latency percentiles (usec)
>          50.0000th: 40
>          75.0000th: 71
>          90.0000th: 89
>          95.0000th: 108
>          *99.0000th: 679
>          99.5000th: 3500
>          99.9000th: 10960
>          min=0, max=13790
>
> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Paul Turner <pjt@google.com>
> Cc: Chris Mason <clm@fb.com>
> ---
>  kernel/sched/fair.c |   55 ++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 37 insertions(+), 18 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3098,6 +3098,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>         cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
>  }
>
> +static inline void
> +update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +       struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> +       long load, delta;
> +
> +       load = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg,
> +                                              shares_runnable));
> +       delta = load - se->avg.runnable_load_avg;
> +
> +       /* Nothing to update */
> +       if (!delta)
> +               return;
> +
> +       /* Set new sched_entity's load */
> +       se->avg.runnable_load_avg = load;
> +       se->avg.runnable_load_sum = se->avg.runnable_load_avg * LOAD_AVG_MAX;
> +
> +       /* Update parent cfs_rq load */
> +       add_positive(&cfs_rq->avg.runnable_load_avg, delta);
> +       cfs_rq->avg.runnable_load_sum =
> +               cfs_rq->avg.runnable_load_avg * LOAD_AVG_MAX;
> +}
> +
>  /* Take into account change of load of a child task group */
>  static inline void
>  update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
> @@ -3120,17 +3144,6 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
>         /* Update parent cfs_rq load */
>         add_positive(&cfs_rq->avg.load_avg, delta);
>         cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
> -
> -       /*
> -        * If the sched_entity is already enqueued, we also have to update the
> -        * runnable load avg.
> -        */
> -       if (se->on_rq) {
> -               /* Update parent cfs_rq runnable_load_avg */
> -               add_positive(&cfs_rq->avg.runnable_load_avg, delta);
> -               cfs_rq->avg.runnable_load_sum =
> -                       cfs_rq->avg.runnable_load_avg * LOAD_AVG_MAX;
> -       }
>  }
>
>  static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq)
> @@ -3152,16 +3165,16 @@ static inline int test_and_clear_tg_cfs_
>  /* Update task and its cfs_rq load average */
>  static inline int propagate_entity_load_avg(struct sched_entity *se)
>  {
> -       struct cfs_rq *cfs_rq;
> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
>         if (entity_is_task(se))
>                 return 0;
>
> +       update_tg_cfs_runnable(cfs_rq, se);
> +
>         if (!test_and_clear_tg_cfs_propagate(se))
>                 return 0;
>
> -       cfs_rq = cfs_rq_of(se);
> -
>         set_tg_cfs_propagate(cfs_rq);
>
>         update_tg_cfs_util(cfs_rq, se);
> @@ -3298,7 +3311,7 @@ static inline void update_load_avg(struc
>         if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
>                 __update_load_avg(now, cpu, &se->avg,
>                           se->on_rq * scale_load_down(se->load.weight),
> -                         cfs_rq->curr == se, false);
> +                         cfs_rq->curr == se, !entity_is_task(se));
>         }
>
>         decayed  = update_cfs_rq_load_avg(now, cfs_rq, true);
> @@ -3354,8 +3367,10 @@ enqueue_entity_load_avg(struct cfs_rq *c
>  {
>         struct sched_avg *sa = &se->avg;
>
> -       cfs_rq->avg.runnable_load_avg += sa->load_avg;
> -       cfs_rq->avg.runnable_load_sum += sa->load_sum;
> +       if (entity_is_task(se)) {

Why don't you add the runnable_load_avg of a group_entity that is enqueued ?

> +               cfs_rq->avg.runnable_load_avg += sa->load_avg;
> +               cfs_rq->avg.runnable_load_sum += sa->load_sum;
> +       }
>
>         if (!sa->last_update_time) {
>                 attach_entity_load_avg(cfs_rq, se);
> @@ -3369,6 +3384,9 @@ dequeue_entity_load_avg(struct cfs_rq *c
>  {
>         struct sched_avg *sa = &se->avg;
>
> +       if (!entity_is_task(se))
> +               return;

Same as enqueue, you have to remove the runnable_load_avg of a
group_entity that is dequeued

> +
>         cfs_rq->avg.runnable_load_avg =
>                 max_t(long, cfs_rq->avg.runnable_load_avg - sa->load_avg, 0);
>         cfs_rq->avg.runnable_load_sum =
> @@ -3406,7 +3424,8 @@ void sync_entity_load_avg(struct sched_e
>         u64 last_update_time;
>
>         last_update_time = cfs_rq_last_update_time(cfs_rq);
> -       __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, false);
> +       __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)),
> +                         &se->avg, 0, 0, !entity_is_task(se));
>  }
>
>  /*

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] sched/fair: Propagate runnable_load_avg independently from load_avg
  2017-05-05 10:42   ` Vincent Guittot
@ 2017-05-05 12:18     ` Vincent Guittot
  2017-05-05 13:26       ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Vincent Guittot @ 2017-05-05 12:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 5 May 2017 at 12:42, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 4 May 2017 at 22:30, Tejun Heo <tj@kernel.org> wrote:
>> We noticed that with cgroup CPU controller in use, the scheduling
>> latency gets wonky regardless of nesting level or weight
>> configuration.  This is easily reproducible with Chris Mason's
>> schbench[1].
>>
>> All tests are run on a single socket, 16 cores, 32 threads machine.
>> While the machine is mostly idle, it isn't completey.  There's always
>> some variable management load going on.  The command used is
>>
>>  schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>
>> which measures scheduling latency with 32 threads each of which
>> repeatedly runs for 15ms and then sleeps for 10ms.  Here's a
>> representative result when running from the root cgroup.
>>
>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>  Latency percentiles (usec)
>>          50.0000th: 26
>>          75.0000th: 62
>>          90.0000th: 74
>>          95.0000th: 86
>>          *99.0000th: 887
>>          99.5000th: 3692
>>          99.9000th: 10832
>>          min=0, max=13374
>>
>> The following is inside a first level CPU cgroup with the maximum
>> weight.
>>
>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>  Latency percentiles (usec)
>>          50.0000th: 31
>>          75.0000th: 65
>>          90.0000th: 71
>>          95.0000th: 91
>>          *99.0000th: 7288
>>          99.5000th: 10352
>>          99.9000th: 12496
>>          min=0, max=13023
>>
>> Note the drastic increase in p99 scheduling latency.  After
>> investigation, it turned out that the update_sd_lb_stats(), which is
>> used by load_balance() to pick the most loaded group, was often
>> picking the wrong group.  A CPU which has one schbench running and
>> another queued wouldn't report the correspondingly higher
>> weighted_cpuload() and get looked over as the target of load
>> balancing.
>>
>> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
>> sum of the load_avg of all active sched_entities.  Without cgroups or
>> at the root cgroup, each task's load_avg contributes directly to the
>> sum.  When a task wakes up or goes to sleep, the change is immediately
>> reflected on runnable_load_avg which in turn affects load balancing.
>>
>> However, when CPU cgroup is in use, a nested cfs_rq blocks this
>> immediate propagation.  When a task wakes up inside a cgroup, the
>> nested cfs_rq's runnable_load_avg is updated but doesn't get
>> propagated to the parent.  It only affects the matching sched_entity's
>> load_avg over time which then gets propagated to the runnable_load_avg
>> at that level.  This makes the runnable_load_avg at the root queue
>> incorrectly include blocked load_avgs of tasks queued in nested
>> cfs_rqs causing the load balancer to make the wrong choices.
>>
>> This patch fixes the bug by propagating nested cfs_rq's
>> runnable_load_avg independently from load_avg.  Tasks still contribute
>> to its cfs_rq's runnable_load_avg the same way; however, a nested
>> cfs_rq directly propagates the scaled runnable_load_avg to the
>> matching group sched_entity's avg.runnable_load_avg and keeps the se's
>> and parent cfs_rq's runnable_load_avg in sync.
>>
>> This ensures that, for any given cfs_rq, its runnable_load_avg is the
>> sum of the scaled load_avgs of all and only active tasks queued on it
>> and its descendants.  This allows the load balancer to operate on the
>> same information whether there are nested cfs_rqs or not.
>>
>> With the patch applied, the p99 latency from inside a cgroup is
>> equivalent to the root cgroup case.
>>
>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>  Latency percentiles (usec)
>>          50.0000th: 40
>>          75.0000th: 71
>>          90.0000th: 89
>>          95.0000th: 108
>>          *99.0000th: 679
>>          99.5000th: 3500
>>          99.9000th: 10960
>>          min=0, max=13790
>>
>> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
>>
>> Signed-off-by: Tejun Heo <tj@kernel.org>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Mike Galbraith <efault@gmx.de>
>> Cc: Paul Turner <pjt@google.com>
>> Cc: Chris Mason <clm@fb.com>
>> ---
>>  kernel/sched/fair.c |   55 ++++++++++++++++++++++++++++++++++------------------
>>  1 file changed, 37 insertions(+), 18 deletions(-)
>>
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3098,6 +3098,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>>         cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
>>  }
>>
>> +static inline void
>> +update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> +{
>> +       struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>> +       long load, delta;
>> +
>> +       load = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg,
>> +                                              shares_runnable));
>> +       delta = load - se->avg.runnable_load_avg;
>> +
>> +       /* Nothing to update */
>> +       if (!delta)
>> +               return;
>> +
>> +       /* Set new sched_entity's load */
>> +       se->avg.runnable_load_avg = load;
>> +       se->avg.runnable_load_sum = se->avg.runnable_load_avg * LOAD_AVG_MAX;
>> +
>> +       /* Update parent cfs_rq load */
>> +       add_positive(&cfs_rq->avg.runnable_load_avg, delta);
>> +       cfs_rq->avg.runnable_load_sum =
>> +               cfs_rq->avg.runnable_load_avg * LOAD_AVG_MAX;
>> +}
>> +
>>  /* Take into account change of load of a child task group */
>>  static inline void
>>  update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> @@ -3120,17 +3144,6 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
>>         /* Update parent cfs_rq load */
>>         add_positive(&cfs_rq->avg.load_avg, delta);
>>         cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>> -
>> -       /*
>> -        * If the sched_entity is already enqueued, we also have to update the
>> -        * runnable load avg.
>> -        */
>> -       if (se->on_rq) {
>> -               /* Update parent cfs_rq runnable_load_avg */
>> -               add_positive(&cfs_rq->avg.runnable_load_avg, delta);
>> -               cfs_rq->avg.runnable_load_sum =
>> -                       cfs_rq->avg.runnable_load_avg * LOAD_AVG_MAX;
>> -       }
>>  }
>>
>>  static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq)
>> @@ -3152,16 +3165,16 @@ static inline int test_and_clear_tg_cfs_
>>  /* Update task and its cfs_rq load average */
>>  static inline int propagate_entity_load_avg(struct sched_entity *se)
>>  {
>> -       struct cfs_rq *cfs_rq;
>> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>
>>         if (entity_is_task(se))
>>                 return 0;
>>
>> +       update_tg_cfs_runnable(cfs_rq, se);
>> +
>>         if (!test_and_clear_tg_cfs_propagate(se))
>>                 return 0;
>>
>> -       cfs_rq = cfs_rq_of(se);
>> -
>>         set_tg_cfs_propagate(cfs_rq);
>>
>>         update_tg_cfs_util(cfs_rq, se);
>> @@ -3298,7 +3311,7 @@ static inline void update_load_avg(struc
>>         if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
>>                 __update_load_avg(now, cpu, &se->avg,
>>                           se->on_rq * scale_load_down(se->load.weight),
>> -                         cfs_rq->curr == se, false);
>> +                         cfs_rq->curr == se, !entity_is_task(se));
>>         }
>>
>>         decayed  = update_cfs_rq_load_avg(now, cfs_rq, true);
>> @@ -3354,8 +3367,10 @@ enqueue_entity_load_avg(struct cfs_rq *c
>>  {
>>         struct sched_avg *sa = &se->avg;
>>
>> -       cfs_rq->avg.runnable_load_avg += sa->load_avg;
>> -       cfs_rq->avg.runnable_load_sum += sa->load_sum;
>> +       if (entity_is_task(se)) {
>
> Why don't you add the runnable_load_avg of a group_entity that is enqueued ?

ok, i forgot that you propagate runnable_load_avg in entity now. But
this seems really weird and adds more  exceptions to the normal
behavior of load tracking

>
>> +               cfs_rq->avg.runnable_load_avg += sa->load_avg;
>> +               cfs_rq->avg.runnable_load_sum += sa->load_sum;
>> +       }
>>
>>         if (!sa->last_update_time) {
>>                 attach_entity_load_avg(cfs_rq, se);
>> @@ -3369,6 +3384,9 @@ dequeue_entity_load_avg(struct cfs_rq *c
>>  {
>>         struct sched_avg *sa = &se->avg;
>>
>> +       if (!entity_is_task(se))
>> +               return;
>
> Same as enqueue, you have to remove the runnable_load_avg of a
> group_entity that is dequeued
>
>> +
>>         cfs_rq->avg.runnable_load_avg =
>>                 max_t(long, cfs_rq->avg.runnable_load_avg - sa->load_avg, 0);
>>         cfs_rq->avg.runnable_load_sum =
>> @@ -3406,7 +3424,8 @@ void sync_entity_load_avg(struct sched_e
>>         u64 last_update_time;
>>
>>         last_update_time = cfs_rq_last_update_time(cfs_rq);
>> -       __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, false);
>> +       __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)),
>> +                         &se->avg, 0, 0, !entity_is_task(se));
>>  }
>>
>>  /*

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] sched/fair: Add load_weight->runnable_load_{sum|avg}
  2017-05-04 20:29 ` [PATCH 2/3] sched/fair: Add load_weight->runnable_load_{sum|avg} Tejun Heo
@ 2017-05-05 13:22   ` Dietmar Eggemann
  2017-05-05 13:26     ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Dietmar Eggemann @ 2017-05-05 13:22 UTC (permalink / raw)
  To: Tejun Heo, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith,
	Paul Turner, Chris Mason, kernel-team

Hi Tejun,

On 04/05/17 21:29, Tejun Heo wrote:
> Currently, runnable_load_avg, which represents the portion of load avg
> only from tasks which are currently active, is tracked by cfs_rq but
> not by sched_entity.  We want to propagate runnable_load_avg of a
> nested cfs_rq without affecting load_avg propagation.  To implement an
> equivalent propagation channel, sched_entity needs to track
> runnable_load_avg too.
> 
> This patch moves cfs_rq->runnable_load_{sum|avg} into struct
> load_weight which is already used to track load_avg and shared by both
> cfs_rq and sched_entity.
> 
> This patch only changes where runnable_load_{sum|avg} are located and
> doesn't cause any actual behavior changes.  The fields are still only
> used for cfs_rqs.

This one doesn't apply cleanly on tip/sched/core. There has been a lot
of changes in the actual PELT code.

e.g.
a481db34b9be - sched/fair: Optimize ___update_sched_avg() (2017-03-30
Yuyang Du)
0ccb977f4c80 - sched/fair: Explicitly generate __update_load_avg()
instances (2017-03-30 Peter Zijlstra)

I stitch this up locally to be able to run some tests.

[...]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] sched/fair: Propagate runnable_load_avg independently from load_avg
  2017-05-05 12:18     ` Vincent Guittot
@ 2017-05-05 13:26       ` Tejun Heo
  0 siblings, 0 replies; 20+ messages in thread
From: Tejun Heo @ 2017-05-05 13:26 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello,

On Fri, May 05, 2017 at 02:18:54PM +0200, Vincent Guittot wrote:
> >> @@ -3354,8 +3367,10 @@ enqueue_entity_load_avg(struct cfs_rq *c
> >>  {
> >>         struct sched_avg *sa = &se->avg;
> >>
> >> -       cfs_rq->avg.runnable_load_avg += sa->load_avg;
> >> -       cfs_rq->avg.runnable_load_sum += sa->load_sum;
> >> +       if (entity_is_task(se)) {
> >
> > Why don't you add the runnable_load_avg of a group_entity that is enqueued ?
> 
> ok, i forgot that you propagate runnable_load_avg in entity now. But
> this seems really weird and adds more  exceptions to the normal
> behavior of load tracking

It seems cleaner this way to me.  The actual runnable tracking is
taking place for tasks only on their immediate queues and everything
beyond that is pure propagation.  The distinction is inherent as
there's no point in calculating runnable for a task's se.

Obviously, we can special-case dequeueing of nested group entities too
but it's more code and fragility.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] sched/fair: Add load_weight->runnable_load_{sum|avg}
  2017-05-05 13:22   ` Dietmar Eggemann
@ 2017-05-05 13:26     ` Tejun Heo
  2017-05-05 13:37       ` Dietmar Eggemann
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-05-05 13:26 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Vincent Guittot, Mike Galbraith, Paul Turner, Chris Mason,
	kernel-team

On Fri, May 05, 2017 at 02:22:05PM +0100, Dietmar Eggemann wrote:
> Hi Tejun,
> 
> On 04/05/17 21:29, Tejun Heo wrote:
> > Currently, runnable_load_avg, which represents the portion of load avg
> > only from tasks which are currently active, is tracked by cfs_rq but
> > not by sched_entity.  We want to propagate runnable_load_avg of a
> > nested cfs_rq without affecting load_avg propagation.  To implement an
> > equivalent propagation channel, sched_entity needs to track
> > runnable_load_avg too.
> > 
> > This patch moves cfs_rq->runnable_load_{sum|avg} into struct
> > load_weight which is already used to track load_avg and shared by both
> > cfs_rq and sched_entity.
> > 
> > This patch only changes where runnable_load_{sum|avg} are located and
> > doesn't cause any actual behavior changes.  The fields are still only
> > used for cfs_rqs.
> 
> This one doesn't apply cleanly on tip/sched/core. There has been a lot
> of changes in the actual PELT code.

The patches are on top of v4.11 as noted in the head message.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCHSET v2] sched/fair: fix load balancer behavior when cgroup is in use
  2017-05-05  8:46 ` [RFC PATCHSET v2] sched/fair: fix load balancer behavior when cgroup is in use Vincent Guittot
@ 2017-05-05 13:28   ` Tejun Heo
  2017-05-05 13:32     ` Vincent Guittot
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-05-05 13:28 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello, Vincent.

On Fri, May 05, 2017 at 10:46:53AM +0200, Vincent Guittot wrote:
> schbench results looks better with this version
> Latency percentiles (usec)
> 50.0000th: 212
> 75.0000th: 292
> 90.0000th: 385
> 95.0000th: 439
> *99.0000th: 671
> 99.5000th: 7992
> 99.9000th: 12176
> min=0, max=14855
> 
> p99 is back to a normal value but p99.5 stays higher than mainline

By how much and is that with the weight adjustment on the cgroup?  I
can't get reliable numbers on p99.5 and beyond on my test setup.  I
ordered the hikey board and will try to replicate your setup once it
arrives.

> I have also checked load_avg and runnable_load_avg value and there is
> something incorrect. I will provide details on the related patch

Sure, will respond there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCHSET v2] sched/fair: fix load balancer behavior when cgroup is in use
  2017-05-05 13:28   ` Tejun Heo
@ 2017-05-05 13:32     ` Vincent Guittot
  0 siblings, 0 replies; 20+ messages in thread
From: Vincent Guittot @ 2017-05-05 13:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 5 May 2017 at 15:28, Tejun Heo <tj@kernel.org> wrote:
> Hello, Vincent.
>
> On Fri, May 05, 2017 at 10:46:53AM +0200, Vincent Guittot wrote:
>> schbench results looks better with this version
>> Latency percentiles (usec)
>> 50.0000th: 212
>> 75.0000th: 292
>> 90.0000th: 385
>> 95.0000th: 439
>> *99.0000th: 671
>> 99.5000th: 7992
>> 99.9000th: 12176
>> min=0, max=14855
>>
>> p99 is back to a normal value but p99.5 stays higher than mainline
>
> By how much and is that with the weight adjustment on the cgroup?  I
> can't get reliable numbers on p99.5 and beyond on my test setup.  I
> ordered the hikey board and will try to replicate your setup once it
> arrives.

This is the results that I got when i tested you 1st version last week:

With v4.11-rc8. I have run 10 times the test and get consistent results
schbench -m 2 -t 4 -s 10000 -c 15000 -r 30
Latency percentiles (usec)
50.0000th: 255
75.0000th: 350
90.0000th: 454
95.0000th: 489
*99.0000th: 539
99.5000th: 585
99.9000th: 10224
min=0, max=13567

>
>> I have also checked load_avg and runnable_load_avg value and there is
>> something incorrect. I will provide details on the related patch
>
> Sure, will respond there.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] sched/fair: Add load_weight->runnable_load_{sum|avg}
  2017-05-05 13:26     ` Tejun Heo
@ 2017-05-05 13:37       ` Dietmar Eggemann
  0 siblings, 0 replies; 20+ messages in thread
From: Dietmar Eggemann @ 2017-05-05 13:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Vincent Guittot, Mike Galbraith, Paul Turner, Chris Mason,
	kernel-team

On 05/05/17 14:26, Tejun Heo wrote:
> On Fri, May 05, 2017 at 02:22:05PM +0100, Dietmar Eggemann wrote:
>> Hi Tejun,
>>

[...]

> 
> The patches are on top of v4.11 as noted in the head message.

Sorry, haven't seen it. Will use v4.11 instead.

[...]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] sched/fair: Peter's shares_type patch
  2017-05-05 10:40   ` Vincent Guittot
@ 2017-05-05 15:30     ` Tejun Heo
  2017-05-10 15:09       ` Tejun Heo
  2017-05-05 15:41     ` Peter Zijlstra
  1 sibling, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-05-05 15:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello,

On Fri, May 05, 2017 at 12:40:27PM +0200, Vincent Guittot wrote:
> In fact, if TA is the only task in the cgroup, tg->load_avg ==
> cfs_rq->tg_load_avg_contrib.
> For shares_avg, tg_weight == cfs_rq->runnable_load_avg and shares =
> cfs_rq->runnable_load_avg * tg->shares / cfs_rq->runnable_load_avg =
> tg->shares = 1024.
> Then, because of rounding in pelt computation, child
> cfs_rq->runnable_load_avg can something be not null (around 2 or 3)
> when idle so groupentity and root cfs_rq stays around 1024

I see.  This is the minor calculation errors and the "replacing local
term" adjustment combining to yield a wild result.

> For shares_runnable, it should be
> 
> group_entity->runnable_load_avg = cfs_rq->runnable_load_avg *
> group_entity->avg.load_avg / cfs_rq->avg.load_avg

Yeah, that could be one way to calculate the value while avoiding the
artifacts.  Hmmm... IIUC, replacing the local contribution with the
current one is to ensure that we at least calculate with the current
term on the local queue.  This makes sense for weight and shares but
as you pointed out it doesn't make sense to replace local base with
runnable when the base is expected to be sum of load_avgs.  How about
something like the following?

---
 kernel/sched/fair.c |  101 +++++++++++++++++++++++++---------------------------
 1 file changed, 50 insertions(+), 51 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2636,26 +2636,58 @@ account_entity_dequeue(struct cfs_rq *cf
 	cfs_rq->nr_running--;
 }
 
+enum shares_type {
+	shares_runnable,
+	shares_avg,
+	shares_weight,
+};
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 # ifdef CONFIG_SMP
-static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
+static long
+calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg,
+		enum shares_type shares_type)
 {
-	long tg_weight, load, shares;
+	long tg_weight, tg_shares, load, load_base, shares;
 
-	/*
-	 * This really should be: cfs_rq->avg.load_avg, but instead we use
-	 * cfs_rq->load.weight, which is its upper bound. This helps ramp up
-	 * the shares for small weight interactive tasks.
-	 */
-	load = scale_load_down(cfs_rq->load.weight);
+	tg_shares = READ_ONCE(tg->shares);
+
+	switch (shares_type) {
+	case shares_runnable:
+		/*
+		 * Instead of the correct cfs_rq->avg.load_avg we use
+		 * cfs_rq->runnable_load_avg, which does not include the
+		 * blocked load.
+		 */
+		load = cfs_rq->runnable_load_avg;
+		load_base = cfs_rq->avg.load_avg;
+		break;
+
+	case shares_avg:
+		load = load_base = cfs_rq->avg.load_avg;
+		break;
+
+	case shares_weight:
+		/*
+		 * Instead of the correct cfs_rq->avg.load_avg we use
+		 * cfs_rq->load.weight, which is its upper bound. This helps
+		 * ramp up the shares for small weight interactive tasks.
+		 */
+		load = load_base = scale_load_down(cfs_rq->load.weight);
+		break;
+	}
 
 	tg_weight = atomic_long_read(&tg->load_avg);
 
-	/* Ensure tg_weight >= load */
+	/*
+	 * This ensures the sum is up-to-date for this CPU, in case of the other
+	 * two approximations it biases the sum towards their value and in case
+	 * of (near) UP ensures the division ends up <= 1.
+	 */
 	tg_weight -= cfs_rq->tg_load_avg_contrib;
-	tg_weight += load;
+	tg_weight += load_base;
 
-	shares = (tg->shares * load);
+	shares = (tg_shares * load);
 	if (tg_weight)
 		shares /= tg_weight;
 
@@ -2671,15 +2703,11 @@ static long calc_cfs_shares(struct cfs_r
 	 * case no task is runnable on a CPU MIN_SHARES=2 should be returned
 	 * instead of 0.
 	 */
-	if (shares < MIN_SHARES)
-		shares = MIN_SHARES;
-	if (shares > tg->shares)
-		shares = tg->shares;
-
-	return shares;
+	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
 # else /* CONFIG_SMP */
-static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
+static inline long
+calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg, enum shares_type shares_type)
 {
 	return tg->shares;
 }
@@ -2721,7 +2749,7 @@ static void update_cfs_shares(struct sch
 	if (likely(se->load.weight == tg->shares))
 		return;
 #endif
-	shares = calc_cfs_shares(cfs_rq, tg);
+	shares = calc_cfs_shares(cfs_rq, tg, shares_weight);
 
 	reweight_entity(cfs_rq_of(se), se, shares);
 }
@@ -3078,39 +3106,10 @@ static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long delta, load = gcfs_rq->avg.load_avg;
-
-	/*
-	 * If the load of group cfs_rq is null, the load of the
-	 * sched_entity will also be null so we can skip the formula
-	 */
-	if (load) {
-		long tg_load;
-
-		/* Get tg's load and ensure tg_load > 0 */
-		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
-		/* Ensure tg_load >= load and updated with current load*/
-		tg_load -= gcfs_rq->tg_load_avg_contrib;
-		tg_load += load;
-
-		/*
-		 * We need to compute a correction term in the case that the
-		 * task group is consuming more CPU than a task of equal
-		 * weight. A task with a weight equals to tg->shares will have
-		 * a load less or equal to scale_load_down(tg->shares).
-		 * Similarly, the sched_entities that represent the task group
-		 * at parent level, can't have a load higher than
-		 * scale_load_down(tg->shares). And the Sum of sched_entities'
-		 * load must be <= scale_load_down(tg->shares).
-		 */
-		if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
-			/* scale gcfs_rq's load into tg's shares*/
-			load *= scale_load_down(gcfs_rq->tg->shares);
-			load /= tg_load;
-		}
-	}
+	long load, delta;
 
+	load = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg,
+					       shares_avg));
 	delta = load - se->avg.load_avg;
 
 	/* Nothing to update */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] sched/fair: Peter's shares_type patch
  2017-05-05 10:40   ` Vincent Guittot
  2017-05-05 15:30     ` Tejun Heo
@ 2017-05-05 15:41     ` Peter Zijlstra
  1 sibling, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2017-05-05 15:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hi all,

Sorry for the overly long post, this is a semi coherent collection
of all the notes I found on my desk and some continuations thereon.

On Fri, May 05, 2017 at 12:40:27PM +0200, Vincent Guittot wrote:

> > +enum shares_type {
> > +       shares_runnable,
> > +       shares_avg,
> > +       shares_weight,
> > +};
> > +
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> >  # ifdef CONFIG_SMP
> > +static long
> > +calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg,
> > +               enum shares_type shares_type)
> >  {
> > +       long tg_weight, tg_shares, load, shares;
> >
> > +       tg_shares = READ_ONCE(tg->shares);
> > +
> > +       switch (shares_type) {
> > +       case shares_runnable:
> > +               /*
> > +                * Instead of the correct cfs_rq->avg.load_avg we use
> > +                * cfs_rq->runnable_load_avg, which does not include the
> > +                * blocked load.
> > +                */
> > +               load = cfs_rq->runnable_load_avg;
> > +               break;
> > +
> > +       case shares_avg:
> > +               load = cfs_rq->avg.load_avg;
> > +               break;
> > +
> > +       case shares_weight:
> > +               /*
> > +                * Instead of the correct cfs_rq->avg.load_avg we use
> > +                * cfs_rq->load.weight, which is its upper bound. This helps
> > +                * ramp up the shares for small weight interactive tasks.
> > +                */
> > +               load = scale_load_down(cfs_rq->load.weight);
> > +               break;
> > +       }
> >
> >         tg_weight = atomic_long_read(&tg->load_avg);
> >
> > +       /*
> > +        * This ensures the sum is up-to-date for this CPU, in case of the other
> > +        * two approximations it biases the sum towards their value and in case
> > +        * of (near) UP ensures the division ends up <= 1.
> > +        */
> >         tg_weight -= cfs_rq->tg_load_avg_contrib;
> >         tg_weight += load;
> 
> The above is correct for shares_avg and shares_weight but it's not
> correct for shares_runnable  because runnable_load_avg is a subset of
> load_avg.

Its only correct for shares_avg. shares_weight and shares_runnable are
both icky.

Remember, all this does is approximate the hierarchical proportion which
includes that global sum we all love to hate.

That is, the weight of a group entity, is the proportional share of the
group weight based on the group runqueue weights. That is:

                    tg->weight * grq->load.weight
  ge->load.weight = -----------------------------		(1)
                        \Sum grq->load.weight

Now, because computing that sum is prohibitively expensive to compute
(been there, done that) we approximate it with this average stuff. The
average moves slower and therefore the approximation is cheaper and
more stable.

So instead of the above, we substitute:

  grq->load.weight -> grq->avg.load_avg				(2)

which yields the following:

                    tg->weight * grq->avg.load_avg
  ge->load.weight = ------------------------------		(3)
                            tg->load_avg

Because: tg->load_avg ~= \Sum grq->avg.load_avg

That is shares_avg, and it is right (given the approximation (2)).

The problem with it is that because the average is slow -- it was
designed to be exactly that of course -- this leads to transients in
boundary conditions. In specific, the case where the group was idle and
we start the one task. It takes time for our CPU's grq->avg.load_avg to
build up, yielding bad latency etc..

Now, in that special case (1) reduces to:

                    tg->weight * grq->load.weight
  ge->load.weight = ----------------------------- = tg>weight	(4)
                        grp->load.weight

That is, the sum collapses because all other CPUs are idle; the UP
scenario.

So what we do is modify our approximation (3) to approach (4) in the
(near) UP case, like:

  ge->load.weight =

             tg->weight * grq->load.weight
    ---------------------------------------------------		(5)
    tg->load_avg - grq->avg.load_avg + grq->load.weight

And that is shares_weight and is icky. In the (near) UP case it
approaches (4) while in the normal case it approaches, but never
quite reaches, (3). I consistently overestimates the ge->load.weight
and therefore:

  \Sum ge->load.weight >= tg->weight

hence icky!

Now, on to the propagate stuff.

I'm still struggling with this, so I might have gotten it completely
backwards. Bear with me.

> And i think that shares_avg is also wrong and should be something like:
>
>   group_entity->avg.load_avg =
>            cfs_rq->avg.load_avg * group_entity->load.weight / cfs_rq->load.weight

So firstly, no, shares_avg was intended to be (3) and that is right.

What you're looking for is an expression for ge->avg.load_avg, that's an
entirely different beast.

But lets take a few steps back.

The whole point of the propagation thing is that:

  ge->avg == grq->avg						(6)

_IFF_ we look at the pure running and runnable sums. Because they
represent the very same entity, just as different views.

But since we've just added an entity to grq->avg, we must also update
ge->avg to retain our invariant (6).

Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
simply copies the running sum over.

However, update_tg_cfs_load() is making my head hurt. So we have:

  ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg	(7)

And since, like util, the runnable part should be directly transferable,
the following would _appear_ to be the straight forward approach:

  grq->avg.load_avg = grq->load.weight * grq->avg.running_avg	(8)

And per (6) we have:

  ge->avg.running_avg == grq->avg.running_avg

Which gives (6,7,8):

                     ge->load.weight * grq->avg.load_avg
  ge->avg.load_avg = -----------------------------------	(9)
                            grq->load.weight

Except that is wrong!

Because while for entities historical weight is not important and we
really only care about our future and therefore can consider a pure
runnable sum, runqueues can NOT do this.

We specifically want runqueues to have a load_avg that includes
historical weights. Those represent the blocked load, the load we expect
to (shortly) return to us. This only works by keeping the weights as
integral part of the sum. We therefore cannot decompose as per (8).

OK, so what then?

The code (more or less) does:

                     tg->weight * grq->avg.load_avg
  ge->avg.load_avg = ------------------------------
                             tg->load_avg

Which is basically (3), but since that was the right expression for
ge->load.weight, and per (7) that does not preserve the runnable sum,
its an upper bound and basically completely destroys the runnable part.

[
  however, since we actually use (5), and (5) > (3) we get an effective
  ge->avg.runnable_avg < 1. But still, afaict there's no relation to
  the desired grq->avg.runnable_avg part.
]

You now propose (9), which as I've explained is wrong (also, trying to
implement it doesn't work, grq->load.weight will be 0 etc..).

Another way to look at things is:

  grq->avg.load_avg = \Sum se->avg.load_avg

Therefore, per (7):

  grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg

And the very thing we're propagating is a change in that sum (someone
joined/left). So we can easily know the runnable change, if we were to
keep track of cfs_rq->removed_runnable_avg and
cfs_rq->added_runnable_avg for the migrate condition, which would be,
per (7) the already tracked se->load_avg divided by the corresponding
se->weight.

Basically (9) but in differential form:

  d(runnable_avg) += se->avg.load_avg / se->load.weight
								(10)
  ge->avg.load_avg += ge->load.weight * d(runnable_avg)

OK, on to the whole runnable thing, which we should not confuse with the
propagation thing, it has very little to do with the invariant (6).

> In fact, if TA is the only task in the cgroup,

>   tg->load_avg == cfs_rq->tg_load_avg_contrib.

> For shares_avg,

>   tg_weight == cfs_rq->runnable_load_avg , and

>   shares = cfs_rq->runnable_load_avg * tg->shares / cfs_rq->runnable_load_avg
>          = tg->shares = 1024.

That was the intent. (Also, _please_, use whitespace, its cheap
and makes things _so_ much more readable).

  grq->runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq

So its a subset of grq->avg.load_avg. Now, since each group entity
(can) represents multiple tasks, not all of which will be runnable at
the same time, our group entity should not contribute its entire
load_avg to its parent cfs_rq's runnable.

Here I think we can use a simple proportion:

					grq->runnable_load_avg
  ge->runnable_avg = ge->avg.load_avg * ----------------------
                                          grq->avg.load_avg

Its a local property, afaict.

And since it needs to be updated on every enqueue/dequeue, we should do
it in {en,de}queue_entity_load_avg() or thereabout.

I've not thought much about time how this interacts with the regular
PELT() function and if this fraction is retained under it.

But at this point my brain is completely fried..

> For shares_runnable, it should be
> 
>   group_entity->runnable_load_avg =
>            cfs_rq->runnable_load_avg * group_entity->avg.load_avg / cfs_rq->avg.load_avg

I agree with the equation.

> > @@ -3078,39 +3105,10 @@ static inline void
> >  update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> >         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> > -       long delta, load = gcfs_rq->avg.load_avg;
> > -
> > -       /*
> > -        * If the load of group cfs_rq is null, the load of the
> > -        * sched_entity will also be null so we can skip the formula
> > -        */
> > -       if (load) {
> > -               long tg_load;
> > -
> > -               /* Get tg's load and ensure tg_load > 0 */
> > -               tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
> > -
> > -               /* Ensure tg_load >= load and updated with current load*/
> > -               tg_load -= gcfs_rq->tg_load_avg_contrib;
> > -               tg_load += load;
> > -
> > -               /*
> > -                * We need to compute a correction term in the case that the
> > -                * task group is consuming more CPU than a task of equal
> > -                * weight. A task with a weight equals to tg->shares will have
> > -                * a load less or equal to scale_load_down(tg->shares).
> > -                * Similarly, the sched_entities that represent the task group
> > -                * at parent level, can't have a load higher than
> > -                * scale_load_down(tg->shares). And the Sum of sched_entities'
> > -                * load must be <= scale_load_down(tg->shares).
> > -                */
> > -               if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
> > -                       /* scale gcfs_rq's load into tg's shares*/
> > -                       load *= scale_load_down(gcfs_rq->tg->shares);
> > -                       load /= tg_load;
> > -               }
> > -       }
> > +       long load, delta;
> >
> > +       load = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg,
> > +                                              shares_avg));
> >         delta = load - se->avg.load_avg;
> >
> >         /* Nothing to update */

So this part... I'm fairly sure both the old and new version is wrong.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] sched/fair: Propagate runnable_load_avg independently from load_avg
  2017-05-04 20:30 ` [PATCH 3/3] sched/fair: Propagate runnable_load_avg independently from load_avg Tejun Heo
  2017-05-05 10:42   ` Vincent Guittot
@ 2017-05-05 16:51   ` Vincent Guittot
  1 sibling, 0 replies; 20+ messages in thread
From: Vincent Guittot @ 2017-05-05 16:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 4 May 2017 at 22:30, Tejun Heo <tj@kernel.org> wrote:

[snip]

>  /* Take into account change of load of a child task group */
>  static inline void
>  update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
> @@ -3120,17 +3144,6 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
>         /* Update parent cfs_rq load */
>         add_positive(&cfs_rq->avg.load_avg, delta);
>         cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
> -
> -       /*
> -        * If the sched_entity is already enqueued, we also have to update the
> -        * runnable load avg.
> -        */
> -       if (se->on_rq) {
> -               /* Update parent cfs_rq runnable_load_avg */
> -               add_positive(&cfs_rq->avg.runnable_load_avg, delta);
> -               cfs_rq->avg.runnable_load_sum =
> -                       cfs_rq->avg.runnable_load_avg * LOAD_AVG_MAX;
> -       }
>  }
>
>  static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq)
> @@ -3152,16 +3165,16 @@ static inline int test_and_clear_tg_cfs_
>  /* Update task and its cfs_rq load average */
>  static inline int propagate_entity_load_avg(struct sched_entity *se)
>  {
> -       struct cfs_rq *cfs_rq;
> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
>         if (entity_is_task(se))
>                 return 0;
>
> +       update_tg_cfs_runnable(cfs_rq, se);

I wonder if update_tg_cfs_runnable() should really be called in
propagate_entity_load_avg().
propagate_entity_load_avg() is called on every update_load_avg because
of entity  can be asynchronously detached so we must catch this
asynchronous events each time we update the load
But  enqueue/dequeue of entity of fully synchronous and we know when
runnable_load_avg is updated and should be propagated. So instead of
calling update_tg_cfs_runnable() at every update_load_avg(), we can
only call it in the enque_/dequeue path and save lot of useless call

> +

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] sched/fair: Peter's shares_type patch
  2017-05-05 15:30     ` Tejun Heo
@ 2017-05-10 15:09       ` Tejun Heo
  2017-05-10 16:07         ` Vincent Guittot
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2017-05-10 15:09 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello, Vincent.

On Fri, May 05, 2017 at 11:30:31AM -0400, Tejun Heo wrote:
> > For shares_runnable, it should be
> > 
> > group_entity->runnable_load_avg = cfs_rq->runnable_load_avg *
> > group_entity->avg.load_avg / cfs_rq->avg.load_avg
> 
> Yeah, that could be one way to calculate the value while avoiding the
> artifacts.  Hmmm... IIUC, replacing the local contribution with the
> current one is to ensure that we at least calculate with the current
> term on the local queue.  This makes sense for weight and shares but
> as you pointed out it doesn't make sense to replace local base with
> runnable when the base is expected to be sum of load_avgs.  How about
> something like the following?

Vincent, have you given this patch a try?

Also, I got the hikey board but of course it can't be powered by usb,
so am waiting on the power supply now.  Any chance you can share the
system image you're using so that I can exactly replicate your
environment?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] sched/fair: Peter's shares_type patch
  2017-05-10 15:09       ` Tejun Heo
@ 2017-05-10 16:07         ` Vincent Guittot
  2017-05-11  6:59           ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Vincent Guittot @ 2017-05-10 16:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 10 May 2017 at 17:09, Tejun Heo <tj@kernel.org> wrote:
> Hello, Vincent.
>
> On Fri, May 05, 2017 at 11:30:31AM -0400, Tejun Heo wrote:
>> > For shares_runnable, it should be
>> >
>> > group_entity->runnable_load_avg = cfs_rq->runnable_load_avg *
>> > group_entity->avg.load_avg / cfs_rq->avg.load_avg
>>
>> Yeah, that could be one way to calculate the value while avoiding the
>> artifacts.  Hmmm... IIUC, replacing the local contribution with the
>> current one is to ensure that we at least calculate with the current
>> term on the local queue.  This makes sense for weight and shares but
>> as you pointed out it doesn't make sense to replace local base with
>> runnable when the base is expected to be sum of load_avgs.  How about
>> something like the following?
>
> Vincent, have you given this patch a try?

No I haven't.
My understand of Peter's feedback is that calc_cfs_shares should not
be the place where to implement calculation of
group_entity->runnable_load_avg and group_entity->avg.load_avg when
propagating

>
> Also, I got the hikey board but of course it can't be powered by usb,
> so am waiting on the power supply now.  Any chance you can share the
> system image you're using so that I can exactly replicate your
> environment?

Yes of course.
I will send you details in a dedicated email.

Regards,
Vincent

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] sched/fair: Peter's shares_type patch
  2017-05-10 16:07         ` Vincent Guittot
@ 2017-05-11  6:59           ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2017-05-11  6:59 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On Wed, May 10, 2017 at 06:07:14PM +0200, Vincent Guittot wrote:
> On 10 May 2017 at 17:09, Tejun Heo <tj@kernel.org> wrote:
> > Hello, Vincent.
> >
> > On Fri, May 05, 2017 at 11:30:31AM -0400, Tejun Heo wrote:
> >> > For shares_runnable, it should be
> >> >
> >> > group_entity->runnable_load_avg = cfs_rq->runnable_load_avg *
> >> > group_entity->avg.load_avg / cfs_rq->avg.load_avg
> >>
> >> Yeah, that could be one way to calculate the value while avoiding the
> >> artifacts.  Hmmm... IIUC, replacing the local contribution with the
> >> current one is to ensure that we at least calculate with the current
> >> term on the local queue.  This makes sense for weight and shares but
> >> as you pointed out it doesn't make sense to replace local base with
> >> runnable when the base is expected to be sum of load_avgs.  How about
> >> something like the following?
> >
> > Vincent, have you given this patch a try?
> 
> No I haven't.
> My understand of Peter's feedback is that calc_cfs_shares should not
> be the place where to implement calculation of
> group_entity->runnable_load_avg and group_entity->avg.load_avg when
> propagating

Right, so I have a pile of patches that implement all that my longish
email outlined. I'm just chasing some strange behaviour; in particular
I'm having runnable_load_avg > load_avg, which is something that should
not happen.

It _looks_ like the add/sub cycle leaks a little and a lot of such
cycles then push runnable_load_avg out. I've not managed to pin it down.

If I don't find it, I'll send it out regardless as an RFC so that others
can 'enjoy'.

One request for Chris / Tejun, could you guys pretty please make a
reproducible benchmark? Relying on some ill specified background noise
just doesn't work, as this thread has clearly illustrated, nobody can
reproduce your issue.

And although I think the specific issue has been fairly well explained,
it would be good to have a working benchmark to prove the point.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-05-11  6:59 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-04 20:28 [RFC PATCHSET v2] sched/fair: fix load balancer behavior when cgroup is in use Tejun Heo
2017-05-04 20:29 ` [PATCH 1/3] sched/fair: Peter's shares_type patch Tejun Heo
2017-05-05 10:40   ` Vincent Guittot
2017-05-05 15:30     ` Tejun Heo
2017-05-10 15:09       ` Tejun Heo
2017-05-10 16:07         ` Vincent Guittot
2017-05-11  6:59           ` Peter Zijlstra
2017-05-05 15:41     ` Peter Zijlstra
2017-05-04 20:29 ` [PATCH 2/3] sched/fair: Add load_weight->runnable_load_{sum|avg} Tejun Heo
2017-05-05 13:22   ` Dietmar Eggemann
2017-05-05 13:26     ` Tejun Heo
2017-05-05 13:37       ` Dietmar Eggemann
2017-05-04 20:30 ` [PATCH 3/3] sched/fair: Propagate runnable_load_avg independently from load_avg Tejun Heo
2017-05-05 10:42   ` Vincent Guittot
2017-05-05 12:18     ` Vincent Guittot
2017-05-05 13:26       ` Tejun Heo
2017-05-05 16:51   ` Vincent Guittot
2017-05-05  8:46 ` [RFC PATCHSET v2] sched/fair: fix load balancer behavior when cgroup is in use Vincent Guittot
2017-05-05 13:28   ` Tejun Heo
2017-05-05 13:32     ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).