linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is in use
@ 2017-04-24 20:13 Tejun Heo
  2017-04-24 20:14 ` [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity Tejun Heo
                   ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Tejun Heo @ 2017-04-24 20:13 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith,
	Paul Turner, Chris Mason, kernel-team

Hello,

We've noticed scheduling latency spike when cgroup is in use even when
the machine is idle enough with moderate scheduling frequency and
single level of cgroup nesting.  More details are in the patch
descriptions but here's a schbench run from the root cgroup.

 # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
 Latency percentiles (usec)
	 50.0000th: 26
	 75.0000th: 62
	 90.0000th: 74
	 95.0000th: 86
	 *99.0000th: 887
	 99.5000th: 3692
	 99.9000th: 10832
	 min=0, max=13374

And here's one from inside a first level cgroup.

 # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
 Latency percentiles (usec)
	 50.0000th: 31
	 75.0000th: 65
	 90.0000th: 71
	 95.0000th: 91
	 *99.0000th: 7288
	 99.5000th: 10352
	 99.9000th: 12496
	 min=0, max=13023

The p99 latency spike got tracked down to runnable_load_avg not being
propagated through nested cfs_rqs and thus load_balance() operating on
out-of-sync load information.  It ended up picking the wrong CPU as
load balance target often enough to significantly impact p99 latency.

This patchset fixes the issue by always propagating runnable_load_avg
so that, regardless of nesting, every cfs_rq's runnable_load_avg is
the sum of the scaled loads of all tasks queued below it.

As a side effect, this changes the load_avg behavior of sched_entities
associated cfs_rq's.  It doesn't seem wrong to me and I can't think of
a better / cleaner way, but if there is, please let me know.

This patchset is on top of v4.11-rc8 and contains the following two
patches.

 sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
 sched/fair: Always propagate runnable_load_avg

diffstat follows.

 kernel/sched/fair.c |   46 +++++++++++++++++++---------------------------
 1 file changed, 19 insertions(+), 27 deletions(-)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-04-24 20:13 [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is in use Tejun Heo
@ 2017-04-24 20:14 ` Tejun Heo
  2017-04-24 21:33   ` [PATCH v2 " Tejun Heo
                     ` (2 more replies)
  2017-04-24 20:14 ` [PATCH 2/2] sched/fair: Always propagate runnable_load_avg Tejun Heo
  2017-04-24 21:35 ` [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities Tejun Heo
  2 siblings, 3 replies; 69+ messages in thread
From: Tejun Heo @ 2017-04-24 20:14 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith,
	Paul Turner, Chris Mason, kernel-team

09a43ace1f98 ("sched/fair: Propagate load during synchronous
attach/detach") added immediate load propagation from cfs_rq to its
sched_entity then to the parent cfs_rq; however, what gets propagated
doesn't seem to make sense.

It repeats the tg_weight calculation done in calc_cfs_shares() but
only uses it to compensate for shares being out of date.  After that,
it sets the sched_entity's load_avg to the load_avg of the
corresponding cfs_rq.

This doesn't make sense as the cfs_rq's load_avg is some fraction of
its total weight, which the sched_entity's weight has nothing to with.
For example, if the cfs_rq has a single constant load 1 task the
cfs_rq's load_avg would be around 1.  If that cfs_rq is the only
active sched_entity in the parent cfs_rq which has the maximum weight,
the sched_entity's load should be around the maximum weight but
update_tg_cfs_load() ends up overriding it to 1.

At the parent's level, the absolute value of load_avg inside a child
cfs_rq doesn't mean anything.  Only the ratio against its weight is
meaningful.

This patch changes update_tg_cfs_load() to normalize the
runnable_load_avg of the cfs_rq and then scale it to the matching
sched_entity's freshly calculated shares for propagation.  Use of
runnable_load_avg instead of load_avg is intentional and keeps the
parent's runnable_load_avg true to the sum of scaled loads of all
tasks queued under it which is critical for the correction operation
of load balancer.  The next patch will depend on it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
---
 kernel/sched/fair.c |   46 +++++++++++++++++++---------------------------
 1 file changed, 19 insertions(+), 27 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3078,37 +3078,29 @@ static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long delta, load = gcfs_rq->avg.load_avg;
+	long load = 0, delta;
 
 	/*
-	 * If the load of group cfs_rq is null, the load of the
-	 * sched_entity will also be null so we can skip the formula
+	 * A cfs_rq's load avg contribution to the parent should be scaled
+	 * to the sched_entity's weight.  Use freshly calculated shares
+	 * instead of @se->load.weight as the latter may not reflect
+	 * changes from the current scheduling operation.
+	 *
+	 * Note that the propagation source is runnable_load_avg instead of
+	 * load_avg.  This keeps every cfs_rq's runnable_load_avg true to
+	 * the sum of the scaled loads of all tasks queued under it, which
+	 * is important for the correct operation of the load balancer.
+	 *
+	 * This can make the sched_entity's load_avg jumpier but that
+	 * correctly reflects what would happen without cgroups if each
+	 * task's load is scaled across nesting - the load is being
+	 * averaged at the task and each cfs_rq.
 	 */
-	if (load) {
-		long tg_load;
+	if (gcfs_rq->load.weight) {
+		long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
 
-		/* Get tg's load and ensure tg_load > 0 */
-		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
-		/* Ensure tg_load >= load and updated with current load*/
-		tg_load -= gcfs_rq->tg_load_avg_contrib;
-		tg_load += load;
-
-		/*
-		 * We need to compute a correction term in the case that the
-		 * task group is consuming more CPU than a task of equal
-		 * weight. A task with a weight equals to tg->shares will have
-		 * a load less or equal to scale_load_down(tg->shares).
-		 * Similarly, the sched_entities that represent the task group
-		 * at parent level, can't have a load higher than
-		 * scale_load_down(tg->shares). And the Sum of sched_entities'
-		 * load must be <= scale_load_down(tg->shares).
-		 */
-		if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
-			/* scale gcfs_rq's load into tg's shares*/
-			load *= scale_load_down(gcfs_rq->tg->shares);
-			load /= tg_load;
-		}
+		load = min(gcfs_rq->runnable_load_avg *
+			   shares / gcfs_rq->load.weight, shares);
 	}
 
 	delta = load - se->avg.load_avg;

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-24 20:13 [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is in use Tejun Heo
  2017-04-24 20:14 ` [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity Tejun Heo
@ 2017-04-24 20:14 ` Tejun Heo
  2017-04-25  8:46   ` Vincent Guittot
  2017-04-26 18:12   ` Vincent Guittot
  2017-04-24 21:35 ` [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities Tejun Heo
  2 siblings, 2 replies; 69+ messages in thread
From: Tejun Heo @ 2017-04-24 20:14 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith,
	Paul Turner, Chris Mason, kernel-team

We noticed that with cgroup CPU controller in use, the scheduling
latency gets wonky regardless of nesting level or weight
configuration.  This is easily reproducible with Chris Mason's
schbench[1].

All tests are run on a single socket, 16 cores, 32 threads machine.
While the machine is mostly idle, it isn't completey.  There's always
some variable management load going on.  The command used is

 schbench -m 2 -t 16 -s 10000 -c 15000 -r 30

which measures scheduling latency with 32 threads each of which
repeatedly runs for 15ms and then sleeps for 10ms.  Here's a
representative result when running from the root cgroup.

 # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
 Latency percentiles (usec)
	 50.0000th: 26
	 75.0000th: 62
	 90.0000th: 74
	 95.0000th: 86
	 *99.0000th: 887
	 99.5000th: 3692
	 99.9000th: 10832
	 min=0, max=13374

The following is inside a first level CPU cgroup with the maximum
weight.

 # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
 Latency percentiles (usec)
	 50.0000th: 31
	 75.0000th: 65
	 90.0000th: 71
	 95.0000th: 91
	 *99.0000th: 7288
	 99.5000th: 10352
	 99.9000th: 12496
	 min=0, max=13023

Note the drastic increase in p99 scheduling latency.  After
investigation, it turned out that the update_sd_lb_stats(), which is
used by load_balance() to pick the most loaded group, was often
picking the wrong group.  A CPU which has one schbench running and
another queued wouldn't report the correspondingly higher
weighted_cpuload() and get looked over as the target of load
balancing.

weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
sum of the load_avg of all queued sched_entities.  Without cgroups or
at the root cgroup, each task's load_avg contributes directly to the
sum.  When a task wakes up or goes to sleep, the change is immediately
reflected on runnable_load_avg which in turn affects load balancing.

However, when CPU cgroup is in use, a nesting cfs_rq blocks this
immediate reflection.  When a task wakes up inside a cgroup, the
nested cfs_rq's runnable_load_avg is updated but doesn't get
propagated to the parent.  It only affects the matching sched_entity's
load_avg over time which then gets propagated to the runnable_load_avg
at that level.  This makes weighted_cpuload() often temporarily out of
sync leading to suboptimal behavior of load_balance() and increase in
scheduling latencies as shown above.

This patch fixes the issue by updating propagate_entity_load_avg() to
always propagate to the parent's runnable_load_avg.  Combined with the
previous patch, this keeps a cfs_rq's runnable_load_avg always the sum
of the scaled loads of all tasks queued below removing the artifacts
from nesting cfs_rqs.  The following is from inside three levels of
nesting with the patch applied.

 # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
 Latency percentiles (usec)
	 50.0000th: 40
	 75.0000th: 71
	 90.0000th: 89
	 95.0000th: 108
	 *99.0000th: 679
	 99.5000th: 3500
	 99.9000th: 10960
	 min=0, max=13790

[1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
Cc: Chris Mason <clm@fb.com>
---
 kernel/sched/fair.c |   34 +++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 13 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
 
 /* Take into account change of load of a child task group */
 static inline void
-update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se,
+		   bool propagate_avg)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
 	long load = 0, delta;
@@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
 	se->avg.load_avg = load;
 	se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
 
-	/* Update parent cfs_rq load */
-	add_positive(&cfs_rq->avg.load_avg, delta);
-	cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
+	if (propagate_avg) {
+		/* Update parent cfs_rq load */
+		add_positive(&cfs_rq->avg.load_avg, delta);
+		cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
+	}
 
 	/*
 	 * If the sched_entity is already enqueued, we also have to update the
@@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_
 /* Update task and its cfs_rq load average */
 static inline int propagate_entity_load_avg(struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	bool propagate_avg;
 
 	if (entity_is_task(se))
 		return 0;
 
-	if (!test_and_clear_tg_cfs_propagate(se))
-		return 0;
-
-	cfs_rq = cfs_rq_of(se);
+	propagate_avg = test_and_clear_tg_cfs_propagate(se);
 
-	set_tg_cfs_propagate(cfs_rq);
+	/*
+	 * We want to keep @cfs_rq->runnable_load_avg always in sync so
+	 * that the load balancer can accurately determine the busiest CPU
+	 * regardless of cfs_rq nesting.
+	 */
+	update_tg_cfs_load(cfs_rq, se, propagate_avg);
 
-	update_tg_cfs_util(cfs_rq, se);
-	update_tg_cfs_load(cfs_rq, se);
+	if (propagate_avg) {
+		set_tg_cfs_propagate(cfs_rq);
+		update_tg_cfs_util(cfs_rq, se);
+	}
 
-	return 1;
+	return propagate_avg;
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v2 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-04-24 20:14 ` [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity Tejun Heo
@ 2017-04-24 21:33   ` Tejun Heo
  2017-05-03 18:00     ` Peter Zijlstra
  2017-04-25  8:35   ` [PATCH " Vincent Guittot
  2017-04-26 16:14   ` Vincent Guittot
  2 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-04-24 21:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith,
	Paul Turner, Chris Mason, kernel-team

09a43ace1f98 ("sched/fair: Propagate load during synchronous
attach/detach") added immediate load propagation from cfs_rq to its
sched_entity then to the parent cfs_rq; however, what gets propagated
doesn't seem to make sense.

It repeats the tg_weight calculation done in calc_cfs_shares() but
only uses it to compensate for shares being out of date.  After that,
it sets the sched_entity's load_avg to the load_avg of the
corresponding cfs_rq.

This doesn't make sense as the cfs_rq's load_avg is some fraction of
its total weight, which the sched_entity's weight has nothing to with.
For example, if the cfs_rq has a single constant load 1 task the
cfs_rq's load_avg would be around 1.  If that cfs_rq is the only
active sched_entity in the parent cfs_rq which has the maximum weight,
the sched_entity's load should be around the maximum weight but
update_tg_cfs_load() ends up overriding it to 1.

At the parent's level, the absolute value of load_avg inside a child
cfs_rq doesn't mean anything.  Only the ratio against its weight is
meaningful.

This patch changes update_tg_cfs_load() to normalize the
runnable_load_avg of the cfs_rq and then scale it to the matching
sched_entity's freshly calculated shares for propagation.  Use of
runnable_load_avg instead of load_avg is intentional and keeps the
parent's runnable_load_avg true to the sum of scaled loads of all
tasks queued under it which is critical for the correction operation
of load balancer.  The next patch will depend on it.

v2: Use min_t() to squash a build warning.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
---
 kernel/sched/fair.c |   47 ++++++++++++++++++++---------------------------
 1 file changed, 20 insertions(+), 27 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3078,37 +3078,30 @@ static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long delta, load = gcfs_rq->avg.load_avg;
+	long load = 0, delta;
 
 	/*
-	 * If the load of group cfs_rq is null, the load of the
-	 * sched_entity will also be null so we can skip the formula
+	 * A cfs_rq's load avg contribution to the parent should be scaled
+	 * to the sched_entity's weight.  Use freshly calculated shares
+	 * instead of @se->load.weight as the latter may not reflect
+	 * changes from the current scheduling operation.
+	 *
+	 * Note that the propagation source is runnable_load_avg instead of
+	 * load_avg.  This keeps every cfs_rq's runnable_load_avg true to
+	 * the sum of the scaled loads of all tasks queued under it, which
+	 * is important for the correct operation of the load balancer.
+	 *
+	 * This can make the sched_entity's load_avg jumpier but that
+	 * correctly reflects what would happen without cgroups if each
+	 * task's load is scaled across nesting - the load is being
+	 * averaged at the task and each cfs_rq.
 	 */
-	if (load) {
-		long tg_load;
+	if (gcfs_rq->load.weight) {
+		long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
 
-		/* Get tg's load and ensure tg_load > 0 */
-		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
-		/* Ensure tg_load >= load and updated with current load*/
-		tg_load -= gcfs_rq->tg_load_avg_contrib;
-		tg_load += load;
-
-		/*
-		 * We need to compute a correction term in the case that the
-		 * task group is consuming more CPU than a task of equal
-		 * weight. A task with a weight equals to tg->shares will have
-		 * a load less or equal to scale_load_down(tg->shares).
-		 * Similarly, the sched_entities that represent the task group
-		 * at parent level, can't have a load higher than
-		 * scale_load_down(tg->shares). And the Sum of sched_entities'
-		 * load must be <= scale_load_down(tg->shares).
-		 */
-		if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
-			/* scale gcfs_rq's load into tg's shares*/
-			load *= scale_load_down(gcfs_rq->tg->shares);
-			load /= tg_load;
-		}
+		load = min_t(long, shares,
+			     gcfs_rq->runnable_load_avg *
+			     shares / gcfs_rq->load.weight);
 	}
 
 	delta = load - se->avg.load_avg;

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities
  2017-04-24 20:13 [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is in use Tejun Heo
  2017-04-24 20:14 ` [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity Tejun Heo
  2017-04-24 20:14 ` [PATCH 2/2] sched/fair: Always propagate runnable_load_avg Tejun Heo
@ 2017-04-24 21:35 ` Tejun Heo
  2017-04-24 21:48   ` Peter Zijlstra
  2017-04-25 21:09   ` Tejun Heo
  2 siblings, 2 replies; 69+ messages in thread
From: Tejun Heo @ 2017-04-24 21:35 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith,
	Paul Turner, Chris Mason, kernel-team

Now that a cfs_rq sched_entity's load_avg always gets propagated from
the associated cfs_rq, there's no point in calling __update_load_avg()
on it.  The two mechanisms compete with each other and we'd be always
using a value close to the propagated one anyway.

Skip __update_load_avg() for cfs_rq sched_entities.  Also, relocate
propagate_entity_load_avg() to signify that propagation is the
counterpart to __update_load_avg() for cfs_rq sched_entities.  This
puts the propagation before update_cfs_rq_load_avg() which shouldn't
disturb anything.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Turner <pjt@google.com>
---
Hello,

A follow-up patch.  This removes __update_load_avg() on cfs_rq se's as
the value is now constantly kept in sync from cfs_rq.  The patch
doesn't cause any noticable changes in tets.

Thanks.

 kernel/sched/fair.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3293,20 +3293,22 @@ static inline void update_load_avg(struc
 	u64 now = cfs_rq_clock_task(cfs_rq);
 	struct rq *rq = rq_of(cfs_rq);
 	int cpu = cpu_of(rq);
-	int decayed;
+	int decayed = 0;
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
-	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
-		__update_load_avg(now, cpu, &se->avg,
-			  se->on_rq * scale_load_down(se->load.weight),
-			  cfs_rq->curr == se, NULL);
+	if (entity_is_task(se)) {
+		if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
+			__update_load_avg(now, cpu, &se->avg,
+					  se->on_rq * scale_load_down(se->load.weight),
+					  cfs_rq->curr == se, NULL);
+	} else {
+		decayed |= propagate_entity_load_avg(se);
 	}
 
-	decayed  = update_cfs_rq_load_avg(now, cfs_rq, true);
-	decayed |= propagate_entity_load_avg(se);
+	decayed |= update_cfs_rq_load_avg(now, cfs_rq, true);
 
 	if (decayed && (flags & UPDATE_TG))
 		update_tg_load_avg(cfs_rq, 0);

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities
  2017-04-24 21:35 ` [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities Tejun Heo
@ 2017-04-24 21:48   ` Peter Zijlstra
  2017-04-24 22:54     ` Tejun Heo
  2017-04-25 21:09   ` Tejun Heo
  1 sibling, 1 reply; 69+ messages in thread
From: Peter Zijlstra @ 2017-04-24 21:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Vincent Guittot,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On Mon, Apr 24, 2017 at 02:35:28PM -0700, Tejun Heo wrote:
> -	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
> -		__update_load_avg(now, cpu, &se->avg,
> -			  se->on_rq * scale_load_down(se->load.weight),
> -			  cfs_rq->curr == se, NULL);
> +	if (entity_is_task(se)) {
> +		if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> +			__update_load_avg(now, cpu, &se->avg,
> +					  se->on_rq * scale_load_down(se->load.weight),
> +					  cfs_rq->curr == se, NULL);

I've not looked at these patches yet, but you've been patching old code.
__update_load_avg() no longer exists (the conversion shouldn't be too
hard, its mostly been a restructure/rename thing).

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities
  2017-04-24 21:48   ` Peter Zijlstra
@ 2017-04-24 22:54     ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2017-04-24 22:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Vincent Guittot,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello, Peter.

On Mon, Apr 24, 2017 at 11:48:59PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 24, 2017 at 02:35:28PM -0700, Tejun Heo wrote:
> > -	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
> > -		__update_load_avg(now, cpu, &se->avg,
> > -			  se->on_rq * scale_load_down(se->load.weight),
> > -			  cfs_rq->curr == se, NULL);
> > +	if (entity_is_task(se)) {
> > +		if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> > +			__update_load_avg(now, cpu, &se->avg,
> > +					  se->on_rq * scale_load_down(se->load.weight),
> > +					  cfs_rq->curr == se, NULL);
> 
> I've not looked at these patches yet, but you've been patching old code.
> __update_load_avg() no longer exists (the conversion shouldn't be too
> hard, its mostly been a restructure/rename thing).

Ah, sure.  The patchset still being RFC, I wanted to post the version
I was working with.  If you want the patchset refreshed now, please
let me know.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-04-24 20:14 ` [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity Tejun Heo
  2017-04-24 21:33   ` [PATCH v2 " Tejun Heo
@ 2017-04-25  8:35   ` Vincent Guittot
  2017-04-25 18:12     ` Tejun Heo
  2017-04-26 16:14   ` Vincent Guittot
  2 siblings, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-04-25  8:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hi Tejun

On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
> 09a43ace1f98 ("sched/fair: Propagate load during synchronous
> attach/detach") added immediate load propagation from cfs_rq to its
> sched_entity then to the parent cfs_rq; however, what gets propagated
> doesn't seem to make sense.
>
> It repeats the tg_weight calculation done in calc_cfs_shares() but
> only uses it to compensate for shares being out of date.  After that,
> it sets the sched_entity's load_avg to the load_avg of the
> corresponding cfs_rq.
>
> This doesn't make sense as the cfs_rq's load_avg is some fraction of
> its total weight, which the sched_entity's weight has nothing to with.
> For example, if the cfs_rq has a single constant load 1 task the
> cfs_rq's load_avg would be around 1.  If that cfs_rq is the only
> active sched_entity in the parent cfs_rq which has the maximum weight,
> the sched_entity's load should be around the maximum weight but
> update_tg_cfs_load() ends up overriding it to 1.

not sure to catch your example:
a task TA with a load_avg = 1 is the only task in a task group GB so
the cfs_rq load_avg = 1 too and the group_entity of this cfs_rq has
got a weight of 1024 (I use 10bits format for readability) which is
the total share of task group GB

Are you saying that the group_entity load_avg should be around 1024 and not 1 ?

I would say it depends of TA weight. I assume that TA weight is the
default value (1024) as you don't specify any value in your example

If TA directly runs at parent level, its sched_entity would have a
load_avg of 1 so why the group entity load_avg should be 1024 ? it
will just temporally show the cfs_rq more loaded than it is really and
at the end the group entity load_avg will go back to 1

>
> At the parent's level, the absolute value of load_avg inside a child
> cfs_rq doesn't mean anything.  Only the ratio against its weight is
> meaningful.
>
> This patch changes update_tg_cfs_load() to normalize the
> runnable_load_avg of the cfs_rq and then scale it to the matching
> sched_entity's freshly calculated shares for propagation.  Use of
> runnable_load_avg instead of load_avg is intentional and keeps the
> parent's runnable_load_avg true to the sum of scaled loads of all
> tasks queued under it which is critical for the correction operation
> of load balancer.  The next patch will depend on it.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Paul Turner <pjt@google.com>
> ---
>  kernel/sched/fair.c |   46 +++++++++++++++++++---------------------------
>  1 file changed, 19 insertions(+), 27 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3078,37 +3078,29 @@ static inline void
>  update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> -       long delta, load = gcfs_rq->avg.load_avg;
> +       long load = 0, delta;
>
>         /*
> -        * If the load of group cfs_rq is null, the load of the
> -        * sched_entity will also be null so we can skip the formula
> +        * A cfs_rq's load avg contribution to the parent should be scaled
> +        * to the sched_entity's weight.  Use freshly calculated shares
> +        * instead of @se->load.weight as the latter may not reflect
> +        * changes from the current scheduling operation.
> +        *
> +        * Note that the propagation source is runnable_load_avg instead of
> +        * load_avg.  This keeps every cfs_rq's runnable_load_avg true to
> +        * the sum of the scaled loads of all tasks queued under it, which
> +        * is important for the correct operation of the load balancer.
> +        *
> +        * This can make the sched_entity's load_avg jumpier but that
> +        * correctly reflects what would happen without cgroups if each
> +        * task's load is scaled across nesting - the load is being
> +        * averaged at the task and each cfs_rq.
>          */
> -       if (load) {
> -               long tg_load;
> +       if (gcfs_rq->load.weight) {
> +               long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
>
> -               /* Get tg's load and ensure tg_load > 0 */
> -               tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
> -
> -               /* Ensure tg_load >= load and updated with current load*/
> -               tg_load -= gcfs_rq->tg_load_avg_contrib;
> -               tg_load += load;
> -
> -               /*
> -                * We need to compute a correction term in the case that the
> -                * task group is consuming more CPU than a task of equal
> -                * weight. A task with a weight equals to tg->shares will have
> -                * a load less or equal to scale_load_down(tg->shares).
> -                * Similarly, the sched_entities that represent the task group
> -                * at parent level, can't have a load higher than
> -                * scale_load_down(tg->shares). And the Sum of sched_entities'
> -                * load must be <= scale_load_down(tg->shares).
> -                */
> -               if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
> -                       /* scale gcfs_rq's load into tg's shares*/
> -                       load *= scale_load_down(gcfs_rq->tg->shares);
> -                       load /= tg_load;
> -               }
> +               load = min(gcfs_rq->runnable_load_avg *
> +                          shares / gcfs_rq->load.weight, shares);
>         }
>
>         delta = load - se->avg.load_avg;

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-24 20:14 ` [PATCH 2/2] sched/fair: Always propagate runnable_load_avg Tejun Heo
@ 2017-04-25  8:46   ` Vincent Guittot
  2017-04-25  9:05     ` Vincent Guittot
  2017-04-26 18:12   ` Vincent Guittot
  1 sibling, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-04-25  8:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
> We noticed that with cgroup CPU controller in use, the scheduling
> latency gets wonky regardless of nesting level or weight
> configuration.  This is easily reproducible with Chris Mason's
> schbench[1].
>
> All tests are run on a single socket, 16 cores, 32 threads machine.
> While the machine is mostly idle, it isn't completey.  There's always
> some variable management load going on.  The command used is
>
>  schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>
> which measures scheduling latency with 32 threads each of which
> repeatedly runs for 15ms and then sleeps for 10ms.  Here's a
> representative result when running from the root cgroup.
>
>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>  Latency percentiles (usec)
>          50.0000th: 26
>          75.0000th: 62
>          90.0000th: 74
>          95.0000th: 86
>          *99.0000th: 887
>          99.5000th: 3692
>          99.9000th: 10832
>          min=0, max=13374
>
> The following is inside a first level CPU cgroup with the maximum
> weight.
>
>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>  Latency percentiles (usec)
>          50.0000th: 31
>          75.0000th: 65
>          90.0000th: 71
>          95.0000th: 91
>          *99.0000th: 7288
>          99.5000th: 10352
>          99.9000th: 12496
>          min=0, max=13023
>
> Note the drastic increase in p99 scheduling latency.  After
> investigation, it turned out that the update_sd_lb_stats(), which is
> used by load_balance() to pick the most loaded group, was often
> picking the wrong group.  A CPU which has one schbench running and
> another queued wouldn't report the correspondingly higher
> weighted_cpuload() and get looked over as the target of load
> balancing.
>
> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
> sum of the load_avg of all queued sched_entities.  Without cgroups or
> at the root cgroup, each task's load_avg contributes directly to the
> sum.  When a task wakes up or goes to sleep, the change is immediately
> reflected on runnable_load_avg which in turn affects load balancing.
>
> However, when CPU cgroup is in use, a nesting cfs_rq blocks this
> immediate reflection.  When a task wakes up inside a cgroup, the
> nested cfs_rq's runnable_load_avg is updated but doesn't get
> propagated to the parent.  It only affects the matching sched_entity's
> load_avg over time which then gets propagated to the runnable_load_avg
> at that level.  This makes weighted_cpuload() often temporarily out of
> sync leading to suboptimal behavior of load_balance() and increase in
> scheduling latencies as shown above.
>
> This patch fixes the issue by updating propagate_entity_load_avg() to
> always propagate to the parent's runnable_load_avg.  Combined with the
> previous patch, this keeps a cfs_rq's runnable_load_avg always the sum
> of the scaled loads of all tasks queued below removing the artifacts
> from nesting cfs_rqs.  The following is from inside three levels of
> nesting with the patch applied.

So you are changing the purpose of propagate_entity_load_avg which
aims to propagate load_avg/util_avg changes only when a task migrate
and you also want to propagate the enqueue/dequeue in the parent
cfs_rq->runnable_load_avg

>
>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>  Latency percentiles (usec)
>          50.0000th: 40
>          75.0000th: 71
>          90.0000th: 89
>          95.0000th: 108
>          *99.0000th: 679
>          99.5000th: 3500
>          99.9000th: 10960
>          min=0, max=13790
>
> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Paul Turner <pjt@google.com>
> Cc: Chris Mason <clm@fb.com>
> ---
>  kernel/sched/fair.c |   34 +++++++++++++++++++++-------------
>  1 file changed, 21 insertions(+), 13 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>
>  /* Take into account change of load of a child task group */
>  static inline void
> -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se,
> +                  bool propagate_avg)
>  {
>         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>         long load = 0, delta;
> @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
>         se->avg.load_avg = load;
>         se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
>
> -       /* Update parent cfs_rq load */
> -       add_positive(&cfs_rq->avg.load_avg, delta);
> -       cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
> +       if (propagate_avg) {
> +               /* Update parent cfs_rq load */
> +               add_positive(&cfs_rq->avg.load_avg, delta);
> +               cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
> +       }
>
>         /*
>          * If the sched_entity is already enqueued, we also have to update the
> @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_
>  /* Update task and its cfs_rq load average */
>  static inline int propagate_entity_load_avg(struct sched_entity *se)
>  {
> -       struct cfs_rq *cfs_rq;
> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +       bool propagate_avg;
>
>         if (entity_is_task(se))
>                 return 0;
>
> -       if (!test_and_clear_tg_cfs_propagate(se))
> -               return 0;
> -
> -       cfs_rq = cfs_rq_of(se);
> +       propagate_avg = test_and_clear_tg_cfs_propagate(se);
>
> -       set_tg_cfs_propagate(cfs_rq);
> +       /*
> +        * We want to keep @cfs_rq->runnable_load_avg always in sync so
> +        * that the load balancer can accurately determine the busiest CPU
> +        * regardless of cfs_rq nesting.
> +        */
> +       update_tg_cfs_load(cfs_rq, se, propagate_avg);
>
> -       update_tg_cfs_util(cfs_rq, se);
> -       update_tg_cfs_load(cfs_rq, se);
> +       if (propagate_avg) {
> +               set_tg_cfs_propagate(cfs_rq);
> +               update_tg_cfs_util(cfs_rq, se);
> +       }
>
> -       return 1;
> +       return propagate_avg;
>  }
>
>  #else /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-25  8:46   ` Vincent Guittot
@ 2017-04-25  9:05     ` Vincent Guittot
  2017-04-25 12:59       ` Vincent Guittot
  0 siblings, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-04-25  9:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 25 April 2017 at 10:46, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
>> We noticed that with cgroup CPU controller in use, the scheduling
>> latency gets wonky regardless of nesting level or weight
>> configuration.  This is easily reproducible with Chris Mason's
>> schbench[1].
>>
>> All tests are run on a single socket, 16 cores, 32 threads machine.
>> While the machine is mostly idle, it isn't completey.  There's always
>> some variable management load going on.  The command used is
>>
>>  schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>
>> which measures scheduling latency with 32 threads each of which
>> repeatedly runs for 15ms and then sleeps for 10ms.  Here's a
>> representative result when running from the root cgroup.
>>
>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>  Latency percentiles (usec)
>>          50.0000th: 26
>>          75.0000th: 62
>>          90.0000th: 74
>>          95.0000th: 86
>>          *99.0000th: 887
>>          99.5000th: 3692
>>          99.9000th: 10832
>>          min=0, max=13374
>>
>> The following is inside a first level CPU cgroup with the maximum
>> weight.
>>
>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>  Latency percentiles (usec)
>>          50.0000th: 31
>>          75.0000th: 65
>>          90.0000th: 71
>>          95.0000th: 91
>>          *99.0000th: 7288
>>          99.5000th: 10352
>>          99.9000th: 12496
>>          min=0, max=13023
>>
>> Note the drastic increase in p99 scheduling latency.  After
>> investigation, it turned out that the update_sd_lb_stats(), which is
>> used by load_balance() to pick the most loaded group, was often
>> picking the wrong group.  A CPU which has one schbench running and
>> another queued wouldn't report the correspondingly higher
>> weighted_cpuload() and get looked over as the target of load
>> balancing.
>>
>> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
>> sum of the load_avg of all queued sched_entities.  Without cgroups or
>> at the root cgroup, each task's load_avg contributes directly to the
>> sum.  When a task wakes up or goes to sleep, the change is immediately
>> reflected on runnable_load_avg which in turn affects load balancing.
>>
>> However, when CPU cgroup is in use, a nesting cfs_rq blocks this
>> immediate reflection.  When a task wakes up inside a cgroup, the
>> nested cfs_rq's runnable_load_avg is updated but doesn't get
>> propagated to the parent.  It only affects the matching sched_entity's
>> load_avg over time which then gets propagated to the runnable_load_avg
>> at that level.  This makes weighted_cpuload() often temporarily out of
>> sync leading to suboptimal behavior of load_balance() and increase in
>> scheduling latencies as shown above.
>>
>> This patch fixes the issue by updating propagate_entity_load_avg() to
>> always propagate to the parent's runnable_load_avg.  Combined with the
>> previous patch, this keeps a cfs_rq's runnable_load_avg always the sum
>> of the scaled loads of all tasks queued below removing the artifacts
>> from nesting cfs_rqs.  The following is from inside three levels of
>> nesting with the patch applied.
>
> So you are changing the purpose of propagate_entity_load_avg which
> aims to propagate load_avg/util_avg changes only when a task migrate
> and you also want to propagate the enqueue/dequeue in the parent
> cfs_rq->runnable_load_avg

In fact you want that sched_entity load_avg reflects
cfs_rq->runnable_load_avg and not cfs_rq->avg.load_avg

>
>>
>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>  Latency percentiles (usec)
>>          50.0000th: 40
>>          75.0000th: 71
>>          90.0000th: 89
>>          95.0000th: 108
>>          *99.0000th: 679
>>          99.5000th: 3500
>>          99.9000th: 10960
>>          min=0, max=13790
>>
>> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
>>
>> Signed-off-by: Tejun Heo <tj@kernel.org>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Mike Galbraith <efault@gmx.de>
>> Cc: Paul Turner <pjt@google.com>
>> Cc: Chris Mason <clm@fb.com>
>> ---
>>  kernel/sched/fair.c |   34 +++++++++++++++++++++-------------
>>  1 file changed, 21 insertions(+), 13 deletions(-)
>>
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>>
>>  /* Take into account change of load of a child task group */
>>  static inline void
>> -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se,
>> +                  bool propagate_avg)
>>  {
>>         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>>         long load = 0, delta;
>> @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
>>         se->avg.load_avg = load;
>>         se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
>>
>> -       /* Update parent cfs_rq load */
>> -       add_positive(&cfs_rq->avg.load_avg, delta);
>> -       cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>> +       if (propagate_avg) {
>> +               /* Update parent cfs_rq load */
>> +               add_positive(&cfs_rq->avg.load_avg, delta);
>> +               cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>> +       }
>>
>>         /*
>>          * If the sched_entity is already enqueued, we also have to update the
>> @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_
>>  /* Update task and its cfs_rq load average */
>>  static inline int propagate_entity_load_avg(struct sched_entity *se)
>>  {
>> -       struct cfs_rq *cfs_rq;
>> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
>> +       bool propagate_avg;
>>
>>         if (entity_is_task(se))
>>                 return 0;
>>
>> -       if (!test_and_clear_tg_cfs_propagate(se))
>> -               return 0;
>> -
>> -       cfs_rq = cfs_rq_of(se);
>> +       propagate_avg = test_and_clear_tg_cfs_propagate(se);
>>
>> -       set_tg_cfs_propagate(cfs_rq);
>> +       /*
>> +        * We want to keep @cfs_rq->runnable_load_avg always in sync so
>> +        * that the load balancer can accurately determine the busiest CPU
>> +        * regardless of cfs_rq nesting.
>> +        */
>> +       update_tg_cfs_load(cfs_rq, se, propagate_avg);
>>
>> -       update_tg_cfs_util(cfs_rq, se);
>> -       update_tg_cfs_load(cfs_rq, se);
>> +       if (propagate_avg) {
>> +               set_tg_cfs_propagate(cfs_rq);
>> +               update_tg_cfs_util(cfs_rq, se);
>> +       }
>>
>> -       return 1;
>> +       return propagate_avg;
>>  }
>>
>>  #else /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-25  9:05     ` Vincent Guittot
@ 2017-04-25 12:59       ` Vincent Guittot
  2017-04-25 18:49         ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-04-25 12:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 25 April 2017 at 11:05, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 25 April 2017 at 10:46, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>> On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
>>> We noticed that with cgroup CPU controller in use, the scheduling
>>> latency gets wonky regardless of nesting level or weight
>>> configuration.  This is easily reproducible with Chris Mason's
>>> schbench[1].
>>>
>>> All tests are run on a single socket, 16 cores, 32 threads machine.
>>> While the machine is mostly idle, it isn't completey.  There's always
>>> some variable management load going on.  The command used is
>>>
>>>  schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>>
>>> which measures scheduling latency with 32 threads each of which
>>> repeatedly runs for 15ms and then sleeps for 10ms.  Here's a
>>> representative result when running from the root cgroup.
>>>
>>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>>  Latency percentiles (usec)
>>>          50.0000th: 26
>>>          75.0000th: 62
>>>          90.0000th: 74
>>>          95.0000th: 86
>>>          *99.0000th: 887
>>>          99.5000th: 3692
>>>          99.9000th: 10832
>>>          min=0, max=13374
>>>
>>> The following is inside a first level CPU cgroup with the maximum
>>> weight.
>>>
>>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>>  Latency percentiles (usec)
>>>          50.0000th: 31
>>>          75.0000th: 65
>>>          90.0000th: 71
>>>          95.0000th: 91
>>>          *99.0000th: 7288
>>>          99.5000th: 10352
>>>          99.9000th: 12496
>>>          min=0, max=13023
>>>
>>> Note the drastic increase in p99 scheduling latency.  After
>>> investigation, it turned out that the update_sd_lb_stats(), which is
>>> used by load_balance() to pick the most loaded group, was often
>>> picking the wrong group.  A CPU which has one schbench running and
>>> another queued wouldn't report the correspondingly higher
>>> weighted_cpuload() and get looked over as the target of load
>>> balancing.
>>>
>>> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
>>> sum of the load_avg of all queued sched_entities.  Without cgroups or
>>> at the root cgroup, each task's load_avg contributes directly to the
>>> sum.  When a task wakes up or goes to sleep, the change is immediately
>>> reflected on runnable_load_avg which in turn affects load balancing.
>>>
>>> However, when CPU cgroup is in use, a nesting cfs_rq blocks this
>>> immediate reflection.  When a task wakes up inside a cgroup, the
>>> nested cfs_rq's runnable_load_avg is updated but doesn't get
>>> propagated to the parent.  It only affects the matching sched_entity's
>>> load_avg over time which then gets propagated to the runnable_load_avg
>>> at that level.  This makes weighted_cpuload() often temporarily out of
>>> sync leading to suboptimal behavior of load_balance() and increase in
>>> scheduling latencies as shown above.
>>>
>>> This patch fixes the issue by updating propagate_entity_load_avg() to
>>> always propagate to the parent's runnable_load_avg.  Combined with the
>>> previous patch, this keeps a cfs_rq's runnable_load_avg always the sum
>>> of the scaled loads of all tasks queued below removing the artifacts
>>> from nesting cfs_rqs.  The following is from inside three levels of
>>> nesting with the patch applied.
>>
>> So you are changing the purpose of propagate_entity_load_avg which
>> aims to propagate load_avg/util_avg changes only when a task migrate
>> and you also want to propagate the enqueue/dequeue in the parent
>> cfs_rq->runnable_load_avg
>
> In fact you want that sched_entity load_avg reflects
> cfs_rq->runnable_load_avg and not cfs_rq->avg.load_avg

I have run a quick test with your patches and schbench on my platform.
I haven't been able to reproduce your regression but my platform is
quite different from yours (only 8 cores without SMT)
But most importantly, the parent cfs_rq->runnable_load_avg never
reaches 0 (or almost 0) when it is idle. Instead, it still has a
runnable_load_avg (this is not due to rounding computation) whereas
runnable_load_avg should be 0

Just to be curious, Is your regression still there if you disable
SMT/hyperthreading on your paltform?

Regards,
Vincent

>
>>
>>>
>>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>>  Latency percentiles (usec)
>>>          50.0000th: 40
>>>          75.0000th: 71
>>>          90.0000th: 89
>>>          95.0000th: 108
>>>          *99.0000th: 679
>>>          99.5000th: 3500
>>>          99.9000th: 10960
>>>          min=0, max=13790
>>>
>>> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
>>>
>>> Signed-off-by: Tejun Heo <tj@kernel.org>
>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Mike Galbraith <efault@gmx.de>
>>> Cc: Paul Turner <pjt@google.com>
>>> Cc: Chris Mason <clm@fb.com>
>>> ---
>>>  kernel/sched/fair.c |   34 +++++++++++++++++++++-------------
>>>  1 file changed, 21 insertions(+), 13 deletions(-)
>>>
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>>>
>>>  /* Take into account change of load of a child task group */
>>>  static inline void
>>> -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se,
>>> +                  bool propagate_avg)
>>>  {
>>>         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>>>         long load = 0, delta;
>>> @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
>>>         se->avg.load_avg = load;
>>>         se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
>>>
>>> -       /* Update parent cfs_rq load */
>>> -       add_positive(&cfs_rq->avg.load_avg, delta);
>>> -       cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>>> +       if (propagate_avg) {
>>> +               /* Update parent cfs_rq load */
>>> +               add_positive(&cfs_rq->avg.load_avg, delta);
>>> +               cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>>> +       }
>>>
>>>         /*
>>>          * If the sched_entity is already enqueued, we also have to update the
>>> @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_
>>>  /* Update task and its cfs_rq load average */
>>>  static inline int propagate_entity_load_avg(struct sched_entity *se)
>>>  {
>>> -       struct cfs_rq *cfs_rq;
>>> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>> +       bool propagate_avg;
>>>
>>>         if (entity_is_task(se))
>>>                 return 0;
>>>
>>> -       if (!test_and_clear_tg_cfs_propagate(se))
>>> -               return 0;
>>> -
>>> -       cfs_rq = cfs_rq_of(se);
>>> +       propagate_avg = test_and_clear_tg_cfs_propagate(se);
>>>
>>> -       set_tg_cfs_propagate(cfs_rq);
>>> +       /*
>>> +        * We want to keep @cfs_rq->runnable_load_avg always in sync so
>>> +        * that the load balancer can accurately determine the busiest CPU
>>> +        * regardless of cfs_rq nesting.
>>> +        */
>>> +       update_tg_cfs_load(cfs_rq, se, propagate_avg);
>>>
>>> -       update_tg_cfs_util(cfs_rq, se);
>>> -       update_tg_cfs_load(cfs_rq, se);
>>> +       if (propagate_avg) {
>>> +               set_tg_cfs_propagate(cfs_rq);
>>> +               update_tg_cfs_util(cfs_rq, se);
>>> +       }
>>>
>>> -       return 1;
>>> +       return propagate_avg;
>>>  }
>>>
>>>  #else /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-04-25  8:35   ` [PATCH " Vincent Guittot
@ 2017-04-25 18:12     ` Tejun Heo
  2017-04-26 16:51       ` Vincent Guittot
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-04-25 18:12 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello,

On Tue, Apr 25, 2017 at 10:35:53AM +0200, Vincent Guittot wrote:
> not sure to catch your example:
> a task TA with a load_avg = 1 is the only task in a task group GB so
> the cfs_rq load_avg = 1 too and the group_entity of this cfs_rq has
> got a weight of 1024 (I use 10bits format for readability) which is
> the total share of task group GB

The group_entity (the sched_entity corresponding to the cfs_rq) should
behave as if it's a task which has the weight of 1024.

> Are you saying that the group_entity load_avg should be around 1024 and not 1 ?

Yes.

> I would say it depends of TA weight. I assume that TA weight is the
> default value (1024) as you don't specify any value in your example

Please consider the following configuration, where GA is a group
entity, and TA and TB are tasks.

	ROOT - GA (weight 1024) - TA (weight 1)
	     \ GB (weight 1   ) - TB (weight 1)

Let's say both TA and TB are running full-tilt.  Now let's take out GA
and GB.

	ROOT - TA1 (weight 1024)
	     \ TB1 (weight 1   )

GA should behave the same as TA1 and GB TB1.  GA's load should match
TA1's, and GA's load when seen from ROOT's cfs_rq has nothing to do
with how much total absolute weight it has inside it.

	ROOT - GA2 (weight 1024) - TA2 (weight 1   )
	     \ GB2 (weight 1   ) - TB2 (weight 1024)

If TA2 and TB2 are constantly running, GA2 and GB2's in ROOT's cfs_rq
should match GA and GB's, respectively.

> If TA directly runs at parent level, its sched_entity would have a
> load_avg of 1 so why the group entity load_avg should be 1024 ? it

Because then the hierarchical weight configuration doesn't mean
anything.

> will just temporally show the cfs_rq more loaded than it is really and
> at the end the group entity load_avg will go back to 1

It's not temporary.  The weight of a group is its shares, which is its
load fraction of the configured weight of the group.  Assuming UP, if
you configure a group to the weight of 1024 and have any task running
full-tilt in it, the group will converge to the load of 1024.  The
problem is that the propagation logic is currently doing something
completely different and temporarily push down the load whenever it
triggers.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-25 12:59       ` Vincent Guittot
@ 2017-04-25 18:49         ` Tejun Heo
  2017-04-25 20:49           ` Tejun Heo
  2017-04-25 21:08           ` Tejun Heo
  0 siblings, 2 replies; 69+ messages in thread
From: Tejun Heo @ 2017-04-25 18:49 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello,

On Tue, Apr 25, 2017 at 02:59:18PM +0200, Vincent Guittot wrote:
> >> So you are changing the purpose of propagate_entity_load_avg which
> >> aims to propagate load_avg/util_avg changes only when a task migrate
> >> and you also want to propagate the enqueue/dequeue in the parent
> >> cfs_rq->runnable_load_avg

Yeah, it always propagates runnable_load_avg and load_avg/util_avg too
on migrations.

> > In fact you want that sched_entity load_avg reflects
> > cfs_rq->runnable_load_avg and not cfs_rq->avg.load_avg

Yes, that's how it gets changed.  The load balancer assumes that the
root's runnable_load_avg is the total sum of all currently active
tasks.  Nesting cfs_rq's shouldn't change that and how it should be
mapped is clearly defined (scaled recursively till it reaches the
root), which is what the code calculates.  The change in
cfs_rq->avg.load_avg's behavior is to reflect that immediate
propagation as load_avg and runnable_load_avg are tightly coupled.

While it does change a nested cfs_rq's load_avg behavior.  It sheds of
the extra layer of averaging and directly reflects the scaled load
avgs of its members, which are already time averaged.  I could have
missed something but couldn't spot anything which can break from this.

> I have run a quick test with your patches and schbench on my platform.
> I haven't been able to reproduce your regression but my platform is
> quite different from yours (only 8 cores without SMT)
> But most importantly, the parent cfs_rq->runnable_load_avg never
> reaches 0 (or almost 0) when it is idle. Instead, it still has a
> runnable_load_avg (this is not due to rounding computation) whereas
> runnable_load_avg should be 0
 
Heh, let me try that out.  Probably a silly mistake somewhere.

> Just to be curious, Is your regression still there if you disable
> SMT/hyperthreading on your paltform?

Will try that too.  I can't see why HT would change it because I see
single CPU queues misevaluated.  Just in case, you need to tune the
test params so that it doesn't load the machine too much and that
there are some non-CPU intensive workloads going on to purturb things
a bit.  Anyways, I'm gonna try disabling HT.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-25 18:49         ` Tejun Heo
@ 2017-04-25 20:49           ` Tejun Heo
  2017-04-25 21:15             ` Chris Mason
  2017-04-25 21:08           ` Tejun Heo
  1 sibling, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-04-25 20:49 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote:
> Will try that too.  I can't see why HT would change it because I see
> single CPU queues misevaluated.  Just in case, you need to tune the
> test params so that it doesn't load the machine too much and that
> there are some non-CPU intensive workloads going on to purturb things
> a bit.  Anyways, I'm gonna try disabling HT.

It's finickier but after changing the duty cycle a bit, it reproduces
w/ HT off.  I think the trick is setting the number of threads to the
number of logical CPUs and tune -s/-c so that p99 starts climbing up.
The following is from the root cgroup.

 ~/schbench -m 2 -t 8 -s 15000 -c 10000 -r 30
 Latency percentiles (usec)
	 50.0000th: 51
	 75.0000th: 62
	 90.0000th: 67
	 95.0000th: 70
	 *99.0000th: 1482
	 99.5000th: 5048
	 99.9000th: 9008
	 min=0, max=10066

And the following is from a first-level cgroup with maximum CPU
weight.

 # ~/schbench -m 2 -t 8 -s 15000 -c 10000 -r 30
 Latency percentiles (usec)
	 50.0000th: 51
	 75.0000th: 62
	 90.0000th: 71
	 95.0000th: 84
	 *99.0000th: 10064
	 99.5000th: 10064
	 99.9000th: 10064
	 min=0, max=10089

It's interesting that p99 ends up aligned on the CPU burn duration.
It looks like some threads end up wating for full durations.

The following is with the patches applied in the same cgroup setup.

 # ~/schbench -m 2 -t 8 -s 15000 -c 10000 -r 30
 Latency percentiles (usec)
	 50.0000th: 64
	 75.0000th: 73
	 90.0000th: 102
	 95.0000th: 111
	 *99.0000th: 1954
	 99.5000th: 5432
	 99.9000th: 9520
	 min=0, max=10012

The numbers fluctuate quite a bit between runs but the pattern is
still very clear - e.g. 10ms p99 never shows up in the root cgroup or
on the patched kernel.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-25 18:49         ` Tejun Heo
  2017-04-25 20:49           ` Tejun Heo
@ 2017-04-25 21:08           ` Tejun Heo
  2017-04-26 10:21             ` Vincent Guittot
  1 sibling, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-04-25 21:08 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote:
> > I have run a quick test with your patches and schbench on my platform.
> > I haven't been able to reproduce your regression but my platform is
> > quite different from yours (only 8 cores without SMT)
> > But most importantly, the parent cfs_rq->runnable_load_avg never
> > reaches 0 (or almost 0) when it is idle. Instead, it still has a
> > runnable_load_avg (this is not due to rounding computation) whereas
> > runnable_load_avg should be 0
>  
> Heh, let me try that out.  Probably a silly mistake somewhere.

This is from the follow-up patch.  I was confused.  Because we don't
propagate decays, we still should decay the runnable_load_avg;
otherwise, we end up accumulating errors in the counter.  I'll drop
the last patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities
  2017-04-24 21:35 ` [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities Tejun Heo
  2017-04-24 21:48   ` Peter Zijlstra
@ 2017-04-25 21:09   ` Tejun Heo
  1 sibling, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2017-04-25 21:09 UTC (permalink / raw)
  Cc: linux-kernel, Linus Torvalds, Vincent Guittot, Mike Galbraith,
	Paul Turner, Chris Mason, kernel-team

On Mon, Apr 24, 2017 at 02:35:28PM -0700, Tejun Heo wrote:
> Now that a cfs_rq sched_entity's load_avg always gets propagated from
> the associated cfs_rq, there's no point in calling __update_load_avg()
> on it.  The two mechanisms compete with each other and we'd be always
> using a value close to the propagated one anyway.
> 
> Skip __update_load_avg() for cfs_rq sched_entities.  Also, relocate
> propagate_entity_load_avg() to signify that propagation is the
> counterpart to __update_load_avg() for cfs_rq sched_entities.  This
> puts the propagation before update_cfs_rq_load_avg() which shouldn't
> disturb anything.

Please ignore this patch.  As we don't propagate on decays, we still
need __update_load_avg() on runanble_load_avg so that it can decay on
its own.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-25 20:49           ` Tejun Heo
@ 2017-04-25 21:15             ` Chris Mason
  0 siblings, 0 replies; 69+ messages in thread
From: Chris Mason @ 2017-04-25 21:15 UTC (permalink / raw)
  To: Tejun Heo, Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, kernel-team



On 04/25/2017 04:49 PM, Tejun Heo wrote:
> On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote:
>> Will try that too.  I can't see why HT would change it because I see
>> single CPU queues misevaluated.  Just in case, you need to tune the
>> test params so that it doesn't load the machine too much and that
>> there are some non-CPU intensive workloads going on to purturb things
>> a bit.  Anyways, I'm gonna try disabling HT.
>
> It's finickier but after changing the duty cycle a bit, it reproduces
> w/ HT off.  I think the trick is setting the number of threads to the
> number of logical CPUs and tune -s/-c so that p99 starts climbing up.
> The following is from the root cgroup.

Since it's only measuring wakeup latency, schbench is best at exposing 
problems when the machine is just barely below saturated.  At 
saturation, everyone has to wait for the CPUs, and if we're relatively 
idle there's always a CPU to be found

There's schbench -a to try and find this magic tipping point, but I 
haven't found a great way to automate for every kind of machine yet (sorry).

-chris

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-25 21:08           ` Tejun Heo
@ 2017-04-26 10:21             ` Vincent Guittot
  2017-04-27  0:30               ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-04-26 10:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 25 April 2017 at 23:08, Tejun Heo <tj@kernel.org> wrote:
> On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote:
>> > I have run a quick test with your patches and schbench on my platform.
>> > I haven't been able to reproduce your regression but my platform is
>> > quite different from yours (only 8 cores without SMT)
>> > But most importantly, the parent cfs_rq->runnable_load_avg never
>> > reaches 0 (or almost 0) when it is idle. Instead, it still has a
>> > runnable_load_avg (this is not due to rounding computation) whereas
>> > runnable_load_avg should be 0
>>
>> Heh, let me try that out.  Probably a silly mistake somewhere.
>
> This is from the follow-up patch.  I was confused.  Because we don't
> propagate decays, we still should decay the runnable_load_avg;
> otherwise, we end up accumulating errors in the counter.  I'll drop
> the last patch.

Ok, the runnable_load_avg goes back to 0 when I drop patch 3. But i
see  runnable_load_avg sometimes significantly higher than load_avg
which is normally not possible as load_avg = runnable_load_avg +
sleeping task's load_avg

Then, I just have the opposite behavior on my platform. I see a
increase of latency at p99 with your patches.
My platform is a hikey : 2x4 cores ARM and I have used schbench -m 2
-t 4 -s 10000 -c 15000 -r 30 so I have 1 worker thread per CPU which
is similar to what you are doing on your platform

With v4.11-rc8. I have run 10 times the test and get consistent results
schbench -m 2 -t 4 -s 10000 -c 15000 -r 30
Latency percentiles (usec)
50.0000th: 255
75.0000th: 350
90.0000th: 454
95.0000th: 489
*99.0000th: 539
99.5000th: 585
99.9000th: 10224
min=0, max=13567

With your patches i see an increase of the latency for p99. I run 10
times the test too and half tests show latency increase like below
schbench$ ./schbench -m 2 -t 4 -s 10000 -c 15000 -r 30
Latency percentiles (usec)
50.0000th: 216
75.0000th: 295
90.0000th: 395
95.0000th: 444
*99.0000th: 2034
99.5000th: 5960
99.9000th: 12240
min=0, max=14744

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-04-24 20:14 ` [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity Tejun Heo
  2017-04-24 21:33   ` [PATCH v2 " Tejun Heo
  2017-04-25  8:35   ` [PATCH " Vincent Guittot
@ 2017-04-26 16:14   ` Vincent Guittot
  2017-04-26 22:27     ` Tejun Heo
  2 siblings, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-04-26 16:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
> 09a43ace1f98 ("sched/fair: Propagate load during synchronous
> attach/detach") added immediate load propagation from cfs_rq to its
> sched_entity then to the parent cfs_rq; however, what gets propagated
> doesn't seem to make sense.
>
> It repeats the tg_weight calculation done in calc_cfs_shares() but
> only uses it to compensate for shares being out of date.  After that,
> it sets the sched_entity's load_avg to the load_avg of the
> corresponding cfs_rq.
>
> This doesn't make sense as the cfs_rq's load_avg is some fraction of
> its total weight, which the sched_entity's weight has nothing to with.
> For example, if the cfs_rq has a single constant load 1 task the
> cfs_rq's load_avg would be around 1.  If that cfs_rq is the only
> active sched_entity in the parent cfs_rq which has the maximum weight,
> the sched_entity's load should be around the maximum weight but
> update_tg_cfs_load() ends up overriding it to 1.
>
> At the parent's level, the absolute value of load_avg inside a child
> cfs_rq doesn't mean anything.  Only the ratio against its weight is
> meaningful.
>
> This patch changes update_tg_cfs_load() to normalize the
> runnable_load_avg of the cfs_rq and then scale it to the matching
> sched_entity's freshly calculated shares for propagation.  Use of
> runnable_load_avg instead of load_avg is intentional and keeps the
> parent's runnable_load_avg true to the sum of scaled loads of all
> tasks queued under it which is critical for the correction operation
> of load balancer.  The next patch will depend on it.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Paul Turner <pjt@google.com>
> ---
>  kernel/sched/fair.c |   46 +++++++++++++++++++---------------------------
>  1 file changed, 19 insertions(+), 27 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3078,37 +3078,29 @@ static inline void
>  update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> -       long delta, load = gcfs_rq->avg.load_avg;
> +       long load = 0, delta;
>
>         /*
> -        * If the load of group cfs_rq is null, the load of the
> -        * sched_entity will also be null so we can skip the formula
> +        * A cfs_rq's load avg contribution to the parent should be scaled
> +        * to the sched_entity's weight.  Use freshly calculated shares
> +        * instead of @se->load.weight as the latter may not reflect
> +        * changes from the current scheduling operation.
> +        *
> +        * Note that the propagation source is runnable_load_avg instead of
> +        * load_avg.  This keeps every cfs_rq's runnable_load_avg true to
> +        * the sum of the scaled loads of all tasks queued under it, which
> +        * is important for the correct operation of the load balancer.
> +        *
> +        * This can make the sched_entity's load_avg jumpier but that
> +        * correctly reflects what would happen without cgroups if each
> +        * task's load is scaled across nesting - the load is being
> +        * averaged at the task and each cfs_rq.
>          */
> -       if (load) {
> -               long tg_load;
> +       if (gcfs_rq->load.weight) {
> +               long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
>
> -               /* Get tg's load and ensure tg_load > 0 */
> -               tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
> -
> -               /* Ensure tg_load >= load and updated with current load*/
> -               tg_load -= gcfs_rq->tg_load_avg_contrib;
> -               tg_load += load;
> -
> -               /*
> -                * We need to compute a correction term in the case that the
> -                * task group is consuming more CPU than a task of equal
> -                * weight. A task with a weight equals to tg->shares will have
> -                * a load less or equal to scale_load_down(tg->shares).
> -                * Similarly, the sched_entities that represent the task group
> -                * at parent level, can't have a load higher than
> -                * scale_load_down(tg->shares). And the Sum of sched_entities'
> -                * load must be <= scale_load_down(tg->shares).
> -                */
> -               if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
> -                       /* scale gcfs_rq's load into tg's shares*/
> -                       load *= scale_load_down(gcfs_rq->tg->shares);
> -                       load /= tg_load;
> -               }
> +               load = min(gcfs_rq->runnable_load_avg *
> +                          shares / gcfs_rq->load.weight, shares);

There is a unit problem above:
runnable_load_avg and shares are not in the same range but
runnable_load_avg and  scale_load_down(gcfs_rq->load.weight) are so
you should use
gcfs_rq->runnable_load_avg * scale_load_down(shares) /
scale_load_down(gcfs_rq->load.weight).
Hopefully both  scale_load_down cancel between them
But the min should be then tested with scale_load_down(shares) and not
only shares


>         }
>
>         delta = load - se->avg.load_avg;

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-04-25 18:12     ` Tejun Heo
@ 2017-04-26 16:51       ` Vincent Guittot
  2017-04-26 22:40         ` Tejun Heo
  2017-05-01 14:17         ` Peter Zijlstra
  0 siblings, 2 replies; 69+ messages in thread
From: Vincent Guittot @ 2017-04-26 16:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Le Tuesday 25 Apr 2017 à 11:12:19 (-0700), Tejun Heo a écrit :
> Hello,
> 
> On Tue, Apr 25, 2017 at 10:35:53AM +0200, Vincent Guittot wrote:
> > not sure to catch your example:
> > a task TA with a load_avg = 1 is the only task in a task group GB so
> > the cfs_rq load_avg = 1 too and the group_entity of this cfs_rq has
> > got a weight of 1024 (I use 10bits format for readability) which is
> > the total share of task group GB
> 
> The group_entity (the sched_entity corresponding to the cfs_rq) should
> behave as if it's a task which has the weight of 1024.
> 
> > Are you saying that the group_entity load_avg should be around 1024 and not 1 ?
> 
> Yes.
> 
> > I would say it depends of TA weight. I assume that TA weight is the
> > default value (1024) as you don't specify any value in your example
> 
> Please consider the following configuration, where GA is a group
> entity, and TA and TB are tasks.
> 
> 	ROOT - GA (weight 1024) - TA (weight 1)
> 	     \ GB (weight 1   ) - TB (weight 1)
> 
> Let's say both TA and TB are running full-tilt.  Now let's take out GA
> and GB.
> 
> 	ROOT - TA1 (weight 1024)
> 	     \ TB1 (weight 1   )
> 
> GA should behave the same as TA1 and GB TB1.  GA's load should match
> TA1's, and GA's load when seen from ROOT's cfs_rq has nothing to do
> with how much total absolute weight it has inside it.
> 
> 	ROOT - GA2 (weight 1024) - TA2 (weight 1   )
> 	     \ GB2 (weight 1   ) - TB2 (weight 1024)
> 
> If TA2 and TB2 are constantly running, GA2 and GB2's in ROOT's cfs_rq
> should match GA and GB's, respectively.

Yes I agree

> 
> > If TA directly runs at parent level, its sched_entity would have a
> > load_avg of 1 so why the group entity load_avg should be 1024 ? it
> 
> Because then the hierarchical weight configuration doesn't mean
> anything.
> 
> > will just temporally show the cfs_rq more loaded than it is really and
> > at the end the group entity load_avg will go back to 1
> 
> It's not temporary.  The weight of a group is its shares, which is its
> load fraction of the configured weight of the group.  Assuming UP, if
> you configure a group to the weight of 1024 and have any task running
> full-tilt in it, the group will converge to the load of 1024.  The
> problem is that the propagation logic is currently doing something
> completely different and temporarily push down the load whenever it
> triggers.

Ok, I see your point and agree that there is an issue when propagating
load_avg of a task group which has tasks with lower weight than the share
but your proposal has got issue because it uses runnable_load_avg instead
of load_avg and this makes propagation of loadavg_avg incorrect, something
like below which keeps using load_avg solve the problem

+	if (gcfs_rq->load.weight) {
+		long shares = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg));
+
+		load = min(gcfs_rq->avg.load_avg *
+			   shares / scale_load_down(gcfs_rq->load.weight), shares);

I have run schbench with the change above on v4.11-rc8 and latency are ok

Thanks
Vincent
>
> 
> Thanks.
> 
> -- 
> tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-24 20:14 ` [PATCH 2/2] sched/fair: Always propagate runnable_load_avg Tejun Heo
  2017-04-25  8:46   ` Vincent Guittot
@ 2017-04-26 18:12   ` Vincent Guittot
  2017-04-26 22:52     ` Tejun Heo
  1 sibling, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-04-26 18:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
> We noticed that with cgroup CPU controller in use, the scheduling
>
> Note the drastic increase in p99 scheduling latency.  After
> investigation, it turned out that the update_sd_lb_stats(), which is
> used by load_balance() to pick the most loaded group, was often
> picking the wrong group.  A CPU which has one schbench running and

Can the problem be on the load balance side instead ?  and more
precisely in the wakeup path ?
After looking at the trace, it seems that task placement happens at
wake up path and if it fails to select the right idle cpu at wake up,
you will have to wait for a load balance which is alreayd too late

> another queued wouldn't report the correspondingly higher

It will as load_avg includes the runnable_load_avg so whatever load is
in runnable_load_avg will be in load_avg too. But at the contrary,
runnable_load_avg will not have the blocked that is going to wake up
soon in the case of schbench

One last thing, the load_avg of an idle CPU can stay blocked for a
while (until a load balance happens that will update blocked load) and
can be seen has "busy" whereas it is not. Could it be a reason of your
problem ?

I have an ongoing patch to solve the problem at least partly if this
can be a reason

> weighted_cpuload() and get looked over as the target of load
> balancing.
>
> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
> sum of the load_avg of all queued sched_entities.  Without cgroups or
> at the root cgroup, each task's load_avg contributes directly to the
> sum.  When a task wakes up or goes to sleep, the change is immediately
> reflected on runnable_load_avg which in turn affects load balancing.
>

>  #else /* CONFIG_FAIR_GROUP_SCHED */

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-04-26 16:14   ` Vincent Guittot
@ 2017-04-26 22:27     ` Tejun Heo
  2017-04-27  8:59       ` Vincent Guittot
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-04-26 22:27 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello, Vincent.

On Wed, Apr 26, 2017 at 06:14:17PM +0200, Vincent Guittot wrote:
> > +       if (gcfs_rq->load.weight) {
> > +               long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
> >
> > +               load = min(gcfs_rq->runnable_load_avg *
> > +                          shares / gcfs_rq->load.weight, shares);
> 
> There is a unit problem above:
> runnable_load_avg and shares are not in the same range but
> runnable_load_avg and  scale_load_down(gcfs_rq->load.weight) are so
> you should use
> gcfs_rq->runnable_load_avg * scale_load_down(shares) /
> scale_load_down(gcfs_rq->load.weight).

But the only difference there is that we lose accuracy in calculation;
otherwise, the end results are the same, no?

> Hopefully both  scale_load_down cancel between them
> But the min should be then tested with scale_load_down(shares) and not
> only shares

Ah, that's right.  The min should be against scaled down shares.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-04-26 16:51       ` Vincent Guittot
@ 2017-04-26 22:40         ` Tejun Heo
  2017-04-27  7:00           ` Vincent Guittot
  2017-05-01 14:17         ` Peter Zijlstra
  1 sibling, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-04-26 22:40 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello,

On Wed, Apr 26, 2017 at 06:51:23PM +0200, Vincent Guittot wrote:
> > It's not temporary.  The weight of a group is its shares, which is its
> > load fraction of the configured weight of the group.  Assuming UP, if
> > you configure a group to the weight of 1024 and have any task running
> > full-tilt in it, the group will converge to the load of 1024.  The
> > problem is that the propagation logic is currently doing something
> > completely different and temporarily push down the load whenever it
> > triggers.
> 
> Ok, I see your point and agree that there is an issue when propagating
> load_avg of a task group which has tasks with lower weight than the share
> but your proposal has got issue because it uses runnable_load_avg instead
> of load_avg and this makes propagation of loadavg_avg incorrect, something
> like below which keeps using load_avg solve the problem
> 
> +	if (gcfs_rq->load.weight) {
> +		long shares = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg));
> +
> +		load = min(gcfs_rq->avg.load_avg *
> +			   shares / scale_load_down(gcfs_rq->load.weight), shares);
> 
> I have run schbench with the change above on v4.11-rc8 and latency are ok

Hmm... so, I'll test this but this wouldn't solve the problem of
root's runnable_load_avg being out of sync with the approximate sum of
all task loads, which is the cause of the latencies that I'm seeing.

Are you saying that with the above change, you're not seeing the
higher latency issue that you reported in the other reply?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-26 18:12   ` Vincent Guittot
@ 2017-04-26 22:52     ` Tejun Heo
  2017-04-27  8:29       ` Vincent Guittot
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-04-26 22:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello,

On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
> On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
> Can the problem be on the load balance side instead ?  and more
> precisely in the wakeup path ?
> After looking at the trace, it seems that task placement happens at
> wake up path and if it fails to select the right idle cpu at wake up,
> you will have to wait for a load balance which is alreayd too late

Oh, I was tracing most of scheduler activities and the ratios of
wakeups picking idle CPUs were about the same regardless of cgroup
membership.  I can confidently say that the latency issue that I'm
seeing is from load balancer picking the wrong busiest CPU, which is
not to say that there can be other problems.

> > another queued wouldn't report the correspondingly higher
> 
> It will as load_avg includes the runnable_load_avg so whatever load is
> in runnable_load_avg will be in load_avg too. But at the contrary,
> runnable_load_avg will not have the blocked that is going to wake up
> soon in the case of schbench

Decaying contribution of blocked tasks don't affect the busiest CPU
selection.  Without cgroup, runnable_load_avg is immediately increased
and decreased as tasks enter and leave the queue and otherwise we end
up with CPUs which are idle when there are threads queued on different
CPUs accumulating scheduling latencies.

The patch doesn't change how the busiest CPU is picked.  It already
uses runnable_load_avg.  The change that cgroup causes is that it
blocks updates to runnable_load_avg from newly scheduled or sleeping
tasks.

The issue isn't about whether runnable_load_avg or load_avg should be
used but the unexpected differences in the metrics that the load
balancer uses depending on whether cgroup is used or not.

> One last thing, the load_avg of an idle CPU can stay blocked for a
> while (until a load balance happens that will update blocked load) and
> can be seen has "busy" whereas it is not. Could it be a reason of your
> problem ?

AFAICS, the load balancer doesn't use load_avg.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-26 10:21             ` Vincent Guittot
@ 2017-04-27  0:30               ` Tejun Heo
  2017-04-27  8:28                 ` Vincent Guittot
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-04-27  0:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello, Vincent.

On Wed, Apr 26, 2017 at 12:21:52PM +0200, Vincent Guittot wrote:
> > This is from the follow-up patch.  I was confused.  Because we don't
> > propagate decays, we still should decay the runnable_load_avg;
> > otherwise, we end up accumulating errors in the counter.  I'll drop
> > the last patch.
> 
> Ok, the runnable_load_avg goes back to 0 when I drop patch 3. But i
> see  runnable_load_avg sometimes significantly higher than load_avg
> which is normally not possible as load_avg = runnable_load_avg +
> sleeping task's load_avg

So, while load_avg would eventually converge on runnable_load_avg +
blocked load_avg given stable enough workload for long enough,
runnable_load_avg jumping above load avg temporarily is expected,
AFAICS.  That's the whole point of it, a sum closely tracking what's
currently on the cpu so that we can pick the cpu which has the most on
it now.  It doesn't make sense to try to pick threads off of a cpu
which is generally loaded but doesn't have much going on right now,
after all.

> Then, I just have the opposite behavior on my platform. I see a
> increase of latency at p99 with your patches.
> My platform is a hikey : 2x4 cores ARM and I have used schbench -m 2
> -t 4 -s 10000 -c 15000 -r 30 so I have 1 worker thread per CPU which
> is similar to what you are doing on your platform
>
> With v4.11-rc8. I have run 10 times the test and get consistent results
...
> *99.0000th: 539
...
> With your patches i see an increase of the latency for p99. I run 10
> *99.0000th: 2034

I see.  This is surprising given that at least the purpose of the
patch is restoring cgroup behavior to match !cgroup one.  I could have
totally messed it up tho.  Hmm... there are several ways forward I
guess.

* Can you please double check that the higher latencies w/ the patch
  is reliably reproducible?  The test machines that I use have
  variable management load.  They never dominate the machine but are
  enough to disturb the results so that to drawing out a reliable
  pattern takes a lot of repeated runs.  I'd really appreciate if you
  could double check that the pattern is reliable with different run
  patterns (ie. instead of 10 consecutive runs after another,
  interleaved).

* Is the board something easily obtainable?  It'd be the eaisest for
  me to set up the same environment and reproduce the problem.  I
  looked up hikey boards on amazon but couldn't easily find 2x4 core
  ones.  If there's something I can easily buy, please point me to it.
  If there's something I can loan, that'd be great too.

* If not, I'll try to clean up the debug patches I have and send them
  your way to get more visiblity but given these things tend to be
  very iterative, it might take quite a few back and forth.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-04-26 22:40         ` Tejun Heo
@ 2017-04-27  7:00           ` Vincent Guittot
  0 siblings, 0 replies; 69+ messages in thread
From: Vincent Guittot @ 2017-04-27  7:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 27 April 2017 at 00:40, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Wed, Apr 26, 2017 at 06:51:23PM +0200, Vincent Guittot wrote:
>> > It's not temporary.  The weight of a group is its shares, which is its
>> > load fraction of the configured weight of the group.  Assuming UP, if
>> > you configure a group to the weight of 1024 and have any task running
>> > full-tilt in it, the group will converge to the load of 1024.  The
>> > problem is that the propagation logic is currently doing something
>> > completely different and temporarily push down the load whenever it
>> > triggers.
>>
>> Ok, I see your point and agree that there is an issue when propagating
>> load_avg of a task group which has tasks with lower weight than the share
>> but your proposal has got issue because it uses runnable_load_avg instead
>> of load_avg and this makes propagation of loadavg_avg incorrect, something
>> like below which keeps using load_avg solve the problem
>>
>> +     if (gcfs_rq->load.weight) {
>> +             long shares = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg));
>> +
>> +             load = min(gcfs_rq->avg.load_avg *
>> +                        shares / scale_load_down(gcfs_rq->load.weight), shares);
>>
>> I have run schbench with the change above on v4.11-rc8 and latency are ok
>
> Hmm... so, I'll test this but this wouldn't solve the problem of
> root's runnable_load_avg being out of sync with the approximate sum of
> all task loads, which is the cause of the latencies that I'm seeing.
>
> Are you saying that with the above change, you're not seeing the
> higher latency issue that you reported in the other reply?

yes I don't have any latency regression like v4.11-rc8 with the above
change that uses load_avg but fix the propagation for of a task with a
lower weight than task group share.

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-27  0:30               ` Tejun Heo
@ 2017-04-27  8:28                 ` Vincent Guittot
  2017-04-28 16:14                   ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-04-27  8:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 27 April 2017 at 02:30, Tejun Heo <tj@kernel.org> wrote:
> Hello, Vincent.
>
> On Wed, Apr 26, 2017 at 12:21:52PM +0200, Vincent Guittot wrote:
>> > This is from the follow-up patch.  I was confused.  Because we don't
>> > propagate decays, we still should decay the runnable_load_avg;
>> > otherwise, we end up accumulating errors in the counter.  I'll drop
>> > the last patch.
>>
>> Ok, the runnable_load_avg goes back to 0 when I drop patch 3. But i
>> see  runnable_load_avg sometimes significantly higher than load_avg
>> which is normally not possible as load_avg = runnable_load_avg +
>> sleeping task's load_avg
>
> So, while load_avg would eventually converge on runnable_load_avg +
> blocked load_avg given stable enough workload for long enough,
> runnable_load_avg jumping above load avg temporarily is expected,

No, it's not. Look at load_avg/runnable_avg at root domain when only
task are involved, runnable_load_avg will never be higher than
load_avg because
    load_avg = /sum load_avg of tasks attached to the cfs_rq
    runnable_load_avg = /sum load_avg of tasks attached and enqueued
to the cfs_rq
load_avg = runnable_load_avg + blocked tasks and as a result
runnable_load_avg is always lower than load_avg.
And with the propagate load/util_avg patchset, we can even reflect
task migration directly at root domain whereas we had to wait for the
util/load_avg and runnable_load_avg to converge to the new value
before

Just to confirm one of my assumption, the latency regression was
already there in previous kernel versions and is not a result of
propagate load/util_avg patchset, isn't it ?

> AFAICS.  That's the whole point of it, a sum closely tracking what's
> currently on the cpu so that we can pick the cpu which has the most on
> it now.  It doesn't make sense to try to pick threads off of a cpu
> which is generally loaded but doesn't have much going on right now,
> after all.

The only interest of runnable_load_avg is to be null when a cfs_rq is
idle whereas load_avg is not  but not to be higher than load_avg. The
root cause is that load_balance only looks at "load" but not number of
task currently running and that's probably the main problem:
runnable_load_avg has been added because load_balance fails to filter
idle group and idle rq. We should better add a new type in
group_classify to tag group that are  idle and the same in
find_busiest_queue more.

>
>> Then, I just have the opposite behavior on my platform. I see a
>> increase of latency at p99 with your patches.
>> My platform is a hikey : 2x4 cores ARM and I have used schbench -m 2
>> -t 4 -s 10000 -c 15000 -r 30 so I have 1 worker thread per CPU which
>> is similar to what you are doing on your platform
>>
>> With v4.11-rc8. I have run 10 times the test and get consistent results
> ...
>> *99.0000th: 539
> ...
>> With your patches i see an increase of the latency for p99. I run 10
>> *99.0000th: 2034
>
> I see.  This is surprising given that at least the purpose of the
> patch is restoring cgroup behavior to match !cgroup one.  I could have
> totally messed it up tho.  Hmm... there are several ways forward I
> guess.
>
> * Can you please double check that the higher latencies w/ the patch
>   is reliably reproducible?  The test machines that I use have
>   variable management load.  They never dominate the machine but are
>   enough to disturb the results so that to drawing out a reliable
>   pattern takes a lot of repeated runs.  I'd really appreciate if you
>   could double check that the pattern is reliable with different run
>   patterns (ie. instead of 10 consecutive runs after another,
>   interleaved).

I always let time between 2 consecutive run and the 10 consecutive
runs take around 2min to execute

Then I have run several time these 10 consecutive test and results stay the same

>
> * Is the board something easily obtainable?  It'd be the eaisest for
>   me to set up the same environment and reproduce the problem.  I
>   looked up hikey boards on amazon but couldn't easily find 2x4 core

It is often named hikey octo cores but I use 2x4 cores just to point
out that there are 2 clusters which is important for scheduler
topology and behavior

>   ones.  If there's something I can easily buy, please point me to it.
>   If there's something I can loan, that'd be great too.

It looks like most of web site are currently out of stock

>
> * If not, I'll try to clean up the debug patches I have and send them
>   your way to get more visiblity but given these things tend to be
>   very iterative, it might take quite a few back and forth.

Yes, that could be usefull. Even a trace of regression could be useful

I can also push on my git tree the debug patch that i use for tracking
load metrics if you want. It's ugly but it does the job

Thanks

>
> Thanks!
>
> --
> tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-26 22:52     ` Tejun Heo
@ 2017-04-27  8:29       ` Vincent Guittot
  2017-04-28 20:33         ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-04-27  8:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 27 April 2017 at 00:52, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
>> On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
>> Can the problem be on the load balance side instead ?  and more
>> precisely in the wakeup path ?
>> After looking at the trace, it seems that task placement happens at
>> wake up path and if it fails to select the right idle cpu at wake up,
>> you will have to wait for a load balance which is alreayd too late
>
> Oh, I was tracing most of scheduler activities and the ratios of
> wakeups picking idle CPUs were about the same regardless of cgroup
> membership.  I can confidently say that the latency issue that I'm
> seeing is from load balancer picking the wrong busiest CPU, which is
> not to say that there can be other problems.

ok. Is there any trace that you can share ? your behavior seems
different of mine

>
>> > another queued wouldn't report the correspondingly higher
>>
>> It will as load_avg includes the runnable_load_avg so whatever load is
>> in runnable_load_avg will be in load_avg too. But at the contrary,
>> runnable_load_avg will not have the blocked that is going to wake up
>> soon in the case of schbench
>
> Decaying contribution of blocked tasks don't affect the busiest CPU
> selection.  Without cgroup, runnable_load_avg is immediately increased
> and decreased as tasks enter and leave the queue and otherwise we end
> up with CPUs which are idle when there are threads queued on different
> CPUs accumulating scheduling latencies.
>
> The patch doesn't change how the busiest CPU is picked.  It already
> uses runnable_load_avg.  The change that cgroup causes is that it
> blocks updates to runnable_load_avg from newly scheduled or sleeping
> tasks.
>
> The issue isn't about whether runnable_load_avg or load_avg should be
> used but the unexpected differences in the metrics that the load

I think that's the root of the problem. I explain a bit more my view
on the other thread

> balancer uses depending on whether cgroup is used or not.
>
>> One last thing, the load_avg of an idle CPU can stay blocked for a
>> while (until a load balance happens that will update blocked load) and
>> can be seen has "busy" whereas it is not. Could it be a reason of your
>> problem ?
>
> AFAICS, the load balancer doesn't use load_avg.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-04-26 22:27     ` Tejun Heo
@ 2017-04-27  8:59       ` Vincent Guittot
  2017-04-28 17:46         ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-04-27  8:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 27 April 2017 at 00:27, Tejun Heo <tj@kernel.org> wrote:
> Hello, Vincent.
>
> On Wed, Apr 26, 2017 at 06:14:17PM +0200, Vincent Guittot wrote:
>> > +       if (gcfs_rq->load.weight) {
>> > +               long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
>> >
>> > +               load = min(gcfs_rq->runnable_load_avg *
>> > +                          shares / gcfs_rq->load.weight, shares);
>>
>> There is a unit problem above:
>> runnable_load_avg and shares are not in the same range but
>> runnable_load_avg and  scale_load_down(gcfs_rq->load.weight) are so
>> you should use
>> gcfs_rq->runnable_load_avg * scale_load_down(shares) /
>> scale_load_down(gcfs_rq->load.weight).
>
> But the only difference there is that we lose accuracy in calculation;
> otherwise, the end results are the same, no?

Yes the end result is the same, it was mainly to point out the range
difference and explain why we need scale_load_down(shares) for the 2nd
argument of min.
This should also explain the warning issue you mentioned earlier

>
>> Hopefully both  scale_load_down cancel between them
>> But the min should be then tested with scale_load_down(shares) and not
>> only shares
>
> Ah, that's right.  The min should be against scaled down shares.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-27  8:28                 ` Vincent Guittot
@ 2017-04-28 16:14                   ` Tejun Heo
  2017-05-02  6:56                     ` Vincent Guittot
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-04-28 16:14 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello, Vincent.

On Thu, Apr 27, 2017 at 10:28:01AM +0200, Vincent Guittot wrote:
> On 27 April 2017 at 02:30, Tejun Heo <tj@kernel.org> wrote:
> > Hello, Vincent.
> >
> > On Wed, Apr 26, 2017 at 12:21:52PM +0200, Vincent Guittot wrote:
> >> > This is from the follow-up patch.  I was confused.  Because we don't
> >> > propagate decays, we still should decay the runnable_load_avg;
> >> > otherwise, we end up accumulating errors in the counter.  I'll drop
> >> > the last patch.
> >>
> >> Ok, the runnable_load_avg goes back to 0 when I drop patch 3. But i
> >> see  runnable_load_avg sometimes significantly higher than load_avg
> >> which is normally not possible as load_avg = runnable_load_avg +
> >> sleeping task's load_avg
> >
> > So, while load_avg would eventually converge on runnable_load_avg +
> > blocked load_avg given stable enough workload for long enough,
> > runnable_load_avg jumping above load avg temporarily is expected,
> 
> No, it's not. Look at load_avg/runnable_avg at root domain when only
> task are involved, runnable_load_avg will never be higher than
> load_avg because
>     load_avg = /sum load_avg of tasks attached to the cfs_rq
>     runnable_load_avg = /sum load_avg of tasks attached and enqueued
> to the cfs_rq
> load_avg = runnable_load_avg + blocked tasks and as a result
> runnable_load_avg is always lower than load_avg.

We don't actually calculate that tho, but yeah, given the calculations
we do, runnable shouldn't be significantly higher than load_avg.  I
see runable going above load_avg by a bit often but the margins are
small.  I wonder whether you're seeing the errors accumulating from
the incorrect min() capping.

> And with the propagate load/util_avg patchset, we can even reflect
> task migration directly at root domain whereas we had to wait for the
> util/load_avg and runnable_load_avg to converge to the new value
> before
> 
> Just to confirm one of my assumption, the latency regression was
> already there in previous kernel versions and is not a result of
> propagate load/util_avg patchset, isn't it ?

Yeah, the latency problem is independent of the propagation logic.
It's really the meaning of the top level running_avg_load being
different w/ and w/o cgroups.

> > AFAICS.  That's the whole point of it, a sum closely tracking what's
> > currently on the cpu so that we can pick the cpu which has the most on
> > it now.  It doesn't make sense to try to pick threads off of a cpu
> > which is generally loaded but doesn't have much going on right now,
> > after all.
> 
> The only interest of runnable_load_avg is to be null when a cfs_rq is
> idle whereas load_avg is not  but not to be higher than load_avg. The
> root cause is that load_balance only looks at "load" but not number of
> task currently running and that's probably the main problem:
> runnable_load_avg has been added because load_balance fails to filter
> idle group and idle rq. We should better add a new type in
> group_classify to tag group that are  idle and the same in
> find_busiest_queue more.

I'll follow up in the other subthread but there really is fundamental
difference in how we calculate runnable_avg w/ and w/o cgroups.
Indepndent of whether we can improve the load balancer further, it is
an existing bug.

> > * Can you please double check that the higher latencies w/ the patch
> >   is reliably reproducible?  The test machines that I use have
> >   variable management load.  They never dominate the machine but are
> >   enough to disturb the results so that to drawing out a reliable
> >   pattern takes a lot of repeated runs.  I'd really appreciate if you
> >   could double check that the pattern is reliable with different run
> >   patterns (ie. instead of 10 consecutive runs after another,
> >   interleaved).
> 
> I always let time between 2 consecutive run and the 10 consecutive
> runs take around 2min to execute
> 
> Then I have run several time these 10 consecutive test and results stay the same

Can you please try the patch at the end of this message?  I'm
wondering whether you're seeing the errors accumulating from the wrong
min().

> >   ones.  If there's something I can easily buy, please point me to it.
> >   If there's something I can loan, that'd be great too.
> 
> It looks like most of web site are currently out of stock

:(

> > * If not, I'll try to clean up the debug patches I have and send them
> >   your way to get more visiblity but given these things tend to be
> >   very iterative, it might take quite a few back and forth.
> 
> Yes, that could be usefull. Even a trace of regression could be useful

Yeah, will clean it up a bit and post.  Will also post some traces.

And here's the updated patch.  I didn't add the suggested
scale_down()'s.  I'll follow up on that in the other subthread.
Thanks.

 kernel/sched/fair.c |   47 ++++++++++++++++++++---------------------------
 1 file changed, 20 insertions(+), 27 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3078,37 +3078,30 @@ static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long delta, load = gcfs_rq->avg.load_avg;
+	long load = 0, delta;
 
 	/*
-	 * If the load of group cfs_rq is null, the load of the
-	 * sched_entity will also be null so we can skip the formula
+	 * A cfs_rq's load avg contribution to the parent should be scaled
+	 * to the sched_entity's weight.  Use freshly calculated shares
+	 * instead of @se->load.weight as the latter may not reflect
+	 * changes from the current scheduling operation.
+	 *
+	 * Note that the propagation source is runnable_load_avg instead of
+	 * load_avg.  This keeps every cfs_rq's runnable_load_avg true to
+	 * the sum of the scaled loads of all tasks queued under it, which
+	 * is important for the correct operation of the load balancer.
+	 *
+	 * This can make the sched_entity's load_avg jumpier but that
+	 * correctly reflects what would happen without cgroups if each
+	 * task's load is scaled across nesting - the load is being
+	 * averaged at the task and each cfs_rq.
 	 */
-	if (load) {
-		long tg_load;
+	if (gcfs_rq->load.weight) {
+		long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
 
-		/* Get tg's load and ensure tg_load > 0 */
-		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
-		/* Ensure tg_load >= load and updated with current load*/
-		tg_load -= gcfs_rq->tg_load_avg_contrib;
-		tg_load += load;
-
-		/*
-		 * We need to compute a correction term in the case that the
-		 * task group is consuming more CPU than a task of equal
-		 * weight. A task with a weight equals to tg->shares will have
-		 * a load less or equal to scale_load_down(tg->shares).
-		 * Similarly, the sched_entities that represent the task group
-		 * at parent level, can't have a load higher than
-		 * scale_load_down(tg->shares). And the Sum of sched_entities'
-		 * load must be <= scale_load_down(tg->shares).
-		 */
-		if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
-			/* scale gcfs_rq's load into tg's shares*/
-			load *= scale_load_down(gcfs_rq->tg->shares);
-			load /= tg_load;
-		}
+		load = min_t(long, scale_load_down(shares),
+			     gcfs_rq->runnable_load_avg *
+			     shares / gcfs_rq->load.weight);
 	}
 
 	delta = load - se->avg.load_avg;

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-04-27  8:59       ` Vincent Guittot
@ 2017-04-28 17:46         ` Tejun Heo
  2017-05-02  7:20           ` Vincent Guittot
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-04-28 17:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello, Vincent.

On Thu, Apr 27, 2017 at 10:59:12AM +0200, Vincent Guittot wrote:
> > But the only difference there is that we lose accuracy in calculation;
> > otherwise, the end results are the same, no?
> 
> Yes the end result is the same, it was mainly to point out the range
> difference and explain why we need scale_load_down(shares) for the 2nd
> argument of min.
> This should also explain the warning issue you mentioned earlier

I'm not sure this makes sense.  Practically, we're doing more shifts
just to lose calculation accuracy.  Even conceptually, what we're
doing is

             C
	A * ---
             B

Where A is in a different scale while B and C are in the same.  What
you're suggesting is

             scale_down(C)
	A * ---------------
	     scale_down(B)

I can't see why this is better in any way.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-27  8:29       ` Vincent Guittot
@ 2017-04-28 20:33         ` Tejun Heo
  2017-04-28 20:38           ` Tejun Heo
                             ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Tejun Heo @ 2017-04-28 20:33 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello, Vincent.

On Thu, Apr 27, 2017 at 10:29:10AM +0200, Vincent Guittot wrote:
> On 27 April 2017 at 00:52, Tejun Heo <tj@kernel.org> wrote:
> > Hello,
> >
> > On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
> >> On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
> >> Can the problem be on the load balance side instead ?  and more
> >> precisely in the wakeup path ?
> >> After looking at the trace, it seems that task placement happens at
> >> wake up path and if it fails to select the right idle cpu at wake up,
> >> you will have to wait for a load balance which is alreayd too late
> >
> > Oh, I was tracing most of scheduler activities and the ratios of
> > wakeups picking idle CPUs were about the same regardless of cgroup
> > membership.  I can confidently say that the latency issue that I'm
> > seeing is from load balancer picking the wrong busiest CPU, which is
> > not to say that there can be other problems.
> 
> ok. Is there any trace that you can share ? your behavior seems
> different of mine

I'm attaching the debug patch.  With your change (avg instead of
runnable_avg), the following trace shows why it's wrong.

It's dumping a case where group A has a CPU w/ more than two schbench
threads and B doesn't, but the load balancer is determining that B is
loaded heavier.

 dbg_odd: odd: dst=28 idle=2 brk=32 lbtgt=0-31 type=2
 dbg_odd_dump: A: grp=1,17 w=2 avg=7.247 grp=8.337 sum=8.337 pertask=2.779
 dbg_odd_dump: A: gcap=1.150 gutil=1.095 run=3 idle=0 gwt=2 type=2 nocap=1
 dbg_odd_dump: A: CPU001: run=1 schb=1
 dbg_odd_dump: A: Q001-asdf: w=1.000,l=0.525,u=0.513,r=0.527 run=1 hrun=1 tgs=100.000 tgw=17.266
 dbg_odd_dump: A: Q001-asdf:  schbench(153757C):w=1.000,l=0.527,u=0.514
 dbg_odd_dump: A: Q001-/: w=5.744,l=2.522,u=0.520,r=3.067 run=1 hrun=1 tgs=1.000 tgw=0.000
 dbg_odd_dump: A: Q001-/:  asdf(C):w=5.744,l=3.017,u=0.521
 dbg_odd_dump: A: CPU017: run=2 schb=2
 dbg_odd_dump: A: Q017-asdf: w=2.000,l=0.989,u=0.966,r=0.988 run=2 hrun=2 tgs=100.000 tgw=17.266
 dbg_odd_dump: A: Q017-asdf:  schbench(153737C):w=1.000,l=0.493,u=0.482 schbench(153739):w=1.000,l=0.494,u=0.483
 dbg_odd_dump: A: Q017-/: w=10.653,l=7.888,u=0.973,r=5.270 run=1 hrun=2 tgs=1.000 tgw=0.000
 dbg_odd_dump: A: Q017-/:  asdf(C):w=10.653,l=5.269,u=0.966
 dbg_odd_dump: B: grp=14,30 w=2 avg=7.666 grp=8.819 sum=8.819 pertask=4.409
 dbg_odd_dump: B: gcap=1.150 gutil=1.116 run=2 idle=0 gwt=2 type=2 nocap=1
 dbg_odd_dump: B: CPU014: run=1 schb=1
 dbg_odd_dump: B: Q014-asdf: w=1.000,l=1.004,u=0.970,r=0.492 run=1 hrun=1 tgs=100.000 tgw=17.266
 dbg_odd_dump: B: Q014-asdf:  schbench(153760C):w=1.000,l=0.491,u=0.476
 dbg_odd_dump: B: Q014-/: w=5.605,l=11.146,u=0.970,r=5.774 run=1 hrun=1 tgs=1.000 tgw=0.000
 dbg_odd_dump: B: Q014-/:  asdf(C):w=5.605,l=5.766,u=0.970
 dbg_odd_dump: B: CPU030: run=1 schb=1
 dbg_odd_dump: B: Q030-asdf: w=1.000,l=0.538,u=0.518,r=0.558 run=1 hrun=1 tgs=100.000 tgw=17.266
 dbg_odd_dump: B: Q030-asdf:  schbench(153747C):w=1.000,l=0.537,u=0.516
 dbg_odd_dump: B: Q030-/: w=5.758,l=3.186,u=0.541,r=3.044 run=1 hrun=1 tgs=1.000 tgw=0.000
 dbg_odd_dump: B: Q030-/:  asdf(C):w=5.758,l=3.092,u=0.516

You can notice that B's pertask weight is 4.409 which is way higher
than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is
twice as high as it should be.  The root queue's runnable avg should
only contain what's currently active but because we're scaling load
avg which includes both active and blocked, we're ending up picking
group B over A.

This shows up in the total number of times we pick the wrong queue and
thus latency.  I'm running the following script with the debug patch
applied.

  #!/bin/bash

  date
  cat /proc/self/cgroup

  echo 1000 > /sys/module/fair/parameters/dbg_odd_nth
  echo 0 > /sys/module/fair/parameters/dbg_odd_cnt

  ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30

  cat /sys/module/fair/parameters/dbg_odd_cnt


With your patch applied, in the root cgroup,

  Fri Apr 28 12:48:59 PDT 2017
  0::/
  Latency percentiles (usec)
	  50.0000th: 26
	  75.0000th: 63
	  90.0000th: 78
	  95.0000th: 88
	  *99.0000th: 707
	  99.5000th: 5096
	  99.9000th: 10352
	  min=0, max=13743
  577


In the /asdf cgroup,

  Fri Apr 28 13:19:53 PDT 2017
  0::/asdf
  Latency percentiles (usec)
	  50.0000th: 35
	  75.0000th: 67
	  90.0000th: 81
	  95.0000th: 98
	  *99.0000th: 2212
	  99.5000th: 4536
	  99.9000th: 11024
	  min=0, max=13026
  1708


The last line is the number of times the load balancer picked a group
w/o more than two schbench threads on a CPU over one w/.  Some number
of these are expected as there are other threads and there are some
plays in all the calculations but propgating avg or not propgating at
all significantly increases the count and latency.

> > The issue isn't about whether runnable_load_avg or load_avg should be
> > used but the unexpected differences in the metrics that the load
> 
> I think that's the root of the problem. I explain a bit more my view
> on the other thread

So, when picking the busiest group, the only thing which matters is
the queue's runnable_load_avg, which should approximate the sum of all
on-queue loads on that CPU.

If we don't propagate or propagate load_avg, we're factoring in
blocked avg of descendent cgroups into the root's runnable_load_avg
which is obviously wrong.

We can argue whether overriding a cfs_rq se's load_avg to the scaled
runnable_load_avg of the cfs_rq is the right way to go or we should
introduce a separate channel to propagate runnable_load_avg; however,
it's clear that we need to fix runnable_load_avg propagation one way
or another.

The thing with cfs_rq se's load_avg is that, it isn't really used
anywhere else AFAICS, so overriding it to the cfs_rq's
runnable_load_avg isn't prettiest but doesn't really change anything.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-28 20:33         ` Tejun Heo
@ 2017-04-28 20:38           ` Tejun Heo
  2017-05-01 15:56           ` Peter Zijlstra
  2017-05-02  7:18           ` Vincent Guittot
  2 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2017-04-28 20:38 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Here's the debug patch.

The debug condition triggers when the load balancer picks a group w/o
more than one schbench threads on a CPU over one w/.

 /sys/module/fair/parameters/dbg_odd_cnt: resettable counter
 /sys/module/fair/parameters/dbg_odd_nth: dump group states on Nth
					  occurrence via trace_printk()

The load / weights are printed out so that NICE_0_LOAD is 1.000.

Thanks.
---
 kernel/sched/fair.c |  160 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 159 insertions(+), 1 deletion(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -32,11 +32,18 @@
 #include <linux/mempolicy.h>
 #include <linux/migrate.h>
 #include <linux/task_work.h>
+#include <linux/moduleparam.h>
 
 #include <trace/events/sched.h>
 
 #include "sched.h"
 
+static unsigned long dbg_odd_nth;
+static unsigned long dbg_odd_cnt;
+
+module_param(dbg_odd_nth, ulong, 0644);
+module_param(dbg_odd_cnt, ulong, 0644);
+
 /*
  * Targeted preemption latency for CPU-bound tasks:
  *
@@ -7413,6 +7420,149 @@ static inline void update_sg_lb_stats(st
 	sgs->group_type = group_classify(group, sgs);
 }
 
+static int count_schb(struct rq *rq)
+{
+	unsigned long flags;
+	struct task_struct *p;
+	int cnt = 0;
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+
+	list_for_each_entry(p, &rq->cfs_tasks, se.group_node)
+		if (!strncmp(p->comm, "schbench", 8))
+			cnt++;
+
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+	return cnt;
+}
+
+static bool sg_has_two_schb(struct sched_group *sg)
+{
+	int cpu;
+
+	for_each_cpu(cpu, sched_group_cpus(sg))
+		if (count_schb(cpu_rq(cpu)) >= 2)
+			return true;
+	return false;
+}
+
+static DEFINE_PER_CPU(char [PAGE_SIZE], odd_buf);
+
+#define lbw(x)	(int)((x) / NICE_0_LOAD), (int)(((x) % NICE_0_LOAD) * 1000 / NICE_0_LOAD)
+#define lba(x)	(int)((scale_load(x)) / NICE_0_LOAD), (int)(((scale_load(x)) % NICE_0_LOAD) * 1000 / NICE_0_LOAD)
+
+static int odd_append_se(struct sched_entity *se, const char *postfix,
+			 int cnt, char *buf, size_t size)
+{
+#define odd_append(fmt, args...)	do {				\
+	cnt += scnprintf(buf + cnt, size - cnt, fmt, ##args);		\
+	cnt = min_t(int, cnt, size);					\
+} while (0)
+
+	if (entity_is_task(se)) {
+		struct task_struct *task = task_of(se);
+		odd_append(" %s(%d%s)", task->comm, task->pid, postfix);
+	} else {
+		char nbuf[64];
+		cgroup_name(se->my_q->tg->css.cgroup, nbuf, sizeof(nbuf));
+		odd_append(" %s(%s)", nbuf, postfix);
+	}
+	odd_append(":w=%d.%03d,l=%d.%03d,u=%d.%03d",
+		   lbw(se->load.weight),
+		   lba(se->avg.load_avg),
+		   lba(se->avg.util_avg));
+
+	return cnt;
+}
+
+static void dbg_odd_dump(const char *pref,
+			 struct sched_group *sg, struct sg_lb_stats *sgs)
+{
+	int cpu;
+
+	trace_printk("%sgrp=%*pbl w=%u avg=%d.%03d grp=%d.%03d sum=%d.%03d pertask=%d.%03d\n", pref,
+		     cpumask_pr_args(sched_group_cpus(sg)), sg->group_weight,
+		     lba(sgs->avg_load), lba(sgs->group_load),
+		     lba(sgs->sum_weighted_load), lba(sgs->load_per_task));
+	trace_printk("%sgcap=%d.%03d gutil=%d.%03d run=%u idle=%u gwt=%u type=%d nocap=%d\n",
+		     pref,
+		     lba(sgs->group_capacity), lba(sgs->group_util),
+		     sgs->sum_nr_running, sgs->idle_cpus, sgs->group_weight,
+		     sgs->group_type, sgs->group_no_capacity);
+
+	for_each_cpu(cpu, sched_group_cpus(sg)) {
+		struct task_group *tg;
+		unsigned long flags;
+
+		trace_printk("%sCPU%03d: run=%u schb=%d\n", pref, cpu,
+			     cpu_rq(cpu)->nr_running, count_schb(cpu_rq(cpu)));
+
+		raw_spin_lock_irqsave(&cpu_rq(cpu)->lock, flags);
+
+		list_for_each_entry_rcu(tg, &task_groups, list) {
+			struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
+			char qname[32] = "root";
+			int depth = 0;
+			long tg_weight = 0, tg_shares = 0;
+			struct sched_entity *se;
+			char *buf = per_cpu_ptr(odd_buf, cpu);
+			int cnt;
+
+			if (!cfs_rq->nr_running)
+				continue;
+
+			if (cfs_rq->tg) {
+				cgroup_name(cfs_rq->tg->css.cgroup, qname, sizeof(qname));
+				if (cfs_rq->tg->se[cpu])
+					depth = cfs_rq->tg->se[cpu]->depth;
+				tg_weight = atomic_long_read(&cfs_rq->tg->load_avg);
+				tg_shares = cfs_rq->tg->shares;
+			}
+
+			trace_printk("%sQ%03d-%s@%d: w=%d.%03d,l=%d.%03d,u=%d.%03d,r=%d.%03d run=%u hrun=%u tgs=%d.%03d tgw=%d.%03d\n",
+				     pref, cpu, qname, depth,
+				     lbw(cfs_rq->load.weight),
+				     lba(cfs_rq->avg.load_avg),
+				     lba(cfs_rq->avg.util_avg),
+				     lba(cfs_rq->runnable_load_avg),
+				     cfs_rq->nr_running, cfs_rq->h_nr_running,
+				     lbw(tg_shares),
+				     lba(tg_weight));
+
+			buf[0] = '\0';
+			cnt = 0;
+
+			if (cfs_rq->curr)
+				cnt = odd_append_se(cfs_rq->curr, "C", cnt, buf, PAGE_SIZE);
+
+			for (se = __pick_first_entity(cfs_rq); se;
+			     se = __pick_next_entity(se))
+				cnt = odd_append_se(se, "", cnt, buf, PAGE_SIZE);
+
+			trace_printk("%sQ%03d-%s@%d: %s\n",
+				     pref, cpu, qname, depth, buf);
+		}
+
+		raw_spin_unlock_irqrestore(&cpu_rq(cpu)->lock, flags);
+	}
+}
+
+/* a has >= 2 dts, b doesn't */
+static void dbg_odd(struct lb_env *env,
+		    struct sched_group *sga, struct sg_lb_stats *sgsa,
+		    struct sched_group *sgb, struct sg_lb_stats *sgsb)
+{
+	if (dbg_odd_nth && (dbg_odd_cnt++ % dbg_odd_nth))
+		return;
+
+	trace_printk("odd: dst=%d idle=%d brk=%u lbtgt=%*pbl type=%d\n",
+		     env->dst_cpu, env->idle, env->loop_break,
+		     cpumask_pr_args(env->cpus), env->fbq_type);
+	dbg_odd_dump("A: ", sga, sgsa);
+	dbg_odd_dump("B: ", sgb, sgsb);
+}
+
 /**
  * update_sd_pick_busiest - return 1 on busiest group
  * @env: The load balancing environment.
@@ -7432,6 +7582,8 @@ static bool update_sd_pick_busiest(struc
 				   struct sg_lb_stats *sgs)
 {
 	struct sg_lb_stats *busiest = &sds->busiest_stat;
+	bool busiest_has_two = sds->busiest && sg_has_two_schb(sds->busiest);
+	bool sg_has_two = sg_has_two_schb(sg);
 
 	if (sgs->group_type > busiest->group_type)
 		return true;
@@ -7439,8 +7591,14 @@ static bool update_sd_pick_busiest(struc
 	if (sgs->group_type < busiest->group_type)
 		return false;
 
-	if (sgs->avg_load <= busiest->avg_load)
+	if (sgs->avg_load <= busiest->avg_load) {
+		if (sg_has_two && !busiest_has_two)
+			dbg_odd(env, sg, sgs, sds->busiest, busiest);
 		return false;
+	}
+
+	if (!sg_has_two && busiest_has_two)
+		dbg_odd(env, sds->busiest, busiest, sg, sgs);
 
 	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
 		goto asym_packing;

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-04-26 16:51       ` Vincent Guittot
  2017-04-26 22:40         ` Tejun Heo
@ 2017-05-01 14:17         ` Peter Zijlstra
  2017-05-01 14:52           ` Peter Zijlstra
  2017-05-01 21:56           ` Tejun Heo
  1 sibling, 2 replies; 69+ messages in thread
From: Peter Zijlstra @ 2017-05-01 14:17 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hi,

sorry for being late to the party, trying to catch-up..

On Wed, Apr 26, 2017 at 06:51:23PM +0200, Vincent Guittot wrote:
> Ok, I see your point and agree that there is an issue when propagating
> load_avg of a task group which has tasks with lower weight than the share
> but your proposal has got issue because it uses runnable_load_avg instead
> of load_avg and this makes propagation of loadavg_avg incorrect, something
> like below which keeps using load_avg solve the problem
> 
> +	if (gcfs_rq->load.weight) {
> +		long shares = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg));
> +
> +		load = min(gcfs_rq->avg.load_avg *
> +			   shares / scale_load_down(gcfs_rq->load.weight), shares);
> 


So this here does:

  ( tg->load_avg = \Sum cfs_rq->load_avg )

  load      = cfs_rq->load.weight

  tg_weight = tg->load_avg - cfs_rq->contrib + load


           tg->shares * load
  shares = -----------------
               tg_weight


                        cfs_rq->load_avg
  avg_shares = shares * ----------------
                              load

	       tg->shares * cfs_rq->load_avg
             = -----------------------------
                      tg_weight


  ( se->load.weight = shares )

  se->load_avg = min(shares, avg_shares);


So where shares (and se->load.weight) are an upper bound (due to using
cfs_rq->load.weight, see calc_cfs_shares), avg_shares is supposedly a
more accurate representation based on our PELT averages.

This looks OK; and I agree with Vincent that we should use
cfs_rq->avg.load_avg, not cfs_rq->runnable_load_avg, since tg->load_avg
is a sum of the former, not the latter.

Also, arguably calculating the above avg_shares directly (using the
second equation) might be more precise; but I doubt it makes much of a
difference, however since we do min(), we should at least clamp against
MIN_SHARES again.

Furthermore, it appears to me we want a different tg_weight value for
the avg_shares, something like:

  tg_weight = tg->load_avg - cfs_rq->contrib + cfs_rq->avg.load_avg

To better match with the numerator's units, otherwise it will have a
tendency to push avg_shares down further than it needs to be.


(All assuming it actually works of course.. compile tested only)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2634,7 +2634,9 @@ account_entity_dequeue(struct cfs_rq *cf
 # ifdef CONFIG_SMP
 static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 {
-	long tg_weight, load, shares;
+	long tg_weight, tg_shares, load, shares;
+
+	tg_shares = READ_ONCE(tg->shares);
 
 	/*
 	 * This really should be: cfs_rq->avg.load_avg, but instead we use
@@ -2665,12 +2667,7 @@ static long calc_cfs_shares(struct cfs_r
 	 * case no task is runnable on a CPU MIN_SHARES=2 should be returned
 	 * instead of 0.
 	 */
-	if (shares < MIN_SHARES)
-		shares = MIN_SHARES;
-	if (shares > tg->shares)
-		shares = tg->shares;
-
-	return shares;
+	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
 # else /* CONFIG_SMP */
 static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
@@ -3075,37 +3072,69 @@ static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long delta, load = gcfs_rq->avg.load_avg;
+	long load = 0, delta;
 
 	/*
-	 * If the load of group cfs_rq is null, the load of the
-	 * sched_entity will also be null so we can skip the formula
+	 * If there is load in our group cfs_rq (@gcfs_rq), then update its
+	 * corresponding group se (@se) load value to reflect any change in
+	 * weights.
+	 *
+	 * Also see effective_load(), where we do something similar.
+	 *
+	 *   ( tg->load_avg = \Sum cfs_rq->load_avg )
+	 *   
+	 *   tg_weight = tg->load_avg - cfs_rq->contrib
+	 *   
+	 *   
+	 *            tg->shares * cfs_rq->weight
+	 *   shares = ---------------------------
+	 *            tg_weight  + cfs_rq->weight
+	 *   
+	 *   
+	 *		  tg->shares * cfs_rq->load_avg
+	 *   avg_shares = -----------------------------
+	 *                tg_weight  + cfs_rq->load_avg
+	 *   
+	 *   
+	 *   ( se->load.weight = shares )
+	 *   
+	 *   se->load_avg = min(shares, avg_shares);
+	 *
+	 * Where @shares is an upper bound to @avg_shares, see the comments
+	 * in calc_cfs_shares().
 	 */
-	if (load) {
-		long tg_load;
-
-		/* Get tg's load and ensure tg_load > 0 */
-		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
-		/* Ensure tg_load >= load and updated with current load*/
-		tg_load -= gcfs_rq->tg_load_avg_contrib;
-		tg_load += load;
+	if (gcfs_rq->load.weight) {
+		long tg_weight1, tg_weight2, tg_shares, shares, avg_shares;
+		struct task_group *tg = gcfs_rq->tg;
 
 		/*
-		 * We need to compute a correction term in the case that the
-		 * task group is consuming more CPU than a task of equal
-		 * weight. A task with a weight equals to tg->shares will have
-		 * a load less or equal to scale_load_down(tg->shares).
-		 * Similarly, the sched_entities that represent the task group
-		 * at parent level, can't have a load higher than
-		 * scale_load_down(tg->shares). And the Sum of sched_entities'
-		 * load must be <= scale_load_down(tg->shares).
+		 * This open-codes calc_cfs_shares(), in order to ensure
+		 * we use consistent values for @shares and @avg_shares,
+		 * as well as make sure we clip the result properly.
 		 */
-		if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
-			/* scale gcfs_rq's load into tg's shares*/
-			load *= scale_load_down(gcfs_rq->tg->shares);
-			load /= tg_load;
-		}
+
+		tg_shares = READ_ONCE(tg->shares);
+
+		load = scale_load_down(gcfs_rq->load.weight);
+
+		tg_weight1 = atomic_long_read(&tg->load_avg);
+		tg_weight2 = (tg_weight1 -= cfs_rq->tg_load_avg_contrib);
+		tg_weight1 += load;
+
+		shares = tg_shares * load;
+		if (tg_weight1)
+			shares /= tg_weight1;
+
+
+		load = READ_ONCE(gcfs_rq->avg.load_avg);
+
+		tg_weight2 += load;
+
+		avg_shares = tg_shares * load;
+		if (tg_weight2)
+			avg_shares /= tg_weight2;
+
+		load = clamp_t(long, min(shares, avg_shares), MIN_SHARES, tg_shares);
 	}
 
 	delta = load - se->avg.load_avg;

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-05-01 14:17         ` Peter Zijlstra
@ 2017-05-01 14:52           ` Peter Zijlstra
  2017-05-01 21:56           ` Tejun Heo
  1 sibling, 0 replies; 69+ messages in thread
From: Peter Zijlstra @ 2017-05-01 14:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On Mon, May 01, 2017 at 04:17:33PM +0200, Peter Zijlstra wrote:

> So this here does:
> 
>   ( tg->load_avg = \Sum cfs_rq->load_avg )
> 
>   load      = cfs_rq->load.weight
> 
>   tg_weight = tg->load_avg - cfs_rq->contrib + load
> 
> 
>            tg->shares * load
>   shares = -----------------
>                tg_weight
> 
> 
>                         cfs_rq->load_avg
>   avg_shares = shares * ----------------
>                               load
> 
> 	       tg->shares * cfs_rq->load_avg
>              = -----------------------------
>                       tg_weight
> 
> 
>   ( se->load.weight = shares )
> 
>   se->load_avg = min(shares, avg_shares);

I wonder though; do we really need the min() ? Can't we simply use
avg_shares all the time?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-28 20:33         ` Tejun Heo
  2017-04-28 20:38           ` Tejun Heo
@ 2017-05-01 15:56           ` Peter Zijlstra
  2017-05-02 22:01             ` Tejun Heo
  2017-05-02  7:18           ` Vincent Guittot
  2 siblings, 1 reply; 69+ messages in thread
From: Peter Zijlstra @ 2017-05-01 15:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On Fri, Apr 28, 2017 at 04:33:47PM -0400, Tejun Heo wrote:
> I'm attaching the debug patch.  With your change (avg instead of
> runnable_avg), the following trace shows why it's wrong.

Ah, OK. So you really want runnable_avg (and I understand why), which is
rather unfortunate, since we have everything on load_avg.

So for shares, load_avg gives a more stable number. This is important
since tg->load_avg is a global number, so computing it is expensive (and
slow). Therefore more stable numbers are good.

> The thing with cfs_rq se's load_avg is that, it isn't really used
> anywhere else AFAICS, so overriding it to the cfs_rq's
> runnable_load_avg isn't prettiest but doesn't really change anything.

You mean, consistently expressing a group's se->load.weight in terms of
runnable_load_avg? See the above.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-05-01 14:17         ` Peter Zijlstra
  2017-05-01 14:52           ` Peter Zijlstra
@ 2017-05-01 21:56           ` Tejun Heo
  2017-05-02  8:19             ` Peter Zijlstra
  1 sibling, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-01 21:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello, Peter.

On Mon, May 01, 2017 at 04:17:33PM +0200, Peter Zijlstra wrote:
> So this here does:
> 
>   ( tg->load_avg = \Sum cfs_rq->load_avg )
> 
>   load      = cfs_rq->load.weight
> 
>   tg_weight = tg->load_avg - cfs_rq->contrib + load
> 
> 
>            tg->shares * load
>   shares = -----------------
>                tg_weight
> 
> 
>                         cfs_rq->load_avg
>   avg_shares = shares * ----------------
>                               load
> 
> 	       tg->shares * cfs_rq->load_avg
>              = -----------------------------
>                       tg_weight
> 
> 
>   ( se->load.weight = shares )
> 
>   se->load_avg = min(shares, avg_shares);
> 
> 
> So where shares (and se->load.weight) are an upper bound (due to using
> cfs_rq->load.weight, see calc_cfs_shares), avg_shares is supposedly a
> more accurate representation based on our PELT averages.
> 
> This looks OK; and I agree with Vincent that we should use
> cfs_rq->avg.load_avg, not cfs_rq->runnable_load_avg, since tg->load_avg
> is a sum of the former, not the latter.

With this, we end up using a different metric for picking the busiest
queue depending on whether there are nested cfs_rq's or not.  The
root's runnable_load_avg ends up including blocked load avgs queued
behind nested cfs_rq's because we lose the resolution across threads
across nesting.

> Also, arguably calculating the above avg_shares directly (using the
> second equation) might be more precise; but I doubt it makes much of a
> difference, however since we do min(), we should at least clamp against
> MIN_SHARES again.
> 
> Furthermore, it appears to me we want a different tg_weight value for
> the avg_shares, something like:
> 
>   tg_weight = tg->load_avg - cfs_rq->contrib + cfs_rq->avg.load_avg
> 
> To better match with the numerator's units, otherwise it will have a
> tendency to push avg_shares down further than it needs to be.
> 
> 
> (All assuming it actually works of course.. compile tested only)

So, if changing gcfs_rq se->load_avg.avg to match the gcfs_rq's
runnable_load_avg is icky, and I can see why that would be, we can
simply introduce a separate channel of propagation so that
runnable_load_avg gets propagated independently from se->load_avg
propagation, so that for all every cfs_rq, its runnable_load_avg is
the sum of all active load_avgs queued on itself and its descendents,
which is the number we want for load balancing anyway.  I'll try to
spin a patch which does that.

I still wonder what gcfs_rq se->load_avg.avg is good for tho?  It's
nice to keep the value in line but is it actually used anywhere?  The
parent cfs_rq's values are independently calculated and, AFAICS, the
only time the value is used is to propagate into the parent's
runnable_load_sum, which has to use a different value, as explained
above.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-28 16:14                   ` Tejun Heo
@ 2017-05-02  6:56                     ` Vincent Guittot
  2017-05-02 20:56                       ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-05-02  6:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 28 April 2017 at 18:14, Tejun Heo <tj@kernel.org> wrote:
> Hello, Vincent.
>
>>
>> The only interest of runnable_load_avg is to be null when a cfs_rq is
>> idle whereas load_avg is not  but not to be higher than load_avg. The
>> root cause is that load_balance only looks at "load" but not number of
>> task currently running and that's probably the main problem:
>> runnable_load_avg has been added because load_balance fails to filter
>> idle group and idle rq. We should better add a new type in
>> group_classify to tag group that are  idle and the same in
>> find_busiest_queue more.
>
> I'll follow up in the other subthread but there really is fundamental
> difference in how we calculate runnable_avg w/ and w/o cgroups.
> Indepndent of whether we can improve the load balancer further, it is
> an existing bug.

I'd like to weight that a bit.
The runnable_load_avg works correctly as it is because it reflects
correctly the load of runnable entity at root domain
If you start to propagate the runnable_load_avg on the load_avg of the
group entity, the load will become unstable.
runnable_load_avg has been added to fix load_balance being unable to
select the right busiest rq. So the goal is to use more and more
load_avg not the opposite

>
>> > * Can you please double check that the higher latencies w/ the patch
>> >   is reliably reproducible?  The test machines that I use have
>> >   variable management load.  They never dominate the machine but are
>> >   enough to disturb the results so that to drawing out a reliable
>> >   pattern takes a lot of repeated runs.  I'd really appreciate if you
>> >   could double check that the pattern is reliable with different run
>> >   patterns (ie. instead of 10 consecutive runs after another,
>> >   interleaved).
>>
>> I always let time between 2 consecutive run and the 10 consecutive
>> runs take around 2min to execute
>>
>> Then I have run several time these 10 consecutive test and results stay the same
>
> Can you please try the patch at the end of this message?  I'm
> wondering whether you're seeing the errors accumulating from the wrong
> min().

I still have the regression with the patch below.
The regression comes from the use runnable_load_avg to propagate. As
load_avg becomes null at some point, it break the computation of share
and the load_avg stay very low

>
>> >   ones.  If there's something I can easily buy, please point me to it.
>> >   If there's something I can loan, that'd be great too.
>>
>> It looks like most of web site are currently out of stock
>
> :(
>
>> > * If not, I'll try to clean up the debug patches I have and send them
>> >   your way to get more visiblity but given these things tend to be
>> >   very iterative, it might take quite a few back and forth.
>>
>> Yes, that could be usefull. Even a trace of regression could be useful
>
> Yeah, will clean it up a bit and post.  Will also post some traces.
>
> And here's the updated patch.  I didn't add the suggested
> scale_down()'s.  I'll follow up on that in the other subthread.
> Thanks.
>
>  kernel/sched/fair.c |   47 ++++++++++++++++++++---------------------------
>  1 file changed, 20 insertions(+), 27 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3078,37 +3078,30 @@ static inline void
>  update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> -       long delta, load = gcfs_rq->avg.load_avg;
> +       long load = 0, delta;
>
>         /*
> -        * If the load of group cfs_rq is null, the load of the
> -        * sched_entity will also be null so we can skip the formula
> +        * A cfs_rq's load avg contribution to the parent should be scaled
> +        * to the sched_entity's weight.  Use freshly calculated shares
> +        * instead of @se->load.weight as the latter may not reflect
> +        * changes from the current scheduling operation.
> +        *
> +        * Note that the propagation source is runnable_load_avg instead of
> +        * load_avg.  This keeps every cfs_rq's runnable_load_avg true to
> +        * the sum of the scaled loads of all tasks queued under it, which
> +        * is important for the correct operation of the load balancer.
> +        *
> +        * This can make the sched_entity's load_avg jumpier but that
> +        * correctly reflects what would happen without cgroups if each
> +        * task's load is scaled across nesting - the load is being
> +        * averaged at the task and each cfs_rq.
>          */
> -       if (load) {
> -               long tg_load;
> +       if (gcfs_rq->load.weight) {
> +               long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
>
> -               /* Get tg's load and ensure tg_load > 0 */
> -               tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
> -
> -               /* Ensure tg_load >= load and updated with current load*/
> -               tg_load -= gcfs_rq->tg_load_avg_contrib;
> -               tg_load += load;
> -
> -               /*
> -                * We need to compute a correction term in the case that the
> -                * task group is consuming more CPU than a task of equal
> -                * weight. A task with a weight equals to tg->shares will have
> -                * a load less or equal to scale_load_down(tg->shares).
> -                * Similarly, the sched_entities that represent the task group
> -                * at parent level, can't have a load higher than
> -                * scale_load_down(tg->shares). And the Sum of sched_entities'
> -                * load must be <= scale_load_down(tg->shares).
> -                */
> -               if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
> -                       /* scale gcfs_rq's load into tg's shares*/
> -                       load *= scale_load_down(gcfs_rq->tg->shares);
> -                       load /= tg_load;
> -               }
> +               load = min_t(long, scale_load_down(shares),
> +                            gcfs_rq->runnable_load_avg *
> +                            shares / gcfs_rq->load.weight);
>         }
>
>         delta = load - se->avg.load_avg;

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-04-28 20:33         ` Tejun Heo
  2017-04-28 20:38           ` Tejun Heo
  2017-05-01 15:56           ` Peter Zijlstra
@ 2017-05-02  7:18           ` Vincent Guittot
  2017-05-02 13:26             ` Vincent Guittot
  2017-05-02 21:50             ` Tejun Heo
  2 siblings, 2 replies; 69+ messages in thread
From: Vincent Guittot @ 2017-05-02  7:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 28 April 2017 at 22:33, Tejun Heo <tj@kernel.org> wrote:
> Hello, Vincent.
>
> On Thu, Apr 27, 2017 at 10:29:10AM +0200, Vincent Guittot wrote:
>> On 27 April 2017 at 00:52, Tejun Heo <tj@kernel.org> wrote:
>> > Hello,
>> >
>> > On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
>> >> On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
>> >> Can the problem be on the load balance side instead ?  and more
>> >> precisely in the wakeup path ?
>> >> After looking at the trace, it seems that task placement happens at
>> >> wake up path and if it fails to select the right idle cpu at wake up,
>> >> you will have to wait for a load balance which is alreayd too late
>> >
>> > Oh, I was tracing most of scheduler activities and the ratios of
>> > wakeups picking idle CPUs were about the same regardless of cgroup
>> > membership.  I can confidently say that the latency issue that I'm
>> > seeing is from load balancer picking the wrong busiest CPU, which is
>> > not to say that there can be other problems.
>>
>> ok. Is there any trace that you can share ? your behavior seems
>> different of mine
>
> I'm attaching the debug patch.  With your change (avg instead of
> runnable_avg), the following trace shows why it's wrong.
>
> It's dumping a case where group A has a CPU w/ more than two schbench
> threads and B doesn't, but the load balancer is determining that B is
> loaded heavier.
>
>  dbg_odd: odd: dst=28 idle=2 brk=32 lbtgt=0-31 type=2
>  dbg_odd_dump: A: grp=1,17 w=2 avg=7.247 grp=8.337 sum=8.337 pertask=2.779
>  dbg_odd_dump: A: gcap=1.150 gutil=1.095 run=3 idle=0 gwt=2 type=2 nocap=1
>  dbg_odd_dump: A: CPU001: run=1 schb=1
>  dbg_odd_dump: A: Q001-asdf: w=1.000,l=0.525,u=0.513,r=0.527 run=1 hrun=1 tgs=100.000 tgw=17.266
>  dbg_odd_dump: A: Q001-asdf:  schbench(153757C):w=1.000,l=0.527,u=0.514
>  dbg_odd_dump: A: Q001-/: w=5.744,l=2.522,u=0.520,r=3.067 run=1 hrun=1 tgs=1.000 tgw=0.000
>  dbg_odd_dump: A: Q001-/:  asdf(C):w=5.744,l=3.017,u=0.521
>  dbg_odd_dump: A: CPU017: run=2 schb=2
>  dbg_odd_dump: A: Q017-asdf: w=2.000,l=0.989,u=0.966,r=0.988 run=2 hrun=2 tgs=100.000 tgw=17.266
>  dbg_odd_dump: A: Q017-asdf:  schbench(153737C):w=1.000,l=0.493,u=0.482 schbench(153739):w=1.000,l=0.494,u=0.483
>  dbg_odd_dump: A: Q017-/: w=10.653,l=7.888,u=0.973,r=5.270 run=1 hrun=2 tgs=1.000 tgw=0.000
>  dbg_odd_dump: A: Q017-/:  asdf(C):w=10.653,l=5.269,u=0.966
>  dbg_odd_dump: B: grp=14,30 w=2 avg=7.666 grp=8.819 sum=8.819 pertask=4.409
>  dbg_odd_dump: B: gcap=1.150 gutil=1.116 run=2 idle=0 gwt=2 type=2 nocap=1
>  dbg_odd_dump: B: CPU014: run=1 schb=1
>  dbg_odd_dump: B: Q014-asdf: w=1.000,l=1.004,u=0.970,r=0.492 run=1 hrun=1 tgs=100.000 tgw=17.266
>  dbg_odd_dump: B: Q014-asdf:  schbench(153760C):w=1.000,l=0.491,u=0.476
>  dbg_odd_dump: B: Q014-/: w=5.605,l=11.146,u=0.970,r=5.774 run=1 hrun=1 tgs=1.000 tgw=0.000
>  dbg_odd_dump: B: Q014-/:  asdf(C):w=5.605,l=5.766,u=0.970
>  dbg_odd_dump: B: CPU030: run=1 schb=1
>  dbg_odd_dump: B: Q030-asdf: w=1.000,l=0.538,u=0.518,r=0.558 run=1 hrun=1 tgs=100.000 tgw=17.266
>  dbg_odd_dump: B: Q030-asdf:  schbench(153747C):w=1.000,l=0.537,u=0.516
>  dbg_odd_dump: B: Q030-/: w=5.758,l=3.186,u=0.541,r=3.044 run=1 hrun=1 tgs=1.000 tgw=0.000
>  dbg_odd_dump: B: Q030-/:  asdf(C):w=5.758,l=3.092,u=0.516
>
> You can notice that B's pertask weight is 4.409 which is way higher
> than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is
> twice as high as it should be.  The root queue's runnable avg should

Are you sure that this is because of blocked load in group A ? it can
be that Q014-asdf has already have to wait before running and its load
still increase while runnable but not running .
IIUC your trace, group A has 2 running tasks and group B only one but
load_balance selects B because of its sgs->avg_load being higher. But
this can also happen even if runnable_load_avg of child cfs_rq was
propagated correctly in group entity because we can have situation
where a group A has only 1 task with higher load than 2 tasks on
groupB and even if blocked load is not taken into account, and
load_balance will select A.

IMHO, we should better improve load balance selection. I'm going to
add smarter group selection in load_balance. that's something we
should have already done but it was difficult without load/util_avg
propagation. it should be doable now

> only contain what's currently active but because we're scaling load
> avg which includes both active and blocked, we're ending up picking
> group B over A.
>
> This shows up in the total number of times we pick the wrong queue and
> thus latency.  I'm running the following script with the debug patch
> applied.
>
>   #!/bin/bash
>
>   date
>   cat /proc/self/cgroup
>
>   echo 1000 > /sys/module/fair/parameters/dbg_odd_nth
>   echo 0 > /sys/module/fair/parameters/dbg_odd_cnt
>
>   ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>
>   cat /sys/module/fair/parameters/dbg_odd_cnt
>
>
> With your patch applied, in the root cgroup,
>
>   Fri Apr 28 12:48:59 PDT 2017
>   0::/
>   Latency percentiles (usec)
>           50.0000th: 26
>           75.0000th: 63
>           90.0000th: 78
>           95.0000th: 88
>           *99.0000th: 707
>           99.5000th: 5096
>           99.9000th: 10352
>           min=0, max=13743
>   577
>
>
> In the /asdf cgroup,
>
>   Fri Apr 28 13:19:53 PDT 2017
>   0::/asdf
>   Latency percentiles (usec)
>           50.0000th: 35
>           75.0000th: 67
>           90.0000th: 81
>           95.0000th: 98
>           *99.0000th: 2212
>           99.5000th: 4536
>           99.9000th: 11024
>           min=0, max=13026
>   1708
>
>
> The last line is the number of times the load balancer picked a group
> w/o more than two schbench threads on a CPU over one w/.  Some number
> of these are expected as there are other threads and there are some
> plays in all the calculations but propgating avg or not propgating at
> all significantly increases the count and latency.
>
>> > The issue isn't about whether runnable_load_avg or load_avg should be
>> > used but the unexpected differences in the metrics that the load
>>
>> I think that's the root of the problem. I explain a bit more my view
>> on the other thread
>
> So, when picking the busiest group, the only thing which matters is
> the queue's runnable_load_avg, which should approximate the sum of all
> on-queue loads on that CPU.
>
> If we don't propagate or propagate load_avg, we're factoring in
> blocked avg of descendent cgroups into the root's runnable_load_avg
> which is obviously wrong.
>
> We can argue whether overriding a cfs_rq se's load_avg to the scaled
> runnable_load_avg of the cfs_rq is the right way to go or we should
> introduce a separate channel to propagate runnable_load_avg; however,
> it's clear that we need to fix runnable_load_avg propagation one way
> or another.

The minimum would be to not break load_avg

>
> The thing with cfs_rq se's load_avg is that, it isn't really used
> anywhere else AFAICS, so overriding it to the cfs_rq's
> runnable_load_avg isn't prettiest but doesn't really change anything.

load_avg is used for defining the share of each cfs_rq.

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-04-28 17:46         ` Tejun Heo
@ 2017-05-02  7:20           ` Vincent Guittot
  0 siblings, 0 replies; 69+ messages in thread
From: Vincent Guittot @ 2017-05-02  7:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 28 April 2017 at 19:46, Tejun Heo <tj@kernel.org> wrote:
> Hello, Vincent.
>
> On Thu, Apr 27, 2017 at 10:59:12AM +0200, Vincent Guittot wrote:
>> > But the only difference there is that we lose accuracy in calculation;
>> > otherwise, the end results are the same, no?
>>
>> Yes the end result is the same, it was mainly to point out the range
>> difference and explain why we need scale_load_down(shares) for the 2nd
>> argument of min.
>> This should also explain the warning issue you mentioned earlier
>
> I'm not sure this makes sense.  Practically, we're doing more shifts
> just to lose calculation accuracy.  Even conceptually, what we're
> doing is
>
>              C
>         A * ---
>              B
>
> Where A is in a different scale while B and C are in the same.  What
> you're suggesting is
>
>              scale_down(C)
>         A * ---------------
>              scale_down(B)
>
> I can't see why this is better in any way.

i'm not saying it's netter, i'm saying that it might be the cause of
the build warning you mentioned because you are mixing  u64 and
unsigned long

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-05-01 21:56           ` Tejun Heo
@ 2017-05-02  8:19             ` Peter Zijlstra
  2017-05-02  8:30               ` Peter Zijlstra
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Zijlstra @ 2017-05-02  8:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On Mon, May 01, 2017 at 05:56:04PM -0400, Tejun Heo wrote:

> I still wonder what gcfs_rq se->load_avg.avg is good for tho?  It's
> nice to keep the value in line but is it actually used anywhere?

Its used to manage its parent's cfs_rq sums boundary conditions. When we
enqueue the group se to the parent group, we do:

  enqueue_entity()
    enqueue_entity_load_avg()
      // update runnable
      // add se.avg to cfs_rq.avg if 'migrated'

Now obviously group se's never actually migrate between CPUs, but they
still pass through that code. I think the 'migrate' condition also
triggers for new entities for example, they 'migrate' into the system.

Similar for dequeue, that should trigger when groups entities
get destroyed, they 'migrate' out of the system.

But I get what you're saying. They're not overly useful as such.

> So, if changing gcfs_rq se->load_avg.avg to match the gcfs_rq's
> runnable_load_avg is icky, and I can see why that would be, we can
> simply introduce a separate channel of propagation ...

Can you have a play with something like the below? I suspect
'shares_runnable' might work for you here.

Its still 'icky', but not more so than calc_cfs_shares() was to begin
with. When many CPUs are involved the global sum (tg_shares) is still
dominated by avg_load and the division will have a downward bias, but
given the purpose that might not be too bad.

---
 kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 44 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a903276fcb62..7d1fb5f421bc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2630,26 +2630,56 @@ static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
 	cfs_rq->nr_running--;
 }
 
+enum shares_type {
+	shares_runnable,
+	shares_avg,
+	shares_weight,
+};
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 # ifdef CONFIG_SMP
-static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
+static long
+calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg, enum shares_type)
 {
-	long tg_weight, load, shares;
+	long tg_weight, tg_shares, load, shares;
 
-	/*
-	 * This really should be: cfs_rq->avg.load_avg, but instead we use
-	 * cfs_rq->load.weight, which is its upper bound. This helps ramp up
-	 * the shares for small weight interactive tasks.
-	 */
-	load = scale_load_down(cfs_rq->load.weight);
+	tg_shares = READ_ONCE(tg->shares);
+
+	switch (shares_type) {
+	case shares_runnable:
+		/*
+		 * Instead of the correct cfs_rq->avg.load_avg we use
+		 * cfs_rq->runnable_load_avg, which does not include the
+		 * blocked load.
+		 */
+		load = cfs_rq->runnable_load_avg;
+		break;
+
+	case shares_avg:
+		load = cfs_rq->avg.load_avg;
+		break;
+
+	case shares_weight:
+		/*
+		 * Instead of the correct cfs_rq->avg.load_avg we use
+		 * cfs_rq->load.weight, which is its upper bound. This helps
+		 * ramp up the shares for small weight interactive tasks.
+		 */
+		load = scale_load_down(cfs_rq->load.weight);
+		break;
+	}
 
 	tg_weight = atomic_long_read(&tg->load_avg);
 
-	/* Ensure tg_weight >= load */
+	/*
+	 * This ensures the sum is up-to-date for this CPU, in case of the other
+	 * two approximations it biases the sum towards their value and in case
+	 * of (near) UP ensures the division ends up <= 1.
+	 */
 	tg_weight -= cfs_rq->tg_load_avg_contrib;
 	tg_weight += load;
 
-	shares = (tg->shares * load);
+	shares = (tg_shares * load);
 	if (tg_weight)
 		shares /= tg_weight;
 
@@ -2665,15 +2695,11 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 	 * case no task is runnable on a CPU MIN_SHARES=2 should be returned
 	 * instead of 0.
 	 */
-	if (shares < MIN_SHARES)
-		shares = MIN_SHARES;
-	if (shares > tg->shares)
-		shares = tg->shares;
-
-	return shares;
+	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
 # else /* CONFIG_SMP */
-static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
+static inline long
+calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg, enum shares_type)
 {
 	return tg->shares;
 }
@@ -2715,7 +2741,7 @@ static void update_cfs_shares(struct sched_entity *se)
 	if (likely(se->load.weight == tg->shares))
 		return;
 #endif
-	shares = calc_cfs_shares(cfs_rq, tg);
+	shares = calc_cfs_shares(cfs_rq, tg, shares_weight);
 
 	reweight_entity(cfs_rq_of(se), se, shares);
 }

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-05-02  8:19             ` Peter Zijlstra
@ 2017-05-02  8:30               ` Peter Zijlstra
  2017-05-02 20:00                 ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Zijlstra @ 2017-05-02  8:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On Tue, May 02, 2017 at 10:19:05AM +0200, Peter Zijlstra wrote:
> Can you have a play with something like the below? I suspect
> 'shares_runnable' might work for you here.

> ---
>  kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 44 insertions(+), 18 deletions(-)

So something like so on top I suppose...

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3101,38 +3101,7 @@ static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long delta, load = gcfs_rq->avg.load_avg;
-
-	/*
-	 * If the load of group cfs_rq is null, the load of the
-	 * sched_entity will also be null so we can skip the formula
-	 */
-	if (load) {
-		long tg_load;
-
-		/* Get tg's load and ensure tg_load > 0 */
-		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
-		/* Ensure tg_load >= load and updated with current load*/
-		tg_load -= gcfs_rq->tg_load_avg_contrib;
-		tg_load += load;
-
-		/*
-		 * We need to compute a correction term in the case that the
-		 * task group is consuming more CPU than a task of equal
-		 * weight. A task with a weight equals to tg->shares will have
-		 * a load less or equal to scale_load_down(tg->shares).
-		 * Similarly, the sched_entities that represent the task group
-		 * at parent level, can't have a load higher than
-		 * scale_load_down(tg->shares). And the Sum of sched_entities'
-		 * load must be <= scale_load_down(tg->shares).
-		 */
-		if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
-			/* scale gcfs_rq's load into tg's shares*/
-			load *= scale_load_down(gcfs_rq->tg->shares);
-			load /= tg_load;
-		}
-	}
+	long delta, load = calc_cfs_shares(gcfs_rq, gcfs_rq->tg, shares_runnable);
 
 	delta = load - se->avg.load_avg;
 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-02  7:18           ` Vincent Guittot
@ 2017-05-02 13:26             ` Vincent Guittot
  2017-05-02 22:37               ` Tejun Heo
  2017-05-02 21:50             ` Tejun Heo
  1 sibling, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-05-02 13:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hi Tejun,

Le Tuesday 02 May 2017 à 09:18:53 (+0200), Vincent Guittot a écrit :
> On 28 April 2017 at 22:33, Tejun Heo <tj@kernel.org> wrote:
> > Hello, Vincent.
> >
> > On Thu, Apr 27, 2017 at 10:29:10AM +0200, Vincent Guittot wrote:
> >> On 27 April 2017 at 00:52, Tejun Heo <tj@kernel.org> wrote:
> >> > Hello,
> >> >
> >> > On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
> >> >> On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
> >> >> Can the problem be on the load balance side instead ?  and more
> >> >> precisely in the wakeup path ?
> >> >> After looking at the trace, it seems that task placement happens at
> >> >> wake up path and if it fails to select the right idle cpu at wake up,
> >> >> you will have to wait for a load balance which is alreayd too late
> >> >
> >> > Oh, I was tracing most of scheduler activities and the ratios of
> >> > wakeups picking idle CPUs were about the same regardless of cgroup
> >> > membership.  I can confidently say that the latency issue that I'm
> >> > seeing is from load balancer picking the wrong busiest CPU, which is
> >> > not to say that there can be other problems.
> >>
> >> ok. Is there any trace that you can share ? your behavior seems
> >> different of mine
> >
> >

[ snip]

> > You can notice that B's pertask weight is 4.409 which is way higher
> > than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is
> > twice as high as it should be.  The root queue's runnable avg should
> 
> Are you sure that this is because of blocked load in group A ? it can
> be that Q014-asdf has already have to wait before running and its load
> still increase while runnable but not running .
> IIUC your trace, group A has 2 running tasks and group B only one but
> load_balance selects B because of its sgs->avg_load being higher. But
> this can also happen even if runnable_load_avg of child cfs_rq was
> propagated correctly in group entity because we can have situation
> where a group A has only 1 task with higher load than 2 tasks on
> groupB and even if blocked load is not taken into account, and
> load_balance will select A.
> 
> IMHO, we should better improve load balance selection. I'm going to
> add smarter group selection in load_balance. that's something we
> should have already done but it was difficult without load/util_avg
> propagation. it should be doable now

Could you test the patch in load_balance below ?
If group is not overloaded which means that threads have all runtime they
want, we select the cfs_rq according to the number of running threads instead

---
 kernel/sched/fair.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a903276..87e3b77 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7069,7 +7069,8 @@ static unsigned long task_h_load(struct task_struct *p)
 /********** Helpers for find_busiest_group ************************/
 
 enum group_type {
-	group_other = 0,
+	group_idle = 0,
+	group_other,
 	group_imbalanced,
 	group_overloaded,
 };
@@ -7383,6 +7384,9 @@ group_type group_classify(struct sched_group *group,
 	if (sgs->group_no_capacity)
 		return group_overloaded;
 
+	if (!sgs->sum_nr_running)
+		return group_idle;
+
 	if (sg_imbalanced(group))
 		return group_imbalanced;
 
@@ -7476,8 +7480,19 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (sgs->group_type < busiest->group_type)
 		return false;
 
-	if (sgs->avg_load <= busiest->avg_load)
+	if (sgs->group_type == group_other) {
+		/*
+		 * The groups are not overloaded so there is enough cpu time
+		 * for all threads. In this case, takes the group with the
+		 * highest number of tasks per CPU in order to improve
+		 * scheduling latency
+		 */
+		if ((sgs->sum_nr_running * busiest->group_weight) <=
+				(busiest->sum_nr_running * sgs->group_weight))
+			return false;
+	} if (sgs->avg_load <= busiest->avg_load) {
 		return false;
+	}
 
 	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
 		goto asym_packing;
@@ -7969,6 +7984,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		    !check_cpu_capacity(rq, env->sd))
 			continue;
 
+		if (!rq->cfs.h_nr_running)
+			continue;
+
 		/*
 		 * For the load comparisons with the other cpu's, consider
 		 * the weighted_cpuload() scaled with the cpu capacity, so
-- 
2.7.4


> 
> > only contain what's currently active but because we're scaling load
> > avg which includes both active and blocked, we're ending up picking
> > group B over A.
> >

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-05-02  8:30               ` Peter Zijlstra
@ 2017-05-02 20:00                 ` Tejun Heo
  2017-05-03  9:10                   ` Peter Zijlstra
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-02 20:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello, Peter.

Your changes need the following fix patch.  With the fix and
"sched/fair: Always propagate runnable_load_avg" applied, it seems to
work fine.  The propagated number is a bit different but I don't see
noticeable difference in behavior and the new number seems to better
represent what we need.

Thanks.
---
 kernel/sched/fair.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2645,7 +2645,8 @@ enum shares_type {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 # ifdef CONFIG_SMP
 static long
-calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg, enum shares_type)
+calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg,
+		enum shares_type shares_type)
 {
 	long tg_weight, tg_shares, load, shares;
 
@@ -2705,7 +2706,7 @@ calc_cfs_shares(struct cfs_rq *cfs_rq, s
 }
 # else /* CONFIG_SMP */
 static inline long
-calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg, enum shares_type)
+calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg, enum shares_type shares_type)
 {
 	return tg->shares;
 }
@@ -3104,7 +3105,7 @@ static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long delta, load = calc_cfs_shares(gcfs_rq, gcfs_rq->tg, shares_runnable);
+	long delta, load = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg, shares_runnable));
 
 	delta = load - se->avg.load_avg;
 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-02  6:56                     ` Vincent Guittot
@ 2017-05-02 20:56                       ` Tejun Heo
  2017-05-03  7:25                         ` Vincent Guittot
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-02 20:56 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello, Vincent.

On Tue, May 02, 2017 at 08:56:52AM +0200, Vincent Guittot wrote:
> On 28 April 2017 at 18:14, Tejun Heo <tj@kernel.org> wrote:
> > I'll follow up in the other subthread but there really is fundamental
> > difference in how we calculate runnable_avg w/ and w/o cgroups.
> > Indepndent of whether we can improve the load balancer further, it is
> > an existing bug.
> 
> I'd like to weight that a bit.
> The runnable_load_avg works correctly as it is because it reflects
> correctly the load of runnable entity at root domain
> If you start to propagate the runnable_load_avg on the load_avg of the
> group entity, the load will become unstable.
> runnable_load_avg has been added to fix load_balance being unable to
> select the right busiest rq. So the goal is to use more and more
> load_avg not the opposite

I have a hard time understanding what you're trying to say here.

Without cgroup, the load balancer uses the sum of load_avgs of the
running tasks on the queue.  As shown by the debug trace, the load
balancer repeatedly ends up picking the wrong CPU when cgroup is
involved because it ends up including the load_avgs of nested blocked
tasks into runnable_load_avg of root - we lose the distinction between
running and blocked load_avgs when we pass through a nested cfs_rq.

We can further improve the load balancer all we want, for example,
right now, we would end up picking a CPU with one task which has a
really high weight over another CPU with two normal weight tasks even,
which isn't ideal; however, there is something obviously broken in the
existing mechanism and we want to fix that first independent of
further improvements, and it won't be a good idea to paper over an
existing problem with a different mechanism either.

> >> I always let time between 2 consecutive run and the 10 consecutive
> >> runs take around 2min to execute
> >>
> >> Then I have run several time these 10 consecutive test and results stay the same
> >
> > Can you please try the patch at the end of this message?  I'm
> > wondering whether you're seeing the errors accumulating from the wrong
> > min().
> 
> I still have the regression with the patch below.
> The regression comes from the use runnable_load_avg to propagate. As
> load_avg becomes null at some point, it break the computation of share
> and the load_avg stay very low

That's surprising given that what the patch does is bringing the
cgroup behavior closer to !cgroup behavior.  It'd be great to be able
to reproduce the problem and trace it.  It looks like the board is
pretty standardized.  Would the following be equivalent to the one you
have?

 http://a.co/f3dD1lm

If so, I can just buy it, get your test image and repro it here and
trace why the regression is happening with the setup.  We might be
hitting something else.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-02  7:18           ` Vincent Guittot
  2017-05-02 13:26             ` Vincent Guittot
@ 2017-05-02 21:50             ` Tejun Heo
  2017-05-03  7:34               ` Vincent Guittot
  1 sibling, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-02 21:50 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello,

On Tue, May 02, 2017 at 09:18:53AM +0200, Vincent Guittot wrote:
> >  dbg_odd: odd: dst=28 idle=2 brk=32 lbtgt=0-31 type=2
> >  dbg_odd_dump: A: grp=1,17 w=2 avg=7.247 grp=8.337 sum=8.337 pertask=2.779
> >  dbg_odd_dump: A: gcap=1.150 gutil=1.095 run=3 idle=0 gwt=2 type=2 nocap=1
> >  dbg_odd_dump: A: CPU001: run=1 schb=1
> >  dbg_odd_dump: A: Q001-asdf: w=1.000,l=0.525,u=0.513,r=0.527 run=1 hrun=1 tgs=100.000 tgw=17.266
> >  dbg_odd_dump: A: Q001-asdf:  schbench(153757C):w=1.000,l=0.527,u=0.514
> >  dbg_odd_dump: A: Q001-/: w=5.744,l=2.522,u=0.520,r=3.067 run=1 hrun=1 tgs=1.000 tgw=0.000
> >  dbg_odd_dump: A: Q001-/:  asdf(C):w=5.744,l=3.017,u=0.521
> >  dbg_odd_dump: A: CPU017: run=2 schb=2
> >  dbg_odd_dump: A: Q017-asdf: w=2.000,l=0.989,u=0.966,r=0.988 run=2 hrun=2 tgs=100.000 tgw=17.266
> >  dbg_odd_dump: A: Q017-asdf:  schbench(153737C):w=1.000,l=0.493,u=0.482 schbench(153739):w=1.000,l=0.494,u=0.483
> >  dbg_odd_dump: A: Q017-/: w=10.653,l=7.888,u=0.973,r=5.270 run=1 hrun=2 tgs=1.000 tgw=0.000
> >  dbg_odd_dump: A: Q017-/:  asdf(C):w=10.653,l=5.269,u=0.966
> >  dbg_odd_dump: B: grp=14,30 w=2 avg=7.666 grp=8.819 sum=8.819 pertask=4.409
> >  dbg_odd_dump: B: gcap=1.150 gutil=1.116 run=2 idle=0 gwt=2 type=2 nocap=1
> >  dbg_odd_dump: B: CPU014: run=1 schb=1
> >  dbg_odd_dump: B: Q014-asdf: w=1.000,l=1.004,u=0.970,r=0.492 run=1 hrun=1 tgs=100.000 tgw=17.266
> >  dbg_odd_dump: B: Q014-asdf:  schbench(153760C):w=1.000,l=0.491,u=0.476
> >  dbg_odd_dump: B: Q014-/: w=5.605,l=11.146,u=0.970,r=5.774 run=1 hrun=1 tgs=1.000 tgw=0.000
> >  dbg_odd_dump: B: Q014-/:  asdf(C):w=5.605,l=5.766,u=0.970
> >  dbg_odd_dump: B: CPU030: run=1 schb=1
> >  dbg_odd_dump: B: Q030-asdf: w=1.000,l=0.538,u=0.518,r=0.558 run=1 hrun=1 tgs=100.000 tgw=17.266
> >  dbg_odd_dump: B: Q030-asdf:  schbench(153747C):w=1.000,l=0.537,u=0.516
> >  dbg_odd_dump: B: Q030-/: w=5.758,l=3.186,u=0.541,r=3.044 run=1 hrun=1 tgs=1.000 tgw=0.000
> >  dbg_odd_dump: B: Q030-/:  asdf(C):w=5.758,l=3.092,u=0.516
> >
> > You can notice that B's pertask weight is 4.409 which is way higher
> > than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is
> > twice as high as it should be.  The root queue's runnable avg should
> 
> Are you sure that this is because of blocked load in group A ? it can
> be that Q014-asdf has already have to wait before running and its load
> still increase while runnable but not running .

This is with propagation enabled, so the only thing contributing to
the root queue's runnable_load_avg is the load being propagated from
Q014-asdf, which has twice high load avg than runnable.  The past
history doesn't matter for load balancing and without cgroup this
blocked load wouldn't have contributed to root's runnable load avg.  I
don't think it can get much clearer.

> IIUC your trace, group A has 2 running tasks and group B only one but
> load_balance selects B because of its sgs->avg_load being higher. But
> this can also happen even if runnable_load_avg of child cfs_rq was
> propagated correctly in group entity because we can have situation
> where a group A has only 1 task with higher load than 2 tasks on
> groupB and even if blocked load is not taken into account, and
> load_balance will select A.

Yes, it can happen with tasks w/ different weights.  That's clearly
not what's happening here.  The load balancer is picking the wrong CPU
far more frequently because the root queue's runnable load avg
incorrectly includes blocked load avgs from nested cfs_rqs.

> IMHO, we should better improve load balance selection. I'm going to
> add smarter group selection in load_balance. that's something we
> should have already done but it was difficult without load/util_avg
> propagation. it should be doable now

That's all well and great but let's fix a bug first; otherwise, we'd
be papering over an existing issue with a new mechanism which is a bad
idea for any code base which has to last.

> > We can argue whether overriding a cfs_rq se's load_avg to the scaled
> > runnable_load_avg of the cfs_rq is the right way to go or we should
> > introduce a separate channel to propagate runnable_load_avg; however,
> > it's clear that we need to fix runnable_load_avg propagation one way
> > or another.
> 
> The minimum would be to not break load_avg

Oh yeah, this I can understand.  The proposed change is icky in that
it forces group se->load_avg.avg to be runnable_load_avg of the
corresponding group cfs_rq.  We *can* introduce a separate channel,
say, se->group_runnable_load_avg which is used to propagate
runnable_load_avg; however, the thing is that we don't really use
group se->load_avg.avg anywhere, so we might as well just override it.

I have a preliminary patch to introduce a separate field but it looks
sad too because we end up calculating the load_avg and
runnable_load_avg to propagate separately without actually using the
former value anywhere.

> > The thing with cfs_rq se's load_avg is that, it isn't really used
> > anywhere else AFAICS, so overriding it to the cfs_rq's
> > runnable_load_avg isn't prettiest but doesn't really change anything.
> 
> load_avg is used for defining the share of each cfs_rq.

Each cfs_rq calculates its load_avg independently from the weight sum.
The queued se's load_avgs don't affect cfs_rq's load_avg in any direct
way.  The only time the value is used is for propagation during
migration; however, group se themselves never get migrated themselves
and during propagation only deltas matter so the difference between
load_avg and runnable_load_avg isn't gonna matter that much.  In
short, we don't really use group se's load_avg in any way significant.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-01 15:56           ` Peter Zijlstra
@ 2017-05-02 22:01             ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2017-05-02 22:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello,

On Mon, May 01, 2017 at 05:56:13PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 28, 2017 at 04:33:47PM -0400, Tejun Heo wrote:
> > I'm attaching the debug patch.  With your change (avg instead of
> > runnable_avg), the following trace shows why it's wrong.
> 
> Ah, OK. So you really want runnable_avg (and I understand why), which is
> rather unfortunate, since we have everything on load_avg.
> 
> So for shares, load_avg gives a more stable number. This is important
> since tg->load_avg is a global number, so computing it is expensive (and
> slow). Therefore more stable numbers are good.
> 
> > The thing with cfs_rq se's load_avg is that, it isn't really used
> > anywhere else AFAICS, so overriding it to the cfs_rq's
> > runnable_load_avg isn't prettiest but doesn't really change anything.
> 
> You mean, consistently expressing a group's se->load.weight in terms of
> runnable_load_avg? See the above.

I think you got this on the other thread but for clarity:

cfs_rq->avg.load_avg is used for share calculation and we want to keep
it that way as the calcluation is expensive and rather decoupled
across CPUs (the deviation can be quite a bit without the stability).
But the group *se*->avg.load_avg is a separate thing which isn't
really used anywhere except for as a propagation channel from group
cfs_rq to its parent cfs_rq.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-02 13:26             ` Vincent Guittot
@ 2017-05-02 22:37               ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2017-05-02 22:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello, Vincent.

On Tue, May 02, 2017 at 03:26:12PM +0200, Vincent Guittot wrote:
> > IMHO, we should better improve load balance selection. I'm going to
> > add smarter group selection in load_balance. that's something we
> > should have already done but it was difficult without load/util_avg
> > propagation. it should be doable now
> 
> Could you test the patch in load_balance below ?
> If group is not overloaded which means that threads have all runtime they
> want, we select the cfs_rq according to the number of running threads instead

So, this didn't help.  Tried also w/ return true added on the else
clause but that didn't help either.

Anyways, once debugged, the idea would work for this particular test
case and in general we should avoid picking a CPU as the busiest if it
doesn't have extra threads to give away; however, this isn't the
proper fix for the identified problem and basing load balancing soley
on the number of tasks is far more likely to be harmful than the other
way around.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-02 20:56                       ` Tejun Heo
@ 2017-05-03  7:25                         ` Vincent Guittot
  2017-05-03  7:54                           ` Vincent Guittot
  0 siblings, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-05-03  7:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 2 May 2017 at 22:56, Tejun Heo <tj@kernel.org> wrote:
> Hello, Vincent.
>
> On Tue, May 02, 2017 at 08:56:52AM +0200, Vincent Guittot wrote:
>> On 28 April 2017 at 18:14, Tejun Heo <tj@kernel.org> wrote:
>> > I'll follow up in the other subthread but there really is fundamental
>> > difference in how we calculate runnable_avg w/ and w/o cgroups.
>> > Indepndent of whether we can improve the load balancer further, it is
>> > an existing bug.
>>
>> I'd like to weight that a bit.
>> The runnable_load_avg works correctly as it is because it reflects
>> correctly the load of runnable entity at root domain
>> If you start to propagate the runnable_load_avg on the load_avg of the
>> group entity, the load will become unstable.
>> runnable_load_avg has been added to fix load_balance being unable to
>> select the right busiest rq. So the goal is to use more and more
>> load_avg not the opposite
>
> I have a hard time understanding what you're trying to say here.
>
> Without cgroup, the load balancer uses the sum of load_avgs of the
> running tasks on the queue.  As shown by the debug trace, the load
> balancer repeatedly ends up picking the wrong CPU when cgroup is
> involved because it ends up including the load_avgs of nested blocked
> tasks into runnable_load_avg of root - we lose the distinction between
> running and blocked load_avgs when we pass through a nested cfs_rq.
>
> We can further improve the load balancer all we want, for example,
> right now, we would end up picking a CPU with one task which has a
> really high weight over another CPU with two normal weight tasks even,
> which isn't ideal; however, there is something obviously broken in the
> existing mechanism and we want to fix that first independent of
> further improvements, and it won't be a good idea to paper over an
> existing problem with a different mechanism either.

That's probably where I disagree.
IMO, runnable_load_avg is already a fix to compensate the fact when
PELT has been rewritten, the load balance has not been update
accordingly  and was unable to make the right choice
between 2 groups: one group having only 1 task but a higher load and
another group with several tasks but a lower load (either because of
the blocked load or because a runnable task with far higher load than
others). runnable_load_avg has been put back to quickly fix the 1st
case but it doesn't fix the other. There were also some issues of
propagating load update with task migration but this has been fixed
now.
So we have 2 use case with the exact same behavior a group with a
higher load (runnable_load_avg or load_avg) but only 1 task and
another one with several tasks but a lower load. At now
runnable_load_avg can only fix one with the impact of breaking
load_avg behavior whereas one fix in the load_balance can fix both
withour breaking load_avg behavior.
We should better work on removing runnable_load_avg completely than
adding more stuff on it

>
>> >> I always let time between 2 consecutive run and the 10 consecutive
>> >> runs take around 2min to execute
>> >>
>> >> Then I have run several time these 10 consecutive test and results stay the same
>> >
>> > Can you please try the patch at the end of this message?  I'm
>> > wondering whether you're seeing the errors accumulating from the wrong
>> > min().
>>
>> I still have the regression with the patch below.
>> The regression comes from the use runnable_load_avg to propagate. As
>> load_avg becomes null at some point, it break the computation of share
>> and the load_avg stay very low
>
> That's surprising given that what the patch does is bringing the
> cgroup behavior closer to !cgroup behavior.  It'd be great to be able

But it's also break load_avg which is used to calculate per cfs_rq
share and this change generates the regression on my board

> to reproduce the problem and trace it.  It looks like the board is
> pretty standardized.  Would the following be equivalent to the one you
> have?
>
>  http://a.co/f3dD1lm

Yes it's this board i have the previous version but it should not
change the behavior
>
> If so, I can just buy it, get your test image and repro it here and
> trace why the regression is happening with the setup.  We might be
> hitting something else.

I have pushed my branch which includes v4.11-rc8 + your patches + some
debug trace here:
https://git.linaro.org/people/vincent.guittot/kernel.git/log/?h=test-tejun-patches

and uploaded a trace_cmd's trace that shows the regression
http://people.linaro.org/~vincent.guittot/trace-tejun-patches

As an example, we can easily see arounf time stamp 94.826469 sec that
cpu5 is idle while CPU2 and CPU3 share their time between several
task.
This doesn't happens without your patch

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-02 21:50             ` Tejun Heo
@ 2017-05-03  7:34               ` Vincent Guittot
  2017-05-03  9:37                 ` Peter Zijlstra
  0 siblings, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-05-03  7:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hi Tejun,

On 2 May 2017 at 23:50, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Tue, May 02, 2017 at 09:18:53AM +0200, Vincent Guittot wrote:
>> >  dbg_odd: odd: dst=28 idle=2 brk=32 lbtgt=0-31 type=2
>> >  dbg_odd_dump: A: grp=1,17 w=2 avg=7.247 grp=8.337 sum=8.337 pertask=2.779
>> >  dbg_odd_dump: A: gcap=1.150 gutil=1.095 run=3 idle=0 gwt=2 type=2 nocap=1
>> >  dbg_odd_dump: A: CPU001: run=1 schb=1
>> >  dbg_odd_dump: A: Q001-asdf: w=1.000,l=0.525,u=0.513,r=0.527 run=1 hrun=1 tgs=100.000 tgw=17.266
>> >  dbg_odd_dump: A: Q001-asdf:  schbench(153757C):w=1.000,l=0.527,u=0.514
>> >  dbg_odd_dump: A: Q001-/: w=5.744,l=2.522,u=0.520,r=3.067 run=1 hrun=1 tgs=1.000 tgw=0.000
>> >  dbg_odd_dump: A: Q001-/:  asdf(C):w=5.744,l=3.017,u=0.521
>> >  dbg_odd_dump: A: CPU017: run=2 schb=2
>> >  dbg_odd_dump: A: Q017-asdf: w=2.000,l=0.989,u=0.966,r=0.988 run=2 hrun=2 tgs=100.000 tgw=17.266
>> >  dbg_odd_dump: A: Q017-asdf:  schbench(153737C):w=1.000,l=0.493,u=0.482 schbench(153739):w=1.000,l=0.494,u=0.483
>> >  dbg_odd_dump: A: Q017-/: w=10.653,l=7.888,u=0.973,r=5.270 run=1 hrun=2 tgs=1.000 tgw=0.000
>> >  dbg_odd_dump: A: Q017-/:  asdf(C):w=10.653,l=5.269,u=0.966
>> >  dbg_odd_dump: B: grp=14,30 w=2 avg=7.666 grp=8.819 sum=8.819 pertask=4.409
>> >  dbg_odd_dump: B: gcap=1.150 gutil=1.116 run=2 idle=0 gwt=2 type=2 nocap=1
>> >  dbg_odd_dump: B: CPU014: run=1 schb=1
>> >  dbg_odd_dump: B: Q014-asdf: w=1.000,l=1.004,u=0.970,r=0.492 run=1 hrun=1 tgs=100.000 tgw=17.266
>> >  dbg_odd_dump: B: Q014-asdf:  schbench(153760C):w=1.000,l=0.491,u=0.476
>> >  dbg_odd_dump: B: Q014-/: w=5.605,l=11.146,u=0.970,r=5.774 run=1 hrun=1 tgs=1.000 tgw=0.000
>> >  dbg_odd_dump: B: Q014-/:  asdf(C):w=5.605,l=5.766,u=0.970
>> >  dbg_odd_dump: B: CPU030: run=1 schb=1
>> >  dbg_odd_dump: B: Q030-asdf: w=1.000,l=0.538,u=0.518,r=0.558 run=1 hrun=1 tgs=100.000 tgw=17.266
>> >  dbg_odd_dump: B: Q030-asdf:  schbench(153747C):w=1.000,l=0.537,u=0.516
>> >  dbg_odd_dump: B: Q030-/: w=5.758,l=3.186,u=0.541,r=3.044 run=1 hrun=1 tgs=1.000 tgw=0.000
>> >  dbg_odd_dump: B: Q030-/:  asdf(C):w=5.758,l=3.092,u=0.516
>> >
>> > You can notice that B's pertask weight is 4.409 which is way higher
>> > than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is
>> > twice as high as it should be.  The root queue's runnable avg should
>>
>> Are you sure that this is because of blocked load in group A ? it can
>> be that Q014-asdf has already have to wait before running and its load
>> still increase while runnable but not running .
>
> This is with propagation enabled, so the only thing contributing to
> the root queue's runnable_load_avg is the load being propagated from
> Q014-asdf, which has twice high load avg than runnable.  The past
> history doesn't matter for load balancing and without cgroup this
> blocked load wouldn't have contributed to root's runnable load avg.  I
> don't think it can get much clearer.
>
>> IIUC your trace, group A has 2 running tasks and group B only one but
>> load_balance selects B because of its sgs->avg_load being higher. But
>> this can also happen even if runnable_load_avg of child cfs_rq was
>> propagated correctly in group entity because we can have situation
>> where a group A has only 1 task with higher load than 2 tasks on
>> groupB and even if blocked load is not taken into account, and
>> load_balance will select A.
>
> Yes, it can happen with tasks w/ different weights.  That's clearly
> not what's happening here.  The load balancer is picking the wrong CPU
> far more frequently because the root queue's runnable load avg
> incorrectly includes blocked load avgs from nested cfs_rqs.
>
>> IMHO, we should better improve load balance selection. I'm going to
>> add smarter group selection in load_balance. that's something we
>> should have already done but it was difficult without load/util_avg
>> propagation. it should be doable now
>
> That's all well and great but let's fix a bug first; otherwise, we'd
> be papering over an existing issue with a new mechanism which is a bad
> idea for any code base which has to last.

runnable_load_avg is already a kind of fix and breaking load_avg seems
worse than fixing load_balance

>
>> > We can argue whether overriding a cfs_rq se's load_avg to the scaled
>> > runnable_load_avg of the cfs_rq is the right way to go or we should
>> > introduce a separate channel to propagate runnable_load_avg; however,
>> > it's clear that we need to fix runnable_load_avg propagation one way
>> > or another.
>>
>> The minimum would be to not break load_avg
>
> Oh yeah, this I can understand.  The proposed change is icky in that
> it forces group se->load_avg.avg to be runnable_load_avg of the
> corresponding group cfs_rq.  We *can* introduce a separate channel,
> say, se->group_runnable_load_avg which is used to propagate
> runnable_load_avg; however, the thing is that we don't really use
> group se->load_avg.avg anywhere, so we might as well just override it.

We use load_avg for calculating a stable share and we want to use it
more and more.
So breaking it because it's easier doesn't seems to be the right way to do IMHO

>
> I have a preliminary patch to introduce a separate field but it looks
> sad too because we end up calculating the load_avg and
> runnable_load_avg to propagate separately without actually using the
> former value anywhere.
>
>> > The thing with cfs_rq se's load_avg is that, it isn't really used
>> > anywhere else AFAICS, so overriding it to the cfs_rq's
>> > runnable_load_avg isn't prettiest but doesn't really change anything.
>>
>> load_avg is used for defining the share of each cfs_rq.
>
> Each cfs_rq calculates its load_avg independently from the weight sum.
> The queued se's load_avgs don't affect cfs_rq's load_avg in any direct
> way.  The only time the value is used is for propagation during
> migration; however, group se themselves never get migrated themselves
> and during propagation only deltas matter so the difference between
> load_avg and runnable_load_avg isn't gonna matter that much.  In
> short, we don't really use group se's load_avg in any way significant.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-03  7:25                         ` Vincent Guittot
@ 2017-05-03  7:54                           ` Vincent Guittot
  0 siblings, 0 replies; 69+ messages in thread
From: Vincent Guittot @ 2017-05-03  7:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 3 May 2017 at 09:25, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 2 May 2017 at 22:56, Tejun Heo <tj@kernel.org> wrote:
>> Hello, Vincent.
>>
>> On Tue, May 02, 2017 at 08:56:52AM +0200, Vincent Guittot wrote:
>>> On 28 April 2017 at 18:14, Tejun Heo <tj@kernel.org> wrote:
>>> > I'll follow up in the other subthread but there really is fundamental
>>> > difference in how we calculate runnable_avg w/ and w/o cgroups.
>>> > Indepndent of whether we can improve the load balancer further, it is
>>> > an existing bug.
>>>
>>> I'd like to weight that a bit.
>>> The runnable_load_avg works correctly as it is because it reflects
>>> correctly the load of runnable entity at root domain
>>> If you start to propagate the runnable_load_avg on the load_avg of the
>>> group entity, the load will become unstable.
>>> runnable_load_avg has been added to fix load_balance being unable to
>>> select the right busiest rq. So the goal is to use more and more
>>> load_avg not the opposite
>>
>> I have a hard time understanding what you're trying to say here.
>>
>> Without cgroup, the load balancer uses the sum of load_avgs of the
>> running tasks on the queue.  As shown by the debug trace, the load
>> balancer repeatedly ends up picking the wrong CPU when cgroup is
>> involved because it ends up including the load_avgs of nested blocked
>> tasks into runnable_load_avg of root - we lose the distinction between
>> running and blocked load_avgs when we pass through a nested cfs_rq.
>>
>> We can further improve the load balancer all we want, for example,
>> right now, we would end up picking a CPU with one task which has a
>> really high weight over another CPU with two normal weight tasks even,
>> which isn't ideal; however, there is something obviously broken in the
>> existing mechanism and we want to fix that first independent of
>> further improvements, and it won't be a good idea to paper over an
>> existing problem with a different mechanism either.
>
> That's probably where I disagree.
> IMO, runnable_load_avg is already a fix to compensate the fact when
> PELT has been rewritten, the load balance has not been update
> accordingly  and was unable to make the right choice
> between 2 groups: one group having only 1 task but a higher load and
> another group with several tasks but a lower load (either because of
> the blocked load or because a runnable task with far higher load than
> others). runnable_load_avg has been put back to quickly fix the 1st
> case but it doesn't fix the other. There were also some issues of
> propagating load update with task migration but this has been fixed
> now.
> So we have 2 use case with the exact same behavior a group with a
> higher load (runnable_load_avg or load_avg) but only 1 task and
> another one with several tasks but a lower load. At now
> runnable_load_avg can only fix one with the impact of breaking
> load_avg behavior whereas one fix in the load_balance can fix both
> withour breaking load_avg behavior.
> We should better work on removing runnable_load_avg completely than
> adding more stuff on it
>
>>
>>> >> I always let time between 2 consecutive run and the 10 consecutive
>>> >> runs take around 2min to execute
>>> >>
>>> >> Then I have run several time these 10 consecutive test and results stay the same
>>> >
>>> > Can you please try the patch at the end of this message?  I'm
>>> > wondering whether you're seeing the errors accumulating from the wrong
>>> > min().
>>>
>>> I still have the regression with the patch below.
>>> The regression comes from the use runnable_load_avg to propagate. As
>>> load_avg becomes null at some point, it break the computation of share
>>> and the load_avg stay very low
>>
>> That's surprising given that what the patch does is bringing the
>> cgroup behavior closer to !cgroup behavior.  It'd be great to be able
>
> But it's also break load_avg which is used to calculate per cfs_rq
> share and this change generates the regression on my board
>
>> to reproduce the problem and trace it.  It looks like the board is
>> pretty standardized.  Would the following be equivalent to the one you
>> have?
>>
>>  http://a.co/f3dD1lm
>
> Yes it's this board i have the previous version but it should not
> change the behavior
>>
>> If so, I can just buy it, get your test image and repro it here and
>> trace why the regression is happening with the setup.  We might be
>> hitting something else.
>
> I have pushed my branch which includes v4.11-rc8 + your patches + some
> debug trace here:
> https://git.linaro.org/people/vincent.guittot/kernel.git/log/?h=test-tejun-patches
>
> and uploaded a trace_cmd's trace that shows the regression
> http://people.linaro.org/~vincent.guittot/trace-tejun-patches

I have also uploaded the trace withour your patches so you can compare both:
http://people.linaro.org/~vincent.guittot/trace-v4.8-rc11

>
> As an example, we can easily see arounf time stamp 94.826469 sec that
> cpu5 is idle while CPU2 and CPU3 share their time between several
> task.
> This doesn't happens without your patch
>
>>
>> Thanks.
>>
>> --
>> tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-05-02 20:00                 ` Tejun Heo
@ 2017-05-03  9:10                   ` Peter Zijlstra
  0 siblings, 0 replies; 69+ messages in thread
From: Peter Zijlstra @ 2017-05-03  9:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On Tue, May 02, 2017 at 04:00:28PM -0400, Tejun Heo wrote:
> Hello, Peter.
> 
> Your changes need the following fix patch.

D'0h, so much for not running my patches through a compiler ;-)

> With the fix and
> "sched/fair: Always propagate runnable_load_avg" applied, it seems to
> work fine.  The propagated number is a bit different but I don't see
> noticeable difference in behavior and the new number seems to better
> represent what we need.

Right. Let me stare at that always propagate thing a bit. My instant
worry is overhead, but yeah, we need something to make this work.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-03  7:34               ` Vincent Guittot
@ 2017-05-03  9:37                 ` Peter Zijlstra
  2017-05-03 10:37                   ` Vincent Guittot
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Zijlstra @ 2017-05-03  9:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On Wed, May 03, 2017 at 09:34:51AM +0200, Vincent Guittot wrote:

> We use load_avg for calculating a stable share and we want to use it
> more and more.  So breaking it because it's easier doesn't seems to be
> the right way to do IMHO

So afaict we calculate group se->load.weight (aka shares, see
calc_cfs_shares), using cfs_rq->avg.load_avg, which feeds into
tg->load_avg through cfs_rq->tg_load_avg_contrib and
cfs_rq->load.weight.

And cfs_rq->avg.load_avg is computed from cfs_rq->load.weight, which
is \Sum se->load.weight.

OTOH group se->avg.load_avg isn't used much, which is TJ's point.

The only cases where group se->avg.load_avg are relevant to
cfs_rq->avg.load are the cases I listed yesterday, group creation and
group destruction. There we use the group se->avg.load_avg to manage the
boundary conditions.

So with the proposed change to se->avg.load_avg we can have some
(temporary) boundary effect when you destroy a lot of (previously
active) cgroups.


Of course, it could be I overlooked something, in which case, please
tell :-)


That said, I agree it would be nice to entirely get rid of runnable_avg,
but that is a much larger change and would require a lot more work. I
don't immediately see why we can't fix the thing now and then work on
removing runnable_load_avg later.

Of course, we should not regress either, I'll go read up on that part.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-03  9:37                 ` Peter Zijlstra
@ 2017-05-03 10:37                   ` Vincent Guittot
  2017-05-03 13:09                     ` Peter Zijlstra
  0 siblings, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-05-03 10:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 3 May 2017 at 11:37, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, May 03, 2017 at 09:34:51AM +0200, Vincent Guittot wrote:
>
> > We use load_avg for calculating a stable share and we want to use it
> > more and more.  So breaking it because it's easier doesn't seems to be
> > the right way to do IMHO
>
> So afaict we calculate group se->load.weight (aka shares, see
> calc_cfs_shares), using cfs_rq->avg.load_avg, which feeds into
> tg->load_avg through cfs_rq->tg_load_avg_contrib and
> cfs_rq->load.weight.
>
> And cfs_rq->avg.load_avg is computed from cfs_rq->load.weight, which
> is \Sum se->load.weight.
>
> OTOH group se->avg.load_avg isn't used much, which is TJ's point.
>
> The only cases where group se->avg.load_avg are relevant to
> cfs_rq->avg.load are the cases I listed yesterday, group creation and
> group destruction. There we use the group se->avg.load_avg to manage the
> boundary conditions.
>
> So with the proposed change to se->avg.load_avg we can have some
> (temporary) boundary effect when you destroy a lot of (previously
> active) cgroups.
>
>
> Of course, it could be I overlooked something, in which case, please
> tell :-)

That's mainly based on the regression i see on my platform. I haven't
find the root cause of the regression but it's there which means that
using group_entity's load_avg to propagate child cfs_rq
runnable_load_avg breaks something

>
>
> That said, I agree it would be nice to entirely get rid of runnable_avg,
> but that is a much larger change and would require a lot more work. I
> don't immediately see why we can't fix the thing now and then work on
> removing runnable_load_avg later.

What propose Tejun is to break the group's load_avg and make it
follows child cfs_rq's runnable_load_avg instead of child cfs_rq's
load_avg so it will be difficult if not possible to try to move
load_balance on load_avg and remove runnable_load_avg later on if
load_avg doesn't work anymore as expected. So keeping group's load_avg
working correctly seems a key point
Then, we know that we still have wrong behavior with runnable_load_avg
when running task's load are really different. so it fixes part of
wider problem IMO

>
> Of course, we should not regress either, I'll go read up on that part.
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-03 10:37                   ` Vincent Guittot
@ 2017-05-03 13:09                     ` Peter Zijlstra
  2017-05-03 21:49                       ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Zijlstra @ 2017-05-03 13:09 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On Wed, May 03, 2017 at 12:37:37PM +0200, Vincent Guittot wrote:
> On 3 May 2017 at 11:37, Peter Zijlstra <peterz@infradead.org> wrote:

> > Of course, it could be I overlooked something, in which case, please
> > tell :-)
> 
> That's mainly based on the regression i see on my platform. I haven't
> find the root cause of the regression but it's there which means that
> using group_entity's load_avg to propagate child cfs_rq
> runnable_load_avg breaks something

(as mentioned on IRC)

Right.. so looking through the code, (group) se->avg.load_avg is used in
effective_load() (and thereby wake_affine()) and update_cfs_rq_h_load()
(and therefore task_h_load()).

So changing it will affect those two functions, which could well lead to
your regression.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-04-24 21:33   ` [PATCH v2 " Tejun Heo
@ 2017-05-03 18:00     ` Peter Zijlstra
  2017-05-03 21:45       ` Tejun Heo
  2017-05-04 10:26       ` Vincent Guittot
  0 siblings, 2 replies; 69+ messages in thread
From: Peter Zijlstra @ 2017-05-03 18:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Vincent Guittot,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team



This is on my IVB-EP, 2 sockets, 10 cores / socket, 2 threads / core.

workload is constrained to 1 socket.

root@ivb-ep:~/bench/schbench# numactl -N 0 ./schbench -m 2 -t 10 -s 10000 -c 15000 -r 30
Latency percentiles (usec)
        50.0000th: 21
        75.0000th: 30
        90.0000th: 38
        95.0000th: 42
        *99.0000th: 49
        99.5000th: 53
        99.9000th: 15024
        min=0, max=15056


root@ivb-ep:~/bench/schbench# echo NO_FUDGE > /debug/sched_features; echo NO_FUDGE2 > /debug/sched_features ; mkdir /cgroup/ponies; echo $$ > /cgroup/ponies/tasks ; numactl -N 0 ./schbench -m 2 -t 10 -s 10000 -c 15000 -r 30 ; echo $$ > /cgroup/tasks ; rmdir /cgroup/ponies
Latency percentiles (usec)
        50.0000th: 14
        75.0000th: 19
        90.0000th: 24
        95.0000th: 28
        *99.0000th: 57
        99.5000th: 15024
        99.9000th: 15024
        min=0, max=15060


root@ivb-ep:~/bench/schbench# echo NO_FUDGE > /debug/sched_features; echo FUDGE2 > /debug/sched_features ; mkdir /cgroup/ponies; echo $$ > /cgroup/ponies/tasks ; numactl -N 0 ./schbench -m 2 -t 10 -s 10000 -c 15000 -r 30 ; echo $$ > /cgroup/tasks ; rmdir /cgroup/ponies
Latency percentiles (usec)
        50.0000th: 14
        75.0000th: 19
        90.0000th: 24
        95.0000th: 26
        *99.0000th: 38
        99.5000th: 49
        99.9000th: 9648
        min=0, max=15035


root@ivb-ep:~/bench/schbench# echo FUDGE > /debug/sched_features; echo FUDGE2 > /debug/sched_features ; mkdir /cgroup/ponies; echo $$ > /cgroup/ponies/tasks ; numactl -N 0 ./schbench -m 2 -t 10 -s 10000 -c 15000 -r 30 ; echo $$ > /cgroup/tasks ; rmdir /cgroup/ponies
Latency percentiles (usec)
        50.0000th: 14
        75.0000th: 19
        90.0000th: 24
        95.0000th: 27
        *99.0000th: 3060
        99.5000th: 7848
        99.9000th: 15024
        min=0, max=15041


root@ivb-ep:~/bench/schbench# echo 0 > /sys/module/fair/parameters/prop_type
root@ivb-ep:~/bench/schbench# echo FUDGE > /debug/sched_features; echo FUDGE2 > /debug/sched_features ; mkdir /cgroup/ponies; echo $$ > /cgroup/ponies/tasks ; numactl -N 0 ./schbench -m 2 -t 10 -s 10000 -c 15000 -r 30 ; echo $$ > /cgroup/tasks ; rmdir /cgroup/ponies
Latency percentiles (usec)
        50.0000th: 14
        75.0000th: 19
        90.0000th: 24
        95.0000th: 27
        *99.0000th: 52
        99.5000th: 4712
        99.9000th: 14640
        min=0, max=15033


Just FUDGE2 on its own seems to be the best on my system and is a change
that makes sense (and something Paul recently pointed out as well).

The implementation isn't particularly pretty or fast, but should
illustrate the idea.

Poking at the whole update_tg_cfs_load() thing only makes it worse after
that. And while I agree that that code is mind bending; it seems to work
OK-ish.

Tejun, Vincent, could you guys have a poke?

The thing is that if we assume se->avg.load_avg is correct, we should
already compute a correct cfs_rq->runnable_load_avg, we do all that
propagation right.

But because se->avg.load_avg is stuck in the 'past' because it's sum is
based on all its old weight, things don't quite work out. If we otoh
treat it as a runnable_sum and scale with weight, it seems to work out
fine.

Arguably sys_nice and all related crud should do the same, but nobody
really uses nice at any frequency, whereas we constantly change the
weight of our group entities.

---
 kernel/sched/fair.c     | 184 +++++++++++++++++++++++++++++++-----------------
 kernel/sched/features.h |   2 +
 2 files changed, 122 insertions(+), 64 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd6c3f9..d6a33e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -32,6 +32,7 @@
 #include <linux/mempolicy.h>
 #include <linux/migrate.h>
 #include <linux/task_work.h>
+#include <linux/moduleparam.h>
 
 #include <trace/events/sched.h>
 
@@ -2632,16 +2633,39 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 # ifdef CONFIG_SMP
-static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
+
+enum shares_type {
+	shares_runnable,
+	shares_avg,
+	shares_weight,
+};
+
+static long calc_cfs_shares(struct cfs_rq *cfs_rq, enum shares_type shares_type)
 {
-	long tg_weight, load, shares;
+	long tg_weight, tg_shares, load, shares;
+	struct task_group *tg = cfs_rq->tg;
 
-	/*
-	 * This really should be: cfs_rq->avg.load_avg, but instead we use
-	 * cfs_rq->load.weight, which is its upper bound. This helps ramp up
-	 * the shares for small weight interactive tasks.
-	 */
-	load = scale_load_down(cfs_rq->load.weight);
+	tg_shares = READ_ONCE(tg->shares);
+
+	switch (shares_type) {
+	case shares_runnable:
+		load = cfs_rq->runnable_load_avg;
+		break;
+
+	default:
+	case shares_avg:
+		load = cfs_rq->avg.load_avg;
+		break;
+
+	case shares_weight:
+		/*
+		 * This really should be: cfs_rq->avg.load_avg, but instead we
+		 * use cfs_rq->load.weight, which is its upper bound. This
+		 * helps ramp up the shares for small weight interactive tasks.
+		 */
+		load = scale_load_down(cfs_rq->load.weight);
+		break;
+	}
 
 	tg_weight = atomic_long_read(&tg->load_avg);
 
@@ -2665,23 +2689,33 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
 	 * case no task is runnable on a CPU MIN_SHARES=2 should be returned
 	 * instead of 0.
 	 */
-	if (shares < MIN_SHARES)
-		shares = MIN_SHARES;
-	if (shares > tg->shares)
-		shares = tg->shares;
-
-	return shares;
-}
-# else /* CONFIG_SMP */
-static inline long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
-{
-	return tg->shares;
+	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
 # endif /* CONFIG_SMP */
 
+/*
+ * Unsigned subtract and clamp on underflow.
+ *
+ * Explicitly do a load-store to ensure the intermediate value never hits
+ * memory. This allows lockless observations without ever seeing the negative
+ * values.
+ */
+#define sub_positive(_ptr, _val) do {				\
+	typeof(_ptr) ptr = (_ptr);				\
+	typeof(*ptr) val = (_val);				\
+	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
+	res = var - val;					\
+	if (res > var)						\
+		res = 0;					\
+	WRITE_ONCE(*ptr, res);					\
+} while (0)
+
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
 {
+	if (se->load.weight == weight)
+		return;
+
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
@@ -2689,10 +2723,40 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		account_entity_dequeue(cfs_rq, se);
 	}
 
+	if (sched_feat(FUDGE2)) {
+		unsigned long new_weight = max(scale_load_down(weight), 1UL);
+		unsigned long old_weight = max(scale_load_down(se->load.weight), 1UL);
+
+		sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
+		sub_positive(&cfs_rq->avg.load_sum, se->avg.load_sum);
+
+		if (se->on_rq) {
+			sub_positive(&cfs_rq->runnable_load_avg, se->avg.load_avg);
+			sub_positive(&cfs_rq->runnable_load_sum, se->avg.load_sum);
+		}
+
+		se->avg.load_avg *= new_weight;
+		se->avg.load_sum *= new_weight;
+
+		se->avg.load_avg /= old_weight;
+		se->avg.load_sum /= old_weight;
+	}
+
 	update_load_set(&se->load, weight);
 
-	if (se->on_rq)
+	if (se->on_rq) {
 		account_entity_enqueue(cfs_rq, se);
+	}
+
+	if (sched_feat(FUDGE2)) {
+		cfs_rq->avg.load_avg += se->avg.load_avg;
+		cfs_rq->avg.load_sum += se->avg.load_sum;
+
+		if (se->on_rq) {
+			cfs_rq->runnable_load_avg += se->avg.load_avg;
+			cfs_rq->runnable_load_sum += se->avg.load_sum;
+		}
+	}
 }
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
@@ -2700,7 +2764,6 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 static void update_cfs_shares(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = group_cfs_rq(se);
-	struct task_group *tg;
 	long shares;
 
 	if (!cfs_rq)
@@ -2709,13 +2772,14 @@ static void update_cfs_shares(struct sched_entity *se)
 	if (throttled_hierarchy(cfs_rq))
 		return;
 
-	tg = cfs_rq->tg;
-
 #ifndef CONFIG_SMP
-	if (likely(se->load.weight == tg->shares))
+	shares = READ_ONCE(cfs_rq->tg->shares);
+
+	if (likely(se->load.weight == shares))
 		return;
+#else
+	shares = calc_cfs_shares(cfs_rq, shares_weight);
 #endif
-	shares = calc_cfs_shares(cfs_rq, tg);
 
 	reweight_entity(cfs_rq_of(se), se, shares);
 }
@@ -3070,42 +3134,51 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
 }
 
+static int prop_type = shares_avg;
+
+module_param(prop_type, int, 0644);
+
 /* Take into account change of load of a child task group */
 static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long delta, load = gcfs_rq->avg.load_avg;
+	long delta, load;
 
 	/*
 	 * If the load of group cfs_rq is null, the load of the
 	 * sched_entity will also be null so we can skip the formula
 	 */
-	if (load) {
-		long tg_load;
+	if (!sched_feat(FUDGE)) {
+		load = gcfs_rq->avg.load_avg;
+		if (load) {
+			long tg_load;
 
-		/* Get tg's load and ensure tg_load > 0 */
-		tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
+			/* Get tg's load and ensure tg_load > 0 */
+			tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
 
-		/* Ensure tg_load >= load and updated with current load*/
-		tg_load -= gcfs_rq->tg_load_avg_contrib;
-		tg_load += load;
+			/* Ensure tg_load >= load and updated with current load*/
+			tg_load -= gcfs_rq->tg_load_avg_contrib;
+			tg_load += load;
 
-		/*
-		 * We need to compute a correction term in the case that the
-		 * task group is consuming more CPU than a task of equal
-		 * weight. A task with a weight equals to tg->shares will have
-		 * a load less or equal to scale_load_down(tg->shares).
-		 * Similarly, the sched_entities that represent the task group
-		 * at parent level, can't have a load higher than
-		 * scale_load_down(tg->shares). And the Sum of sched_entities'
-		 * load must be <= scale_load_down(tg->shares).
-		 */
-		if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
-			/* scale gcfs_rq's load into tg's shares*/
-			load *= scale_load_down(gcfs_rq->tg->shares);
-			load /= tg_load;
+			/*
+			 * We need to compute a correction term in the case that the
+			 * task group is consuming more CPU than a task of equal
+			 * weight. A task with a weight equals to tg->shares will have
+			 * a load less or equal to scale_load_down(tg->shares).
+			 * Similarly, the sched_entities that represent the task group
+			 * at parent level, can't have a load higher than
+			 * scale_load_down(tg->shares). And the Sum of sched_entities'
+			 * load must be <= scale_load_down(tg->shares).
+			 */
+			if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
+				/* scale gcfs_rq's load into tg's shares*/
+				load *= scale_load_down(gcfs_rq->tg->shares);
+				load /= tg_load;
+			}
 		}
+	} else {
+		load = calc_cfs_shares(gcfs_rq, prop_type);
 	}
 
 	delta = load - se->avg.load_avg;
@@ -3236,23 +3309,6 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
 	}
 }
 
-/*
- * Unsigned subtract and clamp on underflow.
- *
- * Explicitly do a load-store to ensure the intermediate value never hits
- * memory. This allows lockless observations without ever seeing the negative
- * values.
- */
-#define sub_positive(_ptr, _val) do {				\
-	typeof(_ptr) ptr = (_ptr);				\
-	typeof(*ptr) val = (_val);				\
-	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
-	res = var - val;					\
-	if (res > var)						\
-		res = 0;					\
-	WRITE_ONCE(*ptr, res);					\
-} while (0)
-
 /**
  * update_cfs_rq_load_avg - update the cfs_rq's load/util averages
  * @now: current time, as per cfs_rq_clock_task()
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index dc4d148..4c517b4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -80,3 +80,5 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
 SCHED_FEAT(ATTACH_AGE_LOAD, true)
 
+SCHED_FEAT(FUDGE, true)
+SCHED_FEAT(FUDGE2, true)

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-05-03 18:00     ` Peter Zijlstra
@ 2017-05-03 21:45       ` Tejun Heo
  2017-05-04  5:51         ` Peter Zijlstra
  2017-05-04 10:26       ` Vincent Guittot
  1 sibling, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-03 21:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Vincent Guittot,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello, Peter.

On Wed, May 03, 2017 at 08:00:28PM +0200, Peter Zijlstra wrote:
> Just FUDGE2 on its own seems to be the best on my system and is a change
> that makes sense (and something Paul recently pointed out as well).
> 
> The implementation isn't particularly pretty or fast, but should
> illustrate the idea.
> 
> Poking at the whole update_tg_cfs_load() thing only makes it worse after
> that. And while I agree that that code is mind bending; it seems to work
> OK-ish.
> 
> Tejun, Vincent, could you guys have a poke?

So, just preliminary testing.

FUDGE: Does cut down the number of wrong picks by about 70% and p99
       latency by about half; however, the resulting p99 is still
       worse by 5 - 10 times compared to !cgroup case.

FUDGE2: Changes things a lot (load values go wild) but only because
        it's missing scale_load_down().  After adding
        scale_load_down(), it doesn't do much.  For this to work, it
        needs to be always propagated, which btw shouldn't be
        prohibitively expensive given other operations which are
        performed at the same time.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-03 13:09                     ` Peter Zijlstra
@ 2017-05-03 21:49                       ` Tejun Heo
  2017-05-04  8:19                         ` Vincent Guittot
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-03 21:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On Wed, May 03, 2017 at 03:09:38PM +0200, Peter Zijlstra wrote:
> On Wed, May 03, 2017 at 12:37:37PM +0200, Vincent Guittot wrote:
> > On 3 May 2017 at 11:37, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > Of course, it could be I overlooked something, in which case, please
> > > tell :-)
> > 
> > That's mainly based on the regression i see on my platform. I haven't
> > find the root cause of the regression but it's there which means that
> > using group_entity's load_avg to propagate child cfs_rq
> > runnable_load_avg breaks something
> 
> (as mentioned on IRC)
> 
> Right.. so looking through the code, (group) se->avg.load_avg is used in
> effective_load() (and thereby wake_affine()) and update_cfs_rq_h_load()
> (and therefore task_h_load()).
> 
> So changing it will affect those two functions, which could well lead to
> your regression.

Ah, okay, that makes sense.  I'll try to finish the patch to propagate
runnable without affecting group se->avg.load_avg.  BTW, Vincent, did
you boost the weight of the cgroup when you were testing?  If you put
schbench inside a cgroup and have some base load, it is actually
expected to show worse latency.  You need to give higher weight to the
cgroup matching the number of active threads (to be accruate, scaled
by duty cycle but shouldn't matter too much in practice).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-05-03 21:45       ` Tejun Heo
@ 2017-05-04  5:51         ` Peter Zijlstra
  2017-05-04  6:21           ` Peter Zijlstra
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Zijlstra @ 2017-05-04  5:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Vincent Guittot,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On Wed, May 03, 2017 at 05:45:46PM -0400, Tejun Heo wrote:

> FUDGE2: Changes things a lot (load values go wild) but only because
>         it's missing scale_load_down().  After adding
>         scale_load_down(), it doesn't do much.  For this to work, it
>         needs to be always propagated, which btw shouldn't be
>         prohibitively expensive given other operations which are
>         performed at the same time.

Urgh, and my numbers were so pretty :/

Maybe I need to wake up, but I'm not immediately seeing where the
scale_load_down() went missing for FUDGE2.

FUDGE does indeed appear to have one missing.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-05-04  5:51         ` Peter Zijlstra
@ 2017-05-04  6:21           ` Peter Zijlstra
  2017-05-04  9:49             ` Dietmar Eggemann
  0 siblings, 1 reply; 69+ messages in thread
From: Peter Zijlstra @ 2017-05-04  6:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Vincent Guittot,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On Thu, May 04, 2017 at 07:51:29AM +0200, Peter Zijlstra wrote:

> Urgh, and my numbers were so pretty :/

Just to clarify on how to run schbench, I limited to a single socket (as
that is what you have) and set -t to the number of cores in the socket
(not the number of threads).

Furthermore, my machine is _idle_, if I don't do anything, it doesn't do
_anything_.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-03 21:49                       ` Tejun Heo
@ 2017-05-04  8:19                         ` Vincent Guittot
  2017-05-04 17:43                           ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-05-04  8:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 3 May 2017 at 23:49, Tejun Heo <tj@kernel.org> wrote:
> On Wed, May 03, 2017 at 03:09:38PM +0200, Peter Zijlstra wrote:
>> On Wed, May 03, 2017 at 12:37:37PM +0200, Vincent Guittot wrote:
>> > On 3 May 2017 at 11:37, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> > > Of course, it could be I overlooked something, in which case, please
>> > > tell :-)
>> >
>> > That's mainly based on the regression i see on my platform. I haven't
>> > find the root cause of the regression but it's there which means that
>> > using group_entity's load_avg to propagate child cfs_rq
>> > runnable_load_avg breaks something
>>
>> (as mentioned on IRC)
>>
>> Right.. so looking through the code, (group) se->avg.load_avg is used in
>> effective_load() (and thereby wake_affine()) and update_cfs_rq_h_load()
>> (and therefore task_h_load()).
>>
>> So changing it will affect those two functions, which could well lead to
>> your regression.
>
> Ah, okay, that makes sense.  I'll try to finish the patch to propagate
> runnable without affecting group se->avg.load_avg.  BTW, Vincent, did
> you boost the weight of the cgroup when you were testing?  If you put

I use default group weight

> schbench inside a cgroup and have some base load, it is actually
> expected to show worse latency.  You need to give higher weight to the
> cgroup matching the number of active threads (to be accruate, scaled
> by duty cycle but shouldn't matter too much in practice).

I don't have to change anything cgroup weight with mainline to get
good number which means that the base load which is quite close to
null, is probably not the problem

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-05-04  6:21           ` Peter Zijlstra
@ 2017-05-04  9:49             ` Dietmar Eggemann
  2017-05-04 10:57               ` Peter Zijlstra
  2017-05-04 17:39               ` Tejun Heo
  0 siblings, 2 replies; 69+ messages in thread
From: Dietmar Eggemann @ 2017-05-04  9:49 UTC (permalink / raw)
  To: Peter Zijlstra, Tejun Heo
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Vincent Guittot,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

On 04/05/17 07:21, Peter Zijlstra wrote:
> On Thu, May 04, 2017 at 07:51:29AM +0200, Peter Zijlstra wrote:
> 
>> Urgh, and my numbers were so pretty :/
> 
> Just to clarify on how to run schbench, I limited to a single socket (as
> that is what you have) and set -t to the number of cores in the socket
> (not the number of threads).
> 
> Furthermore, my machine is _idle_, if I don't do anything, it doesn't do
> _anything_.
>

I can't recreate this problem running 'numactl -N 0 ./schbench -m 2 -t
10 -s 10000 -c 15000 -r 30' on my E5-2690 v2 (IVB-EP, 2 sockets, 10
cores / socket, 2 threads / core)

I tried tip/sched/core comparing running in 'cpu:/' and 'cpu:/foo' and

using your patch on top with all the combinations of {NO_}FUDGE,
{NO_}FUDGE2 with prop_type=shares_avg or prop_type_runnable.

Where you able to see the issue on tip/sched/core w/o your patch on your
machine?

The workload of n 60% periodic tasks on n logical cpus always creates a
very stable task distribution for me.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-05-03 18:00     ` Peter Zijlstra
  2017-05-03 21:45       ` Tejun Heo
@ 2017-05-04 10:26       ` Vincent Guittot
  1 sibling, 0 replies; 69+ messages in thread
From: Vincent Guittot @ 2017-05-04 10:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Le Wednesday 03 May 2017 à 20:00:28 (+0200), Peter Zijlstra a écrit :
> 
 
[snip]

> 
> Just FUDGE2 on its own seems to be the best on my system and is a change
> that makes sense (and something Paul recently pointed out as well).
> 
> The implementation isn't particularly pretty or fast, but should
> illustrate the idea.
> 
> Poking at the whole update_tg_cfs_load() thing only makes it worse after
> that. And while I agree that that code is mind bending; it seems to work
> OK-ish.
> 
> Tejun, Vincent, could you guys have a poke?

I have added below patch on to of your:
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3191,7 +3191,7 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
 			}
 		}
 	} else {
-		load = calc_cfs_shares(gcfs_rq, prop_type);
+		load = scale_load_down(calc_cfs_shares(gcfs_rq, prop_type));
 	}
 
 	delta = load - se->avg.load_avg;
-- 

The results for each configurations are :

** Config 1**
linaro@linaro-developer:~/schbench$cat /sys/module/fair/parameters/prop_type
1
linaro@linaro-developer:~/schbench$ sudo cat /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU RT_PUSH_IPI NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD NO_FUDGE NO_FUDGE2 

Latency percentiles (usec)
	50.0000th: 252
	75.0000th: 346
	90.0000th: 438
	95.0000th: 485
	*99.0000th: 537
	99.5000th: 581
	99.9000th: 5768
	min=0, max=14202

** Config 2 **
linaro@linaro-developer:~/schbench$ cat /sys/module/fair/parameters/prop_type
1
linaro@linaro-developer:~/schbench$ sudo cat /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU RT_PUSH_IPI NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD FUDGE NO_FUDGE2 

Latency percentiles (usec)
	50.0000th: 261
	75.0000th: 374
	90.0000th: 457
	95.0000th: 490
	*99.0000th: 533
	99.5000th: 585
	99.9000th: 9392
	min=0, max=13295

**Config 3**
linaro@linaro-developer:~/schbench$ cat /sys/module/fair/parameters/prop_type
1
linaro@linaro-developer:~/schbench$ sudo cat /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU RT_PUSH_IPI NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD NO_FUDGE FUDGE2 

Latency percentiles (usec)
	50.0000th: 233
	75.0000th: 309
	90.0000th: 456
	95.0000th: 498
	*99.0000th: 5272
	99.5000th: 8184
	99.9000th: 12368
	min=0, max=14865

I have run several time the test, few were correct but most of them were like
above

**Config 4**
linaro@linaro-developer:~/schbench$ cat /sys/module/fair/parameters/prop_type
1
linaro@linaro-developer:~/schbench$ sudo cat /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU RT_PUSH_IPI NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD FUDGE FUDGE2 

Latency percentiles (usec)
	50.0000th: 211
	75.0000th: 290
	90.0000th: 380
	95.0000th: 451
	*99.0000th: 1778
	99.5000th: 5048
	99.9000th: 12752
	min=0, max=15090

I have run several time the test, few were correct but most of them were like
above

** Config 5 **
linaro@linaro-developer:~/schbench$ cat /sys/module/fair/parameters/prop_type
0
linaro@linaro-developer:~/schbench$ sudo cat /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU RT_PUSH_IPI NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD FUDGE FUDGE2

Latency percentiles (usec)
	50.0000th: 216
	75.0000th: 297
	90.0000th: 430
	95.0000th: 487
	*99.0000th: 2748
	99.5000th: 7432
	99.9000th: 12912
	min=0, max=15046

I have run several time the test, few were correct but most of them were like
above

** Config 6 **
linaro@linaro-developer:~/schbench$ cat /sys/module/fair/parameters/prop_type
0
linaro@linaro-developer:~/schbench$ sudo cat /sys/kernel/debug/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU RT_PUSH_IPI NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD FUDGE NO_FUDGE2 

Latency percentiles (usec)
	50.0000th: 245
	75.0000th: 339
	90.0000th: 444
	95.0000th: 492
	*99.0000th: 3036
	99.5000th: 9104
	99.9000th: 12496
	min=0, max=14098

I have run several time the test, few were correct but most of them were like
above



Config 1 and 2 are the two configuraton which have stable and good results

Regards,
Vincent

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-05-04  9:49             ` Dietmar Eggemann
@ 2017-05-04 10:57               ` Peter Zijlstra
  2017-05-04 17:39               ` Tejun Heo
  1 sibling, 0 replies; 69+ messages in thread
From: Peter Zijlstra @ 2017-05-04 10:57 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Tejun Heo, Ingo Molnar, linux-kernel, Linus Torvalds,
	Vincent Guittot, Mike Galbraith, Paul Turner, Chris Mason,
	kernel-team

On Thu, May 04, 2017 at 10:49:51AM +0100, Dietmar Eggemann wrote:
> On 04/05/17 07:21, Peter Zijlstra wrote:
> > On Thu, May 04, 2017 at 07:51:29AM +0200, Peter Zijlstra wrote:
> > 
> >> Urgh, and my numbers were so pretty :/
> > 
> > Just to clarify on how to run schbench, I limited to a single socket (as
> > that is what you have) and set -t to the number of cores in the socket
> > (not the number of threads).
> > 
> > Furthermore, my machine is _idle_, if I don't do anything, it doesn't do
> > _anything_.
> >
> 
> I can't recreate this problem running 'numactl -N 0 ./schbench -m 2 -t
> 10 -s 10000 -c 15000 -r 30' on my E5-2690 v2 (IVB-EP, 2 sockets, 10
> cores / socket, 2 threads / core)
> 
> I tried tip/sched/core comparing running in 'cpu:/' and 'cpu:/foo' and

I'm running tip/master (I think, possibly with the numa topology fixes
in, which should be no-op on the EP).

Also, I run debian sysvinit, so nobody creating cgroups I don't know about.

> using your patch on top with all the combinations of {NO_}FUDGE,
> {NO_}FUDGE2 with prop_type=shares_avg or prop_type_runnable.
> 
> Where you able to see the issue on tip/sched/core w/o your patch on your
> machine?

I see the 99.5th percentile shoot up when I run it in a cgroup.
With FUDGE2 its all good again like not using cgroups.



But yes, last time I played with schbench (when prodding at
select_idle_sibling) the thing was finicky too, I never quite got the same
numbers Chris did. But in the end we found something that worked
at both ends.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-05-04  9:49             ` Dietmar Eggemann
  2017-05-04 10:57               ` Peter Zijlstra
@ 2017-05-04 17:39               ` Tejun Heo
  2017-05-05 10:36                 ` Dietmar Eggemann
  1 sibling, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-04 17:39 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Linus Torvalds,
	Vincent Guittot, Mike Galbraith, Paul Turner, Chris Mason,
	kernel-team

Hello, Dietmar.

On Thu, May 04, 2017 at 10:49:51AM +0100, Dietmar Eggemann wrote:
> On 04/05/17 07:21, Peter Zijlstra wrote:
> > On Thu, May 04, 2017 at 07:51:29AM +0200, Peter Zijlstra wrote:
> > 
> >> Urgh, and my numbers were so pretty :/
> > 
> > Just to clarify on how to run schbench, I limited to a single socket (as
> > that is what you have) and set -t to the number of cores in the socket
> > (not the number of threads).
> > 
> > Furthermore, my machine is _idle_, if I don't do anything, it doesn't do
> > _anything_.
> >
> 
> I can't recreate this problem running 'numactl -N 0 ./schbench -m 2 -t
> 10 -s 10000 -c 15000 -r 30' on my E5-2690 v2 (IVB-EP, 2 sockets, 10
> cores / socket, 2 threads / core)
> 
> I tried tip/sched/core comparing running in 'cpu:/' and 'cpu:/foo' and
> 
> using your patch on top with all the combinations of {NO_}FUDGE,
> {NO_}FUDGE2 with prop_type=shares_avg or prop_type_runnable.
> 
> Where you able to see the issue on tip/sched/core w/o your patch on your
> machine?
> 
> The workload of n 60% periodic tasks on n logical cpus always creates a
> very stable task distribution for me.

It depends heavily on what else is going on in the system.  On the
test systems that I'm using, there's always something not-too-heavy
going on.  The pattern over time isn't too varied and the latency
results are usually stable and the grouping of results is very clear
as the difference between the load balancer working properly and not
shows up as upto an order of magnitude difference in p99 latencies.

For these differences to matter, you need to push the machine so that
it's right at the point of saturation - e.g. increase duty cycle till
p99 starts to deteriorate w/o cgroup.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-04  8:19                         ` Vincent Guittot
@ 2017-05-04 17:43                           ` Tejun Heo
  2017-05-04 19:02                             ` Vincent Guittot
  0 siblings, 1 reply; 69+ messages in thread
From: Tejun Heo @ 2017-05-04 17:43 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello,

On Thu, May 04, 2017 at 10:19:46AM +0200, Vincent Guittot wrote:
> > schbench inside a cgroup and have some base load, it is actually
> > expected to show worse latency.  You need to give higher weight to the
> > cgroup matching the number of active threads (to be accruate, scaled
> > by duty cycle but shouldn't matter too much in practice).
> 
> I don't have to change anything cgroup weight with mainline to get
> good number which means that the base load which is quite close to
> null, is probably not the problem

So, while that *could* be the case, it could also be the baseline
incorrectly favoring the nested cfs_rqs over other tasks because of
the nested runnables being inflated with blocked load avgs.  I think
it'd be a good idea to test with matching weight to put things on the
even ground.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-04 17:43                           ` Tejun Heo
@ 2017-05-04 19:02                             ` Vincent Guittot
  2017-05-04 19:04                               ` Tejun Heo
  0 siblings, 1 reply; 69+ messages in thread
From: Vincent Guittot @ 2017-05-04 19:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hi Tejun,

On 4 May 2017 at 19:43, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Thu, May 04, 2017 at 10:19:46AM +0200, Vincent Guittot wrote:
>> > schbench inside a cgroup and have some base load, it is actually
>> > expected to show worse latency.  You need to give higher weight to the
>> > cgroup matching the number of active threads (to be accruate, scaled
>> > by duty cycle but shouldn't matter too much in practice).
>>
>> I don't have to change anything cgroup weight with mainline to get
>> good number which means that the base load which is quite close to
>> null, is probably not the problem
>
> So, while that *could* be the case, it could also be the baseline
> incorrectly favoring the nested cfs_rqs over other tasks because of
> the nested runnables being inflated with blocked load avgs.  I think
> it'd be a good idea to test with matching weight to put things on the
> even ground.

In the trace i have uploaded, you will see that regressions happen
whereas there is no other runnable threads around so it's not a matter
of background activities that disturbs schbench

Thanks
Vincent

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
  2017-05-04 19:02                             ` Vincent Guittot
@ 2017-05-04 19:04                               ` Tejun Heo
  0 siblings, 0 replies; 69+ messages in thread
From: Tejun Heo @ 2017-05-04 19:04 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Linus Torvalds,
	Mike Galbraith, Paul Turner, Chris Mason, kernel-team

Hello, Vincent.

On Thu, May 04, 2017 at 09:02:39PM +0200, Vincent Guittot wrote:
> In the trace i have uploaded, you will see that regressions happen
> whereas there is no other runnable threads around so it's not a matter
> of background activities that disturbs schbench

Understood, yeah, I'm almost done with the patches to propagate
runnable w/o disturbing load_avg.  Will post them soon.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
  2017-05-04 17:39               ` Tejun Heo
@ 2017-05-05 10:36                 ` Dietmar Eggemann
  0 siblings, 0 replies; 69+ messages in thread
From: Dietmar Eggemann @ 2017-05-05 10:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Linus Torvalds,
	Vincent Guittot, Mike Galbraith, Paul Turner, Chris Mason,
	kernel-team

Hi Tejun,

On 04/05/17 18:39, Tejun Heo wrote:
> Hello, Dietmar.
> 
> On Thu, May 04, 2017 at 10:49:51AM +0100, Dietmar Eggemann wrote:
>> On 04/05/17 07:21, Peter Zijlstra wrote:
>>> On Thu, May 04, 2017 at 07:51:29AM +0200, Peter Zijlstra wrote:

[...]

>>
>> I can't recreate this problem running 'numactl -N 0 ./schbench -m 2 -t
>> 10 -s 10000 -c 15000 -r 30' on my E5-2690 v2 (IVB-EP, 2 sockets, 10
>> cores / socket, 2 threads / core)
>>
>> I tried tip/sched/core comparing running in 'cpu:/' and 'cpu:/foo' and
>>
>> using your patch on top with all the combinations of {NO_}FUDGE,
>> {NO_}FUDGE2 with prop_type=shares_avg or prop_type_runnable.
>>
>> Where you able to see the issue on tip/sched/core w/o your patch on your
>> machine?
>>
>> The workload of n 60% periodic tasks on n logical cpus always creates a
>> very stable task distribution for me.
> 
> It depends heavily on what else is going on in the system.  On the
> test systems that I'm using, there's always something not-too-heavy
> going on.  The pattern over time isn't too varied and the latency
> results are usually stable and the grouping of results is very clear
> as the difference between the load balancer working properly and not
> shows up as upto an order of magnitude difference in p99 latencies.

OK, that make sense. You do need the light (independent from schbench)
background noise to create work for the load balancer.

I switched to my Hikey board (hot-plugged out the 2. cluster, so 4
remaining cores with performance governor) because we should see the
effect regardless of the topology. There is no background noise on my
debian fs.

That's why I don't see any effect if I increase the C/S
(cputime/sleeptime) ratio when running 'schbench -m 2 -t 2 -s S -c C -r
30'. The only source of disturbance are some additional schbench threads
which sometimes force one of the worker threads to get co-scheduled with
another worker thread.

https://drive.google.com/file/d/0B2f-ZAwV_YnmTDhWUk5ZRHdBRUU/view shows
such a case where the additional schbench thread 'schbench-2206' (green
marker line in picture) forces the worker thread 'schbench-2209' to
wakeup migrate from cpu3 to cpu0 where he gets co-scheduled with the
worker thread 'schbench-2210' for a while.

> For these differences to matter, you need to push the machine so that
> it's right at the point of saturation - e.g. increase duty cycle till
> p99 starts to deteriorate w/o cgroup.
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2017-05-05 10:36 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-24 20:13 [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is in use Tejun Heo
2017-04-24 20:14 ` [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity Tejun Heo
2017-04-24 21:33   ` [PATCH v2 " Tejun Heo
2017-05-03 18:00     ` Peter Zijlstra
2017-05-03 21:45       ` Tejun Heo
2017-05-04  5:51         ` Peter Zijlstra
2017-05-04  6:21           ` Peter Zijlstra
2017-05-04  9:49             ` Dietmar Eggemann
2017-05-04 10:57               ` Peter Zijlstra
2017-05-04 17:39               ` Tejun Heo
2017-05-05 10:36                 ` Dietmar Eggemann
2017-05-04 10:26       ` Vincent Guittot
2017-04-25  8:35   ` [PATCH " Vincent Guittot
2017-04-25 18:12     ` Tejun Heo
2017-04-26 16:51       ` Vincent Guittot
2017-04-26 22:40         ` Tejun Heo
2017-04-27  7:00           ` Vincent Guittot
2017-05-01 14:17         ` Peter Zijlstra
2017-05-01 14:52           ` Peter Zijlstra
2017-05-01 21:56           ` Tejun Heo
2017-05-02  8:19             ` Peter Zijlstra
2017-05-02  8:30               ` Peter Zijlstra
2017-05-02 20:00                 ` Tejun Heo
2017-05-03  9:10                   ` Peter Zijlstra
2017-04-26 16:14   ` Vincent Guittot
2017-04-26 22:27     ` Tejun Heo
2017-04-27  8:59       ` Vincent Guittot
2017-04-28 17:46         ` Tejun Heo
2017-05-02  7:20           ` Vincent Guittot
2017-04-24 20:14 ` [PATCH 2/2] sched/fair: Always propagate runnable_load_avg Tejun Heo
2017-04-25  8:46   ` Vincent Guittot
2017-04-25  9:05     ` Vincent Guittot
2017-04-25 12:59       ` Vincent Guittot
2017-04-25 18:49         ` Tejun Heo
2017-04-25 20:49           ` Tejun Heo
2017-04-25 21:15             ` Chris Mason
2017-04-25 21:08           ` Tejun Heo
2017-04-26 10:21             ` Vincent Guittot
2017-04-27  0:30               ` Tejun Heo
2017-04-27  8:28                 ` Vincent Guittot
2017-04-28 16:14                   ` Tejun Heo
2017-05-02  6:56                     ` Vincent Guittot
2017-05-02 20:56                       ` Tejun Heo
2017-05-03  7:25                         ` Vincent Guittot
2017-05-03  7:54                           ` Vincent Guittot
2017-04-26 18:12   ` Vincent Guittot
2017-04-26 22:52     ` Tejun Heo
2017-04-27  8:29       ` Vincent Guittot
2017-04-28 20:33         ` Tejun Heo
2017-04-28 20:38           ` Tejun Heo
2017-05-01 15:56           ` Peter Zijlstra
2017-05-02 22:01             ` Tejun Heo
2017-05-02  7:18           ` Vincent Guittot
2017-05-02 13:26             ` Vincent Guittot
2017-05-02 22:37               ` Tejun Heo
2017-05-02 21:50             ` Tejun Heo
2017-05-03  7:34               ` Vincent Guittot
2017-05-03  9:37                 ` Peter Zijlstra
2017-05-03 10:37                   ` Vincent Guittot
2017-05-03 13:09                     ` Peter Zijlstra
2017-05-03 21:49                       ` Tejun Heo
2017-05-04  8:19                         ` Vincent Guittot
2017-05-04 17:43                           ` Tejun Heo
2017-05-04 19:02                             ` Vincent Guittot
2017-05-04 19:04                               ` Tejun Heo
2017-04-24 21:35 ` [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities Tejun Heo
2017-04-24 21:48   ` Peter Zijlstra
2017-04-24 22:54     ` Tejun Heo
2017-04-25 21:09   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).