linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] sched,cfs: flatten CPU controller runqueues
@ 2019-06-12 19:32 Rik van Riel
  2019-06-12 19:32 ` [PATCH 1/8] sched: introduce task_se_h_load helper Rik van Riel
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Rik van Riel @ 2019-06-12 19:32 UTC (permalink / raw)
  To: peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot

The current implementation of the CPU controller uses hierarchical
runqueues, where on wakeup a task is enqueued on its group's runqueue,
the group is enqueued on the runqueue of the group above it, etc.

This increases a fairly large amount of overhead for workloads that
do a lot of wakeups a second, especially given that the default systemd
hierarchy is 2 or 3 levels deep.

This patch series is an attempt at reducing that overhead, by placing
all the tasks on the same runqueue, and scaling the task priority by
the priority of the group, which is calculated periodically.

This patch series still has a number of TODO items:
- Clean up the code, and fix compilation without CONFIG_FAIR_GROUP_SCHED.
- Remove some more now unused code.
- Figure out a regression with schbench, where the p99 latency goes up
  before the system is fully overloaded. I suspect wakeup_preempt_entity()
  and wakeup_gran() because they now use the task_h_load instead of the
  unscaled load to figure out whether a task should be preempted.
- Reimplement CONFIG_CFS_BANDWIDTH.

Plan for the CONFIG_CFS_BANDWIDTH reimplementation:
- When a cgroup gets throttled, mark the cgroup and its children
  as throttled.
- When pick_next_entity finds a task that is on a throttled cgroup,
  stash it on the cgroup runqueue (which is not used for runnable
  tasks any more). Leave the vruntime unchanged, and adjust that
  runqueue's vruntime to be that of the left-most task.
- When a cgroup gets unthrottled, and has tasks on it, place it on
  a vruntime ordered heap separate from the main runqueue.
- Have pick_next_task_fair grab one task off that heap every time it
  is called, and the min vruntime of that heap is lower than the
  vruntime of the CPU's cfs_rq (or the CPU has no other runnable tasks).
- Place that selected task on the CPU's cfs_rq, renormalizing its
  vruntime with the GENTLE_FAIR_SLEEPERS logic. That should help
  interleave the already runnable tasks with the recently unthrottled
  group, and prevent thundering herd issues.
- If the group gets throttled again before all of its task had a chance
  to run, vruntime sorting ensures all the tasks in the throttled cgroup
  get a chance to run over time.

This patch applies on top of what was Linus's current tree when I last
rebased it:
2c1212de6f97 ("Merge tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core")

 include/linux/sched.h |    5 
 kernel/sched/core.c   |    2 
 kernel/sched/debug.c  |   12 
 kernel/sched/fair.c   |  744 +++++++++++++++++++++-----------------------------
 kernel/sched/pelt.c   |   55 +--
 kernel/sched/pelt.h   |    2 
 kernel/sched/sched.h  |    9 
 7 files changed, 346 insertions(+), 483 deletions(-)



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/8] sched: introduce task_se_h_load helper
  2019-06-12 19:32 [RFC] sched,cfs: flatten CPU controller runqueues Rik van Riel
@ 2019-06-12 19:32 ` Rik van Riel
  2019-06-19 12:52   ` Dietmar Eggemann
  2019-06-12 19:32 ` [PATCH 2/8] sched: change /proc/sched_debug fields Rik van Riel
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Rik van Riel @ 2019-06-12 19:32 UTC (permalink / raw)
  To: peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot, Rik van Riel

Sometimes the hierarchical load of a sched_entity needs to be calculated.
Split out task_h_load into a task_se_h_load that takes a sched_entity pointer
as its argument, and a task_h_load wrapper that calls task_se_h_load.

No functional changes.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f5e528..df624f7a68e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -706,6 +706,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 #ifdef CONFIG_SMP
 
 static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
+static unsigned long task_se_h_load(struct sched_entity *se);
 static unsigned long task_h_load(struct task_struct *p);
 static unsigned long capacity_of(int cpu);
 
@@ -7833,14 +7834,19 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 	}
 }
 
-static unsigned long task_h_load(struct task_struct *p)
+static unsigned long task_se_h_load(struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq = task_cfs_rq(p);
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	update_cfs_rq_h_load(cfs_rq);
-	return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
+	return div64_ul(se->avg.load_avg * cfs_rq->h_load,
 			cfs_rq_load_avg(cfs_rq) + 1);
 }
+
+static unsigned long task_h_load(struct task_struct *p)
+{
+	return task_se_h_load(&p->se);
+}
 #else
 static inline void update_blocked_averages(int cpu)
 {
@@ -7865,6 +7871,11 @@ static inline void update_blocked_averages(int cpu)
 	rq_unlock_irqrestore(rq, &rf);
 }
 
+static unsigned long task_se_h_load(struct sched_entity *se)
+{
+	return se->avg.load_avg;
+}
+
 static unsigned long task_h_load(struct task_struct *p)
 {
 	return p->se.avg.load_avg;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/8] sched: change /proc/sched_debug fields
  2019-06-12 19:32 [RFC] sched,cfs: flatten CPU controller runqueues Rik van Riel
  2019-06-12 19:32 ` [PATCH 1/8] sched: introduce task_se_h_load helper Rik van Riel
@ 2019-06-12 19:32 ` Rik van Riel
  2019-06-12 19:32 ` [PATCH 3/8] sched,fair: redefine runnable_load_avg as the sum of task_h_load Rik van Riel
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2019-06-12 19:32 UTC (permalink / raw)
  To: peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot, Rik van Riel

Remove some fields from /proc/sched_debug that are removed from
sched_entity in a subsequent patch, and add h_load, which comes in
very handy to debug CPU controller weight distribution.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/debug.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 678bfb9bd87f..aab4640d66c5 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -419,11 +419,9 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	}
 
 	P(se->load.weight);
-	P(se->runnable_weight);
 #ifdef CONFIG_SMP
 	P(se->avg.load_avg);
 	P(se->avg.util_avg);
-	P(se->avg.runnable_load_avg);
 #endif
 
 #undef PN_SCHEDSTAT
@@ -541,7 +539,6 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_SMP
-	SEQ_printf(m, "  .%-30s: %ld\n", "runnable_weight", cfs_rq->runnable_weight);
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
 			cfs_rq->avg.load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_load_avg",
@@ -550,17 +547,15 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %u\n", "util_est_enqueued",
 			cfs_rq->avg.util_est.enqueued);
-	SEQ_printf(m, "  .%-30s: %ld\n", "removed.load_avg",
-			cfs_rq->removed.load_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
 			cfs_rq->removed.util_avg);
-	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
-			cfs_rq->removed.runnable_sum);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
 			cfs_rq->tg_load_avg_contrib);
 	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_avg",
 			atomic_long_read(&cfs_rq->tg->load_avg));
+	SEQ_printf(m, "  .%-30s: %lu\n", "h_load",
+			cfs_rq->h_load);
 #endif
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -964,10 +959,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		   "nr_involuntary_switches", (long long)p->nivcsw);
 
 	P(se.load.weight);
-	P(se.runnable_weight);
 #ifdef CONFIG_SMP
 	P(se.avg.load_sum);
-	P(se.avg.runnable_load_sum);
 	P(se.avg.util_sum);
 	P(se.avg.load_avg);
 	P(se.avg.runnable_load_avg);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/8] sched,fair: redefine runnable_load_avg as the sum of task_h_load
  2019-06-12 19:32 [RFC] sched,cfs: flatten CPU controller runqueues Rik van Riel
  2019-06-12 19:32 ` [PATCH 1/8] sched: introduce task_se_h_load helper Rik van Riel
  2019-06-12 19:32 ` [PATCH 2/8] sched: change /proc/sched_debug fields Rik van Riel
@ 2019-06-12 19:32 ` Rik van Riel
  2019-06-18  9:08   ` Dietmar Eggemann
  2019-06-26 14:34   ` Dietmar Eggemann
  2019-06-12 19:32 ` [PATCH 4/8] sched,fair: remove cfs rqs from leaf_cfs_rq_list bottom up Rik van Riel
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 24+ messages in thread
From: Rik van Riel @ 2019-06-12 19:32 UTC (permalink / raw)
  To: peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot, Rik van Riel

The runnable_load magic is used to quickly propagate information about
runnable tasks up the hierarchy of runqueues. lhen switching to a flat
runqueue, that no longer works.

Redefine the CPU cfs_rq runnable_load_avg to be the sum of task_h_loads
of the runnable tasks. This provides enough information to the load
balancer.

The runnable_load_avg of the cgroup cfs_rqs does not appear to be
used for anything, so don't bother calculating those.

This removes one of the things that the code currently traverses the
cgroup hierarchy for, and getting rid of it brings us one step closer
to a flat runqueue for the CPU controller.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/sched.h |   3 +-
 kernel/sched/core.c   |   2 -
 kernel/sched/debug.c  |   1 +
 kernel/sched/fair.c   | 125 +++++++++++++-----------------------------
 kernel/sched/pelt.c   |  49 ++++++-----------
 kernel/sched/sched.h  |   6 --
 6 files changed, 55 insertions(+), 131 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 11837410690f..f5bb6948e40c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -391,7 +391,6 @@ struct util_est {
 struct sched_avg {
 	u64				last_update_time;
 	u64				load_sum;
-	u64				runnable_load_sum;
 	u32				util_sum;
 	u32				period_contrib;
 	unsigned long			load_avg;
@@ -439,7 +438,6 @@ struct sched_statistics {
 struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
-	unsigned long			runnable_weight;
 	struct rb_node			run_node;
 	struct list_head		group_node;
 	unsigned int			on_rq;
@@ -455,6 +453,7 @@ struct sched_entity {
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	int				depth;
+	unsigned long			enqueued_h_load;
 	struct sched_entity		*parent;
 	/* rq on which this entity is (to be) queued: */
 	struct cfs_rq			*cfs_rq;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427742a9..fbd96900f715 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -744,7 +744,6 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 	if (task_has_idle_policy(p)) {
 		load->weight = scale_load(WEIGHT_IDLEPRIO);
 		load->inv_weight = WMULT_IDLEPRIO;
-		p->se.runnable_weight = load->weight;
 		return;
 	}
 
@@ -757,7 +756,6 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 	} else {
 		load->weight = scale_load(sched_prio_to_weight[prio]);
 		load->inv_weight = sched_prio_to_wmult[prio];
-		p->se.runnable_weight = load->weight;
 	}
 }
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index aab4640d66c5..d06e7436d148 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -965,6 +965,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 	P(se.avg.load_avg);
 	P(se.avg.runnable_load_avg);
 	P(se.avg.util_avg);
+	P(se.enqueued_h_load);
 	P(se.avg.last_update_time);
 	P(se.avg.util_est.ewma);
 	P(se.avg.util_est.enqueued);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df624f7a68e7..aebd43d74468 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -724,9 +724,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * nothing has been attached to the task group yet.
 	 */
 	if (entity_is_task(se))
-		sa->runnable_load_avg = sa->load_avg = scale_load_down(se->load.weight);
-
-	se->runnable_weight = se->load.weight;
+		sa->load_avg = scale_load_down(se->load.weight);
 
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
@@ -2767,20 +2765,39 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static inline void
 enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	cfs_rq->runnable_weight += se->runnable_weight;
+	if (entity_is_task(se)) {
+		struct cfs_rq *cpu_cfs_rq = &cfs_rq->rq->cfs;
+		se->enqueued_h_load = task_se_h_load(se);
 
-	cfs_rq->avg.runnable_load_avg += se->avg.runnable_load_avg;
-	cfs_rq->avg.runnable_load_sum += se_runnable(se) * se->avg.runnable_load_sum;
+		cpu_cfs_rq->avg.runnable_load_avg += se->enqueued_h_load;
+	}
 }
 
 static inline void
 dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	cfs_rq->runnable_weight -= se->runnable_weight;
+	if (entity_is_task(se)) {
+		struct cfs_rq *cpu_cfs_rq = &cfs_rq->rq->cfs;
+		sub_positive(&cpu_cfs_rq->avg.runnable_load_avg,
+					se->enqueued_h_load);
+	}
+}
+
+static inline void
+update_runnable_load_avg(struct sched_entity *se)
+{
+	struct cfs_rq *cpu_cfs_rq = &cfs_rq_of(se)->rq->cfs;
+	long new_h_load, delta;
+
+	SCHED_WARN_ON(!entity_is_task(se));
+
+	if (!se->on_rq)
+		return;
 
-	sub_positive(&cfs_rq->avg.runnable_load_avg, se->avg.runnable_load_avg);
-	sub_positive(&cfs_rq->avg.runnable_load_sum,
-		     se_runnable(se) * se->avg.runnable_load_sum);
+	new_h_load = task_se_h_load(se);
+	delta = new_h_load - se->enqueued_h_load;
+	cpu_cfs_rq->avg.runnable_load_avg += delta;
+	se->enqueued_h_load = new_h_load;
 }
 
 static inline void
@@ -2808,7 +2825,7 @@ dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 #endif
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
-			    unsigned long weight, unsigned long runnable)
+			    unsigned long weight)
 {
 	if (se->on_rq) {
 		/* commit outstanding execution time */
@@ -2819,7 +2836,6 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 	}
 	dequeue_load_avg(cfs_rq, se);
 
-	se->runnable_weight = runnable;
 	update_load_set(&se->load, weight);
 
 #ifdef CONFIG_SMP
@@ -2827,8 +2843,6 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib;
 
 		se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
-		se->avg.runnable_load_avg =
-			div_u64(se_runnable(se) * se->avg.runnable_load_sum, divider);
 	} while (0);
 #endif
 
@@ -2846,7 +2860,7 @@ void reweight_task(struct task_struct *p, int prio)
 	struct load_weight *load = &se->load;
 	unsigned long weight = scale_load(sched_prio_to_weight[prio]);
 
-	reweight_entity(cfs_rq, se, weight, weight);
+	reweight_entity(cfs_rq, se, weight);
 	load->inv_weight = sched_prio_to_wmult[prio];
 }
 
@@ -2959,49 +2973,6 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
 	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
 
-/*
- * This calculates the effective runnable weight for a group entity based on
- * the group entity weight calculated above.
- *
- * Because of the above approximation (2), our group entity weight is
- * an load_avg based ratio (3). This means that it includes blocked load and
- * does not represent the runnable weight.
- *
- * Approximate the group entity's runnable weight per ratio from the group
- * runqueue:
- *
- *					     grq->avg.runnable_load_avg
- *   ge->runnable_weight = ge->load.weight * -------------------------- (7)
- *						 grq->avg.load_avg
- *
- * However, analogous to above, since the avg numbers are slow, this leads to
- * transients in the from-idle case. Instead we use:
- *
- *   ge->runnable_weight = ge->load.weight *
- *
- *		max(grq->avg.runnable_load_avg, grq->runnable_weight)
- *		-----------------------------------------------------	(8)
- *		      max(grq->avg.load_avg, grq->load.weight)
- *
- * Where these max() serve both to use the 'instant' values to fix the slow
- * from-idle and avoid the /0 on to-idle, similar to (6).
- */
-static long calc_group_runnable(struct cfs_rq *cfs_rq, long shares)
-{
-	long runnable, load_avg;
-
-	load_avg = max(cfs_rq->avg.load_avg,
-		       scale_load_down(cfs_rq->load.weight));
-
-	runnable = max(cfs_rq->avg.runnable_load_avg,
-		       scale_load_down(cfs_rq->runnable_weight));
-
-	runnable *= shares;
-	if (load_avg)
-		runnable /= load_avg;
-
-	return clamp_t(long, runnable, MIN_SHARES, shares);
-}
 #endif /* CONFIG_SMP */
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
@@ -3013,25 +2984,24 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 static void update_cfs_group(struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long shares, runnable;
+	long shares;
 
-	if (!gcfs_rq)
+	if (!gcfs_rq) {
+		update_runnable_load_avg(se);
 		return;
+	}
 
 	if (throttled_hierarchy(gcfs_rq))
 		return;
 
 #ifndef CONFIG_SMP
-	runnable = shares = READ_ONCE(gcfs_rq->tg->shares);
-
 	if (likely(se->load.weight == shares))
 		return;
 #else
 	shares   = calc_group_shares(gcfs_rq);
-	runnable = calc_group_runnable(gcfs_rq, shares);
 #endif
 
-	reweight_entity(cfs_rq_of(se), se, shares, runnable);
+	reweight_entity(cfs_rq_of(se), se, shares);
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -3244,8 +3214,8 @@ static inline void
 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
 	long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
-	unsigned long runnable_load_avg, load_avg;
-	u64 runnable_load_sum, load_sum = 0;
+	unsigned long load_avg;
+	u64 load_sum = 0;
 	s64 delta_sum;
 
 	if (!runnable_sum)
@@ -3293,19 +3263,6 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf
 	se->avg.load_avg = load_avg;
 	add_positive(&cfs_rq->avg.load_avg, delta_avg);
 	add_positive(&cfs_rq->avg.load_sum, delta_sum);
-
-	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
-	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
-	delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
-	delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
-
-	se->avg.runnable_load_sum = runnable_sum;
-	se->avg.runnable_load_avg = runnable_load_avg;
-
-	if (se->on_rq) {
-		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
-		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
-	}
 }
 
 static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
@@ -3400,7 +3357,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
-	unsigned long removed_load = 0, removed_util = 0, removed_runnable_sum = 0;
+	unsigned long removed_load = 0, removed_util = 0;
 	struct sched_avg *sa = &cfs_rq->avg;
 	int decayed = 0;
 
@@ -3411,7 +3368,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		raw_spin_lock(&cfs_rq->removed.lock);
 		swap(cfs_rq->removed.util_avg, removed_util);
 		swap(cfs_rq->removed.load_avg, removed_load);
-		swap(cfs_rq->removed.runnable_sum, removed_runnable_sum);
 		cfs_rq->removed.nr = 0;
 		raw_spin_unlock(&cfs_rq->removed.lock);
 
@@ -3423,8 +3379,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		sub_positive(&sa->util_avg, r);
 		sub_positive(&sa->util_sum, r * divider);
 
-		add_tg_cfs_propagate(cfs_rq, -(long)removed_runnable_sum);
-
 		decayed = 1;
 	}
 
@@ -3478,8 +3432,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 			div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
 	}
 
-	se->avg.runnable_load_sum = se->avg.load_sum;
-
 	enqueue_load_avg(cfs_rq, se);
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
@@ -7736,9 +7688,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (cfs_rq->avg.util_sum)
 		return false;
 
-	if (cfs_rq->avg.runnable_load_sum)
-		return false;
-
 	return true;
 }
 
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index befce29bd882..32dc2791a517 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -106,7 +106,7 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
  */
 static __always_inline u32
 accumulate_sum(u64 delta, struct sched_avg *sa,
-	       unsigned long load, unsigned long runnable, int running)
+	       unsigned long load, int running)
 {
 	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
 	u64 periods;
@@ -119,8 +119,6 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 	 */
 	if (periods) {
 		sa->load_sum = decay_load(sa->load_sum, periods);
-		sa->runnable_load_sum =
-			decay_load(sa->runnable_load_sum, periods);
 		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
 
 		/*
@@ -134,8 +132,6 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 
 	if (load)
 		sa->load_sum += load * contrib;
-	if (runnable)
-		sa->runnable_load_sum += runnable * contrib;
 	if (running)
 		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
 
@@ -172,7 +168,7 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
  */
 static __always_inline int
 ___update_load_sum(u64 now, struct sched_avg *sa,
-		  unsigned long load, unsigned long runnable, int running)
+		  unsigned long load, int running)
 {
 	u64 delta;
 
@@ -206,7 +202,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * update_blocked_averages()
 	 */
 	if (!load)
-		runnable = running = 0;
+		running = 0;
 
 	/*
 	 * Now we know we crossed measurement unit boundaries. The *_avg
@@ -215,14 +211,14 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, sa, load, runnable, running))
+	if (!accumulate_sum(delta, sa, load, running))
 		return 0;
 
 	return 1;
 }
 
 static __always_inline void
-___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
+___update_load_avg(struct sched_avg *sa, unsigned long load)
 {
 	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
 
@@ -230,7 +226,6 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
 	 * Step 2: update *_avg.
 	 */
 	sa->load_avg = div_u64(load * sa->load_sum, divider);
-	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
 	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
 }
 
@@ -263,8 +258,8 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
 
 int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+	if (___update_load_sum(now, &se->avg, 0, 0)) {
+		___update_load_avg(&se->avg, se_weight(se));
 		return 1;
 	}
 
@@ -273,10 +268,10 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 
 int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, !!se->on_rq, !!se->on_rq,
+	if (___update_load_sum(now, &se->avg, !!se->on_rq,
 				cfs_rq->curr == se)) {
 
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
 		return 1;
 	}
@@ -288,10 +283,9 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
 	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
-				scale_load_down(cfs_rq->runnable_weight),
 				cfs_rq->curr != NULL)) {
 
-		___update_load_avg(&cfs_rq->avg, 1, 1);
+		___update_load_avg(&cfs_rq->avg, 1);
 		return 1;
 	}
 
@@ -303,20 +297,18 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
  *
- *   load_avg and runnable_load_avg are not supported and meaningless.
+ *   load_avg is not supported and meaningless.
  *
  */
 
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_rt,
-				running,
 				running,
 				running)) {
 
-		___update_load_avg(&rq->avg_rt, 1, 1);
+		___update_load_avg(&rq->avg_rt, 1);
 		return 1;
 	}
 
@@ -328,18 +320,16 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
  *
  */
 
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_dl,
-				running,
 				running,
 				running)) {
 
-		___update_load_avg(&rq->avg_dl, 1, 1);
+		___update_load_avg(&rq->avg_dl, 1);
 		return 1;
 	}
 
@@ -352,7 +342,6 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
  *
  */
 
@@ -379,17 +368,11 @@ int update_irq_load_avg(struct rq *rq, u64 running)
 	 * We can safely remove running from rq->clock because
 	 * rq->clock += delta with delta >= running
 	 */
-	ret = ___update_load_sum(rq->clock - running, &rq->avg_irq,
-				0,
-				0,
-				0);
-	ret += ___update_load_sum(rq->clock, &rq->avg_irq,
-				1,
-				1,
-				1);
+	ret = ___update_load_sum(rq->clock - running, &rq->avg_irq, 0);
+	ret += ___update_load_sum(rq->clock, &rq->avg_irq, 1);
 
 	if (ret)
-		___update_load_avg(&rq->avg_irq, 1, 1);
+		___update_load_avg(&rq->avg_irq, 1);
 
 	return ret;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1ada0be..5be14cee61f9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -487,7 +487,6 @@ struct cfs_bandwidth { };
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight	load;
-	unsigned long		runnable_weight;
 	unsigned int		nr_running;
 	unsigned int		h_nr_running;
 
@@ -700,11 +699,6 @@ static inline long se_weight(struct sched_entity *se)
 	return scale_load_down(se->load.weight);
 }
 
-static inline long se_runnable(struct sched_entity *se)
-{
-	return scale_load_down(se->runnable_weight);
-}
-
 static inline bool sched_asym_prefer(int a, int b)
 {
 	return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/8] sched,fair: remove cfs rqs from leaf_cfs_rq_list bottom up
  2019-06-12 19:32 [RFC] sched,cfs: flatten CPU controller runqueues Rik van Riel
                   ` (2 preceding siblings ...)
  2019-06-12 19:32 ` [PATCH 3/8] sched,fair: redefine runnable_load_avg as the sum of task_h_load Rik van Riel
@ 2019-06-12 19:32 ` Rik van Riel
  2019-06-12 19:32 ` [PATCH 5/8] sched,cfs: use explicit cfs_rq of parent se helper Rik van Riel
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2019-06-12 19:32 UTC (permalink / raw)
  To: peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot, Rik van Riel

Reducing the overhead of the CPU controller is achieved by not walking
all the sched_entities every time a task is enqueued or dequeued.

One of the things being checked every single time is whether the cfs_rq
is on the rq->leaf_cfs_rq_list.

By only removing a cfs_rq from the list once it no longer has children
on the list, we can avoid walking the sched_entity hierarchy if the bottom
cfs_rq is on the list, once the runqueues have been flattened.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c  | 17 +++++++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aebd43d74468..dcc521d251e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -285,6 +285,13 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 
 	cfs_rq->on_list = 1;
 
+	/*
+	 * If the tmp_alone_branch cursor was moved, it means a child cfs_rq
+	 * is already on the list ahead of us.
+	 */
+	if (rq->tmp_alone_branch != &rq->leaf_cfs_rq_list)
+		cfs_rq->children_on_list++;
+
 	/*
 	 * Ensure we either appear before our parent (if already
 	 * enqueued) or force our parent to appear after us when it is
@@ -310,6 +317,7 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 		 * list.
 		 */
 		rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
+		cfs_rq->tg->parent->cfs_rq[cpu]->children_on_list++;
 		return true;
 	}
 
@@ -358,6 +366,11 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 		if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
 			rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
 
+		if (cfs_rq->tg->parent) {
+			int cpu = cpu_of(rq);
+			cfs_rq->tg->parent->cfs_rq[cpu]->children_on_list--;
+		}
+
 		list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
 		cfs_rq->on_list = 0;
 	}
@@ -7688,6 +7701,10 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (cfs_rq->avg.util_sum)
 		return false;
 
+	/* Remove decayed parents once their decayed children are gone. */
+	if (cfs_rq->children_on_list)
+		return false;
+
 	return true;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5be14cee61f9..18494b1a9bac 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -556,6 +556,7 @@ struct cfs_rq {
 	 * This list is used during load balance.
 	 */
 	int			on_list;
+	int			children_on_list;
 	struct list_head	leaf_cfs_rq_list;
 	struct task_group	*tg;	/* group that "owns" this runqueue */
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/8] sched,cfs: use explicit cfs_rq of parent se helper
  2019-06-12 19:32 [RFC] sched,cfs: flatten CPU controller runqueues Rik van Riel
                   ` (3 preceding siblings ...)
  2019-06-12 19:32 ` [PATCH 4/8] sched,fair: remove cfs rqs from leaf_cfs_rq_list bottom up Rik van Riel
@ 2019-06-12 19:32 ` Rik van Riel
  2019-06-20 16:23   ` Dietmar Eggemann
  2019-06-26 15:58   ` Dietmar Eggemann
  2019-06-12 19:32 ` [PATCH 6/8] sched,cfs: fix zero length timeslice calculation Rik van Riel
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 24+ messages in thread
From: Rik van Riel @ 2019-06-12 19:32 UTC (permalink / raw)
  To: peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot, Rik van Riel

Use an explicit "cfs_rq of parent sched_entity" helper in a few
strategic places, where cfs_rq_of(se) may no longer point at the
right runqueue once we flatten the hierarchical cgroup runqueues.

No functional change.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dcc521d251e3..c6ede2ecc935 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -275,6 +275,15 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
 	return grp->my_q;
 }
 
+/* runqueue owned by the parent entity */
+static inline struct cfs_rq *group_cfs_rq_of_parent(struct sched_entity *se)
+{
+	if (se->parent)
+		return group_cfs_rq(se->parent);
+
+	return &cfs_rq_of(se)->rq->cfs;
+}
+
 static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
@@ -3298,7 +3307,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
 
 	gcfs_rq->propagate = 0;
 
-	cfs_rq = cfs_rq_of(se);
+	cfs_rq = group_cfs_rq_of_parent(se);
 
 	add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
 
@@ -7779,7 +7788,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 
 	WRITE_ONCE(cfs_rq->h_load_next, NULL);
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		cfs_rq = group_cfs_rq_of_parent(se);
 		WRITE_ONCE(cfs_rq->h_load_next, se);
 		if (cfs_rq->last_h_load_update == now)
 			break;
@@ -7802,7 +7811,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 
 static unsigned long task_se_h_load(struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct cfs_rq *cfs_rq = group_cfs_rq_of_parent(se);
 
 	update_cfs_rq_h_load(cfs_rq);
 	return div64_ul(se->avg.load_avg * cfs_rq->h_load,
@@ -10159,7 +10168,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	struct sched_entity *se = &curr->se;
 
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		cfs_rq = group_cfs_rq_of_parent(se);
 		entity_tick(cfs_rq, se, queued);
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/8] sched,cfs: fix zero length timeslice calculation
  2019-06-12 19:32 [RFC] sched,cfs: flatten CPU controller runqueues Rik van Riel
                   ` (4 preceding siblings ...)
  2019-06-12 19:32 ` [PATCH 5/8] sched,cfs: use explicit cfs_rq of parent se helper Rik van Riel
@ 2019-06-12 19:32 ` Rik van Riel
  2019-06-12 19:32 ` [PATCH 7/8] sched,fair: refactor enqueue/dequeue_entity Rik van Riel
  2019-06-12 19:32 ` [PATCH 8/8] sched,fair: flatten hierarchical runqueues Rik van Riel
  7 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2019-06-12 19:32 UTC (permalink / raw)
  To: peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot, Rik van Riel

The way the time slice length is currently calculated, not only do high
priority tasks get longer time slices than low priority tasks, but due
to fixed point math, low priority tasks could end up with a zero length
time slice. This can lead to cache thrashing and other inefficiencies.

Simplify the logic a little bit, and cap the minimum time slice length
to sysctl_sched_min_granularity.

Tasks that end up getting a time slice length too long for their relative
priority will simply end up having their vruntime advanced much faster than
other tasks, resulting in them receiving time slices less frequently.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 25 ++++++++-----------------
 1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6ede2ecc935..35153a89d5c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -670,22 +670,6 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
 	return delta;
 }
 
-/*
- * The idea is to set a period in which each task runs once.
- *
- * When there are too many tasks (sched_nr_latency) we have to stretch
- * this period because otherwise the slices get too small.
- *
- * p = (nr <= nl) ? l : l*nr/nl
- */
-static u64 __sched_period(unsigned long nr_running)
-{
-	if (unlikely(nr_running > sched_nr_latency))
-		return nr_running * sysctl_sched_min_granularity;
-	else
-		return sysctl_sched_latency;
-}
-
 /*
  * We calculate the wall-time slice from the period by taking a part
  * proportional to the weight.
@@ -694,7 +678,7 @@ static u64 __sched_period(unsigned long nr_running)
  */
 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
+	u64 slice = sysctl_sched_latency;
 
 	for_each_sched_entity(se) {
 		struct load_weight *load;
@@ -711,6 +695,13 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		}
 		slice = __calc_delta(slice, se->load.weight, load);
 	}
+
+	/*
+	 * To avoid cache thrashing, run at least sysctl_sched_min_granularity.
+	 * The vruntime of a low priority task advances faster; those tasks
+	 * will simply get time slices less frequently.
+	 */
+	slice = max_t(u64, slice, sysctl_sched_min_granularity);
 	return slice;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 7/8] sched,fair: refactor enqueue/dequeue_entity
  2019-06-12 19:32 [RFC] sched,cfs: flatten CPU controller runqueues Rik van Riel
                   ` (5 preceding siblings ...)
  2019-06-12 19:32 ` [PATCH 6/8] sched,cfs: fix zero length timeslice calculation Rik van Riel
@ 2019-06-12 19:32 ` Rik van Riel
  2019-06-12 19:32 ` [PATCH 8/8] sched,fair: flatten hierarchical runqueues Rik van Riel
  7 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2019-06-12 19:32 UTC (permalink / raw)
  To: peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot, Rik van Riel

Refactor enqueue_entity, dequeue_entity, and update_load_avg, in order
to split out the things we still want to happen at every level in the
cgroup hierarchy with a flat runqueue from the things we only need to
happen once.

No functional changes.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 65 +++++++++++++++++++++++++++++----------------
 1 file changed, 42 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 35153a89d5c5..c2baf3c8a879 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3481,17 +3481,17 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 #define DO_ATTACH	0x4
 
 /* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	u64 now = cfs_rq_clock_pelt(cfs_rq);
-	int decayed;
+	int decayed, updated = 0;
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
 	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
-		__update_load_avg_se(now, cfs_rq, se);
+		updated = __update_load_avg_se(now, cfs_rq, se);
 
 	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
 	decayed |= propagate_entity_load_avg(se);
@@ -3510,6 +3510,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 
 	} else if (decayed && (flags & UPDATE_TG))
 		update_tg_load_avg(cfs_rq, 0);
+
+	return decayed | updated;
 }
 
 #ifndef CONFIG_64BIT
@@ -3851,6 +3853,24 @@ static inline void check_schedstat_required(void)
  * CPU and an up-to-date min_vruntime on the destination CPU.
  */
 
+static bool
+enqueue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+	/*
+	 * When enqueuing a sched_entity, we must:
+	 *   - Update loads to have both entity and cfs_rq synced with now.
+	 *   - Add its load to cfs_rq->runnable_avg
+	 *   - For group_entity, update its weight to reflect the new share of
+	 *     its group cfs_rq
+	 *   - Add its new weight to cfs_rq->load.weight
+	 */
+	if (!update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH))
+		return false;
+
+	update_cfs_group(se);
+	return true;
+}
+
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -3875,16 +3895,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (renorm && !curr)
 		se->vruntime += cfs_rq->min_vruntime;
 
-	/*
-	 * When enqueuing a sched_entity, we must:
-	 *   - Update loads to have both entity and cfs_rq synced with now.
-	 *   - Add its load to cfs_rq->runnable_avg
-	 *   - For group_entity, update its weight to reflect the new share of
-	 *     its group cfs_rq
-	 *   - Add its new weight to cfs_rq->load.weight
-	 */
-	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
-	update_cfs_group(se);
 	enqueue_runnable_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 
@@ -3951,14 +3961,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 
-static void
-dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+static bool 
+dequeue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	/*
-	 * Update run-time statistics of the 'current'.
-	 */
-	update_curr(cfs_rq);
-
 	/*
 	 * When dequeuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
@@ -3967,7 +3972,21 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 *   - For group entity, update its weight to reflect the new share
 	 *     of its group cfs_rq.
 	 */
-	update_load_avg(cfs_rq, se, UPDATE_TG);
+	if (!update_load_avg(cfs_rq, se, UPDATE_TG))
+		return false;
+	update_cfs_group(se);
+
+	return true;
+}
+
+static void
+dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+	/*
+	 * Update run-time statistics of the 'current'.
+	 */
+	update_curr(cfs_rq);
+
 	dequeue_runnable_load_avg(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
@@ -3991,8 +4010,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/* return excess runtime on last dequeue */
 	return_cfs_rq_runtime(cfs_rq);
 
-	update_cfs_group(se);
-
 	/*
 	 * Now advance min_vruntime if @se was the entity holding it back,
 	 * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
@@ -5157,6 +5174,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (se->on_rq)
 			break;
 		cfs_rq = cfs_rq_of(se);
+		enqueue_entity_groups(cfs_rq, se, flags);
 		enqueue_entity(cfs_rq, se, flags);
 
 		/*
@@ -5239,6 +5257,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
+		dequeue_entity_groups(cfs_rq, se, flags);
 		dequeue_entity(cfs_rq, se, flags);
 
 		/*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 8/8] sched,fair: flatten hierarchical runqueues
  2019-06-12 19:32 [RFC] sched,cfs: flatten CPU controller runqueues Rik van Riel
                   ` (6 preceding siblings ...)
  2019-06-12 19:32 ` [PATCH 7/8] sched,fair: refactor enqueue/dequeue_entity Rik van Riel
@ 2019-06-12 19:32 ` Rik van Riel
  2019-06-25  9:50   ` Dietmar Eggemann
  2019-06-28 10:26   ` Dietmar Eggemann
  7 siblings, 2 replies; 24+ messages in thread
From: Rik van Riel @ 2019-06-12 19:32 UTC (permalink / raw)
  To: peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot, Rik van Riel

Flatten the hierarchical runqueues into just the per CPU rq.cfs runqueue.

Iteration of the sched_entity hierarchy is rate limited to once per jiffy
per sched_entity, which is a smaller change than it seems, because load
average adjustments were already rate limited to once per jiffy before this
patch series.

This patch breaks CONFIG_CFS_BANDWIDTH. The plan for that is to park tasks
from throttled cgroups onto their cgroup runqueues, and slowly (using the
GENTLE_FAIR_SLEEPERS) wake them back up, in vruntime order, once the cgroup
gets unthrottled, to prevent thundering herd issues.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/sched.h |   2 +
 kernel/sched/fair.c   | 478 +++++++++++++++++-------------------------
 kernel/sched/pelt.c   |   6 +-
 kernel/sched/pelt.h   |   2 +-
 kernel/sched/sched.h  |   2 +-
 5 files changed, 194 insertions(+), 296 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f5bb6948e40c..05ed40b304dc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -454,6 +454,8 @@ struct sched_entity {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	int				depth;
 	unsigned long			enqueued_h_load;
+	unsigned long			enqueued_h_weight;
+	struct load_weight		h_load;
 	struct sched_entity		*parent;
 	/* rq on which this entity is (to be) queued: */
 	struct cfs_rq			*cfs_rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2baf3c8a879..29bdfbd4dc2e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -242,6 +242,9 @@ static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight
 
 
 const struct sched_class fair_sched_class;
+static unsigned long task_se_h_weight(struct sched_entity *se);
+static unsigned long task_se_h_load(struct sched_entity *se);
+static unsigned long task_h_load(struct task_struct *p);
 
 /**************************************************************
  * CFS operations on generic schedulable entities:
@@ -395,7 +398,6 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
 	list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,	\
 				 leaf_cfs_rq_list)
 
-/* Do the two (enqueued) entities belong to the same group ? */
 static inline struct cfs_rq *
 is_same_group(struct sched_entity *se, struct sched_entity *pse)
 {
@@ -410,6 +412,11 @@ static inline struct sched_entity *parent_entity(struct sched_entity *se)
 	return se->parent;
 }
 
+static inline bool task_se_in_cgroup(struct sched_entity *se)
+{
+	return parent_entity(se);
+}
+
 static void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
@@ -442,6 +449,19 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 	}
 }
 
+/* Add the cgroup cfs_rqs to the list, for update_blocked_averages */
+static void enqueue_entity_cfs_rqs(struct sched_entity *se)
+{
+	SCHED_WARN_ON(!entity_is_task(se));
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = group_cfs_rq_of_parent(se);
+
+		if (list_add_leaf_cfs_rq(cfs_rq))
+			break;
+	}
+}
+
 #else	/* !CONFIG_FAIR_GROUP_SCHED */
 
 static inline struct task_struct *task_of(struct sched_entity *se)
@@ -492,6 +512,11 @@ static inline struct sched_entity *parent_entity(struct sched_entity *se)
 	return NULL;
 }
 
+static inline bool task_se_in_cgroup(struct sched_entity *se)
+{
+	return false;
+}
+
 static inline void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
@@ -664,8 +689,14 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
  */
 static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
 {
-	if (unlikely(se->load.weight != NICE_0_LOAD))
+	if (task_se_in_cgroup(se)) {
+		unsigned long h_load = task_se_h_load(se);
+		if (h_load != se->h_load.weight)
+			update_load_set(&se->h_load, h_load);
+		delta = __calc_delta(delta, NICE_0_LOAD, &se->h_load);
+	} else if (unlikely(se->load.weight != NICE_0_LOAD)) {
 		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
+	}
 
 	return delta;
 }
@@ -679,22 +710,16 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	u64 slice = sysctl_sched_latency;
+	struct load_weight *load = &cfs_rq->load;
+	struct load_weight lw;
 
-	for_each_sched_entity(se) {
-		struct load_weight *load;
-		struct load_weight lw;
+	if (unlikely(!se->on_rq)) {
+		lw = cfs_rq->load;
 
-		cfs_rq = cfs_rq_of(se);
-		load = &cfs_rq->load;
-
-		if (unlikely(!se->on_rq)) {
-			lw = cfs_rq->load;
-
-			update_load_add(&lw, se->load.weight);
-			load = &lw;
-		}
-		slice = __calc_delta(slice, se->load.weight, load);
+		update_load_add(&lw, task_se_h_load(se));
+		load = &lw;
 	}
+	slice = __calc_delta(slice, task_se_h_load(se), load);
 
 	/*
 	 * To avoid cache thrashing, run at least sysctl_sched_min_granularity.
@@ -719,8 +744,6 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 #ifdef CONFIG_SMP
 
 static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
-static unsigned long task_se_h_load(struct sched_entity *se);
-static unsigned long task_h_load(struct task_struct *p);
 static unsigned long capacity_of(int cpu);
 
 /* Give new sched_entity start runnable values to heavy its load in infant time */
@@ -2697,16 +2720,28 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
 static void
 account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	update_load_add(&cfs_rq->load, se->load.weight);
-	if (!parent_entity(se))
+	struct rq *rq;
+
+	if (task_se_in_cgroup(se)) {
+		struct cfs_rq *cgroup_rq = group_cfs_rq_of_parent(se);
+		unsigned long h_weight;
+
+		update_load_add(&cgroup_rq->load, se->load.weight);
+		cgroup_rq->nr_running++;
+
+		/* Add the hierarchical weight to the CPU rq */
+		h_weight = task_se_h_weight(se);
+		se->enqueued_h_weight = h_weight;
+		update_load_add(&rq_of(cfs_rq)->load, h_weight);
+	} else {
+		update_load_add(&cfs_rq->load, se->load.weight);
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
+	}
 #ifdef CONFIG_SMP
-	if (entity_is_task(se)) {
-		struct rq *rq = rq_of(cfs_rq);
+	rq = rq_of(cfs_rq);
 
-		account_numa_enqueue(rq, task_of(se));
-		list_add(&se->group_node, &rq->cfs_tasks);
-	}
+	account_numa_enqueue(rq, task_of(se));
+	list_add(&se->group_node, &rq->cfs_tasks);
 #endif
 	cfs_rq->nr_running++;
 }
@@ -2714,14 +2749,20 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static void
 account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	update_load_sub(&cfs_rq->load, se->load.weight);
-	if (!parent_entity(se))
+	if (task_se_in_cgroup(se)) {
+		struct cfs_rq *cgroup_rq = group_cfs_rq_of_parent(se);
+
+		update_load_sub(&cgroup_rq->load, se->load.weight);
+		cgroup_rq->nr_running--;
+
+		update_load_sub(&rq_of(cfs_rq)->load, se->enqueued_h_weight);
+	} else {
+		update_load_sub(&cfs_rq->load, se->load.weight);
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-#ifdef CONFIG_SMP
-	if (entity_is_task(se)) {
-		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
-		list_del_init(&se->group_node);
 	}
+#ifdef CONFIG_SMP
+	account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+	list_del_init(&se->group_node);
 #endif
 	cfs_rq->nr_running--;
 }
@@ -2816,6 +2857,9 @@ update_runnable_load_avg(struct sched_entity *se)
 static inline void
 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	if (task_se_in_cgroup(se))
+		cfs_rq = group_cfs_rq_of_parent(se);
+
 	cfs_rq->avg.load_avg += se->avg.load_avg;
 	cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
 }
@@ -2823,6 +2867,9 @@ enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	if (task_se_in_cgroup(se))
+		cfs_rq = group_cfs_rq_of_parent(se);
+
 	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
 	sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
 }
@@ -3449,7 +3496,9 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
 
-	add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
+	if (task_se_in_cgroup(se))
+		add_tg_cfs_propagate(group_cfs_rq_of_parent(se),
+						se->avg.load_sum);
 
 	cfs_rq_util_change(cfs_rq, flags);
 }
@@ -3468,7 +3517,9 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
 
-	add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
+	if (task_se_in_cgroup(se))
+		add_tg_cfs_propagate(group_cfs_rq_of_parent(se),
+						-se->avg.load_sum);
 
 	cfs_rq_util_change(cfs_rq, 0);
 }
@@ -3479,11 +3530,13 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 #define UPDATE_TG	0x1
 #define SKIP_AGE_LOAD	0x2
 #define DO_ATTACH	0x4
+#define SE_IS_CURRENT	0x8
 
 /* Update task and its cfs_rq load average */
 static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	u64 now = cfs_rq_clock_pelt(cfs_rq);
+	bool curr = flags & SE_IS_CURRENT;
 	int decayed, updated = 0;
 
 	/*
@@ -3491,7 +3544,7 @@ static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
 	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
-		updated = __update_load_avg_se(now, cfs_rq, se);
+		updated = __update_load_avg_se(now, cfs_rq, se, curr, curr);
 
 	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
 	decayed |= propagate_entity_load_avg(se);
@@ -3727,6 +3780,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
 #define UPDATE_TG	0x0
 #define SKIP_AGE_LOAD	0x0
 #define DO_ATTACH	0x0
+#define SE_IS_CURRENT	0x0
 
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
 {
@@ -3908,55 +3962,20 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
 
-	if (cfs_rq->nr_running == 1) {
-		list_add_leaf_cfs_rq(cfs_rq);
-		check_enqueue_throttle(cfs_rq);
-	}
-}
-
-static void __clear_buddies_last(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->last != se)
-			break;
-
-		cfs_rq->last = NULL;
-	}
-}
-
-static void __clear_buddies_next(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->next != se)
-			break;
-
-		cfs_rq->next = NULL;
-	}
-}
-
-static void __clear_buddies_skip(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->skip != se)
-			break;
-
-		cfs_rq->skip = NULL;
-	}
+	if (task_se_in_cgroup(se))
+		enqueue_entity_cfs_rqs(se);
 }
 
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if (cfs_rq->last == se)
-		__clear_buddies_last(se);
+		cfs_rq->last = NULL;
 
 	if (cfs_rq->next == se)
-		__clear_buddies_next(se);
+		cfs_rq->next = NULL;
 
 	if (cfs_rq->skip == se)
-		__clear_buddies_skip(se);
+		cfs_rq->skip = NULL;
 }
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
@@ -4065,6 +4084,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	/* 'current' is not kept within the tree. */
 	if (se->on_rq) {
+		struct sched_entity *ise = se;
 		/*
 		 * Any task has to be enqueued before it get to execute on
 		 * a CPU. So account for the time it spent waiting on the
@@ -4072,7 +4092,11 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
-		update_load_avg(cfs_rq, se, UPDATE_TG);
+		for_each_sched_entity(ise) {
+			struct cfs_rq *group_rq = group_cfs_rq_of_parent(ise);
+			if (!update_load_avg(group_rq, ise, UPDATE_TG))
+				break;
+		}
 	}
 
 	update_stats_curr_start(cfs_rq, se);
@@ -4170,11 +4194,16 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 	check_spread(cfs_rq, prev);
 
 	if (prev->on_rq) {
+		struct sched_entity *se = prev;
 		update_stats_wait_start(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
-		update_load_avg(cfs_rq, prev, 0);
+		for_each_sched_entity(se) {
+			struct cfs_rq *group_rq = group_cfs_rq_of_parent(se);
+			if (!update_load_avg(group_rq, se, SE_IS_CURRENT))
+				break;
+		}
 	}
 	cfs_rq->curr = NULL;
 }
@@ -4190,7 +4219,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
-	update_load_avg(cfs_rq, curr, UPDATE_TG);
+	update_load_avg(cfs_rq, curr, UPDATE_TG|SE_IS_CURRENT);
 	update_cfs_group(curr);
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -4209,9 +4238,6 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 			hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
 		return;
 #endif
-
-	if (cfs_rq->nr_running > 1)
-		check_preempt_tick(cfs_rq, curr);
 }
 
 
@@ -5086,7 +5112,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 
 	SCHED_WARN_ON(task_rq(p) != rq);
 
-	if (rq->cfs.h_nr_running > 1) {
+	if (rq->cfs.nr_running > 1) {
 		u64 slice = sched_slice(cfs_rq, se);
 		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
 		s64 delta = slice - ran;
@@ -5151,7 +5177,7 @@ static inline void update_overutilized_status(struct rq *rq) { }
 static void
 enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq = & rq->cfs;
 	struct sched_entity *se = &p->se;
 
 	/*
@@ -5160,7 +5186,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 * Let's add the task's estimated utilization to the cfs_rq's
 	 * estimated utilization, before we update schedutil.
 	 */
-	util_est_enqueue(&rq->cfs, p);
+	util_est_enqueue(cfs_rq, p);
 
 	/*
 	 * If in_iowait is set, the code below may not trigger any cpufreq
@@ -5171,37 +5197,13 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
 
 	for_each_sched_entity(se) {
-		if (se->on_rq)
+		struct cfs_rq *group_rq = group_cfs_rq_of_parent(se);
+		if (!enqueue_entity_groups(group_rq, se, flags))
 			break;
-		cfs_rq = cfs_rq_of(se);
-		enqueue_entity_groups(cfs_rq, se, flags);
-		enqueue_entity(cfs_rq, se, flags);
-
-		/*
-		 * end evaluation on encountering a throttled cfs_rq
-		 *
-		 * note: in the case of encountering a throttled cfs_rq we will
-		 * post the final h_nr_running increment below.
-		 */
-		if (cfs_rq_throttled(cfs_rq))
-			break;
-		cfs_rq->h_nr_running++;
-
-		flags = ENQUEUE_WAKEUP;
 	}
 
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_nr_running++;
-
-		if (cfs_rq_throttled(cfs_rq))
-			break;
-
-		update_load_avg(cfs_rq, se, UPDATE_TG);
-		update_cfs_group(se);
-	}
+	enqueue_entity(cfs_rq, &p->se, flags);
 
-	if (!se) {
 		add_nr_running(rq, 1);
 		/*
 		 * Since new tasks are assigned an initial util_avg equal to
@@ -5220,23 +5222,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (flags & ENQUEUE_WAKEUP)
 			update_overutilized_status(rq);
 
-	}
-
-	if (cfs_bandwidth_used()) {
-		/*
-		 * When bandwidth control is enabled; the cfs_rq_throttled()
-		 * breaks in the above iteration can result in incomplete
-		 * leaf list maintenance, resulting in triggering the assertion
-		 * below.
-		 */
-		for_each_sched_entity(se) {
-			cfs_rq = cfs_rq_of(se);
-
-			if (list_add_leaf_cfs_rq(cfs_rq))
-				break;
-		}
-	}
-
 	assert_list_leaf_cfs_rq(rq);
 
 	hrtick_update(rq);
@@ -5251,55 +5236,21 @@ static void set_next_buddy(struct sched_entity *se);
  */
 static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *se = &p->se;
 	int task_sleep = flags & DEQUEUE_SLEEP;
 
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		dequeue_entity_groups(cfs_rq, se, flags);
-		dequeue_entity(cfs_rq, se, flags);
-
-		/*
-		 * end evaluation on encountering a throttled cfs_rq
-		 *
-		 * note: in the case of encountering a throttled cfs_rq we will
-		 * post the final h_nr_running decrement below.
-		*/
-		if (cfs_rq_throttled(cfs_rq))
+		struct cfs_rq *group_rq = group_cfs_rq_of_parent(se);
+		if (!dequeue_entity_groups(group_rq, se, flags | SE_IS_CURRENT))
 			break;
-		cfs_rq->h_nr_running--;
-
-		/* Don't dequeue parent if it has other entities besides us */
-		if (cfs_rq->load.weight) {
-			/* Avoid re-evaluating load for this entity: */
-			se = parent_entity(se);
-			/*
-			 * Bias pick_next to pick a task from this cfs_rq, as
-			 * p is sleeping when it is within its sched_slice.
-			 */
-			if (task_sleep && se && !throttled_hierarchy(cfs_rq))
-				set_next_buddy(se);
-			break;
-		}
-		flags |= DEQUEUE_SLEEP;
 	}
 
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_nr_running--;
+	dequeue_entity(cfs_rq, &p->se, flags);
 
-		if (cfs_rq_throttled(cfs_rq))
-			break;
+	sub_nr_running(rq, 1);
 
-		update_load_avg(cfs_rq, se, UPDATE_TG);
-		update_cfs_group(se);
-	}
-
-	if (!se)
-		sub_nr_running(rq, 1);
-
-	util_est_dequeue(&rq->cfs, p, task_sleep);
+	util_est_dequeue(cfs_rq, p, task_sleep);
 	hrtick_update(rq);
 }
 
@@ -5622,7 +5573,7 @@ static unsigned long capacity_of(int cpu)
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running);
+	unsigned long nr_running = READ_ONCE(rq->cfs.nr_running);
 	unsigned long load_avg = weighted_cpuload(rq);
 
 	if (nr_running)
@@ -6841,11 +6792,9 @@ static void set_last_buddy(struct sched_entity *se)
 	if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se))))
 		return;
 
-	for_each_sched_entity(se) {
-		if (SCHED_WARN_ON(!se->on_rq))
-			return;
-		cfs_rq_of(se)->last = se;
-	}
+	if (SCHED_WARN_ON(!se->on_rq))
+		return;
+	cfs_rq_of(se)->last = se;
 }
 
 static void set_next_buddy(struct sched_entity *se)
@@ -6853,17 +6802,14 @@ static void set_next_buddy(struct sched_entity *se)
 	if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se))))
 		return;
 
-	for_each_sched_entity(se) {
-		if (SCHED_WARN_ON(!se->on_rq))
-			return;
-		cfs_rq_of(se)->next = se;
-	}
+	if (SCHED_WARN_ON(!se->on_rq))
+		return;
+	cfs_rq_of(se)->next = se;
 }
 
 static void set_skip_buddy(struct sched_entity *se)
 {
-	for_each_sched_entity(se)
-		cfs_rq_of(se)->skip = se;
+	cfs_rq_of(se)->skip = se;
 }
 
 /*
@@ -6919,7 +6865,6 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
 		return;
 
-	find_matching_se(&se, &pse);
 	update_curr(cfs_rq_of(se));
 	BUG_ON(!pse);
 	if (wakeup_preempt_entity(se, pse) == 1) {
@@ -6960,100 +6905,18 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	struct task_struct *p;
 	int new_tasks;
 
+	put_prev_task(rq, prev);
 again:
 	if (!cfs_rq->nr_running)
 		goto idle;
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	if (prev->sched_class != &fair_sched_class)
-		goto simple;
-
-	/*
-	 * Because of the set_next_buddy() in dequeue_task_fair() it is rather
-	 * likely that a next task is from the same cgroup as the current.
-	 *
-	 * Therefore attempt to avoid putting and setting the entire cgroup
-	 * hierarchy, only change the part that actually changes.
-	 */
-
-	do {
-		struct sched_entity *curr = cfs_rq->curr;
-
-		/*
-		 * Since we got here without doing put_prev_entity() we also
-		 * have to consider cfs_rq->curr. If it is still a runnable
-		 * entity, update_curr() will update its vruntime, otherwise
-		 * forget we've ever seen it.
-		 */
-		if (curr) {
-			if (curr->on_rq)
-				update_curr(cfs_rq);
-			else
-				curr = NULL;
-
-			/*
-			 * This call to check_cfs_rq_runtime() will do the
-			 * throttle and dequeue its entity in the parent(s).
-			 * Therefore the nr_running test will indeed
-			 * be correct.
-			 */
-			if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
-				cfs_rq = &rq->cfs;
-
-				if (!cfs_rq->nr_running)
-					goto idle;
-
-				goto simple;
-			}
-		}
-
-		se = pick_next_entity(cfs_rq, curr);
-		cfs_rq = group_cfs_rq(se);
-	} while (cfs_rq);
-
-	p = task_of(se);
-
-	/*
-	 * Since we haven't yet done put_prev_entity and if the selected task
-	 * is a different task than we started out with, try and touch the
-	 * least amount of cfs_rqs.
-	 */
-	if (prev != p) {
-		struct sched_entity *pse = &prev->se;
-
-		while (!(cfs_rq = is_same_group(se, pse))) {
-			int se_depth = se->depth;
-			int pse_depth = pse->depth;
-
-			if (se_depth <= pse_depth) {
-				put_prev_entity(cfs_rq_of(pse), pse);
-				pse = parent_entity(pse);
-			}
-			if (se_depth >= pse_depth) {
-				set_next_entity(cfs_rq_of(se), se);
-				se = parent_entity(se);
-			}
-		}
-
-		put_prev_entity(cfs_rq, pse);
-		set_next_entity(cfs_rq, se);
-	}
-
-	goto done;
-simple:
-#endif
-
-	put_prev_task(rq, prev);
-
-	do {
-		se = pick_next_entity(cfs_rq, NULL);
-		set_next_entity(cfs_rq, se);
-		cfs_rq = group_cfs_rq(se);
-	} while (cfs_rq);
+	se = pick_next_entity(cfs_rq, NULL);
+	if (!se)
+		goto idle;
 
+	set_next_entity(cfs_rq, se);
 	p = task_of(se);
 
-done: __maybe_unused;
 #ifdef CONFIG_SMP
 	/*
 	 * Move the next running task to the front of
@@ -7102,10 +6965,8 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
 	struct sched_entity *se = &prev->se;
 	struct cfs_rq *cfs_rq;
 
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		put_prev_entity(cfs_rq, se);
-	}
+	cfs_rq = cfs_rq_of(se);
+	put_prev_entity(cfs_rq, se);
 }
 
 /*
@@ -7819,6 +7680,19 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 	}
 }
 
+static unsigned long task_se_h_weight(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq;
+
+	if (!task_se_in_cgroup(se))
+		return se->load.weight;
+
+	cfs_rq = group_cfs_rq_of_parent(se);
+
+	/* Reduce the load.weight by the h_load of the group the task is in. */
+	return (cfs_rq->h_load * se->load.weight) >> SCHED_FIXEDPOINT_SHIFT;
+}
+
 static unsigned long task_se_h_load(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = group_cfs_rq_of_parent(se);
@@ -7865,6 +7739,11 @@ static unsigned long task_h_load(struct task_struct *p)
 {
 	return p->se.avg.load_avg;
 }
+
+static unsigned long task_se_h_weight(struct sched_entity *se)
+{
+	return se->load.weight;
+}
 #endif
 
 /********** Helpers for find_busiest_group ************************/
@@ -8266,7 +8145,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		sgs->group_load += load;
 		sgs->group_util += cpu_util(i);
-		sgs->sum_nr_running += rq->cfs.h_nr_running;
+		sgs->sum_nr_running += rq->cfs.nr_running;
 
 		nr_running = rq->nr_running;
 		if (nr_running > 1)
@@ -8957,7 +8836,7 @@ voluntary_active_balance(struct lb_env *env)
 	 * available on dst_cpu.
 	 */
 	if ((env->idle != CPU_NOT_IDLE) &&
-	    (env->src_rq->cfs.h_nr_running == 1)) {
+	    (env->src_rq->cfs.nr_running == 1)) {
 		if ((check_cpu_capacity(env->src_rq, sd)) &&
 		    (capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100))
 			return 1;
@@ -9638,7 +9517,7 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * capacity; kick the ILB to see if there's a better CPU to run
 		 * on.
 		 */
-		if (rq->cfs.h_nr_running >= 1 && check_cpu_capacity(rq, sd)) {
+		if (rq->cfs.nr_running >= 1 && check_cpu_capacity(rq, sd)) {
 			flags = NOHZ_KICK_MASK;
 			goto unlock;
 		}
@@ -10087,7 +9966,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 	 * have been enqueued in the meantime. Since we're not going idle,
 	 * pretend we pulled a task.
 	 */
-	if (this_rq->cfs.h_nr_running && !pulled_task)
+	if (this_rq->cfs.nr_running && !pulled_task)
 		pulled_task = 1;
 
 	/* Move the next balance forward */
@@ -10095,7 +9974,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 		this_rq->next_balance = next_balance;
 
 	/* Is there a task of a high priority class? */
-	if (this_rq->nr_running != this_rq->cfs.h_nr_running)
+	if (this_rq->nr_running != this_rq->cfs.nr_running)
 		pulled_task = -1;
 
 	if (pulled_task)
@@ -10182,6 +10061,10 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		entity_tick(cfs_rq, se, queued);
 	}
 
+	cfs_rq = &rq->cfs;
+	if (cfs_rq->nr_running > 1)
+		check_preempt_tick(cfs_rq, &curr->se);
+
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
@@ -10280,40 +10163,51 @@ static inline bool vruntime_normalized(struct task_struct *p)
  * Propagate the changes of the sched_entity across the tg tree to make it
  * visible to the root
  */
-static void propagate_entity_cfs_rq(struct sched_entity *se)
+static void propagate_entity_cfs_rq(struct sched_entity *se, bool curr)
 {
+	unsigned long flags = UPDATE_TG;
 	struct cfs_rq *cfs_rq;
 
+	if (curr)
+		flags |= SE_IS_CURRENT;
+
 	/* Start to propagate at parent */
 	se = se->parent;
 
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		cfs_rq = group_cfs_rq_of_parent(se);
 
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(cfs_rq, se, UPDATE_TG);
+		if (!update_load_avg(cfs_rq, se, flags))
+			break;
 	}
 }
 #else
-static void propagate_entity_cfs_rq(struct sched_entity *se) { }
+static void propagate_entity_cfs_rq(struct sched_entity *se, bool curr) { }
 #endif
 
 static void detach_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct sched_entity *ise = se;
 
 	/* Catch up with the cfs_rq and remove our load when we leave */
-	update_load_avg(cfs_rq, se, 0);
+	for_each_sched_entity(ise) {
+		struct cfs_rq *group_rq = group_cfs_rq_of_parent(ise);
+		if (!update_load_avg(group_rq, ise, 0))
+			break;
+	}
 	detach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
-	propagate_entity_cfs_rq(se);
+	propagate_entity_cfs_rq(se, true);
 }
 
 static void attach_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct sched_entity *ise = se;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/*
@@ -10324,10 +10218,15 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
 #endif
 
 	/* Synchronize entity with its cfs_rq */
-	update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
+	for_each_sched_entity(ise) {
+		struct cfs_rq *group_rq = group_cfs_rq_of_parent(ise);
+		int flags = sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD;
+		if (!update_load_avg(group_rq, ise, flags))
+			break;
+	}
 	attach_entity_load_avg(cfs_rq, se, 0);
 	update_tg_load_avg(cfs_rq, false);
-	propagate_entity_cfs_rq(se);
+	propagate_entity_cfs_rq(se, false);
 }
 
 static void detach_task_cfs_rq(struct task_struct *p)
@@ -10388,14 +10287,11 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
 static void set_curr_task_fair(struct rq *rq)
 {
 	struct sched_entity *se = &rq->curr->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-
-		set_next_entity(cfs_rq, se);
-		/* ensure bandwidth has been allocated on our new cfs_rq */
-		account_cfs_rq_runtime(cfs_rq, 0);
-	}
+	set_next_entity(cfs_rq, se);
+	/* ensure bandwidth has been allocated on our new cfs_rq */
+	account_cfs_rq_runtime(cfs_rq, 0);
 }
 
 void init_cfs_rq(struct cfs_rq *cfs_rq)
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 32dc2791a517..45d3b4979022 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -266,10 +266,10 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 	return 0;
 }
 
-int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
+int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se, bool load, bool running)
 {
-	if (___update_load_sum(now, &se->avg, !!se->on_rq,
-				cfs_rq->curr == se)) {
+	if (___update_load_sum(now, &se->avg, (!!se->on_rq || load),
+				(cfs_rq->curr == se) || running)) {
 
 		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 7489d5f56960..1152c4ebf314 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -2,7 +2,7 @@
 #include "sched-pelt.h"
 
 int __update_load_avg_blocked_se(u64 now, struct sched_entity *se);
-int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se);
+int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se, bool load, bool running);
 int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 18494b1a9bac..67066daf7ee9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1443,7 +1443,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);
-	p->se.cfs_rq = tg->cfs_rq[cpu];
+	p->se.cfs_rq = &cpu_rq(cpu)->cfs;
 	p->se.parent = tg->se[cpu];
 #endif
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/8] sched,fair: redefine runnable_load_avg as the sum of task_h_load
  2019-06-12 19:32 ` [PATCH 3/8] sched,fair: redefine runnable_load_avg as the sum of task_h_load Rik van Riel
@ 2019-06-18  9:08   ` Dietmar Eggemann
  2019-06-26 14:34   ` Dietmar Eggemann
  1 sibling, 0 replies; 24+ messages in thread
From: Dietmar Eggemann @ 2019-06-18  9:08 UTC (permalink / raw)
  To: Rik van Riel, peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot

Hi Rik,

On 6/12/19 9:32 PM, Rik van Riel wrote:

[...]

> @@ -379,17 +368,11 @@ int update_irq_load_avg(struct rq *rq, u64 running)
>   	 * We can safely remove running from rq->clock because
>   	 * rq->clock += delta with delta >= running
>   	 */
> -	ret = ___update_load_sum(rq->clock - running, &rq->avg_irq,
> -				0,
> -				0,
> -				0);
> -	ret += ___update_load_sum(rq->clock, &rq->avg_irq,
> -				1,
> -				1,
> -				1);
> +	ret = ___update_load_sum(rq->clock - running, &rq->avg_irq, 0);
> +	ret += ___update_load_sum(rq->clock, &rq->avg_irq, 1);

The 'int running' argument in the two ___update_load_sum() calls is 
missing. Doesn't compile for me (arm64 defconfig w/ 
CONFIG_IRQ_TIME_ACCOUNTING=y).

[...]


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/8] sched: introduce task_se_h_load helper
  2019-06-12 19:32 ` [PATCH 1/8] sched: introduce task_se_h_load helper Rik van Riel
@ 2019-06-19 12:52   ` Dietmar Eggemann
  2019-06-19 13:57     ` Rik van Riel
  0 siblings, 1 reply; 24+ messages in thread
From: Dietmar Eggemann @ 2019-06-19 12:52 UTC (permalink / raw)
  To: Rik van Riel, peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot

On 6/12/19 9:32 PM, Rik van Riel wrote:
> Sometimes the hierarchical load of a sched_entity needs to be calculated.
> Split out task_h_load into a task_se_h_load that takes a sched_entity pointer
> as its argument, and a task_h_load wrapper that calls task_se_h_load.
> 
> No functional changes.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  kernel/sched/fair.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f5e528..df624f7a68e7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -706,6 +706,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  #ifdef CONFIG_SMP
>  
>  static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
> +static unsigned long task_se_h_load(struct sched_entity *se);
>  static unsigned long task_h_load(struct task_struct *p);
>  static unsigned long capacity_of(int cpu);
>  
> @@ -7833,14 +7834,19 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
>  	}
>  }
>  
> -static unsigned long task_h_load(struct task_struct *p)
> +static unsigned long task_se_h_load(struct sched_entity *se)
>  {
> -	struct cfs_rq *cfs_rq = task_cfs_rq(p);
> +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  
>  	update_cfs_rq_h_load(cfs_rq);
> -	return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> +	return div64_ul(se->avg.load_avg * cfs_rq->h_load,
>  			cfs_rq_load_avg(cfs_rq) + 1);
>  }

I wonder if this is necessary. I placed a BUG_ON(!entity_is_task(se))
into task_se_h_load() after I applied the whole patch-set and ran some
taskgroup related testcases. It didn't hit.

So why not use task_h_load(task_of(se)) instead?

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/8] sched: introduce task_se_h_load helper
  2019-06-19 12:52   ` Dietmar Eggemann
@ 2019-06-19 13:57     ` Rik van Riel
  2019-06-19 15:18       ` Dietmar Eggemann
  0 siblings, 1 reply; 24+ messages in thread
From: Rik van Riel @ 2019-06-19 13:57 UTC (permalink / raw)
  To: Dietmar Eggemann, peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot

[-- Attachment #1: Type: text/plain, Size: 1233 bytes --]

On Wed, 2019-06-19 at 14:52 +0200, Dietmar Eggemann wrote:

> > @@ -7833,14 +7834,19 @@ static void update_cfs_rq_h_load(struct
> > cfs_rq *cfs_rq)
> >  	}
> >  }
> >  
> > -static unsigned long task_h_load(struct task_struct *p)
> > +static unsigned long task_se_h_load(struct sched_entity *se)
> >  {
> > -	struct cfs_rq *cfs_rq = task_cfs_rq(p);
> > +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >  
> >  	update_cfs_rq_h_load(cfs_rq);
> > -	return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> > +	return div64_ul(se->avg.load_avg * cfs_rq->h_load,
> >  			cfs_rq_load_avg(cfs_rq) + 1);
> >  }
> 
> I wonder if this is necessary. I placed a BUG_ON(!entity_is_task(se))
> into task_se_h_load() after I applied the whole patch-set and ran
> some
> taskgroup related testcases. It didn't hit.
> 
> So why not use task_h_load(task_of(se)) instead?
> 
> [...]

That would work, but task_h_load then dereferences
task->se to get the se->avg.load_avg value.

Going back to task from the se, only to then get the
se from the task seems a little unnecessary :)

Can you explain why you think task_h_load(task_of(se))
would be better? I think I may be overlooking something.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/8] sched: introduce task_se_h_load helper
  2019-06-19 13:57     ` Rik van Riel
@ 2019-06-19 15:18       ` Dietmar Eggemann
  2019-06-19 15:55         ` Rik van Riel
  0 siblings, 1 reply; 24+ messages in thread
From: Dietmar Eggemann @ 2019-06-19 15:18 UTC (permalink / raw)
  To: Rik van Riel, peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot

On 6/19/19 3:57 PM, Rik van Riel wrote:
> On Wed, 2019-06-19 at 14:52 +0200, Dietmar Eggemann wrote:
> 
>>> @@ -7833,14 +7834,19 @@ static void update_cfs_rq_h_load(struct
>>> cfs_rq *cfs_rq)
>>>  	}
>>>  }
>>>  
>>> -static unsigned long task_h_load(struct task_struct *p)
>>> +static unsigned long task_se_h_load(struct sched_entity *se)
>>>  {
>>> -	struct cfs_rq *cfs_rq = task_cfs_rq(p);
>>> +	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>>  
>>>  	update_cfs_rq_h_load(cfs_rq);
>>> -	return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>>> +	return div64_ul(se->avg.load_avg * cfs_rq->h_load,
>>>  			cfs_rq_load_avg(cfs_rq) + 1);
>>>  }
>>
>> I wonder if this is necessary. I placed a BUG_ON(!entity_is_task(se))
>> into task_se_h_load() after I applied the whole patch-set and ran
>> some
>> taskgroup related testcases. It didn't hit.
>>
>> So why not use task_h_load(task_of(se)) instead?
>>
>> [...]
> 
> That would work, but task_h_load then dereferences
> task->se to get the se->avg.load_avg value.
> 
> Going back to task from the se, only to then get the
> se from the task seems a little unnecessary :)
> 
> Can you explain why you think task_h_load(task_of(se))
> would be better? I think I may be overlooking something.

Ah, OK, I just wanted to avoid having task_se_h_load() and task_h_load()
at the same time. You could replace the remaining calls to
task_h_load(p) with task_se_h_load(&p->se) in this case.

- task_load = task_h_load(p);
+ task_load = task_se_h_load(&p->se);

Not that important though right now ...


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/8] sched: introduce task_se_h_load helper
  2019-06-19 15:18       ` Dietmar Eggemann
@ 2019-06-19 15:55         ` Rik van Riel
  0 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2019-06-19 15:55 UTC (permalink / raw)
  To: Dietmar Eggemann, peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot

[-- Attachment #1: Type: text/plain, Size: 1064 bytes --]

On Wed, 2019-06-19 at 17:18 +0200, Dietmar Eggemann wrote:
> On 6/19/19 3:57 PM, Rik van Riel wrote:
> 
> > That would work, but task_h_load then dereferences
> > task->se to get the se->avg.load_avg value.
> > 
> > Going back to task from the se, only to then get the
> > se from the task seems a little unnecessary :)
> > 
> > Can you explain why you think task_h_load(task_of(se))
> > would be better? I think I may be overlooking something.
> 
> Ah, OK, I just wanted to avoid having task_se_h_load() and
> task_h_load()
> at the same time. You could replace the remaining calls to
> task_h_load(p) with task_se_h_load(&p->se) in this case.
> 
> - task_load = task_h_load(p);
> + task_load = task_se_h_load(&p->se);
> 
> Not that important though right now ...

That I can do.

I might as well do that while going through the
rest of the series to merge in the bug fix that
I have for the performance regression, and the
fixes for compilation with other config options.

Thank you for the suggestion.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/8] sched,cfs: use explicit cfs_rq of parent se helper
  2019-06-12 19:32 ` [PATCH 5/8] sched,cfs: use explicit cfs_rq of parent se helper Rik van Riel
@ 2019-06-20 16:23   ` Dietmar Eggemann
  2019-06-20 16:29     ` Rik van Riel
  2019-06-26 15:58   ` Dietmar Eggemann
  1 sibling, 1 reply; 24+ messages in thread
From: Dietmar Eggemann @ 2019-06-20 16:23 UTC (permalink / raw)
  To: Rik van Riel, peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot

On 6/12/19 9:32 PM, Rik van Riel wrote:
> Use an explicit "cfs_rq of parent sched_entity" helper in a few
> strategic places, where cfs_rq_of(se) may no longer point at the
> right runqueue once we flatten the hierarchical cgroup runqueues.
> 
> No functional change.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  kernel/sched/fair.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dcc521d251e3..c6ede2ecc935 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -275,6 +275,15 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
>  	return grp->my_q;
>  }
>  
> +/* runqueue owned by the parent entity */
> +static inline struct cfs_rq *group_cfs_rq_of_parent(struct sched_entity *se)
> +{
> +	if (se->parent)
> +		return group_cfs_rq(se->parent);
> +
> +	return &cfs_rq_of(se)->rq->cfs;
> +}
> +
>  static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>  	struct rq *rq = rq_of(cfs_rq);
> @@ -3298,7 +3307,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
>  
>  	gcfs_rq->propagate = 0;
>  
> -	cfs_rq = cfs_rq_of(se);
> +	cfs_rq = group_cfs_rq_of_parent(se);
>  
>  	add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
>  
> @@ -7779,7 +7788,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
>  
>  	WRITE_ONCE(cfs_rq->h_load_next, NULL);
>  	for_each_sched_entity(se) {
> -		cfs_rq = cfs_rq_of(se);
> +		cfs_rq = group_cfs_rq_of_parent(se);

Why do you change this here? task_se_h_load() calls
update_cfs_rq_h_load() with cfs_rq = group_cfs_rq_of_parent(se) because
the task might not be on the cfs_rq yet.

But inside update_cfs_rq_h_load() the first se is derived from
cfs_rq->tg->se[cpu_of(rq)] so in the first for_each_sched_entity() loop
we should still start with group_cfs_rq() (se->my_q) ?

The system doesn't barf with these two WARN_ON's in.

@@ -7663,12 +7673,17 @@ static void update_cfs_rq_h_load(struct cfs_rq
*cfs_rq)
        unsigned long now = jiffies;
        unsigned long load;

+       WARN_ON(se && (se != group_cfs_rq(se)->tg->se[cpu_of(rq)]));
+
        if (cfs_rq->last_h_load_update == now)
                return;

        WRITE_ONCE(cfs_rq->h_load_next, NULL);
        for_each_sched_entity(se) {
                cfs_rq = group_cfs_rq_of_parent(se);
+
+               WARN_ON(se != group_cfs_rq(se)->tg->se[cpu_of(rq)]);
+
                WRITE_ONCE(cfs_rq->h_load_next, se);
                if (cfs_rq->last_h_load_update == now)
                        break;


[...]


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/8] sched,cfs: use explicit cfs_rq of parent se helper
  2019-06-20 16:23   ` Dietmar Eggemann
@ 2019-06-20 16:29     ` Rik van Riel
  2019-06-24 11:24       ` Dietmar Eggemann
  0 siblings, 1 reply; 24+ messages in thread
From: Rik van Riel @ 2019-06-20 16:29 UTC (permalink / raw)
  To: Dietmar Eggemann, peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot

[-- Attachment #1: Type: text/plain, Size: 3258 bytes --]

On Thu, 2019-06-20 at 18:23 +0200, Dietmar Eggemann wrote:
> On 6/12/19 9:32 PM, Rik van Riel wrote:
> > Use an explicit "cfs_rq of parent sched_entity" helper in a few
> > strategic places, where cfs_rq_of(se) may no longer point at the
> > right runqueue once we flatten the hierarchical cgroup runqueues.
> > 
> > No functional change.
> > 
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > ---
> >  kernel/sched/fair.c | 17 +++++++++++++----
> >  1 file changed, 13 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index dcc521d251e3..c6ede2ecc935 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -275,6 +275,15 @@ static inline struct cfs_rq
> > *group_cfs_rq(struct sched_entity *grp)
> >  	return grp->my_q;
> >  }
> >  
> > +/* runqueue owned by the parent entity */
> > +static inline struct cfs_rq *group_cfs_rq_of_parent(struct
> > sched_entity *se)
> > +{
> > +	if (se->parent)
> > +		return group_cfs_rq(se->parent);
> > +
> > +	return &cfs_rq_of(se)->rq->cfs;
> > +}
> > +
> >  static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> >  {
> >  	struct rq *rq = rq_of(cfs_rq);
> > @@ -3298,7 +3307,7 @@ static inline int
> > propagate_entity_load_avg(struct sched_entity *se)
> >  
> >  	gcfs_rq->propagate = 0;
> >  
> > -	cfs_rq = cfs_rq_of(se);
> > +	cfs_rq = group_cfs_rq_of_parent(se);
> >  
> >  	add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
> >  
> > @@ -7779,7 +7788,7 @@ static void update_cfs_rq_h_load(struct
> > cfs_rq *cfs_rq)
> >  
> >  	WRITE_ONCE(cfs_rq->h_load_next, NULL);
> >  	for_each_sched_entity(se) {
> > -		cfs_rq = cfs_rq_of(se);
> > +		cfs_rq = group_cfs_rq_of_parent(se);
> 
> Why do you change this here? task_se_h_load() calls
> update_cfs_rq_h_load() with cfs_rq = group_cfs_rq_of_parent(se)
> because
> the task might not be on the cfs_rq yet.

Because patch 6 points cfs_rq_of(se) at the CPU's top level
cfs_rq for every task se ...

... but I guess since I have not changed where the cfs_rq_of
points for cgroup sched_entities, this change is not necessary
at this time, and I should be able to go without it, in this
function.

> But inside update_cfs_rq_h_load() the first se is derived from
> cfs_rq->tg->se[cpu_of(rq)] so in the first for_each_sched_entity()
> loop
> we should still start with group_cfs_rq() (se->my_q) ?
> 
> The system doesn't barf with these two WARN_ON's in.
> 
> @@ -7663,12 +7673,17 @@ static void update_cfs_rq_h_load(struct
> cfs_rq
> *cfs_rq)
>         unsigned long now = jiffies;
>         unsigned long load;
> 
> +       WARN_ON(se && (se != group_cfs_rq(se)->tg->se[cpu_of(rq)]));
> +
>         if (cfs_rq->last_h_load_update == now)
>                 return;
> 
>         WRITE_ONCE(cfs_rq->h_load_next, NULL);
>         for_each_sched_entity(se) {
>                 cfs_rq = group_cfs_rq_of_parent(se);
> +
> +               WARN_ON(se != group_cfs_rq(se)->tg->se[cpu_of(rq)]);
> +
>                 WRITE_ONCE(cfs_rq->h_load_next, se);
>                 if (cfs_rq->last_h_load_update == now)
>                         break;
> 
> 
> [...]
> 
> 
-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/8] sched,cfs: use explicit cfs_rq of parent se helper
  2019-06-20 16:29     ` Rik van Riel
@ 2019-06-24 11:24       ` Dietmar Eggemann
  0 siblings, 0 replies; 24+ messages in thread
From: Dietmar Eggemann @ 2019-06-24 11:24 UTC (permalink / raw)
  To: Rik van Riel, peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot

On 6/20/19 6:29 PM, Rik van Riel wrote:
> On Thu, 2019-06-20 at 18:23 +0200, Dietmar Eggemann wrote:
>> On 6/12/19 9:32 PM, Rik van Riel wrote:

[...]

>>> @@ -7779,7 +7788,7 @@ static void update_cfs_rq_h_load(struct
>>> cfs_rq *cfs_rq)
>>>  
>>>  	WRITE_ONCE(cfs_rq->h_load_next, NULL);
>>>  	for_each_sched_entity(se) {
>>> -		cfs_rq = cfs_rq_of(se);
>>> +		cfs_rq = group_cfs_rq_of_parent(se);
>>
>> Why do you change this here? task_se_h_load() calls
>> update_cfs_rq_h_load() with cfs_rq = group_cfs_rq_of_parent(se)
>> because
>> the task might not be on the cfs_rq yet.
> 
> Because patch 6 points cfs_rq_of(se) at the CPU's top level
> cfs_rq for every task se ...
> 
> ... but I guess since I have not changed where the cfs_rq_of
> points for cgroup sched_entities, this change is not necessary
> at this time, and I should be able to go without it, in this
> function.

IMHO, since you only change set_task_rq() (p->se.cfs_rq =
&cpu_rq(cpu)->cfs instead of tg->cfs_rq[cpu] in 8/8) (used for a task)
and not init_tg_cfs_entry() (used for a taskgroup), 'cfs_rq_of(se) ==
se->parent->my_q' should still be true in update_cfs_rq_h_load().

update_cfs_rq_h_load() only deals with se's representing taskgroups. So
cfs_rq_of(se) and group_cfs_rq_of_parent(se) should deliver the same
result for these se's.

>> But inside update_cfs_rq_h_load() the first se is derived from
>> cfs_rq->tg->se[cpu_of(rq)] so in the first for_each_sched_entity()
>> loop
>> we should still start with group_cfs_rq() (se->my_q) ?

Here I was wrong. The first loop did use cfs_rq_of() and not group_cfs_rq().
[...]


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 8/8] sched,fair: flatten hierarchical runqueues
  2019-06-12 19:32 ` [PATCH 8/8] sched,fair: flatten hierarchical runqueues Rik van Riel
@ 2019-06-25  9:50   ` Dietmar Eggemann
  2019-06-25 13:51     ` Rik van Riel
  2019-06-28 10:26   ` Dietmar Eggemann
  1 sibling, 1 reply; 24+ messages in thread
From: Dietmar Eggemann @ 2019-06-25  9:50 UTC (permalink / raw)
  To: Rik van Riel, peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggeman, mgorman, vincent.guittot

On 6/12/19 9:32 PM, Rik van Riel wrote:

[...]

> @@ -410,6 +412,11 @@ static inline struct sched_entity *parent_entity(struct sched_entity *se)
>  	return se->parent;
>  }
>  
> +static inline bool task_se_in_cgroup(struct sched_entity *se)
> +{
> +	return parent_entity(se);
> +}

IMHO, s/in_cgroup/not_in_root_tg/ reads easier. "/", i.e. the root tg is
still a cgroup, I guess. But you could use existing parent_entity(se) as
well.

[...]

> @@ -679,22 +710,16 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
>  static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>  	u64 slice = sysctl_sched_latency;
> +	struct load_weight *load = &cfs_rq->load;
> +	struct load_weight lw;
>  
> -	for_each_sched_entity(se) {
> -		struct load_weight *load;
> -		struct load_weight lw;
> +	if (unlikely(!se->on_rq)) {
> +		lw = cfs_rq->load;
>  
> -		cfs_rq = cfs_rq_of(se);
> -		load = &cfs_rq->load;
> -
> -		if (unlikely(!se->on_rq)) {
> -			lw = cfs_rq->load;
> -
> -			update_load_add(&lw, se->load.weight);
> -			load = &lw;
> -		}
> -		slice = __calc_delta(slice, se->load.weight, load);
> +		update_load_add(&lw, task_se_h_load(se));
> +		load = &lw;
>  	}
> +	slice = __calc_delta(slice, task_se_h_load(se), load);

task_se_h_load(se) and se->load.weight are off my factor of >= 1024 on
64bit.

...
    bash pid=3250: task_se_h_load(se)=1023 se->load.weight=1048576
    sysctl_sched_latency=18000000 slice=0 old_slice=17999995
...

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 8/8] sched,fair: flatten hierarchical runqueues
  2019-06-25  9:50   ` Dietmar Eggemann
@ 2019-06-25 13:51     ` Rik van Riel
  0 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2019-06-25 13:51 UTC (permalink / raw)
  To: Dietmar Eggemann, peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	Mel Gorman, vincent.guittot

[-- Attachment #1: Type: text/plain, Size: 2081 bytes --]

On Tue, 2019-06-25 at 11:50 +0200, Dietmar Eggemann wrote:
> On 6/12/19 9:32 PM, Rik van Riel wrote:
> 
> [...]
> 
> > @@ -410,6 +412,11 @@ static inline struct sched_entity
> > *parent_entity(struct sched_entity *se)
> >  	return se->parent;
> >  }
> >  
> > +static inline bool task_se_in_cgroup(struct sched_entity *se)
> > +{
> > +	return parent_entity(se);
> > +}
> 
> IMHO, s/in_cgroup/not_in_root_tg/ reads easier. "/", i.e. the root tg
> is
> still a cgroup, I guess. But you could use existing parent_entity(se)
> as
> well.

I agree my name is not the prettiest, but I am not
entirely convinced your idea is an improvement.

I'll hold out for better ideas by other reviewers :)

> > @@ -679,22 +710,16 @@ static inline u64 calc_delta_fair(u64 delta,
> > struct sched_entity *se)
> >  static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity
> > *se)
> >  {
> >  	u64 slice = sysctl_sched_latency;
> > +	struct load_weight *load = &cfs_rq->load;
> > +	struct load_weight lw;
> >  
> > -	for_each_sched_entity(se) {
> > -		struct load_weight *load;
> > -		struct load_weight lw;
> > +	if (unlikely(!se->on_rq)) {
> > +		lw = cfs_rq->load;
> >  
> > -		cfs_rq = cfs_rq_of(se);
> > -		load = &cfs_rq->load;
> > -
> > -		if (unlikely(!se->on_rq)) {
> > -			lw = cfs_rq->load;
> > -
> > -			update_load_add(&lw, se->load.weight);
> > -			load = &lw;
> > -		}
> > -		slice = __calc_delta(slice, se->load.weight, load);
> > +		update_load_add(&lw, task_se_h_load(se));
> > +		load = &lw;
> >  	}
> > +	slice = __calc_delta(slice, task_se_h_load(se), load);
> 
> task_se_h_load(se) and se->load.weight are off my factor of >= 1024
> on
> 64bit.

Oh indeed they are!

I wonder if this is the root cause of that
performance regression I have been hunting for
the past few weeks :)

Let me go test some things...

> ...
>     bash pid=3250: task_se_h_load(se)=1023 se->load.weight=1048576
>     sysctl_sched_latency=18000000 slice=0 old_slice=17999995
> ...
> 
> [...]
> 
-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/8] sched,fair: redefine runnable_load_avg as the sum of task_h_load
  2019-06-12 19:32 ` [PATCH 3/8] sched,fair: redefine runnable_load_avg as the sum of task_h_load Rik van Riel
  2019-06-18  9:08   ` Dietmar Eggemann
@ 2019-06-26 14:34   ` Dietmar Eggemann
  1 sibling, 0 replies; 24+ messages in thread
From: Dietmar Eggemann @ 2019-06-26 14:34 UTC (permalink / raw)
  To: Rik van Riel, peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggemann, mgorman, vincent.guittot

On 6/12/19 9:32 PM, Rik van Riel wrote:
> The runnable_load magic is used to quickly propagate information about
> runnable tasks up the hierarchy of runqueues. lhen switching to a flat

Looks like here is some information missing.

> runqueue, that no longer works.
> 
> Redefine the CPU cfs_rq runnable_load_avg to be the sum of task_h_loads
> of the runnable tasks. This provides enough information to the load
> balancer.
> 
> The runnable_load_avg of the cgroup cfs_rqs does not appear to be
> used for anything, so don't bother calculating those.
> 
> This removes one of the things that the code currently traverses the
> cgroup hierarchy for, and getting rid of it brings us one step closer
> to a flat runqueue for the CPU controller.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  include/linux/sched.h |   3 +-
>  kernel/sched/core.c   |   2 -
>  kernel/sched/debug.c  |   1 +
>  kernel/sched/fair.c   | 125 +++++++++++++-----------------------------
>  kernel/sched/pelt.c   |  49 ++++++-----------
>  kernel/sched/sched.h  |   6 --
>  6 files changed, 55 insertions(+), 131 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 11837410690f..f5bb6948e40c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -391,7 +391,6 @@ struct util_est {
>  struct sched_avg {
>  	u64				last_update_time;
>  	u64				load_sum;
> -	u64				runnable_load_sum;
>  	u32				util_sum;
>  	u32				period_contrib;
>  	unsigned long			load_avg;

Could you not also remove runnable_load_avg from struct sched_avg and
put it into the struct cfs_rq directly. The signal has nothing to to
with PELT anymore and se don't have to carry it. You only need it for
the root cfs_rq's but it's at least better than having it still for all
se's as well.

[...]

> @@ -2767,20 +2765,39 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  static inline void
>  enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -	cfs_rq->runnable_weight += se->runnable_weight;
> +	if (entity_is_task(se)) {
> +		struct cfs_rq *cpu_cfs_rq = &cfs_rq->rq->cfs;

There are a couple of comments in fair.c referring to this cfs_rq as the
root cfs_rq, rather the cpu cfs_rq. IMHO, easier to read if we stick to
one name (root_cfs_rq vs. cpu_cfs_rq).

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/8] sched,cfs: use explicit cfs_rq of parent se helper
  2019-06-12 19:32 ` [PATCH 5/8] sched,cfs: use explicit cfs_rq of parent se helper Rik van Riel
  2019-06-20 16:23   ` Dietmar Eggemann
@ 2019-06-26 15:58   ` Dietmar Eggemann
  2019-06-26 16:15     ` Rik van Riel
  1 sibling, 1 reply; 24+ messages in thread
From: Dietmar Eggemann @ 2019-06-26 15:58 UTC (permalink / raw)
  To: Rik van Riel, peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggemann, mgorman, vincent.guittot

On 6/12/19 9:32 PM, Rik van Riel wrote:
> Use an explicit "cfs_rq of parent sched_entity" helper in a few
> strategic places, where cfs_rq_of(se) may no longer point at the
> right runqueue once we flatten the hierarchical cgroup runqueues.
> 
> No functional change.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  kernel/sched/fair.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dcc521d251e3..c6ede2ecc935 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -275,6 +275,15 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
>  	return grp->my_q;
>  }
>  
> +/* runqueue owned by the parent entity */
> +static inline struct cfs_rq *group_cfs_rq_of_parent(struct sched_entity *se)
> +{
> +	if (se->parent)
> +		return group_cfs_rq(se->parent);
> +
> +	return &cfs_rq_of(se)->rq->cfs;

The function name and the description is not 100% correct. For tasks
running naturally (not in a flattened taskgroup) in the root taskgroup
or for the se representing a first level taskgroup (e.g. /tg1 (with
se->depth = 0)) it returns the root cfs_rq or easier se->cfs_rq.

So you could replace

    return &cfs_rq_of(se)->rq->cfs;

with

    return se->cfs_rq;

or

    return cfs_rq_of(se);

I guess a crucial point to understand is that you do need both,
cfs_rq_of(se) to access the flattened and group_cfs_rq_of_parent(se) to
access the hierarchical world.

[...]


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/8] sched,cfs: use explicit cfs_rq of parent se helper
  2019-06-26 15:58   ` Dietmar Eggemann
@ 2019-06-26 16:15     ` Rik van Riel
  0 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2019-06-26 16:15 UTC (permalink / raw)
  To: Dietmar Eggemann, peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	mgorman, vincent.guittot

[-- Attachment #1: Type: text/plain, Size: 1611 bytes --]

On Wed, 2019-06-26 at 17:58 +0200, Dietmar Eggemann wrote:
> On 6/12/19 9:32 PM, Rik van Riel wrote:
> > Use an explicit "cfs_rq of parent sched_entity" helper in a few
> > strategic places, where cfs_rq_of(se) may no longer point at the
> > right runqueue once we flatten the hierarchical cgroup runqueues.
> > 
> > No functional change.
> > 
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > ---
> >  kernel/sched/fair.c | 17 +++++++++++++----
> >  1 file changed, 13 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index dcc521d251e3..c6ede2ecc935 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -275,6 +275,15 @@ static inline struct cfs_rq
> > *group_cfs_rq(struct sched_entity *grp)
> >  	return grp->my_q;
> >  }
> >  
> > +/* runqueue owned by the parent entity */
> > +static inline struct cfs_rq *group_cfs_rq_of_parent(struct
> > sched_entity *se)
> > +{
> > +	if (se->parent)
> > +		return group_cfs_rq(se->parent);
> > +
> > +	return &cfs_rq_of(se)->rq->cfs;
> 
> The function name and the description is not 100% correct. For tasks
> running naturally (not in a flattened taskgroup) in the root
> taskgroup
> or for the se representing a first level taskgroup (e.g. /tg1 (with
> se->depth = 0)) it returns the root cfs_rq or easier se->cfs_rq.
> 
> So you could replace
> 
>     return &cfs_rq_of(se)->rq->cfs;
> 
> with
> 
>     return se->cfs_rq;
> 
> or
> 
>     return cfs_rq_of(se);

Good point. I will do that for the v2 series.


-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 8/8] sched,fair: flatten hierarchical runqueues
  2019-06-12 19:32 ` [PATCH 8/8] sched,fair: flatten hierarchical runqueues Rik van Riel
  2019-06-25  9:50   ` Dietmar Eggemann
@ 2019-06-28 10:26   ` Dietmar Eggemann
  2019-06-28 19:36     ` Rik van Riel
  1 sibling, 1 reply; 24+ messages in thread
From: Dietmar Eggemann @ 2019-06-28 10:26 UTC (permalink / raw)
  To: Rik van Riel, peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	dietmar.eggemann, mgorman, vincent.guittot

On 6/12/19 9:32 PM, Rik van Riel wrote:
> Flatten the hierarchical runqueues into just the per CPU rq.cfs runqueue.
> 
> Iteration of the sched_entity hierarchy is rate limited to once per jiffy
> per sched_entity, which is a smaller change than it seems, because load
> average adjustments were already rate limited to once per jiffy before this
> patch series.
> 
> This patch breaks CONFIG_CFS_BANDWIDTH. The plan for that is to park tasks
> from throttled cgroups onto their cgroup runqueues, and slowly (using the
> GENTLE_FAIR_SLEEPERS) wake them back up, in vruntime order, once the cgroup
> gets unthrottled, to prevent thundering herd issues.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  include/linux/sched.h |   2 +
>  kernel/sched/fair.c   | 478 +++++++++++++++++-------------------------
>  kernel/sched/pelt.c   |   6 +-
>  kernel/sched/pelt.h   |   2 +-
>  kernel/sched/sched.h  |   2 +-
>  5 files changed, 194 insertions(+), 296 deletions(-)
> 

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

[...]

> @@ -3491,7 +3544,7 @@ static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  	 * track group sched_entity load average for task_h_load calc in migration
>  	 */
>  	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> -		updated = __update_load_avg_se(now, cfs_rq, se);
> +		updated = __update_load_avg_se(now, cfs_rq, se, curr, curr);

I wonder if task migration is still working correctly.

migrate_task_rq_fair(p, ...) -> remove_entity_load_avg(&p->se) would use
cfs_rq = se->cfs_rq (i.e. root cfs_rq). So load (and util) will not
propagate through the taskgroup hierarchy.

[...]


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 8/8] sched,fair: flatten hierarchical runqueues
  2019-06-28 10:26   ` Dietmar Eggemann
@ 2019-06-28 19:36     ` Rik van Riel
  0 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2019-06-28 19:36 UTC (permalink / raw)
  To: Dietmar Eggemann, peterz
  Cc: mingo, linux-kernel, kernel-team, morten.rasmussen, tglx,
	mgorman, vincent.guittot

[-- Attachment #1: Type: text/plain, Size: 2111 bytes --]

On Fri, 2019-06-28 at 12:26 +0200, Dietmar Eggemann wrote:
> On 6/12/19 9:32 PM, Rik van Riel wrote:
> > Flatten the hierarchical runqueues into just the per CPU rq.cfs
> > runqueue.
> > 
> > Iteration of the sched_entity hierarchy is rate limited to once per
> > jiffy
> > per sched_entity, which is a smaller change than it seems, because
> > load
> > average adjustments were already rate limited to once per jiffy
> > before this
> > patch series.
> > 
> > This patch breaks CONFIG_CFS_BANDWIDTH. The plan for that is to
> > park tasks
> > from throttled cgroups onto their cgroup runqueues, and slowly
> > (using the
> > GENTLE_FAIR_SLEEPERS) wake them back up, in vruntime order, once
> > the cgroup
> > gets unthrottled, to prevent thundering herd issues.
> > 
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > ---
> >  include/linux/sched.h |   2 +
> >  kernel/sched/fair.c   | 478 +++++++++++++++++---------------------
> > ----
> >  kernel/sched/pelt.c   |   6 +-
> >  kernel/sched/pelt.h   |   2 +-
> >  kernel/sched/sched.h  |   2 +-
> >  5 files changed, 194 insertions(+), 296 deletions(-)
> > 
> 
> [...]
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> 
> [...]
> 
> > @@ -3491,7 +3544,7 @@ static inline bool update_load_avg(struct
> > cfs_rq *cfs_rq, struct sched_entity *s
> >  	 * track group sched_entity load average for task_h_load calc
> > in migration
> >  	 */
> >  	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> > -		updated = __update_load_avg_se(now, cfs_rq, se);
> > +		updated = __update_load_avg_se(now, cfs_rq, se, curr,
> > curr);
> 
> I wonder if task migration is still working correctly.
> 
> migrate_task_rq_fair(p, ...) -> remove_entity_load_avg(&p->se) would
> use
> cfs_rq = se->cfs_rq (i.e. root cfs_rq). So load (and util) will not
> propagate through the taskgroup hierarchy.
> 
> [...]

Good point. This should be the group cfs_rq, and
then on the next tick the load change will be 
propagated up.

Let me add that change in for v2 as well.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-06-28 19:36 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-12 19:32 [RFC] sched,cfs: flatten CPU controller runqueues Rik van Riel
2019-06-12 19:32 ` [PATCH 1/8] sched: introduce task_se_h_load helper Rik van Riel
2019-06-19 12:52   ` Dietmar Eggemann
2019-06-19 13:57     ` Rik van Riel
2019-06-19 15:18       ` Dietmar Eggemann
2019-06-19 15:55         ` Rik van Riel
2019-06-12 19:32 ` [PATCH 2/8] sched: change /proc/sched_debug fields Rik van Riel
2019-06-12 19:32 ` [PATCH 3/8] sched,fair: redefine runnable_load_avg as the sum of task_h_load Rik van Riel
2019-06-18  9:08   ` Dietmar Eggemann
2019-06-26 14:34   ` Dietmar Eggemann
2019-06-12 19:32 ` [PATCH 4/8] sched,fair: remove cfs rqs from leaf_cfs_rq_list bottom up Rik van Riel
2019-06-12 19:32 ` [PATCH 5/8] sched,cfs: use explicit cfs_rq of parent se helper Rik van Riel
2019-06-20 16:23   ` Dietmar Eggemann
2019-06-20 16:29     ` Rik van Riel
2019-06-24 11:24       ` Dietmar Eggemann
2019-06-26 15:58   ` Dietmar Eggemann
2019-06-26 16:15     ` Rik van Riel
2019-06-12 19:32 ` [PATCH 6/8] sched,cfs: fix zero length timeslice calculation Rik van Riel
2019-06-12 19:32 ` [PATCH 7/8] sched,fair: refactor enqueue/dequeue_entity Rik van Riel
2019-06-12 19:32 ` [PATCH 8/8] sched,fair: flatten hierarchical runqueues Rik van Riel
2019-06-25  9:50   ` Dietmar Eggemann
2019-06-25 13:51     ` Rik van Riel
2019-06-28 10:26   ` Dietmar Eggemann
2019-06-28 19:36     ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).