linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues
@ 2019-08-22  2:17 Rik van Riel
  2019-08-22  2:17 ` [PATCH 01/15] sched: introduce task_se_h_load helper Rik van Riel
                   ` (15 more replies)
  0 siblings, 16 replies; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot

The current implementation of the CPU controller uses hierarchical
runqueues, where on wakeup a task is enqueued on its group's runqueue,
the group is enqueued on the runqueue of the group above it, etc.

This increases a fairly large amount of overhead for workloads that
do a lot of wakeups a second, especially given that the default systemd
hierarchy is 2 or 3 levels deep.

This patch series is an attempt at reducing that overhead, by placing
all the tasks on the same runqueue, and scaling the task priority by
the priority of the group, which is calculated periodically.

My main TODO items for the next period of time are likely going to
be testing, testing, and testing. I hope to find and flush out any
corner case I can find, and make sure performance does not regress
with any workloads, and hopefully improves some.

Other TODO items:
- More code cleanups.
- Remove some more now unused code.
- Reimplement CONFIG_CFS_BANDWIDTH.

Plan for the CONFIG_CFS_BANDWIDTH reimplementation:
- When a cgroup gets throttled, mark the cgroup and its children
  as throttled.
- When pick_next_entity finds a task that is on a throttled cgroup,
  stash it on the cgroup runqueue (which is not used for runnable
  tasks any more). Leave the vruntime unchanged, and adjust that
  runqueue's vruntime to be that of the left-most task.
- When a cgroup gets unthrottled, and has tasks on it, place it on
  a vruntime ordered heap separate from the main runqueue.
- Have pick_next_task_fair grab one task off that heap every time it
  is called, and the min vruntime of that heap is lower than the
  vruntime of the CPU's cfs_rq (or the CPU has no other runnable tasks).
- Place that selected task on the CPU's cfs_rq, renormalizing its
  vruntime with the GENTLE_FAIR_SLEEPERS logic. That should help
  interleave the already runnable tasks with the recently unthrottled
  group, and prevent thundering herd issues.
- If the group gets throttled again before all of its task had a chance
  to run, vruntime sorting ensures all the tasks in the throttled cgroup
  get a chance to run over time.

Changes from v3:
- replace max_h_load with another hacky idea to ramp up the
  task_se_h_weight; I believe this new idea is wrong as well, but
  it will hopefully inspire a better solution (thanks to Peter Zijlstra)
- fix the ordering inside enqueue_task_fair to get task weights set up right
  (thanks to Peter Zijlstra)
- change wakeup_preempt_entity to reduce the number of task preemptions,
  hopefully resulting in behavior closer to what people configure in sysctl
- various other small cleanups and fixes

Changes from v2:
- fixed the web server performance regression, in a way vaguely similar
  to what Josef Bacik suggested (blame me for the implementation)
- removed some code duplication so the diffstat is redder than before
- propagate sum_exec_runtime up the tree, in preparation for CFS_BANDWIDTH
- small cleanups left and right

Changes from v1:
- use task_se_h_weight instead of task_se_h_load in calc_delta_fair
  and sched_slice, this seems to improve performance a little, but
  I still have some remaining regression to chase with our web server
  workload
- implement a number of the changes suggested by Dietmar Eggemann
  (still holding out for a better name for group_cfs_rq_of_parent)

This series applies on top of 5.2

 include/linux/sched.h |    7 
 kernel/sched/core.c   |    3 
 kernel/sched/debug.c  |   15 
 kernel/sched/fair.c   |  803 +++++++++++++++++++++-----------------------------
 kernel/sched/pelt.c   |   68 +---
 kernel/sched/pelt.h   |    2 
 kernel/sched/sched.h  |    9 
 7 files changed, 372 insertions(+), 535 deletions(-)



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 01/15] sched: introduce task_se_h_load helper
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
@ 2019-08-22  2:17 ` Rik van Riel
  2019-08-23 18:13   ` Dietmar Eggemann
  2019-08-22  2:17 ` [PATCH 02/15] sched: change /proc/sched_debug fields Rik van Riel
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel,
	Josef Bacik

Sometimes the hierarchical load of a sched_entity needs to be calculated.
Rename task_h_load to task_se_h_load, and directly pass a sched_entity to
that function.

Move the function declaration up above where it will be used later.

No functional changes.

Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 kernel/sched/fair.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f5e528..eadf9a96b3e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -242,6 +242,7 @@ static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight
 
 
 const struct sched_class fair_sched_class;
+static unsigned long task_se_h_load(struct sched_entity *se);
 
 /**************************************************************
  * CFS operations on generic schedulable entities:
@@ -706,7 +707,6 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 #ifdef CONFIG_SMP
 
 static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
-static unsigned long task_h_load(struct task_struct *p);
 static unsigned long capacity_of(int cpu);
 
 /* Give new sched_entity start runnable values to heavy its load in infant time */
@@ -1668,7 +1668,7 @@ static void task_numa_compare(struct task_numa_env *env,
 	/*
 	 * In the overloaded case, try and keep the load balanced.
 	 */
-	load = task_h_load(env->p) - task_h_load(cur);
+	load = task_se_h_load(env->p->se) - task_se_h_load(cur->se);
 	if (!load)
 		goto assign;
 
@@ -1706,7 +1706,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	bool maymove = false;
 	int cpu;
 
-	load = task_h_load(env->p);
+	load = task_se_h_load(env->p->se);
 	dst_load = env->dst_stats.load + load;
 	src_load = env->src_stats.load - load;
 
@@ -3389,7 +3389,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
  * avg. The immediate corollary is that all (fair) tasks must be attached, see
  * post_init_entity_util_avg().
  *
- * cfs_rq->avg is used for task_h_load() and update_cfs_share() for example.
+ * cfs_rq->avg is used for task_se_h_load() and update_cfs_share() for example.
  *
  * Returns true if the load decayed or we removed load.
  *
@@ -3522,7 +3522,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
-	 * track group sched_entity load average for task_h_load calc in migration
+	 * track group sched_entity load average for task_se_h_load calc in migration
 	 */
 	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
 		__update_load_avg_se(now, cfs_rq, se);
@@ -3751,7 +3751,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
 		return;
 	}
 
-	rq->misfit_task_load = task_h_load(p);
+	rq->misfit_task_load = task_se_h_load(&p->se);
 }
 
 #else /* CONFIG_SMP */
@@ -5739,7 +5739,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	this_eff_load = target_load(this_cpu, sd->wake_idx);
 
 	if (sync) {
-		unsigned long current_load = task_h_load(current);
+		unsigned long current_load = task_se_h_load(&current->se);
 
 		if (current_load > this_eff_load)
 			return this_cpu;
@@ -5747,7 +5747,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 		this_eff_load -= current_load;
 	}
 
-	task_load = task_h_load(p);
+	task_load = task_se_h_load(&p->se);
 
 	this_eff_load += task_load;
 	if (sched_feat(WA_BIAS))
@@ -7600,7 +7600,7 @@ static int detach_tasks(struct lb_env *env)
 		if (!can_migrate_task(p, env))
 			goto next;
 
-		load = task_h_load(p);
+		load = task_se_h_load(&p->se);
 
 		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
 			goto next;
@@ -7833,12 +7833,12 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 	}
 }
 
-static unsigned long task_h_load(struct task_struct *p)
+static unsigned long task_se_h_load(struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq = task_cfs_rq(p);
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	update_cfs_rq_h_load(cfs_rq);
-	return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
+	return div64_ul(se->avg.load_avg * cfs_rq->h_load,
 			cfs_rq_load_avg(cfs_rq) + 1);
 }
 #else
@@ -7865,9 +7865,9 @@ static inline void update_blocked_averages(int cpu)
 	rq_unlock_irqrestore(rq, &rf);
 }
 
-static unsigned long task_h_load(struct task_struct *p)
+static unsigned long task_se_h_load(struct sched_entity *se)
 {
-	return p->se.avg.load_avg;
+	return se->avg.load_avg;
 }
 #endif
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 02/15] sched: change /proc/sched_debug fields
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
  2019-08-22  2:17 ` [PATCH 01/15] sched: introduce task_se_h_load helper Rik van Riel
@ 2019-08-22  2:17 ` Rik van Riel
  2019-08-22  2:17 ` [PATCH 03/15] sched,fair: redefine runnable_load_avg as the sum of task_h_load Rik van Riel
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel,
	Josef Bacik

Remove some fields from /proc/sched_debug that are removed from
sched_entity in a subsequent patch, and add h_load, which comes in
very handy to debug CPU controller weight distribution.

Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 kernel/sched/debug.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 14c6a8716ba1..f6beaca97a09 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -416,11 +416,9 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	}
 
 	P(se->load.weight);
-	P(se->runnable_weight);
 #ifdef CONFIG_SMP
 	P(se->avg.load_avg);
 	P(se->avg.util_avg);
-	P(se->avg.runnable_load_avg);
 #endif
 
 #undef PN_SCHEDSTAT
@@ -538,7 +536,6 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_SMP
-	SEQ_printf(m, "  .%-30s: %ld\n", "runnable_weight", cfs_rq->runnable_weight);
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
 			cfs_rq->avg.load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_load_avg",
@@ -547,17 +544,15 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %u\n", "util_est_enqueued",
 			cfs_rq->avg.util_est.enqueued);
-	SEQ_printf(m, "  .%-30s: %ld\n", "removed.load_avg",
-			cfs_rq->removed.load_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
 			cfs_rq->removed.util_avg);
-	SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
-			cfs_rq->removed.runnable_sum);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
 			cfs_rq->tg_load_avg_contrib);
 	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_avg",
 			atomic_long_read(&cfs_rq->tg->load_avg));
+	SEQ_printf(m, "  .%-30s: %lu\n", "h_load",
+			cfs_rq->h_load);
 #endif
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -961,10 +956,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		   "nr_involuntary_switches", (long long)p->nivcsw);
 
 	P(se.load.weight);
-	P(se.runnable_weight);
 #ifdef CONFIG_SMP
 	P(se.avg.load_sum);
-	P(se.avg.runnable_load_sum);
 	P(se.avg.util_sum);
 	P(se.avg.load_avg);
 	P(se.avg.runnable_load_avg);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 03/15] sched,fair: redefine runnable_load_avg as the sum of task_h_load
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
  2019-08-22  2:17 ` [PATCH 01/15] sched: introduce task_se_h_load helper Rik van Riel
  2019-08-22  2:17 ` [PATCH 02/15] sched: change /proc/sched_debug fields Rik van Riel
@ 2019-08-22  2:17 ` Rik van Riel
  2019-08-28 13:50   ` Vincent Guittot
  2019-08-22  2:17 ` [PATCH 04/15] sched,fair: move runnable_load_avg to cfs_rq Rik van Riel
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel

The runnable_load magic is used to quickly propagate information about
runnable tasks up the hierarchy of runqueues. The runnable_load_avg is
mostly used for the load balancing code, which only examines the value at
the root cfs_rq.

Redefine the root cfs_rq runnable_load_avg to be the sum of task_h_loads
of the runnable tasks. This works because the hierarchical runnable_load of
a task is already equal to the task_se_h_load today. This provides enough
information to the load balancer.

The runnable_load_avg of the cgroup cfs_rqs does not appear to be
used for anything, so don't bother calculating those.

This removes one of the things that the code currently traverses the
cgroup hierarchy for, and getting rid of it brings us one step closer
to a flat runqueue for the CPU controller.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/sched.h |   3 +-
 kernel/sched/core.c   |   2 -
 kernel/sched/debug.c  |   1 +
 kernel/sched/fair.c   | 125 +++++++++++++-----------------------------
 kernel/sched/pelt.c   |  64 ++++++---------------
 kernel/sched/sched.h  |   6 --
 6 files changed, 56 insertions(+), 145 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 11837410690f..f5bb6948e40c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -391,7 +391,6 @@ struct util_est {
 struct sched_avg {
 	u64				last_update_time;
 	u64				load_sum;
-	u64				runnable_load_sum;
 	u32				util_sum;
 	u32				period_contrib;
 	unsigned long			load_avg;
@@ -439,7 +438,6 @@ struct sched_statistics {
 struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
-	unsigned long			runnable_weight;
 	struct rb_node			run_node;
 	struct list_head		group_node;
 	unsigned int			on_rq;
@@ -455,6 +453,7 @@ struct sched_entity {
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	int				depth;
+	unsigned long			enqueued_h_load;
 	struct sched_entity		*parent;
 	/* rq on which this entity is (to be) queued: */
 	struct cfs_rq			*cfs_rq;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427742a9..fbd96900f715 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -744,7 +744,6 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 	if (task_has_idle_policy(p)) {
 		load->weight = scale_load(WEIGHT_IDLEPRIO);
 		load->inv_weight = WMULT_IDLEPRIO;
-		p->se.runnable_weight = load->weight;
 		return;
 	}
 
@@ -757,7 +756,6 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 	} else {
 		load->weight = scale_load(sched_prio_to_weight[prio]);
 		load->inv_weight = sched_prio_to_wmult[prio];
-		p->se.runnable_weight = load->weight;
 	}
 }
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f6beaca97a09..cefc1b171c0b 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -962,6 +962,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 	P(se.avg.load_avg);
 	P(se.avg.runnable_load_avg);
 	P(se.avg.util_avg);
+	P(se.enqueued_h_load);
 	P(se.avg.last_update_time);
 	P(se.avg.util_est.ewma);
 	P(se.avg.util_est.enqueued);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eadf9a96b3e1..30afeda1620f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -723,9 +723,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * nothing has been attached to the task group yet.
 	 */
 	if (entity_is_task(se))
-		sa->runnable_load_avg = sa->load_avg = scale_load_down(se->load.weight);
-
-	se->runnable_weight = se->load.weight;
+		sa->load_avg = scale_load_down(se->load.weight);
 
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
@@ -2766,20 +2764,39 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static inline void
 enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	cfs_rq->runnable_weight += se->runnable_weight;
+	if (entity_is_task(se)) {
+		struct cfs_rq *root_cfs_rq = &cfs_rq->rq->cfs;
+		se->enqueued_h_load = task_se_h_load(se);
 
-	cfs_rq->avg.runnable_load_avg += se->avg.runnable_load_avg;
-	cfs_rq->avg.runnable_load_sum += se_runnable(se) * se->avg.runnable_load_sum;
+		root_cfs_rq->avg.runnable_load_avg += se->enqueued_h_load;
+	}
 }
 
 static inline void
 dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	cfs_rq->runnable_weight -= se->runnable_weight;
+	if (entity_is_task(se)) {
+		struct cfs_rq *root_cfs_rq = &cfs_rq->rq->cfs;
+		sub_positive(&root_cfs_rq->avg.runnable_load_avg,
+					se->enqueued_h_load);
+	}
+}
+
+static inline void
+update_runnable_load_avg(struct sched_entity *se)
+{
+	struct cfs_rq *root_cfs_rq = &cfs_rq_of(se)->rq->cfs;
+	long new_h_load, delta;
 
-	sub_positive(&cfs_rq->avg.runnable_load_avg, se->avg.runnable_load_avg);
-	sub_positive(&cfs_rq->avg.runnable_load_sum,
-		     se_runnable(se) * se->avg.runnable_load_sum);
+	SCHED_WARN_ON(!entity_is_task(se));
+
+	if (!se->on_rq)
+		return;
+
+	new_h_load = task_se_h_load(se);
+	delta = new_h_load - se->enqueued_h_load;
+	root_cfs_rq->avg.runnable_load_avg += delta;
+	se->enqueued_h_load = new_h_load;
 }
 
 static inline void
@@ -2807,7 +2824,7 @@ dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 #endif
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
-			    unsigned long weight, unsigned long runnable)
+			    unsigned long weight)
 {
 	if (se->on_rq) {
 		/* commit outstanding execution time */
@@ -2818,7 +2835,6 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 	}
 	dequeue_load_avg(cfs_rq, se);
 
-	se->runnable_weight = runnable;
 	update_load_set(&se->load, weight);
 
 #ifdef CONFIG_SMP
@@ -2826,8 +2842,6 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib;
 
 		se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
-		se->avg.runnable_load_avg =
-			div_u64(se_runnable(se) * se->avg.runnable_load_sum, divider);
 	} while (0);
 #endif
 
@@ -2845,7 +2859,7 @@ void reweight_task(struct task_struct *p, int prio)
 	struct load_weight *load = &se->load;
 	unsigned long weight = scale_load(sched_prio_to_weight[prio]);
 
-	reweight_entity(cfs_rq, se, weight, weight);
+	reweight_entity(cfs_rq, se, weight);
 	load->inv_weight = sched_prio_to_wmult[prio];
 }
 
@@ -2958,49 +2972,6 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
 	return clamp_t(long, shares, MIN_SHARES, tg_shares);
 }
 
-/*
- * This calculates the effective runnable weight for a group entity based on
- * the group entity weight calculated above.
- *
- * Because of the above approximation (2), our group entity weight is
- * an load_avg based ratio (3). This means that it includes blocked load and
- * does not represent the runnable weight.
- *
- * Approximate the group entity's runnable weight per ratio from the group
- * runqueue:
- *
- *					     grq->avg.runnable_load_avg
- *   ge->runnable_weight = ge->load.weight * -------------------------- (7)
- *						 grq->avg.load_avg
- *
- * However, analogous to above, since the avg numbers are slow, this leads to
- * transients in the from-idle case. Instead we use:
- *
- *   ge->runnable_weight = ge->load.weight *
- *
- *		max(grq->avg.runnable_load_avg, grq->runnable_weight)
- *		-----------------------------------------------------	(8)
- *		      max(grq->avg.load_avg, grq->load.weight)
- *
- * Where these max() serve both to use the 'instant' values to fix the slow
- * from-idle and avoid the /0 on to-idle, similar to (6).
- */
-static long calc_group_runnable(struct cfs_rq *cfs_rq, long shares)
-{
-	long runnable, load_avg;
-
-	load_avg = max(cfs_rq->avg.load_avg,
-		       scale_load_down(cfs_rq->load.weight));
-
-	runnable = max(cfs_rq->avg.runnable_load_avg,
-		       scale_load_down(cfs_rq->runnable_weight));
-
-	runnable *= shares;
-	if (load_avg)
-		runnable /= load_avg;
-
-	return clamp_t(long, runnable, MIN_SHARES, shares);
-}
 #endif /* CONFIG_SMP */
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
@@ -3012,25 +2983,26 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 static void update_cfs_group(struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long shares, runnable;
+	long shares;
 
-	if (!gcfs_rq)
+	if (!gcfs_rq) {
+		update_runnable_load_avg(se);
 		return;
+	}
 
 	if (throttled_hierarchy(gcfs_rq))
 		return;
 
 #ifndef CONFIG_SMP
-	runnable = shares = READ_ONCE(gcfs_rq->tg->shares);
+	shares = READ_ONCE(gcfs_rq->tg->shares);
 
 	if (likely(se->load.weight == shares))
 		return;
 #else
 	shares   = calc_group_shares(gcfs_rq);
-	runnable = calc_group_runnable(gcfs_rq, shares);
 #endif
 
-	reweight_entity(cfs_rq_of(se), se, shares, runnable);
+	reweight_entity(cfs_rq_of(se), se, shares);
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -3243,8 +3215,8 @@ static inline void
 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
 	long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
-	unsigned long runnable_load_avg, load_avg;
-	u64 runnable_load_sum, load_sum = 0;
+	unsigned long load_avg;
+	u64 load_sum = 0;
 	s64 delta_sum;
 
 	if (!runnable_sum)
@@ -3292,19 +3264,6 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf
 	se->avg.load_avg = load_avg;
 	add_positive(&cfs_rq->avg.load_avg, delta_avg);
 	add_positive(&cfs_rq->avg.load_sum, delta_sum);
-
-	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
-	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
-	delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
-	delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
-
-	se->avg.runnable_load_sum = runnable_sum;
-	se->avg.runnable_load_avg = runnable_load_avg;
-
-	if (se->on_rq) {
-		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
-		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
-	}
 }
 
 static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
@@ -3399,7 +3358,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
-	unsigned long removed_load = 0, removed_util = 0, removed_runnable_sum = 0;
+	unsigned long removed_load = 0, removed_util = 0;
 	struct sched_avg *sa = &cfs_rq->avg;
 	int decayed = 0;
 
@@ -3410,7 +3369,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		raw_spin_lock(&cfs_rq->removed.lock);
 		swap(cfs_rq->removed.util_avg, removed_util);
 		swap(cfs_rq->removed.load_avg, removed_load);
-		swap(cfs_rq->removed.runnable_sum, removed_runnable_sum);
 		cfs_rq->removed.nr = 0;
 		raw_spin_unlock(&cfs_rq->removed.lock);
 
@@ -3422,8 +3380,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		sub_positive(&sa->util_avg, r);
 		sub_positive(&sa->util_sum, r * divider);
 
-		add_tg_cfs_propagate(cfs_rq, -(long)removed_runnable_sum);
-
 		decayed = 1;
 	}
 
@@ -3477,8 +3433,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 			div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
 	}
 
-	se->avg.runnable_load_sum = se->avg.load_sum;
-
 	enqueue_load_avg(cfs_rq, se);
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
@@ -7735,9 +7689,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (cfs_rq->avg.util_sum)
 		return false;
 
-	if (cfs_rq->avg.runnable_load_sum)
-		return false;
-
 	return true;
 }
 
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index befce29bd882..b3f3e8b29394 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -106,7 +106,7 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
  */
 static __always_inline u32
 accumulate_sum(u64 delta, struct sched_avg *sa,
-	       unsigned long load, unsigned long runnable, int running)
+	       unsigned long load, int running)
 {
 	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
 	u64 periods;
@@ -119,8 +119,6 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 	 */
 	if (periods) {
 		sa->load_sum = decay_load(sa->load_sum, periods);
-		sa->runnable_load_sum =
-			decay_load(sa->runnable_load_sum, periods);
 		sa->util_sum = decay_load((u64)(sa->util_sum), periods);
 
 		/*
@@ -134,8 +132,6 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
 
 	if (load)
 		sa->load_sum += load * contrib;
-	if (runnable)
-		sa->runnable_load_sum += runnable * contrib;
 	if (running)
 		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
 
@@ -172,7 +168,7 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
  */
 static __always_inline int
 ___update_load_sum(u64 now, struct sched_avg *sa,
-		  unsigned long load, unsigned long runnable, int running)
+		  unsigned long load, int running)
 {
 	u64 delta;
 
@@ -206,7 +202,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * update_blocked_averages()
 	 */
 	if (!load)
-		runnable = running = 0;
+		running = 0;
 
 	/*
 	 * Now we know we crossed measurement unit boundaries. The *_avg
@@ -215,14 +211,14 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, sa, load, runnable, running))
+	if (!accumulate_sum(delta, sa, load, running))
 		return 0;
 
 	return 1;
 }
 
 static __always_inline void
-___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
+___update_load_avg(struct sched_avg *sa, unsigned long load)
 {
 	u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
 
@@ -230,41 +226,25 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
 	 * Step 2: update *_avg.
 	 */
 	sa->load_avg = div_u64(load * sa->load_sum, divider);
-	sa->runnable_load_avg =	div_u64(runnable * sa->runnable_load_sum, divider);
 	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
 }
 
 /*
  * sched_entity:
  *
- *   task:
- *     se_runnable() == se_weight()
- *
- *   group: [ see update_cfs_group() ]
- *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
- *     se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
- *
  *   load_sum := runnable_sum
  *   load_avg = se_weight(se) * runnable_avg
  *
- *   runnable_load_sum := runnable_sum
- *   runnable_load_avg = se_runnable(se) * runnable_avg
- *
- * XXX collapse load_sum and runnable_load_sum
- *
  * cfq_rq:
  *
  *   load_sum = \Sum se_weight(se) * se->avg.load_sum
  *   load_avg = \Sum se->avg.load_avg
- *
- *   runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
- *   runnable_load_avg = \Sum se->avg.runable_load_avg
  */
 
 int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+	if (___update_load_sum(now, &se->avg, 0, 0)) {
+		___update_load_avg(&se->avg, se_weight(se));
 		return 1;
 	}
 
@@ -273,10 +253,10 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 
 int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, !!se->on_rq, !!se->on_rq,
+	if (___update_load_sum(now, &se->avg, !!se->on_rq,
 				cfs_rq->curr == se)) {
 
-		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
 		return 1;
 	}
@@ -288,10 +268,9 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
 	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
-				scale_load_down(cfs_rq->runnable_weight),
 				cfs_rq->curr != NULL)) {
 
-		___update_load_avg(&cfs_rq->avg, 1, 1);
+		___update_load_avg(&cfs_rq->avg, 1);
 		return 1;
 	}
 
@@ -303,20 +282,18 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
  *
- *   load_avg and runnable_load_avg are not supported and meaningless.
+ *   load_avg is not supported and meaningless.
  *
  */
 
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_rt,
-				running,
 				running,
 				running)) {
 
-		___update_load_avg(&rq->avg_rt, 1, 1);
+		___update_load_avg(&rq->avg_rt, 1);
 		return 1;
 	}
 
@@ -328,18 +305,16 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
  *
  */
 
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 {
 	if (___update_load_sum(now, &rq->avg_dl,
-				running,
 				running,
 				running)) {
 
-		___update_load_avg(&rq->avg_dl, 1, 1);
+		___update_load_avg(&rq->avg_dl, 1);
 		return 1;
 	}
 
@@ -352,7 +327,6 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
  *
  *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
  *   util_sum = cpu_scale * load_sum
- *   runnable_load_sum = load_sum
  *
  */
 
@@ -379,17 +353,11 @@ int update_irq_load_avg(struct rq *rq, u64 running)
 	 * We can safely remove running from rq->clock because
 	 * rq->clock += delta with delta >= running
 	 */
-	ret = ___update_load_sum(rq->clock - running, &rq->avg_irq,
-				0,
-				0,
-				0);
-	ret += ___update_load_sum(rq->clock, &rq->avg_irq,
-				1,
-				1,
-				1);
+	ret = ___update_load_sum(rq->clock - running, &rq->avg_irq, 0, 0);
+	ret += ___update_load_sum(rq->clock, &rq->avg_irq, 1, 1);
 
 	if (ret)
-		___update_load_avg(&rq->avg_irq, 1, 1);
+		___update_load_avg(&rq->avg_irq, 1);
 
 	return ret;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1ada0be..5be14cee61f9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -487,7 +487,6 @@ struct cfs_bandwidth { };
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight	load;
-	unsigned long		runnable_weight;
 	unsigned int		nr_running;
 	unsigned int		h_nr_running;
 
@@ -700,11 +699,6 @@ static inline long se_weight(struct sched_entity *se)
 	return scale_load_down(se->load.weight);
 }
 
-static inline long se_runnable(struct sched_entity *se)
-{
-	return scale_load_down(se->runnable_weight);
-}
-
 static inline bool sched_asym_prefer(int a, int b)
 {
 	return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 04/15] sched,fair: move runnable_load_avg to cfs_rq
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
                   ` (2 preceding siblings ...)
  2019-08-22  2:17 ` [PATCH 03/15] sched,fair: redefine runnable_load_avg as the sum of task_h_load Rik van Riel
@ 2019-08-22  2:17 ` Rik van Riel
  2019-08-22  2:17 ` [PATCH 05/15] sched,fair: remove cfs_rqs from leaf_cfs_rq_list bottom up Rik van Riel
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel

Since only the root cfs_rq runnable_load_avg field is used any more,
we can move the field from struct sched_avg, which every sched_entity
has one of, directly into the struct cfs_rq, of which we have way fewer.

No functional changes.

Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/sched.h | 1 -
 kernel/sched/debug.c  | 3 +--
 kernel/sched/fair.c   | 8 ++++----
 kernel/sched/sched.h  | 1 +
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f5bb6948e40c..84a6cc6f5c47 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -394,7 +394,6 @@ struct sched_avg {
 	u32				util_sum;
 	u32				period_contrib;
 	unsigned long			load_avg;
-	unsigned long			runnable_load_avg;
 	unsigned long			util_avg;
 	struct util_est			util_est;
 } ____cacheline_aligned;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index cefc1b171c0b..6e7c8ff210a8 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -539,7 +539,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
 			cfs_rq->avg.load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_load_avg",
-			cfs_rq->avg.runnable_load_avg);
+			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %u\n", "util_est_enqueued",
@@ -960,7 +960,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 	P(se.avg.load_sum);
 	P(se.avg.util_sum);
 	P(se.avg.load_avg);
-	P(se.avg.runnable_load_avg);
 	P(se.avg.util_avg);
 	P(se.enqueued_h_load);
 	P(se.avg.last_update_time);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 30afeda1620f..a48d0dbfc232 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2768,7 +2768,7 @@ enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		struct cfs_rq *root_cfs_rq = &cfs_rq->rq->cfs;
 		se->enqueued_h_load = task_se_h_load(se);
 
-		root_cfs_rq->avg.runnable_load_avg += se->enqueued_h_load;
+		root_cfs_rq->runnable_load_avg += se->enqueued_h_load;
 	}
 }
 
@@ -2777,7 +2777,7 @@ dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if (entity_is_task(se)) {
 		struct cfs_rq *root_cfs_rq = &cfs_rq->rq->cfs;
-		sub_positive(&root_cfs_rq->avg.runnable_load_avg,
+		sub_positive(&root_cfs_rq->runnable_load_avg,
 					se->enqueued_h_load);
 	}
 }
@@ -2795,7 +2795,7 @@ update_runnable_load_avg(struct sched_entity *se)
 
 	new_h_load = task_se_h_load(se);
 	delta = new_h_load - se->enqueued_h_load;
-	root_cfs_rq->avg.runnable_load_avg += delta;
+	root_cfs_rq->runnable_load_avg += delta;
 	se->enqueued_h_load = new_h_load;
 }
 
@@ -3561,7 +3561,7 @@ static void remove_entity_load_avg(struct sched_entity *se)
 
 static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
 {
-	return cfs_rq->avg.runnable_load_avg;
+	return cfs_rq->runnable_load_avg;
 }
 
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5be14cee61f9..32978a8de8ce 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -516,6 +516,7 @@ struct cfs_rq {
 	 * CFS load tracking
 	 */
 	struct sched_avg	avg;
+	unsigned long		runnable_load_avg;
 #ifndef CONFIG_64BIT
 	u64			load_last_update_time_copy;
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 05/15] sched,fair: remove cfs_rqs from leaf_cfs_rq_list bottom up
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
                   ` (3 preceding siblings ...)
  2019-08-22  2:17 ` [PATCH 04/15] sched,fair: move runnable_load_avg to cfs_rq Rik van Riel
@ 2019-08-22  2:17 ` Rik van Riel
  2019-08-28 14:09   ` Vincent Guittot
  2019-08-22  2:17 ` [PATCH 06/15] sched,cfs: use explicit cfs_rq of parent se helper Rik van Riel
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel

Reducing the overhead of the CPU controller is achieved by not walking
all the sched_entities every time a task is enqueued or dequeued.

One of the things being checked every single time is whether the cfs_rq
is on the rq->leaf_cfs_rq_list.

By only removing a cfs_rq from the list once it no longer has children
on the list, we can avoid walking the sched_entity hierarchy if the bottom
cfs_rq is on the list, once the runqueues have been flattened.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a48d0dbfc232..04b216234265 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -369,6 +369,39 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
 	SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
 }
 
+/*
+ * Because list_add_leaf_cfs_rq always places a child cfs_rq on the list
+ * immediately before a parent cfs_rq, and cfs_rqs are removed from the list
+ * bottom-up, we only have to test whether the cfs_rq before us on the list
+ * is our child.
+ */
+static inline bool child_cfs_rq_on_list(struct cfs_rq *cfs_rq)
+{
+	struct cfs_rq *prev_cfs_rq;
+	struct list_head *prev;
+
+	prev = cfs_rq->leaf_cfs_rq_list.prev;
+	prev_cfs_rq = container_of(prev, struct cfs_rq, leaf_cfs_rq_list);
+
+	return (prev_cfs_rq->tg->parent == cfs_rq->tg);
+}	
+
+/*
+ * Remove a cfs_rq from the list if it has no children on the list.
+ * The scheduler iterates over the list regularly; if conditions for
+ * removal are still true, we'll get to this cfs_rq in the future.
+ */
+static inline void list_del_leaf_cfs_rq_bottom(struct cfs_rq *cfs_rq)
+{
+	if (!cfs_rq->on_list)
+		return;
+
+	if (child_cfs_rq_on_list(cfs_rq))
+		return;
+
+	list_del_leaf_cfs_rq(cfs_rq);
+}
+
 /* Iterate thr' all leaf cfs_rq's on a runqueue */
 #define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)			\
 	list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,	\
@@ -7723,7 +7756,7 @@ static void update_blocked_averages(int cpu)
 		 * decayed cfs_rqs linger on the list.
 		 */
 		if (cfs_rq_is_decayed(cfs_rq))
-			list_del_leaf_cfs_rq(cfs_rq);
+			list_del_leaf_cfs_rq_bottom(cfs_rq);
 
 		/* Don't need periodic decay once load/util_avg are null */
 		if (cfs_rq_has_blocked(cfs_rq))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 06/15] sched,cfs: use explicit cfs_rq of parent se helper
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
                   ` (4 preceding siblings ...)
  2019-08-22  2:17 ` [PATCH 05/15] sched,fair: remove cfs_rqs from leaf_cfs_rq_list bottom up Rik van Riel
@ 2019-08-22  2:17 ` Rik van Riel
  2019-08-28 13:53   ` Vincent Guittot
  2019-08-22  2:17 ` [PATCH 07/15] sched,cfs: fix zero length timeslice calculation Rik van Riel
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel

Use an explicit "cfs_rq of parent sched_entity" helper in a few
strategic places, where cfs_rq_of(se) may no longer point at the
right runqueue once we flatten the hierarchical cgroup runqueues.

No functional change.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04b216234265..31a26737a873 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -276,6 +276,15 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
 	return grp->my_q;
 }
 
+/* runqueue owned by the parent entity; the root cfs_rq for a top level se */
+static inline struct cfs_rq *group_cfs_rq_of_parent(struct sched_entity *se)
+{
+	if (se->parent)
+		return group_cfs_rq(se->parent);
+
+	return cfs_rq_of(se);
+}
+
 static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
@@ -3319,7 +3328,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
 
 	gcfs_rq->propagate = 0;
 
-	cfs_rq = cfs_rq_of(se);
+	cfs_rq = group_cfs_rq_of_parent(se);
 
 	add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
 
@@ -7796,7 +7805,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 
 	WRITE_ONCE(cfs_rq->h_load_next, NULL);
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		cfs_rq = group_cfs_rq_of_parent(se);
 		WRITE_ONCE(cfs_rq->h_load_next, se);
 		if (cfs_rq->last_h_load_update == now)
 			break;
@@ -7819,7 +7828,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 
 static unsigned long task_se_h_load(struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct cfs_rq *cfs_rq = group_cfs_rq_of_parent(se);
 
 	update_cfs_rq_h_load(cfs_rq);
 	return div64_ul(se->avg.load_avg * cfs_rq->h_load,
@@ -10166,7 +10175,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	struct sched_entity *se = &curr->se;
 
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		cfs_rq = group_cfs_rq_of_parent(se);
 		entity_tick(cfs_rq, se, queued);
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 07/15] sched,cfs: fix zero length timeslice calculation
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
                   ` (5 preceding siblings ...)
  2019-08-22  2:17 ` [PATCH 06/15] sched,cfs: use explicit cfs_rq of parent se helper Rik van Riel
@ 2019-08-22  2:17 ` Rik van Riel
  2019-08-28 16:59   ` Vincent Guittot
  2019-08-22  2:17 ` [PATCH 08/15] sched,fair: simplify timeslice length code Rik van Riel
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel

The way the time slice length is currently calculated, not only do high
priority tasks get longer time slices than low priority tasks, but due
to fixed point math, low priority tasks could end up with a zero length
time slice. This can lead to cache thrashing and other inefficiencies.

Cap the minimum time slice length to sysctl_sched_min_granularity.

Tasks that end up getting a time slice length too long for their relative
priority will simply end up having their vruntime advanced much faster than
other tasks, resulting in them receiving time slices less frequently.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 31a26737a873..8f8c85c6da9b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -732,6 +732,13 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		}
 		slice = __calc_delta(slice, se->load.weight, load);
 	}
+
+	/*
+	 * To avoid cache thrashing, run at least sysctl_sched_min_granularity.
+	 * The vruntime of a low priority task advances faster; those tasks
+	 * will simply get time slices less frequently.
+	 */
+	slice = max_t(u64, slice, sysctl_sched_min_granularity);
 	return slice;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 08/15] sched,fair: simplify timeslice length code
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
                   ` (6 preceding siblings ...)
  2019-08-22  2:17 ` [PATCH 07/15] sched,cfs: fix zero length timeslice calculation Rik van Riel
@ 2019-08-22  2:17 ` Rik van Riel
  2019-08-28 17:32   ` Vincent Guittot
  2019-08-22  2:17 ` [PATCH 09/15] sched,fair: refactor enqueue/dequeue_entity Rik van Riel
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel

The idea behind __sched_period makes sense, but the results do not always.

When a CPU has one high priority task and a large number of low priority
tasks, __sched_period will return a value larger than sysctl_sched_latency,
and the one high priority task may end up getting a timeslice all for itself
that is also much larger than sysctl_sched_latency.

The low priority tasks will have their time slices rounded up to
sysctl_sched_min_granularity, resulting in an even larger scheduling
latency than targeted by __sched_period.

Simplify the code by simply ripping out __sched_period and always taking
fractions of sysctl_sched_latency.

If a high priority task ends up getting a "too small" time slice compared
to low priority tasks, the vruntime scaling ensures that it will simply
get scheduled more frequently than low priority tasks.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f8c85c6da9b..74ee22c59d13 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -691,22 +691,6 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
 	return delta;
 }
 
-/*
- * The idea is to set a period in which each task runs once.
- *
- * When there are too many tasks (sched_nr_latency) we have to stretch
- * this period because otherwise the slices get too small.
- *
- * p = (nr <= nl) ? l : l*nr/nl
- */
-static u64 __sched_period(unsigned long nr_running)
-{
-	if (unlikely(nr_running > sched_nr_latency))
-		return nr_running * sysctl_sched_min_granularity;
-	else
-		return sysctl_sched_latency;
-}
-
 /*
  * We calculate the wall-time slice from the period by taking a part
  * proportional to the weight.
@@ -715,7 +699,7 @@ static u64 __sched_period(unsigned long nr_running)
  */
 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
+	u64 slice = sysctl_sched_latency;
 
 	for_each_sched_entity(se) {
 		struct load_weight *load;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 09/15] sched,fair: refactor enqueue/dequeue_entity
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
                   ` (7 preceding siblings ...)
  2019-08-22  2:17 ` [PATCH 08/15] sched,fair: simplify timeslice length code Rik van Riel
@ 2019-08-22  2:17 ` Rik van Riel
  2019-09-03 15:38   ` Vincent Guittot
  2019-08-22  2:17 ` [PATCH 10/15] sched,fair: add helper functions for flattened runqueue Rik van Riel
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel

Refactor enqueue_entity, dequeue_entity, and update_load_avg, in order
to split out the things we still want to happen at every level in the
cgroup hierarchy with a flat runqueue from the things we only need to
happen once.

No functional changes.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++----------------
 1 file changed, 42 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 74ee22c59d13..7b0d95f2e3a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3502,7 +3502,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 #define DO_ATTACH	0x4
 
 /* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	u64 now = cfs_rq_clock_pelt(cfs_rq);
 	int decayed;
@@ -3531,6 +3531,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 
 	} else if (decayed && (flags & UPDATE_TG))
 		update_tg_load_avg(cfs_rq, 0);
+
+	return decayed;
 }
 
 #ifndef CONFIG_64BIT
@@ -3747,9 +3749,10 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
 #define SKIP_AGE_LOAD	0x0
 #define DO_ATTACH	0x0
 
-static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
+static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
 {
 	cfs_rq_util_change(cfs_rq, 0);
+	return false;
 }
 
 static inline void remove_entity_load_avg(struct sched_entity *se) {}
@@ -3872,6 +3875,24 @@ static inline void check_schedstat_required(void)
  * CPU and an up-to-date min_vruntime on the destination CPU.
  */
 
+static bool
+enqueue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+	/*
+	 * When enqueuing a sched_entity, we must:
+	 *   - Update loads to have both entity and cfs_rq synced with now.
+	 *   - Add its load to cfs_rq->runnable_avg
+	 *   - For group_entity, update its weight to reflect the new share of
+	 *     its group cfs_rq
+	 *   - Add its new weight to cfs_rq->load.weight
+	 */
+	if (!update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH))
+		return false;
+
+	update_cfs_group(se);
+	return true;
+}
+
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -3896,16 +3917,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (renorm && !curr)
 		se->vruntime += cfs_rq->min_vruntime;
 
-	/*
-	 * When enqueuing a sched_entity, we must:
-	 *   - Update loads to have both entity and cfs_rq synced with now.
-	 *   - Add its load to cfs_rq->runnable_avg
-	 *   - For group_entity, update its weight to reflect the new share of
-	 *     its group cfs_rq
-	 *   - Add its new weight to cfs_rq->load.weight
-	 */
-	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
-	update_cfs_group(se);
 	enqueue_runnable_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 
@@ -3972,14 +3983,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 
-static void
-dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+static bool
+dequeue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	/*
-	 * Update run-time statistics of the 'current'.
-	 */
-	update_curr(cfs_rq);
-
 	/*
 	 * When dequeuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
@@ -3988,7 +3994,21 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 *   - For group entity, update its weight to reflect the new share
 	 *     of its group cfs_rq.
 	 */
-	update_load_avg(cfs_rq, se, UPDATE_TG);
+	if (!update_load_avg(cfs_rq, se, UPDATE_TG))
+		return false;
+	update_cfs_group(se);
+
+	return true;
+}
+
+static void
+dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+	/*
+	 * Update run-time statistics of the 'current'.
+	 */
+	update_curr(cfs_rq);
+
 	dequeue_runnable_load_avg(cfs_rq, se);
 
 	update_stats_dequeue(cfs_rq, se, flags);
@@ -4012,8 +4032,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/* return excess runtime on last dequeue */
 	return_cfs_rq_runtime(cfs_rq);
 
-	update_cfs_group(se);
-
 	/*
 	 * Now advance min_vruntime if @se was the entity holding it back,
 	 * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
@@ -5178,6 +5196,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (se->on_rq)
 			break;
 		cfs_rq = cfs_rq_of(se);
+		enqueue_entity_groups(cfs_rq, se, flags);
 		enqueue_entity(cfs_rq, se, flags);
 
 		/*
@@ -5260,6 +5279,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
+		dequeue_entity_groups(cfs_rq, se, flags);
 		dequeue_entity(cfs_rq, se, flags);
 
 		/*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 10/15] sched,fair: add helper functions for flattened runqueue
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
                   ` (8 preceding siblings ...)
  2019-08-22  2:17 ` [PATCH 09/15] sched,fair: refactor enqueue/dequeue_entity Rik van Riel
@ 2019-08-22  2:17 ` Rik van Riel
  2019-08-22  2:17 ` [PATCH 11/15] sched,fair: flatten hierarchical runqueues Rik van Riel
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel

Add helper functions to make the flattened runqueue patch a little smaller.

The task_se_h_weight function is similar to task_se_h_load, but scales the
task weight by the group weight, without taking the task's duty cycle into
account.

The task_se_in_cgroup helper is functionally identical to parent_entity,
but directly calling a function with that name obscures what the other
code is trying to use it for, and would make the code harder to understand.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7b0d95f2e3a8..29bfa7379dec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -243,6 +243,7 @@ static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight
 
 const struct sched_class fair_sched_class;
 static unsigned long task_se_h_load(struct sched_entity *se);
+static unsigned long task_se_h_weight(struct sched_entity *se);
 
 /**************************************************************
  * CFS operations on generic schedulable entities:
@@ -431,6 +432,12 @@ static inline struct sched_entity *parent_entity(struct sched_entity *se)
 	return se->parent;
 }
 
+/* Is this (task) sched_entity in a non-root cgroup? */
+static inline bool task_se_in_cgroup(struct sched_entity *se)
+{
+	return parent_entity(se);
+}
+
 static void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
@@ -513,6 +520,11 @@ static inline struct sched_entity *parent_entity(struct sched_entity *se)
 	return NULL;
 }
 
+static inline bool task_se_in_cgroup(struct sched_entity *se)
+{
+	return false;
+}
+
 static inline void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
@@ -7837,6 +7849,20 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 	}
 }
 
+static unsigned long task_se_h_weight(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq;
+
+	if (!task_se_in_cgroup(se))
+		return se->load.weight;
+
+	cfs_rq = group_cfs_rq_of_parent(se);
+	update_cfs_rq_h_load(cfs_rq);
+
+	/* Reduce the load.weight by the h_load of the group the task is in. */
+	return (cfs_rq->h_load * se->load.weight) >> SCHED_FIXEDPOINT_SHIFT;
+}
+
 static unsigned long task_se_h_load(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = group_cfs_rq_of_parent(se);
@@ -7873,6 +7899,11 @@ static unsigned long task_se_h_load(struct sched_entity *se)
 {
 	return se->avg.load_avg;
 }
+
+static unsigned long task_se_h_weight(struct sched_entity *se)
+{
+	return se->load.weight;
+}
 #endif
 
 /********** Helpers for find_busiest_group ************************/
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 11/15] sched,fair: flatten hierarchical runqueues
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
                   ` (9 preceding siblings ...)
  2019-08-22  2:17 ` [PATCH 10/15] sched,fair: add helper functions for flattened runqueue Rik van Riel
@ 2019-08-22  2:17 ` Rik van Riel
  2019-08-23 18:14   ` Dietmar Eggemann
  2019-08-22  2:17 ` [PATCH 12/15] sched,fair: flatten update_curr functionality Rik van Riel
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel

Flatten the hierarchical runqueues into just the per CPU rq.cfs runqueue.

Iteration of the sched_entity hierarchy is rate limited to once per jiffy
per sched_entity, which is a smaller change than it seems, because load
average adjustments were already rate limited to once per jiffy before this
patch series.

This patch breaks CONFIG_CFS_BANDWIDTH. The plan for that is to park tasks
from throttled cgroups onto their cgroup runqueues, and slowly (using the
GENTLE_FAIR_SLEEPERS) wake them back up, in vruntime order, once the cgroup
gets unthrottled, to prevent thundering herd issues.

Signed-off-by: Rik van Riel <riel@surriel.com>

Header from folded patch 'fix-attach-detach_enticy_cfs_rq.patch~':

Subject: sched,fair: fix attach/detach_entity_cfs_rq

While attach_entity_cfs_rq and detach_entity_cfs_rq should iterate over
the hierarchy, they do not need to so that twice.

Passing flags into propagate_entity_cfs_rq allows us to reuse that same
loop from other functions.

Signed-off-by: Rik van Riel <riel@surriel.com>


Header from folded patch 'enqueue-order.patch':

Subject: sched,fair: better ordering at enqueue_task_fair time

In order to get useful numbers for the task's hierarchical weight,
task priority, etc things need to be done in a certain order at task
enqueue time.

Specifically:
1) static load/weight to "local" cfs_rq
2) propagate load/weight up the tree
3) add runnable load avg to root cfs_rq

The reason is that each step depends on the things done by the
step beforehand, and we can end up with nonsense numbers if we
do not do things right.

Also, make sure that we walk all the way up the hierarchy at
enqueue_task_fair time in order to get the benefit from the ramp-up
logic in update_cfs_group.

Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h |   2 +
 kernel/sched/fair.c   | 502 ++++++++++++++----------------------------
 kernel/sched/pelt.c   |   6 +-
 kernel/sched/pelt.h   |   2 +-
 kernel/sched/sched.h  |   2 +-
 5 files changed, 170 insertions(+), 344 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 84a6cc6f5c47..901c710363e7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -453,6 +453,8 @@ struct sched_entity {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	int				depth;
 	unsigned long			enqueued_h_load;
+	unsigned long			enqueued_h_weight;
+	struct load_weight		h_load;
 	struct sched_entity		*parent;
 	/* rq on which this entity is (to be) queued: */
 	struct cfs_rq			*cfs_rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 29bfa7379dec..fa8e88731821 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -470,6 +470,19 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 	}
 }
 
+/* Add the cgroup cfs_rqs to the list, for update_blocked_averages */
+static void enqueue_entity_cfs_rqs(struct sched_entity *se)
+{
+	SCHED_WARN_ON(!entity_is_task(se));
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = group_cfs_rq_of_parent(se);
+
+		if (list_add_leaf_cfs_rq(cfs_rq))
+			break;
+	}
+}
+
 #else	/* !CONFIG_FAIR_GROUP_SCHED */
 
 static inline struct task_struct *task_of(struct sched_entity *se)
@@ -697,8 +710,14 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
  */
 static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
 {
-	if (unlikely(se->load.weight != NICE_0_LOAD))
+	if (task_se_in_cgroup(se)) {
+		unsigned long h_weight = task_se_h_weight(se);
+		if (h_weight != se->h_load.weight)
+			update_load_set(&se->h_load, h_weight);
+		delta = __calc_delta(delta, NICE_0_LOAD, &se->h_load);
+	} else if (unlikely(se->load.weight != NICE_0_LOAD)) {
 		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
+	}
 
 	return delta;
 }
@@ -712,22 +731,16 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	u64 slice = sysctl_sched_latency;
+	struct load_weight *load = &cfs_rq->load;
+	struct load_weight lw;
 
-	for_each_sched_entity(se) {
-		struct load_weight *load;
-		struct load_weight lw;
+	if (unlikely(!se->on_rq)) {
+		lw = cfs_rq->load;
 
-		cfs_rq = cfs_rq_of(se);
-		load = &cfs_rq->load;
-
-		if (unlikely(!se->on_rq)) {
-			lw = cfs_rq->load;
-
-			update_load_add(&lw, se->load.weight);
-			load = &lw;
-		}
-		slice = __calc_delta(slice, se->load.weight, load);
+		update_load_add(&lw, task_se_h_weight(se));
+		load = &lw;
 	}
+	slice = __calc_delta(slice, task_se_h_weight(se), load);
 
 	/*
 	 * To avoid cache thrashing, run at least sysctl_sched_min_granularity.
@@ -775,6 +788,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
 static void attach_entity_cfs_rq(struct sched_entity *se);
+static void propagate_entity_cfs_rq(struct sched_entity *se, int flags);
 
 /*
  * With new tasks being created, their initial util_avgs are extrapolated
@@ -2726,18 +2740,36 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
 #endif /* CONFIG_NUMA_BALANCING */
 
 static void
-account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+account_entity_enqueue_h(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	update_load_add(&cfs_rq->load, se->load.weight);
-	if (!parent_entity(se))
+	if (task_se_in_cgroup(se)) {
+		/* Add the hierarchical weight to the CPU rq */
+		unsigned long h_weight = task_se_h_weight(se);
+		se->enqueued_h_weight = h_weight;
+		update_load_add(&rq_of(cfs_rq)->load, h_weight);
+	} else {
 		update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
-#ifdef CONFIG_SMP
-	if (entity_is_task(se)) {
-		struct rq *rq = rq_of(cfs_rq);
+	}
+}
 
-		account_numa_enqueue(rq, task_of(se));
-		list_add(&se->group_node, &rq->cfs_tasks);
+static void
+account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	struct rq *rq;
+
+	if (task_se_in_cgroup(se)) {
+		struct cfs_rq *cgroup_rq = group_cfs_rq_of_parent(se);
+
+		update_load_add(&cgroup_rq->load, se->load.weight);
+		cgroup_rq->nr_running++;
+	} else {
+		update_load_add(&cfs_rq->load, se->load.weight);
 	}
+#ifdef CONFIG_SMP
+	rq = rq_of(cfs_rq);
+
+	account_numa_enqueue(rq, task_of(se));
+	list_add(&se->group_node, &rq->cfs_tasks);
 #endif
 	cfs_rq->nr_running++;
 }
@@ -2745,14 +2777,20 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static void
 account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	update_load_sub(&cfs_rq->load, se->load.weight);
-	if (!parent_entity(se))
+	if (task_se_in_cgroup(se)) {
+		struct cfs_rq *cgroup_rq = group_cfs_rq_of_parent(se);
+
+		update_load_sub(&cgroup_rq->load, se->load.weight);
+		cgroup_rq->nr_running--;
+
+		update_load_sub(&rq_of(cfs_rq)->load, se->enqueued_h_weight);
+	} else {
+		update_load_sub(&cfs_rq->load, se->load.weight);
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-#ifdef CONFIG_SMP
-	if (entity_is_task(se)) {
-		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
-		list_del_init(&se->group_node);
 	}
+#ifdef CONFIG_SMP
+	account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+	list_del_init(&se->group_node);
 #endif
 	cfs_rq->nr_running--;
 }
@@ -2847,6 +2885,9 @@ update_runnable_load_avg(struct sched_entity *se)
 static inline void
 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	if (task_se_in_cgroup(se))
+		cfs_rq = group_cfs_rq_of_parent(se);
+
 	cfs_rq->avg.load_avg += se->avg.load_avg;
 	cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
 }
@@ -2854,6 +2895,9 @@ enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	if (task_se_in_cgroup(se))
+		cfs_rq = group_cfs_rq_of_parent(se);
+
 	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
 	sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
 }
@@ -3031,7 +3075,6 @@ static void update_cfs_group(struct sched_entity *se)
 	long shares;
 
 	if (!gcfs_rq) {
-		update_runnable_load_avg(se);
 		return;
 	}
 
@@ -3482,7 +3525,9 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	cfs_rq->avg.util_avg += se->avg.util_avg;
 	cfs_rq->avg.util_sum += se->avg.util_sum;
 
-	add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
+	if (task_se_in_cgroup(se))
+		add_tg_cfs_propagate(group_cfs_rq_of_parent(se),
+						se->avg.load_sum);
 
 	cfs_rq_util_change(cfs_rq, flags);
 }
@@ -3501,7 +3546,9 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
 	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
 
-	add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
+	if (task_se_in_cgroup(se))
+		add_tg_cfs_propagate(group_cfs_rq_of_parent(se),
+						-se->avg.load_sum);
 
 	cfs_rq_util_change(cfs_rq, 0);
 }
@@ -3512,11 +3559,13 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 #define UPDATE_TG	0x1
 #define SKIP_AGE_LOAD	0x2
 #define DO_ATTACH	0x4
+#define SE_IS_CURRENT	0x8
 
 /* Update task and its cfs_rq load average */
 static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	u64 now = cfs_rq_clock_pelt(cfs_rq);
+	bool curr = flags & SE_IS_CURRENT;
 	int decayed;
 
 	/*
@@ -3524,7 +3573,7 @@ static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	 * track group sched_entity load average for task_se_h_load calc in migration
 	 */
 	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
-		__update_load_avg_se(now, cfs_rq, se);
+		__update_load_avg_se(now, cfs_rq, se, curr, curr);
 
 	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
 	decayed |= propagate_entity_load_avg(se);
@@ -3590,6 +3639,9 @@ static void remove_entity_load_avg(struct sched_entity *se)
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	unsigned long flags;
 
+	if (task_se_in_cgroup(se))
+		cfs_rq = group_cfs_rq_of_parent(se);
+
 	/*
 	 * tasks cannot exit without having gone through wake_up_new_task() ->
 	 * post_init_entity_util_avg() which will have added things to the
@@ -3760,6 +3812,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
 #define UPDATE_TG	0x0
 #define SKIP_AGE_LOAD	0x0
 #define DO_ATTACH	0x0
+#define SE_IS_CURRENT	0x0
 
 static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
 {
@@ -3887,24 +3940,6 @@ static inline void check_schedstat_required(void)
  * CPU and an up-to-date min_vruntime on the destination CPU.
  */
 
-static bool
-enqueue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
-{
-	/*
-	 * When enqueuing a sched_entity, we must:
-	 *   - Update loads to have both entity and cfs_rq synced with now.
-	 *   - Add its load to cfs_rq->runnable_avg
-	 *   - For group_entity, update its weight to reflect the new share of
-	 *     its group cfs_rq
-	 *   - Add its new weight to cfs_rq->load.weight
-	 */
-	if (!update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH))
-		return false;
-
-	update_cfs_group(se);
-	return true;
-}
-
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -3929,9 +3964,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (renorm && !curr)
 		se->vruntime += cfs_rq->min_vruntime;
 
-	enqueue_runnable_load_avg(cfs_rq, se);
-	account_entity_enqueue(cfs_rq, se);
-
 	if (flags & ENQUEUE_WAKEUP)
 		place_entity(cfs_rq, se, 0);
 
@@ -3942,77 +3974,24 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
 
-	if (cfs_rq->nr_running == 1) {
-		list_add_leaf_cfs_rq(cfs_rq);
-		check_enqueue_throttle(cfs_rq);
-	}
-}
-
-static void __clear_buddies_last(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->last != se)
-			break;
-
-		cfs_rq->last = NULL;
-	}
-}
-
-static void __clear_buddies_next(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->next != se)
-			break;
-
-		cfs_rq->next = NULL;
-	}
-}
-
-static void __clear_buddies_skip(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->skip != se)
-			break;
-
-		cfs_rq->skip = NULL;
-	}
+	if (task_se_in_cgroup(se))
+		enqueue_entity_cfs_rqs(se);
 }
 
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if (cfs_rq->last == se)
-		__clear_buddies_last(se);
+		cfs_rq->last = NULL;
 
 	if (cfs_rq->next == se)
-		__clear_buddies_next(se);
+		cfs_rq->next = NULL;
 
 	if (cfs_rq->skip == se)
-		__clear_buddies_skip(se);
+		cfs_rq->skip = NULL;
 }
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 
-static bool
-dequeue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
-{
-	/*
-	 * When dequeuing a sched_entity, we must:
-	 *   - Update loads to have both entity and cfs_rq synced with now.
-	 *   - Subtract its load from the cfs_rq->runnable_avg.
-	 *   - Subtract its previous weight from cfs_rq->load.weight.
-	 *   - For group entity, update its weight to reflect the new share
-	 *     of its group cfs_rq.
-	 */
-	if (!update_load_avg(cfs_rq, se, UPDATE_TG))
-		return false;
-	update_cfs_group(se);
-
-	return true;
-}
-
 static void
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -4106,7 +4085,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
-		update_load_avg(cfs_rq, se, UPDATE_TG);
+		propagate_entity_cfs_rq(se, UPDATE_TG);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
@@ -4204,11 +4183,12 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 	check_spread(cfs_rq, prev);
 
 	if (prev->on_rq) {
+		struct sched_entity *se = prev;
 		update_stats_wait_start(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
-		update_load_avg(cfs_rq, prev, 0);
+		propagate_entity_cfs_rq(se, SE_IS_CURRENT);
 	}
 	cfs_rq->curr = NULL;
 }
@@ -4224,7 +4204,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
-	update_load_avg(cfs_rq, curr, UPDATE_TG);
+	update_load_avg(cfs_rq, curr, UPDATE_TG|SE_IS_CURRENT);
 	update_cfs_group(curr);
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -4243,9 +4223,6 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 			hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
 		return;
 #endif
-
-	if (cfs_rq->nr_running > 1)
-		check_preempt_tick(cfs_rq, curr);
 }
 
 
@@ -5120,7 +5097,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 
 	SCHED_WARN_ON(task_rq(p) != rq);
 
-	if (rq->cfs.h_nr_running > 1) {
+	if (rq->cfs.nr_running > 1) {
 		u64 slice = sched_slice(cfs_rq, se);
 		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
 		s64 delta = slice - ran;
@@ -5185,7 +5162,7 @@ static inline void update_overutilized_status(struct rq *rq) { }
 static void
 enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq = & rq->cfs;
 	struct sched_entity *se = &p->se;
 
 	/*
@@ -5194,7 +5171,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 * Let's add the task's estimated utilization to the cfs_rq's
 	 * estimated utilization, before we update schedutil.
 	 */
-	util_est_enqueue(&rq->cfs, p);
+	util_est_enqueue(cfs_rq, p);
 
 	/*
 	 * If in_iowait is set, the code below may not trigger any cpufreq
@@ -5204,38 +5181,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (p->in_iowait)
 		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
 
-	for_each_sched_entity(se) {
-		if (se->on_rq)
-			break;
-		cfs_rq = cfs_rq_of(se);
-		enqueue_entity_groups(cfs_rq, se, flags);
-		enqueue_entity(cfs_rq, se, flags);
-
-		/*
-		 * end evaluation on encountering a throttled cfs_rq
-		 *
-		 * note: in the case of encountering a throttled cfs_rq we will
-		 * post the final h_nr_running increment below.
-		 */
-		if (cfs_rq_throttled(cfs_rq))
-			break;
-		cfs_rq->h_nr_running++;
-
-		flags = ENQUEUE_WAKEUP;
-	}
-
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_nr_running++;
+	/* First, add the static weight to the (group) cfs_rq. */
+	account_entity_enqueue(cfs_rq, se);
 
-		if (cfs_rq_throttled(cfs_rq))
-			break;
+	/* Then, propagate that up the hierarchy. */
+	propagate_entity_cfs_rq(se, UPDATE_TG|DO_ATTACH);
 
-		update_load_avg(cfs_rq, se, UPDATE_TG);
-		update_cfs_group(se);
-	}
+	/* Apply the calculated hierarchical weight & load. */
+	enqueue_entity(cfs_rq, se, flags);
+	enqueue_runnable_load_avg(cfs_rq, se);
+	account_entity_enqueue_h(cfs_rq, se);
 
-	if (!se) {
 		add_nr_running(rq, 1);
 		/*
 		 * Since new tasks are assigned an initial util_avg equal to
@@ -5254,23 +5210,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (flags & ENQUEUE_WAKEUP)
 			update_overutilized_status(rq);
 
-	}
-
-	if (cfs_bandwidth_used()) {
-		/*
-		 * When bandwidth control is enabled; the cfs_rq_throttled()
-		 * breaks in the above iteration can result in incomplete
-		 * leaf list maintenance, resulting in triggering the assertion
-		 * below.
-		 */
-		for_each_sched_entity(se) {
-			cfs_rq = cfs_rq_of(se);
-
-			if (list_add_leaf_cfs_rq(cfs_rq))
-				break;
-		}
-	}
-
 	assert_list_leaf_cfs_rq(rq);
 
 	hrtick_update(rq);
@@ -5285,55 +5224,17 @@ static void set_next_buddy(struct sched_entity *se);
  */
 static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *se = &p->se;
 	int task_sleep = flags & DEQUEUE_SLEEP;
 
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		dequeue_entity_groups(cfs_rq, se, flags);
-		dequeue_entity(cfs_rq, se, flags);
-
-		/*
-		 * end evaluation on encountering a throttled cfs_rq
-		 *
-		 * note: in the case of encountering a throttled cfs_rq we will
-		 * post the final h_nr_running decrement below.
-		*/
-		if (cfs_rq_throttled(cfs_rq))
-			break;
-		cfs_rq->h_nr_running--;
-
-		/* Don't dequeue parent if it has other entities besides us */
-		if (cfs_rq->load.weight) {
-			/* Avoid re-evaluating load for this entity: */
-			se = parent_entity(se);
-			/*
-			 * Bias pick_next to pick a task from this cfs_rq, as
-			 * p is sleeping when it is within its sched_slice.
-			 */
-			if (task_sleep && se && !throttled_hierarchy(cfs_rq))
-				set_next_buddy(se);
-			break;
-		}
-		flags |= DEQUEUE_SLEEP;
-	}
-
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		cfs_rq->h_nr_running--;
-
-		if (cfs_rq_throttled(cfs_rq))
-			break;
+	propagate_entity_cfs_rq(se, UPDATE_TG|SE_IS_CURRENT);
 
-		update_load_avg(cfs_rq, se, UPDATE_TG);
-		update_cfs_group(se);
-	}
+	dequeue_entity(cfs_rq, &p->se, flags);
 
-	if (!se)
-		sub_nr_running(rq, 1);
+	sub_nr_running(rq, 1);
 
-	util_est_dequeue(&rq->cfs, p, task_sleep);
+	util_est_dequeue(cfs_rq, p, task_sleep);
 	hrtick_update(rq);
 }
 
@@ -5656,7 +5557,7 @@ static unsigned long capacity_of(int cpu)
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running);
+	unsigned long nr_running = READ_ONCE(rq->cfs.nr_running);
 	unsigned long load_avg = weighted_cpuload(rq);
 
 	if (nr_running)
@@ -6875,11 +6776,9 @@ static void set_last_buddy(struct sched_entity *se)
 	if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se))))
 		return;
 
-	for_each_sched_entity(se) {
-		if (SCHED_WARN_ON(!se->on_rq))
-			return;
-		cfs_rq_of(se)->last = se;
-	}
+	if (SCHED_WARN_ON(!se->on_rq))
+		return;
+	cfs_rq_of(se)->last = se;
 }
 
 static void set_next_buddy(struct sched_entity *se)
@@ -6887,17 +6786,14 @@ static void set_next_buddy(struct sched_entity *se)
 	if (entity_is_task(se) && unlikely(task_has_idle_policy(task_of(se))))
 		return;
 
-	for_each_sched_entity(se) {
-		if (SCHED_WARN_ON(!se->on_rq))
-			return;
-		cfs_rq_of(se)->next = se;
-	}
+	if (SCHED_WARN_ON(!se->on_rq))
+		return;
+	cfs_rq_of(se)->next = se;
 }
 
 static void set_skip_buddy(struct sched_entity *se)
 {
-	for_each_sched_entity(se)
-		cfs_rq_of(se)->skip = se;
+	cfs_rq_of(se)->skip = se;
 }
 
 /*
@@ -6953,7 +6849,6 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
 		return;
 
-	find_matching_se(&se, &pse);
 	update_curr(cfs_rq_of(se));
 	BUG_ON(!pse);
 	if (wakeup_preempt_entity(se, pse) == 1) {
@@ -6994,100 +6889,18 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	struct task_struct *p;
 	int new_tasks;
 
+	put_prev_task(rq, prev);
 again:
 	if (!cfs_rq->nr_running)
 		goto idle;
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	if (prev->sched_class != &fair_sched_class)
-		goto simple;
-
-	/*
-	 * Because of the set_next_buddy() in dequeue_task_fair() it is rather
-	 * likely that a next task is from the same cgroup as the current.
-	 *
-	 * Therefore attempt to avoid putting and setting the entire cgroup
-	 * hierarchy, only change the part that actually changes.
-	 */
-
-	do {
-		struct sched_entity *curr = cfs_rq->curr;
-
-		/*
-		 * Since we got here without doing put_prev_entity() we also
-		 * have to consider cfs_rq->curr. If it is still a runnable
-		 * entity, update_curr() will update its vruntime, otherwise
-		 * forget we've ever seen it.
-		 */
-		if (curr) {
-			if (curr->on_rq)
-				update_curr(cfs_rq);
-			else
-				curr = NULL;
-
-			/*
-			 * This call to check_cfs_rq_runtime() will do the
-			 * throttle and dequeue its entity in the parent(s).
-			 * Therefore the nr_running test will indeed
-			 * be correct.
-			 */
-			if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
-				cfs_rq = &rq->cfs;
-
-				if (!cfs_rq->nr_running)
-					goto idle;
-
-				goto simple;
-			}
-		}
-
-		se = pick_next_entity(cfs_rq, curr);
-		cfs_rq = group_cfs_rq(se);
-	} while (cfs_rq);
-
-	p = task_of(se);
-
-	/*
-	 * Since we haven't yet done put_prev_entity and if the selected task
-	 * is a different task than we started out with, try and touch the
-	 * least amount of cfs_rqs.
-	 */
-	if (prev != p) {
-		struct sched_entity *pse = &prev->se;
-
-		while (!(cfs_rq = is_same_group(se, pse))) {
-			int se_depth = se->depth;
-			int pse_depth = pse->depth;
-
-			if (se_depth <= pse_depth) {
-				put_prev_entity(cfs_rq_of(pse), pse);
-				pse = parent_entity(pse);
-			}
-			if (se_depth >= pse_depth) {
-				set_next_entity(cfs_rq_of(se), se);
-				se = parent_entity(se);
-			}
-		}
-
-		put_prev_entity(cfs_rq, pse);
-		set_next_entity(cfs_rq, se);
-	}
-
-	goto done;
-simple:
-#endif
-
-	put_prev_task(rq, prev);
-
-	do {
-		se = pick_next_entity(cfs_rq, NULL);
-		set_next_entity(cfs_rq, se);
-		cfs_rq = group_cfs_rq(se);
-	} while (cfs_rq);
+	se = pick_next_entity(cfs_rq, NULL);
+	if (!se)
+		goto idle;
 
+	set_next_entity(cfs_rq, se);
 	p = task_of(se);
 
-done: __maybe_unused;
 #ifdef CONFIG_SMP
 	/*
 	 * Move the next running task to the front of
@@ -7136,10 +6949,8 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
 	struct sched_entity *se = &prev->se;
 	struct cfs_rq *cfs_rq;
 
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		put_prev_entity(cfs_rq, se);
-	}
+	cfs_rq = cfs_rq_of(se);
+	put_prev_entity(cfs_rq, se);
 }
 
 /*
@@ -7900,6 +7711,11 @@ static unsigned long task_se_h_load(struct sched_entity *se)
 	return se->avg.load_avg;
 }
 
+static unsigned long task_se_h_weight(struct sched_entity *se)
+{
+	return se->load.weight;
+}
+
 static unsigned long task_se_h_weight(struct sched_entity *se)
 {
 	return se->load.weight;
@@ -8305,7 +8121,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		sgs->group_load += load;
 		sgs->group_util += cpu_util(i);
-		sgs->sum_nr_running += rq->cfs.h_nr_running;
+		sgs->sum_nr_running += rq->cfs.nr_running;
 
 		nr_running = rq->nr_running;
 		if (nr_running > 1)
@@ -8996,7 +8812,7 @@ voluntary_active_balance(struct lb_env *env)
 	 * available on dst_cpu.
 	 */
 	if ((env->idle != CPU_NOT_IDLE) &&
-	    (env->src_rq->cfs.h_nr_running == 1)) {
+	    (env->src_rq->cfs.nr_running == 1)) {
 		if ((check_cpu_capacity(env->src_rq, sd)) &&
 		    (capacity_of(env->src_cpu)*sd->imbalance_pct < capacity_of(env->dst_cpu)*100))
 			return 1;
@@ -9677,7 +9493,7 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * capacity; kick the ILB to see if there's a better CPU to run
 		 * on.
 		 */
-		if (rq->cfs.h_nr_running >= 1 && check_cpu_capacity(rq, sd)) {
+		if (rq->cfs.nr_running >= 1 && check_cpu_capacity(rq, sd)) {
 			flags = NOHZ_KICK_MASK;
 			goto unlock;
 		}
@@ -10126,7 +9942,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 	 * have been enqueued in the meantime. Since we're not going idle,
 	 * pretend we pulled a task.
 	 */
-	if (this_rq->cfs.h_nr_running && !pulled_task)
+	if (this_rq->cfs.nr_running && !pulled_task)
 		pulled_task = 1;
 
 	/* Move the next balance forward */
@@ -10134,7 +9950,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 		this_rq->next_balance = next_balance;
 
 	/* Is there a task of a high priority class? */
-	if (this_rq->nr_running != this_rq->cfs.h_nr_running)
+	if (this_rq->nr_running != this_rq->cfs.nr_running)
 		pulled_task = -1;
 
 	if (pulled_task)
@@ -10221,6 +10037,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		entity_tick(cfs_rq, se, queued);
 	}
 
+	update_runnable_load_avg(&curr->se);
+
+	cfs_rq = &rq->cfs;
+	if (cfs_rq->nr_running > 1)
+		check_preempt_tick(cfs_rq, &curr->se);
+
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
@@ -10319,40 +10141,45 @@ static inline bool vruntime_normalized(struct task_struct *p)
  * Propagate the changes of the sched_entity across the tg tree to make it
  * visible to the root
  */
-static void propagate_entity_cfs_rq(struct sched_entity *se)
+static void propagate_entity_cfs_rq(struct sched_entity *se, int flags)
 {
 	struct cfs_rq *cfs_rq;
 
-	/* Start to propagate at parent */
-	se = se->parent;
-
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		cfs_rq = group_cfs_rq_of_parent(se);
 
 		if (cfs_rq_throttled(cfs_rq))
 			break;
 
-		update_load_avg(cfs_rq, se, UPDATE_TG);
+		/*
+		 * Walk the hierarchy while PELT says there is work to do,
+		 * or while attaching a new task, to get sane task weights.
+		 */
+		if (!update_load_avg(cfs_rq, se, flags))
+			if (!(flags & DO_ATTACH))
+				break;
+
+		update_cfs_group(se);
 	}
 }
 #else
-static void propagate_entity_cfs_rq(struct sched_entity *se) { }
+static void propagate_entity_cfs_rq(struct sched_entity *se, int flags) { }
 #endif
 
 static void detach_entity_cfs_rq(struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct cfs_rq *cfs_rq = group_cfs_rq_of_parent(se);
 
-	/* Catch up with the cfs_rq and remove our load when we leave */
 	update_load_avg(cfs_rq, se, 0);
 	detach_entity_load_avg(cfs_rq, se);
 	update_tg_load_avg(cfs_rq, false);
-	propagate_entity_cfs_rq(se);
+	propagate_entity_cfs_rq(se->parent, UPDATE_TG);
 }
 
 static void attach_entity_cfs_rq(struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	int flags = sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD;
+	struct cfs_rq *cfs_rq = group_cfs_rq_of_parent(se);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/*
@@ -10363,10 +10190,10 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
 #endif
 
 	/* Synchronize entity with its cfs_rq */
-	update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
+	update_load_avg(cfs_rq, se, flags);
 	attach_entity_load_avg(cfs_rq, se, 0);
 	update_tg_load_avg(cfs_rq, false);
-	propagate_entity_cfs_rq(se);
+	propagate_entity_cfs_rq(se->parent, flags | UPDATE_TG);
 }
 
 static void detach_task_cfs_rq(struct task_struct *p)
@@ -10427,14 +10254,11 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
 static void set_curr_task_fair(struct rq *rq)
 {
 	struct sched_entity *se = &rq->curr->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-
-		set_next_entity(cfs_rq, se);
-		/* ensure bandwidth has been allocated on our new cfs_rq */
-		account_cfs_rq_runtime(cfs_rq, 0);
-	}
+	set_next_entity(cfs_rq, se);
+	/* ensure bandwidth has been allocated on our new cfs_rq */
+	account_cfs_rq_runtime(cfs_rq, 0);
 }
 
 void init_cfs_rq(struct cfs_rq *cfs_rq)
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index b3f3e8b29394..8492c9792ad0 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -251,10 +251,10 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 	return 0;
 }
 
-int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
+int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se, bool load, bool running)
 {
-	if (___update_load_sum(now, &se->avg, !!se->on_rq,
-				cfs_rq->curr == se)) {
+	if (___update_load_sum(now, &se->avg, (!!se->on_rq || load),
+				(cfs_rq->curr == se) || running)) {
 
 		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 7489d5f56960..1152c4ebf314 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -2,7 +2,7 @@
 #include "sched-pelt.h"
 
 int __update_load_avg_blocked_se(u64 now, struct sched_entity *se);
-int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se);
+int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se, bool load, bool running);
 int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 32978a8de8ce..b05fd87cf8b5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1443,7 +1443,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);
-	p->se.cfs_rq = tg->cfs_rq[cpu];
+	p->se.cfs_rq = &cpu_rq(cpu)->cfs;
 	p->se.parent = tg->se[cpu];
 #endif
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 12/15] sched,fair: flatten update_curr functionality
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
                   ` (10 preceding siblings ...)
  2019-08-22  2:17 ` [PATCH 11/15] sched,fair: flatten hierarchical runqueues Rik van Riel
@ 2019-08-22  2:17 ` Rik van Riel
  2019-08-27 10:37   ` Dietmar Eggemann
  2019-08-22  2:17 ` [PATCH 13/15] sched,fair: propagate sum_exec_runtime up the hierarchy Rik van Riel
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel

Make it clear that update_curr only works on tasks any more.

There is no need for task_tick_fair to call it on every sched entity up
the hierarchy, so move the call out of entity_tick.

Signed-off-by: Rik van Riel <riel@surriel.com>`
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fa8e88731821..5cfa3dbeba49 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -872,10 +872,11 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
 static void update_curr(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
+	struct task_struct *curtask;
 	u64 now = rq_clock_task(rq_of(cfs_rq));
 	u64 delta_exec;
 
-	if (unlikely(!curr))
+	if (unlikely(!curr) || !entity_is_task(curr))
 		return;
 
 	delta_exec = now - curr->exec_start;
@@ -893,13 +894,10 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
 	update_min_vruntime(cfs_rq);
 
-	if (entity_is_task(curr)) {
-		struct task_struct *curtask = task_of(curr);
-
-		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
-		cgroup_account_cputime(curtask, delta_exec);
-		account_group_exec_runtime(curtask, delta_exec);
-	}
+	curtask = task_of(curr);
+	trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
+	cgroup_account_cputime(curtask, delta_exec);
+	account_group_exec_runtime(curtask, delta_exec);
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
@@ -4196,11 +4194,6 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 static void
 entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 {
-	/*
-	 * Update run-time statistics of the 'current'.
-	 */
-	update_curr(cfs_rq);
-
 	/*
 	 * Ensure that runnable average is periodically updated.
 	 */
@@ -10032,6 +10025,11 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &curr->se;
 
+	/*
+	 * Update run-time statistics of the 'current'.
+	 */
+	update_curr(&rq->cfs);
+
 	for_each_sched_entity(se) {
 		cfs_rq = group_cfs_rq_of_parent(se);
 		entity_tick(cfs_rq, se, queued);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 13/15] sched,fair: propagate sum_exec_runtime up the hierarchy
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
                   ` (11 preceding siblings ...)
  2019-08-22  2:17 ` [PATCH 12/15] sched,fair: flatten update_curr functionality Rik van Riel
@ 2019-08-22  2:17 ` Rik van Riel
  2019-08-28  7:51   ` Dietmar Eggemann
  2019-08-22  2:17 ` [PATCH 14/15] sched,fair: ramp up task_se_h_weight quickly Rik van Riel
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel

Now that enqueue_task_fair and dequeue_task_fair no longer iterate up
the hierarchy all the time, a method to lazily propagate sum_exec_runtime
up the hierarchy is necessary.

Once a tick, propagate the newly accumulated exec_runtime up the hierarchy,
and feed it into CFS bandwidth control.

Remove the pointless call to account_cfs_rq_runtime from update_curr,
which is always called with a root cfs_rq.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 22 ++++++++++++++++++++--
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 901c710363e7..bdca15b3afe7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -454,6 +454,7 @@ struct sched_entity {
 	int				depth;
 	unsigned long			enqueued_h_load;
 	unsigned long			enqueued_h_weight;
+	u64				propagated_exec_runtime;
 	struct load_weight		h_load;
 	struct sched_entity		*parent;
 	/* rq on which this entity is (to be) queued: */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbd96900f715..9915d20e84a9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2137,6 +2137,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	p->se.propagated_exec_runtime	= 0;
 	p->se.cfs_rq			= NULL;
 #endif
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5cfa3dbeba49..d6c881c5c4d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -898,8 +898,6 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
 	cgroup_account_cputime(curtask, delta_exec);
 	account_group_exec_runtime(curtask, delta_exec);
-
-	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
 static void update_curr_fair(struct rq *rq)
@@ -3412,6 +3410,20 @@ static inline bool skip_blocked_update(struct sched_entity *se)
 	return true;
 }
 
+static void propagate_exec_runtime(struct cfs_rq *cfs_rq,
+				struct sched_entity *se)
+{
+	struct sched_entity *parent = se->parent;
+	u64 diff = se->sum_exec_runtime - se->propagated_exec_runtime;
+
+	if (parent) {
+		parent->sum_exec_runtime += diff;
+		account_cfs_rq_runtime(cfs_rq, diff);
+	}
+
+	se->propagated_exec_runtime = se->sum_exec_runtime;
+}
+
 #else /* CONFIG_FAIR_GROUP_SCHED */
 
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {}
@@ -3423,6 +3435,11 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
 
 static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
 
+static void propagate_exec_runtime(struct cfs_rq *cfs_rq,
+				struct sched_entity *se);
+{
+}
+
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 /**
@@ -10157,6 +10174,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se, int flags)
 			if (!(flags & DO_ATTACH))
 				break;
 
+		propagate_exec_runtime(cfs_rq, se);
 		update_cfs_group(se);
 	}
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 14/15] sched,fair: ramp up task_se_h_weight quickly
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
                   ` (12 preceding siblings ...)
  2019-08-22  2:17 ` [PATCH 13/15] sched,fair: propagate sum_exec_runtime up the hierarchy Rik van Riel
@ 2019-08-22  2:17 ` Rik van Riel
  2019-08-22  2:17 ` [PATCH 15/15] sched,fair: scale vdiff in wakeup_preempt_entity Rik van Riel
  2019-09-02 10:53 ` [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Dietmar Eggemann
  15 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel

The code in update_cfs_group / calc_group_shares has some logic to
quickly ramp up the load when a task has just started running in a
cgroup, in order to get sane values for the cgroup se->load.weight.

This code adds a similar hack to task_se_h_weight.

However, THIS CODE IS WRONG, since it does not do things hierarchically.

I am wondering a few things here:
1) Should I have something similar to the logic in calc_group_shares
   in update_cfs_rq_h_load?
2) If so, should I also use that fast-ramp-up value for task_h_load,
   to prevent the load balancer from thinking it is moving zero weight
   tasks around?
3) If update_cfs_rq_h_load is the wrong place, where should I be
   calculating a hierarchical group weight value, instead?

Not-yet-signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d6c881c5c4d5..3df5d60b245f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7672,6 +7672,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 
 static unsigned long task_se_h_weight(struct sched_entity *se)
 {
+	unsigned long group_load;
 	struct cfs_rq *cfs_rq;
 
 	if (!task_se_in_cgroup(se))
@@ -7680,8 +7681,12 @@ static unsigned long task_se_h_weight(struct sched_entity *se)
 	cfs_rq = group_cfs_rq_of_parent(se);
 	update_cfs_rq_h_load(cfs_rq);
 
+	/* Ramp up quickly to keep h_weight sane. */
+	group_load = max(scale_load_down(se->parent->load.weight),
+							cfs_rq->h_load);
+
 	/* Reduce the load.weight by the h_load of the group the task is in. */
-	return (cfs_rq->h_load * se->load.weight) >> SCHED_FIXEDPOINT_SHIFT;
+	return (group_load * se->load.weight) >> SCHED_FIXEDPOINT_SHIFT;
 }
 
 static unsigned long task_se_h_load(struct sched_entity *se)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 15/15] sched,fair: scale vdiff in wakeup_preempt_entity
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
                   ` (13 preceding siblings ...)
  2019-08-22  2:17 ` [PATCH 14/15] sched,fair: ramp up task_se_h_weight quickly Rik van Riel
@ 2019-08-22  2:17 ` Rik van Riel
  2019-09-02 10:53 ` [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Dietmar Eggemann
  15 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2019-08-22  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel

When a task wakes back up after having gone to sleep, place_entity
will limit the vruntime difference between min_vruntime and the
woken up task to half of sysctl_sched_latency.

The code in wakeup_preempt_entity calculates how much vruntime a
time slice for the woken up task represents, in wakeup_gran.

It then assumes that all the vruntime used since the task went to
sleep was used by the currently running task (which has its vruntime
scaled by calc_delta_fair, as well).

However, that assumption is not necessarily true, and the vruntime
may have advanced at different rates, pushed ahead by different tasks
on the CPU. This becomes more visible when the CPU controller is enabled.

This leads to the symptom that a high priority woken up task is likely to
preempt whatever is running, even if the currently running task is of equal
or higher priority than the woken up task!

Scaling the vdiff down if the currently running task is also high priority
solves that symptom.

This is not the correct thing to do if all of the vruntime was accumulated
by the current task, or other tasks at similar priority, and already scaled
by the same priority, but I do not have any better ideas on how to tackle
the "task X got preempted by task Y of the same priority" issue that system
administrators try to resolve by setting the sched_wakeup_granularity
sysctl variable to a larger value than half of sysctl_sched_latency...

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3df5d60b245f..ef7629bdf41d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6774,6 +6774,7 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
 	if (vdiff <= 0)
 		return -1;
 
+	vdiff = min((u64)vdiff, calc_delta_fair(vdiff, curr));
 	gran = wakeup_gran(se);
 	if (vdiff > gran)
 		return 1;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/15] sched: introduce task_se_h_load helper
  2019-08-22  2:17 ` [PATCH 01/15] sched: introduce task_se_h_load helper Rik van Riel
@ 2019-08-23 18:13   ` Dietmar Eggemann
  2019-08-24  0:05     ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Dietmar Eggemann @ 2019-08-23 18:13 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: kernel-team, pjt, peterz, mingo, morten.rasmussen, tglx, mgorman,
	vincent.guittot, Josef Bacik



On 22/08/2019 04:17, Rik van Riel wrote:
> Sometimes the hierarchical load of a sched_entity needs to be calculated.
> Rename task_h_load to task_se_h_load, and directly pass a sched_entity to
> that function.
> 
> Move the function declaration up above where it will be used later.
> 
> No functional changes.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> ---

[...]

> @@ -1668,7 +1668,7 @@ static void task_numa_compare(struct task_numa_env *env,
>  	/*
>  	 * In the overloaded case, try and keep the load balanced.
>  	 */
> -	load = task_h_load(env->p) - task_h_load(cur);
> +	load = task_se_h_load(env->p->se) - task_se_h_load(cur->se);

Shouldn't this be:

load = task_se_h_load(&env->p->se) - task_se_h_load(&cur->se);

>  	if (!load)
>  		goto assign;
>  
> @@ -1706,7 +1706,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>  	bool maymove = false;
>  	int cpu;
>  
> -	load = task_h_load(env->p);
> +	load = task_se_h_load(env->p->se);

load = task_se_h_load(&env->p->se);

Only visible with CONFIG_NUMA_BALANCING though.

[...]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 11/15] sched,fair: flatten hierarchical runqueues
  2019-08-22  2:17 ` [PATCH 11/15] sched,fair: flatten hierarchical runqueues Rik van Riel
@ 2019-08-23 18:14   ` Dietmar Eggemann
  2019-08-24  1:16     ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Dietmar Eggemann @ 2019-08-23 18:14 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: kernel-team, pjt, peterz, mingo, morten.rasmussen, tglx, mgorman,
	vincent.guittot

On 22/08/2019 04:17, Rik van Riel wrote:
> Flatten the hierarchical runqueues into just the per CPU rq.cfs runqueue.
> 
> Iteration of the sched_entity hierarchy is rate limited to once per jiffy
> per sched_entity, which is a smaller change than it seems, because load
> average adjustments were already rate limited to once per jiffy before this
> patch series.
> 
> This patch breaks CONFIG_CFS_BANDWIDTH. The plan for that is to park tasks
> from throttled cgroups onto their cgroup runqueues, and slowly (using the
> GENTLE_FAIR_SLEEPERS) wake them back up, in vruntime order, once the cgroup
> gets unthrottled, to prevent thundering herd issues.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> 
> Header from folded patch 'fix-attach-detach_enticy_cfs_rq.patch~':
> 
> Subject: sched,fair: fix attach/detach_entity_cfs_rq
> 
> While attach_entity_cfs_rq and detach_entity_cfs_rq should iterate over
> the hierarchy, they do not need to so that twice.
> 
> Passing flags into propagate_entity_cfs_rq allows us to reuse that same
> loop from other functions.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> 
> 
> Header from folded patch 'enqueue-order.patch':
> 
> Subject: sched,fair: better ordering at enqueue_task_fair time
> 
> In order to get useful numbers for the task's hierarchical weight,
> task priority, etc things need to be done in a certain order at task
> enqueue time.
> 
> Specifically:
> 1) static load/weight to "local" cfs_rq
> 2) propagate load/weight up the tree
> 3) add runnable load avg to root cfs_rq
> 
> The reason is that each step depends on the things done by the
> step beforehand, and we can end up with nonsense numbers if we
> do not do things right.
> 
> Also, make sure that we walk all the way up the hierarchy at
> enqueue_task_fair time in order to get the benefit from the ramp-up
> logic in update_cfs_group.

[...]

>  /*
> @@ -6953,7 +6849,6 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>  	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>  		return;
>  
> -	find_matching_se(&se, &pse);
>  	update_curr(cfs_rq_of(se));
>  	BUG_ON(!pse);
>  	if (wakeup_preempt_entity(se, pse) == 1) {
> @@ -6994,100 +6889,18 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  	struct task_struct *p;
>  	int new_tasks;
>  
> +	put_prev_task(rq, prev);
>  again:
>  	if (!cfs_rq->nr_running)
>  		goto idle;
>  
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> -	if (prev->sched_class != &fair_sched_class)
> -		goto simple;
> -
> -	/*
> -	 * Because of the set_next_buddy() in dequeue_task_fair() it is rather
> -	 * likely that a next task is from the same cgroup as the current.
> -	 *
> -	 * Therefore attempt to avoid putting and setting the entire cgroup
> -	 * hierarchy, only change the part that actually changes.
> -	 */
> -
> -	do {
> -		struct sched_entity *curr = cfs_rq->curr;
> -
> -		/*
> -		 * Since we got here without doing put_prev_entity() we also
> -		 * have to consider cfs_rq->curr. If it is still a runnable
> -		 * entity, update_curr() will update its vruntime, otherwise
> -		 * forget we've ever seen it.
> -		 */
> -		if (curr) {
> -			if (curr->on_rq)
> -				update_curr(cfs_rq);
> -			else
> -				curr = NULL;
> -
> -			/*
> -			 * This call to check_cfs_rq_runtime() will do the
> -			 * throttle and dequeue its entity in the parent(s).
> -			 * Therefore the nr_running test will indeed
> -			 * be correct.
> -			 */
> -			if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
> -				cfs_rq = &rq->cfs;
> -
> -				if (!cfs_rq->nr_running)
> -					goto idle;
> -
> -				goto simple;
> -			}
> -		}
> -
> -		se = pick_next_entity(cfs_rq, curr);
> -		cfs_rq = group_cfs_rq(se);
> -	} while (cfs_rq);
> -
> -	p = task_of(se);
> -
> -	/*
> -	 * Since we haven't yet done put_prev_entity and if the selected task
> -	 * is a different task than we started out with, try and touch the
> -	 * least amount of cfs_rqs.
> -	 */
> -	if (prev != p) {
> -		struct sched_entity *pse = &prev->se;
> -
> -		while (!(cfs_rq = is_same_group(se, pse))) {
> -			int se_depth = se->depth;
> -			int pse_depth = pse->depth;
> -
> -			if (se_depth <= pse_depth) {
> -				put_prev_entity(cfs_rq_of(pse), pse);
> -				pse = parent_entity(pse);
> -			}
> -			if (se_depth >= pse_depth) {
> -				set_next_entity(cfs_rq_of(se), se);
> -				se = parent_entity(se);
> -			}

Looks like with the se->depth related code gone here in
pick_next_task_fair() and the call to find_matching_se() in
check_preempt_wakeup() you could remove se->depth entirely.

[...]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/15] sched: introduce task_se_h_load helper
  2019-08-23 18:13   ` Dietmar Eggemann
@ 2019-08-24  0:05     ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2019-08-24  0:05 UTC (permalink / raw)
  To: Dietmar Eggemann, linux-kernel
  Cc: kernel-team, pjt, peterz, mingo, morten.rasmussen, tglx, mgorman,
	vincent.guittot, Josef Bacik

[-- Attachment #1: Type: text/plain, Size: 607 bytes --]

On Fri, 2019-08-23 at 20:13 +0200, Dietmar Eggemann wrote:
> 
> 
> > @@ -1668,7 +1668,7 @@ static void task_numa_compare(struct
> > task_numa_env *env,
> >  	/*
> >  	 * In the overloaded case, try and keep the load balanced.
> >  	 */
> > -	load = task_h_load(env->p) - task_h_load(cur);
> > +	load = task_se_h_load(env->p->se) - task_se_h_load(cur->se);
> 
> Shouldn't this be:
> 
> load = task_se_h_load(&env->p->se) - task_se_h_load(&cur->se);

Yes indeed. Sorry I forgot to fix these after you
pointed them out last time.

They are now fixed in my tree.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 11/15] sched,fair: flatten hierarchical runqueues
  2019-08-23 18:14   ` Dietmar Eggemann
@ 2019-08-24  1:16     ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2019-08-24  1:16 UTC (permalink / raw)
  To: Dietmar Eggemann, linux-kernel
  Cc: kernel-team, pjt, peterz, mingo, morten.rasmussen, tglx, mgorman,
	vincent.guittot

[-- Attachment #1: Type: text/plain, Size: 609 bytes --]

On Fri, 2019-08-23 at 20:14 +0200, Dietmar Eggemann wrote:
> 
> Looks like with the se->depth related code gone here in
> pick_next_task_fair() and the call to find_matching_se() in
> check_preempt_wakeup() you could remove se->depth entirely.
> 
> [...]

I've just done that in a separate patch in my series,
in case we need it again.  If we don't, diffstat tells
us we're getting this:

 include/linux/sched.h |  1 -
 kernel/sched/fair.c   | 50 ++-----------------------------------------
-------
 2 files changed, 2 insertions(+), 49 deletions(-)

Thank you!

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/15] sched,fair: flatten update_curr functionality
  2019-08-22  2:17 ` [PATCH 12/15] sched,fair: flatten update_curr functionality Rik van Riel
@ 2019-08-27 10:37   ` Dietmar Eggemann
  0 siblings, 0 replies; 47+ messages in thread
From: Dietmar Eggemann @ 2019-08-27 10:37 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: kernel-team, pjt, peterz, mingo, morten.rasmussen, tglx, mgorman,
	vincent.guittot

On 22/08/2019 04:17, Rik van Riel wrote:
> Make it clear that update_curr only works on tasks any more.
> 
> There is no need for task_tick_fair to call it on every sched entity up
> the hierarchy, so move the call out of entity_tick.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>`
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  kernel/sched/fair.c | 24 +++++++++++-------------
>  1 file changed, 11 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fa8e88731821..5cfa3dbeba49 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -872,10 +872,11 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>  static void update_curr(struct cfs_rq *cfs_rq)
>  {
>  	struct sched_entity *curr = cfs_rq->curr;
> +	struct task_struct *curtask;
>  	u64 now = rq_clock_task(rq_of(cfs_rq));
>  	u64 delta_exec;
>  
> -	if (unlikely(!curr))
> +	if (unlikely(!curr) || !entity_is_task(curr))
>  		return;

Shouldn't this be

-       if (unlikely(!curr) || !entity_is_task(curr))
+       if (unlikely(!curr))
                return;

+       BUG_ON(!entity_is_task(curr));

IMHO, cfs_rq->curr can only be a task with your changes.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 13/15] sched,fair: propagate sum_exec_runtime up the hierarchy
  2019-08-22  2:17 ` [PATCH 13/15] sched,fair: propagate sum_exec_runtime up the hierarchy Rik van Riel
@ 2019-08-28  7:51   ` Dietmar Eggemann
  2019-08-28 13:14     ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Dietmar Eggemann @ 2019-08-28  7:51 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: kernel-team, pjt, peterz, mingo, morten.rasmussen, tglx, mgorman,
	vincent.guittot

On 22/08/2019 04:17, Rik van Riel wrote:
> Now that enqueue_task_fair and dequeue_task_fair no longer iterate up
> the hierarchy all the time, a method to lazily propagate sum_exec_runtime
> up the hierarchy is necessary.
> 
> Once a tick, propagate the newly accumulated exec_runtime up the hierarchy,
> and feed it into CFS bandwidth control.
> 
> Remove the pointless call to account_cfs_rq_runtime from update_curr,
> which is always called with a root cfs_rq.

But what about the call to account_cfs_rq_runtime() in
set_curr_task_fair()? Here you always call it with the root cfs_rq.
Shouldn't this be called also in a loop over all se's until !se->parent
(like in propagate_exec_runtime() further below).

> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  include/linux/sched.h |  1 +
>  kernel/sched/core.c   |  1 +
>  kernel/sched/fair.c   | 22 ++++++++++++++++++++--
>  3 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 901c710363e7..bdca15b3afe7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -454,6 +454,7 @@ struct sched_entity {
>  	int				depth;
>  	unsigned long			enqueued_h_load;
>  	unsigned long			enqueued_h_weight;
> +	u64				propagated_exec_runtime;
>  	struct load_weight		h_load;
>  	struct sched_entity		*parent;
>  	/* rq on which this entity is (to be) queued: */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbd96900f715..9915d20e84a9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2137,6 +2137,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>  	INIT_LIST_HEAD(&p->se.group_node);
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> +	p->se.propagated_exec_runtime	= 0;
>  	p->se.cfs_rq			= NULL;
>  #endif
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5cfa3dbeba49..d6c881c5c4d5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -898,8 +898,6 @@ static void update_curr(struct cfs_rq *cfs_rq)
>  	trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
>  	cgroup_account_cputime(curtask, delta_exec);
>  	account_group_exec_runtime(curtask, delta_exec);
> -
> -	account_cfs_rq_runtime(cfs_rq, delta_exec);
>  }
>  
>  static void update_curr_fair(struct rq *rq)
> @@ -3412,6 +3410,20 @@ static inline bool skip_blocked_update(struct sched_entity *se)
>  	return true;
>  }
>  
> +static void propagate_exec_runtime(struct cfs_rq *cfs_rq,
> +				struct sched_entity *se)
> +{
> +	struct sched_entity *parent = se->parent;
> +	u64 diff = se->sum_exec_runtime - se->propagated_exec_runtime;
> +
> +	if (parent) {
> +		parent->sum_exec_runtime += diff;
> +		account_cfs_rq_runtime(cfs_rq, diff);
> +	}
> +
> +	se->propagated_exec_runtime = se->sum_exec_runtime;
> +}
> +
>  #else /* CONFIG_FAIR_GROUP_SCHED */
>  
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {}
> @@ -3423,6 +3435,11 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
>  
>  static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
>  
> +static void propagate_exec_runtime(struct cfs_rq *cfs_rq,
> +				struct sched_entity *se);
> +{
> +}
> +
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  
>  /**
> @@ -10157,6 +10174,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se, int flags)
>  			if (!(flags & DO_ATTACH))
>  				break;
>  
> +		propagate_exec_runtime(cfs_rq, se);
>  		update_cfs_group(se);
>  	}
>  }
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 13/15] sched,fair: propagate sum_exec_runtime up the hierarchy
  2019-08-28  7:51   ` Dietmar Eggemann
@ 2019-08-28 13:14     ` Rik van Riel
  2019-08-29 17:20       ` Dietmar Eggemann
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-28 13:14 UTC (permalink / raw)
  To: Dietmar Eggemann, linux-kernel
  Cc: kernel-team, pjt, peterz, mingo, morten.rasmussen, tglx, mgorman,
	vincent.guittot

[-- Attachment #1: Type: text/plain, Size: 1226 bytes --]

On Wed, 2019-08-28 at 09:51 +0200, Dietmar Eggemann wrote:
> On 22/08/2019 04:17, Rik van Riel wrote:
> > Now that enqueue_task_fair and dequeue_task_fair no longer iterate
> > up
> > the hierarchy all the time, a method to lazily propagate
> > sum_exec_runtime
> > up the hierarchy is necessary.
> > 
> > Once a tick, propagate the newly accumulated exec_runtime up the
> > hierarchy,
> > and feed it into CFS bandwidth control.
> > 
> > Remove the pointless call to account_cfs_rq_runtime from
> > update_curr,
> > which is always called with a root cfs_rq.
> 
> But what about the call to account_cfs_rq_runtime() in
> set_curr_task_fair()? Here you always call it with the root cfs_rq.
> Shouldn't this be called also in a loop over all se's until !se-
> >parent
> (like in propagate_exec_runtime() further below).

I believe that call should be only on the cgroup
cfs_rq, with account_cfs_rq_runtime figuring out
whether more runtime needs to be obtained from
further up in the hierarchy.

By default we should probably work under the assumption
that account_cfs_rq_runtime() will succeed at the current
level, and no gymnastics are required to obtain CPU time.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/15] sched,fair: redefine runnable_load_avg as the sum of task_h_load
  2019-08-22  2:17 ` [PATCH 03/15] sched,fair: redefine runnable_load_avg as the sum of task_h_load Rik van Riel
@ 2019-08-28 13:50   ` Vincent Guittot
  2019-08-28 14:47     ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Vincent Guittot @ 2019-08-28 13:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

Hi Rik,

On Thu, 22 Aug 2019 at 04:18, Rik van Riel <riel@surriel.com> wrote:
>
> The runnable_load magic is used to quickly propagate information about
> runnable tasks up the hierarchy of runqueues. The runnable_load_avg is
> mostly used for the load balancing code, which only examines the value at
> the root cfs_rq.
>
> Redefine the root cfs_rq runnable_load_avg to be the sum of task_h_loads
> of the runnable tasks. This works because the hierarchical runnable_load of
> a task is already equal to the task_se_h_load today. This provides enough
> information to the load balancer.
>
> The runnable_load_avg of the cgroup cfs_rqs does not appear to be
> used for anything, so don't bother calculating those.
>
> This removes one of the things that the code currently traverses the
> cgroup hierarchy for, and getting rid of it brings us one step closer
> to a flat runqueue for the CPU controller.

I like your proposal but just wanted to clarify one thing with this patch.
Although you removed the computation of runnable_load_avg of the
cgroup cfs_rq, we are still traversing the hierarchy to update the
root cfs_rq runnable_load_avg because we are traversing the hierarchy
for computing the task_h_loads

That being said, if we manage to remove the need on using
runnable_load_avg we will completely skip this traversal. I have a
proposal to remove it from load balance and wake up path but i haven't
look at numa stats which also use it

>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  include/linux/sched.h |   3 +-
>  kernel/sched/core.c   |   2 -
>  kernel/sched/debug.c  |   1 +
>  kernel/sched/fair.c   | 125 +++++++++++++-----------------------------
>  kernel/sched/pelt.c   |  64 ++++++---------------
>  kernel/sched/sched.h  |   6 --
>  6 files changed, 56 insertions(+), 145 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 11837410690f..f5bb6948e40c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -391,7 +391,6 @@ struct util_est {
>  struct sched_avg {
>         u64                             last_update_time;
>         u64                             load_sum;
> -       u64                             runnable_load_sum;
>         u32                             util_sum;
>         u32                             period_contrib;
>         unsigned long                   load_avg;
> @@ -439,7 +438,6 @@ struct sched_statistics {
>  struct sched_entity {
>         /* For load-balancing: */
>         struct load_weight              load;
> -       unsigned long                   runnable_weight;
>         struct rb_node                  run_node;
>         struct list_head                group_node;
>         unsigned int                    on_rq;
> @@ -455,6 +453,7 @@ struct sched_entity {
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         int                             depth;
> +       unsigned long                   enqueued_h_load;
>         struct sched_entity             *parent;
>         /* rq on which this entity is (to be) queued: */
>         struct cfs_rq                   *cfs_rq;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 874c427742a9..fbd96900f715 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -744,7 +744,6 @@ static void set_load_weight(struct task_struct *p, bool update_load)
>         if (task_has_idle_policy(p)) {
>                 load->weight = scale_load(WEIGHT_IDLEPRIO);
>                 load->inv_weight = WMULT_IDLEPRIO;
> -               p->se.runnable_weight = load->weight;
>                 return;
>         }
>
> @@ -757,7 +756,6 @@ static void set_load_weight(struct task_struct *p, bool update_load)
>         } else {
>                 load->weight = scale_load(sched_prio_to_weight[prio]);
>                 load->inv_weight = sched_prio_to_wmult[prio];
> -               p->se.runnable_weight = load->weight;
>         }
>  }
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index f6beaca97a09..cefc1b171c0b 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -962,6 +962,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
>         P(se.avg.load_avg);
>         P(se.avg.runnable_load_avg);
>         P(se.avg.util_avg);
> +       P(se.enqueued_h_load);
>         P(se.avg.last_update_time);
>         P(se.avg.util_est.ewma);
>         P(se.avg.util_est.enqueued);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eadf9a96b3e1..30afeda1620f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -723,9 +723,7 @@ void init_entity_runnable_average(struct sched_entity *se)
>          * nothing has been attached to the task group yet.
>          */
>         if (entity_is_task(se))
> -               sa->runnable_load_avg = sa->load_avg = scale_load_down(se->load.weight);
> -
> -       se->runnable_weight = se->load.weight;
> +               sa->load_avg = scale_load_down(se->load.weight);
>
>         /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
>  }
> @@ -2766,20 +2764,39 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  static inline void
>  enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       cfs_rq->runnable_weight += se->runnable_weight;
> +       if (entity_is_task(se)) {
> +               struct cfs_rq *root_cfs_rq = &cfs_rq->rq->cfs;
> +               se->enqueued_h_load = task_se_h_load(se);
>
> -       cfs_rq->avg.runnable_load_avg += se->avg.runnable_load_avg;
> -       cfs_rq->avg.runnable_load_sum += se_runnable(se) * se->avg.runnable_load_sum;
> +               root_cfs_rq->avg.runnable_load_avg += se->enqueued_h_load;
> +       }
>  }
>
>  static inline void
>  dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       cfs_rq->runnable_weight -= se->runnable_weight;
> +       if (entity_is_task(se)) {
> +               struct cfs_rq *root_cfs_rq = &cfs_rq->rq->cfs;
> +               sub_positive(&root_cfs_rq->avg.runnable_load_avg,
> +                                       se->enqueued_h_load);
> +       }
> +}
> +
> +static inline void
> +update_runnable_load_avg(struct sched_entity *se)
> +{
> +       struct cfs_rq *root_cfs_rq = &cfs_rq_of(se)->rq->cfs;
> +       long new_h_load, delta;
>
> -       sub_positive(&cfs_rq->avg.runnable_load_avg, se->avg.runnable_load_avg);
> -       sub_positive(&cfs_rq->avg.runnable_load_sum,
> -                    se_runnable(se) * se->avg.runnable_load_sum);
> +       SCHED_WARN_ON(!entity_is_task(se));
> +
> +       if (!se->on_rq)
> +               return;
> +
> +       new_h_load = task_se_h_load(se);
> +       delta = new_h_load - se->enqueued_h_load;
> +       root_cfs_rq->avg.runnable_load_avg += delta;
> +       se->enqueued_h_load = new_h_load;
>  }
>
>  static inline void
> @@ -2807,7 +2824,7 @@ dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
>  #endif
>
>  static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> -                           unsigned long weight, unsigned long runnable)
> +                           unsigned long weight)
>  {
>         if (se->on_rq) {
>                 /* commit outstanding execution time */
> @@ -2818,7 +2835,6 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>         }
>         dequeue_load_avg(cfs_rq, se);
>
> -       se->runnable_weight = runnable;
>         update_load_set(&se->load, weight);
>
>  #ifdef CONFIG_SMP
> @@ -2826,8 +2842,6 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>                 u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib;
>
>                 se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
> -               se->avg.runnable_load_avg =
> -                       div_u64(se_runnable(se) * se->avg.runnable_load_sum, divider);
>         } while (0);
>  #endif
>
> @@ -2845,7 +2859,7 @@ void reweight_task(struct task_struct *p, int prio)
>         struct load_weight *load = &se->load;
>         unsigned long weight = scale_load(sched_prio_to_weight[prio]);
>
> -       reweight_entity(cfs_rq, se, weight, weight);
> +       reweight_entity(cfs_rq, se, weight);
>         load->inv_weight = sched_prio_to_wmult[prio];
>  }
>
> @@ -2958,49 +2972,6 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
>         return clamp_t(long, shares, MIN_SHARES, tg_shares);
>  }
>
> -/*
> - * This calculates the effective runnable weight for a group entity based on
> - * the group entity weight calculated above.
> - *
> - * Because of the above approximation (2), our group entity weight is
> - * an load_avg based ratio (3). This means that it includes blocked load and
> - * does not represent the runnable weight.
> - *
> - * Approximate the group entity's runnable weight per ratio from the group
> - * runqueue:
> - *
> - *                                          grq->avg.runnable_load_avg
> - *   ge->runnable_weight = ge->load.weight * -------------------------- (7)
> - *                                              grq->avg.load_avg
> - *
> - * However, analogous to above, since the avg numbers are slow, this leads to
> - * transients in the from-idle case. Instead we use:
> - *
> - *   ge->runnable_weight = ge->load.weight *
> - *
> - *             max(grq->avg.runnable_load_avg, grq->runnable_weight)
> - *             -----------------------------------------------------   (8)
> - *                   max(grq->avg.load_avg, grq->load.weight)
> - *
> - * Where these max() serve both to use the 'instant' values to fix the slow
> - * from-idle and avoid the /0 on to-idle, similar to (6).
> - */
> -static long calc_group_runnable(struct cfs_rq *cfs_rq, long shares)
> -{
> -       long runnable, load_avg;
> -
> -       load_avg = max(cfs_rq->avg.load_avg,
> -                      scale_load_down(cfs_rq->load.weight));
> -
> -       runnable = max(cfs_rq->avg.runnable_load_avg,
> -                      scale_load_down(cfs_rq->runnable_weight));
> -
> -       runnable *= shares;
> -       if (load_avg)
> -               runnable /= load_avg;
> -
> -       return clamp_t(long, runnable, MIN_SHARES, shares);
> -}
>  #endif /* CONFIG_SMP */
>
>  static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
> @@ -3012,25 +2983,26 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
>  static void update_cfs_group(struct sched_entity *se)
>  {
>         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> -       long shares, runnable;
> +       long shares;
>
> -       if (!gcfs_rq)
> +       if (!gcfs_rq) {
> +               update_runnable_load_avg(se);
>                 return;
> +       }
>
>         if (throttled_hierarchy(gcfs_rq))
>                 return;
>
>  #ifndef CONFIG_SMP
> -       runnable = shares = READ_ONCE(gcfs_rq->tg->shares);
> +       shares = READ_ONCE(gcfs_rq->tg->shares);
>
>         if (likely(se->load.weight == shares))
>                 return;
>  #else
>         shares   = calc_group_shares(gcfs_rq);
> -       runnable = calc_group_runnable(gcfs_rq, shares);
>  #endif
>
> -       reweight_entity(cfs_rq_of(se), se, shares, runnable);
> +       reweight_entity(cfs_rq_of(se), se, shares);
>  }
>
>  #else /* CONFIG_FAIR_GROUP_SCHED */
> @@ -3243,8 +3215,8 @@ static inline void
>  update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
>  {
>         long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
> -       unsigned long runnable_load_avg, load_avg;
> -       u64 runnable_load_sum, load_sum = 0;
> +       unsigned long load_avg;
> +       u64 load_sum = 0;
>         s64 delta_sum;
>
>         if (!runnable_sum)
> @@ -3292,19 +3264,6 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf
>         se->avg.load_avg = load_avg;
>         add_positive(&cfs_rq->avg.load_avg, delta_avg);
>         add_positive(&cfs_rq->avg.load_sum, delta_sum);
> -
> -       runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
> -       runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
> -       delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
> -       delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
> -
> -       se->avg.runnable_load_sum = runnable_sum;
> -       se->avg.runnable_load_avg = runnable_load_avg;
> -
> -       if (se->on_rq) {
> -               add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
> -               add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
> -       }
>  }
>
>  static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
> @@ -3399,7 +3358,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
>  static inline int
>  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  {
> -       unsigned long removed_load = 0, removed_util = 0, removed_runnable_sum = 0;
> +       unsigned long removed_load = 0, removed_util = 0;
>         struct sched_avg *sa = &cfs_rq->avg;
>         int decayed = 0;
>
> @@ -3410,7 +3369,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>                 raw_spin_lock(&cfs_rq->removed.lock);
>                 swap(cfs_rq->removed.util_avg, removed_util);
>                 swap(cfs_rq->removed.load_avg, removed_load);
> -               swap(cfs_rq->removed.runnable_sum, removed_runnable_sum);
>                 cfs_rq->removed.nr = 0;
>                 raw_spin_unlock(&cfs_rq->removed.lock);
>
> @@ -3422,8 +3380,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>                 sub_positive(&sa->util_avg, r);
>                 sub_positive(&sa->util_sum, r * divider);
>
> -               add_tg_cfs_propagate(cfs_rq, -(long)removed_runnable_sum);
> -
>                 decayed = 1;
>         }
>
> @@ -3477,8 +3433,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>                         div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
>         }
>
> -       se->avg.runnable_load_sum = se->avg.load_sum;
> -
>         enqueue_load_avg(cfs_rq, se);
>         cfs_rq->avg.util_avg += se->avg.util_avg;
>         cfs_rq->avg.util_sum += se->avg.util_sum;
> @@ -7735,9 +7689,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>         if (cfs_rq->avg.util_sum)
>                 return false;
>
> -       if (cfs_rq->avg.runnable_load_sum)
> -               return false;
> -
>         return true;
>  }
>
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index befce29bd882..b3f3e8b29394 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -106,7 +106,7 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
>   */
>  static __always_inline u32
>  accumulate_sum(u64 delta, struct sched_avg *sa,
> -              unsigned long load, unsigned long runnable, int running)
> +              unsigned long load, int running)
>  {
>         u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
>         u64 periods;
> @@ -119,8 +119,6 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
>          */
>         if (periods) {
>                 sa->load_sum = decay_load(sa->load_sum, periods);
> -               sa->runnable_load_sum =
> -                       decay_load(sa->runnable_load_sum, periods);
>                 sa->util_sum = decay_load((u64)(sa->util_sum), periods);
>
>                 /*
> @@ -134,8 +132,6 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
>
>         if (load)
>                 sa->load_sum += load * contrib;
> -       if (runnable)
> -               sa->runnable_load_sum += runnable * contrib;
>         if (running)
>                 sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
>
> @@ -172,7 +168,7 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
>   */
>  static __always_inline int
>  ___update_load_sum(u64 now, struct sched_avg *sa,
> -                 unsigned long load, unsigned long runnable, int running)
> +                 unsigned long load, int running)
>  {
>         u64 delta;
>
> @@ -206,7 +202,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
>          * update_blocked_averages()
>          */
>         if (!load)
> -               runnable = running = 0;
> +               running = 0;
>
>         /*
>          * Now we know we crossed measurement unit boundaries. The *_avg
> @@ -215,14 +211,14 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
>          * Step 1: accumulate *_sum since last_update_time. If we haven't
>          * crossed period boundaries, finish.
>          */
> -       if (!accumulate_sum(delta, sa, load, runnable, running))
> +       if (!accumulate_sum(delta, sa, load, running))
>                 return 0;
>
>         return 1;
>  }
>
>  static __always_inline void
> -___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable)
> +___update_load_avg(struct sched_avg *sa, unsigned long load)
>  {
>         u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
>
> @@ -230,41 +226,25 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
>          * Step 2: update *_avg.
>          */
>         sa->load_avg = div_u64(load * sa->load_sum, divider);
> -       sa->runnable_load_avg = div_u64(runnable * sa->runnable_load_sum, divider);
>         WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
>  }
>
>  /*
>   * sched_entity:
>   *
> - *   task:
> - *     se_runnable() == se_weight()
> - *
> - *   group: [ see update_cfs_group() ]
> - *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
> - *     se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg
> - *
>   *   load_sum := runnable_sum
>   *   load_avg = se_weight(se) * runnable_avg
>   *
> - *   runnable_load_sum := runnable_sum
> - *   runnable_load_avg = se_runnable(se) * runnable_avg
> - *
> - * XXX collapse load_sum and runnable_load_sum
> - *
>   * cfq_rq:
>   *
>   *   load_sum = \Sum se_weight(se) * se->avg.load_sum
>   *   load_avg = \Sum se->avg.load_avg
> - *
> - *   runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum
> - *   runnable_load_avg = \Sum se->avg.runable_load_avg
>   */
>
>  int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
>  {
> -       if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
> -               ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
> +       if (___update_load_sum(now, &se->avg, 0, 0)) {
> +               ___update_load_avg(&se->avg, se_weight(se));
>                 return 1;
>         }
>
> @@ -273,10 +253,10 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
>
>  int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       if (___update_load_sum(now, &se->avg, !!se->on_rq, !!se->on_rq,
> +       if (___update_load_sum(now, &se->avg, !!se->on_rq,
>                                 cfs_rq->curr == se)) {
>
> -               ___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
> +               ___update_load_avg(&se->avg, se_weight(se));
>                 cfs_se_util_change(&se->avg);
>                 return 1;
>         }
> @@ -288,10 +268,9 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
>  {
>         if (___update_load_sum(now, &cfs_rq->avg,
>                                 scale_load_down(cfs_rq->load.weight),
> -                               scale_load_down(cfs_rq->runnable_weight),
>                                 cfs_rq->curr != NULL)) {
>
> -               ___update_load_avg(&cfs_rq->avg, 1, 1);
> +               ___update_load_avg(&cfs_rq->avg, 1);
>                 return 1;
>         }
>
> @@ -303,20 +282,18 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
>   *
>   *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
>   *   util_sum = cpu_scale * load_sum
> - *   runnable_load_sum = load_sum
>   *
> - *   load_avg and runnable_load_avg are not supported and meaningless.
> + *   load_avg is not supported and meaningless.
>   *
>   */
>
>  int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
>  {
>         if (___update_load_sum(now, &rq->avg_rt,
> -                               running,
>                                 running,
>                                 running)) {
>
> -               ___update_load_avg(&rq->avg_rt, 1, 1);
> +               ___update_load_avg(&rq->avg_rt, 1);
>                 return 1;
>         }
>
> @@ -328,18 +305,16 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
>   *
>   *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
>   *   util_sum = cpu_scale * load_sum
> - *   runnable_load_sum = load_sum
>   *
>   */
>
>  int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>  {
>         if (___update_load_sum(now, &rq->avg_dl,
> -                               running,
>                                 running,
>                                 running)) {
>
> -               ___update_load_avg(&rq->avg_dl, 1, 1);
> +               ___update_load_avg(&rq->avg_dl, 1);
>                 return 1;
>         }
>
> @@ -352,7 +327,6 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
>   *
>   *   util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked
>   *   util_sum = cpu_scale * load_sum
> - *   runnable_load_sum = load_sum
>   *
>   */
>
> @@ -379,17 +353,11 @@ int update_irq_load_avg(struct rq *rq, u64 running)
>          * We can safely remove running from rq->clock because
>          * rq->clock += delta with delta >= running
>          */
> -       ret = ___update_load_sum(rq->clock - running, &rq->avg_irq,
> -                               0,
> -                               0,
> -                               0);
> -       ret += ___update_load_sum(rq->clock, &rq->avg_irq,
> -                               1,
> -                               1,
> -                               1);
> +       ret = ___update_load_sum(rq->clock - running, &rq->avg_irq, 0, 0);
> +       ret += ___update_load_sum(rq->clock, &rq->avg_irq, 1, 1);
>
>         if (ret)
> -               ___update_load_avg(&rq->avg_irq, 1, 1);
> +               ___update_load_avg(&rq->avg_irq, 1);
>
>         return ret;
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b52ed1ada0be..5be14cee61f9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -487,7 +487,6 @@ struct cfs_bandwidth { };
>  /* CFS-related fields in a runqueue */
>  struct cfs_rq {
>         struct load_weight      load;
> -       unsigned long           runnable_weight;
>         unsigned int            nr_running;
>         unsigned int            h_nr_running;
>
> @@ -700,11 +699,6 @@ static inline long se_weight(struct sched_entity *se)
>         return scale_load_down(se->load.weight);
>  }
>
> -static inline long se_runnable(struct sched_entity *se)
> -{
> -       return scale_load_down(se->runnable_weight);
> -}
> -
>  static inline bool sched_asym_prefer(int a, int b)
>  {
>         return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b);
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/15] sched,cfs: use explicit cfs_rq of parent se helper
  2019-08-22  2:17 ` [PATCH 06/15] sched,cfs: use explicit cfs_rq of parent se helper Rik van Riel
@ 2019-08-28 13:53   ` Vincent Guittot
  2019-08-28 15:28     ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Vincent Guittot @ 2019-08-28 13:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

On Thu, 22 Aug 2019 at 04:18, Rik van Riel <riel@surriel.com> wrote:
>
> Use an explicit "cfs_rq of parent sched_entity" helper in a few
> strategic places, where cfs_rq_of(se) may no longer point at the

The only case is the sched_entity of a task which will point to root
cfs, isn't it ?

> right runqueue once we flatten the hierarchical cgroup runqueues.
>
> No functional change.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  kernel/sched/fair.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04b216234265..31a26737a873 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -276,6 +276,15 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
>         return grp->my_q;
>  }
>
> +/* runqueue owned by the parent entity; the root cfs_rq for a top level se */
> +static inline struct cfs_rq *group_cfs_rq_of_parent(struct sched_entity *se)
> +{
> +       if (se->parent)
> +               return group_cfs_rq(se->parent);
> +
> +       return cfs_rq_of(se);
> +}
> +
>  static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>         struct rq *rq = rq_of(cfs_rq);
> @@ -3319,7 +3328,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
>
>         gcfs_rq->propagate = 0;
>
> -       cfs_rq = cfs_rq_of(se);
> +       cfs_rq = group_cfs_rq_of_parent(se);
>
>         add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
>
> @@ -7796,7 +7805,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
>
>         WRITE_ONCE(cfs_rq->h_load_next, NULL);
>         for_each_sched_entity(se) {
> -               cfs_rq = cfs_rq_of(se);
> +               cfs_rq = group_cfs_rq_of_parent(se);
>                 WRITE_ONCE(cfs_rq->h_load_next, se);
>                 if (cfs_rq->last_h_load_update == now)
>                         break;
> @@ -7819,7 +7828,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
>
>  static unsigned long task_se_h_load(struct sched_entity *se)
>  {
> -       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +       struct cfs_rq *cfs_rq = group_cfs_rq_of_parent(se);
>
>         update_cfs_rq_h_load(cfs_rq);
>         return div64_ul(se->avg.load_avg * cfs_rq->h_load,
> @@ -10166,7 +10175,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>         struct sched_entity *se = &curr->se;
>
>         for_each_sched_entity(se) {
> -               cfs_rq = cfs_rq_of(se);
> +               cfs_rq = group_cfs_rq_of_parent(se);
>                 entity_tick(cfs_rq, se, queued);
>         }
>
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 05/15] sched,fair: remove cfs_rqs from leaf_cfs_rq_list bottom up
  2019-08-22  2:17 ` [PATCH 05/15] sched,fair: remove cfs_rqs from leaf_cfs_rq_list bottom up Rik van Riel
@ 2019-08-28 14:09   ` Vincent Guittot
  0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2019-08-28 14:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

On Thu, 22 Aug 2019 at 04:18, Rik van Riel <riel@surriel.com> wrote:
>
> Reducing the overhead of the CPU controller is achieved by not walking
> all the sched_entities every time a task is enqueued or dequeued.
>
> One of the things being checked every single time is whether the cfs_rq
> is on the rq->leaf_cfs_rq_list.
>
> By only removing a cfs_rq from the list once it no longer has children
> on the list, we can avoid walking the sched_entity hierarchy if the bottom
> cfs_rq is on the list, once the runqueues have been flattened.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>

Acked-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 35 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a48d0dbfc232..04b216234265 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -369,6 +369,39 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
>         SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
>  }
>
> +/*
> + * Because list_add_leaf_cfs_rq always places a child cfs_rq on the list
> + * immediately before a parent cfs_rq, and cfs_rqs are removed from the list
> + * bottom-up, we only have to test whether the cfs_rq before us on the list
> + * is our child.
> + */
> +static inline bool child_cfs_rq_on_list(struct cfs_rq *cfs_rq)
> +{
> +       struct cfs_rq *prev_cfs_rq;
> +       struct list_head *prev;
> +
> +       prev = cfs_rq->leaf_cfs_rq_list.prev;
> +       prev_cfs_rq = container_of(prev, struct cfs_rq, leaf_cfs_rq_list);
> +
> +       return (prev_cfs_rq->tg->parent == cfs_rq->tg);
> +}
> +
> +/*
> + * Remove a cfs_rq from the list if it has no children on the list.
> + * The scheduler iterates over the list regularly; if conditions for
> + * removal are still true, we'll get to this cfs_rq in the future.
> + */
> +static inline void list_del_leaf_cfs_rq_bottom(struct cfs_rq *cfs_rq)
> +{
> +       if (!cfs_rq->on_list)
> +               return;
> +
> +       if (child_cfs_rq_on_list(cfs_rq))
> +               return;
> +
> +       list_del_leaf_cfs_rq(cfs_rq);
> +}
> +
>  /* Iterate thr' all leaf cfs_rq's on a runqueue */
>  #define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)                     \
>         list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,    \
> @@ -7723,7 +7756,7 @@ static void update_blocked_averages(int cpu)
>                  * decayed cfs_rqs linger on the list.
>                  */
>                 if (cfs_rq_is_decayed(cfs_rq))
> -                       list_del_leaf_cfs_rq(cfs_rq);
> +                       list_del_leaf_cfs_rq_bottom(cfs_rq);
>
>                 /* Don't need periodic decay once load/util_avg are null */
>                 if (cfs_rq_has_blocked(cfs_rq))
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/15] sched,fair: redefine runnable_load_avg as the sum of task_h_load
  2019-08-28 13:50   ` Vincent Guittot
@ 2019-08-28 14:47     ` Rik van Riel
  2019-08-28 15:02       ` Vincent Guittot
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-28 14:47 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 1847 bytes --]

On Wed, 2019-08-28 at 15:50 +0200, Vincent Guittot wrote:
> Hi Rik,
> 
> On Thu, 22 Aug 2019 at 04:18, Rik van Riel <riel@surriel.com> wrote:
> > The runnable_load magic is used to quickly propagate information
> > about
> > runnable tasks up the hierarchy of runqueues. The runnable_load_avg
> > is
> > mostly used for the load balancing code, which only examines the
> > value at
> > the root cfs_rq.
> > 
> > Redefine the root cfs_rq runnable_load_avg to be the sum of
> > task_h_loads
> > of the runnable tasks. This works because the hierarchical
> > runnable_load of
> > a task is already equal to the task_se_h_load today. This provides
> > enough
> > information to the load balancer.
> > 
> > The runnable_load_avg of the cgroup cfs_rqs does not appear to be
> > used for anything, so don't bother calculating those.
> > 
> > This removes one of the things that the code currently traverses
> > the
> > cgroup hierarchy for, and getting rid of it brings us one step
> > closer
> > to a flat runqueue for the CPU controller.
> 
> I like your proposal but just wanted to clarify one thing with this
> patch.
> Although you removed the computation of runnable_load_avg of the
> cgroup cfs_rq, we are still traversing the hierarchy to update the
> root cfs_rq runnable_load_avg because we are traversing the hierarchy
> for computing the task_h_loads

The task_h_load hierarchy traversal in update_cfs_rq_h_load
is rate limited to once a jiffy, though.  Rate limiting the
hierarchy traversal significantly reduces overhead.

> That being said, if we manage to remove the need on using
> runnable_load_avg we will completely skip this traversal. I have a
> proposal to remove it from load balance and wake up path but i
> haven't
> look at numa stats which also use it

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/15] sched,fair: redefine runnable_load_avg as the sum of task_h_load
  2019-08-28 14:47     ` Rik van Riel
@ 2019-08-28 15:02       ` Vincent Guittot
  0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2019-08-28 15:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

On Wed, 28 Aug 2019 at 16:48, Rik van Riel <riel@surriel.com> wrote:
>
> On Wed, 2019-08-28 at 15:50 +0200, Vincent Guittot wrote:
> > Hi Rik,
> >
> > On Thu, 22 Aug 2019 at 04:18, Rik van Riel <riel@surriel.com> wrote:
> > > The runnable_load magic is used to quickly propagate information
> > > about
> > > runnable tasks up the hierarchy of runqueues. The runnable_load_avg
> > > is
> > > mostly used for the load balancing code, which only examines the
> > > value at
> > > the root cfs_rq.
> > >
> > > Redefine the root cfs_rq runnable_load_avg to be the sum of
> > > task_h_loads
> > > of the runnable tasks. This works because the hierarchical
> > > runnable_load of
> > > a task is already equal to the task_se_h_load today. This provides
> > > enough
> > > information to the load balancer.
> > >
> > > The runnable_load_avg of the cgroup cfs_rqs does not appear to be
> > > used for anything, so don't bother calculating those.
> > >
> > > This removes one of the things that the code currently traverses
> > > the
> > > cgroup hierarchy for, and getting rid of it brings us one step
> > > closer
> > > to a flat runqueue for the CPU controller.
> >
> > I like your proposal but just wanted to clarify one thing with this
> > patch.
> > Although you removed the computation of runnable_load_avg of the
> > cgroup cfs_rq, we are still traversing the hierarchy to update the
> > root cfs_rq runnable_load_avg because we are traversing the hierarchy
> > for computing the task_h_loads
>
> The task_h_load hierarchy traversal in update_cfs_rq_h_load
> is rate limited to once a jiffy, though.  Rate limiting the

Ah yes. I forgot that it was jiffies and not clock_task that is used
for limiting the update

> hierarchy traversal significantly reduces overhead.
>
> > That being said, if we manage to remove the need on using
> > runnable_load_avg we will completely skip this traversal. I have a
> > proposal to remove it from load balance and wake up path but i
> > haven't
> > look at numa stats which also use it
>
> --
> All Rights Reversed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/15] sched,cfs: use explicit cfs_rq of parent se helper
  2019-08-28 13:53   ` Vincent Guittot
@ 2019-08-28 15:28     ` Rik van Riel
  2019-08-28 16:34       ` Vincent Guittot
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-28 15:28 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 521 bytes --]

On Wed, 2019-08-28 at 15:53 +0200, Vincent Guittot wrote:
> On Thu, 22 Aug 2019 at 04:18, Rik van Riel <riel@surriel.com> wrote:
> > Use an explicit "cfs_rq of parent sched_entity" helper in a few
> > strategic places, where cfs_rq_of(se) may no longer point at the
> 
> The only case is the sched_entity of a task which will point to root
> cfs, isn't it ?

True. I am more than open to ideas to simplify this
part of the code, hopefully in some way that makes
it easier to read.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/15] sched,cfs: use explicit cfs_rq of parent se helper
  2019-08-28 15:28     ` Rik van Riel
@ 2019-08-28 16:34       ` Vincent Guittot
  0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2019-08-28 16:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

On Wed, 28 Aug 2019 at 17:28, Rik van Riel <riel@surriel.com> wrote:
>
> On Wed, 2019-08-28 at 15:53 +0200, Vincent Guittot wrote:
> > On Thu, 22 Aug 2019 at 04:18, Rik van Riel <riel@surriel.com> wrote:
> > > Use an explicit "cfs_rq of parent sched_entity" helper in a few
> > > strategic places, where cfs_rq_of(se) may no longer point at the
> >
> > The only case is the sched_entity of a task which will point to root
> > cfs, isn't it ?
>
> True. I am more than open to ideas to simplify this
> part of the code, hopefully in some way that makes
> it easier to read.

my question was mainly informative to make sure that i didn't miss a corner case

>
> --
> All Rights Reversed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/15] sched,cfs: fix zero length timeslice calculation
  2019-08-22  2:17 ` [PATCH 07/15] sched,cfs: fix zero length timeslice calculation Rik van Riel
@ 2019-08-28 16:59   ` Vincent Guittot
  0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2019-08-28 16:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

On Thu, 22 Aug 2019 at 04:18, Rik van Riel <riel@surriel.com> wrote:
>
> The way the time slice length is currently calculated, not only do high
> priority tasks get longer time slices than low priority tasks, but due
> to fixed point math, low priority tasks could end up with a zero length
> time slice. This can lead to cache thrashing and other inefficiencies.

Have you got more details of those UCs ?

>
> Cap the minimum time slice length to sysctl_sched_min_granularity.
>
> Tasks that end up getting a time slice length too long for their relative
> priority will simply end up having their vruntime advanced much faster than
> other tasks, resulting in them receiving time slices less frequently.

In fact that already happen as we wait for the tick to preempt a task
(unless you enable sched_feat(HRTICK))
so it sounds reasonable

>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  kernel/sched/fair.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 31a26737a873..8f8c85c6da9b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -732,6 +732,13 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>                 }
>                 slice = __calc_delta(slice, se->load.weight, load);
>         }
> +
> +       /*
> +        * To avoid cache thrashing, run at least sysctl_sched_min_granularity.
> +        * The vruntime of a low priority task advances faster; those tasks
> +        * will simply get time slices less frequently.
> +        */
> +       slice = max_t(u64, slice, sysctl_sched_min_granularity);
>         return slice;
>  }
>
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/15] sched,fair: simplify timeslice length code
  2019-08-22  2:17 ` [PATCH 08/15] sched,fair: simplify timeslice length code Rik van Riel
@ 2019-08-28 17:32   ` Vincent Guittot
  2019-08-28 23:18     ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Vincent Guittot @ 2019-08-28 17:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

On Thu, 22 Aug 2019 at 04:18, Rik van Riel <riel@surriel.com> wrote:
>
> The idea behind __sched_period makes sense, but the results do not always.
>
> When a CPU has one high priority task and a large number of low priority
> tasks, __sched_period will return a value larger than sysctl_sched_latency,
> and the one high priority task may end up getting a timeslice all for itself
> that is also much larger than sysctl_sched_latency.

note that unless you enable sched_feat(HRTICK), the sched_slice is
mainly used to decide how fast we preempt running task at tick but a
newly wake up task can preempt it before

>
> The low priority tasks will have their time slices rounded up to
> sysctl_sched_min_granularity, resulting in an even larger scheduling
> latency than targeted by __sched_period.

Will this not break the fairness between a always running task and a
short sleeping one with this changes ?

>
> Simplify the code by simply ripping out __sched_period and always taking
> fractions of sysctl_sched_latency.
>
> If a high priority task ends up getting a "too small" time slice compared
> to low priority tasks, the vruntime scaling ensures that it will simply
> get scheduled more frequently than low priority tasks.

Will you not increase the number of context switch ?

>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  kernel/sched/fair.c | 18 +-----------------
>  1 file changed, 1 insertion(+), 17 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8f8c85c6da9b..74ee22c59d13 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -691,22 +691,6 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
>         return delta;
>  }
>
> -/*
> - * The idea is to set a period in which each task runs once.
> - *
> - * When there are too many tasks (sched_nr_latency) we have to stretch
> - * this period because otherwise the slices get too small.
> - *
> - * p = (nr <= nl) ? l : l*nr/nl
> - */
> -static u64 __sched_period(unsigned long nr_running)
> -{
> -       if (unlikely(nr_running > sched_nr_latency))
> -               return nr_running * sysctl_sched_min_granularity;
> -       else
> -               return sysctl_sched_latency;
> -}
> -
>  /*
>   * We calculate the wall-time slice from the period by taking a part
>   * proportional to the weight.
> @@ -715,7 +699,7 @@ static u64 __sched_period(unsigned long nr_running)
>   */
>  static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
> +       u64 slice = sysctl_sched_latency;
>
>         for_each_sched_entity(se) {
>                 struct load_weight *load;
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/15] sched,fair: simplify timeslice length code
  2019-08-28 17:32   ` Vincent Guittot
@ 2019-08-28 23:18     ` Rik van Riel
  2019-08-29 14:02       ` Vincent Guittot
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-28 23:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 2090 bytes --]

On Wed, 2019-08-28 at 19:32 +0200, Vincent Guittot wrote:
> On Thu, 22 Aug 2019 at 04:18, Rik van Riel <riel@surriel.com> wrote:
> > The idea behind __sched_period makes sense, but the results do not
> > always.
> > 
> > When a CPU has one high priority task and a large number of low
> > priority
> > tasks, __sched_period will return a value larger than
> > sysctl_sched_latency,
> > and the one high priority task may end up getting a timeslice all
> > for itself
> > that is also much larger than sysctl_sched_latency.
> 
> note that unless you enable sched_feat(HRTICK), the sched_slice is
> mainly used to decide how fast we preempt running task at tick but a
> newly wake up task can preempt it before
> 
> > The low priority tasks will have their time slices rounded up to
> > sysctl_sched_min_granularity, resulting in an even larger
> > scheduling
> > latency than targeted by __sched_period.
> 
> Will this not break the fairness between a always running task and a
> short sleeping one with this changes ?

In what way?

The vruntime for the always running task will continue
to advance the same way it always has.

> > Simplify the code by simply ripping out __sched_period and always
> > taking
> > fractions of sysctl_sched_latency.
> > 
> > If a high priority task ends up getting a "too small" time slice
> > compared
> > to low priority tasks, the vruntime scaling ensures that it will
> > simply
> > get scheduled more frequently than low priority tasks.
> 
> Will you not increase the number of context switch ?

It should actually decrease the number of context
switches. If a nice +19 task gets a longer time slice
than it would today, its vruntime will be advanced by
more than sysctl_sched_latency, and it will not get
to run again until another task has caught up with its
vruntime.

That means the regular (or high) priority task that
shares the CPU with that nice +19 task might get
several time slices in a row until the nice +19 task
gets to run again.

What am I overlooking?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/15] sched,fair: simplify timeslice length code
  2019-08-28 23:18     ` Rik van Riel
@ 2019-08-29 14:02       ` Vincent Guittot
  2019-08-29 16:00         ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Vincent Guittot @ 2019-08-29 14:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

On Thu, 29 Aug 2019 at 01:19, Rik van Riel <riel@surriel.com> wrote:
>
> On Wed, 2019-08-28 at 19:32 +0200, Vincent Guittot wrote:
> > On Thu, 22 Aug 2019 at 04:18, Rik van Riel <riel@surriel.com> wrote:
> > > The idea behind __sched_period makes sense, but the results do not
> > > always.
> > >
> > > When a CPU has one high priority task and a large number of low
> > > priority
> > > tasks, __sched_period will return a value larger than
> > > sysctl_sched_latency,
> > > and the one high priority task may end up getting a timeslice all
> > > for itself
> > > that is also much larger than sysctl_sched_latency.
> >
> > note that unless you enable sched_feat(HRTICK), the sched_slice is
> > mainly used to decide how fast we preempt running task at tick but a
> > newly wake up task can preempt it before
> >
> > > The low priority tasks will have their time slices rounded up to
> > > sysctl_sched_min_granularity, resulting in an even larger
> > > scheduling
> > > latency than targeted by __sched_period.
> >
> > Will this not break the fairness between a always running task and a
> > short sleeping one with this changes ?
>
> In what way?
>
> The vruntime for the always running task will continue
> to advance the same way it always has.

Ok so 1st, my brain is probably not yet fully back from vacations as I
have read sysctl_sched_min_granularity instead of sysctl_sched_latency
 and wrongly thought that you were setting
sysctl_sched_min_granularity for all tasks.
That being said, sched_slice is used to prevent other tasks to preempt
the running task before it get a chances to run its ideal time
compared to others and before new tasks modify the ideal sched_slice
of each. By capping this max value, the task can be preempted earlier
than before by newly wake up task and don't get the amount of running
time it could have expect before the situation is changing

>
> > > Simplify the code by simply ripping out __sched_period and always
> > > taking
> > > fractions of sysctl_sched_latency.
> > >
> > > If a high priority task ends up getting a "too small" time slice
> > > compared
> > > to low priority tasks, the vruntime scaling ensures that it will
> > > simply
> > > get scheduled more frequently than low priority tasks.
> >
> > Will you not increase the number of context switch ?
>
> It should actually decrease the number of context
> switches. If a nice +19 task gets a longer time slice
> than it would today, its vruntime will be advanced by

In fact that's already the case today, when a task is scheduled, it
runs a full jiffy even if its sched_slice is smaller than a jiffy
(unless you have enabled sched_feat(HRTICK)).

> more than sysctl_sched_latency, and it will not get
> to run again until another task has caught up with its
> vruntime.
>
> That means the regular (or high) priority task that
> shares the CPU with that nice +19 task might get
> several time slices in a row until the nice +19 task
> gets to run again.
>
> What am I overlooking?

My point is more for task that runs several ticks in a row. Their
sched_slice will be shorter in some cases with your changes so they
can be preempted earlier by other runnable tasks with a lower vruntime
and there will be more context switch

>
> --
> All Rights Reversed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/15] sched,fair: simplify timeslice length code
  2019-08-29 14:02       ` Vincent Guittot
@ 2019-08-29 16:00         ` Rik van Riel
  2019-08-30  6:41           ` Vincent Guittot
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-29 16:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 1449 bytes --]

On Thu, 2019-08-29 at 16:02 +0200, Vincent Guittot wrote:
> On Thu, 29 Aug 2019 at 01:19, Rik van Riel <riel@surriel.com> wrote:
> 
> > What am I overlooking?
> 
> My point is more for task that runs several ticks in a row. Their
> sched_slice will be shorter in some cases with your changes so they
> can be preempted earlier by other runnable tasks with a lower
> vruntime
> and there will be more context switch

I can think of exactly one case where the time slice
will be shorter with my new code than with the old code,
and that is the case where:
- A CPU has nr_running > sched_nr_latency
- __sched_period returns a value larger than sysctl_sched_latency
- one of the tasks is much higher priority than the others
- that one task alone gets a timeslice larger than sysctl_sched_latency

With the new code, that high priority task will get a time
slice that is a (large) fraction of sysctl_sched_latency,
while the other (lower priority) tasks get their time slices
rounded up to sysctl_sched_min_granularity.

When tasks get their timeslice rounded up, that will increase
the total sched period in a similar way the old code did by
returning a longer period from __sched_period.

If a CPU is faced with a large number of equal priority tasks,
both the old code and the new code would end up giving each
task a timeslice length of sysctl_sched_min_granularity.

What am I missing?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 13/15] sched,fair: propagate sum_exec_runtime up the hierarchy
  2019-08-28 13:14     ` Rik van Riel
@ 2019-08-29 17:20       ` Dietmar Eggemann
  2019-08-29 18:06         ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Dietmar Eggemann @ 2019-08-29 17:20 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: kernel-team, pjt, peterz, mingo, morten.rasmussen, tglx, mgorman,
	vincent.guittot

On 28/08/2019 15:14, Rik van Riel wrote:
> On Wed, 2019-08-28 at 09:51 +0200, Dietmar Eggemann wrote:
>> On 22/08/2019 04:17, Rik van Riel wrote:
>>> Now that enqueue_task_fair and dequeue_task_fair no longer iterate
>>> up
>>> the hierarchy all the time, a method to lazily propagate
>>> sum_exec_runtime
>>> up the hierarchy is necessary.
>>>
>>> Once a tick, propagate the newly accumulated exec_runtime up the
>>> hierarchy,
>>> and feed it into CFS bandwidth control.
>>>
>>> Remove the pointless call to account_cfs_rq_runtime from
>>> update_curr,
>>> which is always called with a root cfs_rq.
>>
>> But what about the call to account_cfs_rq_runtime() in
>> set_curr_task_fair()? Here you always call it with the root cfs_rq.
>> Shouldn't this be called also in a loop over all se's until !se-
>>> parent
>> (like in propagate_exec_runtime() further below).
> 
> I believe that call should be only on the cgroup
> cfs_rq, with account_cfs_rq_runtime figuring out
> whether more runtime needs to be obtained from
> further up in the hierarchy.

So like this?

@@ -10248,7 +10248,8 @@ static void set_curr_task_fair(struct rq *rq)

        set_next_entity(cfs_rq, se);
        /* ensure bandwidth has been allocated on our new cfs_rq */
-       account_cfs_rq_runtime(cfs_rq, 0);
+       if (task_se_in_cgroup(se))
+               account_cfs_rq_runtime(group_cfs_rq_of_parent(se), 0);
 }

I fail to understand the second part of your sentence, and
how is this related to the code in propagate_exec_runtime():

for_each_sched_entity(se) {

    propagate_exec_runtime() {

        if (parent)
            account_cfs_rq_runtime(cfs_rq, diff);
    }
}

> By default we should probably work under the assumption
> that account_cfs_rq_runtime() will succeed at the current
> level, and no gymnastics are required to obtain CPU time.

Maybe this all will become clearer when the reworked CFS Bandwidth
support is ready ;-) I see this patch as the first part of it.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 13/15] sched,fair: propagate sum_exec_runtime up the hierarchy
  2019-08-29 17:20       ` Dietmar Eggemann
@ 2019-08-29 18:06         ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2019-08-29 18:06 UTC (permalink / raw)
  To: Dietmar Eggemann, linux-kernel
  Cc: kernel-team, pjt, peterz, mingo, morten.rasmussen, tglx, mgorman,
	vincent.guittot

[-- Attachment #1: Type: text/plain, Size: 2789 bytes --]

On Thu, 2019-08-29 at 19:20 +0200, Dietmar Eggemann wrote:
> On 28/08/2019 15:14, Rik van Riel wrote:
> > On Wed, 2019-08-28 at 09:51 +0200, Dietmar Eggemann wrote:
> > > On 22/08/2019 04:17, Rik van Riel wrote:
> > > > Now that enqueue_task_fair and dequeue_task_fair no longer
> > > > iterate
> > > > up
> > > > the hierarchy all the time, a method to lazily propagate
> > > > sum_exec_runtime
> > > > up the hierarchy is necessary.
> > > > 
> > > > Once a tick, propagate the newly accumulated exec_runtime up
> > > > the
> > > > hierarchy,
> > > > and feed it into CFS bandwidth control.
> > > > 
> > > > Remove the pointless call to account_cfs_rq_runtime from
> > > > update_curr,
> > > > which is always called with a root cfs_rq.
> > > 
> > > But what about the call to account_cfs_rq_runtime() in
> > > set_curr_task_fair()? Here you always call it with the root
> > > cfs_rq.
> > > Shouldn't this be called also in a loop over all se's until !se-
> > > > parent
> > > (like in propagate_exec_runtime() further below).
> > 
> > I believe that call should be only on the cgroup
> > cfs_rq, with account_cfs_rq_runtime figuring out
> > whether more runtime needs to be obtained from
> > further up in the hierarchy.
> 
> So like this?
> 
> @@ -10248,7 +10248,8 @@ static void set_curr_task_fair(struct rq *rq)
> 
>         set_next_entity(cfs_rq, se);
>         /* ensure bandwidth has been allocated on our new cfs_rq */
> -       account_cfs_rq_runtime(cfs_rq, 0);
> +       if (task_se_in_cgroup(se))
> +               account_cfs_rq_runtime(group_cfs_rq_of_parent(se),
> 0);
>  }
> 
> I fail to understand the second part of your sentence, and
> how is this related to the code in propagate_exec_runtime():
> 
> for_each_sched_entity(se) {
> 
>     propagate_exec_runtime() {
> 
>         if (parent)
>             account_cfs_rq_runtime(cfs_rq, diff);
>     }
> }

I am not sure how that would work for distributing
runtime, since runtime would have to be distributed
downwards and on demand, no?

That seems like a very different code path than
"upwards, and periodically".

Then again, I have not worked out all the details
of reimplementing CFS bandwidth yet...

> > By default we should probably work under the assumption
> > that account_cfs_rq_runtime() will succeed at the current
> > level, and no gymnastics are required to obtain CPU time.
> 
> Maybe this all will become clearer when the reworked CFS Bandwidth
> support is ready ;-) I see this patch as the first part of it.

That is one of the reasons I have not been "fixing"
CFS bandwidth related code in the current patch series.

Having all of those changes in one location seems like
it would be best.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/15] sched,fair: simplify timeslice length code
  2019-08-29 16:00         ` Rik van Riel
@ 2019-08-30  6:41           ` Vincent Guittot
  2019-08-30 15:01             ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Vincent Guittot @ 2019-08-30  6:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

On Thu, 29 Aug 2019 at 18:00, Rik van Riel <riel@surriel.com> wrote:
>
> On Thu, 2019-08-29 at 16:02 +0200, Vincent Guittot wrote:
> > On Thu, 29 Aug 2019 at 01:19, Rik van Riel <riel@surriel.com> wrote:
> >
> > > What am I overlooking?
> >
> > My point is more for task that runs several ticks in a row. Their
> > sched_slice will be shorter in some cases with your changes so they
> > can be preempted earlier by other runnable tasks with a lower
> > vruntime
> > and there will be more context switch
>
> I can think of exactly one case where the time slice
> will be shorter with my new code than with the old code,
> and that is the case where:
> - A CPU has nr_running > sched_nr_latency

yes nr_running must be higher than  sched_nr_latency

> - __sched_period returns a value larger than sysctl_sched_latency
> - one of the tasks is much higher priority than the others

it's not only one, that can be several. It depends of the number of
running tasks

> - that one task alone gets a timeslice larger than sysctl_sched_latency
>
> With the new code, that high priority task will get a time
> slice that is a (large) fraction of sysctl_sched_latency,

yes

> while the other (lower priority) tasks get their time slices
> rounded up to sysctl_sched_min_granularity.

yes and if the jify period is higher than sysctl_sched_min_granularity
they will get a full jiffy period

>
> When tasks get their timeslice rounded up, that will increase
> the total sched period in a similar way the old code did by
> returning a longer period from __sched_period.

sched_slice is not a strict value and scheduler will not schedule out
the task after the sched_slice (unless you enable HRTICK which is
disable by default). Instead it will wait for next tick to change the
running task

sched_slice is mainly use to ensure a minimum running time in a row.
With this change, the running time of the high priority task will most
probably be split in several slice instead of one

>
> If a CPU is faced with a large number of equal priority tasks,
> both the old code and the new code would end up giving each
> task a timeslice length of sysctl_sched_min_granularity.
>
> What am I missing?
>
> --
> All Rights Reversed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/15] sched,fair: simplify timeslice length code
  2019-08-30  6:41           ` Vincent Guittot
@ 2019-08-30 15:01             ` Rik van Riel
  2019-09-02  7:51               ` Vincent Guittot
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-08-30 15:01 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 798 bytes --]

On Fri, 2019-08-30 at 08:41 +0200, Vincent Guittot wrote:

> > When tasks get their timeslice rounded up, that will increase
> > the total sched period in a similar way the old code did by
> > returning a longer period from __sched_period.
> 
> sched_slice is not a strict value and scheduler will not schedule out
> the task after the sched_slice (unless you enable HRTICK which is
> disable by default). Instead it will wait for next tick to change the
> running task
> 
> sched_slice is mainly use to ensure a minimum running time in a row.
> With this change, the running time of the high priority task will
> most
> probably be split in several slice instead of one

I would be more than happy to drop this patch if you
prefer. Just let me know.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/15] sched,fair: simplify timeslice length code
  2019-08-30 15:01             ` Rik van Riel
@ 2019-09-02  7:51               ` Vincent Guittot
  2019-09-02 17:47                 ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Vincent Guittot @ 2019-09-02  7:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

On Fri, 30 Aug 2019 at 17:02, Rik van Riel <riel@surriel.com> wrote:
>
> On Fri, 2019-08-30 at 08:41 +0200, Vincent Guittot wrote:
>
> > > When tasks get their timeslice rounded up, that will increase
> > > the total sched period in a similar way the old code did by
> > > returning a longer period from __sched_period.
> >
> > sched_slice is not a strict value and scheduler will not schedule out
> > the task after the sched_slice (unless you enable HRTICK which is
> > disable by default). Instead it will wait for next tick to change the
> > running task
> >
> > sched_slice is mainly use to ensure a minimum running time in a row.
> > With this change, the running time of the high priority task will
> > most
> > probably be split in several slice instead of one
>
> I would be more than happy to drop this patch if you
> prefer. Just let me know.

If i'm  not wrong, this change is not mandatory to flatten the
runqueue and because of the possible impact if you would prefer to
drop it from this serie

Regards,
Vincent

>
> --
> All Rights Reversed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues
  2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
                   ` (14 preceding siblings ...)
  2019-08-22  2:17 ` [PATCH 15/15] sched,fair: scale vdiff in wakeup_preempt_entity Rik van Riel
@ 2019-09-02 10:53 ` Dietmar Eggemann
  2019-09-03  1:44   ` Rik van Riel
  15 siblings, 1 reply; 47+ messages in thread
From: Dietmar Eggemann @ 2019-09-02 10:53 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: kernel-team, pjt, peterz, mingo, morten.rasmussen, tglx, mgorman,
	vincent.guittot

On 22/08/2019 04:17, Rik van Riel wrote:
> The current implementation of the CPU controller uses hierarchical
> runqueues, where on wakeup a task is enqueued on its group's runqueue,
> the group is enqueued on the runqueue of the group above it, etc.
> 
> This increases a fairly large amount of overhead for workloads that
> do a lot of wakeups a second, especially given that the default systemd
> hierarchy is 2 or 3 levels deep.
> 
> This patch series is an attempt at reducing that overhead, by placing
> all the tasks on the same runqueue, and scaling the task priority by
> the priority of the group, which is calculated periodically.
> 
> My main TODO items for the next period of time are likely going to
> be testing, testing, and testing. I hope to find and flush out any
> corner case I can find, and make sure performance does not regress
> with any workloads, and hopefully improves some.

I did some testing with a small & simple rt-app based test-case:

2 CPUs (rq->cpu_capacity_orig=1024), CPUfreq performance governor

2 taskgroups /tg0 and /tg1

6 CFS tasks (periodic, 8/16ms (runtime/period))

/tg0 (cpu.shares=1024) ran 4 tasks and /tg1 (cpu.shares=1024) ran 2 tasks

(arm64 defconfig with !CONFIG_NUMA_BALANCING, !CONFIG_SCHED_AUTOGROUP)

---

v5.2:

The 2 /tg1 tasks ran 8/16ms. The 4 /tg0 tasks ran 4/16ms in the
beginning and then 8/16ms after the 2 /tg1 tasks did finish.

---

v5.2 + v4:

There is no runtime/period pattern visible anymore. I see a lot of extra
wakeup latency for those tasks though.

v5.2 + (v4 without 07/15, 08/15, 15/15) didn't change much.

---

I could try to reduce the stack even further (e.g. without 13/15).

IMHO it's a good idea to have a set of these small & simple test-cases
handy to verify that the base-functionality is still in place. This
might be hard to achieve with benchmarks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/15] sched,fair: simplify timeslice length code
  2019-09-02  7:51               ` Vincent Guittot
@ 2019-09-02 17:47                 ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2019-09-02 17:47 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 434 bytes --]

On Mon, 2019-09-02 at 09:51 +0200, Vincent Guittot wrote:
> On Fri, 30 Aug 2019 at 17:02, Rik van Riel <riel@surriel.com> wrote:

> > I would be more than happy to drop this patch if you
> > prefer. Just let me know.
> 
> If i'm  not wrong, this change is not mandatory to flatten the
> runqueue and because of the possible impact if you would prefer to
> drop it from this serie

OK, will do.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues
  2019-09-02 10:53 ` [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Dietmar Eggemann
@ 2019-09-03  1:44   ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2019-09-03  1:44 UTC (permalink / raw)
  To: Dietmar Eggemann, linux-kernel
  Cc: kernel-team, pjt, peterz, mingo, morten.rasmussen, tglx, mgorman,
	vincent.guittot

[-- Attachment #1: Type: text/plain, Size: 1973 bytes --]

On Mon, 2019-09-02 at 12:53 +0200, Dietmar Eggemann wrote:
> On 22/08/2019 04:17, Rik van Riel wrote:

> > My main TODO items for the next period of time are likely going to
> > be testing, testing, and testing. I hope to find and flush out any
> > corner case I can find, and make sure performance does not regress
> > with any workloads, and hopefully improves some.
> 
> I did some testing with a small & simple rt-app based test-case:
> 
> 2 CPUs (rq->cpu_capacity_orig=1024), CPUfreq performance governor
> 
> 2 taskgroups /tg0 and /tg1
> 
> 6 CFS tasks (periodic, 8/16ms (runtime/period))
> 
> /tg0 (cpu.shares=1024) ran 4 tasks and /tg1 (cpu.shares=1024) ran 2
> tasks
> 
> (arm64 defconfig with !CONFIG_NUMA_BALANCING,
> !CONFIG_SCHED_AUTOGROUP)
> 
> ---
> 
> v5.2:
> 
> The 2 /tg1 tasks ran 8/16ms. The 4 /tg0 tasks ran 4/16ms in the
> beginning and then 8/16ms after the 2 /tg1 tasks did finish.
> 
> ---
> 
> v5.2 + v4:
> 
> There is no runtime/period pattern visible anymore. I see a lot of
> extra
> wakeup latency for those tasks though.
> 
> v5.2 + (v4 without 07/15, 08/15, 15/15) didn't change much.

One thing to keep in mind is that with the hierarchical
CPU controller code, you are always either comparing
tg0 and tg1 (of equal priority), or tasks of equal priority,
once the load balancer has equalized the load between CPUs.

With the flat CPU controller, the preemption code is comparing
sched_entities with other sched_entities that have 2x the
priority, similar to a nice level 0 entity compared against a
nice level ~3 task.

I do not know whether the code has ever given a predictable
scheduling pattern when the CPU is fully loaded with a mix of
different priority tasks that each want a 50% duty cycle.

But maybe it has, and I need to look into that :)

Figuring out exactly what the preemption code should do might
be a good discussion for Plumbers, too.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/15] sched,fair: refactor enqueue/dequeue_entity
  2019-08-22  2:17 ` [PATCH 09/15] sched,fair: refactor enqueue/dequeue_entity Rik van Riel
@ 2019-09-03 15:38   ` Vincent Guittot
  2019-09-03 20:27     ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Vincent Guittot @ 2019-09-03 15:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

Hi Rik,

On Thu, 22 Aug 2019 at 04:18, Rik van Riel <riel@surriel.com> wrote:
>
> Refactor enqueue_entity, dequeue_entity, and update_load_avg, in order
> to split out the things we still want to happen at every level in the
> cgroup hierarchy with a flat runqueue from the things we only need to
> happen once.
>
> No functional changes.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++----------------
>  1 file changed, 42 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 74ee22c59d13..7b0d95f2e3a8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3502,7 +3502,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  #define DO_ATTACH      0x4
>
>  /* Update task and its cfs_rq load average */
> -static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> +static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
>         u64 now = cfs_rq_clock_pelt(cfs_rq);
>         int decayed;
> @@ -3531,6 +3531,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
>         } else if (decayed && (flags & UPDATE_TG))
>                 update_tg_load_avg(cfs_rq, 0);
> +
> +       return decayed;

This is a functional change, isn't it ?
update_cfs_group is now called only if decayed but we can we attach a
task during the enqueue and there is no decay

>
>  }
>
>  #ifndef CONFIG_64BIT
> @@ -3747,9 +3749,10 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
>  #define SKIP_AGE_LOAD  0x0
>  #define DO_ATTACH      0x0
>
> -static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
> +static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
>  {
>         cfs_rq_util_change(cfs_rq, 0);
> +       return false;
>  }
>
>  static inline void remove_entity_load_avg(struct sched_entity *se) {}
> @@ -3872,6 +3875,24 @@ static inline void check_schedstat_required(void)
>   * CPU and an up-to-date min_vruntime on the destination CPU.
>   */
>
> +static bool
> +enqueue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> +{
> +       /*
> +        * When enqueuing a sched_entity, we must:
> +        *   - Update loads to have both entity and cfs_rq synced with now.
> +        *   - Add its load to cfs_rq->runnable_avg
> +        *   - For group_entity, update its weight to reflect the new share of
> +        *     its group cfs_rq
> +        *   - Add its new weight to cfs_rq->load.weight
> +        */
> +       if (!update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH))
> +               return false;
>
> +
> +       update_cfs_group(se);
> +       return true;
> +}
> +
>  static void
>  enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
> @@ -3896,16 +3917,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>         if (renorm && !curr)
>                 se->vruntime += cfs_rq->min_vruntime;
>
> -       /*
> -        * When enqueuing a sched_entity, we must:
> -        *   - Update loads to have both entity and cfs_rq synced with now.
> -        *   - Add its load to cfs_rq->runnable_avg
> -        *   - For group_entity, update its weight to reflect the new share of
> -        *     its group cfs_rq
> -        *   - Add its new weight to cfs_rq->load.weight
> -        */
> -       update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
> -       update_cfs_group(se);
>         enqueue_runnable_load_avg(cfs_rq, se);
>         account_entity_enqueue(cfs_rq, se);
>
> @@ -3972,14 +3983,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
>
>  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
>
> -static void
> -dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> +static bool
> +dequeue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
> -       /*
> -        * Update run-time statistics of the 'current'.
> -        */
> -       update_curr(cfs_rq);
> -
>         /*
>          * When dequeuing a sched_entity, we must:
>          *   - Update loads to have both entity and cfs_rq synced with now.
> @@ -3988,7 +3994,21 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>          *   - For group entity, update its weight to reflect the new share
>          *     of its group cfs_rq.
>          */
> -       update_load_avg(cfs_rq, se, UPDATE_TG);
> +       if (!update_load_avg(cfs_rq, se, UPDATE_TG))
> +               return false;
> +       update_cfs_group(se);
> +
> +       return true;
> +}
> +
> +static void
> +dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> +{
> +       /*
> +        * Update run-time statistics of the 'current'.
> +        */
> +       update_curr(cfs_rq);
> +
>         dequeue_runnable_load_avg(cfs_rq, se);
>
>         update_stats_dequeue(cfs_rq, se, flags);
> @@ -4012,8 +4032,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>         /* return excess runtime on last dequeue */
>         return_cfs_rq_runtime(cfs_rq);
>
> -       update_cfs_group(se);
> -
>         /*
>          * Now advance min_vruntime if @se was the entity holding it back,
>          * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
> @@ -5178,6 +5196,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 if (se->on_rq)
>                         break;
>                 cfs_rq = cfs_rq_of(se);
> +               enqueue_entity_groups(cfs_rq, se, flags);
>                 enqueue_entity(cfs_rq, se, flags);
>
>                 /*
> @@ -5260,6 +5279,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
>         for_each_sched_entity(se) {
>                 cfs_rq = cfs_rq_of(se);
> +               dequeue_entity_groups(cfs_rq, se, flags);
>                 dequeue_entity(cfs_rq, se, flags);
>
>                 /*
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/15] sched,fair: refactor enqueue/dequeue_entity
  2019-09-03 15:38   ` Vincent Guittot
@ 2019-09-03 20:27     ` Rik van Riel
  2019-09-04  6:44       ` Vincent Guittot
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2019-09-03 20:27 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 1984 bytes --]

On Tue, 2019-09-03 at 17:38 +0200, Vincent Guittot wrote:
> Hi Rik,
> 
> On Thu, 22 Aug 2019 at 04:18, Rik van Riel <riel@surriel.com> wrote:
> > Refactor enqueue_entity, dequeue_entity, and update_load_avg, in
> > order
> > to split out the things we still want to happen at every level in
> > the
> > cgroup hierarchy with a flat runqueue from the things we only need
> > to
> > happen once.
> > 
> > No functional changes.
> > 
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > ---
> >  kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++----------
> > ------
> >  1 file changed, 42 insertions(+), 22 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 74ee22c59d13..7b0d95f2e3a8 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3502,7 +3502,7 @@ static void detach_entity_load_avg(struct
> > cfs_rq *cfs_rq, struct sched_entity *s
> >  #define DO_ATTACH      0x4
> > 
> >  /* Update task and its cfs_rq load average */
> > -static inline void update_load_avg(struct cfs_rq *cfs_rq, struct
> > sched_entity *se, int flags)
> > +static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct
> > sched_entity *se, int flags)
> >  {
> >         u64 now = cfs_rq_clock_pelt(cfs_rq);
> >         int decayed;
> > @@ -3531,6 +3531,8 @@ static inline void update_load_avg(struct
> > cfs_rq *cfs_rq, struct sched_entity *s
> > 
> >         } else if (decayed && (flags & UPDATE_TG))
> >                 update_tg_load_avg(cfs_rq, 0);
> > +
> > +       return decayed;
> 
> This is a functional change, isn't it ?
> update_cfs_group is now called only if decayed but we can we attach a
> task during the enqueue and there is no decay

Yes, it is, and patch 11 changes the way this functional
change is done.

If you want, I can change this patch to not have the
functional change, though in the end it should not make
any difference.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/15] sched,fair: refactor enqueue/dequeue_entity
  2019-09-03 20:27     ` Rik van Riel
@ 2019-09-04  6:44       ` Vincent Guittot
  0 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2019-09-04  6:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Kernel Team, Paul Turner, Dietmar Eggemann,
	Peter Zijlstra, Ingo Molnar, Morten Rasmussen, Thomas Gleixner,
	Mel Gorman

On Tue, 3 Sep 2019 at 22:27, Rik van Riel <riel@surriel.com> wrote:
>
> On Tue, 2019-09-03 at 17:38 +0200, Vincent Guittot wrote:
> > Hi Rik,
> >
> > On Thu, 22 Aug 2019 at 04:18, Rik van Riel <riel@surriel.com> wrote:
> > > Refactor enqueue_entity, dequeue_entity, and update_load_avg, in
> > > order
> > > to split out the things we still want to happen at every level in
> > > the
> > > cgroup hierarchy with a flat runqueue from the things we only need
> > > to
> > > happen once.
> > >
> > > No functional changes.
> > >
> > > Signed-off-by: Rik van Riel <riel@surriel.com>
> > > ---
> > >  kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++----------
> > > ------
> > >  1 file changed, 42 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 74ee22c59d13..7b0d95f2e3a8 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3502,7 +3502,7 @@ static void detach_entity_load_avg(struct
> > > cfs_rq *cfs_rq, struct sched_entity *s
> > >  #define DO_ATTACH      0x4
> > >
> > >  /* Update task and its cfs_rq load average */
> > > -static inline void update_load_avg(struct cfs_rq *cfs_rq, struct
> > > sched_entity *se, int flags)
> > > +static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct
> > > sched_entity *se, int flags)
> > >  {
> > >         u64 now = cfs_rq_clock_pelt(cfs_rq);
> > >         int decayed;
> > > @@ -3531,6 +3531,8 @@ static inline void update_load_avg(struct
> > > cfs_rq *cfs_rq, struct sched_entity *s
> > >
> > >         } else if (decayed && (flags & UPDATE_TG))
> > >                 update_tg_load_avg(cfs_rq, 0);
> > > +
> > > +       return decayed;
> >
> > This is a functional change, isn't it ?
> > update_cfs_group is now called only if decayed but we can we attach a
> > task during the enqueue and there is no decay
>
> Yes, it is, and patch 11 changes the way this functional
> change is done.
>
> If you want, I can change this patch to not have the
> functional change, though in the end it should not make
> any difference.

It's mainly that the code is not aligned with the commit message

I will continue to review other patch and will come back to this one
after reviewing patch 11

Thanks
Vincent
>
> --
> All Rights Reversed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 06/15] sched,cfs: use explicit cfs_rq of parent se helper
  2019-09-06 19:12 [PATCH RFC v5 " Rik van Riel
@ 2019-09-06 19:12 ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2019-09-06 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, pjt, dietmar.eggemann, peterz, mingo,
	morten.rasmussen, tglx, mgorman, vincent.guittot, Rik van Riel

Use an explicit "cfs_rq of parent sched_entity" helper in a few
strategic places, where cfs_rq_of(se) may no longer point at the
right runqueue once we flatten the hierarchical cgroup runqueues.

No functional change.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1c1593611291..dfc116cb8750 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -276,6 +276,15 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
 	return grp->my_q;
 }
 
+/* runqueue owned by the parent entity; the root cfs_rq for a top level se */
+static inline struct cfs_rq *group_cfs_rq_of_parent(struct sched_entity *se)
+{
+	if (se->parent)
+		return group_cfs_rq(se->parent);
+
+	return cfs_rq_of(se);
+}
+
 static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
@@ -3319,7 +3328,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
 
 	gcfs_rq->propagate = 0;
 
-	cfs_rq = cfs_rq_of(se);
+	cfs_rq = group_cfs_rq_of_parent(se);
 
 	add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
 
@@ -7796,7 +7805,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 
 	WRITE_ONCE(cfs_rq->h_load_next, NULL);
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		cfs_rq = group_cfs_rq_of_parent(se);
 		WRITE_ONCE(cfs_rq->h_load_next, se);
 		if (cfs_rq->last_h_load_update == now)
 			break;
@@ -7819,7 +7828,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 
 static unsigned long task_se_h_load(struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct cfs_rq *cfs_rq = group_cfs_rq_of_parent(se);
 
 	update_cfs_rq_h_load(cfs_rq);
 	return div64_ul(se->avg.load_avg * cfs_rq->h_load,
@@ -10166,7 +10175,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	struct sched_entity *se = &curr->se;
 
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		cfs_rq = group_cfs_rq_of_parent(se);
 		entity_tick(cfs_rq, se, queued);
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2019-09-06 19:14 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-22  2:17 [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Rik van Riel
2019-08-22  2:17 ` [PATCH 01/15] sched: introduce task_se_h_load helper Rik van Riel
2019-08-23 18:13   ` Dietmar Eggemann
2019-08-24  0:05     ` Rik van Riel
2019-08-22  2:17 ` [PATCH 02/15] sched: change /proc/sched_debug fields Rik van Riel
2019-08-22  2:17 ` [PATCH 03/15] sched,fair: redefine runnable_load_avg as the sum of task_h_load Rik van Riel
2019-08-28 13:50   ` Vincent Guittot
2019-08-28 14:47     ` Rik van Riel
2019-08-28 15:02       ` Vincent Guittot
2019-08-22  2:17 ` [PATCH 04/15] sched,fair: move runnable_load_avg to cfs_rq Rik van Riel
2019-08-22  2:17 ` [PATCH 05/15] sched,fair: remove cfs_rqs from leaf_cfs_rq_list bottom up Rik van Riel
2019-08-28 14:09   ` Vincent Guittot
2019-08-22  2:17 ` [PATCH 06/15] sched,cfs: use explicit cfs_rq of parent se helper Rik van Riel
2019-08-28 13:53   ` Vincent Guittot
2019-08-28 15:28     ` Rik van Riel
2019-08-28 16:34       ` Vincent Guittot
2019-08-22  2:17 ` [PATCH 07/15] sched,cfs: fix zero length timeslice calculation Rik van Riel
2019-08-28 16:59   ` Vincent Guittot
2019-08-22  2:17 ` [PATCH 08/15] sched,fair: simplify timeslice length code Rik van Riel
2019-08-28 17:32   ` Vincent Guittot
2019-08-28 23:18     ` Rik van Riel
2019-08-29 14:02       ` Vincent Guittot
2019-08-29 16:00         ` Rik van Riel
2019-08-30  6:41           ` Vincent Guittot
2019-08-30 15:01             ` Rik van Riel
2019-09-02  7:51               ` Vincent Guittot
2019-09-02 17:47                 ` Rik van Riel
2019-08-22  2:17 ` [PATCH 09/15] sched,fair: refactor enqueue/dequeue_entity Rik van Riel
2019-09-03 15:38   ` Vincent Guittot
2019-09-03 20:27     ` Rik van Riel
2019-09-04  6:44       ` Vincent Guittot
2019-08-22  2:17 ` [PATCH 10/15] sched,fair: add helper functions for flattened runqueue Rik van Riel
2019-08-22  2:17 ` [PATCH 11/15] sched,fair: flatten hierarchical runqueues Rik van Riel
2019-08-23 18:14   ` Dietmar Eggemann
2019-08-24  1:16     ` Rik van Riel
2019-08-22  2:17 ` [PATCH 12/15] sched,fair: flatten update_curr functionality Rik van Riel
2019-08-27 10:37   ` Dietmar Eggemann
2019-08-22  2:17 ` [PATCH 13/15] sched,fair: propagate sum_exec_runtime up the hierarchy Rik van Riel
2019-08-28  7:51   ` Dietmar Eggemann
2019-08-28 13:14     ` Rik van Riel
2019-08-29 17:20       ` Dietmar Eggemann
2019-08-29 18:06         ` Rik van Riel
2019-08-22  2:17 ` [PATCH 14/15] sched,fair: ramp up task_se_h_weight quickly Rik van Riel
2019-08-22  2:17 ` [PATCH 15/15] sched,fair: scale vdiff in wakeup_preempt_entity Rik van Riel
2019-09-02 10:53 ` [PATCH RFC v4 0/15] sched,fair: flatten CPU controller runqueues Dietmar Eggemann
2019-09-03  1:44   ` Rik van Riel
2019-09-06 19:12 [PATCH RFC v5 " Rik van Riel
2019-09-06 19:12 ` [PATCH 06/15] sched,cfs: use explicit cfs_rq of parent se helper Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).