[PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
@ 2023-05-31 11:58 Peter Zijlstra
  2023-05-31 11:58 ` [PATCH 01/15] sched/fair: Add avg_vruntime Peter Zijlstra
                   ` (15 more replies)
  0 siblings, 16 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

Hi!

Latest version of the EEVDF [1] patches.

The only real change since last time is the fix for tick-preemption [2], and a 
simple safe-guard for the mixed slice heuristic.

Other than that, I've re-arranged the patches to make EEVDF come first and have
the latency-nice or slice-attribute patches on top.

Results should not be different from last time around, lots of people ran them
and found no major performance issues; what was found was better latency and
smaller variance (probably due to the more stable latency).

I'm hoping we can start queueing this part.

The big question is what additional interface to expose; some people have
voiced objections to the latency-nice interface, the 'obvious' alternative
is to directly expose the slice length as a request/hint.

The very last patch implements this alternative using sched_attr::sched_runtime
but is untested.

Diffstat for the base patches [1-11]:

 include/linux/rbtree_augmented.h |   26 +
 include/linux/sched.h            |    7 +-
 kernel/sched/core.c              |    2 +
 kernel/sched/debug.c             |   48 +-
 kernel/sched/fair.c              | 1105 ++++++++++++++++++--------------------
 kernel/sched/features.h          |   24 +-
 kernel/sched/sched.h             |   16 +-
 7 files changed, 587 insertions(+), 641 deletions(-)

[1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564

[2] https://lkml.kernel.org/r/20230420150537.GC4253%40hirez.programming.kicks-ass.net

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 01/15] sched/fair: Add avg_vruntime
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
@ 2023-05-31 11:58 ` Peter Zijlstra
  2023-06-02 13:51   ` Vincent Guittot
                     ` (2 more replies)
  2023-05-31 11:58 ` [PATCH 02/15] sched/fair: Remove START_DEBIT Peter Zijlstra
                   ` (14 subsequent siblings)
  15 siblings, 3 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

In order to move to an eligibility based scheduling policy it is
needed to have a better approximation of the ideal scheduler.

Specifically, for a virtual time weighted fair queueing based
scheduler the ideal scheduler will be the weighted average of the
individual virtual runtimes (math in the comment).

As such, compute the weighted average to approximate the ideal
scheduler -- note that the approximation is in the individual task
behaviour, which isn't strictly conformant.

Specifically consider adding a task with a vruntime left of center, in
this case the average will move backwards in time -- something the
ideal scheduler would of course never do.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |   32 +++++------
 kernel/sched/fair.c  |  137 +++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |    5 +
 3 files changed, 154 insertions(+), 20 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -626,10 +626,9 @@ static void print_rq(struct seq_file *m,
 
 void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 {
-	s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
-		spread, rq0_min_vruntime, spread0;
+	s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, spread;
+	struct sched_entity *last, *first;
 	struct rq *rq = cpu_rq(cpu);
-	struct sched_entity *last;
 	unsigned long flags;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -643,26 +642,25 @@ void print_cfs_rq(struct seq_file *m, in
 			SPLIT_NS(cfs_rq->exec_clock));
 
 	raw_spin_rq_lock_irqsave(rq, flags);
-	if (rb_first_cached(&cfs_rq->tasks_timeline))
-		MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
+	first = __pick_first_entity(cfs_rq);
+	if (first)
+		left_vruntime = first->vruntime;
 	last = __pick_last_entity(cfs_rq);
 	if (last)
-		max_vruntime = last->vruntime;
+		right_vruntime = last->vruntime;
 	min_vruntime = cfs_rq->min_vruntime;
-	rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
 	raw_spin_rq_unlock_irqrestore(rq, flags);
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "MIN_vruntime",
-			SPLIT_NS(MIN_vruntime));
+
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_vruntime",
+			SPLIT_NS(left_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
 			SPLIT_NS(min_vruntime));
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "max_vruntime",
-			SPLIT_NS(max_vruntime));
-	spread = max_vruntime - MIN_vruntime;
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread",
-			SPLIT_NS(spread));
-	spread0 = min_vruntime - rq0_min_vruntime;
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread0",
-			SPLIT_NS(spread0));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
+			SPLIT_NS(avg_vruntime(cfs_rq)));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "right_vruntime",
+			SPLIT_NS(right_vruntime));
+	spread = right_vruntime - left_vruntime;
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread", SPLIT_NS(spread));
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -601,9 +601,134 @@ static inline bool entity_before(const s
 	return (s64)(a->vruntime - b->vruntime) < 0;
 }
 
+static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	return (s64)(se->vruntime - cfs_rq->min_vruntime);
+}
+
 #define __node_2_se(node) \
 	rb_entry((node), struct sched_entity, run_node)
 
+/*
+ * Compute virtual time from the per-task service numbers:
+ *
+ * Fair schedulers conserve lag:
+ *
+ *   \Sum lag_i = 0
+ *
+ * Where lag_i is given by:
+ *
+ *   lag_i = S - s_i = w_i * (V - v_i)
+ *
+ * Where S is the ideal service time and V is it's virtual time counterpart.
+ * Therefore:
+ *
+ *   \Sum lag_i = 0
+ *   \Sum w_i * (V - v_i) = 0
+ *   \Sum w_i * V - w_i * v_i = 0
+ *
+ * From which we can solve an expression for V in v_i (which we have in
+ * se->vruntime):
+ *
+ *       \Sum v_i * w_i   \Sum v_i * w_i
+ *   V = -------------- = --------------
+ *          \Sum w_i            W
+ *
+ * Specifically, this is the weighted average of all entity virtual runtimes.
+ *
+ * [[ NOTE: this is only equal to the ideal scheduler under the condition
+ *          that join/leave operations happen at lag_i = 0, otherwise the
+ *          virtual time has non-continguous motion equivalent to:
+ *
+ *	      V +-= lag_i / W
+ *
+ *	    Also see the comment in place_entity() that deals with this. ]]
+ *
+ * However, since v_i is u64, and the multiplcation could easily overflow
+ * transform it into a relative form that uses smaller quantities:
+ *
+ * Substitute: v_i == (v_i - v0) + v0
+ *
+ *     \Sum ((v_i - v0) + v0) * w_i   \Sum (v_i - v0) * w_i
+ * V = ---------------------------- = --------------------- + v0
+ *                  W                            W
+ *
+ * Which we track using:
+ *
+ *                    v0 := cfs_rq->min_vruntime
+ * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
+ *              \Sum w_i := cfs_rq->avg_load
+ *
+ * Since min_vruntime is a monotonic increasing variable that closely tracks
+ * the per-task service, these deltas: (v_i - v), will be in the order of the
+ * maximal (virtual) lag induced in the system due to quantisation.
+ *
+ * Also, we use scale_load_down() to reduce the size.
+ *
+ * As measured, the max (key * weight) value was ~44 bits for a kernel build.
+ */
+static void
+avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	unsigned long weight = scale_load_down(se->load.weight);
+	s64 key = entity_key(cfs_rq, se);
+
+	cfs_rq->avg_vruntime += key * weight;
+	cfs_rq->avg_load += weight;
+}
+
+static void
+avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	unsigned long weight = scale_load_down(se->load.weight);
+	s64 key = entity_key(cfs_rq, se);
+
+	cfs_rq->avg_vruntime -= key * weight;
+	cfs_rq->avg_load -= weight;
+}
+
+static inline
+void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+{
+	/*
+	 * v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load
+	 */
+	cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta;
+}
+
+u64 avg_vruntime(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 avg = cfs_rq->avg_vruntime;
+	long load = cfs_rq->avg_load;
+
+	if (curr && curr->on_rq) {
+		unsigned long weight = scale_load_down(curr->load.weight);
+
+		avg += entity_key(cfs_rq, curr) * weight;
+		load += weight;
+	}
+
+	if (load)
+		avg = div_s64(avg, load);
+
+	return cfs_rq->min_vruntime + avg;
+}
+
+static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
+{
+	u64 min_vruntime = cfs_rq->min_vruntime;
+	/*
+	 * open coded max_vruntime() to allow updating avg_vruntime
+	 */
+	s64 delta = (s64)(vruntime - min_vruntime);
+	if (delta > 0) {
+		avg_vruntime_update(cfs_rq, delta);
+		min_vruntime = vruntime;
+	}
+	return min_vruntime;
+}
+
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
@@ -629,7 +754,7 @@ static void update_min_vruntime(struct c
 
 	/* ensure we never gain time by being placed backwards. */
 	u64_u32_store(cfs_rq->min_vruntime,
-		      max_vruntime(cfs_rq->min_vruntime, vruntime));
+		      __update_min_vruntime(cfs_rq, vruntime));
 }
 
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
@@ -642,12 +767,14 @@ static inline bool __entity_less(struct
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	avg_vruntime_add(cfs_rq, se);
 	rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
+	avg_vruntime_sub(cfs_rq, se);
 }
 
 struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
@@ -3379,6 +3506,8 @@ static void reweight_entity(struct cfs_r
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
 			update_curr(cfs_rq);
+		else
+			avg_vruntime_sub(cfs_rq, se);
 		update_load_sub(&cfs_rq->load, se->load.weight);
 	}
 	dequeue_load_avg(cfs_rq, se);
@@ -3394,9 +3523,11 @@ static void reweight_entity(struct cfs_r
 #endif
 
 	enqueue_load_avg(cfs_rq, se);
-	if (se->on_rq)
+	if (se->on_rq) {
 		update_load_add(&cfs_rq->load, se->load.weight);
-
+		if (cfs_rq->curr != se)
+			avg_vruntime_add(cfs_rq, se);
+	}
 }
 
 void reweight_task(struct task_struct *p, int prio)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -554,6 +554,9 @@ struct cfs_rq {
 	unsigned int		idle_nr_running;   /* SCHED_IDLE */
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
 
+	s64			avg_vruntime;
+	u64			avg_load;
+
 	u64			exec_clock;
 	u64			min_vruntime;
 #ifdef CONFIG_SCHED_CORE
@@ -3503,4 +3506,6 @@ static inline void task_tick_mm_cid(stru
 static inline void init_sched_mm_cid(struct task_struct *t) { }
 #endif
 
+extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
+
 #endif /* _KERNEL_SCHED_SCHED_H */



^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 02/15] sched/fair: Remove START_DEBIT
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
  2023-05-31 11:58 ` [PATCH 01/15] sched/fair: Add avg_vruntime Peter Zijlstra
@ 2023-05-31 11:58 ` Peter Zijlstra
  2023-08-10  7:10   ` [tip: sched/core] sched/fair: Remove sched_feat(START_DEBIT) tip-bot2 for Peter Zijlstra
  2023-05-31 11:58 ` [PATCH 03/15] sched/fair: Add lag based placement Peter Zijlstra
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

With the introduction of avg_vruntime() there is no need to use worse
approximations. Take the 0-lag point as starting point for inserting
new tasks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   21 +--------------------
 kernel/sched/features.h |    6 ------
 2 files changed, 1 insertion(+), 26 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -906,16 +906,6 @@ static u64 sched_slice(struct cfs_rq *cf
 	return slice;
 }
 
-/*
- * We calculate the vruntime slice of a to-be-inserted task.
- *
- * vs = s/w
- */
-static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	return calc_delta_fair(sched_slice(cfs_rq, se), se);
-}
-
 #include "pelt.h"
 #ifdef CONFIG_SMP
 
@@ -4862,16 +4852,7 @@ static inline bool entity_is_long_sleepe
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
-	u64 vruntime = cfs_rq->min_vruntime;
-
-	/*
-	 * The 'current' period is already promised to the current tasks,
-	 * however the extra weight of the new task will slow them down a
-	 * little, place the new task so that it fits in the slot that
-	 * stays open at the end.
-	 */
-	if (initial && sched_feat(START_DEBIT))
-		vruntime += sched_vslice(cfs_rq, se);
+	u64 vruntime = avg_vruntime(cfs_rq);
 
 	/* sleeps up to a single latency don't count. */
 	if (!initial) {
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,12 +7,6 @@
 SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
 
 /*
- * Place new tasks ahead so that they do not starve already running
- * tasks
- */
-SCHED_FEAT(START_DEBIT, true)
-
-/*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 03/15] sched/fair: Add lag based placement
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
  2023-05-31 11:58 ` [PATCH 01/15] sched/fair: Add avg_vruntime Peter Zijlstra
  2023-05-31 11:58 ` [PATCH 02/15] sched/fair: Remove START_DEBIT Peter Zijlstra
@ 2023-05-31 11:58 ` Peter Zijlstra
  2023-08-10  7:10   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
                     ` (2 more replies)
  2023-05-31 11:58 ` [PATCH 04/15] rbtree: Add rb_add_augmented_cached() helper Peter Zijlstra
                   ` (12 subsequent siblings)
  15 siblings, 3 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

With the introduction of avg_vruntime, it is possible to approximate
lag (the entire purpose of introducing it in fact). Use this to do lag
based placement over sleep+wake.

Specifically, the FAIR_SLEEPERS thing places things too far to the
left and messes up the deadline aspect of EEVDF.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h   |    3 
 kernel/sched/core.c     |    1 
 kernel/sched/fair.c     |  162 +++++++++++++++++++++++++++++++++++++-----------
 kernel/sched/features.h |    8 ++
 4 files changed, 138 insertions(+), 36 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -555,8 +555,9 @@ struct sched_entity {
 
 	u64				exec_start;
 	u64				sum_exec_runtime;
-	u64				vruntime;
 	u64				prev_sum_exec_runtime;
+	u64				vruntime;
+	s64				vlag;
 
 	u64				nr_migrations;
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4463,6 +4463,7 @@ static void __sched_fork(unsigned long c
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
+	p->se.vlag			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -715,6 +715,15 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 	return cfs_rq->min_vruntime + avg;
 }
 
+/*
+ * lag_i = S - s_i = w_i * (V - v_i)
+ */
+void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	SCHED_WARN_ON(!se->on_rq);
+	se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
+}
+
 static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	u64 min_vruntime = cfs_rq->min_vruntime;
@@ -3492,6 +3501,8 @@ dequeue_load_avg(struct cfs_rq *cfs_rq,
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
 {
+	unsigned long old_weight = se->load.weight;
+
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
@@ -3504,6 +3515,14 @@ static void reweight_entity(struct cfs_r
 
 	update_load_set(&se->load, weight);
 
+	if (!se->on_rq) {
+		/*
+		 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
+		 * we need to scale se->vlag when w_i changes.
+		 */
+		se->vlag = div_s64(se->vlag * old_weight, weight);
+	}
+
 #ifdef CONFIG_SMP
 	do {
 		u32 divider = get_pelt_divider(&se->avg);
@@ -4853,49 +4872,119 @@ static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
 	u64 vruntime = avg_vruntime(cfs_rq);
+	s64 lag = 0;
 
-	/* sleeps up to a single latency don't count. */
-	if (!initial) {
-		unsigned long thresh;
+	/*
+	 * Due to how V is constructed as the weighted average of entities,
+	 * adding tasks with positive lag, or removing tasks with negative lag
+	 * will move 'time' backwards, this can screw around with the lag of
+	 * other tasks.
+	 *
+	 * EEVDF: placement strategy #1 / #2
+	 */
+	if (sched_feat(PLACE_LAG) && cfs_rq->nr_running > 1) {
+		struct sched_entity *curr = cfs_rq->curr;
+		unsigned long load;
 
-		if (se_is_idle(se))
-			thresh = sysctl_sched_min_granularity;
-		else
-			thresh = sysctl_sched_latency;
+		lag = se->vlag;
 
 		/*
-		 * Halve their sleep time's effect, to allow
-		 * for a gentler effect of sleepers:
+		 * If we want to place a task and preserve lag, we have to
+		 * consider the effect of the new entity on the weighted
+		 * average and compensate for this, otherwise lag can quickly
+		 * evaporate.
+		 *
+		 * Lag is defined as:
+		 *
+		 *   lag_i = S - s_i = w_i * (V - v_i)
+		 *
+		 * To avoid the 'w_i' term all over the place, we only track
+		 * the virtual lag:
+		 *
+		 *   vl_i = V - v_i <=> v_i = V - vl_i
+		 *
+		 * And we take V to be the weighted average of all v:
+		 *
+		 *   V = (\Sum w_j*v_j) / W
+		 *
+		 * Where W is: \Sum w_j
+		 *
+		 * Then, the weighted average after adding an entity with lag
+		 * vl_i is given by:
+		 *
+		 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
+		 *      = (W*V + w_i*(V - vl_i)) / (W + w_i)
+		 *      = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
+		 *      = (V*(W + w_i) - w_i*l) / (W + w_i)
+		 *      = V - w_i*vl_i / (W + w_i)
+		 *
+		 * And the actual lag after adding an entity with vl_i is:
+		 *
+		 *   vl'_i = V' - v_i
+		 *         = V - w_i*vl_i / (W + w_i) - (V - vl_i)
+		 *         = vl_i - w_i*vl_i / (W + w_i)
+		 *
+		 * Which is strictly less than vl_i. So in order to preserve lag
+		 * we should inflate the lag before placement such that the
+		 * effective lag after placement comes out right.
+		 *
+		 * As such, invert the above relation for vl'_i to get the vl_i
+		 * we need to use such that the lag after placement is the lag
+		 * we computed before dequeue.
+		 *
+		 *   vl'_i = vl_i - w_i*vl_i / (W + w_i)
+		 *         = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
+		 *
+		 *   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
+		 *                   = W*vl_i
+		 *
+		 *   vl_i = (W + w_i)*vl'_i / W
 		 */
-		if (sched_feat(GENTLE_FAIR_SLEEPERS))
-			thresh >>= 1;
+		load = cfs_rq->avg_load;
+		if (curr && curr->on_rq)
+			load += curr->load.weight;
 
-		vruntime -= thresh;
+		lag *= load + se->load.weight;
+		if (WARN_ON_ONCE(!load))
+			load = 1;
+		lag = div_s64(lag, load);
+
+		vruntime -= lag;
 	}
 
-	/*
-	 * Pull vruntime of the entity being placed to the base level of
-	 * cfs_rq, to prevent boosting it if placed backwards.
-	 * However, min_vruntime can advance much faster than real time, with
-	 * the extreme being when an entity with the minimal weight always runs
-	 * on the cfs_rq. If the waking entity slept for a long time, its
-	 * vruntime difference from min_vruntime may overflow s64 and their
-	 * comparison may get inversed, so ignore the entity's original
-	 * vruntime in that case.
-	 * The maximal vruntime speedup is given by the ratio of normal to
-	 * minimal weight: scale_load_down(NICE_0_LOAD) / MIN_SHARES.
-	 * When placing a migrated waking entity, its exec_start has been set
-	 * from a different rq. In order to take into account a possible
-	 * divergence between new and prev rq's clocks task because of irq and
-	 * stolen time, we take an additional margin.
-	 * So, cutting off on the sleep time of
-	 *     2^63 / scale_load_down(NICE_0_LOAD) ~ 104 days
-	 * should be safe.
-	 */
-	if (entity_is_long_sleeper(se))
-		se->vruntime = vruntime;
-	else
-		se->vruntime = max_vruntime(se->vruntime, vruntime);
+	if (sched_feat(FAIR_SLEEPERS)) {
+
+		/* sleeps up to a single latency don't count. */
+		if (!initial) {
+			unsigned long thresh;
+
+			if (se_is_idle(se))
+				thresh = sysctl_sched_min_granularity;
+			else
+				thresh = sysctl_sched_latency;
+
+			/*
+			 * Halve their sleep time's effect, to allow
+			 * for a gentler effect of sleepers:
+			 */
+			if (sched_feat(GENTLE_FAIR_SLEEPERS))
+				thresh >>= 1;
+
+			vruntime -= thresh;
+		}
+
+		/*
+		 * Pull vruntime of the entity being placed to the base level of
+		 * cfs_rq, to prevent boosting it if placed backwards.  If the entity
+		 * slept for a long time, don't even try to compare its vruntime with
+		 * the base as it may be too far off and the comparison may get
+		 * inversed due to s64 overflow.
+		 */
+		if (!entity_is_long_sleeper(se))
+			vruntime = max_vruntime(se->vruntime, vruntime);
+	}
+
+	se->vruntime = vruntime;
 }
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -5066,6 +5155,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	clear_buddies(cfs_rq, se);
 
+	if (flags & DEQUEUE_SLEEP)
+		update_entity_lag(cfs_rq, se);
+
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -1,12 +1,20 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+
 /*
  * Only give sleepers 50% of their service deficit. This allows
  * them to run sooner, but does not allow tons of sleepers to
  * rip the spread apart.
  */
+SCHED_FEAT(FAIR_SLEEPERS, false)
 SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
 
 /*
+ * Using the avg_vruntime, do the right thing and preserve lag across
+ * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
+ */
+SCHED_FEAT(PLACE_LAG, true)
+
+/*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 04/15] rbtree: Add rb_add_augmented_cached() helper
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
                   ` (2 preceding siblings ...)
  2023-05-31 11:58 ` [PATCH 03/15] sched/fair: Add lag based placement Peter Zijlstra
@ 2023-05-31 11:58 ` Peter Zijlstra
  2023-08-10  7:10   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2023-05-31 11:58 ` [PATCH 05/15] sched/fair: Implement an EEVDF like policy Peter Zijlstra
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

While slightly sub-optimal, updating the augmented data while going
down the tree during lookup would be faster -- alas the augment
interface does not currently allow for that, provide a generic helper
to add a node to an augmented cached tree.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rbtree_augmented.h |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- a/include/linux/rbtree_augmented.h
+++ b/include/linux/rbtree_augmented.h
@@ -60,6 +60,32 @@ rb_insert_augmented_cached(struct rb_nod
 	rb_insert_augmented(node, &root->rb_root, augment);
 }
 
+static __always_inline struct rb_node *
+rb_add_augmented_cached(struct rb_node *node, struct rb_root_cached *tree,
+			bool (*less)(struct rb_node *, const struct rb_node *),
+			const struct rb_augment_callbacks *augment)
+{
+	struct rb_node **link = &tree->rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	bool leftmost = true;
+
+	while (*link) {
+		parent = *link;
+		if (less(node, parent)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = false;
+		}
+	}
+
+	rb_link_node(node, parent, link);
+	augment->propagate(parent, NULL); /* suboptimal */
+	rb_insert_augmented_cached(node, tree, leftmost, augment);
+
+	return leftmost ? node : NULL;
+}
+
 /*
  * Template for declaring augmented rbtree callbacks (generic case)
  *



^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 05/15] sched/fair: Implement an EEVDF like policy
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
                   ` (3 preceding siblings ...)
  2023-05-31 11:58 ` [PATCH 04/15] rbtree: Add rb_add_augmented_cached() helper Peter Zijlstra
@ 2023-05-31 11:58 ` Peter Zijlstra
  2023-08-10  7:10   ` [tip: sched/core] sched/fair: Implement an EEVDF-like scheduling policy tip-bot2 for Peter Zijlstra
                     ` (2 more replies)
  2023-05-31 11:58 ` [PATCH 06/15] sched: Commit to lag based placement Peter Zijlstra
                   ` (10 subsequent siblings)
  15 siblings, 3 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

Where CFS is currently a WFQ based scheduler with only a single knob,
the weight. The addition of a second, latency oriented parameter,
makes something like WF2Q or EEVDF based a much better fit.

Specifically, EEVDF does EDF like scheduling in the left half of the
tree -- those entities that are owed service. Except because this is a
virtual time scheduler, the deadlines are in virtual time as well,
which is what allows over-subscription.

EEVDF has two parameters:

 - weight, or time-slope; which is mapped to nice just as before
 - request size, or slice length; which is used to compute
   the virtual deadline as: vd_i = ve_i + r_i/w_i

Basically, by setting a smaller slice, the deadline will be earlier
and the task will be more eligible and ran earlier.

Tick driven preemption is driven by request/slice completion; while
wakeup preemption is driven by the deadline.

Because the tree is now effectively an interval tree, and the
selection is no longer 'leftmost', over-scheduling is less of a
problem.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h   |    4 
 kernel/sched/core.c     |    1 
 kernel/sched/debug.c    |    6 
 kernel/sched/fair.c     |  338 +++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/features.h |    3 
 kernel/sched/sched.h    |    4 
 6 files changed, 308 insertions(+), 48 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,6 +550,9 @@ struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
 	struct rb_node			run_node;
+	u64				deadline;
+	u64				min_deadline;
+
 	struct list_head		group_node;
 	unsigned int			on_rq;
 
@@ -558,6 +561,7 @@ struct sched_entity {
 	u64				prev_sum_exec_runtime;
 	u64				vruntime;
 	s64				vlag;
+	u64				slice;
 
 	u64				nr_migrations;
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4464,6 +4464,7 @@ static void __sched_fork(unsigned long c
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
 	p->se.vlag			= 0;
+	p->se.slice			= sysctl_sched_min_granularity;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -581,9 +581,13 @@ print_task(struct seq_file *m, struct rq
 	else
 		SEQ_printf(m, " %c", task_state_to_char(p));
 
-	SEQ_printf(m, " %15s %5d %9Ld.%06ld %9Ld %5d ",
+	SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld.%06ld %9Ld %5d ",
 		p->comm, task_pid_nr(p),
 		SPLIT_NS(p->se.vruntime),
+		entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N',
+		SPLIT_NS(p->se.deadline),
+		SPLIT_NS(p->se.slice),
+		SPLIT_NS(p->se.sum_exec_runtime),
 		(long long)(p->nvcsw + p->nivcsw),
 		p->prio);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -47,6 +47,7 @@
 #include <linux/psi.h>
 #include <linux/ratelimit.h>
 #include <linux/task_work.h>
+#include <linux/rbtree_augmented.h>
 
 #include <asm/switch_to.h>
 
@@ -347,6 +348,16 @@ static u64 __calc_delta(u64 delta_exec,
 	return mul_u64_u32_shr(delta_exec, fact, shift);
 }
 
+/*
+ * delta /= w
+ */
+static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
+{
+	if (unlikely(se->load.weight != NICE_0_LOAD))
+		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
+
+	return delta;
+}
 
 const struct sched_class fair_sched_class;
 
@@ -717,11 +728,62 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 
 /*
  * lag_i = S - s_i = w_i * (V - v_i)
+ *
+ * However, since V is approximated by the weighted average of all entities it
+ * is possible -- by addition/removal/reweight to the tree -- to move V around
+ * and end up with a larger lag than we started with.
+ *
+ * Limit this to either double the slice length with a minimum of TICK_NSEC
+ * since that is the timing granularity.
+ *
+ * EEVDF gives the following limit for a steady state system:
+ *
+ *   -r_max < lag < max(r_max, q)
+ *
+ * XXX could add max_slice to the augmented data to track this.
  */
 void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	s64 lag, limit;
+
 	SCHED_WARN_ON(!se->on_rq);
-	se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
+	lag = avg_vruntime(cfs_rq) - se->vruntime;
+
+	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+	se->vlag = clamp(lag, -limit, limit);
+}
+
+/*
+ * Entity is eligible once it received less service than it ought to have,
+ * eg. lag >= 0.
+ *
+ * lag_i = S - s_i = w_i*(V - v_i)
+ *
+ * lag_i >= 0 -> V >= v_i
+ *
+ *     \Sum (v_i - v)*w_i
+ * V = ------------------ + v
+ *          \Sum w_i
+ *
+ * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ *
+ * Note: using 'avg_vruntime() > se->vruntime' is inacurate due
+ *       to the loss in precision caused by the division.
+ */
+int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 avg = cfs_rq->avg_vruntime;
+	long load = cfs_rq->avg_load;
+
+	if (curr && curr->on_rq) {
+		unsigned long weight = scale_load_down(curr->load.weight);
+
+		avg += entity_key(cfs_rq, curr) * weight;
+		load += weight;
+	}
+
+	return avg >= entity_key(cfs_rq, se) * load;
 }
 
 static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
@@ -740,8 +802,8 @@ static u64 __update_min_vruntime(struct
 
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
+	struct sched_entity *se = __pick_first_entity(cfs_rq);
 	struct sched_entity *curr = cfs_rq->curr;
-	struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);
 
 	u64 vruntime = cfs_rq->min_vruntime;
 
@@ -752,9 +814,7 @@ static void update_min_vruntime(struct c
 			curr = NULL;
 	}
 
-	if (leftmost) { /* non-empty tree */
-		struct sched_entity *se = __node_2_se(leftmost);
-
+	if (se) {
 		if (!curr)
 			vruntime = se->vruntime;
 		else
@@ -771,18 +831,50 @@ static inline bool __entity_less(struct
 	return entity_before(__node_2_se(a), __node_2_se(b));
 }
 
+#define deadline_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })
+
+static inline void __update_min_deadline(struct sched_entity *se, struct rb_node *node)
+{
+	if (node) {
+		struct sched_entity *rse = __node_2_se(node);
+		if (deadline_gt(min_deadline, se, rse))
+			se->min_deadline = rse->min_deadline;
+	}
+}
+
+/*
+ * se->min_deadline = min(se->deadline, left->min_deadline, right->min_deadline)
+ */
+static inline bool min_deadline_update(struct sched_entity *se, bool exit)
+{
+	u64 old_min_deadline = se->min_deadline;
+	struct rb_node *node = &se->run_node;
+
+	se->min_deadline = se->deadline;
+	__update_min_deadline(se, node->rb_right);
+	__update_min_deadline(se, node->rb_left);
+
+	return se->min_deadline == old_min_deadline;
+}
+
+RB_DECLARE_CALLBACKS(static, min_deadline_cb, struct sched_entity,
+		     run_node, min_deadline, min_deadline_update);
+
 /*
  * Enqueue an entity into the rb-tree:
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	avg_vruntime_add(cfs_rq, se);
-	rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
+	se->min_deadline = se->deadline;
+	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
+				__entity_less, &min_deadline_cb);
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
+	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
+				  &min_deadline_cb);
 	avg_vruntime_sub(cfs_rq, se);
 }
 
@@ -806,6 +898,97 @@ static struct sched_entity *__pick_next_
 	return __node_2_se(next);
 }
 
+static struct sched_entity *pick_cfs(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+{
+	struct sched_entity *left = __pick_first_entity(cfs_rq);
+
+	/*
+	 * If curr is set we have to see if its left of the leftmost entity
+	 * still in the tree, provided there was anything in the tree at all.
+	 */
+	if (!left || (curr && entity_before(curr, left)))
+		left = curr;
+
+	return left;
+}
+
+/*
+ * Earliest Eligible Virtual Deadline First
+ *
+ * In order to provide latency guarantees for different request sizes
+ * EEVDF selects the best runnable task from two criteria:
+ *
+ *  1) the task must be eligible (must be owed service)
+ *
+ *  2) from those tasks that meet 1), we select the one
+ *     with the earliest virtual deadline.
+ *
+ * We can do this in O(log n) time due to an augmented RB-tree. The
+ * tree keeps the entries sorted on service, but also functions as a
+ * heap based on the deadline by keeping:
+ *
+ *  se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
+ *
+ * Which allows an EDF like search on (sub)trees.
+ */
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+{
+	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+	struct sched_entity *curr = cfs_rq->curr;
+	struct sched_entity *best = NULL;
+
+	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
+		curr = NULL;
+
+	while (node) {
+		struct sched_entity *se = __node_2_se(node);
+
+		/*
+		 * If this entity is not eligible, try the left subtree.
+		 */
+		if (!entity_eligible(cfs_rq, se)) {
+			node = node->rb_left;
+			continue;
+		}
+
+		/*
+		 * If this entity has an earlier deadline than the previous
+		 * best, take this one. If it also has the earliest deadline
+		 * of its subtree, we're done.
+		 */
+		if (!best || deadline_gt(deadline, best, se)) {
+			best = se;
+			if (best->deadline == best->min_deadline)
+				break;
+		}
+
+		/*
+		 * If the earlest deadline in this subtree is in the fully
+		 * eligible left half of our space, go there.
+		 */
+		if (node->rb_left &&
+		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
+			node = node->rb_left;
+			continue;
+		}
+
+		node = node->rb_right;
+	}
+
+	if (!best || (curr && deadline_gt(deadline, best, curr)))
+		best = curr;
+
+	if (unlikely(!best)) {
+		struct sched_entity *left = __pick_first_entity(cfs_rq);
+		if (left) {
+			pr_err("EEVDF scheduling fail, picking leftmost\n");
+			return left;
+		}
+	}
+
+	return best;
+}
+
 #ifdef CONFIG_SCHED_DEBUG
 struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
 {
@@ -840,17 +1023,6 @@ int sched_update_scaling(void)
 #endif
 
 /*
- * delta /= w
- */
-static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
-{
-	if (unlikely(se->load.weight != NICE_0_LOAD))
-		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
-
-	return delta;
-}
-
-/*
  * The idea is to set a period in which each task runs once.
  *
  * When there are too many tasks (sched_nr_latency) we have to stretch
@@ -915,6 +1087,48 @@ static u64 sched_slice(struct cfs_rq *cf
 	return slice;
 }
 
+static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
+
+/*
+ * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
+ * this is probably good enough.
+ */
+static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	if ((s64)(se->vruntime - se->deadline) < 0)
+		return;
+
+	if (sched_feat(EEVDF)) {
+		/*
+		 * For EEVDF the virtual time slope is determined by w_i (iow.
+		 * nice) while the request time r_i is determined by
+		 * sysctl_sched_min_granularity.
+		 */
+		se->slice = sysctl_sched_min_granularity;
+
+		/*
+		 * The task has consumed its request, reschedule.
+		 */
+		if (cfs_rq->nr_running > 1) {
+			resched_curr(rq_of(cfs_rq));
+			clear_buddies(cfs_rq, se);
+		}
+	} else {
+		/*
+		 * When many tasks blow up the sched_period; it is possible
+		 * that sched_slice() reports unusually large results (when
+		 * many tasks are very light for example). Therefore impose a
+		 * maximum.
+		 */
+		se->slice = min_t(u64, sched_slice(cfs_rq, se), sysctl_sched_latency);
+	}
+
+	/*
+	 * EEVDF: vd_i = ve_i + r_i / w_i
+	 */
+	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+}
+
 #include "pelt.h"
 #ifdef CONFIG_SMP
 
@@ -1047,6 +1261,7 @@ static void update_curr(struct cfs_rq *c
 	schedstat_add(cfs_rq->exec_clock, delta_exec);
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
+	update_deadline(cfs_rq, curr);
 	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
@@ -3521,6 +3736,14 @@ static void reweight_entity(struct cfs_r
 		 * we need to scale se->vlag when w_i changes.
 		 */
 		se->vlag = div_s64(se->vlag * old_weight, weight);
+	} else {
+		s64 deadline = se->deadline - se->vruntime;
+		/*
+		 * When the weight changes, the virtual time slope changes and
+		 * we should adjust the relative virtual deadline accordingly.
+		 */
+		deadline = div_s64(deadline * old_weight, weight);
+		se->deadline = se->vruntime + deadline;
 	}
 
 #ifdef CONFIG_SMP
@@ -4871,6 +5094,7 @@ static inline bool entity_is_long_sleepe
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
+	u64 vslice = calc_delta_fair(se->slice, se);
 	u64 vruntime = avg_vruntime(cfs_rq);
 	s64 lag = 0;
 
@@ -4942,9 +5166,9 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		 */
 		load = cfs_rq->avg_load;
 		if (curr && curr->on_rq)
-			load += curr->load.weight;
+			load += scale_load_down(curr->load.weight);
 
-		lag *= load + se->load.weight;
+		lag *= load + scale_load_down(se->load.weight);
 		if (WARN_ON_ONCE(!load))
 			load = 1;
 		lag = div_s64(lag, load);
@@ -4985,6 +5209,19 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	}
 
 	se->vruntime = vruntime;
+
+	/*
+	 * When joining the competition; the exisiting tasks will be,
+	 * on average, halfway through their slice, as such start tasks
+	 * off with half a slice to ease into the competition.
+	 */
+	if (sched_feat(PLACE_DEADLINE_INITIAL) && initial)
+		vslice /= 2;
+
+	/*
+	 * EEVDF: vd_i = ve_i + r_i/w_i
+	 */
+	se->deadline = se->vruntime + vslice;
 }
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -5196,19 +5433,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 static void
 check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	unsigned long ideal_runtime, delta_exec;
+	unsigned long delta_exec;
 	struct sched_entity *se;
 	s64 delta;
 
-	/*
-	 * When many tasks blow up the sched_period; it is possible that
-	 * sched_slice() reports unusually large results (when many tasks are
-	 * very light for example). Therefore impose a maximum.
-	 */
-	ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
-
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
-	if (delta_exec > ideal_runtime) {
+	if (delta_exec > curr->slice) {
 		resched_curr(rq_of(cfs_rq));
 		/*
 		 * The current task ran long enough, ensure it doesn't get
@@ -5232,7 +5462,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq
 	if (delta < 0)
 		return;
 
-	if (delta > ideal_runtime)
+	if (delta > curr->slice)
 		resched_curr(rq_of(cfs_rq));
 }
 
@@ -5287,17 +5517,20 @@ wakeup_preempt_entity(struct sched_entit
 static struct sched_entity *
 pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	struct sched_entity *left = __pick_first_entity(cfs_rq);
-	struct sched_entity *se;
+	struct sched_entity *left, *se;
 
-	/*
-	 * If curr is set we have to see if its left of the leftmost entity
-	 * still in the tree, provided there was anything in the tree at all.
-	 */
-	if (!left || (curr && entity_before(curr, left)))
-		left = curr;
+	if (sched_feat(EEVDF)) {
+		/*
+		 * Enabling NEXT_BUDDY will affect latency but not fairness.
+		 */
+		if (sched_feat(NEXT_BUDDY) &&
+		    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
+			return cfs_rq->next;
+
+		return pick_eevdf(cfs_rq);
+	}
 
-	se = left; /* ideally we run the leftmost entity */
+	se = left = pick_cfs(cfs_rq, curr);
 
 	/*
 	 * Avoid running the skip buddy, if running something else can
@@ -5390,7 +5623,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 		return;
 #endif
 
-	if (cfs_rq->nr_running > 1)
+	if (!sched_feat(EEVDF) && cfs_rq->nr_running > 1)
 		check_preempt_tick(cfs_rq, curr);
 }
 
@@ -6396,13 +6629,12 @@ static inline void unthrottle_offline_cf
 static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	SCHED_WARN_ON(task_rq(p) != rq);
 
 	if (rq->cfs.h_nr_running > 1) {
-		u64 slice = sched_slice(cfs_rq, se);
 		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+		u64 slice = se->slice;
 		s64 delta = slice - ran;
 
 		if (delta < 0) {
@@ -8122,7 +8354,19 @@ static void check_preempt_wakeup(struct
 	if (cse_is_idle != pse_is_idle)
 		return;
 
-	update_curr(cfs_rq_of(se));
+	cfs_rq = cfs_rq_of(se);
+	update_curr(cfs_rq);
+
+	if (sched_feat(EEVDF)) {
+		/*
+		 * XXX pick_eevdf(cfs_rq) != se ?
+		 */
+		if (pick_eevdf(cfs_rq) == pse)
+			goto preempt;
+
+		return;
+	}
+
 	if (wakeup_preempt_entity(se, pse) == 1) {
 		/*
 		 * Bias pick_next to pick the sched entity that is
@@ -8368,7 +8612,7 @@ static void yield_task_fair(struct rq *r
 
 	clear_buddies(cfs_rq, se);
 
-	if (curr->policy != SCHED_BATCH) {
+	if (sched_feat(EEVDF) || curr->policy != SCHED_BATCH) {
 		update_rq_clock(rq);
 		/*
 		 * Update run-time statistics of the 'current'.
@@ -8381,6 +8625,8 @@ static void yield_task_fair(struct rq *r
 		 */
 		rq_clock_skip_update(rq);
 	}
+	if (sched_feat(EEVDF))
+		se->deadline += calc_delta_fair(se->slice, se);
 
 	set_skip_buddy(se);
 }
@@ -12136,8 +12382,8 @@ static void rq_offline_fair(struct rq *r
 static inline bool
 __entity_slice_used(struct sched_entity *se, int min_nr_tasks)
 {
-	u64 slice = sched_slice(cfs_rq_of(se), se);
 	u64 rtime = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+	u64 slice = se->slice;
 
 	return (rtime * min_nr_tasks > slice);
 }
@@ -12832,7 +13078,7 @@ static unsigned int get_rr_interval_fair
 	 * idle runqueue:
 	 */
 	if (rq->cfs.load.weight)
-		rr_interval = NS_TO_JIFFIES(sched_slice(cfs_rq_of(se), se));
+		rr_interval = NS_TO_JIFFIES(se->slice);
 
 	return rr_interval;
 }
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -13,6 +13,7 @@ SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
  * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
  */
 SCHED_FEAT(PLACE_LAG, true)
+SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed
@@ -103,3 +104,5 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(EEVDF, true)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2481,9 +2481,10 @@ extern void check_preempt_curr(struct rq
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
 
+extern unsigned int sysctl_sched_min_granularity;
+
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_latency;
-extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_idle_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern int sysctl_resched_latency_warn_ms;
@@ -3507,5 +3508,6 @@ static inline void init_sched_mm_cid(str
 #endif
 
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
+extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 #endif /* _KERNEL_SCHED_SCHED_H */



^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 06/15] sched: Commit to lag based placement
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
                   ` (4 preceding siblings ...)
  2023-05-31 11:58 ` [PATCH 05/15] sched/fair: Implement an EEVDF like policy Peter Zijlstra
@ 2023-05-31 11:58 ` Peter Zijlstra
  2023-08-10  7:10   ` [tip: sched/core] sched/fair: " tip-bot2 for Peter Zijlstra
  2023-05-31 11:58 ` [PATCH 07/15] sched/smp: Use lag to simplify cross-runqueue placement Peter Zijlstra
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

Removes the FAIR_SLEEPERS code in favour of the new LAG based
placement.

Specifically, the whole FAIR_SLEEPER thing was a very crud
approximation to make up for the lack of lag based placement,
specifically the 'service owed' part. This is important for things
like 'starve' and 'hackbench'.

One side effect of FAIR_SLEEPER is that is caused 'small' unfairness,
specifically, by always ignoring up-to 'thresh' sleeptime it would
have a 50%/50% time distribution for a 50% sleeper vs a 100% runner,
while strictly speaking this should (of course) result in a 33%/67%
split (as CFS will also do if the sleep period exceeds 'thresh').

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   59 ------------------------------------------------
 kernel/sched/features.h |    8 ------
 2 files changed, 1 insertion(+), 66 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5068,29 +5068,6 @@ static void check_spread(struct cfs_rq *
 #endif
 }
 
-static inline bool entity_is_long_sleeper(struct sched_entity *se)
-{
-	struct cfs_rq *cfs_rq;
-	u64 sleep_time;
-
-	if (se->exec_start == 0)
-		return false;
-
-	cfs_rq = cfs_rq_of(se);
-
-	sleep_time = rq_clock_task(rq_of(cfs_rq));
-
-	/* Happen while migrating because of clock task divergence */
-	if (sleep_time <= se->exec_start)
-		return false;
-
-	sleep_time -= se->exec_start;
-	if (sleep_time > ((1ULL << 63) / scale_load_down(NICE_0_LOAD)))
-		return true;
-
-	return false;
-}
-
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
@@ -5172,43 +5149,9 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		if (WARN_ON_ONCE(!load))
 			load = 1;
 		lag = div_s64(lag, load);
-
-		vruntime -= lag;
-	}
-
-	if (sched_feat(FAIR_SLEEPERS)) {
-
-		/* sleeps up to a single latency don't count. */
-		if (!initial) {
-			unsigned long thresh;
-
-			if (se_is_idle(se))
-				thresh = sysctl_sched_min_granularity;
-			else
-				thresh = sysctl_sched_latency;
-
-			/*
-			 * Halve their sleep time's effect, to allow
-			 * for a gentler effect of sleepers:
-			 */
-			if (sched_feat(GENTLE_FAIR_SLEEPERS))
-				thresh >>= 1;
-
-			vruntime -= thresh;
-		}
-
-		/*
-		 * Pull vruntime of the entity being placed to the base level of
-		 * cfs_rq, to prevent boosting it if placed backwards.  If the entity
-		 * slept for a long time, don't even try to compare its vruntime with
-		 * the base as it may be too far off and the comparison may get
-		 * inversed due to s64 overflow.
-		 */
-		if (!entity_is_long_sleeper(se))
-			vruntime = max_vruntime(se->vruntime, vruntime);
 	}
 
-	se->vruntime = vruntime;
+	se->vruntime = vruntime - lag;
 
 	/*
 	 * When joining the competition; the exisiting tasks will be,
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -1,14 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 /*
- * Only give sleepers 50% of their service deficit. This allows
- * them to run sooner, but does not allow tons of sleepers to
- * rip the spread apart.
- */
-SCHED_FEAT(FAIR_SLEEPERS, false)
-SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
-
-/*
  * Using the avg_vruntime, do the right thing and preserve lag across
  * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
  */



^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 07/15] sched/smp: Use lag to simplify cross-runqueue placement
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
                   ` (5 preceding siblings ...)
  2023-05-31 11:58 ` [PATCH 06/15] sched: Commit to lag based placement Peter Zijlstra
@ 2023-05-31 11:58 ` Peter Zijlstra
  2023-08-10  7:10   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
                     ` (3 more replies)
  2023-05-31 11:58 ` [PATCH 08/15] sched: Commit to EEVDF Peter Zijlstra
                   ` (8 subsequent siblings)
  15 siblings, 4 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

Using lag is both more correct and simpler when moving between
runqueues.

Notable, min_vruntime() was invented as a cheap approximation of
avg_vruntime() for this very purpose (SMP migration). Since we now
have the real thing; use it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |  145 ++++++----------------------------------------------
 1 file changed, 19 insertions(+), 126 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5083,7 +5083,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	 *
 	 * EEVDF: placement strategy #1 / #2
 	 */
-	if (sched_feat(PLACE_LAG) && cfs_rq->nr_running > 1) {
+	if (sched_feat(PLACE_LAG) && cfs_rq->nr_running) {
 		struct sched_entity *curr = cfs_rq->curr;
 		unsigned long load;
 
@@ -5171,61 +5171,21 @@ static void check_enqueue_throttle(struc
 
 static inline bool cfs_bandwidth_used(void);
 
-/*
- * MIGRATION
- *
- *	dequeue
- *	  update_curr()
- *	    update_min_vruntime()
- *	  vruntime -= min_vruntime
- *
- *	enqueue
- *	  update_curr()
- *	    update_min_vruntime()
- *	  vruntime += min_vruntime
- *
- * this way the vruntime transition between RQs is done when both
- * min_vruntime are up-to-date.
- *
- * WAKEUP (remote)
- *
- *	->migrate_task_rq_fair() (p->state == TASK_WAKING)
- *	  vruntime -= min_vruntime
- *
- *	enqueue
- *	  update_curr()
- *	    update_min_vruntime()
- *	  vruntime += min_vruntime
- *
- * this way we don't have the most up-to-date min_vruntime on the originating
- * CPU and an up-to-date min_vruntime on the destination CPU.
- */
-
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
 	bool curr = cfs_rq->curr == se;
 
 	/*
 	 * If we're the current task, we must renormalise before calling
 	 * update_curr().
 	 */
-	if (renorm && curr)
-		se->vruntime += cfs_rq->min_vruntime;
+	if (curr)
+		place_entity(cfs_rq, se, 0);
 
 	update_curr(cfs_rq);
 
 	/*
-	 * Otherwise, renormalise after, such that we're placed at the current
-	 * moment in time, instead of some random moment in the past. Being
-	 * placed in the past could significantly boost this task to the
-	 * fairness detriment of existing tasks.
-	 */
-	if (renorm && !curr)
-		se->vruntime += cfs_rq->min_vruntime;
-
-	/*
 	 * When enqueuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
 	 *   - For group_entity, update its runnable_weight to reflect the new
@@ -5236,11 +5196,22 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
 	se_update_runnable(se);
+	/*
+	 * XXX update_load_avg() above will have attached us to the pelt sum;
+	 * but update_cfs_group() here will re-adjust the weight and have to
+	 * undo/redo all that. Seems wasteful.
+	 */
 	update_cfs_group(se);
-	account_entity_enqueue(cfs_rq, se);
 
-	if (flags & ENQUEUE_WAKEUP)
+	/*
+	 * XXX now that the entity has been re-weighted, and it's lag adjusted,
+	 * we can place the entity.
+	 */
+	if (!curr)
 		place_entity(cfs_rq, se, 0);
+
+	account_entity_enqueue(cfs_rq, se);
+
 	/* Entity has migrated, no longer consider this task hot */
 	if (flags & ENQUEUE_MIGRATED)
 		se->exec_start = 0;
@@ -5335,23 +5306,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	clear_buddies(cfs_rq, se);
 
-	if (flags & DEQUEUE_SLEEP)
-		update_entity_lag(cfs_rq, se);
-
+	update_entity_lag(cfs_rq, se);
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
 	account_entity_dequeue(cfs_rq, se);
 
-	/*
-	 * Normalize after update_curr(); which will also have moved
-	 * min_vruntime if @se is the one holding it back. But before doing
-	 * update_min_vruntime() again, which will discount @se's position and
-	 * can move min_vruntime forward still more.
-	 */
-	if (!(flags & DEQUEUE_SLEEP))
-		se->vruntime -= cfs_rq->min_vruntime;
-
 	/* return excess runtime on last dequeue */
 	return_cfs_rq_runtime(cfs_rq);
 
@@ -8102,18 +8062,6 @@ static void migrate_task_rq_fair(struct
 {
 	struct sched_entity *se = &p->se;
 
-	/*
-	 * As blocked tasks retain absolute vruntime the migration needs to
-	 * deal with this by subtracting the old and adding the new
-	 * min_vruntime -- the latter is done by enqueue_entity() when placing
-	 * the task on the new runqueue.
-	 */
-	if (READ_ONCE(p->__state) == TASK_WAKING) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-
-		se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
-	}
-
 	if (!task_on_rq_migrating(p)) {
 		remove_entity_load_avg(se);
 
@@ -12482,8 +12430,8 @@ static void task_tick_fair(struct rq *rq
  */
 static void task_fork_fair(struct task_struct *p)
 {
-	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se, *curr;
+	struct cfs_rq *cfs_rq;
 	struct rq *rq = this_rq();
 	struct rq_flags rf;
 
@@ -12492,22 +12440,9 @@ static void task_fork_fair(struct task_s
 
 	cfs_rq = task_cfs_rq(current);
 	curr = cfs_rq->curr;
-	if (curr) {
+	if (curr)
 		update_curr(cfs_rq);
-		se->vruntime = curr->vruntime;
-	}
 	place_entity(cfs_rq, se, 1);
-
-	if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
-		/*
-		 * Upon rescheduling, sched_class::put_prev_task() will place
-		 * 'current' within the tree based on its new key value.
-		 */
-		swap(curr->vruntime, se->vruntime);
-		resched_curr(rq);
-	}
-
-	se->vruntime -= cfs_rq->min_vruntime;
 	rq_unlock(rq, &rf);
 }
 
@@ -12536,34 +12471,6 @@ prio_changed_fair(struct rq *rq, struct
 		check_preempt_curr(rq, p, 0);
 }
 
-static inline bool vruntime_normalized(struct task_struct *p)
-{
-	struct sched_entity *se = &p->se;
-
-	/*
-	 * In both the TASK_ON_RQ_QUEUED and TASK_ON_RQ_MIGRATING cases,
-	 * the dequeue_entity(.flags=0) will already have normalized the
-	 * vruntime.
-	 */
-	if (p->on_rq)
-		return true;
-
-	/*
-	 * When !on_rq, vruntime of the task has usually NOT been normalized.
-	 * But there are some cases where it has already been normalized:
-	 *
-	 * - A forked child which is waiting for being woken up by
-	 *   wake_up_new_task().
-	 * - A task which has been woken up by try_to_wake_up() and
-	 *   waiting for actually being woken up by sched_ttwu_pending().
-	 */
-	if (!se->sum_exec_runtime ||
-	    (READ_ONCE(p->__state) == TASK_WAKING && p->sched_remote_wakeup))
-		return true;
-
-	return false;
-}
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /*
  * Propagate the changes of the sched_entity across the tg tree to make it
@@ -12634,16 +12541,6 @@ static void attach_entity_cfs_rq(struct
 static void detach_task_cfs_rq(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-
-	if (!vruntime_normalized(p)) {
-		/*
-		 * Fix up our vruntime so that the current sleep doesn't
-		 * cause 'unlimited' sleep bonus.
-		 */
-		place_entity(cfs_rq, se, 0);
-		se->vruntime -= cfs_rq->min_vruntime;
-	}
 
 	detach_entity_cfs_rq(se);
 }
@@ -12651,12 +12548,8 @@ static void detach_task_cfs_rq(struct ta
 static void attach_task_cfs_rq(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	attach_entity_cfs_rq(se);
-
-	if (!vruntime_normalized(p))
-		se->vruntime += cfs_rq->min_vruntime;
 }
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)



^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 08/15] sched: Commit to EEVDF
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
                   ` (6 preceding siblings ...)
  2023-05-31 11:58 ` [PATCH 07/15] sched/smp: Use lag to simplify cross-runqueue placement Peter Zijlstra
@ 2023-05-31 11:58 ` Peter Zijlstra
  2023-06-16 21:23   ` Joel Fernandes
  2023-08-10  7:10   ` [tip: sched/core] sched/fair: " tip-bot2 for Peter Zijlstra
  2023-05-31 11:58 ` [PATCH 09/15] sched/debug: Rename min_granularity to base_slice Peter Zijlstra
                   ` (7 subsequent siblings)
  15 siblings, 2 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

EEVDF is a better defined scheduling policy, as a result it has less
heuristics/tunables. There is no compelling reason to keep CFS around.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c    |    6 
 kernel/sched/fair.c     |  465 +++---------------------------------------------
 kernel/sched/features.h |   12 -
 kernel/sched/sched.h    |    5 
 4 files changed, 38 insertions(+), 450 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -347,10 +347,7 @@ static __init int sched_init_debug(void)
 	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
 #endif
 
-	debugfs_create_u32("latency_ns", 0644, debugfs_sched, &sysctl_sched_latency);
 	debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);
-	debugfs_create_u32("idle_min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_idle_min_granularity);
-	debugfs_create_u32("wakeup_granularity_ns", 0644, debugfs_sched, &sysctl_sched_wakeup_granularity);
 
 	debugfs_create_u32("latency_warn_ms", 0644, debugfs_sched, &sysctl_resched_latency_warn_ms);
 	debugfs_create_u32("latency_warn_once", 0644, debugfs_sched, &sysctl_resched_latency_warn_once);
@@ -865,10 +862,7 @@ static void sched_debug_header(struct se
 	SEQ_printf(m, "  .%-40s: %Ld\n", #x, (long long)(x))
 #define PN(x) \
 	SEQ_printf(m, "  .%-40s: %Ld.%06ld\n", #x, SPLIT_NS(x))
-	PN(sysctl_sched_latency);
 	PN(sysctl_sched_min_granularity);
-	PN(sysctl_sched_idle_min_granularity);
-	PN(sysctl_sched_wakeup_granularity);
 	P(sysctl_sched_child_runs_first);
 	P(sysctl_sched_features);
 #undef PN
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -58,22 +58,6 @@
 #include "autogroup.h"
 
 /*
- * Targeted preemption latency for CPU-bound tasks:
- *
- * NOTE: this latency value is not the same as the concept of
- * 'timeslice length' - timeslices in CFS are of variable length
- * and have no persistent notion like in traditional, time-slice
- * based scheduling concepts.
- *
- * (to see the precise effective timeslice length of your workload,
- *  run vmstat and monitor the context-switches (cs) field)
- *
- * (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds)
- */
-unsigned int sysctl_sched_latency			= 6000000ULL;
-static unsigned int normalized_sysctl_sched_latency	= 6000000ULL;
-
-/*
  * The initial- and re-scaling of tunables is configurable
  *
  * Options are:
@@ -95,36 +79,11 @@ unsigned int sysctl_sched_min_granularit
 static unsigned int normalized_sysctl_sched_min_granularity	= 750000ULL;
 
 /*
- * Minimal preemption granularity for CPU-bound SCHED_IDLE tasks.
- * Applies only when SCHED_IDLE tasks compete with normal tasks.
- *
- * (default: 0.75 msec)
- */
-unsigned int sysctl_sched_idle_min_granularity			= 750000ULL;
-
-/*
- * This value is kept at sysctl_sched_latency/sysctl_sched_min_granularity
- */
-static unsigned int sched_nr_latency = 8;
-
-/*
  * After fork, child runs first. If set to 0 (default) then
  * parent will (try to) run first.
  */
 unsigned int sysctl_sched_child_runs_first __read_mostly;
 
-/*
- * SCHED_OTHER wake-up granularity.
- *
- * This option delays the preemption effects of decoupled workloads
- * and reduces their over-scheduling. Synchronous workloads will still
- * have immediate wakeup/sleep latencies.
- *
- * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
- */
-unsigned int sysctl_sched_wakeup_granularity			= 1000000UL;
-static unsigned int normalized_sysctl_sched_wakeup_granularity	= 1000000UL;
-
 const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
 
 int sched_thermal_decay_shift;
@@ -279,8 +238,6 @@ static void update_sysctl(void)
 #define SET_SYSCTL(name) \
 	(sysctl_##name = (factor) * normalized_sysctl_##name)
 	SET_SYSCTL(sched_min_granularity);
-	SET_SYSCTL(sched_latency);
-	SET_SYSCTL(sched_wakeup_granularity);
 #undef SET_SYSCTL
 }
 
@@ -888,30 +845,6 @@ struct sched_entity *__pick_first_entity
 	return __node_2_se(left);
 }
 
-static struct sched_entity *__pick_next_entity(struct sched_entity *se)
-{
-	struct rb_node *next = rb_next(&se->run_node);
-
-	if (!next)
-		return NULL;
-
-	return __node_2_se(next);
-}
-
-static struct sched_entity *pick_cfs(struct cfs_rq *cfs_rq, struct sched_entity *curr)
-{
-	struct sched_entity *left = __pick_first_entity(cfs_rq);
-
-	/*
-	 * If curr is set we have to see if its left of the leftmost entity
-	 * still in the tree, provided there was anything in the tree at all.
-	 */
-	if (!left || (curr && entity_before(curr, left)))
-		left = curr;
-
-	return left;
-}
-
 /*
  * Earliest Eligible Virtual Deadline First
  *
@@ -1008,85 +941,15 @@ int sched_update_scaling(void)
 {
 	unsigned int factor = get_update_sysctl_factor();
 
-	sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency,
-					sysctl_sched_min_granularity);
-
 #define WRT_SYSCTL(name) \
 	(normalized_sysctl_##name = sysctl_##name / (factor))
 	WRT_SYSCTL(sched_min_granularity);
-	WRT_SYSCTL(sched_latency);
-	WRT_SYSCTL(sched_wakeup_granularity);
 #undef WRT_SYSCTL
 
 	return 0;
 }
 #endif
 
-/*
- * The idea is to set a period in which each task runs once.
- *
- * When there are too many tasks (sched_nr_latency) we have to stretch
- * this period because otherwise the slices get too small.
- *
- * p = (nr <= nl) ? l : l*nr/nl
- */
-static u64 __sched_period(unsigned long nr_running)
-{
-	if (unlikely(nr_running > sched_nr_latency))
-		return nr_running * sysctl_sched_min_granularity;
-	else
-		return sysctl_sched_latency;
-}
-
-static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq);
-
-/*
- * We calculate the wall-time slice from the period by taking a part
- * proportional to the weight.
- *
- * s = p*P[w/rw]
- */
-static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	unsigned int nr_running = cfs_rq->nr_running;
-	struct sched_entity *init_se = se;
-	unsigned int min_gran;
-	u64 slice;
-
-	if (sched_feat(ALT_PERIOD))
-		nr_running = rq_of(cfs_rq)->cfs.h_nr_running;
-
-	slice = __sched_period(nr_running + !se->on_rq);
-
-	for_each_sched_entity(se) {
-		struct load_weight *load;
-		struct load_weight lw;
-		struct cfs_rq *qcfs_rq;
-
-		qcfs_rq = cfs_rq_of(se);
-		load = &qcfs_rq->load;
-
-		if (unlikely(!se->on_rq)) {
-			lw = qcfs_rq->load;
-
-			update_load_add(&lw, se->load.weight);
-			load = &lw;
-		}
-		slice = __calc_delta(slice, se->load.weight, load);
-	}
-
-	if (sched_feat(BASE_SLICE)) {
-		if (se_is_idle(init_se) && !sched_idle_cfs_rq(cfs_rq))
-			min_gran = sysctl_sched_idle_min_granularity;
-		else
-			min_gran = sysctl_sched_min_granularity;
-
-		slice = max_t(u64, slice, min_gran);
-	}
-
-	return slice;
-}
-
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 /*
@@ -1098,35 +961,25 @@ static void update_deadline(struct cfs_r
 	if ((s64)(se->vruntime - se->deadline) < 0)
 		return;
 
-	if (sched_feat(EEVDF)) {
-		/*
-		 * For EEVDF the virtual time slope is determined by w_i (iow.
-		 * nice) while the request time r_i is determined by
-		 * sysctl_sched_min_granularity.
-		 */
-		se->slice = sysctl_sched_min_granularity;
-
-		/*
-		 * The task has consumed its request, reschedule.
-		 */
-		if (cfs_rq->nr_running > 1) {
-			resched_curr(rq_of(cfs_rq));
-			clear_buddies(cfs_rq, se);
-		}
-	} else {
-		/*
-		 * When many tasks blow up the sched_period; it is possible
-		 * that sched_slice() reports unusually large results (when
-		 * many tasks are very light for example). Therefore impose a
-		 * maximum.
-		 */
-		se->slice = min_t(u64, sched_slice(cfs_rq, se), sysctl_sched_latency);
-	}
+	/*
+	 * For EEVDF the virtual time slope is determined by w_i (iow.
+	 * nice) while the request time r_i is determined by
+	 * sysctl_sched_min_granularity.
+	 */
+	se->slice = sysctl_sched_min_granularity;
 
 	/*
 	 * EEVDF: vd_i = ve_i + r_i / w_i
 	 */
 	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+
+	/*
+	 * The task has consumed its request, reschedule.
+	 */
+	if (cfs_rq->nr_running > 1) {
+		resched_curr(rq_of(cfs_rq));
+		clear_buddies(cfs_rq, se);
+	}
 }
 
 #include "pelt.h"
@@ -5055,19 +4908,6 @@ static inline void update_misfit_status(
 
 #endif /* CONFIG_SMP */
 
-static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-#ifdef CONFIG_SCHED_DEBUG
-	s64 d = se->vruntime - cfs_rq->min_vruntime;
-
-	if (d < 0)
-		d = -d;
-
-	if (d > 3*sysctl_sched_latency)
-		schedstat_inc(cfs_rq->nr_spread_over);
-#endif
-}
-
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
@@ -5218,7 +5058,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 
 	check_schedstat_required();
 	update_stats_enqueue_fair(cfs_rq, se, flags);
-	check_spread(cfs_rq, se);
 	if (!curr)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
@@ -5230,17 +5069,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	}
 }
 
-static void __clear_buddies_last(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->last != se)
-			break;
-
-		cfs_rq->last = NULL;
-	}
-}
-
 static void __clear_buddies_next(struct sched_entity *se)
 {
 	for_each_sched_entity(se) {
@@ -5252,27 +5080,10 @@ static void __clear_buddies_next(struct
 	}
 }
 
-static void __clear_buddies_skip(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->skip != se)
-			break;
-
-		cfs_rq->skip = NULL;
-	}
-}
-
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (cfs_rq->last == se)
-		__clear_buddies_last(se);
-
 	if (cfs_rq->next == se)
 		__clear_buddies_next(se);
-
-	if (cfs_rq->skip == se)
-		__clear_buddies_skip(se);
 }
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
@@ -5330,45 +5141,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
 }
 
-/*
- * Preempt the current task with a newly woken task if needed:
- */
-static void
-check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
-{
-	unsigned long delta_exec;
-	struct sched_entity *se;
-	s64 delta;
-
-	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
-	if (delta_exec > curr->slice) {
-		resched_curr(rq_of(cfs_rq));
-		/*
-		 * The current task ran long enough, ensure it doesn't get
-		 * re-elected due to buddy favours.
-		 */
-		clear_buddies(cfs_rq, curr);
-		return;
-	}
-
-	/*
-	 * Ensure that a task that missed wakeup preemption by a
-	 * narrow margin doesn't have to wait for a full slice.
-	 * This also mitigates buddy induced latencies under load.
-	 */
-	if (delta_exec < sysctl_sched_min_granularity)
-		return;
-
-	se = __pick_first_entity(cfs_rq);
-	delta = curr->vruntime - se->vruntime;
-
-	if (delta < 0)
-		return;
-
-	if (delta > curr->slice)
-		resched_curr(rq_of(cfs_rq));
-}
-
 static void
 set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -5407,9 +5179,6 @@ set_next_entity(struct cfs_rq *cfs_rq, s
 	se->prev_sum_exec_runtime = se->sum_exec_runtime;
 }
 
-static int
-wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
-
 /*
  * Pick the next process, keeping these things in mind, in this order:
  * 1) keep things fair between processes/task groups
@@ -5420,53 +5189,14 @@ wakeup_preempt_entity(struct sched_entit
 static struct sched_entity *
 pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	struct sched_entity *left, *se;
-
-	if (sched_feat(EEVDF)) {
-		/*
-		 * Enabling NEXT_BUDDY will affect latency but not fairness.
-		 */
-		if (sched_feat(NEXT_BUDDY) &&
-		    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
-			return cfs_rq->next;
-
-		return pick_eevdf(cfs_rq);
-	}
-
-	se = left = pick_cfs(cfs_rq, curr);
-
 	/*
-	 * Avoid running the skip buddy, if running something else can
-	 * be done without getting too unfair.
+	 * Enabling NEXT_BUDDY will affect latency but not fairness.
 	 */
-	if (cfs_rq->skip && cfs_rq->skip == se) {
-		struct sched_entity *second;
+	if (sched_feat(NEXT_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
+		return cfs_rq->next;
 
-		if (se == curr) {
-			second = __pick_first_entity(cfs_rq);
-		} else {
-			second = __pick_next_entity(se);
-			if (!second || (curr && entity_before(curr, second)))
-				second = curr;
-		}
-
-		if (second && wakeup_preempt_entity(second, left) < 1)
-			se = second;
-	}
-
-	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
-		/*
-		 * Someone really wants this to run. If it's not unfair, run it.
-		 */
-		se = cfs_rq->next;
-	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
-		/*
-		 * Prefer last buddy, try to return the CPU to a preempted task.
-		 */
-		se = cfs_rq->last;
-	}
-
-	return se;
+	return pick_eevdf(cfs_rq);
 }
 
 static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
@@ -5483,8 +5213,6 @@ static void put_prev_entity(struct cfs_r
 	/* throttle cfs_rqs exceeding runtime */
 	check_cfs_rq_runtime(cfs_rq);
 
-	check_spread(cfs_rq, prev);
-
 	if (prev->on_rq) {
 		update_stats_wait_start_fair(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
@@ -5525,9 +5253,6 @@ entity_tick(struct cfs_rq *cfs_rq, struc
 			hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
 		return;
 #endif
-
-	if (!sched_feat(EEVDF) && cfs_rq->nr_running > 1)
-		check_preempt_tick(cfs_rq, curr);
 }
 
 
@@ -6561,8 +6286,7 @@ static void hrtick_update(struct rq *rq)
 	if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
 		return;
 
-	if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
-		hrtick_start_fair(rq, curr);
+	hrtick_start_fair(rq, curr);
 }
 #else /* !CONFIG_SCHED_HRTICK */
 static inline void
@@ -6603,17 +6327,6 @@ static int sched_idle_rq(struct rq *rq)
 			rq->nr_running);
 }
 
-/*
- * Returns true if cfs_rq only has SCHED_IDLE entities enqueued. Note the use
- * of idle_nr_running, which does not consider idle descendants of normal
- * entities.
- */
-static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq)
-{
-	return cfs_rq->nr_running &&
-		cfs_rq->nr_running == cfs_rq->idle_nr_running;
-}
-
 #ifdef CONFIG_SMP
 static int sched_idle_cpu(int cpu)
 {
@@ -8099,66 +7812,6 @@ balance_fair(struct rq *rq, struct task_
 }
 #endif /* CONFIG_SMP */
 
-static unsigned long wakeup_gran(struct sched_entity *se)
-{
-	unsigned long gran = sysctl_sched_wakeup_granularity;
-
-	/*
-	 * Since its curr running now, convert the gran from real-time
-	 * to virtual-time in his units.
-	 *
-	 * By using 'se' instead of 'curr' we penalize light tasks, so
-	 * they get preempted easier. That is, if 'se' < 'curr' then
-	 * the resulting gran will be larger, therefore penalizing the
-	 * lighter, if otoh 'se' > 'curr' then the resulting gran will
-	 * be smaller, again penalizing the lighter task.
-	 *
-	 * This is especially important for buddies when the leftmost
-	 * task is higher priority than the buddy.
-	 */
-	return calc_delta_fair(gran, se);
-}
-
-/*
- * Should 'se' preempt 'curr'.
- *
- *             |s1
- *        |s2
- *   |s3
- *         g
- *      |<--->|c
- *
- *  w(c, s1) = -1
- *  w(c, s2) =  0
- *  w(c, s3) =  1
- *
- */
-static int
-wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
-{
-	s64 gran, vdiff = curr->vruntime - se->vruntime;
-
-	if (vdiff <= 0)
-		return -1;
-
-	gran = wakeup_gran(se);
-	if (vdiff > gran)
-		return 1;
-
-	return 0;
-}
-
-static void set_last_buddy(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		if (SCHED_WARN_ON(!se->on_rq))
-			return;
-		if (se_is_idle(se))
-			return;
-		cfs_rq_of(se)->last = se;
-	}
-}
-
 static void set_next_buddy(struct sched_entity *se)
 {
 	for_each_sched_entity(se) {
@@ -8170,12 +7823,6 @@ static void set_next_buddy(struct sched_
 	}
 }
 
-static void set_skip_buddy(struct sched_entity *se)
-{
-	for_each_sched_entity(se)
-		cfs_rq_of(se)->skip = se;
-}
-
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -8184,7 +7831,6 @@ static void check_preempt_wakeup(struct
 	struct task_struct *curr = rq->curr;
 	struct sched_entity *se = &curr->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
-	int scale = cfs_rq->nr_running >= sched_nr_latency;
 	int next_buddy_marked = 0;
 	int cse_is_idle, pse_is_idle;
 
@@ -8200,7 +7846,7 @@ static void check_preempt_wakeup(struct
 	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
+	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK)) {
 		set_next_buddy(pse);
 		next_buddy_marked = 1;
 	}
@@ -8248,44 +7894,16 @@ static void check_preempt_wakeup(struct
 	cfs_rq = cfs_rq_of(se);
 	update_curr(cfs_rq);
 
-	if (sched_feat(EEVDF)) {
-		/*
-		 * XXX pick_eevdf(cfs_rq) != se ?
-		 */
-		if (pick_eevdf(cfs_rq) == pse)
-			goto preempt;
-
-		return;
-	}
-
-	if (wakeup_preempt_entity(se, pse) == 1) {
-		/*
-		 * Bias pick_next to pick the sched entity that is
-		 * triggering this preemption.
-		 */
-		if (!next_buddy_marked)
-			set_next_buddy(pse);
+	/*
+	 * XXX pick_eevdf(cfs_rq) != se ?
+	 */
+	if (pick_eevdf(cfs_rq) == pse)
 		goto preempt;
-	}
 
 	return;
 
 preempt:
 	resched_curr(rq);
-	/*
-	 * Only set the backward buddy when the current task is still
-	 * on the rq. This can happen when a wakeup gets interleaved
-	 * with schedule on the ->pre_schedule() or idle_balance()
-	 * point, either of which can * drop the rq lock.
-	 *
-	 * Also, during early boot the idle thread is in the fair class,
-	 * for obvious reasons its a bad idea to schedule back to it.
-	 */
-	if (unlikely(!se->on_rq || curr == rq->idle))
-		return;
-
-	if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
-		set_last_buddy(se);
 }
 
 #ifdef CONFIG_SMP
@@ -8486,8 +8104,6 @@ static void put_prev_task_fair(struct rq
 
 /*
  * sched_yield() is very simple
- *
- * The magic of dealing with the ->skip buddy is in pick_next_entity.
  */
 static void yield_task_fair(struct rq *rq)
 {
@@ -8503,23 +8119,19 @@ static void yield_task_fair(struct rq *r
 
 	clear_buddies(cfs_rq, se);
 
-	if (sched_feat(EEVDF) || curr->policy != SCHED_BATCH) {
-		update_rq_clock(rq);
-		/*
-		 * Update run-time statistics of the 'current'.
-		 */
-		update_curr(cfs_rq);
-		/*
-		 * Tell update_rq_clock() that we've just updated,
-		 * so we don't do microscopic update in schedule()
-		 * and double the fastpath cost.
-		 */
-		rq_clock_skip_update(rq);
-	}
-	if (sched_feat(EEVDF))
-		se->deadline += calc_delta_fair(se->slice, se);
+	update_rq_clock(rq);
+	/*
+	 * Update run-time statistics of the 'current'.
+	 */
+	update_curr(cfs_rq);
+	/*
+	 * Tell update_rq_clock() that we've just updated,
+	 * so we don't do microscopic update in schedule()
+	 * and double the fastpath cost.
+	 */
+	rq_clock_skip_update(rq);
 
-	set_skip_buddy(se);
+	se->deadline += calc_delta_fair(se->slice, se);
 }
 
 static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
@@ -8762,8 +8374,7 @@ static int task_hot(struct task_struct *
 	 * Buddy candidates are cache hot:
 	 */
 	if (sched_feat(CACHE_HOT_BUDDY) && env->dst_rq->nr_running &&
-			(&p->se == cfs_rq_of(&p->se)->next ||
-			 &p->se == cfs_rq_of(&p->se)->last))
+	    (&p->se == cfs_rq_of(&p->se)->next))
 		return 1;
 
 	if (sysctl_sched_migration_cost == -1)
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -15,13 +15,6 @@ SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 SCHED_FEAT(NEXT_BUDDY, false)
 
 /*
- * Prefer to schedule the task that ran last (when we did
- * wake-preempt) as that likely will touch the same data, increases
- * cache locality.
- */
-SCHED_FEAT(LAST_BUDDY, true)
-
-/*
  * Consider buddies to be cache hot, decreases the likeliness of a
  * cache buddy being migrated away, increases cache locality.
  */
@@ -93,8 +86,3 @@ SCHED_FEAT(UTIL_EST, true)
 SCHED_FEAT(UTIL_EST_FASTUP, true)
 
 SCHED_FEAT(LATENCY_WARN, false)
-
-SCHED_FEAT(ALT_PERIOD, true)
-SCHED_FEAT(BASE_SLICE, true)
-
-SCHED_FEAT(EEVDF, true)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -576,8 +576,6 @@ struct cfs_rq {
 	 */
 	struct sched_entity	*curr;
 	struct sched_entity	*next;
-	struct sched_entity	*last;
-	struct sched_entity	*skip;
 
 #ifdef	CONFIG_SCHED_DEBUG
 	unsigned int		nr_spread_over;
@@ -2484,9 +2482,6 @@ extern const_debug unsigned int sysctl_s
 extern unsigned int sysctl_sched_min_granularity;
 
 #ifdef CONFIG_SCHED_DEBUG
-extern unsigned int sysctl_sched_latency;
-extern unsigned int sysctl_sched_idle_min_granularity;
-extern unsigned int sysctl_sched_wakeup_granularity;
 extern int sysctl_resched_latency_warn_ms;
 extern int sysctl_resched_latency_warn_once;
 



^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 09/15] sched/debug: Rename min_granularity to base_slice
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
                   ` (7 preceding siblings ...)
  2023-05-31 11:58 ` [PATCH 08/15] sched: Commit to EEVDF Peter Zijlstra
@ 2023-05-31 11:58 ` Peter Zijlstra
  2023-08-10  7:10   ` [tip: sched/core] sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice tip-bot2 for Peter Zijlstra
  2023-05-31 11:58 ` [PATCH 10/15] sched/fair: Propagate enqueue flags into place_entity() Peter Zijlstra
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

EEVDF uses this tunable as the base request/slice -- make sure the
name reflects this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |    2 +-
 kernel/sched/debug.c |    4 ++--
 kernel/sched/fair.c  |   12 ++++++------
 kernel/sched/sched.h |    2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4464,7 +4464,7 @@ static void __sched_fork(unsigned long c
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
 	p->se.vlag			= 0;
-	p->se.slice			= sysctl_sched_min_granularity;
+	p->se.slice			= sysctl_sched_base_slice;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -347,7 +347,7 @@ static __init int sched_init_debug(void)
 	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
 #endif
 
-	debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);
+	debugfs_create_u32("base_slice_ns", 0644, debugfs_sched, &sysctl_sched_base_slice);
 
 	debugfs_create_u32("latency_warn_ms", 0644, debugfs_sched, &sysctl_resched_latency_warn_ms);
 	debugfs_create_u32("latency_warn_once", 0644, debugfs_sched, &sysctl_resched_latency_warn_once);
@@ -862,7 +862,7 @@ static void sched_debug_header(struct se
 	SEQ_printf(m, "  .%-40s: %Ld\n", #x, (long long)(x))
 #define PN(x) \
 	SEQ_printf(m, "  .%-40s: %Ld.%06ld\n", #x, SPLIT_NS(x))
-	PN(sysctl_sched_min_granularity);
+	PN(sysctl_sched_base_slice);
 	P(sysctl_sched_child_runs_first);
 	P(sysctl_sched_features);
 #undef PN
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -75,8 +75,8 @@ unsigned int sysctl_sched_tunable_scalin
  *
  * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
  */
-unsigned int sysctl_sched_min_granularity			= 750000ULL;
-static unsigned int normalized_sysctl_sched_min_granularity	= 750000ULL;
+unsigned int sysctl_sched_base_slice			= 750000ULL;
+static unsigned int normalized_sysctl_sched_base_slice	= 750000ULL;
 
 /*
  * After fork, child runs first. If set to 0 (default) then
@@ -237,7 +237,7 @@ static void update_sysctl(void)
 
 #define SET_SYSCTL(name) \
 	(sysctl_##name = (factor) * normalized_sysctl_##name)
-	SET_SYSCTL(sched_min_granularity);
+	SET_SYSCTL(sched_base_slice);
 #undef SET_SYSCTL
 }
 
@@ -943,7 +943,7 @@ int sched_update_scaling(void)
 
 #define WRT_SYSCTL(name) \
 	(normalized_sysctl_##name = sysctl_##name / (factor))
-	WRT_SYSCTL(sched_min_granularity);
+	WRT_SYSCTL(sched_base_slice);
 #undef WRT_SYSCTL
 
 	return 0;
@@ -964,9 +964,9 @@ static void update_deadline(struct cfs_r
 	/*
 	 * For EEVDF the virtual time slope is determined by w_i (iow.
 	 * nice) while the request time r_i is determined by
-	 * sysctl_sched_min_granularity.
+	 * sysctl_sched_base_slice.
 	 */
-	se->slice = sysctl_sched_min_granularity;
+	se->slice = sysctl_sched_base_slice;
 
 	/*
 	 * EEVDF: vd_i = ve_i + r_i / w_i
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2479,7 +2479,7 @@ extern void check_preempt_curr(struct rq
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
 
-extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_base_slice;
 
 #ifdef CONFIG_SCHED_DEBUG
 extern int sysctl_resched_latency_warn_ms;



^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 10/15] sched/fair: Propagate enqueue flags into place_entity()
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
                   ` (8 preceding siblings ...)
  2023-05-31 11:58 ` [PATCH 09/15] sched/debug: Rename min_granularity to base_slice Peter Zijlstra
@ 2023-05-31 11:58 ` Peter Zijlstra
  2023-08-10  7:10   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2023-05-31 11:58 ` [PATCH 11/15] sched/eevdf: Better handle mixed slice length Peter Zijlstra
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

This allows place_entity() to consider ENQUEUE_WAKEUP and
ENQUEUE_MIGRATED.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c  |   10 +++++-----
 kernel/sched/sched.h |    1 +
 2 files changed, 6 insertions(+), 5 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4909,7 +4909,7 @@ static inline void update_misfit_status(
 #endif /* CONFIG_SMP */
 
 static void
-place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
+place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	u64 vslice = calc_delta_fair(se->slice, se);
 	u64 vruntime = avg_vruntime(cfs_rq);
@@ -4998,7 +4998,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	 * on average, halfway through their slice, as such start tasks
 	 * off with half a slice to ease into the competition.
 	 */
-	if (sched_feat(PLACE_DEADLINE_INITIAL) && initial)
+	if (sched_feat(PLACE_DEADLINE_INITIAL) && (flags & ENQUEUE_INITIAL))
 		vslice /= 2;
 
 	/*
@@ -5021,7 +5021,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * update_curr().
 	 */
 	if (curr)
-		place_entity(cfs_rq, se, 0);
+		place_entity(cfs_rq, se, flags);
 
 	update_curr(cfs_rq);
 
@@ -5048,7 +5048,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * we can place the entity.
 	 */
 	if (!curr)
-		place_entity(cfs_rq, se, 0);
+		place_entity(cfs_rq, se, flags);
 
 	account_entity_enqueue(cfs_rq, se);
 
@@ -12053,7 +12053,7 @@ static void task_fork_fair(struct task_s
 	curr = cfs_rq->curr;
 	if (curr)
 		update_curr(cfs_rq);
-	place_entity(cfs_rq, se, 1);
+	place_entity(cfs_rq, se, ENQUEUE_INITIAL);
 	rq_unlock(rq, &rf);
 }
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2174,6 +2174,7 @@ extern const u32		sched_prio_to_wmult[40
 #else
 #define ENQUEUE_MIGRATED	0x00
 #endif
+#define ENQUEUE_INITIAL		0x80
 
 #define RETRY_TASK		((void *)-1UL)
 



^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH 11/15] sched/eevdf: Better handle mixed slice length
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
                   ` (9 preceding siblings ...)
  2023-05-31 11:58 ` [PATCH 10/15] sched/fair: Propagate enqueue flags into place_entity() Peter Zijlstra
@ 2023-05-31 11:58 ` Peter Zijlstra
  2023-06-02 13:45   ` Vincent Guittot
  2023-06-10  6:34   ` Chen Yu
  2023-05-31 11:58 ` [RFC][PATCH 12/15] sched: Introduce latency-nice as a per-task attribute Peter Zijlstra
                   ` (4 subsequent siblings)
  15 siblings, 2 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

In the case where (due to latency-nice) there are different request
sizes in the tree, the smaller requests tend to be dominated by the
larger. Also note how the EEVDF lag limits are based on r_max.

Therefore; add a heuristic that for the mixed request size case, moves
smaller requests to placement strategy #2 which ensures they're
immidiately eligible and and due to their smaller (virtual) deadline
will cause preemption.

NOTE: this relies on update_entity_lag() to impose lag limits above
a single slice.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   30 ++++++++++++++++++++++++++++++
 kernel/sched/features.h |    1 +
 kernel/sched/sched.h    |    1 +
 3 files changed, 32 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -642,6 +642,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq,
 	s64 key = entity_key(cfs_rq, se);
 
 	cfs_rq->avg_vruntime += key * weight;
+	cfs_rq->avg_slice += se->slice * weight;
 	cfs_rq->avg_load += weight;
 }
 
@@ -652,6 +653,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
 	s64 key = entity_key(cfs_rq, se);
 
 	cfs_rq->avg_vruntime -= key * weight;
+	cfs_rq->avg_slice -= se->slice * weight;
 	cfs_rq->avg_load -= weight;
 }
 
@@ -4908,6 +4910,21 @@ static inline void update_misfit_status(
 
 #endif /* CONFIG_SMP */
 
+static inline bool
+entity_has_slept(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+	u64 now;
+
+	if (!(flags & ENQUEUE_WAKEUP))
+		return false;
+
+	if (flags & ENQUEUE_MIGRATED)
+		return true;
+
+	now = rq_clock_task(rq_of(cfs_rq));
+	return (s64)(se->exec_start - now) >= se->slice;
+}
+
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -4930,6 +4947,19 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		lag = se->vlag;
 
 		/*
+		 * For latency sensitive tasks; those that have a shorter than
+		 * average slice and do not fully consume the slice, transition
+		 * to EEVDF placement strategy #2.
+		 */
+		if (sched_feat(PLACE_FUDGE) &&
+		    (cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) &&
+		    entity_has_slept(cfs_rq, se, flags)) {
+			lag += vslice;
+			if (lag > 0)
+				lag = 0;
+		}
+
+		/*
 		 * If we want to place a task and preserve lag, we have to
 		 * consider the effect of the new entity on the weighted
 		 * average and compensate for this, otherwise lag can quickly
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -5,6 +5,7 @@
  * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
  */
 SCHED_FEAT(PLACE_LAG, true)
+SCHED_FEAT(PLACE_FUDGE, true)
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 
 /*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -555,6 +555,7 @@ struct cfs_rq {
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
 
 	s64			avg_vruntime;
+	u64			avg_slice;
 	u64			avg_load;
 
 	u64			exec_clock;



^ permalink raw reply	[flat|nested] 104+ messages in thread

* [RFC][PATCH 12/15] sched: Introduce latency-nice as a per-task attribute
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
                   ` (10 preceding siblings ...)
  2023-05-31 11:58 ` [PATCH 11/15] sched/eevdf: Better handle mixed slice length Peter Zijlstra
@ 2023-05-31 11:58 ` Peter Zijlstra
  2023-05-31 11:58 ` [RFC][PATCH 13/15] sched/fair: Implement latency-nice Peter Zijlstra
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx,
	Parth Shah

From: Parth Shah <parth@linux.ibm.com>

Latency-nice indicates the latency requirements of a task with respect
to the other tasks in the system. The value of the attribute can be within
the range of [-20, 19] both inclusive to be in-line with the values just
like task nice values.

Just like task nice, -20 is the 'highest' priority and conveys this
task should get minimal latency, conversely 19 is the lowest priority
and conveys this task will get the least consideration and will thus
receive maximal latency.

[peterz: rebase, squash]
Signed-off-by: Parth Shah <parth@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h            |    1 +
 include/uapi/linux/sched.h       |    4 +++-
 include/uapi/linux/sched/types.h |   19 +++++++++++++++++++
 init/init_task.c                 |    3 ++-
 kernel/sched/core.c              |   27 ++++++++++++++++++++++++++-
 kernel/sched/debug.c             |    1 +
 tools/include/uapi/linux/sched.h |    4 +++-
 7 files changed, 55 insertions(+), 4 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -791,6 +791,7 @@ struct task_struct {
 	int				static_prio;
 	int				normal_prio;
 	unsigned int			rt_priority;
+	int				latency_prio;
 
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -132,6 +132,7 @@ struct clone_args {
 #define SCHED_FLAG_KEEP_PARAMS		0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
+#define SCHED_FLAG_LATENCY_NICE		0x80
 
 #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
 				 SCHED_FLAG_KEEP_PARAMS)
@@ -143,6 +144,7 @@ struct clone_args {
 			 SCHED_FLAG_RECLAIM		| \
 			 SCHED_FLAG_DL_OVERRUN		| \
 			 SCHED_FLAG_KEEP_ALL		| \
-			 SCHED_FLAG_UTIL_CLAMP)
+			 SCHED_FLAG_UTIL_CLAMP		| \
+			 SCHED_FLAG_LATENCY_NICE)
 
 #endif /* _UAPI_LINUX_SCHED_H */
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -10,6 +10,7 @@ struct sched_param {
 
 #define SCHED_ATTR_SIZE_VER0	48	/* sizeof first published struct */
 #define SCHED_ATTR_SIZE_VER1	56	/* add: util_{min,max} */
+#define SCHED_ATTR_SIZE_VER2	60	/* add: latency_nice */
 
 /*
  * Extended scheduling parameters data structure.
@@ -98,6 +99,22 @@ struct sched_param {
  * scheduled on a CPU with no more capacity than the specified value.
  *
  * A task utilization boundary can be reset by setting the attribute to -1.
+ *
+ * Latency Tolerance Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the relative latency
+ * requirements of a task with respect to the other tasks running/queued in the
+ * system.
+ *
+ * @ sched_latency_nice	task's latency_nice value
+ *
+ * The latency_nice of a task can have any value in a range of
+ * [MIN_LATENCY_NICE..MAX_LATENCY_NICE].
+ *
+ * A task with latency_nice with the value of LATENCY_NICE_MIN can be
+ * taken for a task requiring a lower latency as opposed to the task with
+ * higher latency_nice.
  */
 struct sched_attr {
 	__u32 size;
@@ -120,6 +137,8 @@ struct sched_attr {
 	__u32 sched_util_min;
 	__u32 sched_util_max;
 
+	/* latency requirement hints */
+	__s32 sched_latency_nice;
 };
 
 #endif /* _UAPI_LINUX_SCHED_TYPES_H */
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -78,6 +78,7 @@ struct task_struct init_task
 	.prio		= MAX_PRIO - 20,
 	.static_prio	= MAX_PRIO - 20,
 	.normal_prio	= MAX_PRIO - 20,
+	.latency_prio	= DEFAULT_PRIO,
 	.policy		= SCHED_NORMAL,
 	.cpus_ptr	= &init_task.cpus_mask,
 	.user_cpus_ptr	= NULL,
@@ -89,7 +90,7 @@ struct task_struct init_task
 		.fn = do_no_restart_syscall,
 	},
 	.se		= {
-		.group_node 	= LIST_HEAD_INIT(init_task.se.group_node),
+		.group_node	= LIST_HEAD_INIT(init_task.se.group_node),
 	},
 	.rt		= {
 		.run_list	= LIST_HEAD_INIT(init_task.rt.run_list),
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4719,6 +4719,8 @@ int sched_fork(unsigned long clone_flags
 		p->prio = p->normal_prio = p->static_prio;
 		set_load_weight(p, false);
 
+		p->latency_prio = NICE_TO_PRIO(0);
+
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
 		 * fulfilled its duty:
@@ -7477,7 +7479,7 @@ static struct task_struct *find_process_
 #define SETPARAM_POLICY	-1
 
 static void __setscheduler_params(struct task_struct *p,
-		const struct sched_attr *attr)
+				  const struct sched_attr *attr)
 {
 	int policy = attr->sched_policy;
 
@@ -7501,6 +7503,13 @@ static void __setscheduler_params(struct
 	set_load_weight(p, true);
 }
 
+static void __setscheduler_latency(struct task_struct *p,
+				   const struct sched_attr *attr)
+{
+	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
+		p->latency_prio = NICE_TO_PRIO(attr->sched_latency_nice);
+}
+
 /*
  * Check the target process has a UID that matches the current process's:
  */
@@ -7641,6 +7650,13 @@ static int __sched_setscheduler(struct t
 			return retval;
 	}
 
+	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
+		if (attr->sched_latency_nice > MAX_NICE)
+			return -EINVAL;
+		if (attr->sched_latency_nice < MIN_NICE)
+			return -EINVAL;
+	}
+
 	if (pi)
 		cpuset_read_lock();
 
@@ -7675,6 +7691,9 @@ static int __sched_setscheduler(struct t
 			goto change;
 		if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
 			goto change;
+		if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE &&
+		    attr->sched_latency_nice != PRIO_TO_NICE(p->latency_prio))
+			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
 		retval = 0;
@@ -7763,6 +7782,7 @@ static int __sched_setscheduler(struct t
 		__setscheduler_params(p, attr);
 		__setscheduler_prio(p, newprio);
 	}
+	__setscheduler_latency(p, attr);
 	__setscheduler_uclamp(p, attr);
 
 	if (queued) {
@@ -7973,6 +7993,9 @@ static int sched_copy_attr(struct sched_
 	    size < SCHED_ATTR_SIZE_VER1)
 		return -EINVAL;
 
+	if ((attr->sched_flags & SCHED_FLAG_LATENCY_NICE) &&
+	    size < SCHED_ATTR_SIZE_VER2)
+		return -EINVAL;
 	/*
 	 * XXX: Do we want to be lenient like existing syscalls; or do we want
 	 * to be strict and return an error on out-of-bounds values?
@@ -8210,6 +8233,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pi
 	get_params(p, &kattr);
 	kattr.sched_flags &= SCHED_FLAG_ALL;
 
+	kattr.sched_latency_nice = PRIO_TO_NICE(p->latency_prio);
+
 #ifdef CONFIG_UCLAMP_TASK
 	/*
 	 * This could race with another potential updater, but this is fine
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1085,6 +1085,7 @@ void proc_sched_show_task(struct task_st
 #endif
 	P(policy);
 	P(prio);
+	P(latency_prio);
 	if (task_has_dl_policy(p)) {
 		P(dl.runtime);
 		P(dl.deadline);
--- a/tools/include/uapi/linux/sched.h
+++ b/tools/include/uapi/linux/sched.h
@@ -132,6 +132,7 @@ struct clone_args {
 #define SCHED_FLAG_KEEP_PARAMS		0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
+#define SCHED_FLAG_LATENCY_NICE		0x80
 
 #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
 				 SCHED_FLAG_KEEP_PARAMS)
@@ -143,6 +144,7 @@ struct clone_args {
 			 SCHED_FLAG_RECLAIM		| \
 			 SCHED_FLAG_DL_OVERRUN		| \
 			 SCHED_FLAG_KEEP_ALL		| \
-			 SCHED_FLAG_UTIL_CLAMP)
+			 SCHED_FLAG_UTIL_CLAMP		| \
+			 SCHED_FLAG_LATENCY_NICE)
 
 #endif /* _UAPI_LINUX_SCHED_H */



^ permalink raw reply	[flat|nested] 104+ messages in thread

* [RFC][PATCH 13/15] sched/fair: Implement latency-nice
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
                   ` (11 preceding siblings ...)
  2023-05-31 11:58 ` [RFC][PATCH 12/15] sched: Introduce latency-nice as a per-task attribute Peter Zijlstra
@ 2023-05-31 11:58 ` Peter Zijlstra
  2023-06-06 14:54   ` Vincent Guittot
  2023-10-11 23:24   ` Benjamin Segall
  2023-05-31 11:58 ` [RFC][PATCH 14/15] sched/fair: Add sched group latency support Peter Zijlstra
                   ` (2 subsequent siblings)
  15 siblings, 2 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

Implement latency-nice as a modulation of the EEVDF r_i parameter,
specifically apply the inverse sched_prio_to_weight[] relation on
base_slice.

Given a base slice of 3 [ms], this gives a range of:

  latency-nice  19: 3*1024 / 15    ~= 204.8 [ms]
  latency-nice -20: 3*1024 / 88761 ~= 0.034 [ms]

(which might not make sense)

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/core.c  |   14 ++++++++++----
 kernel/sched/fair.c  |   22 +++++++++++++++-------
 kernel/sched/sched.h |    2 ++
 3 files changed, 27 insertions(+), 11 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1305,6 +1305,12 @@ static void set_load_weight(struct task_
 	}
 }
 
+static inline void set_latency_prio(struct task_struct *p, int prio)
+{
+	p->latency_prio = prio;
+	set_latency_fair(&p->se, prio - MAX_RT_PRIO);
+}
+
 #ifdef CONFIG_UCLAMP_TASK
 /*
  * Serializes updates of utilization clamp values
@@ -4464,9 +4470,10 @@ static void __sched_fork(unsigned long c
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
 	p->se.vlag			= 0;
-	p->se.slice			= sysctl_sched_base_slice;
 	INIT_LIST_HEAD(&p->se.group_node);
 
+	set_latency_prio(p, p->latency_prio);
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
 #endif
@@ -4718,8 +4725,7 @@ int sched_fork(unsigned long clone_flags
 
 		p->prio = p->normal_prio = p->static_prio;
 		set_load_weight(p, false);
-
-		p->latency_prio = NICE_TO_PRIO(0);
+		set_latency_prio(p, NICE_TO_PRIO(0));
 
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
@@ -7507,7 +7513,7 @@ static void __setscheduler_latency(struc
 				   const struct sched_attr *attr)
 {
 	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
-		p->latency_prio = NICE_TO_PRIO(attr->sched_latency_nice);
+		set_latency_prio(p, NICE_TO_PRIO(attr->sched_latency_nice));
 }
 
 /*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -952,6 +952,21 @@ int sched_update_scaling(void)
 }
 #endif
 
+void set_latency_fair(struct sched_entity *se, int prio)
+{
+	u32 weight = sched_prio_to_weight[prio];
+	u64 base = sysctl_sched_base_slice;
+
+	/*
+	 * For EEVDF the virtual time slope is determined by w_i (iow.
+	 * nice) while the request time r_i is determined by
+	 * latency-nice.
+	 *
+	 * Smaller request gets better latency.
+	 */
+	se->slice = div_u64(base << SCHED_FIXEDPOINT_SHIFT, weight);
+}
+
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 /*
@@ -964,13 +979,6 @@ static void update_deadline(struct cfs_r
 		return;
 
 	/*
-	 * For EEVDF the virtual time slope is determined by w_i (iow.
-	 * nice) while the request time r_i is determined by
-	 * sysctl_sched_base_slice.
-	 */
-	se->slice = sysctl_sched_base_slice;
-
-	/*
 	 * EEVDF: vd_i = ve_i + r_i / w_i
 	 */
 	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2495,6 +2495,8 @@ extern unsigned int sysctl_numa_balancin
 extern unsigned int sysctl_numa_balancing_hot_threshold;
 #endif
 
+extern void set_latency_fair(struct sched_entity *se, int prio);
+
 #ifdef CONFIG_SCHED_HRTICK
 
 /*



^ permalink raw reply	[flat|nested] 104+ messages in thread

* [RFC][PATCH 14/15] sched/fair: Add sched group latency support
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
                   ` (12 preceding siblings ...)
  2023-05-31 11:58 ` [RFC][PATCH 13/15] sched/fair: Implement latency-nice Peter Zijlstra
@ 2023-05-31 11:58 ` Peter Zijlstra
  2023-05-31 11:58 ` [RFC][PATCH 15/15] sched/eevdf: Use sched_attr::sched_runtime to set request/slice Peter Zijlstra
  2023-08-24  0:52 ` [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Daniel Jordan
  15 siblings, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

From: Vincent Guittot <vincent.guittot@linaro.org>

Task can set its latency priority with sched_setattr(), which is then used
to set the latency offset of its sched_enity, but sched group entities
still have the default latency offset value.

Add a latency.nice field in cpu cgroup controller to set the latency
priority of the group similarly to sched_setattr(). The latency priority
is then used to set the offset of the sched_entities of the group.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20230224093454.956298-7-vincent.guittot@linaro.org
---
 Documentation/admin-guide/cgroup-v2.rst |   10 ++++++++++
 kernel/sched/core.c                     |   30 ++++++++++++++++++++++++++++++
 kernel/sched/fair.c                     |   27 +++++++++++++++++++++++++++
 kernel/sched/sched.h                    |    4 ++++
 4 files changed, 71 insertions(+)

--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1121,6 +1121,16 @@ All time durations are in microseconds.
         values similar to the sched_setattr(2). This maximum utilization
         value is used to clamp the task specific maximum utilization clamp.
 
+  cpu.latency.nice
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	The nice value is in the range [-20, 19].
+
+	This interface file allows reading and setting latency using the
+	same values used by sched_setattr(2). The latency_nice of a group is
+	used to limit the impact of the latency_nice of a task outside the
+	group.
 
 
 Memory
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11177,6 +11177,25 @@ static int cpu_idle_write_s64(struct cgr
 {
 	return sched_group_set_idle(css_tg(css), idle);
 }
+
+static s64 cpu_latency_nice_read_s64(struct cgroup_subsys_state *css,
+				    struct cftype *cft)
+{
+	return PRIO_TO_NICE(css_tg(css)->latency_prio);
+}
+
+static int cpu_latency_nice_write_s64(struct cgroup_subsys_state *css,
+				     struct cftype *cft, s64 nice)
+{
+	int prio;
+
+	if (nice < MIN_NICE || nice > MAX_NICE)
+		return -ERANGE;
+
+	prio = NICE_TO_PRIO(nice);
+
+	return sched_group_set_latency(css_tg(css), prio);
+}
 #endif
 
 static struct cftype cpu_legacy_files[] = {
@@ -11191,6 +11210,11 @@ static struct cftype cpu_legacy_files[]
 		.read_s64 = cpu_idle_read_s64,
 		.write_s64 = cpu_idle_write_s64,
 	},
+	{
+		.name = "latency.nice",
+		.read_s64 = cpu_latency_nice_read_s64,
+		.write_s64 = cpu_latency_nice_write_s64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
@@ -11408,6 +11432,12 @@ static struct cftype cpu_files[] = {
 		.read_s64 = cpu_idle_read_s64,
 		.write_s64 = cpu_idle_write_s64,
 	},
+	{
+		.name = "latency.nice",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_s64 = cpu_latency_nice_read_s64,
+		.write_s64 = cpu_latency_nice_write_s64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12293,6 +12293,7 @@ int alloc_fair_sched_group(struct task_g
 		goto err;
 
 	tg->shares = NICE_0_LOAD;
+	tg->latency_prio = DEFAULT_PRIO;
 
 	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
 
@@ -12391,6 +12392,9 @@ void init_tg_cfs_entry(struct task_group
 	}
 
 	se->my_q = cfs_rq;
+
+	set_latency_fair(se, tg->latency_prio - MAX_RT_PRIO);
+
 	/* guarantee group entities always have weight */
 	update_load_set(&se->load, NICE_0_LOAD);
 	se->parent = parent;
@@ -12519,6 +12523,29 @@ int sched_group_set_idle(struct task_gro
 
 	mutex_unlock(&shares_mutex);
 	return 0;
+}
+
+int sched_group_set_latency(struct task_group *tg, int prio)
+{
+	int i;
+
+	if (tg == &root_task_group)
+		return -EINVAL;
+
+	mutex_lock(&shares_mutex);
+
+	if (tg->latency_prio == prio) {
+		mutex_unlock(&shares_mutex);
+		return 0;
+	}
+
+	tg->latency_prio = prio;
+
+	for_each_possible_cpu(i)
+		set_latency_fair(tg->se[i], prio - MAX_RT_PRIO);
+
+	mutex_unlock(&shares_mutex);
+	return 0;
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -378,6 +378,8 @@ struct task_group {
 
 	/* A positive value indicates that this is a SCHED_IDLE group. */
 	int			idle;
+	/* latency priority of the group. */
+	int			latency_prio;
 
 #ifdef	CONFIG_SMP
 	/*
@@ -488,6 +490,8 @@ extern int sched_group_set_shares(struct
 
 extern int sched_group_set_idle(struct task_group *tg, long idle);
 
+extern int sched_group_set_latency(struct task_group *tg, int prio);
+
 #ifdef CONFIG_SMP
 extern void set_task_rq_fair(struct sched_entity *se,
 			     struct cfs_rq *prev, struct cfs_rq *next);



^ permalink raw reply	[flat|nested] 104+ messages in thread

* [RFC][PATCH 15/15] sched/eevdf: Use sched_attr::sched_runtime to set request/slice
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
                   ` (13 preceding siblings ...)
  2023-05-31 11:58 ` [RFC][PATCH 14/15] sched/fair: Add sched group latency support Peter Zijlstra
@ 2023-05-31 11:58 ` Peter Zijlstra
  2023-06-01 13:55   ` Vincent Guittot
  2023-08-24  0:52 ` [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Daniel Jordan
  15 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-05-31 11:58 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

As an alternative to the latency-nice interface; allow applications to
directly set the request/slice using sched_attr::sched_runtime.

The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms]
which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.

Applications should strive to use their periodic runtime at a high
confidence interval (95%+) as the target slice. Using a smaller slice
will introduce undue preemptions, while using a larger value will
reduce latency.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7494,10 +7494,18 @@ static void __setscheduler_params(struct
 
 	p->policy = policy;
 
-	if (dl_policy(policy))
+	if (dl_policy(policy)) {
 		__setparam_dl(p, attr);
-	else if (fair_policy(policy))
+	} else if (fair_policy(policy)) {
 		p->static_prio = NICE_TO_PRIO(attr->sched_nice);
+		if (attr->sched_runtime) {
+			p->se.slice = clamp_t(u64, attr->sched_runtime,
+					      NSEC_PER_MSEC/10,   /* HZ=1000 * 10 */
+					      NSEC_PER_MSEC*100); /* HZ=100  / 10 */
+		} else {
+			p->se.slice = sysctl_sched_base_slice;
+		}
+	}
 
 	/*
 	 * __sched_setscheduler() ensures attr->sched_priority == 0 when
@@ -7689,7 +7697,9 @@ static int __sched_setscheduler(struct t
 	 * but store a possible modification of reset_on_fork.
 	 */
 	if (unlikely(policy == p->policy)) {
-		if (fair_policy(policy) && attr->sched_nice != task_nice(p))
+		if (fair_policy(policy) &&
+		    (attr->sched_nice != task_nice(p) ||
+		     (attr->sched_runtime && attr->sched_runtime != p->se.slice)))
 			goto change;
 		if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
 			goto change;
@@ -8017,12 +8027,14 @@ static int sched_copy_attr(struct sched_
 
 static void get_params(struct task_struct *p, struct sched_attr *attr)
 {
-	if (task_has_dl_policy(p))
+	if (task_has_dl_policy(p)) {
 		__getparam_dl(p, attr);
-	else if (task_has_rt_policy(p))
+	} else if (task_has_rt_policy(p)) {
 		attr->sched_priority = p->rt_priority;
-	else
+	} else {
 		attr->sched_nice = task_nice(p);
+		attr->sched_runtime = p->se.slice;
+	}
 }
 
 /**



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][PATCH 15/15] sched/eevdf: Use sched_attr::sched_runtime to set request/slice
  2023-05-31 11:58 ` [RFC][PATCH 15/15] sched/eevdf: Use sched_attr::sched_runtime to set request/slice Peter Zijlstra
@ 2023-06-01 13:55   ` Vincent Guittot
  2023-06-08 11:52     ` Peter Zijlstra
  0 siblings, 1 reply; 104+ messages in thread
From: Vincent Guittot @ 2023-06-01 13:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

On Wed, 31 May 2023 at 14:47, Peter Zijlstra <peterz@infradead.org> wrote:
>
> As an alternative to the latency-nice interface; allow applications to
> directly set the request/slice using sched_attr::sched_runtime.
>
> The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms]
> which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.

There were some discussions about the latency interface and setting a
raw time value. The problems with using a raw time value are:
- what  does this raw time value mean ? and how it applies to the
scheduling latency of the task. Typically what does setting
sched_runtime to 1ms means ? Regarding the latency, users would expect
to be scheduled in less than 1ms but this is not what will (always)
happen with a sched_slice set to 1ms whereas we ensure that the task
will run for sched_runtime in the sched_period (and before
sched_deadline) when using it with deadline scheduler. so this will be
confusing
- more than a runtime, we want to set a scheduling latency hint which
would be more aligned with a deadline
- Then the user will complain that he set 1ms but its task is
scheduled after several (or even dozens) ms in some cases. Also, you
will probably end up with everybody setting 0.1ms and expecting 0.1ms
latency. The latency nice like the nice give an opaque weight against
others without any determinism that we can't respect
- How do you set that you don't want to preempt others ? But still
want to keep your allocated running time.

>
> Applications should strive to use their periodic runtime at a high
> confidence interval (95%+) as the target slice. Using a smaller slice
> will introduce undue preemptions, while using a larger value will
> reduce latency.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c |   24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7494,10 +7494,18 @@ static void __setscheduler_params(struct
>
>         p->policy = policy;
>
> -       if (dl_policy(policy))
> +       if (dl_policy(policy)) {
>                 __setparam_dl(p, attr);
> -       else if (fair_policy(policy))
> +       } else if (fair_policy(policy)) {
>                 p->static_prio = NICE_TO_PRIO(attr->sched_nice);
> +               if (attr->sched_runtime) {
> +                       p->se.slice = clamp_t(u64, attr->sched_runtime,
> +                                             NSEC_PER_MSEC/10,   /* HZ=1000 * 10 */
> +                                             NSEC_PER_MSEC*100); /* HZ=100  / 10 */
> +               } else {
> +                       p->se.slice = sysctl_sched_base_slice;
> +               }
> +       }
>
>         /*
>          * __sched_setscheduler() ensures attr->sched_priority == 0 when
> @@ -7689,7 +7697,9 @@ static int __sched_setscheduler(struct t
>          * but store a possible modification of reset_on_fork.
>          */
>         if (unlikely(policy == p->policy)) {
> -               if (fair_policy(policy) && attr->sched_nice != task_nice(p))
> +               if (fair_policy(policy) &&
> +                   (attr->sched_nice != task_nice(p) ||
> +                    (attr->sched_runtime && attr->sched_runtime != p->se.slice)))
>                         goto change;
>                 if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
>                         goto change;
> @@ -8017,12 +8027,14 @@ static int sched_copy_attr(struct sched_
>
>  static void get_params(struct task_struct *p, struct sched_attr *attr)
>  {
> -       if (task_has_dl_policy(p))
> +       if (task_has_dl_policy(p)) {
>                 __getparam_dl(p, attr);
> -       else if (task_has_rt_policy(p))
> +       } else if (task_has_rt_policy(p)) {
>                 attr->sched_priority = p->rt_priority;
> -       else
> +       } else {
>                 attr->sched_nice = task_nice(p);
> +               attr->sched_runtime = p->se.slice;
> +       }
>  }
>
>  /**
>
>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 11/15] sched/eevdf: Better handle mixed slice length
  2023-05-31 11:58 ` [PATCH 11/15] sched/eevdf: Better handle mixed slice length Peter Zijlstra
@ 2023-06-02 13:45   ` Vincent Guittot
  2023-06-02 15:06     ` Peter Zijlstra
  2023-06-10  6:34   ` Chen Yu
  1 sibling, 1 reply; 104+ messages in thread
From: Vincent Guittot @ 2023-06-02 13:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

On Wed, 31 May 2023 at 14:47, Peter Zijlstra <peterz@infradead.org> wrote:
>
> In the case where (due to latency-nice) there are different request
> sizes in the tree, the smaller requests tend to be dominated by the
> larger. Also note how the EEVDF lag limits are based on r_max.
>
> Therefore; add a heuristic that for the mixed request size case, moves
> smaller requests to placement strategy #2 which ensures they're
> immidiately eligible and and due to their smaller (virtual) deadline
> will cause preemption.
>
> NOTE: this relies on update_entity_lag() to impose lag limits above
> a single slice.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c     |   30 ++++++++++++++++++++++++++++++
>  kernel/sched/features.h |    1 +
>  kernel/sched/sched.h    |    1 +
>  3 files changed, 32 insertions(+)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -642,6 +642,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq,
>         s64 key = entity_key(cfs_rq, se);
>
>         cfs_rq->avg_vruntime += key * weight;
> +       cfs_rq->avg_slice += se->slice * weight;
>         cfs_rq->avg_load += weight;
>  }
>
> @@ -652,6 +653,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
>         s64 key = entity_key(cfs_rq, se);
>
>         cfs_rq->avg_vruntime -= key * weight;
> +       cfs_rq->avg_slice -= se->slice * weight;
>         cfs_rq->avg_load -= weight;
>  }
>
> @@ -4908,6 +4910,21 @@ static inline void update_misfit_status(
>
>  #endif /* CONFIG_SMP */
>
> +static inline bool
> +entity_has_slept(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> +{
> +       u64 now;
> +
> +       if (!(flags & ENQUEUE_WAKEUP))
> +               return false;
> +
> +       if (flags & ENQUEUE_MIGRATED)
> +               return true;
> +
> +       now = rq_clock_task(rq_of(cfs_rq));
> +       return (s64)(se->exec_start - now) >= se->slice;
> +}
> +
>  static void
>  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
> @@ -4930,6 +4947,19 @@ place_entity(struct cfs_rq *cfs_rq, stru
>                 lag = se->vlag;
>
>                 /*
> +                * For latency sensitive tasks; those that have a shorter than
> +                * average slice and do not fully consume the slice, transition
> +                * to EEVDF placement strategy #2.
> +                */
> +               if (sched_feat(PLACE_FUDGE) &&
> +                   (cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) &&
> +                   entity_has_slept(cfs_rq, se, flags)) {
> +                       lag += vslice;
> +                       if (lag > 0)
> +                               lag = 0;

This PLACE_FUDGE looks quite not a good heuristic because it breaks
the better fair sharing of cpu bandwidth that EEVDF is supposed to
bring. Furthermore, it breaks the isolation between cpu bandwidth and
latency because playing with latency_nice will impact your cpu
bandwidth

> +               }
> +
> +               /*
>                  * If we want to place a task and preserve lag, we have to
>                  * consider the effect of the new entity on the weighted
>                  * average and compensate for this, otherwise lag can quickly
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -5,6 +5,7 @@
>   * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
>   */
>  SCHED_FEAT(PLACE_LAG, true)
> +SCHED_FEAT(PLACE_FUDGE, true)
>  SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
>
>  /*
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -555,6 +555,7 @@ struct cfs_rq {
>         unsigned int            idle_h_nr_running; /* SCHED_IDLE */
>
>         s64                     avg_vruntime;
> +       u64                     avg_slice;
>         u64                     avg_load;
>
>         u64                     exec_clock;
>
>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 01/15] sched/fair: Add avg_vruntime
  2023-05-31 11:58 ` [PATCH 01/15] sched/fair: Add avg_vruntime Peter Zijlstra
@ 2023-06-02 13:51   ` Vincent Guittot
  2023-06-02 14:27     ` Peter Zijlstra
  2023-08-10  7:10   ` [tip: sched/core] sched/fair: Add cfs_rq::avg_vruntime tip-bot2 for Peter Zijlstra
  2023-10-11  4:15   ` [PATCH 01/15] sched/fair: Add avg_vruntime Abel Wu
  2 siblings, 1 reply; 104+ messages in thread
From: Vincent Guittot @ 2023-06-02 13:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

On Wed, 31 May 2023 at 14:47, Peter Zijlstra <peterz@infradead.org> wrote:
>
> In order to move to an eligibility based scheduling policy it is
> needed to have a better approximation of the ideal scheduler.
>
> Specifically, for a virtual time weighted fair queueing based
> scheduler the ideal scheduler will be the weighted average of the
> individual virtual runtimes (math in the comment).
>
> As such, compute the weighted average to approximate the ideal
> scheduler -- note that the approximation is in the individual task
> behaviour, which isn't strictly conformant.
>
> Specifically consider adding a task with a vruntime left of center, in
> this case the average will move backwards in time -- something the
> ideal scheduler would of course never do.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/debug.c |   32 +++++------
>  kernel/sched/fair.c  |  137 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/sched.h |    5 +
>  3 files changed, 154 insertions(+), 20 deletions(-)
>
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -626,10 +626,9 @@ static void print_rq(struct seq_file *m,
>
>  void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>  {
> -       s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
> -               spread, rq0_min_vruntime, spread0;
> +       s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, spread;
> +       struct sched_entity *last, *first;
>         struct rq *rq = cpu_rq(cpu);
> -       struct sched_entity *last;
>         unsigned long flags;
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> @@ -643,26 +642,25 @@ void print_cfs_rq(struct seq_file *m, in
>                         SPLIT_NS(cfs_rq->exec_clock));
>
>         raw_spin_rq_lock_irqsave(rq, flags);
> -       if (rb_first_cached(&cfs_rq->tasks_timeline))
> -               MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
> +       first = __pick_first_entity(cfs_rq);
> +       if (first)
> +               left_vruntime = first->vruntime;
>         last = __pick_last_entity(cfs_rq);
>         if (last)
> -               max_vruntime = last->vruntime;
> +               right_vruntime = last->vruntime;
>         min_vruntime = cfs_rq->min_vruntime;
> -       rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
>         raw_spin_rq_unlock_irqrestore(rq, flags);
> -       SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "MIN_vruntime",
> -                       SPLIT_NS(MIN_vruntime));
> +
> +       SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_vruntime",
> +                       SPLIT_NS(left_vruntime));
>         SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
>                         SPLIT_NS(min_vruntime));
> -       SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "max_vruntime",
> -                       SPLIT_NS(max_vruntime));
> -       spread = max_vruntime - MIN_vruntime;
> -       SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread",
> -                       SPLIT_NS(spread));
> -       spread0 = min_vruntime - rq0_min_vruntime;
> -       SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread0",
> -                       SPLIT_NS(spread0));
> +       SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
> +                       SPLIT_NS(avg_vruntime(cfs_rq)));
> +       SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "right_vruntime",
> +                       SPLIT_NS(right_vruntime));
> +       spread = right_vruntime - left_vruntime;
> +       SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread", SPLIT_NS(spread));
>         SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
>                         cfs_rq->nr_spread_over);
>         SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -601,9 +601,134 @@ static inline bool entity_before(const s
>         return (s64)(a->vruntime - b->vruntime) < 0;
>  }
>
> +static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +       return (s64)(se->vruntime - cfs_rq->min_vruntime);
> +}
> +
>  #define __node_2_se(node) \
>         rb_entry((node), struct sched_entity, run_node)
>
> +/*
> + * Compute virtual time from the per-task service numbers:
> + *
> + * Fair schedulers conserve lag:
> + *
> + *   \Sum lag_i = 0
> + *
> + * Where lag_i is given by:
> + *
> + *   lag_i = S - s_i = w_i * (V - v_i)
> + *
> + * Where S is the ideal service time and V is it's virtual time counterpart.
> + * Therefore:
> + *
> + *   \Sum lag_i = 0
> + *   \Sum w_i * (V - v_i) = 0
> + *   \Sum w_i * V - w_i * v_i = 0
> + *
> + * From which we can solve an expression for V in v_i (which we have in
> + * se->vruntime):
> + *
> + *       \Sum v_i * w_i   \Sum v_i * w_i
> + *   V = -------------- = --------------
> + *          \Sum w_i            W
> + *
> + * Specifically, this is the weighted average of all entity virtual runtimes.
> + *
> + * [[ NOTE: this is only equal to the ideal scheduler under the condition
> + *          that join/leave operations happen at lag_i = 0, otherwise the
> + *          virtual time has non-continguous motion equivalent to:
> + *
> + *           V +-= lag_i / W
> + *
> + *         Also see the comment in place_entity() that deals with this. ]]
> + *
> + * However, since v_i is u64, and the multiplcation could easily overflow
> + * transform it into a relative form that uses smaller quantities:
> + *
> + * Substitute: v_i == (v_i - v0) + v0
> + *
> + *     \Sum ((v_i - v0) + v0) * w_i   \Sum (v_i - v0) * w_i
> + * V = ---------------------------- = --------------------- + v0
> + *                  W                            W
> + *
> + * Which we track using:
> + *
> + *                    v0 := cfs_rq->min_vruntime
> + * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
> + *              \Sum w_i := cfs_rq->avg_load
> + *
> + * Since min_vruntime is a monotonic increasing variable that closely tracks
> + * the per-task service, these deltas: (v_i - v), will be in the order of the
> + * maximal (virtual) lag induced in the system due to quantisation.
> + *
> + * Also, we use scale_load_down() to reduce the size.
> + *
> + * As measured, the max (key * weight) value was ~44 bits for a kernel build.
> + */
> +static void
> +avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +       unsigned long weight = scale_load_down(se->load.weight);
> +       s64 key = entity_key(cfs_rq, se);
> +
> +       cfs_rq->avg_vruntime += key * weight;
> +       cfs_rq->avg_load += weight;

isn't cfs_rq->avg_load similar to scale_load_down(cfs_rq->load.weight)  ?

> +}
> +
> +static void
> +avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +       unsigned long weight = scale_load_down(se->load.weight);
> +       s64 key = entity_key(cfs_rq, se);
> +
> +       cfs_rq->avg_vruntime -= key * weight;
> +       cfs_rq->avg_load -= weight;
> +}
> +
> +static inline
> +void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
> +{
> +       /*
> +        * v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load
> +        */
> +       cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta;
> +}
> +
> +u64 avg_vruntime(struct cfs_rq *cfs_rq)
> +{
> +       struct sched_entity *curr = cfs_rq->curr;
> +       s64 avg = cfs_rq->avg_vruntime;
> +       long load = cfs_rq->avg_load;
> +
> +       if (curr && curr->on_rq) {
> +               unsigned long weight = scale_load_down(curr->load.weight);
> +
> +               avg += entity_key(cfs_rq, curr) * weight;
> +               load += weight;
> +       }
> +
> +       if (load)
> +               avg = div_s64(avg, load);
> +
> +       return cfs_rq->min_vruntime + avg;
> +}
> +
> +static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
> +{
> +       u64 min_vruntime = cfs_rq->min_vruntime;
> +       /*
> +        * open coded max_vruntime() to allow updating avg_vruntime
> +        */
> +       s64 delta = (s64)(vruntime - min_vruntime);
> +       if (delta > 0) {
> +               avg_vruntime_update(cfs_rq, delta);
> +               min_vruntime = vruntime;
> +       }
> +       return min_vruntime;
> +}
> +
>  static void update_min_vruntime(struct cfs_rq *cfs_rq)
>  {
>         struct sched_entity *curr = cfs_rq->curr;
> @@ -629,7 +754,7 @@ static void update_min_vruntime(struct c
>
>         /* ensure we never gain time by being placed backwards. */
>         u64_u32_store(cfs_rq->min_vruntime,
> -                     max_vruntime(cfs_rq->min_vruntime, vruntime));
> +                     __update_min_vruntime(cfs_rq, vruntime));
>  }
>
>  static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
> @@ -642,12 +767,14 @@ static inline bool __entity_less(struct
>   */
>  static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> +       avg_vruntime_add(cfs_rq, se);
>         rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
>  }
>
>  static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
> +       avg_vruntime_sub(cfs_rq, se);
>  }
>
>  struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
> @@ -3379,6 +3506,8 @@ static void reweight_entity(struct cfs_r
>                 /* commit outstanding execution time */
>                 if (cfs_rq->curr == se)
>                         update_curr(cfs_rq);
> +               else
> +                       avg_vruntime_sub(cfs_rq, se);
>                 update_load_sub(&cfs_rq->load, se->load.weight);
>         }
>         dequeue_load_avg(cfs_rq, se);
> @@ -3394,9 +3523,11 @@ static void reweight_entity(struct cfs_r
>  #endif
>
>         enqueue_load_avg(cfs_rq, se);
> -       if (se->on_rq)
> +       if (se->on_rq) {
>                 update_load_add(&cfs_rq->load, se->load.weight);
> -
> +               if (cfs_rq->curr != se)
> +                       avg_vruntime_add(cfs_rq, se);
> +       }
>  }
>
>  void reweight_task(struct task_struct *p, int prio)
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -554,6 +554,9 @@ struct cfs_rq {
>         unsigned int            idle_nr_running;   /* SCHED_IDLE */
>         unsigned int            idle_h_nr_running; /* SCHED_IDLE */
>
> +       s64                     avg_vruntime;
> +       u64                     avg_load;
> +
>         u64                     exec_clock;
>         u64                     min_vruntime;
>  #ifdef CONFIG_SCHED_CORE
> @@ -3503,4 +3506,6 @@ static inline void task_tick_mm_cid(stru
>  static inline void init_sched_mm_cid(struct task_struct *t) { }
>  #endif
>
> +extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
> +
>  #endif /* _KERNEL_SCHED_SCHED_H */
>
>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 01/15] sched/fair: Add avg_vruntime
  2023-06-02 13:51   ` Vincent Guittot
@ 2023-06-02 14:27     ` Peter Zijlstra
  2023-06-05  7:18       ` Vincent Guittot
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-06-02 14:27 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

On Fri, Jun 02, 2023 at 03:51:53PM +0200, Vincent Guittot wrote:
> On Wed, 31 May 2023 at 14:47, Peter Zijlstra <peterz@infradead.org> wrote:
> > +static void
> > +avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > +{
> > +       unsigned long weight = scale_load_down(se->load.weight);
> > +       s64 key = entity_key(cfs_rq, se);
> > +
> > +       cfs_rq->avg_vruntime += key * weight;
> > +       cfs_rq->avg_load += weight;
> 
> isn't cfs_rq->avg_load similar to scale_load_down(cfs_rq->load.weight)  ?
> 
> > +}

Similar, yes, but not quite the same in two ways:

 - it's sometimes off by one entry due to ordering of operations -- this
   is probably fixable.

 - it does the scale down after addition, whereas this does the scale
   down before addition, esp for multiple low weight entries this makes
   a significant difference.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 11/15] sched/eevdf: Better handle mixed slice length
  2023-06-02 13:45   ` Vincent Guittot
@ 2023-06-02 15:06     ` Peter Zijlstra
  0 siblings, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-06-02 15:06 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

On Fri, Jun 02, 2023 at 03:45:18PM +0200, Vincent Guittot wrote:
> On Wed, 31 May 2023 at 14:47, Peter Zijlstra <peterz@infradead.org> wrote:

> > +static inline bool
> > +entity_has_slept(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > +{
> > +       u64 now;
> > +
> > +       if (!(flags & ENQUEUE_WAKEUP))
> > +               return false;
> > +
> > +       if (flags & ENQUEUE_MIGRATED)
> > +               return true;
> > +
> > +       now = rq_clock_task(rq_of(cfs_rq));
> > +       return (s64)(se->exec_start - now) >= se->slice;
> > +}
> > +
> >  static void
> >  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  {
> > @@ -4930,6 +4947,19 @@ place_entity(struct cfs_rq *cfs_rq, stru
> >                 lag = se->vlag;
> >
> >                 /*
> > +                * For latency sensitive tasks; those that have a shorter than
> > +                * average slice and do not fully consume the slice, transition
> > +                * to EEVDF placement strategy #2.
> > +                */
> > +               if (sched_feat(PLACE_FUDGE) &&
> > +                   (cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) &&
> > +                   entity_has_slept(cfs_rq, se, flags)) {
> > +                       lag += vslice;
> > +                       if (lag > 0)
> > +                               lag = 0;
> 
> This PLACE_FUDGE looks quite not a good heuristic because it breaks
> the better fair sharing of cpu bandwidth that EEVDF is supposed to
> bring. Furthermore, it breaks the isolation between cpu bandwidth and
> latency because playing with latency_nice will impact your cpu
> bandwidth

Yeah, probably :/ Even though entity_has_slept() ensures the task slept
for at least one slice, that's probably not enough to preserve the
bandwidth contraints.

The fairness analysis in the paper conveniently avoids all 'interesting'
cases, including their own placement policies.

I'll sit on this one longer and think a bit more about it.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 01/15] sched/fair: Add avg_vruntime
  2023-06-02 14:27     ` Peter Zijlstra
@ 2023-06-05  7:18       ` Vincent Guittot
  0 siblings, 0 replies; 104+ messages in thread
From: Vincent Guittot @ 2023-06-05  7:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

On Fri, 2 Jun 2023 at 16:27, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jun 02, 2023 at 03:51:53PM +0200, Vincent Guittot wrote:
> > On Wed, 31 May 2023 at 14:47, Peter Zijlstra <peterz@infradead.org> wrote:
> > > +static void
> > > +avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > > +{
> > > +       unsigned long weight = scale_load_down(se->load.weight);
> > > +       s64 key = entity_key(cfs_rq, se);
> > > +
> > > +       cfs_rq->avg_vruntime += key * weight;
> > > +       cfs_rq->avg_load += weight;
> >
> > isn't cfs_rq->avg_load similar to scale_load_down(cfs_rq->load.weight)  ?
> >
> > > +}
>
> Similar, yes, but not quite the same in two ways:
>
>  - it's sometimes off by one entry due to ordering of operations -- this
>    is probably fixable.
>
>  - it does the scale down after addition, whereas this does the scale
>    down before addition, esp for multiple low weight entries this makes
>    a significant difference.

Ah yes, we are still using the the scaled down value for computation

>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][PATCH 13/15] sched/fair: Implement latency-nice
  2023-05-31 11:58 ` [RFC][PATCH 13/15] sched/fair: Implement latency-nice Peter Zijlstra
@ 2023-06-06 14:54   ` Vincent Guittot
  2023-06-08 10:34     ` Peter Zijlstra
  2023-10-11 23:24   ` Benjamin Segall
  1 sibling, 1 reply; 104+ messages in thread
From: Vincent Guittot @ 2023-06-06 14:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

On Wed, 31 May 2023 at 14:47, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Implement latency-nice as a modulation of the EEVDF r_i parameter,
> specifically apply the inverse sched_prio_to_weight[] relation on
> base_slice.
>
> Given a base slice of 3 [ms], this gives a range of:
>
>   latency-nice  19: 3*1024 / 15    ~= 204.8 [ms]
>   latency-nice -20: 3*1024 / 88761 ~= 0.034 [ms]

I have reread the publication. I have question about

Theorem 1: The lag of any active client k in a steady system is
bounded as follows,
    -rmax < lagk (d) < max(rmax ; q);

and

Corollary 2: Consider a steady system and a client k such that no
request of client k is larger than a
time quantum. Then at any time t, the lag of client k is bounded as follows:
    -q < lagk (t) < q

q being the time quanta a task can run
and rmax the maximum slice of active task

I wonder how it applies to us. What is our time quanta q ? I guess
that it's the tick because it is assumed that the algorithm evaluates
which task should run next for each q interval in order to fulfill the
fairness IIUC.So I don't think that we can assume a q shorter than the
tick (at least with current implementation) unless we trigger some
additional interrupts

Then asking for a request shorter than the tick also means that
scheduler must enqueue a new request (on behalf of the task) during
the tick and evaluate if the task is still the one to be scheduled
now. So similarly to q, the request size r should be at least a tick
in order to reevaluate which task will run next after the end of a
request. In fact, the real limit is : r/wi >= tick/(Sum wj)

On Arm64 system, tick is 4ms long and on arm32 it raises to 10ms

We can always not follow these assumptions made in the publication but
I wonder how we can then rely on its theorems and corollaries

>
> (which might not make sense)
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/core.c  |   14 ++++++++++----
>  kernel/sched/fair.c  |   22 +++++++++++++++-------
>  kernel/sched/sched.h |    2 ++
>  3 files changed, 27 insertions(+), 11 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1305,6 +1305,12 @@ static void set_load_weight(struct task_
>         }
>  }
>
> +static inline void set_latency_prio(struct task_struct *p, int prio)
> +{
> +       p->latency_prio = prio;
> +       set_latency_fair(&p->se, prio - MAX_RT_PRIO);
> +}
> +
>  #ifdef CONFIG_UCLAMP_TASK
>  /*
>   * Serializes updates of utilization clamp values
> @@ -4464,9 +4470,10 @@ static void __sched_fork(unsigned long c
>         p->se.nr_migrations             = 0;
>         p->se.vruntime                  = 0;
>         p->se.vlag                      = 0;
> -       p->se.slice                     = sysctl_sched_base_slice;
>         INIT_LIST_HEAD(&p->se.group_node);
>
> +       set_latency_prio(p, p->latency_prio);
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         p->se.cfs_rq                    = NULL;
>  #endif
> @@ -4718,8 +4725,7 @@ int sched_fork(unsigned long clone_flags
>
>                 p->prio = p->normal_prio = p->static_prio;
>                 set_load_weight(p, false);
> -
> -               p->latency_prio = NICE_TO_PRIO(0);
> +               set_latency_prio(p, NICE_TO_PRIO(0));
>
>                 /*
>                  * We don't need the reset flag anymore after the fork. It has
> @@ -7507,7 +7513,7 @@ static void __setscheduler_latency(struc
>                                    const struct sched_attr *attr)
>  {
>         if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
> -               p->latency_prio = NICE_TO_PRIO(attr->sched_latency_nice);
> +               set_latency_prio(p, NICE_TO_PRIO(attr->sched_latency_nice));
>  }
>
>  /*
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -952,6 +952,21 @@ int sched_update_scaling(void)
>  }
>  #endif
>
> +void set_latency_fair(struct sched_entity *se, int prio)
> +{
> +       u32 weight = sched_prio_to_weight[prio];
> +       u64 base = sysctl_sched_base_slice;
> +
> +       /*
> +        * For EEVDF the virtual time slope is determined by w_i (iow.
> +        * nice) while the request time r_i is determined by
> +        * latency-nice.
> +        *
> +        * Smaller request gets better latency.
> +        */
> +       se->slice = div_u64(base << SCHED_FIXEDPOINT_SHIFT, weight);
> +}
> +
>  static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
>
>  /*
> @@ -964,13 +979,6 @@ static void update_deadline(struct cfs_r
>                 return;
>
>         /*
> -        * For EEVDF the virtual time slope is determined by w_i (iow.
> -        * nice) while the request time r_i is determined by
> -        * sysctl_sched_base_slice.
> -        */
> -       se->slice = sysctl_sched_base_slice;
> -
> -       /*
>          * EEVDF: vd_i = ve_i + r_i / w_i
>          */
>         se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2495,6 +2495,8 @@ extern unsigned int sysctl_numa_balancin
>  extern unsigned int sysctl_numa_balancing_hot_threshold;
>  #endif
>
> +extern void set_latency_fair(struct sched_entity *se, int prio);
> +
>  #ifdef CONFIG_SCHED_HRTICK
>
>  /*
>
>

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][PATCH 13/15] sched/fair: Implement latency-nice
  2023-06-06 14:54   ` Vincent Guittot
@ 2023-06-08 10:34     ` Peter Zijlstra
  2023-06-08 12:44       ` Peter Zijlstra
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-06-08 10:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

On Tue, Jun 06, 2023 at 04:54:13PM +0200, Vincent Guittot wrote:
> On Wed, 31 May 2023 at 14:47, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Implement latency-nice as a modulation of the EEVDF r_i parameter,
> > specifically apply the inverse sched_prio_to_weight[] relation on
> > base_slice.
> >
> > Given a base slice of 3 [ms], this gives a range of:
> >
> >   latency-nice  19: 3*1024 / 15    ~= 204.8 [ms]
> >   latency-nice -20: 3*1024 / 88761 ~= 0.034 [ms]
> 
> I have reread the publication. I have question about
> 
> Theorem 1: The lag of any active client k in a steady system is
> bounded as follows,
>     -rmax < lagk (d) < max(rmax ; q);
> 
> and
> 
> Corollary 2: Consider a steady system and a client k such that no
> request of client k is larger than a
> time quantum. Then at any time t, the lag of client k is bounded as follows:
>     -q < lagk (t) < q
> 
> q being the time quanta a task can run
> and rmax the maximum slice of active task
> 
> I wonder how it applies to us. What is our time quanta q ?

> I guess that it's the tick because it is assumed that the algorithm
> evaluates which task should run next for each q interval in order to
> fulfill the fairness IIUC.So I don't think that we can assume a q
> shorter than the tick (at least with current implementation) unless we
> trigger some additional interrupts

Indeed, TICK_NSEC is our unit of accounting (unless HRTICK, but I've not
looked at that, it might not DTRT -- also, I still need to rewrite that
whole thing to not be so damn expensive).

> Then asking for a request shorter than the tick also means that
> scheduler must enqueue a new request (on behalf of the task) during
> the tick and evaluate if the task is still the one to be scheduled
> now.

If there is no 'interrupt', we won't update time and the scheduler can't
do anything -- as you well know. The paper only requires (and we
slightly violate this) to push forward the deadline. See the comment
with update_deadline().

Much like pure EDF without a combined CBS.

> So similarly to q, the request size r should be at least a tick
> in order to reevaluate which task will run next after the end of a
> request. In fact, the real limit is : r/wi >= tick/(Sum wj)

> We can always not follow these assumptions made in the publication but
> I wonder how we can then rely on its theorems and corollaries

Again, I'm not entirely following, the corollaries take r_i < q into
account, that's where the max(rmax, q) term comes from.

You're right in that r_i < q does not behave 'right', but it doesn't
invalidate the results. Note that if a task overshoots, it will build of
significant negative lag (right side of the tree) and won't be eligible
for it's next actual period. This 'hole' in the schedule is then used to
make up for the extra time it used previously.

The much bigger problem with those bounds is this little caveat: 'in a
steady state'. They conveniently fail to consider the impact of
leave/join operations on the whole thing -- and that's a much more
interesting case.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][PATCH 15/15] sched/eevdf: Use sched_attr::sched_runtime to set request/slice
  2023-06-01 13:55   ` Vincent Guittot
@ 2023-06-08 11:52     ` Peter Zijlstra
  0 siblings, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-06-08 11:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

On Thu, Jun 01, 2023 at 03:55:18PM +0200, Vincent Guittot wrote:
> On Wed, 31 May 2023 at 14:47, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > As an alternative to the latency-nice interface; allow applications to
> > directly set the request/slice using sched_attr::sched_runtime.
> >
> > The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms]
> > which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.
> 
> There were some discussions about the latency interface and setting a
> raw time value. The problems with using a raw time value are:

So yeah, I'm well aware of that. And I'm not saying this is a better
interface -- just an alternative.

> - what  does this raw time value mean ? and how it applies to the
> scheduling latency of the task. Typically what does setting
> sched_runtime to 1ms means ? Regarding the latency, users would expect
> to be scheduled in less than 1ms but this is not what will (always)
> happen with a sched_slice set to 1ms whereas we ensure that the task
> will run for sched_runtime in the sched_period (and before
> sched_deadline) when using it with deadline scheduler. so this will be
> confusing

Confusing only if you don't know how to look at it; users are confused
in general and that's unfixable, nature will always invent a better
moron. The best we can do is provide enough clues for someone that does
know what he's doing.

So let me start by explaining how such an interface could be used and
how to look at it.

(and because we all love steady state things; I too shall use it)

Consider 4 equal-weight always running tasks (A,B,C,D) with a default
slice of 1ms.  The perfect schedule for this is a straight up FIFO
rotation of the 4 tasks, 1ms each for a total period of 4ms.

  ABCDABCD...
  +---+---+---+---

By keeping the tasks in the same order, we ensure the max latency is the
min latency -- consistency is king. If for one period you were to
say flip the first and last tasks in the order, your max latency takes a
hit, the task that was first will now have to wait 7ms instead of it's
usual 3ms.

  ABCDDBCA...
  +---+---+---+---

So far so obvious and boring..

Now, is we were to change the slice of task D to 2ms, what happens is
that it can't run the first time, because the slice rotations are 1ms,
and it needs 2ms, so it needs to save up and bank the first slot, so you
get a schedule like:

  ABCABCDDABCABCDD...
  +---+---+---+---+---

And here you can see that the total period becomes 8 (N*r_max). The
period for the 1ms tasks is still 4ms -- on average, but the period for
the 2ms task is 8ms.

A more complex example would be 3 tasks: A(w=1,r=1), B(w=1,r=1),
C(w=2,r=1) [to keep the 4ms period]:

  CCABCCAB...
  +---+---+---+---

If we change the slice of B to 2 then it becomes:

  CCACCABBCCACCABB...
  +---+---+---+---+---

So the total period is W*r_max (8ms), each task will average to a period
of W*r_i and each task will get the fair share of w_i/W time over the
total period (W*r_max per previous).

> - more than a runtime, we want to set a scheduling latency hint which
> would be more aligned with a deadline

We all wants ponies ;-) But seriously if you have a real deadline, use
SCHED_DEADLINE.

> - Then the user will complain that he set 1ms but its task is
> scheduled after several (or even dozens) ms in some cases. Also, you
> will probably end up with everybody setting 0.1ms and expecting 0.1ms
> latency. The latency nice like the nice give an opaque weight against
> others without any determinism that we can't respect

Now, notably I used sched_attr::sched_runtime, not _deadline nor _period.
Runtime is how long you expect each job-execution to take (WCET and all
that) in a periodic or sporadic task model.

Given this is a best effort overcommit scheduling class, we *CANNOT*
guarantee actual latency. The best we can offer is consistency (and this
is where EEVDF is *much* better than CFS).

We cannot, and must not pretend to provide a real deadline; hence we
should really not use that term in the user interface for this.

From the above examples we can see that if you ask for 1ms slices, you
get 1ms slices spaced (on average) closer together than if you were to
ask for 2ms slices -- even though they end up with the same share of
CPU-time.

Per the previous argument, the 2ms slice task has to forgo one slot in
first period to bank and save up for a 2ms slot in a super period.

How, if you're not a CPU hogging bully and don't use much CPU time at
all (your music player etc..) then setting the slice length to what it
actually takes to decode the next sample buffer, you can likely get a
smaller average period.

Conversely, if you ask for a slice significantly smaller than your job
execution time, you'll see it'll get split up in smaller chunks and
suffer preemption.

> - How do you set that you don't want to preempt others ? But still
> want to keep your allocated running time.

SCHED_BATCH is what we have for that. That actually works.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [RFC][PATCH 13/15] sched/fair: Implement latency-nice
  2023-06-08 10:34     ` Peter Zijlstra
@ 2023-06-08 12:44       ` Peter Zijlstra
  0 siblings, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-06-08 12:44 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault, tglx

On Thu, Jun 08, 2023 at 12:34:58PM +0200, Peter Zijlstra wrote:
> > Then asking for a request shorter than the tick also means that
> > scheduler must enqueue a new request (on behalf of the task) during
> > the tick and evaluate if the task is still the one to be scheduled
> > now.
> 
> If there is no 'interrupt', we won't update time and the scheduler can't
> do anything -- as you well know. The paper only requires (and we
> slightly violate this) to push forward the deadline. See the comment
> with update_deadline().
> 
> Much like pure EDF without a combined CBS.
> 
> > So similarly to q, the request size r should be at least a tick
> > in order to reevaluate which task will run next after the end of a
> > request. In fact, the real limit is : r/wi >= tick/(Sum wj)
> 
> > We can always not follow these assumptions made in the publication but
> > I wonder how we can then rely on its theorems and corollaries
> 
> Again, I'm not entirely following, the corollaries take r_i < q into
> account, that's where the max(rmax, q) term comes from.
> 
> You're right in that r_i < q does not behave 'right', but it doesn't
> invalidate the results. Note that if a task overshoots, it will build of
> significant negative lag (right side of the tree) and won't be eligible
> for it's next actual period. This 'hole' in the schedule is then used to
> make up for the extra time it used previously.

So notably, if your task *does* behave correctly and does not consume
the full request, then it will not build up (large) negative lag and
wakeup-preemption can make it go quickly on the next period.

This is where that FUDGE hack comes in, except I got it wrong, I think
it needs to be something like:

	if (delta / W >= vslice) {
		se->vlag += vslice
		if (se->vlag > 0)
			se->vlag = 0;
	}

To ensure it can't gain time. It's still a gruesome hack, but at least
is shouldn't be able to game the system.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 11/15] sched/eevdf: Better handle mixed slice length
  2023-05-31 11:58 ` [PATCH 11/15] sched/eevdf: Better handle mixed slice length Peter Zijlstra
  2023-06-02 13:45   ` Vincent Guittot
@ 2023-06-10  6:34   ` Chen Yu
  2023-06-10 11:22     ` Peter Zijlstra
  1 sibling, 1 reply; 104+ messages in thread
From: Chen Yu @ 2023-06-10  6:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, youssefesmat, joel,
	efault, tglx

On 2023-05-31 at 13:58:50 +0200, Peter Zijlstra wrote:
> In the case where (due to latency-nice) there are different request
> sizes in the tree, the smaller requests tend to be dominated by the
> larger. Also note how the EEVDF lag limits are based on r_max.
> 
> Therefore; add a heuristic that for the mixed request size case, moves
> smaller requests to placement strategy #2 which ensures they're
> immidiately eligible and and due to their smaller (virtual) deadline
> will cause preemption.
> 
> NOTE: this relies on update_entity_lag() to impose lag limits above
> a single slice.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c     |   30 ++++++++++++++++++++++++++++++
>  kernel/sched/features.h |    1 +
>  kernel/sched/sched.h    |    1 +
>  3 files changed, 32 insertions(+)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -642,6 +642,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq,
>  	s64 key = entity_key(cfs_rq, se);
>  
>  	cfs_rq->avg_vruntime += key * weight;
> +	cfs_rq->avg_slice += se->slice * weight;
>  	cfs_rq->avg_load += weight;
>  }
>  
> @@ -652,6 +653,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
>  	s64 key = entity_key(cfs_rq, se);
>  
>  	cfs_rq->avg_vruntime -= key * weight;
> +	cfs_rq->avg_slice -= se->slice * weight;
>  	cfs_rq->avg_load -= weight;
>  }
>  
> @@ -4908,6 +4910,21 @@ static inline void update_misfit_status(
>  
>  #endif /* CONFIG_SMP */
>  
> +static inline bool
> +entity_has_slept(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> +{
> +	u64 now;
> +
> +	if (!(flags & ENQUEUE_WAKEUP))
> +		return false;
> +
> +	if (flags & ENQUEUE_MIGRATED)
> +		return true;
> +
> +	now = rq_clock_task(rq_of(cfs_rq));
> +	return (s64)(se->exec_start - now) >= se->slice;
> +}
A minor question, should it be now - se->exec_start ?
(se->exec_start - now) is always negetive on local wakeup?

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 11/15] sched/eevdf: Better handle mixed slice length
  2023-06-10  6:34   ` Chen Yu
@ 2023-06-10 11:22     ` Peter Zijlstra
  0 siblings, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-06-10 11:22 UTC (permalink / raw)
  To: Chen Yu
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, youssefesmat, joel,
	efault, tglx

On Sat, Jun 10, 2023 at 02:34:04PM +0800, Chen Yu wrote:

> > +static inline bool
> > +entity_has_slept(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > +{
> > +	u64 now;
> > +
> > +	if (!(flags & ENQUEUE_WAKEUP))
> > +		return false;
> > +
> > +	if (flags & ENQUEUE_MIGRATED)
> > +		return true;
> > +
> > +	now = rq_clock_task(rq_of(cfs_rq));
> > +	return (s64)(se->exec_start - now) >= se->slice;
> > +}
> A minor question, should it be now - se->exec_start ?
> (se->exec_start - now) is always negetive on local wakeup?

Yeah, also:

https://lkml.kernel.org/r/20230608124440.GB1002251@hirez.programming.kicks-ass.net

That is, it should be something along the lines of:

	delta = new - se->exec->start
	rerturn delta/W > vslice


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 08/15] sched: Commit to EEVDF
  2023-05-31 11:58 ` [PATCH 08/15] sched: Commit to EEVDF Peter Zijlstra
@ 2023-06-16 21:23   ` Joel Fernandes
  2023-06-22 12:01     ` Ingo Molnar
  2023-08-10  7:10   ` [tip: sched/core] sched/fair: " tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 104+ messages in thread
From: Joel Fernandes @ 2023-06-16 21:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, efault, tglx

On Wed, May 31, 2023 at 01:58:47PM +0200, Peter Zijlstra wrote:
> EEVDF is a better defined scheduling policy, as a result it has less
> heuristics/tunables. There is no compelling reason to keep CFS around.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/debug.c    |    6 
>  kernel/sched/fair.c     |  465 +++---------------------------------------------

Whether EEVDF helps us improve our CFS latency issues or not, I do like the
merits of this diffstat alone and the lesser complexity and getting rid of
those horrible knobs is kinda nice.

For ChromeOS, we are experimenting with RT as well for our high priority
threads, and the PeterZ/Juri's patches on DL-server.

One of the issues is, on small number of CPU systems and overloaded CPUs, the
fair scheduler tends to eat up too much from high priority threads. That's
where RT kind of shines in our testing because it just tells those lower
priority CFS buggers to STFU.

I believe EEVDF will still have those issues (as its still a fair scheduler
if I'm not horribly mistaken).

But hey, less CFS knobs is only a good thing! And most people say fair.c is
one of the most complex beasts in the kernel so I'm all for this..

thanks,

 - Joel


>  kernel/sched/features.h |   12 -
>  kernel/sched/sched.h    |    5 
>  4 files changed, 38 insertions(+), 450 deletions(-)
> 
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -347,10 +347,7 @@ static __init int sched_init_debug(void)
>  	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
>  #endif
>  
> -	debugfs_create_u32("latency_ns", 0644, debugfs_sched, &sysctl_sched_latency);
>  	debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);
> -	debugfs_create_u32("idle_min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_idle_min_granularity);
> -	debugfs_create_u32("wakeup_granularity_ns", 0644, debugfs_sched, &sysctl_sched_wakeup_granularity);
>  
>  	debugfs_create_u32("latency_warn_ms", 0644, debugfs_sched, &sysctl_resched_latency_warn_ms);
>  	debugfs_create_u32("latency_warn_once", 0644, debugfs_sched, &sysctl_resched_latency_warn_once);
> @@ -865,10 +862,7 @@ static void sched_debug_header(struct se
>  	SEQ_printf(m, "  .%-40s: %Ld\n", #x, (long long)(x))
>  #define PN(x) \
>  	SEQ_printf(m, "  .%-40s: %Ld.%06ld\n", #x, SPLIT_NS(x))
> -	PN(sysctl_sched_latency);
>  	PN(sysctl_sched_min_granularity);
> -	PN(sysctl_sched_idle_min_granularity);
> -	PN(sysctl_sched_wakeup_granularity);
>  	P(sysctl_sched_child_runs_first);
>  	P(sysctl_sched_features);
>  #undef PN
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -58,22 +58,6 @@
>  #include "autogroup.h"
>  
>  /*
> - * Targeted preemption latency for CPU-bound tasks:
> - *
> - * NOTE: this latency value is not the same as the concept of
> - * 'timeslice length' - timeslices in CFS are of variable length
> - * and have no persistent notion like in traditional, time-slice
> - * based scheduling concepts.
> - *
> - * (to see the precise effective timeslice length of your workload,
> - *  run vmstat and monitor the context-switches (cs) field)
> - *
> - * (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds)
> - */
> -unsigned int sysctl_sched_latency			= 6000000ULL;
> -static unsigned int normalized_sysctl_sched_latency	= 6000000ULL;
> -
> -/*
>   * The initial- and re-scaling of tunables is configurable
>   *
>   * Options are:
> @@ -95,36 +79,11 @@ unsigned int sysctl_sched_min_granularit
>  static unsigned int normalized_sysctl_sched_min_granularity	= 750000ULL;
>  
>  /*
> - * Minimal preemption granularity for CPU-bound SCHED_IDLE tasks.
> - * Applies only when SCHED_IDLE tasks compete with normal tasks.
> - *
> - * (default: 0.75 msec)
> - */
> -unsigned int sysctl_sched_idle_min_granularity			= 750000ULL;
> -
> -/*
> - * This value is kept at sysctl_sched_latency/sysctl_sched_min_granularity
> - */
> -static unsigned int sched_nr_latency = 8;
> -
> -/*
>   * After fork, child runs first. If set to 0 (default) then
>   * parent will (try to) run first.
>   */
>  unsigned int sysctl_sched_child_runs_first __read_mostly;
>  
> -/*
> - * SCHED_OTHER wake-up granularity.
> - *
> - * This option delays the preemption effects of decoupled workloads
> - * and reduces their over-scheduling. Synchronous workloads will still
> - * have immediate wakeup/sleep latencies.
> - *
> - * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
> - */
> -unsigned int sysctl_sched_wakeup_granularity			= 1000000UL;
> -static unsigned int normalized_sysctl_sched_wakeup_granularity	= 1000000UL;
> -
>  const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
>  
>  int sched_thermal_decay_shift;
> @@ -279,8 +238,6 @@ static void update_sysctl(void)
>  #define SET_SYSCTL(name) \
>  	(sysctl_##name = (factor) * normalized_sysctl_##name)
>  	SET_SYSCTL(sched_min_granularity);
> -	SET_SYSCTL(sched_latency);
> -	SET_SYSCTL(sched_wakeup_granularity);
>  #undef SET_SYSCTL
>  }
>  
> @@ -888,30 +845,6 @@ struct sched_entity *__pick_first_entity
>  	return __node_2_se(left);
>  }
>  
> -static struct sched_entity *__pick_next_entity(struct sched_entity *se)
> -{
> -	struct rb_node *next = rb_next(&se->run_node);
> -
> -	if (!next)
> -		return NULL;
> -
> -	return __node_2_se(next);
> -}
> -
> -static struct sched_entity *pick_cfs(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> -{
> -	struct sched_entity *left = __pick_first_entity(cfs_rq);
> -
> -	/*
> -	 * If curr is set we have to see if its left of the leftmost entity
> -	 * still in the tree, provided there was anything in the tree at all.
> -	 */
> -	if (!left || (curr && entity_before(curr, left)))
> -		left = curr;
> -
> -	return left;
> -}
> -
>  /*
>   * Earliest Eligible Virtual Deadline First
>   *
> @@ -1008,85 +941,15 @@ int sched_update_scaling(void)
>  {
>  	unsigned int factor = get_update_sysctl_factor();
>  
> -	sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency,
> -					sysctl_sched_min_granularity);
> -
>  #define WRT_SYSCTL(name) \
>  	(normalized_sysctl_##name = sysctl_##name / (factor))
>  	WRT_SYSCTL(sched_min_granularity);
> -	WRT_SYSCTL(sched_latency);
> -	WRT_SYSCTL(sched_wakeup_granularity);
>  #undef WRT_SYSCTL
>  
>  	return 0;
>  }
>  #endif
>  
> -/*
> - * The idea is to set a period in which each task runs once.
> - *
> - * When there are too many tasks (sched_nr_latency) we have to stretch
> - * this period because otherwise the slices get too small.
> - *
> - * p = (nr <= nl) ? l : l*nr/nl
> - */
> -static u64 __sched_period(unsigned long nr_running)
> -{
> -	if (unlikely(nr_running > sched_nr_latency))
> -		return nr_running * sysctl_sched_min_granularity;
> -	else
> -		return sysctl_sched_latency;
> -}
> -
> -static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq);
> -
> -/*
> - * We calculate the wall-time slice from the period by taking a part
> - * proportional to the weight.
> - *
> - * s = p*P[w/rw]
> - */
> -static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> -{
> -	unsigned int nr_running = cfs_rq->nr_running;
> -	struct sched_entity *init_se = se;
> -	unsigned int min_gran;
> -	u64 slice;
> -
> -	if (sched_feat(ALT_PERIOD))
> -		nr_running = rq_of(cfs_rq)->cfs.h_nr_running;
> -
> -	slice = __sched_period(nr_running + !se->on_rq);
> -
> -	for_each_sched_entity(se) {
> -		struct load_weight *load;
> -		struct load_weight lw;
> -		struct cfs_rq *qcfs_rq;
> -
> -		qcfs_rq = cfs_rq_of(se);
> -		load = &qcfs_rq->load;
> -
> -		if (unlikely(!se->on_rq)) {
> -			lw = qcfs_rq->load;
> -
> -			update_load_add(&lw, se->load.weight);
> -			load = &lw;
> -		}
> -		slice = __calc_delta(slice, se->load.weight, load);
> -	}
> -
> -	if (sched_feat(BASE_SLICE)) {
> -		if (se_is_idle(init_se) && !sched_idle_cfs_rq(cfs_rq))
> -			min_gran = sysctl_sched_idle_min_granularity;
> -		else
> -			min_gran = sysctl_sched_min_granularity;
> -
> -		slice = max_t(u64, slice, min_gran);
> -	}
> -
> -	return slice;
> -}
> -
>  static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
>  
>  /*
> @@ -1098,35 +961,25 @@ static void update_deadline(struct cfs_r
>  	if ((s64)(se->vruntime - se->deadline) < 0)
>  		return;
>  
> -	if (sched_feat(EEVDF)) {
> -		/*
> -		 * For EEVDF the virtual time slope is determined by w_i (iow.
> -		 * nice) while the request time r_i is determined by
> -		 * sysctl_sched_min_granularity.
> -		 */
> -		se->slice = sysctl_sched_min_granularity;
> -
> -		/*
> -		 * The task has consumed its request, reschedule.
> -		 */
> -		if (cfs_rq->nr_running > 1) {
> -			resched_curr(rq_of(cfs_rq));
> -			clear_buddies(cfs_rq, se);
> -		}
> -	} else {
> -		/*
> -		 * When many tasks blow up the sched_period; it is possible
> -		 * that sched_slice() reports unusually large results (when
> -		 * many tasks are very light for example). Therefore impose a
> -		 * maximum.
> -		 */
> -		se->slice = min_t(u64, sched_slice(cfs_rq, se), sysctl_sched_latency);
> -	}
> +	/*
> +	 * For EEVDF the virtual time slope is determined by w_i (iow.
> +	 * nice) while the request time r_i is determined by
> +	 * sysctl_sched_min_granularity.
> +	 */
> +	se->slice = sysctl_sched_min_granularity;
>  
>  	/*
>  	 * EEVDF: vd_i = ve_i + r_i / w_i
>  	 */
>  	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
> +
> +	/*
> +	 * The task has consumed its request, reschedule.
> +	 */
> +	if (cfs_rq->nr_running > 1) {
> +		resched_curr(rq_of(cfs_rq));
> +		clear_buddies(cfs_rq, se);
> +	}
>  }
>  
>  #include "pelt.h"
> @@ -5055,19 +4908,6 @@ static inline void update_misfit_status(
>  
>  #endif /* CONFIG_SMP */
>  
> -static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
> -{
> -#ifdef CONFIG_SCHED_DEBUG
> -	s64 d = se->vruntime - cfs_rq->min_vruntime;
> -
> -	if (d < 0)
> -		d = -d;
> -
> -	if (d > 3*sysctl_sched_latency)
> -		schedstat_inc(cfs_rq->nr_spread_over);
> -#endif
> -}
> -
>  static void
>  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>  {
> @@ -5218,7 +5058,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>  
>  	check_schedstat_required();
>  	update_stats_enqueue_fair(cfs_rq, se, flags);
> -	check_spread(cfs_rq, se);
>  	if (!curr)
>  		__enqueue_entity(cfs_rq, se);
>  	se->on_rq = 1;
> @@ -5230,17 +5069,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>  	}
>  }
>  
> -static void __clear_buddies_last(struct sched_entity *se)
> -{
> -	for_each_sched_entity(se) {
> -		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> -		if (cfs_rq->last != se)
> -			break;
> -
> -		cfs_rq->last = NULL;
> -	}
> -}
> -
>  static void __clear_buddies_next(struct sched_entity *se)
>  {
>  	for_each_sched_entity(se) {
> @@ -5252,27 +5080,10 @@ static void __clear_buddies_next(struct
>  	}
>  }
>  
> -static void __clear_buddies_skip(struct sched_entity *se)
> -{
> -	for_each_sched_entity(se) {
> -		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> -		if (cfs_rq->skip != se)
> -			break;
> -
> -		cfs_rq->skip = NULL;
> -	}
> -}
> -
>  static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -	if (cfs_rq->last == se)
> -		__clear_buddies_last(se);
> -
>  	if (cfs_rq->next == se)
>  		__clear_buddies_next(se);
> -
> -	if (cfs_rq->skip == se)
> -		__clear_buddies_skip(se);
>  }
>  
>  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
> @@ -5330,45 +5141,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>  		update_idle_cfs_rq_clock_pelt(cfs_rq);
>  }
>  
> -/*
> - * Preempt the current task with a newly woken task if needed:
> - */
> -static void
> -check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> -{
> -	unsigned long delta_exec;
> -	struct sched_entity *se;
> -	s64 delta;
> -
> -	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> -	if (delta_exec > curr->slice) {
> -		resched_curr(rq_of(cfs_rq));
> -		/*
> -		 * The current task ran long enough, ensure it doesn't get
> -		 * re-elected due to buddy favours.
> -		 */
> -		clear_buddies(cfs_rq, curr);
> -		return;
> -	}
> -
> -	/*
> -	 * Ensure that a task that missed wakeup preemption by a
> -	 * narrow margin doesn't have to wait for a full slice.
> -	 * This also mitigates buddy induced latencies under load.
> -	 */
> -	if (delta_exec < sysctl_sched_min_granularity)
> -		return;
> -
> -	se = __pick_first_entity(cfs_rq);
> -	delta = curr->vruntime - se->vruntime;
> -
> -	if (delta < 0)
> -		return;
> -
> -	if (delta > curr->slice)
> -		resched_curr(rq_of(cfs_rq));
> -}
> -
>  static void
>  set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> @@ -5407,9 +5179,6 @@ set_next_entity(struct cfs_rq *cfs_rq, s
>  	se->prev_sum_exec_runtime = se->sum_exec_runtime;
>  }
>  
> -static int
> -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
> -
>  /*
>   * Pick the next process, keeping these things in mind, in this order:
>   * 1) keep things fair between processes/task groups
> @@ -5420,53 +5189,14 @@ wakeup_preempt_entity(struct sched_entit
>  static struct sched_entity *
>  pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>  {
> -	struct sched_entity *left, *se;
> -
> -	if (sched_feat(EEVDF)) {
> -		/*
> -		 * Enabling NEXT_BUDDY will affect latency but not fairness.
> -		 */
> -		if (sched_feat(NEXT_BUDDY) &&
> -		    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
> -			return cfs_rq->next;
> -
> -		return pick_eevdf(cfs_rq);
> -	}
> -
> -	se = left = pick_cfs(cfs_rq, curr);
> -
>  	/*
> -	 * Avoid running the skip buddy, if running something else can
> -	 * be done without getting too unfair.
> +	 * Enabling NEXT_BUDDY will affect latency but not fairness.
>  	 */
> -	if (cfs_rq->skip && cfs_rq->skip == se) {
> -		struct sched_entity *second;
> +	if (sched_feat(NEXT_BUDDY) &&
> +	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
> +		return cfs_rq->next;
>  
> -		if (se == curr) {
> -			second = __pick_first_entity(cfs_rq);
> -		} else {
> -			second = __pick_next_entity(se);
> -			if (!second || (curr && entity_before(curr, second)))
> -				second = curr;
> -		}
> -
> -		if (second && wakeup_preempt_entity(second, left) < 1)
> -			se = second;
> -	}
> -
> -	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> -		/*
> -		 * Someone really wants this to run. If it's not unfair, run it.
> -		 */
> -		se = cfs_rq->next;
> -	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> -		/*
> -		 * Prefer last buddy, try to return the CPU to a preempted task.
> -		 */
> -		se = cfs_rq->last;
> -	}
> -
> -	return se;
> +	return pick_eevdf(cfs_rq);
>  }
>  
>  static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
> @@ -5483,8 +5213,6 @@ static void put_prev_entity(struct cfs_r
>  	/* throttle cfs_rqs exceeding runtime */
>  	check_cfs_rq_runtime(cfs_rq);
>  
> -	check_spread(cfs_rq, prev);
> -
>  	if (prev->on_rq) {
>  		update_stats_wait_start_fair(cfs_rq, prev);
>  		/* Put 'current' back into the tree. */
> @@ -5525,9 +5253,6 @@ entity_tick(struct cfs_rq *cfs_rq, struc
>  			hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
>  		return;
>  #endif
> -
> -	if (!sched_feat(EEVDF) && cfs_rq->nr_running > 1)
> -		check_preempt_tick(cfs_rq, curr);
>  }
>  
>  
> @@ -6561,8 +6286,7 @@ static void hrtick_update(struct rq *rq)
>  	if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
>  		return;
>  
> -	if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
> -		hrtick_start_fair(rq, curr);
> +	hrtick_start_fair(rq, curr);
>  }
>  #else /* !CONFIG_SCHED_HRTICK */
>  static inline void
> @@ -6603,17 +6327,6 @@ static int sched_idle_rq(struct rq *rq)
>  			rq->nr_running);
>  }
>  
> -/*
> - * Returns true if cfs_rq only has SCHED_IDLE entities enqueued. Note the use
> - * of idle_nr_running, which does not consider idle descendants of normal
> - * entities.
> - */
> -static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq)
> -{
> -	return cfs_rq->nr_running &&
> -		cfs_rq->nr_running == cfs_rq->idle_nr_running;
> -}
> -
>  #ifdef CONFIG_SMP
>  static int sched_idle_cpu(int cpu)
>  {
> @@ -8099,66 +7812,6 @@ balance_fair(struct rq *rq, struct task_
>  }
>  #endif /* CONFIG_SMP */
>  
> -static unsigned long wakeup_gran(struct sched_entity *se)
> -{
> -	unsigned long gran = sysctl_sched_wakeup_granularity;
> -
> -	/*
> -	 * Since its curr running now, convert the gran from real-time
> -	 * to virtual-time in his units.
> -	 *
> -	 * By using 'se' instead of 'curr' we penalize light tasks, so
> -	 * they get preempted easier. That is, if 'se' < 'curr' then
> -	 * the resulting gran will be larger, therefore penalizing the
> -	 * lighter, if otoh 'se' > 'curr' then the resulting gran will
> -	 * be smaller, again penalizing the lighter task.
> -	 *
> -	 * This is especially important for buddies when the leftmost
> -	 * task is higher priority than the buddy.
> -	 */
> -	return calc_delta_fair(gran, se);
> -}
> -
> -/*
> - * Should 'se' preempt 'curr'.
> - *
> - *             |s1
> - *        |s2
> - *   |s3
> - *         g
> - *      |<--->|c
> - *
> - *  w(c, s1) = -1
> - *  w(c, s2) =  0
> - *  w(c, s3) =  1
> - *
> - */
> -static int
> -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> -{
> -	s64 gran, vdiff = curr->vruntime - se->vruntime;
> -
> -	if (vdiff <= 0)
> -		return -1;
> -
> -	gran = wakeup_gran(se);
> -	if (vdiff > gran)
> -		return 1;
> -
> -	return 0;
> -}
> -
> -static void set_last_buddy(struct sched_entity *se)
> -{
> -	for_each_sched_entity(se) {
> -		if (SCHED_WARN_ON(!se->on_rq))
> -			return;
> -		if (se_is_idle(se))
> -			return;
> -		cfs_rq_of(se)->last = se;
> -	}
> -}
> -
>  static void set_next_buddy(struct sched_entity *se)
>  {
>  	for_each_sched_entity(se) {
> @@ -8170,12 +7823,6 @@ static void set_next_buddy(struct sched_
>  	}
>  }
>  
> -static void set_skip_buddy(struct sched_entity *se)
> -{
> -	for_each_sched_entity(se)
> -		cfs_rq_of(se)->skip = se;
> -}
> -
>  /*
>   * Preempt the current task with a newly woken task if needed:
>   */
> @@ -8184,7 +7831,6 @@ static void check_preempt_wakeup(struct
>  	struct task_struct *curr = rq->curr;
>  	struct sched_entity *se = &curr->se, *pse = &p->se;
>  	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
> -	int scale = cfs_rq->nr_running >= sched_nr_latency;
>  	int next_buddy_marked = 0;
>  	int cse_is_idle, pse_is_idle;
>  
> @@ -8200,7 +7846,7 @@ static void check_preempt_wakeup(struct
>  	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
>  		return;
>  
> -	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
> +	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK)) {
>  		set_next_buddy(pse);
>  		next_buddy_marked = 1;
>  	}
> @@ -8248,44 +7894,16 @@ static void check_preempt_wakeup(struct
>  	cfs_rq = cfs_rq_of(se);
>  	update_curr(cfs_rq);
>  
> -	if (sched_feat(EEVDF)) {
> -		/*
> -		 * XXX pick_eevdf(cfs_rq) != se ?
> -		 */
> -		if (pick_eevdf(cfs_rq) == pse)
> -			goto preempt;
> -
> -		return;
> -	}
> -
> -	if (wakeup_preempt_entity(se, pse) == 1) {
> -		/*
> -		 * Bias pick_next to pick the sched entity that is
> -		 * triggering this preemption.
> -		 */
> -		if (!next_buddy_marked)
> -			set_next_buddy(pse);
> +	/*
> +	 * XXX pick_eevdf(cfs_rq) != se ?
> +	 */
> +	if (pick_eevdf(cfs_rq) == pse)
>  		goto preempt;
> -	}
>  
>  	return;
>  
>  preempt:
>  	resched_curr(rq);
> -	/*
> -	 * Only set the backward buddy when the current task is still
> -	 * on the rq. This can happen when a wakeup gets interleaved
> -	 * with schedule on the ->pre_schedule() or idle_balance()
> -	 * point, either of which can * drop the rq lock.
> -	 *
> -	 * Also, during early boot the idle thread is in the fair class,
> -	 * for obvious reasons its a bad idea to schedule back to it.
> -	 */
> -	if (unlikely(!se->on_rq || curr == rq->idle))
> -		return;
> -
> -	if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
> -		set_last_buddy(se);
>  }
>  
>  #ifdef CONFIG_SMP
> @@ -8486,8 +8104,6 @@ static void put_prev_task_fair(struct rq
>  
>  /*
>   * sched_yield() is very simple
> - *
> - * The magic of dealing with the ->skip buddy is in pick_next_entity.
>   */
>  static void yield_task_fair(struct rq *rq)
>  {
> @@ -8503,23 +8119,19 @@ static void yield_task_fair(struct rq *r
>  
>  	clear_buddies(cfs_rq, se);
>  
> -	if (sched_feat(EEVDF) || curr->policy != SCHED_BATCH) {
> -		update_rq_clock(rq);
> -		/*
> -		 * Update run-time statistics of the 'current'.
> -		 */
> -		update_curr(cfs_rq);
> -		/*
> -		 * Tell update_rq_clock() that we've just updated,
> -		 * so we don't do microscopic update in schedule()
> -		 * and double the fastpath cost.
> -		 */
> -		rq_clock_skip_update(rq);
> -	}
> -	if (sched_feat(EEVDF))
> -		se->deadline += calc_delta_fair(se->slice, se);
> +	update_rq_clock(rq);
> +	/*
> +	 * Update run-time statistics of the 'current'.
> +	 */
> +	update_curr(cfs_rq);
> +	/*
> +	 * Tell update_rq_clock() that we've just updated,
> +	 * so we don't do microscopic update in schedule()
> +	 * and double the fastpath cost.
> +	 */
> +	rq_clock_skip_update(rq);
>  
> -	set_skip_buddy(se);
> +	se->deadline += calc_delta_fair(se->slice, se);
>  }
>  
>  static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
> @@ -8762,8 +8374,7 @@ static int task_hot(struct task_struct *
>  	 * Buddy candidates are cache hot:
>  	 */
>  	if (sched_feat(CACHE_HOT_BUDDY) && env->dst_rq->nr_running &&
> -			(&p->se == cfs_rq_of(&p->se)->next ||
> -			 &p->se == cfs_rq_of(&p->se)->last))
> +	    (&p->se == cfs_rq_of(&p->se)->next))
>  		return 1;
>  
>  	if (sysctl_sched_migration_cost == -1)
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -15,13 +15,6 @@ SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
>  SCHED_FEAT(NEXT_BUDDY, false)
>  
>  /*
> - * Prefer to schedule the task that ran last (when we did
> - * wake-preempt) as that likely will touch the same data, increases
> - * cache locality.
> - */
> -SCHED_FEAT(LAST_BUDDY, true)
> -
> -/*
>   * Consider buddies to be cache hot, decreases the likeliness of a
>   * cache buddy being migrated away, increases cache locality.
>   */
> @@ -93,8 +86,3 @@ SCHED_FEAT(UTIL_EST, true)
>  SCHED_FEAT(UTIL_EST_FASTUP, true)
>  
>  SCHED_FEAT(LATENCY_WARN, false)
> -
> -SCHED_FEAT(ALT_PERIOD, true)
> -SCHED_FEAT(BASE_SLICE, true)
> -
> -SCHED_FEAT(EEVDF, true)
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -576,8 +576,6 @@ struct cfs_rq {
>  	 */
>  	struct sched_entity	*curr;
>  	struct sched_entity	*next;
> -	struct sched_entity	*last;
> -	struct sched_entity	*skip;
>  
>  #ifdef	CONFIG_SCHED_DEBUG
>  	unsigned int		nr_spread_over;
> @@ -2484,9 +2482,6 @@ extern const_debug unsigned int sysctl_s
>  extern unsigned int sysctl_sched_min_granularity;
>  
>  #ifdef CONFIG_SCHED_DEBUG
> -extern unsigned int sysctl_sched_latency;
> -extern unsigned int sysctl_sched_idle_min_granularity;
> -extern unsigned int sysctl_sched_wakeup_granularity;
>  extern int sysctl_resched_latency_warn_ms;
>  extern int sysctl_resched_latency_warn_once;
>  
> 
> 

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 08/15] sched: Commit to EEVDF
  2023-06-16 21:23   ` Joel Fernandes
@ 2023-06-22 12:01     ` Ingo Molnar
  2023-06-22 13:11       ` Joel Fernandes
  0 siblings, 1 reply; 104+ messages in thread
From: Ingo Molnar @ 2023-06-22 12:01 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Peter Zijlstra, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, efault, tglx


* Joel Fernandes <joel@joelfernandes.org> wrote:

> On Wed, May 31, 2023 at 01:58:47PM +0200, Peter Zijlstra wrote:
> > EEVDF is a better defined scheduling policy, as a result it has less
> > heuristics/tunables. There is no compelling reason to keep CFS around.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/sched/debug.c    |    6 
> >  kernel/sched/fair.c     |  465 +++---------------------------------------------
> 
> Whether EEVDF helps us improve our CFS latency issues or not, I do like the
> merits of this diffstat alone and the lesser complexity and getting rid of
> those horrible knobs is kinda nice.

To to be fair, the "removal" in this patch is in significant part an 
artifact of the patch series itself, because first EEVDF bits get added by 
three earlier patches, in parallel to CFS:

 kernel/sched/fair.c     |  137 +++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/fair.c     |  162 +++++++++++++++++++++++++++++++++++++-----------
 kernel/sched/fair.c     |  338 +++++++++++++++++++++++++++++++++++++++++-------

... and then we remove the old CFS policy code in this 'commit to EEVDF' patch:

 kernel/sched/fair.c     |  465 +++---------------------------------------------

The combined diffstat is close to 50% / 50% balanced:

 kernel/sched/fair.c              | 1105 ++++++++++++++++++--------------------

But having said that, I do agree that EEVDF as submitted by Peter is better 
defined, with fewer heuristics, which is an overall win - so no complaints 
from me!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 08/15] sched: Commit to EEVDF
  2023-06-22 12:01     ` Ingo Molnar
@ 2023-06-22 13:11       ` Joel Fernandes
  0 siblings, 0 replies; 104+ messages in thread
From: Joel Fernandes @ 2023-06-22 13:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, efault, tglx

On 6/22/23 08:01, Ingo Molnar wrote:
> 
> * Joel Fernandes <joel@joelfernandes.org> wrote:
> 
>> On Wed, May 31, 2023 at 01:58:47PM +0200, Peter Zijlstra wrote:
>>> EEVDF is a better defined scheduling policy, as a result it has less
>>> heuristics/tunables. There is no compelling reason to keep CFS around.
>>>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>> ---
>>>   kernel/sched/debug.c    |    6
>>>   kernel/sched/fair.c     |  465 +++---------------------------------------------
>>
>> Whether EEVDF helps us improve our CFS latency issues or not, I do like the
>> merits of this diffstat alone and the lesser complexity and getting rid of
>> those horrible knobs is kinda nice.
> 
> To to be fair, the "removal" in this patch is in significant part an
> artifact of the patch series itself, because first EEVDF bits get added by
> three earlier patches, in parallel to CFS:
> 
>   kernel/sched/fair.c     |  137 +++++++++++++++++++++++++++++++++++++++++++++++++--
>   kernel/sched/fair.c     |  162 +++++++++++++++++++++++++++++++++++++-----------
>   kernel/sched/fair.c     |  338 +++++++++++++++++++++++++++++++++++++++++-------
> 
> ... and then we remove the old CFS policy code in this 'commit to EEVDF' patch:
> 
>   kernel/sched/fair.c     |  465 +++---------------------------------------------
> 
> The combined diffstat is close to 50% / 50% balanced:
> 
>   kernel/sched/fair.c              | 1105 ++++++++++++++++++--------------------
> 
> But having said that, I do agree that EEVDF as submitted by Peter is better
> defined, with fewer heuristics, which is an overall win - so no complaints
> from me!

Agreed, thank you for correcting me on the statistics.

  - Joel



^ permalink raw reply	[flat|nested] 104+ messages in thread

* [tip: sched/core] sched/fair: Propagate enqueue flags into place_entity()
  2023-05-31 11:58 ` [PATCH 10/15] sched/fair: Propagate enqueue flags into place_entity() Peter Zijlstra
@ 2023-08-10  7:10   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 104+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-08-10  7:10 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     d07f09a1f99cabbc86bc5c97d962eb8a466106b5
Gitweb:        https://git.kernel.org/tip/d07f09a1f99cabbc86bc5c97d962eb8a466106b5
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 31 May 2023 13:58:49 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Jul 2023 09:43:59 +02:00

sched/fair: Propagate enqueue flags into place_entity()

This allows place_entity() to consider ENQUEUE_WAKEUP and
ENQUEUE_MIGRATED.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.274010996@infradead.org
---
 kernel/sched/fair.c  | 10 +++++-----
 kernel/sched/sched.h |  1 +
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61747a2..5c8c9f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4909,7 +4909,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
 #endif /* CONFIG_SMP */
 
 static void
-place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
+place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	u64 vslice = calc_delta_fair(se->slice, se);
 	u64 vruntime = avg_vruntime(cfs_rq);
@@ -4998,7 +4998,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 	 * on average, halfway through their slice, as such start tasks
 	 * off with half a slice to ease into the competition.
 	 */
-	if (sched_feat(PLACE_DEADLINE_INITIAL) && initial)
+	if (sched_feat(PLACE_DEADLINE_INITIAL) && (flags & ENQUEUE_INITIAL))
 		vslice /= 2;
 
 	/*
@@ -5022,7 +5022,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * update_curr().
 	 */
 	if (curr)
-		place_entity(cfs_rq, se, 0);
+		place_entity(cfs_rq, se, flags);
 
 	update_curr(cfs_rq);
 
@@ -5049,7 +5049,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * we can place the entity.
 	 */
 	if (!curr)
-		place_entity(cfs_rq, se, 0);
+		place_entity(cfs_rq, se, flags);
 
 	account_entity_enqueue(cfs_rq, se);
 
@@ -12280,7 +12280,7 @@ static void task_fork_fair(struct task_struct *p)
 	curr = cfs_rq->curr;
 	if (curr)
 		update_curr(cfs_rq);
-	place_entity(cfs_rq, se, 1);
+	place_entity(cfs_rq, se, ENQUEUE_INITIAL);
 	rq_unlock(rq, &rf);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7ff9965..db58537 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2199,6 +2199,7 @@ extern const u32		sched_prio_to_wmult[40];
 #else
 #define ENQUEUE_MIGRATED	0x00
 #endif
+#define ENQUEUE_INITIAL		0x80
 
 #define RETRY_TASK		((void *)-1UL)
 

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [tip: sched/core] sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice
  2023-05-31 11:58 ` [PATCH 09/15] sched/debug: Rename min_granularity to base_slice Peter Zijlstra
@ 2023-08-10  7:10   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 104+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-08-10  7:10 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e4ec3318a17f5dcf11bc23b2d2c1da4c1c5bb507
Gitweb:        https://git.kernel.org/tip/e4ec3318a17f5dcf11bc23b2d2c1da4c1c5bb507
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 31 May 2023 13:58:48 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Jul 2023 09:43:59 +02:00

sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice

EEVDF uses this tunable as the base request/slice -- make sure the
name reflects this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.205287511@infradead.org
---
 kernel/sched/core.c  |  2 +-
 kernel/sched/debug.c |  4 ++--
 kernel/sched/fair.c  | 12 ++++++------
 kernel/sched/sched.h |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e85a2fd..a5d3422 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4502,7 +4502,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
 	p->se.vlag			= 0;
-	p->se.slice			= sysctl_sched_min_granularity;
+	p->se.slice			= sysctl_sched_base_slice;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f8d190c..4c3d0d9 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -347,7 +347,7 @@ static __init int sched_init_debug(void)
 	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
 #endif
 
-	debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);
+	debugfs_create_u32("base_slice_ns", 0644, debugfs_sched, &sysctl_sched_base_slice);
 
 	debugfs_create_u32("latency_warn_ms", 0644, debugfs_sched, &sysctl_resched_latency_warn_ms);
 	debugfs_create_u32("latency_warn_once", 0644, debugfs_sched, &sysctl_resched_latency_warn_once);
@@ -863,7 +863,7 @@ static void sched_debug_header(struct seq_file *m)
 	SEQ_printf(m, "  .%-40s: %Ld\n", #x, (long long)(x))
 #define PN(x) \
 	SEQ_printf(m, "  .%-40s: %Ld.%06ld\n", #x, SPLIT_NS(x))
-	PN(sysctl_sched_min_granularity);
+	PN(sysctl_sched_base_slice);
 	P(sysctl_sched_child_runs_first);
 	P(sysctl_sched_features);
 #undef PN
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0605eb4..61747a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -75,8 +75,8 @@ unsigned int sysctl_sched_tunable_scaling = SCHED_TUNABLESCALING_LOG;
  *
  * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
  */
-unsigned int sysctl_sched_min_granularity			= 750000ULL;
-static unsigned int normalized_sysctl_sched_min_granularity	= 750000ULL;
+unsigned int sysctl_sched_base_slice			= 750000ULL;
+static unsigned int normalized_sysctl_sched_base_slice	= 750000ULL;
 
 /*
  * After fork, child runs first. If set to 0 (default) then
@@ -237,7 +237,7 @@ static void update_sysctl(void)
 
 #define SET_SYSCTL(name) \
 	(sysctl_##name = (factor) * normalized_sysctl_##name)
-	SET_SYSCTL(sched_min_granularity);
+	SET_SYSCTL(sched_base_slice);
 #undef SET_SYSCTL
 }
 
@@ -943,7 +943,7 @@ int sched_update_scaling(void)
 
 #define WRT_SYSCTL(name) \
 	(normalized_sysctl_##name = sysctl_##name / (factor))
-	WRT_SYSCTL(sched_min_granularity);
+	WRT_SYSCTL(sched_base_slice);
 #undef WRT_SYSCTL
 
 	return 0;
@@ -964,9 +964,9 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	/*
 	 * For EEVDF the virtual time slope is determined by w_i (iow.
 	 * nice) while the request time r_i is determined by
-	 * sysctl_sched_min_granularity.
+	 * sysctl_sched_base_slice.
 	 */
-	se->slice = sysctl_sched_min_granularity;
+	se->slice = sysctl_sched_base_slice;
 
 	/*
 	 * EEVDF: vd_i = ve_i + r_i / w_i
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f814bb7..7ff9965 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2503,7 +2503,7 @@ extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
 
-extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_base_slice;
 
 #ifdef CONFIG_SCHED_DEBUG
 extern int sysctl_resched_latency_warn_ms;

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [tip: sched/core] sched/smp: Use lag to simplify cross-runqueue placement
  2023-05-31 11:58 ` [PATCH 07/15] sched/smp: Use lag to simplify cross-runqueue placement Peter Zijlstra
@ 2023-08-10  7:10   ` tip-bot2 for Peter Zijlstra
  2023-09-12 15:32   ` [PATCH 07/15] " Sebastian Andrzej Siewior
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 104+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-08-10  7:10 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e8f331bcc270354a803c2127c486190d33eac441
Gitweb:        https://git.kernel.org/tip/e8f331bcc270354a803c2127c486190d33eac441
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 31 May 2023 13:58:46 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Jul 2023 09:43:58 +02:00

sched/smp: Use lag to simplify cross-runqueue placement

Using lag is both more correct and simpler when moving between
runqueues.

Notable, min_vruntime() was invented as a cheap approximation of
avg_vruntime() for this very purpose (SMP migration). Since we now
have the real thing; use it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.068911180@infradead.org
---
 kernel/sched/fair.c | 145 +++++--------------------------------------
 1 file changed, 19 insertions(+), 126 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 58798da..57e8bc1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5083,7 +5083,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 	 *
 	 * EEVDF: placement strategy #1 / #2
 	 */
-	if (sched_feat(PLACE_LAG) && cfs_rq->nr_running > 1) {
+	if (sched_feat(PLACE_LAG) && cfs_rq->nr_running) {
 		struct sched_entity *curr = cfs_rq->curr;
 		unsigned long load;
 
@@ -5172,61 +5172,21 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
 
 static inline bool cfs_bandwidth_used(void);
 
-/*
- * MIGRATION
- *
- *	dequeue
- *	  update_curr()
- *	    update_min_vruntime()
- *	  vruntime -= min_vruntime
- *
- *	enqueue
- *	  update_curr()
- *	    update_min_vruntime()
- *	  vruntime += min_vruntime
- *
- * this way the vruntime transition between RQs is done when both
- * min_vruntime are up-to-date.
- *
- * WAKEUP (remote)
- *
- *	->migrate_task_rq_fair() (p->state == TASK_WAKING)
- *	  vruntime -= min_vruntime
- *
- *	enqueue
- *	  update_curr()
- *	    update_min_vruntime()
- *	  vruntime += min_vruntime
- *
- * this way we don't have the most up-to-date min_vruntime on the originating
- * CPU and an up-to-date min_vruntime on the destination CPU.
- */
-
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
 	bool curr = cfs_rq->curr == se;
 
 	/*
 	 * If we're the current task, we must renormalise before calling
 	 * update_curr().
 	 */
-	if (renorm && curr)
-		se->vruntime += cfs_rq->min_vruntime;
+	if (curr)
+		place_entity(cfs_rq, se, 0);
 
 	update_curr(cfs_rq);
 
 	/*
-	 * Otherwise, renormalise after, such that we're placed at the current
-	 * moment in time, instead of some random moment in the past. Being
-	 * placed in the past could significantly boost this task to the
-	 * fairness detriment of existing tasks.
-	 */
-	if (renorm && !curr)
-		se->vruntime += cfs_rq->min_vruntime;
-
-	/*
 	 * When enqueuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
 	 *   - For group_entity, update its runnable_weight to reflect the new
@@ -5237,11 +5197,22 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
 	se_update_runnable(se);
+	/*
+	 * XXX update_load_avg() above will have attached us to the pelt sum;
+	 * but update_cfs_group() here will re-adjust the weight and have to
+	 * undo/redo all that. Seems wasteful.
+	 */
 	update_cfs_group(se);
-	account_entity_enqueue(cfs_rq, se);
 
-	if (flags & ENQUEUE_WAKEUP)
+	/*
+	 * XXX now that the entity has been re-weighted, and it's lag adjusted,
+	 * we can place the entity.
+	 */
+	if (!curr)
 		place_entity(cfs_rq, se, 0);
+
+	account_entity_enqueue(cfs_rq, se);
+
 	/* Entity has migrated, no longer consider this task hot */
 	if (flags & ENQUEUE_MIGRATED)
 		se->exec_start = 0;
@@ -5346,23 +5317,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	clear_buddies(cfs_rq, se);
 
-	if (flags & DEQUEUE_SLEEP)
-		update_entity_lag(cfs_rq, se);
-
+	update_entity_lag(cfs_rq, se);
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
 	account_entity_dequeue(cfs_rq, se);
 
-	/*
-	 * Normalize after update_curr(); which will also have moved
-	 * min_vruntime if @se is the one holding it back. But before doing
-	 * update_min_vruntime() again, which will discount @se's position and
-	 * can move min_vruntime forward still more.
-	 */
-	if (!(flags & DEQUEUE_SLEEP))
-		se->vruntime -= cfs_rq->min_vruntime;
-
 	/* return excess runtime on last dequeue */
 	return_cfs_rq_runtime(cfs_rq);
 
@@ -8208,18 +8168,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 {
 	struct sched_entity *se = &p->se;
 
-	/*
-	 * As blocked tasks retain absolute vruntime the migration needs to
-	 * deal with this by subtracting the old and adding the new
-	 * min_vruntime -- the latter is done by enqueue_entity() when placing
-	 * the task on the new runqueue.
-	 */
-	if (READ_ONCE(p->__state) == TASK_WAKING) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-
-		se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
-	}
-
 	if (!task_on_rq_migrating(p)) {
 		remove_entity_load_avg(se);
 
@@ -12709,8 +12657,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
  */
 static void task_fork_fair(struct task_struct *p)
 {
-	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se, *curr;
+	struct cfs_rq *cfs_rq;
 	struct rq *rq = this_rq();
 	struct rq_flags rf;
 
@@ -12719,22 +12667,9 @@ static void task_fork_fair(struct task_struct *p)
 
 	cfs_rq = task_cfs_rq(current);
 	curr = cfs_rq->curr;
-	if (curr) {
+	if (curr)
 		update_curr(cfs_rq);
-		se->vruntime = curr->vruntime;
-	}
 	place_entity(cfs_rq, se, 1);
-
-	if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
-		/*
-		 * Upon rescheduling, sched_class::put_prev_task() will place
-		 * 'current' within the tree based on its new key value.
-		 */
-		swap(curr->vruntime, se->vruntime);
-		resched_curr(rq);
-	}
-
-	se->vruntime -= cfs_rq->min_vruntime;
 	rq_unlock(rq, &rf);
 }
 
@@ -12763,34 +12698,6 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
 		check_preempt_curr(rq, p, 0);
 }
 
-static inline bool vruntime_normalized(struct task_struct *p)
-{
-	struct sched_entity *se = &p->se;
-
-	/*
-	 * In both the TASK_ON_RQ_QUEUED and TASK_ON_RQ_MIGRATING cases,
-	 * the dequeue_entity(.flags=0) will already have normalized the
-	 * vruntime.
-	 */
-	if (p->on_rq)
-		return true;
-
-	/*
-	 * When !on_rq, vruntime of the task has usually NOT been normalized.
-	 * But there are some cases where it has already been normalized:
-	 *
-	 * - A forked child which is waiting for being woken up by
-	 *   wake_up_new_task().
-	 * - A task which has been woken up by try_to_wake_up() and
-	 *   waiting for actually being woken up by sched_ttwu_pending().
-	 */
-	if (!se->sum_exec_runtime ||
-	    (READ_ONCE(p->__state) == TASK_WAKING && p->sched_remote_wakeup))
-		return true;
-
-	return false;
-}
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /*
  * Propagate the changes of the sched_entity across the tg tree to make it
@@ -12861,16 +12768,6 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
 static void detach_task_cfs_rq(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-
-	if (!vruntime_normalized(p)) {
-		/*
-		 * Fix up our vruntime so that the current sleep doesn't
-		 * cause 'unlimited' sleep bonus.
-		 */
-		place_entity(cfs_rq, se, 0);
-		se->vruntime -= cfs_rq->min_vruntime;
-	}
 
 	detach_entity_cfs_rq(se);
 }
@@ -12878,12 +12775,8 @@ static void detach_task_cfs_rq(struct task_struct *p)
 static void attach_task_cfs_rq(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	attach_entity_cfs_rq(se);
-
-	if (!vruntime_normalized(p))
-		se->vruntime += cfs_rq->min_vruntime;
 }
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [tip: sched/core] sched/fair: Commit to EEVDF
  2023-05-31 11:58 ` [PATCH 08/15] sched: Commit to EEVDF Peter Zijlstra
  2023-06-16 21:23   ` Joel Fernandes
@ 2023-08-10  7:10   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 104+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-08-10  7:10 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     5e963f2bd4654a202a8a05aa3a86cb0300b10e6c
Gitweb:        https://git.kernel.org/tip/5e963f2bd4654a202a8a05aa3a86cb0300b10e6c
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 31 May 2023 13:58:47 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Jul 2023 09:43:58 +02:00

sched/fair: Commit to EEVDF

EEVDF is a better defined scheduling policy, as a result it has less
heuristics/tunables. There is no compelling reason to keep CFS around.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.137187212@infradead.org
---
 kernel/sched/debug.c    |   6 +-
 kernel/sched/fair.c     | 465 +++------------------------------------
 kernel/sched/features.h |  12 +-
 kernel/sched/sched.h    |   5 +-
 4 files changed, 38 insertions(+), 450 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 18efc6d..f8d190c 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -347,10 +347,7 @@ static __init int sched_init_debug(void)
 	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
 #endif
 
-	debugfs_create_u32("latency_ns", 0644, debugfs_sched, &sysctl_sched_latency);
 	debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);
-	debugfs_create_u32("idle_min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_idle_min_granularity);
-	debugfs_create_u32("wakeup_granularity_ns", 0644, debugfs_sched, &sysctl_sched_wakeup_granularity);
 
 	debugfs_create_u32("latency_warn_ms", 0644, debugfs_sched, &sysctl_resched_latency_warn_ms);
 	debugfs_create_u32("latency_warn_once", 0644, debugfs_sched, &sysctl_resched_latency_warn_once);
@@ -866,10 +863,7 @@ static void sched_debug_header(struct seq_file *m)
 	SEQ_printf(m, "  .%-40s: %Ld\n", #x, (long long)(x))
 #define PN(x) \
 	SEQ_printf(m, "  .%-40s: %Ld.%06ld\n", #x, SPLIT_NS(x))
-	PN(sysctl_sched_latency);
 	PN(sysctl_sched_min_granularity);
-	PN(sysctl_sched_idle_min_granularity);
-	PN(sysctl_sched_wakeup_granularity);
 	P(sysctl_sched_child_runs_first);
 	P(sysctl_sched_features);
 #undef PN
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57e8bc1..0605eb4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -58,22 +58,6 @@
 #include "autogroup.h"
 
 /*
- * Targeted preemption latency for CPU-bound tasks:
- *
- * NOTE: this latency value is not the same as the concept of
- * 'timeslice length' - timeslices in CFS are of variable length
- * and have no persistent notion like in traditional, time-slice
- * based scheduling concepts.
- *
- * (to see the precise effective timeslice length of your workload,
- *  run vmstat and monitor the context-switches (cs) field)
- *
- * (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds)
- */
-unsigned int sysctl_sched_latency			= 6000000ULL;
-static unsigned int normalized_sysctl_sched_latency	= 6000000ULL;
-
-/*
  * The initial- and re-scaling of tunables is configurable
  *
  * Options are:
@@ -95,36 +79,11 @@ unsigned int sysctl_sched_min_granularity			= 750000ULL;
 static unsigned int normalized_sysctl_sched_min_granularity	= 750000ULL;
 
 /*
- * Minimal preemption granularity for CPU-bound SCHED_IDLE tasks.
- * Applies only when SCHED_IDLE tasks compete with normal tasks.
- *
- * (default: 0.75 msec)
- */
-unsigned int sysctl_sched_idle_min_granularity			= 750000ULL;
-
-/*
- * This value is kept at sysctl_sched_latency/sysctl_sched_min_granularity
- */
-static unsigned int sched_nr_latency = 8;
-
-/*
  * After fork, child runs first. If set to 0 (default) then
  * parent will (try to) run first.
  */
 unsigned int sysctl_sched_child_runs_first __read_mostly;
 
-/*
- * SCHED_OTHER wake-up granularity.
- *
- * This option delays the preemption effects of decoupled workloads
- * and reduces their over-scheduling. Synchronous workloads will still
- * have immediate wakeup/sleep latencies.
- *
- * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
- */
-unsigned int sysctl_sched_wakeup_granularity			= 1000000UL;
-static unsigned int normalized_sysctl_sched_wakeup_granularity	= 1000000UL;
-
 const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
 
 int sched_thermal_decay_shift;
@@ -279,8 +238,6 @@ static void update_sysctl(void)
 #define SET_SYSCTL(name) \
 	(sysctl_##name = (factor) * normalized_sysctl_##name)
 	SET_SYSCTL(sched_min_granularity);
-	SET_SYSCTL(sched_latency);
-	SET_SYSCTL(sched_wakeup_granularity);
 #undef SET_SYSCTL
 }
 
@@ -888,30 +845,6 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
 	return __node_2_se(left);
 }
 
-static struct sched_entity *__pick_next_entity(struct sched_entity *se)
-{
-	struct rb_node *next = rb_next(&se->run_node);
-
-	if (!next)
-		return NULL;
-
-	return __node_2_se(next);
-}
-
-static struct sched_entity *pick_cfs(struct cfs_rq *cfs_rq, struct sched_entity *curr)
-{
-	struct sched_entity *left = __pick_first_entity(cfs_rq);
-
-	/*
-	 * If curr is set we have to see if its left of the leftmost entity
-	 * still in the tree, provided there was anything in the tree at all.
-	 */
-	if (!left || (curr && entity_before(curr, left)))
-		left = curr;
-
-	return left;
-}
-
 /*
  * Earliest Eligible Virtual Deadline First
  *
@@ -1008,85 +941,15 @@ int sched_update_scaling(void)
 {
 	unsigned int factor = get_update_sysctl_factor();
 
-	sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency,
-					sysctl_sched_min_granularity);
-
 #define WRT_SYSCTL(name) \
 	(normalized_sysctl_##name = sysctl_##name / (factor))
 	WRT_SYSCTL(sched_min_granularity);
-	WRT_SYSCTL(sched_latency);
-	WRT_SYSCTL(sched_wakeup_granularity);
 #undef WRT_SYSCTL
 
 	return 0;
 }
 #endif
 
-/*
- * The idea is to set a period in which each task runs once.
- *
- * When there are too many tasks (sched_nr_latency) we have to stretch
- * this period because otherwise the slices get too small.
- *
- * p = (nr <= nl) ? l : l*nr/nl
- */
-static u64 __sched_period(unsigned long nr_running)
-{
-	if (unlikely(nr_running > sched_nr_latency))
-		return nr_running * sysctl_sched_min_granularity;
-	else
-		return sysctl_sched_latency;
-}
-
-static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq);
-
-/*
- * We calculate the wall-time slice from the period by taking a part
- * proportional to the weight.
- *
- * s = p*P[w/rw]
- */
-static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	unsigned int nr_running = cfs_rq->nr_running;
-	struct sched_entity *init_se = se;
-	unsigned int min_gran;
-	u64 slice;
-
-	if (sched_feat(ALT_PERIOD))
-		nr_running = rq_of(cfs_rq)->cfs.h_nr_running;
-
-	slice = __sched_period(nr_running + !se->on_rq);
-
-	for_each_sched_entity(se) {
-		struct load_weight *load;
-		struct load_weight lw;
-		struct cfs_rq *qcfs_rq;
-
-		qcfs_rq = cfs_rq_of(se);
-		load = &qcfs_rq->load;
-
-		if (unlikely(!se->on_rq)) {
-			lw = qcfs_rq->load;
-
-			update_load_add(&lw, se->load.weight);
-			load = &lw;
-		}
-		slice = __calc_delta(slice, se->load.weight, load);
-	}
-
-	if (sched_feat(BASE_SLICE)) {
-		if (se_is_idle(init_se) && !sched_idle_cfs_rq(cfs_rq))
-			min_gran = sysctl_sched_idle_min_granularity;
-		else
-			min_gran = sysctl_sched_min_granularity;
-
-		slice = max_t(u64, slice, min_gran);
-	}
-
-	return slice;
-}
-
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 /*
@@ -1098,35 +961,25 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if ((s64)(se->vruntime - se->deadline) < 0)
 		return;
 
-	if (sched_feat(EEVDF)) {
-		/*
-		 * For EEVDF the virtual time slope is determined by w_i (iow.
-		 * nice) while the request time r_i is determined by
-		 * sysctl_sched_min_granularity.
-		 */
-		se->slice = sysctl_sched_min_granularity;
-
-		/*
-		 * The task has consumed its request, reschedule.
-		 */
-		if (cfs_rq->nr_running > 1) {
-			resched_curr(rq_of(cfs_rq));
-			clear_buddies(cfs_rq, se);
-		}
-	} else {
-		/*
-		 * When many tasks blow up the sched_period; it is possible
-		 * that sched_slice() reports unusually large results (when
-		 * many tasks are very light for example). Therefore impose a
-		 * maximum.
-		 */
-		se->slice = min_t(u64, sched_slice(cfs_rq, se), sysctl_sched_latency);
-	}
+	/*
+	 * For EEVDF the virtual time slope is determined by w_i (iow.
+	 * nice) while the request time r_i is determined by
+	 * sysctl_sched_min_granularity.
+	 */
+	se->slice = sysctl_sched_min_granularity;
 
 	/*
 	 * EEVDF: vd_i = ve_i + r_i / w_i
 	 */
 	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+
+	/*
+	 * The task has consumed its request, reschedule.
+	 */
+	if (cfs_rq->nr_running > 1) {
+		resched_curr(rq_of(cfs_rq));
+		clear_buddies(cfs_rq, se);
+	}
 }
 
 #include "pelt.h"
@@ -5055,19 +4908,6 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
 
 #endif /* CONFIG_SMP */
 
-static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-#ifdef CONFIG_SCHED_DEBUG
-	s64 d = se->vruntime - cfs_rq->min_vruntime;
-
-	if (d < 0)
-		d = -d;
-
-	if (d > 3*sysctl_sched_latency)
-		schedstat_inc(cfs_rq->nr_spread_over);
-#endif
-}
-
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
@@ -5219,7 +5059,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	check_schedstat_required();
 	update_stats_enqueue_fair(cfs_rq, se, flags);
-	check_spread(cfs_rq, se);
 	if (!curr)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
@@ -5241,17 +5080,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	}
 }
 
-static void __clear_buddies_last(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->last != se)
-			break;
-
-		cfs_rq->last = NULL;
-	}
-}
-
 static void __clear_buddies_next(struct sched_entity *se)
 {
 	for_each_sched_entity(se) {
@@ -5263,27 +5091,10 @@ static void __clear_buddies_next(struct sched_entity *se)
 	}
 }
 
-static void __clear_buddies_skip(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->skip != se)
-			break;
-
-		cfs_rq->skip = NULL;
-	}
-}
-
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (cfs_rq->last == se)
-		__clear_buddies_last(se);
-
 	if (cfs_rq->next == se)
 		__clear_buddies_next(se);
-
-	if (cfs_rq->skip == se)
-		__clear_buddies_skip(se);
 }
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
@@ -5341,45 +5152,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
 }
 
-/*
- * Preempt the current task with a newly woken task if needed:
- */
-static void
-check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
-{
-	unsigned long delta_exec;
-	struct sched_entity *se;
-	s64 delta;
-
-	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
-	if (delta_exec > curr->slice) {
-		resched_curr(rq_of(cfs_rq));
-		/*
-		 * The current task ran long enough, ensure it doesn't get
-		 * re-elected due to buddy favours.
-		 */
-		clear_buddies(cfs_rq, curr);
-		return;
-	}
-
-	/*
-	 * Ensure that a task that missed wakeup preemption by a
-	 * narrow margin doesn't have to wait for a full slice.
-	 * This also mitigates buddy induced latencies under load.
-	 */
-	if (delta_exec < sysctl_sched_min_granularity)
-		return;
-
-	se = __pick_first_entity(cfs_rq);
-	delta = curr->vruntime - se->vruntime;
-
-	if (delta < 0)
-		return;
-
-	if (delta > curr->slice)
-		resched_curr(rq_of(cfs_rq));
-}
-
 static void
 set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -5418,9 +5190,6 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	se->prev_sum_exec_runtime = se->sum_exec_runtime;
 }
 
-static int
-wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
-
 /*
  * Pick the next process, keeping these things in mind, in this order:
  * 1) keep things fair between processes/task groups
@@ -5431,53 +5200,14 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
 static struct sched_entity *
 pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	struct sched_entity *left, *se;
-
-	if (sched_feat(EEVDF)) {
-		/*
-		 * Enabling NEXT_BUDDY will affect latency but not fairness.
-		 */
-		if (sched_feat(NEXT_BUDDY) &&
-		    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
-			return cfs_rq->next;
-
-		return pick_eevdf(cfs_rq);
-	}
-
-	se = left = pick_cfs(cfs_rq, curr);
-
 	/*
-	 * Avoid running the skip buddy, if running something else can
-	 * be done without getting too unfair.
+	 * Enabling NEXT_BUDDY will affect latency but not fairness.
 	 */
-	if (cfs_rq->skip && cfs_rq->skip == se) {
-		struct sched_entity *second;
+	if (sched_feat(NEXT_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
+		return cfs_rq->next;
 
-		if (se == curr) {
-			second = __pick_first_entity(cfs_rq);
-		} else {
-			second = __pick_next_entity(se);
-			if (!second || (curr && entity_before(curr, second)))
-				second = curr;
-		}
-
-		if (second && wakeup_preempt_entity(second, left) < 1)
-			se = second;
-	}
-
-	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
-		/*
-		 * Someone really wants this to run. If it's not unfair, run it.
-		 */
-		se = cfs_rq->next;
-	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
-		/*
-		 * Prefer last buddy, try to return the CPU to a preempted task.
-		 */
-		se = cfs_rq->last;
-	}
-
-	return se;
+	return pick_eevdf(cfs_rq);
 }
 
 static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
@@ -5494,8 +5224,6 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 	/* throttle cfs_rqs exceeding runtime */
 	check_cfs_rq_runtime(cfs_rq);
 
-	check_spread(cfs_rq, prev);
-
 	if (prev->on_rq) {
 		update_stats_wait_start_fair(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
@@ -5536,9 +5264,6 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 			hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
 		return;
 #endif
-
-	if (!sched_feat(EEVDF) && cfs_rq->nr_running > 1)
-		check_preempt_tick(cfs_rq, curr);
 }
 
 
@@ -6610,8 +6335,7 @@ static void hrtick_update(struct rq *rq)
 	if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
 		return;
 
-	if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
-		hrtick_start_fair(rq, curr);
+	hrtick_start_fair(rq, curr);
 }
 #else /* !CONFIG_SCHED_HRTICK */
 static inline void
@@ -6652,17 +6376,6 @@ static int sched_idle_rq(struct rq *rq)
 			rq->nr_running);
 }
 
-/*
- * Returns true if cfs_rq only has SCHED_IDLE entities enqueued. Note the use
- * of idle_nr_running, which does not consider idle descendants of normal
- * entities.
- */
-static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq)
-{
-	return cfs_rq->nr_running &&
-		cfs_rq->nr_running == cfs_rq->idle_nr_running;
-}
-
 #ifdef CONFIG_SMP
 static int sched_idle_cpu(int cpu)
 {
@@ -8205,66 +7918,6 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 }
 #endif /* CONFIG_SMP */
 
-static unsigned long wakeup_gran(struct sched_entity *se)
-{
-	unsigned long gran = sysctl_sched_wakeup_granularity;
-
-	/*
-	 * Since its curr running now, convert the gran from real-time
-	 * to virtual-time in his units.
-	 *
-	 * By using 'se' instead of 'curr' we penalize light tasks, so
-	 * they get preempted easier. That is, if 'se' < 'curr' then
-	 * the resulting gran will be larger, therefore penalizing the
-	 * lighter, if otoh 'se' > 'curr' then the resulting gran will
-	 * be smaller, again penalizing the lighter task.
-	 *
-	 * This is especially important for buddies when the leftmost
-	 * task is higher priority than the buddy.
-	 */
-	return calc_delta_fair(gran, se);
-}
-
-/*
- * Should 'se' preempt 'curr'.
- *
- *             |s1
- *        |s2
- *   |s3
- *         g
- *      |<--->|c
- *
- *  w(c, s1) = -1
- *  w(c, s2) =  0
- *  w(c, s3) =  1
- *
- */
-static int
-wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
-{
-	s64 gran, vdiff = curr->vruntime - se->vruntime;
-
-	if (vdiff <= 0)
-		return -1;
-
-	gran = wakeup_gran(se);
-	if (vdiff > gran)
-		return 1;
-
-	return 0;
-}
-
-static void set_last_buddy(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		if (SCHED_WARN_ON(!se->on_rq))
-			return;
-		if (se_is_idle(se))
-			return;
-		cfs_rq_of(se)->last = se;
-	}
-}
-
 static void set_next_buddy(struct sched_entity *se)
 {
 	for_each_sched_entity(se) {
@@ -8276,12 +7929,6 @@ static void set_next_buddy(struct sched_entity *se)
 	}
 }
 
-static void set_skip_buddy(struct sched_entity *se)
-{
-	for_each_sched_entity(se)
-		cfs_rq_of(se)->skip = se;
-}
-
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -8290,7 +7937,6 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	struct task_struct *curr = rq->curr;
 	struct sched_entity *se = &curr->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
-	int scale = cfs_rq->nr_running >= sched_nr_latency;
 	int next_buddy_marked = 0;
 	int cse_is_idle, pse_is_idle;
 
@@ -8306,7 +7952,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
+	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK)) {
 		set_next_buddy(pse);
 		next_buddy_marked = 1;
 	}
@@ -8354,44 +8000,16 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	cfs_rq = cfs_rq_of(se);
 	update_curr(cfs_rq);
 
-	if (sched_feat(EEVDF)) {
-		/*
-		 * XXX pick_eevdf(cfs_rq) != se ?
-		 */
-		if (pick_eevdf(cfs_rq) == pse)
-			goto preempt;
-
-		return;
-	}
-
-	if (wakeup_preempt_entity(se, pse) == 1) {
-		/*
-		 * Bias pick_next to pick the sched entity that is
-		 * triggering this preemption.
-		 */
-		if (!next_buddy_marked)
-			set_next_buddy(pse);
+	/*
+	 * XXX pick_eevdf(cfs_rq) != se ?
+	 */
+	if (pick_eevdf(cfs_rq) == pse)
 		goto preempt;
-	}
 
 	return;
 
 preempt:
 	resched_curr(rq);
-	/*
-	 * Only set the backward buddy when the current task is still
-	 * on the rq. This can happen when a wakeup gets interleaved
-	 * with schedule on the ->pre_schedule() or idle_balance()
-	 * point, either of which can * drop the rq lock.
-	 *
-	 * Also, during early boot the idle thread is in the fair class,
-	 * for obvious reasons its a bad idea to schedule back to it.
-	 */
-	if (unlikely(!se->on_rq || curr == rq->idle))
-		return;
-
-	if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
-		set_last_buddy(se);
 }
 
 #ifdef CONFIG_SMP
@@ -8592,8 +8210,6 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
 
 /*
  * sched_yield() is very simple
- *
- * The magic of dealing with the ->skip buddy is in pick_next_entity.
  */
 static void yield_task_fair(struct rq *rq)
 {
@@ -8609,23 +8225,19 @@ static void yield_task_fair(struct rq *rq)
 
 	clear_buddies(cfs_rq, se);
 
-	if (sched_feat(EEVDF) || curr->policy != SCHED_BATCH) {
-		update_rq_clock(rq);
-		/*
-		 * Update run-time statistics of the 'current'.
-		 */
-		update_curr(cfs_rq);
-		/*
-		 * Tell update_rq_clock() that we've just updated,
-		 * so we don't do microscopic update in schedule()
-		 * and double the fastpath cost.
-		 */
-		rq_clock_skip_update(rq);
-	}
-	if (sched_feat(EEVDF))
-		se->deadline += calc_delta_fair(se->slice, se);
+	update_rq_clock(rq);
+	/*
+	 * Update run-time statistics of the 'current'.
+	 */
+	update_curr(cfs_rq);
+	/*
+	 * Tell update_rq_clock() that we've just updated,
+	 * so we don't do microscopic update in schedule()
+	 * and double the fastpath cost.
+	 */
+	rq_clock_skip_update(rq);
 
-	set_skip_buddy(se);
+	se->deadline += calc_delta_fair(se->slice, se);
 }
 
 static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
@@ -8873,8 +8485,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 	 * Buddy candidates are cache hot:
 	 */
 	if (sched_feat(CACHE_HOT_BUDDY) && env->dst_rq->nr_running &&
-			(&p->se == cfs_rq_of(&p->se)->next ||
-			 &p->se == cfs_rq_of(&p->se)->last))
+	    (&p->se == cfs_rq_of(&p->se)->next))
 		return 1;
 
 	if (sysctl_sched_migration_cost == -1)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 2a830ec..54334ca 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -15,13 +15,6 @@ SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 SCHED_FEAT(NEXT_BUDDY, false)
 
 /*
- * Prefer to schedule the task that ran last (when we did
- * wake-preempt) as that likely will touch the same data, increases
- * cache locality.
- */
-SCHED_FEAT(LAST_BUDDY, true)
-
-/*
  * Consider buddies to be cache hot, decreases the likeliness of a
  * cache buddy being migrated away, increases cache locality.
  */
@@ -93,8 +86,3 @@ SCHED_FEAT(UTIL_EST, true)
 SCHED_FEAT(UTIL_EST_FASTUP, true)
 
 SCHED_FEAT(LATENCY_WARN, false)
-
-SCHED_FEAT(ALT_PERIOD, true)
-SCHED_FEAT(BASE_SLICE, true)
-
-SCHED_FEAT(EEVDF, true)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aa5b293..f814bb7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -570,8 +570,6 @@ struct cfs_rq {
 	 */
 	struct sched_entity	*curr;
 	struct sched_entity	*next;
-	struct sched_entity	*last;
-	struct sched_entity	*skip;
 
 #ifdef	CONFIG_SCHED_DEBUG
 	unsigned int		nr_spread_over;
@@ -2508,9 +2506,6 @@ extern const_debug unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_min_granularity;
 
 #ifdef CONFIG_SCHED_DEBUG
-extern unsigned int sysctl_sched_latency;
-extern unsigned int sysctl_sched_idle_min_granularity;
-extern unsigned int sysctl_sched_wakeup_granularity;
 extern int sysctl_resched_latency_warn_ms;
 extern int sysctl_resched_latency_warn_once;
 

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [tip: sched/core] sched/fair: Commit to lag based placement
  2023-05-31 11:58 ` [PATCH 06/15] sched: Commit to lag based placement Peter Zijlstra
@ 2023-08-10  7:10   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 104+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-08-10  7:10 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     76cae9dbe185b82aeb0640aa2b73da4a8e0088ce
Gitweb:        https://git.kernel.org/tip/76cae9dbe185b82aeb0640aa2b73da4a8e0088ce
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 31 May 2023 13:58:45 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Jul 2023 09:43:58 +02:00

sched/fair: Commit to lag based placement

Removes the FAIR_SLEEPERS code in favour of the new LAG based
placement.

Specifically, the whole FAIR_SLEEPER thing was a very crude
approximation to make up for the lack of lag based placement,
specifically the 'service owed' part. This is important for things
like 'starve' and 'hackbench'.

One side effect of FAIR_SLEEPER is that it caused 'small' unfairness,
specifically, by always ignoring up-to 'thresh' sleeptime it would
have a 50%/50% time distribution for a 50% sleeper vs a 100% runner,
while strictly speaking this should (of course) result in a 33%/67%
split (as CFS will also do if the sleep period exceeds 'thresh').

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.000198861@infradead.org
---
 kernel/sched/fair.c     | 59 +----------------------------------------
 kernel/sched/features.h |  8 +-----
 2 files changed, 1 insertion(+), 66 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4d3505d..58798da 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5068,29 +5068,6 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
 #endif
 }
 
-static inline bool entity_is_long_sleeper(struct sched_entity *se)
-{
-	struct cfs_rq *cfs_rq;
-	u64 sleep_time;
-
-	if (se->exec_start == 0)
-		return false;
-
-	cfs_rq = cfs_rq_of(se);
-
-	sleep_time = rq_clock_task(rq_of(cfs_rq));
-
-	/* Happen while migrating because of clock task divergence */
-	if (sleep_time <= se->exec_start)
-		return false;
-
-	sleep_time -= se->exec_start;
-	if (sleep_time > ((1ULL << 63) / scale_load_down(NICE_0_LOAD)))
-		return true;
-
-	return false;
-}
-
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
@@ -5172,43 +5149,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 		if (WARN_ON_ONCE(!load))
 			load = 1;
 		lag = div_s64(lag, load);
-
-		vruntime -= lag;
-	}
-
-	if (sched_feat(FAIR_SLEEPERS)) {
-
-		/* sleeps up to a single latency don't count. */
-		if (!initial) {
-			unsigned long thresh;
-
-			if (se_is_idle(se))
-				thresh = sysctl_sched_min_granularity;
-			else
-				thresh = sysctl_sched_latency;
-
-			/*
-			 * Halve their sleep time's effect, to allow
-			 * for a gentler effect of sleepers:
-			 */
-			if (sched_feat(GENTLE_FAIR_SLEEPERS))
-				thresh >>= 1;
-
-			vruntime -= thresh;
-		}
-
-		/*
-		 * Pull vruntime of the entity being placed to the base level of
-		 * cfs_rq, to prevent boosting it if placed backwards.  If the entity
-		 * slept for a long time, don't even try to compare its vruntime with
-		 * the base as it may be too far off and the comparison may get
-		 * inversed due to s64 overflow.
-		 */
-		if (!entity_is_long_sleeper(se))
-			vruntime = max_vruntime(se->vruntime, vruntime);
 	}
 
-	se->vruntime = vruntime;
+	se->vruntime = vruntime - lag;
 
 	/*
 	 * When joining the competition; the exisiting tasks will be,
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 60cce1e..2a830ec 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -1,14 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 /*
- * Only give sleepers 50% of their service deficit. This allows
- * them to run sooner, but does not allow tons of sleepers to
- * rip the spread apart.
- */
-SCHED_FEAT(FAIR_SLEEPERS, false)
-SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
-
-/*
  * Using the avg_vruntime, do the right thing and preserve lag across
  * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
  */

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [tip: sched/core] sched/fair: Implement an EEVDF-like scheduling policy
  2023-05-31 11:58 ` [PATCH 05/15] sched/fair: Implement an EEVDF like policy Peter Zijlstra
@ 2023-08-10  7:10   ` tip-bot2 for Peter Zijlstra
  2023-09-29 21:40   ` [PATCH 05/15] sched/fair: Implement an EEVDF like policy Benjamin Segall
  2023-09-30  0:09   ` [PATCH] sched/fair: fix pick_eevdf to always find the correct se Benjamin Segall
  2 siblings, 0 replies; 104+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-08-10  7:10 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     147f3efaa24182a21706bca15eab2f3f4630b5fe
Gitweb:        https://git.kernel.org/tip/147f3efaa24182a21706bca15eab2f3f4630b5fe
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 31 May 2023 13:58:44 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Jul 2023 09:43:58 +02:00

sched/fair: Implement an EEVDF-like scheduling policy

Where CFS is currently a WFQ based scheduler with only a single knob,
the weight. The addition of a second, latency oriented parameter,
makes something like WF2Q or EEVDF based a much better fit.

Specifically, EEVDF does EDF like scheduling in the left half of the
tree -- those entities that are owed service. Except because this is a
virtual time scheduler, the deadlines are in virtual time as well,
which is what allows over-subscription.

EEVDF has two parameters:

 - weight, or time-slope: which is mapped to nice just as before

 - request size, or slice length: which is used to compute
   the virtual deadline as: vd_i = ve_i + r_i/w_i

Basically, by setting a smaller slice, the deadline will be earlier
and the task will be more eligible and ran earlier.

Tick driven preemption is driven by request/slice completion; while
wakeup preemption is driven by the deadline.

Because the tree is now effectively an interval tree, and the
selection is no longer 'leftmost', over-scheduling is less of a
problem.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org
---
 include/linux/sched.h   |   4 +-
 kernel/sched/core.c     |   1 +-
 kernel/sched/debug.c    |   6 +-
 kernel/sched/fair.c     | 338 +++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |   3 +-
 kernel/sched/sched.h    |   4 +-
 6 files changed, 308 insertions(+), 48 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba1828b..177b3f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -549,6 +549,9 @@ struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
 	struct rb_node			run_node;
+	u64				deadline;
+	u64				min_deadline;
+
 	struct list_head		group_node;
 	unsigned int			on_rq;
 
@@ -557,6 +560,7 @@ struct sched_entity {
 	u64				prev_sum_exec_runtime;
 	u64				vruntime;
 	s64				vlag;
+	u64				slice;
 
 	u64				nr_migrations;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84b0d47..e85a2fd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4502,6 +4502,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
 	p->se.vlag			= 0;
+	p->se.slice			= sysctl_sched_min_granularity;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e48d2b2..18efc6d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -582,9 +582,13 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
 	else
 		SEQ_printf(m, " %c", task_state_to_char(p));
 
-	SEQ_printf(m, " %15s %5d %9Ld.%06ld %9Ld %5d ",
+	SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld.%06ld %9Ld %5d ",
 		p->comm, task_pid_nr(p),
 		SPLIT_NS(p->se.vruntime),
+		entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N',
+		SPLIT_NS(p->se.deadline),
+		SPLIT_NS(p->se.slice),
+		SPLIT_NS(p->se.sum_exec_runtime),
 		(long long)(p->nvcsw + p->nivcsw),
 		p->prio);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd12ada..4d3505d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -47,6 +47,7 @@
 #include <linux/psi.h>
 #include <linux/ratelimit.h>
 #include <linux/task_work.h>
+#include <linux/rbtree_augmented.h>
 
 #include <asm/switch_to.h>
 
@@ -347,6 +348,16 @@ static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight
 	return mul_u64_u32_shr(delta_exec, fact, shift);
 }
 
+/*
+ * delta /= w
+ */
+static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
+{
+	if (unlikely(se->load.weight != NICE_0_LOAD))
+		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
+
+	return delta;
+}
 
 const struct sched_class fair_sched_class;
 
@@ -717,11 +728,62 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 
 /*
  * lag_i = S - s_i = w_i * (V - v_i)
+ *
+ * However, since V is approximated by the weighted average of all entities it
+ * is possible -- by addition/removal/reweight to the tree -- to move V around
+ * and end up with a larger lag than we started with.
+ *
+ * Limit this to either double the slice length with a minimum of TICK_NSEC
+ * since that is the timing granularity.
+ *
+ * EEVDF gives the following limit for a steady state system:
+ *
+ *   -r_max < lag < max(r_max, q)
+ *
+ * XXX could add max_slice to the augmented data to track this.
  */
 void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	s64 lag, limit;
+
 	SCHED_WARN_ON(!se->on_rq);
-	se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
+	lag = avg_vruntime(cfs_rq) - se->vruntime;
+
+	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+	se->vlag = clamp(lag, -limit, limit);
+}
+
+/*
+ * Entity is eligible once it received less service than it ought to have,
+ * eg. lag >= 0.
+ *
+ * lag_i = S - s_i = w_i*(V - v_i)
+ *
+ * lag_i >= 0 -> V >= v_i
+ *
+ *     \Sum (v_i - v)*w_i
+ * V = ------------------ + v
+ *          \Sum w_i
+ *
+ * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ *
+ * Note: using 'avg_vruntime() > se->vruntime' is inacurate due
+ *       to the loss in precision caused by the division.
+ */
+int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 avg = cfs_rq->avg_vruntime;
+	long load = cfs_rq->avg_load;
+
+	if (curr && curr->on_rq) {
+		unsigned long weight = scale_load_down(curr->load.weight);
+
+		avg += entity_key(cfs_rq, curr) * weight;
+		load += weight;
+	}
+
+	return avg >= entity_key(cfs_rq, se) * load;
 }
 
 static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
@@ -740,8 +802,8 @@ static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
 
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
+	struct sched_entity *se = __pick_first_entity(cfs_rq);
 	struct sched_entity *curr = cfs_rq->curr;
-	struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);
 
 	u64 vruntime = cfs_rq->min_vruntime;
 
@@ -752,9 +814,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 			curr = NULL;
 	}
 
-	if (leftmost) { /* non-empty tree */
-		struct sched_entity *se = __node_2_se(leftmost);
-
+	if (se) {
 		if (!curr)
 			vruntime = se->vruntime;
 		else
@@ -771,18 +831,50 @@ static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
 	return entity_before(__node_2_se(a), __node_2_se(b));
 }
 
+#define deadline_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })
+
+static inline void __update_min_deadline(struct sched_entity *se, struct rb_node *node)
+{
+	if (node) {
+		struct sched_entity *rse = __node_2_se(node);
+		if (deadline_gt(min_deadline, se, rse))
+			se->min_deadline = rse->min_deadline;
+	}
+}
+
+/*
+ * se->min_deadline = min(se->deadline, left->min_deadline, right->min_deadline)
+ */
+static inline bool min_deadline_update(struct sched_entity *se, bool exit)
+{
+	u64 old_min_deadline = se->min_deadline;
+	struct rb_node *node = &se->run_node;
+
+	se->min_deadline = se->deadline;
+	__update_min_deadline(se, node->rb_right);
+	__update_min_deadline(se, node->rb_left);
+
+	return se->min_deadline == old_min_deadline;
+}
+
+RB_DECLARE_CALLBACKS(static, min_deadline_cb, struct sched_entity,
+		     run_node, min_deadline, min_deadline_update);
+
 /*
  * Enqueue an entity into the rb-tree:
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	avg_vruntime_add(cfs_rq, se);
-	rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
+	se->min_deadline = se->deadline;
+	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
+				__entity_less, &min_deadline_cb);
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
+	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
+				  &min_deadline_cb);
 	avg_vruntime_sub(cfs_rq, se);
 }
 
@@ -806,6 +898,97 @@ static struct sched_entity *__pick_next_entity(struct sched_entity *se)
 	return __node_2_se(next);
 }
 
+static struct sched_entity *pick_cfs(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+{
+	struct sched_entity *left = __pick_first_entity(cfs_rq);
+
+	/*
+	 * If curr is set we have to see if its left of the leftmost entity
+	 * still in the tree, provided there was anything in the tree at all.
+	 */
+	if (!left || (curr && entity_before(curr, left)))
+		left = curr;
+
+	return left;
+}
+
+/*
+ * Earliest Eligible Virtual Deadline First
+ *
+ * In order to provide latency guarantees for different request sizes
+ * EEVDF selects the best runnable task from two criteria:
+ *
+ *  1) the task must be eligible (must be owed service)
+ *
+ *  2) from those tasks that meet 1), we select the one
+ *     with the earliest virtual deadline.
+ *
+ * We can do this in O(log n) time due to an augmented RB-tree. The
+ * tree keeps the entries sorted on service, but also functions as a
+ * heap based on the deadline by keeping:
+ *
+ *  se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
+ *
+ * Which allows an EDF like search on (sub)trees.
+ */
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+{
+	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+	struct sched_entity *curr = cfs_rq->curr;
+	struct sched_entity *best = NULL;
+
+	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
+		curr = NULL;
+
+	while (node) {
+		struct sched_entity *se = __node_2_se(node);
+
+		/*
+		 * If this entity is not eligible, try the left subtree.
+		 */
+		if (!entity_eligible(cfs_rq, se)) {
+			node = node->rb_left;
+			continue;
+		}
+
+		/*
+		 * If this entity has an earlier deadline than the previous
+		 * best, take this one. If it also has the earliest deadline
+		 * of its subtree, we're done.
+		 */
+		if (!best || deadline_gt(deadline, best, se)) {
+			best = se;
+			if (best->deadline == best->min_deadline)
+				break;
+		}
+
+		/*
+		 * If the earlest deadline in this subtree is in the fully
+		 * eligible left half of our space, go there.
+		 */
+		if (node->rb_left &&
+		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
+			node = node->rb_left;
+			continue;
+		}
+
+		node = node->rb_right;
+	}
+
+	if (!best || (curr && deadline_gt(deadline, best, curr)))
+		best = curr;
+
+	if (unlikely(!best)) {
+		struct sched_entity *left = __pick_first_entity(cfs_rq);
+		if (left) {
+			pr_err("EEVDF scheduling fail, picking leftmost\n");
+			return left;
+		}
+	}
+
+	return best;
+}
+
 #ifdef CONFIG_SCHED_DEBUG
 struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
 {
@@ -840,17 +1023,6 @@ int sched_update_scaling(void)
 #endif
 
 /*
- * delta /= w
- */
-static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
-{
-	if (unlikely(se->load.weight != NICE_0_LOAD))
-		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
-
-	return delta;
-}
-
-/*
  * The idea is to set a period in which each task runs once.
  *
  * When there are too many tasks (sched_nr_latency) we have to stretch
@@ -915,6 +1087,48 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	return slice;
 }
 
+static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
+
+/*
+ * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
+ * this is probably good enough.
+ */
+static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	if ((s64)(se->vruntime - se->deadline) < 0)
+		return;
+
+	if (sched_feat(EEVDF)) {
+		/*
+		 * For EEVDF the virtual time slope is determined by w_i (iow.
+		 * nice) while the request time r_i is determined by
+		 * sysctl_sched_min_granularity.
+		 */
+		se->slice = sysctl_sched_min_granularity;
+
+		/*
+		 * The task has consumed its request, reschedule.
+		 */
+		if (cfs_rq->nr_running > 1) {
+			resched_curr(rq_of(cfs_rq));
+			clear_buddies(cfs_rq, se);
+		}
+	} else {
+		/*
+		 * When many tasks blow up the sched_period; it is possible
+		 * that sched_slice() reports unusually large results (when
+		 * many tasks are very light for example). Therefore impose a
+		 * maximum.
+		 */
+		se->slice = min_t(u64, sched_slice(cfs_rq, se), sysctl_sched_latency);
+	}
+
+	/*
+	 * EEVDF: vd_i = ve_i + r_i / w_i
+	 */
+	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+}
+
 #include "pelt.h"
 #ifdef CONFIG_SMP
 
@@ -1047,6 +1261,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	schedstat_add(cfs_rq->exec_clock, delta_exec);
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
+	update_deadline(cfs_rq, curr);
 	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
@@ -3521,6 +3736,14 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		 * we need to scale se->vlag when w_i changes.
 		 */
 		se->vlag = div_s64(se->vlag * old_weight, weight);
+	} else {
+		s64 deadline = se->deadline - se->vruntime;
+		/*
+		 * When the weight changes, the virtual time slope changes and
+		 * we should adjust the relative virtual deadline accordingly.
+		 */
+		deadline = div_s64(deadline * old_weight, weight);
+		se->deadline = se->vruntime + deadline;
 	}
 
 #ifdef CONFIG_SMP
@@ -4871,6 +5094,7 @@ static inline bool entity_is_long_sleeper(struct sched_entity *se)
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
+	u64 vslice = calc_delta_fair(se->slice, se);
 	u64 vruntime = avg_vruntime(cfs_rq);
 	s64 lag = 0;
 
@@ -4942,9 +5166,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 		 */
 		load = cfs_rq->avg_load;
 		if (curr && curr->on_rq)
-			load += curr->load.weight;
+			load += scale_load_down(curr->load.weight);
 
-		lag *= load + se->load.weight;
+		lag *= load + scale_load_down(se->load.weight);
 		if (WARN_ON_ONCE(!load))
 			load = 1;
 		lag = div_s64(lag, load);
@@ -4985,6 +5209,19 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 	}
 
 	se->vruntime = vruntime;
+
+	/*
+	 * When joining the competition; the exisiting tasks will be,
+	 * on average, halfway through their slice, as such start tasks
+	 * off with half a slice to ease into the competition.
+	 */
+	if (sched_feat(PLACE_DEADLINE_INITIAL) && initial)
+		vslice /= 2;
+
+	/*
+	 * EEVDF: vd_i = ve_i + r_i/w_i
+	 */
+	se->deadline = se->vruntime + vslice;
 }
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -5207,19 +5444,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 static void
 check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	unsigned long ideal_runtime, delta_exec;
+	unsigned long delta_exec;
 	struct sched_entity *se;
 	s64 delta;
 
-	/*
-	 * When many tasks blow up the sched_period; it is possible that
-	 * sched_slice() reports unusually large results (when many tasks are
-	 * very light for example). Therefore impose a maximum.
-	 */
-	ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
-
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
-	if (delta_exec > ideal_runtime) {
+	if (delta_exec > curr->slice) {
 		resched_curr(rq_of(cfs_rq));
 		/*
 		 * The current task ran long enough, ensure it doesn't get
@@ -5243,7 +5473,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	if (delta < 0)
 		return;
 
-	if (delta > ideal_runtime)
+	if (delta > curr->slice)
 		resched_curr(rq_of(cfs_rq));
 }
 
@@ -5298,17 +5528,20 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
 static struct sched_entity *
 pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	struct sched_entity *left = __pick_first_entity(cfs_rq);
-	struct sched_entity *se;
+	struct sched_entity *left, *se;
 
-	/*
-	 * If curr is set we have to see if its left of the leftmost entity
-	 * still in the tree, provided there was anything in the tree at all.
-	 */
-	if (!left || (curr && entity_before(curr, left)))
-		left = curr;
+	if (sched_feat(EEVDF)) {
+		/*
+		 * Enabling NEXT_BUDDY will affect latency but not fairness.
+		 */
+		if (sched_feat(NEXT_BUDDY) &&
+		    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
+			return cfs_rq->next;
+
+		return pick_eevdf(cfs_rq);
+	}
 
-	se = left; /* ideally we run the leftmost entity */
+	se = left = pick_cfs(cfs_rq, curr);
 
 	/*
 	 * Avoid running the skip buddy, if running something else can
@@ -5401,7 +5634,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 		return;
 #endif
 
-	if (cfs_rq->nr_running > 1)
+	if (!sched_feat(EEVDF) && cfs_rq->nr_running > 1)
 		check_preempt_tick(cfs_rq, curr);
 }
 
@@ -6445,13 +6678,12 @@ static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {}
 static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	SCHED_WARN_ON(task_rq(p) != rq);
 
 	if (rq->cfs.h_nr_running > 1) {
-		u64 slice = sched_slice(cfs_rq, se);
 		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+		u64 slice = se->slice;
 		s64 delta = slice - ran;
 
 		if (delta < 0) {
@@ -8228,7 +8460,19 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	if (cse_is_idle != pse_is_idle)
 		return;
 
-	update_curr(cfs_rq_of(se));
+	cfs_rq = cfs_rq_of(se);
+	update_curr(cfs_rq);
+
+	if (sched_feat(EEVDF)) {
+		/*
+		 * XXX pick_eevdf(cfs_rq) != se ?
+		 */
+		if (pick_eevdf(cfs_rq) == pse)
+			goto preempt;
+
+		return;
+	}
+
 	if (wakeup_preempt_entity(se, pse) == 1) {
 		/*
 		 * Bias pick_next to pick the sched entity that is
@@ -8474,7 +8718,7 @@ static void yield_task_fair(struct rq *rq)
 
 	clear_buddies(cfs_rq, se);
 
-	if (curr->policy != SCHED_BATCH) {
+	if (sched_feat(EEVDF) || curr->policy != SCHED_BATCH) {
 		update_rq_clock(rq);
 		/*
 		 * Update run-time statistics of the 'current'.
@@ -8487,6 +8731,8 @@ static void yield_task_fair(struct rq *rq)
 		 */
 		rq_clock_skip_update(rq);
 	}
+	if (sched_feat(EEVDF))
+		se->deadline += calc_delta_fair(se->slice, se);
 
 	set_skip_buddy(se);
 }
@@ -12363,8 +12609,8 @@ static void rq_offline_fair(struct rq *rq)
 static inline bool
 __entity_slice_used(struct sched_entity *se, int min_nr_tasks)
 {
-	u64 slice = sched_slice(cfs_rq_of(se), se);
 	u64 rtime = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+	u64 slice = se->slice;
 
 	return (rtime * min_nr_tasks > slice);
 }
@@ -13059,7 +13305,7 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
 	 * idle runqueue:
 	 */
 	if (rq->cfs.load.weight)
-		rr_interval = NS_TO_JIFFIES(sched_slice(cfs_rq_of(se), se));
+		rr_interval = NS_TO_JIFFIES(se->slice);
 
 	return rr_interval;
 }
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7958a10..60cce1e 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -13,6 +13,7 @@ SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
  * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
  */
 SCHED_FEAT(PLACE_LAG, true)
+SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed
@@ -103,3 +104,5 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(EEVDF, true)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 52a0a4b..aa5b293 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2505,9 +2505,10 @@ extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
 
+extern unsigned int sysctl_sched_min_granularity;
+
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_latency;
-extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_idle_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern int sysctl_resched_latency_warn_ms;
@@ -3487,5 +3488,6 @@ static inline void init_sched_mm_cid(struct task_struct *t) { }
 #endif
 
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
+extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 #endif /* _KERNEL_SCHED_SCHED_H */

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [tip: sched/core] rbtree: Add rb_add_augmented_cached() helper
  2023-05-31 11:58 ` [PATCH 04/15] rbtree: Add rb_add_augmented_cached() helper Peter Zijlstra
@ 2023-08-10  7:10   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 104+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-08-10  7:10 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     99d4d26551b56f4e523dd04e4970b94aa796a64e
Gitweb:        https://git.kernel.org/tip/99d4d26551b56f4e523dd04e4970b94aa796a64e
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 31 May 2023 13:58:43 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Jul 2023 09:43:58 +02:00

rbtree: Add rb_add_augmented_cached() helper

While slightly sub-optimal, updating the augmented data while going
down the tree during lookup would be faster -- alas the augment
interface does not currently allow for that, provide a generic helper
to add a node to an augmented cached tree.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.862983648@infradead.org
---
 include/linux/rbtree_augmented.h | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/include/linux/rbtree_augmented.h b/include/linux/rbtree_augmented.h
index 7ee7ed5..6dbc5a1 100644
--- a/include/linux/rbtree_augmented.h
+++ b/include/linux/rbtree_augmented.h
@@ -60,6 +60,32 @@ rb_insert_augmented_cached(struct rb_node *node,
 	rb_insert_augmented(node, &root->rb_root, augment);
 }
 
+static __always_inline struct rb_node *
+rb_add_augmented_cached(struct rb_node *node, struct rb_root_cached *tree,
+			bool (*less)(struct rb_node *, const struct rb_node *),
+			const struct rb_augment_callbacks *augment)
+{
+	struct rb_node **link = &tree->rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	bool leftmost = true;
+
+	while (*link) {
+		parent = *link;
+		if (less(node, parent)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = false;
+		}
+	}
+
+	rb_link_node(node, parent, link);
+	augment->propagate(parent, NULL); /* suboptimal */
+	rb_insert_augmented_cached(node, tree, leftmost, augment);
+
+	return leftmost ? node : NULL;
+}
+
 /*
  * Template for declaring augmented rbtree callbacks (generic case)
  *

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [tip: sched/core] sched/fair: Add lag based placement
  2023-05-31 11:58 ` [PATCH 03/15] sched/fair: Add lag based placement Peter Zijlstra
@ 2023-08-10  7:10   ` tip-bot2 for Peter Zijlstra
  2023-10-11 12:00   ` [PATCH 03/15] " Abel Wu
  2023-10-12 19:15   ` Benjamin Segall
  2 siblings, 0 replies; 104+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-08-10  7:10 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     86bfbb7ce4f67a88df2639198169b685668e7349
Gitweb:        https://git.kernel.org/tip/86bfbb7ce4f67a88df2639198169b685668e7349
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 31 May 2023 13:58:42 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Jul 2023 09:43:58 +02:00

sched/fair: Add lag based placement

With the introduction of avg_vruntime, it is possible to approximate
lag (the entire purpose of introducing it in fact). Use this to do lag
based placement over sleep+wake.

Specifically, the FAIR_SLEEPERS thing places things too far to the
left and messes up the deadline aspect of EEVDF.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.794929315@infradead.org
---
 include/linux/sched.h   |   3 +-
 kernel/sched/core.c     |   1 +-
 kernel/sched/fair.c     | 168 ++++++++++++++++++++++++++++++---------
 kernel/sched/features.h |   8 ++-
 4 files changed, 141 insertions(+), 39 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2aab7be..ba1828b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -554,8 +554,9 @@ struct sched_entity {
 
 	u64				exec_start;
 	u64				sum_exec_runtime;
-	u64				vruntime;
 	u64				prev_sum_exec_runtime;
+	u64				vruntime;
+	s64				vlag;
 
 	u64				nr_migrations;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 83e3654..84b0d47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4501,6 +4501,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
+	p->se.vlag			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fc43482..dd12ada 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -715,6 +715,15 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 	return cfs_rq->min_vruntime + avg;
 }
 
+/*
+ * lag_i = S - s_i = w_i * (V - v_i)
+ */
+void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	SCHED_WARN_ON(!se->on_rq);
+	se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
+}
+
 static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	u64 min_vruntime = cfs_rq->min_vruntime;
@@ -3492,6 +3501,8 @@ dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
 {
+	unsigned long old_weight = se->load.weight;
+
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
@@ -3504,6 +3515,14 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 
 	update_load_set(&se->load, weight);
 
+	if (!se->on_rq) {
+		/*
+		 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
+		 * we need to scale se->vlag when w_i changes.
+		 */
+		se->vlag = div_s64(se->vlag * old_weight, weight);
+	}
+
 #ifdef CONFIG_SMP
 	do {
 		u32 divider = get_pelt_divider(&se->avg);
@@ -4853,49 +4872,119 @@ static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
 	u64 vruntime = avg_vruntime(cfs_rq);
+	s64 lag = 0;
 
-	/* sleeps up to a single latency don't count. */
-	if (!initial) {
-		unsigned long thresh;
+	/*
+	 * Due to how V is constructed as the weighted average of entities,
+	 * adding tasks with positive lag, or removing tasks with negative lag
+	 * will move 'time' backwards, this can screw around with the lag of
+	 * other tasks.
+	 *
+	 * EEVDF: placement strategy #1 / #2
+	 */
+	if (sched_feat(PLACE_LAG) && cfs_rq->nr_running > 1) {
+		struct sched_entity *curr = cfs_rq->curr;
+		unsigned long load;
 
-		if (se_is_idle(se))
-			thresh = sysctl_sched_min_granularity;
-		else
-			thresh = sysctl_sched_latency;
+		lag = se->vlag;
 
 		/*
-		 * Halve their sleep time's effect, to allow
-		 * for a gentler effect of sleepers:
+		 * If we want to place a task and preserve lag, we have to
+		 * consider the effect of the new entity on the weighted
+		 * average and compensate for this, otherwise lag can quickly
+		 * evaporate.
+		 *
+		 * Lag is defined as:
+		 *
+		 *   lag_i = S - s_i = w_i * (V - v_i)
+		 *
+		 * To avoid the 'w_i' term all over the place, we only track
+		 * the virtual lag:
+		 *
+		 *   vl_i = V - v_i <=> v_i = V - vl_i
+		 *
+		 * And we take V to be the weighted average of all v:
+		 *
+		 *   V = (\Sum w_j*v_j) / W
+		 *
+		 * Where W is: \Sum w_j
+		 *
+		 * Then, the weighted average after adding an entity with lag
+		 * vl_i is given by:
+		 *
+		 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
+		 *      = (W*V + w_i*(V - vl_i)) / (W + w_i)
+		 *      = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
+		 *      = (V*(W + w_i) - w_i*l) / (W + w_i)
+		 *      = V - w_i*vl_i / (W + w_i)
+		 *
+		 * And the actual lag after adding an entity with vl_i is:
+		 *
+		 *   vl'_i = V' - v_i
+		 *         = V - w_i*vl_i / (W + w_i) - (V - vl_i)
+		 *         = vl_i - w_i*vl_i / (W + w_i)
+		 *
+		 * Which is strictly less than vl_i. So in order to preserve lag
+		 * we should inflate the lag before placement such that the
+		 * effective lag after placement comes out right.
+		 *
+		 * As such, invert the above relation for vl'_i to get the vl_i
+		 * we need to use such that the lag after placement is the lag
+		 * we computed before dequeue.
+		 *
+		 *   vl'_i = vl_i - w_i*vl_i / (W + w_i)
+		 *         = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
+		 *
+		 *   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
+		 *                   = W*vl_i
+		 *
+		 *   vl_i = (W + w_i)*vl'_i / W
 		 */
-		if (sched_feat(GENTLE_FAIR_SLEEPERS))
-			thresh >>= 1;
-
-		vruntime -= thresh;
-	}
-
-	/*
-	 * Pull vruntime of the entity being placed to the base level of
-	 * cfs_rq, to prevent boosting it if placed backwards.
-	 * However, min_vruntime can advance much faster than real time, with
-	 * the extreme being when an entity with the minimal weight always runs
-	 * on the cfs_rq. If the waking entity slept for a long time, its
-	 * vruntime difference from min_vruntime may overflow s64 and their
-	 * comparison may get inversed, so ignore the entity's original
-	 * vruntime in that case.
-	 * The maximal vruntime speedup is given by the ratio of normal to
-	 * minimal weight: scale_load_down(NICE_0_LOAD) / MIN_SHARES.
-	 * When placing a migrated waking entity, its exec_start has been set
-	 * from a different rq. In order to take into account a possible
-	 * divergence between new and prev rq's clocks task because of irq and
-	 * stolen time, we take an additional margin.
-	 * So, cutting off on the sleep time of
-	 *     2^63 / scale_load_down(NICE_0_LOAD) ~ 104 days
-	 * should be safe.
-	 */
-	if (entity_is_long_sleeper(se))
-		se->vruntime = vruntime;
-	else
-		se->vruntime = max_vruntime(se->vruntime, vruntime);
+		load = cfs_rq->avg_load;
+		if (curr && curr->on_rq)
+			load += curr->load.weight;
+
+		lag *= load + se->load.weight;
+		if (WARN_ON_ONCE(!load))
+			load = 1;
+		lag = div_s64(lag, load);
+
+		vruntime -= lag;
+	}
+
+	if (sched_feat(FAIR_SLEEPERS)) {
+
+		/* sleeps up to a single latency don't count. */
+		if (!initial) {
+			unsigned long thresh;
+
+			if (se_is_idle(se))
+				thresh = sysctl_sched_min_granularity;
+			else
+				thresh = sysctl_sched_latency;
+
+			/*
+			 * Halve their sleep time's effect, to allow
+			 * for a gentler effect of sleepers:
+			 */
+			if (sched_feat(GENTLE_FAIR_SLEEPERS))
+				thresh >>= 1;
+
+			vruntime -= thresh;
+		}
+
+		/*
+		 * Pull vruntime of the entity being placed to the base level of
+		 * cfs_rq, to prevent boosting it if placed backwards.  If the entity
+		 * slept for a long time, don't even try to compare its vruntime with
+		 * the base as it may be too far off and the comparison may get
+		 * inversed due to s64 overflow.
+		 */
+		if (!entity_is_long_sleeper(se))
+			vruntime = max_vruntime(se->vruntime, vruntime);
+	}
+
+	se->vruntime = vruntime;
 }
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -5077,6 +5166,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	clear_buddies(cfs_rq, se);
 
+	if (flags & DEQUEUE_SLEEP)
+		update_entity_lag(cfs_rq, se);
+
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index fa828b3..7958a10 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -1,12 +1,20 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+
 /*
  * Only give sleepers 50% of their service deficit. This allows
  * them to run sooner, but does not allow tons of sleepers to
  * rip the spread apart.
  */
+SCHED_FEAT(FAIR_SLEEPERS, false)
 SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
 
 /*
+ * Using the avg_vruntime, do the right thing and preserve lag across
+ * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
+ */
+SCHED_FEAT(PLACE_LAG, true)
+
+/*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [tip: sched/core] sched/fair: Remove sched_feat(START_DEBIT)
  2023-05-31 11:58 ` [PATCH 02/15] sched/fair: Remove START_DEBIT Peter Zijlstra
@ 2023-08-10  7:10   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 104+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-08-10  7:10 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e0c2ff903c320d3fd3c2c604dc401b3b7c0a1d13
Gitweb:        https://git.kernel.org/tip/e0c2ff903c320d3fd3c2c604dc401b3b7c0a1d13
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 31 May 2023 13:58:41 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Jul 2023 09:43:58 +02:00

sched/fair: Remove sched_feat(START_DEBIT)

With the introduction of avg_vruntime() there is no need to use worse
approximations. Take the 0-lag point as starting point for inserting
new tasks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.722361178@infradead.org
---
 kernel/sched/fair.c     | 21 +--------------------
 kernel/sched/features.h |  6 ------
 2 files changed, 1 insertion(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bb54606..fc43482 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -906,16 +906,6 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	return slice;
 }
 
-/*
- * We calculate the vruntime slice of a to-be-inserted task.
- *
- * vs = s/w
- */
-static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	return calc_delta_fair(sched_slice(cfs_rq, se), se);
-}
-
 #include "pelt.h"
 #ifdef CONFIG_SMP
 
@@ -4862,16 +4852,7 @@ static inline bool entity_is_long_sleeper(struct sched_entity *se)
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
-	u64 vruntime = cfs_rq->min_vruntime;
-
-	/*
-	 * The 'current' period is already promised to the current tasks,
-	 * however the extra weight of the new task will slow them down a
-	 * little, place the new task so that it fits in the slot that
-	 * stays open at the end.
-	 */
-	if (initial && sched_feat(START_DEBIT))
-		vruntime += sched_vslice(cfs_rq, se);
+	u64 vruntime = avg_vruntime(cfs_rq);
 
 	/* sleeps up to a single latency don't count. */
 	if (!initial) {
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c..fa828b3 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,12 +7,6 @@
 SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
 
 /*
- * Place new tasks ahead so that they do not starve already running
- * tasks
- */
-SCHED_FEAT(START_DEBIT, true)
-
-/*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [tip: sched/core] sched/fair: Add cfs_rq::avg_vruntime
  2023-05-31 11:58 ` [PATCH 01/15] sched/fair: Add avg_vruntime Peter Zijlstra
  2023-06-02 13:51   ` Vincent Guittot
@ 2023-08-10  7:10   ` tip-bot2 for Peter Zijlstra
  2023-10-11  4:15   ` [PATCH 01/15] sched/fair: Add avg_vruntime Abel Wu
  2 siblings, 0 replies; 104+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-08-10  7:10 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     af4cf40470c22efa3987200fd19478199e08e103
Gitweb:        https://git.kernel.org/tip/af4cf40470c22efa3987200fd19478199e08e103
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 31 May 2023 13:58:40 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 19 Jul 2023 09:43:58 +02:00

sched/fair: Add cfs_rq::avg_vruntime

In order to move to an eligibility based scheduling policy, we need
to have a better approximation of the ideal scheduler.

Specifically, for a virtual time weighted fair queueing based
scheduler the ideal scheduler will be the weighted average of the
individual virtual runtimes (math in the comment).

As such, compute the weighted average to approximate the ideal
scheduler -- note that the approximation is in the individual task
behaviour, which isn't strictly conformant.

Specifically consider adding a task with a vruntime left of center, in
this case the average will move backwards in time -- something the
ideal scheduler would of course never do.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.654144274@infradead.org
---
 kernel/sched/debug.c |  32 ++++------
 kernel/sched/fair.c  | 137 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   5 ++-
 3 files changed, 154 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index aeeba46..e48d2b2 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -627,10 +627,9 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
 
 void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 {
-	s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
-		spread, rq0_min_vruntime, spread0;
+	s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, spread;
+	struct sched_entity *last, *first;
 	struct rq *rq = cpu_rq(cpu);
-	struct sched_entity *last;
 	unsigned long flags;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -644,26 +643,25 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			SPLIT_NS(cfs_rq->exec_clock));
 
 	raw_spin_rq_lock_irqsave(rq, flags);
-	if (rb_first_cached(&cfs_rq->tasks_timeline))
-		MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
+	first = __pick_first_entity(cfs_rq);
+	if (first)
+		left_vruntime = first->vruntime;
 	last = __pick_last_entity(cfs_rq);
 	if (last)
-		max_vruntime = last->vruntime;
+		right_vruntime = last->vruntime;
 	min_vruntime = cfs_rq->min_vruntime;
-	rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
 	raw_spin_rq_unlock_irqrestore(rq, flags);
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "MIN_vruntime",
-			SPLIT_NS(MIN_vruntime));
+
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_vruntime",
+			SPLIT_NS(left_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
 			SPLIT_NS(min_vruntime));
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "max_vruntime",
-			SPLIT_NS(max_vruntime));
-	spread = max_vruntime - MIN_vruntime;
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread",
-			SPLIT_NS(spread));
-	spread0 = min_vruntime - rq0_min_vruntime;
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread0",
-			SPLIT_NS(spread0));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
+			SPLIT_NS(avg_vruntime(cfs_rq)));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "right_vruntime",
+			SPLIT_NS(right_vruntime));
+	spread = right_vruntime - left_vruntime;
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread", SPLIT_NS(spread));
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3df5b1..bb54606 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -601,9 +601,134 @@ static inline bool entity_before(const struct sched_entity *a,
 	return (s64)(a->vruntime - b->vruntime) < 0;
 }
 
+static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	return (s64)(se->vruntime - cfs_rq->min_vruntime);
+}
+
 #define __node_2_se(node) \
 	rb_entry((node), struct sched_entity, run_node)
 
+/*
+ * Compute virtual time from the per-task service numbers:
+ *
+ * Fair schedulers conserve lag:
+ *
+ *   \Sum lag_i = 0
+ *
+ * Where lag_i is given by:
+ *
+ *   lag_i = S - s_i = w_i * (V - v_i)
+ *
+ * Where S is the ideal service time and V is it's virtual time counterpart.
+ * Therefore:
+ *
+ *   \Sum lag_i = 0
+ *   \Sum w_i * (V - v_i) = 0
+ *   \Sum w_i * V - w_i * v_i = 0
+ *
+ * From which we can solve an expression for V in v_i (which we have in
+ * se->vruntime):
+ *
+ *       \Sum v_i * w_i   \Sum v_i * w_i
+ *   V = -------------- = --------------
+ *          \Sum w_i            W
+ *
+ * Specifically, this is the weighted average of all entity virtual runtimes.
+ *
+ * [[ NOTE: this is only equal to the ideal scheduler under the condition
+ *          that join/leave operations happen at lag_i = 0, otherwise the
+ *          virtual time has non-continguous motion equivalent to:
+ *
+ *	      V +-= lag_i / W
+ *
+ *	    Also see the comment in place_entity() that deals with this. ]]
+ *
+ * However, since v_i is u64, and the multiplcation could easily overflow
+ * transform it into a relative form that uses smaller quantities:
+ *
+ * Substitute: v_i == (v_i - v0) + v0
+ *
+ *     \Sum ((v_i - v0) + v0) * w_i   \Sum (v_i - v0) * w_i
+ * V = ---------------------------- = --------------------- + v0
+ *                  W                            W
+ *
+ * Which we track using:
+ *
+ *                    v0 := cfs_rq->min_vruntime
+ * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
+ *              \Sum w_i := cfs_rq->avg_load
+ *
+ * Since min_vruntime is a monotonic increasing variable that closely tracks
+ * the per-task service, these deltas: (v_i - v), will be in the order of the
+ * maximal (virtual) lag induced in the system due to quantisation.
+ *
+ * Also, we use scale_load_down() to reduce the size.
+ *
+ * As measured, the max (key * weight) value was ~44 bits for a kernel build.
+ */
+static void
+avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	unsigned long weight = scale_load_down(se->load.weight);
+	s64 key = entity_key(cfs_rq, se);
+
+	cfs_rq->avg_vruntime += key * weight;
+	cfs_rq->avg_load += weight;
+}
+
+static void
+avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	unsigned long weight = scale_load_down(se->load.weight);
+	s64 key = entity_key(cfs_rq, se);
+
+	cfs_rq->avg_vruntime -= key * weight;
+	cfs_rq->avg_load -= weight;
+}
+
+static inline
+void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+{
+	/*
+	 * v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load
+	 */
+	cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta;
+}
+
+u64 avg_vruntime(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 avg = cfs_rq->avg_vruntime;
+	long load = cfs_rq->avg_load;
+
+	if (curr && curr->on_rq) {
+		unsigned long weight = scale_load_down(curr->load.weight);
+
+		avg += entity_key(cfs_rq, curr) * weight;
+		load += weight;
+	}
+
+	if (load)
+		avg = div_s64(avg, load);
+
+	return cfs_rq->min_vruntime + avg;
+}
+
+static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
+{
+	u64 min_vruntime = cfs_rq->min_vruntime;
+	/*
+	 * open coded max_vruntime() to allow updating avg_vruntime
+	 */
+	s64 delta = (s64)(vruntime - min_vruntime);
+	if (delta > 0) {
+		avg_vruntime_update(cfs_rq, delta);
+		min_vruntime = vruntime;
+	}
+	return min_vruntime;
+}
+
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
@@ -629,7 +754,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 
 	/* ensure we never gain time by being placed backwards. */
 	u64_u32_store(cfs_rq->min_vruntime,
-		      max_vruntime(cfs_rq->min_vruntime, vruntime));
+		      __update_min_vruntime(cfs_rq, vruntime));
 }
 
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
@@ -642,12 +767,14 @@ static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	avg_vruntime_add(cfs_rq, se);
 	rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
+	avg_vruntime_sub(cfs_rq, se);
 }
 
 struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
@@ -3379,6 +3506,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
 			update_curr(cfs_rq);
+		else
+			avg_vruntime_sub(cfs_rq, se);
 		update_load_sub(&cfs_rq->load, se->load.weight);
 	}
 	dequeue_load_avg(cfs_rq, se);
@@ -3394,9 +3523,11 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 #endif
 
 	enqueue_load_avg(cfs_rq, se);
-	if (se->on_rq)
+	if (se->on_rq) {
 		update_load_add(&cfs_rq->load, se->load.weight);
-
+		if (cfs_rq->curr != se)
+			avg_vruntime_add(cfs_rq, se);
+	}
 }
 
 void reweight_task(struct task_struct *p, int prio)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9baeb1a..52a0a4b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -548,6 +548,9 @@ struct cfs_rq {
 	unsigned int		idle_nr_running;   /* SCHED_IDLE */
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
 
+	s64			avg_vruntime;
+	u64			avg_load;
+
 	u64			exec_clock;
 	u64			min_vruntime;
 #ifdef CONFIG_SCHED_CORE
@@ -3483,4 +3486,6 @@ static inline void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) { }
 static inline void init_sched_mm_cid(struct task_struct *t) { }
 #endif
 
+extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
+
 #endif /* _KERNEL_SCHED_SCHED_H */

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
                   ` (14 preceding siblings ...)
  2023-05-31 11:58 ` [RFC][PATCH 15/15] sched/eevdf: Use sched_attr::sched_runtime to set request/slice Peter Zijlstra
@ 2023-08-24  0:52 ` Daniel Jordan
  2023-09-06 13:13   ` Peter Zijlstra
  15 siblings, 1 reply; 104+ messages in thread
From: Daniel Jordan @ 2023-08-24  0:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

Hi Peter,

On Wed, May 31, 2023 at 01:58:39PM +0200, Peter Zijlstra wrote:
> 
> Hi!
> 
> Latest version of the EEVDF [1] patches.
> 
> The only real change since last time is the fix for tick-preemption [2], and a 
> simple safe-guard for the mixed slice heuristic.

We're seeing regressions from EEVDF with SPEC CPU, a database workload,
and a Java workload.  We tried SPEC CPU on five systems, and here are
numbers from one of them (high core count, two-socket x86 machine).

    SPECrate2017 oversubscribed by 2x (two copies of the test per CPU)

    Base: v6.3-based kernel
    EEVDF: Base + patches from May 31 [0]

    Performance comparison: >0 if EEVDF wins

    Integer
     
     -0.5% 500.perlbench_r
     -6.6% 502.gcc_r
     -8.7% 505.mcf_r
     -9.2% 520.omnetpp_r
     -6.6% 523.xalancbmk_r
     -0.7% 525.x264_r
     -2.1% 531.deepsjeng_r
     -0.4% 541.leela_r
     -0.3% 548.exchange2_r
     -2.6% 557.xz_r
     
     -3.8% Est(*) SPECrate2017_int_base
     
    Floating Point
     
     -0.6% 503.bwaves_r
     -1.3% 507.cactuBSSN_r
     -0.8% 508.namd_r
    -17.8% 510.parest_r
      0.3% 511.povray_r
     -1.0% 519.lbm_r
     -7.7% 521.wrf_r
     -2.4% 526.blender_r
     -6.1% 527.cam4_r
     -2.0% 538.imagick_r
      0.1% 544.nab_r
     -0.7% 549.fotonik3d_r
    -11.3% 554.roms_r
     
     -4.1% Est(*) SPECrate2017_fp_base
     
    (*) SPEC CPU Fair Use rules require that tests with non-production
        components must be marked as estimates.

The other machines show similarly consistent regressions, and we've tried a
v6.5-rc4-based kernel with the latest EEVDF patches from tip/sched/core
including the recent fixlet "sched/eevdf: Curb wakeup-preemption".  I can post
the rest of the numbers, but I'm trying to keep this on the shorter side for
now.

Running the database workload on a two-socket x86 server, we see
regressions of up to 6% when the number of users exceeds the number of
CPUs.

With the Java workload on another two-socket x86 server, we see a 10%
regression.

We're investigating the other benchmarks, but here's what I've found so far
with SPEC CPU.  Some schedstats showed that eevdf is tick-preemption happy
(patches below).  These stats were taken over 1 minute near the middle of a ~26
minute benchmark (502.gcc_r).

    Base: v6.5-rc4-based kernel
    EEVDF: Base + the latest EEVDF patches from tip/sched/core

    schedstat                     Base            EEVDF

    sched                    1,243,911        3,947,251

    tick_check_preempts     12,899,049
    tick_preempts            1,022,998

    check_deadline                           15,878,463
    update_deadline                           3,895,530
    preempt_deadline                          3,751,580

In both kernels, tick preemption is primarily what drives schedule()s.
Preemptions happen over three times more often for EEVDF because in the base,
tick preemption happens after a task has run through its ideal timeslice as a
fraction of sched_latency (so two tasks sharing a CPU each get 12ms on a server
with enough CPUs, sched_latency being 24ms), whereas with eevdf, a task's base
slice determines when it gets tick-preempted, and that's 3ms by default.  It
seems SPEC CPU isn't liking the increased scheduling of EEVDF in a cpu-bound
load like this.  When I set the base_slice_ns sysctl to 12000000, the
regression disappears.

I'm still thinking about how to fix it.  Pre-EEVDF, tick preemption was
more flexible in that a task's timeslice could change depending on how
much competition it had on the same CPU, but with EEVDF the timeslice is
fixed no matter what else is running, and growing or shrinking it
depending on nr_running doesn't honor whatever deadline was set for the
task.

The schedstat patch for the base:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3e25be58e2b..fb5a35aa07ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4996,6 +4996,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
        struct sched_entity *se;
        s64 delta;

+       schedstat_inc(rq_of(cfs_rq)->tick_check_preempts);
+
        /*
         * When many tasks blow up the sched_period; it is possible that
         * sched_slice() reports unusually large results (when many tasks are
@@ -5005,6 +5007,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)

        delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
        if (delta_exec > ideal_runtime) {
+               schedstat_inc(rq_of(cfs_rq)->tick_preempts);
                resched_curr(rq_of(cfs_rq));
                /*
                 * The current task ran long enough, ensure it doesn't get
@@ -5028,8 +5031,10 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
        if (delta < 0)
                return;

-       if (delta > ideal_runtime)
+       if (delta > ideal_runtime) {
+               schedstat_inc(rq_of(cfs_rq)->tick_preempts);
                resched_curr(rq_of(cfs_rq));
+       }
 }

 static void
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e93e006a942b..1bf12e271756 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1123,6 +1123,10 @@ struct rq {
        /* try_to_wake_up() stats */
        unsigned int            ttwu_count;
        unsigned int            ttwu_local;
+
+       /* tick preempt stats */
+       unsigned int            tick_check_preempts;
+       unsigned int            tick_preempts;
 #endif

 #ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..7997b8538b72 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,13 @@ static int show_schedstat(struct seq_file *seq, void *v)

                /* runqueue-specific stats */
                seq_printf(seq,
-                   "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+                   "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u",
                    cpu, rq->yld_count,
                    rq->sched_count, rq->sched_goidle,
                    rq->ttwu_count, rq->ttwu_local,
                    rq->rq_cpu_time,
-                   rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+                   rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+                   rq->tick_check_preempts, rq->tick_preempts);

                seq_printf(seq, "\n");


The schedstat patch for eevdf:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cffec98724f3..675f4bbac471 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -975,18 +975,21 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
  */
 static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+       schedstat_inc(rq_of(cfs_rq)->check_deadline);
        if ((s64)(se->vruntime - se->deadline) < 0)
                return;

        /*
         * EEVDF: vd_i = ve_i + r_i / w_i
         */
+       schedstat_inc(rq_of(cfs_rq)->update_deadline);
        se->deadline = se->vruntime + calc_delta_fair(se->slice, se);

        /*
         * The task has consumed its request, reschedule.
         */
        if (cfs_rq->nr_running > 1) {
+               schedstat_inc(rq_of(cfs_rq)->preempt_deadline);
                resched_curr(rq_of(cfs_rq));
                clear_buddies(cfs_rq, se);
        }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 93c2dc80143f..c44b59556367 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1129,6 +1129,11 @@ struct rq {
        /* try_to_wake_up() stats */
        unsigned int            ttwu_count;
        unsigned int            ttwu_local;
+
+       /* update_deadline() stats */
+       unsigned int            check_deadline;
+       unsigned int            update_deadline;
+       unsigned int            preempt_deadline;
 #endif

 #ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..2a8bd742507d 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,14 @@ static int show_schedstat(struct seq_file *seq, void *v)

                /* runqueue-specific stats */
                seq_printf(seq,
-                   "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+                   "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u",
                    cpu, rq->yld_count,
                    rq->sched_count, rq->sched_goidle,
                    rq->ttwu_count, rq->ttwu_local,
                    rq->rq_cpu_time,
-                   rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+                   rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+                   rq->check_deadline, rq->update_deadline,
+                   rq->preempt_deadline);

                seq_printf(seq, "\n");


[0] https://lore.kernel.org/all/20230531115839.089944915@infradead.org/

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-08-24  0:52 ` [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Daniel Jordan
@ 2023-09-06 13:13   ` Peter Zijlstra
  2023-09-29 16:54     ` Youssef Esmat
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-09-06 13:13 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On Wed, Aug 23, 2023 at 08:52:26PM -0400, Daniel Jordan wrote:

> We're investigating the other benchmarks, but here's what I've found so far
> with SPEC CPU.  Some schedstats showed that eevdf is tick-preemption happy
> (patches below).  These stats were taken over 1 minute near the middle of a ~26
> minute benchmark (502.gcc_r).
> 
>     Base: v6.5-rc4-based kernel
>     EEVDF: Base + the latest EEVDF patches from tip/sched/core
> 
>     schedstat                     Base            EEVDF
> 
>     sched                    1,243,911        3,947,251
> 
>     tick_check_preempts     12,899,049
>     tick_preempts            1,022,998
> 
>     check_deadline                           15,878,463
>     update_deadline                           3,895,530
>     preempt_deadline                          3,751,580
> 
> In both kernels, tick preemption is primarily what drives schedule()s.
> Preemptions happen over three times more often for EEVDF because in the base,
> tick preemption happens after a task has run through its ideal timeslice as a
> fraction of sched_latency (so two tasks sharing a CPU each get 12ms on a server
> with enough CPUs, sched_latency being 24ms), whereas with eevdf, a task's base
> slice determines when it gets tick-preempted, and that's 3ms by default.  It
> seems SPEC CPU isn't liking the increased scheduling of EEVDF in a cpu-bound
> load like this.  When I set the base_slice_ns sysctl to 12000000, the
> regression disappears.
> 
> I'm still thinking about how to fix it. 

EEVDF fundamentally supports per task request/slice sizes, which is the
primary motivator for finally finishing these patches.

So the plan is to extend sched_setattr() to allow tasks setting their
own ideal slice length. But we're not quite there yet.

Having just returned from PTO the mailbox is an utter trainwreck, but
I'll try and refresh those few patches this week for consideration.

In the meantime I think you found the right knob to twiddle.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 07/15] sched/smp: Use lag to simplify cross-runqueue placement
  2023-05-31 11:58 ` [PATCH 07/15] sched/smp: Use lag to simplify cross-runqueue placement Peter Zijlstra
  2023-08-10  7:10   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2023-09-12 15:32   ` Sebastian Andrzej Siewior
  2023-09-13  9:03     ` Peter Zijlstra
  2023-10-04  1:17   ` [PATCH] sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline Daniel Jordan
  2023-10-06 16:48   ` [PATCH] sched/fair: Always update_curr() before placing at enqueue Daniel Jordan
  3 siblings, 1 reply; 104+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-09-12 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On 2023-05-31 13:58:46 [+0200], Peter Zijlstra wrote:
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12492,22 +12440,9 @@ static void task_fork_fair(struct task_s
>  
>  	cfs_rq = task_cfs_rq(current);
>  	curr = cfs_rq->curr;
> -	if (curr) {
> +	if (curr)
>  		update_curr(cfs_rq);
> -		se->vruntime = curr->vruntime;
> -	}
>  	place_entity(cfs_rq, se, 1);
> -
> -	if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {

Since the removal of sysctl_sched_child_runs_first there is no user of
this anymore. There is still the sysctl file sched_child_runs_first with
no functionality.
Is this intended or should it be removed?

…
>  	rq_unlock(rq, &rf);
>  }

Sebastian

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 07/15] sched/smp: Use lag to simplify cross-runqueue placement
  2023-09-12 15:32   ` [PATCH 07/15] " Sebastian Andrzej Siewior
@ 2023-09-13  9:03     ` Peter Zijlstra
  0 siblings, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-09-13  9:03 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On Tue, Sep 12, 2023 at 05:32:21PM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-05-31 13:58:46 [+0200], Peter Zijlstra wrote:
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -12492,22 +12440,9 @@ static void task_fork_fair(struct task_s
> >  
> >  	cfs_rq = task_cfs_rq(current);
> >  	curr = cfs_rq->curr;
> > -	if (curr) {
> > +	if (curr)
> >  		update_curr(cfs_rq);
> > -		se->vruntime = curr->vruntime;
> > -	}
> >  	place_entity(cfs_rq, se, 1);
> > -
> > -	if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
> 
> Since the removal of sysctl_sched_child_runs_first there is no user of
> this anymore. There is still the sysctl file sched_child_runs_first with
> no functionality.
> Is this intended or should it be removed?

Hurmph... I think that knob has been somewat dysfunctional for a long
while and it might be time to remove it.

Ingo?

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-09-06 13:13   ` Peter Zijlstra
@ 2023-09-29 16:54     ` Youssef Esmat
  2023-10-02 15:55       ` Youssef Esmat
  2023-10-02 18:41       ` Peter Zijlstra
  0 siblings, 2 replies; 104+ messages in thread
From: Youssef Esmat @ 2023-09-29 16:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Jordan, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen, joel,
	efault, tglx

>
> EEVDF fundamentally supports per task request/slice sizes, which is the
> primary motivator for finally finishing these patches.
>
> So the plan is to extend sched_setattr() to allow tasks setting their
> own ideal slice length. But we're not quite there yet.
>
> Having just returned from PTO the mailbox is an utter trainwreck, but
> I'll try and refresh those few patches this week for consideration.
>
> In the meantime I think you found the right knob to twiddle.

Hello Peter,

I am trying to understand a little better the need for the eligibility
checks (entity_eligible). I understand the general concept, but I am
trying to find a scenario where it is necessary. And maybe propose to
have it toggled by a feature flag.

Some of my testing:

All my testing was done on a two core Celeron N400 cpu system 1.1Ghz.
It was done on the 6.5-rc3 kernel with EEVDF changes ported.

I have two CPU bound tasks one with a nice of -4 and the other with a
nice of 0. They are both affinitized to CPU 0. (while 1 { i++ })

With entity_eligible *enabled* and with entity_eligible *disabled*
(always returns 1):
Top showed consistent results, one at ~70% and the other at ~30%

So it seems the deadline adjustment will naturally achieve fairness.

I also added a few trace_printks to see if there is a case where
entity_eligible would have returned 0 before the deadline forced us to
reschedule. There were a few such cases. The following snippet of
prints shows that an entity became ineligible 2 slices before its
deadline expired. It seems like this will add more context switching
but still achieve a similar result at the end.

bprint:               pick_eevdf: eligibility check: tid=4568,
eligible=0, deadline=26577257249, vruntime=26575761118
bprint:               pick_eevdf: found best deadline: tid=4573,
deadline=26575451399, vruntime=26574838855
sched_switch:         prev_comm=loop prev_pid=4568 prev_prio=120
prev_state=R ==> next_comm=loop next_pid=4573 next_prio=116
bputs:                task_tick_fair: tick
bputs:                task_tick_fair: tick
bprint:               pick_eevdf: eligibility check: tid=4573,
eligible=1, deadline=26576270304, vruntime=26575657159
bprint:               pick_eevdf: found best deadline: tid=4573,
deadline=26576270304, vruntime=26575657159
bputs:                task_tick_fair: tick
bputs:                task_tick_fair: tick
bprint:               pick_eevdf: eligibility check: tid=4573,
eligible=0, deadline=26577089170, vruntime=26576476006
bprint:               pick_eevdf: found best deadline: tid=4573,
deadline=26577089170, vruntime=26576476006
bputs:                task_tick_fair: tick
bputs:                task_tick_fair: tick
bprint:               pick_eevdf: eligibility check: tid=4573,
eligible=0, deadline=26577908042, vruntime=26577294838
bprint:               pick_eevdf: found best deadline: tid=4568,
deadline=26577257249, vruntime=26575761118
sched_switch:         prev_comm=loop prev_pid=4573 prev_prio=116
prev_state=R ==> next_comm=loop next_pid=4568 next_prio=120

In a more practical example, I tried this with one of our benchmarks
that involves running Meet and Docs side by side and measuring the
input latency in the Docs document. The following is the average
latency for 5 runs:

(These numbers are after removing our cgroup hierarchy - that might be
a discussion for a later time).

CFS: 168ms
EEVDF with eligibility: 206ms (regression from CFS)
EEVDF *without* eligibility: 143ms (improvement to CFS)
EEVDF *without* eligibility and with a 6ms base_slice_ns (was 1.5ms):
104ms (great improvement)

Removing the eligibility check for this workload seemed to result in a
great improvement. I haven't dug deeper but I suspect it's related to
reduced context switches on our 2 core system.
As an extra test I also increased the base_slice_ns and it further
improved the input latency significantly.

I would love to hear your thoughts. Thanks!

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 05/15] sched/fair: Implement an EEVDF like policy
  2023-05-31 11:58 ` [PATCH 05/15] sched/fair: Implement an EEVDF like policy Peter Zijlstra
  2023-08-10  7:10   ` [tip: sched/core] sched/fair: Implement an EEVDF-like scheduling policy tip-bot2 for Peter Zijlstra
@ 2023-09-29 21:40   ` Benjamin Segall
  2023-10-02 17:39     ` Peter Zijlstra
  2023-10-11  4:14     ` Abel Wu
  2023-09-30  0:09   ` [PATCH] sched/fair: fix pick_eevdf to always find the correct se Benjamin Segall
  2 siblings, 2 replies; 104+ messages in thread
From: Benjamin Segall @ 2023-09-29 21:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx

Peter Zijlstra <peterz@infradead.org> writes:

> +
> +/*
> + * Earliest Eligible Virtual Deadline First
> + *
> + * In order to provide latency guarantees for different request sizes
> + * EEVDF selects the best runnable task from two criteria:
> + *
> + *  1) the task must be eligible (must be owed service)
> + *
> + *  2) from those tasks that meet 1), we select the one
> + *     with the earliest virtual deadline.
> + *
> + * We can do this in O(log n) time due to an augmented RB-tree. The
> + * tree keeps the entries sorted on service, but also functions as a
> + * heap based on the deadline by keeping:
> + *
> + *  se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
> + *
> + * Which allows an EDF like search on (sub)trees.
> + */
> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> +{
> +	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> +	struct sched_entity *curr = cfs_rq->curr;
> +	struct sched_entity *best = NULL;
> +
> +	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> +		curr = NULL;
> +
> +	while (node) {
> +		struct sched_entity *se = __node_2_se(node);
> +
> +		/*
> +		 * If this entity is not eligible, try the left subtree.
> +		 */
> +		if (!entity_eligible(cfs_rq, se)) {
> +			node = node->rb_left;
> +			continue;
> +		}
> +
> +		/*
> +		 * If this entity has an earlier deadline than the previous
> +		 * best, take this one. If it also has the earliest deadline
> +		 * of its subtree, we're done.
> +		 */
> +		if (!best || deadline_gt(deadline, best, se)) {
> +			best = se;
> +			if (best->deadline == best->min_deadline)
> +				break;
> +		}
> +
> +		/*
> +		 * If the earlest deadline in this subtree is in the fully
> +		 * eligible left half of our space, go there.
> +		 */
> +		if (node->rb_left &&
> +		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
> +			node = node->rb_left;
> +			continue;
> +		}
> +
> +		node = node->rb_right;
> +	}

I believe that this can fail to actually find the earliest eligible
deadline, because the earliest deadline (min_deadline) can be in the
right branch, but that se isn't eligible, and the actual target se is in
the left branch. A trivial 3-se example with the nodes represented by
(vruntime, deadline, min_deadline):

   (5,9,7)
 /        \
(4,8,8)  (6,7,7)

AIUI, here the EEVDF pick should be (4,8,8), but pick_eevdf() will
instead pick (5,9,7), because it goes into the right branch and then
fails eligibility.

I'm not sure how much of a problem this is in practice, either in
frequency or severity, but it probably should be mentioned if it's
an intentional tradeoff.



Thinking out loud, I think that it would be sufficient to recheck via something like

for_each_sched_entity(best) {
	check __node_2_se(best->rb_left)->min_deadline, store in actual_best
}

for the best min_deadline, and then go do a heap lookup in actual_best
to find the se matching that min_deadline.

I think this pass could then be combined with our initial descent for
better cache behavior by keeping track of the best rb_left->min_deadline
each time we take a right branch. We still have to look at up to ~2x the
nodes, but I don't think that's avoidable? I'll expand my quick hack I
used to test my simple case into a something of a stress tester and try
some implementations.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH] sched/fair: fix pick_eevdf to always find the correct se
  2023-05-31 11:58 ` [PATCH 05/15] sched/fair: Implement an EEVDF like policy Peter Zijlstra
  2023-08-10  7:10   ` [tip: sched/core] sched/fair: Implement an EEVDF-like scheduling policy tip-bot2 for Peter Zijlstra
  2023-09-29 21:40   ` [PATCH 05/15] sched/fair: Implement an EEVDF like policy Benjamin Segall
@ 2023-09-30  0:09   ` Benjamin Segall
  2023-10-03 10:42     ` [tip: sched/urgent] sched/fair: Fix pick_eevdf() tip-bot2 for Benjamin Segall
                       ` (3 more replies)
  2 siblings, 4 replies; 104+ messages in thread
From: Benjamin Segall @ 2023-09-30  0:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx

The old pick_eevdf could fail to find the actual earliest eligible
deadline when it descended to the right looking for min_deadline, but it
turned out that that min_deadline wasn't actually eligible. In that case
we need to go back and search through any left branches we skipped
looking for the actual best _eligible_ min_deadline.

This is more expensive, but still O(log n), and at worst should only
involve descending two branches of the rbtree.

I've run this through a userspace stress test (thank you
tools/lib/rbtree.c), so hopefully this implementation doesn't miss any
corner cases.

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c | 72 ++++++++++++++++++++++++++++++++++++---------
 1 file changed, 58 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c31cda0712f..77e9440b8ab3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -864,18 +864,20 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
  *
  *  se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
  *
  * Which allows an EDF like search on (sub)trees.
  */
-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
 {
 	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
 	struct sched_entity *curr = cfs_rq->curr;
 	struct sched_entity *best = NULL;
+	struct sched_entity *best_left = NULL;
 
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
+	best = curr;
 
 	/*
 	 * Once selected, run a task until it either becomes non-eligible or
 	 * until it gets a new slice. See the HACK in set_next_entity().
 	 */
@@ -892,45 +894,87 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 			node = node->rb_left;
 			continue;
 		}
 
 		/*
-		 * If this entity has an earlier deadline than the previous
-		 * best, take this one. If it also has the earliest deadline
-		 * of its subtree, we're done.
+		 * Now we heap search eligible trees for the best (min_)deadline
 		 */
-		if (!best || deadline_gt(deadline, best, se)) {
+		if (!best || deadline_gt(deadline, best, se))
 			best = se;
-			if (best->deadline == best->min_deadline)
-				break;
-		}
 
 		/*
-		 * If the earlest deadline in this subtree is in the fully
-		 * eligible left half of our space, go there.
+		 * Every se in a left branch is eligible, keep track of the
+		 * branch with the best min_deadline
 		 */
+		if (node->rb_left) {
+			struct sched_entity *left = __node_2_se(node->rb_left);
+
+			if (!best_left || deadline_gt(min_deadline, best_left, left))
+				best_left = left;
+
+			/*
+			 * min_deadline is in the left branch. rb_left and all
+			 * descendants are eligible, so immediately switch to the second
+			 * loop.
+			 */
+			if (left->min_deadline == se->min_deadline)
+				break;
+		}
+
+		/* min_deadline is at this node, no need to look right */
+		if (se->deadline == se->min_deadline)
+			break;
+
+		/* else min_deadline is in the right branch. */
+		node = node->rb_right;
+	}
+
+	/*
+	 * We ran into an eligible node which is itself the best.
+	 * (Or nr_running == 0 and both are NULL)
+	 */
+	if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0)
+		return best;
+
+	/*
+	 * Now best_left and all of its children are eligible, and we are just
+	 * looking for deadline == min_deadline
+	 */
+	node = &best_left->run_node;
+	while (node) {
+		struct sched_entity *se = __node_2_se(node);
+
+		/* min_deadline is the current node */
+		if (se->deadline == se->min_deadline)
+			return se;
+
+		/* min_deadline is in the left branch */
 		if (node->rb_left &&
 		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
 			node = node->rb_left;
 			continue;
 		}
 
+		/* else min_deadline is in the right branch */
 		node = node->rb_right;
 	}
+	return NULL;
+}
 
-	if (!best || (curr && deadline_gt(deadline, best, curr)))
-		best = curr;
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *se = __pick_eevdf(cfs_rq);
 
-	if (unlikely(!best)) {
+	if (!se) {
 		struct sched_entity *left = __pick_first_entity(cfs_rq);
 		if (left) {
 			pr_err("EEVDF scheduling fail, picking leftmost\n");
 			return left;
 		}
 	}
 
-	return best;
+	return se;
 }
 
 #ifdef CONFIG_SCHED_DEBUG
 struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
 {
-- 
2.42.0.582.g8ccd20d70d-goog


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-09-29 16:54     ` Youssef Esmat
@ 2023-10-02 15:55       ` Youssef Esmat
  2023-10-02 18:41       ` Peter Zijlstra
  1 sibling, 0 replies; 104+ messages in thread
From: Youssef Esmat @ 2023-10-02 15:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Jordan, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen, joel,
	efault, tglx

On Fri, Sep 29, 2023 at 11:54 AM Youssef Esmat
<youssefesmat@chromium.org> wrote:
>
> >
> > EEVDF fundamentally supports per task request/slice sizes, which is the
> > primary motivator for finally finishing these patches.
> >
> > So the plan is to extend sched_setattr() to allow tasks setting their
> > own ideal slice length. But we're not quite there yet.
> >
> > Having just returned from PTO the mailbox is an utter trainwreck, but
> > I'll try and refresh those few patches this week for consideration.
> >
> > In the meantime I think you found the right knob to twiddle.
>
> Hello Peter,
>
> I am trying to understand a little better the need for the eligibility
> checks (entity_eligible). I understand the general concept, but I am
> trying to find a scenario where it is necessary. And maybe propose to
> have it toggled by a feature flag.
>
> Some of my testing:
>
> All my testing was done on a two core Celeron N400 cpu system 1.1Ghz.
> It was done on the 6.5-rc3 kernel with EEVDF changes ported.
>
> I have two CPU bound tasks one with a nice of -4 and the other with a
> nice of 0. They are both affinitized to CPU 0. (while 1 { i++ })
>
> With entity_eligible *enabled* and with entity_eligible *disabled*
> (always returns 1):
> Top showed consistent results, one at ~70% and the other at ~30%
>
> So it seems the deadline adjustment will naturally achieve fairness.
>
> I also added a few trace_printks to see if there is a case where
> entity_eligible would have returned 0 before the deadline forced us to
> reschedule. There were a few such cases. The following snippet of
> prints shows that an entity became ineligible 2 slices before its
> deadline expired. It seems like this will add more context switching
> but still achieve a similar result at the end.
>
> bprint:               pick_eevdf: eligibility check: tid=4568,
> eligible=0, deadline=26577257249, vruntime=26575761118
> bprint:               pick_eevdf: found best deadline: tid=4573,
> deadline=26575451399, vruntime=26574838855
> sched_switch:         prev_comm=loop prev_pid=4568 prev_prio=120
> prev_state=R ==> next_comm=loop next_pid=4573 next_prio=116
> bputs:                task_tick_fair: tick
> bputs:                task_tick_fair: tick
> bprint:               pick_eevdf: eligibility check: tid=4573,
> eligible=1, deadline=26576270304, vruntime=26575657159
> bprint:               pick_eevdf: found best deadline: tid=4573,
> deadline=26576270304, vruntime=26575657159
> bputs:                task_tick_fair: tick
> bputs:                task_tick_fair: tick
> bprint:               pick_eevdf: eligibility check: tid=4573,
> eligible=0, deadline=26577089170, vruntime=26576476006
> bprint:               pick_eevdf: found best deadline: tid=4573,
> deadline=26577089170, vruntime=26576476006
> bputs:                task_tick_fair: tick
> bputs:                task_tick_fair: tick
> bprint:               pick_eevdf: eligibility check: tid=4573,
> eligible=0, deadline=26577908042, vruntime=26577294838
> bprint:               pick_eevdf: found best deadline: tid=4568,
> deadline=26577257249, vruntime=26575761118
> sched_switch:         prev_comm=loop prev_pid=4573 prev_prio=116
> prev_state=R ==> next_comm=loop next_pid=4568 next_prio=120
>
> In a more practical example, I tried this with one of our benchmarks
> that involves running Meet and Docs side by side and measuring the
> input latency in the Docs document. The following is the average
> latency for 5 runs:
>
> (These numbers are after removing our cgroup hierarchy - that might be
> a discussion for a later time).
>
> CFS: 168ms
> EEVDF with eligibility: 206ms (regression from CFS)
> EEVDF *without* eligibility: 143ms (improvement to CFS)
> EEVDF *without* eligibility and with a 6ms base_slice_ns (was 1.5ms):
> 104ms (great improvement)
>
> Removing the eligibility check for this workload seemed to result in a
> great improvement. I haven't dug deeper but I suspect it's related to
> reduced context switches on our 2 core system.
> As an extra test I also increased the base_slice_ns and it further
> improved the input latency significantly.
>
> I would love to hear your thoughts. Thanks!

For completeness I ran two more tests:

1. EEVDF with eligibility and 6ms base_slice_ns.
2. EEVDF with eligibility with Benjamin Segall's patch
(https://lore.kernel.org/all/xm261qego72d.fsf_-_@google.com/).

I copied over all the previous results for easier comparison.

CFS:                                          168ms
EEVDF, eligib, 1.5ms slice:        206ms
EEVDF, eligib, 6ms slice:           167ms
EEVDF_Fix, eligib, 1.5ms slice: 190ms
EEVDF, *no*eligib, 1.5ms slice: 143ms
EEVDF, *no*eligib, 6ms slice:    104ms

It does seem like Benjamin's fix did have an improvement.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 05/15] sched/fair: Implement an EEVDF like policy
  2023-09-29 21:40   ` [PATCH 05/15] sched/fair: Implement an EEVDF like policy Benjamin Segall
@ 2023-10-02 17:39     ` Peter Zijlstra
  2023-10-11  4:14     ` Abel Wu
  1 sibling, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-02 17:39 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx

On Fri, Sep 29, 2023 at 02:40:31PM -0700, Benjamin Segall wrote:

> > +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> > +{
> > +	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> > +	struct sched_entity *curr = cfs_rq->curr;
> > +	struct sched_entity *best = NULL;
> > +
> > +	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> > +		curr = NULL;
> > +
> > +	while (node) {
> > +		struct sched_entity *se = __node_2_se(node);
> > +
> > +		/*
> > +		 * If this entity is not eligible, try the left subtree.
> > +		 */
> > +		if (!entity_eligible(cfs_rq, se)) {
> > +			node = node->rb_left;
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * If this entity has an earlier deadline than the previous
> > +		 * best, take this one. If it also has the earliest deadline
> > +		 * of its subtree, we're done.
> > +		 */
> > +		if (!best || deadline_gt(deadline, best, se)) {
> > +			best = se;
> > +			if (best->deadline == best->min_deadline)
> > +				break;
> > +		}
> > +
> > +		/*
> > +		 * If the earlest deadline in this subtree is in the fully
> > +		 * eligible left half of our space, go there.
> > +		 */
> > +		if (node->rb_left &&
> > +		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
> > +			node = node->rb_left;
> > +			continue;
> > +		}
> > +
> > +		node = node->rb_right;
> > +	}
> 
> I believe that this can fail to actually find the earliest eligible
> deadline, because the earliest deadline (min_deadline) can be in the
> right branch, but that se isn't eligible, and the actual target se is in
> the left branch. A trivial 3-se example with the nodes represented by
> (vruntime, deadline, min_deadline):
> 
>    (5,9,7)
>  /        \
> (4,8,8)  (6,7,7)
> 
> AIUI, here the EEVDF pick should be (4,8,8), but pick_eevdf() will
> instead pick (5,9,7), because it goes into the right branch and then
> fails eligibility.
> 
> I'm not sure how much of a problem this is in practice, either in
> frequency or severity, but it probably should be mentioned if it's
> an intentional tradeoff.

Well, that is embarrassing :-(

You're quite right -- and I *SHOULD* have double checked my decade old
patches, but alas.

Re-reading the paper, your proposal is fairly close to what they have.
Let me go stare at your patch in more detail.



^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-09-29 16:54     ` Youssef Esmat
  2023-10-02 15:55       ` Youssef Esmat
@ 2023-10-02 18:41       ` Peter Zijlstra
  2023-10-05 12:05         ` Peter Zijlstra
  1 sibling, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-02 18:41 UTC (permalink / raw)
  To: Youssef Esmat
  Cc: Daniel Jordan, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen, joel,
	efault, tglx

On Fri, Sep 29, 2023 at 11:54:25AM -0500, Youssef Esmat wrote:
> >
> > EEVDF fundamentally supports per task request/slice sizes, which is the
> > primary motivator for finally finishing these patches.
> >
> > So the plan is to extend sched_setattr() to allow tasks setting their
> > own ideal slice length. But we're not quite there yet.
> >
> > Having just returned from PTO the mailbox is an utter trainwreck, but
> > I'll try and refresh those few patches this week for consideration.
> >
> > In the meantime I think you found the right knob to twiddle.
> 
> Hello Peter,
> 
> I am trying to understand a little better the need for the eligibility
> checks (entity_eligible). I understand the general concept, but I am
> trying to find a scenario where it is necessary. And maybe propose to
> have it toggled by a feature flag.

My initial response was that it ensures fairness, but thinking a little
about it I'm not so sure anymore.

I do think it features in section 6 lemma 4 and 5, which provide
interference bounds. But I'd have to think a little more about it.

The current setup, where all requests are of equal size, then virtual
deadline tree is basically identical to the virtual runtime tree, just
transposed (in the static state scenario).

When mixing request sizes things become a little more interesting.

Let me ponder this a little bit more.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [tip: sched/urgent] sched/fair: Fix pick_eevdf()
  2023-09-30  0:09   ` [PATCH] sched/fair: fix pick_eevdf to always find the correct se Benjamin Segall
@ 2023-10-03 10:42     ` tip-bot2 for Benjamin Segall
       [not found]     ` <CGME20231004203940eucas1p2f73b017497d1f4239a6e236fdb6019e2@eucas1p2.samsung.com>
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 104+ messages in thread
From: tip-bot2 for Benjamin Segall @ 2023-10-03 10:42 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Ben Segall, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     561c58efd2394d76a32254d91e4b1de8ecdeb5c8
Gitweb:        https://git.kernel.org/tip/561c58efd2394d76a32254d91e4b1de8ecdeb5c8
Author:        Benjamin Segall <bsegall@google.com>
AuthorDate:    Fri, 29 Sep 2023 17:09:30 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 03 Oct 2023 12:32:30 +02:00

sched/fair: Fix pick_eevdf()

The old pick_eevdf() could fail to find the actual earliest eligible
deadline when it descended to the right looking for min_deadline, but
it turned out that that min_deadline wasn't actually eligible. In that
case we need to go back and search through any left branches we
skipped looking for the actual best _eligible_ min_deadline.

This is more expensive, but still O(log n), and at worst should only
involve descending two branches of the rbtree.

I've run this through a userspace stress test (thank you
tools/lib/rbtree.c), so hopefully this implementation doesn't miss any
corner cases.

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/xm261qego72d.fsf_-_@google.com
---
 kernel/sched/fair.c | 72 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 58 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ef7490c..929d21d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -872,14 +872,16 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
  *
  * Which allows an EDF like search on (sub)trees.
  */
-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
 {
 	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
 	struct sched_entity *curr = cfs_rq->curr;
 	struct sched_entity *best = NULL;
+	struct sched_entity *best_left = NULL;
 
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
+	best = curr;
 
 	/*
 	 * Once selected, run a task until it either becomes non-eligible or
@@ -900,33 +902,75 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 		}
 
 		/*
-		 * If this entity has an earlier deadline than the previous
-		 * best, take this one. If it also has the earliest deadline
-		 * of its subtree, we're done.
+		 * Now we heap search eligible trees for the best (min_)deadline
 		 */
-		if (!best || deadline_gt(deadline, best, se)) {
+		if (!best || deadline_gt(deadline, best, se))
 			best = se;
-			if (best->deadline == best->min_deadline)
-				break;
-		}
 
 		/*
-		 * If the earlest deadline in this subtree is in the fully
-		 * eligible left half of our space, go there.
+		 * Every se in a left branch is eligible, keep track of the
+		 * branch with the best min_deadline
 		 */
+		if (node->rb_left) {
+			struct sched_entity *left = __node_2_se(node->rb_left);
+
+			if (!best_left || deadline_gt(min_deadline, best_left, left))
+				best_left = left;
+
+			/*
+			 * min_deadline is in the left branch. rb_left and all
+			 * descendants are eligible, so immediately switch to the second
+			 * loop.
+			 */
+			if (left->min_deadline == se->min_deadline)
+				break;
+		}
+
+		/* min_deadline is at this node, no need to look right */
+		if (se->deadline == se->min_deadline)
+			break;
+
+		/* else min_deadline is in the right branch. */
+		node = node->rb_right;
+	}
+
+	/*
+	 * We ran into an eligible node which is itself the best.
+	 * (Or nr_running == 0 and both are NULL)
+	 */
+	if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0)
+		return best;
+
+	/*
+	 * Now best_left and all of its children are eligible, and we are just
+	 * looking for deadline == min_deadline
+	 */
+	node = &best_left->run_node;
+	while (node) {
+		struct sched_entity *se = __node_2_se(node);
+
+		/* min_deadline is the current node */
+		if (se->deadline == se->min_deadline)
+			return se;
+
+		/* min_deadline is in the left branch */
 		if (node->rb_left &&
 		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
 			node = node->rb_left;
 			continue;
 		}
 
+		/* else min_deadline is in the right branch */
 		node = node->rb_right;
 	}
+	return NULL;
+}
 
-	if (!best || (curr && deadline_gt(deadline, best, curr)))
-		best = curr;
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *se = __pick_eevdf(cfs_rq);
 
-	if (unlikely(!best)) {
+	if (!se) {
 		struct sched_entity *left = __pick_first_entity(cfs_rq);
 		if (left) {
 			pr_err("EEVDF scheduling fail, picking leftmost\n");
@@ -934,7 +978,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 		}
 	}
 
-	return best;
+	return se;
 }
 
 #ifdef CONFIG_SCHED_DEBUG

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH] sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline
  2023-05-31 11:58 ` [PATCH 07/15] sched/smp: Use lag to simplify cross-runqueue placement Peter Zijlstra
  2023-08-10  7:10   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2023-09-12 15:32   ` [PATCH 07/15] " Sebastian Andrzej Siewior
@ 2023-10-04  1:17   ` Daniel Jordan
  2023-10-04 13:09     ` [PATCH v2] " Daniel Jordan
  2023-10-05  5:56     ` [PATCH] " K Prateek Nayak
  2023-10-06 16:48   ` [PATCH] sched/fair: Always update_curr() before placing at enqueue Daniel Jordan
  3 siblings, 2 replies; 104+ messages in thread
From: Daniel Jordan @ 2023-10-04  1:17 UTC (permalink / raw)
  To: peterz
  Cc: bristot, bsegall, chris.hyser, corbet, dietmar.eggemann, efault,
	joel, joshdon, juri.lelli, kprateek.nayak, linux-kernel, mgorman,
	mingo, patrick.bellasi, pavel, pjt, qperret, qyousef, rostedt,
	tglx, tim.c.chen, timj, vincent.guittot, youssefesmat, yu.c.chen,
	daniel.m.jordan

An entity is supposed to get an earlier deadline with
PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
overwritten soon after in enqueue_entity() the first time a forked
entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.

Placing in task_fork_fair() seems unnecessary since none of the values
that get set (slice, vruntime, deadline) are used before they're set
again at enqueue time, so get rid of that and just pass ENQUEUE_INITIAL
to enqueue_entity() via wake_up_new_task().

Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---

Tested on top of peterz/sched/eevdf from 2023-10-03.

 kernel/sched/core.c | 2 +-
 kernel/sched/fair.c | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 779cdc7969c81..500e2dbfd41dd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4854,7 +4854,7 @@ void wake_up_new_task(struct task_struct *p)
 	update_rq_clock(rq);
 	post_init_entity_util_avg(p);
 
-	activate_task(rq, p, ENQUEUE_NOCLOCK);
+	activate_task(rq, p, ENQUEUE_INITIAL | ENQUEUE_NOCLOCK);
 	trace_sched_wakeup_new(p);
 	wakeup_preempt(rq, p, WF_FORK);
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a0b4dac2662c9..5872b8a3f5891 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12446,7 +12446,6 @@ static void task_fork_fair(struct task_struct *p)
 	curr = cfs_rq->curr;
 	if (curr)
 		update_curr(cfs_rq);
-	place_entity(cfs_rq, se, ENQUEUE_INITIAL);
 	rq_unlock(rq, &rf);
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* [PATCH v2] sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline
  2023-10-04  1:17   ` [PATCH] sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline Daniel Jordan
@ 2023-10-04 13:09     ` Daniel Jordan
  2023-10-04 15:46       ` Chen Yu
  2023-10-12  4:48       ` K Prateek Nayak
  2023-10-05  5:56     ` [PATCH] " K Prateek Nayak
  1 sibling, 2 replies; 104+ messages in thread
From: Daniel Jordan @ 2023-10-04 13:09 UTC (permalink / raw)
  To: peterz
  Cc: bristot, bsegall, chris.hyser, corbet, dietmar.eggemann, efault,
	joel, joshdon, juri.lelli, kprateek.nayak, linux-kernel, mgorman,
	mingo, patrick.bellasi, pavel, pjt, qperret, qyousef, rostedt,
	tglx, tim.c.chen, timj, vincent.guittot, youssefesmat, yu.c.chen,
	daniel.m.jordan

An entity is supposed to get an earlier deadline with
PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
overwritten soon after in enqueue_entity() the first time a forked
entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.

Placing in task_fork_fair() seems unnecessary since none of the values
that get set (slice, vruntime, deadline) are used before they're set
again at enqueue time, so get rid of that (and with it all of
task_fork_fair()) and just pass ENQUEUE_INITIAL to enqueue_entity() via
wake_up_new_task().

Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---

v2
 - place_entity() seems like the only reason for task_fork_fair() to exist
   after the recent removal of sysctl_sched_child_runs_first, so take out
   the whole function.

Still based on today's peterz/sched/eevdf

 kernel/sched/core.c |  2 +-
 kernel/sched/fair.c | 24 ------------------------
 2 files changed, 1 insertion(+), 25 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 779cdc7969c81..500e2dbfd41dd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4854,7 +4854,7 @@ void wake_up_new_task(struct task_struct *p)
 	update_rq_clock(rq);
 	post_init_entity_util_avg(p);
 
-	activate_task(rq, p, ENQUEUE_NOCLOCK);
+	activate_task(rq, p, ENQUEUE_INITIAL | ENQUEUE_NOCLOCK);
 	trace_sched_wakeup_new(p);
 	wakeup_preempt(rq, p, WF_FORK);
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a0b4dac2662c9..3827b302eeb9b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12427,29 +12427,6 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	task_tick_core(rq, curr);
 }
 
-/*
- * called on fork with the child task as argument from the parent's context
- *  - child not yet on the tasklist
- *  - preemption disabled
- */
-static void task_fork_fair(struct task_struct *p)
-{
-	struct sched_entity *se = &p->se, *curr;
-	struct cfs_rq *cfs_rq;
-	struct rq *rq = this_rq();
-	struct rq_flags rf;
-
-	rq_lock(rq, &rf);
-	update_rq_clock(rq);
-
-	cfs_rq = task_cfs_rq(current);
-	curr = cfs_rq->curr;
-	if (curr)
-		update_curr(cfs_rq);
-	place_entity(cfs_rq, se, ENQUEUE_INITIAL);
-	rq_unlock(rq, &rf);
-}
-
 /*
  * Priority of the task has changed. Check to see if we preempt
  * the current task.
@@ -12953,7 +12930,6 @@ DEFINE_SCHED_CLASS(fair) = {
 #endif
 
 	.task_tick		= task_tick_fair,
-	.task_fork		= task_fork_fair,
 
 	.prio_changed		= prio_changed_fair,
 	.switched_from		= switched_from_fair,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH v2] sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline
  2023-10-04 13:09     ` [PATCH v2] " Daniel Jordan
@ 2023-10-04 15:46       ` Chen Yu
  2023-10-06 16:31         ` Daniel Jordan
  2023-10-12  4:48       ` K Prateek Nayak
  1 sibling, 1 reply; 104+ messages in thread
From: Chen Yu @ 2023-10-04 15:46 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: peterz, bristot, bsegall, chris.hyser, corbet, dietmar.eggemann,
	efault, joel, joshdon, juri.lelli, kprateek.nayak, linux-kernel,
	mgorman, mingo, patrick.bellasi, pavel, pjt, qperret, qyousef,
	rostedt, tglx, tim.c.chen, timj, vincent.guittot, youssefesmat,
	yu.chen.surf

Hi Daniel,

On 2023-10-04 at 09:09:08 -0400, Daniel Jordan wrote:
> An entity is supposed to get an earlier deadline with
> PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
> overwritten soon after in enqueue_entity() the first time a forked
> entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
> 
> Placing in task_fork_fair() seems unnecessary since none of the values
> that get set (slice, vruntime, deadline) are used before they're set
> again at enqueue time, so get rid of that (and with it all of
> task_fork_fair()) and just pass ENQUEUE_INITIAL to enqueue_entity() via
> wake_up_new_task().
> 
> Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> ---
> 
> v2
>  - place_entity() seems like the only reason for task_fork_fair() to exist
>    after the recent removal of sysctl_sched_child_runs_first, so take out
>    the whole function.

At first glance I thought if we remove task_fork_fair(), do we lose one chance to
update the parent task's statistic in update_curr()?  We might get out-of-date
parent task's deadline and make preemption decision based on the stale data in
wake_up_new_task() -> wakeup_preempt() -> pick_eevdf(). But after a second thought,
I found that wake_up_new_task() -> enqueue_entity() itself would invoke update_curr(),
so this should not be a problem.

Then I was wondering why can't we just skip place_entity() in enqueue_entity()
if ENQUEUE_WAKEUP is not set, just like the code before e8f331bcc270? In this
way the new fork task's deadline will not be overwritten by wake_up_new_task()->
enqueue_entity(). Then I realized that, after e8f331bcc270, the task's vruntime
and deadline are all calculated by place_entity() rather than being renormalised
to cfs_rq->min_vruntime in enqueue_entity(), so we can not simply skip place_entity()
in enqueue_entity().

Per my understanding, this patch looks good,

Reviewed-by: Chen Yu <yu.c.chen@intel.com>

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH] sched/fair: fix pick_eevdf to always find the correct se
       [not found]     ` <CGME20231004203940eucas1p2f73b017497d1f4239a6e236fdb6019e2@eucas1p2.samsung.com>
@ 2023-10-04 20:39       ` Marek Szyprowski
  0 siblings, 0 replies; 104+ messages in thread
From: Marek Szyprowski @ 2023-10-04 20:39 UTC (permalink / raw)
  To: Benjamin Segall, Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx, Greg Kroah-Hartman

Hi,

On 30.09.2023 02:09, Benjamin Segall wrote:
> The old pick_eevdf could fail to find the actual earliest eligible
> deadline when it descended to the right looking for min_deadline, but it
> turned out that that min_deadline wasn't actually eligible. In that case
> we need to go back and search through any left branches we skipped
> looking for the actual best _eligible_ min_deadline.
>
> This is more expensive, but still O(log n), and at worst should only
> involve descending two branches of the rbtree.
>
> I've run this through a userspace stress test (thank you
> tools/lib/rbtree.c), so hopefully this implementation doesn't miss any
> corner cases.
>
> Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
> Signed-off-by: Ben Segall <bsegall@google.com>

This patch landed in today's linux-next as commit 561c58efd239 
("sched/fair: Fix pick_eevdf()"). Surprisingly it introduced a warning 
about circular locking dependency. It can be easily observed during boot 
from time to time on on qemu/arm64 'virt' machine:

======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc4+ #7222 Not tainted
------------------------------------------------------
systemd-udevd/1187 is trying to acquire lock:
ffffbcc2be0c4de0 (console_owner){..-.}-{0:0}, at: 
console_flush_all+0x1b0/0x500

but task is already holding lock:
ffff5535ffdd2b18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0xe0/0xc40

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #4 (&rq->__lock){-.-.}-{2:2}:
        _raw_spin_lock_nested+0x44/0x5c
        raw_spin_rq_lock_nested+0x24/0x40
        task_fork_fair+0x3c/0xac
        sched_cgroup_fork+0xe8/0x14c
        copy_process+0x11c4/0x1a14
        kernel_clone+0x8c/0x400
        user_mode_thread+0x70/0x98
        rest_init+0x28/0x190
        arch_post_acpi_subsys_init+0x0/0x8
        start_kernel+0x594/0x684
        __primary_switched+0xbc/0xc4

-> #3 (&p->pi_lock){-.-.}-{2:2}:
        _raw_spin_lock_irqsave+0x60/0x88
        try_to_wake_up+0x58/0x468
        default_wake_function+0x14/0x20
        woken_wake_function+0x20/0x2c
        __wake_up_common+0x94/0x170
        __wake_up_common_lock+0x7c/0xcc
        __wake_up+0x18/0x24
        tty_wakeup+0x34/0x70
        tty_port_default_wakeup+0x20/0x38
        tty_port_tty_wakeup+0x18/0x24
        uart_write_wakeup+0x18/0x28
        pl011_tx_chars+0x140/0x2b4
        pl011_start_tx+0xe8/0x190
        serial_port_runtime_resume+0x90/0xc0
        __rpm_callback+0x48/0x1a8
        rpm_callback+0x6c/0x78
        rpm_resume+0x438/0x6d8
        pm_runtime_work+0x84/0xc8
        process_one_work+0x1ec/0x53c
        worker_thread+0x298/0x408
        kthread+0x124/0x128
        ret_from_fork+0x10/0x20

-> #2 (&tty->write_wait){....}-{2:2}:
        _raw_spin_lock_irqsave+0x60/0x88
        __wake_up_common_lock+0x5c/0xcc
        __wake_up+0x18/0x24
        tty_wakeup+0x34/0x70
        tty_port_default_wakeup+0x20/0x38
        tty_port_tty_wakeup+0x18/0x24
        uart_write_wakeup+0x18/0x28
        pl011_tx_chars+0x140/0x2b4
        pl011_start_tx+0xe8/0x190
        serial_port_runtime_resume+0x90/0xc0
        __rpm_callback+0x48/0x1a8
        rpm_callback+0x6c/0x78
        rpm_resume+0x438/0x6d8
        pm_runtime_work+0x84/0xc8
        process_one_work+0x1ec/0x53c
        worker_thread+0x298/0x408
        kthread+0x124/0x128
        ret_from_fork+0x10/0x20

-> #1 (&port_lock_key){..-.}-{2:2}:
        _raw_spin_lock+0x48/0x60
        pl011_console_write+0x13c/0x1b0
        console_flush_all+0x20c/0x500
        console_unlock+0x6c/0x130
        vprintk_emit+0x228/0x3a0
        vprintk_default+0x38/0x44
        vprintk+0xa4/0xc0
        _printk+0x5c/0x84
        register_console+0x1f4/0x420
        serial_core_register_port+0x5a4/0x5d8
        serial_ctrl_register_port+0x10/0x1c
        uart_add_one_port+0x10/0x1c
        pl011_register_port+0x70/0x12c
        pl011_probe+0x1bc/0x1fc
        amba_probe+0x110/0x1c8
        really_probe+0x148/0x2b4
        __driver_probe_device+0x78/0x12c
        driver_probe_device+0xd8/0x160
        __device_attach_driver+0xb8/0x138
        bus_for_each_drv+0x84/0xe0
        __device_attach+0xa8/0x1b0
        device_initial_probe+0x14/0x20
        bus_probe_device+0xb0/0xb4
        device_add+0x574/0x738
        amba_device_add+0x40/0xac
        of_platform_bus_create+0x2b4/0x378
        of_platform_populate+0x50/0xfc
        of_platform_default_populate_init+0xd0/0xf0
        do_one_initcall+0x74/0x2f0
        kernel_init_freeable+0x28c/0x4dc
        kernel_init+0x24/0x1dc
        ret_from_fork+0x10/0x20

-> #0 (console_owner){..-.}-{0:0}:
        __lock_acquire+0x1318/0x20c4
        lock_acquire+0x1e8/0x318
        console_flush_all+0x1f8/0x500
        console_unlock+0x6c/0x130
        vprintk_emit+0x228/0x3a0
        vprintk_default+0x38/0x44
        vprintk+0xa4/0xc0
        _printk+0x5c/0x84
        pick_next_task_fair+0x28c/0x498
        __schedule+0x164/0xc40
        do_task_dead+0x54/0x58
        do_exit+0x61c/0x9e8
        do_group_exit+0x34/0x90
        __wake_up_parent+0x0/0x30
        invoke_syscall+0x48/0x114
        el0_svc_common.constprop.0+0x40/0xe0
        do_el0_svc_compat+0x1c/0x38
        el0_svc_compat+0x48/0xb4
        el0t_32_sync_handler+0x90/0x140
        el0t_32_sync+0x194/0x198

other info that might help us debug this:

Chain exists of:
   console_owner --> &p->pi_lock --> &rq->__lock

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&rq->__lock);
                                lock(&p->pi_lock);
                                lock(&rq->__lock);
   lock(console_owner);

  *** DEADLOCK ***

3 locks held by systemd-udevd/1187:
  #0: ffff5535ffdd2b18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0xe0/0xc40
  #1: ffffbcc2be0c4c30 (console_lock){+.+.}-{0:0}, at: 
vprintk_emit+0x11c/0x3a0
  #2: ffffbcc2be0c4c88 (console_srcu){....}-{0:0}, at: 
console_flush_all+0x7c/0x500

stack backtrace:
CPU: 1 PID: 1187 Comm: systemd-udevd Not tainted 6.6.0-rc4+ #7222
Hardware name: linux,dummy-virt (DT)
Call trace:
  dump_backtrace+0x98/0xf0
  show_stack+0x18/0x24
  dump_stack_lvl+0x60/0xac
  dump_stack+0x18/0x24
  print_circular_bug+0x290/0x370
  check_noncircular+0x15c/0x170
  __lock_acquire+0x1318/0x20c4
  lock_acquire+0x1e8/0x318
  console_flush_all+0x1f8/0x500
  console_unlock+0x6c/0x130
  vprintk_emit+0x228/0x3a0
  vprintk_default+0x38/0x44
  vprintk+0xa4/0xc0
  _printk+0x5c/0x84
  pick_next_task_fair+0x28c/0x498
  __schedule+0x164/0xc40
  do_task_dead+0x54/0x58
  do_exit+0x61c/0x9e8
  do_group_exit+0x34/0x90
  __wake_up_parent+0x0/0x30
  invoke_syscall+0x48/0x114
  el0_svc_common.constprop.0+0x40/0xe0
  do_el0_svc_compat+0x1c/0x38
  el0_svc_compat+0x48/0xb4
  el0t_32_sync_handler+0x90/0x140
  el0t_32_sync+0x194/0x198

The problem is probably elsewhere, but this scheduler change only 
revealed it in a fully reproducible way. Reverting $subject on top of 
linux-next hides the problem deep enough that I was not able to 
reproduce it. Let me know if there is anything I can do to help fixing 
this issue.

> ---
>   kernel/sched/fair.c | 72 ++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 58 insertions(+), 14 deletions(-)
>
> ...

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH] sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline
  2023-10-04  1:17   ` [PATCH] sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline Daniel Jordan
  2023-10-04 13:09     ` [PATCH v2] " Daniel Jordan
@ 2023-10-05  5:56     ` K Prateek Nayak
  2023-10-06 16:35       ` Daniel Jordan
  1 sibling, 1 reply; 104+ messages in thread
From: K Prateek Nayak @ 2023-10-05  5:56 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: bristot, bsegall, chris.hyser, corbet, dietmar.eggemann, efault,
	joel, joshdon, juri.lelli, linux-kernel, mgorman, mingo,
	patrick.bellasi, pavel, pjt, qperret, qyousef, rostedt, tglx,
	tim.c.chen, timj, vincent.guittot, youssefesmat, yu.c.chen,
	peterz

Hello Daniel,

On 10/4/2023 6:47 AM, Daniel Jordan wrote:
> An entity is supposed to get an earlier deadline with
> PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
> overwritten soon after in enqueue_entity() the first time a forked
> entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
> 
> Placing in task_fork_fair() seems unnecessary since none of the values
> that get set (slice, vruntime, deadline) are used before they're set
> again at enqueue time, so get rid of that and just pass ENQUEUE_INITIAL
> to enqueue_entity() via wake_up_new_task().
> 
> Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>

I got a chance to this this on a 3rd Generation EPYC system. I don't
see anything out of the ordinary except for a small regression on
hackbench. I'll leave the full result below.

o System details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- Boost enabled, C2 Disabled (POLL and MWAIT based C1 remained enabled)


o Kernel Details

- tip:	tip:sched/core at commit d4d6596b4386 ("sched/headers: Remove
	duplicate header inclusions")

- place-deadline-fix: tip + this patch


o Benchmark Results

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:           tip[pct imp](CV)    place-deadline-fix[pct imp](CV)
 1-groups     1.00 [ -0.00]( 2.58)     1.04 [ -3.63]( 3.14)
 2-groups     1.00 [ -0.00]( 1.87)     1.03 [ -2.98]( 1.85)
 4-groups     1.00 [ -0.00]( 1.63)     1.02 [ -2.35]( 1.59)
 8-groups     1.00 [ -0.00]( 1.38)     1.03 [ -2.92]( 1.20)
16-groups     1.00 [ -0.00]( 2.67)     1.02 [ -1.61]( 2.08)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:           tip[pct imp](CV)    place-deadline-fix[pct imp](CV)
    1     1.00 [  0.00]( 0.59)     1.02 [  2.09]( 0.07)
    2     1.00 [  0.00]( 1.19)     1.02 [  2.38]( 0.82)
    4     1.00 [  0.00]( 0.33)     1.03 [  2.89]( 0.99)
    8     1.00 [  0.00]( 0.76)     1.02 [  2.10]( 0.46)
   16     1.00 [  0.00]( 1.10)     1.01 [  0.81]( 0.49)
   32     1.00 [  0.00]( 1.47)     1.02 [  1.77]( 0.58)
   64     1.00 [  0.00]( 1.77)     1.02 [  1.83]( 1.77)
  128     1.00 [  0.00]( 0.41)     1.02 [  2.49]( 0.52)
  256     1.00 [  0.00]( 0.63)     1.03 [  3.03]( 1.38)
  512     1.00 [  0.00]( 0.32)     1.02 [  1.61]( 0.45)
 1024     1.00 [  0.00]( 0.22)     1.01 [  1.00]( 0.26)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:           tip[pct imp](CV)    place-deadline-fix[pct imp](CV)
 Copy     1.00 [  0.00]( 9.30)     0.85 [-15.36](11.26)
Scale     1.00 [  0.00]( 6.67)     0.98 [ -2.36]( 7.53)
  Add     1.00 [  0.00]( 6.77)     0.92 [ -7.86]( 7.83)
Triad     1.00 [  0.00]( 7.36)     0.94 [ -5.57]( 6.82)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:           tip[pct imp](CV)    place-deadline-fix[pct imp](CV)
 Copy     1.00 [  0.00]( 1.83)     0.96 [ -3.68]( 5.08)
Scale     1.00 [  0.00]( 6.41)     1.03 [  2.66]( 5.28)
  Add     1.00 [  0.00]( 6.23)     1.02 [  1.54]( 4.97)
Triad     1.00 [  0.00]( 0.89)     0.94 [ -5.68]( 6.78)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:           tip[pct imp](CV)    place-deadline-fix[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.05)     1.02 [  1.83]( 1.98)
 2-clients     1.00 [  0.00]( 0.93)     1.02 [  1.87]( 2.45)
 4-clients     1.00 [  0.00]( 0.54)     1.02 [  2.19]( 1.99)
 8-clients     1.00 [  0.00]( 0.48)     1.02 [  2.29]( 2.27)
16-clients     1.00 [  0.00]( 0.42)     1.02 [  1.60]( 1.70)
32-clients     1.00 [  0.00]( 0.78)     1.02 [  1.88]( 2.08)
64-clients     1.00 [  0.00]( 1.45)     1.02 [  2.33]( 2.18)
128-clients    1.00 [  0.00]( 0.97)     1.02 [  2.38]( 1.95)
256-clients    1.00 [  0.00]( 4.57)     1.02 [  2.50]( 5.42)
512-clients    1.00 [  0.00](52.74)     1.03 [  3.38](49.69)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:           tip[pct imp](CV)    place-deadline-fix[pct imp](CV)
  1     1.00 [ -0.00]( 3.95)     0.90 [ 10.26](31.80)
  2     1.00 [ -0.00](10.45)     1.08 [ -7.89](15.33)
  4     1.00 [ -0.00]( 4.76)     0.93 [  7.14]( 3.95)
  8     1.00 [ -0.00]( 9.35)     1.06 [ -6.25]( 8.90)
 16     1.00 [ -0.00]( 8.84)     0.92 [  8.06]( 4.39)
 32     1.00 [ -0.00]( 3.33)     1.04 [ -4.40]( 3.68)
 64     1.00 [ -0.00]( 6.70)     0.96 [  4.17]( 2.75)
128     1.00 [ -0.00]( 0.71)     0.96 [  3.55]( 1.26)
256     1.00 [ -0.00](31.20)     1.28 [-28.21]( 9.69)
512     1.00 [ -0.00]( 4.98)     1.00 [  0.48]( 2.76)


==================================================================
Test          : ycsb-cassandra
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Metric          tip    place-deadline-fix(pct imp)
Throughput      1.00    1.01 (%diff: 1.06%)


==================================================================
Test          : ycsb-mondodb
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Metric          tip    place-deadline-fix(pct imp)
Throughput      1.00    1.00 (%diff: 0.25%)


==================================================================
Test          : DeathStarBench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Pinning      scaling     tip            place-deadline-fix(pct imp)
 1CCD           1       1.00            1.00 (%diff: -0.32%)
 2CCD           2       1.00            1.00 (%diff: -0.26%)
 4CCD           4       1.00            1.00 (%diff: 0.17%)
 8CCD           8       1.00            1.00 (%diff: -0.17%)


--
I see there is a v2. I'll give that a spin as well.

> ---
> 
> Tested on top of peterz/sched/eevdf from 2023-10-03.
> 
>  kernel/sched/core.c | 2 +-
>  kernel/sched/fair.c | 1 -
>  2 files changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 779cdc7969c81..500e2dbfd41dd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4854,7 +4854,7 @@ void wake_up_new_task(struct task_struct *p)
>  	update_rq_clock(rq);
>  	post_init_entity_util_avg(p);
>  
> -	activate_task(rq, p, ENQUEUE_NOCLOCK);
> +	activate_task(rq, p, ENQUEUE_INITIAL | ENQUEUE_NOCLOCK);
>  	trace_sched_wakeup_new(p);
>  	wakeup_preempt(rq, p, WF_FORK);
>  #ifdef CONFIG_SMP
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a0b4dac2662c9..5872b8a3f5891 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12446,7 +12446,6 @@ static void task_fork_fair(struct task_struct *p)
>  	curr = cfs_rq->curr;
>  	if (curr)
>  		update_curr(cfs_rq);
> -	place_entity(cfs_rq, se, ENQUEUE_INITIAL);
>  	rq_unlock(rq, &rf);
>  }
>  
 
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-10-02 18:41       ` Peter Zijlstra
@ 2023-10-05 12:05         ` Peter Zijlstra
  2023-10-05 14:14           ` Peter Zijlstra
                             ` (2 more replies)
  0 siblings, 3 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-05 12:05 UTC (permalink / raw)
  To: Youssef Esmat
  Cc: Daniel Jordan, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen, joel,
	efault, tglx

[-- Attachment #1: Type: text/plain, Size: 10061 bytes --]

On Mon, Oct 02, 2023 at 08:41:36PM +0200, Peter Zijlstra wrote:

> When mixing request sizes things become a little more interesting.
> 
> Let me ponder this a little bit more.

Using the attached program (I got *REALLY* fed up trying to draw these
diagrams by hand), let us illustrate the difference between Earliest
*Eligible* Virtual Deadline First and the one with the Eligible test
taken out: EVDF.

Specifically, the program was used with the following argument for
EEVDF:

  ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19 

and with an additional '-n' for the EVDF column.


EEVDF							EVDF


d = 6							d = 6
d = 2                                                   d = 2
d = 18                                                  d = 18
q = 2                                                   q = 2
                                                        
t=0 V=1                                                 t=0 V=1
 A |----<                                                A |----<
>B  |<                                                  >B  |<
 C   |----------------<                                  C   |----------------<
   |*--------|---------|---------|---------|----           |*--------|---------|---------|---------|----
                                                        
                                                        
t=2 V=1                                                 t=2 V=1
>A |----<                                                A |----<
 B    |<                                                >B    |<
 C   |----------------<                                  C   |----------------<
   |*--------|---------|---------|---------|----           |*--------|---------|---------|---------|----
                                                        
                                                        
t=8 V=3                                                 t=4 V=2
 A       |----<                                         >A |----<
>B    |<                                                 B      |<
 C   |----------------<                                  C   |----------------<
   |--*------|---------|---------|---------|----           |-*-------|---------|---------|---------|----
                                                        
                                                        
t=10 V=4                                                t=10 V=4
 A       |----<                                          A       |----<
 B      |<                                              >B      |<
>C   |----------------<                                  C   |----------------<
   |---*-----|---------|---------|---------|----           |---*-----|---------|---------|---------|----
                                                        
                                                        
t=28 V=10                                               t=12 V=5
 A       |----<                                          A       |----<
>B      |<                                              >B        |<
 C                     |----------------<                C   |----------------<
   |---------*---------|---------|---------|----           |----*----|---------|---------|---------|----
                                                        
                                                        
t=30 V=11                                               t=14 V=5
 A       |----<                                          A       |----<
>B        |<                                            >B          |<
 C                     |----------------<                C   |----------------<
   |---------|*--------|---------|---------|----           |----*----|---------|---------|---------|----
                                                        
                                                        
t=32 V=11                                               t=16 V=6
 A       |----<                                         >A       |----<
>B          |<                                           B            |<
 C                     |----------------<                C   |----------------<
   |---------|*--------|---------|---------|----           |-----*---|---------|---------|---------|----
                                                        
                                                        
t=34 V=12                                               t=22 V=8
>A       |----<                                          A             |----<
 B            |<                                        >B            |<
 C                     |----------------<                C   |----------------<
   |---------|-*-------|---------|---------|----           |-------*-|---------|---------|---------|----
                                                        
                                                        
t=40 V=14                                               t=24 V=9
 A             |----<                                    A             |----<
>B            |<                                        >B              |<
 C                     |----------------<                C   |----------------<
   |---------|---*-----|---------|---------|----           |--------*|---------|---------|---------|----
                                                        
                                                        
t=42 V=15                                               t=26 V=9
 A             |----<                                    A             |----<
>B              |<                                      >B                |<
 C                     |----------------<                C   |----------------<
   |---------|----*----|---------|---------|----           |--------*|---------|---------|---------|----
                                                        
                                                        
t=44 V=15                                               t=28 V=10
 A             |----<                                   >A             |----<
>B                |<                                     B                  |<
 C                     |----------------<                C   |----------------<
   |---------|----*----|---------|---------|----           |---------*---------|---------|---------|----
                                                        
                                                        
t=46 V=16                                               t=34 V=12
>A             |----<                                    A                   |----<
 B                  |<                                  >B                  |<
 C                     |----------------<                C   |----------------<
   |---------|-----*---|---------|---------|----           |---------|-*-------|---------|---------|----
                                                        
                                                        
t=52 V=18                                               t=36 V=13
 A                   |----<                              A                   |----<
>B                  |<                                   B                    |<
 C                     |----------------<               >C   |----------------<
   |---------|-------*-|---------|---------|----           |---------|--*------|---------|---------|----
                                                        
                                                        
t=54 V=19                                               t=54 V=19
 A                   |----<                              A                   |----<
>B                    |<                                >B                    |<
 C                     |----------------<                C                     |----------------<
   |---------|--------*|---------|---------|----           |---------|--------*|---------|---------|----
                                                        
                                                        
lags: -10, 6                                            lags: -7, 11
                                                        
BAaaBCccccccccBBBAaaBBBAaaBB                            BBAaaBBBAaaBBBAaaBCccccccccB



As I wrote before; EVDF has worse lag bounds, but this is not
insurmountable. The biggest problem that I can see is that of wakeup
preemption. Currently we allow to preempt when 'current' has reached V
(RUN_TO_PARITY in pick_eevdf()).

With these rules, when EEVDF schedules C (our large slice task) at t=10
above, it is only a little behind C and can be reaily preempted after
about 2 time units.

However, EVDF will delay scheduling C until much later, see how A and B
walk far ahead of V until t=36. Only when will we pick C. But this means
that we're firmly stuck with C for at least 11 time units. A newly
placed task will be around V and will have no chance to preempt.

That said, I do have me a patch to cure some of that:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=d7edbe431f31762e516f2730196f41322edcc621

That would allow a task with a shorter request time to preempt in spite
of RUN_TO_PARITY.

However, in this example V is only 2/3 of the way to C's deadline, but
it we were to have many more tasks, you'll see V gets closer and closer
to C's deadline and it will become harder and harder to place such that
preemption becomes viable.

Adding 4 more tasks:

  ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19 -n -e "3,1024,2" -e "4,1024,2" -e "5,1024,2" -e "6,1024,2"

t=92 V=16
 A                   |----<
 B                    |<
>C   |----------------<
 D                    |<
 E                   |<
 F                    |<
 G                   |<
   |---------|-----*---|---------|---------|----


And I worry this will create very real latency spikes.

That said; I do see not having the eligibility check can help. So I'm
not opposed to having a sched_feat for this, but I would not want to
default to EVDF.

[-- Attachment #2: eevdf.c --]
[-- Type: text/x-csrc, Size: 4009 bytes --]

/* GPL-2.0 */
#include <stdio.h>
#include <limits.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdbool.h>
#include <sys/param.h>

bool eligible = true;
unsigned long V_lim = 20;

struct entity {
	unsigned long vruntime;
	unsigned long weight;
	unsigned long request;
	unsigned long vdeadline;
	int idx;
};

unsigned int gcd(unsigned int a, unsigned int b)
{
	int gcd, m = MIN(a, b);

	for (int i = 1; i <= m; i++) {
		if (a%i == 0 && b%i == 0)
			gcd = i;
	}

	return gcd;
}

int init_entities(int nr, struct entity *es)
{
	unsigned int q = 0;

	for (int i = 0; i < nr; i++) {
		unsigned long d = (1024 * es[i].request) / es[i].weight;
		printf("d = %d\n", d);
		if (!q)
			q = d;
		else
			q = gcd(q, d);

		es[i].vdeadline = es[i].vruntime + d;
		es[i].idx = i;
	}

	printf("q = %d\n\n", q);

	return q;
}

int run_entity(struct entity *e)
{
	unsigned long d = e->vdeadline - e->vruntime;

	d *= e->weight;
	d /= 1024;

	e->vruntime = e->vdeadline;
	e->vdeadline += (1024 * e->request) / e->weight;

	return d;
}

unsigned long avg_vruntime(int nr, struct entity *es)
{
	unsigned long W = 0, V = 0;

	for (int i = 0; i < nr; i++) {
		V += es[i].weight * es[i].vruntime;
		W += es[i].weight;
	}

	V /= W;

	return V;
}

struct entity *pick_entity(int nr, struct entity *es)
{
	unsigned long W = 0, V = 0;
	struct entity *e = NULL;

	for (int i = 0; i < nr; i++) {
		V += es[i].weight * es[i].vruntime;
		W += es[i].weight;
	}

	for (int i = 0; i < nr; i++) {
		if (eligible && W*es[i].vruntime > V)
			continue;

		if (!e || es[i].vdeadline < e->vdeadline)
			e = &es[i];
	}

	return e;
}

void __print_space(int n)
{
	for (int i = 0; i < n; i++)
		putchar(' ');
}

void __print_arrow(int n)
{
	putchar('|');
	for (int i = 1; i < (n-1); i++)
		putchar('-');
	putchar('<');
}

void print_entity(struct entity *e)
{
	__print_space(e->vruntime);
	__print_arrow(e->vdeadline - e->vruntime);
}

void print_entities(int nr, struct entity *es, struct entity *p)
{
	for (int i = 0; i < nr; i++) {
		if (&es[i] == p)
			putchar('>');
		else
			putchar(' ');
		putchar('A' + i);
		putchar(' ');
		print_entity(&es[i]);
		putchar('\n');
	}
}

void print_timeline(unsigned long V)
{
	char timeline[] = "|---------|---------|---------|---------|----";

	if (V > sizeof(timeline)-1) {
		printf("Whoopsie! out of time\n");
		exit(0);
	}

	timeline[V] = '*';
	__print_space(3);
	puts(timeline);
	putchar('\n');
}

void update_lags(int nr, struct entity *es, unsigned long V, long *min, long *max)
{
	for (int i = 0; i < nr; i++) {
		long lag = V - es[i].vruntime;
		if (lag < *min)
			*min = lag;
		if (lag > *max)
			*max = lag;
	}
}

int main(int argc, char *argv[])
{
	unsigned int s = 0, t = 0, n = 0, q = 1;
	long min_lag = 0, max_lag = 0;
	struct entity *e, es[8];
	unsigned long V;
	char S[1024];
	int opt;

	const int N = sizeof(es) / sizeof(es[0]);

	while ((opt = getopt(argc, argv, "nv:e:")) != -1) {
		unsigned int v,w,r;

		switch (opt) {
		case 'n':
			eligible = false;
			break;

		case 'v':
			V_lim = atol(optarg);
			break;

		case 'e':
			if (n >= N) {
				printf("Whoopsie! too many entities\n");
				exit(0);
			}
			if (sscanf(optarg, "%u,%u,%u", &v,&w,&r) == 3) {
				es[n++] = (struct entity) {
					.vruntime = v,
					.weight = w,
					.request = r,
				};
			}
			break;

		default:
			printf("Whoopsie!, bad arguments\n");
			exit(0);
		}
	}

	if (!n) {
		printf("Whoopsie!, no entities\n");
		exit(0);
	}

	q = init_entities(n, es);

	do {
		int d;

		V = avg_vruntime(n, es);
		printf("t=%d V=%ld\n", t, V);

		update_lags(n, es, V, &min_lag, &max_lag);

		e = pick_entity(n, es);
		if (!e) {
			printf("Whoopsie, no pick\n");
			exit(0);
		}

		print_entities(n, es, e);
		print_timeline(V);

		d = run_entity(e);
		t += d;

		for (int i = 0; i < d; i += q) {
			char c = 'A' + e->idx;
			if (i)
				c = 'a' + e->idx;
			S[s++] = c;
			S[s] = '\0';
		}

		putchar('\n');
	} while (V < V_lim);

	printf("lags: %ld, %ld\n\n", min_lag, max_lag);

	puts(S);
	putchar('\n');

	return 0;
}

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-10-05 12:05         ` Peter Zijlstra
@ 2023-10-05 14:14           ` Peter Zijlstra
  2023-10-05 14:42             ` Peter Zijlstra
  2023-10-05 18:23           ` Youssef Esmat
  2023-10-07 22:04           ` Peter Zijlstra
  2 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-05 14:14 UTC (permalink / raw)
  To: Youssef Esmat
  Cc: Daniel Jordan, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen, joel,
	efault, tglx

On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote:

> Using the attached program (I got *REALLY* fed up trying to draw these
> diagrams by hand), let us illustrate the difference between Earliest
> *Eligible* Virtual Deadline First and the one with the Eligible test
> taken out: EVDF.
> 
> Specifically, the program was used with the following argument for
> EEVDF:
> 
>   ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19 
> 
> and with an additional '-n' for the EVDF column.
> 

<snip a metric ton of diagrams>

> 
> As I wrote before; EVDF has worse lag bounds, but this is not
> insurmountable. The biggest problem that I can see is that of wakeup
> preemption. Currently we allow to preempt when 'current' has reached V
> (RUN_TO_PARITY in pick_eevdf()).
> 
> With these rules, when EEVDF schedules C (our large slice task) at t=10
> above, it is only a little behind C and can be reaily preempted after
> about 2 time units.
> 
> However, EVDF will delay scheduling C until much later, see how A and B
> walk far ahead of V until t=36. Only when will we pick C. But this means
> that we're firmly stuck with C for at least 11 time units. A newly
> placed task will be around V and will have no chance to preempt.
> 
> That said, I do have me a patch to cure some of that:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=d7edbe431f31762e516f2730196f41322edcc621
> 
> That would allow a task with a shorter request time to preempt in spite
> of RUN_TO_PARITY.
> 

So, doing all that gave me an idea, see queue/sched/eevdf BIAS_ELIGIBLE.

It pushes the eligibility threshold (V) right by one average request.
The below patch is against eevdf.c.

I've not yet ran it through the normal set of hackbench,netperf etc.. so
it might eat your pet and set your granny on fire.


--- eevdf.c.orig	2023-10-05 16:11:35.821114320 +0200
+++ eevdf.c	2023-10-05 16:08:38.387080327 +0200
@@ -7,6 +7,7 @@
 #include <sys/param.h>
 
 bool eligible = true;
+bool bias = false;
 unsigned long V_lim = 20;
 
 struct entity {
@@ -79,16 +80,17 @@
 
 struct entity *pick_entity(int nr, struct entity *es)
 {
-	unsigned long W = 0, V = 0;
+	unsigned long W = 0, V = 0, R = 0;
 	struct entity *e = NULL;
 
 	for (int i = 0; i < nr; i++) {
 		V += es[i].weight * es[i].vruntime;
+		R += es[i].request;
 		W += es[i].weight;
 	}
 
 	for (int i = 0; i < nr; i++) {
-		if (eligible && W*es[i].vruntime > V)
+		if (eligible && W*es[i].vruntime > V + (bias * R))
 			continue;
 
 		if (!e || es[i].vdeadline < e->vdeadline)
@@ -169,10 +171,14 @@
 
 	const int N = sizeof(es) / sizeof(es[0]);
 
-	while ((opt = getopt(argc, argv, "nv:e:")) != -1) {
+	while ((opt = getopt(argc, argv, "bnv:e:")) != -1) {
 		unsigned int v,w,r;
 
 		switch (opt) {
+		case 'b':
+			bias = true;
+			break;
+
 		case 'n':
 			eligible = false;
 			break;

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-10-05 14:14           ` Peter Zijlstra
@ 2023-10-05 14:42             ` Peter Zijlstra
  0 siblings, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-05 14:42 UTC (permalink / raw)
  To: Youssef Esmat
  Cc: Daniel Jordan, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen, joel,
	efault, tglx

On Thu, Oct 05, 2023 at 04:14:08PM +0200, Peter Zijlstra wrote:

> --- eevdf.c.orig	2023-10-05 16:11:35.821114320 +0200
> +++ eevdf.c	2023-10-05 16:08:38.387080327 +0200
> @@ -7,6 +7,7 @@
>  #include <sys/param.h>
>  
>  bool eligible = true;
> +bool bias = false;
>  unsigned long V_lim = 20;
>  
>  struct entity {
> @@ -79,16 +80,17 @@
>  
>  struct entity *pick_entity(int nr, struct entity *es)
>  {
> -	unsigned long W = 0, V = 0;
> +	unsigned long W = 0, V = 0, R = 0;
>  	struct entity *e = NULL;
>  
>  	for (int i = 0; i < nr; i++) {
>  		V += es[i].weight * es[i].vruntime;
> +		R += es[i].request;

				* 1024

Also, average seems too much, one large value lifts it too easily. 

Need to come up with something better :/

>  		W += es[i].weight;
>  	}
>  
>  	for (int i = 0; i < nr; i++) {
> -		if (eligible && W*es[i].vruntime > V)
> +		if (eligible && W*es[i].vruntime > V + (bias * R))
>  			continue;
>  
>  		if (!e || es[i].vdeadline < e->vdeadline)
> @@ -169,10 +171,14 @@
>  
>  	const int N = sizeof(es) / sizeof(es[0]);
>  
> -	while ((opt = getopt(argc, argv, "nv:e:")) != -1) {
> +	while ((opt = getopt(argc, argv, "bnv:e:")) != -1) {
>  		unsigned int v,w,r;
>  
>  		switch (opt) {
> +		case 'b':
> +			bias = true;
> +			break;
> +
>  		case 'n':
>  			eligible = false;
>  			break;

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-10-05 12:05         ` Peter Zijlstra
  2023-10-05 14:14           ` Peter Zijlstra
@ 2023-10-05 18:23           ` Youssef Esmat
  2023-10-06  0:36             ` Youssef Esmat
  2023-10-10  8:08             ` Peter Zijlstra
  2023-10-07 22:04           ` Peter Zijlstra
  2 siblings, 2 replies; 104+ messages in thread
From: Youssef Esmat @ 2023-10-05 18:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Jordan, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen, joel,
	efault, tglx

On Thu, Oct 5, 2023 at 7:06 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Oct 02, 2023 at 08:41:36PM +0200, Peter Zijlstra wrote:
>
> > When mixing request sizes things become a little more interesting.
> >
> > Let me ponder this a little bit more.
>
> Using the attached program (I got *REALLY* fed up trying to draw these
> diagrams by hand), let us illustrate the difference between Earliest
> *Eligible* Virtual Deadline First and the one with the Eligible test
> taken out: EVDF.
>
> Specifically, the program was used with the following argument for
> EEVDF:
>
>   ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19
>
> and with an additional '-n' for the EVDF column.
>
>
> EEVDF                                                   EVDF
>
>
> d = 6                                                   d = 6
> d = 2                                                   d = 2
> d = 18                                                  d = 18
> q = 2                                                   q = 2
>
> t=0 V=1                                                 t=0 V=1
>  A |----<                                                A |----<
> >B  |<                                                  >B  |<
>  C   |----------------<                                  C   |----------------<
>    |*--------|---------|---------|---------|----           |*--------|---------|---------|---------|----
>
>
> t=2 V=1                                                 t=2 V=1
> >A |----<                                                A |----<
>  B    |<                                                >B    |<
>  C   |----------------<                                  C   |----------------<
>    |*--------|---------|---------|---------|----           |*--------|---------|---------|---------|----
>
>
> t=8 V=3                                                 t=4 V=2
>  A       |----<                                         >A |----<
> >B    |<                                                 B      |<
>  C   |----------------<                                  C   |----------------<
>    |--*------|---------|---------|---------|----           |-*-------|---------|---------|---------|----
>
>
> t=10 V=4                                                t=10 V=4
>  A       |----<                                          A       |----<
>  B      |<                                              >B      |<
> >C   |----------------<                                  C   |----------------<
>    |---*-----|---------|---------|---------|----           |---*-----|---------|---------|---------|----
>
>
> t=28 V=10                                               t=12 V=5
>  A       |----<                                          A       |----<
> >B      |<                                              >B        |<
>  C                     |----------------<                C   |----------------<
>    |---------*---------|---------|---------|----           |----*----|---------|---------|---------|----
>
>
> t=30 V=11                                               t=14 V=5
>  A       |----<                                          A       |----<
> >B        |<                                            >B          |<
>  C                     |----------------<                C   |----------------<
>    |---------|*--------|---------|---------|----           |----*----|---------|---------|---------|----
>
>
> t=32 V=11                                               t=16 V=6
>  A       |----<                                         >A       |----<
> >B          |<                                           B            |<
>  C                     |----------------<                C   |----------------<
>    |---------|*--------|---------|---------|----           |-----*---|---------|---------|---------|----
>
>
> t=34 V=12                                               t=22 V=8
> >A       |----<                                          A             |----<
>  B            |<                                        >B            |<
>  C                     |----------------<                C   |----------------<
>    |---------|-*-------|---------|---------|----           |-------*-|---------|---------|---------|----
>
>
> t=40 V=14                                               t=24 V=9
>  A             |----<                                    A             |----<
> >B            |<                                        >B              |<
>  C                     |----------------<                C   |----------------<
>    |---------|---*-----|---------|---------|----           |--------*|---------|---------|---------|----
>
>
> t=42 V=15                                               t=26 V=9
>  A             |----<                                    A             |----<
> >B              |<                                      >B                |<
>  C                     |----------------<                C   |----------------<
>    |---------|----*----|---------|---------|----           |--------*|---------|---------|---------|----
>
>
> t=44 V=15                                               t=28 V=10
>  A             |----<                                   >A             |----<
> >B                |<                                     B                  |<
>  C                     |----------------<                C   |----------------<
>    |---------|----*----|---------|---------|----           |---------*---------|---------|---------|----
>
>
> t=46 V=16                                               t=34 V=12
> >A             |----<                                    A                   |----<
>  B                  |<                                  >B                  |<
>  C                     |----------------<                C   |----------------<
>    |---------|-----*---|---------|---------|----           |---------|-*-------|---------|---------|----
>
>
> t=52 V=18                                               t=36 V=13
>  A                   |----<                              A                   |----<
> >B                  |<                                   B                    |<
>  C                     |----------------<               >C   |----------------<
>    |---------|-------*-|---------|---------|----           |---------|--*------|---------|---------|----
>
>
> t=54 V=19                                               t=54 V=19
>  A                   |----<                              A                   |----<
> >B                    |<                                >B                    |<
>  C                     |----------------<                C                     |----------------<
>    |---------|--------*|---------|---------|----           |---------|--------*|---------|---------|----
>
>
> lags: -10, 6                                            lags: -7, 11
>
> BAaaBCccccccccBBBAaaBBBAaaBB                            BBAaaBBBAaaBBBAaaBCccccccccB
>
>
>
> As I wrote before; EVDF has worse lag bounds, but this is not
> insurmountable. The biggest problem that I can see is that of wakeup
> preemption. Currently we allow to preempt when 'current' has reached V
> (RUN_TO_PARITY in pick_eevdf()).
>
> With these rules, when EEVDF schedules C (our large slice task) at t=10
> above, it is only a little behind C and can be reaily preempted after
> about 2 time units.
>
> However, EVDF will delay scheduling C until much later, see how A and B
> walk far ahead of V until t=36. Only when will we pick C. But this means
> that we're firmly stuck with C for at least 11 time units. A newly
> placed task will be around V and will have no chance to preempt.
>

Thank you for the detailed analysis! I am still in the process of
digesting everything.
I do have a quick question, this will only be the case if we adjust
C's runtime without adjusting nice value, correct? So it does not
currently apply to the submitted code where the only way to change the
deadline is to also change the nice value and thus how fast/slow
vruntime accumulates. In other words without the sched_runtime
patch[1] we should not run into this scenario, correct?

[1] https://lore.kernel.org/lkml/20230915124354.416936110@noisy.programming.kicks-ass.net/

> That said, I do have me a patch to cure some of that:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=d7edbe431f31762e516f2730196f41322edcc621
>
> That would allow a task with a shorter request time to preempt in spite
> of RUN_TO_PARITY.
>
> However, in this example V is only 2/3 of the way to C's deadline, but
> it we were to have many more tasks, you'll see V gets closer and closer
> to C's deadline and it will become harder and harder to place such that
> preemption becomes viable.
>
> Adding 4 more tasks:
>
>   ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19 -n -e "3,1024,2" -e "4,1024,2" -e "5,1024,2" -e "6,1024,2"
>
> t=92 V=16
>  A                   |----<
>  B                    |<
> >C   |----------------<
>  D                    |<
>  E                   |<
>  F                    |<
>  G                   |<
>    |---------|-----*---|---------|---------|----
>
>
> And I worry this will create very real latency spikes.
>
> That said; I do see not having the eligibility check can help. So I'm
> not opposed to having a sched_feat for this, but I would not want to
> default to EVDF.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-10-05 18:23           ` Youssef Esmat
@ 2023-10-06  0:36             ` Youssef Esmat
  2023-10-10  8:08             ` Peter Zijlstra
  1 sibling, 0 replies; 104+ messages in thread
From: Youssef Esmat @ 2023-10-06  0:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Jordan, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen, joel,
	efault, tglx

On Thu, Oct 5, 2023 at 1:23 PM Youssef Esmat <youssefesmat@chromium.org> wrote:
>
> On Thu, Oct 5, 2023 at 7:06 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Oct 02, 2023 at 08:41:36PM +0200, Peter Zijlstra wrote:
> >
> > > When mixing request sizes things become a little more interesting.
> > >
> > > Let me ponder this a little bit more.
> >
> > Using the attached program (I got *REALLY* fed up trying to draw these
> > diagrams by hand), let us illustrate the difference between Earliest
> > *Eligible* Virtual Deadline First and the one with the Eligible test
> > taken out: EVDF.
> >
> > Specifically, the program was used with the following argument for
> > EEVDF:
> >
> >   ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19
> >
> > and with an additional '-n' for the EVDF column.
> >

<snip diagrams>

> >
> >
> > As I wrote before; EVDF has worse lag bounds, but this is not
> > insurmountable. The biggest problem that I can see is that of wakeup
> > preemption. Currently we allow to preempt when 'current' has reached V
> > (RUN_TO_PARITY in pick_eevdf()).
> >
> > With these rules, when EEVDF schedules C (our large slice task) at t=10
> > above, it is only a little behind C and can be reaily preempted after
> > about 2 time units.
> >
> > However, EVDF will delay scheduling C until much later, see how A and B
> > walk far ahead of V until t=36. Only when will we pick C. But this means
> > that we're firmly stuck with C for at least 11 time units. A newly
> > placed task will be around V and will have no chance to preempt.
> >
>
> Thank you for the detailed analysis! I am still in the process of
> digesting everything.
> I do have a quick question, this will only be the case if we adjust
> C's runtime without adjusting nice value, correct? So it does not
> currently apply to the submitted code where the only way to change the
> deadline is to also change the nice value and thus how fast/slow
> vruntime accumulates. In other words without the sched_runtime
> patch[1] we should not run into this scenario, correct?
>
> [1] https://lore.kernel.org/lkml/20230915124354.416936110@noisy.programming.kicks-ass.net/

Sorry, to clarify, by "this" I meant "that we're firmly stuck with C
for at least 11 time units".

>
> > That said, I do have me a patch to cure some of that:
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=d7edbe431f31762e516f2730196f41322edcc621
> >
> > That would allow a task with a shorter request time to preempt in spite
> > of RUN_TO_PARITY.
> >
> > However, in this example V is only 2/3 of the way to C's deadline, but
> > it we were to have many more tasks, you'll see V gets closer and closer
> > to C's deadline and it will become harder and harder to place such that
> > preemption becomes viable.
> >
> > Adding 4 more tasks:
> >
> >   ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19 -n -e "3,1024,2" -e "4,1024,2" -e "5,1024,2" -e "6,1024,2"
> >
> > t=92 V=16
> >  A                   |----<
> >  B                    |<
> > >C   |----------------<
> >  D                    |<
> >  E                   |<
> >  F                    |<
> >  G                   |<
> >    |---------|-----*---|---------|---------|----
> >
> >
> > And I worry this will create very real latency spikes.
> >
> > That said; I do see not having the eligibility check can help. So I'm
> > not opposed to having a sched_feat for this, but I would not want to
> > default to EVDF.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH v2] sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline
  2023-10-04 15:46       ` Chen Yu
@ 2023-10-06 16:31         ` Daniel Jordan
  0 siblings, 0 replies; 104+ messages in thread
From: Daniel Jordan @ 2023-10-06 16:31 UTC (permalink / raw)
  To: Chen Yu
  Cc: peterz, bristot, bsegall, chris.hyser, corbet, dietmar.eggemann,
	efault, joel, joshdon, juri.lelli, kprateek.nayak, linux-kernel,
	mgorman, mingo, patrick.bellasi, pavel, pjt, qperret, qyousef,
	rostedt, tglx, tim.c.chen, timj, vincent.guittot, youssefesmat,
	yu.chen.surf

Hi Chenyu,

On Wed, Oct 04, 2023 at 11:46:21PM +0800, Chen Yu wrote:
> Hi Daniel,
> 
> On 2023-10-04 at 09:09:08 -0400, Daniel Jordan wrote:
> > An entity is supposed to get an earlier deadline with
> > PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
> > overwritten soon after in enqueue_entity() the first time a forked
> > entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
> > 
> > Placing in task_fork_fair() seems unnecessary since none of the values
> > that get set (slice, vruntime, deadline) are used before they're set
> > again at enqueue time, so get rid of that (and with it all of
> > task_fork_fair()) and just pass ENQUEUE_INITIAL to enqueue_entity() via
> > wake_up_new_task().
> > 
> > Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> > Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > ---
> > 
> > v2
> >  - place_entity() seems like the only reason for task_fork_fair() to exist
> >    after the recent removal of sysctl_sched_child_runs_first, so take out
> >    the whole function.
> 
> At first glance I thought if we remove task_fork_fair(), do we lose one chance to
> update the parent task's statistic in update_curr()?  We might get out-of-date
> parent task's deadline and make preemption decision based on the stale data in
> wake_up_new_task() -> wakeup_preempt() -> pick_eevdf(). But after a second thought,
> I found that wake_up_new_task() -> enqueue_entity() itself would invoke update_curr(),
> so this should not be a problem.
>
> Then I was wondering why can't we just skip place_entity() in enqueue_entity()
> if ENQUEUE_WAKEUP is not set, just like the code before e8f331bcc270? In this
> way the new fork task's deadline will not be overwritten by wake_up_new_task()->
> enqueue_entity(). Then I realized that, after e8f331bcc270, the task's vruntime
> and deadline are all calculated by place_entity() rather than being renormalised
> to cfs_rq->min_vruntime in enqueue_entity(), so we can not simply skip place_entity()
> in enqueue_entity().

This all made me wonder if the order of update_curr() for the parent and
place_entity() for the child matters.  And it does, since placing uses
avg_vruntime(), which wants an up-to-date vruntime for current and
min_vruntime for cfs_rq.  Good that 'curr' in enqueue_entity() is false
on fork so that the parent's vruntime is up to date, but it seems
placing should always happen after update_curr().

> Per my understanding, this patch looks good,
> 
> Reviewed-by: Chen Yu <yu.c.chen@intel.com>

Thanks!

Daniel

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH] sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline
  2023-10-05  5:56     ` [PATCH] " K Prateek Nayak
@ 2023-10-06 16:35       ` Daniel Jordan
  0 siblings, 0 replies; 104+ messages in thread
From: Daniel Jordan @ 2023-10-06 16:35 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: bristot, bsegall, chris.hyser, corbet, dietmar.eggemann, efault,
	joel, joshdon, juri.lelli, linux-kernel, mgorman, mingo,
	patrick.bellasi, pavel, pjt, qperret, qyousef, rostedt, tglx,
	tim.c.chen, timj, vincent.guittot, youssefesmat, yu.c.chen,
	peterz

Hi Prateek,

On Thu, Oct 05, 2023 at 11:26:07AM +0530, K Prateek Nayak wrote:
> Hello Daniel,
> 
> On 10/4/2023 6:47 AM, Daniel Jordan wrote:
> > An entity is supposed to get an earlier deadline with
> > PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
> > overwritten soon after in enqueue_entity() the first time a forked
> > entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
> > 
> > Placing in task_fork_fair() seems unnecessary since none of the values
> > that get set (slice, vruntime, deadline) are used before they're set
> > again at enqueue time, so get rid of that and just pass ENQUEUE_INITIAL
> > to enqueue_entity() via wake_up_new_task().
> > 
> > Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> > Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> 
> I got a chance to this this on a 3rd Generation EPYC system. I don't
> see anything out of the ordinary except for a small regression on
> hackbench. I'll leave the full result below.

Thanks for testing!

> o System details
> 
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - Boost enabled, C2 Disabled (POLL and MWAIT based C1 remained enabled)
> 
> 
> o Kernel Details
> 
> - tip:	tip:sched/core at commit d4d6596b4386 ("sched/headers: Remove
> 	duplicate header inclusions")
> 
> - place-deadline-fix: tip + this patch
> 
> 
> o Benchmark Results
> 
> ==================================================================
> Test          : hackbench
> Units         : Normalized time in seconds
> Interpretation: Lower is better
> Statistic     : AMean
> ==================================================================
> Case:           tip[pct imp](CV)    place-deadline-fix[pct imp](CV)
>  1-groups     1.00 [ -0.00]( 2.58)     1.04 [ -3.63]( 3.14)
>  2-groups     1.00 [ -0.00]( 1.87)     1.03 [ -2.98]( 1.85)
>  4-groups     1.00 [ -0.00]( 1.63)     1.02 [ -2.35]( 1.59)
>  8-groups     1.00 [ -0.00]( 1.38)     1.03 [ -2.92]( 1.20)
> 16-groups     1.00 [ -0.00]( 2.67)     1.02 [ -1.61]( 2.08)

Huh, numbers do seem a bit outside the noise.  Doesn't hackbench only
fork at the beginning?  I glanced at perf messaging source just now, but
not sure if you use that version.  Anyway, I wouldn't expect this patch
to have much of an effect in that case.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH] sched/fair: Always update_curr() before placing at enqueue
  2023-05-31 11:58 ` [PATCH 07/15] sched/smp: Use lag to simplify cross-runqueue placement Peter Zijlstra
                     ` (2 preceding siblings ...)
  2023-10-04  1:17   ` [PATCH] sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline Daniel Jordan
@ 2023-10-06 16:48   ` Daniel Jordan
  2023-10-06 19:58     ` Peter Zijlstra
  2023-10-16  5:39     ` K Prateek Nayak
  3 siblings, 2 replies; 104+ messages in thread
From: Daniel Jordan @ 2023-10-06 16:48 UTC (permalink / raw)
  To: peterz
  Cc: bristot, bsegall, chris.hyser, corbet, dietmar.eggemann, efault,
	joel, joshdon, juri.lelli, kprateek.nayak, linux-kernel, mgorman,
	mingo, patrick.bellasi, pavel, pjt, qperret, qyousef, rostedt,
	tglx, tim.c.chen, timj, vincent.guittot, youssefesmat, yu.c.chen,
	daniel.m.jordan

Placing wants current's vruntime and the cfs_rq's min_vruntime up to
date so that avg_runtime() is too, and similarly it wants the entity to
be re-weighted and lag adjusted so vslice and vlag are fresh, so always
do update_curr() and update_cfs_group() beforehand.

There doesn't seem to be a reason to treat the 'curr' case specially
after e8f331bcc270 since vruntime doesn't get normalized anymore.

Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---

Not sure what the XXX above place_entity() is for, maybe it can go away?

Based on tip/sched/core.

 kernel/sched/fair.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04fbcbda97d5f..db2ca9bf9cc49 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5047,15 +5047,6 @@ static inline bool cfs_bandwidth_used(void);
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	bool curr = cfs_rq->curr == se;
-
-	/*
-	 * If we're the current task, we must renormalise before calling
-	 * update_curr().
-	 */
-	if (curr)
-		place_entity(cfs_rq, se, flags);
-
 	update_curr(cfs_rq);
 
 	/*
@@ -5080,8 +5071,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * XXX now that the entity has been re-weighted, and it's lag adjusted,
 	 * we can place the entity.
 	 */
-	if (!curr)
-		place_entity(cfs_rq, se, flags);
+	place_entity(cfs_rq, se, flags);
 
 	account_entity_enqueue(cfs_rq, se);
 
@@ -5091,7 +5081,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	check_schedstat_required();
 	update_stats_enqueue_fair(cfs_rq, se, flags);
-	if (!curr)
+	if (cfs_rq->curr != se)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH] sched/fair: Always update_curr() before placing at enqueue
  2023-10-06 16:48   ` [PATCH] sched/fair: Always update_curr() before placing at enqueue Daniel Jordan
@ 2023-10-06 19:58     ` Peter Zijlstra
  2023-10-18  0:43       ` Daniel Jordan
  2023-10-16  5:39     ` K Prateek Nayak
  1 sibling, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-06 19:58 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: bristot, bsegall, chris.hyser, corbet, dietmar.eggemann, efault,
	joel, joshdon, juri.lelli, kprateek.nayak, linux-kernel, mgorman,
	mingo, patrick.bellasi, pavel, pjt, qperret, qyousef, rostedt,
	tglx, tim.c.chen, timj, vincent.guittot, youssefesmat, yu.c.chen

On Fri, Oct 06, 2023 at 12:48:26PM -0400, Daniel Jordan wrote:
> Placing wants current's vruntime and the cfs_rq's min_vruntime up to
> date so that avg_runtime() is too, and similarly it wants the entity to
> be re-weighted and lag adjusted so vslice and vlag are fresh, so always
> do update_curr() and update_cfs_group() beforehand.
> 
> There doesn't seem to be a reason to treat the 'curr' case specially
> after e8f331bcc270 since vruntime doesn't get normalized anymore.
> 
> Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> ---
> 
> Not sure what the XXX above place_entity() is for, maybe it can go away?
> 
> Based on tip/sched/core.
> 
>  kernel/sched/fair.c | 14 ++------------
>  1 file changed, 2 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04fbcbda97d5f..db2ca9bf9cc49 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5047,15 +5047,6 @@ static inline bool cfs_bandwidth_used(void);
>  static void
>  enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
> -	bool curr = cfs_rq->curr == se;
> -
> -	/*
> -	 * If we're the current task, we must renormalise before calling
> -	 * update_curr().
> -	 */
> -	if (curr)
> -		place_entity(cfs_rq, se, flags);
> -
>  	update_curr(cfs_rq);

IIRC part of the reason for this order is the:

  dequeue
  update
  enqueue

pattern we have all over the place. You don't want the enqueue to move
time forward in this case.

Could be that all magically works, but please double check.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-10-05 12:05         ` Peter Zijlstra
  2023-10-05 14:14           ` Peter Zijlstra
  2023-10-05 18:23           ` Youssef Esmat
@ 2023-10-07 22:04           ` Peter Zijlstra
  2023-10-09 14:41             ` Peter Zijlstra
  2023-10-10  0:51             ` Youssef Esmat
  2 siblings, 2 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-07 22:04 UTC (permalink / raw)
  To: Youssef Esmat
  Cc: Daniel Jordan, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen, joel,
	efault, tglx

On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote:

> t=10 V=4                                                t=10 V=4
>  A       |----<                                          A       |----<
>  B      |<                                              >B      |<
> >C   |----------------<                                  C   |----------------<
>    |---*-----|---------|---------|---------|----           |---*-----|---------|---------|---------|----
>                                                         

>                                                         
> t=52 V=18                                               t=36 V=13
>  A                   |----<                              A                   |----<
> >B                  |<                                   B                    |<
>  C                     |----------------<               >C   |----------------<
>    |---------|-------*-|---------|---------|----           |---------|--*------|---------|---------|----
>                                                         

>                                                         
> BAaaBCccccccccBBBAaaBBBAaaBB                            BBAaaBBBAaaBBBAaaBCccccccccB
> 
> 
> 
> As I wrote before; EVDF has worse lag bounds, but this is not
> insurmountable. The biggest problem that I can see is that of wakeup
> preemption. Currently we allow to preempt when 'current' has reached V
> (RUN_TO_PARITY in pick_eevdf()).
> 
> With these rules, when EEVDF schedules C (our large slice task) at t=10
> above, it is only a little behind C and can be reaily preempted after
> about 2 time units.
> 
> However, EVDF will delay scheduling C until much later, see how A and B
> walk far ahead of V until t=36. Only when will we pick C. But this means
> that we're firmly stuck with C for at least 11 time units. A newly
> placed task will be around V and will have no chance to preempt.

Playing around with it a little:

EEVDF					EVDF

slice 30000000				slice 30000000
# Min Latencies: 00014                  # Min Latencies: 00048
# Avg Latencies: 00692                  # Avg Latencies: 188239
# Max Latencies: 94633                  # Max Latencies: 961241
                                        
slice 3000000                           slice 3000000
# Min Latencies: 00054                  # Min Latencies: 00055
# Avg Latencies: 00522                  # Avg Latencies: 00673
# Max Latencies: 41475                  # Max Latencies: 13297
                                        
slice 300000                            slice 300000
# Min Latencies: 00018                  # Min Latencies: 00024
# Avg Latencies: 00344                  # Avg Latencies: 00056
# Max Latencies: 20061                  # Max Latencies: 00860


So while it improves the short slices, it completely blows up the large
slices -- utterly slaughters the large slices in fact.

And all the many variants of BIAS_ELIGIBLE that I've tried so far only
manage to murder the high end while simultaneously not actually helping
the low end -- so that's a complete write off.


By far the sanest option so far is PLACE_SLEEPER -- and that is very
much not a nice option either :-(

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [tip: sched/urgent] sched/eevdf: Fix pick_eevdf()
  2023-09-30  0:09   ` [PATCH] sched/fair: fix pick_eevdf to always find the correct se Benjamin Segall
  2023-10-03 10:42     ` [tip: sched/urgent] sched/fair: Fix pick_eevdf() tip-bot2 for Benjamin Segall
       [not found]     ` <CGME20231004203940eucas1p2f73b017497d1f4239a6e236fdb6019e2@eucas1p2.samsung.com>
@ 2023-10-09  7:53     ` tip-bot2 for Benjamin Segall
  2023-10-11 12:12     ` [PATCH] sched/fair: fix pick_eevdf to always find the correct se Abel Wu
  3 siblings, 0 replies; 104+ messages in thread
From: tip-bot2 for Benjamin Segall @ 2023-10-09  7:53 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Ben Segall, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     b01db23d5923a35023540edc4f0c5f019e11ac7d
Gitweb:        https://git.kernel.org/tip/b01db23d5923a35023540edc4f0c5f019e11ac7d
Author:        Benjamin Segall <bsegall@google.com>
AuthorDate:    Fri, 29 Sep 2023 17:09:30 -07:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 09 Oct 2023 09:48:33 +02:00

sched/eevdf: Fix pick_eevdf()

The old pick_eevdf() could fail to find the actual earliest eligible
deadline when it descended to the right looking for min_deadline, but
it turned out that that min_deadline wasn't actually eligible. In that
case we need to go back and search through any left branches we
skipped looking for the actual best _eligible_ min_deadline.

This is more expensive, but still O(log n), and at worst should only
involve descending two branches of the rbtree.

I've run this through a userspace stress test (thank you
tools/lib/rbtree.c), so hopefully this implementation doesn't miss any
corner cases.

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/xm261qego72d.fsf_-_@google.com
---
 kernel/sched/fair.c | 72 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 58 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a4b904a..061a30a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -872,14 +872,16 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
  *
  * Which allows an EDF like search on (sub)trees.
  */
-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
 {
 	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
 	struct sched_entity *curr = cfs_rq->curr;
 	struct sched_entity *best = NULL;
+	struct sched_entity *best_left = NULL;
 
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
+	best = curr;
 
 	/*
 	 * Once selected, run a task until it either becomes non-eligible or
@@ -900,33 +902,75 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 		}
 
 		/*
-		 * If this entity has an earlier deadline than the previous
-		 * best, take this one. If it also has the earliest deadline
-		 * of its subtree, we're done.
+		 * Now we heap search eligible trees for the best (min_)deadline
 		 */
-		if (!best || deadline_gt(deadline, best, se)) {
+		if (!best || deadline_gt(deadline, best, se))
 			best = se;
-			if (best->deadline == best->min_deadline)
-				break;
-		}
 
 		/*
-		 * If the earlest deadline in this subtree is in the fully
-		 * eligible left half of our space, go there.
+		 * Every se in a left branch is eligible, keep track of the
+		 * branch with the best min_deadline
 		 */
+		if (node->rb_left) {
+			struct sched_entity *left = __node_2_se(node->rb_left);
+
+			if (!best_left || deadline_gt(min_deadline, best_left, left))
+				best_left = left;
+
+			/*
+			 * min_deadline is in the left branch. rb_left and all
+			 * descendants are eligible, so immediately switch to the second
+			 * loop.
+			 */
+			if (left->min_deadline == se->min_deadline)
+				break;
+		}
+
+		/* min_deadline is at this node, no need to look right */
+		if (se->deadline == se->min_deadline)
+			break;
+
+		/* else min_deadline is in the right branch. */
+		node = node->rb_right;
+	}
+
+	/*
+	 * We ran into an eligible node which is itself the best.
+	 * (Or nr_running == 0 and both are NULL)
+	 */
+	if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0)
+		return best;
+
+	/*
+	 * Now best_left and all of its children are eligible, and we are just
+	 * looking for deadline == min_deadline
+	 */
+	node = &best_left->run_node;
+	while (node) {
+		struct sched_entity *se = __node_2_se(node);
+
+		/* min_deadline is the current node */
+		if (se->deadline == se->min_deadline)
+			return se;
+
+		/* min_deadline is in the left branch */
 		if (node->rb_left &&
 		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
 			node = node->rb_left;
 			continue;
 		}
 
+		/* else min_deadline is in the right branch */
 		node = node->rb_right;
 	}
+	return NULL;
+}
 
-	if (!best || (curr && deadline_gt(deadline, best, curr)))
-		best = curr;
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *se = __pick_eevdf(cfs_rq);
 
-	if (unlikely(!best)) {
+	if (!se) {
 		struct sched_entity *left = __pick_first_entity(cfs_rq);
 		if (left) {
 			pr_err("EEVDF scheduling fail, picking leftmost\n");
@@ -934,7 +978,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 		}
 	}
 
-	return best;
+	return se;
 }
 
 #ifdef CONFIG_SCHED_DEBUG

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-10-07 22:04           ` Peter Zijlstra
@ 2023-10-09 14:41             ` Peter Zijlstra
  2023-10-10  0:51             ` Youssef Esmat
  1 sibling, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-09 14:41 UTC (permalink / raw)
  To: Youssef Esmat
  Cc: Daniel Jordan, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen, joel,
	efault, tglx

On Sun, Oct 08, 2023 at 12:04:00AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote:
> 
> > t=10 V=4                                                t=10 V=4
> >  A       |----<                                          A       |----<
> >  B      |<                                              >B      |<
> > >C   |----------------<                                  C   |----------------<
> >    |---*-----|---------|---------|---------|----           |---*-----|---------|---------|---------|----
> >                                                         
> 
> >                                                         
> > t=52 V=18                                               t=36 V=13
> >  A                   |----<                              A                   |----<
> > >B                  |<                                   B                    |<
> >  C                     |----------------<               >C   |----------------<
> >    |---------|-------*-|---------|---------|----           |---------|--*------|---------|---------|----
> >                                                         
> 
> >                                                         
> > BAaaBCccccccccBBBAaaBBBAaaBB                            BBAaaBBBAaaBBBAaaBCccccccccB
> > 
> > 
> > 
> > As I wrote before; EVDF has worse lag bounds, but this is not
> > insurmountable. The biggest problem that I can see is that of wakeup
> > preemption. Currently we allow to preempt when 'current' has reached V
> > (RUN_TO_PARITY in pick_eevdf()).
> > 
> > With these rules, when EEVDF schedules C (our large slice task) at t=10
> > above, it is only a little behind C and can be reaily preempted after
> > about 2 time units.
> > 
> > However, EVDF will delay scheduling C until much later, see how A and B
> > walk far ahead of V until t=36. Only when will we pick C. But this means
> > that we're firmly stuck with C for at least 11 time units. A newly
> > placed task will be around V and will have no chance to preempt.
> 
> Playing around with it a little:
> 
> EEVDF					EVDF
> 
> slice 30000000				slice 30000000
> # Min Latencies: 00014                  # Min Latencies: 00048
> # Avg Latencies: 00692                  # Avg Latencies: 188239
> # Max Latencies: 94633                  # Max Latencies: 961241
>                                         
> slice 3000000                           slice 3000000
> # Min Latencies: 00054                  # Min Latencies: 00055
> # Avg Latencies: 00522                  # Avg Latencies: 00673
> # Max Latencies: 41475                  # Max Latencies: 13297
>                                         
> slice 300000                            slice 300000
> # Min Latencies: 00018                  # Min Latencies: 00024
> # Avg Latencies: 00344                  # Avg Latencies: 00056
> # Max Latencies: 20061                  # Max Latencies: 00860
> 
> 
> So while it improves the short slices, it completely blows up the large
> slices -- utterly slaughters the large slices in fact.
> 
> And all the many variants of BIAS_ELIGIBLE that I've tried so far only
> manage to murder the high end while simultaneously not actually helping
> the low end -- so that's a complete write off.
> 
> 
> By far the sanest option so far is PLACE_SLEEPER -- and that is very
> much not a nice option either :-(

And this can be easily explained by the fact that we insert tasks around
0-lag, so if we delay execution past this point we create an effective
DoS window.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-10-07 22:04           ` Peter Zijlstra
  2023-10-09 14:41             ` Peter Zijlstra
@ 2023-10-10  0:51             ` Youssef Esmat
  2023-10-10  8:01               ` Peter Zijlstra
  2023-10-16 16:50               ` Peter Zijlstra
  1 sibling, 2 replies; 104+ messages in thread
From: Youssef Esmat @ 2023-10-10  0:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Jordan, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen, joel,
	efault, tglx

On Sat, Oct 7, 2023 at 5:04 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote:
>
> > t=10 V=4                                                t=10 V=4
> >  A       |----<                                          A       |----<
> >  B      |<                                              >B      |<
> > >C   |----------------<                                  C   |----------------<
> >    |---*-----|---------|---------|---------|----           |---*-----|---------|---------|---------|----
> >
>
> >
> > t=52 V=18                                               t=36 V=13
> >  A                   |----<                              A                   |----<
> > >B                  |<                                   B                    |<
> >  C                     |----------------<               >C   |----------------<
> >    |---------|-------*-|---------|---------|----           |---------|--*------|---------|---------|----
> >
>
> >
> > BAaaBCccccccccBBBAaaBBBAaaBB                            BBAaaBBBAaaBBBAaaBCccccccccB
> >
> >
> >
> > As I wrote before; EVDF has worse lag bounds, but this is not
> > insurmountable. The biggest problem that I can see is that of wakeup
> > preemption. Currently we allow to preempt when 'current' has reached V
> > (RUN_TO_PARITY in pick_eevdf()).
> >
> > With these rules, when EEVDF schedules C (our large slice task) at t=10
> > above, it is only a little behind C and can be reaily preempted after
> > about 2 time units.
> >
> > However, EVDF will delay scheduling C until much later, see how A and B
> > walk far ahead of V until t=36. Only when will we pick C. But this means
> > that we're firmly stuck with C for at least 11 time units. A newly
> > placed task will be around V and will have no chance to preempt.
>
> Playing around with it a little:
>
> EEVDF                                   EVDF
>
> slice 30000000                          slice 30000000
> # Min Latencies: 00014                  # Min Latencies: 00048
> # Avg Latencies: 00692                  # Avg Latencies: 188239
> # Max Latencies: 94633                  # Max Latencies: 961241
>
> slice 3000000                           slice 3000000
> # Min Latencies: 00054                  # Min Latencies: 00055
> # Avg Latencies: 00522                  # Avg Latencies: 00673
> # Max Latencies: 41475                  # Max Latencies: 13297
>
> slice 300000                            slice 300000
> # Min Latencies: 00018                  # Min Latencies: 00024
> # Avg Latencies: 00344                  # Avg Latencies: 00056
> # Max Latencies: 20061                  # Max Latencies: 00860
>

Thanks for sharing. Which workload was used to generate these numbers?

I think looking at the sched latency numbers alone does not show the
complete picture. I ran the same input latency test again and tried to
capture some of these numbers for the chrome processes.

EEVDF 1.5ms slice:

Input latency test result: 226ms
perf sched latency:
switches: 1,084,694
avg:   1.139 ms
max: 408.397 ms

EEVDF 6.0ms slice:

Input latency test result: 178ms
perf sched latency:
switches: 892,306
avg:   1.145 ms
max: 354.344 ms

EVDF 6.0ms slice:

Input latency test result: 112ms
perf sched latency:
switches: 134,200
avg:   2.610 ms
max: 374.888 ms

EVDF 6.0ms slice
(no run_to_parity, no place_lag, no place_deadline_initial):

Input latency test result: 110ms
perf sched latency:
switches: 531,656
avg:   0.830 ms
max: 520.463 ms

For our scenario, it is very expensive to interrupt UI threads. It
will increase the input latency significantly. Lowering the scheduling
latency at the cost of switching out important threads can be very
detrimental in this workload. UI and input threads run with a nice
value of -8.

This also seems to match Daniel's message earlier in this thread where
using 12ms base slice improved their benchmarks.

That said, this might not be beneficial for all workloads, and we are
still trying our other workloads out.

>
> So while it improves the short slices, it completely blows up the large
> slices -- utterly slaughters the large slices in fact.
>
> And all the many variants of BIAS_ELIGIBLE that I've tried so far only
> manage to murder the high end while simultaneously not actually helping
> the low end -- so that's a complete write off.
>
>
> By far the sanest option so far is PLACE_SLEEPER -- and that is very
> much not a nice option either :-(

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-10-10  0:51             ` Youssef Esmat
@ 2023-10-10  8:01               ` Peter Zijlstra
  2023-10-16 16:50               ` Peter Zijlstra
  1 sibling, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-10  8:01 UTC (permalink / raw)
  To: Youssef Esmat
  Cc: Daniel Jordan, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen, joel,
	efault, tglx

On Mon, Oct 09, 2023 at 07:51:03PM -0500, Youssef Esmat wrote:

> > Playing around with it a little:
> >
> > EEVDF                                   EVDF
> >
> > slice 30000000                          slice 30000000
> > # Min Latencies: 00014                  # Min Latencies: 00048
> > # Avg Latencies: 00692                  # Avg Latencies: 188239
> > # Max Latencies: 94633                  # Max Latencies: 961241
> >
> > slice 3000000                           slice 3000000
> > # Min Latencies: 00054                  # Min Latencies: 00055
> > # Avg Latencies: 00522                  # Avg Latencies: 00673
> > # Max Latencies: 41475                  # Max Latencies: 13297
> >
> > slice 300000                            slice 300000
> > # Min Latencies: 00018                  # Min Latencies: 00024
> > # Avg Latencies: 00344                  # Avg Latencies: 00056
> > # Max Latencies: 20061                  # Max Latencies: 00860
> >
> 
> Thanks for sharing. Which workload was used to generate these numbers?

This is hackbench vs cyclictest, where cyclictest gets a custom slice
set, the big slice is 10 * normal, the middle slice is normal (equal to
hackbench) and the short slice is normal / 10.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-10-05 18:23           ` Youssef Esmat
  2023-10-06  0:36             ` Youssef Esmat
@ 2023-10-10  8:08             ` Peter Zijlstra
  1 sibling, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-10  8:08 UTC (permalink / raw)
  To: Youssef Esmat
  Cc: Daniel Jordan, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen, joel,
	efault, tglx

On Thu, Oct 05, 2023 at 01:23:14PM -0500, Youssef Esmat wrote:
> On Thu, Oct 5, 2023 at 7:06 AM Peter Zijlstra <peterz@infradead.org> wrote:

> > BAaaBCccccccccBBBAaaBBBAaaBB                            BBAaaBBBAaaBBBAaaBCccccccccB
> >
> >
> >
> > As I wrote before; EVDF has worse lag bounds, but this is not
> > insurmountable. The biggest problem that I can see is that of wakeup
> > preemption. Currently we allow to preempt when 'current' has reached V
> > (RUN_TO_PARITY in pick_eevdf()).
> >
> > With these rules, when EEVDF schedules C (our large slice task) at t=10
> > above, it is only a little behind C and can be reaily preempted after
> > about 2 time units.
> >
> > However, EVDF will delay scheduling C until much later, see how A and B
> > walk far ahead of V until t=36. Only when will we pick C. But this means
> > that we're firmly stuck with C for at least 11 time units. A newly
> > placed task will be around V and will have no chance to preempt.
> >
> 
> Thank you for the detailed analysis! I am still in the process of
> digesting everything.
> I do have a quick question, this will only be the case if we adjust
> C's runtime without adjusting nice value, correct? So it does not
> currently apply to the submitted code where the only way to change the
> deadline is to also change the nice value and thus how fast/slow
> vruntime accumulates. In other words without the sched_runtime
> patch[1] we should not run into this scenario, correct?
> 
> [1] https://lore.kernel.org/lkml/20230915124354.416936110@noisy.programming.kicks-ass.net/


Much harder to run into it, but you can still hit it using nice.

 d_i = v_i + r_i / w_i

So for very light tasks, the effective v_deadline goes out lots.

And as I wrote yesterday, by delaying scheduling (far) past 0-lag you
effectively open up a denial of service window, since new tasks are
placed around 0-lag.

Now, nobody will likely care much if very light tasks get horrific
treatment, but still, it's there.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Re: [PATCH 05/15] sched/fair: Implement an EEVDF like policy
  2023-09-29 21:40   ` [PATCH 05/15] sched/fair: Implement an EEVDF like policy Benjamin Segall
  2023-10-02 17:39     ` Peter Zijlstra
@ 2023-10-11  4:14     ` Abel Wu
  2023-10-11  7:33       ` Peter Zijlstra
  1 sibling, 1 reply; 104+ messages in thread
From: Abel Wu @ 2023-10-11  4:14 UTC (permalink / raw)
  To: Benjamin Segall, Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx

在 9/30/23 5:40 AM, Benjamin Segall Wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
>> +
>> +/*
>> + * Earliest Eligible Virtual Deadline First
>> + *
>> + * In order to provide latency guarantees for different request sizes
>> + * EEVDF selects the best runnable task from two criteria:
>> + *
>> + *  1) the task must be eligible (must be owed service)
>> + *
>> + *  2) from those tasks that meet 1), we select the one
>> + *     with the earliest virtual deadline.
>> + *
>> + * We can do this in O(log n) time due to an augmented RB-tree. The
>> + * tree keeps the entries sorted on service, but also functions as a
>> + * heap based on the deadline by keeping:
>> + *
>> + *  se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
>> + *
>> + * Which allows an EDF like search on (sub)trees.
>> + */
>> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>> +{
>> +	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
>> +	struct sched_entity *curr = cfs_rq->curr;
>> +	struct sched_entity *best = NULL;
>> +
>> +	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>> +		curr = NULL;
>> +
>> +	while (node) {
>> +		struct sched_entity *se = __node_2_se(node);
>> +
>> +		/*
>> +		 * If this entity is not eligible, try the left subtree.
>> +		 */
>> +		if (!entity_eligible(cfs_rq, se)) {
>> +			node = node->rb_left;
>> +			continue;
>> +		}
>> +
>> +		/*
>> +		 * If this entity has an earlier deadline than the previous
>> +		 * best, take this one. If it also has the earliest deadline
>> +		 * of its subtree, we're done.
>> +		 */
>> +		if (!best || deadline_gt(deadline, best, se)) {
>> +			best = se;
>> +			if (best->deadline == best->min_deadline)
>> +				break;
>> +		}
>> +
>> +		/*
>> +		 * If the earlest deadline in this subtree is in the fully
>> +		 * eligible left half of our space, go there.
>> +		 */
>> +		if (node->rb_left &&
>> +		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
>> +			node = node->rb_left;
>> +			continue;
>> +		}
>> +
>> +		node = node->rb_right;
>> +	}
> 
> I believe that this can fail to actually find the earliest eligible
> deadline, because the earliest deadline (min_deadline) can be in the
> right branch, but that se isn't eligible, and the actual target se is in
> the left branch. A trivial 3-se example with the nodes represented by
> (vruntime, deadline, min_deadline):
> 
>     (5,9,7)
>   /        \
> (4,8,8)  (6,7,7)
> 
> AIUI, here the EEVDF pick should be (4,8,8), but pick_eevdf() will
> instead pick (5,9,7), because it goes into the right branch and then
> fails eligibility.
> 
> I'm not sure how much of a problem this is in practice, either in
> frequency or severity, but it probably should be mentioned if it's
> an intentional tradeoff.

Assume entity i satisfies (d_i == min_deadline) && (v_i > V), there
must be an eligible entity j with (d_j >= d_i) && (v_j < V). Given
that how deadline is calculated, it can be inferred that:

	vslice_i < vslice_j

IOW a more batch-like entity with looser deadline will beat entities
that is more interactive-like even with tighter deadline, only because
the former is eligible while the latter isn't.

With Benjamin's fix, the semantics of 'Earliest Eligible' preserved.
But since all this is about latency rather than fairness, I wonder if
there are cases worthy of breaking the 'eligible' rule.

Thanks & Best,
	Abel

> 
> 
> 
> Thinking out loud, I think that it would be sufficient to recheck via something like
> 
> for_each_sched_entity(best) {
> 	check __node_2_se(best->rb_left)->min_deadline, store in actual_best
> }
> 
> for the best min_deadline, and then go do a heap lookup in actual_best
> to find the se matching that min_deadline.
> 
> I think this pass could then be combined with our initial descent for
> better cache behavior by keeping track of the best rb_left->min_deadline
> each time we take a right branch. We still have to look at up to ~2x the
> nodes, but I don't think that's avoidable? I'll expand my quick hack I
> used to test my simple case into a something of a stress tester and try
> some implementations.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 01/15] sched/fair: Add avg_vruntime
  2023-05-31 11:58 ` [PATCH 01/15] sched/fair: Add avg_vruntime Peter Zijlstra
  2023-06-02 13:51   ` Vincent Guittot
  2023-08-10  7:10   ` [tip: sched/core] sched/fair: Add cfs_rq::avg_vruntime tip-bot2 for Peter Zijlstra
@ 2023-10-11  4:15   ` Abel Wu
  2023-10-11  7:30     ` Peter Zijlstra
  2 siblings, 1 reply; 104+ messages in thread
From: Abel Wu @ 2023-10-11  4:15 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, vincent.guittot
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
	pjt, pavel, qperret, tim.c.chen, joshdon, timj, kprateek.nayak,
	yu.c.chen, youssefesmat, joel, efault, tglx

On 5/31/23 7:58 PM, Peter Zijlstra wrote:
> +/*
> + * Compute virtual time from the per-task service numbers:
> + *
> + * Fair schedulers conserve lag:
> + *
> + *   \Sum lag_i = 0
> + *
> + * Where lag_i is given by:
> + *
> + *   lag_i = S - s_i = w_i * (V - v_i)

Since the ideal service time S is task-specific, should this be:

	lag_i = S_i - s_i = w_i * (V - v_i)

> + *
> + * Where S is the ideal service time and V is it's virtual time counterpart.
> + * Therefore:
> + *
> + *   \Sum lag_i = 0
> + *   \Sum w_i * (V - v_i) = 0
> + *   \Sum w_i * V - w_i * v_i = 0
> + *
> + * From which we can solve an expression for V in v_i (which we have in
> + * se->vruntime):
> + *
> + *       \Sum v_i * w_i   \Sum v_i * w_i
> + *   V = -------------- = --------------
> + *          \Sum w_i            W
> + *
> + * Specifically, this is the weighted average of all entity virtual runtimes.
> + *
> + * [[ NOTE: this is only equal to the ideal scheduler under the condition
> + *          that join/leave operations happen at lag_i = 0, otherwise the
> + *          virtual time has non-continguous motion equivalent to:
> + *
> + *	      V +-= lag_i / W
> + *
> + *	    Also see the comment in place_entity() that deals with this. ]]
> + *
> + * However, since v_i is u64, and the multiplcation could easily overflow
> + * transform it into a relative form that uses smaller quantities:
> + *
> + * Substitute: v_i == (v_i - v0) + v0
> + *
> + *     \Sum ((v_i - v0) + v0) * w_i   \Sum (v_i - v0) * w_i
> + * V = ---------------------------- = --------------------- + v0
> + *                  W                            W
> + *
> + * Which we track using:
> + *
> + *                    v0 := cfs_rq->min_vruntime
> + * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime

IMHO 'sum_runtime' would be more appropriate? Since it actually is
the summed real time rather than virtual time. And also 'sum_load'
instead of 'avg_load'.

Thanks & Best,
	Abel

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 01/15] sched/fair: Add avg_vruntime
  2023-10-11  4:15   ` [PATCH 01/15] sched/fair: Add avg_vruntime Abel Wu
@ 2023-10-11  7:30     ` Peter Zijlstra
  2023-10-11  8:30       ` Abel Wu
  2023-10-11 13:08       ` Peter Zijlstra
  0 siblings, 2 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-11  7:30 UTC (permalink / raw)
  To: Abel Wu
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On Wed, Oct 11, 2023 at 12:15:28PM +0800, Abel Wu wrote:
> On 5/31/23 7:58 PM, Peter Zijlstra wrote:
> > +/*
> > + * Compute virtual time from the per-task service numbers:
> > + *
> > + * Fair schedulers conserve lag:
> > + *
> > + *   \Sum lag_i = 0
> > + *
> > + * Where lag_i is given by:
> > + *
> > + *   lag_i = S - s_i = w_i * (V - v_i)
> 
> Since the ideal service time S is task-specific, should this be:
> 
> 	lag_i = S_i - s_i = w_i * (V - v_i)

It is not, S is the same for all tasks. Remember, the base form is a
differential equation and all tasks progress at the same time at dt/w_i
while S progresses at dt/W.

Infinitesimals are awesome, just not feasible in a discrete system like
a time-sharing computer.

> > + *
> > + * Where S is the ideal service time and V is it's virtual time counterpart.
> > + * Therefore:
> > + *
> > + *   \Sum lag_i = 0
> > + *   \Sum w_i * (V - v_i) = 0
> > + *   \Sum w_i * V - w_i * v_i = 0
> > + *
> > + * From which we can solve an expression for V in v_i (which we have in
> > + * se->vruntime):
> > + *
> > + *       \Sum v_i * w_i   \Sum v_i * w_i
> > + *   V = -------------- = --------------
> > + *          \Sum w_i            W
> > + *
> > + * Specifically, this is the weighted average of all entity virtual runtimes.
> > + *
> > + * [[ NOTE: this is only equal to the ideal scheduler under the condition
> > + *          that join/leave operations happen at lag_i = 0, otherwise the
> > + *          virtual time has non-continguous motion equivalent to:
> > + *
> > + *	      V +-= lag_i / W
> > + *
> > + *	    Also see the comment in place_entity() that deals with this. ]]
> > + *
> > + * However, since v_i is u64, and the multiplcation could easily overflow
> > + * transform it into a relative form that uses smaller quantities:
> > + *
> > + * Substitute: v_i == (v_i - v0) + v0
> > + *
> > + *     \Sum ((v_i - v0) + v0) * w_i   \Sum (v_i - v0) * w_i
> > + * V = ---------------------------- = --------------------- + v0
> > + *                  W                            W
> > + *
> > + * Which we track using:
> > + *
> > + *                    v0 := cfs_rq->min_vruntime
> > + * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
> 
> IMHO 'sum_runtime' would be more appropriate? Since it actually is
> the summed real time rather than virtual time. And also 'sum_load'
> instead of 'avg_load'.

Given we subtract v0 (min_vruntime) and play games with fixed point
math, I don't think it makes sense to change this name. The purpose is
to compute the weighted average of things, lets keep the current name.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Re: [PATCH 05/15] sched/fair: Implement an EEVDF like policy
  2023-10-11  4:14     ` Abel Wu
@ 2023-10-11  7:33       ` Peter Zijlstra
  2023-10-11 11:49         ` Abel Wu
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-11  7:33 UTC (permalink / raw)
  To: Abel Wu
  Cc: Benjamin Segall, mingo, vincent.guittot, linux-kernel,
	juri.lelli, dietmar.eggemann, rostedt, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On Wed, Oct 11, 2023 at 12:14:30PM +0800, Abel Wu wrote:

> With Benjamin's fix, the semantics of 'Earliest Eligible' preserved.

Yeah, my bad.

> But since all this is about latency rather than fairness, I wonder if

It is about both, fairness is absolutely important.

> there are cases worthy of breaking the 'eligible' rule.

See the discussion with Youssef, if we weaken the eligible rule you get
horrific interference because you end up placing new tasks around the
0-lag point.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Re: [PATCH 01/15] sched/fair: Add avg_vruntime
  2023-10-11  7:30     ` Peter Zijlstra
@ 2023-10-11  8:30       ` Abel Wu
  2023-10-11  9:45         ` Peter Zijlstra
  2023-10-11 13:08       ` Peter Zijlstra
  1 sibling, 1 reply; 104+ messages in thread
From: Abel Wu @ 2023-10-11  8:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On 10/11/23 3:30 PM, Peter Zijlstra Wrote:
> On Wed, Oct 11, 2023 at 12:15:28PM +0800, Abel Wu wrote:
>> On 5/31/23 7:58 PM, Peter Zijlstra wrote:
>>> +/*
>>> + * Compute virtual time from the per-task service numbers:
>>> + *
>>> + * Fair schedulers conserve lag:
>>> + *
>>> + *   \Sum lag_i = 0
>>> + *
>>> + * Where lag_i is given by:
>>> + *
>>> + *   lag_i = S - s_i = w_i * (V - v_i)
>>
>> Since the ideal service time S is task-specific, should this be:
>>
>> 	lag_i = S_i - s_i = w_i * (V - v_i)
> 
> It is not, S is the same for all tasks. Remember, the base form is a
> differential equation and all tasks progress at the same time at dt/w_i
> while S progresses at dt/W.

IIUC it's V progresses at dt/W and is same for all tasks, not S which is
measured in real time (V*w_i).

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Re: [PATCH 01/15] sched/fair: Add avg_vruntime
  2023-10-11  8:30       ` Abel Wu
@ 2023-10-11  9:45         ` Peter Zijlstra
  2023-10-11 10:05           ` Peter Zijlstra
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-11  9:45 UTC (permalink / raw)
  To: Abel Wu
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On Wed, Oct 11, 2023 at 04:30:26PM +0800, Abel Wu wrote:
> On 10/11/23 3:30 PM, Peter Zijlstra Wrote:
> > On Wed, Oct 11, 2023 at 12:15:28PM +0800, Abel Wu wrote:
> > > On 5/31/23 7:58 PM, Peter Zijlstra wrote:
> > > > +/*
> > > > + * Compute virtual time from the per-task service numbers:
> > > > + *
> > > > + * Fair schedulers conserve lag:
> > > > + *
> > > > + *   \Sum lag_i = 0
> > > > + *
> > > > + * Where lag_i is given by:
> > > > + *
> > > > + *   lag_i = S - s_i = w_i * (V - v_i)
> > > 
> > > Since the ideal service time S is task-specific, should this be:
> > > 
> > > 	lag_i = S_i - s_i = w_i * (V - v_i)
> > 
> > It is not, S is the same for all tasks. Remember, the base form is a
> > differential equation and all tasks progress at the same time at dt/w_i
> > while S progresses at dt/W.
> 
> IIUC it's V progresses at dt/W and is same for all tasks, not S which is
> measured in real time (V*w_i).

Clearly I should wake up before replying ;-)

  V = S/W, so dV = dt/W and dS = dt

Anyway, the point is that both V and S are the same across all tasks,
all tasks execute in parallel with infinitely small time increments.

In reality this can't work ofc, so we get the approximations v_i and s_i
and lag is the deviation from the ideal.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Re: [PATCH 01/15] sched/fair: Add avg_vruntime
  2023-10-11  9:45         ` Peter Zijlstra
@ 2023-10-11 10:05           ` Peter Zijlstra
  0 siblings, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-11 10:05 UTC (permalink / raw)
  To: Abel Wu
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On Wed, Oct 11, 2023 at 11:45:39AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 11, 2023 at 04:30:26PM +0800, Abel Wu wrote:
> > On 10/11/23 3:30 PM, Peter Zijlstra Wrote:
> > > On Wed, Oct 11, 2023 at 12:15:28PM +0800, Abel Wu wrote:
> > > > On 5/31/23 7:58 PM, Peter Zijlstra wrote:
> > > > > +/*
> > > > > + * Compute virtual time from the per-task service numbers:
> > > > > + *
> > > > > + * Fair schedulers conserve lag:
> > > > > + *
> > > > > + *   \Sum lag_i = 0
> > > > > + *
> > > > > + * Where lag_i is given by:
> > > > > + *
> > > > > + *   lag_i = S - s_i = w_i * (V - v_i)
> > > > 
> > > > Since the ideal service time S is task-specific, should this be:
> > > > 
> > > > 	lag_i = S_i - s_i = w_i * (V - v_i)
> > > 
> > > It is not, S is the same for all tasks. Remember, the base form is a
> > > differential equation and all tasks progress at the same time at dt/w_i
> > > while S progresses at dt/W.
> > 
> > IIUC it's V progresses at dt/W and is same for all tasks, not S which is
> > measured in real time (V*w_i).
> 
> Clearly I should wake up before replying ;-)
> 
>   V = S/W, so dV = dt/W and dS = dt
> 
> Anyway, the point is that both V and S are the same across all tasks,
> all tasks execute in parallel with infinitely small time increments.
> 
> In reality this can't work ofc, so we get the approximations v_i and s_i
> and lag is the deviation from the ideal.

Ah, I think I see. I'm making a mess of things aren't I.

I've got to run some errands, but I'll try and reply more coherently
after.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Re: [PATCH 05/15] sched/fair: Implement an EEVDF like policy
  2023-10-11  7:33       ` Peter Zijlstra
@ 2023-10-11 11:49         ` Abel Wu
  0 siblings, 0 replies; 104+ messages in thread
From: Abel Wu @ 2023-10-11 11:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Segall, mingo, vincent.guittot, linux-kernel,
	juri.lelli, dietmar.eggemann, rostedt, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On 10/11/23 3:33 PM, Peter Zijlstra Wrote:
> On Wed, Oct 11, 2023 at 12:14:30PM +0800, Abel Wu wrote:
> 
>> there are cases worthy of breaking the 'eligible' rule.
> 
> See the discussion with Youssef, if we weaken the eligible rule you get
> horrific interference because you end up placing new tasks around the
> 0-lag point.

I have just begun studying the EEVDF scheduler, and obviously there
are lots of things to catch up with :)

At a quick glance at Youssef's first reply, I'm sure that's exactly
the same as I thought about, the EVDF. The intention behind is w/o
eligibility the task_timeline can be organized by deadline rather
than vruntime, hence task selection can be done in O(1) while the
min_vruntime can be updated through augmented rbtree.

Anyway, I will learn from your discussion with Youssef first, thanks
for providing the info!

Best,
	Abel

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 03/15] sched/fair: Add lag based placement
  2023-05-31 11:58 ` [PATCH 03/15] sched/fair: Add lag based placement Peter Zijlstra
  2023-08-10  7:10   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2023-10-11 12:00   ` Abel Wu
  2023-10-11 13:24     ` Peter Zijlstra
  2023-10-12 19:15   ` Benjamin Segall
  2 siblings, 1 reply; 104+ messages in thread
From: Abel Wu @ 2023-10-11 12:00 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, vincent.guittot
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
	pjt, pavel, qperret, tim.c.chen, joshdon, timj, kprateek.nayak,
	yu.c.chen, youssefesmat, joel, efault, tglx

On 5/31/23 7:58 PM, Peter Zijlstra Wrote:
>   		/*
> -		 * Halve their sleep time's effect, to allow
> -		 * for a gentler effect of sleepers:
> +		 * If we want to place a task and preserve lag, we have to
> +		 * consider the effect of the new entity on the weighted
> +		 * average and compensate for this, otherwise lag can quickly
> +		 * evaporate.
> +		 *
> +		 * Lag is defined as:
> +		 *
> +		 *   lag_i = S - s_i = w_i * (V - v_i)
> +		 *
> +		 * To avoid the 'w_i' term all over the place, we only track
> +		 * the virtual lag:
> +		 *
> +		 *   vl_i = V - v_i <=> v_i = V - vl_i
> +		 *
> +		 * And we take V to be the weighted average of all v:
> +		 *
> +		 *   V = (\Sum w_j*v_j) / W
> +		 *
> +		 * Where W is: \Sum w_j
> +		 *
> +		 * Then, the weighted average after adding an entity with lag
> +		 * vl_i is given by:
> +		 *
> +		 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
> +		 *      = (W*V + w_i*(V - vl_i)) / (W + w_i)
> +		 *      = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
> +		 *      = (V*(W + w_i) - w_i*l) / (W + w_i)
> +		 *      = V - w_i*vl_i / (W + w_i)
> +		 *
> +		 * And the actual lag after adding an entity with vl_i is:
> +		 *
> +		 *   vl'_i = V' - v_i
> +		 *         = V - w_i*vl_i / (W + w_i) - (V - vl_i)
> +		 *         = vl_i - w_i*vl_i / (W + w_i)
> +		 *
> +		 * Which is strictly less than vl_i. So in order to preserve lag

Maybe a stupid question, but why vl'_i < vl_i? Since vl_i can be negative.

> +		 * we should inflate the lag before placement such that the
> +		 * effective lag after placement comes out right.
> +		 *
> +		 * As such, invert the above relation for vl'_i to get the vl_i
> +		 * we need to use such that the lag after placement is the lag
> +		 * we computed before dequeue.
> +		 *
> +		 *   vl'_i = vl_i - w_i*vl_i / (W + w_i)
> +		 *         = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
> +		 *
> +		 *   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
> +		 *                   = W*vl_i
> +		 *
> +		 *   vl_i = (W + w_i)*vl'_i / W
>   		 */
> -		if (sched_feat(GENTLE_FAIR_SLEEPERS))
> -			thresh >>= 1;
> +		load = cfs_rq->avg_load;
> +		if (curr && curr->on_rq)
> +			load += curr->load.weight;
>   
> -		vruntime -= thresh;
> +		lag *= load + se->load.weight;
> +		if (WARN_ON_ONCE(!load))
> +			load = 1;
> +		lag = div_s64(lag, load);
> +
> +		vruntime -= lag;
>   	}

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH] sched/fair: fix pick_eevdf to always find the correct se
  2023-09-30  0:09   ` [PATCH] sched/fair: fix pick_eevdf to always find the correct se Benjamin Segall
                       ` (2 preceding siblings ...)
  2023-10-09  7:53     ` [tip: sched/urgent] sched/eevdf: Fix pick_eevdf() tip-bot2 for Benjamin Segall
@ 2023-10-11 12:12     ` Abel Wu
  2023-10-11 13:14       ` Peter Zijlstra
  2023-10-11 21:01       ` Benjamin Segall
  3 siblings, 2 replies; 104+ messages in thread
From: Abel Wu @ 2023-10-11 12:12 UTC (permalink / raw)
  To: Benjamin Segall, Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx

On 9/30/23 8:09 AM, Benjamin Segall Wrote:
> The old pick_eevdf could fail to find the actual earliest eligible
> deadline when it descended to the right looking for min_deadline, but it
> turned out that that min_deadline wasn't actually eligible. In that case
> we need to go back and search through any left branches we skipped
> looking for the actual best _eligible_ min_deadline.
> 
> This is more expensive, but still O(log n), and at worst should only
> involve descending two branches of the rbtree.
> 
> I've run this through a userspace stress test (thank you
> tools/lib/rbtree.c), so hopefully this implementation doesn't miss any
> corner cases.
> 
> Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
> Signed-off-by: Ben Segall <bsegall@google.com>
> ---
>   kernel/sched/fair.c | 72 ++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 58 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0c31cda0712f..77e9440b8ab3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -864,18 +864,20 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
>    *
>    *  se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
>    *
>    * Which allows an EDF like search on (sub)trees.
>    */
> -static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> +static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
>   {
>   	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
>   	struct sched_entity *curr = cfs_rq->curr;
>   	struct sched_entity *best = NULL;
> +	struct sched_entity *best_left = NULL;
>   
>   	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>   		curr = NULL;
> +	best = curr;
>   
>   	/*
>   	 * Once selected, run a task until it either becomes non-eligible or
>   	 * until it gets a new slice. See the HACK in set_next_entity().
>   	 */
> @@ -892,45 +894,87 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>   			node = node->rb_left;
>   			continue;
>   		}
>   
>   		/*
> -		 * If this entity has an earlier deadline than the previous
> -		 * best, take this one. If it also has the earliest deadline
> -		 * of its subtree, we're done.
> +		 * Now we heap search eligible trees for the best (min_)deadline
>   		 */
> -		if (!best || deadline_gt(deadline, best, se)) {
> +		if (!best || deadline_gt(deadline, best, se))
>   			best = se;
> -			if (best->deadline == best->min_deadline)
> -				break;
> -		}
>   
>   		/*
> -		 * If the earlest deadline in this subtree is in the fully
> -		 * eligible left half of our space, go there.
> +		 * Every se in a left branch is eligible, keep track of the
> +		 * branch with the best min_deadline
>   		 */
> +		if (node->rb_left) {
> +			struct sched_entity *left = __node_2_se(node->rb_left);
> +
> +			if (!best_left || deadline_gt(min_deadline, best_left, left))
> +				best_left = left;
> +
> +			/*
> +			 * min_deadline is in the left branch. rb_left and all
> +			 * descendants are eligible, so immediately switch to the second
> +			 * loop.
> +			 */
> +			if (left->min_deadline == se->min_deadline)
> +				break;
> +		}
> +
> +		/* min_deadline is at this node, no need to look right */
> +		if (se->deadline == se->min_deadline)
> +			break;
> +
> +		/* else min_deadline is in the right branch. */
> +		node = node->rb_right;
> +	}
> +
> +	/*
> +	 * We ran into an eligible node which is itself the best.
> +	 * (Or nr_running == 0 and both are NULL)
> +	 */
> +	if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0)
> +		return best;
> +
> +	/*
> +	 * Now best_left and all of its children are eligible, and we are just
> +	 * looking for deadline == min_deadline
> +	 */
> +	node = &best_left->run_node;
> +	while (node) {
> +		struct sched_entity *se = __node_2_se(node);
> +
> +		/* min_deadline is the current node */
> +		if (se->deadline == se->min_deadline)
> +			return se;

IMHO it would be better tiebreak on vruntime by moving this hunk to ..

> +
> +		/* min_deadline is in the left branch */
>   		if (node->rb_left &&
>   		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
>   			node = node->rb_left;
>   			continue;
>   		}

.. here, thoughts?

>   
> +		/* else min_deadline is in the right branch */
>   		node = node->rb_right;
>   	}
> +	return NULL;

Why not 'best'? Since ..

> +}
>   
> -	if (!best || (curr && deadline_gt(deadline, best, curr)))
> -		best = curr;
> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> +{
> +	struct sched_entity *se = __pick_eevdf(cfs_rq);
>   
> -	if (unlikely(!best)) {
> +	if (!se) {
>   		struct sched_entity *left = __pick_first_entity(cfs_rq);

.. cfs_rq->curr isn't considered here.

>   		if (left) {
>   			pr_err("EEVDF scheduling fail, picking leftmost\n");
>   			return left;
>   		}
>   	}
>   
> -	return best;
> +	return se;
>   }
>   
>   #ifdef CONFIG_SCHED_DEBUG
>   struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
>   {

The implementation of __pick_eevdf() now is quite complicated which
makes it really hard to maintain. I'm trying my best to refactor this
part, hopefully can do some help. Below is only for illustration, I
need to test more.

static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
{
	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
	struct sched_entity *curr = cfs_rq->curr;
	struct sched_entity *best = NULL;
	struct sched_entity *cand = NULL;
	bool all_eligible = false;

	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
		curr = NULL;

	/*
	 * Once selected, run a task until it either becomes non-eligible or
	 * until it gets a new slice. See the HACK in set_next_entity().
	 */
	if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
		return curr;

	while (node) {
		struct sched_entity *se = __node_2_se(node);

		/*
		 * If this entity is not eligible, try the left subtree.
		 */
		if (!all_eligible && !entity_eligible(cfs_rq, se)) {
			node = node->rb_left;
			continue;
		}

		if (node->rb_left) {
			struct sched_entity *left = __node_2_se(node->rb_left);

			BUG_ON(left->min_deadline < se->min_deadline);

			/* Tiebreak on vruntime */
			if (left->min_deadline == se->min_deadline) {
				node = node->rb_left;
				all_eligible = true;
				continue;
			}

			if (!all_eligible) {
				/*
				 * We're going to search right subtree and the one
				 * with min_deadline can be non-eligible, so record
				 * the left subtree as a candidate.
				 */
				if (!cand || deadline_gt(min_deadline, cand, left))
					cand = left;
			}
		}

		/* min_deadline is at this node, no need to look right */
		if (se->deadline == se->min_deadline) {
			best = se;
			break;
		}

		node = node->rb_right;

		if (!node && cand) {
			node = cand;
			all_eligible = true;
			cand = NULL;
		}
	}

	if (!best || (curr && deadline_gt(deadline, best, curr)))
		best = curr;

	return best;
}

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 01/15] sched/fair: Add avg_vruntime
  2023-10-11  7:30     ` Peter Zijlstra
  2023-10-11  8:30       ` Abel Wu
@ 2023-10-11 13:08       ` Peter Zijlstra
  1 sibling, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-11 13:08 UTC (permalink / raw)
  To: Abel Wu
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On Wed, Oct 11, 2023 at 09:30:01AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 11, 2023 at 12:15:28PM +0800, Abel Wu wrote:
> > On 5/31/23 7:58 PM, Peter Zijlstra wrote:
> > > +/*
> > > + * Compute virtual time from the per-task service numbers:
> > > + *
> > > + * Fair schedulers conserve lag:
> > > + *
> > > + *   \Sum lag_i = 0
> > > + *
> > > + * Where lag_i is given by:
> > > + *
> > > + *   lag_i = S - s_i = w_i * (V - v_i)
> > 
> > Since the ideal service time S is task-specific, should this be:
> > 
> > 	lag_i = S_i - s_i = w_i * (V - v_i)
> 

Yes, it should be. Clearly I was delusional this morning.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH] sched/fair: fix pick_eevdf to always find the correct se
  2023-10-11 12:12     ` [PATCH] sched/fair: fix pick_eevdf to always find the correct se Abel Wu
@ 2023-10-11 13:14       ` Peter Zijlstra
  2023-10-12 10:04         ` Abel Wu
  2023-10-11 21:01       ` Benjamin Segall
  1 sibling, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-11 13:14 UTC (permalink / raw)
  To: Abel Wu
  Cc: Benjamin Segall, mingo, vincent.guittot, linux-kernel,
	juri.lelli, dietmar.eggemann, rostedt, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On Wed, Oct 11, 2023 at 08:12:09PM +0800, Abel Wu wrote:

> The implementation of __pick_eevdf() now is quite complicated which
> makes it really hard to maintain. I'm trying my best to refactor this
> part, hopefully can do some help. Below is only for illustration, I
> need to test more.

Well, the form it has now is somewhat close to the one in the paper. I
think Ben added one early break it doesn't have or something.

As the paper explains, you get two walks, one down the eligibility path,
and then one down the heap. I think the current code structure
represents that fairly well.

Very vague memories seem to suggest I thought to be clever some 10+
years ago when I wrote that other decent loop that Ben showed to be
broken (and woe on me for not verifying it when I resurrected the
patches, I rewrote pretty much everything else, except this lookup :/).

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 03/15] sched/fair: Add lag based placement
  2023-10-11 12:00   ` [PATCH 03/15] " Abel Wu
@ 2023-10-11 13:24     ` Peter Zijlstra
  2023-10-12  7:04       ` Abel Wu
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-11 13:24 UTC (permalink / raw)
  To: Abel Wu
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On Wed, Oct 11, 2023 at 08:00:22PM +0800, Abel Wu wrote:
> On 5/31/23 7:58 PM, Peter Zijlstra Wrote:
> >   		/*
> > +		 * If we want to place a task and preserve lag, we have to
> > +		 * consider the effect of the new entity on the weighted
> > +		 * average and compensate for this, otherwise lag can quickly
> > +		 * evaporate.
> > +		 *
> > +		 * Lag is defined as:
> > +		 *
> > +		 *   lag_i = S - s_i = w_i * (V - v_i)
> > +		 *
> > +		 * To avoid the 'w_i' term all over the place, we only track
> > +		 * the virtual lag:
> > +		 *
> > +		 *   vl_i = V - v_i <=> v_i = V - vl_i
> > +		 *
> > +		 * And we take V to be the weighted average of all v:
> > +		 *
> > +		 *   V = (\Sum w_j*v_j) / W
> > +		 *
> > +		 * Where W is: \Sum w_j
> > +		 *
> > +		 * Then, the weighted average after adding an entity with lag
> > +		 * vl_i is given by:
> > +		 *
> > +		 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
> > +		 *      = (W*V + w_i*(V - vl_i)) / (W + w_i)
> > +		 *      = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
> > +		 *      = (V*(W + w_i) - w_i*l) / (W + w_i)
> > +		 *      = V - w_i*vl_i / (W + w_i)
> > +		 *
> > +		 * And the actual lag after adding an entity with vl_i is:
> > +		 *
> > +		 *   vl'_i = V' - v_i
> > +		 *         = V - w_i*vl_i / (W + w_i) - (V - vl_i)
> > +		 *         = vl_i - w_i*vl_i / (W + w_i)
> > +		 *
> > +		 * Which is strictly less than vl_i. So in order to preserve lag
> 
> Maybe a stupid question, but why vl'_i < vl_i? Since vl_i can be negative.

So the below doesn't care about the sign, it simply inverts this
relation to express vl_i in vl'_i:

> > +		 * we should inflate the lag before placement such that the
> > +		 * effective lag after placement comes out right.
> > +		 *
> > +		 * As such, invert the above relation for vl'_i to get the vl_i
> > +		 * we need to use such that the lag after placement is the lag
> > +		 * we computed before dequeue.
> > +		 *
> > +		 *   vl'_i = vl_i - w_i*vl_i / (W + w_i)
> > +		 *         = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
> > +		 *
> > +		 *   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
> > +		 *                   = W*vl_i
> > +		 *
> > +		 *   vl_i = (W + w_i)*vl'_i / W

And then we obtain the scale factor: (W + w_i)/W, which is >1, right?

As such, that means that vl'_i must be smaller than vl_i in the absolute
sense, irrespective of sign.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH] sched/fair: fix pick_eevdf to always find the correct se
  2023-10-11 12:12     ` [PATCH] sched/fair: fix pick_eevdf to always find the correct se Abel Wu
  2023-10-11 13:14       ` Peter Zijlstra
@ 2023-10-11 21:01       ` Benjamin Segall
  2023-10-12 10:25         ` Abel Wu
  1 sibling, 1 reply; 104+ messages in thread
From: Benjamin Segall @ 2023-10-11 21:01 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx

[-- Attachment #1: Type: text/plain, Size: 5024 bytes --]

Abel Wu <wuyun.abel@bytedance.com> writes:

> On 9/30/23 8:09 AM, Benjamin Segall Wrote:
>> +	/*
>> +	 * Now best_left and all of its children are eligible, and we are just
>> +	 * looking for deadline == min_deadline
>> +	 */
>> +	node = &best_left->run_node;
>> +	while (node) {
>> +		struct sched_entity *se = __node_2_se(node);
>> +
>> +		/* min_deadline is the current node */
>> +		if (se->deadline == se->min_deadline)
>> +			return se;
>
> IMHO it would be better tiebreak on vruntime by moving this hunk to ..
>
>> +
>> +		/* min_deadline is in the left branch */
>>   		if (node->rb_left &&
>>   		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
>>   			node = node->rb_left;
>>   			continue;
>>   		}
>
> .. here, thoughts?

Yeah, that should work and be better on the tiebreak (and my test code
agrees). There's an argument that the tiebreak will never really come up
and it's better to avoid the potential one extra cache line from
"__node_2_se(node->rb_left)->min_deadline" though.

>
>>   +		/* else min_deadline is in the right branch */
>>   		node = node->rb_right;
>>   	}
>> +	return NULL;
>
> Why not 'best'? Since ..

The only time this can happen is if the tree is corrupt. We only reach
this case if best_left is set at all (and best_left's min_deadline is
better than "best"'s, which includes curr). In that case getting an
error message is good, and in general I wasn't worrying about it much.

>
>> +}
>>   -	if (!best || (curr && deadline_gt(deadline, best, curr)))
>> -		best = curr;
>> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>> +{
>> +	struct sched_entity *se = __pick_eevdf(cfs_rq);
>>   -	if (unlikely(!best)) {
>> +	if (!se) {
>>   		struct sched_entity *left = __pick_first_entity(cfs_rq);
>
> .. cfs_rq->curr isn't considered here.

That said, we should probably consider curr here in the error-case
fallback, if just as a "if (!left) left = cfs_rq->curr;"

>
>>   		if (left) {
>>   			pr_err("EEVDF scheduling fail, picking leftmost\n");
>>   			return left;
>>   		}
>>   	}
>>   -	return best;
>> +	return se;
>>   }
>>     #ifdef CONFIG_SCHED_DEBUG
>>   struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
>>   {
>
> The implementation of __pick_eevdf() now is quite complicated which
> makes it really hard to maintain. I'm trying my best to refactor this
> part, hopefully can do some help. Below is only for illustration, I
> need to test more.
>

A passing version of that single-loop code minus the cfs_rq->curr
handling is here (just pasting the curr handling back on the start/end
should do):

static struct sched_entity *pick_eevdf_abel(struct cfs_rq *cfs_rq)
{
	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
	struct sched_entity *best = NULL;
	struct sched_entity *cand = NULL;
	bool all_eligible = false;

	while (node || cand) {
		struct sched_entity *se = __node_2_se(node);
		if (!node) {
			BUG_ON(!cand);
			node = &cand->run_node;
			se = __node_2_se(node);
			all_eligible = true;
			cand = NULL;

			/*
			 * Our initial pass ran into an eligible node which is
			 * itself the best
			 */
			if (best && (s64)(se->min_deadline - best->deadline) > 0)
				break;
		}

		/*
		 * If this entity is not eligible, try the left subtree.
		 */
		if (!all_eligible && !entity_eligible(cfs_rq, se)) {
			node = node->rb_left;
			continue;
		}

		if (node->rb_left) {
			struct sched_entity *left = __node_2_se(node->rb_left);

			BUG_ON(left->min_deadline < se->min_deadline);

			/* Tiebreak on vruntime */
			if (left->min_deadline == se->min_deadline) {
				node = node->rb_left;
				all_eligible = true;
				continue;
			}

			if (!all_eligible) {
				/*
				 * We're going to search right subtree and the one
				 * with min_deadline can be non-eligible, so record
				 * the left subtree as a candidate.
				 */
				if (!cand || deadline_gt(min_deadline, cand, left))
					cand = left;
			}
		}

		if (!all_eligible && (!best || deadline_gt(deadline, best, se)))
			best = se;

		/* min_deadline is at this node, no need to look right */
		if (se->deadline == se->min_deadline) {
			if (all_eligible && (!best || deadline_gte(deadline, best, se)))
				best = se;
			if (!all_eligible && (!best || deadline_gt(deadline, best, se)))
				best = se;
			node = NULL;
			continue;
		}

		node = node->rb_right;
	}

	return best;
}

This does exactly as well on the tiebreak condition of vruntime as the
improved two-loop version you suggested. If we don't care about the
tiebreak we can replace the ugly if(all_eligible) if(!all_eligible) in
the last section with a single version that just uses deadline_gt (or
gte) throughout.

Personally with all the extra bits I added for correctness this winds up
definitely uglier than the two-loop version, but it might be possible to
clean it up further. (And a bunch of it is just personal style
preference in the first place)

I've also attached my ugly userspace EEVDF tester as an attachment,
hopefully I attached it in a correct mode to go through lkml.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: EEVDF tester --]
[-- Type: text/x-diff, Size: 18135 bytes --]

diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
index 099ee9213557..10e38c7afe5e 100644
--- a/tools/testing/selftests/sched/Makefile
+++ b/tools/testing/selftests/sched/Makefile
@@ -6,9 +6,11 @@ endif
 
 CFLAGS += -O2 -Wall -g -I./ $(KHDR_INCLUDES) -Wl,-rpath=./ \
 	  $(CLANG_FLAGS)
 LDLIBS += -lpthread
 
-TEST_GEN_FILES := cs_prctl_test
-TEST_PROGS := cs_prctl_test
+TEST_GEN_FILES := cs_prctl_test sim_rbtree
+TEST_PROGS := cs_prctl_test sim_rbtree
 
 include ../lib.mk
+
+CFLAGS += -I$(top_srcdir)/tools/include
diff --git a/tools/testing/selftests/sched/sim_rbtree.c b/tools/testing/selftests/sched/sim_rbtree.c
new file mode 100644
index 000000000000..dad2544e4d9d
--- /dev/null
+++ b/tools/testing/selftests/sched/sim_rbtree.c
@@ -0,0 +1,681 @@
+#include "linux/rbtree_augmented.h"
+#include <stdio.h>
+#include <string.h>
+#include <time.h>
+#include <stdlib.h>
+#include <assert.h>
+
+#include "../../../lib/rbtree.c"
+
+static inline s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder)
+{
+	*remainder = dividend % divisor;
+	return dividend / divisor;
+}
+static inline s64 div_s64(s64 dividend, s32 divisor)
+{
+	s32 remainder;
+	return div_s64_rem(dividend, divisor, &remainder);
+}
+
+static __always_inline struct rb_node *
+rb_add_augmented_cached(struct rb_node *node, struct rb_root_cached *tree,
+			bool (*less)(struct rb_node *, const struct rb_node *),
+			const struct rb_augment_callbacks *augment)
+{
+	struct rb_node **link = &tree->rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	bool leftmost = true;
+
+	while (*link) {
+		parent = *link;
+		if (less(node, parent)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = false;
+		}
+	}
+
+	rb_link_node(node, parent, link);
+	augment->propagate(parent, NULL); /* suboptimal */
+	rb_insert_augmented_cached(node, tree, leftmost, augment);
+
+	return leftmost ? node : NULL;
+}
+
+
+# define SCHED_FIXEDPOINT_SHIFT		10
+# define NICE_0_LOAD_SHIFT	(SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
+# define scale_load(w)		((w) << SCHED_FIXEDPOINT_SHIFT)
+# define scale_load_down(w) \
+({ \
+	unsigned long __w = (w); \
+	if (__w) \
+		__w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT); \
+	__w; \
+})
+
+struct load_weight {
+	unsigned long			weight;
+	u32				inv_weight;
+};
+
+struct sched_entity {
+	char name;
+	struct load_weight		load;
+	struct rb_node			run_node;
+	u64				deadline;
+	u64				min_deadline;
+
+	unsigned int			on_rq;
+
+	u64				vruntime;
+	s64				vlag;
+	u64				slice;
+};
+
+struct cfs_rq {
+	struct load_weight	load;
+	s64			avg_vruntime;
+	u64			avg_load;
+	u64			min_vruntime;
+	struct rb_root_cached	tasks_timeline;
+	struct sched_entity	*curr;
+};
+
+void print_se(char *label, struct sched_entity *se);
+
+static inline bool entity_before(const struct sched_entity *a,
+				 const struct sched_entity *b)
+{
+	return (s64)(a->vruntime - b->vruntime) < 0;
+}
+
+static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	return (s64)(se->vruntime - cfs_rq->min_vruntime);
+}
+
+#define __node_2_se(node) \
+	rb_entry((node), struct sched_entity, run_node)
+
+int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 avg = cfs_rq->avg_vruntime;
+	long load = cfs_rq->avg_load;
+
+	if (curr && curr->on_rq) {
+		unsigned long weight = scale_load_down(curr->load.weight);
+
+		avg += entity_key(cfs_rq, curr) * weight;
+		load += weight;
+	}
+
+	return avg >= entity_key(cfs_rq, se) * load;
+}
+
+static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
+{
+	return entity_before(__node_2_se(a), __node_2_se(b));
+}
+
+#define deadline_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })
+
+static inline void __update_min_deadline(struct sched_entity *se, struct rb_node *node)
+{
+	if (node) {
+		struct sched_entity *rse = __node_2_se(node);
+		if (deadline_gt(min_deadline, se, rse))
+			se->min_deadline = rse->min_deadline;
+	}
+}
+
+/*
+ * se->min_deadline = min(se->deadline, left->min_deadline, right->min_deadline)
+ */
+static inline bool min_deadline_update(struct sched_entity *se, bool exit)
+{
+	u64 old_min_deadline = se->min_deadline;
+	struct rb_node *node = &se->run_node;
+
+	se->min_deadline = se->deadline;
+	__update_min_deadline(se, node->rb_right);
+	__update_min_deadline(se, node->rb_left);
+
+	return se->min_deadline == old_min_deadline;
+}
+
+static void
+avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	unsigned long weight = scale_load_down(se->load.weight);
+	s64 key = entity_key(cfs_rq, se);
+
+	cfs_rq->avg_vruntime += key * weight;
+	cfs_rq->avg_load += weight;
+}
+
+static void
+avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	unsigned long weight = scale_load_down(se->load.weight);
+	s64 key = entity_key(cfs_rq, se);
+
+	cfs_rq->avg_vruntime -= key * weight;
+	cfs_rq->avg_load -= weight;
+}
+
+static inline
+void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+{
+	/*
+	 * v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load
+	 */
+	cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta;
+}
+
+u64 avg_vruntime(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 avg = cfs_rq->avg_vruntime;
+	long load = cfs_rq->avg_load;
+
+	if (curr && curr->on_rq) {
+		unsigned long weight = scale_load_down(curr->load.weight);
+
+		avg += entity_key(cfs_rq, curr) * weight;
+		load += weight;
+	}
+
+	if (load)
+		avg = div_s64(avg, load);
+
+	return cfs_rq->min_vruntime + avg;
+}
+
+
+RB_DECLARE_CALLBACKS(static, min_deadline_cb, struct sched_entity,
+		     run_node, min_deadline, min_deadline_update);
+
+/*
+ * Enqueue an entity into the rb-tree:
+ */
+static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	avg_vruntime_add(cfs_rq, se);
+	se->min_deadline = se->deadline;
+	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
+				__entity_less, &min_deadline_cb);
+}
+
+void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
+				  &min_deadline_cb);
+	avg_vruntime_sub(cfs_rq, se);
+}
+
+struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
+{
+	struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
+
+	if (!left)
+		return NULL;
+
+	return __node_2_se(left);
+}
+
+
+static struct sched_entity *pick_eevdf_orig(struct cfs_rq *cfs_rq)
+{
+	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+	struct sched_entity *curr = cfs_rq->curr;
+	struct sched_entity *best = NULL;
+
+	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
+		curr = NULL;
+
+	/*
+	 * Once selected, run a task until it either becomes non-eligible or
+	 * until it gets a new slice. See the HACK in set_next_entity().
+	 */
+	//if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
+	//return curr;
+
+	while (node) {
+		struct sched_entity *se = __node_2_se(node);
+
+		/*
+		 * If this entity is not eligible, try the left subtree.
+		 */
+		if (!entity_eligible(cfs_rq, se)) {
+			node = node->rb_left;
+			continue;
+		}
+
+		/*
+		 * If this entity has an earlier deadline than the previous
+		 * best, take this one. If it also has the earliest deadline
+		 * of its subtree, we're done.
+		 */
+		if (!best || deadline_gt(deadline, best, se)) {
+			best = se;
+			if (best->deadline == best->min_deadline)
+				break;
+		}
+
+		/*
+		 * If the earlest deadline in this subtree is in the fully
+		 * eligible left half of our space, go there.
+		 */
+		if (node->rb_left &&
+		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
+			node = node->rb_left;
+			continue;
+		}
+
+		node = node->rb_right;
+	}
+
+	return best;
+}
+
+#if 1
+#define print_se(...)
+#define printf(...)
+#endif
+
+static struct sched_entity *pick_eevdf_improved(struct cfs_rq *cfs_rq)
+{
+	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+	struct sched_entity *best_left = NULL;
+	struct sched_entity *best = NULL;
+
+	if (!node)
+		return NULL;
+
+	while (node) {
+		struct sched_entity *se = __node_2_se(node);
+		print_se("search1", se);
+
+		/*
+		 * If this entity is not eligible, try the left subtree.
+		 */
+		if (!entity_eligible(cfs_rq, se)) {
+			node = node->rb_left;
+			printf("not eligible\n");
+			continue;
+		}
+
+		/*
+		 * Now we heap search eligible trees the best (min_)deadline
+		 */
+		if (!best || deadline_gt(deadline, best, se)) {
+			print_se("best found", se);
+			best = se;
+		}
+
+		/*
+		 * Every se in a left branch is eligible, keep track of the one
+		 * with the best min_deadline
+		 */
+		if (node->rb_left) {
+			struct sched_entity *left = __node_2_se(node->rb_left);
+			print_se("going right, has left", left);
+			if (!best_left || deadline_gt(min_deadline, best_left, left)) {
+				printf("new best left\n");
+				best_left = left;
+			}
+
+			/*
+			 * min_deadline is in the left branch. rb_left and all
+			 * descendants are eligible, so immediately switch to the second
+			 * loop.
+			 */
+			if (left->min_deadline == se->min_deadline) {
+				printf("left");
+				break;
+			}
+		}
+
+		/* min_deadline is at node, no need to look right */
+		if (se->deadline == se->min_deadline) {
+			printf("found\n");
+			break;
+		}
+
+		/* else min_deadline is in the right branch. */
+		BUG_ON(__node_2_se(node->rb_right)->min_deadline != se->min_deadline);
+		node = node->rb_right;
+	}
+	BUG_ON(!best);
+	print_se("best_left", best_left);
+	print_se("best", best);
+
+	/* We ran into an eligible node which is itself the best */
+	if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0)
+		return best;
+	
+	/*
+	 * Now best_left and all of its children are eligible, and we are just
+	 * looking for deadline == min_deadline
+	 */
+	node = &best_left->run_node;
+	while (node) {
+		struct sched_entity *se = __node_2_se(node);
+
+		/* min_deadline is the current node */
+		if (se->deadline == se->min_deadline)
+			return se;
+
+		/* min_deadline is in the left branch */
+		if (node->rb_left &&
+		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
+			node = node->rb_left;
+			continue;
+		}
+
+		/* else min_deadline is in the right branch */
+		BUG_ON(__node_2_se(node->rb_right)->min_deadline != se->min_deadline);
+		node = node->rb_right;
+	}
+	BUG();
+	return NULL;
+}
+
+#undef printf
+#undef print_se
+
+static struct sched_entity *pick_eevdf_improved_ndebug(struct cfs_rq *cfs_rq)
+{
+	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+	struct sched_entity *best_left = NULL;
+	struct sched_entity *best = NULL;
+
+	while (node) {
+		struct sched_entity *se = __node_2_se(node);
+
+		/*
+		 * If this entity is not eligible, try the left subtree.
+		 */
+		if (!entity_eligible(cfs_rq, se)) {
+			node = node->rb_left;
+			continue;
+		}
+
+		/*
+		 * Now we heap search eligible trees the best (min_)deadline
+		 */
+		if (!best || deadline_gt(deadline, best, se))
+			best = se;
+
+		/*
+		 * Every se in a left branch is eligible, keep track of the one
+		 * with the best min_deadline
+		 */
+		if (node->rb_left) {
+			struct sched_entity *left = __node_2_se(node->rb_left);
+			if (!best_left || deadline_gt(min_deadline, best_left, left))
+				best_left = left;
+
+			/*
+			 * min_deadline is in the left branch. rb_left and all
+			 * descendants are eligible, so immediately switch to the second
+			 * loop.
+			 */
+			if (left->min_deadline == se->min_deadline)
+				break;
+		}
+
+		/* min_deadline is at node, no need to look right */
+		if (se->deadline == se->min_deadline)
+			break;
+
+		/* else min_deadline is in the right branch. */
+		BUG_ON(__node_2_se(node->rb_right)->min_deadline != se->min_deadline);
+		node = node->rb_right;
+	}
+	BUG_ON(!best && best_left);
+
+	/* We ran into an eligible node which is itself the best */
+	if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0)
+		return best;
+	
+	/*
+	 * Now best_left and all of its children are eligible, and we are just
+	 * looking for deadline == min_deadline
+	 */
+	node = &best_left->run_node;
+	while (node) {
+		struct sched_entity *se = __node_2_se(node);
+
+		/* min_deadline is in the left branch */
+		if (node->rb_left &&
+		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
+			node = node->rb_left;
+			continue;
+		}
+
+		/* min_deadline is the current node */
+		if (se->deadline == se->min_deadline)
+			return se;
+
+		/* else min_deadline is in the right branch */
+		BUG_ON(__node_2_se(node->rb_right)->min_deadline != se->min_deadline);
+		node = node->rb_right;
+	}
+	BUG();
+	return NULL;
+}
+
+static struct sched_entity *pick_eevdf_abel(struct cfs_rq *cfs_rq)
+{
+	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+	struct sched_entity *best = NULL;
+	struct sched_entity *cand = NULL;
+	bool all_eligible = false;
+
+	while (node || cand) {
+		struct sched_entity *se = __node_2_se(node);
+		if (!node) {
+			BUG_ON(!cand);
+			node = &cand->run_node;
+			se = __node_2_se(node);
+			all_eligible = true;
+			cand = NULL;
+
+			/*
+			 * Our initial pass ran into an eligible node which is
+			 * itself the best
+			 */
+			if (best && (s64)(se->min_deadline - best->deadline) > 0)
+				break;
+		}
+
+		/*
+		 * If this entity is not eligible, try the left subtree.
+		 */
+		if (!all_eligible && !entity_eligible(cfs_rq, se)) {
+			node = node->rb_left;
+			continue;
+		}
+		if (!all_eligible && (!best || deadline_gt(deadline, best, se)))
+			best = se;
+
+		if (node->rb_left) {
+			struct sched_entity *left = __node_2_se(node->rb_left);
+
+			BUG_ON(left->min_deadline < se->min_deadline);
+
+			/* Tiebreak on vruntime */
+			if (left->min_deadline == se->min_deadline) {
+				node = node->rb_left;
+				all_eligible = true;
+				continue;
+			}
+
+			if (!all_eligible) {
+				/*
+				 * We're going to search right subtree and the one
+				 * with min_deadline can be non-eligible, so record
+				 * the left subtree as a candidate.
+				 */
+				if (!cand || deadline_gt(min_deadline, cand, left))
+					cand = left;
+			}
+		}
+
+		/* min_deadline is at this node, no need to look right */
+		if (se->deadline == se->min_deadline) {
+			if (!best || deadline_gt(deadline, best, se))
+				best = se;
+			node = NULL;
+			continue;
+		}
+
+		node = node->rb_right;
+	}
+
+	return best;
+}
+
+
+
+void init_se(struct cfs_rq *cfs_rq, struct sched_entity *se, char name, u64 vruntime, u64 deadline) {
+	memset(se, 0, sizeof(*se));
+	se->name = name;
+	se->slice = 1;
+	se->load.weight = 1024;
+	se->vruntime = vruntime;
+	se->deadline = deadline;
+	__enqueue_entity(cfs_rq, se);
+}
+
+void print_se(char *label, struct sched_entity *se) {
+	if (!se) {
+		printf("%s is null\n", label);
+	} else {
+		struct rb_node *parent = rb_parent(&se->run_node);
+		printf("%s(%c) vrun %ld dl %ld, min_dl %ld, parent: %c\n", label, se->name,
+		       se->vruntime, se->deadline, se->min_deadline, parent ? __node_2_se(parent)->name : ' ');
+	}
+}
+
+struct sched_entity *correct_pick_eevdf(struct cfs_rq *cfs_rq) {
+	struct rb_node *node = rb_first_cached(&cfs_rq->tasks_timeline);
+	struct sched_entity *best = NULL;
+
+	while (node) {
+		struct sched_entity *se = __node_2_se(node);
+
+		/*
+		 * If this entity is not eligible, try the left subtree.
+		 */
+		if (!entity_eligible(cfs_rq, se)) {
+			return best;
+		}
+
+		/*
+		 * If this entity has an earlier deadline than the previous
+		 * best, take this one. If it also has the earliest deadline
+		 * of its subtree, we're done.
+		 */
+		if (!best || deadline_gt(deadline, best, se)) {
+			best = se;
+		}
+
+		node = rb_next(node);
+	}
+	return best;
+}
+
+
+void test_pick_function(struct sched_entity *(*pick_fn)(struct cfs_rq *), unsigned long skip_to) {
+#define MAX_SIZE 26
+	int size;
+	unsigned long n = 0;
+	struct sched_entity se[MAX_SIZE];
+	for (size = 0; size < MAX_SIZE; size++) {
+		int runs = 100000;
+		int count;
+		if (size <= 1)
+			runs = 1;
+		else if (size == 2)
+			runs = 100;
+		for (count = 0; count < runs; count++) {
+			int i;
+			struct cfs_rq cfs_rq = {0};
+			struct sched_entity *pick, *correct_pick;
+			cfs_rq.tasks_timeline = RB_ROOT_CACHED;
+			for (i = 0; i < size; i++) {
+				u64 v = (random() % size) * 10;
+				u64 d = v + (random() % size) * 10 + 1;
+				init_se(&cfs_rq, &se[i], 'A'+i, v, d);
+			}
+			n++;
+			if (n < skip_to)
+				continue;
+			pick = pick_fn(&cfs_rq);
+			correct_pick = correct_pick_eevdf(&cfs_rq);
+
+			if (size == 0) {
+				assert(!pick);
+				assert(!correct_pick);
+				continue;
+			}
+			if (!pick ||
+			    pick->deadline != correct_pick->deadline ||
+			    !entity_eligible(&cfs_rq, pick)) {
+
+				printf("Error (run %lu):\n", n);
+				print_se("correct pick", correct_pick);
+				print_se("actual pick ", pick);
+				printf("All ses:\n");
+				for (i = 0; i < size; i++) {
+					print_se("", &se[i]);
+				}
+				return;
+			}
+			//puts("");
+		}
+	}
+}
+
+void orig_check(void) {
+	struct cfs_rq cfs_rq = {0};
+	struct sched_entity sa, sb, sc;
+	struct sched_entity *root;
+
+
+	cfs_rq.tasks_timeline = RB_ROOT_CACHED;
+
+	init_se(&cfs_rq, &sa, 'a', 5, 9);
+	init_se(&cfs_rq, &sb, 'b', 4, 8);
+	init_se(&cfs_rq, &sc, 'c', 6, 7);
+
+	printf("cfs_rq min %ld avg %ld load %ld\n", cfs_rq.min_vruntime, cfs_rq.avg_vruntime, cfs_rq.avg_load);
+
+	root = __node_2_se(cfs_rq.tasks_timeline.rb_root.rb_node);
+	print_se("root", root);
+	if (root->run_node.rb_left)
+		print_se("left", __node_2_se(root->run_node.rb_left));
+	if (root->run_node.rb_right)
+		print_se("right", __node_2_se(root->run_node.rb_right));
+	print_se("picked", pick_eevdf_orig(&cfs_rq));
+}
+
+int main(int argc, char *argv[]) {
+	unsigned int seed = (unsigned int)time(NULL);
+	unsigned long skip_to = 0;
+	if (argc > 1)
+		seed = (unsigned int)atol(argv[1]);
+	srandom(seed);
+	if (argc > 2)
+		skip_to = (unsigned long)atol(argv[2]);
+	printf("Seed: %d\n", seed);
+	
+	test_pick_function(pick_eevdf_improved_ndebug, skip_to);
+	printf("Seed: %d\n", seed);
+}

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [RFC][PATCH 13/15] sched/fair: Implement latency-nice
  2023-05-31 11:58 ` [RFC][PATCH 13/15] sched/fair: Implement latency-nice Peter Zijlstra
  2023-06-06 14:54   ` Vincent Guittot
@ 2023-10-11 23:24   ` Benjamin Segall
  1 sibling, 0 replies; 104+ messages in thread
From: Benjamin Segall @ 2023-10-11 23:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx

Peter Zijlstra <peterz@infradead.org> writes:

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -952,6 +952,21 @@ int sched_update_scaling(void)
>  }
>  #endif
>  
> +void set_latency_fair(struct sched_entity *se, int prio)
> +{
> +	u32 weight = sched_prio_to_weight[prio];
> +	u64 base = sysctl_sched_base_slice;
> +
> +	/*
> +	 * For EEVDF the virtual time slope is determined by w_i (iow.
> +	 * nice) while the request time r_i is determined by
> +	 * latency-nice.
> +	 *
> +	 * Smaller request gets better latency.
> +	 */
> +	se->slice = div_u64(base << SCHED_FIXEDPOINT_SHIFT, weight);
> +}
> +
>  static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
>  
>  /*


This seems questionable in combination with the earlier changes that
make things use se->slice by itself as the expected time slice:


> @@ -6396,13 +6629,12 @@ static inline void unthrottle_offline_cf
>  static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
>  {
>  	struct sched_entity *se = &p->se;
> -	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  
>  	SCHED_WARN_ON(task_rq(p) != rq);
>  
>  	if (rq->cfs.h_nr_running > 1) {
> -		u64 slice = sched_slice(cfs_rq, se);
>  		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> +		u64 slice = se->slice;
>  		s64 delta = slice - ran;
>  
>  		if (delta < 0) {
> @@ -12136,8 +12382,8 @@ static void rq_offline_fair(struct rq *r
>  static inline bool
>  __entity_slice_used(struct sched_entity *se, int min_nr_tasks)
>  {
> -	u64 slice = sched_slice(cfs_rq_of(se), se);
>  	u64 rtime = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> +	u64 slice = se->slice;
>  
>  	return (rtime * min_nr_tasks > slice);
>  }
> @@ -12832,7 +13078,7 @@ static unsigned int get_rr_interval_fair
>  	 * idle runqueue:
>  	 */
>  	if (rq->cfs.load.weight)
> -		rr_interval = NS_TO_JIFFIES(sched_slice(cfs_rq_of(se), se));
> +		rr_interval = NS_TO_JIFFIES(se->slice);
>  
>  	return rr_interval;
>  }

We probably do not want a task with normal weight and low latency-weight
(aka high latency / latency-nice value) to be expected to have a very
very high slice value for some of these. get_rr_interval_fair is
whatever, it's not really a number that exists, and CONFIG_SCHED_CORE
isn't updated for EEVDF at all, but HRTICK at least probably should be
updated. Having such a task run for 68 times normal seems likely to have
far worse latency effects than any gains from other parts.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH v2] sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline
  2023-10-04 13:09     ` [PATCH v2] " Daniel Jordan
  2023-10-04 15:46       ` Chen Yu
@ 2023-10-12  4:48       ` K Prateek Nayak
  1 sibling, 0 replies; 104+ messages in thread
From: K Prateek Nayak @ 2023-10-12  4:48 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: peterz, bristot, bsegall, chris.hyser, corbet, dietmar.eggemann,
	efault, joel, joshdon, juri.lelli, linux-kernel, mgorman, mingo,
	patrick.bellasi, pavel, pjt, qperret, qyousef, rostedt, tglx,
	tim.c.chen, timj, vincent.guittot, youssefesmat, yu.c.chen

Hello Daniel,

Same as v1, I do not see any regressions with this version either.
I'll leave the full results below.

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel Details

- tip:	tip:sched/core at commit 238437d88cea ("intel_idle: Add ibrs_off
	module parameter to force-disable IBRS")
	[For DeathStarBench comparisons alone since I ran to the issue
	which below commit solves]
	+ min_deadline fix commit 8dafa9d0eb1a ("sched/eevdf: Fix
	min_deadline heap integrity") from tip:sched/urgent 

- place-initial-fix: tip + this patch as is

o Benchmark Results

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:           tip[pct imp](CV)    place-initial-fix[pct imp](CV)
 1-groups     1.00 [ -0.00]( 2.11)     1.01 [ -1.08]( 2.60)
 2-groups     1.00 [ -0.00]( 1.31)     1.01 [ -0.93]( 1.61)
 4-groups     1.00 [ -0.00]( 1.04)     1.00 [ -0.00]( 1.25)
 8-groups     1.00 [ -0.00]( 1.34)     0.99 [  1.15]( 0.85)
16-groups     1.00 [ -0.00]( 2.45)     1.00 [ -0.27]( 2.32)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:    tip[pct imp](CV)    place-initial-fix[pct imp](CV)
    1     1.00 [  0.00]( 0.46)     0.99 [ -0.59]( 0.88)
    2     1.00 [  0.00]( 0.64)     0.99 [ -1.43]( 0.69)
    4     1.00 [  0.00]( 0.59)     0.99 [ -1.49]( 0.76)
    8     1.00 [  0.00]( 0.34)     1.00 [ -0.35]( 0.20)
   16     1.00 [  0.00]( 0.72)     0.98 [ -1.96]( 1.97)
   32     1.00 [  0.00]( 0.65)     1.00 [ -0.24]( 1.07)
   64     1.00 [  0.00]( 0.59)     1.00 [ -0.14]( 1.18)
  128     1.00 [  0.00]( 1.19)     0.99 [ -1.04]( 0.93)
  256     1.00 [  0.00]( 0.16)     1.00 [ -0.18]( 0.34)
  512     1.00 [  0.00]( 0.20)     0.99 [ -0.62]( 0.02)
 1024     1.00 [  0.00]( 0.06)     1.00 [ -0.49]( 0.37)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:       tip[pct imp](CV)    place-initial-fix[pct imp](CV)
 Copy     1.00 [  0.00]( 6.04)     1.00 [ -0.21]( 7.98)
Scale     1.00 [  0.00]( 5.44)     0.99 [ -0.75]( 5.75)
  Add     1.00 [  0.00]( 5.44)     0.99 [ -1.48]( 5.40)
Triad     1.00 [  0.00]( 7.82)     1.02 [  2.21]( 8.33)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:       tip[pct imp](CV)    place-initial-fix[pct imp](CV)
 Copy     1.00 [  0.00]( 1.14)     1.00 [  0.40]( 1.12)
Scale     1.00 [  0.00]( 4.60)     1.01 [  1.05]( 4.99)
  Add     1.00 [  0.00]( 4.91)     1.00 [ -0.14]( 4.97)
Triad     1.00 [  0.00]( 0.60)     0.96 [ -3.53]( 6.13)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:         tip[pct imp](CV)    place-initial-fix[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.61)     1.00 [  0.40]( 0.75)
 2-clients     1.00 [  0.00]( 0.44)     1.00 [ -0.47]( 0.91)
 4-clients     1.00 [  0.00]( 0.75)     1.00 [ -0.23]( 0.84)
 8-clients     1.00 [  0.00]( 0.65)     1.00 [ -0.07]( 0.62)
16-clients     1.00 [  0.00]( 0.49)     1.00 [ -0.29]( 0.56)
32-clients     1.00 [  0.00]( 0.57)     1.00 [ -0.14]( 0.46)
64-clients     1.00 [  0.00]( 1.67)     1.00 [ -0.14]( 1.81)
128-clients    1.00 [  0.00]( 1.11)     1.01 [  0.64]( 1.04)
256-clients    1.00 [  0.00]( 2.64)     0.99 [ -1.29]( 5.25)
512-clients    1.00 [  0.00](52.49)     0.99 [ -0.57](53.01)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers: tip[pct imp](CV)    place-initial-fix[pct imp](CV)
  1     1.00 [ -0.00]( 8.41)     1.05 [ -5.41](13.45)
  2     1.00 [ -0.00]( 5.29)     0.88 [ 12.50](13.21)
  4     1.00 [ -0.00]( 1.32)     1.00 [ -0.00]( 4.80)
  8     1.00 [ -0.00]( 9.52)     0.94 [  6.25]( 8.85)
 16     1.00 [ -0.00]( 1.61)     0.97 [  3.23]( 5.00)
 32     1.00 [ -0.00]( 7.27)     0.88 [ 12.50]( 2.30)
 64     1.00 [ -0.00]( 6.96)     1.07 [ -6.94]( 4.94)
128     1.00 [ -0.00]( 3.41)     0.99 [  1.44]( 2.69)
256     1.00 [ -0.00](32.95)     0.81 [ 19.17](16.38)
512     1.00 [ -0.00]( 3.20)     0.98 [  1.66]( 2.35)


==================================================================
Test          : ycsb-cassandra
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
metric          tip    place-initial-fix(%diff)
throughput      1.00    0.99 (%diff: -0.67%)


==================================================================
Test          : ycsb-mondodb
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
metric          tip    place-initial-fix(%diff)
throughput      1.00    0.99 (%diff: -0.68%)


==================================================================
Test          : DeathStarBench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
Note	      : Comparisons contains additional commit 8dafa9d0eb1a
		("sched/eevdf: Fix min_deadline heap integrity") from
		tip:sched/urgent to fix an EEVDF issue being hit
==================================================================
Pinning      scaling    tip     place-initial-fix (%diff)
1CCD            1       1.00    1.00 (%diff: -0.09%)
2CCD            2       1.00    1.02 (%diff: 2.46%)
4CCD            4       1.00    1.00 (%diff: 0.45%)
8CCD            8       1.00    1.00 (%diff: -0.46%)

--

On 10/4/2023 6:39 PM, Daniel Jordan wrote:
> An entity is supposed to get an earlier deadline with
> PLACE_DEADLINE_INITIAL when it's forked, but the deadline gets
> overwritten soon after in enqueue_entity() the first time a forked
> entity is woken so that PLACE_DEADLINE_INITIAL is effectively a no-op.
> 
> Placing in task_fork_fair() seems unnecessary since none of the values
> that get set (slice, vruntime, deadline) are used before they're set
> again at enqueue time, so get rid of that (and with it all of
> task_fork_fair()) and just pass ENQUEUE_INITIAL to enqueue_entity() via
> wake_up_new_task().
> 
> Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

> ---
> 
> v2
>  - place_entity() seems like the only reason for task_fork_fair() to exist
>    after the recent removal of sysctl_sched_child_runs_first, so take out
>    the whole function.
> 
> Still based on today's peterz/sched/eevdf
> 
>  kernel/sched/core.c |  2 +-
>  kernel/sched/fair.c | 24 ------------------------
>  2 files changed, 1 insertion(+), 25 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 779cdc7969c81..500e2dbfd41dd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4854,7 +4854,7 @@ void wake_up_new_task(struct task_struct *p)
>  	update_rq_clock(rq);
>  	post_init_entity_util_avg(p);
>  
> -	activate_task(rq, p, ENQUEUE_NOCLOCK);
> +	activate_task(rq, p, ENQUEUE_INITIAL | ENQUEUE_NOCLOCK);
>  	trace_sched_wakeup_new(p);
>  	wakeup_preempt(rq, p, WF_FORK);
>  #ifdef CONFIG_SMP
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a0b4dac2662c9..3827b302eeb9b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12427,29 +12427,6 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  	task_tick_core(rq, curr);
>  }
>  
> -/*
> - * called on fork with the child task as argument from the parent's context
> - *  - child not yet on the tasklist
> - *  - preemption disabled
> - */
> -static void task_fork_fair(struct task_struct *p)
> -{
> -	struct sched_entity *se = &p->se, *curr;
> -	struct cfs_rq *cfs_rq;
> -	struct rq *rq = this_rq();
> -	struct rq_flags rf;
> -
> -	rq_lock(rq, &rf);
> -	update_rq_clock(rq);
> -
> -	cfs_rq = task_cfs_rq(current);
> -	curr = cfs_rq->curr;
> -	if (curr)
> -		update_curr(cfs_rq);
> -	place_entity(cfs_rq, se, ENQUEUE_INITIAL);
> -	rq_unlock(rq, &rf);
> -}
> -
>  /*
>   * Priority of the task has changed. Check to see if we preempt
>   * the current task.
> @@ -12953,7 +12930,6 @@ DEFINE_SCHED_CLASS(fair) = {
>  #endif
>  
>  	.task_tick		= task_tick_fair,
> -	.task_fork		= task_fork_fair,
>  
>  	.prio_changed		= prio_changed_fair,
>  	.switched_from		= switched_from_fair,

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Re: [PATCH 03/15] sched/fair: Add lag based placement
  2023-10-11 13:24     ` Peter Zijlstra
@ 2023-10-12  7:04       ` Abel Wu
  2023-10-13  7:37         ` Peter Zijlstra
  0 siblings, 1 reply; 104+ messages in thread
From: Abel Wu @ 2023-10-12  7:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On 10/11/23 9:24 PM, Peter Zijlstra Wrote:
> On Wed, Oct 11, 2023 at 08:00:22PM +0800, Abel Wu wrote:
>> On 5/31/23 7:58 PM, Peter Zijlstra Wrote:
>>>    		/*
>>> +		 * If we want to place a task and preserve lag, we have to
>>> +		 * consider the effect of the new entity on the weighted
>>> +		 * average and compensate for this, otherwise lag can quickly
>>> +		 * evaporate.
>>> +		 *
>>> +		 * Lag is defined as:
>>> +		 *
>>> +		 *   lag_i = S - s_i = w_i * (V - v_i)
>>> +		 *
>>> +		 * To avoid the 'w_i' term all over the place, we only track
>>> +		 * the virtual lag:
>>> +		 *
>>> +		 *   vl_i = V - v_i <=> v_i = V - vl_i
>>> +		 *
>>> +		 * And we take V to be the weighted average of all v:
>>> +		 *
>>> +		 *   V = (\Sum w_j*v_j) / W
>>> +		 *
>>> +		 * Where W is: \Sum w_j
>>> +		 *
>>> +		 * Then, the weighted average after adding an entity with lag
>>> +		 * vl_i is given by:
>>> +		 *
>>> +		 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
>>> +		 *      = (W*V + w_i*(V - vl_i)) / (W + w_i)
>>> +		 *      = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
>>> +		 *      = (V*(W + w_i) - w_i*l) / (W + w_i)
>>> +		 *      = V - w_i*vl_i / (W + w_i)
>>> +		 *
>>> +		 * And the actual lag after adding an entity with vl_i is:
>>> +		 *
>>> +		 *   vl'_i = V' - v_i
>>> +		 *         = V - w_i*vl_i / (W + w_i) - (V - vl_i)
>>> +		 *         = vl_i - w_i*vl_i / (W + w_i)
>>> +		 *
>>> +		 * Which is strictly less than vl_i. So in order to preserve lag
>>
>> Maybe a stupid question, but why vl'_i < vl_i? Since vl_i can be negative.
> 
> So the below doesn't care about the sign, it simply inverts this
> relation to express vl_i in vl'_i:
> 
>>> +		 * we should inflate the lag before placement such that the
>>> +		 * effective lag after placement comes out right.
>>> +		 *
>>> +		 * As such, invert the above relation for vl'_i to get the vl_i
>>> +		 * we need to use such that the lag after placement is the lag
>>> +		 * we computed before dequeue.
>>> +		 *
>>> +		 *   vl'_i = vl_i - w_i*vl_i / (W + w_i)
>>> +		 *         = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
>>> +		 *
>>> +		 *   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
>>> +		 *                   = W*vl_i
>>> +		 *
>>> +		 *   vl_i = (W + w_i)*vl'_i / W
> 
> And then we obtain the scale factor: (W + w_i)/W, which is >1, right?

Yeah, I see. But the scale factor is only for the to-be-placed entity.
Say there is an entity k on the tree:

	vl_k	= V - v_k

adding the to-be-placed entity i affects this by:

	define delta := w_i*vl_i / (W + w_i)

	vl'_k	= V' - v_k
		= V - delta - (V - vl_k)
		= vl_k - delta

hence for any entity on the tree, its lag is offsetted by @delta. So
I wonder if we should simply do offsetting rather than scaling.

> 
> As such, that means that vl'_i must be smaller than vl_i in the absolute
> sense, irrespective of sign.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Re: [PATCH] sched/fair: fix pick_eevdf to always find the correct se
  2023-10-11 13:14       ` Peter Zijlstra
@ 2023-10-12 10:04         ` Abel Wu
  0 siblings, 0 replies; 104+ messages in thread
From: Abel Wu @ 2023-10-12 10:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Segall, mingo, vincent.guittot, linux-kernel,
	juri.lelli, dietmar.eggemann, rostedt, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On 10/11/23 9:14 PM, Peter Zijlstra Wrote:
> On Wed, Oct 11, 2023 at 08:12:09PM +0800, Abel Wu wrote:
> 
> As the paper explains, you get two walks, one down the eligibility path,
> and then one down the heap. I think the current code structure
> represents that fairly well.

Yes, it does. I just wonder if the 2-step search is necessary, since
they obey the same rule of heap search:

   1) node->min_deadline > node->left->min_deadline
	1.1) BUG

   2) node->min_deadline = node->left->min_deadline
	2.1) go left if tiebreak on vruntime

   3) node->min_deadline < node->left->min_deadline
	3.1) return @node if it has min deadline, or
	3.2) go right

which gives:

	while (node) {
		if ((left = node->left)) {
			/* 1.1 */
			BUG_ON(left->min < node->min);

			/* 2.1 */
			if (left->min == node->min) {
				node = left;
				continue;
			}
		}

		/* 3.1 */
		if (node->deadline == node->min)
			return node;

		/* 3.2 */
		node = node->right;
	}

The above returns the entity with ealiest deadline (and with smallest
vruntime if have same deadline). Then comes with eligibility:

   0) it helps pruning the tree since the right subtree of a
      non-eligible node can't contain any eligible node.

   3.2.1) record left as a fallback iff the eligibility check
          is active, and saving the best one is enough since
          none of them contain non-eligible node, IOW the one
          with min deadline in the left tree must be eligible.

   4) the eligibility check ends immediately once go left from
      an eligible node, including switch to the fallback which
      is essentially is the 'left' of an eligible node.

   5) fallback to the candidate (if exists) if failed to find
      an eligible entity with earliest deadline.

which makes:

	candidate = NULL;
	need_check = true;

	while (node) {
		/* 0 */
		if (need_check && !eligible(node)) {
			node = node->left;
			goto next;
		}

		if ((left = node->left)) {
			/* 1.1 */
			BUG_ON(left->min < node->min);

			/* 2.1 */
			if (left->min == node->min) {
				node = left;
				/* 4 */
				need_check = false;
				continue;
			}

			/* 3.2.1 */
			if (need_check)
				candidate = better(candidate, left);
		}

		/* 3.1 */
		if (node->deadline == node->min)
			return node;

		/* 3.2 */
		node = node->right;
	next:
		/* 5 */
		if (!node && candidate) {
			node = candidate;
			need_check = false; /* 4 */
			candidate = NULL;
		}
	}

The search ends with a 'best' entity on the tree, comparing it with
curr which is out of tree makes things a whole.

But it's absolutely fine to me to honor the 2-step search given by
the paper if you think that is already clear enough :)

Best,
	Abel

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Re: [PATCH] sched/fair: fix pick_eevdf to always find the correct se
  2023-10-11 21:01       ` Benjamin Segall
@ 2023-10-12 10:25         ` Abel Wu
  2023-10-12 17:51           ` Benjamin Segall
  0 siblings, 1 reply; 104+ messages in thread
From: Abel Wu @ 2023-10-12 10:25 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: Peter Zijlstra, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx

On 10/12/23 5:01 AM, Benjamin Segall Wrote:
> Abel Wu <wuyun.abel@bytedance.com> writes:
> 
>> On 9/30/23 8:09 AM, Benjamin Segall Wrote:
>>> +	/*
>>> +	 * Now best_left and all of its children are eligible, and we are just
>>> +	 * looking for deadline == min_deadline
>>> +	 */
>>> +	node = &best_left->run_node;
>>> +	while (node) {
>>> +		struct sched_entity *se = __node_2_se(node);
>>> +
>>> +		/* min_deadline is the current node */
>>> +		if (se->deadline == se->min_deadline)
>>> +			return se;
>>
>> IMHO it would be better tiebreak on vruntime by moving this hunk to ..
>>
>>> +
>>> +		/* min_deadline is in the left branch */
>>>    		if (node->rb_left &&
>>>    		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
>>>    			node = node->rb_left;
>>>    			continue;
>>>    		}
>>
>> .. here, thoughts?
> 
> Yeah, that should work and be better on the tiebreak (and my test code
> agrees). There's an argument that the tiebreak will never really come up
> and it's better to avoid the potential one extra cache line from
> "__node_2_se(node->rb_left)->min_deadline" though.

I see. Then probably do the same thing in the first loop?

> 
>>
>>>    +		/* else min_deadline is in the right branch */
>>>    		node = node->rb_right;
>>>    	}
>>> +	return NULL;
>>
>> Why not 'best'? Since ..
> 
> The only time this can happen is if the tree is corrupt. We only reach
> this case if best_left is set at all (and best_left's min_deadline is
> better than "best"'s, which includes curr). In that case getting an
> error message is good, and in general I wasn't worrying about it much.

Right.

> 
>>
>>> +}
>>>    -	if (!best || (curr && deadline_gt(deadline, best, curr)))
>>> -		best = curr;
>>> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>>> +{
>>> +	struct sched_entity *se = __pick_eevdf(cfs_rq);
>>>    -	if (unlikely(!best)) {
>>> +	if (!se) {
>>>    		struct sched_entity *left = __pick_first_entity(cfs_rq);
>>
>> .. cfs_rq->curr isn't considered here.
> 
> That said, we should probably consider curr here in the error-case
> fallback, if just as a "if (!left) left = cfs_rq->curr;"

I don't think so as there must be some bugs in the scheduler, replacing
'pr_err' with 'BUG()' would be more appropriate.

> 
> 
> I've also attached my ugly userspace EEVDF tester as an attachment,
> hopefully I attached it in a correct mode to go through lkml.

Received. Thanks, Ben.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH] sched/fair: fix pick_eevdf to always find the correct se
  2023-10-12 10:25         ` Abel Wu
@ 2023-10-12 17:51           ` Benjamin Segall
  2023-10-13  3:46             ` Abel Wu
  0 siblings, 1 reply; 104+ messages in thread
From: Benjamin Segall @ 2023-10-12 17:51 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx

Abel Wu <wuyun.abel@bytedance.com> writes:

> On 10/12/23 5:01 AM, Benjamin Segall Wrote:
>> Abel Wu <wuyun.abel@bytedance.com> writes:
>> 
>>> On 9/30/23 8:09 AM, Benjamin Segall Wrote:
>>>> +	/*
>>>> +	 * Now best_left and all of its children are eligible, and we are just
>>>> +	 * looking for deadline == min_deadline
>>>> +	 */
>>>> +	node = &best_left->run_node;
>>>> +	while (node) {
>>>> +		struct sched_entity *se = __node_2_se(node);
>>>> +
>>>> +		/* min_deadline is the current node */
>>>> +		if (se->deadline == se->min_deadline)
>>>> +			return se;
>>>
>>> IMHO it would be better tiebreak on vruntime by moving this hunk to ..
>>>
>>>> +
>>>> +		/* min_deadline is in the left branch */
>>>>    		if (node->rb_left &&
>>>>    		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
>>>>    			node = node->rb_left;
>>>>    			continue;
>>>>    		}
>>>
>>> .. here, thoughts?
>> Yeah, that should work and be better on the tiebreak (and my test code
>> agrees). There's an argument that the tiebreak will never really come up
>> and it's better to avoid the potential one extra cache line from
>> "__node_2_se(node->rb_left)->min_deadline" though.
>
> I see. Then probably do the same thing in the first loop?
>

We effectively do that already sorta by accident almost always -
computing best and best_left via deadline_gt rather than gte prioritizes
earlier elements, which always have a better vruntime.

Then when we do the best_left->min_deadline vs best->deadline
computation, we prioritize best_left, which is the one case it can be
wrong, we'd need an additional
"if (se->min_deadline == best->deadline &&
(s64)(se->vruntime - best->vruntime) > 0) return best;" check at the end
of the second loop.

(Though again I don't know how much this sort of never-going-to-happen
slight fairness improvement is worth compared to the extra bit of
overhead)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 03/15] sched/fair: Add lag based placement
  2023-05-31 11:58 ` [PATCH 03/15] sched/fair: Add lag based placement Peter Zijlstra
  2023-08-10  7:10   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2023-10-11 12:00   ` [PATCH 03/15] " Abel Wu
@ 2023-10-12 19:15   ` Benjamin Segall
  2023-10-12 22:34     ` Peter Zijlstra
  2023-10-13 14:34     ` Peter Zijlstra
  2 siblings, 2 replies; 104+ messages in thread
From: Benjamin Segall @ 2023-10-12 19:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx

Peter Zijlstra <peterz@infradead.org> writes:

> @@ -4853,49 +4872,119 @@ static void
>  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>  {
>  	u64 vruntime = avg_vruntime(cfs_rq);
> +	s64 lag = 0;
>  
> -	/* sleeps up to a single latency don't count. */
> -	if (!initial) {
> -		unsigned long thresh;
> +	/*
> +	 * Due to how V is constructed as the weighted average of entities,
> +	 * adding tasks with positive lag, or removing tasks with negative lag
> +	 * will move 'time' backwards, this can screw around with the lag of
> +	 * other tasks.
> +	 *
> +	 * EEVDF: placement strategy #1 / #2
> +	 */

So the big problem with EEVDF #1 compared to #2/#3 and CFS (hacky though
it is) is that it creates a significant perverse incentive to yield or
spin until you see yourself be preempted, rather than just sleep (if you
have any competition on the cpu). If you go to sleep immediately after
doing work and happen to do so near the end of a slice (arguably what
you _want_ to have happen overall), then you have to pay that negative
lag in wakeup latency later, because it is maintained through any amount
of sleep. (#1 or similar is good for reweight/migrate of course)

#2 in theory could be abused by micro-sleeping right before you are
preempted, but that isn't something tasks can really predict, unlike
seeing more "don't go to sleep, just spin, the latency numbers are so
much better" nonsense.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 03/15] sched/fair: Add lag based placement
  2023-10-12 19:15   ` Benjamin Segall
@ 2023-10-12 22:34     ` Peter Zijlstra
  2023-10-13 16:35       ` Peter Zijlstra
  2023-10-13 14:34     ` Peter Zijlstra
  1 sibling, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-12 22:34 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx

On Thu, Oct 12, 2023 at 12:15:12PM -0700, Benjamin Segall wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > @@ -4853,49 +4872,119 @@ static void
> >  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> >  {
> >  	u64 vruntime = avg_vruntime(cfs_rq);
> > +	s64 lag = 0;
> >  
> > -	/* sleeps up to a single latency don't count. */
> > -	if (!initial) {
> > -		unsigned long thresh;
> > +	/*
> > +	 * Due to how V is constructed as the weighted average of entities,
> > +	 * adding tasks with positive lag, or removing tasks with negative lag
> > +	 * will move 'time' backwards, this can screw around with the lag of
> > +	 * other tasks.
> > +	 *
> > +	 * EEVDF: placement strategy #1 / #2
> > +	 */
> 
> So the big problem with EEVDF #1 compared to #2/#3 and CFS (hacky though
> it is) is that it creates a significant perverse incentive to yield or
> spin until you see yourself be preempted, rather than just sleep (if you
> have any competition on the cpu). If you go to sleep immediately after
> doing work and happen to do so near the end of a slice (arguably what
> you _want_ to have happen overall), then you have to pay that negative
> lag in wakeup latency later, because it is maintained through any amount
> of sleep. (#1 or similar is good for reweight/migrate of course)
> 
> #2 in theory could be abused by micro-sleeping right before you are
> preempted, but that isn't something tasks can really predict, unlike
> seeing more "don't go to sleep, just spin, the latency numbers are so
> much better" nonsense.

Right, so I do have this:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=344944e06f11da25b49328825ed15fedd63036d3

That allows tasks to sleep away the lag -- with all the gnarly bits that
sleep time has. And it reliably fixes the above. However, it also
depresses a bunch of other stuff. Never a free lunch etc.

It is so far the least horrible of the things I've tried. 

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Re: [PATCH] sched/fair: fix pick_eevdf to always find the correct se
  2023-10-12 17:51           ` Benjamin Segall
@ 2023-10-13  3:46             ` Abel Wu
  2023-10-13 16:51               ` Benjamin Segall
  0 siblings, 1 reply; 104+ messages in thread
From: Abel Wu @ 2023-10-13  3:46 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: Peter Zijlstra, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx

On 10/13/23 1:51 AM, Benjamin Segall Wrote:
> Abel Wu <wuyun.abel@bytedance.com> writes:
> 
>> On 10/12/23 5:01 AM, Benjamin Segall Wrote:
>>> Abel Wu <wuyun.abel@bytedance.com> writes:
>>>
>>>> On 9/30/23 8:09 AM, Benjamin Segall Wrote:
>>>>> +	/*
>>>>> +	 * Now best_left and all of its children are eligible, and we are just
>>>>> +	 * looking for deadline == min_deadline
>>>>> +	 */
>>>>> +	node = &best_left->run_node;
>>>>> +	while (node) {
>>>>> +		struct sched_entity *se = __node_2_se(node);
>>>>> +
>>>>> +		/* min_deadline is the current node */
>>>>> +		if (se->deadline == se->min_deadline)
>>>>> +			return se;
>>>>
>>>> IMHO it would be better tiebreak on vruntime by moving this hunk to ..
>>>>
>>>>> +
>>>>> +		/* min_deadline is in the left branch */
>>>>>     		if (node->rb_left &&
>>>>>     		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
>>>>>     			node = node->rb_left;
>>>>>     			continue;
>>>>>     		}
>>>>
>>>> .. here, thoughts?
>>> Yeah, that should work and be better on the tiebreak (and my test code
>>> agrees). There's an argument that the tiebreak will never really come up
>>> and it's better to avoid the potential one extra cache line from
>>> "__node_2_se(node->rb_left)->min_deadline" though.
>>
>> I see. Then probably do the same thing in the first loop?
>>
> 
> We effectively do that already sorta by accident almost always -
> computing best and best_left via deadline_gt rather than gte prioritizes
> earlier elements, which always have a better vruntime.

Sorry for not clarifying clearly about the 'same thing'. What I meant
was to avoid touch left if the node itself has the min deadline.

@@ -894,6 +894,9 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
                 if (!best || deadline_gt(deadline, best, se))
                         best = se;

+               if (se->deadline == se->min_deadline)
+                       break;
+
                 /*
                  * Every se in a left branch is eligible, keep track of the
                  * branch with the best min_deadline
@@ -913,10 +916,6 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
                                 break;
                 }

-               /* min_deadline is at this node, no need to look right */
-               if (se->deadline == se->min_deadline)
-                       break;
-
                 /* else min_deadline is in the right branch. */
                 node = node->rb_right;
         }

(But still thanks for the convincing explanation on fairness.)

Best,
	Abel

> 
> Then when we do the best_left->min_deadline vs best->deadline
> computation, we prioritize best_left, which is the one case it can be
> wrong, we'd need an additional
> "if (se->min_deadline == best->deadline &&
> (s64)(se->vruntime - best->vruntime) > 0) return best;" check at the end
> of the second loop.
> 
> (Though again I don't know how much this sort of never-going-to-happen
> slight fairness improvement is worth compared to the extra bit of
> overhead)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Re: [PATCH 03/15] sched/fair: Add lag based placement
  2023-10-12  7:04       ` Abel Wu
@ 2023-10-13  7:37         ` Peter Zijlstra
  2023-10-13  8:14           ` Abel Wu
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-13  7:37 UTC (permalink / raw)
  To: Abel Wu
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On Thu, Oct 12, 2023 at 03:04:47PM +0800, Abel Wu wrote:
> On 10/11/23 9:24 PM, Peter Zijlstra Wrote:

> > > > +		 * we should inflate the lag before placement such that the
> > > > +		 * effective lag after placement comes out right.
> > > > +		 *
> > > > +		 * As such, invert the above relation for vl'_i to get the vl_i
> > > > +		 * we need to use such that the lag after placement is the lag
> > > > +		 * we computed before dequeue.
> > > > +		 *
> > > > +		 *   vl'_i = vl_i - w_i*vl_i / (W + w_i)
> > > > +		 *         = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
> > > > +		 *
> > > > +		 *   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
> > > > +		 *                   = W*vl_i
> > > > +		 *
> > > > +		 *   vl_i = (W + w_i)*vl'_i / W
> > 
> > And then we obtain the scale factor: (W + w_i)/W, which is >1, right?
> 
> Yeah, I see. But the scale factor is only for the to-be-placed entity.
> Say there is an entity k on the tree:
> 
> 	vl_k	= V - v_k
> 
> adding the to-be-placed entity i affects this by:
> 
> 	define delta := w_i*vl_i / (W + w_i)
> 
> 	vl'_k	= V' - v_k
> 		= V - delta - (V - vl_k)
> 		= vl_k - delta
> 
> hence for any entity on the tree, its lag is offsetted by @delta. So
> I wonder if we should simply do offsetting rather than scaling.

I don't see the point, the result is the same and computing delta seems
numerically less stable.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: Re: [PATCH 03/15] sched/fair: Add lag based placement
  2023-10-13  7:37         ` Peter Zijlstra
@ 2023-10-13  8:14           ` Abel Wu
  0 siblings, 0 replies; 104+ messages in thread
From: Abel Wu @ 2023-10-13  8:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, tglx

On 10/13/23 3:37 PM, Peter Zijlstra Wrote:
> On Thu, Oct 12, 2023 at 03:04:47PM +0800, Abel Wu wrote:
>> On 10/11/23 9:24 PM, Peter Zijlstra Wrote:
> 
>>>>> +		 * we should inflate the lag before placement such that the
>>>>> +		 * effective lag after placement comes out right.
>>>>> +		 *
>>>>> +		 * As such, invert the above relation for vl'_i to get the vl_i
>>>>> +		 * we need to use such that the lag after placement is the lag
>>>>> +		 * we computed before dequeue.
>>>>> +		 *
>>>>> +		 *   vl'_i = vl_i - w_i*vl_i / (W + w_i)
>>>>> +		 *         = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
>>>>> +		 *
>>>>> +		 *   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
>>>>> +		 *                   = W*vl_i
>>>>> +		 *
>>>>> +		 *   vl_i = (W + w_i)*vl'_i / W
>>>
>>> And then we obtain the scale factor: (W + w_i)/W, which is >1, right?
>>
>> Yeah, I see. But the scale factor is only for the to-be-placed entity.
>> Say there is an entity k on the tree:
>>
>> 	vl_k	= V - v_k
>>
>> adding the to-be-placed entity i affects this by:
>>
>> 	define delta := w_i*vl_i / (W + w_i)
>>
>> 	vl'_k	= V' - v_k
>> 		= V - delta - (V - vl_k)
>> 		= vl_k - delta
>>
>> hence for any entity on the tree, its lag is offsetted by @delta. So
>> I wonder if we should simply do offsetting rather than scaling.
> 
> I don't see the point, the result is the same and computing delta seems
> numerically less stable.

Right. I was not myself then, please forget what I said..

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 03/15] sched/fair: Add lag based placement
  2023-10-12 19:15   ` Benjamin Segall
  2023-10-12 22:34     ` Peter Zijlstra
@ 2023-10-13 14:34     ` Peter Zijlstra
  1 sibling, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-13 14:34 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx

On Thu, Oct 12, 2023 at 12:15:12PM -0700, Benjamin Segall wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > @@ -4853,49 +4872,119 @@ static void
> >  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> >  {
> >  	u64 vruntime = avg_vruntime(cfs_rq);
> > +	s64 lag = 0;
> >  
> > -	/* sleeps up to a single latency don't count. */
> > -	if (!initial) {
> > -		unsigned long thresh;
> > +	/*
> > +	 * Due to how V is constructed as the weighted average of entities,
> > +	 * adding tasks with positive lag, or removing tasks with negative lag
> > +	 * will move 'time' backwards, this can screw around with the lag of
> > +	 * other tasks.
> > +	 *
> > +	 * EEVDF: placement strategy #1 / #2
> > +	 */
> 
> So the big problem with EEVDF #1 compared to #2/#3 and CFS (hacky though
> it is) is that it creates a significant perverse incentive to yield or
> spin until you see yourself be preempted, rather than just sleep (if you
> have any competition on the cpu). If you go to sleep immediately after
> doing work and happen to do so near the end of a slice (arguably what
> you _want_ to have happen overall), then you have to pay that negative
> lag in wakeup latency later, because it is maintained through any amount
> of sleep. (#1 or similar is good for reweight/migrate of course)
> 
> #2 in theory could be abused by micro-sleeping right before you are
> preempted, but that isn't something tasks can really predict, unlike
> seeing more "don't go to sleep, just spin, the latency numbers are so
> much better" nonsense.

For giggles (cyclictest vs hackbench):

$ echo PLACE_LAG > /debug/sched/features
$ ./doit-latency-slice.sh
# Running 'sched/messaging' benchmark:
slice 30000000
# /dev/cpu_dma_latency set to 0us
# Min Latencies: 00051
# Avg Latencies: 00819
# Max Latencies: 172558
slice 3000000
# /dev/cpu_dma_latency set to 0us
# Min Latencies: 00033
# Avg Latencies: 00407
# Max Latencies: 12024
slice 300000
# /dev/cpu_dma_latency set to 0us
# Min Latencies: 00055
# Avg Latencies: 00395
# Max Latencies: 11780


$ echo NO_PLACE_LAG > /debug/sched/features
$ ./doit-latency-slice.sh
# Running 'sched/messaging' benchmark:
slice 30000000
# /dev/cpu_dma_latency set to 0us
# Min Latencies: 00069
# Avg Latencies: 69071
# Max Latencies: 1492250
slice 3000000
# /dev/cpu_dma_latency set to 0us
# Min Latencies: 00062
# Avg Latencies: 10215
# Max Latencies: 21209
slice 300000
# /dev/cpu_dma_latency set to 0us
# Min Latencies: 00055
# Avg Latencies: 00060
# Max Latencies: 03088


IOW, insanely worse latencies in most cases. This is because when
everybody starts at 0-lag, everybody is always eligible, and 'fairness'
goes out the window fast.

Placement strategy #1 only really works when you have well behaving
tasks (eg. conforming to the periodic task model -- not waking up before
its time and all that).


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 03/15] sched/fair: Add lag based placement
  2023-10-12 22:34     ` Peter Zijlstra
@ 2023-10-13 16:35       ` Peter Zijlstra
  2023-10-14  8:08         ` Mike Galbraith
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-13 16:35 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx

On Fri, Oct 13, 2023 at 12:34:28AM +0200, Peter Zijlstra wrote:

> Right, so I do have this:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=344944e06f11da25b49328825ed15fedd63036d3
> 
> That allows tasks to sleep away the lag -- with all the gnarly bits that
> sleep time has. And it reliably fixes the above. However, it also
> depresses a bunch of other stuff. Never a free lunch etc.
> 
> It is so far the least horrible of the things I've tried. 

So the below is one I conceptually like more -- except I hate the code,
nor does it work as well as the one linked above.

(Mike, this isn't the same one you saw before -- it's been 'improved')

---

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 29daece54a74..7f17295931de 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -895,6 +895,7 @@ struct task_struct {
 	unsigned			sched_reset_on_fork:1;
 	unsigned			sched_contributes_to_load:1;
 	unsigned			sched_migrated:1;
+	unsigned			sched_delayed:1;
 
 	/* Force alignment to the next boundary: */
 	unsigned			:0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7771a4d68280..38b2e0488a38 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3833,12 +3833,21 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 
 	rq = __task_rq_lock(p, &rf);
 	if (task_on_rq_queued(p)) {
+		update_rq_clock(rq);
+		if (unlikely(p->sched_delayed)) {
+			p->sched_delayed = 0;
+			/* mustn't run a delayed task */
+			WARN_ON_ONCE(task_on_cpu(rq, p));
+			dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
+			if (p->se.vlag > 0)
+				p->se.vlag = 0;
+			enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
+		}
 		if (!task_on_cpu(rq, p)) {
 			/*
 			 * When on_rq && !on_cpu the task is preempted, see if
 			 * it should preempt the task that is current now.
 			 */
-			update_rq_clock(rq);
 			wakeup_preempt(rq, p, wake_flags);
 		}
 		ttwu_do_wakeup(p);
@@ -6520,6 +6529,16 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 # define SM_MASK_PREEMPT	SM_PREEMPT
 #endif
 
+static void __deschedule_task(struct rq *rq, struct task_struct *p)
+{
+	deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
+
+	if (p->in_iowait) {
+		atomic_inc(&rq->nr_iowait);
+		delayacct_blkio_start();
+	}
+}
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6604,6 +6623,8 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 
 	switch_count = &prev->nivcsw;
 
+	WARN_ON_ONCE(prev->sched_delayed);
+
 	/*
 	 * We must load prev->state once (task_struct::state is volatile), such
 	 * that we form a control dependency vs deactivate_task() below.
@@ -6632,17 +6653,39 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 			 *
 			 * After this, schedule() must not care about p->state any more.
 			 */
-			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
-
-			if (prev->in_iowait) {
-				atomic_inc(&rq->nr_iowait);
-				delayacct_blkio_start();
-			}
+			if (sched_feat(DELAY_DEQUEUE) &&
+			    prev->sched_class->eligible_task &&
+			    !prev->sched_class->eligible_task(rq, prev))
+				prev->sched_delayed = 1;
+			else
+				__deschedule_task(rq, prev);
 		}
 		switch_count = &prev->nvcsw;
 	}
 
-	next = pick_next_task(rq, prev, &rf);
+	for (struct task_struct *tmp = prev;;) {
+
+		next = pick_next_task(rq, tmp, &rf);
+		if (unlikely(tmp != prev))
+			finish_task(tmp);
+
+		if (likely(!next->sched_delayed))
+			break;
+
+		next->sched_delayed = 0;
+
+		/* ttwu_runnable() */
+		if (WARN_ON_ONCE(!next->__state))
+			break;
+
+		prepare_task(next);
+		smp_wmb();
+		__deschedule_task(rq, next);
+		if (next->se.vlag > 0)
+			next->se.vlag = 0;
+		tmp = next;
+	}
+
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b2210e7cc057..3084e21abfe7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8410,6 +8410,16 @@ static struct task_struct *__pick_next_task_fair(struct rq *rq)
 	return pick_next_task_fair(rq, NULL, NULL);
 }
 
+static bool eligible_task_fair(struct rq *rq, struct task_struct *p)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	update_curr(cfs_rq);
+
+	return entity_eligible(cfs_rq, se);
+}
+
 /*
  * Account for a descheduled task:
  */
@@ -13006,6 +13016,7 @@ DEFINE_SCHED_CLASS(fair) = {
 
 	.wakeup_preempt		= check_preempt_wakeup_fair,
 
+	.eligible_task		= eligible_task_fair,
 	.pick_next_task		= __pick_next_task_fair,
 	.put_prev_task		= put_prev_task_fair,
 	.set_next_task          = set_next_task_fair,
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a133b46efedd..0546905f1f8f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -11,6 +11,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
 SCHED_FEAT(PLACE_SLEEPER, false)
 SCHED_FEAT(GENTLE_SLEEPER, true)
 SCHED_FEAT(EVDF, false)
+SCHED_FEAT(DELAY_DEQUEUE, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 245df0c6d344..35d297e1d91b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2222,6 +2222,7 @@ struct sched_class {
 
 	void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags);
 
+	bool (*eligible_task)(struct rq *rq, struct task_struct *p);
 	struct task_struct *(*pick_next_task)(struct rq *rq);
 
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
@@ -2275,7 +2276,7 @@ struct sched_class {
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-	WARN_ON_ONCE(rq->curr != prev);
+//	WARN_ON_ONCE(rq->curr != prev);
 	prev->sched_class->put_prev_task(rq, prev);
 }
 

^ permalink raw reply related	[flat|nested] 104+ messages in thread

* Re: [PATCH] sched/fair: fix pick_eevdf to always find the correct se
  2023-10-13  3:46             ` Abel Wu
@ 2023-10-13 16:51               ` Benjamin Segall
  0 siblings, 0 replies; 104+ messages in thread
From: Benjamin Segall @ 2023-10-13 16:51 UTC (permalink / raw)
  To: Abel Wu
  Cc: Peter Zijlstra, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault, tglx

Abel Wu <wuyun.abel@bytedance.com> writes:

> On 10/13/23 1:51 AM, Benjamin Segall Wrote:
>> Abel Wu <wuyun.abel@bytedance.com> writes:
>> 
>>> On 10/12/23 5:01 AM, Benjamin Segall Wrote:
>>>> Abel Wu <wuyun.abel@bytedance.com> writes:
>>>>
>>>>> On 9/30/23 8:09 AM, Benjamin Segall Wrote:
>>>>>> +	/*
>>>>>> +	 * Now best_left and all of its children are eligible, and we are just
>>>>>> +	 * looking for deadline == min_deadline
>>>>>> +	 */
>>>>>> +	node = &best_left->run_node;
>>>>>> +	while (node) {
>>>>>> +		struct sched_entity *se = __node_2_se(node);
>>>>>> +
>>>>>> +		/* min_deadline is the current node */
>>>>>> +		if (se->deadline == se->min_deadline)
>>>>>> +			return se;
>>>>>
>>>>> IMHO it would be better tiebreak on vruntime by moving this hunk to ..
>>>>>
>>>>>> +
>>>>>> +		/* min_deadline is in the left branch */
>>>>>>     		if (node->rb_left &&
>>>>>>     		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
>>>>>>     			node = node->rb_left;
>>>>>>     			continue;
>>>>>>     		}
>>>>>
>>>>> .. here, thoughts?
>>>> Yeah, that should work and be better on the tiebreak (and my test code
>>>> agrees). There's an argument that the tiebreak will never really come up
>>>> and it's better to avoid the potential one extra cache line from
>>>> "__node_2_se(node->rb_left)->min_deadline" though.
>>>
>>> I see. Then probably do the same thing in the first loop?
>>>
>> We effectively do that already sorta by accident almost always -
>> computing best and best_left via deadline_gt rather than gte prioritizes
>> earlier elements, which always have a better vruntime.
>
> Sorry for not clarifying clearly about the 'same thing'. What I meant
> was to avoid touch left if the node itself has the min deadline.
>
> @@ -894,6 +894,9 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
>                 if (!best || deadline_gt(deadline, best, se))
>                         best = se;
>
> +               if (se->deadline == se->min_deadline)
> +                       break;
> +
>                 /*
>                  * Every se in a left branch is eligible, keep track of the
>                  * branch with the best min_deadline
> @@ -913,10 +916,6 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
>                                 break;
>                 }
>
> -               /* min_deadline is at this node, no need to look right */
> -               if (se->deadline == se->min_deadline)
> -                       break;
> -
>                 /* else min_deadline is in the right branch. */
>                 node = node->rb_right;
>         }
>
> (But still thanks for the convincing explanation on fairness.)
>

Ah, yes, in terms of optimizing performance rather than marginal
fairness, that would help.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 03/15] sched/fair: Add lag based placement
  2023-10-13 16:35       ` Peter Zijlstra
@ 2023-10-14  8:08         ` Mike Galbraith
  0 siblings, 0 replies; 104+ messages in thread
From: Mike Galbraith @ 2023-10-14  8:08 UTC (permalink / raw)
  To: Peter Zijlstra, Benjamin Segall
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, mgorman, bristot, corbet, qyousef,
	chris.hyser, patrick.bellasi, pjt, pavel, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	tglx

On Fri, 2023-10-13 at 18:35 +0200, Peter Zijlstra wrote:
> On Fri, Oct 13, 2023 at 12:34:28AM +0200, Peter Zijlstra wrote:
>
> > Right, so I do have this:
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=344944e06f11da25b49328825ed15fedd63036d3
> >
> > That allows tasks to sleep away the lag -- with all the gnarly bits that
> > sleep time has. And it reliably fixes the above. However, it also
> > depresses a bunch of other stuff. Never a free lunch etc.
> >
> > It is so far the least horrible of the things I've tried.
>
> So the below is one I conceptually like more -- except I hate the code,
> nor does it work as well as the one linked above.
>
> (Mike, this isn't the same one you saw before -- it's been 'improved')

Still improves high frequency switchers vs hogs nicely.

tbench vs massive_intr

6.4.16-stable                  avg
2353.57  2311.77  2399.27  2354.87  1.00

6.4.16-eevdf                   avg
2037.93  2014.57  2026.84  2026.44  .86   1.00  DELAY_DEQUEUE    v1
1893.53  1903.45  1851.57  1882.85  .79    .92  NO_DELAY_DEQUEUE
2193.33  2165.35  2201.82  2186.83  .92   1.07  DELAY_DEQUEUE    v2

It's barely visible in mixed youtube vs compute in i7-4790 box,
squinting required, and completely invisible in dinky rpi4.

A clear win along with no harm to the mundane mix is a good start in my
book.  Here's hoping canned benchmark bots don't grumble.

	-Mike

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH] sched/fair: Always update_curr() before placing at enqueue
  2023-10-06 16:48   ` [PATCH] sched/fair: Always update_curr() before placing at enqueue Daniel Jordan
  2023-10-06 19:58     ` Peter Zijlstra
@ 2023-10-16  5:39     ` K Prateek Nayak
  1 sibling, 0 replies; 104+ messages in thread
From: K Prateek Nayak @ 2023-10-16  5:39 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: peterz, bristot, bsegall, chris.hyser, corbet, dietmar.eggemann,
	efault, joel, joshdon, juri.lelli, linux-kernel, mgorman, mingo,
	patrick.bellasi, pavel, pjt, qperret, qyousef, rostedt, tglx,
	tim.c.chen, timj, vincent.guittot, youssefesmat, yu.c.chen

Hello Daniel,

I see a good and consistent improvement in Stream (with shorter loops)
with this change. Everything else is more or less the same.

I'll leave the detailed results below.

On 10/6/2023 10:18 PM, Daniel Jordan wrote:
> Placing wants current's vruntime and the cfs_rq's min_vruntime up to
> date so that avg_runtime() is too, and similarly it wants the entity to
> be re-weighted and lag adjusted so vslice and vlag are fresh, so always
> do update_curr() and update_cfs_group() beforehand.
> 
> There doesn't seem to be a reason to treat the 'curr' case specially
> after e8f331bcc270 since vruntime doesn't get normalized anymore.
> 
> Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel Details

- tip:	tip:sched/core at commit 3657680f38cd ("sched/psi: Delete the
	'update_total' function parameter from update_triggers()") +
	cherry-pick commit 8dafa9d0eb1a sched/eevdf: Fix min_deadline heap
	integrity") from tip:sched/urgent + cherry-pick commit b01db23d5923
	("sched/eevdf: Fix pick_eevdf()") from tip:sched/urgent

update_curr_opt: tip + this patch

o Benchmark Results

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:           tip[pct imp](CV)    update_curr_opt[pct imp](CV)
 1-groups     1.00 [ -0.00]( 2.69)     1.01 [ -0.71]( 2.88)
 2-groups     1.00 [ -0.00]( 1.69)     1.01 [ -0.62]( 1.40)
 4-groups     1.00 [ -0.00]( 1.25)     1.01 [ -1.17]( 1.03)
 8-groups     1.00 [ -0.00]( 1.36)     1.00 [ -0.43]( 0.83)
16-groups     1.00 [ -0.00]( 1.44)     1.00 [ -0.13]( 2.32)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:    tip[pct imp](CV)    update_curr_opt[pct imp](CV)
    1     1.00 [  0.00]( 0.33)     1.01 [  0.52]( 1.51)
    2     1.00 [  0.00]( 0.22)     1.00 [ -0.01]( 0.37)
    4     1.00 [  0.00]( 0.25)     0.99 [ -0.71]( 0.60)
    8     1.00 [  0.00]( 0.71)     1.00 [ -0.26]( 0.36)
   16     1.00 [  0.00]( 0.79)     0.99 [ -1.21]( 0.77)
   32     1.00 [  0.00]( 0.94)     0.99 [ -0.82]( 1.46)
   64     1.00 [  0.00]( 1.76)     0.99 [ -0.92]( 1.25)
  128     1.00 [  0.00]( 0.68)     0.98 [ -2.22]( 1.19)
  256     1.00 [  0.00]( 1.23)     0.99 [ -1.43]( 0.79)
  512     1.00 [  0.00]( 0.28)     0.99 [ -0.93]( 0.14)
 1024     1.00 [  0.00]( 0.20)     0.99 [ -1.44]( 0.41)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:       tip[pct imp](CV)    update_curr_opt[pct imp](CV)
 Copy     1.00 [  0.00](11.88)     1.22 [ 21.92]( 7.37)
Scale     1.00 [  0.00]( 7.01)     1.04 [  4.02]( 4.89)
  Add     1.00 [  0.00]( 6.56)     1.11 [ 11.03]( 4.77)
Triad     1.00 [  0.00]( 8.81)     1.14 [ 14.12]( 3.89)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:       tip[pct imp](CV)    update_curr_opt[pct imp](CV)
 Copy     1.00 [  0.00]( 1.07)     1.01 [  0.77]( 1.59)
Scale     1.00 [  0.00]( 4.81)     0.97 [ -2.99]( 7.18)
  Add     1.00 [  0.00]( 4.56)     0.98 [ -2.39]( 6.86)
Triad     1.00 [  0.00]( 1.78)     1.00 [ -0.35]( 4.22)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:         tip[pct imp](CV)    update_curr_opt[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.62)     0.99 [ -1.03]( 0.24)
 2-clients     1.00 [  0.00]( 0.36)     1.00 [ -0.32]( 0.66)
 4-clients     1.00 [  0.00]( 0.31)     1.00 [ -0.17]( 0.44)
 8-clients     1.00 [  0.00]( 0.39)     1.00 [  0.24]( 0.67)
16-clients     1.00 [  0.00]( 0.58)     1.00 [  0.50]( 0.46)
32-clients     1.00 [  0.00]( 0.71)     1.01 [  0.54]( 0.66)
64-clients     1.00 [  0.00]( 2.13)     1.00 [  0.35]( 1.80)
128-clients    1.00 [  0.00]( 0.94)     0.99 [ -0.71]( 0.97)
256-clients    1.00 [  0.00]( 6.09)     1.01 [  1.28]( 3.41)
512-clients    1.00 [  0.00](55.28)     1.01 [  1.32](49.78)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers: tip[pct imp](CV)    update_curr_opt[pct imp](CV)
  1     1.00 [ -0.00]( 2.91)     0.97 [  2.56]( 1.53)
  2     1.00 [ -0.00](21.78)     0.89 [ 10.53](10.23)
  4     1.00 [ -0.00]( 4.88)     1.07 [ -7.32]( 6.82)
  8     1.00 [ -0.00]( 2.49)     1.00 [ -0.00]( 9.53)
 16     1.00 [ -0.00]( 3.70)     1.02 [ -1.75]( 0.99)
 32     1.00 [ -0.00](12.65)     0.83 [ 16.51]( 4.41)
 64     1.00 [ -0.00]( 3.98)     0.97 [  2.59]( 8.27)
128     1.00 [ -0.00]( 1.49)     0.96 [  3.60]( 8.01)
256     1.00 [ -0.00](40.79)     0.80 [ 20.39](36.89)
512     1.00 [ -0.00]( 1.12)     0.98 [  2.20]( 0.75)


==================================================================
Test          : ycsb-cassandra
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
metric      tip     update_curr_opt (%diff)
throughput  1.00    1.00 (%diff: -0.45%)


==================================================================
Test          : ycsb-mondodb
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
metric      tip     update_curr_opt (%diff)
throughput  1.00    1.00 (%diff: -0.13%)


==================================================================
Test          : DeathStarBench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Pinning   scaling   tip     update_curr_opt (%diff)
1CCD        1       1.00    1.01 (%diff: 0.57%)
2CCD        2       1.00    1.00 (%diff: -0.27%)
4CCD        4       1.00    1.00 (%diff: 0.06%)
8CCD        8       1.00    1.00 (%diff: 0.45%)

--
Feel free to include

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

I'll keep a lookout for future versions.

> ---
> 
> Not sure what the XXX above place_entity() is for, maybe it can go away?
> 
> Based on tip/sched/core.
> 
>  kernel/sched/fair.c | 14 ++------------
>  1 file changed, 2 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04fbcbda97d5f..db2ca9bf9cc49 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5047,15 +5047,6 @@ static inline bool cfs_bandwidth_used(void);
>  static void
>  enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
> -	bool curr = cfs_rq->curr == se;
> -
> -	/*
> -	 * If we're the current task, we must renormalise before calling
> -	 * update_curr().
> -	 */
> -	if (curr)
> -		place_entity(cfs_rq, se, flags);
> -
>  	update_curr(cfs_rq);
>  
>  	/*
> @@ -5080,8 +5071,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 * XXX now that the entity has been re-weighted, and it's lag adjusted,
>  	 * we can place the entity.
>  	 */
> -	if (!curr)
> -		place_entity(cfs_rq, se, flags);
> +	place_entity(cfs_rq, se, flags);
>  
>  	account_entity_enqueue(cfs_rq, se);
>  
> @@ -5091,7 +5081,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  
>  	check_schedstat_required();
>  	update_stats_enqueue_fair(cfs_rq, se, flags);
> -	if (!curr)
> +	if (cfs_rq->curr != se)
>  		__enqueue_entity(cfs_rq, se);
>  	se->on_rq = 1;
>  

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr
  2023-10-10  0:51             ` Youssef Esmat
  2023-10-10  8:01               ` Peter Zijlstra
@ 2023-10-16 16:50               ` Peter Zijlstra
  1 sibling, 0 replies; 104+ messages in thread
From: Peter Zijlstra @ 2023-10-16 16:50 UTC (permalink / raw)
  To: Youssef Esmat
  Cc: Daniel Jordan, mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen, joel,
	efault, tglx


Sorry, I seem to have forgotten to reply to this part...

On Mon, Oct 09, 2023 at 07:51:03PM -0500, Youssef Esmat wrote:

> I think looking at the sched latency numbers alone does not show the
> complete picture. I ran the same input latency test again and tried to
> capture some of these numbers for the chrome processes.
> 
> EEVDF 1.5ms slice:
> 
> Input latency test result: 226ms
> perf sched latency:
> switches: 1,084,694
> avg:   1.139 ms
> max: 408.397 ms
> 
> EEVDF 6.0ms slice:
> 
> Input latency test result: 178ms
> perf sched latency:
> switches: 892,306
> avg:   1.145 ms
> max: 354.344 ms

> For our scenario, it is very expensive to interrupt UI threads. It
> will increase the input latency significantly. Lowering the scheduling
> latency at the cost of switching out important threads can be very
> detrimental in this workload. UI and input threads run with a nice
> value of -8.

> That said, this might not be beneficial for all workloads, and we are
> still trying our other workloads out.

Right, this seems to suggest something on your critical path (you should
trace that) has more than 3ms of compute in a single activation. 

Basically this means chrome is fairly fat on this critical path. But it
seems you know about that.

Anyway, once you know the 95% length of the longest activation on your
critical path, you can indeed set your slice to that. This should be
readily available from trace data.


^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH] sched/fair: Always update_curr() before placing at enqueue
  2023-10-06 19:58     ` Peter Zijlstra
@ 2023-10-18  0:43       ` Daniel Jordan
  0 siblings, 0 replies; 104+ messages in thread
From: Daniel Jordan @ 2023-10-18  0:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bristot, bsegall, chris.hyser, corbet, dietmar.eggemann, efault,
	joel, joshdon, juri.lelli, kprateek.nayak, linux-kernel, mgorman,
	mingo, patrick.bellasi, pavel, pjt, qperret, qyousef, rostedt,
	tglx, tim.c.chen, timj, vincent.guittot, youssefesmat, yu.c.chen

On Fri, Oct 06, 2023 at 09:58:10PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 06, 2023 at 12:48:26PM -0400, Daniel Jordan wrote:
> > Placing wants current's vruntime and the cfs_rq's min_vruntime up to
> > date so that avg_runtime() is too, and similarly it wants the entity to
> > be re-weighted and lag adjusted so vslice and vlag are fresh, so always
> > do update_curr() and update_cfs_group() beforehand.
> > 
> > There doesn't seem to be a reason to treat the 'curr' case specially
> > after e8f331bcc270 since vruntime doesn't get normalized anymore.
> > 
> > Fixes: e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement")
> > Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > ---
> > 
> > Not sure what the XXX above place_entity() is for, maybe it can go away?
> > 
> > Based on tip/sched/core.
> > 
> >  kernel/sched/fair.c | 14 ++------------
> >  1 file changed, 2 insertions(+), 12 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04fbcbda97d5f..db2ca9bf9cc49 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5047,15 +5047,6 @@ static inline bool cfs_bandwidth_used(void);
> >  static void
> >  enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  {
> > -	bool curr = cfs_rq->curr == se;
> > -
> > -	/*
> > -	 * If we're the current task, we must renormalise before calling
> > -	 * update_curr().
> > -	 */
> > -	if (curr)
> > -		place_entity(cfs_rq, se, flags);
> > -
> >  	update_curr(cfs_rq);
> 
> IIRC part of the reason for this order is the:
> 
>   dequeue
>   update
>   enqueue
> 
> pattern we have all over the place. You don't want the enqueue to move
> time forward in this case.
> 
> Could be that all magically works, but please double check.

Yes, I wasn't thinking of the dequeue/update/enqueue places.
Considering these, it seems like there's more to fix (from before EEVDF
even).

Sorry for the delayed response, been staring for a while thinking I'd
have it all by the next day.  It'll take a bit longer to sort out all
the cases, but I'll keep going.

^ permalink raw reply	[flat|nested] 104+ messages in thread

end of thread, other threads:[~2023-10-18  0:44 UTC | newest]

Thread overview: 104+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-31 11:58 [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Peter Zijlstra
2023-05-31 11:58 ` [PATCH 01/15] sched/fair: Add avg_vruntime Peter Zijlstra
2023-06-02 13:51   ` Vincent Guittot
2023-06-02 14:27     ` Peter Zijlstra
2023-06-05  7:18       ` Vincent Guittot
2023-08-10  7:10   ` [tip: sched/core] sched/fair: Add cfs_rq::avg_vruntime tip-bot2 for Peter Zijlstra
2023-10-11  4:15   ` [PATCH 01/15] sched/fair: Add avg_vruntime Abel Wu
2023-10-11  7:30     ` Peter Zijlstra
2023-10-11  8:30       ` Abel Wu
2023-10-11  9:45         ` Peter Zijlstra
2023-10-11 10:05           ` Peter Zijlstra
2023-10-11 13:08       ` Peter Zijlstra
2023-05-31 11:58 ` [PATCH 02/15] sched/fair: Remove START_DEBIT Peter Zijlstra
2023-08-10  7:10   ` [tip: sched/core] sched/fair: Remove sched_feat(START_DEBIT) tip-bot2 for Peter Zijlstra
2023-05-31 11:58 ` [PATCH 03/15] sched/fair: Add lag based placement Peter Zijlstra
2023-08-10  7:10   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2023-10-11 12:00   ` [PATCH 03/15] " Abel Wu
2023-10-11 13:24     ` Peter Zijlstra
2023-10-12  7:04       ` Abel Wu
2023-10-13  7:37         ` Peter Zijlstra
2023-10-13  8:14           ` Abel Wu
2023-10-12 19:15   ` Benjamin Segall
2023-10-12 22:34     ` Peter Zijlstra
2023-10-13 16:35       ` Peter Zijlstra
2023-10-14  8:08         ` Mike Galbraith
2023-10-13 14:34     ` Peter Zijlstra
2023-05-31 11:58 ` [PATCH 04/15] rbtree: Add rb_add_augmented_cached() helper Peter Zijlstra
2023-08-10  7:10   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2023-05-31 11:58 ` [PATCH 05/15] sched/fair: Implement an EEVDF like policy Peter Zijlstra
2023-08-10  7:10   ` [tip: sched/core] sched/fair: Implement an EEVDF-like scheduling policy tip-bot2 for Peter Zijlstra
2023-09-29 21:40   ` [PATCH 05/15] sched/fair: Implement an EEVDF like policy Benjamin Segall
2023-10-02 17:39     ` Peter Zijlstra
2023-10-11  4:14     ` Abel Wu
2023-10-11  7:33       ` Peter Zijlstra
2023-10-11 11:49         ` Abel Wu
2023-09-30  0:09   ` [PATCH] sched/fair: fix pick_eevdf to always find the correct se Benjamin Segall
2023-10-03 10:42     ` [tip: sched/urgent] sched/fair: Fix pick_eevdf() tip-bot2 for Benjamin Segall
     [not found]     ` <CGME20231004203940eucas1p2f73b017497d1f4239a6e236fdb6019e2@eucas1p2.samsung.com>
2023-10-04 20:39       ` [PATCH] sched/fair: fix pick_eevdf to always find the correct se Marek Szyprowski
2023-10-09  7:53     ` [tip: sched/urgent] sched/eevdf: Fix pick_eevdf() tip-bot2 for Benjamin Segall
2023-10-11 12:12     ` [PATCH] sched/fair: fix pick_eevdf to always find the correct se Abel Wu
2023-10-11 13:14       ` Peter Zijlstra
2023-10-12 10:04         ` Abel Wu
2023-10-11 21:01       ` Benjamin Segall
2023-10-12 10:25         ` Abel Wu
2023-10-12 17:51           ` Benjamin Segall
2023-10-13  3:46             ` Abel Wu
2023-10-13 16:51               ` Benjamin Segall
2023-05-31 11:58 ` [PATCH 06/15] sched: Commit to lag based placement Peter Zijlstra
2023-08-10  7:10   ` [tip: sched/core] sched/fair: " tip-bot2 for Peter Zijlstra
2023-05-31 11:58 ` [PATCH 07/15] sched/smp: Use lag to simplify cross-runqueue placement Peter Zijlstra
2023-08-10  7:10   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2023-09-12 15:32   ` [PATCH 07/15] " Sebastian Andrzej Siewior
2023-09-13  9:03     ` Peter Zijlstra
2023-10-04  1:17   ` [PATCH] sched/fair: Preserve PLACE_DEADLINE_INITIAL deadline Daniel Jordan
2023-10-04 13:09     ` [PATCH v2] " Daniel Jordan
2023-10-04 15:46       ` Chen Yu
2023-10-06 16:31         ` Daniel Jordan
2023-10-12  4:48       ` K Prateek Nayak
2023-10-05  5:56     ` [PATCH] " K Prateek Nayak
2023-10-06 16:35       ` Daniel Jordan
2023-10-06 16:48   ` [PATCH] sched/fair: Always update_curr() before placing at enqueue Daniel Jordan
2023-10-06 19:58     ` Peter Zijlstra
2023-10-18  0:43       ` Daniel Jordan
2023-10-16  5:39     ` K Prateek Nayak
2023-05-31 11:58 ` [PATCH 08/15] sched: Commit to EEVDF Peter Zijlstra
2023-06-16 21:23   ` Joel Fernandes
2023-06-22 12:01     ` Ingo Molnar
2023-06-22 13:11       ` Joel Fernandes
2023-08-10  7:10   ` [tip: sched/core] sched/fair: " tip-bot2 for Peter Zijlstra
2023-05-31 11:58 ` [PATCH 09/15] sched/debug: Rename min_granularity to base_slice Peter Zijlstra
2023-08-10  7:10   ` [tip: sched/core] sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice tip-bot2 for Peter Zijlstra
2023-05-31 11:58 ` [PATCH 10/15] sched/fair: Propagate enqueue flags into place_entity() Peter Zijlstra
2023-08-10  7:10   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2023-05-31 11:58 ` [PATCH 11/15] sched/eevdf: Better handle mixed slice length Peter Zijlstra
2023-06-02 13:45   ` Vincent Guittot
2023-06-02 15:06     ` Peter Zijlstra
2023-06-10  6:34   ` Chen Yu
2023-06-10 11:22     ` Peter Zijlstra
2023-05-31 11:58 ` [RFC][PATCH 12/15] sched: Introduce latency-nice as a per-task attribute Peter Zijlstra
2023-05-31 11:58 ` [RFC][PATCH 13/15] sched/fair: Implement latency-nice Peter Zijlstra
2023-06-06 14:54   ` Vincent Guittot
2023-06-08 10:34     ` Peter Zijlstra
2023-06-08 12:44       ` Peter Zijlstra
2023-10-11 23:24   ` Benjamin Segall
2023-05-31 11:58 ` [RFC][PATCH 14/15] sched/fair: Add sched group latency support Peter Zijlstra
2023-05-31 11:58 ` [RFC][PATCH 15/15] sched/eevdf: Use sched_attr::sched_runtime to set request/slice Peter Zijlstra
2023-06-01 13:55   ` Vincent Guittot
2023-06-08 11:52     ` Peter Zijlstra
2023-08-24  0:52 ` [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr Daniel Jordan
2023-09-06 13:13   ` Peter Zijlstra
2023-09-29 16:54     ` Youssef Esmat
2023-10-02 15:55       ` Youssef Esmat
2023-10-02 18:41       ` Peter Zijlstra
2023-10-05 12:05         ` Peter Zijlstra
2023-10-05 14:14           ` Peter Zijlstra
2023-10-05 14:42             ` Peter Zijlstra
2023-10-05 18:23           ` Youssef Esmat
2023-10-06  0:36             ` Youssef Esmat
2023-10-10  8:08             ` Peter Zijlstra
2023-10-07 22:04           ` Peter Zijlstra
2023-10-09 14:41             ` Peter Zijlstra
2023-10-10  0:51             ` Youssef Esmat
2023-10-10  8:01               ` Peter Zijlstra
2023-10-16 16:50               ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).