linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/17] sched: EEVDF using latency-nice
@ 2023-03-28  9:26 Peter Zijlstra
  2023-03-28  9:26 ` [PATCH 01/17] sched: Introduce latency-nice as a per-task attribute Peter Zijlstra
                   ` (20 more replies)
  0 siblings, 21 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

Hi!

Latest version of the EEVDF [1] patches.

Many changes since last time; most notably it now fully replaces CFS and uses
lag based placement for migrations. Smaller changes include:

 - uses scale_load_down() for avg_vruntime; I measured the max delta to be ~44
   bits on a system/cgroup based kernel build.
 - fixed a bunch of reweight / cgroup placement issues
 - adaptive placement strategy for smaller slices
 - rename se->lag to se->vlag

There's a bunch of RFC patches at the end and one DEBUG patch. Of those, the
PLACE_BONUS patch is a mixed bag of pain. A number of benchmarks regress
because EEVDF is actually fair and gives a 100% parent vs a 50% child a 67%/33%
split (stress-futex, stress-nanosleep, starve, etc..) instead of a 50%/50%
split that sleeper bonus achieves. Mostly I think these benchmarks are somewhat
artificial/daft but who knows.

The PLACE_BONUS thing horribly messes up things like hackbench and latency-nice
because it places things too far to the left in the tree. Basically it messes
with the whole 'when', by placing a task back in history you're putting a
burden on the now to accomodate catching up. More tinkering required.

But over-all the thing seems to be fairly usable and could do with more
extensive testing.

[1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564

Results:

  hackbech -g $nr_cpu + cyclictest --policy other results:

			EEVDF			 CFS

		# Min Latencies: 00054
  LNICE(19)	# Avg Latencies: 00660
		# Max Latencies: 23103

		# Min Latencies: 00052		00053
  LNICE(0)	# Avg Latencies: 00318		00687
		# Max Latencies: 08593		13913

		# Min Latencies: 00054
  LNICE(-19)	# Avg Latencies: 00055
		# Max Latencies: 00061


Some preliminary results from Chen Yu on a slightly older version:

  schbench  (95% tail latency, lower is better)
  =================================================================================
  case                    nr_instance            baseline (std%)    compare% ( std%)
  normal                   25%                     1.00  (2.49%)    -81.2%   (4.27%)
  normal                   50%                     1.00  (2.47%)    -84.5%   (0.47%)
  normal                   75%                     1.00  (2.5%)     -81.3%   (1.27%)
  normal                  100%                     1.00  (3.14%)    -79.2%   (0.72%)
  normal                  125%                     1.00  (3.07%)    -77.5%   (0.85%)
  normal                  150%                     1.00  (3.35%)    -76.4%   (0.10%)
  normal                  175%                     1.00  (3.06%)    -76.2%   (0.56%)
  normal                  200%                     1.00  (3.11%)    -76.3%   (0.39%)
  ==================================================================================

  hackbench (throughput, higher is better)
  ==============================================================================
  case                    nr_instance            baseline(std%)  compare%( std%)
  threads-pipe              25%                      1.00 (<2%)    -17.5 (<2%)
  threads-socket            25%                      1.00 (<2%)    -1.9 (<2%)
  threads-pipe              50%                      1.00 (<2%)     +6.7 (<2%)
  threads-socket            50%                      1.00 (<2%)    -6.3  (<2%)
  threads-pipe              100%                     1.00 (3%)     +110.1 (3%)
  threads-socket            100%                     1.00 (<2%)    -40.2 (<2%)
  threads-pipe              150%                     1.00 (<2%)    +125.4 (<2%)
  threads-socket            150%                     1.00 (<2%)    -24.7 (<2%)
  threads-pipe              200%                     1.00 (<2%)    -89.5 (<2%)
  threads-socket            200%                     1.00 (<2%)    -27.4 (<2%)
  process-pipe              25%                      1.00 (<2%)    -15.0 (<2%)
  process-socket            25%                      1.00 (<2%)    -3.9 (<2%)
  process-pipe              50%                      1.00 (<2%)    -0.4  (<2%)
  process-socket            50%                      1.00 (<2%)    -5.3  (<2%)
  process-pipe              100%                     1.00 (<2%)    +62.0 (<2%)
  process-socket            100%                     1.00 (<2%)    -39.5  (<2%)
  process-pipe              150%                     1.00 (<2%)    +70.0 (<2%)
  process-socket            150%                     1.00 (<2%)    -20.3 (<2%)
  process-pipe              200%                     1.00 (<2%)    +79.2 (<2%)
  process-socket            200%                     1.00 (<2%)    -22.4  (<2%)
  ==============================================================================

  stress-ng (throughput, higher is better)
  ==============================================================================
  case                    nr_instance            baseline(std%)  compare%( std%)
  switch                  25%                      1.00 (<2%)    -6.5 (<2%)
  switch                  50%                      1.00 (<2%)    -9.2 (<2%)
  switch                  75%                      1.00 (<2%)    -1.2 (<2%)
  switch                  100%                     1.00 (<2%)    +11.1 (<2%)
  switch                  125%                     1.00 (<2%)    -16.7% (9%)
  switch                  150%                     1.00 (<2%)    -13.6 (<2%)
  switch                  175%                     1.00 (<2%)    -16.2 (<2%)
  switch                  200%                     1.00 (<2%)    -19.4% (<2%)
  fork                    50%                      1.00 (<2%)    -0.1 (<2%)
  fork                    75%                      1.00 (<2%)    -0.3 (<2%)
  fork                    100%                     1.00 (<2%)    -0.1 (<2%)
  fork                    125%                     1.00 (<2%)    -6.9 (<2%)
  fork                    150%                     1.00 (<2%)    -8.8 (<2%)
  fork                    200%                     1.00 (<2%)    -3.3 (<2%)
  futex                   25%                      1.00 (<2%)    -3.2 (<2%)
  futex                   50%                      1.00 (3%)     -19.9 (5%)
  futex                   75%                      1.00 (6%)     -19.1 (2%)
  futex                   100%                     1.00 (16%)    -30.5 (10%)
  futex                   125%                     1.00 (25%)    -39.3 (11%)
  futex                   150%                     1.00 (20%)    -27.2% (17%)
  futex                   175%                     1.00 (<2%)    -18.6 (<2%)
  futex                   200%                     1.00 (<2%)    -47.5 (<2%)
  nanosleep               25%                      1.00 (<2%)    -0.1 (<2%)
  nanosleep               50%                      1.00 (<2%)    -0.0% (<2%)
  nanosleep               75%                      1.00 (<2%)    +15.2% (<2%)
  nanosleep               100%                     1.00 (<2%)    -26.4 (<2%)
  nanosleep               125%                     1.00 (<2%)    -1.3 (<2%)
  nanosleep               150%                     1.00 (<2%)    +2.1  (<2%)
  nanosleep               175%                     1.00 (<2%)    +8.3 (<2%)
  nanosleep               200%                     1.00 (<2%)    +2.0% (<2%)
  ===============================================================================

  unixbench (throughput, higher is better)
  ==============================================================================
  case                    nr_instance            baseline(std%)  compare%( std%)
  spawn                   125%                      1.00 (<2%)    +8.1 (<2%)
  context1                100%                      1.00 (6%)     +17.4 (6%)
  context1                75%                       1.00 (13%)    +18.8 (8%)
  =================================================================================

  netperf  (throughput, higher is better)
  ===========================================================================
  case                    nr_instance          baseline(std%)  compare%( std%)
  UDP_RR                  25%                   1.00    (<2%)    -1.5%  (<2%)
  UDP_RR                  50%                   1.00    (<2%)    -0.3%  (<2%)
  UDP_RR                  75%                   1.00    (<2%)    +12.5% (<2%)
  UDP_RR                 100%                   1.00    (<2%)    -4.3%  (<2%)
  UDP_RR                 125%                   1.00    (<2%)    -4.9%  (<2%)
  UDP_RR                 150%                   1.00    (<2%)    -4.7%  (<2%)
  UDP_RR                 175%                   1.00    (<2%)    -6.1%  (<2%)
  UDP_RR                 200%                   1.00    (<2%)    -6.6%  (<2%)
  TCP_RR                  25%                   1.00    (<2%)    -1.4%  (<2%)
  TCP_RR                  50%                   1.00    (<2%)    -0.2%  (<2%)
  TCP_RR                  75%                   1.00    (<2%)    -3.9%  (<2%)
  TCP_RR                 100%                   1.00    (2%)     +3.6%  (5%)
  TCP_RR                 125%                   1.00    (<2%)    -4.2%  (<2%)
  TCP_RR                 150%                   1.00    (<2%)    -6.0%  (<2%)
  TCP_RR                 175%                   1.00    (<2%)    -7.4%  (<2%)
  TCP_RR                 200%                   1.00    (<2%)    -8.4%  (<2%)
  ==========================================================================


---
Also available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/eevdf

---
Parth Shah (1):
      sched: Introduce latency-nice as a per-task attribute

Peter Zijlstra (14):
      sched/fair: Add avg_vruntime
      sched/fair: Remove START_DEBIT
      sched/fair: Add lag based placement
      rbtree: Add rb_add_augmented_cached() helper
      sched/fair: Implement an EEVDF like policy
      sched: Commit to lag based placement
      sched/smp: Use lag to simplify cross-runqueue placement
      sched: Commit to EEVDF
      sched/debug: Rename min_granularity to base_slice
      sched: Merge latency_offset into slice
      sched/eevdf: Better handle mixed slice length
      sched/eevdf: Sleeper bonus
      sched/eevdf: Minimal vavg option
      sched/eevdf: Debug / validation crud

Vincent Guittot (2):
      sched/fair: Add latency_offset
      sched/fair: Add sched group latency support

 Documentation/admin-guide/cgroup-v2.rst |   10 +
 include/linux/rbtree_augmented.h        |   26 +
 include/linux/sched.h                   |    6 +
 include/uapi/linux/sched.h              |    4 +-
 include/uapi/linux/sched/types.h        |   19 +
 init/init_task.c                        |    3 +-
 kernel/sched/core.c                     |   65 +-
 kernel/sched/debug.c                    |   49 +-
 kernel/sched/fair.c                     | 1199 ++++++++++++++++---------------
 kernel/sched/features.h                 |   29 +-
 kernel/sched/sched.h                    |   23 +-
 tools/include/uapi/linux/sched.h        |    4 +-
 12 files changed, 794 insertions(+), 643 deletions(-)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 01/17] sched: Introduce latency-nice as a per-task attribute
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-03-28  9:26 ` [PATCH 02/17] sched/fair: Add latency_offset Peter Zijlstra
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault,
	Parth Shah

From: Parth Shah <parth@linux.ibm.com>

Latency-nice indicates the latency requirements of a task with respect
to the other tasks in the system. The value of the attribute can be within
the range of [-20, 19] both inclusive to be in-line with the values just
like task nice values.

Just like task nice, -20 is the 'highest' priority and conveys this
task should get minimal latency, conversely 19 is the lowest priority
and conveys this task will get the least consideration and will thus
receive maximal latency.

[peterz: rebase, squash]
Signed-off-by: Parth Shah <parth@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h            |    1 +
 include/uapi/linux/sched.h       |    4 +++-
 include/uapi/linux/sched/types.h |   19 +++++++++++++++++++
 init/init_task.c                 |    3 ++-
 kernel/sched/core.c              |   27 ++++++++++++++++++++++++++-
 kernel/sched/debug.c             |    1 +
 tools/include/uapi/linux/sched.h |    4 +++-
 7 files changed, 55 insertions(+), 4 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -784,6 +784,7 @@ struct task_struct {
 	int				static_prio;
 	int				normal_prio;
 	unsigned int			rt_priority;
+	int				latency_prio;
 
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -132,6 +132,7 @@ struct clone_args {
 #define SCHED_FLAG_KEEP_PARAMS		0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
+#define SCHED_FLAG_LATENCY_NICE		0x80
 
 #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
 				 SCHED_FLAG_KEEP_PARAMS)
@@ -143,6 +144,7 @@ struct clone_args {
 			 SCHED_FLAG_RECLAIM		| \
 			 SCHED_FLAG_DL_OVERRUN		| \
 			 SCHED_FLAG_KEEP_ALL		| \
-			 SCHED_FLAG_UTIL_CLAMP)
+			 SCHED_FLAG_UTIL_CLAMP		| \
+			 SCHED_FLAG_LATENCY_NICE)
 
 #endif /* _UAPI_LINUX_SCHED_H */
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -10,6 +10,7 @@ struct sched_param {
 
 #define SCHED_ATTR_SIZE_VER0	48	/* sizeof first published struct */
 #define SCHED_ATTR_SIZE_VER1	56	/* add: util_{min,max} */
+#define SCHED_ATTR_SIZE_VER2	60	/* add: latency_nice */
 
 /*
  * Extended scheduling parameters data structure.
@@ -98,6 +99,22 @@ struct sched_param {
  * scheduled on a CPU with no more capacity than the specified value.
  *
  * A task utilization boundary can be reset by setting the attribute to -1.
+ *
+ * Latency Tolerance Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the relative latency
+ * requirements of a task with respect to the other tasks running/queued in the
+ * system.
+ *
+ * @ sched_latency_nice	task's latency_nice value
+ *
+ * The latency_nice of a task can have any value in a range of
+ * [MIN_LATENCY_NICE..MAX_LATENCY_NICE].
+ *
+ * A task with latency_nice with the value of LATENCY_NICE_MIN can be
+ * taken for a task requiring a lower latency as opposed to the task with
+ * higher latency_nice.
  */
 struct sched_attr {
 	__u32 size;
@@ -120,6 +137,8 @@ struct sched_attr {
 	__u32 sched_util_min;
 	__u32 sched_util_max;
 
+	/* latency requirement hints */
+	__s32 sched_latency_nice;
 };
 
 #endif /* _UAPI_LINUX_SCHED_TYPES_H */
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -78,6 +78,7 @@ struct task_struct init_task
 	.prio		= MAX_PRIO - 20,
 	.static_prio	= MAX_PRIO - 20,
 	.normal_prio	= MAX_PRIO - 20,
+	.latency_prio	= DEFAULT_PRIO,
 	.policy		= SCHED_NORMAL,
 	.cpus_ptr	= &init_task.cpus_mask,
 	.user_cpus_ptr	= NULL,
@@ -89,7 +90,7 @@ struct task_struct init_task
 		.fn = do_no_restart_syscall,
 	},
 	.se		= {
-		.group_node 	= LIST_HEAD_INIT(init_task.se.group_node),
+		.group_node	= LIST_HEAD_INIT(init_task.se.group_node),
 	},
 	.rt		= {
 		.run_list	= LIST_HEAD_INIT(init_task.rt.run_list),
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4684,6 +4684,8 @@ int sched_fork(unsigned long clone_flags
 		p->prio = p->normal_prio = p->static_prio;
 		set_load_weight(p, false);
 
+		p->latency_prio = NICE_TO_PRIO(0);
+
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
 		 * fulfilled its duty:
@@ -7428,7 +7430,7 @@ static struct task_struct *find_process_
 #define SETPARAM_POLICY	-1
 
 static void __setscheduler_params(struct task_struct *p,
-		const struct sched_attr *attr)
+				  const struct sched_attr *attr)
 {
 	int policy = attr->sched_policy;
 
@@ -7452,6 +7454,13 @@ static void __setscheduler_params(struct
 	set_load_weight(p, true);
 }
 
+static void __setscheduler_latency(struct task_struct *p,
+				   const struct sched_attr *attr)
+{
+	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
+		p->latency_prio = NICE_TO_PRIO(attr->sched_latency_nice);
+}
+
 /*
  * Check the target process has a UID that matches the current process's:
  */
@@ -7592,6 +7601,13 @@ static int __sched_setscheduler(struct t
 			return retval;
 	}
 
+	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
+		if (attr->sched_latency_nice > MAX_NICE)
+			return -EINVAL;
+		if (attr->sched_latency_nice < MIN_NICE)
+			return -EINVAL;
+	}
+
 	if (pi)
 		cpuset_read_lock();
 
@@ -7626,6 +7642,9 @@ static int __sched_setscheduler(struct t
 			goto change;
 		if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
 			goto change;
+		if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE &&
+		    attr->sched_latency_nice != PRIO_TO_NICE(p->latency_prio))
+			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
 		retval = 0;
@@ -7714,6 +7733,7 @@ static int __sched_setscheduler(struct t
 		__setscheduler_params(p, attr);
 		__setscheduler_prio(p, newprio);
 	}
+	__setscheduler_latency(p, attr);
 	__setscheduler_uclamp(p, attr);
 
 	if (queued) {
@@ -7924,6 +7944,9 @@ static int sched_copy_attr(struct sched_
 	    size < SCHED_ATTR_SIZE_VER1)
 		return -EINVAL;
 
+	if ((attr->sched_flags & SCHED_FLAG_LATENCY_NICE) &&
+	    size < SCHED_ATTR_SIZE_VER2)
+		return -EINVAL;
 	/*
 	 * XXX: Do we want to be lenient like existing syscalls; or do we want
 	 * to be strict and return an error on out-of-bounds values?
@@ -8161,6 +8184,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pi
 	get_params(p, &kattr);
 	kattr.sched_flags &= SCHED_FLAG_ALL;
 
+	kattr.sched_latency_nice = PRIO_TO_NICE(p->latency_prio);
+
 #ifdef CONFIG_UCLAMP_TASK
 	/*
 	 * This could race with another potential updater, but this is fine
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1043,6 +1043,7 @@ void proc_sched_show_task(struct task_st
 #endif
 	P(policy);
 	P(prio);
+	P(latency_prio);
 	if (task_has_dl_policy(p)) {
 		P(dl.runtime);
 		P(dl.deadline);
--- a/tools/include/uapi/linux/sched.h
+++ b/tools/include/uapi/linux/sched.h
@@ -132,6 +132,7 @@ struct clone_args {
 #define SCHED_FLAG_KEEP_PARAMS		0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
+#define SCHED_FLAG_LATENCY_NICE		0x80
 
 #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
 				 SCHED_FLAG_KEEP_PARAMS)
@@ -143,6 +144,7 @@ struct clone_args {
 			 SCHED_FLAG_RECLAIM		| \
 			 SCHED_FLAG_DL_OVERRUN		| \
 			 SCHED_FLAG_KEEP_ALL		| \
-			 SCHED_FLAG_UTIL_CLAMP)
+			 SCHED_FLAG_UTIL_CLAMP		| \
+			 SCHED_FLAG_LATENCY_NICE)
 
 #endif /* _UAPI_LINUX_SCHED_H */



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 02/17] sched/fair: Add latency_offset
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
  2023-03-28  9:26 ` [PATCH 01/17] sched: Introduce latency-nice as a per-task attribute Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-03-28  9:26 ` [PATCH 03/17] sched/fair: Add sched group latency support Peter Zijlstra
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

From: Vincent Guittot <vincent.guittot@linaro.org>


Murdered-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 include/linux/sched.h |    2 ++
 kernel/sched/core.c   |   12 +++++++++++-
 kernel/sched/fair.c   |    8 ++++++++
 kernel/sched/sched.h  |    2 ++
 4 files changed, 23 insertions(+), 1 deletion(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -568,6 +568,8 @@ struct sched_entity {
 	/* cached value of my_q->h_nr_running */
 	unsigned long			runnable_weight;
 #endif
+	/* preemption offset in ns */
+	long				latency_offset;
 
 #ifdef CONFIG_SMP
 	/*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1285,6 +1285,11 @@ static void set_load_weight(struct task_
 	}
 }
 
+static void set_latency_offset(struct task_struct *p)
+{
+	p->se.latency_offset = calc_latency_offset(p->latency_prio - MAX_RT_PRIO);
+}
+
 #ifdef CONFIG_UCLAMP_TASK
 /*
  * Serializes updates of utilization clamp values
@@ -4433,6 +4438,8 @@ static void __sched_fork(unsigned long c
 	p->se.vruntime			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
+	set_latency_offset(p);
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
 #endif
@@ -4685,6 +4692,7 @@ int sched_fork(unsigned long clone_flags
 		set_load_weight(p, false);
 
 		p->latency_prio = NICE_TO_PRIO(0);
+		set_latency_offset(p);
 
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
@@ -7457,8 +7465,10 @@ static void __setscheduler_params(struct
 static void __setscheduler_latency(struct task_struct *p,
 				   const struct sched_attr *attr)
 {
-	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
+	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
 		p->latency_prio = NICE_TO_PRIO(attr->sched_latency_nice);
+		set_latency_offset(p);
+	}
 }
 
 /*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -703,6 +703,14 @@ int sched_update_scaling(void)
 }
 #endif
 
+long calc_latency_offset(int prio)
+{
+	u32 weight = sched_prio_to_weight[prio];
+	u64 base = sysctl_sched_min_granularity;
+
+	return div_u64(base << SCHED_FIXEDPOINT_SHIFT, weight);
+}
+
 /*
  * delta /= w
  */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2475,6 +2475,8 @@ extern unsigned int sysctl_numa_balancin
 extern unsigned int sysctl_numa_balancing_hot_threshold;
 #endif
 
+extern long calc_latency_offset(int prio);
+
 #ifdef CONFIG_SCHED_HRTICK
 
 /*



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 03/17] sched/fair: Add sched group latency support
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
  2023-03-28  9:26 ` [PATCH 01/17] sched: Introduce latency-nice as a per-task attribute Peter Zijlstra
  2023-03-28  9:26 ` [PATCH 02/17] sched/fair: Add latency_offset Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-03-28  9:26 ` [PATCH 04/17] sched/fair: Add avg_vruntime Peter Zijlstra
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

From: Vincent Guittot <vincent.guittot@linaro.org>

Task can set its latency priority with sched_setattr(), which is then used
to set the latency offset of its sched_enity, but sched group entities
still have the default latency offset value.

Add a latency.nice field in cpu cgroup controller to set the latency
priority of the group similarly to sched_setattr(). The latency priority
is then used to set the offset of the sched_entities of the group.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20230224093454.956298-7-vincent.guittot@linaro.org
---
 Documentation/admin-guide/cgroup-v2.rst |   10 ++++++++++
 kernel/sched/core.c                     |   30 ++++++++++++++++++++++++++++++
 kernel/sched/fair.c                     |   32 ++++++++++++++++++++++++++++++++
 kernel/sched/sched.h                    |    4 ++++
 4 files changed, 76 insertions(+)

--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1121,6 +1121,16 @@ All time durations are in microseconds.
         values similar to the sched_setattr(2). This maximum utilization
         value is used to clamp the task specific maximum utilization clamp.
 
+  cpu.latency.nice
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	The nice value is in the range [-20, 19].
+
+	This interface file allows reading and setting latency using the
+	same values used by sched_setattr(2). The latency_nice of a group is
+	used to limit the impact of the latency_nice of a task outside the
+	group.
 
 
 Memory
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11068,6 +11068,25 @@ static int cpu_idle_write_s64(struct cgr
 {
 	return sched_group_set_idle(css_tg(css), idle);
 }
+
+static s64 cpu_latency_nice_read_s64(struct cgroup_subsys_state *css,
+				    struct cftype *cft)
+{
+	return PRIO_TO_NICE(css_tg(css)->latency_prio);
+}
+
+static int cpu_latency_nice_write_s64(struct cgroup_subsys_state *css,
+				     struct cftype *cft, s64 nice)
+{
+	int prio;
+
+	if (nice < MIN_NICE || nice > MAX_NICE)
+		return -ERANGE;
+
+	prio = NICE_TO_PRIO(nice);
+
+	return sched_group_set_latency(css_tg(css), prio);
+}
 #endif
 
 static struct cftype cpu_legacy_files[] = {
@@ -11082,6 +11101,11 @@ static struct cftype cpu_legacy_files[]
 		.read_s64 = cpu_idle_read_s64,
 		.write_s64 = cpu_idle_write_s64,
 	},
+	{
+		.name = "latency.nice",
+		.read_s64 = cpu_latency_nice_read_s64,
+		.write_s64 = cpu_latency_nice_write_s64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
@@ -11299,6 +11323,12 @@ static struct cftype cpu_files[] = {
 		.read_s64 = cpu_idle_read_s64,
 		.write_s64 = cpu_idle_write_s64,
 	},
+	{
+		.name = "latency.nice",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_s64 = cpu_latency_nice_read_s64,
+		.write_s64 = cpu_latency_nice_write_s64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12264,6 +12264,7 @@ int alloc_fair_sched_group(struct task_g
 		goto err;
 
 	tg->shares = NICE_0_LOAD;
+	tg->latency_prio = DEFAULT_PRIO;
 
 	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
 
@@ -12362,6 +12363,9 @@ void init_tg_cfs_entry(struct task_group
 	}
 
 	se->my_q = cfs_rq;
+
+	se->latency_offset = calc_latency_offset(tg->latency_prio - MAX_RT_PRIO);
+
 	/* guarantee group entities always have weight */
 	update_load_set(&se->load, NICE_0_LOAD);
 	se->parent = parent;
@@ -12490,6 +12494,34 @@ int sched_group_set_idle(struct task_gro
 
 	mutex_unlock(&shares_mutex);
 	return 0;
+}
+
+int sched_group_set_latency(struct task_group *tg, int prio)
+{
+	long latency_offset;
+	int i;
+
+	if (tg == &root_task_group)
+		return -EINVAL;
+
+	mutex_lock(&shares_mutex);
+
+	if (tg->latency_prio == prio) {
+		mutex_unlock(&shares_mutex);
+		return 0;
+	}
+
+	tg->latency_prio = prio;
+	latency_offset = calc_latency_offset(prio - MAX_RT_PRIO);
+
+	for_each_possible_cpu(i) {
+		struct sched_entity *se = tg->se[i];
+
+		WRITE_ONCE(se->latency_offset, latency_offset);
+	}
+
+	mutex_unlock(&shares_mutex);
+	return 0;
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -378,6 +378,8 @@ struct task_group {
 
 	/* A positive value indicates that this is a SCHED_IDLE group. */
 	int			idle;
+	/* latency priority of the group. */
+	int			latency_prio;
 
 #ifdef	CONFIG_SMP
 	/*
@@ -488,6 +490,8 @@ extern int sched_group_set_shares(struct
 
 extern int sched_group_set_idle(struct task_group *tg, long idle);
 
+extern int sched_group_set_latency(struct task_group *tg, int prio);
+
 #ifdef CONFIG_SMP
 extern void set_task_rq_fair(struct sched_entity *se,
 			     struct cfs_rq *prev, struct cfs_rq *next);



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 04/17] sched/fair: Add avg_vruntime
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (2 preceding siblings ...)
  2023-03-28  9:26 ` [PATCH 03/17] sched/fair: Add sched group latency support Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-03-28 23:57   ` Josh Don
  2023-03-28  9:26 ` [PATCH 05/17] sched/fair: Remove START_DEBIT Peter Zijlstra
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

In order to move to an eligibility based scheduling policy it is
needed to have a better approximation of the ideal scheduler.

Specifically, for a virtual time weighted fair queueing based
scheduler the ideal scheduler will be the weighted average of the
individual virtual runtimes (math in the comment).

As such, compute the weighted average to approximate the ideal
scheduler -- note that the approximation is in the individual task
behaviour, which isn't strictly conformant.

Specifically consider adding a task with a vruntime left of center, in
this case the average will move backwards in time -- something the
ideal scheduler would of course never do.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |   32 ++++++--------
 kernel/sched/fair.c  |  111 +++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |    5 ++
 3 files changed, 128 insertions(+), 20 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -580,10 +580,9 @@ static void print_rq(struct seq_file *m,
 
 void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 {
-	s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
-		spread, rq0_min_vruntime, spread0;
+	s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, spread;
+	struct sched_entity *last, *first;
 	struct rq *rq = cpu_rq(cpu);
-	struct sched_entity *last;
 	unsigned long flags;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -597,26 +596,25 @@ void print_cfs_rq(struct seq_file *m, in
 			SPLIT_NS(cfs_rq->exec_clock));
 
 	raw_spin_rq_lock_irqsave(rq, flags);
-	if (rb_first_cached(&cfs_rq->tasks_timeline))
-		MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
+	first = __pick_first_entity(cfs_rq);
+	if (first)
+		left_vruntime = first->vruntime;
 	last = __pick_last_entity(cfs_rq);
 	if (last)
-		max_vruntime = last->vruntime;
+		right_vruntime = last->vruntime;
 	min_vruntime = cfs_rq->min_vruntime;
-	rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
 	raw_spin_rq_unlock_irqrestore(rq, flags);
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "MIN_vruntime",
-			SPLIT_NS(MIN_vruntime));
+
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_vruntime",
+			SPLIT_NS(left_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
 			SPLIT_NS(min_vruntime));
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "max_vruntime",
-			SPLIT_NS(max_vruntime));
-	spread = max_vruntime - MIN_vruntime;
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread",
-			SPLIT_NS(spread));
-	spread0 = min_vruntime - rq0_min_vruntime;
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread0",
-			SPLIT_NS(spread0));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
+			SPLIT_NS(avg_vruntime(cfs_rq)));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "right_vruntime",
+			SPLIT_NS(right_vruntime));
+	spread = right_vruntime - left_vruntime;
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread", SPLIT_NS(spread));
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -601,9 +601,108 @@ static inline bool entity_before(const s
 	return (s64)(a->vruntime - b->vruntime) < 0;
 }
 
+static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	return (s64)(se->vruntime - cfs_rq->min_vruntime);
+}
+
 #define __node_2_se(node) \
 	rb_entry((node), struct sched_entity, run_node)
 
+/*
+ * Compute virtual time from the per-task service numbers:
+ *
+ * Fair schedulers conserve lag: \Sum lag_i = 0
+ *
+ * lag_i = S - s_i = w_i * (V - v_i)
+ *
+ * \Sum lag_i = 0 -> \Sum w_i * (V - v_i) = V * \Sum w_i - \Sum w_i * v_i = 0
+ *
+ * From which we solve V:
+ *
+ *     \Sum v_i * w_i
+ * V = --------------
+ *        \Sum w_i
+ *
+ * However, since v_i is u64, and the multiplcation could easily overflow
+ * transform it into a relative form that uses smaller quantities:
+ *
+ * Substitute: v_i == (v_i - v) + v
+ *
+ *     \Sum ((v_i - v) + v) * w_i   \Sum (v_i - v) * w_i
+ * V = -------------------------- = -------------------- + v
+ *              \Sum w_i                   \Sum w_i
+ *
+ * min_vruntime = v
+ * avg_vruntime = \Sum (v_i - v) * w_i
+ * cfs_rq->load = \Sum w_i
+ *
+ * Since min_vruntime is a monotonic increasing variable that closely tracks
+ * the per-task service, these deltas: (v_i - v), will be in the order of the
+ * maximal (virtual) lag induced in the system due to quantisation.
+ */
+static void
+avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	unsigned long weight = scale_load_down(se->load.weight);
+	s64 key = entity_key(cfs_rq, se);
+
+	cfs_rq->avg_vruntime += key * weight;
+	cfs_rq->avg_load += weight;
+}
+
+static void
+avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	unsigned long weight = scale_load_down(se->load.weight);
+	s64 key = entity_key(cfs_rq, se);
+
+	cfs_rq->avg_vruntime -= key * weight;
+	cfs_rq->avg_load -= weight;
+}
+
+static inline
+void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+{
+	/*
+	 * v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load
+	 */
+	cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta;
+}
+
+u64 avg_vruntime(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 avg = cfs_rq->avg_vruntime;
+	long load = cfs_rq->avg_load;
+
+	if (curr && curr->on_rq) {
+		unsigned long weight = scale_load_down(curr->load.weight);
+
+		avg += entity_key(cfs_rq, curr) * weight;
+		load += weight;
+	}
+
+	if (load)
+		avg = div_s64(avg, load);
+
+	return cfs_rq->min_vruntime + avg;
+}
+
+static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
+{
+	u64 min_vruntime = cfs_rq->min_vruntime;
+	/*
+	 * open coded max_vruntime() to allow updating avg_vruntime
+	 */
+	s64 delta = (s64)(vruntime - min_vruntime);
+	if (delta > 0) {
+		avg_vruntime_update(cfs_rq, delta);
+		min_vruntime = vruntime;
+	}
+	return min_vruntime;
+}
+
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
@@ -629,7 +728,7 @@ static void update_min_vruntime(struct c
 
 	/* ensure we never gain time by being placed backwards. */
 	u64_u32_store(cfs_rq->min_vruntime,
-		      max_vruntime(cfs_rq->min_vruntime, vruntime));
+		      __update_min_vruntime(cfs_rq, vruntime));
 }
 
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
@@ -642,12 +741,14 @@ static inline bool __entity_less(struct
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	avg_vruntime_add(cfs_rq, se);
 	rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
+	avg_vruntime_sub(cfs_rq, se);
 }
 
 struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
@@ -3330,6 +3431,8 @@ static void reweight_entity(struct cfs_r
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
 			update_curr(cfs_rq);
+		else
+			avg_vruntime_sub(cfs_rq, se);
 		update_load_sub(&cfs_rq->load, se->load.weight);
 	}
 	dequeue_load_avg(cfs_rq, se);
@@ -3345,9 +3448,11 @@ static void reweight_entity(struct cfs_r
 #endif
 
 	enqueue_load_avg(cfs_rq, se);
-	if (se->on_rq)
+	if (se->on_rq) {
 		update_load_add(&cfs_rq->load, se->load.weight);
-
+		if (cfs_rq->curr != se)
+			avg_vruntime_add(cfs_rq, se);
+	}
 }
 
 void reweight_task(struct task_struct *p, int prio)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -558,6 +558,9 @@ struct cfs_rq {
 	unsigned int		idle_nr_running;   /* SCHED_IDLE */
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
 
+	s64			avg_vruntime;
+	u64			avg_load;
+
 	u64			exec_clock;
 	u64			min_vruntime;
 #ifdef CONFIG_SCHED_CORE
@@ -3312,4 +3315,6 @@ static inline void switch_mm_cid(struct
 static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next) { }
 #endif
 
+extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
+
 #endif /* _KERNEL_SCHED_SCHED_H */



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 05/17] sched/fair: Remove START_DEBIT
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (3 preceding siblings ...)
  2023-03-28  9:26 ` [PATCH 04/17] sched/fair: Add avg_vruntime Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-03-28  9:26 ` [PATCH 06/17] sched/fair: Add lag based placement Peter Zijlstra
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

With the introduction of avg_vruntime() there is no need to use worse
approximations. Take the 0-lag point as starting point for inserting
new tasks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   21 +--------------------
 kernel/sched/features.h |    6 ------
 2 files changed, 1 insertion(+), 26 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,16 +882,6 @@ static u64 sched_slice(struct cfs_rq *cf
 	return slice;
 }
 
-/*
- * We calculate the vruntime slice of a to-be-inserted task.
- *
- * vs = s/w
- */
-static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	return calc_delta_fair(sched_slice(cfs_rq, se), se);
-}
-
 #include "pelt.h"
 #ifdef CONFIG_SMP
 
@@ -4781,16 +4771,7 @@ static inline bool entity_is_long_sleepe
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
-	u64 vruntime = cfs_rq->min_vruntime;
-
-	/*
-	 * The 'current' period is already promised to the current tasks,
-	 * however the extra weight of the new task will slow them down a
-	 * little, place the new task so that it fits in the slot that
-	 * stays open at the end.
-	 */
-	if (initial && sched_feat(START_DEBIT))
-		vruntime += sched_vslice(cfs_rq, se);
+	u64 vruntime = avg_vruntime(cfs_rq);
 
 	/* sleeps up to a single latency don't count. */
 	if (!initial) {
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,12 +7,6 @@
 SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
 
 /*
- * Place new tasks ahead so that they do not starve already running
- * tasks
- */
-SCHED_FEAT(START_DEBIT, true)
-
-/*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 06/17] sched/fair: Add lag based placement
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (4 preceding siblings ...)
  2023-03-28  9:26 ` [PATCH 05/17] sched/fair: Remove START_DEBIT Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-04-03  9:18   ` Chen Yu
  2023-03-28  9:26 ` [PATCH 07/17] rbtree: Add rb_add_augmented_cached() helper Peter Zijlstra
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

With the introduction of avg_vruntime, it is possible to approximate
lag (the entire purpose of introducing it in fact). Use this to do lag
based placement over sleep+wake.

Specifically, the FAIR_SLEEPERS thing places things too far to the
left and messes up the deadline aspect of EEVDF.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h   |    1 
 kernel/sched/core.c     |    1 
 kernel/sched/fair.c     |  129 ++++++++++++++++++++++++++++++++++--------------
 kernel/sched/features.h |    8 ++
 4 files changed, 104 insertions(+), 35 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -555,6 +555,7 @@ struct sched_entity {
 	u64				sum_exec_runtime;
 	u64				vruntime;
 	u64				prev_sum_exec_runtime;
+	s64				vlag;
 
 	u64				nr_migrations;
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4439,6 +4439,7 @@ static void __sched_fork(unsigned long c
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
+	p->se.vlag			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 	set_latency_offset(p);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -689,6 +689,15 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 	return cfs_rq->min_vruntime + avg;
 }
 
+/*
+ * lag_i = S - s_i = w_i * (V - v_i)
+ */
+void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	SCHED_WARN_ON(!se->on_rq);
+	se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
+}
+
 static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	u64 min_vruntime = cfs_rq->min_vruntime;
@@ -3417,6 +3426,8 @@ dequeue_load_avg(struct cfs_rq *cfs_rq,
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
 {
+	unsigned long old_weight = se->load.weight;
+
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
@@ -3429,6 +3440,14 @@ static void reweight_entity(struct cfs_r
 
 	update_load_set(&se->load, weight);
 
+	if (!se->on_rq) {
+		/*
+		 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v),
+		 * we need to scale se->vlag when w_i changes.
+		 */
+		se->vlag = div_s64(se->vlag * old_weight, weight);
+	}
+
 #ifdef CONFIG_SMP
 	do {
 		u32 divider = get_pelt_divider(&se->avg);
@@ -4778,49 +4797,86 @@ static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
 	u64 vruntime = avg_vruntime(cfs_rq);
+	s64 lag = 0;
 
-	/* sleeps up to a single latency don't count. */
-	if (!initial) {
-		unsigned long thresh;
+	/*
+	 * Due to how V is constructed as the weighted average of entities,
+	 * adding tasks with positive lag, or removing tasks with negative lag
+	 * will move 'time' backwards, this can screw around with the lag of
+	 * other tasks.
+	 *
+	 * EEVDF: placement strategy #1 / #2
+	 */
+	if (sched_feat(PLACE_LAG) && cfs_rq->nr_running > 1) {
+		struct sched_entity *curr = cfs_rq->curr;
+		unsigned long load;
 
-		if (se_is_idle(se))
-			thresh = sysctl_sched_min_granularity;
-		else
-			thresh = sysctl_sched_latency;
+		lag = se->vlag;
 
 		/*
-		 * Halve their sleep time's effect, to allow
-		 * for a gentler effect of sleepers:
+		 * If we want to place a task and preserve lag, we have to
+		 * consider the effect of the new entity on the weighted
+		 * average and compensate for this, otherwise lag can quickly
+		 * evaporate:
+		 *
+		 * l_i = V - v_i <=> v_i = V - l_i
+		 *
+		 * V = v_avg = W*v_avg / W
+		 *
+		 * V' = (W*v_avg + w_i*v_i) / (W + w_i)
+		 *    = (W*v_avg + w_i(v_avg - l_i)) / (W + w_i)
+		 *    = v_avg + w_i*l_i/(W + w_i)
+		 *
+		 * l_i' = V' - v_i = v_avg + w_i*l_i/(W + w_i) - (v_avg - l)
+		 *      = l_i - w_i*l_i/(W + w_i)
+		 *
+		 * l_i = (W + w_i) * l_i' / W
 		 */
-		if (sched_feat(GENTLE_FAIR_SLEEPERS))
-			thresh >>= 1;
+		load = cfs_rq->avg_load;
+		if (curr && curr->on_rq)
+			load += curr->load.weight;
+
+		lag *= load + se->load.weight;
+		if (WARN_ON_ONCE(!load))
+			load = 1;
+		lag = div_s64(lag, load);
 
-		vruntime -= thresh;
+		vruntime -= lag;
 	}
 
-	/*
-	 * Pull vruntime of the entity being placed to the base level of
-	 * cfs_rq, to prevent boosting it if placed backwards.
-	 * However, min_vruntime can advance much faster than real time, with
-	 * the extreme being when an entity with the minimal weight always runs
-	 * on the cfs_rq. If the waking entity slept for a long time, its
-	 * vruntime difference from min_vruntime may overflow s64 and their
-	 * comparison may get inversed, so ignore the entity's original
-	 * vruntime in that case.
-	 * The maximal vruntime speedup is given by the ratio of normal to
-	 * minimal weight: scale_load_down(NICE_0_LOAD) / MIN_SHARES.
-	 * When placing a migrated waking entity, its exec_start has been set
-	 * from a different rq. In order to take into account a possible
-	 * divergence between new and prev rq's clocks task because of irq and
-	 * stolen time, we take an additional margin.
-	 * So, cutting off on the sleep time of
-	 *     2^63 / scale_load_down(NICE_0_LOAD) ~ 104 days
-	 * should be safe.
-	 */
-	if (entity_is_long_sleeper(se))
-		se->vruntime = vruntime;
-	else
-		se->vruntime = max_vruntime(se->vruntime, vruntime);
+	if (sched_feat(FAIR_SLEEPERS)) {
+
+		/* sleeps up to a single latency don't count. */
+		if (!initial) {
+			unsigned long thresh;
+
+			if (se_is_idle(se))
+				thresh = sysctl_sched_min_granularity;
+			else
+				thresh = sysctl_sched_latency;
+
+			/*
+			 * Halve their sleep time's effect, to allow
+			 * for a gentler effect of sleepers:
+			 */
+			if (sched_feat(GENTLE_FAIR_SLEEPERS))
+				thresh >>= 1;
+
+			vruntime -= thresh;
+		}
+
+		/*
+		 * Pull vruntime of the entity being placed to the base level of
+		 * cfs_rq, to prevent boosting it if placed backwards.  If the entity
+		 * slept for a long time, don't even try to compare its vruntime with
+		 * the base as it may be too far off and the comparison may get
+		 * inversed due to s64 overflow.
+		 */
+		if (!entity_is_long_sleeper(se))
+			vruntime = max_vruntime(se->vruntime, vruntime);
+	}
+
+	se->vruntime = vruntime;
 }
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -4991,6 +5047,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	clear_buddies(cfs_rq, se);
 
+	if (flags & DEQUEUE_SLEEP)
+		update_entity_lag(cfs_rq, se);
+
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -1,12 +1,20 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+
 /*
  * Only give sleepers 50% of their service deficit. This allows
  * them to run sooner, but does not allow tons of sleepers to
  * rip the spread apart.
  */
+SCHED_FEAT(FAIR_SLEEPERS, false)
 SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
 
 /*
+ * Using the avg_vruntime, do the right thing and preserve lag across
+ * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
+ */
+SCHED_FEAT(PLACE_LAG, true)
+
+/*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 07/17] rbtree: Add rb_add_augmented_cached() helper
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (5 preceding siblings ...)
  2023-03-28  9:26 ` [PATCH 06/17] sched/fair: Add lag based placement Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-03-28  9:26 ` [PATCH 08/17] sched/fair: Implement an EEVDF like policy Peter Zijlstra
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

While slightly sub-optimal, updating the augmented data while going
down the tree during lookup would be faster -- alas the augment
interface does not currently allow for that, provide a generic helper
to add a node to an augmented cached tree.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rbtree_augmented.h |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- a/include/linux/rbtree_augmented.h
+++ b/include/linux/rbtree_augmented.h
@@ -60,6 +60,32 @@ rb_insert_augmented_cached(struct rb_nod
 	rb_insert_augmented(node, &root->rb_root, augment);
 }
 
+static __always_inline struct rb_node *
+rb_add_augmented_cached(struct rb_node *node, struct rb_root_cached *tree,
+			bool (*less)(struct rb_node *, const struct rb_node *),
+			const struct rb_augment_callbacks *augment)
+{
+	struct rb_node **link = &tree->rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	bool leftmost = true;
+
+	while (*link) {
+		parent = *link;
+		if (less(node, parent)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = false;
+		}
+	}
+
+	rb_link_node(node, parent, link);
+	augment->propagate(parent, NULL); /* suboptimal */
+	rb_insert_augmented_cached(node, tree, leftmost, augment);
+
+	return leftmost ? node : NULL;
+}
+
 /*
  * Template for declaring augmented rbtree callbacks (generic case)
  *



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 08/17] sched/fair: Implement an EEVDF like policy
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (6 preceding siblings ...)
  2023-03-28  9:26 ` [PATCH 07/17] rbtree: Add rb_add_augmented_cached() helper Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-03-29  1:26   ` Josh Don
  2023-03-29 14:35   ` Vincent Guittot
  2023-03-28  9:26 ` [PATCH 09/17] sched: Commit to lag based placement Peter Zijlstra
                   ` (12 subsequent siblings)
  20 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

Where CFS is currently a WFQ based scheduler with only a single knob,
the weight. The addition of a second, latency oriented parameter,
makes something like WF2Q or EEVDF based a much better fit.

Specifically, EEVDF does EDF like scheduling in the left half of the
tree -- those entities that are owed service. Except because this is a
virtual time scheduler, the deadlines are in virtual time as well,
which is what allows over-subscription.

EEVDF has two parameters:

 - weight, or time-slope; which is mapped to nice just as before
 - relative deadline; which is related to slice length and mapped
   to the new latency nice.

Basically, by setting a smaller slice, the deadline will be earlier
and the task will be more eligible and ran earlier.

Preemption (both tick and wakeup) is driven by testing against a fresh
pick. Because the tree is now effectively an interval tree, and the
selection is no longer 'leftmost', over-scheduling is less of a
problem.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h   |    4 
 kernel/sched/debug.c    |    6 
 kernel/sched/fair.c     |  324 +++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/features.h |    3 
 kernel/sched/sched.h    |    1 
 5 files changed, 293 insertions(+), 45 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -548,6 +548,9 @@ struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
 	struct rb_node			run_node;
+	u64				deadline;
+	u64				min_deadline;
+
 	struct list_head		group_node;
 	unsigned int			on_rq;
 
@@ -556,6 +559,7 @@ struct sched_entity {
 	u64				vruntime;
 	u64				prev_sum_exec_runtime;
 	s64				vlag;
+	u64				slice;
 
 	u64				nr_migrations;
 
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -535,9 +535,13 @@ print_task(struct seq_file *m, struct rq
 	else
 		SEQ_printf(m, " %c", task_state_to_char(p));
 
-	SEQ_printf(m, " %15s %5d %9Ld.%06ld %9Ld %5d ",
+	SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld.%06ld %9Ld %5d ",
 		p->comm, task_pid_nr(p),
 		SPLIT_NS(p->se.vruntime),
+		entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N',
+		SPLIT_NS(p->se.deadline),
+		SPLIT_NS(p->se.slice),
+		SPLIT_NS(p->se.sum_exec_runtime),
 		(long long)(p->nvcsw + p->nivcsw),
 		p->prio);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -47,6 +47,7 @@
 #include <linux/psi.h>
 #include <linux/ratelimit.h>
 #include <linux/task_work.h>
+#include <linux/rbtree_augmented.h>
 
 #include <asm/switch_to.h>
 
@@ -347,6 +348,16 @@ static u64 __calc_delta(u64 delta_exec,
 	return mul_u64_u32_shr(delta_exec, fact, shift);
 }
 
+/*
+ * delta /= w
+ */
+static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
+{
+	if (unlikely(se->load.weight != NICE_0_LOAD))
+		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
+
+	return delta;
+}
 
 const struct sched_class fair_sched_class;
 
@@ -691,11 +702,62 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 
 /*
  * lag_i = S - s_i = w_i * (V - v_i)
+ *
+ * However, since V is approximated by the weighted average of all entities it
+ * is possible -- by addition/removal/reweight to the tree -- to move V around
+ * and end up with a larger lag than we started with.
+ *
+ * Limit this to either double the slice length with a minimum of TICK_NSEC
+ * since that is the timing granularity.
+ *
+ * EEVDF gives the following limit for a steady state system:
+ *
+ *   -r_max < lag < max(r_max, q)
+ *
+ * XXX could add max_slice to the augmented data to track this.
  */
 void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	s64 lag, limit;
+
 	SCHED_WARN_ON(!se->on_rq);
-	se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
+	lag = avg_vruntime(cfs_rq) - se->vruntime;
+
+	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+	se->vlag = clamp(lag, -limit, limit);
+}
+
+/*
+ * Entity is eligible once it received less service than it ought to have,
+ * eg. lag >= 0.
+ *
+ * lag_i = S - s_i = w_i*(V - v_i)
+ *
+ * lag_i >= 0 -> V >= v_i
+ *
+ *     \Sum (v_i - v)*w_i
+ * V = ------------------ + v
+ *          \Sum w_i
+ *
+ * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ *
+ * Note: using 'avg_vruntime() > se->vruntime' is inacurate due
+ *       to the loss in precision caused by the division.
+ */
+int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 avg = cfs_rq->avg_vruntime;
+	long load = cfs_rq->avg_load;
+
+	if (curr && curr->on_rq) {
+		unsigned long weight = scale_load_down(curr->load.weight);
+
+		avg += entity_key(cfs_rq, curr) * weight;
+		load += weight;
+	}
+
+	return avg >= entity_key(cfs_rq, se) * load;
 }
 
 static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
@@ -714,8 +776,8 @@ static u64 __update_min_vruntime(struct
 
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
+	struct sched_entity *se = __pick_first_entity(cfs_rq);
 	struct sched_entity *curr = cfs_rq->curr;
-	struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);
 
 	u64 vruntime = cfs_rq->min_vruntime;
 
@@ -726,9 +788,7 @@ static void update_min_vruntime(struct c
 			curr = NULL;
 	}
 
-	if (leftmost) { /* non-empty tree */
-		struct sched_entity *se = __node_2_se(leftmost);
-
+	if (se) {
 		if (!curr)
 			vruntime = se->vruntime;
 		else
@@ -745,18 +805,50 @@ static inline bool __entity_less(struct
 	return entity_before(__node_2_se(a), __node_2_se(b));
 }
 
+#define deadline_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })
+
+static inline void __update_min_deadline(struct sched_entity *se, struct rb_node *node)
+{
+	if (node) {
+		struct sched_entity *rse = __node_2_se(node);
+		if (deadline_gt(min_deadline, se, rse))
+			se->min_deadline = rse->min_deadline;
+	}
+}
+
+/*
+ * se->min_deadline = min(se->deadline, left->min_deadline, right->min_deadline)
+ */
+static inline bool min_deadline_update(struct sched_entity *se, bool exit)
+{
+	u64 old_min_deadline = se->min_deadline;
+	struct rb_node *node = &se->run_node;
+
+	se->min_deadline = se->deadline;
+	__update_min_deadline(se, node->rb_right);
+	__update_min_deadline(se, node->rb_left);
+
+	return se->min_deadline == old_min_deadline;
+}
+
+RB_DECLARE_CALLBACKS(static, min_deadline_cb, struct sched_entity,
+		     run_node, min_deadline, min_deadline_update);
+
 /*
  * Enqueue an entity into the rb-tree:
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	avg_vruntime_add(cfs_rq, se);
-	rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
+	se->min_deadline = se->deadline;
+	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
+				__entity_less, &min_deadline_cb);
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
+	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
+				  &min_deadline_cb);
 	avg_vruntime_sub(cfs_rq, se);
 }
 
@@ -780,6 +872,97 @@ static struct sched_entity *__pick_next_
 	return __node_2_se(next);
 }
 
+static struct sched_entity *pick_cfs(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+{
+	struct sched_entity *left = __pick_first_entity(cfs_rq);
+
+	/*
+	 * If curr is set we have to see if its left of the leftmost entity
+	 * still in the tree, provided there was anything in the tree at all.
+	 */
+	if (!left || (curr && entity_before(curr, left)))
+		left = curr;
+
+	return left;
+}
+
+/*
+ * Earliest Eligible Virtual Deadline First
+ *
+ * In order to provide latency guarantees for different request sizes
+ * EEVDF selects the best runnable task from two criteria:
+ *
+ *  1) the task must be eligible (must be owed service)
+ *
+ *  2) from those tasks that meet 1), we select the one
+ *     with the earliest virtual deadline.
+ *
+ * We can do this in O(log n) time due to an augmented RB-tree. The
+ * tree keeps the entries sorted on service, but also functions as a
+ * heap based on the deadline by keeping:
+ *
+ *  se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
+ *
+ * Which allows an EDF like search on (sub)trees.
+ */
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+{
+	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+	struct sched_entity *curr = cfs_rq->curr;
+	struct sched_entity *best = NULL;
+
+	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
+		curr = NULL;
+
+	while (node) {
+		struct sched_entity *se = __node_2_se(node);
+
+		/*
+		 * If this entity is not eligible, try the left subtree.
+		 */
+		if (!entity_eligible(cfs_rq, se)) {
+			node = node->rb_left;
+			continue;
+		}
+
+		/*
+		 * If this entity has an earlier deadline than the previous
+		 * best, take this one. If it also has the earliest deadline
+		 * of its subtree, we're done.
+		 */
+		if (!best || deadline_gt(deadline, best, se)) {
+			best = se;
+			if (best->deadline == best->min_deadline)
+				break;
+		}
+
+		/*
+		 * If the earlest deadline in this subtree is in the fully
+		 * eligible left half of our space, go there.
+		 */
+		if (node->rb_left &&
+		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
+			node = node->rb_left;
+			continue;
+		}
+
+		node = node->rb_right;
+	}
+
+	if (!best || (curr && deadline_gt(deadline, best, curr)))
+		best = curr;
+
+	if (unlikely(!best)) {
+		struct sched_entity *left = __pick_first_entity(cfs_rq);
+		if (left) {
+			pr_err("EEVDF scheduling fail, picking leftmost\n");
+			return left;
+		}
+	}
+
+	return best;
+}
+
 #ifdef CONFIG_SCHED_DEBUG
 struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
 {
@@ -822,17 +1005,6 @@ long calc_latency_offset(int prio)
 }
 
 /*
- * delta /= w
- */
-static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
-{
-	if (unlikely(se->load.weight != NICE_0_LOAD))
-		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
-
-	return delta;
-}
-
-/*
  * The idea is to set a period in which each task runs once.
  *
  * When there are too many tasks (sched_nr_latency) we have to stretch
@@ -897,6 +1069,38 @@ static u64 sched_slice(struct cfs_rq *cf
 	return slice;
 }
 
+/*
+ * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
+ * this is probably good enough.
+ */
+static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	if ((s64)(se->vruntime - se->deadline) < 0)
+		return;
+
+	if (sched_feat(EEVDF)) {
+		/*
+		 * For EEVDF the virtual time slope is determined by w_i (iow.
+		 * nice) while the request time r_i is determined by
+		 * latency-nice.
+		 */
+		se->slice = se->latency_offset;
+	} else {
+		/*
+		 * When many tasks blow up the sched_period; it is possible
+		 * that sched_slice() reports unusually large results (when
+		 * many tasks are very light for example). Therefore impose a
+		 * maximum.
+		 */
+		se->slice = min_t(u64, sched_slice(cfs_rq, se), sysctl_sched_latency);
+	}
+
+	/*
+	 * EEVDF: vd_i = ve_i + r_i / w_i
+	 */
+	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+}
+
 #include "pelt.h"
 #ifdef CONFIG_SMP
 
@@ -1029,6 +1233,7 @@ static void update_curr(struct cfs_rq *c
 	schedstat_add(cfs_rq->exec_clock, delta_exec);
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
+	update_deadline(cfs_rq, curr);
 	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
@@ -4796,6 +5001,7 @@ static inline bool entity_is_long_sleepe
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
+	u64 vslice = calc_delta_fair(se->slice, se);
 	u64 vruntime = avg_vruntime(cfs_rq);
 	s64 lag = 0;
 
@@ -4834,9 +5040,9 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		 */
 		load = cfs_rq->avg_load;
 		if (curr && curr->on_rq)
-			load += curr->load.weight;
+			load += scale_load_down(curr->load.weight);
 
-		lag *= load + se->load.weight;
+		lag *= load + scale_load_down(se->load.weight);
 		if (WARN_ON_ONCE(!load))
 			load = 1;
 		lag = div_s64(lag, load);
@@ -4877,6 +5083,19 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	}
 
 	se->vruntime = vruntime;
+
+	/*
+	 * When joining the competition; the exisiting tasks will be,
+	 * on average, halfway through their slice, as such start tasks
+	 * off with half a slice to ease into the competition.
+	 */
+	if (sched_feat(PLACE_DEADLINE_INITIAL) && initial)
+		vslice /= 2;
+
+	/*
+	 * EEVDF: vd_i = ve_i + r_i/w_i
+	 */
+	se->deadline = se->vruntime + vslice;
 }
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -5088,19 +5307,20 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 static void
 check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	unsigned long ideal_runtime, delta_exec;
+	unsigned long delta_exec;
 	struct sched_entity *se;
 	s64 delta;
 
-	/*
-	 * When many tasks blow up the sched_period; it is possible that
-	 * sched_slice() reports unusually large results (when many tasks are
-	 * very light for example). Therefore impose a maximum.
-	 */
-	ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
+	if (sched_feat(EEVDF)) {
+		if (pick_eevdf(cfs_rq) != curr)
+			goto preempt;
+
+		return;
+	}
 
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
-	if (delta_exec > ideal_runtime) {
+	if (delta_exec > curr->slice) {
+preempt:
 		resched_curr(rq_of(cfs_rq));
 		/*
 		 * The current task ran long enough, ensure it doesn't get
@@ -5124,7 +5344,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq
 	if (delta < 0)
 		return;
 
-	if (delta > ideal_runtime)
+	if (delta > curr->slice)
 		resched_curr(rq_of(cfs_rq));
 }
 
@@ -5179,17 +5399,20 @@ wakeup_preempt_entity(struct sched_entit
 static struct sched_entity *
 pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	struct sched_entity *left = __pick_first_entity(cfs_rq);
-	struct sched_entity *se;
+	struct sched_entity *left, *se;
 
-	/*
-	 * If curr is set we have to see if its left of the leftmost entity
-	 * still in the tree, provided there was anything in the tree at all.
-	 */
-	if (!left || (curr && entity_before(curr, left)))
-		left = curr;
+	if (sched_feat(EEVDF)) {
+		/*
+		 * Enabling NEXT_BUDDY will affect latency but not fairness.
+		 */
+		if (sched_feat(NEXT_BUDDY) &&
+		    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
+			return cfs_rq->next;
+
+		return pick_eevdf(cfs_rq);
+	}
 
-	se = left; /* ideally we run the leftmost entity */
+	se = left = pick_cfs(cfs_rq, curr);
 
 	/*
 	 * Avoid running the skip buddy, if running something else can
@@ -6284,13 +6507,12 @@ static inline void unthrottle_offline_cf
 static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	SCHED_WARN_ON(task_rq(p) != rq);
 
 	if (rq->cfs.h_nr_running > 1) {
-		u64 slice = sched_slice(cfs_rq, se);
 		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+		u64 slice = se->slice;
 		s64 delta = slice - ran;
 
 		if (delta < 0) {
@@ -8010,7 +8232,19 @@ static void check_preempt_wakeup(struct
 	if (cse_is_idle != pse_is_idle)
 		return;
 
-	update_curr(cfs_rq_of(se));
+	cfs_rq = cfs_rq_of(se);
+	update_curr(cfs_rq);
+
+	if (sched_feat(EEVDF)) {
+		/*
+		 * XXX pick_eevdf(cfs_rq) != se ?
+		 */
+		if (pick_eevdf(cfs_rq) == pse)
+			goto preempt;
+
+		return;
+	}
+
 	if (wakeup_preempt_entity(se, pse) == 1) {
 		/*
 		 * Bias pick_next to pick the sched entity that is
@@ -8256,7 +8490,7 @@ static void yield_task_fair(struct rq *r
 
 	clear_buddies(cfs_rq, se);
 
-	if (curr->policy != SCHED_BATCH) {
+	if (sched_feat(EEVDF) || curr->policy != SCHED_BATCH) {
 		update_rq_clock(rq);
 		/*
 		 * Update run-time statistics of the 'current'.
@@ -8269,6 +8503,8 @@ static void yield_task_fair(struct rq *r
 		 */
 		rq_clock_skip_update(rq);
 	}
+	if (sched_feat(EEVDF))
+		se->deadline += calc_delta_fair(se->slice, se);
 
 	set_skip_buddy(se);
 }
@@ -12012,8 +12248,8 @@ static void rq_offline_fair(struct rq *r
 static inline bool
 __entity_slice_used(struct sched_entity *se, int min_nr_tasks)
 {
-	u64 slice = sched_slice(cfs_rq_of(se), se);
 	u64 rtime = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+	u64 slice = se->slice;
 
 	return (rtime * min_nr_tasks > slice);
 }
@@ -12728,7 +12964,7 @@ static unsigned int get_rr_interval_fair
 	 * idle runqueue:
 	 */
 	if (rq->cfs.load.weight)
-		rr_interval = NS_TO_JIFFIES(sched_slice(cfs_rq_of(se), se));
+		rr_interval = NS_TO_JIFFIES(se->slice);
 
 	return rr_interval;
 }
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -13,6 +13,7 @@ SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
  * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
  */
 SCHED_FEAT(PLACE_LAG, true)
+SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed
@@ -103,3 +104,5 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(EEVDF, true)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3316,5 +3316,6 @@ static inline void switch_mm_cid(struct
 #endif
 
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
+extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 #endif /* _KERNEL_SCHED_SCHED_H */



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 09/17] sched: Commit to lag based placement
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (7 preceding siblings ...)
  2023-03-28  9:26 ` [PATCH 08/17] sched/fair: Implement an EEVDF like policy Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-03-28  9:26 ` [PATCH 10/17] sched/smp: Use lag to simplify cross-runqueue placement Peter Zijlstra
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

Removes the FAIR_SLEEPERS code in favour of the new LAG based
placement.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   59 ------------------------------------------------
 kernel/sched/features.h |    8 ------
 2 files changed, 1 insertion(+), 66 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4970,29 +4970,6 @@ static void check_spread(struct cfs_rq *
 #endif
 }
 
-static inline bool entity_is_long_sleeper(struct sched_entity *se)
-{
-	struct cfs_rq *cfs_rq;
-	u64 sleep_time;
-
-	if (se->exec_start == 0)
-		return false;
-
-	cfs_rq = cfs_rq_of(se);
-
-	sleep_time = rq_clock_task(rq_of(cfs_rq));
-
-	/* Happen while migrating because of clock task divergence */
-	if (sleep_time <= se->exec_start)
-		return false;
-
-	sleep_time -= se->exec_start;
-	if (sleep_time > ((1ULL << 63) / scale_load_down(NICE_0_LOAD)))
-		return true;
-
-	return false;
-}
-
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
@@ -5041,43 +5018,9 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		if (WARN_ON_ONCE(!load))
 			load = 1;
 		lag = div_s64(lag, load);
-
-		vruntime -= lag;
-	}
-
-	if (sched_feat(FAIR_SLEEPERS)) {
-
-		/* sleeps up to a single latency don't count. */
-		if (!initial) {
-			unsigned long thresh;
-
-			if (se_is_idle(se))
-				thresh = sysctl_sched_min_granularity;
-			else
-				thresh = sysctl_sched_latency;
-
-			/*
-			 * Halve their sleep time's effect, to allow
-			 * for a gentler effect of sleepers:
-			 */
-			if (sched_feat(GENTLE_FAIR_SLEEPERS))
-				thresh >>= 1;
-
-			vruntime -= thresh;
-		}
-
-		/*
-		 * Pull vruntime of the entity being placed to the base level of
-		 * cfs_rq, to prevent boosting it if placed backwards.  If the entity
-		 * slept for a long time, don't even try to compare its vruntime with
-		 * the base as it may be too far off and the comparison may get
-		 * inversed due to s64 overflow.
-		 */
-		if (!entity_is_long_sleeper(se))
-			vruntime = max_vruntime(se->vruntime, vruntime);
 	}
 
-	se->vruntime = vruntime;
+	se->vruntime = vruntime - lag;
 
 	/*
 	 * When joining the competition; the exisiting tasks will be,
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -1,14 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 /*
- * Only give sleepers 50% of their service deficit. This allows
- * them to run sooner, but does not allow tons of sleepers to
- * rip the spread apart.
- */
-SCHED_FEAT(FAIR_SLEEPERS, false)
-SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
-
-/*
  * Using the avg_vruntime, do the right thing and preserve lag across
  * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
  */



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 10/17] sched/smp: Use lag to simplify cross-runqueue placement
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (8 preceding siblings ...)
  2023-03-28  9:26 ` [PATCH 09/17] sched: Commit to lag based placement Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-03-28  9:26 ` [PATCH 11/17] sched: Commit to EEVDF Peter Zijlstra
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

Using lag is both more correct and simpler when moving between
runqueues.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |  145 ++++++----------------------------------------------
 1 file changed, 19 insertions(+), 126 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4985,7 +4985,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	 *
 	 * EEVDF: placement strategy #1 / #2
 	 */
-	if (sched_feat(PLACE_LAG) && cfs_rq->nr_running > 1) {
+	if (sched_feat(PLACE_LAG) && cfs_rq->nr_running) {
 		struct sched_entity *curr = cfs_rq->curr;
 		unsigned long load;
 
@@ -5040,61 +5040,21 @@ static void check_enqueue_throttle(struc
 
 static inline bool cfs_bandwidth_used(void);
 
-/*
- * MIGRATION
- *
- *	dequeue
- *	  update_curr()
- *	    update_min_vruntime()
- *	  vruntime -= min_vruntime
- *
- *	enqueue
- *	  update_curr()
- *	    update_min_vruntime()
- *	  vruntime += min_vruntime
- *
- * this way the vruntime transition between RQs is done when both
- * min_vruntime are up-to-date.
- *
- * WAKEUP (remote)
- *
- *	->migrate_task_rq_fair() (p->state == TASK_WAKING)
- *	  vruntime -= min_vruntime
- *
- *	enqueue
- *	  update_curr()
- *	    update_min_vruntime()
- *	  vruntime += min_vruntime
- *
- * this way we don't have the most up-to-date min_vruntime on the originating
- * CPU and an up-to-date min_vruntime on the destination CPU.
- */
-
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
 	bool curr = cfs_rq->curr == se;
 
 	/*
 	 * If we're the current task, we must renormalise before calling
 	 * update_curr().
 	 */
-	if (renorm && curr)
-		se->vruntime += cfs_rq->min_vruntime;
+	if (curr)
+		place_entity(cfs_rq, se, 0);
 
 	update_curr(cfs_rq);
 
 	/*
-	 * Otherwise, renormalise after, such that we're placed at the current
-	 * moment in time, instead of some random moment in the past. Being
-	 * placed in the past could significantly boost this task to the
-	 * fairness detriment of existing tasks.
-	 */
-	if (renorm && !curr)
-		se->vruntime += cfs_rq->min_vruntime;
-
-	/*
 	 * When enqueuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
 	 *   - For group_entity, update its runnable_weight to reflect the new
@@ -5105,11 +5065,22 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
 	se_update_runnable(se);
+	/*
+	 * XXX update_load_avg() above will have attached us to the pelt sum;
+	 * but update_cfs_group() here will re-adjust the weight and have to
+	 * undo/redo all that. Seems wasteful.
+	 */
 	update_cfs_group(se);
-	account_entity_enqueue(cfs_rq, se);
 
-	if (flags & ENQUEUE_WAKEUP)
+	/*
+	 * XXX now that the entity has been re-weighted, and it's lag adjusted,
+	 * we can place the entity.
+	 */
+	if (!curr)
 		place_entity(cfs_rq, se, 0);
+
+	account_entity_enqueue(cfs_rq, se);
+
 	/* Entity has migrated, no longer consider this task hot */
 	if (flags & ENQUEUE_MIGRATED)
 		se->exec_start = 0;
@@ -5204,23 +5175,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	clear_buddies(cfs_rq, se);
 
-	if (flags & DEQUEUE_SLEEP)
-		update_entity_lag(cfs_rq, se);
-
+	update_entity_lag(cfs_rq, se);
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
 	account_entity_dequeue(cfs_rq, se);
 
-	/*
-	 * Normalize after update_curr(); which will also have moved
-	 * min_vruntime if @se is the one holding it back. But before doing
-	 * update_min_vruntime() again, which will discount @se's position and
-	 * can move min_vruntime forward still more.
-	 */
-	if (!(flags & DEQUEUE_SLEEP))
-		se->vruntime -= cfs_rq->min_vruntime;
-
 	/* return excess runtime on last dequeue */
 	return_cfs_rq_runtime(cfs_rq);
 
@@ -7975,18 +7935,6 @@ static void migrate_task_rq_fair(struct
 {
 	struct sched_entity *se = &p->se;
 
-	/*
-	 * As blocked tasks retain absolute vruntime the migration needs to
-	 * deal with this by subtracting the old and adding the new
-	 * min_vruntime -- the latter is done by enqueue_entity() when placing
-	 * the task on the new runqueue.
-	 */
-	if (READ_ONCE(p->__state) == TASK_WAKING) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-
-		se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
-	}
-
 	if (!task_on_rq_migrating(p)) {
 		remove_entity_load_avg(se);
 
@@ -12331,8 +12279,8 @@ static void task_tick_fair(struct rq *rq
  */
 static void task_fork_fair(struct task_struct *p)
 {
-	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se, *curr;
+	struct cfs_rq *cfs_rq;
 	struct rq *rq = this_rq();
 	struct rq_flags rf;
 
@@ -12341,22 +12289,9 @@ static void task_fork_fair(struct task_s
 
 	cfs_rq = task_cfs_rq(current);
 	curr = cfs_rq->curr;
-	if (curr) {
+	if (curr)
 		update_curr(cfs_rq);
-		se->vruntime = curr->vruntime;
-	}
 	place_entity(cfs_rq, se, 1);
-
-	if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
-		/*
-		 * Upon rescheduling, sched_class::put_prev_task() will place
-		 * 'current' within the tree based on its new key value.
-		 */
-		swap(curr->vruntime, se->vruntime);
-		resched_curr(rq);
-	}
-
-	se->vruntime -= cfs_rq->min_vruntime;
 	rq_unlock(rq, &rf);
 }
 
@@ -12385,34 +12320,6 @@ prio_changed_fair(struct rq *rq, struct
 		check_preempt_curr(rq, p, 0);
 }
 
-static inline bool vruntime_normalized(struct task_struct *p)
-{
-	struct sched_entity *se = &p->se;
-
-	/*
-	 * In both the TASK_ON_RQ_QUEUED and TASK_ON_RQ_MIGRATING cases,
-	 * the dequeue_entity(.flags=0) will already have normalized the
-	 * vruntime.
-	 */
-	if (p->on_rq)
-		return true;
-
-	/*
-	 * When !on_rq, vruntime of the task has usually NOT been normalized.
-	 * But there are some cases where it has already been normalized:
-	 *
-	 * - A forked child which is waiting for being woken up by
-	 *   wake_up_new_task().
-	 * - A task which has been woken up by try_to_wake_up() and
-	 *   waiting for actually being woken up by sched_ttwu_pending().
-	 */
-	if (!se->sum_exec_runtime ||
-	    (READ_ONCE(p->__state) == TASK_WAKING && p->sched_remote_wakeup))
-		return true;
-
-	return false;
-}
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /*
  * Propagate the changes of the sched_entity across the tg tree to make it
@@ -12483,16 +12390,6 @@ static void attach_entity_cfs_rq(struct
 static void detach_task_cfs_rq(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-
-	if (!vruntime_normalized(p)) {
-		/*
-		 * Fix up our vruntime so that the current sleep doesn't
-		 * cause 'unlimited' sleep bonus.
-		 */
-		place_entity(cfs_rq, se, 0);
-		se->vruntime -= cfs_rq->min_vruntime;
-	}
 
 	detach_entity_cfs_rq(se);
 }
@@ -12500,12 +12397,8 @@ static void detach_task_cfs_rq(struct ta
 static void attach_task_cfs_rq(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	attach_entity_cfs_rq(se);
-
-	if (!vruntime_normalized(p))
-		se->vruntime += cfs_rq->min_vruntime;
 }
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 11/17] sched: Commit to EEVDF
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (9 preceding siblings ...)
  2023-03-28  9:26 ` [PATCH 10/17] sched/smp: Use lag to simplify cross-runqueue placement Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-03-28  9:26 ` [PATCH 12/17] sched/debug: Rename min_granularity to base_slice Peter Zijlstra
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

Remove all the dead code...

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c    |    6 
 kernel/sched/fair.c     |  440 +++---------------------------------------------
 kernel/sched/features.h |   12 -
 kernel/sched/sched.h    |    5 
 4 files changed, 31 insertions(+), 432 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -308,10 +308,7 @@ static __init int sched_init_debug(void)
 	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
 #endif
 
-	debugfs_create_u32("latency_ns", 0644, debugfs_sched, &sysctl_sched_latency);
 	debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);
-	debugfs_create_u32("idle_min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_idle_min_granularity);
-	debugfs_create_u32("wakeup_granularity_ns", 0644, debugfs_sched, &sysctl_sched_wakeup_granularity);
 
 	debugfs_create_u32("latency_warn_ms", 0644, debugfs_sched, &sysctl_resched_latency_warn_ms);
 	debugfs_create_u32("latency_warn_once", 0644, debugfs_sched, &sysctl_resched_latency_warn_once);
@@ -819,10 +816,7 @@ static void sched_debug_header(struct se
 	SEQ_printf(m, "  .%-40s: %Ld\n", #x, (long long)(x))
 #define PN(x) \
 	SEQ_printf(m, "  .%-40s: %Ld.%06ld\n", #x, SPLIT_NS(x))
-	PN(sysctl_sched_latency);
 	PN(sysctl_sched_min_granularity);
-	PN(sysctl_sched_idle_min_granularity);
-	PN(sysctl_sched_wakeup_granularity);
 	P(sysctl_sched_child_runs_first);
 	P(sysctl_sched_features);
 #undef PN
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -58,22 +58,6 @@
 #include "autogroup.h"
 
 /*
- * Targeted preemption latency for CPU-bound tasks:
- *
- * NOTE: this latency value is not the same as the concept of
- * 'timeslice length' - timeslices in CFS are of variable length
- * and have no persistent notion like in traditional, time-slice
- * based scheduling concepts.
- *
- * (to see the precise effective timeslice length of your workload,
- *  run vmstat and monitor the context-switches (cs) field)
- *
- * (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds)
- */
-unsigned int sysctl_sched_latency			= 6000000ULL;
-static unsigned int normalized_sysctl_sched_latency	= 6000000ULL;
-
-/*
  * The initial- and re-scaling of tunables is configurable
  *
  * Options are:
@@ -95,36 +79,11 @@ unsigned int sysctl_sched_min_granularit
 static unsigned int normalized_sysctl_sched_min_granularity	= 750000ULL;
 
 /*
- * Minimal preemption granularity for CPU-bound SCHED_IDLE tasks.
- * Applies only when SCHED_IDLE tasks compete with normal tasks.
- *
- * (default: 0.75 msec)
- */
-unsigned int sysctl_sched_idle_min_granularity			= 750000ULL;
-
-/*
- * This value is kept at sysctl_sched_latency/sysctl_sched_min_granularity
- */
-static unsigned int sched_nr_latency = 8;
-
-/*
  * After fork, child runs first. If set to 0 (default) then
  * parent will (try to) run first.
  */
 unsigned int sysctl_sched_child_runs_first __read_mostly;
 
-/*
- * SCHED_OTHER wake-up granularity.
- *
- * This option delays the preemption effects of decoupled workloads
- * and reduces their over-scheduling. Synchronous workloads will still
- * have immediate wakeup/sleep latencies.
- *
- * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
- */
-unsigned int sysctl_sched_wakeup_granularity			= 1000000UL;
-static unsigned int normalized_sysctl_sched_wakeup_granularity	= 1000000UL;
-
 const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
 
 int sched_thermal_decay_shift;
@@ -279,8 +238,6 @@ static void update_sysctl(void)
 #define SET_SYSCTL(name) \
 	(sysctl_##name = (factor) * normalized_sysctl_##name)
 	SET_SYSCTL(sched_min_granularity);
-	SET_SYSCTL(sched_latency);
-	SET_SYSCTL(sched_wakeup_granularity);
 #undef SET_SYSCTL
 }
 
@@ -853,30 +810,6 @@ struct sched_entity *__pick_first_entity
 	return __node_2_se(left);
 }
 
-static struct sched_entity *__pick_next_entity(struct sched_entity *se)
-{
-	struct rb_node *next = rb_next(&se->run_node);
-
-	if (!next)
-		return NULL;
-
-	return __node_2_se(next);
-}
-
-static struct sched_entity *pick_cfs(struct cfs_rq *cfs_rq, struct sched_entity *curr)
-{
-	struct sched_entity *left = __pick_first_entity(cfs_rq);
-
-	/*
-	 * If curr is set we have to see if its left of the leftmost entity
-	 * still in the tree, provided there was anything in the tree at all.
-	 */
-	if (!left || (curr && entity_before(curr, left)))
-		left = curr;
-
-	return left;
-}
-
 /*
  * Earliest Eligible Virtual Deadline First
  *
@@ -977,14 +910,9 @@ int sched_update_scaling(void)
 {
 	unsigned int factor = get_update_sysctl_factor();
 
-	sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency,
-					sysctl_sched_min_granularity);
-
 #define WRT_SYSCTL(name) \
 	(normalized_sysctl_##name = sysctl_##name / (factor))
 	WRT_SYSCTL(sched_min_granularity);
-	WRT_SYSCTL(sched_latency);
-	WRT_SYSCTL(sched_wakeup_granularity);
 #undef WRT_SYSCTL
 
 	return 0;
@@ -1000,71 +928,6 @@ long calc_latency_offset(int prio)
 }
 
 /*
- * The idea is to set a period in which each task runs once.
- *
- * When there are too many tasks (sched_nr_latency) we have to stretch
- * this period because otherwise the slices get too small.
- *
- * p = (nr <= nl) ? l : l*nr/nl
- */
-static u64 __sched_period(unsigned long nr_running)
-{
-	if (unlikely(nr_running > sched_nr_latency))
-		return nr_running * sysctl_sched_min_granularity;
-	else
-		return sysctl_sched_latency;
-}
-
-static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq);
-
-/*
- * We calculate the wall-time slice from the period by taking a part
- * proportional to the weight.
- *
- * s = p*P[w/rw]
- */
-static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	unsigned int nr_running = cfs_rq->nr_running;
-	struct sched_entity *init_se = se;
-	unsigned int min_gran;
-	u64 slice;
-
-	if (sched_feat(ALT_PERIOD))
-		nr_running = rq_of(cfs_rq)->cfs.h_nr_running;
-
-	slice = __sched_period(nr_running + !se->on_rq);
-
-	for_each_sched_entity(se) {
-		struct load_weight *load;
-		struct load_weight lw;
-		struct cfs_rq *qcfs_rq;
-
-		qcfs_rq = cfs_rq_of(se);
-		load = &qcfs_rq->load;
-
-		if (unlikely(!se->on_rq)) {
-			lw = qcfs_rq->load;
-
-			update_load_add(&lw, se->load.weight);
-			load = &lw;
-		}
-		slice = __calc_delta(slice, se->load.weight, load);
-	}
-
-	if (sched_feat(BASE_SLICE)) {
-		if (se_is_idle(init_se) && !sched_idle_cfs_rq(cfs_rq))
-			min_gran = sysctl_sched_idle_min_granularity;
-		else
-			min_gran = sysctl_sched_min_granularity;
-
-		slice = max_t(u64, slice, min_gran);
-	}
-
-	return slice;
-}
-
-/*
  * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
  * this is probably good enough.
  */
@@ -1073,22 +936,12 @@ static void update_deadline(struct cfs_r
 	if ((s64)(se->vruntime - se->deadline) < 0)
 		return;
 
-	if (sched_feat(EEVDF)) {
-		/*
-		 * For EEVDF the virtual time slope is determined by w_i (iow.
-		 * nice) while the request time r_i is determined by
-		 * latency-nice.
-		 */
-		se->slice = se->latency_offset;
-	} else {
-		/*
-		 * When many tasks blow up the sched_period; it is possible
-		 * that sched_slice() reports unusually large results (when
-		 * many tasks are very light for example). Therefore impose a
-		 * maximum.
-		 */
-		se->slice = min_t(u64, sched_slice(cfs_rq, se), sysctl_sched_latency);
-	}
+	/*
+	 * For EEVDF the virtual time slope is determined by w_i (iow.
+	 * nice) while the request time r_i is determined by
+	 * latency-nice.
+	 */
+	se->slice = se->latency_offset;
 
 	/*
 	 * EEVDF: vd_i = ve_i + r_i / w_i
@@ -4957,19 +4810,6 @@ static inline void update_misfit_status(
 
 #endif /* CONFIG_SMP */
 
-static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-#ifdef CONFIG_SCHED_DEBUG
-	s64 d = se->vruntime - cfs_rq->min_vruntime;
-
-	if (d < 0)
-		d = -d;
-
-	if (d > 3*sysctl_sched_latency)
-		schedstat_inc(cfs_rq->nr_spread_over);
-#endif
-}
-
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
@@ -5087,7 +4927,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 
 	check_schedstat_required();
 	update_stats_enqueue_fair(cfs_rq, se, flags);
-	check_spread(cfs_rq, se);
 	if (!curr)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
@@ -5099,17 +4938,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	}
 }
 
-static void __clear_buddies_last(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->last != se)
-			break;
-
-		cfs_rq->last = NULL;
-	}
-}
-
 static void __clear_buddies_next(struct sched_entity *se)
 {
 	for_each_sched_entity(se) {
@@ -5121,27 +4949,10 @@ static void __clear_buddies_next(struct
 	}
 }
 
-static void __clear_buddies_skip(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->skip != se)
-			break;
-
-		cfs_rq->skip = NULL;
-	}
-}
-
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (cfs_rq->last == se)
-		__clear_buddies_last(se);
-
 	if (cfs_rq->next == se)
 		__clear_buddies_next(se);
-
-	if (cfs_rq->skip == se)
-		__clear_buddies_skip(se);
 }
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
@@ -5205,45 +5016,14 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 static void
 check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	unsigned long delta_exec;
-	struct sched_entity *se;
-	s64 delta;
-
-	if (sched_feat(EEVDF)) {
-		if (pick_eevdf(cfs_rq) != curr)
-			goto preempt;
-
-		return;
-	}
-
-	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
-	if (delta_exec > curr->slice) {
-preempt:
+	if (pick_eevdf(cfs_rq) != curr) {
 		resched_curr(rq_of(cfs_rq));
 		/*
 		 * The current task ran long enough, ensure it doesn't get
 		 * re-elected due to buddy favours.
 		 */
 		clear_buddies(cfs_rq, curr);
-		return;
 	}
-
-	/*
-	 * Ensure that a task that missed wakeup preemption by a
-	 * narrow margin doesn't have to wait for a full slice.
-	 * This also mitigates buddy induced latencies under load.
-	 */
-	if (delta_exec < sysctl_sched_min_granularity)
-		return;
-
-	se = __pick_first_entity(cfs_rq);
-	delta = curr->vruntime - se->vruntime;
-
-	if (delta < 0)
-		return;
-
-	if (delta > curr->slice)
-		resched_curr(rq_of(cfs_rq));
 }
 
 static void
@@ -5284,9 +5064,6 @@ set_next_entity(struct cfs_rq *cfs_rq, s
 	se->prev_sum_exec_runtime = se->sum_exec_runtime;
 }
 
-static int
-wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
-
 /*
  * Pick the next process, keeping these things in mind, in this order:
  * 1) keep things fair between processes/task groups
@@ -5297,53 +5074,14 @@ wakeup_preempt_entity(struct sched_entit
 static struct sched_entity *
 pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	struct sched_entity *left, *se;
-
-	if (sched_feat(EEVDF)) {
-		/*
-		 * Enabling NEXT_BUDDY will affect latency but not fairness.
-		 */
-		if (sched_feat(NEXT_BUDDY) &&
-		    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
-			return cfs_rq->next;
-
-		return pick_eevdf(cfs_rq);
-	}
-
-	se = left = pick_cfs(cfs_rq, curr);
-
 	/*
-	 * Avoid running the skip buddy, if running something else can
-	 * be done without getting too unfair.
+	 * Enabling NEXT_BUDDY will affect latency but not fairness.
 	 */
-	if (cfs_rq->skip && cfs_rq->skip == se) {
-		struct sched_entity *second;
+	if (sched_feat(NEXT_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
+		return cfs_rq->next;
 
-		if (se == curr) {
-			second = __pick_first_entity(cfs_rq);
-		} else {
-			second = __pick_next_entity(se);
-			if (!second || (curr && entity_before(curr, second)))
-				second = curr;
-		}
-
-		if (second && wakeup_preempt_entity(second, left) < 1)
-			se = second;
-	}
-
-	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
-		/*
-		 * Someone really wants this to run. If it's not unfair, run it.
-		 */
-		se = cfs_rq->next;
-	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
-		/*
-		 * Prefer last buddy, try to return the CPU to a preempted task.
-		 */
-		se = cfs_rq->last;
-	}
-
-	return se;
+	return pick_eevdf(cfs_rq);
 }
 
 static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
@@ -5360,8 +5098,6 @@ static void put_prev_entity(struct cfs_r
 	/* throttle cfs_rqs exceeding runtime */
 	check_cfs_rq_runtime(cfs_rq);
 
-	check_spread(cfs_rq, prev);
-
 	if (prev->on_rq) {
 		update_stats_wait_start_fair(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
@@ -6434,8 +6170,7 @@ static void hrtick_update(struct rq *rq)
 	if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
 		return;
 
-	if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
-		hrtick_start_fair(rq, curr);
+	hrtick_start_fair(rq, curr);
 }
 #else /* !CONFIG_SCHED_HRTICK */
 static inline void
@@ -6476,17 +6211,6 @@ static int sched_idle_rq(struct rq *rq)
 			rq->nr_running);
 }
 
-/*
- * Returns true if cfs_rq only has SCHED_IDLE entities enqueued. Note the use
- * of idle_nr_running, which does not consider idle descendants of normal
- * entities.
- */
-static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq)
-{
-	return cfs_rq->nr_running &&
-		cfs_rq->nr_running == cfs_rq->idle_nr_running;
-}
-
 #ifdef CONFIG_SMP
 static int sched_idle_cpu(int cpu)
 {
@@ -7972,66 +7696,6 @@ balance_fair(struct rq *rq, struct task_
 }
 #endif /* CONFIG_SMP */
 
-static unsigned long wakeup_gran(struct sched_entity *se)
-{
-	unsigned long gran = sysctl_sched_wakeup_granularity;
-
-	/*
-	 * Since its curr running now, convert the gran from real-time
-	 * to virtual-time in his units.
-	 *
-	 * By using 'se' instead of 'curr' we penalize light tasks, so
-	 * they get preempted easier. That is, if 'se' < 'curr' then
-	 * the resulting gran will be larger, therefore penalizing the
-	 * lighter, if otoh 'se' > 'curr' then the resulting gran will
-	 * be smaller, again penalizing the lighter task.
-	 *
-	 * This is especially important for buddies when the leftmost
-	 * task is higher priority than the buddy.
-	 */
-	return calc_delta_fair(gran, se);
-}
-
-/*
- * Should 'se' preempt 'curr'.
- *
- *             |s1
- *        |s2
- *   |s3
- *         g
- *      |<--->|c
- *
- *  w(c, s1) = -1
- *  w(c, s2) =  0
- *  w(c, s3) =  1
- *
- */
-static int
-wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
-{
-	s64 gran, vdiff = curr->vruntime - se->vruntime;
-
-	if (vdiff <= 0)
-		return -1;
-
-	gran = wakeup_gran(se);
-	if (vdiff > gran)
-		return 1;
-
-	return 0;
-}
-
-static void set_last_buddy(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		if (SCHED_WARN_ON(!se->on_rq))
-			return;
-		if (se_is_idle(se))
-			return;
-		cfs_rq_of(se)->last = se;
-	}
-}
-
 static void set_next_buddy(struct sched_entity *se)
 {
 	for_each_sched_entity(se) {
@@ -8043,12 +7707,6 @@ static void set_next_buddy(struct sched_
 	}
 }
 
-static void set_skip_buddy(struct sched_entity *se)
-{
-	for_each_sched_entity(se)
-		cfs_rq_of(se)->skip = se;
-}
-
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -8057,7 +7715,6 @@ static void check_preempt_wakeup(struct
 	struct task_struct *curr = rq->curr;
 	struct sched_entity *se = &curr->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
-	int scale = cfs_rq->nr_running >= sched_nr_latency;
 	int next_buddy_marked = 0;
 	int cse_is_idle, pse_is_idle;
 
@@ -8073,7 +7730,7 @@ static void check_preempt_wakeup(struct
 	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
+	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK)) {
 		set_next_buddy(pse);
 		next_buddy_marked = 1;
 	}
@@ -8121,44 +7778,16 @@ static void check_preempt_wakeup(struct
 	cfs_rq = cfs_rq_of(se);
 	update_curr(cfs_rq);
 
-	if (sched_feat(EEVDF)) {
-		/*
-		 * XXX pick_eevdf(cfs_rq) != se ?
-		 */
-		if (pick_eevdf(cfs_rq) == pse)
-			goto preempt;
-
-		return;
-	}
-
-	if (wakeup_preempt_entity(se, pse) == 1) {
-		/*
-		 * Bias pick_next to pick the sched entity that is
-		 * triggering this preemption.
-		 */
-		if (!next_buddy_marked)
-			set_next_buddy(pse);
+	/*
+	 * XXX pick_eevdf(cfs_rq) != se ?
+	 */
+	if (pick_eevdf(cfs_rq) == pse)
 		goto preempt;
-	}
 
 	return;
 
 preempt:
 	resched_curr(rq);
-	/*
-	 * Only set the backward buddy when the current task is still
-	 * on the rq. This can happen when a wakeup gets interleaved
-	 * with schedule on the ->pre_schedule() or idle_balance()
-	 * point, either of which can * drop the rq lock.
-	 *
-	 * Also, during early boot the idle thread is in the fair class,
-	 * for obvious reasons its a bad idea to schedule back to it.
-	 */
-	if (unlikely(!se->on_rq || curr == rq->idle))
-		return;
-
-	if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
-		set_last_buddy(se);
 }
 
 #ifdef CONFIG_SMP
@@ -8359,8 +7988,6 @@ static void put_prev_task_fair(struct rq
 
 /*
  * sched_yield() is very simple
- *
- * The magic of dealing with the ->skip buddy is in pick_next_entity.
  */
 static void yield_task_fair(struct rq *rq)
 {
@@ -8376,23 +8003,19 @@ static void yield_task_fair(struct rq *r
 
 	clear_buddies(cfs_rq, se);
 
-	if (sched_feat(EEVDF) || curr->policy != SCHED_BATCH) {
-		update_rq_clock(rq);
-		/*
-		 * Update run-time statistics of the 'current'.
-		 */
-		update_curr(cfs_rq);
-		/*
-		 * Tell update_rq_clock() that we've just updated,
-		 * so we don't do microscopic update in schedule()
-		 * and double the fastpath cost.
-		 */
-		rq_clock_skip_update(rq);
-	}
-	if (sched_feat(EEVDF))
-		se->deadline += calc_delta_fair(se->slice, se);
+	update_rq_clock(rq);
+	/*
+	 * Update run-time statistics of the 'current'.
+	 */
+	update_curr(cfs_rq);
+	/*
+	 * Tell update_rq_clock() that we've just updated,
+	 * so we don't do microscopic update in schedule()
+	 * and double the fastpath cost.
+	 */
+	rq_clock_skip_update(rq);
 
-	set_skip_buddy(se);
+	se->deadline += calc_delta_fair(se->slice, se);
 }
 
 static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
@@ -8635,8 +8258,7 @@ static int task_hot(struct task_struct *
 	 * Buddy candidates are cache hot:
 	 */
 	if (sched_feat(CACHE_HOT_BUDDY) && env->dst_rq->nr_running &&
-			(&p->se == cfs_rq_of(&p->se)->next ||
-			 &p->se == cfs_rq_of(&p->se)->last))
+	    (&p->se == cfs_rq_of(&p->se)->next))
 		return 1;
 
 	if (sysctl_sched_migration_cost == -1)
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -15,13 +15,6 @@ SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 SCHED_FEAT(NEXT_BUDDY, false)
 
 /*
- * Prefer to schedule the task that ran last (when we did
- * wake-preempt) as that likely will touch the same data, increases
- * cache locality.
- */
-SCHED_FEAT(LAST_BUDDY, true)
-
-/*
  * Consider buddies to be cache hot, decreases the likeliness of a
  * cache buddy being migrated away, increases cache locality.
  */
@@ -93,8 +86,3 @@ SCHED_FEAT(UTIL_EST, true)
 SCHED_FEAT(UTIL_EST_FASTUP, true)
 
 SCHED_FEAT(LATENCY_WARN, false)
-
-SCHED_FEAT(ALT_PERIOD, true)
-SCHED_FEAT(BASE_SLICE, true)
-
-SCHED_FEAT(EEVDF, true)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -580,8 +580,6 @@ struct cfs_rq {
 	 */
 	struct sched_entity	*curr;
 	struct sched_entity	*next;
-	struct sched_entity	*last;
-	struct sched_entity	*skip;
 
 #ifdef	CONFIG_SCHED_DEBUG
 	unsigned int		nr_spread_over;
@@ -2466,10 +2464,7 @@ extern const_debug unsigned int sysctl_s
 extern const_debug unsigned int sysctl_sched_migration_cost;
 
 #ifdef CONFIG_SCHED_DEBUG
-extern unsigned int sysctl_sched_latency;
 extern unsigned int sysctl_sched_min_granularity;
-extern unsigned int sysctl_sched_idle_min_granularity;
-extern unsigned int sysctl_sched_wakeup_granularity;
 extern int sysctl_resched_latency_warn_ms;
 extern int sysctl_resched_latency_warn_once;
 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 12/17] sched/debug: Rename min_granularity to base_slice
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (10 preceding siblings ...)
  2023-03-28  9:26 ` [PATCH 11/17] sched: Commit to EEVDF Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-03-28  9:26 ` [PATCH 13/17] sched: Merge latency_offset into slice Peter Zijlstra
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |    4 ++--
 kernel/sched/fair.c  |   10 +++++-----
 kernel/sched/sched.h |    2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -308,7 +308,7 @@ static __init int sched_init_debug(void)
 	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
 #endif
 
-	debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);
+	debugfs_create_u32("base_slice_ns", 0644, debugfs_sched, &sysctl_sched_base_slice);
 
 	debugfs_create_u32("latency_warn_ms", 0644, debugfs_sched, &sysctl_resched_latency_warn_ms);
 	debugfs_create_u32("latency_warn_once", 0644, debugfs_sched, &sysctl_resched_latency_warn_once);
@@ -816,7 +816,7 @@ static void sched_debug_header(struct se
 	SEQ_printf(m, "  .%-40s: %Ld\n", #x, (long long)(x))
 #define PN(x) \
 	SEQ_printf(m, "  .%-40s: %Ld.%06ld\n", #x, SPLIT_NS(x))
-	PN(sysctl_sched_min_granularity);
+	PN(sysctl_sched_base_slice);
 	P(sysctl_sched_child_runs_first);
 	P(sysctl_sched_features);
 #undef PN
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -75,8 +75,8 @@ unsigned int sysctl_sched_tunable_scalin
  *
  * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
  */
-unsigned int sysctl_sched_min_granularity			= 750000ULL;
-static unsigned int normalized_sysctl_sched_min_granularity	= 750000ULL;
+unsigned int sysctl_sched_base_slice			= 750000ULL;
+static unsigned int normalized_sysctl_sched_base_slice	= 750000ULL;
 
 /*
  * After fork, child runs first. If set to 0 (default) then
@@ -237,7 +237,7 @@ static void update_sysctl(void)
 
 #define SET_SYSCTL(name) \
 	(sysctl_##name = (factor) * normalized_sysctl_##name)
-	SET_SYSCTL(sched_min_granularity);
+	SET_SYSCTL(sched_base_slice);
 #undef SET_SYSCTL
 }
 
@@ -882,7 +882,7 @@ int sched_update_scaling(void)
 
 #define WRT_SYSCTL(name) \
 	(normalized_sysctl_##name = sysctl_##name / (factor))
-	WRT_SYSCTL(sched_min_granularity);
+	WRT_SYSCTL(sched_base_slice);
 #undef WRT_SYSCTL
 
 	return 0;
@@ -892,7 +892,7 @@ int sched_update_scaling(void)
 long calc_latency_offset(int prio)
 {
 	u32 weight = sched_prio_to_weight[prio];
-	u64 base = sysctl_sched_min_granularity;
+	u64 base = sysctl_sched_base_slice;
 
 	return div_u64(base << SCHED_FIXEDPOINT_SHIFT, weight);
 }
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2464,7 +2464,7 @@ extern const_debug unsigned int sysctl_s
 extern const_debug unsigned int sysctl_sched_migration_cost;
 
 #ifdef CONFIG_SCHED_DEBUG
-extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_base_slice;
 extern int sysctl_resched_latency_warn_ms;
 extern int sysctl_resched_latency_warn_once;
 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 13/17] sched: Merge latency_offset into slice
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (11 preceding siblings ...)
  2023-03-28  9:26 ` [PATCH 12/17] sched/debug: Rename min_granularity to base_slice Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-03-28  9:26 ` [PATCH 14/17] sched/eevdf: Better handle mixed slice length Peter Zijlstra
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    2 --
 kernel/sched/core.c   |   17 +++++++----------
 kernel/sched/fair.c   |   29 ++++++++++++-----------------
 kernel/sched/sched.h  |    2 +-
 4 files changed, 20 insertions(+), 30 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -573,8 +573,6 @@ struct sched_entity {
 	/* cached value of my_q->h_nr_running */
 	unsigned long			runnable_weight;
 #endif
-	/* preemption offset in ns */
-	long				latency_offset;
 
 #ifdef CONFIG_SMP
 	/*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1285,9 +1285,10 @@ static void set_load_weight(struct task_
 	}
 }
 
-static void set_latency_offset(struct task_struct *p)
+static inline void set_latency_prio(struct task_struct *p, int prio)
 {
-	p->se.latency_offset = calc_latency_offset(p->latency_prio - MAX_RT_PRIO);
+	p->latency_prio = prio;
+	set_latency_fair(&p->se, prio - MAX_RT_PRIO);
 }
 
 #ifdef CONFIG_UCLAMP_TASK
@@ -4442,7 +4443,7 @@ static void __sched_fork(unsigned long c
 	p->se.vlag			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
-	set_latency_offset(p);
+	set_latency_prio(p, p->latency_prio);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
@@ -4694,9 +4695,7 @@ int sched_fork(unsigned long clone_flags
 
 		p->prio = p->normal_prio = p->static_prio;
 		set_load_weight(p, false);
-
-		p->latency_prio = NICE_TO_PRIO(0);
-		set_latency_offset(p);
+		set_latency_prio(p, NICE_TO_PRIO(0));
 
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
@@ -7469,10 +7468,8 @@ static void __setscheduler_params(struct
 static void __setscheduler_latency(struct task_struct *p,
 				   const struct sched_attr *attr)
 {
-	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
-		p->latency_prio = NICE_TO_PRIO(attr->sched_latency_nice);
-		set_latency_offset(p);
-	}
+	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
+		set_latency_prio(p, NICE_TO_PRIO(attr->sched_latency_nice));
 }
 
 /*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -919,12 +919,19 @@ int sched_update_scaling(void)
 }
 #endif
 
-long calc_latency_offset(int prio)
+void set_latency_fair(struct sched_entity *se, int prio)
 {
 	u32 weight = sched_prio_to_weight[prio];
 	u64 base = sysctl_sched_base_slice;
 
-	return div_u64(base << SCHED_FIXEDPOINT_SHIFT, weight);
+	/*
+	 * For EEVDF the virtual time slope is determined by w_i (iow.
+	 * nice) while the request time r_i is determined by
+	 * latency-nice.
+	 *
+	 * Smaller request gets better latency.
+	 */
+	se->slice = div_u64(base << SCHED_FIXEDPOINT_SHIFT, weight);
 }
 
 /*
@@ -937,13 +944,6 @@ static void update_deadline(struct cfs_r
 		return;
 
 	/*
-	 * For EEVDF the virtual time slope is determined by w_i (iow.
-	 * nice) while the request time r_i is determined by
-	 * latency-nice.
-	 */
-	se->slice = se->latency_offset;
-
-	/*
 	 * EEVDF: vd_i = ve_i + r_i / w_i
 	 */
 	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
@@ -12231,7 +12231,7 @@ void init_tg_cfs_entry(struct task_group
 
 	se->my_q = cfs_rq;
 
-	se->latency_offset = calc_latency_offset(tg->latency_prio - MAX_RT_PRIO);
+	set_latency_fair(se, tg->latency_prio - MAX_RT_PRIO);
 
 	/* guarantee group entities always have weight */
 	update_load_set(&se->load, NICE_0_LOAD);
@@ -12365,7 +12365,6 @@ int sched_group_set_idle(struct task_gro
 
 int sched_group_set_latency(struct task_group *tg, int prio)
 {
-	long latency_offset;
 	int i;
 
 	if (tg == &root_task_group)
@@ -12379,13 +12378,9 @@ int sched_group_set_latency(struct task_
 	}
 
 	tg->latency_prio = prio;
-	latency_offset = calc_latency_offset(prio - MAX_RT_PRIO);
 
-	for_each_possible_cpu(i) {
-		struct sched_entity *se = tg->se[i];
-
-		WRITE_ONCE(se->latency_offset, latency_offset);
-	}
+	for_each_possible_cpu(i)
+		set_latency_fair(tg->se[i], prio - MAX_RT_PRIO);
 
 	mutex_unlock(&shares_mutex);
 	return 0;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2477,7 +2477,7 @@ extern unsigned int sysctl_numa_balancin
 extern unsigned int sysctl_numa_balancing_hot_threshold;
 #endif
 
-extern long calc_latency_offset(int prio);
+extern void set_latency_fair(struct sched_entity *se, int prio);
 
 #ifdef CONFIG_SCHED_HRTICK
 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 14/17] sched/eevdf: Better handle mixed slice length
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (12 preceding siblings ...)
  2023-03-28  9:26 ` [PATCH 13/17] sched: Merge latency_offset into slice Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-03-31 15:26   ` Vincent Guittot
       [not found]   ` <20230401232355.336-1-hdanton@sina.com>
  2023-03-28  9:26 ` [PATCH 15/17] [RFC] sched/eevdf: Sleeper bonus Peter Zijlstra
                   ` (6 subsequent siblings)
  20 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

In the case where (due to latency-nice) there are different request
sizes in the tree, the smaller requests tend to be dominated by the
larger. Also note how the EEVDF lag limits are based on r_max.

Therefore; add a heuristic that for the mixed request size case, moves
smaller requests to placement strategy #2 which ensures they're
immidiately eligible and and due to their smaller (virtual) deadline
will cause preemption.

NOTE: this relies on update_entity_lag() to impose lag limits above
a single slice.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   14 ++++++++++++++
 kernel/sched/features.h |    1 +
 kernel/sched/sched.h    |    1 +
 3 files changed, 16 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -616,6 +616,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq,
 	s64 key = entity_key(cfs_rq, se);
 
 	cfs_rq->avg_vruntime += key * weight;
+	cfs_rq->avg_slice += se->slice * weight;
 	cfs_rq->avg_load += weight;
 }
 
@@ -626,6 +627,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
 	s64 key = entity_key(cfs_rq, se);
 
 	cfs_rq->avg_vruntime -= key * weight;
+	cfs_rq->avg_slice -= se->slice * weight;
 	cfs_rq->avg_load -= weight;
 }
 
@@ -4832,6 +4834,18 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		lag = se->vlag;
 
 		/*
+		 * For latency sensitive tasks; those that have a shorter than
+		 * average slice and do not fully consume the slice, transition
+		 * to EEVDF placement strategy #2.
+		 */
+		if (sched_feat(PLACE_FUDGE) &&
+		    cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) {
+			lag += vslice;
+			if (lag > 0)
+				lag = 0;
+		}
+
+		/*
 		 * If we want to place a task and preserve lag, we have to
 		 * consider the effect of the new entity on the weighted
 		 * average and compensate for this, otherwise lag can quickly
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -5,6 +5,7 @@
  * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
  */
 SCHED_FEAT(PLACE_LAG, true)
+SCHED_FEAT(PLACE_FUDGE, true)
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 
 /*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -559,6 +559,7 @@ struct cfs_rq {
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
 
 	s64			avg_vruntime;
+	u64			avg_slice;
 	u64			avg_load;
 
 	u64			exec_clock;



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 15/17] [RFC] sched/eevdf: Sleeper bonus
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (13 preceding siblings ...)
  2023-03-28  9:26 ` [PATCH 14/17] sched/eevdf: Better handle mixed slice length Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-03-29  9:10   ` Mike Galbraith
  2023-03-28  9:26 ` [PATCH 16/17] [RFC] sched/eevdf: Minimal vavg option Peter Zijlstra
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

Add a sleeper bonus hack, but keep it default disabled. This should
allow easy testing if regressions are due to this.

Specifically; this 'restores' performance for things like starve and
stress-futex, stress-nanosleep that rely on sleeper bonus to compete
against an always running parent (the fair 67%/33% split vs the
50%/50% bonus thing).

OTOH this completely destroys latency and hackbench (as in 5x worse).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   47 ++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/features.h |    1 +
 kernel/sched/sched.h    |    3 ++-
 3 files changed, 43 insertions(+), 8 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4819,7 +4819,7 @@ static inline void update_misfit_status(
 #endif /* CONFIG_SMP */
 
 static void
-place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
+place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	u64 vslice = calc_delta_fair(se->slice, se);
 	u64 vruntime = avg_vruntime(cfs_rq);
@@ -4878,22 +4878,55 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		if (WARN_ON_ONCE(!load))
 			load = 1;
 		lag = div_s64(lag, load);
+
+		vruntime -= lag;
+	}
+
+	/*
+	 * Base the deadline on the 'normal' EEVDF placement policy in an
+	 * attempt to not let the bonus crud below wreck things completely.
+	 */
+	se->deadline = vruntime;
+
+	/*
+	 * The whole 'sleeper' bonus hack... :-/ This is strictly unfair.
+	 *
+	 * By giving a sleeping task a little boost, it becomes possible for a
+	 * 50% task to compete equally with a 100% task. That is, strictly fair
+	 * that setup would result in a 67% / 33% split. Sleeper bonus will
+	 * change that to 50% / 50%.
+	 *
+	 * This thing hurts my brain, because tasks leaving with negative lag
+	 * will move 'time' backward, so comparing against a historical
+	 * se->vruntime is dodgy as heck.
+	 */
+	if (sched_feat(PLACE_BONUS) &&
+	    (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)) {
+		/*
+		 * If se->vruntime is ahead of vruntime, something dodgy
+		 * happened and we cannot give bonus due to not having valid
+		 * history.
+		 */
+		if ((s64)(se->vruntime - vruntime) < 0) {
+			vruntime -= se->slice/2;
+			vruntime = max_vruntime(se->vruntime, vruntime);
+		}
 	}
 
-	se->vruntime = vruntime - lag;
+	se->vruntime = vruntime;
 
 	/*
 	 * When joining the competition; the exisiting tasks will be,
 	 * on average, halfway through their slice, as such start tasks
 	 * off with half a slice to ease into the competition.
 	 */
-	if (sched_feat(PLACE_DEADLINE_INITIAL) && initial)
+	if (sched_feat(PLACE_DEADLINE_INITIAL) && (flags & ENQUEUE_INITIAL))
 		vslice /= 2;
 
 	/*
 	 * EEVDF: vd_i = ve_i + r_i/w_i
 	 */
-	se->deadline = se->vruntime + vslice;
+	se->deadline += vslice;
 }
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -4910,7 +4943,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * update_curr().
 	 */
 	if (curr)
-		place_entity(cfs_rq, se, 0);
+		place_entity(cfs_rq, se, flags);
 
 	update_curr(cfs_rq);
 
@@ -4937,7 +4970,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * we can place the entity.
 	 */
 	if (!curr)
-		place_entity(cfs_rq, se, 0);
+		place_entity(cfs_rq, se, flags);
 
 	account_entity_enqueue(cfs_rq, se);
 
@@ -11933,7 +11966,7 @@ static void task_fork_fair(struct task_s
 	curr = cfs_rq->curr;
 	if (curr)
 		update_curr(cfs_rq);
-	place_entity(cfs_rq, se, 1);
+	place_entity(cfs_rq, se, ENQUEUE_INITIAL);
 	rq_unlock(rq, &rf);
 }
 
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,6 +7,7 @@
 SCHED_FEAT(PLACE_LAG, true)
 SCHED_FEAT(PLACE_FUDGE, true)
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
+SCHED_FEAT(PLACE_BONUS, false)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2143,7 +2143,7 @@ extern const u32		sched_prio_to_wmult[40
  * ENQUEUE_HEAD      - place at front of runqueue (tail if not specified)
  * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
  * ENQUEUE_MIGRATED  - the task was migrated during wakeup
- *
+ * ENQUEUE_INITIAL   - place a new task (fork/clone)
  */
 
 #define DEQUEUE_SLEEP		0x01
@@ -2163,6 +2163,7 @@ extern const u32		sched_prio_to_wmult[40
 #else
 #define ENQUEUE_MIGRATED	0x00
 #endif
+#define ENQUEUE_INITIAL		0x80
 
 #define RETRY_TASK		((void *)-1UL)
 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 16/17] [RFC] sched/eevdf: Minimal vavg option
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (14 preceding siblings ...)
  2023-03-28  9:26 ` [PATCH 15/17] [RFC] sched/eevdf: Sleeper bonus Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-03-28  9:26 ` [PATCH 17/17] [DEBUG] sched/eevdf: Debug / validation crud Peter Zijlstra
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

Alternative means of tracking min_vruntime to minimize the deltas
going into avg_vruntime -- note that because vavg move backwards this
is all sorts of tricky.

Also more expensive because of extra divisions... Not found this
convincing.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   51 ++++++++++++++++++++++++++++--------------------
 kernel/sched/features.h |    2 +
 2 files changed, 32 insertions(+), 21 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -732,28 +732,37 @@ static u64 __update_min_vruntime(struct
 
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
-	struct sched_entity *se = __pick_first_entity(cfs_rq);
-	struct sched_entity *curr = cfs_rq->curr;
-
-	u64 vruntime = cfs_rq->min_vruntime;
-
-	if (curr) {
-		if (curr->on_rq)
-			vruntime = curr->vruntime;
-		else
-			curr = NULL;
+	if (sched_feat(MINIMAL_VA)) {
+		u64 vruntime = avg_vruntime(cfs_rq);
+		s64 delta = (s64)(vruntime - cfs_rq->min_vruntime);
+
+		avg_vruntime_update(cfs_rq, delta);
+
+		u64_u32_store(cfs_rq->min_vruntime, vruntime);
+	} else {
+		struct sched_entity *se = __pick_first_entity(cfs_rq);
+		struct sched_entity *curr = cfs_rq->curr;
+
+		u64 vruntime = cfs_rq->min_vruntime;
+
+		if (curr) {
+			if (curr->on_rq)
+				vruntime = curr->vruntime;
+			else
+				curr = NULL;
+		}
+
+		if (se) {
+			if (!curr)
+				vruntime = se->vruntime;
+			else
+				vruntime = min_vruntime(vruntime, se->vruntime);
+		}
+
+		/* ensure we never gain time by being placed backwards. */
+		u64_u32_store(cfs_rq->min_vruntime,
+				__update_min_vruntime(cfs_rq, vruntime));
 	}
-
-	if (se) {
-		if (!curr)
-			vruntime = se->vruntime;
-		else
-			vruntime = min_vruntime(vruntime, se->vruntime);
-	}
-
-	/* ensure we never gain time by being placed backwards. */
-	u64_u32_store(cfs_rq->min_vruntime,
-		      __update_min_vruntime(cfs_rq, vruntime));
 }
 
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -9,6 +9,8 @@ SCHED_FEAT(PLACE_FUDGE, true)
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 SCHED_FEAT(PLACE_BONUS, false)
 
+SCHED_FEAT(MINIMAL_VA, false)
+
 /*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 17/17] [DEBUG] sched/eevdf: Debug / validation crud
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (15 preceding siblings ...)
  2023-03-28  9:26 ` [PATCH 16/17] [RFC] sched/eevdf: Minimal vavg option Peter Zijlstra
@ 2023-03-28  9:26 ` Peter Zijlstra
  2023-04-03  7:42 ` [PATCH 00/17] sched: EEVDF using latency-nice Shrikanth Hegde
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-28  9:26 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

XXX do not merge

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   95 ++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |    2 +
 2 files changed, 97 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -793,6 +793,92 @@ static inline bool min_deadline_update(s
 RB_DECLARE_CALLBACKS(static, min_deadline_cb, struct sched_entity,
 		     run_node, min_deadline, min_deadline_update);
 
+#ifdef CONFIG_SCHED_DEBUG
+struct validate_data {
+	s64 va;
+	s64 avg_vruntime;
+	s64 avg_load;
+	s64 min_deadline;
+};
+
+static void __print_se(struct cfs_rq *cfs_rq, struct sched_entity *se, int level,
+		       struct validate_data *data)
+{
+	static const char indent[] = "                                           ";
+	unsigned long weight = scale_load_down(se->load.weight);
+	struct task_struct *p = NULL;
+
+	s64 v = se->vruntime - cfs_rq->min_vruntime;
+	s64 d = se->deadline - cfs_rq->min_vruntime;
+
+	data->avg_vruntime += v * weight;
+	data->avg_load += weight;
+
+	data->min_deadline = min(data->min_deadline, d);
+
+	if (entity_is_task(se))
+		p = task_of(se);
+
+	trace_printk("%.*s%lx w: %ld ve: %Ld lag: %Ld vd: %Ld vmd: %Ld %s (%d/%s)\n",
+		     level*2, indent, (unsigned long)se,
+		     weight,
+		     v, data->va - se->vruntime, d,
+		     se->min_deadline - cfs_rq->min_vruntime,
+		     entity_eligible(cfs_rq, se) ? "E" : "N",
+		     p ? p->pid : -1,
+		     p ? p->comm : "(null)");
+}
+
+static void __print_node(struct cfs_rq *cfs_rq, struct rb_node *node, int level,
+			 struct validate_data *data)
+{
+	if (!node)
+		return;
+
+	__print_se(cfs_rq, __node_2_se(node), level, data);
+	__print_node(cfs_rq, node->rb_left, level+1, data);
+	__print_node(cfs_rq, node->rb_right, level+1, data);
+}
+
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq);
+
+static void validate_cfs_rq(struct cfs_rq *cfs_rq, bool pick)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	struct rb_node *root = cfs_rq->tasks_timeline.rb_root.rb_node;
+	struct validate_data _data = {
+		.va = avg_vruntime(cfs_rq),
+		.min_deadline = (~0ULL) >> 1,
+	}, *data = &_data;
+
+	trace_printk("---\n");
+
+	__print_node(cfs_rq, root, 0, data);
+
+	trace_printk("min_deadline: %Ld avg_vruntime: %Ld / %Ld = %Ld\n",
+		     data->min_deadline,
+		     data->avg_vruntime, data->avg_load,
+		     data->avg_load ? div_s64(data->avg_vruntime, data->avg_load) : 0);
+
+	if (WARN_ON_ONCE(cfs_rq->avg_vruntime != data->avg_vruntime))
+		cfs_rq->avg_vruntime = data->avg_vruntime;
+
+	if (WARN_ON_ONCE(cfs_rq->avg_load != data->avg_load))
+		cfs_rq->avg_load = data->avg_load;
+
+	data->min_deadline += cfs_rq->min_vruntime;
+	WARN_ON_ONCE(cfs_rq->avg_load && __node_2_se(root)->min_deadline != data->min_deadline);
+
+	if (curr && curr->on_rq)
+		__print_se(cfs_rq, curr, 0, data);
+
+	if (pick)
+		trace_printk("pick: %lx\n", (unsigned long)pick_eevdf(cfs_rq));
+}
+#else
+static inline void validate_cfs_rq(struct cfs_rq *cfs_rq, bool pick) { }
+#endif
+
 /*
  * Enqueue an entity into the rb-tree:
  */
@@ -802,6 +888,9 @@ static void __enqueue_entity(struct cfs_
 	se->min_deadline = se->deadline;
 	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
 				__entity_less, &min_deadline_cb);
+
+	if (sched_feat(VALIDATE_QUEUE))
+		validate_cfs_rq(cfs_rq, true);
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -809,6 +898,9 @@ static void __dequeue_entity(struct cfs_
 	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
 				  &min_deadline_cb);
 	avg_vruntime_sub(cfs_rq, se);
+
+	if (sched_feat(VALIDATE_QUEUE))
+		validate_cfs_rq(cfs_rq, true);
 }
 
 struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
@@ -894,6 +986,9 @@ static struct sched_entity *pick_eevdf(s
 	if (unlikely(!best)) {
 		struct sched_entity *left = __pick_first_entity(cfs_rq);
 		if (left) {
+			trace_printk("EEVDF scheduling fail, picking leftmost\n");
+			validate_cfs_rq(cfs_rq, false);
+			tracing_off();
 			pr_err("EEVDF scheduling fail, picking leftmost\n");
 			return left;
 		}
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -6,6 +6,8 @@ SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 
 SCHED_FEAT(MINIMAL_VA, false)
 
+SCHED_FEAT(VALIDATE_QUEUE, false)
+
 /*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/17] sched/fair: Add avg_vruntime
  2023-03-28  9:26 ` [PATCH 04/17] sched/fair: Add avg_vruntime Peter Zijlstra
@ 2023-03-28 23:57   ` Josh Don
  2023-03-29  7:50     ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Josh Don @ 2023-03-28 23:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault

On Tue, Mar 28, 2023 at 4:06 AM Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> +/*
> + * Compute virtual time from the per-task service numbers:
> + *
> + * Fair schedulers conserve lag: \Sum lag_i = 0
> + *
> + * lag_i = S - s_i = w_i * (V - v_i)
> + *
> + * \Sum lag_i = 0 -> \Sum w_i * (V - v_i) = V * \Sum w_i - \Sum w_i * v_i = 0

Small note: I think it would be helpful to label these symbols
somewhere :) Weight  and vruntime are fairly obvious, but I don't
think 'S' and 'V' are as clear. Are these non-virtual ideal service
time, and average vruntime, respectively?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/17] sched/fair: Implement an EEVDF like policy
  2023-03-28  9:26 ` [PATCH 08/17] sched/fair: Implement an EEVDF like policy Peter Zijlstra
@ 2023-03-29  1:26   ` Josh Don
  2023-03-29  8:02     ` Peter Zijlstra
                       ` (3 more replies)
  2023-03-29 14:35   ` Vincent Guittot
  1 sibling, 4 replies; 55+ messages in thread
From: Josh Don @ 2023-03-29  1:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault

Hi Peter,

This is a really interesting proposal and in general I think the
incorporation of latency/deadline is quite a nice enhancement. We've
struggled for a while to get better latency bounds on performance
sensitive threads in the face of antagonism from overcommit.

>  void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> +       s64 lag, limit;
> +
>         SCHED_WARN_ON(!se->on_rq);
> -       se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
> +       lag = avg_vruntime(cfs_rq) - se->vruntime;
> +
> +       limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> +       se->vlag = clamp(lag, -limit, limit);

This is for dequeue; presumably you'd want to update the vlag at
enqueue in case the average has moved again due to enqueue/dequeue of
other entities?

> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> +{
> +       struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> +       struct sched_entity *curr = cfs_rq->curr;
> +       struct sched_entity *best = NULL;
> +
> +       if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> +               curr = NULL;
> +
> +       while (node) {
> +               struct sched_entity *se = __node_2_se(node);
> +
> +               /*
> +                * If this entity is not eligible, try the left subtree.
> +                */
> +               if (!entity_eligible(cfs_rq, se)) {
> +                       node = node->rb_left;
> +                       continue;
> +               }
> +
> +               /*
> +                * If this entity has an earlier deadline than the previous
> +                * best, take this one. If it also has the earliest deadline
> +                * of its subtree, we're done.
> +                */
> +               if (!best || deadline_gt(deadline, best, se)) {
> +                       best = se;
> +                       if (best->deadline == best->min_deadline)
> +                               break;

Isn't it possible to have a child with less vruntime (ie. rb->left)
but with the same deadline? Wouldn't it be preferable to choose the
child instead since the deadlines are equivalent but the child has
received less service time?

> +               }
> +
> +               /*
> +                * If the earlest deadline in this subtree is in the fully
> +                * eligible left half of our space, go there.
> +                */
> +               if (node->rb_left &&
> +                   __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
> +                       node = node->rb_left;
> +                       continue;
> +               }
> +
> +               node = node->rb_right;
> +       }
> +
> +       if (!best || (curr && deadline_gt(deadline, best, curr)))
> +               best = curr;
> +
> +       if (unlikely(!best)) {
> +               struct sched_entity *left = __pick_first_entity(cfs_rq);
> +               if (left) {
> +                       pr_err("EEVDF scheduling fail, picking leftmost\n");
> +                       return left;
> +               }
> +       }
> +
> +       return best;
> +}
> +
>
>  static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
> @@ -5088,19 +5307,20 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>  static void
>  check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>  {
> -       unsigned long ideal_runtime, delta_exec;
> +       unsigned long delta_exec;
>         struct sched_entity *se;
>         s64 delta;
>
> -       /*
> -        * When many tasks blow up the sched_period; it is possible that
> -        * sched_slice() reports unusually large results (when many tasks are
> -        * very light for example). Therefore impose a maximum.
> -        */
> -       ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
> +       if (sched_feat(EEVDF)) {
> +               if (pick_eevdf(cfs_rq) != curr)
> +                       goto preempt;

This could shortcircuit the loop in pick_eevdf once we find a best
that has less vruntime and sooner deadline than curr, since we know
we'll never pick curr in that case. Might help performance when we
have a large tree for this cfs_rq.

> +
> +               return;
> +       }
>
>         delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> -       if (delta_exec > ideal_runtime) {
> +       if (delta_exec > curr->slice) {
> +preempt:
>                 resched_curr(rq_of(cfs_rq));
>                 /*
>                  * The current task ran long enough, ensure it doesn't get
> @@ -5124,7 +5344,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq
>         if (delta < 0)
>                 return;
>
> -       if (delta > ideal_runtime)
> +       if (delta > curr->slice)
>                 resched_curr(rq_of(cfs_rq));
>  }

Best,
Josh

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/17] sched/fair: Add avg_vruntime
  2023-03-28 23:57   ` Josh Don
@ 2023-03-29  7:50     ` Peter Zijlstra
  2023-04-05 19:13       ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-29  7:50 UTC (permalink / raw)
  To: Josh Don
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault

On Tue, Mar 28, 2023 at 04:57:49PM -0700, Josh Don wrote:
> On Tue, Mar 28, 2023 at 4:06 AM Peter Zijlstra <peterz@infradead.org> wrote:
> [...]
> > +/*
> > + * Compute virtual time from the per-task service numbers:
> > + *
> > + * Fair schedulers conserve lag: \Sum lag_i = 0
> > + *
> > + * lag_i = S - s_i = w_i * (V - v_i)
> > + *
> > + * \Sum lag_i = 0 -> \Sum w_i * (V - v_i) = V * \Sum w_i - \Sum w_i * v_i = 0
> 
> Small note: I think it would be helpful to label these symbols
> somewhere :) Weight  and vruntime are fairly obvious, but I don't
> think 'S' and 'V' are as clear. Are these non-virtual ideal service
> time, and average vruntime, respectively?

Yep, they are. I'll see what I can do with the comments.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/17] sched/fair: Implement an EEVDF like policy
  2023-03-29  1:26   ` Josh Don
@ 2023-03-29  8:02     ` Peter Zijlstra
  2023-03-29  8:06     ` Peter Zijlstra
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-29  8:02 UTC (permalink / raw)
  To: Josh Don
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault

On Tue, Mar 28, 2023 at 06:26:51PM -0700, Josh Don wrote:
> Hi Peter,
> 
> This is a really interesting proposal and in general I think the
> incorporation of latency/deadline is quite a nice enhancement. We've
> struggled for a while to get better latency bounds on performance
> sensitive threads in the face of antagonism from overcommit.
> 
> >  void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > +       s64 lag, limit;
> > +
> >         SCHED_WARN_ON(!se->on_rq);
> > -       se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
> > +       lag = avg_vruntime(cfs_rq) - se->vruntime;
> > +
> > +       limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> > +       se->vlag = clamp(lag, -limit, limit);
> 
> This is for dequeue; presumably you'd want to update the vlag at
> enqueue in case the average has moved again due to enqueue/dequeue of
> other entities?

Ha, just adding the entry back will shift the avgerage around and it's
all a giant pain in the backside.

 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
 	u64 vruntime = avg_vruntime(cfs_rq);
+	s64 lag = 0;
 
+	/*
+	 * Due to how V is constructed as the weighted average of entities,
+	 * adding tasks with positive lag, or removing tasks with negative lag
+	 * will move 'time' backwards, this can screw around with the lag of
+	 * other tasks.
+	 *
+	 * EEVDF: placement strategy #1 / #2
+	 */
+	if (sched_feat(PLACE_LAG) && cfs_rq->nr_running > 1) {
+		struct sched_entity *curr = cfs_rq->curr;
+		unsigned long load;
 
+		lag = se->vlag;
 
 		/*
+		 * If we want to place a task and preserve lag, we have to
+		 * consider the effect of the new entity on the weighted
+		 * average and compensate for this, otherwise lag can quickly
+		 * evaporate:
+		 *
+		 * l_i = V - v_i <=> v_i = V - l_i
+		 *
+		 * V = v_avg = W*v_avg / W
+		 *
+		 * V' = (W*v_avg + w_i*v_i) / (W + w_i)
+		 *    = (W*v_avg + w_i(v_avg - l_i)) / (W + w_i)
+		 *    = v_avg + w_i*l_i/(W + w_i)
+		 *
+		 * l_i' = V' - v_i = v_avg + w_i*l_i/(W + w_i) - (v_avg - l)
+		 *      = l_i - w_i*l_i/(W + w_i)
+		 *
+		 * l_i = (W + w_i) * l_i' / W
 		 */
+		load = cfs_rq->avg_load;
+		if (curr && curr->on_rq)
+			load += curr->load.weight;
+
+		lag *= load + se->load.weight;
+		if (WARN_ON_ONCE(!load))
+			load = 1;
+		lag = div_s64(lag, load);
 
+		vruntime -= lag;
 	}


That ^ is the other side of it.

But yes, once enqueued, additional join/leave operations can/will shift
V around and lag changes, nothing much to do about that.

The paper does it all a wee bit differently, but I think it ends up
being the same. They explicitly track V (and shift it around on
join/leave) while I implicitly track it through the average and then
need to play games like the above, but in the end it should be the same.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/17] sched/fair: Implement an EEVDF like policy
  2023-03-29  1:26   ` Josh Don
  2023-03-29  8:02     ` Peter Zijlstra
@ 2023-03-29  8:06     ` Peter Zijlstra
  2023-03-29  8:22       ` Peter Zijlstra
  2023-03-29  8:12     ` Peter Zijlstra
  2023-03-29  8:18     ` Peter Zijlstra
  3 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-29  8:06 UTC (permalink / raw)
  To: Josh Don
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault

On Tue, Mar 28, 2023 at 06:26:51PM -0700, Josh Don wrote:
> > +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> > +{
> > +       struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> > +       struct sched_entity *curr = cfs_rq->curr;
> > +       struct sched_entity *best = NULL;
> > +
> > +       if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> > +               curr = NULL;
> > +
> > +       while (node) {
> > +               struct sched_entity *se = __node_2_se(node);
> > +
> > +               /*
> > +                * If this entity is not eligible, try the left subtree.
> > +                */
> > +               if (!entity_eligible(cfs_rq, se)) {
> > +                       node = node->rb_left;
> > +                       continue;
> > +               }
> > +
> > +               /*
> > +                * If this entity has an earlier deadline than the previous
> > +                * best, take this one. If it also has the earliest deadline
> > +                * of its subtree, we're done.
> > +                */
> > +               if (!best || deadline_gt(deadline, best, se)) {
> > +                       best = se;
> > +                       if (best->deadline == best->min_deadline)
> > +                               break;
> 
> Isn't it possible to have a child with less vruntime (ie. rb->left)
> but with the same deadline? Wouldn't it be preferable to choose the
> child instead since the deadlines are equivalent but the child has
> received less service time?

Possible, yes I suppose. But given this is ns granular virtual time,
somewhat unlikely. You can modify the last (validation) patch and have
it detect the case, see if you can trigger it.

Doing that will make the pick always do a full decent of the tree
through, which is a little more expensive. Not sure it's worth the
effort.

> > +               }
> > +
> > +               /*
> > +                * If the earlest deadline in this subtree is in the fully
> > +                * eligible left half of our space, go there.
> > +                */
> > +               if (node->rb_left &&
> > +                   __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
> > +                       node = node->rb_left;
> > +                       continue;
> > +               }
> > +
> > +               node = node->rb_right;
> > +       }
> > +
> > +       if (!best || (curr && deadline_gt(deadline, best, curr)))
> > +               best = curr;
> > +
> > +       if (unlikely(!best)) {
> > +               struct sched_entity *left = __pick_first_entity(cfs_rq);
> > +               if (left) {
> > +                       pr_err("EEVDF scheduling fail, picking leftmost\n");
> > +                       return left;
> > +               }
> > +       }
> > +
> > +       return best;
> > +}

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/17] sched/fair: Implement an EEVDF like policy
  2023-03-29  1:26   ` Josh Don
  2023-03-29  8:02     ` Peter Zijlstra
  2023-03-29  8:06     ` Peter Zijlstra
@ 2023-03-29  8:12     ` Peter Zijlstra
  2023-03-29 18:54       ` Josh Don
  2023-03-29  8:18     ` Peter Zijlstra
  3 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-29  8:12 UTC (permalink / raw)
  To: Josh Don
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault

On Tue, Mar 28, 2023 at 06:26:51PM -0700, Josh Don wrote:

> > @@ -5088,19 +5307,20 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> >  static void
> >  check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >  {
> > -       unsigned long ideal_runtime, delta_exec;
> > +       unsigned long delta_exec;
> >         struct sched_entity *se;
> >         s64 delta;
> >
> > -       /*
> > -        * When many tasks blow up the sched_period; it is possible that
> > -        * sched_slice() reports unusually large results (when many tasks are
> > -        * very light for example). Therefore impose a maximum.
> > -        */
> > -       ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
> > +       if (sched_feat(EEVDF)) {
> > +               if (pick_eevdf(cfs_rq) != curr)
> > +                       goto preempt;
> 
> This could shortcircuit the loop in pick_eevdf once we find a best
> that has less vruntime and sooner deadline than curr, since we know
> we'll never pick curr in that case. Might help performance when we
> have a large tree for this cfs_rq.

Yeah, one of the things I did consider was having this set cfs_rq->next
such that the reschedule pick doesn't have to do the pick again. But I
figured keep things simple for now.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/17] sched/fair: Implement an EEVDF like policy
  2023-03-29  1:26   ` Josh Don
                       ` (2 preceding siblings ...)
  2023-03-29  8:12     ` Peter Zijlstra
@ 2023-03-29  8:18     ` Peter Zijlstra
  3 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-29  8:18 UTC (permalink / raw)
  To: Josh Don
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault

On Tue, Mar 28, 2023 at 06:26:51PM -0700, Josh Don wrote:
> Hi Peter,
> 
> This is a really interesting proposal and in general I think the
> incorporation of latency/deadline is quite a nice enhancement. We've
> struggled for a while to get better latency bounds on performance
> sensitive threads in the face of antagonism from overcommit.

Right; so the big caveat is of course that these are virtual deadlines,
overcommit *will* push them out.

We have SCHED_DEADLINE that does the 'real' deadline thing ;-)

But what these virtual deadlines do is provide an order inside the
virtual time domain. It makes the whole preemption business well
defined. And with that, provides means to mix varying request sizes.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/17] sched/fair: Implement an EEVDF like policy
  2023-03-29  8:06     ` Peter Zijlstra
@ 2023-03-29  8:22       ` Peter Zijlstra
  2023-03-29 18:48         ` Josh Don
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-29  8:22 UTC (permalink / raw)
  To: Josh Don
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault

On Wed, Mar 29, 2023 at 10:06:46AM +0200, Peter Zijlstra wrote:
> On Tue, Mar 28, 2023 at 06:26:51PM -0700, Josh Don wrote:
> > > +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> > > +{
> > > +       struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> > > +       struct sched_entity *curr = cfs_rq->curr;
> > > +       struct sched_entity *best = NULL;
> > > +
> > > +       if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> > > +               curr = NULL;
> > > +
> > > +       while (node) {
> > > +               struct sched_entity *se = __node_2_se(node);
> > > +
> > > +               /*
> > > +                * If this entity is not eligible, try the left subtree.
> > > +                */
> > > +               if (!entity_eligible(cfs_rq, se)) {
> > > +                       node = node->rb_left;
> > > +                       continue;
> > > +               }
> > > +
> > > +               /*
> > > +                * If this entity has an earlier deadline than the previous
> > > +                * best, take this one. If it also has the earliest deadline
> > > +                * of its subtree, we're done.
> > > +                */
> > > +               if (!best || deadline_gt(deadline, best, se)) {
> > > +                       best = se;
> > > +                       if (best->deadline == best->min_deadline)
> > > +                               break;
> > 
> > Isn't it possible to have a child with less vruntime (ie. rb->left)
> > but with the same deadline? Wouldn't it be preferable to choose the
> > child instead since the deadlines are equivalent but the child has
> > received less service time?
> 
> Possible, yes I suppose. But given this is ns granular virtual time,
> somewhat unlikely. You can modify the last (validation) patch and have
> it detect the case, see if you can trigger it.
> 
> Doing that will make the pick always do a full decent of the tree
> through, which is a little more expensive. Not sure it's worth the
> effort.

Hmm, maybe not, if there is no smaller-or-equal deadline then the
min_deadline of the child will be greater and we can terminate the
decent right there.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 15/17] [RFC] sched/eevdf: Sleeper bonus
  2023-03-28  9:26 ` [PATCH 15/17] [RFC] sched/eevdf: Sleeper bonus Peter Zijlstra
@ 2023-03-29  9:10   ` Mike Galbraith
  0 siblings, 0 replies; 55+ messages in thread
From: Mike Galbraith @ 2023-03-29  9:10 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, vincent.guittot
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
	pjt, pavel, qperret, tim.c.chen, joshdon, timj, kprateek.nayak,
	yu.c.chen, youssefesmat, joel

On Tue, 2023-03-28 at 11:26 +0200, Peter Zijlstra wrote:
> Add a sleeper bonus hack, but keep it default disabled. This should
> allow easy testing if regressions are due to this.
>
> Specifically; this 'restores' performance for things like starve and
> stress-futex, stress-nanosleep that rely on sleeper bonus to compete
> against an always running parent (the fair 67%/33% split vs the
> 50%/50% bonus thing).
>
> OTOH this completely destroys latency and hackbench (as in 5x worse).

I profiled that again, but numbers were still.. not so lovely.

Point of this post is the sleeper/hog split business anyway.  I've been
running your patches on my desktop box and cute as button little rpi4b
since they appeared, and poking at them looking for any desktop deltas
and have noticed jack diddly spit.

A lot of benchmarks will notice both distribution and ctx deltas, but
humans.. the numbers I've seen so far say that's highly unlikely.

A couple perf sched lat summaries below for insomniacs.

Load is chrome playing BigBuckBunny (for the zillionth time), which on
this box as I set resolution/size wants ~35% of the box vs 8ms run 1ms
sleep massive_intr, 1 thread per CPU (profiles at ~91%), as a hog-ish
but not absurdly so competitor.

perf.data.stable.full sort=max - top 10 summary
 -----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 -----------------------------------------------------------------------------------------------------------
  chrome:(7)            |   6274.683 ms |    63604 | avg:   0.172 ms | max:  41.796 ms | sum:10930.150 ms |
  massive_intr:(8)      |1673597.295 ms |   762617 | avg:   0.709 ms | max:  40.383 ms | sum:540374.853 ms |
  X:2476                |  86498.438 ms |   129657 | avg:   0.259 ms | max:  36.157 ms | sum:33588.933 ms |
  dav1d-worker:(8)      | 162369.504 ms |   411962 | avg:   0.682 ms | max:  30.648 ms | sum:280864.249 ms |
  ThreadPoolForeg:(13)  |  21177.187 ms |    60907 | avg:   0.401 ms | max:  30.424 ms | sum:24412.770 ms |
  gmain:(3)             |     95.617 ms |     3552 | avg:   0.755 ms | max:  26.365 ms | sum: 2680.738 ms |
  llvmpipe-0:(2)        |  24602.666 ms |    30828 | avg:   1.278 ms | max:  23.590 ms | sum:39408.811 ms |
  llvmpipe-2:(2)        |  27707.699 ms |    29226 | avg:   1.236 ms | max:  23.579 ms | sum:36126.717 ms |
  llvmpipe-7:(2)        |  34437.755 ms |    27017 | avg:   1.097 ms | max:  23.545 ms | sum:29634.448 ms |
  llvmpipe-5:(2)        |  24533.947 ms |    28503 | avg:   1.375 ms | max:  22.995 ms | sum:39191.132 ms |
 -----------------------------------------------------------------------------------------------------------
  TOTAL:                |2314609.811 ms |  2473891 | 96.4% util, 27.7% GUI   41.796 ms |   1361629.825 ms |
 -----------------------------------------------------------------------------------------------------------

perf.data.eevdf.full sort=max - top 10 summary
 -----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 -----------------------------------------------------------------------------------------------------------
  chrome:(8)            |   6329.996 ms |    80080 | avg:   0.193 ms | max:  28.835 ms | sum:15432.012 ms |
  ThreadPoolForeg:(20)  |  20477.539 ms |   158457 | avg:   0.265 ms | max:  25.708 ms | sum:42003.063 ms |
  dav1d-worker:(8)      | 168022.569 ms |  1090719 | avg:   0.366 ms | max:  24.786 ms | sum:398971.023 ms |
  massive_intr:(8)      |1736052.944 ms |   721103 | avg:   0.658 ms | max:  23.427 ms | sum:474493.391 ms |
  llvmpipe-5:(2)        |  22970.555 ms |    31184 | avg:   1.448 ms | max:  22.465 ms | sum:45148.667 ms |
  llvmpipe-3:(2)        |  22803.121 ms |    31688 | avg:   1.436 ms | max:  22.076 ms | sum:45516.196 ms |
  llvmpipe-0:(2)        |  22050.612 ms |    33580 | avg:   1.397 ms | max:  22.007 ms | sum:46898.028 ms |
  VizCompositorTh:5538  |  90856.230 ms |    91865 | avg:   0.605 ms | max:  21.702 ms | sum:55542.418 ms |
  llvmpipe-1:(2)        |  22866.426 ms |    32870 | avg:   1.390 ms | max:  20.732 ms | sum:45690.066 ms |
  llvmpipe-2:(2)        |  22672.646 ms |    32319 | avg:   1.415 ms | max:  20.647 ms | sum:45731.838 ms |
 -----------------------------------------------------------------------------------------------------------
  TOTAL:                |2332092.393 ms |  3449563 | 97.1% util, 25.6% GUI   28.835 ms |   1570459.986 ms |
 -----------------------------------------------------------------------------------------------------------
 vs stable                                    1.394  distribution delta.. meaningless            1.153




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/17] sched/fair: Implement an EEVDF like policy
  2023-03-28  9:26 ` [PATCH 08/17] sched/fair: Implement an EEVDF like policy Peter Zijlstra
  2023-03-29  1:26   ` Josh Don
@ 2023-03-29 14:35   ` Vincent Guittot
  2023-03-30  8:01     ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Vincent Guittot @ 2023-03-29 14:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

On Tue, 28 Mar 2023 at 13:06, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Where CFS is currently a WFQ based scheduler with only a single knob,
> the weight. The addition of a second, latency oriented parameter,
> makes something like WF2Q or EEVDF based a much better fit.
>
> Specifically, EEVDF does EDF like scheduling in the left half of the
> tree -- those entities that are owed service. Except because this is a
> virtual time scheduler, the deadlines are in virtual time as well,
> which is what allows over-subscription.
>
> EEVDF has two parameters:
>
>  - weight, or time-slope; which is mapped to nice just as before
>  - relative deadline; which is related to slice length and mapped
>    to the new latency nice.
>
> Basically, by setting a smaller slice, the deadline will be earlier
> and the task will be more eligible and ran earlier.

IIUC how it works, Vd = ve + r / wi

So for a same weight, the vd will be earlier but it's no more alway
true for different weight

>
> Preemption (both tick and wakeup) is driven by testing against a fresh
> pick. Because the tree is now effectively an interval tree, and the
> selection is no longer 'leftmost', over-scheduling is less of a
> problem.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/sched.h   |    4
>  kernel/sched/debug.c    |    6
>  kernel/sched/fair.c     |  324 +++++++++++++++++++++++++++++++++++++++++-------
>  kernel/sched/features.h |    3
>  kernel/sched/sched.h    |    1
>  5 files changed, 293 insertions(+), 45 deletions(-)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -548,6 +548,9 @@ struct sched_entity {
>         /* For load-balancing: */
>         struct load_weight              load;
>         struct rb_node                  run_node;
> +       u64                             deadline;
> +       u64                             min_deadline;
> +
>         struct list_head                group_node;
>         unsigned int                    on_rq;
>
> @@ -556,6 +559,7 @@ struct sched_entity {
>         u64                             vruntime;
>         u64                             prev_sum_exec_runtime;
>         s64                             vlag;
> +       u64                             slice;
>
>         u64                             nr_migrations;
>
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -535,9 +535,13 @@ print_task(struct seq_file *m, struct rq
>         else
>                 SEQ_printf(m, " %c", task_state_to_char(p));
>
> -       SEQ_printf(m, " %15s %5d %9Ld.%06ld %9Ld %5d ",
> +       SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld.%06ld %9Ld %5d ",
>                 p->comm, task_pid_nr(p),
>                 SPLIT_NS(p->se.vruntime),
> +               entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N',
> +               SPLIT_NS(p->se.deadline),
> +               SPLIT_NS(p->se.slice),
> +               SPLIT_NS(p->se.sum_exec_runtime),
>                 (long long)(p->nvcsw + p->nivcsw),
>                 p->prio);
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -47,6 +47,7 @@
>  #include <linux/psi.h>
>  #include <linux/ratelimit.h>
>  #include <linux/task_work.h>
> +#include <linux/rbtree_augmented.h>
>
>  #include <asm/switch_to.h>
>
> @@ -347,6 +348,16 @@ static u64 __calc_delta(u64 delta_exec,
>         return mul_u64_u32_shr(delta_exec, fact, shift);
>  }
>
> +/*
> + * delta /= w
> + */
> +static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
> +{
> +       if (unlikely(se->load.weight != NICE_0_LOAD))
> +               delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
> +
> +       return delta;
> +}
>
>  const struct sched_class fair_sched_class;
>
> @@ -691,11 +702,62 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
>
>  /*
>   * lag_i = S - s_i = w_i * (V - v_i)
> + *
> + * However, since V is approximated by the weighted average of all entities it
> + * is possible -- by addition/removal/reweight to the tree -- to move V around
> + * and end up with a larger lag than we started with.
> + *
> + * Limit this to either double the slice length with a minimum of TICK_NSEC
> + * since that is the timing granularity.
> + *
> + * EEVDF gives the following limit for a steady state system:
> + *
> + *   -r_max < lag < max(r_max, q)
> + *
> + * XXX could add max_slice to the augmented data to track this.
>   */
>  void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> +       s64 lag, limit;
> +
>         SCHED_WARN_ON(!se->on_rq);
> -       se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
> +       lag = avg_vruntime(cfs_rq) - se->vruntime;
> +
> +       limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> +       se->vlag = clamp(lag, -limit, limit);
> +}
> +
> +/*
> + * Entity is eligible once it received less service than it ought to have,
> + * eg. lag >= 0.
> + *
> + * lag_i = S - s_i = w_i*(V - v_i)
> + *
> + * lag_i >= 0 -> V >= v_i
> + *
> + *     \Sum (v_i - v)*w_i
> + * V = ------------------ + v
> + *          \Sum w_i
> + *
> + * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
> + *
> + * Note: using 'avg_vruntime() > se->vruntime' is inacurate due
> + *       to the loss in precision caused by the division.
> + */
> +int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +       struct sched_entity *curr = cfs_rq->curr;
> +       s64 avg = cfs_rq->avg_vruntime;
> +       long load = cfs_rq->avg_load;
> +
> +       if (curr && curr->on_rq) {
> +               unsigned long weight = scale_load_down(curr->load.weight);
> +
> +               avg += entity_key(cfs_rq, curr) * weight;
> +               load += weight;
> +       }
> +
> +       return avg >= entity_key(cfs_rq, se) * load;
>  }
>
>  static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
> @@ -714,8 +776,8 @@ static u64 __update_min_vruntime(struct
>
>  static void update_min_vruntime(struct cfs_rq *cfs_rq)
>  {
> +       struct sched_entity *se = __pick_first_entity(cfs_rq);
>         struct sched_entity *curr = cfs_rq->curr;
> -       struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);
>
>         u64 vruntime = cfs_rq->min_vruntime;
>
> @@ -726,9 +788,7 @@ static void update_min_vruntime(struct c
>                         curr = NULL;
>         }
>
> -       if (leftmost) { /* non-empty tree */
> -               struct sched_entity *se = __node_2_se(leftmost);
> -
> +       if (se) {
>                 if (!curr)
>                         vruntime = se->vruntime;
>                 else
> @@ -745,18 +805,50 @@ static inline bool __entity_less(struct
>         return entity_before(__node_2_se(a), __node_2_se(b));
>  }
>
> +#define deadline_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })
> +
> +static inline void __update_min_deadline(struct sched_entity *se, struct rb_node *node)
> +{
> +       if (node) {
> +               struct sched_entity *rse = __node_2_se(node);
> +               if (deadline_gt(min_deadline, se, rse))
> +                       se->min_deadline = rse->min_deadline;
> +       }
> +}
> +
> +/*
> + * se->min_deadline = min(se->deadline, left->min_deadline, right->min_deadline)
> + */
> +static inline bool min_deadline_update(struct sched_entity *se, bool exit)
> +{
> +       u64 old_min_deadline = se->min_deadline;
> +       struct rb_node *node = &se->run_node;
> +
> +       se->min_deadline = se->deadline;
> +       __update_min_deadline(se, node->rb_right);
> +       __update_min_deadline(se, node->rb_left);
> +
> +       return se->min_deadline == old_min_deadline;
> +}
> +
> +RB_DECLARE_CALLBACKS(static, min_deadline_cb, struct sched_entity,
> +                    run_node, min_deadline, min_deadline_update);
> +
>  /*
>   * Enqueue an entity into the rb-tree:
>   */
>  static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         avg_vruntime_add(cfs_rq, se);
> -       rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
> +       se->min_deadline = se->deadline;
> +       rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
> +                               __entity_less, &min_deadline_cb);
>  }
>
>  static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
> +       rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
> +                                 &min_deadline_cb);
>         avg_vruntime_sub(cfs_rq, se);
>  }
>
> @@ -780,6 +872,97 @@ static struct sched_entity *__pick_next_
>         return __node_2_se(next);
>  }
>
> +static struct sched_entity *pick_cfs(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> +{
> +       struct sched_entity *left = __pick_first_entity(cfs_rq);
> +
> +       /*
> +        * If curr is set we have to see if its left of the leftmost entity
> +        * still in the tree, provided there was anything in the tree at all.
> +        */
> +       if (!left || (curr && entity_before(curr, left)))
> +               left = curr;
> +
> +       return left;
> +}
> +
> +/*
> + * Earliest Eligible Virtual Deadline First
> + *
> + * In order to provide latency guarantees for different request sizes
> + * EEVDF selects the best runnable task from two criteria:
> + *
> + *  1) the task must be eligible (must be owed service)
> + *
> + *  2) from those tasks that meet 1), we select the one
> + *     with the earliest virtual deadline.
> + *
> + * We can do this in O(log n) time due to an augmented RB-tree. The
> + * tree keeps the entries sorted on service, but also functions as a
> + * heap based on the deadline by keeping:
> + *
> + *  se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
> + *
> + * Which allows an EDF like search on (sub)trees.
> + */
> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> +{
> +       struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> +       struct sched_entity *curr = cfs_rq->curr;
> +       struct sched_entity *best = NULL;
> +
> +       if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> +               curr = NULL;
> +
> +       while (node) {
> +               struct sched_entity *se = __node_2_se(node);
> +
> +               /*
> +                * If this entity is not eligible, try the left subtree.
> +                */
> +               if (!entity_eligible(cfs_rq, se)) {
> +                       node = node->rb_left;
> +                       continue;
> +               }
> +
> +               /*
> +                * If this entity has an earlier deadline than the previous
> +                * best, take this one. If it also has the earliest deadline
> +                * of its subtree, we're done.
> +                */
> +               if (!best || deadline_gt(deadline, best, se)) {
> +                       best = se;
> +                       if (best->deadline == best->min_deadline)
> +                               break;
> +               }
> +
> +               /*
> +                * If the earlest deadline in this subtree is in the fully
> +                * eligible left half of our space, go there.
> +                */
> +               if (node->rb_left &&
> +                   __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
> +                       node = node->rb_left;
> +                       continue;
> +               }
> +
> +               node = node->rb_right;
> +       }
> +
> +       if (!best || (curr && deadline_gt(deadline, best, curr)))
> +               best = curr;
> +
> +       if (unlikely(!best)) {
> +               struct sched_entity *left = __pick_first_entity(cfs_rq);
> +               if (left) {
> +                       pr_err("EEVDF scheduling fail, picking leftmost\n");
> +                       return left;
> +               }
> +       }
> +
> +       return best;
> +}
> +
>  #ifdef CONFIG_SCHED_DEBUG
>  struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
>  {
> @@ -822,17 +1005,6 @@ long calc_latency_offset(int prio)
>  }
>
>  /*
> - * delta /= w
> - */
> -static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
> -{
> -       if (unlikely(se->load.weight != NICE_0_LOAD))
> -               delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
> -
> -       return delta;
> -}
> -
> -/*
>   * The idea is to set a period in which each task runs once.
>   *
>   * When there are too many tasks (sched_nr_latency) we have to stretch
> @@ -897,6 +1069,38 @@ static u64 sched_slice(struct cfs_rq *cf
>         return slice;
>  }
>
> +/*
> + * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
> + * this is probably good enough.
> + */
> +static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> +       if ((s64)(se->vruntime - se->deadline) < 0)
> +               return;
> +
> +       if (sched_feat(EEVDF)) {
> +               /*
> +                * For EEVDF the virtual time slope is determined by w_i (iow.
> +                * nice) while the request time r_i is determined by
> +                * latency-nice.
> +                */
> +               se->slice = se->latency_offset;
> +       } else {
> +               /*
> +                * When many tasks blow up the sched_period; it is possible
> +                * that sched_slice() reports unusually large results (when
> +                * many tasks are very light for example). Therefore impose a
> +                * maximum.
> +                */
> +               se->slice = min_t(u64, sched_slice(cfs_rq, se), sysctl_sched_latency);
> +       }
> +
> +       /*
> +        * EEVDF: vd_i = ve_i + r_i / w_i
> +        */
> +       se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
> +}
> +
>  #include "pelt.h"
>  #ifdef CONFIG_SMP
>
> @@ -1029,6 +1233,7 @@ static void update_curr(struct cfs_rq *c
>         schedstat_add(cfs_rq->exec_clock, delta_exec);
>
>         curr->vruntime += calc_delta_fair(delta_exec, curr);
> +       update_deadline(cfs_rq, curr);
>         update_min_vruntime(cfs_rq);
>
>         if (entity_is_task(curr)) {
> @@ -4796,6 +5001,7 @@ static inline bool entity_is_long_sleepe
>  static void
>  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>  {
> +       u64 vslice = calc_delta_fair(se->slice, se);
>         u64 vruntime = avg_vruntime(cfs_rq);
>         s64 lag = 0;
>
> @@ -4834,9 +5040,9 @@ place_entity(struct cfs_rq *cfs_rq, stru
>                  */
>                 load = cfs_rq->avg_load;
>                 if (curr && curr->on_rq)
> -                       load += curr->load.weight;
> +                       load += scale_load_down(curr->load.weight);
>
> -               lag *= load + se->load.weight;
> +               lag *= load + scale_load_down(se->load.weight);
>                 if (WARN_ON_ONCE(!load))
>                         load = 1;
>                 lag = div_s64(lag, load);
> @@ -4877,6 +5083,19 @@ place_entity(struct cfs_rq *cfs_rq, stru
>         }
>
>         se->vruntime = vruntime;
> +
> +       /*
> +        * When joining the competition; the exisiting tasks will be,
> +        * on average, halfway through their slice, as such start tasks
> +        * off with half a slice to ease into the competition.
> +        */
> +       if (sched_feat(PLACE_DEADLINE_INITIAL) && initial)
> +               vslice /= 2;
> +
> +       /*
> +        * EEVDF: vd_i = ve_i + r_i/w_i
> +        */
> +       se->deadline = se->vruntime + vslice;
>  }
>
>  static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
> @@ -5088,19 +5307,20 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>  static void
>  check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>  {
> -       unsigned long ideal_runtime, delta_exec;
> +       unsigned long delta_exec;
>         struct sched_entity *se;
>         s64 delta;
>
> -       /*
> -        * When many tasks blow up the sched_period; it is possible that
> -        * sched_slice() reports unusually large results (when many tasks are
> -        * very light for example). Therefore impose a maximum.
> -        */
> -       ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
> +       if (sched_feat(EEVDF)) {
> +               if (pick_eevdf(cfs_rq) != curr)
> +                       goto preempt;
> +
> +               return;
> +       }
>
>         delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> -       if (delta_exec > ideal_runtime) {
> +       if (delta_exec > curr->slice) {
> +preempt:
>                 resched_curr(rq_of(cfs_rq));
>                 /*
>                  * The current task ran long enough, ensure it doesn't get
> @@ -5124,7 +5344,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq
>         if (delta < 0)
>                 return;
>
> -       if (delta > ideal_runtime)
> +       if (delta > curr->slice)
>                 resched_curr(rq_of(cfs_rq));
>  }
>
> @@ -5179,17 +5399,20 @@ wakeup_preempt_entity(struct sched_entit
>  static struct sched_entity *
>  pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>  {
> -       struct sched_entity *left = __pick_first_entity(cfs_rq);
> -       struct sched_entity *se;
> +       struct sched_entity *left, *se;
>
> -       /*
> -        * If curr is set we have to see if its left of the leftmost entity
> -        * still in the tree, provided there was anything in the tree at all.
> -        */
> -       if (!left || (curr && entity_before(curr, left)))
> -               left = curr;
> +       if (sched_feat(EEVDF)) {
> +               /*
> +                * Enabling NEXT_BUDDY will affect latency but not fairness.
> +                */
> +               if (sched_feat(NEXT_BUDDY) &&
> +                   cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
> +                       return cfs_rq->next;
> +
> +               return pick_eevdf(cfs_rq);
> +       }
>
> -       se = left; /* ideally we run the leftmost entity */
> +       se = left = pick_cfs(cfs_rq, curr);
>
>         /*
>          * Avoid running the skip buddy, if running something else can
> @@ -6284,13 +6507,12 @@ static inline void unthrottle_offline_cf
>  static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
>  {
>         struct sched_entity *se = &p->se;
> -       struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
>         SCHED_WARN_ON(task_rq(p) != rq);
>
>         if (rq->cfs.h_nr_running > 1) {
> -               u64 slice = sched_slice(cfs_rq, se);
>                 u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> +               u64 slice = se->slice;
>                 s64 delta = slice - ran;
>
>                 if (delta < 0) {
> @@ -8010,7 +8232,19 @@ static void check_preempt_wakeup(struct
>         if (cse_is_idle != pse_is_idle)
>                 return;
>
> -       update_curr(cfs_rq_of(se));
> +       cfs_rq = cfs_rq_of(se);
> +       update_curr(cfs_rq);
> +
> +       if (sched_feat(EEVDF)) {
> +               /*
> +                * XXX pick_eevdf(cfs_rq) != se ?
> +                */
> +               if (pick_eevdf(cfs_rq) == pse)
> +                       goto preempt;
> +
> +               return;
> +       }
> +
>         if (wakeup_preempt_entity(se, pse) == 1) {
>                 /*
>                  * Bias pick_next to pick the sched entity that is
> @@ -8256,7 +8490,7 @@ static void yield_task_fair(struct rq *r
>
>         clear_buddies(cfs_rq, se);
>
> -       if (curr->policy != SCHED_BATCH) {
> +       if (sched_feat(EEVDF) || curr->policy != SCHED_BATCH) {
>                 update_rq_clock(rq);
>                 /*
>                  * Update run-time statistics of the 'current'.
> @@ -8269,6 +8503,8 @@ static void yield_task_fair(struct rq *r
>                  */
>                 rq_clock_skip_update(rq);
>         }
> +       if (sched_feat(EEVDF))
> +               se->deadline += calc_delta_fair(se->slice, se);
>
>         set_skip_buddy(se);
>  }
> @@ -12012,8 +12248,8 @@ static void rq_offline_fair(struct rq *r
>  static inline bool
>  __entity_slice_used(struct sched_entity *se, int min_nr_tasks)
>  {
> -       u64 slice = sched_slice(cfs_rq_of(se), se);
>         u64 rtime = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> +       u64 slice = se->slice;
>
>         return (rtime * min_nr_tasks > slice);
>  }
> @@ -12728,7 +12964,7 @@ static unsigned int get_rr_interval_fair
>          * idle runqueue:
>          */
>         if (rq->cfs.load.weight)
> -               rr_interval = NS_TO_JIFFIES(sched_slice(cfs_rq_of(se), se));
> +               rr_interval = NS_TO_JIFFIES(se->slice);
>
>         return rr_interval;
>  }
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -13,6 +13,7 @@ SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
>   * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
>   */
>  SCHED_FEAT(PLACE_LAG, true)
> +SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
>
>  /*
>   * Prefer to schedule the task we woke last (assuming it failed
> @@ -103,3 +104,5 @@ SCHED_FEAT(LATENCY_WARN, false)
>
>  SCHED_FEAT(ALT_PERIOD, true)
>  SCHED_FEAT(BASE_SLICE, true)
> +
> +SCHED_FEAT(EEVDF, true)
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3316,5 +3316,6 @@ static inline void switch_mm_cid(struct
>  #endif
>
>  extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
> +extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
>
>  #endif /* _KERNEL_SCHED_SCHED_H */
>
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/17] sched/fair: Implement an EEVDF like policy
  2023-03-29  8:22       ` Peter Zijlstra
@ 2023-03-29 18:48         ` Josh Don
  0 siblings, 0 replies; 55+ messages in thread
From: Josh Don @ 2023-03-29 18:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault

On Wed, Mar 29, 2023 at 1:22 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Mar 29, 2023 at 10:06:46AM +0200, Peter Zijlstra wrote:
> > On Tue, Mar 28, 2023 at 06:26:51PM -0700, Josh Don wrote:
> > > > +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> > > > +{
> > > > +       struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> > > > +       struct sched_entity *curr = cfs_rq->curr;
> > > > +       struct sched_entity *best = NULL;
> > > > +
> > > > +       if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> > > > +               curr = NULL;
> > > > +
> > > > +       while (node) {
> > > > +               struct sched_entity *se = __node_2_se(node);
> > > > +
> > > > +               /*
> > > > +                * If this entity is not eligible, try the left subtree.
> > > > +                */
> > > > +               if (!entity_eligible(cfs_rq, se)) {
> > > > +                       node = node->rb_left;
> > > > +                       continue;
> > > > +               }
> > > > +
> > > > +               /*
> > > > +                * If this entity has an earlier deadline than the previous
> > > > +                * best, take this one. If it also has the earliest deadline
> > > > +                * of its subtree, we're done.
> > > > +                */
> > > > +               if (!best || deadline_gt(deadline, best, se)) {
> > > > +                       best = se;
> > > > +                       if (best->deadline == best->min_deadline)
> > > > +                               break;
> > >
> > > Isn't it possible to have a child with less vruntime (ie. rb->left)
> > > but with the same deadline? Wouldn't it be preferable to choose the
> > > child instead since the deadlines are equivalent but the child has
> > > received less service time?
> >
> > Possible, yes I suppose. But given this is ns granular virtual time,
> > somewhat unlikely. You can modify the last (validation) patch and have
> > it detect the case, see if you can trigger it.

Agreed on unlikely, was just checking my understanding here, since
this becomes a question of tradeoff (likelihood of decent vs the ideal
scheduling decision). Leaving as-is seems fine, with potentially a
short comment.

> > Doing that will make the pick always do a full decent of the tree
> > through, which is a little more expensive. Not sure it's worth the
> > effort.
>
> Hmm, maybe not, if there is no smaller-or-equal deadline then the
> min_deadline of the child will be greater and we can terminate the
> decent right there.
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/17] sched/fair: Implement an EEVDF like policy
  2023-03-29  8:12     ` Peter Zijlstra
@ 2023-03-29 18:54       ` Josh Don
  0 siblings, 0 replies; 55+ messages in thread
From: Josh Don @ 2023-03-29 18:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault

On Wed, Mar 29, 2023 at 1:12 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Mar 28, 2023 at 06:26:51PM -0700, Josh Don wrote:
>
> > > @@ -5088,19 +5307,20 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> > >  static void
> > >  check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > >  {
> > > -       unsigned long ideal_runtime, delta_exec;
> > > +       unsigned long delta_exec;
> > >         struct sched_entity *se;
> > >         s64 delta;
> > >
> > > -       /*
> > > -        * When many tasks blow up the sched_period; it is possible that
> > > -        * sched_slice() reports unusually large results (when many tasks are
> > > -        * very light for example). Therefore impose a maximum.
> > > -        */
> > > -       ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
> > > +       if (sched_feat(EEVDF)) {
> > > +               if (pick_eevdf(cfs_rq) != curr)
> > > +                       goto preempt;
> >
> > This could shortcircuit the loop in pick_eevdf once we find a best
> > that has less vruntime and sooner deadline than curr, since we know
> > we'll never pick curr in that case. Might help performance when we
> > have a large tree for this cfs_rq.
>
> Yeah, one of the things I did consider was having this set cfs_rq->next
> such that the reschedule pick doesn't have to do the pick again. But I
> figured keep things simple for now.

Yea that makes sense. I was thinking something similar along the lines
of cfs_rq->next as another way to avoid duplicate computation. But
agreed this can be a future optimization.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/17] sched/fair: Implement an EEVDF like policy
  2023-03-29 14:35   ` Vincent Guittot
@ 2023-03-30  8:01     ` Peter Zijlstra
  2023-03-30 17:05       ` Vincent Guittot
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-03-30  8:01 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

On Wed, Mar 29, 2023 at 04:35:25PM +0200, Vincent Guittot wrote:

> IIUC how it works, Vd = ve + r / wi
> 
> So for a same weight, the vd will be earlier but it's no more alway
> true for different weight

Correct; but for a heavier task the time also goes slower and since it
needs more time, you want it to go first. But yes, this is weird at
first glance.

Let us consider a 3 task scenario, where one task (A) is double weight
wrt to the other two (B,C), and let them run one quanta (q) at a time.

Each step will see V advance q/4.

A: w=2, r=4q	B: w=1, r=4q	C: w=1, r=4q

  1) A runs -- earliest deadline

    A  |-------<
    B  |---------------<
    C  |---------------<
    ---+---+---+---+---+---+---+-----------
    V  ^

  2) B runs (tie break with C) -- A is ineligible due to v_a > V

    A    |-----<
    B  |---------------<
    C  |---------------<
    ---+---+---+---+---+---+---+-----------
    V   ^

  3) A runs -- earliest deadline

    A    |-----<
    B      |-----------<
    C  |---------------<
    ---+---+---+---+---+---+---+-----------
    V    ^

  4) C runs -- only eligible task

    A      |---<
    B      |-----------<
    C  |---------------<
    ---+---+---+---+---+---+---+-----------
    V     ^

  5) similar to 1)

    A      |---<
    B      |-----------<
    C      |-----------<
    ---+---+---+---+---+---+---+-----------
    V      ^

And we see that we get a very nice ABAC interleave, with the only other
possible schedule being ACAB.

By virtue of the heaver task getting a shorter virtual deadline it gets
nicely interleaved with the other tasks and you get a very consistent
schedule with very little choice.

Like already said, step 2) is the only place we had a choice, and if we
were to have given either B or C a shorter request (IOW latency-nice)
that choice would have been fully determined.

So increasing w gets you more time (and the shorter deadline causes the
above interleaving), while for the same w, reducing r gets you picked
earlier.

Perhaps another way to look at it is that since heavier tasks run more
(often) you've got to compete against it more often for latency.


Does that help?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/17] sched/fair: Implement an EEVDF like policy
  2023-03-30  8:01     ` Peter Zijlstra
@ 2023-03-30 17:05       ` Vincent Guittot
  2023-04-04 12:00         ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Vincent Guittot @ 2023-03-30 17:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

On Thu, 30 Mar 2023 at 10:04, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Mar 29, 2023 at 04:35:25PM +0200, Vincent Guittot wrote:
>
> > IIUC how it works, Vd = ve + r / wi
> >
> > So for a same weight, the vd will be earlier but it's no more alway
> > true for different weight
>
> Correct; but for a heavier task the time also goes slower and since it
> needs more time, you want it to go first. But yes, this is weird at
> first glance.

Yeah, I understand that this is needed for bounding the lag to a
quantum max but that makes the latency prioritization less obvious and
not always aligned with what we want

let say that you have 2 tasks A and B waking up simultaneously with
the same vruntime; A has a negative latency nice to reflect some
latency constraint and B the default value.  A will run 1st if they
both have the same prio which is aligned with their  latency nice
values but B could run 1st if it increase its nice prio to reflect the
need for a larger cpu bandwidth so you can defeat the purpose of the
latency nice although there is no unfairness

A cares of its latency and set a negative latency nice to reduce its
request slice

A: w=2, r=4q    B: w=2, r=6q

     A  |-------<
     B  |-----------<

     ---+---+---+---+---+---+---+-----------
     V  ^

A runs 1st because its Vd is earlier

     A    |-----<
     B  |-----------<

     ---+---+---+---+---+---+---+-----------
     V   ^

     A    |-----<
     B    |---------<

     ---+---+---+---+---+---+---+-----------
     V    ^

     A      |---<
     B    |---------<

     ---+---+---+---+---+---+---+-----------
     V     ^


     A      |---<
     B      |-------<

     ---+---+---+---+---+---+---+-----------
     V      ^

If B increases its nice because it wants more bandwidth but still
doesn't care of latency

A: w=2, r=4q    B: w=4, r=6q

     A  |-----------<
     B  |---------<

     ---+-----+-----+-----+-----+-----+---+-----------
     V  ^

B runs 1st whereas A's latency nice is lower

     A  |-----------<
     B    |------<

     ---+-----+-----+-----+-----+-----+---+-----------
     V   ^

     A     |--------<
     B    |------<

     ---+-----+-----+-----+-----+-----+---+-----------
     V    ^

     A     |--------<
     B      |----<

     ---+-----+-----+-----+-----+-----+---+-----------
     V     ^

     A        |-----<
     B      |----<

     ---+-----+-----+-----+-----+-----+---+-----------
     V      ^

     A        |-----<
     B        |--<

     ---+-----+-----+-----+-----+-----+---+-----------
     V       ^

     A        |-----<
     B        |--<

     ---+-----+-----+-----+-----+-----+---+-----------
     V        ^

>
> Let us consider a 3 task scenario, where one task (A) is double weight
> wrt to the other two (B,C), and let them run one quanta (q) at a time.
>
> Each step will see V advance q/4.
>
> A: w=2, r=4q    B: w=1, r=4q    C: w=1, r=4q
>
>   1) A runs -- earliest deadline
>
>     A  |-------<
>     B  |---------------<
>     C  |---------------<
>     ---+---+---+---+---+---+---+-----------
>     V  ^
>
>   2) B runs (tie break with C) -- A is ineligible due to v_a > V
>
>     A    |-----<
>     B  |---------------<
>     C  |---------------<
>     ---+---+---+---+---+---+---+-----------
>     V   ^
>
>   3) A runs -- earliest deadline
>
>     A    |-----<
>     B      |-----------<
>     C  |---------------<
>     ---+---+---+---+---+---+---+-----------
>     V    ^
>
>   4) C runs -- only eligible task
>
>     A      |---<
>     B      |-----------<
>     C  |---------------<
>     ---+---+---+---+---+---+---+-----------
>     V     ^
>
>   5) similar to 1)
>
>     A      |---<
>     B      |-----------<
>     C      |-----------<
>     ---+---+---+---+---+---+---+-----------
>     V      ^
>
> And we see that we get a very nice ABAC interleave, with the only other
> possible schedule being ACAB.
>
> By virtue of the heaver task getting a shorter virtual deadline it gets
> nicely interleaved with the other tasks and you get a very consistent
> schedule with very little choice.
>
> Like already said, step 2) is the only place we had a choice, and if we
> were to have given either B or C a shorter request (IOW latency-nice)
> that choice would have been fully determined.
>
> So increasing w gets you more time (and the shorter deadline causes the
> above interleaving), while for the same w, reducing r gets you picked
> earlier.
>
> Perhaps another way to look at it is that since heavier tasks run more
> (often) you've got to compete against it more often for latency.
>
>
> Does that help?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length
  2023-03-28  9:26 ` [PATCH 14/17] sched/eevdf: Better handle mixed slice length Peter Zijlstra
@ 2023-03-31 15:26   ` Vincent Guittot
  2023-04-04  9:29     ` Peter Zijlstra
       [not found]   ` <20230401232355.336-1-hdanton@sina.com>
  1 sibling, 1 reply; 55+ messages in thread
From: Vincent Guittot @ 2023-03-31 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

On Tue, 28 Mar 2023 at 13:06, Peter Zijlstra <peterz@infradead.org> wrote:
>
> In the case where (due to latency-nice) there are different request
> sizes in the tree, the smaller requests tend to be dominated by the
> larger. Also note how the EEVDF lag limits are based on r_max.
>
> Therefore; add a heuristic that for the mixed request size case, moves
> smaller requests to placement strategy #2 which ensures they're
> immidiately eligible and and due to their smaller (virtual) deadline
> will cause preemption.
>
> NOTE: this relies on update_entity_lag() to impose lag limits above
> a single slice.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c     |   14 ++++++++++++++
>  kernel/sched/features.h |    1 +
>  kernel/sched/sched.h    |    1 +
>  3 files changed, 16 insertions(+)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -616,6 +616,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq,
>         s64 key = entity_key(cfs_rq, se);
>
>         cfs_rq->avg_vruntime += key * weight;
> +       cfs_rq->avg_slice += se->slice * weight;
>         cfs_rq->avg_load += weight;
>  }
>
> @@ -626,6 +627,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
>         s64 key = entity_key(cfs_rq, se);
>
>         cfs_rq->avg_vruntime -= key * weight;
> +       cfs_rq->avg_slice -= se->slice * weight;
>         cfs_rq->avg_load -= weight;
>  }
>
> @@ -4832,6 +4834,18 @@ place_entity(struct cfs_rq *cfs_rq, stru
>                 lag = se->vlag;
>
>                 /*
> +                * For latency sensitive tasks; those that have a shorter than
> +                * average slice and do not fully consume the slice, transition
> +                * to EEVDF placement strategy #2.
> +                */
> +               if (sched_feat(PLACE_FUDGE) &&
> +                   cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) {
> +                       lag += vslice;
> +                       if (lag > 0)
> +                               lag = 0;

By using different lag policies for tasks, doesn't this create
unfairness between tasks ?

I wanted to stress this situation with a simple use case but it seems
that even without changing the slice, there is a fairness problem:

Task A always run
Task B loops on : running 1ms then sleeping 1ms
default nice and latency nice prio bot both
each task should get around 50% of the time.

The fairness is ok with tip/sched/core
but with eevdf, Task B only gets around 30%

I haven't identified the problem so far


> +               }
> +
> +               /*
>                  * If we want to place a task and preserve lag, we have to
>                  * consider the effect of the new entity on the weighted
>                  * average and compensate for this, otherwise lag can quickly
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -5,6 +5,7 @@
>   * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
>   */
>  SCHED_FEAT(PLACE_LAG, true)
> +SCHED_FEAT(PLACE_FUDGE, true)
>  SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
>
>  /*
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -559,6 +559,7 @@ struct cfs_rq {
>         unsigned int            idle_h_nr_running; /* SCHED_IDLE */
>
>         s64                     avg_vruntime;
> +       u64                     avg_slice;
>         u64                     avg_load;
>
>         u64                     exec_clock;
>
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length
       [not found]   ` <20230401232355.336-1-hdanton@sina.com>
@ 2023-04-02  2:40     ` Mike Galbraith
  0 siblings, 0 replies; 55+ messages in thread
From: Mike Galbraith @ 2023-04-02  2:40 UTC (permalink / raw)
  To: Hillf Danton, Vincent Guittot
  Cc: mingo, linux-kernel, Peter Zijlstra, rostedt, bsegall, mgorman,
	bristot, corbet, qyousef, joshdon, timj, kprateek.nayak,
	yu.c.chen, youssefesmat, linux-mm, joel

On Sun, 2023-04-02 at 07:23 +0800, Hillf Danton wrote:
> On 31 Mar 2023 17:26:51 +0200 Vincent Guittot <vincent.guittot@linaro.org>
> >
> > I wanted to stress this situation with a simple use case but it seems
> > that even without changing the slice, there is a fairness problem:
> >
> > Task A always run
> > Task B loops on : running 1ms then sleeping 1ms
> > default nice and latency nice prio bot both
> > each task should get around 50% of the time.
> >
> > The fairness is ok with tip/sched/core
> > but with eevdf, Task B only gets around 30%
>
> Convincing evidence for glitch in wakeup preempt.

If you turn on PLACE_BONUS, it'll mimic FAIR_SLEEPERS.. but if you then
do some testing, you'll probably turn it right back off.

The 50/50 split in current code isn't really any more fair, as soon as
you leave the tiny bubble of fairness, it's not the least bit fair.
Nor is that tiny bubble all rainbows and unicorns, it brought with it
benchmark wins and losses, like everything that changes more than
comments, its price being service latency variance.

The short term split doesn't really mean all that much, some things
will like the current fair-bubble better, some eevdf virtual deadline
math and its less spiky service.  We'll see.

I'm kinda hoping eevdf works out, FAIR_SLEEPERS is quite annoying to
squabble with.

	-Mike

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/17] sched: EEVDF using latency-nice
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (16 preceding siblings ...)
  2023-03-28  9:26 ` [PATCH 17/17] [DEBUG] sched/eevdf: Debug / validation crud Peter Zijlstra
@ 2023-04-03  7:42 ` Shrikanth Hegde
  2023-04-10  3:13 ` David Vernet
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 55+ messages in thread
From: Shrikanth Hegde @ 2023-04-03  7:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
	pjt, pavel, qperret, tim.c.chen, joshdon, timj, kprateek.nayak,
	yu.c.chen, youssefesmat, joel, efault, mingo, vincent.guittot,
	shrikanth hegde



On 3/28/23 2:56 PM, Peter Zijlstra wrote:
> Hi!
>
> Latest version of the EEVDF [1] patches.
>
> Many changes since last time; most notably it now fully replaces CFS and uses
> lag based placement for migrations. Smaller changes include:
>
>  - uses scale_load_down() for avg_vruntime; I measured the max delta to be ~44
>    bits on a system/cgroup based kernel build.
>  - fixed a bunch of reweight / cgroup placement issues
>  - adaptive placement strategy for smaller slices
>  - rename se->lag to se->vlag
>
> There's a bunch of RFC patches at the end and one DEBUG patch. Of those, the
> PLACE_BONUS patch is a mixed bag of pain. A number of benchmarks regress
> because EEVDF is actually fair and gives a 100% parent vs a 50% child a 67%/33%
> split (stress-futex, stress-nanosleep, starve, etc..) instead of a 50%/50%
> split that sleeper bonus achieves. Mostly I think these benchmarks are somewhat
> artificial/daft but who knows.
>
> The PLACE_BONUS thing horribly messes up things like hackbench and latency-nice
> because it places things too far to the left in the tree. Basically it messes
> with the whole 'when', by placing a task back in history you're putting a
> burden on the now to accomodate catching up. More tinkering required.
>
> But over-all the thing seems to be fairly usable and could do with more
> extensive testing.
>
> [1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564
>
> Results:
>
>   hackbech -g $nr_cpu + cyclictest --policy other results:
>
> 			EEVDF			 CFS
>
> 		# Min Latencies: 00054
>   LNICE(19)	# Avg Latencies: 00660
> 		# Max Latencies: 23103
>
> 		# Min Latencies: 00052		00053
>   LNICE(0)	# Avg Latencies: 00318		00687
> 		# Max Latencies: 08593		13913
>
> 		# Min Latencies: 00054
>   LNICE(-19)	# Avg Latencies: 00055
> 		# Max Latencies: 00061
>
>
> Some preliminary results from Chen Yu on a slightly older version:
>
>   schbench  (95% tail latency, lower is better)
>   =================================================================================
>   case                    nr_instance            baseline (std%)    compare% ( std%)
>   normal                   25%                     1.00  (2.49%)    -81.2%   (4.27%)
>   normal                   50%                     1.00  (2.47%)    -84.5%   (0.47%)
>   normal                   75%                     1.00  (2.5%)     -81.3%   (1.27%)
>   normal                  100%                     1.00  (3.14%)    -79.2%   (0.72%)
>   normal                  125%                     1.00  (3.07%)    -77.5%   (0.85%)
>   normal                  150%                     1.00  (3.35%)    -76.4%   (0.10%)
>   normal                  175%                     1.00  (3.06%)    -76.2%   (0.56%)
>   normal                  200%                     1.00  (3.11%)    -76.3%   (0.39%)
>   ==================================================================================
>
>   hackbench (throughput, higher is better)
>   ==============================================================================
>   case                    nr_instance            baseline(std%)  compare%( std%)
>   threads-pipe              25%                      1.00 (<2%)    -17.5 (<2%)
>   threads-socket            25%                      1.00 (<2%)    -1.9 (<2%)
>   threads-pipe              50%                      1.00 (<2%)     +6.7 (<2%)
>   threads-socket            50%                      1.00 (<2%)    -6.3  (<2%)
>   threads-pipe              100%                     1.00 (3%)     +110.1 (3%)
>   threads-socket            100%                     1.00 (<2%)    -40.2 (<2%)
>   threads-pipe              150%                     1.00 (<2%)    +125.4 (<2%)
>   threads-socket            150%                     1.00 (<2%)    -24.7 (<2%)
>   threads-pipe              200%                     1.00 (<2%)    -89.5 (<2%)
>   threads-socket            200%                     1.00 (<2%)    -27.4 (<2%)
>   process-pipe              25%                      1.00 (<2%)    -15.0 (<2%)
>   process-socket            25%                      1.00 (<2%)    -3.9 (<2%)
>   process-pipe              50%                      1.00 (<2%)    -0.4  (<2%)
>   process-socket            50%                      1.00 (<2%)    -5.3  (<2%)
>   process-pipe              100%                     1.00 (<2%)    +62.0 (<2%)
>   process-socket            100%                     1.00 (<2%)    -39.5  (<2%)
>   process-pipe              150%                     1.00 (<2%)    +70.0 (<2%)
>   process-socket            150%                     1.00 (<2%)    -20.3 (<2%)
>   process-pipe              200%                     1.00 (<2%)    +79.2 (<2%)
>   process-socket            200%                     1.00 (<2%)    -22.4  (<2%)
>   ==============================================================================
>
>   stress-ng (throughput, higher is better)
>   ==============================================================================
>   case                    nr_instance            baseline(std%)  compare%( std%)
>   switch                  25%                      1.00 (<2%)    -6.5 (<2%)
>   switch                  50%                      1.00 (<2%)    -9.2 (<2%)
>   switch                  75%                      1.00 (<2%)    -1.2 (<2%)
>   switch                  100%                     1.00 (<2%)    +11.1 (<2%)
>   switch                  125%                     1.00 (<2%)    -16.7% (9%)
>   switch                  150%                     1.00 (<2%)    -13.6 (<2%)
>   switch                  175%                     1.00 (<2%)    -16.2 (<2%)
>   switch                  200%                     1.00 (<2%)    -19.4% (<2%)
>   fork                    50%                      1.00 (<2%)    -0.1 (<2%)
>   fork                    75%                      1.00 (<2%)    -0.3 (<2%)
>   fork                    100%                     1.00 (<2%)    -0.1 (<2%)
>   fork                    125%                     1.00 (<2%)    -6.9 (<2%)
>   fork                    150%                     1.00 (<2%)    -8.8 (<2%)
>   fork                    200%                     1.00 (<2%)    -3.3 (<2%)
>   futex                   25%                      1.00 (<2%)    -3.2 (<2%)
>   futex                   50%                      1.00 (3%)     -19.9 (5%)
>   futex                   75%                      1.00 (6%)     -19.1 (2%)
>   futex                   100%                     1.00 (16%)    -30.5 (10%)
>   futex                   125%                     1.00 (25%)    -39.3 (11%)
>   futex                   150%                     1.00 (20%)    -27.2% (17%)
>   futex                   175%                     1.00 (<2%)    -18.6 (<2%)
>   futex                   200%                     1.00 (<2%)    -47.5 (<2%)
>   nanosleep               25%                      1.00 (<2%)    -0.1 (<2%)
>   nanosleep               50%                      1.00 (<2%)    -0.0% (<2%)
>   nanosleep               75%                      1.00 (<2%)    +15.2% (<2%)
>   nanosleep               100%                     1.00 (<2%)    -26.4 (<2%)
>   nanosleep               125%                     1.00 (<2%)    -1.3 (<2%)
>   nanosleep               150%                     1.00 (<2%)    +2.1  (<2%)
>   nanosleep               175%                     1.00 (<2%)    +8.3 (<2%)
>   nanosleep               200%                     1.00 (<2%)    +2.0% (<2%)
>   ===============================================================================
>
>   unixbench (throughput, higher is better)
>   ==============================================================================
>   case                    nr_instance            baseline(std%)  compare%( std%)
>   spawn                   125%                      1.00 (<2%)    +8.1 (<2%)
>   context1                100%                      1.00 (6%)     +17.4 (6%)
>   context1                75%                       1.00 (13%)    +18.8 (8%)
>   =================================================================================
>
>   netperf  (throughput, higher is better)
>   ===========================================================================
>   case                    nr_instance          baseline(std%)  compare%( std%)
>   UDP_RR                  25%                   1.00    (<2%)    -1.5%  (<2%)
>   UDP_RR                  50%                   1.00    (<2%)    -0.3%  (<2%)
>   UDP_RR                  75%                   1.00    (<2%)    +12.5% (<2%)
>   UDP_RR                 100%                   1.00    (<2%)    -4.3%  (<2%)
>   UDP_RR                 125%                   1.00    (<2%)    -4.9%  (<2%)
>   UDP_RR                 150%                   1.00    (<2%)    -4.7%  (<2%)
>   UDP_RR                 175%                   1.00    (<2%)    -6.1%  (<2%)
>   UDP_RR                 200%                   1.00    (<2%)    -6.6%  (<2%)
>   TCP_RR                  25%                   1.00    (<2%)    -1.4%  (<2%)
>   TCP_RR                  50%                   1.00    (<2%)    -0.2%  (<2%)
>   TCP_RR                  75%                   1.00    (<2%)    -3.9%  (<2%)
>   TCP_RR                 100%                   1.00    (2%)     +3.6%  (5%)
>   TCP_RR                 125%                   1.00    (<2%)    -4.2%  (<2%)
>   TCP_RR                 150%                   1.00    (<2%)    -6.0%  (<2%)
>   TCP_RR                 175%                   1.00    (<2%)    -7.4%  (<2%)
>   TCP_RR                 200%                   1.00    (<2%)    -8.4%  (<2%)
>   ==========================================================================
>
>
> ---
> Also available at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/eevdf
>
> ---
> Parth Shah (1):
>       sched: Introduce latency-nice as a per-task attribute
>
> Peter Zijlstra (14):
>       sched/fair: Add avg_vruntime
>       sched/fair: Remove START_DEBIT
>       sched/fair: Add lag based placement
>       rbtree: Add rb_add_augmented_cached() helper
>       sched/fair: Implement an EEVDF like policy
>       sched: Commit to lag based placement
>       sched/smp: Use lag to simplify cross-runqueue placement
>       sched: Commit to EEVDF
>       sched/debug: Rename min_granularity to base_slice
>       sched: Merge latency_offset into slice
>       sched/eevdf: Better handle mixed slice length
>       sched/eevdf: Sleeper bonus
>       sched/eevdf: Minimal vavg option
>       sched/eevdf: Debug / validation crud
>
> Vincent Guittot (2):
>       sched/fair: Add latency_offset
>       sched/fair: Add sched group latency support
>
>  Documentation/admin-guide/cgroup-v2.rst |   10 +
>  include/linux/rbtree_augmented.h        |   26 +
>  include/linux/sched.h                   |    6 +
>  include/uapi/linux/sched.h              |    4 +-
>  include/uapi/linux/sched/types.h        |   19 +
>  init/init_task.c                        |    3 +-
>  kernel/sched/core.c                     |   65 +-
>  kernel/sched/debug.c                    |   49 +-
>  kernel/sched/fair.c                     | 1199 ++++++++++++++++---------------
>  kernel/sched/features.h                 |   29 +-
>  kernel/sched/sched.h                    |   23 +-
>  tools/include/uapi/linux/sched.h        |    4 +-
>  12 files changed, 794 insertions(+), 643 deletions(-)
>

Tested the patch on power system with 60 cores with SMT=8. Total of 480 CPU's.
System has four NUMA nodes.

TL;DR

Real life workload like daytrader shows improvement in different cases, while
microbenchmarks shows gains and regress as well.

Tested with microbenchmarks (hackbench, schbench, unixbench, STREAM and lmbench)
and DB workload called day trader. daytrader simulates the real life trading
activities which gives total transaction/s. It uses around 70% CPU.

Comparison is between tip/master vs +this_patch. tip/master was at 4b7aa0abddff
small nit: Applies cleanly to tip/master. patch fails to apply cleanly for
sched/core. sched/core is at 05bfb338fa8d
===============================================================================
Summary of methods and observations.
===============================================================================
Method 1:    Ran microbenchmarks on an idle system without any cgroups.
Observation: hackbench, unixbench shows gain. schbench shows regression.
             Stream and lmbench values are same.

Method 2:    Ran microbenchmarks on an idle system. Created a cgroup and ran
	     benchmarks in that cgroup. Latency values are assigned to the cgroup.
	     This is almost same as Method 1.
Observation: hackbench pipe shows improvement. schbench shows regression. lmbench and stream
	     are same more or less.

Method 3:    Ran microbenchmarks in a cgroup and in another terminal running stress-ng
             at 50% utilization. here, also tried different latency nice values
	     for cgroup.
Observation: Hackbench shows gain in latency values. Schbench shows good gain in
             latency values except, 1 thread case. lmbench and stream regress
	     slightly. unixbench is mixed.

	     One concerning throughput is 4 X Shell Scripts (8 concurrent) which
	     shows 50% regression. This is verified with additional run. The
	     same holds true for 25% utilization as well.

Method 4:    Ran daytrader with no cgroups on idle system.
Observation: we see around 7% gain in throughput.

Method 5:    Ran daytrader in a cgroup and running stress-ng at 50% utilization.
Observation: we see around 9% gain in throughput.

===============================================================================

Note:
positive values show improvement and negative values shows regression.

hackbench has 50 iterations.
schbench has 10 iterations
unixbench has 10 iterations.
lmbench has 50 iterations.

# lscpu
Architecture:            ppc64le
  Byte Order:            Little Endian
CPU(s):                  480
  On-line CPU(s) list:   0-479
  Thread(s) per core:    8
  Core(s) per socket:    15
  Socket(s):             4
  Physical sockets:      4
  Physical chips:        1
  Physical cores/chip:   15

NUMA:
  NUMA node(s):          4
  NUMA node0 CPU(s):     0-119
  NUMA node1 CPU(s):     120-239
  NUMA node2 CPU(s):     240-359
  NUMA node3 CPU(s):     360-479


===============================================================================
Detailed logs from each method.
================
Method 1:
================
This is to compare the out of box performance of the two. no load on the system
benchmarks are run without any cgroup.

Hackbench shows improvement. schbench results are mixed. But schebench has run
to variance. stream and lmbench are same.

-------------------------------------------------------------------------------
lmbench                          tip/master      eevdf
-------------------------------------------------------------------------------
latency process fork       :     120.56,     120.32(+0.20)
latency process Exec       :     176.70,     177.22(-0.30)
latency process Exec       :       5.59,       5.89(-5.47)
latency syscall fstat      :       0.26,       0.26( 0.00)
latency syscall open       :       2.27,       2.29(-0.88)
AF_UNIX sock stream latency:       9.13,       9.34(-2.30)
Select on 200 fd's         :       2.16,       2.15(-0.46)
semaphore latency          :       0.85,       0.85(+0.00)

-------------------------------------------------------------------------------
Stream                           tip/master      eevdf
-------------------------------------------------------------------------------
copy latency                :       0.58,       0.59(-1.72)
copy bandwidth              :   27357.05,   27009.15(-1.27)
scale latency               :       0.61,       0.61(0.00)
scale bandwidth             :   26268.65,   26057.07(-0.81)
add latency                 :       1.25,       1.25(0.00)
add bandwidth               :   19176.21,   19177.24(0.01)
triad latency               :       0.74,       0.74(0.00)
triad bandwidth             :   32591.51,   32506.32(-0.26)

-------------------------------------------------------------------------------
Unixbench                              tip/master    eevdf
-------------------------------------------------------------------------------
1 X Execl Throughput               :    5158.07,    5228.97(1.37)
4 X Execl Throughput               :   12745.19,   12927.75(1.43)
1 X Pipe-based Context Switching   :  178280.42,  170140.15(-4.57)
4 X Pipe-based Context Switching   :  594414.36,  560509.01(-5.70)
1 X Process Creation               :    8657.10,    8659.28(0.03)
4 X Process Creation               :   16476.56,   17007.43(3.22)
1 X Shell Scripts (1 concurrent)   :   10179.24,   10307.21(1.26)
4 X Shell Scripts (1 concurrent)   :   32990.17,   33251.73(0.79)
1 X Shell Scripts (8 concurrent)   :    4878.56,    4940.22(1.26)
4 X Shell Scripts (8 concurrent)   :   14001.89,   13568.88(-3.09)

-------------------------------------------------------------------------------
Schbench    tip/master      eevdf
-------------------------------------------------------------------------------
1 Threads
50.0th:       7.20,       7.00(2.78)
75.0th:       8.20,       7.90(3.66)
90.0th:      10.10,       8.30(17.82)
95.0th:      11.40,       9.30(18.42)
99.0th:      13.30,      11.00(17.29)
99.5th:      13.60,      11.70(13.97)
99.9th:      15.40,      13.40(12.99)
2 Threads
50.0th:       8.60,       8.00(6.98)
75.0th:       9.80,       8.80(10.20)
90.0th:      11.50,       9.90(13.91)
95.0th:      12.40,      10.70(13.71)
99.0th:      13.50,      13.70(-1.48)
99.5th:      14.90,      15.00(-0.67)
99.9th:      27.60,      23.60(14.49)
4 Threads
50.0th:      10.00,       9.90(1.00)
75.0th:      11.70,      12.00(-2.56)
90.0th:      13.60,      14.30(-5.15)
95.0th:      14.90,      15.40(-3.36)
99.0th:      17.80,      18.50(-3.93)
99.5th:      19.00,      19.30(-1.58)
99.9th:      27.60,      32.10(-16.30)
8 Threads
50.0th:      12.20,      13.30(-9.02)
75.0th:      15.20,      17.50(-15.13)
90.0th:      18.40,      21.60(-17.39)
95.0th:      20.70,      24.10(-16.43)
99.0th:      26.30,      30.20(-14.83)
99.5th:      30.50,      37.90(-24.26)
99.9th:      53.10,      92.10(-73.45)
16 Threads
50.0th:      20.70,      19.70(4.83)
75.0th:      28.20,      27.20(3.55)
90.0th:      36.20,      33.80(6.63)
95.0th:      40.70,      37.50(7.86)
99.0th:      51.50,      45.30(12.04)
99.5th:      62.70,      49.40(21.21)
99.9th:     120.70,      88.40(26.76)
32 Threads
50.0th:      39.50,      38.60(2.28)
75.0th:      58.30,      56.10(3.77)
90.0th:      76.40,      72.60(4.97)
95.0th:      86.30,      82.20(4.75)
99.0th:     102.20,      98.90(3.23)
99.5th:     108.00,     105.30(2.50)
99.9th:     179.30,     188.80(-5.30)

-------------------------------------------------------------------------------
Hackbench			  tip/master     eevdf
-------------------------------------------------------------------------------
Process 10 groups          :       0.19,       0.19(0.00)
Process 20 groups          :       0.24,       0.26(-8.33)
Process 30 groups          :       0.30,       0.31(-3.33)
Process 40 groups          :       0.35,       0.37(-5.71)
Process 50 groups          :       0.41,       0.44(-7.32)
Process 60 groups          :       0.47,       0.50(-6.38)
thread  10 groups          :       0.22,       0.23(-4.55)
thread  20 groups          :       0.28,       0.27(3.57)
Process(Pipe) 10 groups    :       0.16,       0.16(0.00)
Process(Pipe) 20 groups    :       0.26,       0.24(7.69)
Process(Pipe) 30 groups    :       0.36,       0.30(16.67)
Process(Pipe) 40 groups    :       0.40,       0.35(12.50)
Process(Pipe) 50 groups    :       0.48,       0.40(16.67)
Process(Pipe) 60 groups    :       0.55,       0.44(20.00)
thread (Pipe) 10 groups    :       0.16,       0.14(12.50)
thread (Pipe) 20 groups    :       0.24,       0.22(8.33)


================
Method 2:
================
This was to compare baseline performance with the eevdf by assigning different
latency nice values. In order to do that, created a cgroup and assigned latency
nice values to cgroup. microbenchmarks are run from that cgroup.

hackbench pipe shows improvement. schbench shows regression. lmbench and stream
are same more or less.

-------------------------------------------------------------------------------
lmbench                     tip/master   eevdf(LN=0)   eevdf(LN=-20)  eevdf(LN=19)
-------------------------------------------------------------------------------
latency process fork       :  121.20,  121.35(-0.12),  121.75(-0.45), 120.61(0.49)
latency process Exec       :  177.60,  180.84(-1.82),  177.93(-0.18), 177.44(0.09)
latency process Exec       :    5.80,    6.16(-6.27),    6.14(-5.89),   6.14(-5.91)
latency syscall fstat      :    0.26,    0.26(0.00) ,    0.26(0.00) ,   0.26(0.00)
latency syscall open       :    2.27,    2.29(-0.88),    2.29(-0.88),   2.29(-0.88)
AF_UNIX sock_stream latency:    9.31,    9.61(-3.22),    9.61(-3.22),   9.53(-2.36)
Select on 200 fd'si        :    2.17,    2.15(0.92) ,    2.15(0.92) ,   2.15(0.92)
semaphore latency          :    0.88,    0.89(-1.14),    0.88(0.00) ,   0.88(0.00)

-------------------------------------------------------------------------------
Stream          tip/master   eevdf(LN=0)       eevdf(LN=-20)      eevdf(LN=19)
-------------------------------------------------------------------------------
copy latency   :     0.56,      0.58(-3.57),      0.58(-3.57),       0.58(-3.57)
copy bandwidth : 28767.80,  27520.04(-4.34),  27506.95(-4.38),   27381.61(-4.82)
scale latency  :     0.60,      0.61(-1.67),      0.61(-1.67),       0.61(-1.67)
scale bandwidth: 26875.58,  26385.22(-1.82),  26339.94(-1.99),   26302.86(-2.13)
add latency    :     1.25,      1.25(0.00) ,      1.25(0.00) ,       1.25(0.00)
add bandwidth  : 19175.76,  19177.48(0.01) ,  19177.60(0.01) ,   19176.32(0.00)
triad latency  :     0.74,      0.73(1.35) ,      0.74(0.00) ,       0.74(0.00)
triad bandwidth: 32545.70,  32658.95(0.35) ,  32581.78(0.11) ,   32561.74(0.05)

--------------------------------------------------------------------------------------------------
Unixbench                         tip/master  eevdf(LN=0)      eevdf(LN=-20)   eevdf(LN=19)
--------------------------------------------------------------------------------------------------
1 X Execl Throughput            :  5147.23,    5184.87(0.73),     5217.16(1.36),     5218.21(1.38)
4 X Execl Throughput            : 13225.55,   13638.36(3.12),    13643.07(3.16),    13636.50(3.11)
1 X Pipe-based Context Switching:171413.56,  162720.69(-5.07),  163420.54(-4.66),  163446.67(-4.65)
4 X Pipe-based Context Switching:564887.90,  554545.01(-1.83),  555561.24(-1.65),  547421.20(-3.09)
1 X Process Creation            :  8555.73,    8503.18(-0.61),    8556.39(0.01),     8621.36(0.77)
4 X Process Creation            : 17007.47,   16372.44(-3.73),   17002.88(-0.03),   16611.47(-2.33)
1 X Shell Scripts (1 concurrent): 10104.23,   10235.09(1.30),    10171.44(0.67),    10275.76(1.70)
4 X Shell Scripts (1 concurrent): 33752.14,   32278.50(-4.37),   32885.92(-2.57),   32256.58(-4.43)
1 X Shell Scripts (8 concurrent):  4864.71,    4909.30(0.92),     4914.62(1.03),     4896.45(0.65)
4 X Shell Scripts (8 concurrent): 14237.17,   13395.20(-5.91),   13599.52(-4.48),   12923.93(-9.22)


-------------------------------------------------------------------------------
schbench    tip/master   eevdf(LN=0)       eevdf(LN=-20)      eevdf(LN=19)
-------------------------------------------------------------------------------
1 Threads
50.0th:       6.90,       7.30(-5.80),       7.30(-5.80),      7.10(-2.90)
75.0th:       7.90,       8.40(-6.33),       8.60(-8.86),      8.00(-1.27)
90.0th:      10.10,       9.60(4.95),       10.50(-3.96),      8.90(11.88)
95.0th:      11.20,      10.60(5.36),       11.10(0.89),       9.40(16.07)
99.0th:      13.30,      12.70(4.51),       12.80(3.76),      11.80(11.28)
99.5th:      13.90,      13.50(2.88),       13.60(2.16),      12.40(10.79)
99.9th:      15.00,      15.40(-2.67),      15.20(-1.33),     13.70(8.67)
2 Threads
50.0th:       7.20,       8.10(-12.50),      8.00(-11.11),     8.40(-16.67)
75.0th:       8.30,       9.20(-10.84),      9.00(-8.43),      9.70(-16.87)
90.0th:      10.10,      11.00(-8.91),      10.00(0.99),      11.00(-8.91)
95.0th:      11.30,      12.60(-11.50),     10.60(6.19),      11.60(-2.65)
99.0th:      14.40,      15.40(-6.94),      11.90(17.36),     13.70(4.86)
99.5th:      15.20,      16.10(-5.92),      13.20(13.16),     14.60(3.95)
99.9th:      16.40,      17.30(-5.49),      14.70(10.37),     16.20(1.22)
4 Threads
50.0th:       8.90,      10.30(-15.73),     10.00(-12.36),    10.10(-13.48)
75.0th:      10.80,      12.10(-12.04),     11.80(-9.26),     12.00(-11.11)
90.0th:      13.00,      14.00(-7.69),      13.70(-5.38),     14.30(-10.00)
95.0th:      14.40,      15.20(-5.56),      14.90(-3.47),     15.80(-9.72)
99.0th:      16.90,      17.50(-3.55),      18.70(-10.65),    19.80(-17.16)
99.5th:      17.40,      18.50(-6.32),      19.80(-13.79),    22.10(-27.01)
99.9th:      18.70,      22.30(-19.25),     22.70(-21.39),    37.50(-100.53)
8 Threads
50.0th:      11.50,      12.80(-11.30),     13.30(-15.65),    12.80(-11.30)
75.0th:      15.00,      16.30(-8.67),      16.90(-12.67),    16.20(-8.00)
90.0th:      18.80,      19.50(-3.72),      20.30(-7.98),     19.90(-5.85)
95.0th:      21.40,      21.80(-1.87),      22.30(-4.21),     22.10(-3.27)
99.0th:      27.60,      26.30(4.71) ,      27.60(0.00),      27.30(1.09)
99.5th:      30.40,      32.40(-6.58),      36.40(-19.74),    30.00(1.32)
99.9th:      56.90,      59.10(-3.87),      66.70(-17.22),    60.90(-7.03)
16 Threads
50.0th:      19.20,      20.90(-8.85),      20.60(-7.29),     21.00(-9.38)
75.0th:      25.30,      27.50(-8.70),      27.80(-9.88),     28.30(-11.86)
90.0th:      31.20,      34.60(-10.90),     35.10(-12.50),    35.20(-12.82)
95.0th:      35.40,      38.90(-9.89),      39.50(-11.58),    39.20(-10.73)
99.0th:      44.90,      47.60(-6.01),      47.50(-5.79),     47.60(-6.01)
99.5th:      48.50,      50.50(-4.12),      50.20(-3.51),     55.60(-14.64)
99.9th:      70.80,      84.70(-19.63),     81.40(-14.97),   103.50(-46.19)
32 Threads
50.0th:      39.10,      38.60(1.28),       36.10(7.67),      39.50(-1.02)
75.0th:      57.20,      56.10(1.92),       52.00(9.09),      57.70(-0.87)
90.0th:      74.00,      73.70(0.41),       65.70(11.22),     74.40(-0.54)
95.0th:      82.30,      83.50(-1.46),      74.20(9.84),      84.50(-2.67)
99.0th:      95.80,      98.60(-2.92),      92.10(3.86),     100.50(-4.91)
99.5th:     101.50,     104.10(-2.56),      98.90(2.56),     108.20(-6.60)
99.9th:     185.70,     179.90(3.12),      163.50(11.95),    193.00(-3.93)

-------------------------------------------------------------------------------
Hackbench                tip/master   eevdf(LN=0)   eevdf(LN=-20)  eevdf(LN=19)
-------------------------------------------------------------------------------
Process 10 groups       :   0.19,    0.19(0.00),    0.19(0.00),    0.19(0.00)
Process 20 groups       :   0.24,    0.25(-4.17),   0.26(-8.33),   0.25(-4.17)
Process 30 groups       :   0.30,    0.31(-3.33),   0.31(-3.33),   0.30(0.00)
Process 40 groups       :   0.35,    0.37(-5.71),   0.38(-8.57),   0.38(-8.57)
Process 50 groups       :   0.43,    0.44(-2.33),   0.44(-2.33),   0.44(-2.33)
Process 60 groups       :   0.49,    0.52(-6.12),   0.51(-4.08),   0.51(-4.08)
thread  10 groups       :   0.23,    0.22(4.35),    0.23(0.00),    0.23(0.00)
thread  20 groups       :   0.28,    0.28(0.00),    0.27(3.57),    0.28(0.00)
Process(Pipe) 10 groups :   0.17,    0.16(5.88),    0.16(5.88),    0.16(5.88)
Process(Pipe) 20 groups :   0.25,    0.24(4.00),    0.24(4.00),    0.24(4.00)
Process(Pipe) 30 groups :   0.32,    0.29(9.38),    0.29(9.38),    0.29(9.38)
Process(Pipe) 40 groups :   0.39,    0.34(12.82),   0.34(12.82),   0.34(12.82)
Process(Pipe) 50 groups :   0.45,    0.39(13.33),   0.39(13.33),   0.38(15.56)
Process(Pipe) 60 groups :   0.51,    0.43(15.69),   0.43(15.69),   0.43(15.69)
thread(Pipe)  10 groups :   0.16,    0.15(6.25),    0.15(6.25),    0.15(6.25)
thread(Pipe)  20 groups :   0.24,    0.22(8.33),    0.22(8.33),    0.22(8.33)

================
Method 3:
================
Comparing baseline vs eevdf when the system utilization is 50%. A cpu cgroup is
created and  different latency nice values are assigned to it. on another bash
terminal stress-ng is running at 50% utilization(stress-ng --cpu=480 -l 50).

Hackbench shows gain in latency values. Schbench shows good gain in latency values
except, 1 thread case. lmbench and stream regress slightly. unixbench is mixed.

One concerning throughput is 4 X Shell Scripts (8 concurrent) which shows 50%
regression. This is verified with additional run. The same holds true for 25%
utilization as well.

-------------------------------------------------------------------------------
lmbench                     tip/master   eevdf(LN=0)   eevdf(LN=-20)  eevdf(LN=19)
-------------------------------------------------------------------------------
latency process fork       :152.98,   158.34(-3.50),  155.07(-1.36),   157.57(-3.00)
latency process Exec       :214.30,   214.08(0.10),   214.41(-0.05),   215.16(-0.40)
latency process Exec       : 12.44,    11.86(4.66),    10.60(14.79),    10.58(14.94)
latency syscall fstat      :  0.44,     0.45(-2.27),    0.43(2.27),      0.45(-2.27)
latency syscall open       :  3.71,     3.68(0.81),     3.70(0.27),      3.74(-0.81)
AF_UNIX sock stream latency: 14.07,    13.44(4.48),    14.69(-4.41),    13.65(2.99)
Select on 200 fd'si        :  3.97,     4.16(-4.79),    4.02(-1.26),     4.21(-6.05)
semaphore latency          :  1.83,     1.82(0.55),     1.77(3.28),      1.75(4.37)

-------------------------------------------------------------------------------
Stream          tip/master   eevdf(LN=0)       eevdf(LN=-20)        eevdf(LN=19)
-------------------------------------------------------------------------------
copy latency   :       0.69,       0.69(0.00),       0.76(-10.14),      0.72(-4.35)
copy bandwidth :   23947.02,   24275.24(1.37),   22032.30(-8.00),   23487.29(-1.92)
scale latency  :       0.71,       0.74(-4.23),      0.75(-5.63),       0.77(-8.45)
scale bandwidth:   23490.27,   22713.99(-3.30),  22168.98(-5.62),   21782.47(-7.27)
add latency    :       1.34,       1.36(-1.49),      1.39(-3.73),       1.42(-5.97)
add bandwidth  :   17986.34,   17771.92(-1.19),  17461.59(-2.92),   17276.34(-3.95)
triad latency  :       0.91,       0.93(-2.20),      0.91(0.00),        0.94(-3.30)
triad bandwidth:   27948.13,   27652.98(-1.06),  28134.58(0.67),    27269.73(-2.43)

-------------------------------------------------------------------------------------------------
Unixbench                          tip/master    eevdf(LN=0)      eevdf(LN=-20)   eevdf(LN=19)
-------------------------------------------------------------------------------------------------
1 X Execl Throughput            :   4940.56,    4944.30(0.08),    4991.69(1.03),     4982.80(0.85)
4 X Execl Throughput            :  10737.13,   10885.69(1.38),   10615.75(-1.13),   10803.82(0.62)
1 X Pipe-based Context Switching:  91313.57,  103426.11(13.26), 102985.91(12.78),  104614.22(14.57)
4 X Pipe-based Context Switching: 370430.07,  408075.33(10.16), 409273.07(10.49),  431360.88(16.45)
1 X Process Creation            :   6844.45,    6854.06(0.14),    6887.63(0.63),     6894.30(0.73)
4 X Process Creation            :  18690.31,   19307.50(3.30),   19425.39(3.93),    19128.43(2.34)
1 X Shell Scripts (1 concurrent):   8184.52,    8135.30(-0.60),   8185.53(0.01),     8163.10(-0.26)
4 X Shell Scripts (1 concurrent):  25737.71,   22583.29(-12.26), 22470.35(-12.69),  22615.13(-12.13)
1 X Shell Scripts (8 concurrent):   3653.71,    3115.03(-14.74),  3156.26(-13.61),   3106.63(-14.97)    <<<<< This may be of concern.
4 X Shell Scripts (8 concurrent):   9625.38,    4505.63(-53.19),  4484.03(-53.41),   4468.70(-53.57)    <<<<< This is a concerning one.

-------------------------------------------------------------------------------
schbench    tip/master   eevdf(LN=0)       eevdf(LN=-20)      eevdf(LN=19)
-------------------------------------------------------------------------------
1 Threads
50.0th:      15.10,      15.20(-0.66),      15.10(0.00),       15.10(0.00)
75.0th:      17.20,      17.70(-2.91),      17.20(0.00),       17.40(-1.16)
90.0th:      20.10,      20.70(-2.99),      20.40(-1.49),      20.70(-2.99)
95.0th:      22.20,      22.80(-2.70),      22.60(-1.80),      23.10(-4.05)
99.0th:      45.10,      51.50(-14.19),     37.20(17.52),      44.50(1.33)
99.5th:      79.80,     106.20(-33.08),    103.10(-29.20),    101.00(-26.57)
99.9th:     206.60,     771.40(-273.38),  1003.50(-385.72),   905.50(-338.29)
2 Threads
50.0th:      16.50,      17.00(-3.03),      16.70(-1.21),      16.20(1.82)
75.0th:      19.20,      19.90(-3.65),      19.40(-1.04),      18.90(1.56)
90.0th:      22.20,      23.10(-4.05),      22.80(-2.70),      22.00(0.90)
95.0th:      24.30,      25.40(-4.53),      25.20(-3.70),      24.50(-0.82)
99.0th:      97.00,      41.70(57.01),      43.00(55.67),      45.10(53.51)
99.5th:     367.10,      96.70(73.66),      98.80(73.09),     104.60(71.51)
99.9th:    3770.80,     811.40(78.48),    1414.70(62.48),     886.90(76.48)
4 Threads
50.0th:      20.00,      20.10(-0.50),      19.70(1.50),       19.50(2.50)
75.0th:      23.50,      23.40(0.43),       22.80(2.98),       23.00(2.13)
90.0th:      28.00,      27.00(3.57),       26.50(5.36),       26.60(5.00)
95.0th:      37.20,      29.50(20.70),      28.90(22.31),      28.80(22.58)
99.0th:    2792.50,      42.80(98.47),      38.30(98.63),      37.00(98.68)
99.5th:    4964.00,     101.50(97.96),      85.00(98.29),      70.20(98.59)
99.9th:    7864.80,    1722.20(78.10),     755.40(90.40),     817.10(89.61)
8 Threads
50.0th:      25.30,      24.50(3.16),       24.30(3.95),       23.60(6.72)
75.0th:      31.80,      30.00(5.66),       29.90(5.97),       29.30(7.86)
90.0th:      39.30,      35.00(10.94),      35.00(10.94),      34.20(12.98)
95.0th:     198.00,      38.20(80.71),      38.20(80.71),      37.40(81.11)
99.0th:    4601.20,      56.30(98.78),      85.90(98.13),      65.30(98.58)
99.5th:    6422.40,     162.70(97.47),     195.30(96.96),     153.40(97.61)
99.9th:    9684.00,    3237.60(66.57),    3726.40(61.52),    3965.60(59.05)
16 Threads
50.0th:      37.00,      35.20(4.86),       33.90(8.38),       34.00(8.11)
75.0th:      49.20,      46.00(6.50),       44.20(10.16),      44.40(9.76)
90.0th:      64.20,      54.80(14.64),      52.80(17.76),      53.20(17.13)
95.0th:     890.20,      59.70(93.29),      58.20(93.46),      58.60(93.42)
99.0th:    5369.60,      85.30(98.41),     124.90(97.67),     116.90(97.82)
99.5th:    6952.00,     228.00(96.72),     680.20(90.22),     339.40(95.12)
99.9th:    9222.40,    4896.80(46.90),    4648.40(49.60),    4365.20(52.67)
32 Threads
50.0th:      59.60,      56.80(4.70),       55.30(7.21),       56.00(6.04)
75.0th:      83.70,      78.70(5.97),       75.90(9.32),       77.50(7.41)
90.0th:     122.70,      95.50(22.17),      92.40(24.69),      93.80(23.55)
95.0th:    1680.40,     105.00(93.75),     102.20(93.92),     103.70(93.83)
99.0th:    6540.80,     382.10(94.16),     321.10(95.09),     489.30(92.52)
99.5th:    8094.40,    2144.20(73.51),    2172.70(73.16),    1990.70(75.41)
99.9th:   11417.60,    6672.80(41.56),    6903.20(39.54),    6268.80(45.10)

-------------------------------------------------------------------------------
Hackbench                tip/master   eevdf(LN=0)   eevdf(LN=-20)  eevdf(LN=19)
-------------------------------------------------------------------------------
Process 10  groups     :   0.18,     0.18(0.00),    0.18(0.00),     0.18(0.00)
Process 20  groups     :   0.32,     0.33(-3.13),   0.33(-3.13),    0.33(-3.13)
Process 30  groups     :   0.42,     0.43(-2.38),   0.43(-2.38),    0.43(-2.38)
Process 40  groups     :   0.51,     0.53(-3.92),   0.53(-3.92),    0.53(-3.92)
Process 50  groups     :   0.62,     0.64(-3.23),   0.65(-4.84),    0.64(-3.23)
Process 60  groups     :   0.72,     0.73(-1.39),   0.74(-2.78),    0.74(-2.78)
thread  10  groups     :   0.19,     0.19(0.00),    0.19(0.00),     0.19(0.00)
thread  20  groups     :   0.33,     0.34(-3.03),   0.34(-3.03),    0.34(-3.03)
Process(Pipe) 10 groups:   0.17,     0.16(5.88),    0.16(5.88),     0.16(5.88)
Process(Pipe) 20 groups:   0.25,     0.23(8.00),    0.23(8.00),     0.23(8.00)
Process(Pipe) 30 groups:   0.36,     0.31(13.89),   0.31(13.89),    0.31(13.89)
Process(Pipe) 40 groups:   0.42,     0.36(14.29),   0.36(14.29),    0.36(14.29)
Process(Pipe) 50 groups:   0.49,     0.42(14.29),   0.41(16.33),    0.42(14.29)
Process(Pipe) 60 groups:   0.53,     0.44(16.98),   0.44(16.98),    0.44(16.98)
thread(Pipe)  10 groups:   0.14,     0.14(0.00),    0.14(0.00),     0.14(0.00)
thread(Pipe)  20 groups:   0.24,     0.24(0.00),    0.22(8.33),     0.23(4.17)


================
Method 4:
================
Running daytrader on an idle system without any cgroup. daytrader is a trading
simulator application which does buy/sell and intraday trading etc. It is a
throughput oriented workload running with Jmeter.
reference: https://www.ibm.com/docs/en/linux-on-systems?topic=bad-daytrader

We see around 7% improvement in throughput with eevdf.

--------------------------------------------------------------------------
daytrader			tip/master		eevdf
--------------------------------------------------------------------------
Total throughputs                  1x			1.0717x(7.17)


================
Method 5:
================
Running daytrader on a system where utilization is 50%. created a cgroup and ran
microbenchmark in it and assigned different latency values to it. On another
bash terminal stress-ng is running at 50% utilization.

At LN=0, we see a 9% improvement with eevdf compared to baseline.

-------------------------------------------------------------------------
daytrader	   tip/master   eevdf	      eevdf          eevdf
				(LN=0)       (LN=-20)       (LN=19)
-------------------------------------------------------------------------
Total throughputs      1x      1.0923x(9.2%)   1.0759x(7.6)    1.111x(11.1)


Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 06/17] sched/fair: Add lag based placement
  2023-03-28  9:26 ` [PATCH 06/17] sched/fair: Add lag based placement Peter Zijlstra
@ 2023-04-03  9:18   ` Chen Yu
  2023-04-05  9:47     ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Chen Yu @ 2023-04-03  9:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, youssefesmat, joel,
	efault

On 2023-03-28 at 11:26:28 +0200, Peter Zijlstra wrote:
>  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
[...]
>  		/*
> -		 * Halve their sleep time's effect, to allow
> -		 * for a gentler effect of sleepers:
> +		 * If we want to place a task and preserve lag, we have to
> +		 * consider the effect of the new entity on the weighted
> +		 * average and compensate for this, otherwise lag can quickly
> +		 * evaporate:
> +		 *
> +		 * l_i = V - v_i <=> v_i = V - l_i
> +		 *
> +		 * V = v_avg = W*v_avg / W
> +		 *
> +		 * V' = (W*v_avg + w_i*v_i) / (W + w_i)
If I understand correctly, V' means the avg_runtime if se_i is enqueued?
Then,

V  = (\Sum w_j*v_j) / W

V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)

Not sure how W*v_avg equals to Sum w_j*v_j ?

> +		 *    = (W*v_avg + w_i(v_avg - l_i)) / (W + w_i)
> +		 *    = v_avg + w_i*l_i/(W + w_i)
v_avg - w_i*l_i/(W + w_i) ?
> +		 *
> +		 * l_i' = V' - v_i = v_avg + w_i*l_i/(W + w_i) - (v_avg - l)
> +		 *      = l_i - w_i*l_i/(W + w_i)
> +		 *
> +		 * l_i = (W + w_i) * l_i' / W
>  		 */
[...]
> -		if (sched_feat(GENTLE_FAIR_SLEEPERS))
> -			thresh >>= 1;
> +		load = cfs_rq->avg_load;
> +		if (curr && curr->on_rq)
> +			load += curr->load.weight;
> +
> +		lag *= load + se->load.weight;
> +		if (WARN_ON_ONCE(!load))
> +			load = 1;
> +		lag = div_s64(lag, load);
>  
Should we calculate
l_i' = l_i * w / (W + w_i) instead of calculating l_i above? I thought we want to adjust
the lag(before enqueue) based on the new weight(after enqueued)


[I will start to run some benchmarks today.]

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length
  2023-03-31 15:26   ` Vincent Guittot
@ 2023-04-04  9:29     ` Peter Zijlstra
  2023-04-04 13:50       ` Joel Fernandes
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-04-04  9:29 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

On Fri, Mar 31, 2023 at 05:26:51PM +0200, Vincent Guittot wrote:
> On Tue, 28 Mar 2023 at 13:06, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > In the case where (due to latency-nice) there are different request
> > sizes in the tree, the smaller requests tend to be dominated by the
> > larger. Also note how the EEVDF lag limits are based on r_max.
> >
> > Therefore; add a heuristic that for the mixed request size case, moves
> > smaller requests to placement strategy #2 which ensures they're
> > immidiately eligible and and due to their smaller (virtual) deadline
> > will cause preemption.
> >
> > NOTE: this relies on update_entity_lag() to impose lag limits above
> > a single slice.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/sched/fair.c     |   14 ++++++++++++++
> >  kernel/sched/features.h |    1 +
> >  kernel/sched/sched.h    |    1 +
> >  3 files changed, 16 insertions(+)
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -616,6 +616,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq,
> >         s64 key = entity_key(cfs_rq, se);
> >
> >         cfs_rq->avg_vruntime += key * weight;
> > +       cfs_rq->avg_slice += se->slice * weight;
> >         cfs_rq->avg_load += weight;
> >  }
> >
> > @@ -626,6 +627,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
> >         s64 key = entity_key(cfs_rq, se);
> >
> >         cfs_rq->avg_vruntime -= key * weight;
> > +       cfs_rq->avg_slice -= se->slice * weight;
> >         cfs_rq->avg_load -= weight;
> >  }
> >
> > @@ -4832,6 +4834,18 @@ place_entity(struct cfs_rq *cfs_rq, stru
> >                 lag = se->vlag;
> >
> >                 /*
> > +                * For latency sensitive tasks; those that have a shorter than
> > +                * average slice and do not fully consume the slice, transition
> > +                * to EEVDF placement strategy #2.
> > +                */
> > +               if (sched_feat(PLACE_FUDGE) &&
> > +                   cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) {
> > +                       lag += vslice;
> > +                       if (lag > 0)
> > +                               lag = 0;
> 
> By using different lag policies for tasks, doesn't this create
> unfairness between tasks ?

Possibly, I've just not managed to trigger it yet -- if it is an issue
it can be fixed by ensuring we don't place the entity before its
previous vruntime just like the sleeper hack later on.

> I wanted to stress this situation with a simple use case but it seems
> that even without changing the slice, there is a fairness problem:
> 
> Task A always run
> Task B loops on : running 1ms then sleeping 1ms
> default nice and latency nice prio bot both
> each task should get around 50% of the time.
> 
> The fairness is ok with tip/sched/core
> but with eevdf, Task B only gets around 30%
> 
> I haven't identified the problem so far

Heh, this is actually the correct behaviour. If you have a u=1 and a
u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.

CFS has this sleeper bonus hack that makes it 50% vs 50% but that is
strictly not correct -- although it does help a number of weird
benchmarks.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/17] sched/fair: Implement an EEVDF like policy
  2023-03-30 17:05       ` Vincent Guittot
@ 2023-04-04 12:00         ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-04-04 12:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, efault

On Thu, Mar 30, 2023 at 07:05:17PM +0200, Vincent Guittot wrote:
> On Thu, 30 Mar 2023 at 10:04, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Mar 29, 2023 at 04:35:25PM +0200, Vincent Guittot wrote:
> >
> > > IIUC how it works, Vd = ve + r / wi
> > >
> > > So for a same weight, the vd will be earlier but it's no more alway
> > > true for different weight
> >
> > Correct; but for a heavier task the time also goes slower and since it
> > needs more time, you want it to go first. But yes, this is weird at
> > first glance.
> 
> Yeah, I understand that this is needed for bounding the lag to a
> quantum max but that makes the latency prioritization less obvious and
> not always aligned with what we want

That very much depends on what we want I suppose :-) So far there's not
been strong definitions of what we want, other than that we consider a
negative latency nice task to get it's slice a little earlier where
possible.

(also, I rather like that vagueness -- just like nice is rather vague,
it gives us room for interpretation when implementing things)

> let say that you have 2 tasks A and B waking up simultaneously with
> the same vruntime; A has a negative latency nice to reflect some
> latency constraint and B the default value.  A will run 1st if they
> both have the same prio which is aligned with their  latency nice
> values but B could run 1st if it increase its nice prio to reflect the
> need for a larger cpu bandwidth so you can defeat the purpose of the
> latency nice although there is no unfairness
> 
> A cares of its latency and set a negative latency nice to reduce its
> request slice

This is true; but is it really a problem? It's all relative anyway :-)

Specifically, if you latency-nice harder than nice it, you win again,
also, nice is privilidged, while latency-nice is not (should it?)

The thing I like about EEVDF is that it actually makes the whole thing
simpler, it takes away a whole bunch of magic, and yes the latency thing
is perhaps more relative than absolute, but isn't that good enough?


That said; IIRC there's a few papers (which I can no longer find because
aparently google can now only give me my own patches and the opinion of
the internet on them when I search EEVDF :/) that muck with the {w,r}
set to build 'realtime' schedulers on top of EEVDF. So there's certainly
room to play around a bit.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length
  2023-04-04  9:29     ` Peter Zijlstra
@ 2023-04-04 13:50       ` Joel Fernandes
  2023-04-05  5:41         ` Mike Galbraith
  2023-04-05  8:35         ` Peter Zijlstra
  0 siblings, 2 replies; 55+ messages in thread
From: Joel Fernandes @ 2023-04-04 13:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, efault

On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:
> On Fri, Mar 31, 2023 at 05:26:51PM +0200, Vincent Guittot wrote:
> > On Tue, 28 Mar 2023 at 13:06, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > In the case where (due to latency-nice) there are different request
> > > sizes in the tree, the smaller requests tend to be dominated by the
> > > larger. Also note how the EEVDF lag limits are based on r_max.
> > >
> > > Therefore; add a heuristic that for the mixed request size case, moves
> > > smaller requests to placement strategy #2 which ensures they're
> > > immidiately eligible and and due to their smaller (virtual) deadline
> > > will cause preemption.
> > >
> > > NOTE: this relies on update_entity_lag() to impose lag limits above
> > > a single slice.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  kernel/sched/fair.c     |   14 ++++++++++++++
> > >  kernel/sched/features.h |    1 +
> > >  kernel/sched/sched.h    |    1 +
> > >  3 files changed, 16 insertions(+)
> > >
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -616,6 +616,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq,
> > >         s64 key = entity_key(cfs_rq, se);
> > >
> > >         cfs_rq->avg_vruntime += key * weight;
> > > +       cfs_rq->avg_slice += se->slice * weight;
> > >         cfs_rq->avg_load += weight;
> > >  }
> > >
> > > @@ -626,6 +627,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
> > >         s64 key = entity_key(cfs_rq, se);
> > >
> > >         cfs_rq->avg_vruntime -= key * weight;
> > > +       cfs_rq->avg_slice -= se->slice * weight;
> > >         cfs_rq->avg_load -= weight;
> > >  }
> > >
> > > @@ -4832,6 +4834,18 @@ place_entity(struct cfs_rq *cfs_rq, stru
> > >                 lag = se->vlag;
> > >
> > >                 /*
> > > +                * For latency sensitive tasks; those that have a shorter than
> > > +                * average slice and do not fully consume the slice, transition
> > > +                * to EEVDF placement strategy #2.
> > > +                */
> > > +               if (sched_feat(PLACE_FUDGE) &&
> > > +                   cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) {
> > > +                       lag += vslice;
> > > +                       if (lag > 0)
> > > +                               lag = 0;
> > 
> > By using different lag policies for tasks, doesn't this create
> > unfairness between tasks ?
> 
> Possibly, I've just not managed to trigger it yet -- if it is an issue
> it can be fixed by ensuring we don't place the entity before its
> previous vruntime just like the sleeper hack later on.
> 
> > I wanted to stress this situation with a simple use case but it seems
> > that even without changing the slice, there is a fairness problem:
> > 
> > Task A always run
> > Task B loops on : running 1ms then sleeping 1ms
> > default nice and latency nice prio bot both
> > each task should get around 50% of the time.
> > 
> > The fairness is ok with tip/sched/core
> > but with eevdf, Task B only gets around 30%
> > 
> > I haven't identified the problem so far
> 
> Heh, this is actually the correct behaviour. If you have a u=1 and a
> u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.

Splitting like that sounds like starvation of the sleeper to me. If something
sleeps a lot, it will get even less CPU time on an average than it would if
there was no contention from the u=1 task.

And also CGroups will be even more weird than it already is in such a world,
2 different containers will not get CPU time distributed properly- say if
tasks in one container sleep a lot and tasks in another container are CPU
bound.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length
  2023-04-04 13:50       ` Joel Fernandes
@ 2023-04-05  5:41         ` Mike Galbraith
  2023-04-05  8:35         ` Peter Zijlstra
  1 sibling, 0 replies; 55+ messages in thread
From: Mike Galbraith @ 2023-04-05  5:41 UTC (permalink / raw)
  To: Joel Fernandes, Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat

On Tue, 2023-04-04 at 13:50 +0000, Joel Fernandes wrote:
> On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:
> > On Fri, Mar 31, 2023 at 05:26:51PM +0200, Vincent Guittot wrote:
> >
> > >
> > > Task A always run
> > > Task B loops on : running 1ms then sleeping 1ms
> > > default nice and latency nice prio bot both
> > > each task should get around 50% of the time.
> > >
> > > The fairness is ok with tip/sched/core
> > > but with eevdf, Task B only gets around 30%
> > >
> > > I haven't identified the problem so far
> >
> > Heh, this is actually the correct behaviour. If you have a u=1 and a
> > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.
>
> Splitting like that sounds like starvation of the sleeper to me. If something
> sleeps a lot, it will get even less CPU time on an average than it would if
> there was no contention from the u=1 task.
>
> And also CGroups will be even more weird than it already is in such a world,
> 2 different containers will not get CPU time distributed properly- say if
> tasks in one container sleep a lot and tasks in another container are CPU
> bound.

Lets take a quick peek at some group distribution numbers.

start tbench and massive_intr in their own VT (autogroup), then in
another, sleep 300;killall top massive_intr tbench_srv tbench.

(caveman method because perf's refusing to handle fast switchers well
for me.. top's plenty good enough for this anyway, and less intrusive)

massive_intr runs 8ms, sleeps 1, wants 88.8% of 8 runqueues.  tbench
buddy pairs want only a tad more CPU, 100% between them, but switch
orders of magnitude more frequently.  Very dissimilar breeds of hog.

master.today      accrued   of 2400s vs master
team massive_intr 1120.50s     .466      1.000
team tbench       1256.13s     .523      1.000

+eevdf
team massive_intr 1071.94s     .446       .956
team tbench       1301.56s     .542      1.036

There is of course a distribution delta.. but was it meaningful?

Between mostly idle but kinda noisy GUI perturbing things, and more
importantly, neither load having been manually distributed and pinned,
both schedulers came out pretty good, and both a tad shy of.. perfect
is the enemy of good.

Raw numbers below in case my mouse mucked up feeding of numbers to bc
(blame the inanimate, they can't do a damn thing about it).

6.3.0.g148341f-master
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
 5641 root      20   0    2564    640    640 R 50.33 0.004   2:17.19 5 massive_intr
 5636 root      20   0    2564    640    640 S 49.00 0.004   2:20.05 5 massive_intr
 5647 root      20   0    2564    640    640 R 48.67 0.004   2:21.85 6 massive_intr
 5640 root      20   0    2564    640    640 R 48.00 0.004   2:21.13 6 massive_intr
 5645 root      20   0    2564    640    640 R 47.67 0.004   2:18.25 5 massive_intr
 5638 root      20   0    2564    640    640 R 46.67 0.004   2:22.39 2 massive_intr
 5634 root      20   0    2564    640    640 R 45.00 0.004   2:18.93 4 massive_intr
 5643 root      20   0    2564    640    640 R 44.00 0.004   2:20.71 7 massive_intr
 5639 root      20   0   23468   1664   1536 R 29.00 0.010   1:22.31 3 tbench
 5644 root      20   0   23468   1792   1664 R 28.67 0.011   1:22.32 3 tbench
 5637 root      20   0   23468   1664   1536 S 28.00 0.010   1:22.75 5 tbench
 5631 root      20   0   23468   1792   1664 R 27.00 0.011   1:21.47 4 tbench
 5632 root      20   0   23468   1536   1408 R 27.00 0.010   1:21.78 0 tbench
 5653 root      20   0    6748    896    768 S 26.67 0.006   1:15.26 3 tbench_srv
 5633 root      20   0   23468   1792   1664 R 26.33 0.011   1:22.53 0 tbench
 5635 root      20   0   23468   1920   1792 R 26.33 0.012   1:20.72 7 tbench
 5642 root      20   0   23468   1920   1792 R 26.00 0.012   1:21.73 2 tbench
 5650 root      20   0    6748    768    768 R 25.67 0.005   1:15.71 1 tbench_srv
 5652 root      20   0    6748    768    768 S 25.67 0.005   1:15.71 3 tbench_srv
 5646 root      20   0    6748    768    768 S 25.33 0.005   1:14.97 4 tbench_srv
 5648 root      20   0    6748    896    768 S 25.00 0.006   1:14.66 0 tbench_srv
 5651 root      20   0    6748    896    768 S 24.67 0.006   1:14.79 2 tbench_srv
 5654 root      20   0    6748    768    768 R 24.33 0.005   1:15.47 0 tbench_srv
 5649 root      20   0    6748    768    768 R 24.00 0.005   1:13.95 7 tbench_srv

6.3.0.g148341f-master-eevdf
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
10561 root      20   0    2564    768    640 R 49.83 0.005   2:14.86 3 massive_intr
10562 root      20   0    2564    768    640 R 49.50 0.005   2:14.00 3 massive_intr
10564 root      20   0    2564    896    768 R 49.50 0.006   2:14.11 6 massive_intr
10559 root      20   0    2564    768    640 R 47.84 0.005   2:14.03 2 massive_intr
10560 root      20   0    2564    768    640 R 45.51 0.005   2:13.92 7 massive_intr
10557 root      20   0    2564    896    768 R 44.85 0.006   2:13.59 7 massive_intr
10563 root      20   0    2564    896    768 R 44.85 0.006   2:13.53 6 massive_intr
10558 root      20   0    2564    768    640 R 43.52 0.005   2:13.90 2 massive_intr
10577 root      20   0   23468   1920   1792 R 35.22 0.012   1:37.06 0 tbench
10574 root      20   0   23468   1920   1792 R 32.23 0.012   1:32.89 4 tbench
10580 root      20   0   23468   1920   1792 R 29.57 0.012   1:34.95 0 tbench
10575 root      20   0   23468   1792   1664 R 29.24 0.011   1:31.66 4 tbench
10576 root      20   0   23468   1792   1664 S 28.57 0.011   1:34.55 5 tbench
10573 root      20   0   23468   1792   1664 R 28.24 0.011   1:33.17 5 tbench
10578 root      20   0   23468   1920   1792 S 28.24 0.012   1:33.97 1 tbench
10579 root      20   0   23468   1920   1792 R 28.24 0.012   1:36.09 1 tbench
10587 root      20   0    6748    768    640 S 26.91 0.005   1:09.45 0 tbench_srv
10582 root      20   0    6748    768    640 R 24.25 0.005   1:08.19 4 tbench_srv
10588 root      20   0    6748    640    640 R 22.59 0.004   1:09.15 0 tbench_srv
10583 root      20   0    6748    640    640 R 21.93 0.004   1:07.93 4 tbench_srv
10586 root      20   0    6748    640    640 S 21.59 0.004   1:07.92 1 tbench_srv
10581 root      20   0    6748    640    640 S 21.26 0.004   1:07.08 5 tbench_srv
10585 root      20   0    6748    640    640 R 21.26 0.004   1:08.89 5 tbench_srv
10584 root      20   0    6748    768    640 S 20.93 0.005   1:08.61 1 tbench_srv



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length
  2023-04-04 13:50       ` Joel Fernandes
  2023-04-05  5:41         ` Mike Galbraith
@ 2023-04-05  8:35         ` Peter Zijlstra
  2023-04-05 20:05           ` Joel Fernandes
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2023-04-05  8:35 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Vincent Guittot, mingo, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, efault

On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote:
> On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:

> > Heh, this is actually the correct behaviour. If you have a u=1 and a
> > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.
> 
> Splitting like that sounds like starvation of the sleeper to me. If something
> sleeps a lot, it will get even less CPU time on an average than it would if
> there was no contention from the u=1 task.

No, sleeping, per definition, means you're not contending for CPU. What
CFS does, giving them a little boost, is strictly yuck and messes with
latency -- because suddenly you have a task that said it wasn't
competing appear as if it were, but you didn't run it (how could you, it
wasn't there to run) -- but it still needs to catch up.

The reason it does that, is mostly because at the time we didn't want to
do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit
computing was still a thing :/ So hacks happened.

That said; I'm starting to regret not pushing the EEVDF thing harder
back in 2010 when I first wrote it :/

> And also CGroups will be even more weird than it already is in such a world,
> 2 different containers will not get CPU time distributed properly- say if
> tasks in one container sleep a lot and tasks in another container are CPU
> bound.

Cgroups are an abomination anyway :-) /me runs like hell. But no, I
don't actually expect too much trouble there.

Or rather, as per the above, time distribution is now more proper than
it was :-)

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 06/17] sched/fair: Add lag based placement
  2023-04-03  9:18   ` Chen Yu
@ 2023-04-05  9:47     ` Peter Zijlstra
  2023-04-06  3:03       ` Chen Yu
  2023-04-13 15:42       ` Chen Yu
  0 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-04-05  9:47 UTC (permalink / raw)
  To: Chen Yu
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, youssefesmat, joel,
	efault

On Mon, Apr 03, 2023 at 05:18:06PM +0800, Chen Yu wrote:
> On 2023-03-28 at 11:26:28 +0200, Peter Zijlstra wrote:
> >  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> [...]
> >  		/*
> > -		 * Halve their sleep time's effect, to allow
> > -		 * for a gentler effect of sleepers:
> > +		 * If we want to place a task and preserve lag, we have to
> > +		 * consider the effect of the new entity on the weighted
> > +		 * average and compensate for this, otherwise lag can quickly
> > +		 * evaporate:
> > +		 *
> > +		 * l_i = V - v_i <=> v_i = V - l_i
> > +		 *
> > +		 * V = v_avg = W*v_avg / W
> > +		 *
> > +		 * V' = (W*v_avg + w_i*v_i) / (W + w_i)
> If I understand correctly, V' means the avg_runtime if se_i is enqueued?
> Then,
> 
> V  = (\Sum w_j*v_j) / W

multiply by W on both sides to get:

  V*W = \Sum w_j*v_j

> V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
> 
> Not sure how W*v_avg equals to Sum w_j*v_j ?

 V := v_avg

(yeah, I should clean up this stuff, already said to Josh I would)

> > +		 *    = (W*v_avg + w_i(v_avg - l_i)) / (W + w_i)
> > +		 *    = v_avg + w_i*l_i/(W + w_i)
> v_avg - w_i*l_i/(W + w_i) ?

Yup -- seems typing is hard :-)

> > +		 *
> > +		 * l_i' = V' - v_i = v_avg + w_i*l_i/(W + w_i) - (v_avg - l)
> > +		 *      = l_i - w_i*l_i/(W + w_i)
> > +		 *
> > +		 * l_i = (W + w_i) * l_i' / W
> >  		 */
> [...]
> > -		if (sched_feat(GENTLE_FAIR_SLEEPERS))
> > -			thresh >>= 1;
> > +		load = cfs_rq->avg_load;
> > +		if (curr && curr->on_rq)
> > +			load += curr->load.weight;
> > +
> > +		lag *= load + se->load.weight;
> > +		if (WARN_ON_ONCE(!load))
> > +			load = 1;
> > +		lag = div_s64(lag, load);
> >  
> Should we calculate
> l_i' = l_i * w / (W + w_i) instead of calculating l_i above? I thought we want to adjust
> the lag(before enqueue) based on the new weight(after enqueued)

We want to ensure the lag after placement is the lag we got before
dequeue.

I've updated the comment to read like so:

		/*
		 * If we want to place a task and preserve lag, we have to
		 * consider the effect of the new entity on the weighted
		 * average and compensate for this, otherwise lag can quickly
		 * evaporate.
		 *
		 * Lag is defined as:
		 *
		 *   l_i = V - v_i <=> v_i = V - l_i
		 *
		 * And we take V to be the weighted average of all v:
		 *
		 *   V = (\Sum w_j*v_j) / W
		 *
		 * Where W is: \Sum w_j
		 *
		 * Then, the weighted average after adding an entity with lag
		 * l_i is given by:
		 *
		 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
		 *      = (W*V + w_i*(V - l_i)) / (W + w_i)
		 *      = (W*V + w_i*V - w_i*l_i) / (W + w_i)
		 *      = (V*(W + w_i) - w_i*l) / (W + w_i)
		 *      = V - w_i*l_i / (W + w_i)
		 *
		 * And the actual lag after adding an entity with l_i is:
		 *
		 *   l'_i = V' - v_i
		 *        = V - w_i*l_i / (W + w_i) - (V - l_i)
		 *        = l_i - w_i*l_i / (W + w_i)
		 *
		 * Which is strictly less than l_i. So in order to preserve lag
		 * we should inflate the lag before placement such that the
		 * effective lag after placement comes out right.
		 *
		 * As such, invert the above relation for l'_i to get the l_i
		 * we need to use such that the lag after placement is the lag
		 * we computed before dequeue.
		 *
		 *   l'_i = l_i - w_i*l_i / (W + w_i)
		 *        = ((W + w_i)*l_i - w_i*l_i) / (W + w_i)
		 *
		 *   (W + w_i)*l'_i = (W + w_i)*l_i - w_i*l_i
		 *                  = W*l_i
		 *
		 *   l_i = (W + w_i)*l'_i / W
		 */

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/17] sched/fair: Add avg_vruntime
  2023-03-29  7:50     ` Peter Zijlstra
@ 2023-04-05 19:13       ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-04-05 19:13 UTC (permalink / raw)
  To: Josh Don
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel,
	efault

On Wed, Mar 29, 2023 at 09:50:51AM +0200, Peter Zijlstra wrote:
> On Tue, Mar 28, 2023 at 04:57:49PM -0700, Josh Don wrote:
> > On Tue, Mar 28, 2023 at 4:06 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > [...]
> > > +/*
> > > + * Compute virtual time from the per-task service numbers:
> > > + *
> > > + * Fair schedulers conserve lag: \Sum lag_i = 0
> > > + *
> > > + * lag_i = S - s_i = w_i * (V - v_i)
> > > + *
> > > + * \Sum lag_i = 0 -> \Sum w_i * (V - v_i) = V * \Sum w_i - \Sum w_i * v_i = 0
> > 
> > Small note: I think it would be helpful to label these symbols
> > somewhere :) Weight  and vruntime are fairly obvious, but I don't
> > think 'S' and 'V' are as clear. Are these non-virtual ideal service
> > time, and average vruntime, respectively?
> 
> Yep, they are. I'll see what I can do with the comments.


/*
 * Compute virtual time from the per-task service numbers:
 *
 * Fair schedulers conserve lag:
 *
 *   \Sum lag_i = 0
 *
 * Where lag_i is given by:
 *
 *   lag_i = S - s_i = w_i * (V - v_i)
 *
 * Where S is the ideal service time and V is it's virtual time counterpart.
 * Therefore:
 *
 *   \Sum lag_i = 0
 *   \Sum w_i * (V - v_i) = 0
 *   \Sum w_i * V - w_i * v_i = 0
 *
 * From which we can solve an expression for V in v_i (which we have in
 * se->vruntime):
 *
 *       \Sum v_i * w_i   \Sum v_i * w_i
 *   V = -------------- = --------------
 *          \Sum w_i            W
 *
 * Specifically, this is the weighted average of all entity virtual runtimes.
 *
 * [[ NOTE: this is only equal to the ideal scheduler under the condition
 *          that join/leave operations happen at lag_i = 0, otherwise the
 *          virtual time has non-continguous motion equivalent to:
 *
 *	      V +-= lag_i / W
 *
 *	    Also see the comment in place_entity() that deals with this. ]]
 *
 * However, since v_i is u64, and the multiplcation could easily overflow
 * transform it into a relative form that uses smaller quantities:
 *
 * Substitute: v_i == (v_i - v0) + v0
 *
 *     \Sum ((v_i - v0) + v0) * w_i   \Sum (v_i - v0) * w_i
 * V = ---------------------------- = --------------------- + v0
 *                  W                            W
 *
 * Which we track using:
 *
 *                    v0 := cfs_rq->min_vruntime
 * \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
 *              \Sum w_i := cfs_rq->avg_load
 *
 * Since min_vruntime is a monotonic increasing variable that closely tracks
 * the per-task service, these deltas: (v_i - v), will be in the order of the
 * maximal (virtual) lag induced in the system due to quantisation.
 *
 * Also, we use scale_load_down() to reduce the size.
 *
 * As measured, the max (key * weight) value was ~44 bits for a kernel build.
 */


And the comment in place_entity() (slightly updated since this morning):


		/*
		 * If we want to place a task and preserve lag, we have to
		 * consider the effect of the new entity on the weighted
		 * average and compensate for this, otherwise lag can quickly
		 * evaporate.
		 *
		 * Lag is defined as:
		 *
		 *   lag_i = S - s_i = w_i * (V - v_i)
		 *
		 * To avoid the 'w_i' term all over the place, we only track
		 * the virtual lag:
		 *
		 *   vl_i = V - v_i <=> v_i = V - vl_i
		 *
		 * And we take V to be the weighted average of all v:
		 *
		 *   V = (\Sum w_j*v_j) / W
		 *
		 * Where W is: \Sum w_j
		 *
		 * Then, the weighted average after adding an entity with lag
		 * vl_i is given by:
		 *
		 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
		 *      = (W*V + w_i*(V - vl_i)) / (W + w_i)
		 *      = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
		 *      = (V*(W + w_i) - w_i*l) / (W + w_i)
		 *      = V - w_i*vl_i / (W + w_i)
		 *
		 * And the actual lag after adding an entity with vl_i is:
		 *
		 *   vl'_i = V' - v_i
		 *         = V - w_i*vl_i / (W + w_i) - (V - vl_i)
		 *         = vl_i - w_i*vl_i / (W + w_i)
		 *
		 * Which is strictly less than vl_i. So in order to preserve lag
		 * we should inflate the lag before placement such that the
		 * effective lag after placement comes out right.
		 *
		 * As such, invert the above relation for vl'_i to get the vl_i
		 * we need to use such that the lag after placement is the lag
		 * we computed before dequeue.
		 *
		 *   vl'_i = vl_i - w_i*vl_i / (W + w_i)
		 *         = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
		 *
		 *   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
		 *                   = W*vl_i
		 *
		 *   vl_i = (W + w_i)*vl'_i / W
		 */



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length
  2023-04-05  8:35         ` Peter Zijlstra
@ 2023-04-05 20:05           ` Joel Fernandes
  2023-04-14 11:18             ` Phil Auld
  0 siblings, 1 reply; 55+ messages in thread
From: Joel Fernandes @ 2023-04-05 20:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, efault

On Wed, Apr 5, 2023 at 4:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote:
> > On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:
>
> > > Heh, this is actually the correct behaviour. If you have a u=1 and a
> > > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.
> >
> > Splitting like that sounds like starvation of the sleeper to me. If something
> > sleeps a lot, it will get even less CPU time on an average than it would if
> > there was no contention from the u=1 task.
>
> No, sleeping, per definition, means you're not contending for CPU. What
> CFS does, giving them a little boost, is strictly yuck and messes with
> latency -- because suddenly you have a task that said it wasn't
> competing appear as if it were, but you didn't run it (how could you, it
> wasn't there to run) -- but it still needs to catch up.
>
> The reason it does that, is mostly because at the time we didn't want to
> do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit
> computing was still a thing :/ So hacks happened.

Also you have the whole "boost tasks" that sleep a lot with CFS right?
 Like a task handling user input sleeps a lot, but when it wakes up,
it gets higher dynamic priority as its vruntime did not advance. I
guess EEVDF also gets you the same thing but still messes with the CPU
usage?

> That said; I'm starting to regret not pushing the EEVDF thing harder
> back in 2010 when I first wrote it :/
>
> > And also CGroups will be even more weird than it already is in such a world,
> > 2 different containers will not get CPU time distributed properly- say if
> > tasks in one container sleep a lot and tasks in another container are CPU
> > bound.
>
> Cgroups are an abomination anyway :-) /me runs like hell. But no, I
> don't actually expect too much trouble there.

So, with 2 equally weighted containers, if one has a task that sleeps
50% of the time, and another has a 100% task, then the sleeper will
only run 33% of the time? I can see people running containers having a
problem with that (a customer running one container gets less CPU than
the other.). Sorry if I missed something.

But yeah I do find the whole EEVDF idea interesting but I admit I have
to research it more.

 - Joel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 06/17] sched/fair: Add lag based placement
  2023-04-05  9:47     ` Peter Zijlstra
@ 2023-04-06  3:03       ` Chen Yu
  2023-04-13 15:42       ` Chen Yu
  1 sibling, 0 replies; 55+ messages in thread
From: Chen Yu @ 2023-04-06  3:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, youssefesmat, joel,
	efault

On 2023-04-05 at 11:47:20 +0200, Peter Zijlstra wrote:
> On Mon, Apr 03, 2023 at 05:18:06PM +0800, Chen Yu wrote:
> > On 2023-03-28 at 11:26:28 +0200, Peter Zijlstra wrote:
> > >  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> > [...]
> > >  		/*
> > > -		 * Halve their sleep time's effect, to allow
> > > -		 * for a gentler effect of sleepers:
> > > +		 * If we want to place a task and preserve lag, we have to
> > > +		 * consider the effect of the new entity on the weighted
> > > +		 * average and compensate for this, otherwise lag can quickly
> > > +		 * evaporate:
> > > +		 *
> > > +		 * l_i = V - v_i <=> v_i = V - l_i
> > > +		 *
> > > +		 * V = v_avg = W*v_avg / W
> > > +		 *
> > > +		 * V' = (W*v_avg + w_i*v_i) / (W + w_i)
> > If I understand correctly, V' means the avg_runtime if se_i is enqueued?
> > Then,
> > 
> > V  = (\Sum w_j*v_j) / W
> 
> multiply by W on both sides to get:
> 
>   V*W = \Sum w_j*v_j
> 
> > V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
> > 
> > Not sure how W*v_avg equals to Sum w_j*v_j ?
> 
>  V := v_avg
>
I see, thanks for the explanation. 
> (yeah, I should clean up this stuff, already said to Josh I would)
> 
> > > +		 *    = (W*v_avg + w_i(v_avg - l_i)) / (W + w_i)
> > > +		 *    = v_avg + w_i*l_i/(W + w_i)
> > v_avg - w_i*l_i/(W + w_i) ?
> 
> Yup -- seems typing is hard :-)
> 
> > > +		 *
> > > +		 * l_i' = V' - v_i = v_avg + w_i*l_i/(W + w_i) - (v_avg - l)
> > > +		 *      = l_i - w_i*l_i/(W + w_i)
> > > +		 *
> > > +		 * l_i = (W + w_i) * l_i' / W
> > >  		 */
> > [...]
> > > -		if (sched_feat(GENTLE_FAIR_SLEEPERS))
> > > -			thresh >>= 1;
> > > +		load = cfs_rq->avg_load;
> > > +		if (curr && curr->on_rq)
> > > +			load += curr->load.weight;
> > > +
> > > +		lag *= load + se->load.weight;
> > > +		if (WARN_ON_ONCE(!load))
> > > +			load = 1;
> > > +		lag = div_s64(lag, load);
> > >  
> > Should we calculate
> > l_i' = l_i * w / (W + w_i) instead of calculating l_i above? I thought we want to adjust
> > the lag(before enqueue) based on the new weight(after enqueued)
> 
> We want to ensure the lag after placement is the lag we got before
> dequeue.
> 
> I've updated the comment to read like so:
> 
> 		/*
> 		 * If we want to place a task and preserve lag, we have to
> 		 * consider the effect of the new entity on the weighted
> 		 * average and compensate for this, otherwise lag can quickly
> 		 * evaporate.
> 		 *
> 		 * Lag is defined as:
> 		 *
> 		 *   l_i = V - v_i <=> v_i = V - l_i
> 		 *
> 		 * And we take V to be the weighted average of all v:
> 		 *
> 		 *   V = (\Sum w_j*v_j) / W
> 		 *
> 		 * Where W is: \Sum w_j
> 		 *
> 		 * Then, the weighted average after adding an entity with lag
> 		 * l_i is given by:
> 		 *
> 		 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
> 		 *      = (W*V + w_i*(V - l_i)) / (W + w_i)
> 		 *      = (W*V + w_i*V - w_i*l_i) / (W + w_i)
> 		 *      = (V*(W + w_i) - w_i*l) / (W + w_i)
small typo  w_i*l -> w_i*l_i
> 		 *      = V - w_i*l_i / (W + w_i)
> 		 *
> 		 * And the actual lag after adding an entity with l_i is:
> 		 *
> 		 *   l'_i = V' - v_i
> 		 *        = V - w_i*l_i / (W + w_i) - (V - l_i)
> 		 *        = l_i - w_i*l_i / (W + w_i)
> 		 *
> 		 * Which is strictly less than l_i. So in order to preserve lag
> 		 * we should inflate the lag before placement such that the
> 		 * effective lag after placement comes out right.
> 		 *
> 		 * As such, invert the above relation for l'_i to get the l_i
> 		 * we need to use such that the lag after placement is the lag
> 		 * we computed before dequeue.
> 		 *
> 		 *   l'_i = l_i - w_i*l_i / (W + w_i)
> 		 *        = ((W + w_i)*l_i - w_i*l_i) / (W + w_i)
> 		 *
> 		 *   (W + w_i)*l'_i = (W + w_i)*l_i - w_i*l_i
> 		 *                  = W*l_i
> 		 *
> 		 *   l_i = (W + w_i)*l'_i / W
> 		 */
Got it, thanks! This is very clear.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/17] sched: EEVDF using latency-nice
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (17 preceding siblings ...)
  2023-04-03  7:42 ` [PATCH 00/17] sched: EEVDF using latency-nice Shrikanth Hegde
@ 2023-04-10  3:13 ` David Vernet
  2023-04-11  2:09   ` David Vernet
       [not found] ` <20230410082307.1327-1-hdanton@sina.com>
  2023-04-25 12:32 ` Phil Auld
  20 siblings, 1 reply; 55+ messages in thread
From: David Vernet @ 2023-04-10  3:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, clm, tj, djk

On Tue, Mar 28, 2023 at 11:26:22AM +0200, Peter Zijlstra wrote:
> Hi!
> 
> Latest version of the EEVDF [1] patches.
> 
> Many changes since last time; most notably it now fully replaces CFS and uses
> lag based placement for migrations. Smaller changes include:
> 
>  - uses scale_load_down() for avg_vruntime; I measured the max delta to be ~44
>    bits on a system/cgroup based kernel build.
>  - fixed a bunch of reweight / cgroup placement issues
>  - adaptive placement strategy for smaller slices
>  - rename se->lag to se->vlag
> 
> There's a bunch of RFC patches at the end and one DEBUG patch. Of those, the
> PLACE_BONUS patch is a mixed bag of pain. A number of benchmarks regress
> because EEVDF is actually fair and gives a 100% parent vs a 50% child a 67%/33%
> split (stress-futex, stress-nanosleep, starve, etc..) instead of a 50%/50%
> split that sleeper bonus achieves. Mostly I think these benchmarks are somewhat
> artificial/daft but who knows.
> 
> The PLACE_BONUS thing horribly messes up things like hackbench and latency-nice
> because it places things too far to the left in the tree. Basically it messes
> with the whole 'when', by placing a task back in history you're putting a
> burden on the now to accomodate catching up. More tinkering required.
> 
> But over-all the thing seems to be fairly usable and could do with more
> extensive testing.

Hi Peter,

I used the EEVDF scheduler to run workloads on one of Meta's largest
services (our main HHVM web server), and I wanted to share my
observations with you.

The TL;DR is that, unfortunately, it appears as though EEVDF regresses
these workloads rather substantially. Running with "vanilla" EEVDF (i.e.
on this patch set up to [0], with no changes to latency nice for any
task) compared to vanilla CFS results in the following outcomes to our
major KPIs for servicing web requests:

- .5 - 2.5% drop in throughput
- 1 - 5% increase in p95 latencies
- 1.75 - 6% increase in p99 latencies
- .5 - 4% drop in 50th percentile throughput

[0]: https://lore.kernel.org/lkml/20230328110354.562078801@infradead.org/

Decreasing latency nice for our critical web workers unfortunately did
not help either. For example, here are the numbers for a latency nice
value of -10:

- .8 - 2.5% drop in throughput
- 2 - 4% increase in p95 latencies
- 1 - 4.5% increase in p99 latencies
- 0 - 4.5% increase in 50th percentile throughput

Other latency nice values resulted in similar metrics. Some metrics may
get slightly better, and others slightly worse, but the end result was
always a relatively significant regression from vanilla CFS. Throughout
the rest of this write up, the remaining figures quoted will be from
vanilla EEVDF runs (modulo some numbers towards the end of this writeup
which describe the outcome of increasing the default base slice with
sysctl_sched_base_slice).

With that out the way, let me outline some of the reasons for these
regressions:

1. Improved runqueue delays, but costly faults and involuntary context
   switches

EEVDF substantially increased the number of context switches on the
system, by 15 - 35%. On its own, this doesn't necessarily imply a
problem. For example, we observed that EEVDF resulted in a 20 - 40%
reduction in the time that tasks were spent waiting on the runqueue
before being placed on a CPU.

There were, however, other metrics which were less encouraging. We
observed a 400 - 550% increase in involuntary context switches (which
are also presumably a reason for the improved runqueue delays), as well
as a 10 - 70% increase in major page faults per minute. Along these
lines, we also saw an erratic but often sigificant decrease in CPU
utilization.

It's hard to say exactly what kinds of issues such faults / involuntary
context context switches could introduce, but it does seem clear that in
general, less time is being spent doing useful work, and more time is
spent thrashing on resources between tasks.

2. Front-end CPU pipeline bottlenecks

Like many (any?) other JIT engines / compilers, HHVM tends to be heavily
front-end bound in the CPU pipeline, and have very poor IPC
(Instructions Per Cycle). For HHVM, this is due to high branch resteers,
poor icache / iTLB locality, and poor uop caching / decoding (many uops
are being serviced through the MITE instead of the DSB). While using
BOLT [1] to improve the layout of the HHVM binary does help to minimize
these costs, they're still the main bottleneck for the application.

[1]: https://github.com/llvm/llvm-project/blob/main/bolt/docs/OptimizingClang.md

An implication of this is that any time a task runs on a CPU after one
of these web worker tasks, it is essentially guaranteed to have poor
front-end locality, and their IPCs will similarly suffer. In other
words, more context switches generally means fewer instructions being
run across the whole system. When profiling vanilla CFS vs. vanilla
EEVDF (that is, with no changes to latency nice for any task), we found
that it resulted in a 1 - 2% drop in IPC across the whole system.

Note that simply partitioning the system by cpuset won't necessarily
work either, as CPU utilization will drop even further, and we want to
keep the system as busy as possible. There's another internal patch set
we have (which we're planning to try and upstream soon) where waking
tasks are placed in a global shared "wakequeue", which is then always
pulled from in newidle_balance(). The discrepancy in performance between
CFS and EEVDF is even worse in this case, with throughput dropping by 2
- 4%, p95 tail latencies increasing by 3 - 5%, and p99 tail latencies
increasing by 6 - 11%.

3. Low latency + long slice are not mutually exclusive for us

An interesting quality of web workloads running JIT engines is that they
require both low-latency, and long slices on the CPU. The reason we need
the tasks to be low latency is they're on the critical path for
servicing web requests (for most of their runtime, at least), and the
reasons we need them to have long slices are enumerated above -- they
thrash the icache / DSB / iTLB, more aggressive context switching causes
us to thrash on paging from disk, and in general, these tasks are on the
critical path for servicing web requests and we want to encourage them
to run to completion.

This causes EEVDF to perform poorly for workloads with these
characteristics. If we decrease latency nice for our web workers then
they'll have lower latency, but only because their slices are smaller.
This in turn causes the increase in context switches, which causes the
thrashing described above.

Worth noting -- I did try and increase the default base slice length by
setting sysctl_sched_base_slice to 35ms, and these were the results:

With EEVDF slice 35ms and latency_nice 0
----------------------------------------
- .5 - 2.25% drop in throughput
- 2.5 - 4.5% increase in p95 latencies
- 2.5 - 5.25% increase in p99 latencies
- Context switch per minute increase: 9.5 - 12.4%
- Involuntary context switch increase: ~320 - 330%
- Major fault delta: -3.6% to 37.6%
- IPC decrease .5 - .9%

With EEVDF slice 35ms and latency_nice -8 for web workers
---------------------------------------------------------
- .5 - 2.5% drop in throughput
- 1.7 - 4.75% increase in p95 latencies
- 2.5 - 5% increase in p99 latencies
- Context switch per minute increase: 10.5 - 15%
- Involuntary context switch increase: ~327 - 350%
- Major fault delta: -1% to 45%
- IPC decrease .4 - 1.1%

I was expecting the increase in context switches and involuntary context
switches to be lower what than they ended up being with the increased
default slice length. Regardless, it still seems to tell a relatively
consistent story with the numbers from above. The improvement in IPC is
expected, though also less improved than I was anticipating (presumably
due to the still-high context switch rate). There were also fewer major
faults per minute compared to runs with a shorter default slice.

Note that even if increasing the slice length did cause fewer context
switches and major faults, I still expect that it would hurt throughput
and latency for HHVM given that when latency-nicer tasks are eventually
given the CPU, the web workers will have to wait around for longer than
we'd like for those tasks to burn through their longer slices.

In summary, I must admit that this patch set makes me a bit nervous.
Speaking for Meta at least, the patch set in its current form exceeds
the performance regressions (generally < .5% at the very most) that
we're able to tolerate in production. More broadly, it will certainly
cause us to have to carefully consider how it affects our model for
server capacity.

Thanks,
David

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/17] sched: EEVDF using latency-nice
  2023-04-10  3:13 ` David Vernet
@ 2023-04-11  2:09   ` David Vernet
  0 siblings, 0 replies; 55+ messages in thread
From: David Vernet @ 2023-04-11  2:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, clm, tj, djk

On Sun, Apr 09, 2023 at 10:13:50PM -0500, David Vernet wrote:
> On Tue, Mar 28, 2023 at 11:26:22AM +0200, Peter Zijlstra wrote:
> > Hi!
> > 
> > Latest version of the EEVDF [1] patches.
> > 
> > Many changes since last time; most notably it now fully replaces CFS and uses
> > lag based placement for migrations. Smaller changes include:
> > 
> >  - uses scale_load_down() for avg_vruntime; I measured the max delta to be ~44
> >    bits on a system/cgroup based kernel build.
> >  - fixed a bunch of reweight / cgroup placement issues
> >  - adaptive placement strategy for smaller slices
> >  - rename se->lag to se->vlag
> > 
> > There's a bunch of RFC patches at the end and one DEBUG patch. Of those, the
> > PLACE_BONUS patch is a mixed bag of pain. A number of benchmarks regress
> > because EEVDF is actually fair and gives a 100% parent vs a 50% child a 67%/33%
> > split (stress-futex, stress-nanosleep, starve, etc..) instead of a 50%/50%
> > split that sleeper bonus achieves. Mostly I think these benchmarks are somewhat
> > artificial/daft but who knows.
> > 
> > The PLACE_BONUS thing horribly messes up things like hackbench and latency-nice
> > because it places things too far to the left in the tree. Basically it messes
> > with the whole 'when', by placing a task back in history you're putting a
> > burden on the now to accomodate catching up. More tinkering required.
> > 
> > But over-all the thing seems to be fairly usable and could do with more
> > extensive testing.
> 
> Hi Peter,
> 
> I used the EEVDF scheduler to run workloads on one of Meta's largest
> services (our main HHVM web server), and I wanted to share my
> observations with you.
> 
> The TL;DR is that, unfortunately, it appears as though EEVDF regresses
> these workloads rather substantially. Running with "vanilla" EEVDF (i.e.
> on this patch set up to [0], with no changes to latency nice for any
> task) compared to vanilla CFS results in the following outcomes to our
> major KPIs for servicing web requests:
> 
> - .5 - 2.5% drop in throughput
> - 1 - 5% increase in p95 latencies
> - 1.75 - 6% increase in p99 latencies
> - .5 - 4% drop in 50th percentile throughput
> 
> [0]: https://lore.kernel.org/lkml/20230328110354.562078801@infradead.org/
> 
> Decreasing latency nice for our critical web workers unfortunately did
> not help either. For example, here are the numbers for a latency nice
> value of -10:
> 
> - .8 - 2.5% drop in throughput
> - 2 - 4% increase in p95 latencies
> - 1 - 4.5% increase in p99 latencies
> - 0 - 4.5% increase in 50th percentile throughput
> 
> Other latency nice values resulted in similar metrics. Some metrics may
> get slightly better, and others slightly worse, but the end result was
> always a relatively significant regression from vanilla CFS. Throughout
> the rest of this write up, the remaining figures quoted will be from
> vanilla EEVDF runs (modulo some numbers towards the end of this writeup
> which describe the outcome of increasing the default base slice with
> sysctl_sched_base_slice).
> 
> With that out the way, let me outline some of the reasons for these
> regressions:
> 
> 1. Improved runqueue delays, but costly faults and involuntary context
>    switches
> 
> EEVDF substantially increased the number of context switches on the
> system, by 15 - 35%. On its own, this doesn't necessarily imply a
> problem. For example, we observed that EEVDF resulted in a 20 - 40%
> reduction in the time that tasks were spent waiting on the runqueue
> before being placed on a CPU.
> 
> There were, however, other metrics which were less encouraging. We
> observed a 400 - 550% increase in involuntary context switches (which
> are also presumably a reason for the improved runqueue delays), as well
> as a 10 - 70% increase in major page faults per minute. Along these
> lines, we also saw an erratic but often sigificant decrease in CPU
> utilization.
> 
> It's hard to say exactly what kinds of issues such faults / involuntary
> context context switches could introduce, but it does seem clear that in
> general, less time is being spent doing useful work, and more time is
> spent thrashing on resources between tasks.
> 
> 2. Front-end CPU pipeline bottlenecks
> 
> Like many (any?) other JIT engines / compilers, HHVM tends to be heavily
> front-end bound in the CPU pipeline, and have very poor IPC
> (Instructions Per Cycle). For HHVM, this is due to high branch resteers,
> poor icache / iTLB locality, and poor uop caching / decoding (many uops
> are being serviced through the MITE instead of the DSB). While using
> BOLT [1] to improve the layout of the HHVM binary does help to minimize
> these costs, they're still the main bottleneck for the application.
> 
> [1]: https://github.com/llvm/llvm-project/blob/main/bolt/docs/OptimizingClang.md
> 
> An implication of this is that any time a task runs on a CPU after one
> of these web worker tasks, it is essentially guaranteed to have poor
> front-end locality, and their IPCs will similarly suffer. In other
> words, more context switches generally means fewer instructions being
> run across the whole system. When profiling vanilla CFS vs. vanilla
> EEVDF (that is, with no changes to latency nice for any task), we found
> that it resulted in a 1 - 2% drop in IPC across the whole system.
> 
> Note that simply partitioning the system by cpuset won't necessarily
> work either, as CPU utilization will drop even further, and we want to
> keep the system as busy as possible. There's another internal patch set
> we have (which we're planning to try and upstream soon) where waking
> tasks are placed in a global shared "wakequeue", which is then always
> pulled from in newidle_balance(). The discrepancy in performance between
> CFS and EEVDF is even worse in this case, with throughput dropping by 2
> - 4%, p95 tail latencies increasing by 3 - 5%, and p99 tail latencies
> increasing by 6 - 11%.
> 
> 3. Low latency + long slice are not mutually exclusive for us
> 
> An interesting quality of web workloads running JIT engines is that they
> require both low-latency, and long slices on the CPU. The reason we need
> the tasks to be low latency is they're on the critical path for
> servicing web requests (for most of their runtime, at least), and the
> reasons we need them to have long slices are enumerated above -- they
> thrash the icache / DSB / iTLB, more aggressive context switching causes
> us to thrash on paging from disk, and in general, these tasks are on the
> critical path for servicing web requests and we want to encourage them
> to run to completion.
> 
> This causes EEVDF to perform poorly for workloads with these
> characteristics. If we decrease latency nice for our web workers then
> they'll have lower latency, but only because their slices are smaller.
> This in turn causes the increase in context switches, which causes the
> thrashing described above.
> 
> Worth noting -- I did try and increase the default base slice length by
> setting sysctl_sched_base_slice to 35ms, and these were the results:
> 
> With EEVDF slice 35ms and latency_nice 0
> ----------------------------------------
> - .5 - 2.25% drop in throughput
> - 2.5 - 4.5% increase in p95 latencies
> - 2.5 - 5.25% increase in p99 latencies
> - Context switch per minute increase: 9.5 - 12.4%
> - Involuntary context switch increase: ~320 - 330%
> - Major fault delta: -3.6% to 37.6%
> - IPC decrease .5 - .9%
> 
> With EEVDF slice 35ms and latency_nice -8 for web workers
> ---------------------------------------------------------
> - .5 - 2.5% drop in throughput
> - 1.7 - 4.75% increase in p95 latencies
> - 2.5 - 5% increase in p99 latencies
> - Context switch per minute increase: 10.5 - 15%
> - Involuntary context switch increase: ~327 - 350%
> - Major fault delta: -1% to 45%
> - IPC decrease .4 - 1.1%
> 
> I was expecting the increase in context switches and involuntary context
> switches to be lower what than they ended up being with the increased
> default slice length. Regardless, it still seems to tell a relatively
> consistent story with the numbers from above. The improvement in IPC is
> expected, though also less improved than I was anticipating (presumably
> due to the still-high context switch rate). There were also fewer major
> faults per minute compared to runs with a shorter default slice.

Ah, these numbers with a larger slice are inaccurate. I was being
careless and accidentally changed the wrong sysctl
(sysctl_sched_cfs_bandwidth_slice rather than sysctl_sched_base_slice)
yesterday when testing how a longer slice affects performance.
Increasing sysctl_sched_base_slice to 30ms actually does improve things
a bit, but we're still losing to CFS (note that higher is better for
throughput, and lower is better for p95 and p99 latency):

lat_nice | throughput     | p95 latency  | p99 latency   | total swtch    | invol swtch  | mjr faults | IPC
-----------------------------------------------------------------------------------------------------------------------
0        | -1.2% to 0%    | 1 to 2.5%    | 1.8 to 3.75%  | 0%             | 200 to 210%  | -3% to 14% | -0.4% to -0.25%
-----------------------------------------------------------------------------------------------------------------------
-8       | -1.8% to -1.3% | 1 to 2.3%    | .75% to 2.5%  | -1.9% to -1.3% | 185 to 193%  | -3% to 20% | -0.6% to -0.25%
-----------------------------------------------------------------------------------------------------------------------
-12      | -.9% to 0.25%  | -0.1 to 2.4% | -0.4% to 2.8% | -2% to -1.5%   | 180 to 185%  | -5% to 30% | -0.8% to -0.25%
-----------------------------------------------------------------------------------------------------------------------
-19      | -1.3% to 0%   | 0.3 to 3.5%  | -1% to 2.1%   | -2% to -1.4%   | 175 to 190%   | 4% to 27%  | -0.6% to -0.27%
-----------------------------------------------------------------------------------------------------------------------

I'm sure experimenting with various slice lengths, etc would yield
slightly different results, but the common theme seems to be that EEVDF
causes more involuntary context switches and more major faults, which
regress throughput and latency for our web workloads.

> 
> Note that even if increasing the slice length did cause fewer context
> switches and major faults, I still expect that it would hurt throughput
> and latency for HHVM given that when latency-nicer tasks are eventually
> given the CPU, the web workers will have to wait around for longer than
> we'd like for those tasks to burn through their longer slices.
> 
> In summary, I must admit that this patch set makes me a bit nervous.
> Speaking for Meta at least, the patch set in its current form exceeds
> the performance regressions (generally < .5% at the very most) that
> we're able to tolerate in production. More broadly, it will certainly
> cause us to have to carefully consider how it affects our model for
> server capacity.
> 
> Thanks,
> David

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/17] sched: EEVDF using latency-nice
       [not found] ` <20230410082307.1327-1-hdanton@sina.com>
@ 2023-04-11 10:15   ` Mike Galbraith
       [not found]   ` <20230411133333.1790-1-hdanton@sina.com>
  1 sibling, 0 replies; 55+ messages in thread
From: Mike Galbraith @ 2023-04-11 10:15 UTC (permalink / raw)
  To: Hillf Danton, David Vernet
  Cc: Peter Zijlstra, mingo, vincent.guittot, linux-kernel, linux-mm,
	mgorman, bristot, corbet, kprateek.nayak, youssefesmat, joel

On Mon, 2023-04-10 at 16:23 +0800, Hillf Danton wrote:
>
> In order to only narrow down the poor performance reported, make a tradeoff
> between runtime and latency simply by restoring sysctl_sched_min_granularity
> at tick preempt, given the known order on the runqueue.

Tick preemption isn't the primary contributor to the scheduling delta,
it's wakeup preemption. If you look at the perf summaries of 5 minute
recordings on my little 8 rq box below, you'll see that the delta is
more than twice what a 250Hz tick could inflict.  You could also just
turn off WAKEUP_PREEMPTION and watch the delta instantly peg negative.

Anyway...

Given we know preemption is markedly up, and as always a source of pain
(as well as gain), perhaps we can try to tamp it down a little without
inserting old constraints into the shiny new scheduler.

The dirt simple tweak below puts a dent in the sting by merely sticking
with whatever decision EEVDF last made until it itself invalidates that
decision. It still selects via the same math, just does so the tiniest
bit less frenetically.

---
 kernel/sched/fair.c     |    3 +++
 kernel/sched/features.h |    6 ++++++
 2 files changed, 9 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -950,6 +950,9 @@ static struct sched_entity *pick_eevdf(s
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;

+	if (sched_feat(GENTLE_EEVDF) && curr)
+		return curr;
+
 	while (node) {
 		struct sched_entity *se = __node_2_se(node);

--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -14,6 +14,12 @@ SCHED_FEAT(MINIMAL_VA, false)
 SCHED_FEAT(VALIDATE_QUEUE, false)

 /*
+ * Don't be quite so damn twitchy, once you select a champion let the
+ * poor bastard carry the baton until no longer eligible to do so.
+ */
+SCHED_FEAT(GENTLE_EEVDF, true)
+
+/*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.

perf.data.cfs
 ----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 ----------------------------------------------------------------------------------------------------------
  massive_intr:(8)      |1665786.092 ms |   529819 | avg:   1.046 ms | max:  33.639 ms | sum:554226.960 ms |
  dav1d-worker:(8)      | 187982.593 ms |   448022 | avg:   0.881 ms | max:  35.806 ms | sum:394546.442 ms |
  X:2503                | 102533.714 ms |    89729 | avg:   0.071 ms | max:   9.448 ms | sum: 6372.383 ms |
  VizCompositorTh:5235  |  38717.241 ms |    76743 | avg:   0.632 ms | max:  24.308 ms | sum:48502.097 ms |
  llvmpipe-0:(2)        |  32520.412 ms |    42390 | avg:   1.041 ms | max:  19.804 ms | sum:44116.653 ms |
  llvmpipe-1:(2)        |  32374.548 ms |    35557 | avg:   1.247 ms | max:  17.439 ms | sum:44347.573 ms |
  llvmpipe-2:(2)        |  31579.168 ms |    34292 | avg:   1.312 ms | max:  16.775 ms | sum:45005.225 ms |
  llvmpipe-3:(2)        |  30478.664 ms |    33659 | avg:   1.375 ms | max:  16.863 ms | sum:46268.417 ms |
  llvmpipe-7:(2)        |  29778.002 ms |    30684 | avg:   1.543 ms | max:  17.384 ms | sum:47338.420 ms |
  llvmpipe-4:(2)        |  29741.774 ms |    32832 | avg:   1.433 ms | max:  18.571 ms | sum:47062.280 ms |
  llvmpipe-5:(2)        |  29462.794 ms |    32641 | avg:   1.455 ms | max:  19.802 ms | sum:47497.195 ms |
  llvmpipe-6:(2)        |  28367.114 ms |    32132 | avg:   1.514 ms | max:  16.562 ms | sum:48646.738 ms |
  ThreadPoolForeg:(16)  |  22238.667 ms |    66355 | avg:   0.353 ms | max:  46.477 ms | sum:23409.474 ms |
  VideoFrameCompo:5243  |  17071.755 ms |    75223 | avg:   0.288 ms | max:  33.358 ms | sum:21650.918 ms |
  chrome:(8)            |   6478.351 ms |    47110 | avg:   0.486 ms | max:  28.018 ms | sum:22910.980 ms |
 ----------------------------------------------------------------------------------------------------------
  TOTAL:                |2317066.420 ms |  2221889 |                 |       46.477 ms |   1629736.515 ms |
 ----------------------------------------------------------------------------------------------------------

perf.data.eevdf
 ----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 ----------------------------------------------------------------------------------------------------------
  massive_intr:(8)      |1673379.930 ms |   743590 | avg:   0.745 ms | max:  28.003 ms | sum:554041.093 ms |
  dav1d-worker:(8)      | 197647.514 ms |  1139053 | avg:   0.434 ms | max:  22.357 ms | sum:494377.980 ms |
  X:2495                | 100741.946 ms |   114808 | avg:   0.191 ms | max:   8.583 ms | sum:21945.360 ms |
  VizCompositorTh:6571  |  37705.863 ms |    74900 | avg:   0.479 ms | max:  16.464 ms | sum:35843.010 ms |
  llvmpipe-6:(2)        |  30757.126 ms |    38941 | avg:   1.448 ms | max:  18.529 ms | sum:56371.507 ms |
  llvmpipe-3:(2)        |  30658.127 ms |    40296 | avg:   1.405 ms | max:  24.791 ms | sum:56601.212 ms |
  llvmpipe-4:(2)        |  30456.388 ms |    40011 | avg:   1.419 ms | max:  23.840 ms | sum:56793.272 ms |
  llvmpipe-2:(2)        |  30395.971 ms |    40828 | avg:   1.394 ms | max:  19.195 ms | sum:56897.961 ms |
  llvmpipe-5:(2)        |  30346.432 ms |    39393 | avg:   1.445 ms | max:  21.747 ms | sum:56917.495 ms |
  llvmpipe-1:(2)        |  30275.694 ms |    41349 | avg:   1.378 ms | max:  20.765 ms | sum:56989.923 ms |
  llvmpipe-7:(2)        |  29768.515 ms |    37626 | avg:   1.532 ms | max:  20.649 ms | sum:57639.337 ms |
  llvmpipe-0:(2)        |  28931.905 ms |    42568 | avg:   1.378 ms | max:  20.942 ms | sum:58667.379 ms |
  ThreadPoolForeg:(60)  |  22598.216 ms |   131514 | avg:   0.342 ms | max:  36.105 ms | sum:44927.149 ms |
  VideoFrameCompo:6587  |  16966.649 ms |    90751 | avg:   0.357 ms | max:  18.199 ms | sum:32379.045 ms |
  chrome:(25)           |   8862.695 ms |    75923 | avg:   0.308 ms | max:  30.821 ms | sum:23347.992 ms |
 ----------------------------------------------------------------------------------------------------------
  TOTAL:                |2331946.838 ms |  3471615 |                 |       36.105 ms |   1808071.407 ms |
 ----------------------------------------------------------------------------------------------------------

perf.data.eevdf+tweak
 ----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 ----------------------------------------------------------------------------------------------------------
  massive_intr:(8)      |1687121.317 ms |   695518 | avg:   0.760 ms | max:  24.098 ms | sum:528302.626 ms |
  dav1d-worker:(8)      | 183514.008 ms |   922884 | avg:   0.489 ms | max:  32.093 ms | sum:451319.787 ms |
  X:2489                |  99164.486 ms |   101585 | avg:   0.239 ms | max:   8.896 ms | sum:24295.253 ms |
  VizCompositorTh:17881 |  37911.007 ms |    71122 | avg:   0.499 ms | max:  16.743 ms | sum:35460.994 ms |
  llvmpipe-1:(2)        |  29946.625 ms |    40320 | avg:   1.394 ms | max:  23.036 ms | sum:56222.367 ms |
  llvmpipe-2:(2)        |  29910.414 ms |    39677 | avg:   1.412 ms | max:  24.187 ms | sum:56011.791 ms |
  llvmpipe-6:(2)        |  29742.389 ms |    37822 | avg:   1.484 ms | max:  18.228 ms | sum:56109.947 ms |
  llvmpipe-3:(2)        |  29644.994 ms |    39155 | avg:   1.435 ms | max:  21.191 ms | sum:56202.636 ms |
  llvmpipe-5:(2)        |  29520.006 ms |    38037 | avg:   1.482 ms | max:  21.698 ms | sum:56373.679 ms |
  llvmpipe-4:(2)        |  29460.485 ms |    38562 | avg:   1.462 ms | max:  26.308 ms | sum:56389.022 ms |
  llvmpipe-7:(2)        |  29449.959 ms |    36308 | avg:   1.557 ms | max:  21.617 ms | sum:56547.129 ms |
  llvmpipe-0:(2)        |  29041.903 ms |    41207 | avg:   1.389 ms | max:  26.322 ms | sum:57239.666 ms |
  ThreadPoolForeg:(16)  |  22490.094 ms |   112591 | avg:   0.377 ms | max:  27.027 ms | sum:42414.618 ms |
  VideoFrameCompo:17888 |  17385.895 ms |    86651 | avg:   0.367 ms | max:  19.350 ms | sum:31767.043 ms |
  chrome:(8)            |   6826.127 ms |    61487 | avg:   0.306 ms | max:  20.000 ms | sum:18835.879 ms |
 ----------------------------------------------------------------------------------------------------------
  TOTAL:                |2326181.115 ms |  3081183 |                 |       32.093 ms |   1737425.434 ms |
 ----------------------------------------------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/17] sched: EEVDF using latency-nice
       [not found]   ` <20230411133333.1790-1-hdanton@sina.com>
@ 2023-04-11 14:56     ` Mike Galbraith
       [not found]     ` <20230412025042.1413-1-hdanton@sina.com>
  1 sibling, 0 replies; 55+ messages in thread
From: Mike Galbraith @ 2023-04-11 14:56 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, mingo, vincent.guittot, linux-kernel, linux-mm,
	mgorman, bristot, corbet, kprateek.nayak, youssefesmat, joel

On Tue, 2023-04-11 at 21:33 +0800, Hillf Danton wrote:
> On Tue, 11 Apr 2023 12:15:41 +0200 Mike Galbraith <efault@gmx.de>
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -950,6 +950,9 @@ static struct sched_entity *pick_eevdf(s
> >         if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> >                 curr =3D NULL;
> >
> > +       if (sched_feat(GENTLE_EEVDF) && curr)
> > +               return curr;
> > +
>
> This is rather aggressive, given latency-10 curr and latency-0 candidate
> at tick hit for instance.

The numbers seem to indicate that the ~400k ctx switches eliminated
were meaningless to the load being measures.  I recorded everything for
5 minutes, and the recording wide max actually went down.. but one-off
hits happen regularly in noisy GUI regardless of scheduler, are
difficult to assign meaning to.

Now I'm not saying there is no cost, if you change anything that's
converted to instructions, there is a price tag somewhere, whether you
notice immediately or not.  Nor am I saying that patchlet is golden.  I
am saying that some of the ctx switch delta look very much like useless
overhead that can and should be made to go away.  From my POV, patchlet
actually looks like kinda viable, but to Peter and regression reporter,
it and associated data are presented as a datapoint.

>  And along your direction a mild change is
> postpone the preempt wakeup to the next tick.
>
> +++ b/kernel/sched/fair.c
> @@ -7932,8 +7932,6 @@ static void check_preempt_wakeup(struct
>                 return;
>  
>         cfs_rq = cfs_rq_of(se);
> -       update_curr(cfs_rq);
> -
>         /*
>          * XXX pick_eevdf(cfs_rq) != se ?
>          */

Mmmm, stopping time is a bad idea methinks.

	-Mike

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/17] sched: EEVDF using latency-nice
       [not found]     ` <20230412025042.1413-1-hdanton@sina.com>
@ 2023-04-12  4:05       ` Mike Galbraith
  0 siblings, 0 replies; 55+ messages in thread
From: Mike Galbraith @ 2023-04-12  4:05 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Peter Zijlstra, mingo, vincent.guittot, linux-kernel, linux-mm,
	mgorman, bristot, corbet, kprateek.nayak, youssefesmat, joel

On Wed, 2023-04-12 at 10:50 +0800, Hillf Danton wrote:
> On Tue, 11 Apr 2023 16:56:24 +0200 Mike Galbraith <efault@gmx.de>
>
>
> The data from you and David (lat_nice: -12 throughput: -.9% to 0.25%) is
> supporting eevdf, given a optimization <5% could be safely ignored in general
> (while 10% good and 20% standing ovation).
>

There's nothing pro or con here, David's testing seems to agree with my
own testing that a bit of adjustment may be necessary and that's it.
Cold hard numbers to developer, completely optional mitigation tweak to
fellow tester.. and we're done.

	-Mike

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 06/17] sched/fair: Add lag based placement
  2023-04-05  9:47     ` Peter Zijlstra
  2023-04-06  3:03       ` Chen Yu
@ 2023-04-13 15:42       ` Chen Yu
  2023-04-13 15:55         ` Chen Yu
  1 sibling, 1 reply; 55+ messages in thread
From: Chen Yu @ 2023-04-13 15:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, youssefesmat, joel,
	efault

On 2023-04-05 at 11:47:20 +0200, Peter Zijlstra wrote:
> On Mon, Apr 03, 2023 at 05:18:06PM +0800, Chen Yu wrote:
> > On 2023-03-28 at 11:26:28 +0200, Peter Zijlstra wrote:
So I launched the test on another platform with more CPUs,

baseline: 6.3-rc6

compare:  sched/eevdf branch on top of commit 8c59a975d5ee ("sched/eevdf: Debug / validation crud")


--------------------------------------------------------------------------------------
schbench:mthreads = 2
                   baseline                    eevdf+NO_PLACE_BONUS
worker_threads
25%                80.00           +19.2%      95.40            schbench.latency_90%_us
                   (0.00%)                     (0.51%)          stddev
50%                183.70          +2.2%       187.80           schbench.latency_90%_us
                   (0.35%)                     (0.46%)          stddev
75%                4065            -21.4%      3193             schbench.latency_90%_us
                   (69.65%)                    (3.42%)          stddev
100%               13696           -92.4%      1040             schbench.latency_90%_us
                   (5.25%)                     (69.03%)         stddev
125%               16457           -78.6%      3514             schbench.latency_90%_us
                   (10.50%)                    (6.25%)          stddev
150%               31177           -77.5%      7008             schbench.latency_90%_us
                   (6.84%)                     (5.19%)          stddev
175%               40729           -75.1%      10160            schbench.latency_90%_us
                   (6.11%)                     (2.53%)          stddev
200%               52224           -74.4%      13385            schbench.latency_90%_us
                   (10.42%)                    (1.72%)          stddev


                  eevdf+NO_PLACE_BONUS       eevdf+PLACE_BONUS
worker_threads
25%               96.30             +0.2%      96.50            schbench.latency_90%_us
                  (0.66%)                      (0.52%)          stddev
50%               187.20            -3.0%      181.60           schbench.latency_90%_us
                  (0.21%)                      (0.71%)          stddev
75%                3034             -84.1%     482.50           schbench.latency_90%_us
                  (5.56%)                      (27.40%)         stddev
100%              648.20            +114.7%    1391             schbench.latency_90%_us
                  (64.70%)                     (10.05%)         stddev
125%              3506              -3.0%      3400             schbench.latency_90%_us
                  (2.79%)                      (9.89%)          stddev
150%              6793              +29.6%     8803             schbench.latency_90%_us
                  (1.39%)                      (7.30%)          stddev
175%               9961             +9.2%      10876            schbench.latency_90%_us
                  (1.51%)                      (6.54%)          stddev
200%              13660             +3.3%      14118            schbench.latency_90%_us
                  (1.38%)                      (6.02%)          stddev



Summary for schbench: in most cases eevdf+NO_PLACE_BONUS gives the best performance.
And this is aligned with the previous test on another platform with smaller number of
CPUs, eevdf benefits schbench overall.

---------------------------------------------------------------------------------------



hackbench: ipc=pipe mode=process default fd:20

                   baseline                     eevdf+NO_PLACE_BONUS
worker_threads
1                  103103            -0.3%     102794        hackbench.throughput_avg
25%                115562          +825.7%    1069725        hackbench.throughput_avg
50%                296514          +352.1%    1340414        hackbench.throughput_avg
75%                498059          +190.8%    1448156        hackbench.throughput_avg
100%               804560           +74.8%    1406413        hackbench.throughput_avg


                   eevdf+NO_PLACE_BONUS        eevdf+PLACE_BONUS
worker_threads
1                  102172            +1.5%     103661         hackbench.throughput_avg
25%                1076503           -52.8%     508612        hackbench.throughput_avg
50%                1394311           -68.2%     443251        hackbench.throughput_avg
75%                1476502           -70.2%     440391        hackbench.throughput_avg
100%               1512706           -76.2%     359741        hackbench.throughput_avg


Summary for hackbench pipe process test: in most cases eevdf+NO_PLACE_BONUS gives the best performance.

-------------------------------------------------------------------------------------
unixbench: test=pipe

                   baseline                     eevdf+NO_PLACE_BONUS
nr_task
1                  1405              -0.5%       1398        unixbench.score
25%                77942             +0.9%      78680        unixbench.score
50%                155384            +1.1%     157100        unixbench.score
75%                179756            +0.3%     180295        unixbench.score
100%               204030            -0.2%     203540        unixbench.score
125%               204972            -0.4%     204062        unixbench.score
150%               205891            -0.5%     204792        unixbench.score
175%               207051            -0.5%     206047        unixbench.score
200%               209387            -0.9%     207559        unixbench.score


                   eevdf+NO_PLACE_BONUS        eevdf+PLACE_BONUS
nr_task
1                  1405              -0.3%       1401        unixbench.score
25%                78640             +0.0%      78647        unixbench.score
50%                157153            -0.0%     157093        unixbench.score
75%                180152            +0.0%     180205        unixbench.score
100%               203479            -0.0%     203464        unixbench.score
125%               203866            +0.1%     204013        unixbench.score
150%               204872            -0.0%     204838        unixbench.score
175%               205799            +0.0%     205824        unixbench.score
200%               207152            +0.2%     207546        unixbench.score

Seems to have no impact on unixbench in pipe mode.
--------------------------------------------------------------------------------

netperf: TCP_RR, ipv4, loopback

                   baseline                    eevdf+NO_PLACE_BONUS
nr_threads
25%                56232            -1.7%      55265        netperf.Throughput_tps
50%                49876            -3.1%      48338        netperf.Throughput_tps
75%                24281            +1.9%      24741        netperf.Throughput_tps
100%               73598            +3.8%      76375        netperf.Throughput_tps
125%               59119            +1.4%      59968        netperf.Throughput_tps
150%               49124            +1.2%      49727        netperf.Throughput_tps
175%               41929            +0.2%      42004        netperf.Throughput_tps
200%               36543            +0.4%      36677        netperf.Throughput_tps

                   eevdf+NO_PLACE_BONUS        eevdf+PLACE_BONUS
nr_threads
25%                55296            +4.7%      57877        netperf.Throughput_tps
50%                48659            +1.9%      49585        netperf.Throughput_tps
75%                24741            +0.3%      24807        netperf.Throughput_tps
100%               76455            +6.7%      81548        netperf.Throughput_tps
125%               60082            +7.6%      64622        netperf.Throughput_tps
150%               49618            +7.7%      53429        netperf.Throughput_tps
175%               41974            +7.6%      45160        netperf.Throughput_tps
200%               36677            +6.5%      39067        netperf.Throughput_tps

Seems to have no impact on netperf.
-----------------------------------------------------------------------------------

stress-ng: futex

                   baseline                     eevdf+NO_PLACE_BONUS
nr_threads
25%                207926           -21.0%     164356       stress-ng.futex.ops_per_sec
50%                46611           -16.1%      39130        stress-ng.futex.ops_per_sec
75%                71381           -11.3%      63283        stress-ng.futex.ops_per_sec
100%               58766            -0.8%      58269        stress-ng.futex.ops_per_sec
125%               59859           +11.3%      66645        stress-ng.futex.ops_per_sec
150%               52869            +7.6%      56863        stress-ng.futex.ops_per_sec
175%               49607           +22.9%      60969        stress-ng.futex.ops_per_sec
200%               56011           +11.8%      62631        stress-ng.futex.ops_per_sec


When the system is not busy, there is regression. When the system gets busier,
there are some improvement. Even with PLACE_BONUS enabled, there are still regression.
Per the perf profile of 50% case, there are nearly the same ratio of wakeup with vs without
eevdf patch applied:
50.82            -0.7       50.15        perf-profile.children.cycles-pp.futex_wake
but there are more preemption after eevdf enabled:
135095           +15.4%     155943        stress-ng.time.involuntary_context_switches
which is near the performance loss -16.1%
That is to say, eevdf help futex wakee grab the CPU easier(benefit latency), while might
have some impact on throughput?

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 06/17] sched/fair: Add lag based placement
  2023-04-13 15:42       ` Chen Yu
@ 2023-04-13 15:55         ` Chen Yu
  0 siblings, 0 replies; 55+ messages in thread
From: Chen Yu @ 2023-04-13 15:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, youssefesmat, joel,
	efault

On 2023-04-13 at 23:42:34 +0800, Chen Yu wrote:
> On 2023-04-05 at 11:47:20 +0200, Peter Zijlstra wrote:
> > On Mon, Apr 03, 2023 at 05:18:06PM +0800, Chen Yu wrote:
> > > On 2023-03-28 at 11:26:28 +0200, Peter Zijlstra wrote:
> So I launched the test on another platform with more CPUs,
> 
> baseline: 6.3-rc6
> 
> compare:  sched/eevdf branch on top of commit 8c59a975d5ee ("sched/eevdf: Debug / validation crud")
> Chenyu
I realized that you have pushed some changes to eevdf branch yesterday, so the test was
actually tested on top of this commit I pulled 1 week ago:
commit 4f58ee3ba245ff97a075b17b454256f9c4d769c4 ("sched/eevdf: Debug / validation crud")

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length
  2023-04-05 20:05           ` Joel Fernandes
@ 2023-04-14 11:18             ` Phil Auld
  2023-04-16  5:10               ` Joel Fernandes
  0 siblings, 1 reply; 55+ messages in thread
From: Phil Auld @ 2023-04-14 11:18 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Peter Zijlstra, Vincent Guittot, mingo, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, efault

On Wed, Apr 05, 2023 at 04:05:55PM -0400 Joel Fernandes wrote:
> On Wed, Apr 5, 2023 at 4:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote:
> > > On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:
> >
> > > > Heh, this is actually the correct behaviour. If you have a u=1 and a
> > > > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.
> > >
> > > Splitting like that sounds like starvation of the sleeper to me. If something
> > > sleeps a lot, it will get even less CPU time on an average than it would if
> > > there was no contention from the u=1 task.
> >
> > No, sleeping, per definition, means you're not contending for CPU. What
> > CFS does, giving them a little boost, is strictly yuck and messes with
> > latency -- because suddenly you have a task that said it wasn't
> > competing appear as if it were, but you didn't run it (how could you, it
> > wasn't there to run) -- but it still needs to catch up.
> >
> > The reason it does that, is mostly because at the time we didn't want to
> > do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit
> > computing was still a thing :/ So hacks happened.
> 
> Also you have the whole "boost tasks" that sleep a lot with CFS right?
>  Like a task handling user input sleeps a lot, but when it wakes up,
> it gets higher dynamic priority as its vruntime did not advance. I
> guess EEVDF also gets you the same thing but still messes with the CPU
> usage?
> 
> > That said; I'm starting to regret not pushing the EEVDF thing harder
> > back in 2010 when I first wrote it :/
> >
> > > And also CGroups will be even more weird than it already is in such a world,
> > > 2 different containers will not get CPU time distributed properly- say if
> > > tasks in one container sleep a lot and tasks in another container are CPU
> > > bound.
> >
> > Cgroups are an abomination anyway :-) /me runs like hell. But no, I
> > don't actually expect too much trouble there.
> 
> So, with 2 equally weighted containers, if one has a task that sleeps
> 50% of the time, and another has a 100% task, then the sleeper will
> only run 33% of the time? I can see people running containers having a
> problem with that (a customer running one container gets less CPU than
> the other.). Sorry if I missed something.
>

But the 50% sleeper is _asking_ for less CPU.  Doing 50% for each would
mean that when the sleeper task was awake it always ran, always won, to
the exclusion of any one else. (Assuming 1 CPU...)

Cheers,
Phil

> But yeah I do find the whole EEVDF idea interesting but I admit I have
> to research it more.
> 
>  - Joel
> 

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length
  2023-04-14 11:18             ` Phil Auld
@ 2023-04-16  5:10               ` Joel Fernandes
  0 siblings, 0 replies; 55+ messages in thread
From: Joel Fernandes @ 2023-04-16  5:10 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, Vincent Guittot, mingo, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, efault



> On Apr 14, 2023, at 1:18 PM, Phil Auld <pauld@redhat.com> wrote:
> 
> On Wed, Apr 05, 2023 at 04:05:55PM -0400 Joel Fernandes wrote:
>>> On Wed, Apr 5, 2023 at 4:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote:
>>>>> On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:
>>> 
>>>>> Heh, this is actually the correct behaviour. If you have a u=1 and a
>>>>> u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.
>>>> 
>>>> Splitting like that sounds like starvation of the sleeper to me. If something
>>>> sleeps a lot, it will get even less CPU time on an average than it would if
>>>> there was no contention from the u=1 task.
>>> 
>>> No, sleeping, per definition, means you're not contending for CPU. What
>>> CFS does, giving them a little boost, is strictly yuck and messes with
>>> latency -- because suddenly you have a task that said it wasn't
>>> competing appear as if it were, but you didn't run it (how could you, it
>>> wasn't there to run) -- but it still needs to catch up.
>>> 
>>> The reason it does that, is mostly because at the time we didn't want to
>>> do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit
>>> computing was still a thing :/ So hacks happened.
>> 
>> Also you have the whole "boost tasks" that sleep a lot with CFS right?
>> Like a task handling user input sleeps a lot, but when it wakes up,
>> it gets higher dynamic priority as its vruntime did not advance. I
>> guess EEVDF also gets you the same thing but still messes with the CPU
>> usage?
>> 
>>> That said; I'm starting to regret not pushing the EEVDF thing harder
>>> back in 2010 when I first wrote it :/
>>> 
>>>> And also CGroups will be even more weird than it already is in such a world,
>>>> 2 different containers will not get CPU time distributed properly- say if
>>>> tasks in one container sleep a lot and tasks in another container are CPU
>>>> bound.
>>> 
>>> Cgroups are an abomination anyway :-) /me runs like hell. But no, I
>>> don't actually expect too much trouble there.
>> 
>> So, with 2 equally weighted containers, if one has a task that sleeps
>> 50% of the time, and another has a 100% task, then the sleeper will
>> only run 33% of the time? I can see people running containers having a
>> problem with that (a customer running one container gets less CPU than
>> the other.). Sorry if I missed something.
>> 
> 
> But the 50% sleeper is _asking_ for less CPU.  Doing 50% for each would
> mean that when the sleeper task was awake it always ran, always won, to
> the exclusion of any one else. (Assuming 1 CPU...)
> 

It sounds like you are saying that if the task busy looped instead of sleeping, it would get more CPU during the time it is not busy looping but doing some real work. That sounds like encouraging abuse to get more perf.

But again, I have not looked too closely at EEVDF or Peters patches. I was just going by Vincents test and was cautioning to not break users who depend on CFS shares..

Cheers,

- Joel


> Cheers,
> Phil
> 
>> But yeah I do find the whole EEVDF idea interesting but I admit I have
>> to research it more.
>> 
>> - Joel
>> 
> 
> -- 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/17] sched: EEVDF using latency-nice
  2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (19 preceding siblings ...)
       [not found] ` <20230410082307.1327-1-hdanton@sina.com>
@ 2023-04-25 12:32 ` Phil Auld
  20 siblings, 0 replies; 55+ messages in thread
From: Phil Auld @ 2023-04-25 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel, efault, jhladky


Hi Peter,

On Tue, Mar 28, 2023 at 11:26:22AM +0200 Peter Zijlstra wrote:
> Hi!
> 
> Latest version of the EEVDF [1] patches.
> 
> Many changes since last time; most notably it now fully replaces CFS and uses
> lag based placement for migrations. Smaller changes include:
> 
>  - uses scale_load_down() for avg_vruntime; I measured the max delta to be ~44
>    bits on a system/cgroup based kernel build.
>  - fixed a bunch of reweight / cgroup placement issues
>  - adaptive placement strategy for smaller slices
>  - rename se->lag to se->vlag
> 
> There's a bunch of RFC patches at the end and one DEBUG patch. Of those, the
> PLACE_BONUS patch is a mixed bag of pain. A number of benchmarks regress
> because EEVDF is actually fair and gives a 100% parent vs a 50% child a 67%/33%
> split (stress-futex, stress-nanosleep, starve, etc..) instead of a 50%/50%
> split that sleeper bonus achieves. Mostly I think these benchmarks are somewhat
> artificial/daft but who knows.
> 
> The PLACE_BONUS thing horribly messes up things like hackbench and latency-nice
> because it places things too far to the left in the tree. Basically it messes
> with the whole 'when', by placing a task back in history you're putting a
> burden on the now to accomodate catching up. More tinkering required.
> 
> But over-all the thing seems to be fairly usable and could do with more
> extensive testing.

I had Jirka run his suite of perf workloads on this. These are macro benchmarks
on baremetal (NAS, SPECjbb etc). I can't share specific results because it
comes out in nice html reports on an internal website. There was no noticeable
performance change, which is a good thing. Overall performance was comparable
to CFS.

There was a win in stability though. A number of the error boxes across the
board were smaller. So less variance.

These are mostly performance/throughput tests. We're going to run some more
latency sensitive tests now.

So, fwiw, EEVDF is performing well on macro workloads here.



Cheers,
Phil

> 
> [1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564
> 
> Results:
> 
>   hackbech -g $nr_cpu + cyclictest --policy other results:
> 
> 			EEVDF			 CFS
> 
> 		# Min Latencies: 00054
>   LNICE(19)	# Avg Latencies: 00660
> 		# Max Latencies: 23103
> 
> 		# Min Latencies: 00052		00053
>   LNICE(0)	# Avg Latencies: 00318		00687
> 		# Max Latencies: 08593		13913
> 
> 		# Min Latencies: 00054
>   LNICE(-19)	# Avg Latencies: 00055
> 		# Max Latencies: 00061
> 
> 
> Some preliminary results from Chen Yu on a slightly older version:
> 
>   schbench  (95% tail latency, lower is better)
>   =================================================================================
>   case                    nr_instance            baseline (std%)    compare% ( std%)
>   normal                   25%                     1.00  (2.49%)    -81.2%   (4.27%)
>   normal                   50%                     1.00  (2.47%)    -84.5%   (0.47%)
>   normal                   75%                     1.00  (2.5%)     -81.3%   (1.27%)
>   normal                  100%                     1.00  (3.14%)    -79.2%   (0.72%)
>   normal                  125%                     1.00  (3.07%)    -77.5%   (0.85%)
>   normal                  150%                     1.00  (3.35%)    -76.4%   (0.10%)
>   normal                  175%                     1.00  (3.06%)    -76.2%   (0.56%)
>   normal                  200%                     1.00  (3.11%)    -76.3%   (0.39%)
>   ==================================================================================
> 
>   hackbench (throughput, higher is better)
>   ==============================================================================
>   case                    nr_instance            baseline(std%)  compare%( std%)
>   threads-pipe              25%                      1.00 (<2%)    -17.5 (<2%)
>   threads-socket            25%                      1.00 (<2%)    -1.9 (<2%)
>   threads-pipe              50%                      1.00 (<2%)     +6.7 (<2%)
>   threads-socket            50%                      1.00 (<2%)    -6.3  (<2%)
>   threads-pipe              100%                     1.00 (3%)     +110.1 (3%)
>   threads-socket            100%                     1.00 (<2%)    -40.2 (<2%)
>   threads-pipe              150%                     1.00 (<2%)    +125.4 (<2%)
>   threads-socket            150%                     1.00 (<2%)    -24.7 (<2%)
>   threads-pipe              200%                     1.00 (<2%)    -89.5 (<2%)
>   threads-socket            200%                     1.00 (<2%)    -27.4 (<2%)
>   process-pipe              25%                      1.00 (<2%)    -15.0 (<2%)
>   process-socket            25%                      1.00 (<2%)    -3.9 (<2%)
>   process-pipe              50%                      1.00 (<2%)    -0.4  (<2%)
>   process-socket            50%                      1.00 (<2%)    -5.3  (<2%)
>   process-pipe              100%                     1.00 (<2%)    +62.0 (<2%)
>   process-socket            100%                     1.00 (<2%)    -39.5  (<2%)
>   process-pipe              150%                     1.00 (<2%)    +70.0 (<2%)
>   process-socket            150%                     1.00 (<2%)    -20.3 (<2%)
>   process-pipe              200%                     1.00 (<2%)    +79.2 (<2%)
>   process-socket            200%                     1.00 (<2%)    -22.4  (<2%)
>   ==============================================================================
> 
>   stress-ng (throughput, higher is better)
>   ==============================================================================
>   case                    nr_instance            baseline(std%)  compare%( std%)
>   switch                  25%                      1.00 (<2%)    -6.5 (<2%)
>   switch                  50%                      1.00 (<2%)    -9.2 (<2%)
>   switch                  75%                      1.00 (<2%)    -1.2 (<2%)
>   switch                  100%                     1.00 (<2%)    +11.1 (<2%)
>   switch                  125%                     1.00 (<2%)    -16.7% (9%)
>   switch                  150%                     1.00 (<2%)    -13.6 (<2%)
>   switch                  175%                     1.00 (<2%)    -16.2 (<2%)
>   switch                  200%                     1.00 (<2%)    -19.4% (<2%)
>   fork                    50%                      1.00 (<2%)    -0.1 (<2%)
>   fork                    75%                      1.00 (<2%)    -0.3 (<2%)
>   fork                    100%                     1.00 (<2%)    -0.1 (<2%)
>   fork                    125%                     1.00 (<2%)    -6.9 (<2%)
>   fork                    150%                     1.00 (<2%)    -8.8 (<2%)
>   fork                    200%                     1.00 (<2%)    -3.3 (<2%)
>   futex                   25%                      1.00 (<2%)    -3.2 (<2%)
>   futex                   50%                      1.00 (3%)     -19.9 (5%)
>   futex                   75%                      1.00 (6%)     -19.1 (2%)
>   futex                   100%                     1.00 (16%)    -30.5 (10%)
>   futex                   125%                     1.00 (25%)    -39.3 (11%)
>   futex                   150%                     1.00 (20%)    -27.2% (17%)
>   futex                   175%                     1.00 (<2%)    -18.6 (<2%)
>   futex                   200%                     1.00 (<2%)    -47.5 (<2%)
>   nanosleep               25%                      1.00 (<2%)    -0.1 (<2%)
>   nanosleep               50%                      1.00 (<2%)    -0.0% (<2%)
>   nanosleep               75%                      1.00 (<2%)    +15.2% (<2%)
>   nanosleep               100%                     1.00 (<2%)    -26.4 (<2%)
>   nanosleep               125%                     1.00 (<2%)    -1.3 (<2%)
>   nanosleep               150%                     1.00 (<2%)    +2.1  (<2%)
>   nanosleep               175%                     1.00 (<2%)    +8.3 (<2%)
>   nanosleep               200%                     1.00 (<2%)    +2.0% (<2%)
>   ===============================================================================
> 
>   unixbench (throughput, higher is better)
>   ==============================================================================
>   case                    nr_instance            baseline(std%)  compare%( std%)
>   spawn                   125%                      1.00 (<2%)    +8.1 (<2%)
>   context1                100%                      1.00 (6%)     +17.4 (6%)
>   context1                75%                       1.00 (13%)    +18.8 (8%)
>   =================================================================================
> 
>   netperf  (throughput, higher is better)
>   ===========================================================================
>   case                    nr_instance          baseline(std%)  compare%( std%)
>   UDP_RR                  25%                   1.00    (<2%)    -1.5%  (<2%)
>   UDP_RR                  50%                   1.00    (<2%)    -0.3%  (<2%)
>   UDP_RR                  75%                   1.00    (<2%)    +12.5% (<2%)
>   UDP_RR                 100%                   1.00    (<2%)    -4.3%  (<2%)
>   UDP_RR                 125%                   1.00    (<2%)    -4.9%  (<2%)
>   UDP_RR                 150%                   1.00    (<2%)    -4.7%  (<2%)
>   UDP_RR                 175%                   1.00    (<2%)    -6.1%  (<2%)
>   UDP_RR                 200%                   1.00    (<2%)    -6.6%  (<2%)
>   TCP_RR                  25%                   1.00    (<2%)    -1.4%  (<2%)
>   TCP_RR                  50%                   1.00    (<2%)    -0.2%  (<2%)
>   TCP_RR                  75%                   1.00    (<2%)    -3.9%  (<2%)
>   TCP_RR                 100%                   1.00    (2%)     +3.6%  (5%)
>   TCP_RR                 125%                   1.00    (<2%)    -4.2%  (<2%)
>   TCP_RR                 150%                   1.00    (<2%)    -6.0%  (<2%)
>   TCP_RR                 175%                   1.00    (<2%)    -7.4%  (<2%)
>   TCP_RR                 200%                   1.00    (<2%)    -8.4%  (<2%)
>   ==========================================================================
> 
> 
> ---
> Also available at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/eevdf
> 
> ---
> Parth Shah (1):
>       sched: Introduce latency-nice as a per-task attribute
> 
> Peter Zijlstra (14):
>       sched/fair: Add avg_vruntime
>       sched/fair: Remove START_DEBIT
>       sched/fair: Add lag based placement
>       rbtree: Add rb_add_augmented_cached() helper
>       sched/fair: Implement an EEVDF like policy
>       sched: Commit to lag based placement
>       sched/smp: Use lag to simplify cross-runqueue placement
>       sched: Commit to EEVDF
>       sched/debug: Rename min_granularity to base_slice
>       sched: Merge latency_offset into slice
>       sched/eevdf: Better handle mixed slice length
>       sched/eevdf: Sleeper bonus
>       sched/eevdf: Minimal vavg option
>       sched/eevdf: Debug / validation crud
> 
> Vincent Guittot (2):
>       sched/fair: Add latency_offset
>       sched/fair: Add sched group latency support
> 
>  Documentation/admin-guide/cgroup-v2.rst |   10 +
>  include/linux/rbtree_augmented.h        |   26 +
>  include/linux/sched.h                   |    6 +
>  include/uapi/linux/sched.h              |    4 +-
>  include/uapi/linux/sched/types.h        |   19 +
>  init/init_task.c                        |    3 +-
>  kernel/sched/core.c                     |   65 +-
>  kernel/sched/debug.c                    |   49 +-
>  kernel/sched/fair.c                     | 1199 ++++++++++++++++---------------
>  kernel/sched/features.h                 |   29 +-
>  kernel/sched/sched.h                    |   23 +-
>  tools/include/uapi/linux/sched.h        |    4 +-
>  12 files changed, 794 insertions(+), 643 deletions(-)
> 

-- 


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2023-04-25 12:33 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-28  9:26 [PATCH 00/17] sched: EEVDF using latency-nice Peter Zijlstra
2023-03-28  9:26 ` [PATCH 01/17] sched: Introduce latency-nice as a per-task attribute Peter Zijlstra
2023-03-28  9:26 ` [PATCH 02/17] sched/fair: Add latency_offset Peter Zijlstra
2023-03-28  9:26 ` [PATCH 03/17] sched/fair: Add sched group latency support Peter Zijlstra
2023-03-28  9:26 ` [PATCH 04/17] sched/fair: Add avg_vruntime Peter Zijlstra
2023-03-28 23:57   ` Josh Don
2023-03-29  7:50     ` Peter Zijlstra
2023-04-05 19:13       ` Peter Zijlstra
2023-03-28  9:26 ` [PATCH 05/17] sched/fair: Remove START_DEBIT Peter Zijlstra
2023-03-28  9:26 ` [PATCH 06/17] sched/fair: Add lag based placement Peter Zijlstra
2023-04-03  9:18   ` Chen Yu
2023-04-05  9:47     ` Peter Zijlstra
2023-04-06  3:03       ` Chen Yu
2023-04-13 15:42       ` Chen Yu
2023-04-13 15:55         ` Chen Yu
2023-03-28  9:26 ` [PATCH 07/17] rbtree: Add rb_add_augmented_cached() helper Peter Zijlstra
2023-03-28  9:26 ` [PATCH 08/17] sched/fair: Implement an EEVDF like policy Peter Zijlstra
2023-03-29  1:26   ` Josh Don
2023-03-29  8:02     ` Peter Zijlstra
2023-03-29  8:06     ` Peter Zijlstra
2023-03-29  8:22       ` Peter Zijlstra
2023-03-29 18:48         ` Josh Don
2023-03-29  8:12     ` Peter Zijlstra
2023-03-29 18:54       ` Josh Don
2023-03-29  8:18     ` Peter Zijlstra
2023-03-29 14:35   ` Vincent Guittot
2023-03-30  8:01     ` Peter Zijlstra
2023-03-30 17:05       ` Vincent Guittot
2023-04-04 12:00         ` Peter Zijlstra
2023-03-28  9:26 ` [PATCH 09/17] sched: Commit to lag based placement Peter Zijlstra
2023-03-28  9:26 ` [PATCH 10/17] sched/smp: Use lag to simplify cross-runqueue placement Peter Zijlstra
2023-03-28  9:26 ` [PATCH 11/17] sched: Commit to EEVDF Peter Zijlstra
2023-03-28  9:26 ` [PATCH 12/17] sched/debug: Rename min_granularity to base_slice Peter Zijlstra
2023-03-28  9:26 ` [PATCH 13/17] sched: Merge latency_offset into slice Peter Zijlstra
2023-03-28  9:26 ` [PATCH 14/17] sched/eevdf: Better handle mixed slice length Peter Zijlstra
2023-03-31 15:26   ` Vincent Guittot
2023-04-04  9:29     ` Peter Zijlstra
2023-04-04 13:50       ` Joel Fernandes
2023-04-05  5:41         ` Mike Galbraith
2023-04-05  8:35         ` Peter Zijlstra
2023-04-05 20:05           ` Joel Fernandes
2023-04-14 11:18             ` Phil Auld
2023-04-16  5:10               ` Joel Fernandes
     [not found]   ` <20230401232355.336-1-hdanton@sina.com>
2023-04-02  2:40     ` Mike Galbraith
2023-03-28  9:26 ` [PATCH 15/17] [RFC] sched/eevdf: Sleeper bonus Peter Zijlstra
2023-03-29  9:10   ` Mike Galbraith
2023-03-28  9:26 ` [PATCH 16/17] [RFC] sched/eevdf: Minimal vavg option Peter Zijlstra
2023-03-28  9:26 ` [PATCH 17/17] [DEBUG] sched/eevdf: Debug / validation crud Peter Zijlstra
2023-04-03  7:42 ` [PATCH 00/17] sched: EEVDF using latency-nice Shrikanth Hegde
2023-04-10  3:13 ` David Vernet
2023-04-11  2:09   ` David Vernet
     [not found] ` <20230410082307.1327-1-hdanton@sina.com>
2023-04-11 10:15   ` Mike Galbraith
     [not found]   ` <20230411133333.1790-1-hdanton@sina.com>
2023-04-11 14:56     ` Mike Galbraith
     [not found]     ` <20230412025042.1413-1-hdanton@sina.com>
2023-04-12  4:05       ` Mike Galbraith
2023-04-25 12:32 ` Phil Auld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).