All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10] sched: EEVDF using latency-nice
@ 2023-03-06 13:25 Peter Zijlstra
  2023-03-06 13:25 ` [PATCH 01/10] sched: Introduce latency-nice as a per-task attribute Peter Zijlstra
                   ` (13 more replies)
  0 siblings, 14 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel

Hi!

Ever since looking at the latency-nice patches, I've wondered if EEVDF would
not make more sense, and I did point Vincent at some older patches I had for
that (which is here his augmented rbtree thing comes from).

Also, since I really dislike the dual tree, I also figured we could dynamically
switch between an augmented tree and not (and while I have code for that,
that's not included in this posting because with the current results I don't
think we actually need this).

Anyway, since I'm somewhat under the weather, I spend last week desperately
trying to connect a small cluster of neurons in defiance of the snot overlord
and bring back the EEVDF patches from the dark crypts where they'd been
gathering cobwebs for the past 13 odd years.

By friday they worked well enough, and this morning (because obviously I forgot
the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf,
tbench and sysbench -- there's a bunch of wins and losses, but nothing that
indicates a total fail.

( in fact, some of the schbench results seem to indicate EEVDF schedules a lot
  more consistent than CFS and has a bunch of latency wins )

( hackbench also doesn't show the augmented tree and generally more expensive
  pick to be a loss, in fact it shows a slight win here )


  hackbech load + cyclictest --policy other results:


			EEVDF			 CFS

		# Min Latencies: 00053
  LNICE(19)	# Avg Latencies: 04350
		# Max Latencies: 76019

		# Min Latencies: 00052		00053
  LNICE(0)	# Avg Latencies: 00690		00687
		# Max Latencies: 14145		13913

		# Min Latencies: 00019
  LNICE(-19)	# Avg Latencies: 00261
		# Max Latencies: 05642


The nice -19 numbers aren't as pretty as Vincent's, but at the end I was going
cross-eyed from staring at tree prints and I just couldn't figure out where it
was going side-ways.

There's definitely more benchmarking/tweaking to be done (0-day already
reported a stress-ng loss), but if we can pull this off we can delete a whole
much of icky heuristics code. EEVDF is a much better defined policy than what
we currently have.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 01/10] sched: Introduce latency-nice as a per-task attribute
  2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
  2023-03-06 13:25 ` [PATCH 02/10] sched/core: Propagate parent tasks latency requirements to the child task Peter Zijlstra
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, Parth Shah

From: Parth Shah <parth@linux.ibm.com>

Latency-nice indicates the latency requirements of a task with respect
to the other tasks in the system. The value of the attribute can be within
the range of [-20, 19] both inclusive to be in-line with the values just
like task nice values.

latency_nice = -20 indicates the task to have the least latency as
compared to the tasks having latency_nice = +19.

The latency_nice may affect only the CFS SCHED_CLASS by getting
latency requirements from the userspace.

Additionally, add debugging bits for newly added latency_nice attribute.

[rebase, move defines in sched/prio.h]
Signed-off-by: Parth Shah <parth@linux.ibm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20230224093454.956298-3-vincent.guittot@linaro.org
---
 include/linux/sched.h      |  1 +
 include/linux/sched/prio.h | 18 ++++++++++++++++++
 kernel/sched/debug.c       |  1 +
 3 files changed, 20 insertions(+)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -784,6 +784,7 @@ struct task_struct {
 	int				static_prio;
 	int				normal_prio;
 	unsigned int			rt_priority;
+	int				latency_nice;
 
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
Index: linux-2.6/include/linux/sched/prio.h
===================================================================
--- linux-2.6.orig/include/linux/sched/prio.h
+++ linux-2.6/include/linux/sched/prio.h
@@ -42,4 +42,22 @@ static inline long rlimit_to_nice(long p
 	return (MAX_NICE - prio + 1);
 }
 
+/*
+ * Latency nice is meant to provide scheduler hints about the relative
+ * latency requirements of a task with respect to other tasks.
+ * Thus a task with latency_nice == 19 can be hinted as the task with no
+ * latency requirements, in contrast to the task with latency_nice == -20
+ * which should be given priority in terms of lower latency.
+ */
+#define MAX_LATENCY_NICE	19
+#define MIN_LATENCY_NICE	-20
+
+#define LATENCY_NICE_WIDTH	\
+	(MAX_LATENCY_NICE - MIN_LATENCY_NICE + 1)
+
+/*
+ * Default tasks should be treated as a task with latency_nice = 0.
+ */
+#define DEFAULT_LATENCY_NICE	0
+
 #endif /* _LINUX_SCHED_PRIO_H */
Index: linux-2.6/kernel/sched/debug.c
===================================================================
--- linux-2.6.orig/kernel/sched/debug.c
+++ linux-2.6/kernel/sched/debug.c
@@ -1043,6 +1043,7 @@ void proc_sched_show_task(struct task_st
 #endif
 	P(policy);
 	P(prio);
+	P(latency_nice);
 	if (task_has_dl_policy(p)) {
 		P(dl.runtime);
 		P(dl.deadline);



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 02/10] sched/core: Propagate parent tasks latency requirements to the child task
  2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
  2023-03-06 13:25 ` [PATCH 01/10] sched: Introduce latency-nice as a per-task attribute Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
  2023-03-06 13:25 ` [PATCH 03/10] sched: Allow sched_{get,set}attr to change latency_nice of the task Peter Zijlstra
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, Parth Shah

From: Parth Shah <parth@linux.ibm.com>

Clone parent task's latency_nice attribute to the forked child task.

Reset the latency_nice value to default value when the child task is
set to sched_reset_on_fork.

Also, initialize init_task.latency_nice value with DEFAULT_LATENCY_NICE
value

[rebase]
Signed-off-by: Parth Shah <parth@linux.ibm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20230224093454.956298-4-vincent.guittot@linaro.org
---
 init/init_task.c    | 1 +
 kernel/sched/core.c | 1 +
 2 files changed, 2 insertions(+)

Index: linux-2.6/init/init_task.c
===================================================================
--- linux-2.6.orig/init/init_task.c
+++ linux-2.6/init/init_task.c
@@ -78,6 +78,7 @@ struct task_struct init_task
 	.prio		= MAX_PRIO - 20,
 	.static_prio	= MAX_PRIO - 20,
 	.normal_prio	= MAX_PRIO - 20,
+	.latency_nice	= DEFAULT_LATENCY_NICE,
 	.policy		= SCHED_NORMAL,
 	.cpus_ptr	= &init_task.cpus_mask,
 	.user_cpus_ptr	= NULL,
Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -4684,6 +4684,7 @@ int sched_fork(unsigned long clone_flags
 		p->prio = p->normal_prio = p->static_prio;
 		set_load_weight(p, false);
 
+		p->latency_nice = DEFAULT_LATENCY_NICE;
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
 		 * fulfilled its duty:



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 03/10] sched: Allow sched_{get,set}attr to change latency_nice of the task
  2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
  2023-03-06 13:25 ` [PATCH 01/10] sched: Introduce latency-nice as a per-task attribute Peter Zijlstra
  2023-03-06 13:25 ` [PATCH 02/10] sched/core: Propagate parent tasks latency requirements to the child task Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
  2023-03-06 13:25 ` [PATCH 04/10] sched/fair: Add latency_offset Peter Zijlstra
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel, Parth Shah

From: Parth Shah <parth@linux.ibm.com>

Introduce the latency_nice attribute to sched_attr and provide a
mechanism to change the value with the use of sched_setattr/sched_getattr
syscall.

Also add new flag "SCHED_FLAG_LATENCY_NICE" to hint the change in
latency_nice of the task on every sched_setattr syscall.

[rebase and add a dedicated __setscheduler_latency ]
Signed-off-by: Parth Shah <parth@linux.ibm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20230224093454.956298-5-vincent.guittot@linaro.org
---
 include/uapi/linux/sched.h       |  4 +++-
 include/uapi/linux/sched/types.h | 19 +++++++++++++++++++
 kernel/sched/core.c              | 24 ++++++++++++++++++++++++
 tools/include/uapi/linux/sched.h |  4 +++-
 4 files changed, 49 insertions(+), 2 deletions(-)

Index: linux-2.6/include/uapi/linux/sched.h
===================================================================
--- linux-2.6.orig/include/uapi/linux/sched.h
+++ linux-2.6/include/uapi/linux/sched.h
@@ -132,6 +132,7 @@ struct clone_args {
 #define SCHED_FLAG_KEEP_PARAMS		0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
+#define SCHED_FLAG_LATENCY_NICE		0x80
 
 #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
 				 SCHED_FLAG_KEEP_PARAMS)
@@ -143,6 +144,7 @@ struct clone_args {
 			 SCHED_FLAG_RECLAIM		| \
 			 SCHED_FLAG_DL_OVERRUN		| \
 			 SCHED_FLAG_KEEP_ALL		| \
-			 SCHED_FLAG_UTIL_CLAMP)
+			 SCHED_FLAG_UTIL_CLAMP		| \
+			 SCHED_FLAG_LATENCY_NICE)
 
 #endif /* _UAPI_LINUX_SCHED_H */
Index: linux-2.6/include/uapi/linux/sched/types.h
===================================================================
--- linux-2.6.orig/include/uapi/linux/sched/types.h
+++ linux-2.6/include/uapi/linux/sched/types.h
@@ -10,6 +10,7 @@ struct sched_param {
 
 #define SCHED_ATTR_SIZE_VER0	48	/* sizeof first published struct */
 #define SCHED_ATTR_SIZE_VER1	56	/* add: util_{min,max} */
+#define SCHED_ATTR_SIZE_VER2	60	/* add: latency_nice */
 
 /*
  * Extended scheduling parameters data structure.
@@ -98,6 +99,22 @@ struct sched_param {
  * scheduled on a CPU with no more capacity than the specified value.
  *
  * A task utilization boundary can be reset by setting the attribute to -1.
+ *
+ * Latency Tolerance Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the relative latency
+ * requirements of a task with respect to the other tasks running/queued in the
+ * system.
+ *
+ * @ sched_latency_nice	task's latency_nice value
+ *
+ * The latency_nice of a task can have any value in a range of
+ * [MIN_LATENCY_NICE..MAX_LATENCY_NICE].
+ *
+ * A task with latency_nice with the value of LATENCY_NICE_MIN can be
+ * taken for a task requiring a lower latency as opposed to the task with
+ * higher latency_nice.
  */
 struct sched_attr {
 	__u32 size;
@@ -120,6 +137,8 @@ struct sched_attr {
 	__u32 sched_util_min;
 	__u32 sched_util_max;
 
+	/* latency requirement hints */
+	__s32 sched_latency_nice;
 };
 
 #endif /* _UAPI_LINUX_SCHED_TYPES_H */
Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -7451,6 +7451,13 @@ static void __setscheduler_params(struct
 	p->rt_priority = attr->sched_priority;
 	p->normal_prio = normal_prio(p);
 	set_load_weight(p, true);
+}
+
+static void __setscheduler_latency(struct task_struct *p,
+		const struct sched_attr *attr)
+{
+	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
+		p->latency_nice = attr->sched_latency_nice;
 }
 
 /*
@@ -7593,6 +7601,13 @@ recheck:
 			return retval;
 	}
 
+	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
+		if (attr->sched_latency_nice > MAX_LATENCY_NICE)
+			return -EINVAL;
+		if (attr->sched_latency_nice < MIN_LATENCY_NICE)
+			return -EINVAL;
+	}
+
 	if (pi)
 		cpuset_read_lock();
 
@@ -7627,6 +7642,9 @@ recheck:
 			goto change;
 		if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
 			goto change;
+		if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE &&
+		    attr->sched_latency_nice != p->latency_nice)
+			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
 		retval = 0;
@@ -7715,6 +7733,7 @@ change:
 		__setscheduler_params(p, attr);
 		__setscheduler_prio(p, newprio);
 	}
+	__setscheduler_latency(p, attr);
 	__setscheduler_uclamp(p, attr);
 
 	if (queued) {
@@ -7925,6 +7944,9 @@ static int sched_copy_attr(struct sched_
 	    size < SCHED_ATTR_SIZE_VER1)
 		return -EINVAL;
 
+	if ((attr->sched_flags & SCHED_FLAG_LATENCY_NICE) &&
+	    size < SCHED_ATTR_SIZE_VER2)
+		return -EINVAL;
 	/*
 	 * XXX: Do we want to be lenient like existing syscalls; or do we want
 	 * to be strict and return an error on out-of-bounds values?
@@ -8162,6 +8184,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pi
 	get_params(p, &kattr);
 	kattr.sched_flags &= SCHED_FLAG_ALL;
 
+	kattr.sched_latency_nice = p->latency_nice;
+
 #ifdef CONFIG_UCLAMP_TASK
 	/*
 	 * This could race with another potential updater, but this is fine
Index: linux-2.6/tools/include/uapi/linux/sched.h
===================================================================
--- linux-2.6.orig/tools/include/uapi/linux/sched.h
+++ linux-2.6/tools/include/uapi/linux/sched.h
@@ -132,6 +132,7 @@ struct clone_args {
 #define SCHED_FLAG_KEEP_PARAMS		0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
+#define SCHED_FLAG_LATENCY_NICE		0x80
 
 #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
 				 SCHED_FLAG_KEEP_PARAMS)
@@ -143,6 +144,7 @@ struct clone_args {
 			 SCHED_FLAG_RECLAIM		| \
 			 SCHED_FLAG_DL_OVERRUN		| \
 			 SCHED_FLAG_KEEP_ALL		| \
-			 SCHED_FLAG_UTIL_CLAMP)
+			 SCHED_FLAG_UTIL_CLAMP		| \
+			 SCHED_FLAG_LATENCY_NICE)
 
 #endif /* _UAPI_LINUX_SCHED_H */



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 04/10] sched/fair: Add latency_offset
  2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (2 preceding siblings ...)
  2023-03-06 13:25 ` [PATCH 03/10] sched: Allow sched_{get,set}attr to change latency_nice of the task Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
  2023-03-06 13:25 ` [PATCH 05/10] sched/fair: Add sched group latency support Peter Zijlstra
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel

From: Vincent Guittot <vincent.guittot@linaro.org>

XXX fold back into previous patches

Murdered-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 include/linux/sched.h      |    4 +++-
 include/linux/sched/prio.h |    9 +++++++++
 init/init_task.c           |    2 +-
 kernel/sched/core.c        |   21 ++++++++++++++++-----
 kernel/sched/debug.c       |    2 +-
 kernel/sched/fair.c        |    8 ++++++++
 kernel/sched/sched.h       |    2 ++
 7 files changed, 40 insertions(+), 8 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -568,6 +568,8 @@ struct sched_entity {
 	/* cached value of my_q->h_nr_running */
 	unsigned long			runnable_weight;
 #endif
+	/* preemption offset in ns */
+	long				latency_offset;
 
 #ifdef CONFIG_SMP
 	/*
@@ -784,7 +786,7 @@ struct task_struct {
 	int				static_prio;
 	int				normal_prio;
 	unsigned int			rt_priority;
-	int				latency_nice;
+	int				latency_prio;
 
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
--- a/include/linux/sched/prio.h
+++ b/include/linux/sched/prio.h
@@ -59,5 +59,14 @@ static inline long rlimit_to_nice(long p
  * Default tasks should be treated as a task with latency_nice = 0.
  */
 #define DEFAULT_LATENCY_NICE	0
+#define DEFAULT_LATENCY_PRIO	(DEFAULT_LATENCY_NICE + LATENCY_NICE_WIDTH/2)
+
+/*
+ * Convert user-nice values [ -20 ... 0 ... 19 ]
+ * to static latency [ 0..39 ],
+ * and back.
+ */
+#define NICE_TO_LATENCY(nice)	((nice) + DEFAULT_LATENCY_PRIO)
+#define LATENCY_TO_NICE(prio)	((prio) - DEFAULT_LATENCY_PRIO)
 
 #endif /* _LINUX_SCHED_PRIO_H */
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -78,7 +78,7 @@ struct task_struct init_task
 	.prio		= MAX_PRIO - 20,
 	.static_prio	= MAX_PRIO - 20,
 	.normal_prio	= MAX_PRIO - 20,
-	.latency_nice	= DEFAULT_LATENCY_NICE,
+	.latency_prio	= DEFAULT_LATENCY_PRIO,
 	.policy		= SCHED_NORMAL,
 	.cpus_ptr	= &init_task.cpus_mask,
 	.user_cpus_ptr	= NULL,
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1285,6 +1285,11 @@ static void set_load_weight(struct task_
 	}
 }
 
+static void set_latency_offset(struct task_struct *p)
+{
+	p->se.latency_offset = calc_latency_offset(p->latency_prio);
+}
+
 #ifdef CONFIG_UCLAMP_TASK
 /*
  * Serializes updates of utilization clamp values
@@ -4433,6 +4438,8 @@ static void __sched_fork(unsigned long c
 	p->se.vruntime			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
+	set_latency_offset(p);
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
 #endif
@@ -4684,7 +4691,9 @@ int sched_fork(unsigned long clone_flags
 		p->prio = p->normal_prio = p->static_prio;
 		set_load_weight(p, false);
 
-		p->latency_nice = DEFAULT_LATENCY_NICE;
+		p->latency_prio = NICE_TO_LATENCY(0);
+		set_latency_offset(p);
+
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
 		 * fulfilled its duty:
@@ -7456,8 +7465,10 @@ static void __setscheduler_params(struct
 static void __setscheduler_latency(struct task_struct *p,
 		const struct sched_attr *attr)
 {
-	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
-		p->latency_nice = attr->sched_latency_nice;
+	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
+		p->latency_prio = NICE_TO_LATENCY(attr->sched_latency_nice);
+		set_latency_offset(p);
+	}
 }
 
 /*
@@ -7642,7 +7653,7 @@ static int __sched_setscheduler(struct t
 		if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
 			goto change;
 		if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE &&
-		    attr->sched_latency_nice != p->latency_nice)
+		    attr->sched_latency_nice != LATENCY_TO_NICE(p->latency_prio))
 			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
@@ -8183,7 +8194,7 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pi
 	get_params(p, &kattr);
 	kattr.sched_flags &= SCHED_FLAG_ALL;
 
-	kattr.sched_latency_nice = p->latency_nice;
+	kattr.sched_latency_nice = LATENCY_TO_NICE(p->latency_prio);
 
 #ifdef CONFIG_UCLAMP_TASK
 	/*
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1043,7 +1043,7 @@ void proc_sched_show_task(struct task_st
 #endif
 	P(policy);
 	P(prio);
-	P(latency_nice);
+	P(latency_prio);
 	if (task_has_dl_policy(p)) {
 		P(dl.runtime);
 		P(dl.deadline);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -703,6 +703,14 @@ int sched_update_scaling(void)
 }
 #endif
 
+long calc_latency_offset(int prio)
+{
+	u32 weight = sched_prio_to_weight[prio];
+	u64 base = sysctl_sched_min_granularity;
+
+	return div_u64(base << SCHED_FIXEDPOINT_SHIFT, weight);
+}
+
 /*
  * delta /= w
  */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2475,6 +2475,8 @@ extern unsigned int sysctl_numa_balancin
 extern unsigned int sysctl_numa_balancing_hot_threshold;
 #endif
 
+extern long calc_latency_offset(int prio);
+
 #ifdef CONFIG_SCHED_HRTICK
 
 /*



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 05/10] sched/fair: Add sched group latency support
  2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (3 preceding siblings ...)
  2023-03-06 13:25 ` [PATCH 04/10] sched/fair: Add latency_offset Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
  2023-03-06 13:25 ` [PATCH 06/10] sched/fair: Add avg_vruntime Peter Zijlstra
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel

From: Vincent Guittot <vincent.guittot@linaro.org>

Task can set its latency priority with sched_setattr(), which is then used
to set the latency offset of its sched_enity, but sched group entities
still have the default latency offset value.

Add a latency.nice field in cpu cgroup controller to set the latency
priority of the group similarly to sched_setattr(). The latency priority
is then used to set the offset of the sched_entities of the group.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20230224093454.956298-7-vincent.guittot@linaro.org
---
 Documentation/admin-guide/cgroup-v2.rst |   10 ++++++++++
 kernel/sched/core.c                     |   30 ++++++++++++++++++++++++++++++
 kernel/sched/fair.c                     |   32 ++++++++++++++++++++++++++++++++
 kernel/sched/sched.h                    |    4 ++++
 4 files changed, 76 insertions(+)

--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1121,6 +1121,16 @@ All time durations are in microseconds.
         values similar to the sched_setattr(2). This maximum utilization
         value is used to clamp the task specific maximum utilization clamp.
 
+  cpu.latency.nice
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "0".
+
+	The nice value is in the range [-20, 19].
+
+	This interface file allows reading and setting latency using the
+	same values used by sched_setattr(2). The latency_nice of a group is
+	used to limit the impact of the latency_nice of a task outside the
+	group.
 
 
 Memory
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11068,6 +11068,25 @@ static int cpu_idle_write_s64(struct cgr
 {
 	return sched_group_set_idle(css_tg(css), idle);
 }
+
+static s64 cpu_latency_nice_read_s64(struct cgroup_subsys_state *css,
+				    struct cftype *cft)
+{
+	return LATENCY_TO_NICE(css_tg(css)->latency_prio);
+}
+
+static int cpu_latency_nice_write_s64(struct cgroup_subsys_state *css,
+				     struct cftype *cft, s64 nice)
+{
+	int prio;
+
+	if (nice < MIN_LATENCY_NICE || nice > MAX_LATENCY_NICE)
+		return -ERANGE;
+
+	prio = NICE_TO_LATENCY(nice);
+
+	return sched_group_set_latency(css_tg(css), prio);
+}
 #endif
 
 static struct cftype cpu_legacy_files[] = {
@@ -11082,6 +11101,11 @@ static struct cftype cpu_legacy_files[]
 		.read_s64 = cpu_idle_read_s64,
 		.write_s64 = cpu_idle_write_s64,
 	},
+	{
+		.name = "latency.nice",
+		.read_s64 = cpu_latency_nice_read_s64,
+		.write_s64 = cpu_latency_nice_write_s64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
@@ -11299,6 +11323,12 @@ static struct cftype cpu_files[] = {
 		.read_s64 = cpu_idle_read_s64,
 		.write_s64 = cpu_idle_write_s64,
 	},
+	{
+		.name = "latency.nice",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_s64 = cpu_latency_nice_read_s64,
+		.write_s64 = cpu_latency_nice_write_s64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12264,6 +12264,7 @@ int alloc_fair_sched_group(struct task_g
 		goto err;
 
 	tg->shares = NICE_0_LOAD;
+	tg->latency_prio = DEFAULT_LATENCY_PRIO;
 
 	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
 
@@ -12362,6 +12363,9 @@ void init_tg_cfs_entry(struct task_group
 	}
 
 	se->my_q = cfs_rq;
+
+	se->latency_offset = calc_latency_offset(tg->latency_prio);
+
 	/* guarantee group entities always have weight */
 	update_load_set(&se->load, NICE_0_LOAD);
 	se->parent = parent;
@@ -12490,6 +12494,34 @@ int sched_group_set_idle(struct task_gro
 
 	mutex_unlock(&shares_mutex);
 	return 0;
+}
+
+int sched_group_set_latency(struct task_group *tg, int prio)
+{
+	long latency_offset;
+	int i;
+
+	if (tg == &root_task_group)
+		return -EINVAL;
+
+	mutex_lock(&shares_mutex);
+
+	if (tg->latency_prio == prio) {
+		mutex_unlock(&shares_mutex);
+		return 0;
+	}
+
+	tg->latency_prio = prio;
+	latency_offset = calc_latency_offset(prio);
+
+	for_each_possible_cpu(i) {
+		struct sched_entity *se = tg->se[i];
+
+		WRITE_ONCE(se->latency_offset, latency_offset);
+	}
+
+	mutex_unlock(&shares_mutex);
+	return 0;
 }
 
 #else /* CONFIG_FAIR_GROUP_SCHED */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -378,6 +378,8 @@ struct task_group {
 
 	/* A positive value indicates that this is a SCHED_IDLE group. */
 	int			idle;
+	/* latency priority of the group. */
+	int			latency_prio;
 
 #ifdef	CONFIG_SMP
 	/*
@@ -488,6 +490,8 @@ extern int sched_group_set_shares(struct
 
 extern int sched_group_set_idle(struct task_group *tg, long idle);
 
+extern int sched_group_set_latency(struct task_group *tg, int prio);
+
 #ifdef CONFIG_SMP
 extern void set_task_rq_fair(struct sched_entity *se,
 			     struct cfs_rq *prev, struct cfs_rq *next);



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 06/10] sched/fair: Add avg_vruntime
  2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (4 preceding siblings ...)
  2023-03-06 13:25 ` [PATCH 05/10] sched/fair: Add sched group latency support Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
  2023-03-21 13:58   ` Chen Yu
  2023-03-06 13:25 ` [PATCH 07/10] sched/fair: Remove START_DEBIT Peter Zijlstra
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel

In order to move to an eligibility based scheduling policy it is
needed to have a better approximation of the ideal scheduler.

Specifically, for a virtual time weighted fair queueing based
scheduler the ideal scheduler will be the weighted average of the
individual virtual runtimes (math in the comment).

As such, compute the weighted average to approximate the ideal
scheduler -- note that the approximation is in the individual task
behaviour, which isn't strictly conformant.

Specifically consider adding a task with a vruntime left of center, in
this case the average will move backwards in time -- something the
ideal scheduler would of course never do.

[ Note: try and replace cfs_rq::avg_load with cfs_rq->load.weight, the
        conditions are slightly different but should be possible ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |   32 +++++++--------
 kernel/sched/fair.c  |  105 +++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |    5 ++
 3 files changed, 122 insertions(+), 20 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -580,10 +580,9 @@ static void print_rq(struct seq_file *m,
 
 void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 {
-	s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
-		spread, rq0_min_vruntime, spread0;
+	s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, spread;
+	struct sched_entity *last, *first;
 	struct rq *rq = cpu_rq(cpu);
-	struct sched_entity *last;
 	unsigned long flags;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -597,26 +596,25 @@ void print_cfs_rq(struct seq_file *m, in
 			SPLIT_NS(cfs_rq->exec_clock));
 
 	raw_spin_rq_lock_irqsave(rq, flags);
-	if (rb_first_cached(&cfs_rq->tasks_timeline))
-		MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
+	first = __pick_first_entity(cfs_rq);
+	if (first)
+		left_vruntime = first->vruntime;
 	last = __pick_last_entity(cfs_rq);
 	if (last)
-		max_vruntime = last->vruntime;
+		right_vruntime = last->vruntime;
 	min_vruntime = cfs_rq->min_vruntime;
-	rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
 	raw_spin_rq_unlock_irqrestore(rq, flags);
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "MIN_vruntime",
-			SPLIT_NS(MIN_vruntime));
+
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "left_vruntime",
+			SPLIT_NS(left_vruntime));
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "min_vruntime",
 			SPLIT_NS(min_vruntime));
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "max_vruntime",
-			SPLIT_NS(max_vruntime));
-	spread = max_vruntime - MIN_vruntime;
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread",
-			SPLIT_NS(spread));
-	spread0 = min_vruntime - rq0_min_vruntime;
-	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread0",
-			SPLIT_NS(spread0));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "avg_vruntime",
+			SPLIT_NS(avg_vruntime(cfs_rq)));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "right_vruntime",
+			SPLIT_NS(right_vruntime));
+	spread = right_vruntime - left_vruntime;
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread", SPLIT_NS(spread));
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -601,9 +601,102 @@ static inline bool entity_before(const s
 	return (s64)(a->vruntime - b->vruntime) < 0;
 }
 
+static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	return (s64)(se->vruntime - cfs_rq->min_vruntime);
+}
+
 #define __node_2_se(node) \
 	rb_entry((node), struct sched_entity, run_node)
 
+/*
+ * Compute virtual time from the per-task service numbers:
+ *
+ * Fair schedulers conserve lag: \Sum lag_i = 0
+ *
+ * lag_i = S - s_i = w_i * (V - v_i)
+ *
+ * \Sum lag_i = 0 -> \Sum w_i * (V - v_i) = V * \Sum w_i - \Sum w_i * v_i = 0
+ *
+ * From which we solve V:
+ *
+ *     \Sum v_i * w_i
+ * V = --------------
+ *        \Sum w_i
+ *
+ * However, since v_i is u64, and the multiplcation could easily overflow
+ * transform it into a relative form that uses smaller quantities:
+ *
+ * Substitute: v_i == (v_i - v) + v
+ *
+ *     \Sum ((v_i - v) + v) * w_i   \Sum (v_i - v) * w_i
+ * V = -------------------------- = -------------------- + v
+ *              \Sum w_i                   \Sum w_i
+ *
+ * min_vruntime = v
+ * avg_vruntime = \Sum (v_i - v) * w_i
+ * cfs_rq->load = \Sum w_i
+ *
+ * Since min_vruntime is a monotonic increasing variable that closely tracks
+ * the per-task service, these deltas: (v_i - v), will be in the order of the
+ * maximal (virtual) lag induced in the system due to quantisation.
+ */
+static void
+avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	s64 key = entity_key(cfs_rq, se);
+	cfs_rq->avg_vruntime += key * se->load.weight;
+	cfs_rq->avg_load += se->load.weight;
+}
+
+static void
+avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	s64 key = entity_key(cfs_rq, se);
+	cfs_rq->avg_vruntime -= key * se->load.weight;
+	cfs_rq->avg_load -= se->load.weight;
+}
+
+static inline
+void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+{
+	/*
+	 * v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load
+	 */
+	cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta;
+}
+
+u64 avg_vruntime(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 lag = cfs_rq->avg_vruntime;
+	long load = cfs_rq->avg_load;
+
+	if (curr && curr->on_rq) {
+		lag += entity_key(cfs_rq, curr) * curr->load.weight;
+		load += curr->load.weight;
+	}
+
+	if (load)
+		lag = div_s64(lag, load);
+
+	return cfs_rq->min_vruntime + lag;
+}
+
+static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
+{
+	u64 min_vruntime = cfs_rq->min_vruntime;
+	/*
+	 * open coded max_vruntime() to allow updating avg_vruntime
+	 */
+	s64 delta = (s64)(vruntime - min_vruntime);
+	if (delta > 0) {
+		avg_vruntime_update(cfs_rq, delta);
+		min_vruntime = vruntime;
+	}
+	return min_vruntime;
+}
+
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
@@ -629,7 +722,7 @@ static void update_min_vruntime(struct c
 
 	/* ensure we never gain time by being placed backwards. */
 	u64_u32_store(cfs_rq->min_vruntime,
-		      max_vruntime(cfs_rq->min_vruntime, vruntime));
+		      __update_min_vruntime(cfs_rq, vruntime));
 }
 
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
@@ -642,12 +735,14 @@ static inline bool __entity_less(struct
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	avg_vruntime_add(cfs_rq, se);
 	rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
+	avg_vruntime_sub(cfs_rq, se);
 }
 
 struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
@@ -3330,6 +3425,8 @@ static void reweight_entity(struct cfs_r
 		/* commit outstanding execution time */
 		if (cfs_rq->curr == se)
 			update_curr(cfs_rq);
+		else
+			avg_vruntime_sub(cfs_rq, se);
 		update_load_sub(&cfs_rq->load, se->load.weight);
 	}
 	dequeue_load_avg(cfs_rq, se);
@@ -3345,9 +3442,11 @@ static void reweight_entity(struct cfs_r
 #endif
 
 	enqueue_load_avg(cfs_rq, se);
-	if (se->on_rq)
+	if (se->on_rq) {
 		update_load_add(&cfs_rq->load, se->load.weight);
-
+		if (cfs_rq->curr != se)
+			avg_vruntime_add(cfs_rq, se);
+	}
 }
 
 void reweight_task(struct task_struct *p, int prio)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -558,6 +558,9 @@ struct cfs_rq {
 	unsigned int		idle_nr_running;   /* SCHED_IDLE */
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
 
+	s64			avg_vruntime;
+	u64			avg_load;
+
 	u64			exec_clock;
 	u64			min_vruntime;
 #ifdef CONFIG_SCHED_CORE
@@ -3312,4 +3315,6 @@ static inline void switch_mm_cid(struct
 static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next) { }
 #endif
 
+extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
+
 #endif /* _KERNEL_SCHED_SCHED_H */



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 07/10] sched/fair: Remove START_DEBIT
  2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (5 preceding siblings ...)
  2023-03-06 13:25 ` [PATCH 06/10] sched/fair: Add avg_vruntime Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
  2023-03-06 13:25 ` [PATCH 08/10] sched/fair: Add lag based placement Peter Zijlstra
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel

With the introduction of avg_vruntime() there is no need to use worse
approximations. Take the 0-lag point as starting point for inserting
new tasks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   21 +--------------------
 kernel/sched/features.h |    6 ------
 2 files changed, 1 insertion(+), 26 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,16 +882,6 @@ static u64 sched_slice(struct cfs_rq *cf
 	return slice;
 }
 
-/*
- * We calculate the vruntime slice of a to-be-inserted task.
- *
- * vs = s/w
- */
-static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	return calc_delta_fair(sched_slice(cfs_rq, se), se);
-}
-
 #include "pelt.h"
 #ifdef CONFIG_SMP
 
@@ -4758,18 +4748,9 @@ static void check_spread(struct cfs_rq *
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
-	u64 vruntime = cfs_rq->min_vruntime;
+	u64 vruntime = avg_vruntime(cfs_rq);
 	u64 sleep_time;
 
-	/*
-	 * The 'current' period is already promised to the current tasks,
-	 * however the extra weight of the new task will slow them down a
-	 * little, place the new task so that it fits in the slot that
-	 * stays open at the end.
-	 */
-	if (initial && sched_feat(START_DEBIT))
-		vruntime += sched_vslice(cfs_rq, se);
-
 	/* sleeps up to a single latency don't count. */
 	if (!initial) {
 		unsigned long thresh;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,12 +7,6 @@
 SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
 
 /*
- * Place new tasks ahead so that they do not starve already running
- * tasks
- */
-SCHED_FEAT(START_DEBIT, true)
-
-/*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 08/10] sched/fair: Add lag based placement
  2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (6 preceding siblings ...)
  2023-03-06 13:25 ` [PATCH 07/10] sched/fair: Remove START_DEBIT Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
  2023-03-16 22:49   ` Tim Chen
  2023-03-06 13:25 ` [PATCH 09/10] rbtree: Add rb_add_augmented_cached() helper Peter Zijlstra
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel

With the introduction of avg_vruntime, it is possible to approximate
lag (the entire purpose of introducing it in fact). Use this to do lag
based placement over sleep+wake.

Specifically, the FAIR_SLEEPERS thing places things too far to the
left and messes up the deadline aspect of EEVDF.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h   |    1 
 kernel/sched/core.c     |    1 
 kernel/sched/fair.c     |   63 +++++++++++++++++++++++++++---------------------
 kernel/sched/features.h |    8 ++++++
 4 files changed, 46 insertions(+), 27 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -555,6 +555,7 @@ struct sched_entity {
 	u64				sum_exec_runtime;
 	u64				vruntime;
 	u64				prev_sum_exec_runtime;
+	s64				lag;
 
 	u64				nr_migrations;
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4436,6 +4436,7 @@ static void __sched_fork(unsigned long c
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
+	p->se.lag			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 	set_latency_offset(p);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4749,39 +4749,45 @@ static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
 	u64 vruntime = avg_vruntime(cfs_rq);
-	u64 sleep_time;
 
-	/* sleeps up to a single latency don't count. */
-	if (!initial) {
-		unsigned long thresh;
-
-		if (se_is_idle(se))
-			thresh = sysctl_sched_min_granularity;
-		else
-			thresh = sysctl_sched_latency;
+	if (sched_feat(FAIR_SLEEPERS)) {
+		u64 sleep_time;
+
+		/* sleeps up to a single latency don't count. */
+		if (!initial) {
+			unsigned long thresh;
+
+			if (se_is_idle(se))
+				thresh = sysctl_sched_min_granularity;
+			else
+				thresh = sysctl_sched_latency;
+
+			/*
+			 * Halve their sleep time's effect, to allow
+			 * for a gentler effect of sleepers:
+			 */
+			if (sched_feat(GENTLE_FAIR_SLEEPERS))
+				thresh >>= 1;
+
+			vruntime -= thresh;
+		}
 
 		/*
-		 * Halve their sleep time's effect, to allow
-		 * for a gentler effect of sleepers:
+		 * Pull vruntime of the entity being placed to the base level of
+		 * cfs_rq, to prevent boosting it if placed backwards.  If the entity
+		 * slept for a long time, don't even try to compare its vruntime with
+		 * the base as it may be too far off and the comparison may get
+		 * inversed due to s64 overflow.
 		 */
-		if (sched_feat(GENTLE_FAIR_SLEEPERS))
-			thresh >>= 1;
-
-		vruntime -= thresh;
+		sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
+		if ((s64)sleep_time < 60LL * NSEC_PER_SEC)
+			vruntime = max_vruntime(se->vruntime, vruntime);
 	}
 
-	/*
-	 * Pull vruntime of the entity being placed to the base level of
-	 * cfs_rq, to prevent boosting it if placed backwards.  If the entity
-	 * slept for a long time, don't even try to compare its vruntime with
-	 * the base as it may be too far off and the comparison may get
-	 * inversed due to s64 overflow.
-	 */
-	sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
-	if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
-		se->vruntime = vruntime;
-	else
-		se->vruntime = max_vruntime(se->vruntime, vruntime);
+	if (sched_feat(PRESERVE_LAG))
+		vruntime -= se->lag;
+
+	se->vruntime = vruntime;
 }
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -4949,6 +4955,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	clear_buddies(cfs_rq, se);
 
+	if (sched_feat(PRESERVE_LAG) && (flags & DEQUEUE_SLEEP))
+		se->lag = avg_vruntime(cfs_rq) - se->vruntime;
+
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -1,12 +1,20 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+
 /*
  * Only give sleepers 50% of their service deficit. This allows
  * them to run sooner, but does not allow tons of sleepers to
  * rip the spread apart.
  */
+SCHED_FEAT(FAIR_SLEEPERS, false)
 SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
 
 /*
+ * Using the avg_vruntime, do the right thing and preserve lag
+ * across sleep+wake cycles.
+ */
+SCHED_FEAT(PRESERVE_LAG, true)
+
+/*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 09/10] rbtree: Add rb_add_augmented_cached() helper
  2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (7 preceding siblings ...)
  2023-03-06 13:25 ` [PATCH 08/10] sched/fair: Add lag based placement Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
  2023-03-06 13:25 ` [PATCH 10/10] sched/fair: Implement an EEVDF like policy Peter Zijlstra
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel

While slightly sub-optimal, updating the augmented data while going
down the tree during lookup would be faster -- alas the augment
interface does not currently allow for that, provide a generic helper
to add a node to an augmented cached tree.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/rbtree_augmented.h |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- a/include/linux/rbtree_augmented.h
+++ b/include/linux/rbtree_augmented.h
@@ -60,6 +60,32 @@ rb_insert_augmented_cached(struct rb_nod
 	rb_insert_augmented(node, &root->rb_root, augment);
 }
 
+static __always_inline struct rb_node *
+rb_add_augmented_cached(struct rb_node *node, struct rb_root_cached *tree,
+			bool (*less)(struct rb_node *, const struct rb_node *),
+			const struct rb_augment_callbacks *augment)
+{
+	struct rb_node **link = &tree->rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	bool leftmost = true;
+
+	while (*link) {
+		parent = *link;
+		if (less(node, parent)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = false;
+		}
+	}
+
+	rb_link_node(node, parent, link);
+	augment->propagate(parent, NULL); /* suboptimal */
+	rb_insert_augmented_cached(node, tree, leftmost, augment);
+
+	return leftmost ? node : NULL;
+}
+
 /*
  * Template for declaring augmented rbtree callbacks (generic case)
  *



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 10/10] sched/fair: Implement an EEVDF like policy
  2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (8 preceding siblings ...)
  2023-03-06 13:25 ` [PATCH 09/10] rbtree: Add rb_add_augmented_cached() helper Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
  2023-03-08  8:39   ` Mike Galbraith
  2023-03-07 10:27 ` [PATCH 00/10] sched: EEVDF using latency-nice Vincent Guittot
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel

Where CFS is currently a WFQ based scheduler with only a single knob,
the weight. The addition of a second, latency oriented parameter,
makes something like WF2Q or EEVDF based a much better fit.

Specifically, EEVDF does EDF like scheduling in the left half of the
tree -- those entities that are owed service. Except because this is a
virtual time scheduler, the deadlines are in virtual time as well,
which is what allows over-subscription.

EEVDF has two parameters:

 - weight; which is mapped to nice just as before
 - relative deadline; which is related to slice length and mapped
   to the new latency nice.

Basically, by setting a smaller slice, the deadline will be earlier
and the task will be more eligible and ran earlier.

Preemption (both tick and wakeup) is driven by testing against a fresh
pick. Because the tree is now effectively an interval tree, and the
selection is no longer 'leftmost', over-scheduling is less of a
problem.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h   |    4 
 kernel/sched/debug.c    |    6 -
 kernel/sched/fair.c     |  265 ++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |    2 
 kernel/sched/sched.h    |    1 
 5 files changed, 247 insertions(+), 31 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -548,6 +548,9 @@ struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
 	struct rb_node			run_node;
+	u64				deadline;
+	u64				min_deadline;
+
 	struct list_head		group_node;
 	unsigned int			on_rq;
 
@@ -556,6 +559,7 @@ struct sched_entity {
 	u64				vruntime;
 	u64				prev_sum_exec_runtime;
 	s64				lag;
+	u64				slice;
 
 	u64				nr_migrations;
 
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -535,9 +535,13 @@ print_task(struct seq_file *m, struct rq
 	else
 		SEQ_printf(m, " %c", task_state_to_char(p));
 
-	SEQ_printf(m, " %15s %5d %9Ld.%06ld %9Ld %5d ",
+	SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld.%06ld %9Ld %5d ",
 		p->comm, task_pid_nr(p),
 		SPLIT_NS(p->se.vruntime),
+		entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N',
+		SPLIT_NS(p->se.deadline),
+		SPLIT_NS(p->se.slice),
+		SPLIT_NS(p->se.sum_exec_runtime),
 		(long long)(p->nvcsw + p->nivcsw),
 		p->prio);
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -47,6 +47,7 @@
 #include <linux/psi.h>
 #include <linux/ratelimit.h>
 #include <linux/task_work.h>
+#include <linux/rbtree_augmented.h>
 
 #include <asm/switch_to.h>
 
@@ -683,6 +684,34 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 	return cfs_rq->min_vruntime + lag;
 }
 
+/*
+ * Entity is eligible once it received less service than it ought to have,
+ * eg. lag >= 0.
+ *
+ * lag_i = S - s_i = w_i*(V - w_i)
+ *
+ * lag_i >= 0 -> V >= v_i
+ *
+ *     \Sum (v_i - v)*w_i
+ * V = ------------------ + v
+ *          \Sum w_i
+ *
+ * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ */
+int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 avg_vruntime = cfs_rq->avg_vruntime;
+	long avg_load = cfs_rq->avg_load;
+
+	if (curr && curr->on_rq) {
+		avg_vruntime += entity_key(cfs_rq, curr) * curr->load.weight;
+		avg_load += curr->load.weight;
+	}
+
+	return avg_vruntime >= entity_key(cfs_rq, se) * avg_load;
+}
+
 static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	u64 min_vruntime = cfs_rq->min_vruntime;
@@ -699,8 +728,8 @@ static u64 __update_min_vruntime(struct
 
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
+	struct sched_entity *se = __pick_first_entity(cfs_rq);
 	struct sched_entity *curr = cfs_rq->curr;
-	struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);
 
 	u64 vruntime = cfs_rq->min_vruntime;
 
@@ -711,9 +740,7 @@ static void update_min_vruntime(struct c
 			curr = NULL;
 	}
 
-	if (leftmost) { /* non-empty tree */
-		struct sched_entity *se = __node_2_se(leftmost);
-
+	if (se) {
 		if (!curr)
 			vruntime = se->vruntime;
 		else
@@ -730,18 +757,50 @@ static inline bool __entity_less(struct
 	return entity_before(__node_2_se(a), __node_2_se(b));
 }
 
+#define deadline_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })
+
+static inline void __update_min_deadline(struct sched_entity *se, struct rb_node *node)
+{
+	if (node) {
+		struct sched_entity *rse = __node_2_se(node);
+		if (deadline_gt(min_deadline, se, rse))
+			se->min_deadline = rse->min_deadline;
+	}
+}
+
+/*
+ * se->min_deadline = min(se->deadline, left->min_deadline, right->min_deadline)
+ */
+static inline bool min_deadline_update(struct sched_entity *se, bool exit)
+{
+	u64 old_min_deadline = se->min_deadline;
+	struct rb_node *node = &se->run_node;
+
+	se->min_deadline = se->deadline;
+	__update_min_deadline(se, node->rb_right);
+	__update_min_deadline(se, node->rb_left);
+
+	return se->min_deadline == old_min_deadline;
+}
+
+RB_DECLARE_CALLBACKS(static, min_deadline_cb, struct sched_entity,
+		     run_node, min_deadline, min_deadline_update);
+
 /*
  * Enqueue an entity into the rb-tree:
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	avg_vruntime_add(cfs_rq, se);
-	rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
+	se->min_deadline = se->deadline;
+	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
+				__entity_less, &min_deadline_cb);
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
+	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
+				  &min_deadline_cb);
 	avg_vruntime_sub(cfs_rq, se);
 }
 
@@ -765,6 +824,101 @@ static struct sched_entity *__pick_next_
 	return __node_2_se(next);
 }
 
+static struct sched_entity *pick_cfs(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+{
+	struct sched_entity *left = __pick_first_entity(cfs_rq);
+
+	/*
+	 * If curr is set we have to see if its left of the leftmost entity
+	 * still in the tree, provided there was anything in the tree at all.
+	 */
+	if (!left || (curr && entity_before(curr, left)))
+		left = curr;
+
+	return left;
+}
+
+/*
+ * Earliest Eligible Virtual Deadline First
+ *
+ * In order to provide latency guarantees for different request sizes
+ * EEVDF selects the best runnable task from two criteria:
+ *
+ *  1) the task must be eligible (must be owed service)
+ *
+ *  2) from those tasks that meet 1), we select the one
+ *     with the earliest virtual deadline.
+ *
+ * We can do this in O(log n) time due to an augmented RB-tree. The
+ * tree keeps the entries sorted on service, but also functions as a
+ * heap based on the deadline by keeping:
+ *
+ *  se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
+ *
+ * Which allows an EDF like search on (sub)trees.
+ */
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+{
+	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+	struct sched_entity *curr = cfs_rq->curr;
+	struct sched_entity *best = NULL;
+
+	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
+		curr = NULL;
+
+	while (node) {
+		struct sched_entity *se = __node_2_se(node);
+
+		/*
+		 * If this entity is not eligible, try the left subtree.
+		 *
+		 * XXX: would it be worth it to do the single division for
+		 *      avg_vruntime() once, instead of the multiplication
+		 *      in entity_eligible() O(log n) times?
+		 */
+		if (!entity_eligible(cfs_rq, se)) {
+			node = node->rb_left;
+			continue;
+		}
+
+		/*
+		 * If this entity has an earlier deadline than the previous
+		 * best, take this one. If it also has the earliest deadline
+		 * of its subtree, we're done.
+		 */
+		if (!best || deadline_gt(deadline, best, se)) {
+			best = se;
+			if (best->deadline == best->min_deadline)
+				break;
+		}
+
+		/*
+		 * If the earlest deadline in this subtree is in the fully
+		 * eligible left half of our space, go there.
+		 */
+		if (node->rb_left &&
+		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
+			node = node->rb_left;
+			continue;
+		}
+
+		node = node->rb_right;
+	}
+
+	if (!best || (curr && deadline_gt(deadline, best, curr)))
+		best = curr;
+
+	if (unlikely(!best)) {
+		struct sched_entity *left = __pick_first_entity(cfs_rq);
+		if (left) {
+			pr_err("EEVDF scheduling fail, picking leftmost\n");
+			return left;
+		}
+	}
+
+	return best;
+}
+
 #ifdef CONFIG_SCHED_DEBUG
 struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
 {
@@ -882,6 +1036,32 @@ static u64 sched_slice(struct cfs_rq *cf
 	return slice;
 }
 
+static void set_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	if (sched_feat(EEVDF)) {
+		/*
+		 * For EEVDF the virtual time slope is determined by w_i (iow.
+		 * nice) while the request time r_i is determined by
+		 * latency-nice.
+		 */
+		se->slice = se->latency_offset;
+	} else {
+		/*
+		 * When many tasks blow up the sched_period; it is possible
+		 * that sched_slice() reports unusually large results (when
+		 * many tasks are very light for example). Therefore impose a
+		 * maximum.
+		 */
+		se->slice = min_t(u64, sched_slice(cfs_rq, se), sysctl_sched_latency);
+	}
+
+	/*
+	 * vd_i = ve_i + r_i / w_i
+	 */
+	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+	se->min_deadline = se->deadline;
+}
+
 #include "pelt.h"
 #ifdef CONFIG_SMP
 
@@ -1014,6 +1194,13 @@ static void update_curr(struct cfs_rq *c
 	schedstat_add(cfs_rq->exec_clock, delta_exec);
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
+	/*
+	 * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
+	 * this is probably good enough.
+	 */
+	if ((s64)(curr->vruntime - curr->deadline) > 0)
+		set_slice(cfs_rq, curr);
+
 	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
@@ -4788,6 +4975,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		vruntime -= se->lag;
 
 	se->vruntime = vruntime;
+	set_slice(cfs_rq, se);
 }
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -4996,19 +5184,20 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 static void
 check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	unsigned long ideal_runtime, delta_exec;
+	unsigned long delta_exec;
 	struct sched_entity *se;
 	s64 delta;
 
-	/*
-	 * When many tasks blow up the sched_period; it is possible that
-	 * sched_slice() reports unusually large results (when many tasks are
-	 * very light for example). Therefore impose a maximum.
-	 */
-	ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
+	if (sched_feat(EEVDF)) {
+		if (pick_eevdf(cfs_rq) != curr)
+			goto preempt;
+
+		return;
+	}
 
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
-	if (delta_exec > ideal_runtime) {
+	if (delta_exec > curr->slice) {
+preempt:
 		resched_curr(rq_of(cfs_rq));
 		/*
 		 * The current task ran long enough, ensure it doesn't get
@@ -5032,7 +5221,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq
 	if (delta < 0)
 		return;
 
-	if (delta > ideal_runtime)
+	if (delta > curr->slice)
 		resched_curr(rq_of(cfs_rq));
 }
 
@@ -5087,17 +5276,20 @@ wakeup_preempt_entity(struct sched_entit
 static struct sched_entity *
 pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	struct sched_entity *left = __pick_first_entity(cfs_rq);
-	struct sched_entity *se;
+	struct sched_entity *left, *se;
 
-	/*
-	 * If curr is set we have to see if its left of the leftmost entity
-	 * still in the tree, provided there was anything in the tree at all.
-	 */
-	if (!left || (curr && entity_before(curr, left)))
-		left = curr;
+	if (sched_feat(EEVDF)) {
+		/*
+		 * Enabling NEXT_BUDDY will affect latency but not fairness.
+		 */
+		if (sched_feat(NEXT_BUDDY) &&
+		    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
+			return cfs_rq->next;
+
+		return pick_eevdf(cfs_rq);
+	}
 
-	se = left; /* ideally we run the leftmost entity */
+	se = left = pick_cfs(cfs_rq, curr);
 
 	/*
 	 * Avoid running the skip buddy, if running something else can
@@ -6192,13 +6384,12 @@ static inline void unthrottle_offline_cf
 static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	SCHED_WARN_ON(task_rq(p) != rq);
 
 	if (rq->cfs.h_nr_running > 1) {
-		u64 slice = sched_slice(cfs_rq, se);
 		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+		u64 slice = se->slice;
 		s64 delta = slice - ran;
 
 		if (delta < 0) {
@@ -7921,7 +8112,19 @@ static void check_preempt_wakeup(struct
 	if (cse_is_idle != pse_is_idle)
 		return;
 
-	update_curr(cfs_rq_of(se));
+	cfs_rq = cfs_rq_of(se);
+	update_curr(cfs_rq);
+
+	if (sched_feat(EEVDF)) {
+		/*
+		 * XXX pick_eevdf(cfs_rq) != se ?
+		 */
+		if (pick_eevdf(cfs_rq) == pse)
+			goto preempt;
+
+		return;
+	}
+
 	if (wakeup_preempt_entity(se, pse) == 1) {
 		/*
 		 * Bias pick_next to pick the sched entity that is
@@ -8167,7 +8370,7 @@ static void yield_task_fair(struct rq *r
 
 	clear_buddies(cfs_rq, se);
 
-	if (curr->policy != SCHED_BATCH) {
+	if (sched_feat(EEVDF) || curr->policy != SCHED_BATCH) {
 		update_rq_clock(rq);
 		/*
 		 * Update run-time statistics of the 'current'.
@@ -8180,6 +8383,8 @@ static void yield_task_fair(struct rq *r
 		 */
 		rq_clock_skip_update(rq);
 	}
+	if (sched_feat(EEVDF))
+		se->deadline += calc_delta_fair(se->slice, se);
 
 	set_skip_buddy(se);
 }
@@ -11923,8 +12128,8 @@ static void rq_offline_fair(struct rq *r
 static inline bool
 __entity_slice_used(struct sched_entity *se, int min_nr_tasks)
 {
-	u64 slice = sched_slice(cfs_rq_of(se), se);
 	u64 rtime = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+	u64 slice = se->slice;
 
 	return (rtime * min_nr_tasks > slice);
 }
@@ -12639,7 +12844,7 @@ static unsigned int get_rr_interval_fair
 	 * idle runqueue:
 	 */
 	if (rq->cfs.load.weight)
-		rr_interval = NS_TO_JIFFIES(sched_slice(cfs_rq_of(se), se));
+		rr_interval = NS_TO_JIFFIES(se->slice);
 
 	return rr_interval;
 }
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -103,3 +103,5 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(EEVDF, true)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3316,5 +3316,6 @@ static inline void switch_mm_cid(struct
 #endif
 
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
+extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 #endif /* _KERNEL_SCHED_SCHED_H */



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 00/10] sched: EEVDF using latency-nice
  2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (9 preceding siblings ...)
  2023-03-06 13:25 ` [PATCH 10/10] sched/fair: Implement an EEVDF like policy Peter Zijlstra
@ 2023-03-07 10:27 ` Vincent Guittot
  2023-03-07 13:08   ` Peter Zijlstra
  2023-03-08 15:13 ` Shrikanth Hegde
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Vincent Guittot @ 2023-03-07 10:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel

On Mon, 6 Mar 2023 at 15:17, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Hi!
>
> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
> not make more sense, and I did point Vincent at some older patches I had for
> that (which is here his augmented rbtree thing comes from).
>
> Also, since I really dislike the dual tree, I also figured we could dynamically
> switch between an augmented tree and not (and while I have code for that,
> that's not included in this posting because with the current results I don't
> think we actually need this).
>
> Anyway, since I'm somewhat under the weather, I spend last week desperately
> trying to connect a small cluster of neurons in defiance of the snot overlord
> and bring back the EEVDF patches from the dark crypts where they'd been
> gathering cobwebs for the past 13 odd years.

I haven't studied your patchset in detail yet but at a 1st glance this
seems to be a major rework on the cfs task placement and the latency
is just an add-on on top of moving to the EEVDF scheduling.

>
> By friday they worked well enough, and this morning (because obviously I forgot
> the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf,
> tbench and sysbench -- there's a bunch of wins and losses, but nothing that
> indicates a total fail.
>
> ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot
>   more consistent than CFS and has a bunch of latency wins )
>
> ( hackbench also doesn't show the augmented tree and generally more expensive
>   pick to be a loss, in fact it shows a slight win here )
>
>
>   hackbech load + cyclictest --policy other results:
>
>
>                         EEVDF                    CFS
>
>                 # Min Latencies: 00053
>   LNICE(19)     # Avg Latencies: 04350
>                 # Max Latencies: 76019
>
>                 # Min Latencies: 00052          00053
>   LNICE(0)      # Avg Latencies: 00690          00687
>                 # Max Latencies: 14145          13913
>
>                 # Min Latencies: 00019
>   LNICE(-19)    # Avg Latencies: 00261
>                 # Max Latencies: 05642
>
>
> The nice -19 numbers aren't as pretty as Vincent's, but at the end I was going
> cross-eyed from staring at tree prints and I just couldn't figure out where it
> was going side-ways.
>
> There's definitely more benchmarking/tweaking to be done (0-day already
> reported a stress-ng loss), but if we can pull this off we can delete a whole
> much of icky heuristics code. EEVDF is a much better defined policy than what
> we currently have.
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 00/10] sched: EEVDF using latency-nice
  2023-03-07 10:27 ` [PATCH 00/10] sched: EEVDF using latency-nice Vincent Guittot
@ 2023-03-07 13:08   ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-07 13:08 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
	patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
	kprateek.nayak, yu.c.chen, youssefesmat, joel

On Tue, Mar 07, 2023 at 11:27:37AM +0100, Vincent Guittot wrote:
> On Mon, 6 Mar 2023 at 15:17, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Hi!
> >
> > Ever since looking at the latency-nice patches, I've wondered if EEVDF would
> > not make more sense, and I did point Vincent at some older patches I had for
> > that (which is here his augmented rbtree thing comes from).
> >
> > Also, since I really dislike the dual tree, I also figured we could dynamically
> > switch between an augmented tree and not (and while I have code for that,
> > that's not included in this posting because with the current results I don't
> > think we actually need this).
> >
> > Anyway, since I'm somewhat under the weather, I spend last week desperately
> > trying to connect a small cluster of neurons in defiance of the snot overlord
> > and bring back the EEVDF patches from the dark crypts where they'd been
> > gathering cobwebs for the past 13 odd years.
> 
> I haven't studied your patchset in detail yet but at a 1st glance this
> seems to be a major rework on the cfs task placement and the latency
> is just an add-on on top of moving to the EEVDF scheduling.

It completely reworks the base scheduler, placement, preemption, picking
-- everything. The only thing they have in common is that they're both a
virtual time based scheduler.

The big advantage I see is that EEVDF is fairly well known and studied,
and a much better defined scheduler than WFQ. Specifically, where WFQ is
only well defined in how much time is given to any task (bandwidth), but
says nothing about how that is distributed in time. That is, there is no
native preemption condition/constraint etc. -- all that code we have is
random heuristics mostly.

The WF2Q/EEVDF class of schedulers otoh *do* define all that. There is a
lot less wiggle room as a result. The avg_vruntime / placement stuff I
did is fundamental to how it controls bandwidth distribution and
guarantees the WFQ subset. Specifically, by limiting the pick to that
subset of tasks that has positive lag (owed time), it guarantees this
fairness -- but that means we need a working measure of lag.

Similarly, since the whole 'when' thing is well defined in order to
provide the additional latency goals of these schedulers, placement is
crucial. Things like sleeper bonus is fundamentally incompatible with
latency guarantees -- both affect the 'when'.


Initial EEVDF paper is here:

https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564

It contains a few 'mistakes' and oversights, but those should not
matter.

Anyway, I'm still struggling to make complete sense of what you did --
will continue to stare at that.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
  2023-03-06 13:25 ` [PATCH 10/10] sched/fair: Implement an EEVDF like policy Peter Zijlstra
@ 2023-03-08  8:39   ` Mike Galbraith
  2023-03-08  9:26     ` Mike Galbraith
  2023-03-08 13:36     ` Mike Galbraith
  0 siblings, 2 replies; 35+ messages in thread
From: Mike Galbraith @ 2023-03-08  8:39 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, vincent.guittot
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
	pjt, pavel, qperret, tim.c.chen, joshdon, timj, kprateek.nayak,
	yu.c.chen, youssefesmat, joel

On Mon, 2023-03-06 at 14:25 +0100, Peter Zijlstra wrote:
> Where CFS is currently a WFQ based scheduler with only a single knob,
> the weight. The addition of a second, latency oriented parameter,
> makes something like WF2Q or EEVDF based a much better fit.
>
> Specifically, EEVDF does EDF like scheduling in the left half of the
> tree -- those entities that are owed service. Except because this is a
> virtual time scheduler, the deadlines are in virtual time as well,
> which is what allows over-subscription.

Curiosity got the best of me, and I stuffed this series into master and
did a little light testing.  Unsurprisingly, the numbers move about as
they are wont to do when you diddle anything in sched-land, most
notable were the UDP_STREAM numbers which moved a LOT.

Another thing I found interesting was in the comparison of massive_intr
vs youtube clip load.  I was expecting less scheduling from +eevdf
tree, but got more in total and more total wait time.

Season to taste or just toss: I didn't start with a virgin tree, though
local hacks *should* be meaningless to tbench and netperf, and while
they reduce stacking, they make zero perceptible difference to the
desktop, and are common to both trees.

Some numbers:
box is old i4790 desktop box

30 sec tbench
6.3.0.g8ca09d5
Throughput 3655.33 MB/sec  8 clients  8 procs  max_latency=16.322 ms
Throughput 3651.73 MB/sec  8 clients  8 procs  max_latency=16.260 ms
Throughput 3627.60 MB/sec  8 clients  8 procs  max_latency=16.262 ms
6.3.0.g8ca09d5+eevdf
Throughput 3497.39 MB/sec  8 clients  8 procs  max_latency=12.380 ms
Throughput 3403.28 MB/sec  8 clients  8 procs  max_latency=12.296 ms
Throughput 3466.86 MB/sec  8 clients  8 procs  max_latency=12.349 ms
avg vs avg     .948


netperf                   cfs +eevdf  vs cfs
TCP_SENDFILE-1    Avg:  89258  92080   1.031
TCP_SENDFILE-2    Avg:  83271  83371   1.001
TCP_SENDFILE-4    Avg:  56395  53011    .939
TCP_SENDFILE-8    Avg:  26389  39470   1.495
TCP_SENDFILE-16   Avg:  10251  19590   1.911
TCP_STREAM-1      Avg:  72583  71276    .981
TCP_STREAM-2      Avg:  59627  54424    .912
TCP_STREAM-4      Avg:  33310  22717    .681
TCP_STREAM-8      Avg:   8062   7718    .957
TCP_STREAM-16     Avg:   5143   3726    .724
TCP_MAERTS-1      Avg:  72897  71052    .974
TCP_MAERTS-2      Avg:  55065  60022   1.090
TCP_MAERTS-4      Avg:  45525  26531    .582
TCP_MAERTS-8      Avg:  11435   7937    .694
TCP_MAERTS-16     Avg:    5437  3351    .616
TCP_RR-1          Avg: 192766 180505    .936
TCP_RR-2          Avg: 169043 164731    .974
TCP_RR-4          Avg: 115702 110938    .958
TCP_RR-8          Avg: 113085 111775    .988
TCP_RR-16         Avg:  55226  53439    .967
UDP_RR-1          Avg: 261076 242027    .927
UDP_RR-2          Avg: 224808 221913    .987
UDP_RR-4          Avg: 158232 155162    .980
UDP_RR-8          Avg: 152255 149527    .982
UDP_RR-16         Avg:  75148  72739    .967
UDP_STREAM-1      Avg: 102612 102728   1.001 hmm: UDP_STREAM deltas verified repeatable
UDP_STREAM-2      Avg:  93774  93563    .997
UDP_STREAM-4      Avg:  54728  55702   1.017
UDP_STREAM-8      Avg:  30153  63329   2.100
UDP_STREAM-16     Avg:  12997  31644   2.434

(if your mailer don't do wide, this is gonna make a mess)

youtube BigBuckBunny vs 8 hogs (ancient 8ms run 1ms sleep cpu distribution checker)
massive_intr 8 9999& perf sched record -- firefox https://www.youtube.com/watch?v=aqz-KE-bpKQ& sleep 300;killall massive_intr firefox
note: perf obviously twiddled to add 'Sum delay'.

6.3.0.g8ca09d5
 --------------------------------------------------------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     | Max delay start           | Max delay end          |
 --------------------------------------------------------------------------------------------------------------------------------------------------------------
  Renderer:5410         | 201867.875 ms |   106745 | avg:   0.468 ms | max:  22.465 ms | sum:49998.624 ms | max start:   100.880096 s | max end:   100.902561 s
  SwComposite:5453      | 189304.441 ms |    73405 | avg:   0.595 ms | max:  23.934 ms | sum:43657.252 ms | max start:   155.074533 s | max end:   155.098467 s
  massive_intr:5571     | 179445.480 ms |    79984 | avg:   0.705 ms | max:  22.822 ms | sum:56395.232 ms | max start:   289.707867 s | max end:   289.730689 s
  massive_intr:5569     | 179337.834 ms |    80101 | avg:   0.711 ms | max:  25.517 ms | sum:56965.103 ms | max start:   103.652981 s | max end:   103.678498 s
  massive_intr:5575     | 179190.634 ms |    79433 | avg:   0.720 ms | max:  18.873 ms | sum:57223.508 ms | max start:   223.520540 s | max end:   223.539413 s
  massive_intr:5568     | 179081.476 ms |    77118 | avg:   0.742 ms | max:  21.715 ms | sum:57188.088 ms | max start:    90.071642 s | max end:    90.093358 s
  massive_intr:5574     | 179079.481 ms |    78924 | avg:   0.723 ms | max:  23.821 ms | sum:57024.111 ms | max start:   345.637059 s | max end:   345.660881 s
  massive_intr:5570     | 178945.128 ms |    79466 | avg:   0.724 ms | max:  22.105 ms | sum:57554.610 ms | max start:    90.511319 s | max end:    90.533423 s
  massive_intr:5572     | 178867.227 ms |    80016 | avg:   0.717 ms | max:  25.140 ms | sum:57393.314 ms | max start:   193.801127 s | max end:   193.826267 s
  massive_intr:5573     | 178560.808 ms |    79764 | avg:   0.719 ms | max:  40.232 ms | sum:57338.388 ms | max start:    87.446558 s | max end:    87.486790 s
  X:2492                |  86585.854 ms |    17924 | avg:   0.294 ms | max:  20.053 ms | sum: 5262.780 ms | max start:   106.798324 s | max end:   106.818377 s
  llvmpipe-7:2870       |  36803.778 ms |     7268 | avg:   1.727 ms | max:  30.168 ms | sum:12548.310 ms | max start:    87.406812 s | max end:    87.436981 s
  llvmpipe-3:2866       |  35004.654 ms |     6161 | avg:   1.385 ms | max:  19.992 ms | sum: 8531.873 ms | max start:    87.410811 s | max end:    87.430803 s
  llvmpipe-1:2864       |  34615.309 ms |     6423 | avg:   1.238 ms | max:  21.740 ms | sum: 7954.871 ms | max start:    93.834245 s | max end:    93.855985 s
  llvmpipe-2:2865       |  34375.917 ms |     6205 | avg:   1.273 ms | max:  22.031 ms | sum: 7897.655 ms | max start:    87.414812 s | max end:    87.436843 s
  llvmpipe-0:2863       |  32479.993 ms |     8472 | avg:   0.906 ms | max:  18.145 ms | sum: 7674.587 ms | max start:   156.041479 s | max end:   156.059624 s
  llvmpipe-5:2868       |  32284.589 ms |     5668 | avg:   1.271 ms | max:  21.562 ms | sum: 7203.223 ms | max start:    98.798222 s | max end:    98.819784 s
  llvmpipe-6:2869       |  31752.624 ms |     5689 | avg:   1.241 ms | max:  18.067 ms | sum: 7057.222 ms | max start:    87.422817 s | max end:    87.440885 s
  llvmpipe-4:2867       |  31621.552 ms |     5298 | avg:   1.327 ms | max:  21.903 ms | sum: 7029.350 ms | max start:    87.418812 s | max end:    87.440715 s
  MediaPD~oder #1:5910  |  24623.459 ms |     7900 | avg:   0.455 ms | max:  18.267 ms | sum: 3596.813 ms | max start:   143.181740 s | max end:   143.200008 s
  MediaPD~oder #1:5908  |  24498.697 ms |     7698 | avg:   0.470 ms | max:  24.616 ms | sum: 3614.831 ms | max start:   222.568707 s | max end:   222.593322 s
  MediaPD~oder #1:5909  |  24447.400 ms |     7744 | avg:   0.476 ms | max:  22.047 ms | sum: 3683.826 ms | max start:   234.216798 s | max end:   234.238845 s
  MediaPD~oder #1:5907  |  24349.888 ms |     7819 | avg:   0.437 ms | max:  19.288 ms | sum: 3413.911 ms | max start:   131.920097 s | max end:   131.939385 s
  Isolated Web Co:5457  |  10982.274 ms |     5768 | avg:   0.295 ms | max:  25.414 ms | sum: 1701.308 ms | max start:    91.801774 s | max end:    91.827188 s
...
 ------------------------------------------------------------------------------------------------------------
  TOTAL:                |2370376.486 ms |  1189527 |                                     |    654188.153 ms |
 ------------------------------------------------------------------------------------------------------------

6.3.0.g8ca09d5 +eevdf
 --------------------------------------------------------------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     | Max delay start           | Max delay end          |
 --------------------------------------------------------------------------------------------------------------------------------------------------------------
  Renderer:5675         | 211250.021 ms |    84700 | avg:   0.546 ms | max:  20.503 ms | sum:46225.011 ms | max start:   317.446170 s | max end:   317.466673 s
  SwComposite:5719      | 205804.660 ms |    66043 | avg:   0.685 ms | max:  19.871 ms | sum:45214.452 ms | max start:   119.470385 s | max end:   119.490256 s
  massive_intr:5838     | 195285.673 ms |    75458 | avg:   0.885 ms | max:  18.171 ms | sum:66793.033 ms | max start:   177.270747 s | max end:   177.288919 s
  massive_intr:5835     | 195217.246 ms |    75005 | avg:   0.884 ms | max:  18.211 ms | sum:66340.670 ms | max start:   340.966607 s | max end:   340.984818 s
  massive_intr:5836     | 195148.073 ms |    74723 | avg:   0.891 ms | max:  22.868 ms | sum:66544.981 ms | max start:    92.619771 s | max end:    92.642639 s
  massive_intr:5840     | 195093.638 ms |    75229 | avg:   0.886 ms | max:  21.502 ms | sum:66630.906 ms | max start:    96.715761 s | max end:    96.737263 s
  massive_intr:5837     | 195081.767 ms |    74906 | avg:   0.890 ms | max:  18.384 ms | sum:66672.064 ms | max start:   157.736916 s | max end:   157.755300 s
  massive_intr:5839     | 195067.653 ms |    74433 | avg:   0.892 ms | max:  17.731 ms | sum:66391.236 ms | max start:   327.225658 s | max end:   327.243389 s
  massive_intr:5841     | 194947.303 ms |    75468 | avg:   0.890 ms | max:  19.250 ms | sum:67155.572 ms | max start:    94.936517 s | max end:    94.955767 s
  massive_intr:5834     | 194820.690 ms |    74531 | avg:   0.901 ms | max:  16.593 ms | sum:67172.443 ms | max start:    87.268034 s | max end:    87.284627 s
  X:2673                |  55537.286 ms |    24293 | avg:   0.624 ms | max:  19.071 ms | sum:15149.642 ms | max start:   130.940240 s | max end:   130.959311 s
  MediaPD~oder #1:6190  |  24885.689 ms |    19835 | avg:   0.576 ms | max:  16.064 ms | sum:11424.368 ms | max start:   274.211226 s | max end:   274.227290 s
  MediaPD~oder #1:6188  |  24802.432 ms |    19475 | avg:   0.589 ms | max:  24.496 ms | sum:11465.965 ms | max start:   255.251754 s | max end:   255.276250 s
  MediaPD~oder #1:6189  |  24784.917 ms |    19684 | avg:   0.573 ms | max:  18.250 ms | sum:11277.622 ms | max start:   263.644372 s | max end:   263.662622 s
  MediaPD~oder #1:6187  |  24660.751 ms |    19613 | avg:   0.572 ms | max:  19.865 ms | sum:11221.213 ms | max start:   172.852720 s | max end:   172.872585 s
  llvmpipe-6:3047       |  18633.251 ms |     7545 | avg:   3.428 ms | max:  27.450 ms | sum:25864.293 ms | max start:   150.773081 s | max end:   150.800531 s
  llvmpipe-7:3048       |  18414.068 ms |     8024 | avg:   3.880 ms | max:  23.249 ms | sum:31135.265 ms | max start:   150.776388 s | max end:   150.799637 s
  llvmpipe-5:3046       |  17914.117 ms |     7336 | avg:   3.429 ms | max:  19.998 ms | sum:25155.781 ms | max start:   137.463848 s | max end:   137.483845 s
  llvmpipe-3:3044       |  17669.019 ms |     7913 | avg:   3.329 ms | max:  21.280 ms | sum:26340.572 ms | max start:   232.804014 s | max end:   232.825294 s
  llvmpipe-4:3045       |  17539.666 ms |     7438 | avg:   3.353 ms | max:  22.802 ms | sum:24936.398 ms | max start:    94.800014 s | max end:    94.822817 s
  llvmpipe-0:3041       |  17428.494 ms |     9445 | avg:   2.663 ms | max:  29.456 ms | sum:25153.007 ms | max start:    96.231519 s | max end:    96.260975 s
  llvmpipe-2:3043       |  17239.204 ms |     7962 | avg:   3.282 ms | max:  24.674 ms | sum:26133.925 ms | max start:   161.019506 s | max end:   161.044179 s
  llvmpipe-1:3042       |  17143.242 ms |     8261 | avg:   3.118 ms | max:  22.086 ms | sum:25756.911 ms | max start:   379.521740 s | max end:   379.543826 s
  Isolated Web Co:5723  |  11075.262 ms |    15573 | avg:   0.280 ms | max:  23.996 ms | sum: 4360.925 ms | max start:   230.161012 s | max end:   230.185008 s
...
 ------------------------------------------------------------------------------------------------------------
  TOTAL:                |2368428.034 ms |  2114759 |                                     |   1048388.278 ms |
 ------------------------------------------------------------------------------------------------------------
vs 6.3.0.g8ca09d5               .999         1.777                                                 1.602


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
  2023-03-08  8:39   ` Mike Galbraith
@ 2023-03-08  9:26     ` Mike Galbraith
  2023-03-08 13:36     ` Mike Galbraith
  1 sibling, 0 replies; 35+ messages in thread
From: Mike Galbraith @ 2023-03-08  9:26 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, vincent.guittot
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
	pjt, pavel, qperret, tim.c.chen, joshdon, timj, kprateek.nayak,
	yu.c.chen, youssefesmat, joel

On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
> 
> netperf                   cfs +eevdf  vs cfs
> TCP_SENDFILE-1    Avg:  89258  92080   1.031
> TCP_SENDFILE-2    Avg:  83271  83371   1.001
> TCP_SENDFILE-4    Avg:  56395  53011    .939
> TCP_SENDFILE-8    Avg:  26389  39470   1.495
> TCP_SENDFILE-16   Avg:  10251  19590   1.911

Forgot to recheck this not so modest clients >= CPUS win, it's also
repeatable.

	-Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
  2023-03-08  8:39   ` Mike Galbraith
  2023-03-08  9:26     ` Mike Galbraith
@ 2023-03-08 13:36     ` Mike Galbraith
  2023-03-09  4:23       ` Mike Galbraith
  2023-03-09  9:06       ` Peter Zijlstra
  1 sibling, 2 replies; 35+ messages in thread
From: Mike Galbraith @ 2023-03-08 13:36 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, vincent.guittot
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
	pjt, pavel, qperret, tim.c.chen, joshdon, timj, kprateek.nayak,
	yu.c.chen, youssefesmat, joel

On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
>
> Curiosity got the best of me...

Remember this little bugger, allegedly distilled from a real
application control thread starvation issue?

6.3.0.g8ca09d5-master
homer:/root # time taskset -c 3 starve
expecting to receive 10000000 signals

real    0m24.424s
user    0m4.468s
sys     0m18.957s

6.3.0.g8ca09d5-eevdf
homer:/root # time taskset -c 3 starve
expecting to receive 10000000 signals
zzzzzz
^C

real    15m24.115s
user    0m3.078s
sys     0m0.000s


#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>

#include <sys/types.h>
#include <sys/wait.h>

volatile unsigned long loop;

void handler(int n)
{
	if (loop > 0)
		--loop;
}

static int child(void)
{
	pid_t ppid = getppid();

	sleep(1);
	while (1)
		kill(ppid, SIGUSR1);
	return 0;
}

int main(int argc, char **argv)
{
	pid_t child_pid;
	int r;

	loop = argc > 1 ? strtoul(argv[1], NULL, 10) : 10000000;
	printf("expecting to receive %lu signals\n", loop);

	if ((child_pid = fork()) == 0)
		exit(child());

	signal(SIGUSR1, handler);
	while (loop)
		sleep(1);
	r = kill(child_pid, SIGTERM);
	waitpid(child_pid, NULL, 0);
	return 0;
}


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 00/10] sched: EEVDF using latency-nice
  2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (10 preceding siblings ...)
  2023-03-07 10:27 ` [PATCH 00/10] sched: EEVDF using latency-nice Vincent Guittot
@ 2023-03-08 15:13 ` Shrikanth Hegde
  2023-03-22  6:49 ` K Prateek Nayak
  2023-03-23 11:53 ` Pavel Machek
  13 siblings, 0 replies; 35+ messages in thread
From: Shrikanth Hegde @ 2023-03-08 15:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
	pjt, pavel, qperret, tim.c.chen, joshdon, timj, kprateek.nayak,
	yu.c.chen, youssefesmat, joel, mingo, vincent.guittot




> Hi!
>
> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
> not make more sense, and I did point Vincent at some older patches I had for
> that (which is here his augmented rbtree thing comes from).
>
> Also, since I really dislike the dual tree, I also figured we could dynamically
> switch between an augmented tree and not (and while I have code for that,
> that's not included in this posting because with the current results I don't
> think we actually need this).
>
> Anyway, since I'm somewhat under the weather, I spend last week desperately
> trying to connect a small cluster of neurons in defiance of the snot overlord
> and bring back the EEVDF patches from the dark crypts where they'd been
> gathering cobwebs for the past 13 odd years.
>
> By friday they worked well enough, and this morning (because obviously I forgot
> the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf,
> tbench and sysbench -- there's a bunch of wins and losses, but nothing that
> indicates a total fail.
>
> ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot
>   more consistent than CFS and has a bunch of latency wins )
>
> ( hackbench also doesn't show the augmented tree and generally more expensive
>   pick to be a loss, in fact it shows a slight win here )
>
>
>   hackbech load + cyclictest --policy other results:
>
>
> 			EEVDF			 CFS
>
> 		# Min Latencies: 00053
>   LNICE(19)	# Avg Latencies: 04350
> 		# Max Latencies: 76019
>
> 		# Min Latencies: 00052		00053
>   LNICE(0)	# Avg Latencies: 00690		00687
> 		# Max Latencies: 14145		13913
>
> 		# Min Latencies: 00019
>   LNICE(-19)	# Avg Latencies: 00261
> 		# Max Latencies: 05642
>
>
> The nice -19 numbers aren't as pretty as Vincent's, but at the end I was going
> cross-eyed from staring at tree prints and I just couldn't figure out where it
> was going side-ways.
>
> There's definitely more benchmarking/tweaking to be done (0-day already
> reported a stress-ng loss), but if we can pull this off we can delete a whole
> much of icky heuristics code. EEVDF is a much better defined policy than what
> we currently have.
>
Tested the patch series on powerpc systems. This test is done in the same way
that was done for vincent's V12 series.

Creating two cgroups. In cgroup2 running stress-ng -l 50 --cpu=<total_cpu> and
in cgroup1 running micro benchmarks. Different latency values are assigned to
cgroup1.

Tested on two different system. One system has 480 CPU and other one has 96
CPU.

++++++++
Summary:
++++++++
For hackbench, 480 CPU system shows good improvement.
96 CPU system shows same numbers as 6.2. Smaller system was showing regressing 
results as discussed in Vincent's V12 series. With this patch, there is no regression.

Schbench shows good improvement compared to v6.2 at LN=0 or LN=-20. Whereas
at LN=19, it shows regression.

Please suggest if any variation of the benchmark or a different benchmark to be run.


++++++++++++++++++
480 CPU system
++++++++++++++++++

==========
schbench
==========

		 v6.2		|  v6.2+LN=0    |  v6.2+LN=-20 |  v6.2+LN=19
1 Threads
  50.0th:	 14.00          |   12.00      |   14.50       |   15.00
  75.0th:	 16.50          |   14.50      |   17.00       |   18.00
  90.0th:	 18.50          |   17.00      |   19.50       |   20.00
  95.0th:	 20.50          |   18.50      |   22.00       |   23.50
  99.0th:	 27.50          |   24.50      |   31.50       |   155.00
  99.5th:	 36.00          |   30.00      |   44.50       |   2991.00
  99.9th:	 81.50          |   171.50     |   153.00      |   4621.00
2 Threads
  50.0th:	 14.00          |   15.50      |   17.00       |   16.00
  75.0th:	 17.00          |   18.00      |   19.00       |   19.00
  90.0th:	 20.00          |   21.00      |   22.00       |   22.50
  95.0th:	 23.00          |   23.00      |   25.00       |   25.50
  99.0th:	 71.00          |   30.50      |   35.50       |   990.50
  99.5th:	 1170.00        |   53.00      |   71.00       |   3719.00
  99.9th:	 5088.00        |   245.50     |   138.00      |   6644.00
4 Threads
  50.0th:	 20.50          |   20.00      |   20.00       |   19.50
  75.0th:	 24.50          |   23.00      |   23.00       |   23.50
  90.0th:	 31.00          |   27.00      |   26.50       |   27.50
  95.0th:	 260.50         |   29.50      |   29.00       |   35.00
  99.0th:	 3644.00        |   106.00     |   37.50       |   2884.00
  99.5th:	 5152.00        |   227.00     |   92.00       |   5496.00
  99.9th:	 8076.00        |   3662.50    |   517.00      |   8640.00
8 Threads
  50.0th:	 26.00          |   23.50      |   22.50       |   25.00
  75.0th:	 32.50          |   29.50      |   27.50       |   31.00
  90.0th:	 41.50          |   34.50      |   31.50       |   39.00
  95.0th:	 794.00         |   37.00      |   34.50       |   579.50
  99.0th:	 5992.00        |   48.50      |   52.00       |   5872.00
  99.5th:	 7208.00        |   100.50     |   97.50       |   7280.00
  99.9th:	 9392.00        |   4098.00    |   1226.00     |   9328.00
16 Threads
  50.0th:	 37.50          |   33.00      |   34.00       |   37.00
  75.0th:	 49.50          |   43.50      |   44.00       |   49.00
  90.0th:	 70.00          |   52.00      |   53.00       |   66.00
  95.0th:	 1284.00        |   57.50      |   59.00       |   1162.50
  99.0th:	 5600.00        |   79.50      |   111.50      |   5912.00
  99.5th:	 7216.00        |   282.00     |   194.50      |   7392.00
  99.9th:	 9328.00        |   4026.00    |   2009.00     |   9440.00
32 Threads
  50.0th:	 59.00          |   56.00      |   57.00       |   59.00
  75.0th:	 83.00          |   77.50      |   79.00       |   83.00
  90.0th:	 118.50         |   94.00      |   95.00       |   120.50
  95.0th:	 1921.00        |   104.50     |   104.00      |   1800.00
  99.0th:	 6672.00        |   425.00     |   255.00      |   6384.00
  99.5th:	 8252.00        |   2800.00    |   1252.00     |   7696.00
  99.9th:	 10448.00       |   7264.00    |   5888.00     |   9504.00

=========
hackbench
=========

Process		 10	 0.19   |   0.18      |   0.17      |   0.18
Process		 20 	 0.34   |   0.32      |   0.33      |   0.31
Process		 30 	 0.45   |   0.42      |   0.43      |   0.43
Process		 40 	 0.58   |   0.53      |   0.53      |   0.53
Process		 50 	 0.70   |   0.64      |   0.64      |   0.65
Process		 60 	 0.82   |   0.74      |   0.75      |   0.76
thread 		 10 	 0.20   |   0.19      |   0.19      |   0.19
thread 		 20 	 0.36   |   0.34      |   0.34      |   0.34
Process(Pipe)	 10 	 0.24   |   0.15      |   0.15      |   0.15
Process(Pipe)	 20 	 0.46   |   0.22      |   0.22      |   0.21
Process(Pipe)	 30 	 0.65   |   0.30      |   0.29      |   0.29
Process(Pipe)	 40 	 0.90   |   0.35      |   0.36      |   0.34
Process(Pipe)	 50 	 1.04   |   0.38      |   0.39      |   0.38
Process(Pipe)	 60 	 1.16   |   0.42      |   0.42      |   0.43
thread(Pipe)	 10 	 0.19   |   0.13      |   0.13      |   0.13
thread(Pipe)	 20 	 0.46   |   0.21      |   0.21      |   0.21


++++++++++++++++++
96 CPU system
++++++++++++++++++

===========
schbench
===========
		 v6.2        |  v6.2+LN=0    |  v6.2+LN=-20 |  v6.2+LN=19
1 Thread
  50.0th:	 10.50       |   10.00       |   10.00      |   11.00
  75.0th:	 12.50       |   11.50       |   11.50      |   12.50
  90.0th:	 15.00       |   13.00       |   13.50      |   16.50
  95.0th:	 47.50       |   15.00       |   15.00      |   274.50
  99.0th:	 4744.00     |   17.50       |   18.00      |   5032.00
  99.5th:	 7640.00     |   18.50       |   525.00     |   6636.00
  99.9th:	 8916.00     |   538.00      |   6704.00    |   9264.00
2 Threads
  50.0th:	 11.00       |   10.00       |   11.00      |   11.00
  75.0th:	 13.50       |   12.00       |   12.50      |   13.50
  90.0th:	 17.00       |   14.00       |   14.00      |   17.00
  95.0th:	 451.50      |   16.00       |   15.50      |   839.00
  99.0th:	 5488.00     |   20.50       |   18.00      |   6312.00
  99.5th:	 6712.00     |   986.00      |   19.00      |   7664.00
  99.9th:	 9856.00     |   4913.00     |   1154.00    |   8736.00
4 Threads
  50.0th:	 13.00       |   12.00       |   12.00      |   13.00
  75.0th:	 15.00       |   14.00       |   14.00      |   15.00
  90.0th:	 23.50       |   16.00       |   16.00      |   20.00
  95.0th:	 2508.00     |   17.50       |   17.50      |   1818.00
  99.0th:	 7232.00     |   777.00      |   38.50      |   5952.00
  99.5th:	 8720.00     |   3548.00     |   1926.00    |   7788.00
  99.9th:	 10352.00    |   6320.00     |   7160.00    |   10000.00
8 Threads
  50.0th:	 16.00       |   15.00       |   15.00      |   16.00
  75.0th:	 20.00       |   18.00       |   18.00      |   19.50
  90.0th:	 371.50      |   20.00       |   21.00      |   245.50
  95.0th:	 2992.00     |   22.00       |   23.00      |   2608.00
  99.0th:	 7784.00     |   1084.50     |   563.50     |   7136.00
  99.5th:	 9488.00     |   2612.00     |   2696.00    |   8720.00
  99.9th:	 15568.00    |   6656.00     |   7496.00    |   10000.00
16 Threads
  50.0th:	 23.00       |   21.00       |   20.00      |   22.50
  75.0th:	 31.00       |   27.50       |   26.00      |   29.50
  90.0th:	 1981.00     |   32.50       |   30.50      |   1500.50
  95.0th:	 4856.00     |   304.50      |   34.00      |   4046.00
  99.0th:	 10112.00    |   5720.00     |   4590.00    |   8220.00
  99.5th:	 13104.00    |   7828.00     |   7008.00    |   9312.00
  99.9th:	 18624.00    |   9856.00     |   9504.00    |   11984.00
32 Threads
  50.0th:	 36.50       |   34.50       |   33.50      |   35.50
  75.0th:	 56.50       |   48.00       |   46.00      |   52.50
  90.0th:	 4728.00     |   1470.50     |   376.00     |   3624.00
  95.0th:	 7808.00     |   4130.00     |   3850.00    |   6488.00
  99.0th:	 15776.00    |   8972.00     |   9060.00    |   9872.00
  99.5th:	 19072.00    |   11328.00    |   12224.00   |   11520.00
  99.9th:	 28864.00    |   18016.00    |   18368.00   |   18848.00


==========
Hackbench
=========

Type	      groups    v6.2	  | v6.2+LN=0  | v6.2+LN=-20 | v6.2+LN=19
Process		10	0.33      |   0.33     |   0.33      |   0.33
Process		20	0.61      |   0.56     |   0.58      |   0.57
Process		30	0.87      |   0.82     |   0.81      |   0.81
Process		40	1.10      |   1.05     |   1.06      |   1.05
Process		50	1.34      |   1.28     |   1.29      |   1.29
Process		60	1.58      |   1.53     |   1.52      |   1.51
thread		10	0.36      |   0.35     |   0.35      |   0.35
thread		20	0.64	  |   0.63     |   0.62      |   0.62
Process(Pipe)	10	0.18      |   0.18     |   0.18      |   0.17
Process(Pipe)   20	0.32      |   0.31     |   0.31      |   0.31
Process(Pipe)   30	0.42      |   0.41     |   0.41      |   0.42
Process(Pipe)   40	0.56      |   0.53     |   0.55      |   0.53
Process(Pipe)   50	0.68      |   0.66     |   0.66      |   0.66
Process(Pipe)   60	0.80      |   0.78     |   0.78      |   0.78
thread(Pipe)	10	0.20      |   0.18     |   0.19      |   0.18
thread(Pipe)	20	0.34      |   0.34     |   0.33      |   0.33

Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
  2023-03-08 13:36     ` Mike Galbraith
@ 2023-03-09  4:23       ` Mike Galbraith
  2023-03-10 20:38         ` Peter Zijlstra
  2023-03-09  9:06       ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2023-03-09  4:23 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, vincent.guittot
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
	pjt, pavel, qperret, tim.c.chen, joshdon, timj, kprateek.nayak,
	yu.c.chen, youssefesmat, joel

On Wed, 2023-03-08 at 14:36 +0100, Mike Galbraith wrote:
>
> Remember this little bugger, allegedly distilled from a real
> application control thread starvation issue?
>
> 6.3.0.g8ca09d5-master
> homer:/root # time taskset -c 3 starve      
> expecting to receive 10000000 signals
>
> real    0m24.424s
> user    0m4.468s
> sys     0m18.957s
>
> 6.3.0.g8ca09d5-eevdf
> homer:/root # time taskset -c 3 starve
> expecting to receive 10000000 signals
> zzzzzz
> ^C

Ok, seems there must be a math booboo lurking.

virgin source, 100% hog vs tbench buddy pair, all pinned.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
 5060 root      20   0    4420    680    680 R 96.01 0.004   0:41.40 3 cpuhog
 5058 root      20   0   25500   1920   1792 S 2.326 0.012   0:01.05 3 tbench
 5059 root      20   0    8796    896    768 R 1.661 0.006   0:00.78 3 tbench_srv

echo NO_PRESERVE_LAG > features

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
 5060 root      20   0    4420    680    680 R 99.33 0.004   1:28.24 3 cpuhog
 5058 root      20   0   25500   1920   1792 R 0.333 0.012   0:01.75 3 tbench
 5059 root      20   0    8796    896    768 S 0.333 0.006   0:01.30 3 tbench_srv


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
  2023-03-08 13:36     ` Mike Galbraith
  2023-03-09  4:23       ` Mike Galbraith
@ 2023-03-09  9:06       ` Peter Zijlstra
  2023-03-09 12:44         ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-09  9:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel

Hi Mike!

On Wed, Mar 08, 2023 at 02:36:01PM +0100, Mike Galbraith wrote:
> On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
> >
> > Curiosity got the best of me...
> 
> Remember this little bugger, allegedly distilled from a real
> application control thread starvation issue?

Oooh, yeah, I should still have that somewhere. I'll try and remember
what exactly was needed to make it behave properly.

Thanks for your feedback. I'll go prod at things a bit more. Like I
wrote, the code is more or less at the point where it stopped crashing
and doing obviously bad things. It definitely needs a more attention.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
  2023-03-09  9:06       ` Peter Zijlstra
@ 2023-03-09 12:44         ` Peter Zijlstra
  2023-03-09 15:29           ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-09 12:44 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel

On Thu, Mar 09, 2023 at 10:06:33AM +0100, Peter Zijlstra wrote:
> Hi Mike!
> 
> On Wed, Mar 08, 2023 at 02:36:01PM +0100, Mike Galbraith wrote:
> > On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
> > >
> > > Curiosity got the best of me...
> > 
> > Remember this little bugger, allegedly distilled from a real
> > application control thread starvation issue?
> 
> Oooh, yeah, I should still have that somewhere. I'll try and remember
> what exactly was needed to make it behave properly.

That thing wants both wakeup preemption and sleeper bonus. Specifically,
it needs the signal to insta-preempt the 'pointless' kill loop.

What happens is that while positive lag, we get this, when negative lag
happens wakeup-preemption is not achieved and we get delayed by a full
tick.

This gets us very little actual runtime.

Let me see what do do about that...

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
  2023-03-09 12:44         ` Peter Zijlstra
@ 2023-03-09 15:29           ` Peter Zijlstra
  2023-03-09 15:39             ` Peter Zijlstra
                               ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-09 15:29 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel

On Thu, Mar 09, 2023 at 01:44:13PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 09, 2023 at 10:06:33AM +0100, Peter Zijlstra wrote:
> > Hi Mike!
> > 
> > On Wed, Mar 08, 2023 at 02:36:01PM +0100, Mike Galbraith wrote:
> > > On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
> > > >
> > > > Curiosity got the best of me...
> > > 
> > > Remember this little bugger, allegedly distilled from a real
> > > application control thread starvation issue?
> > 
> > Oooh, yeah, I should still have that somewhere. I'll try and remember
> > what exactly was needed to make it behave properly.
> 
> That thing wants both wakeup preemption and sleeper bonus. Specifically,
> it needs the signal to insta-preempt the 'pointless' kill loop.
> 
> What happens is that while positive lag, we get this, when negative lag
> happens wakeup-preemption is not achieved and we get delayed by a full
> tick.
> 
> This gets us very little actual runtime.
> 
> Let me see what do do about that...

So if I add TICK_NSEC based sleeper bonus (/2 for gentle), then starve
works -- this is the absolutely minimal amount required. It sucks a bit
it's HZ dependent, but alas.

Also, the whole sleeper bonus gets us back into needing to track the old
vruntime and the overflow crap for super long sleeps and all that fugly
:/ I was so hoping we could delete that code.

Oh well.

(also, did you know that removing the debug cruft helps with running
numbers? ;-)

I think it adds a bit of variance to the numbers -- but I've not ran
long enough to tell with certainty.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
  2023-03-09 15:29           ` Peter Zijlstra
@ 2023-03-09 15:39             ` Peter Zijlstra
  2023-03-09 16:24             ` Mike Galbraith
  2023-03-09 16:42             ` Peter Zijlstra
  2 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-09 15:39 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel

On Thu, Mar 09, 2023 at 04:29:04PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 09, 2023 at 01:44:13PM +0100, Peter Zijlstra wrote:
> > On Thu, Mar 09, 2023 at 10:06:33AM +0100, Peter Zijlstra wrote:
> > > Hi Mike!
> > > 
> > > On Wed, Mar 08, 2023 at 02:36:01PM +0100, Mike Galbraith wrote:
> > > > On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
> > > > >
> > > > > Curiosity got the best of me...
> > > > 
> > > > Remember this little bugger, allegedly distilled from a real
> > > > application control thread starvation issue?
> > > 
> > > Oooh, yeah, I should still have that somewhere. I'll try and remember
> > > what exactly was needed to make it behave properly.
> > 
> > That thing wants both wakeup preemption and sleeper bonus. Specifically,
> > it needs the signal to insta-preempt the 'pointless' kill loop.
> > 
> > What happens is that while positive lag, we get this, when negative lag
> > happens wakeup-preemption is not achieved and we get delayed by a full
> > tick.
> > 
> > This gets us very little actual runtime.
> > 
> > Let me see what do do about that...
> 
> So if I add TICK_NSEC based sleeper bonus (/2 for gentle), then starve
> works -- this is the absolutely minimal amount required. It sucks a bit
> it's HZ dependent, but alas.
> 
> Also, the whole sleeper bonus gets us back into needing to track the old
> vruntime and the overflow crap for super long sleeps and all that fugly
> :/ I was so hoping we could delete that code.
> 
> Oh well.
> 
> (also, did you know that removing the debug cruft helps with running
> numbers? ;-)

Also, it helps to turn the sched_feat on... clearly i should be calling
it a day.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
  2023-03-09 15:29           ` Peter Zijlstra
  2023-03-09 15:39             ` Peter Zijlstra
@ 2023-03-09 16:24             ` Mike Galbraith
  2023-03-09 16:42             ` Peter Zijlstra
  2 siblings, 0 replies; 35+ messages in thread
From: Mike Galbraith @ 2023-03-09 16:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel

On Thu, 2023-03-09 at 16:29 +0100, Peter Zijlstra wrote:
>
> So if I add TICK_NSEC based sleeper bonus (/2 for gentle), then starve
> works -- this is the absolutely minimal amount required. It sucks a bit
> it's HZ dependent, but alas.
>
> Also, the whole sleeper bonus gets us back into needing to track the old
> vruntime and the overflow crap for super long sleeps and all that fugly
> :/ I was so hoping we could delete that code.
>
> Oh well.

Yeah, it's a worthy target.

	-Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
  2023-03-09 15:29           ` Peter Zijlstra
  2023-03-09 15:39             ` Peter Zijlstra
  2023-03-09 16:24             ` Mike Galbraith
@ 2023-03-09 16:42             ` Peter Zijlstra
  2 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-09 16:42 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel

On Thu, Mar 09, 2023 at 04:29:04PM +0100, Peter Zijlstra wrote:

> So if I add TICK_NSEC based sleeper bonus (/2 for gentle), then starve
> works -- this is the absolutely minimal amount required. It sucks a bit
> it's HZ dependent, but alas.

Fixes starve, sucks for schbench and hackbench :/

Clearly more thinking is required...

root@ivb-ep:~/bench# echo NO_FAIR_SLEEPERS > /debug/sched/features
root@ivb-ep:~/bench# ./doit-schbench.sh ; ./doit-hackbench-series.sh
Latency percentiles (usec)
50.0000th: 83
75.0000th: 102
90.0000th: 109
95.0000th: 114
*99.0000th: 450
99.5000th: 723
99.9000th: 985
min=0, max=1067
1:            0.55355 +- 0.00290 seconds time elapsed  ( +-  0.52% )
2:            0.79591 +- 0.00545 seconds time elapsed  ( +-  0.68% )
5:             1.5804 +- 0.0102 seconds time elapsed  ( +-  0.65% )
10:             2.5674 +- 0.0110 seconds time elapsed  ( +-  0.43% )
20:             4.6116 +- 0.0160 seconds time elapsed  ( +-  0.35% )
40:             9.5965 +- 0.0167 seconds time elapsed  ( +-  0.17% )
root@ivb-ep:~/bench# time taskset -c 3 ./starve/starve 1000000
expecting to receive 1000000 signals
^C

real    0m32.999s
user    0m0.000s
sys     0m0.719s
root@ivb-ep:~/bench# echo FAIR_SLEEPERS > /debug/sched/features
root@ivb-ep:~/bench# ./doit-schbench.sh ; ./doit-hackbench-series.sh
Latency percentiles (usec)
50.0000th: 87
75.0000th: 103
90.0000th: 111
95.0000th: 116
*99.0000th: 163
99.5000th: 697
99.9000th: 1110
min=0, max=1522
1:            0.59076 +- 0.00577 seconds time elapsed  ( +-  0.98% )
2:            0.86093 +- 0.00407 seconds time elapsed  ( +-  0.47% )
5:             2.1018 +- 0.0129 seconds time elapsed  ( +-  0.61% )
10:             3.6378 +- 0.0395 seconds time elapsed  ( +-  1.09% )
20:            5.56884 +- 0.00979 seconds time elapsed  ( +-  0.18% )
40:            10.8570 +- 0.0207 seconds time elapsed  ( +-  0.19% )
root@ivb-ep:~/bench# time taskset -c 3 ./starve/starve 1000000
expecting to receive 1000000 signals

real    0m5.651s
user    0m0.604s
sys     0m4.047s


---

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4938,17 +4938,22 @@ place_entity(struct cfs_rq *cfs_rq, stru
 {
 	u64 vruntime = avg_vruntime(cfs_rq);
 
+	if (sched_feat(PRESERVE_LAG))
+		vruntime -= se->lag;
+
 	if (sched_feat(FAIR_SLEEPERS)) {
-		u64 sleep_time;
+//		u64 sleep_time;
 
 		/* sleeps up to a single latency don't count. */
 		if (!initial) {
-			unsigned long thresh;
+			unsigned long thresh = TICK_NSEC;
 
-			if (se_is_idle(se))
-				thresh = sysctl_sched_min_granularity;
-			else
-				thresh = sysctl_sched_latency;
+			if (!sched_feat(EEVDF)) {
+				if (se_is_idle(se))
+					thresh = sysctl_sched_min_granularity;
+				else
+					thresh = sysctl_sched_latency;
+			}
 
 			/*
 			 * Halve their sleep time's effect, to allow
@@ -4957,7 +4962,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 			if (sched_feat(GENTLE_FAIR_SLEEPERS))
 				thresh >>= 1;
 
-			vruntime -= thresh;
+			vruntime -= calc_delta_fair(thresh, se);
 		}
 
 		/*
@@ -4966,15 +4971,12 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		 * slept for a long time, don't even try to compare its vruntime with
 		 * the base as it may be too far off and the comparison may get
 		 * inversed due to s64 overflow.
-		 */
 		sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
 		if ((s64)sleep_time < 60LL * NSEC_PER_SEC)
+		 */
 			vruntime = max_vruntime(se->vruntime, vruntime);
 	}
 
-	if (sched_feat(PRESERVE_LAG))
-		vruntime -= se->lag;
-
 	se->vruntime = vruntime;
 	set_slice(cfs_rq, se);
 }

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
  2023-03-09  4:23       ` Mike Galbraith
@ 2023-03-10 20:38         ` Peter Zijlstra
  2023-03-11  5:53           ` Mike Galbraith
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-10 20:38 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel

On Thu, Mar 09, 2023 at 05:23:33AM +0100, Mike Galbraith wrote:
> Ok, seems there must be a math booboo lurking.

Yep that.. please try the version I pushed out here:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/eevdf

(It's based on 6.3-rc1, but it should be trivial to rebase onto v6.2 if
you so wish.)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
  2023-03-10 20:38         ` Peter Zijlstra
@ 2023-03-11  5:53           ` Mike Galbraith
  2023-03-11  7:56             ` Mike Galbraith
  0 siblings, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2023-03-11  5:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel

On Fri, 2023-03-10 at 21:38 +0100, Peter Zijlstra wrote:
> On Thu, Mar 09, 2023 at 05:23:33AM +0100, Mike Galbraith wrote:
> > Ok, seems there must be a math booboo lurking.
>
> Yep that.. please try the version I pushed out here:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/eevdf
>
> (It's based on 6.3-rc1, but it should be trivial to rebase onto v6.2 if
> you so wish.)

I stuffed it into .today.  Yup, extreme sleeper stinginess is gone.

tbench/netperf numbers look roughly as previously.  massive_intr vs
desktop CPU distribution improved as expected, but (to me) oddly, the
amount of desktop beating up itself did not.

These are all .today

perf.data.eevdf
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     | Max delay start           | Max delay end          |
  massive_intr:(9)      |1499451.694 ms |   647385 | avg:   0.940 ms | max:  27.969 ms | sum:608348.424 ms | max start:   467.219496 s | max end:   467.247465 s
  TOTAL:                |2369880.457 ms |  2797272 |                                     |   1243484.857 ms |
perf.data.eevdf_FAIR_SLEEPERS
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     | Max delay start           | Max delay end          |
  massive_intr:(9)      |1415743.272 ms |   714999 | avg:   1.013 ms | max:  73.722 ms | sum:724624.600 ms | max start:  2678.846216 s | max end:  2678.919938 s
  TOTAL:                |2362333.392 ms |  2871586 |                                     |   1238748.802 ms |
perf.data.master
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     | Max delay start           | Max delay end          |
  massive_intr:(9)      |1463798.673 ms |   598383 | avg:   0.730 ms | max:  40.029 ms | sum:437069.674 ms | max start:   945.022965 s | max end:   945.062994 s
  TOTAL:                |2370011.937 ms |  1109828 |                                     |    616408.789 ms |

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
  2023-03-11  5:53           ` Mike Galbraith
@ 2023-03-11  7:56             ` Mike Galbraith
  0 siblings, 0 replies; 35+ messages in thread
From: Mike Galbraith @ 2023-03-11  7:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel

On Sat, 2023-03-11 at 06:53 +0100, Mike Galbraith wrote:
>
> massive_intr vs desktop CPU distribution improved as expected, but
> (to me) oddly, the amount of desktop beating up itself did not.

Hmm, maybe expected when fancy deadline math tries to wedge a very wide
GUI into a not so wide and pretty well saturated CPU.

	-Mike

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 08/10] sched/fair: Add lag based placement
  2023-03-06 13:25 ` [PATCH 08/10] sched/fair: Add lag based placement Peter Zijlstra
@ 2023-03-16 22:49   ` Tim Chen
  0 siblings, 0 replies; 35+ messages in thread
From: Tim Chen @ 2023-03-16 22:49 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, vincent.guittot
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
	pjt, pavel, qperret, joshdon, timj, kprateek.nayak, yu.c.chen,
	youssefesmat, joel

On Mon, 2023-03-06 at 14:25 +0100, Peter Zijlstra wrote:
> With the introduction of avg_vruntime, it is possible to approximate
> lag (the entire purpose of introducing it in fact). Use this to do lag
> based placement over sleep+wake.
> 
> Specifically, the FAIR_SLEEPERS thing places things too far to the
> left and messes up the deadline aspect of EEVDF.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/sched.h   |    1 
>  kernel/sched/core.c     |    1 
>  kernel/sched/fair.c     |   63 +++++++++++++++++++++++++++---------------------
>  kernel/sched/features.h |    8 ++++++
>  4 files changed, 46 insertions(+), 27 deletions(-)
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -555,6 +555,7 @@ struct sched_entity {
>  	u64				sum_exec_runtime;
>  	u64				vruntime;
>  	u64				prev_sum_exec_runtime;
> +	s64				lag;
>  
>  	u64				nr_migrations;
>  
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4436,6 +4436,7 @@ static void __sched_fork(unsigned long c
>  	p->se.prev_sum_exec_runtime	= 0;
>  	p->se.nr_migrations		= 0;
>  	p->se.vruntime			= 0;
> +	p->se.lag			= 0;
>  	INIT_LIST_HEAD(&p->se.group_node);
>  
>  	set_latency_offset(p);
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4749,39 +4749,45 @@ static void
>  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>  {
>  	u64 vruntime = avg_vruntime(cfs_rq);
> -	u64 sleep_time;
>  
> -	/* sleeps up to a single latency don't count. */
> -	if (!initial) {
> -		unsigned long thresh;
> -
> -		if (se_is_idle(se))
> -			thresh = sysctl_sched_min_granularity;
> -		else
> -			thresh = sysctl_sched_latency;
> +	if (sched_feat(FAIR_SLEEPERS)) {
> +		u64 sleep_time;
> +
> +		/* sleeps up to a single latency don't count. */
> +		if (!initial) {
> +			unsigned long thresh;
> +
> +			if (se_is_idle(se))
> +				thresh = sysctl_sched_min_granularity;
> +			else
> +				thresh = sysctl_sched_latency;
> +
> +			/*
> +			 * Halve their sleep time's effect, to allow
> +			 * for a gentler effect of sleepers:
> +			 */
> +			if (sched_feat(GENTLE_FAIR_SLEEPERS))
> +				thresh >>= 1;
> +
> +			vruntime -= thresh;
> +		}
>  
>  		/*
> -		 * Halve their sleep time's effect, to allow
> -		 * for a gentler effect of sleepers:
> +		 * Pull vruntime of the entity being placed to the base level of
> +		 * cfs_rq, to prevent boosting it if placed backwards.  If the entity
> +		 * slept for a long time, don't even try to compare its vruntime with
> +		 * the base as it may be too far off and the comparison may get
> +		 * inversed due to s64 overflow.
>  		 */
> -		if (sched_feat(GENTLE_FAIR_SLEEPERS))
> -			thresh >>= 1;
> -
> -		vruntime -= thresh;
> +		sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
> +		if ((s64)sleep_time < 60LL * NSEC_PER_SEC)
> +			vruntime = max_vruntime(se->vruntime, vruntime);
>  	}
>  
> -	/*
> -	 * Pull vruntime of the entity being placed to the base level of
> -	 * cfs_rq, to prevent boosting it if placed backwards.  If the entity
> -	 * slept for a long time, don't even try to compare its vruntime with
> -	 * the base as it may be too far off and the comparison may get
> -	 * inversed due to s64 overflow.
> -	 */
> -	sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
> -	if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
> -		se->vruntime = vruntime;
> -	else
> -		se->vruntime = max_vruntime(se->vruntime, vruntime);
> +	if (sched_feat(PRESERVE_LAG))
> +		vruntime -= se->lag;
> +
> +	se->vruntime = vruntime;

I was going to say that
when we migrate a task to a new runqueue, we subtract vruntime
by old queue's min_vruntime and that math needs update.

But then I saw you did that in 
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/kernel/sched?h=sched/eevdf

With this new lag based placement, I think it should properly fix
the starvation issues we have seen caused by quickly cpu hopping tasks. 

Tim

>  }
>  
>  static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
> @@ -4949,6 +4955,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>  
>  	clear_buddies(cfs_rq, se);
>  
> +	if (sched_feat(PRESERVE_LAG) && (flags & DEQUEUE_SLEEP))
> +		se->lag = avg_vruntime(cfs_rq) - se->vruntime;
> +
>  	if (se != cfs_rq->curr)
>  		__dequeue_entity(cfs_rq, se);
>  	se->on_rq = 0;
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -1,12 +1,20 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
> +
>  /*
>   * Only give sleepers 50% of their service deficit. This allows
>   * them to run sooner, but does not allow tons of sleepers to
>   * rip the spread apart.
>   */
> +SCHED_FEAT(FAIR_SLEEPERS, false)
>  SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
>  
>  /*
> + * Using the avg_vruntime, do the right thing and preserve lag
> + * across sleep+wake cycles.
> + */
> +SCHED_FEAT(PRESERVE_LAG, true)
> +
> +/*
>   * Prefer to schedule the task we woke last (assuming it failed
>   * wakeup-preemption), since its likely going to consume data we
>   * touched, increases cache locality.
> 
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/10] sched/fair: Add avg_vruntime
  2023-03-06 13:25 ` [PATCH 06/10] sched/fair: Add avg_vruntime Peter Zijlstra
@ 2023-03-21 13:58   ` Chen Yu
  2023-03-21 16:04     ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Chen Yu @ 2023-03-21 13:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, youssefesmat, joel

On 2023-03-06 at 14:25:27 +0100, Peter Zijlstra wrote:
[...]
>  
> +/*
> + * Compute virtual time from the per-task service numbers:
> + *
> + * Fair schedulers conserve lag: \Sum lag_i = 0
> + *
> + * lag_i = S - s_i = w_i * (V - v_i)
> + *
The definination of above lag_i seems to be inconsistent with the defininatin
of se->lag in PATCH 8. Maybe rename lag_i to something other to avoid confusion?
> + * \Sum lag_i = 0 -> \Sum w_i * (V - v_i) = V * \Sum w_i - \Sum w_i * v_i = 0
> + *
> + * From which we solve V:
> + *
> + *     \Sum v_i * w_i
> + * V = --------------
> + *        \Sum w_i
> + *
> + * However, since v_i is u64, and the multiplcation could easily overflow
> + * transform it into a relative form that uses smaller quantities:
> + *
> + * Substitute: v_i == (v_i - v) + v
> + *
> + *     \Sum ((v_i - v) + v) * w_i   \Sum (v_i - v) * w_i
> + * V = -------------------------- = -------------------- + v
> + *              \Sum w_i                   \Sum w_i
> + *
> + *
Not sure if I understand it correctly, does it mean  (v_i - v) * w_i will not
overflow? If the weight of task is 15 (nice 19), then if v_i - v > (S64_MAX / 15)
it gets overflow. Is it possible that v_i is much larger than cfs_rq->min_vruntime
in this case?

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/10] sched/fair: Add avg_vruntime
  2023-03-21 13:58   ` Chen Yu
@ 2023-03-21 16:04     ` Peter Zijlstra
  2023-03-24  7:12       ` Chen Yu
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-21 16:04 UTC (permalink / raw)
  To: Chen Yu
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, youssefesmat, joel

On Tue, Mar 21, 2023 at 09:58:13PM +0800, Chen Yu wrote:
> On 2023-03-06 at 14:25:27 +0100, Peter Zijlstra wrote:
> [...]
> >  
> > +/*
> > + * Compute virtual time from the per-task service numbers:
> > + *
> > + * Fair schedulers conserve lag: \Sum lag_i = 0
> > + *
> > + * lag_i = S - s_i = w_i * (V - v_i)
> > + *
> The definination of above lag_i seems to be inconsistent with the defininatin
> of se->lag in PATCH 8. Maybe rename lag_i to something other to avoid confusion?

Yeah, I ran into that the other day, I think I'll introduce vlag_i = V - v_i
or so.

> > + * \Sum lag_i = 0 -> \Sum w_i * (V - v_i) = V * \Sum w_i - \Sum w_i * v_i = 0
> > + *
> > + * From which we solve V:
> > + *
> > + *     \Sum v_i * w_i
> > + * V = --------------
> > + *        \Sum w_i
> > + *
> > + * However, since v_i is u64, and the multiplcation could easily overflow
> > + * transform it into a relative form that uses smaller quantities:
> > + *
> > + * Substitute: v_i == (v_i - v) + v
> > + *
> > + *     \Sum ((v_i - v) + v) * w_i   \Sum (v_i - v) * w_i
> > + * V = -------------------------- = -------------------- + v
> > + *              \Sum w_i                   \Sum w_i
> > + *
> > + *

> Not sure if I understand it correctly, does it mean  (v_i - v) * w_i will not
> overflow? If the weight of task is 15 (nice 19), then if v_i - v > (S64_MAX / 15)
> it gets overflow. Is it possible that v_i is much larger than cfs_rq->min_vruntime
> in this case?

Or worse, SCHED_IDLE, where weight is 2 (IIRC) or cgroups, then vtime
advances at 512 times realtime. Now, the tick puts a limit on how long
we'll overshoot these super low weight entities, for HZ=1000 we still
only get 0.5s of vtime for weight=2.

That would be only 30 bits used, except we use double FIXEDPOINT_SHIFT
on 64bit, so we'll end up at 40-ish.

That should give us enough room to carry an average of deltas around
min_vruntime.

But yes, I've seen this go sideways and I need to stare a bit more at
this. One of the things I've considered is changing the min_vruntime
update rules to instead move to avg_vruntime() to minimize the deltas.
But I've not yet actually written that code.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 00/10] sched: EEVDF using latency-nice
  2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (11 preceding siblings ...)
  2023-03-08 15:13 ` Shrikanth Hegde
@ 2023-03-22  6:49 ` K Prateek Nayak
  2023-03-22  9:38   ` K Prateek Nayak
  2023-03-23 11:53 ` Pavel Machek
  13 siblings, 1 reply; 35+ messages in thread
From: K Prateek Nayak @ 2023-03-22  6:49 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, vincent.guittot
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
	pjt, pavel, qperret, tim.c.chen, joshdon, timj, yu.c.chen,
	youssefesmat, joel

Hello Peter,

Leaving some results from my testing on a dual socket Zen3 machine
(2 x 64C/128T) below.

tl;dr

o I've not tested workloads with nice and latency nice yet focusing more
  on the out of the box performance. No changes to sched_feat were made
  for the same reason. 

o Except for hackbench (m:n communication relationship), I do not see any
  regression for other standard benchmarks (mostly 1:1 or 1:n) relation
  when system is below fully loaded.

o At fully loaded scenario, schbench seems to be unhappy. Looking at the
  data from /proc/<pid>/sched for the tasks with schedstats enabled,
  there is an increase in number of context switches and the total wait
  sum. When system is overloaded, things flip and the schbench tail
  latency improves drastically. I suspect the involuntary
  context-switches help workers make progress much sooner after wakeup
  compared to tip thus leading to lower tail latency.

o For the same reason as above, tbench throughput takes a hit with
  number of involuntary context-switches increasing drastically for the
  tbench server. There is also an increase in wait sum noticed.

o Couple of real world workloads were also tested. DeathStarBench
  throughput tanks much more with the updated version in your tree
  compared to this series as is.
  SpecJBB Max-jOPS sees large improvements but comes at a cost of
  drop in Critical-jOPS signifying an increase in either wait time
  or an increase in involuntary context-switches which can lead to
  transactions taking longer to complete.

o Apart from DeathStarBench, the all the trends reported remain same
  comparing the version in your tree and this series, as is, applied
  on the same base kernel.

I'll leave the detailed results below and some limited analysis. 

On 3/6/2023 6:55 PM, Peter Zijlstra wrote:
> Hi!
> 
> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
> not make more sense, and I did point Vincent at some older patches I had for
> that (which is here his augmented rbtree thing comes from).
> 
> Also, since I really dislike the dual tree, I also figured we could dynamically
> switch between an augmented tree and not (and while I have code for that,
> that's not included in this posting because with the current results I don't
> think we actually need this).
> 
> Anyway, since I'm somewhat under the weather, I spend last week desperately
> trying to connect a small cluster of neurons in defiance of the snot overlord
> and bring back the EEVDF patches from the dark crypts where they'd been
> gathering cobwebs for the past 13 odd years.
> 
> By friday they worked well enough, and this morning (because obviously I forgot
> the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf,
> tbench and sysbench -- there's a bunch of wins and losses, but nothing that
> indicates a total fail.
> 
> ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot
>   more consistent than CFS and has a bunch of latency wins )
> 
> ( hackbench also doesn't show the augmented tree and generally more expensive
>   pick to be a loss, in fact it shows a slight win here )
> 
> 
>   hackbech load + cyclictest --policy other results:
> 
> 
> 			EEVDF			 CFS
> 
> 		# Min Latencies: 00053
>   LNICE(19)	# Avg Latencies: 04350
> 		# Max Latencies: 76019
> 
> 		# Min Latencies: 00052		00053
>   LNICE(0)	# Avg Latencies: 00690		00687
> 		# Max Latencies: 14145		13913
> 
> 		# Min Latencies: 00019
>   LNICE(-19)	# Avg Latencies: 00261
> 		# Max Latencies: 05642
> 

Following are the results from testing the series on a dual socket
Zen3 machine (2 x 64C/128T):

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Kernel versions:
- tip:          6.2.0-rc6 tip sched/core
- eevdf: 	6.2.0-rc6 tip sched/core
		+ eevdf commits from your tree
		(https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/eevdf)

- eevdf prev:	6.2.0-rc6 tip sched/core + this series as is

When the testing started, the tip was at:
commit 7c4a5b89a0b5 "sched/rt: pick_next_rt_entity(): check list_entry"

Benchmark Results:

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test:			tip			eevdf
 1-groups:	   4.63 (0.00 pct)	   4.52 (2.37 pct)
 2-groups:	   4.42 (0.00 pct)	   5.41 (-22.39 pct)	*
 4-groups:	   4.21 (0.00 pct)	   5.26 (-24.94 pct)	*
 8-groups:	   4.95 (0.00 pct)	   5.01 (-1.21 pct)
16-groups:	   5.43 (0.00 pct)	   6.24 (-14.91 pct)	*

o NPS2

Test:			tip			eevdf
 1-groups:	   4.68 (0.00 pct)	   4.56 (2.56 pct)
 2-groups:	   4.45 (0.00 pct)	   5.19 (-16.62 pct)	*
 4-groups:	   4.19 (0.00 pct)	   4.53 (-8.11 pct)	*
 8-groups:	   4.80 (0.00 pct)	   4.81 (-0.20 pct)
16-groups:	   5.60 (0.00 pct)	   6.22 (-11.07 pct)	*

o NPS4

Test:			tip			eevdf
 1-groups:	   4.68 (0.00 pct)	   4.57 (2.35 pct)
 2-groups:	   4.56 (0.00 pct)	   5.19 (-13.81 pct)	*
 4-groups:	   4.50 (0.00 pct)	   4.96 (-10.22 pct)	*
 8-groups:	   5.76 (0.00 pct)	   5.49 (4.68 pct)
16-groups:	   5.60 (0.00 pct)	   6.53 (-16.60 pct)	*

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

o NPS1

#workers:	tip			eevdf
  1:	  36.00 (0.00 pct)	  36.00 (0.00 pct)
  2:	  37.00 (0.00 pct)	  37.00 (0.00 pct)
  4:	  38.00 (0.00 pct)	  39.00 (-2.63 pct)
  8:	  52.00 (0.00 pct)	  50.00 (3.84 pct)
 16:	  66.00 (0.00 pct)	  68.00 (-3.03 pct)
 32:	 111.00 (0.00 pct)	 109.00 (1.80 pct)
 64:	 213.00 (0.00 pct)	 212.00 (0.46 pct)
128:	 502.00 (0.00 pct)	 637.00 (-26.89 pct)	*
256:	 45632.00 (0.00 pct)	 24992.00 (45.23 pct)	^
512:	 78720.00 (0.00 pct)	 44096.00 (43.98 pct)	^

o NPS2

#workers:	tip			eevdf
  1:	  31.00 (0.00 pct)	  23.00 (25.80 pct)
  2:	  32.00 (0.00 pct)	  33.00 (-3.12 pct)
  4:	  39.00 (0.00 pct)	  37.00 (5.12 pct)
  8:	  52.00 (0.00 pct)	  49.00 (5.76 pct)
 16:	  67.00 (0.00 pct)	  68.00 (-1.49 pct)
 32:	 113.00 (0.00 pct)	 112.00 (0.88 pct)
 64:	 213.00 (0.00 pct)	 214.00 (-0.46 pct)
128:	 508.00 (0.00 pct)	 491.00 (3.34 pct)
256:	 46912.00 (0.00 pct)	 22304.00 (52.45 pct)	^
512:	 76672.00 (0.00 pct)	 42944.00 (43.98 pct)	^

o NPS4

#workers:	tip			eevdf
  1:	  33.00 (0.00 pct)	  30.00 (9.09 pct)
  2:	  40.00 (0.00 pct)	  36.00 (10.00 pct)
  4:	  44.00 (0.00 pct)	  41.00 (6.81 pct)
  8:	  73.00 (0.00 pct)	  73.00 (0.00 pct)
 16:	  71.00 (0.00 pct)	  71.00 (0.00 pct)
 32:	 111.00 (0.00 pct)	 115.00 (-3.60 pct)
 64:	 217.00 (0.00 pct)	 211.00 (2.76 pct)
128:	 509.00 (0.00 pct)	 553.00 (-8.64 pct)	*
256:	 44352.00 (0.00 pct)	 26848.00 (39.46 pct)	^
512:	 75392.00 (0.00 pct)	 44352.00 (41.17 pct)	^


~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients:	tip			eevdf
    1	 483.10 (0.00 pct)	 476.46 (-1.37 pct)
    2	 956.03 (0.00 pct)	 943.12 (-1.35 pct)
    4	 1786.36 (0.00 pct)	 1760.64 (-1.43 pct)
    8	 3304.47 (0.00 pct)	 3105.19 (-6.03 pct)
   16	 5440.44 (0.00 pct)	 5609.24 (3.10 pct)
   32	 10462.02 (0.00 pct)	 10416.02 (-0.43 pct)
   64	 18995.99 (0.00 pct)	 19317.34 (1.69 pct)
  128	 27896.44 (0.00 pct)	 28459.38 (2.01 pct)
  256	 49742.89 (0.00 pct)	 46371.44 (-6.77 pct)	*
  512	 49583.01 (0.00 pct)	 45717.22 (-7.79 pct)	*
 1024	 48467.75 (0.00 pct)	 43475.31 (-10.30 pct)	*

o NPS2

Clients:	tip			eevdf
    1	 472.57 (0.00 pct)	 475.35 (0.58 pct)
    2	 938.27 (0.00 pct)	 942.19 (0.41 pct)
    4	 1764.34 (0.00 pct)	 1783.50 (1.08 pct)
    8	 3043.57 (0.00 pct)	 3205.85 (5.33 pct)
   16	 5103.53 (0.00 pct)	 5154.94 (1.00 pct)
   32	 9767.22 (0.00 pct)	 9793.81 (0.27 pct)
   64	 18712.65 (0.00 pct)	 18601.10 (-0.59 pct)
  128	 27691.95 (0.00 pct)	 27542.57 (-0.53 pct)
  256	 47939.24 (0.00 pct)	 43401.62 (-9.46 pct)	*
  512	 47843.70 (0.00 pct)	 43971.16 (-8.09 pct)	*
 1024	 48412.05 (0.00 pct)	 42808.58 (-11.57 pct)	*

o NPS4

Clients:	tip			eevdf
    1	 486.74 (0.00 pct)	 484.88 (-0.38 pct)
    2	 950.50 (0.00 pct)	 950.04 (-0.04 pct)
    4	 1778.58 (0.00 pct)	 1796.03 (0.98 pct)
    8	 3106.36 (0.00 pct)	 3180.09 (2.37 pct)
   16	 5139.81 (0.00 pct)	 5139.50 (0.00 pct)
   32	 9911.04 (0.00 pct)	 10086.37 (1.76 pct)
   64	 18201.46 (0.00 pct)	 18289.40 (0.48 pct)
  128	 27284.67 (0.00 pct)	 26947.19 (-1.23 pct)
  256	 46793.72 (0.00 pct)	 43971.87 (-6.03 pct)	*
  512	 48841.96 (0.00 pct)	 44255.01 (-9.39 pct)	*
 1024	 48811.99 (0.00 pct)	 43118.99 (-11.66 pct)	*

~~~~~~~~~~
~ stream ~
~~~~~~~~~~

o NPS1

- 10 Runs:

Test:		tip			eevdf
 Copy:	 321229.54 (0.00 pct)	 332975.45 (3.65 pct)
Scale:	 207471.32 (0.00 pct)	 212534.83 (2.44 pct)
  Add:	 234962.15 (0.00 pct)	 243011.39 (3.42 pct)
Triad:	 246256.00 (0.00 pct)	 256453.73 (4.14 pct)

- 100 Runs:

Test:		tip			eevdf
 Copy:	 332714.94 (0.00 pct)	 333183.42 (0.14 pct)
Scale:	 216140.84 (0.00 pct)	 212160.53 (-1.84 pct)
  Add:	 239605.00 (0.00 pct)	 233168.69 (-2.68 pct)
Triad:	 258580.84 (0.00 pct)	 256972.33 (-0.62 pct)

o NPS2

- 10 Runs:

Test:		tip			eevdf
 Copy:	 324423.92 (0.00 pct)	 340685.20 (5.01 pct)
Scale:	 215993.56 (0.00 pct)	 217895.31 (0.88 pct)
  Add:	 250590.28 (0.00 pct)	 257495.12 (2.75 pct)
Triad:	 261284.44 (0.00 pct)	 261373.49 (0.03 pct)

- 100 Runs:

Test:		tip			eevdf
 Copy:	 325993.72 (0.00 pct)	 341244.18 (4.67 pct)
Scale:	 227201.27 (0.00 pct)	 227255.98 (0.02 pct)
  Add:	 256601.84 (0.00 pct)	 258026.75 (0.55 pct)
Triad:	 260222.19 (0.00 pct)	 269878.75 (3.71 pct)

o NPS4

- 10 Runs:

Test:		tip			eevdf
 Copy:	 356850.80 (0.00 pct)	 371230.27 (4.02 pct)
Scale:	 247219.39 (0.00 pct)	 237846.20 (-3.79 pct)
  Add:	 268588.78 (0.00 pct)	 261088.54 (-2.79 pct)
Triad:	 272932.59 (0.00 pct)	 284068.07 (4.07 pct)

- 100 Runs:

Test:		tip			eevdf
 Copy:	 365965.18 (0.00 pct)	 371186.97 (1.42 pct)
Scale:	 246068.58 (0.00 pct)	 245991.10 (-0.03 pct)
  Add:	 263677.73 (0.00 pct)	 269021.14 (2.02 pct)
Triad:	 273701.36 (0.00 pct)	 280566.44 (2.50 pct)

~~~~~~~~~~~~~
~ Unixbench ~
~~~~~~~~~~~~~

o NPS1

Test			Metric	  Parallelism			tip		          eevdf
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49077561.21 (   0.00%)    49144835.64 (   0.14%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6285373890.61 (   0.00%)  6270537933.92 (  -0.24%)
unixbench-syscall       Amean     unixbench-syscall-1        2664815.40 (   0.00%)     2679289.17 *  -0.54%*
unixbench-syscall       Amean     unixbench-syscall-512      7848462.70 (   0.00%)     7456802.37 *   4.99%*
unixbench-pipe          Hmean     unixbench-pipe-1           2531131.89 (   0.00%)     2475863.05 *  -2.18%*
unixbench-pipe          Hmean     unixbench-pipe-512       305171024.40 (   0.00%)   301182156.60 (  -1.31%)
unixbench-spawn         Hmean     unixbench-spawn-1             4058.05 (   0.00%)        4284.38 *   5.58%*
unixbench-spawn         Hmean     unixbench-spawn-512          79893.24 (   0.00%)       78234.45 *  -2.08%*
unixbench-execl         Hmean     unixbench-execl-1             4148.64 (   0.00%)        4086.73 *  -1.49%*
unixbench-execl         Hmean     unixbench-execl-512          11077.20 (   0.00%)       11137.79 (   0.55%)

o NPS2

Test			Metric	  Parallelism			tip		          eevdf
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49394822.56 (   0.00%)    49175574.26 (  -0.44%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6267817215.36 (   0.00%)  6282838979.08 *   0.24%*
unixbench-syscall       Amean     unixbench-syscall-1        2663675.03 (   0.00%)     2677018.53 *  -0.50%*
unixbench-syscall       Amean     unixbench-syscall-512      7342392.90 (   0.00%)     7443264.00 *  -1.37%*
unixbench-pipe          Hmean     unixbench-pipe-1           2533194.04 (   0.00%)     2475969.01 *  -2.26%*
unixbench-pipe          Hmean     unixbench-pipe-512       303588239.03 (   0.00%)   302217597.98 *  -0.45%*
unixbench-spawn         Hmean     unixbench-spawn-1             5141.40 (   0.00%)        4862.78 (  -5.42%)    *
unixbench-spawn         Hmean     unixbench-spawn-512          82993.79 (   0.00%)       79139.42 *  -4.64%*    *
unixbench-execl         Hmean     unixbench-execl-1             4140.15 (   0.00%)        4084.20 *  -1.35%*
unixbench-execl         Hmean     unixbench-execl-512          12229.25 (   0.00%)       11445.22 (  -6.41%)    *

o NPS4

Test			Metric	  Parallelism			tip		          eevdf
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48970677.27 (   0.00%)    49070289.56 (   0.20%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6297506696.81 (   0.00%)  6311038905.07 (   0.21%)
unixbench-syscall       Amean     unixbench-syscall-1        2664715.13 (   0.00%)     2677752.20 *  -0.49%*
unixbench-syscall       Amean     unixbench-syscall-512      7938670.70 (   0.00%)     7972291.60 (  -0.42%)
unixbench-pipe          Hmean     unixbench-pipe-1           2527605.54 (   0.00%)     2476140.77 *  -2.04%*
unixbench-pipe          Hmean     unixbench-pipe-512       305068507.23 (   0.00%)   304114548.50 (  -0.31%)
unixbench-spawn         Hmean     unixbench-spawn-1             5207.34 (   0.00%)        4964.39 (  -4.67%)    *
unixbench-spawn         Hmean     unixbench-spawn-512          81352.38 (   0.00%)       74467.00 *  -8.46%*    *
unixbench-execl         Hmean     unixbench-execl-1             4131.37 (   0.00%)        4044.09 *  -2.11%*
unixbench-execl         Hmean     unixbench-execl-512          13025.56 (   0.00%)       11124.77 * -14.59%*    *

~~~~~~~~~~~
~ netperf ~
~~~~~~~~~~~

o NPS1

                        tip                     eevdf
 1-clients:      107932.22 (0.00 pct)    106167.39 (-1.63 pct)
 2-clients:      106887.99 (0.00 pct)    105304.25 (-1.48 pct)
 4-clients:      106676.11 (0.00 pct)    104328.10 (-2.20 pct)
 8-clients:      98645.45 (0.00 pct)     94076.26 (-4.63 pct)
16-clients:      88881.23 (0.00 pct)     86831.85 (-2.30 pct)
32-clients:      86654.28 (0.00 pct)     86313.80 (-0.39 pct)
64-clients:      81431.90 (0.00 pct)     74885.75 (-8.03 pct)
128-clients:     55993.77 (0.00 pct)     55378.10 (-1.09 pct)
256-clients:     43865.59 (0.00 pct)     44326.30 (1.05 pct)

o NPS2

                        tip                     eevdf
 1-clients:      106711.81 (0.00 pct)    108576.27 (1.74 pct)
 2-clients:      106987.79 (0.00 pct)    108348.24 (1.27 pct)
 4-clients:      105275.37 (0.00 pct)    105702.12 (0.40 pct)
 8-clients:      103028.31 (0.00 pct)    96250.20 (-6.57 pct)
16-clients:      87382.43 (0.00 pct)     87683.29 (0.34 pct)
32-clients:      86578.14 (0.00 pct)     86968.29 (0.45 pct)
64-clients:      81470.63 (0.00 pct)     75906.15 (-6.83 pct)
128-clients:     54803.35 (0.00 pct)     55051.90 (0.45 pct)
256-clients:     42910.29 (0.00 pct)     44062.33 (2.68 pct)

~~~~~~~~~~~
~ SpecJBB ~
~~~~~~~~~~~

o NPS1

			tip		    eevdf
Max-jOPS	       100%		  115.71%	(+15.71%)  ^
Critical-jOPS	       100%		   93.59%	 (-6.41%)  *

~~~~~~~~~~~~~~~~~~
~ DeathStarBench ~
~~~~~~~~~~~~~~~~~~

o NPS1

  #CCX                      	 1 CCX    2 CCX   3 CCX   4 CCX
o eevdf compared to tip		-10.93	 -14.35	  -9.74	  -6.07
o eevdf prev (this sries as is)    
  compared to tip                -1.99    -6.64   -4.99   -3.87

Note: #CCX is the number of LLCs the the services are pinned to.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Some Preliminary Analysis ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

tl;dr

- There seems to be an increase in number of involuntary context switches
  when the system is overloaded. This probably allows newly waking task to
  make progress benefiting latency sensitive workload like schbench in
  overloaded scenario compared to tip but hurts tbench performance.
  When system is fully loaded, the larger average wait time seems to hurt
  the schbench performance.
  More analysis is needed to get to the bottom of the problem.

- For hackbench 2 groups scenario, there seems the wait time seems to go up
  drastically.

Scheduler statistics of interest are listed in detail below.

Note: Units of all metrics denoting time is ms. They are processed from
      per-task schedstats in /proc/<pid>/sched.

o Hackbench (2 Groups) (NPS1)

				tip		eevdf			%diff
Comm				sched-messaging	sched-messaging		N/A
Sum of avg_atom			282.0024818	19.04355233		-93.24702669
Average of avg_atom		3.481512121	0.235105584		-93.24702669
Sum of avg_per_cpu		1761.949461	61.52537145		-96.50810805
Average of avg_per_cpu		21.75246248	0.759572487		-96.50810805
Average of avg_wait_time	0.007239228	0.012899105		78.18343632
Sum of nr_switches		4897740		4728784			-3.449672706
Sum of nr_voluntary_switches	4742512		4621606			-2.549408415
Sum of nr_involuntary_switches	155228		107178			-30.95446698
Sum of nr_wakeups		4742648		4623175			-2.51912012
Sum of nr_migrations		1263925		930600			-26.37221354
Sum of sum_exec_runtime		288481.15	262255.2574		-9.091024712
Sum of sum_idle_runtime		2576164.568	2851759.68		10.69788457
Sum of sum_sleep_runtime	76890.14753	78632.31679		2.265789982
Sum of wait_count		4897894		4728939			-3.449543824
Sum of wait_sum			3041.78227	24167.4694		694.5167422

o schbench (2 messengers, 128 workers - fully loaded) (NPS1)

				tip		eevdf		%diff
Comm				schbench	schbench	N/A
Sum of avg_atom			7538.162897	7289.565705	-3.297848503
Average of avg_atom		29.10487605	28.14504133	-3.297848503
Sum of avg_per_cpu		630248.6079	471215.3671	-25.23341406
Average of avg_per_cpu		2433.392309	1819.364352	-25.23341406
Average of avg_wait_time	0.054147456	25.34304285	46703.75524
Sum of nr_switches		85210		88176		3.480812111
Sum of nr_voluntary_switches	83165		83457		0.351109241
Sum of nr_involuntary_switches	2045		4719		130.7579462
Sum of nr_wakeups		83168		83459		0.34989419
Sum of nr_migrations		3265		3025		-7.350689127
Sum of sum_exec_runtime		2476504.52	2469058.164	-0.300680129
Sum of sum_idle_runtime		110294825.8	132520924.2	20.15153321
Sum of sum_sleep_runtime	5293337.741	5297778.714	0.083897408
Sum of sum_block_runtime	56.043253	15.12936	-73.00413664
Sum of wait_count		85615		88606		3.493546692
Sum of wait_sum			4653.340163	9605.221964	106.4156418

o schbench (2 messengers, 256 workers - overloaded) (NPS1)

				tip		eevdf		%diff
Comm				schbench	schbench	N/A
Sum of avg_atom			11676.77306	4803.485728	-58.8629007
Average of avg_atom		22.67334574	9.327156753	-58.8629007
Sum of avg_per_cpu		55235.68013	38286.47722	-30.68524343
Average of avg_per_cpu		107.2537478	74.34267421	-30.68524343
Average of avg_wait_time	2.23189096	2.58191945	15.68304621
Sum of nr_switches		202862		425258		109.6292061
Sum of nr_voluntary_switches	163079		165058		1.213522281
Sum of nr_involuntary_switches	39783		260200		554.0482115
Sum of nr_wakeups		163082		165058		1.211660392
Sum of nr_migrations		44199		54894		24.19738003
Sum of sum_exec_runtime		4586675.667	3963846.024	-13.57910801
Sum of sum_idle_runtime		201050644.2	195126863.7	-2.946412087
Sum of sum_sleep_runtime	10418117.66	10402686.4	-0.148119407
Sum of sum_block_runtime	1548.979156	516.115078	-66.68030838
Sum of wait_count		203377		425792		109.3609405
Sum of wait_sum			455609.3122	1100885.201	141.6292142

o tbench (256 clients - overloaded) (NPS1)

- tbench client
				tip		eevdf		% diff
comm				tbench		tbench		N/A
Sum of avg_atom			3.594587941	5.112101854	42.21663064
Average of avg_atom		0.013986724	0.019891447	42.21663064
Sum of avg_per_cpu		392838.0975	142065.4206	-63.83613975
Average of avg_per_cpu		1528.552909	552.7837377	-63.83613975
Average of avg_wait_time	0.010512441	0.006861579	-34.72895916
Sum of nr_switches		692845080	511780111	-26.1335433
Sum of nr_voluntary_switches	178151085	371234907	108.3820635
Sum of nr_involuntary_switches	514693995	140545204	-72.69344399
Sum of nr_wakeups		178151085	371234909	108.3820646
Sum of nr_migrations		45279		71177		57.19649286
Sum of sum_exec_runtime		9192343.465	9624025.792	4.69610746
Sum of sum_idle_runtime		7125370.721	16145736.39	126.5950365
Sum of sum_sleep_runtime	2222469.726	5792868.629	160.650058
Sum of sum_block_runtime	68.60879	446.080476	550.1797743
Sum of wait_count		692845479	511780543	-26.13352349
Sum of wait_sum			7287852.246	3297894.139	-54.7480653

- tbench server

				tip		eevdf		% diff
Comm				tbench_srv	tbench_srv	N/A
Sum of avg_atom			5.077837807	5.447267364	7.275331971
Average of avg_atom2		0.019758124	0.021195593	7.275331971
Sum of avg_per_cpu		538586.1634	87925.51225	-83.67475471
Average of avg_per_cpu2		2095.666006	342.1226158	-83.67475471
Average of avg_wait_time	0.000827346	0.006505748	686.3392261
Sum of nr_switches		692980666	511838912	-26.13951051
Sum of nr_voluntary_switches	690367607	390304935	-43.46418762
Sum of nr_involuntary_switches	2613059		121533977	4551.023073
Sum of nr_wakeups		690367607	390304935	-43.46418762
Sum of nr_migrations		39486		84474		113.9340526
Sum of sum_exec_runtime		9176708.278	8734423.401	-4.819646259
Sum of sum_idle_runtime		413900.3645	447180.3879	8.040588086
Sum of sum_sleep_runtime	8966201.976	6690818.107	-25.37734345
Sum of sum_block_runtime	1.776413	1.617435	-8.949382829
Sum of wait_count		692980942	511839229	-26.13949418
Sum of wait_sum			565739.6984	3295519.077	482.5150836

> 
> The nice -19 numbers aren't as pretty as Vincent's, but at the end I was going
> cross-eyed from staring at tree prints and I just couldn't figure out where it
> was going side-ways.
> 
> There's definitely more benchmarking/tweaking to be done (0-day already
> reported a stress-ng loss), but if we can pull this off we can delete a whole
> much of icky heuristics code. EEVDF is a much better defined policy than what
> we currently have.
> 

DeathStarBench and SpecJBB and slightly more complex to analyze. I'll
get the schedstat data for both soon. I'll rerun some of the above
workloads with NO_PRESERVE_LAG to see if that makes any difference.
In the meantime, if you need more data from the test system for any
particular workload, please do let me know. I will collect the per-task
and system-wide schedstat data for the workload as it is rather
inexpensive to collect and gives good insights but if you need any
other data, I'll be more than happy to get those too for analysis.

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 00/10] sched: EEVDF using latency-nice
  2023-03-22  6:49 ` K Prateek Nayak
@ 2023-03-22  9:38   ` K Prateek Nayak
  0 siblings, 0 replies; 35+ messages in thread
From: K Prateek Nayak @ 2023-03-22  9:38 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, vincent.guittot
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
	pjt, pavel, qperret, tim.c.chen, joshdon, timj, yu.c.chen,
	youssefesmat, joel

Hello Peter,

One important detail I forgot to mention: When I picked eevdf commits
from your tree
(https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=sched/core),
they were based on v6.3-rc1 with the sched/eevdf HEAD at:

commit: 0dddbc0b54ad ("sched/fair: Implement an EEVDF like policy")

On 3/22/2023 12:19 PM, K Prateek Nayak wrote:
> Hello Peter,
> 
> Leaving some results from my testing on a dual socket Zen3 machine
> (2 x 64C/128T) below.
> 
> tl;dr
> 
> o I've not tested workloads with nice and latency nice yet focusing more
>   on the out of the box performance. No changes to sched_feat were made
>   for the same reason. 
> 
> o Except for hackbench (m:n communication relationship), I do not see any
>   regression for other standard benchmarks (mostly 1:1 or 1:n) relation
>   when system is below fully loaded.
> 
> o At fully loaded scenario, schbench seems to be unhappy. Looking at the
>   data from /proc/<pid>/sched for the tasks with schedstats enabled,
>   there is an increase in number of context switches and the total wait
>   sum. When system is overloaded, things flip and the schbench tail
>   latency improves drastically. I suspect the involuntary
>   context-switches help workers make progress much sooner after wakeup
>   compared to tip thus leading to lower tail latency.
> 
> o For the same reason as above, tbench throughput takes a hit with
>   number of involuntary context-switches increasing drastically for the
>   tbench server. There is also an increase in wait sum noticed.
> 
> o Couple of real world workloads were also tested. DeathStarBench
>   throughput tanks much more with the updated version in your tree
>   compared to this series as is.
>   SpecJBB Max-jOPS sees large improvements but comes at a cost of
>   drop in Critical-jOPS signifying an increase in either wait time
>   or an increase in involuntary context-switches which can lead to
>   transactions taking longer to complete.
> 
> o Apart from DeathStarBench, the all the trends reported remain same
>   comparing the version in your tree and this series, as is, applied
>   on the same base kernel.
> 
> I'll leave the detailed results below and some limited analysis. 
> 
> On 3/6/2023 6:55 PM, Peter Zijlstra wrote:
>> Hi!
>>
>> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
>> not make more sense, and I did point Vincent at some older patches I had for
>> that (which is here his augmented rbtree thing comes from).
>>
>> Also, since I really dislike the dual tree, I also figured we could dynamically
>> switch between an augmented tree and not (and while I have code for that,
>> that's not included in this posting because with the current results I don't
>> think we actually need this).
>>
>> Anyway, since I'm somewhat under the weather, I spend last week desperately
>> trying to connect a small cluster of neurons in defiance of the snot overlord
>> and bring back the EEVDF patches from the dark crypts where they'd been
>> gathering cobwebs for the past 13 odd years.
>>
>> By friday they worked well enough, and this morning (because obviously I forgot
>> the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf,
>> tbench and sysbench -- there's a bunch of wins and losses, but nothing that
>> indicates a total fail.
>>
>> ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot
>>   more consistent than CFS and has a bunch of latency wins )
>>
>> ( hackbench also doesn't show the augmented tree and generally more expensive
>>   pick to be a loss, in fact it shows a slight win here )
>>
>>
>>   hackbech load + cyclictest --policy other results:
>>
>>
>> 			EEVDF			 CFS
>>
>> 		# Min Latencies: 00053
>>   LNICE(19)	# Avg Latencies: 04350
>> 		# Max Latencies: 76019
>>
>> 		# Min Latencies: 00052		00053
>>   LNICE(0)	# Avg Latencies: 00690		00687
>> 		# Max Latencies: 14145		13913
>>
>> 		# Min Latencies: 00019
>>   LNICE(-19)	# Avg Latencies: 00261
>> 		# Max Latencies: 05642
>>
> 
> Following are the results from testing the series on a dual socket
> Zen3 machine (2 x 64C/128T):
> 
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
> 
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
> 
> Kernel versions:
> - tip:          6.2.0-rc6 tip sched/core
> - eevdf: 	6.2.0-rc6 tip sched/core
> 		+ eevdf commits from your tree
> 		(https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/eevdf)

I had cherry picked the following commits for eevdf:

commit: b84a8f6b6fa3 ("sched: Introduce latency-nice as a per-task attribute")
commit: eea7fc6f13b4 ("sched/core: Propagate parent task's latency requirements to the child task")
commit: a143d2bcef65 ("sched: Allow sched_{get,set}attr to change latency_nice of the task")
commit: d9790468df14 ("sched/fair: Add latency_offset")
commit: 3d4d37acaba4 ("sched/fair: Add sched group latency support")
commit: 707840ffc8fa ("sched/fair: Add avg_vruntime")
commit: 394af9db316b ("sched/fair: Remove START_DEBIT")
commit: 89b2a2ee0e9d ("sched/fair: Add lag based placement")
commit: e3db9631d8ca ("rbtree: Add rb_add_augmented_cached() helper")
commit: 0dddbc0b54ad ("sched/fair: Implement an EEVDF like policy")

from the sched/eevdf branch in your tree onto the tip branch back when
I started testing. I notice some more changes have been added since then.
Queuing testing of latest changes on the updated tip:sched/core based
on v6.3-rc3. I was able to cherry pick the latest commits from
sched/eevdf cleanly.

> 
> - eevdf prev:	6.2.0-rc6 tip sched/core + this series as is
> 
> When the testing started, the tip was at:
> commit 7c4a5b89a0b5 "sched/rt: pick_next_rt_entity(): check list_entry"
> [..snip..]
> 
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 00/10] sched: EEVDF using latency-nice
  2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
                   ` (12 preceding siblings ...)
  2023-03-22  6:49 ` K Prateek Nayak
@ 2023-03-23 11:53 ` Pavel Machek
  13 siblings, 0 replies; 35+ messages in thread
From: Pavel Machek @ 2023-03-23 11:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, qperret, tim.c.chen,
	joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel

[-- Attachment #1: Type: text/plain, Size: 426 bytes --]

Hi!

> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
> not make more sense, and I did point Vincent at some older patches I had for
> that (which is here his augmented rbtree thing comes from).

Link for context: https://lwn.net/Articles/925371/ . "EEVDF" is not
commonly known acronym :-).

BR,								Pavel


-- 
People of Russia, stop Putin before his war on Ukraine escalates.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/10] sched/fair: Add avg_vruntime
  2023-03-21 16:04     ` Peter Zijlstra
@ 2023-03-24  7:12       ` Chen Yu
  2023-03-24 10:03         ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Chen Yu @ 2023-03-24  7:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, youssefesmat, joel

On 2023-03-21 at 17:04:58 +0100, Peter Zijlstra wrote:
> On Tue, Mar 21, 2023 at 09:58:13PM +0800, Chen Yu wrote:
> > On 2023-03-06 at 14:25:27 +0100, Peter Zijlstra wrote:
> > [...]
> > >  
> > > +/*
> > > + * Compute virtual time from the per-task service numbers:
> > > + *
> > > + * Fair schedulers conserve lag: \Sum lag_i = 0
> > > + *
> > > + * lag_i = S - s_i = w_i * (V - v_i)
> > > + *
> > The definination of above lag_i seems to be inconsistent with the defininatin
> > of se->lag in PATCH 8. Maybe rename lag_i to something other to avoid confusion?
> 
> Yeah, I ran into that the other day, I think I'll introduce vlag_i = V - v_i
> or so.
> 
> > > + * \Sum lag_i = 0 -> \Sum w_i * (V - v_i) = V * \Sum w_i - \Sum w_i * v_i = 0
> > > + *
> > > + * From which we solve V:
> > > + *
> > > + *     \Sum v_i * w_i
> > > + * V = --------------
> > > + *        \Sum w_i
> > > + *
> > > + * However, since v_i is u64, and the multiplcation could easily overflow
> > > + * transform it into a relative form that uses smaller quantities:
> > > + *
> > > + * Substitute: v_i == (v_i - v) + v
> > > + *
> > > + *     \Sum ((v_i - v) + v) * w_i   \Sum (v_i - v) * w_i
> > > + * V = -------------------------- = -------------------- + v
> > > + *              \Sum w_i                   \Sum w_i
> > > + *
> > > + *
> 
> > Not sure if I understand it correctly, does it mean  (v_i - v) * w_i will not
> > overflow? If the weight of task is 15 (nice 19), then if v_i - v > (S64_MAX / 15)
> > it gets overflow. Is it possible that v_i is much larger than cfs_rq->min_vruntime
> > in this case?
> 
> Or worse, SCHED_IDLE, where weight is 2 (IIRC) or cgroups, then vtime
> advances at 512 times realtime. Now, the tick puts a limit on how long
> we'll overshoot these super low weight entities, for HZ=1000 we still
> only get 0.5s of vtime for weight=2.
>
> That would be only 30 bits used, except we use double FIXEDPOINT_SHIFT
> on 64bit, so we'll end up at 40-ish.
> 
> That should give us enough room to carry an average of deltas around
> min_vruntime.
> 
I'm trying to digest how ticks could prevent the overflow.
In update_curr() -> update_min_vruntime(cfs_rq), the cfs_rq->min_vruntime
is set to
max (cfs_rq->min_vruntime, min(curr->vruntime, leftmost(se->vruntime)))
so, although curr->vruntime increase by 0.5 seconds in each tick,
the leftmost(se->vruntime) could still be very small and unchanged,
thus the delta between v_i and cfs_rq->min_vruntime is still large.
Instead sysctl_sched_latency could decide how far it is between the
se.vruntime and the cfs_rq.min_vruntime, by calculating the vruntime
delta between task1 and task2:

    sched_vslice(task1) = (NICE0_LOAD/se1.weight)  * (w1/Sum wi * sysctl_sched_latency)
    sched_vslice(task2) = (NICE0_LOAD/se2.weight)  * (w2/Sum wi * sysctl_sched_latency)

Besides in patch 10, entity_eligible() checks
\Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
and the \Sum w_i could become large if there are many runnable tasks and
bring overflow? But yes, if the vi -v is very small we can get rid of this
issue IMO.

thanks,
Chenyu
> But yes, I've seen this go sideways and I need to stare a bit more at
> this. One of the things I've considered is changing the min_vruntime
> update rules to instead move to avg_vruntime() to minimize the deltas.
> But I've not yet actually written that code.
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/10] sched/fair: Add avg_vruntime
  2023-03-24  7:12       ` Chen Yu
@ 2023-03-24 10:03         ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-24 10:03 UTC (permalink / raw)
  To: Chen Yu
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
	qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
	tim.c.chen, joshdon, timj, kprateek.nayak, youssefesmat, joel

On Fri, Mar 24, 2023 at 03:12:19PM +0800, Chen Yu wrote:

> > Or worse, SCHED_IDLE, where weight is 2 (IIRC) or cgroups, then vtime
> > advances at 512 times realtime. Now, the tick puts a limit on how long
> > we'll overshoot these super low weight entities, for HZ=1000 we still
> > only get 0.5s of vtime for weight=2.
> >
> > That would be only 30 bits used, except we use double FIXEDPOINT_SHIFT
> > on 64bit, so we'll end up at 40-ish.
> > 
> > That should give us enough room to carry an average of deltas around
> > min_vruntime.
> > 
> I'm trying to digest how ticks could prevent the overflow.

They don't prevent overflow per se, but they do limit on how far
vruntime can advance ahead of the pack.

> In update_curr() -> update_min_vruntime(cfs_rq), the cfs_rq->min_vruntime
> is set to
> max (cfs_rq->min_vruntime, min(curr->vruntime, leftmost(se->vruntime)))
> so, although curr->vruntime increase by 0.5 seconds in each tick,
> the leftmost(se->vruntime) could still be very small and unchanged,
> thus the delta between v_i and cfs_rq->min_vruntime is still large.

Well, since the basic task selection rule is: pick leftmost, the idea is
that leftmost and hence min_vruntime advances. The only problem is that
placement can place new entities left of min_vruntime and then it stalls
for a bit. But then rightmost tasks shouldn't get more runtime and the
whole situation should be 'stable'-ish.

> Instead sysctl_sched_latency could decide how far it is between the
> se.vruntime and the cfs_rq.min_vruntime, by calculating the vruntime
> delta between task1 and task2:
> 
>     sched_vslice(task1) = (NICE0_LOAD/se1.weight)  * (w1/Sum wi * sysctl_sched_latency)
>     sched_vslice(task2) = (NICE0_LOAD/se2.weight)  * (w2/Sum wi * sysctl_sched_latency)

Yes, vslice is obviously involved, but low weight tasks are the ones
that tend to shoot away and are tick limited.

> Besides in patch 10, entity_eligible() checks
> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
> and the \Sum w_i could become large if there are many runnable tasks and
> bring overflow?

Indeed; I'll check there too. I think I'll make it do the division on
32bit and use 64x64->128 on 64bit.

Let me have a play..

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2023-03-24 10:05 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
2023-03-06 13:25 ` [PATCH 01/10] sched: Introduce latency-nice as a per-task attribute Peter Zijlstra
2023-03-06 13:25 ` [PATCH 02/10] sched/core: Propagate parent tasks latency requirements to the child task Peter Zijlstra
2023-03-06 13:25 ` [PATCH 03/10] sched: Allow sched_{get,set}attr to change latency_nice of the task Peter Zijlstra
2023-03-06 13:25 ` [PATCH 04/10] sched/fair: Add latency_offset Peter Zijlstra
2023-03-06 13:25 ` [PATCH 05/10] sched/fair: Add sched group latency support Peter Zijlstra
2023-03-06 13:25 ` [PATCH 06/10] sched/fair: Add avg_vruntime Peter Zijlstra
2023-03-21 13:58   ` Chen Yu
2023-03-21 16:04     ` Peter Zijlstra
2023-03-24  7:12       ` Chen Yu
2023-03-24 10:03         ` Peter Zijlstra
2023-03-06 13:25 ` [PATCH 07/10] sched/fair: Remove START_DEBIT Peter Zijlstra
2023-03-06 13:25 ` [PATCH 08/10] sched/fair: Add lag based placement Peter Zijlstra
2023-03-16 22:49   ` Tim Chen
2023-03-06 13:25 ` [PATCH 09/10] rbtree: Add rb_add_augmented_cached() helper Peter Zijlstra
2023-03-06 13:25 ` [PATCH 10/10] sched/fair: Implement an EEVDF like policy Peter Zijlstra
2023-03-08  8:39   ` Mike Galbraith
2023-03-08  9:26     ` Mike Galbraith
2023-03-08 13:36     ` Mike Galbraith
2023-03-09  4:23       ` Mike Galbraith
2023-03-10 20:38         ` Peter Zijlstra
2023-03-11  5:53           ` Mike Galbraith
2023-03-11  7:56             ` Mike Galbraith
2023-03-09  9:06       ` Peter Zijlstra
2023-03-09 12:44         ` Peter Zijlstra
2023-03-09 15:29           ` Peter Zijlstra
2023-03-09 15:39             ` Peter Zijlstra
2023-03-09 16:24             ` Mike Galbraith
2023-03-09 16:42             ` Peter Zijlstra
2023-03-07 10:27 ` [PATCH 00/10] sched: EEVDF using latency-nice Vincent Guittot
2023-03-07 13:08   ` Peter Zijlstra
2023-03-08 15:13 ` Shrikanth Hegde
2023-03-22  6:49 ` K Prateek Nayak
2023-03-22  9:38   ` K Prateek Nayak
2023-03-23 11:53 ` Pavel Machek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.