* [PATCH 01/10] sched: Introduce latency-nice as a per-task attribute
2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
2023-03-06 13:25 ` [PATCH 02/10] sched/core: Propagate parent tasks latency requirements to the child task Peter Zijlstra
` (12 subsequent siblings)
13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
To: mingo, vincent.guittot
Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
kprateek.nayak, yu.c.chen, youssefesmat, joel, Parth Shah
From: Parth Shah <parth@linux.ibm.com>
Latency-nice indicates the latency requirements of a task with respect
to the other tasks in the system. The value of the attribute can be within
the range of [-20, 19] both inclusive to be in-line with the values just
like task nice values.
latency_nice = -20 indicates the task to have the least latency as
compared to the tasks having latency_nice = +19.
The latency_nice may affect only the CFS SCHED_CLASS by getting
latency requirements from the userspace.
Additionally, add debugging bits for newly added latency_nice attribute.
[rebase, move defines in sched/prio.h]
Signed-off-by: Parth Shah <parth@linux.ibm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20230224093454.956298-3-vincent.guittot@linaro.org
---
include/linux/sched.h | 1 +
include/linux/sched/prio.h | 18 ++++++++++++++++++
kernel/sched/debug.c | 1 +
3 files changed, 20 insertions(+)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -784,6 +784,7 @@ struct task_struct {
int static_prio;
int normal_prio;
unsigned int rt_priority;
+ int latency_nice;
struct sched_entity se;
struct sched_rt_entity rt;
Index: linux-2.6/include/linux/sched/prio.h
===================================================================
--- linux-2.6.orig/include/linux/sched/prio.h
+++ linux-2.6/include/linux/sched/prio.h
@@ -42,4 +42,22 @@ static inline long rlimit_to_nice(long p
return (MAX_NICE - prio + 1);
}
+/*
+ * Latency nice is meant to provide scheduler hints about the relative
+ * latency requirements of a task with respect to other tasks.
+ * Thus a task with latency_nice == 19 can be hinted as the task with no
+ * latency requirements, in contrast to the task with latency_nice == -20
+ * which should be given priority in terms of lower latency.
+ */
+#define MAX_LATENCY_NICE 19
+#define MIN_LATENCY_NICE -20
+
+#define LATENCY_NICE_WIDTH \
+ (MAX_LATENCY_NICE - MIN_LATENCY_NICE + 1)
+
+/*
+ * Default tasks should be treated as a task with latency_nice = 0.
+ */
+#define DEFAULT_LATENCY_NICE 0
+
#endif /* _LINUX_SCHED_PRIO_H */
Index: linux-2.6/kernel/sched/debug.c
===================================================================
--- linux-2.6.orig/kernel/sched/debug.c
+++ linux-2.6/kernel/sched/debug.c
@@ -1043,6 +1043,7 @@ void proc_sched_show_task(struct task_st
#endif
P(policy);
P(prio);
+ P(latency_nice);
if (task_has_dl_policy(p)) {
P(dl.runtime);
P(dl.deadline);
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH 02/10] sched/core: Propagate parent tasks latency requirements to the child task
2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
2023-03-06 13:25 ` [PATCH 01/10] sched: Introduce latency-nice as a per-task attribute Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
2023-03-06 13:25 ` [PATCH 03/10] sched: Allow sched_{get,set}attr to change latency_nice of the task Peter Zijlstra
` (11 subsequent siblings)
13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
To: mingo, vincent.guittot
Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
kprateek.nayak, yu.c.chen, youssefesmat, joel, Parth Shah
From: Parth Shah <parth@linux.ibm.com>
Clone parent task's latency_nice attribute to the forked child task.
Reset the latency_nice value to default value when the child task is
set to sched_reset_on_fork.
Also, initialize init_task.latency_nice value with DEFAULT_LATENCY_NICE
value
[rebase]
Signed-off-by: Parth Shah <parth@linux.ibm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20230224093454.956298-4-vincent.guittot@linaro.org
---
init/init_task.c | 1 +
kernel/sched/core.c | 1 +
2 files changed, 2 insertions(+)
Index: linux-2.6/init/init_task.c
===================================================================
--- linux-2.6.orig/init/init_task.c
+++ linux-2.6/init/init_task.c
@@ -78,6 +78,7 @@ struct task_struct init_task
.prio = MAX_PRIO - 20,
.static_prio = MAX_PRIO - 20,
.normal_prio = MAX_PRIO - 20,
+ .latency_nice = DEFAULT_LATENCY_NICE,
.policy = SCHED_NORMAL,
.cpus_ptr = &init_task.cpus_mask,
.user_cpus_ptr = NULL,
Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -4684,6 +4684,7 @@ int sched_fork(unsigned long clone_flags
p->prio = p->normal_prio = p->static_prio;
set_load_weight(p, false);
+ p->latency_nice = DEFAULT_LATENCY_NICE;
/*
* We don't need the reset flag anymore after the fork. It has
* fulfilled its duty:
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH 03/10] sched: Allow sched_{get,set}attr to change latency_nice of the task
2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
2023-03-06 13:25 ` [PATCH 01/10] sched: Introduce latency-nice as a per-task attribute Peter Zijlstra
2023-03-06 13:25 ` [PATCH 02/10] sched/core: Propagate parent tasks latency requirements to the child task Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
2023-03-06 13:25 ` [PATCH 04/10] sched/fair: Add latency_offset Peter Zijlstra
` (10 subsequent siblings)
13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
To: mingo, vincent.guittot
Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
kprateek.nayak, yu.c.chen, youssefesmat, joel, Parth Shah
From: Parth Shah <parth@linux.ibm.com>
Introduce the latency_nice attribute to sched_attr and provide a
mechanism to change the value with the use of sched_setattr/sched_getattr
syscall.
Also add new flag "SCHED_FLAG_LATENCY_NICE" to hint the change in
latency_nice of the task on every sched_setattr syscall.
[rebase and add a dedicated __setscheduler_latency ]
Signed-off-by: Parth Shah <parth@linux.ibm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20230224093454.956298-5-vincent.guittot@linaro.org
---
include/uapi/linux/sched.h | 4 +++-
include/uapi/linux/sched/types.h | 19 +++++++++++++++++++
kernel/sched/core.c | 24 ++++++++++++++++++++++++
tools/include/uapi/linux/sched.h | 4 +++-
4 files changed, 49 insertions(+), 2 deletions(-)
Index: linux-2.6/include/uapi/linux/sched.h
===================================================================
--- linux-2.6.orig/include/uapi/linux/sched.h
+++ linux-2.6/include/uapi/linux/sched.h
@@ -132,6 +132,7 @@ struct clone_args {
#define SCHED_FLAG_KEEP_PARAMS 0x10
#define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
#define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
+#define SCHED_FLAG_LATENCY_NICE 0x80
#define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
SCHED_FLAG_KEEP_PARAMS)
@@ -143,6 +144,7 @@ struct clone_args {
SCHED_FLAG_RECLAIM | \
SCHED_FLAG_DL_OVERRUN | \
SCHED_FLAG_KEEP_ALL | \
- SCHED_FLAG_UTIL_CLAMP)
+ SCHED_FLAG_UTIL_CLAMP | \
+ SCHED_FLAG_LATENCY_NICE)
#endif /* _UAPI_LINUX_SCHED_H */
Index: linux-2.6/include/uapi/linux/sched/types.h
===================================================================
--- linux-2.6.orig/include/uapi/linux/sched/types.h
+++ linux-2.6/include/uapi/linux/sched/types.h
@@ -10,6 +10,7 @@ struct sched_param {
#define SCHED_ATTR_SIZE_VER0 48 /* sizeof first published struct */
#define SCHED_ATTR_SIZE_VER1 56 /* add: util_{min,max} */
+#define SCHED_ATTR_SIZE_VER2 60 /* add: latency_nice */
/*
* Extended scheduling parameters data structure.
@@ -98,6 +99,22 @@ struct sched_param {
* scheduled on a CPU with no more capacity than the specified value.
*
* A task utilization boundary can be reset by setting the attribute to -1.
+ *
+ * Latency Tolerance Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the relative latency
+ * requirements of a task with respect to the other tasks running/queued in the
+ * system.
+ *
+ * @ sched_latency_nice task's latency_nice value
+ *
+ * The latency_nice of a task can have any value in a range of
+ * [MIN_LATENCY_NICE..MAX_LATENCY_NICE].
+ *
+ * A task with latency_nice with the value of LATENCY_NICE_MIN can be
+ * taken for a task requiring a lower latency as opposed to the task with
+ * higher latency_nice.
*/
struct sched_attr {
__u32 size;
@@ -120,6 +137,8 @@ struct sched_attr {
__u32 sched_util_min;
__u32 sched_util_max;
+ /* latency requirement hints */
+ __s32 sched_latency_nice;
};
#endif /* _UAPI_LINUX_SCHED_TYPES_H */
Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -7451,6 +7451,13 @@ static void __setscheduler_params(struct
p->rt_priority = attr->sched_priority;
p->normal_prio = normal_prio(p);
set_load_weight(p, true);
+}
+
+static void __setscheduler_latency(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
+ p->latency_nice = attr->sched_latency_nice;
}
/*
@@ -7593,6 +7601,13 @@ recheck:
return retval;
}
+ if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
+ if (attr->sched_latency_nice > MAX_LATENCY_NICE)
+ return -EINVAL;
+ if (attr->sched_latency_nice < MIN_LATENCY_NICE)
+ return -EINVAL;
+ }
+
if (pi)
cpuset_read_lock();
@@ -7627,6 +7642,9 @@ recheck:
goto change;
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
goto change;
+ if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE &&
+ attr->sched_latency_nice != p->latency_nice)
+ goto change;
p->sched_reset_on_fork = reset_on_fork;
retval = 0;
@@ -7715,6 +7733,7 @@ change:
__setscheduler_params(p, attr);
__setscheduler_prio(p, newprio);
}
+ __setscheduler_latency(p, attr);
__setscheduler_uclamp(p, attr);
if (queued) {
@@ -7925,6 +7944,9 @@ static int sched_copy_attr(struct sched_
size < SCHED_ATTR_SIZE_VER1)
return -EINVAL;
+ if ((attr->sched_flags & SCHED_FLAG_LATENCY_NICE) &&
+ size < SCHED_ATTR_SIZE_VER2)
+ return -EINVAL;
/*
* XXX: Do we want to be lenient like existing syscalls; or do we want
* to be strict and return an error on out-of-bounds values?
@@ -8162,6 +8184,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pi
get_params(p, &kattr);
kattr.sched_flags &= SCHED_FLAG_ALL;
+ kattr.sched_latency_nice = p->latency_nice;
+
#ifdef CONFIG_UCLAMP_TASK
/*
* This could race with another potential updater, but this is fine
Index: linux-2.6/tools/include/uapi/linux/sched.h
===================================================================
--- linux-2.6.orig/tools/include/uapi/linux/sched.h
+++ linux-2.6/tools/include/uapi/linux/sched.h
@@ -132,6 +132,7 @@ struct clone_args {
#define SCHED_FLAG_KEEP_PARAMS 0x10
#define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
#define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
+#define SCHED_FLAG_LATENCY_NICE 0x80
#define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
SCHED_FLAG_KEEP_PARAMS)
@@ -143,6 +144,7 @@ struct clone_args {
SCHED_FLAG_RECLAIM | \
SCHED_FLAG_DL_OVERRUN | \
SCHED_FLAG_KEEP_ALL | \
- SCHED_FLAG_UTIL_CLAMP)
+ SCHED_FLAG_UTIL_CLAMP | \
+ SCHED_FLAG_LATENCY_NICE)
#endif /* _UAPI_LINUX_SCHED_H */
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH 04/10] sched/fair: Add latency_offset
2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
` (2 preceding siblings ...)
2023-03-06 13:25 ` [PATCH 03/10] sched: Allow sched_{get,set}attr to change latency_nice of the task Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
2023-03-06 13:25 ` [PATCH 05/10] sched/fair: Add sched group latency support Peter Zijlstra
` (9 subsequent siblings)
13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
To: mingo, vincent.guittot
Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
kprateek.nayak, yu.c.chen, youssefesmat, joel
From: Vincent Guittot <vincent.guittot@linaro.org>
XXX fold back into previous patches
Murdered-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
include/linux/sched.h | 4 +++-
include/linux/sched/prio.h | 9 +++++++++
init/init_task.c | 2 +-
kernel/sched/core.c | 21 ++++++++++++++++-----
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 8 ++++++++
kernel/sched/sched.h | 2 ++
7 files changed, 40 insertions(+), 8 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -568,6 +568,8 @@ struct sched_entity {
/* cached value of my_q->h_nr_running */
unsigned long runnable_weight;
#endif
+ /* preemption offset in ns */
+ long latency_offset;
#ifdef CONFIG_SMP
/*
@@ -784,7 +786,7 @@ struct task_struct {
int static_prio;
int normal_prio;
unsigned int rt_priority;
- int latency_nice;
+ int latency_prio;
struct sched_entity se;
struct sched_rt_entity rt;
--- a/include/linux/sched/prio.h
+++ b/include/linux/sched/prio.h
@@ -59,5 +59,14 @@ static inline long rlimit_to_nice(long p
* Default tasks should be treated as a task with latency_nice = 0.
*/
#define DEFAULT_LATENCY_NICE 0
+#define DEFAULT_LATENCY_PRIO (DEFAULT_LATENCY_NICE + LATENCY_NICE_WIDTH/2)
+
+/*
+ * Convert user-nice values [ -20 ... 0 ... 19 ]
+ * to static latency [ 0..39 ],
+ * and back.
+ */
+#define NICE_TO_LATENCY(nice) ((nice) + DEFAULT_LATENCY_PRIO)
+#define LATENCY_TO_NICE(prio) ((prio) - DEFAULT_LATENCY_PRIO)
#endif /* _LINUX_SCHED_PRIO_H */
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -78,7 +78,7 @@ struct task_struct init_task
.prio = MAX_PRIO - 20,
.static_prio = MAX_PRIO - 20,
.normal_prio = MAX_PRIO - 20,
- .latency_nice = DEFAULT_LATENCY_NICE,
+ .latency_prio = DEFAULT_LATENCY_PRIO,
.policy = SCHED_NORMAL,
.cpus_ptr = &init_task.cpus_mask,
.user_cpus_ptr = NULL,
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1285,6 +1285,11 @@ static void set_load_weight(struct task_
}
}
+static void set_latency_offset(struct task_struct *p)
+{
+ p->se.latency_offset = calc_latency_offset(p->latency_prio);
+}
+
#ifdef CONFIG_UCLAMP_TASK
/*
* Serializes updates of utilization clamp values
@@ -4433,6 +4438,8 @@ static void __sched_fork(unsigned long c
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);
+ set_latency_offset(p);
+
#ifdef CONFIG_FAIR_GROUP_SCHED
p->se.cfs_rq = NULL;
#endif
@@ -4684,7 +4691,9 @@ int sched_fork(unsigned long clone_flags
p->prio = p->normal_prio = p->static_prio;
set_load_weight(p, false);
- p->latency_nice = DEFAULT_LATENCY_NICE;
+ p->latency_prio = NICE_TO_LATENCY(0);
+ set_latency_offset(p);
+
/*
* We don't need the reset flag anymore after the fork. It has
* fulfilled its duty:
@@ -7456,8 +7465,10 @@ static void __setscheduler_params(struct
static void __setscheduler_latency(struct task_struct *p,
const struct sched_attr *attr)
{
- if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
- p->latency_nice = attr->sched_latency_nice;
+ if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
+ p->latency_prio = NICE_TO_LATENCY(attr->sched_latency_nice);
+ set_latency_offset(p);
+ }
}
/*
@@ -7642,7 +7653,7 @@ static int __sched_setscheduler(struct t
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
goto change;
if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE &&
- attr->sched_latency_nice != p->latency_nice)
+ attr->sched_latency_nice != LATENCY_TO_NICE(p->latency_prio))
goto change;
p->sched_reset_on_fork = reset_on_fork;
@@ -8183,7 +8194,7 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pi
get_params(p, &kattr);
kattr.sched_flags &= SCHED_FLAG_ALL;
- kattr.sched_latency_nice = p->latency_nice;
+ kattr.sched_latency_nice = LATENCY_TO_NICE(p->latency_prio);
#ifdef CONFIG_UCLAMP_TASK
/*
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1043,7 +1043,7 @@ void proc_sched_show_task(struct task_st
#endif
P(policy);
P(prio);
- P(latency_nice);
+ P(latency_prio);
if (task_has_dl_policy(p)) {
P(dl.runtime);
P(dl.deadline);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -703,6 +703,14 @@ int sched_update_scaling(void)
}
#endif
+long calc_latency_offset(int prio)
+{
+ u32 weight = sched_prio_to_weight[prio];
+ u64 base = sysctl_sched_min_granularity;
+
+ return div_u64(base << SCHED_FIXEDPOINT_SHIFT, weight);
+}
+
/*
* delta /= w
*/
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2475,6 +2475,8 @@ extern unsigned int sysctl_numa_balancin
extern unsigned int sysctl_numa_balancing_hot_threshold;
#endif
+extern long calc_latency_offset(int prio);
+
#ifdef CONFIG_SCHED_HRTICK
/*
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH 05/10] sched/fair: Add sched group latency support
2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
` (3 preceding siblings ...)
2023-03-06 13:25 ` [PATCH 04/10] sched/fair: Add latency_offset Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
2023-03-06 13:25 ` [PATCH 06/10] sched/fair: Add avg_vruntime Peter Zijlstra
` (8 subsequent siblings)
13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
To: mingo, vincent.guittot
Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
kprateek.nayak, yu.c.chen, youssefesmat, joel
From: Vincent Guittot <vincent.guittot@linaro.org>
Task can set its latency priority with sched_setattr(), which is then used
to set the latency offset of its sched_enity, but sched group entities
still have the default latency offset value.
Add a latency.nice field in cpu cgroup controller to set the latency
priority of the group similarly to sched_setattr(). The latency priority
is then used to set the offset of the sched_entities of the group.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20230224093454.956298-7-vincent.guittot@linaro.org
---
Documentation/admin-guide/cgroup-v2.rst | 10 ++++++++++
kernel/sched/core.c | 30 ++++++++++++++++++++++++++++++
kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 4 ++++
4 files changed, 76 insertions(+)
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1121,6 +1121,16 @@ All time durations are in microseconds.
values similar to the sched_setattr(2). This maximum utilization
value is used to clamp the task specific maximum utilization clamp.
+ cpu.latency.nice
+ A read-write single value file which exists on non-root
+ cgroups. The default is "0".
+
+ The nice value is in the range [-20, 19].
+
+ This interface file allows reading and setting latency using the
+ same values used by sched_setattr(2). The latency_nice of a group is
+ used to limit the impact of the latency_nice of a task outside the
+ group.
Memory
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11068,6 +11068,25 @@ static int cpu_idle_write_s64(struct cgr
{
return sched_group_set_idle(css_tg(css), idle);
}
+
+static s64 cpu_latency_nice_read_s64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return LATENCY_TO_NICE(css_tg(css)->latency_prio);
+}
+
+static int cpu_latency_nice_write_s64(struct cgroup_subsys_state *css,
+ struct cftype *cft, s64 nice)
+{
+ int prio;
+
+ if (nice < MIN_LATENCY_NICE || nice > MAX_LATENCY_NICE)
+ return -ERANGE;
+
+ prio = NICE_TO_LATENCY(nice);
+
+ return sched_group_set_latency(css_tg(css), prio);
+}
#endif
static struct cftype cpu_legacy_files[] = {
@@ -11082,6 +11101,11 @@ static struct cftype cpu_legacy_files[]
.read_s64 = cpu_idle_read_s64,
.write_s64 = cpu_idle_write_s64,
},
+ {
+ .name = "latency.nice",
+ .read_s64 = cpu_latency_nice_read_s64,
+ .write_s64 = cpu_latency_nice_write_s64,
+ },
#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
@@ -11299,6 +11323,12 @@ static struct cftype cpu_files[] = {
.read_s64 = cpu_idle_read_s64,
.write_s64 = cpu_idle_write_s64,
},
+ {
+ .name = "latency.nice",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_s64 = cpu_latency_nice_read_s64,
+ .write_s64 = cpu_latency_nice_write_s64,
+ },
#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12264,6 +12264,7 @@ int alloc_fair_sched_group(struct task_g
goto err;
tg->shares = NICE_0_LOAD;
+ tg->latency_prio = DEFAULT_LATENCY_PRIO;
init_cfs_bandwidth(tg_cfs_bandwidth(tg));
@@ -12362,6 +12363,9 @@ void init_tg_cfs_entry(struct task_group
}
se->my_q = cfs_rq;
+
+ se->latency_offset = calc_latency_offset(tg->latency_prio);
+
/* guarantee group entities always have weight */
update_load_set(&se->load, NICE_0_LOAD);
se->parent = parent;
@@ -12490,6 +12494,34 @@ int sched_group_set_idle(struct task_gro
mutex_unlock(&shares_mutex);
return 0;
+}
+
+int sched_group_set_latency(struct task_group *tg, int prio)
+{
+ long latency_offset;
+ int i;
+
+ if (tg == &root_task_group)
+ return -EINVAL;
+
+ mutex_lock(&shares_mutex);
+
+ if (tg->latency_prio == prio) {
+ mutex_unlock(&shares_mutex);
+ return 0;
+ }
+
+ tg->latency_prio = prio;
+ latency_offset = calc_latency_offset(prio);
+
+ for_each_possible_cpu(i) {
+ struct sched_entity *se = tg->se[i];
+
+ WRITE_ONCE(se->latency_offset, latency_offset);
+ }
+
+ mutex_unlock(&shares_mutex);
+ return 0;
}
#else /* CONFIG_FAIR_GROUP_SCHED */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -378,6 +378,8 @@ struct task_group {
/* A positive value indicates that this is a SCHED_IDLE group. */
int idle;
+ /* latency priority of the group. */
+ int latency_prio;
#ifdef CONFIG_SMP
/*
@@ -488,6 +490,8 @@ extern int sched_group_set_shares(struct
extern int sched_group_set_idle(struct task_group *tg, long idle);
+extern int sched_group_set_latency(struct task_group *tg, int prio);
+
#ifdef CONFIG_SMP
extern void set_task_rq_fair(struct sched_entity *se,
struct cfs_rq *prev, struct cfs_rq *next);
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH 06/10] sched/fair: Add avg_vruntime
2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
` (4 preceding siblings ...)
2023-03-06 13:25 ` [PATCH 05/10] sched/fair: Add sched group latency support Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
2023-03-21 13:58 ` Chen Yu
2023-03-06 13:25 ` [PATCH 07/10] sched/fair: Remove START_DEBIT Peter Zijlstra
` (7 subsequent siblings)
13 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
To: mingo, vincent.guittot
Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
kprateek.nayak, yu.c.chen, youssefesmat, joel
In order to move to an eligibility based scheduling policy it is
needed to have a better approximation of the ideal scheduler.
Specifically, for a virtual time weighted fair queueing based
scheduler the ideal scheduler will be the weighted average of the
individual virtual runtimes (math in the comment).
As such, compute the weighted average to approximate the ideal
scheduler -- note that the approximation is in the individual task
behaviour, which isn't strictly conformant.
Specifically consider adding a task with a vruntime left of center, in
this case the average will move backwards in time -- something the
ideal scheduler would of course never do.
[ Note: try and replace cfs_rq::avg_load with cfs_rq->load.weight, the
conditions are slightly different but should be possible ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/sched/debug.c | 32 +++++++--------
kernel/sched/fair.c | 105 +++++++++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 5 ++
3 files changed, 122 insertions(+), 20 deletions(-)
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -580,10 +580,9 @@ static void print_rq(struct seq_file *m,
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
{
- s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
- spread, rq0_min_vruntime, spread0;
+ s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, spread;
+ struct sched_entity *last, *first;
struct rq *rq = cpu_rq(cpu);
- struct sched_entity *last;
unsigned long flags;
#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -597,26 +596,25 @@ void print_cfs_rq(struct seq_file *m, in
SPLIT_NS(cfs_rq->exec_clock));
raw_spin_rq_lock_irqsave(rq, flags);
- if (rb_first_cached(&cfs_rq->tasks_timeline))
- MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
+ first = __pick_first_entity(cfs_rq);
+ if (first)
+ left_vruntime = first->vruntime;
last = __pick_last_entity(cfs_rq);
if (last)
- max_vruntime = last->vruntime;
+ right_vruntime = last->vruntime;
min_vruntime = cfs_rq->min_vruntime;
- rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
raw_spin_rq_unlock_irqrestore(rq, flags);
- SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "MIN_vruntime",
- SPLIT_NS(MIN_vruntime));
+
+ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_vruntime",
+ SPLIT_NS(left_vruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "min_vruntime",
SPLIT_NS(min_vruntime));
- SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "max_vruntime",
- SPLIT_NS(max_vruntime));
- spread = max_vruntime - MIN_vruntime;
- SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "spread",
- SPLIT_NS(spread));
- spread0 = min_vruntime - rq0_min_vruntime;
- SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "spread0",
- SPLIT_NS(spread0));
+ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "avg_vruntime",
+ SPLIT_NS(avg_vruntime(cfs_rq)));
+ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "right_vruntime",
+ SPLIT_NS(right_vruntime));
+ spread = right_vruntime - left_vruntime;
+ SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "spread", SPLIT_NS(spread));
SEQ_printf(m, " .%-30s: %d\n", "nr_spread_over",
cfs_rq->nr_spread_over);
SEQ_printf(m, " .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -601,9 +601,102 @@ static inline bool entity_before(const s
return (s64)(a->vruntime - b->vruntime) < 0;
}
+static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ return (s64)(se->vruntime - cfs_rq->min_vruntime);
+}
+
#define __node_2_se(node) \
rb_entry((node), struct sched_entity, run_node)
+/*
+ * Compute virtual time from the per-task service numbers:
+ *
+ * Fair schedulers conserve lag: \Sum lag_i = 0
+ *
+ * lag_i = S - s_i = w_i * (V - v_i)
+ *
+ * \Sum lag_i = 0 -> \Sum w_i * (V - v_i) = V * \Sum w_i - \Sum w_i * v_i = 0
+ *
+ * From which we solve V:
+ *
+ * \Sum v_i * w_i
+ * V = --------------
+ * \Sum w_i
+ *
+ * However, since v_i is u64, and the multiplcation could easily overflow
+ * transform it into a relative form that uses smaller quantities:
+ *
+ * Substitute: v_i == (v_i - v) + v
+ *
+ * \Sum ((v_i - v) + v) * w_i \Sum (v_i - v) * w_i
+ * V = -------------------------- = -------------------- + v
+ * \Sum w_i \Sum w_i
+ *
+ * min_vruntime = v
+ * avg_vruntime = \Sum (v_i - v) * w_i
+ * cfs_rq->load = \Sum w_i
+ *
+ * Since min_vruntime is a monotonic increasing variable that closely tracks
+ * the per-task service, these deltas: (v_i - v), will be in the order of the
+ * maximal (virtual) lag induced in the system due to quantisation.
+ */
+static void
+avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ s64 key = entity_key(cfs_rq, se);
+ cfs_rq->avg_vruntime += key * se->load.weight;
+ cfs_rq->avg_load += se->load.weight;
+}
+
+static void
+avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ s64 key = entity_key(cfs_rq, se);
+ cfs_rq->avg_vruntime -= key * se->load.weight;
+ cfs_rq->avg_load -= se->load.weight;
+}
+
+static inline
+void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+{
+ /*
+ * v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load
+ */
+ cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta;
+}
+
+u64 avg_vruntime(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *curr = cfs_rq->curr;
+ s64 lag = cfs_rq->avg_vruntime;
+ long load = cfs_rq->avg_load;
+
+ if (curr && curr->on_rq) {
+ lag += entity_key(cfs_rq, curr) * curr->load.weight;
+ load += curr->load.weight;
+ }
+
+ if (load)
+ lag = div_s64(lag, load);
+
+ return cfs_rq->min_vruntime + lag;
+}
+
+static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
+{
+ u64 min_vruntime = cfs_rq->min_vruntime;
+ /*
+ * open coded max_vruntime() to allow updating avg_vruntime
+ */
+ s64 delta = (s64)(vruntime - min_vruntime);
+ if (delta > 0) {
+ avg_vruntime_update(cfs_rq, delta);
+ min_vruntime = vruntime;
+ }
+ return min_vruntime;
+}
+
static void update_min_vruntime(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
@@ -629,7 +722,7 @@ static void update_min_vruntime(struct c
/* ensure we never gain time by being placed backwards. */
u64_u32_store(cfs_rq->min_vruntime,
- max_vruntime(cfs_rq->min_vruntime, vruntime));
+ __update_min_vruntime(cfs_rq, vruntime));
}
static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
@@ -642,12 +735,14 @@ static inline bool __entity_less(struct
*/
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ avg_vruntime_add(cfs_rq, se);
rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
}
static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
+ avg_vruntime_sub(cfs_rq, se);
}
struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
@@ -3330,6 +3425,8 @@ static void reweight_entity(struct cfs_r
/* commit outstanding execution time */
if (cfs_rq->curr == se)
update_curr(cfs_rq);
+ else
+ avg_vruntime_sub(cfs_rq, se);
update_load_sub(&cfs_rq->load, se->load.weight);
}
dequeue_load_avg(cfs_rq, se);
@@ -3345,9 +3442,11 @@ static void reweight_entity(struct cfs_r
#endif
enqueue_load_avg(cfs_rq, se);
- if (se->on_rq)
+ if (se->on_rq) {
update_load_add(&cfs_rq->load, se->load.weight);
-
+ if (cfs_rq->curr != se)
+ avg_vruntime_add(cfs_rq, se);
+ }
}
void reweight_task(struct task_struct *p, int prio)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -558,6 +558,9 @@ struct cfs_rq {
unsigned int idle_nr_running; /* SCHED_IDLE */
unsigned int idle_h_nr_running; /* SCHED_IDLE */
+ s64 avg_vruntime;
+ u64 avg_load;
+
u64 exec_clock;
u64 min_vruntime;
#ifdef CONFIG_SCHED_CORE
@@ -3312,4 +3315,6 @@ static inline void switch_mm_cid(struct
static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next) { }
#endif
+extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
+
#endif /* _KERNEL_SCHED_SCHED_H */
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 06/10] sched/fair: Add avg_vruntime
2023-03-06 13:25 ` [PATCH 06/10] sched/fair: Add avg_vruntime Peter Zijlstra
@ 2023-03-21 13:58 ` Chen Yu
2023-03-21 16:04 ` Peter Zijlstra
0 siblings, 1 reply; 35+ messages in thread
From: Chen Yu @ 2023-03-21 13:58 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
tim.c.chen, joshdon, timj, kprateek.nayak, youssefesmat, joel
On 2023-03-06 at 14:25:27 +0100, Peter Zijlstra wrote:
[...]
>
> +/*
> + * Compute virtual time from the per-task service numbers:
> + *
> + * Fair schedulers conserve lag: \Sum lag_i = 0
> + *
> + * lag_i = S - s_i = w_i * (V - v_i)
> + *
The definination of above lag_i seems to be inconsistent with the defininatin
of se->lag in PATCH 8. Maybe rename lag_i to something other to avoid confusion?
> + * \Sum lag_i = 0 -> \Sum w_i * (V - v_i) = V * \Sum w_i - \Sum w_i * v_i = 0
> + *
> + * From which we solve V:
> + *
> + * \Sum v_i * w_i
> + * V = --------------
> + * \Sum w_i
> + *
> + * However, since v_i is u64, and the multiplcation could easily overflow
> + * transform it into a relative form that uses smaller quantities:
> + *
> + * Substitute: v_i == (v_i - v) + v
> + *
> + * \Sum ((v_i - v) + v) * w_i \Sum (v_i - v) * w_i
> + * V = -------------------------- = -------------------- + v
> + * \Sum w_i \Sum w_i
> + *
> + *
Not sure if I understand it correctly, does it mean (v_i - v) * w_i will not
overflow? If the weight of task is 15 (nice 19), then if v_i - v > (S64_MAX / 15)
it gets overflow. Is it possible that v_i is much larger than cfs_rq->min_vruntime
in this case?
thanks,
Chenyu
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 06/10] sched/fair: Add avg_vruntime
2023-03-21 13:58 ` Chen Yu
@ 2023-03-21 16:04 ` Peter Zijlstra
2023-03-24 7:12 ` Chen Yu
0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-21 16:04 UTC (permalink / raw)
To: Chen Yu
Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
tim.c.chen, joshdon, timj, kprateek.nayak, youssefesmat, joel
On Tue, Mar 21, 2023 at 09:58:13PM +0800, Chen Yu wrote:
> On 2023-03-06 at 14:25:27 +0100, Peter Zijlstra wrote:
> [...]
> >
> > +/*
> > + * Compute virtual time from the per-task service numbers:
> > + *
> > + * Fair schedulers conserve lag: \Sum lag_i = 0
> > + *
> > + * lag_i = S - s_i = w_i * (V - v_i)
> > + *
> The definination of above lag_i seems to be inconsistent with the defininatin
> of se->lag in PATCH 8. Maybe rename lag_i to something other to avoid confusion?
Yeah, I ran into that the other day, I think I'll introduce vlag_i = V - v_i
or so.
> > + * \Sum lag_i = 0 -> \Sum w_i * (V - v_i) = V * \Sum w_i - \Sum w_i * v_i = 0
> > + *
> > + * From which we solve V:
> > + *
> > + * \Sum v_i * w_i
> > + * V = --------------
> > + * \Sum w_i
> > + *
> > + * However, since v_i is u64, and the multiplcation could easily overflow
> > + * transform it into a relative form that uses smaller quantities:
> > + *
> > + * Substitute: v_i == (v_i - v) + v
> > + *
> > + * \Sum ((v_i - v) + v) * w_i \Sum (v_i - v) * w_i
> > + * V = -------------------------- = -------------------- + v
> > + * \Sum w_i \Sum w_i
> > + *
> > + *
> Not sure if I understand it correctly, does it mean (v_i - v) * w_i will not
> overflow? If the weight of task is 15 (nice 19), then if v_i - v > (S64_MAX / 15)
> it gets overflow. Is it possible that v_i is much larger than cfs_rq->min_vruntime
> in this case?
Or worse, SCHED_IDLE, where weight is 2 (IIRC) or cgroups, then vtime
advances at 512 times realtime. Now, the tick puts a limit on how long
we'll overshoot these super low weight entities, for HZ=1000 we still
only get 0.5s of vtime for weight=2.
That would be only 30 bits used, except we use double FIXEDPOINT_SHIFT
on 64bit, so we'll end up at 40-ish.
That should give us enough room to carry an average of deltas around
min_vruntime.
But yes, I've seen this go sideways and I need to stare a bit more at
this. One of the things I've considered is changing the min_vruntime
update rules to instead move to avg_vruntime() to minimize the deltas.
But I've not yet actually written that code.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 06/10] sched/fair: Add avg_vruntime
2023-03-21 16:04 ` Peter Zijlstra
@ 2023-03-24 7:12 ` Chen Yu
2023-03-24 10:03 ` Peter Zijlstra
0 siblings, 1 reply; 35+ messages in thread
From: Chen Yu @ 2023-03-24 7:12 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
tim.c.chen, joshdon, timj, kprateek.nayak, youssefesmat, joel
On 2023-03-21 at 17:04:58 +0100, Peter Zijlstra wrote:
> On Tue, Mar 21, 2023 at 09:58:13PM +0800, Chen Yu wrote:
> > On 2023-03-06 at 14:25:27 +0100, Peter Zijlstra wrote:
> > [...]
> > >
> > > +/*
> > > + * Compute virtual time from the per-task service numbers:
> > > + *
> > > + * Fair schedulers conserve lag: \Sum lag_i = 0
> > > + *
> > > + * lag_i = S - s_i = w_i * (V - v_i)
> > > + *
> > The definination of above lag_i seems to be inconsistent with the defininatin
> > of se->lag in PATCH 8. Maybe rename lag_i to something other to avoid confusion?
>
> Yeah, I ran into that the other day, I think I'll introduce vlag_i = V - v_i
> or so.
>
> > > + * \Sum lag_i = 0 -> \Sum w_i * (V - v_i) = V * \Sum w_i - \Sum w_i * v_i = 0
> > > + *
> > > + * From which we solve V:
> > > + *
> > > + * \Sum v_i * w_i
> > > + * V = --------------
> > > + * \Sum w_i
> > > + *
> > > + * However, since v_i is u64, and the multiplcation could easily overflow
> > > + * transform it into a relative form that uses smaller quantities:
> > > + *
> > > + * Substitute: v_i == (v_i - v) + v
> > > + *
> > > + * \Sum ((v_i - v) + v) * w_i \Sum (v_i - v) * w_i
> > > + * V = -------------------------- = -------------------- + v
> > > + * \Sum w_i \Sum w_i
> > > + *
> > > + *
>
> > Not sure if I understand it correctly, does it mean (v_i - v) * w_i will not
> > overflow? If the weight of task is 15 (nice 19), then if v_i - v > (S64_MAX / 15)
> > it gets overflow. Is it possible that v_i is much larger than cfs_rq->min_vruntime
> > in this case?
>
> Or worse, SCHED_IDLE, where weight is 2 (IIRC) or cgroups, then vtime
> advances at 512 times realtime. Now, the tick puts a limit on how long
> we'll overshoot these super low weight entities, for HZ=1000 we still
> only get 0.5s of vtime for weight=2.
>
> That would be only 30 bits used, except we use double FIXEDPOINT_SHIFT
> on 64bit, so we'll end up at 40-ish.
>
> That should give us enough room to carry an average of deltas around
> min_vruntime.
>
I'm trying to digest how ticks could prevent the overflow.
In update_curr() -> update_min_vruntime(cfs_rq), the cfs_rq->min_vruntime
is set to
max (cfs_rq->min_vruntime, min(curr->vruntime, leftmost(se->vruntime)))
so, although curr->vruntime increase by 0.5 seconds in each tick,
the leftmost(se->vruntime) could still be very small and unchanged,
thus the delta between v_i and cfs_rq->min_vruntime is still large.
Instead sysctl_sched_latency could decide how far it is between the
se.vruntime and the cfs_rq.min_vruntime, by calculating the vruntime
delta between task1 and task2:
sched_vslice(task1) = (NICE0_LOAD/se1.weight) * (w1/Sum wi * sysctl_sched_latency)
sched_vslice(task2) = (NICE0_LOAD/se2.weight) * (w2/Sum wi * sysctl_sched_latency)
Besides in patch 10, entity_eligible() checks
\Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
and the \Sum w_i could become large if there are many runnable tasks and
bring overflow? But yes, if the vi -v is very small we can get rid of this
issue IMO.
thanks,
Chenyu
> But yes, I've seen this go sideways and I need to stare a bit more at
> this. One of the things I've considered is changing the min_vruntime
> update rules to instead move to avg_vruntime() to minimize the deltas.
> But I've not yet actually written that code.
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 06/10] sched/fair: Add avg_vruntime
2023-03-24 7:12 ` Chen Yu
@ 2023-03-24 10:03 ` Peter Zijlstra
0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-24 10:03 UTC (permalink / raw)
To: Chen Yu
Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
tim.c.chen, joshdon, timj, kprateek.nayak, youssefesmat, joel
On Fri, Mar 24, 2023 at 03:12:19PM +0800, Chen Yu wrote:
> > Or worse, SCHED_IDLE, where weight is 2 (IIRC) or cgroups, then vtime
> > advances at 512 times realtime. Now, the tick puts a limit on how long
> > we'll overshoot these super low weight entities, for HZ=1000 we still
> > only get 0.5s of vtime for weight=2.
> >
> > That would be only 30 bits used, except we use double FIXEDPOINT_SHIFT
> > on 64bit, so we'll end up at 40-ish.
> >
> > That should give us enough room to carry an average of deltas around
> > min_vruntime.
> >
> I'm trying to digest how ticks could prevent the overflow.
They don't prevent overflow per se, but they do limit on how far
vruntime can advance ahead of the pack.
> In update_curr() -> update_min_vruntime(cfs_rq), the cfs_rq->min_vruntime
> is set to
> max (cfs_rq->min_vruntime, min(curr->vruntime, leftmost(se->vruntime)))
> so, although curr->vruntime increase by 0.5 seconds in each tick,
> the leftmost(se->vruntime) could still be very small and unchanged,
> thus the delta between v_i and cfs_rq->min_vruntime is still large.
Well, since the basic task selection rule is: pick leftmost, the idea is
that leftmost and hence min_vruntime advances. The only problem is that
placement can place new entities left of min_vruntime and then it stalls
for a bit. But then rightmost tasks shouldn't get more runtime and the
whole situation should be 'stable'-ish.
> Instead sysctl_sched_latency could decide how far it is between the
> se.vruntime and the cfs_rq.min_vruntime, by calculating the vruntime
> delta between task1 and task2:
>
> sched_vslice(task1) = (NICE0_LOAD/se1.weight) * (w1/Sum wi * sysctl_sched_latency)
> sched_vslice(task2) = (NICE0_LOAD/se2.weight) * (w2/Sum wi * sysctl_sched_latency)
Yes, vslice is obviously involved, but low weight tasks are the ones
that tend to shoot away and are tick limited.
> Besides in patch 10, entity_eligible() checks
> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
> and the \Sum w_i could become large if there are many runnable tasks and
> bring overflow?
Indeed; I'll check there too. I think I'll make it do the division on
32bit and use 64x64->128 on 64bit.
Let me have a play..
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH 07/10] sched/fair: Remove START_DEBIT
2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
` (5 preceding siblings ...)
2023-03-06 13:25 ` [PATCH 06/10] sched/fair: Add avg_vruntime Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
2023-03-06 13:25 ` [PATCH 08/10] sched/fair: Add lag based placement Peter Zijlstra
` (6 subsequent siblings)
13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
To: mingo, vincent.guittot
Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
kprateek.nayak, yu.c.chen, youssefesmat, joel
With the introduction of avg_vruntime() there is no need to use worse
approximations. Take the 0-lag point as starting point for inserting
new tasks.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/sched/fair.c | 21 +--------------------
kernel/sched/features.h | 6 ------
2 files changed, 1 insertion(+), 26 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,16 +882,6 @@ static u64 sched_slice(struct cfs_rq *cf
return slice;
}
-/*
- * We calculate the vruntime slice of a to-be-inserted task.
- *
- * vs = s/w
- */
-static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
- return calc_delta_fair(sched_slice(cfs_rq, se), se);
-}
-
#include "pelt.h"
#ifdef CONFIG_SMP
@@ -4758,18 +4748,9 @@ static void check_spread(struct cfs_rq *
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
- u64 vruntime = cfs_rq->min_vruntime;
+ u64 vruntime = avg_vruntime(cfs_rq);
u64 sleep_time;
- /*
- * The 'current' period is already promised to the current tasks,
- * however the extra weight of the new task will slow them down a
- * little, place the new task so that it fits in the slot that
- * stays open at the end.
- */
- if (initial && sched_feat(START_DEBIT))
- vruntime += sched_vslice(cfs_rq, se);
-
/* sleeps up to a single latency don't count. */
if (!initial) {
unsigned long thresh;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,12 +7,6 @@
SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
/*
- * Place new tasks ahead so that they do not starve already running
- * tasks
- */
-SCHED_FEAT(START_DEBIT, true)
-
-/*
* Prefer to schedule the task we woke last (assuming it failed
* wakeup-preemption), since its likely going to consume data we
* touched, increases cache locality.
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH 08/10] sched/fair: Add lag based placement
2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
` (6 preceding siblings ...)
2023-03-06 13:25 ` [PATCH 07/10] sched/fair: Remove START_DEBIT Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
2023-03-16 22:49 ` Tim Chen
2023-03-06 13:25 ` [PATCH 09/10] rbtree: Add rb_add_augmented_cached() helper Peter Zijlstra
` (5 subsequent siblings)
13 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
To: mingo, vincent.guittot
Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
kprateek.nayak, yu.c.chen, youssefesmat, joel
With the introduction of avg_vruntime, it is possible to approximate
lag (the entire purpose of introducing it in fact). Use this to do lag
based placement over sleep+wake.
Specifically, the FAIR_SLEEPERS thing places things too far to the
left and messes up the deadline aspect of EEVDF.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
include/linux/sched.h | 1
kernel/sched/core.c | 1
kernel/sched/fair.c | 63 +++++++++++++++++++++++++++---------------------
kernel/sched/features.h | 8 ++++++
4 files changed, 46 insertions(+), 27 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -555,6 +555,7 @@ struct sched_entity {
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
+ s64 lag;
u64 nr_migrations;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4436,6 +4436,7 @@ static void __sched_fork(unsigned long c
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
+ p->se.lag = 0;
INIT_LIST_HEAD(&p->se.group_node);
set_latency_offset(p);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4749,39 +4749,45 @@ static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
u64 vruntime = avg_vruntime(cfs_rq);
- u64 sleep_time;
- /* sleeps up to a single latency don't count. */
- if (!initial) {
- unsigned long thresh;
-
- if (se_is_idle(se))
- thresh = sysctl_sched_min_granularity;
- else
- thresh = sysctl_sched_latency;
+ if (sched_feat(FAIR_SLEEPERS)) {
+ u64 sleep_time;
+
+ /* sleeps up to a single latency don't count. */
+ if (!initial) {
+ unsigned long thresh;
+
+ if (se_is_idle(se))
+ thresh = sysctl_sched_min_granularity;
+ else
+ thresh = sysctl_sched_latency;
+
+ /*
+ * Halve their sleep time's effect, to allow
+ * for a gentler effect of sleepers:
+ */
+ if (sched_feat(GENTLE_FAIR_SLEEPERS))
+ thresh >>= 1;
+
+ vruntime -= thresh;
+ }
/*
- * Halve their sleep time's effect, to allow
- * for a gentler effect of sleepers:
+ * Pull vruntime of the entity being placed to the base level of
+ * cfs_rq, to prevent boosting it if placed backwards. If the entity
+ * slept for a long time, don't even try to compare its vruntime with
+ * the base as it may be too far off and the comparison may get
+ * inversed due to s64 overflow.
*/
- if (sched_feat(GENTLE_FAIR_SLEEPERS))
- thresh >>= 1;
-
- vruntime -= thresh;
+ sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
+ if ((s64)sleep_time < 60LL * NSEC_PER_SEC)
+ vruntime = max_vruntime(se->vruntime, vruntime);
}
- /*
- * Pull vruntime of the entity being placed to the base level of
- * cfs_rq, to prevent boosting it if placed backwards. If the entity
- * slept for a long time, don't even try to compare its vruntime with
- * the base as it may be too far off and the comparison may get
- * inversed due to s64 overflow.
- */
- sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
- if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
- se->vruntime = vruntime;
- else
- se->vruntime = max_vruntime(se->vruntime, vruntime);
+ if (sched_feat(PRESERVE_LAG))
+ vruntime -= se->lag;
+
+ se->vruntime = vruntime;
}
static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -4949,6 +4955,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
clear_buddies(cfs_rq, se);
+ if (sched_feat(PRESERVE_LAG) && (flags & DEQUEUE_SLEEP))
+ se->lag = avg_vruntime(cfs_rq) - se->vruntime;
+
if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
se->on_rq = 0;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -1,12 +1,20 @@
/* SPDX-License-Identifier: GPL-2.0 */
+
/*
* Only give sleepers 50% of their service deficit. This allows
* them to run sooner, but does not allow tons of sleepers to
* rip the spread apart.
*/
+SCHED_FEAT(FAIR_SLEEPERS, false)
SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
/*
+ * Using the avg_vruntime, do the right thing and preserve lag
+ * across sleep+wake cycles.
+ */
+SCHED_FEAT(PRESERVE_LAG, true)
+
+/*
* Prefer to schedule the task we woke last (assuming it failed
* wakeup-preemption), since its likely going to consume data we
* touched, increases cache locality.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 08/10] sched/fair: Add lag based placement
2023-03-06 13:25 ` [PATCH 08/10] sched/fair: Add lag based placement Peter Zijlstra
@ 2023-03-16 22:49 ` Tim Chen
0 siblings, 0 replies; 35+ messages in thread
From: Tim Chen @ 2023-03-16 22:49 UTC (permalink / raw)
To: Peter Zijlstra, mingo, vincent.guittot
Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
pjt, pavel, qperret, joshdon, timj, kprateek.nayak, yu.c.chen,
youssefesmat, joel
On Mon, 2023-03-06 at 14:25 +0100, Peter Zijlstra wrote:
> With the introduction of avg_vruntime, it is possible to approximate
> lag (the entire purpose of introducing it in fact). Use this to do lag
> based placement over sleep+wake.
>
> Specifically, the FAIR_SLEEPERS thing places things too far to the
> left and messes up the deadline aspect of EEVDF.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> include/linux/sched.h | 1
> kernel/sched/core.c | 1
> kernel/sched/fair.c | 63 +++++++++++++++++++++++++++---------------------
> kernel/sched/features.h | 8 ++++++
> 4 files changed, 46 insertions(+), 27 deletions(-)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -555,6 +555,7 @@ struct sched_entity {
> u64 sum_exec_runtime;
> u64 vruntime;
> u64 prev_sum_exec_runtime;
> + s64 lag;
>
> u64 nr_migrations;
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4436,6 +4436,7 @@ static void __sched_fork(unsigned long c
> p->se.prev_sum_exec_runtime = 0;
> p->se.nr_migrations = 0;
> p->se.vruntime = 0;
> + p->se.lag = 0;
> INIT_LIST_HEAD(&p->se.group_node);
>
> set_latency_offset(p);
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4749,39 +4749,45 @@ static void
> place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> {
> u64 vruntime = avg_vruntime(cfs_rq);
> - u64 sleep_time;
>
> - /* sleeps up to a single latency don't count. */
> - if (!initial) {
> - unsigned long thresh;
> -
> - if (se_is_idle(se))
> - thresh = sysctl_sched_min_granularity;
> - else
> - thresh = sysctl_sched_latency;
> + if (sched_feat(FAIR_SLEEPERS)) {
> + u64 sleep_time;
> +
> + /* sleeps up to a single latency don't count. */
> + if (!initial) {
> + unsigned long thresh;
> +
> + if (se_is_idle(se))
> + thresh = sysctl_sched_min_granularity;
> + else
> + thresh = sysctl_sched_latency;
> +
> + /*
> + * Halve their sleep time's effect, to allow
> + * for a gentler effect of sleepers:
> + */
> + if (sched_feat(GENTLE_FAIR_SLEEPERS))
> + thresh >>= 1;
> +
> + vruntime -= thresh;
> + }
>
> /*
> - * Halve their sleep time's effect, to allow
> - * for a gentler effect of sleepers:
> + * Pull vruntime of the entity being placed to the base level of
> + * cfs_rq, to prevent boosting it if placed backwards. If the entity
> + * slept for a long time, don't even try to compare its vruntime with
> + * the base as it may be too far off and the comparison may get
> + * inversed due to s64 overflow.
> */
> - if (sched_feat(GENTLE_FAIR_SLEEPERS))
> - thresh >>= 1;
> -
> - vruntime -= thresh;
> + sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
> + if ((s64)sleep_time < 60LL * NSEC_PER_SEC)
> + vruntime = max_vruntime(se->vruntime, vruntime);
> }
>
> - /*
> - * Pull vruntime of the entity being placed to the base level of
> - * cfs_rq, to prevent boosting it if placed backwards. If the entity
> - * slept for a long time, don't even try to compare its vruntime with
> - * the base as it may be too far off and the comparison may get
> - * inversed due to s64 overflow.
> - */
> - sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
> - if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
> - se->vruntime = vruntime;
> - else
> - se->vruntime = max_vruntime(se->vruntime, vruntime);
> + if (sched_feat(PRESERVE_LAG))
> + vruntime -= se->lag;
> +
> + se->vruntime = vruntime;
I was going to say that
when we migrate a task to a new runqueue, we subtract vruntime
by old queue's min_vruntime and that math needs update.
But then I saw you did that in
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/kernel/sched?h=sched/eevdf
With this new lag based placement, I think it should properly fix
the starvation issues we have seen caused by quickly cpu hopping tasks.
Tim
> }
>
> static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
> @@ -4949,6 +4955,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>
> clear_buddies(cfs_rq, se);
>
> + if (sched_feat(PRESERVE_LAG) && (flags & DEQUEUE_SLEEP))
> + se->lag = avg_vruntime(cfs_rq) - se->vruntime;
> +
> if (se != cfs_rq->curr)
> __dequeue_entity(cfs_rq, se);
> se->on_rq = 0;
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -1,12 +1,20 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> +
> /*
> * Only give sleepers 50% of their service deficit. This allows
> * them to run sooner, but does not allow tons of sleepers to
> * rip the spread apart.
> */
> +SCHED_FEAT(FAIR_SLEEPERS, false)
> SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
>
> /*
> + * Using the avg_vruntime, do the right thing and preserve lag
> + * across sleep+wake cycles.
> + */
> +SCHED_FEAT(PRESERVE_LAG, true)
> +
> +/*
> * Prefer to schedule the task we woke last (assuming it failed
> * wakeup-preemption), since its likely going to consume data we
> * touched, increases cache locality.
>
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH 09/10] rbtree: Add rb_add_augmented_cached() helper
2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
` (7 preceding siblings ...)
2023-03-06 13:25 ` [PATCH 08/10] sched/fair: Add lag based placement Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
2023-03-06 13:25 ` [PATCH 10/10] sched/fair: Implement an EEVDF like policy Peter Zijlstra
` (4 subsequent siblings)
13 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
To: mingo, vincent.guittot
Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
kprateek.nayak, yu.c.chen, youssefesmat, joel
While slightly sub-optimal, updating the augmented data while going
down the tree during lookup would be faster -- alas the augment
interface does not currently allow for that, provide a generic helper
to add a node to an augmented cached tree.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
include/linux/rbtree_augmented.h | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
--- a/include/linux/rbtree_augmented.h
+++ b/include/linux/rbtree_augmented.h
@@ -60,6 +60,32 @@ rb_insert_augmented_cached(struct rb_nod
rb_insert_augmented(node, &root->rb_root, augment);
}
+static __always_inline struct rb_node *
+rb_add_augmented_cached(struct rb_node *node, struct rb_root_cached *tree,
+ bool (*less)(struct rb_node *, const struct rb_node *),
+ const struct rb_augment_callbacks *augment)
+{
+ struct rb_node **link = &tree->rb_root.rb_node;
+ struct rb_node *parent = NULL;
+ bool leftmost = true;
+
+ while (*link) {
+ parent = *link;
+ if (less(node, parent)) {
+ link = &parent->rb_left;
+ } else {
+ link = &parent->rb_right;
+ leftmost = false;
+ }
+ }
+
+ rb_link_node(node, parent, link);
+ augment->propagate(parent, NULL); /* suboptimal */
+ rb_insert_augmented_cached(node, tree, leftmost, augment);
+
+ return leftmost ? node : NULL;
+}
+
/*
* Template for declaring augmented rbtree callbacks (generic case)
*
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH 10/10] sched/fair: Implement an EEVDF like policy
2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
` (8 preceding siblings ...)
2023-03-06 13:25 ` [PATCH 09/10] rbtree: Add rb_add_augmented_cached() helper Peter Zijlstra
@ 2023-03-06 13:25 ` Peter Zijlstra
2023-03-08 8:39 ` Mike Galbraith
2023-03-07 10:27 ` [PATCH 00/10] sched: EEVDF using latency-nice Vincent Guittot
` (3 subsequent siblings)
13 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-06 13:25 UTC (permalink / raw)
To: mingo, vincent.guittot
Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
kprateek.nayak, yu.c.chen, youssefesmat, joel
Where CFS is currently a WFQ based scheduler with only a single knob,
the weight. The addition of a second, latency oriented parameter,
makes something like WF2Q or EEVDF based a much better fit.
Specifically, EEVDF does EDF like scheduling in the left half of the
tree -- those entities that are owed service. Except because this is a
virtual time scheduler, the deadlines are in virtual time as well,
which is what allows over-subscription.
EEVDF has two parameters:
- weight; which is mapped to nice just as before
- relative deadline; which is related to slice length and mapped
to the new latency nice.
Basically, by setting a smaller slice, the deadline will be earlier
and the task will be more eligible and ran earlier.
Preemption (both tick and wakeup) is driven by testing against a fresh
pick. Because the tree is now effectively an interval tree, and the
selection is no longer 'leftmost', over-scheduling is less of a
problem.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
include/linux/sched.h | 4
kernel/sched/debug.c | 6 -
kernel/sched/fair.c | 265 ++++++++++++++++++++++++++++++++++++++++++------
kernel/sched/features.h | 2
kernel/sched/sched.h | 1
5 files changed, 247 insertions(+), 31 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -548,6 +548,9 @@ struct sched_entity {
/* For load-balancing: */
struct load_weight load;
struct rb_node run_node;
+ u64 deadline;
+ u64 min_deadline;
+
struct list_head group_node;
unsigned int on_rq;
@@ -556,6 +559,7 @@ struct sched_entity {
u64 vruntime;
u64 prev_sum_exec_runtime;
s64 lag;
+ u64 slice;
u64 nr_migrations;
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -535,9 +535,13 @@ print_task(struct seq_file *m, struct rq
else
SEQ_printf(m, " %c", task_state_to_char(p));
- SEQ_printf(m, " %15s %5d %9Ld.%06ld %9Ld %5d ",
+ SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld.%06ld %9Ld %5d ",
p->comm, task_pid_nr(p),
SPLIT_NS(p->se.vruntime),
+ entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N',
+ SPLIT_NS(p->se.deadline),
+ SPLIT_NS(p->se.slice),
+ SPLIT_NS(p->se.sum_exec_runtime),
(long long)(p->nvcsw + p->nivcsw),
p->prio);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -47,6 +47,7 @@
#include <linux/psi.h>
#include <linux/ratelimit.h>
#include <linux/task_work.h>
+#include <linux/rbtree_augmented.h>
#include <asm/switch_to.h>
@@ -683,6 +684,34 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
return cfs_rq->min_vruntime + lag;
}
+/*
+ * Entity is eligible once it received less service than it ought to have,
+ * eg. lag >= 0.
+ *
+ * lag_i = S - s_i = w_i*(V - w_i)
+ *
+ * lag_i >= 0 -> V >= v_i
+ *
+ * \Sum (v_i - v)*w_i
+ * V = ------------------ + v
+ * \Sum w_i
+ *
+ * lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
+ */
+int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ struct sched_entity *curr = cfs_rq->curr;
+ s64 avg_vruntime = cfs_rq->avg_vruntime;
+ long avg_load = cfs_rq->avg_load;
+
+ if (curr && curr->on_rq) {
+ avg_vruntime += entity_key(cfs_rq, curr) * curr->load.weight;
+ avg_load += curr->load.weight;
+ }
+
+ return avg_vruntime >= entity_key(cfs_rq, se) * avg_load;
+}
+
static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
{
u64 min_vruntime = cfs_rq->min_vruntime;
@@ -699,8 +728,8 @@ static u64 __update_min_vruntime(struct
static void update_min_vruntime(struct cfs_rq *cfs_rq)
{
+ struct sched_entity *se = __pick_first_entity(cfs_rq);
struct sched_entity *curr = cfs_rq->curr;
- struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);
u64 vruntime = cfs_rq->min_vruntime;
@@ -711,9 +740,7 @@ static void update_min_vruntime(struct c
curr = NULL;
}
- if (leftmost) { /* non-empty tree */
- struct sched_entity *se = __node_2_se(leftmost);
-
+ if (se) {
if (!curr)
vruntime = se->vruntime;
else
@@ -730,18 +757,50 @@ static inline bool __entity_less(struct
return entity_before(__node_2_se(a), __node_2_se(b));
}
+#define deadline_gt(field, lse, rse) ({ (s64)((lse)->field - (rse)->field) > 0; })
+
+static inline void __update_min_deadline(struct sched_entity *se, struct rb_node *node)
+{
+ if (node) {
+ struct sched_entity *rse = __node_2_se(node);
+ if (deadline_gt(min_deadline, se, rse))
+ se->min_deadline = rse->min_deadline;
+ }
+}
+
+/*
+ * se->min_deadline = min(se->deadline, left->min_deadline, right->min_deadline)
+ */
+static inline bool min_deadline_update(struct sched_entity *se, bool exit)
+{
+ u64 old_min_deadline = se->min_deadline;
+ struct rb_node *node = &se->run_node;
+
+ se->min_deadline = se->deadline;
+ __update_min_deadline(se, node->rb_right);
+ __update_min_deadline(se, node->rb_left);
+
+ return se->min_deadline == old_min_deadline;
+}
+
+RB_DECLARE_CALLBACKS(static, min_deadline_cb, struct sched_entity,
+ run_node, min_deadline, min_deadline_update);
+
/*
* Enqueue an entity into the rb-tree:
*/
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
avg_vruntime_add(cfs_rq, se);
- rb_add_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less);
+ se->min_deadline = se->deadline;
+ rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
+ __entity_less, &min_deadline_cb);
}
static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
+ rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
+ &min_deadline_cb);
avg_vruntime_sub(cfs_rq, se);
}
@@ -765,6 +824,101 @@ static struct sched_entity *__pick_next_
return __node_2_se(next);
}
+static struct sched_entity *pick_cfs(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+{
+ struct sched_entity *left = __pick_first_entity(cfs_rq);
+
+ /*
+ * If curr is set we have to see if its left of the leftmost entity
+ * still in the tree, provided there was anything in the tree at all.
+ */
+ if (!left || (curr && entity_before(curr, left)))
+ left = curr;
+
+ return left;
+}
+
+/*
+ * Earliest Eligible Virtual Deadline First
+ *
+ * In order to provide latency guarantees for different request sizes
+ * EEVDF selects the best runnable task from two criteria:
+ *
+ * 1) the task must be eligible (must be owed service)
+ *
+ * 2) from those tasks that meet 1), we select the one
+ * with the earliest virtual deadline.
+ *
+ * We can do this in O(log n) time due to an augmented RB-tree. The
+ * tree keeps the entries sorted on service, but also functions as a
+ * heap based on the deadline by keeping:
+ *
+ * se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
+ *
+ * Which allows an EDF like search on (sub)trees.
+ */
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+{
+ struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+ struct sched_entity *curr = cfs_rq->curr;
+ struct sched_entity *best = NULL;
+
+ if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
+ curr = NULL;
+
+ while (node) {
+ struct sched_entity *se = __node_2_se(node);
+
+ /*
+ * If this entity is not eligible, try the left subtree.
+ *
+ * XXX: would it be worth it to do the single division for
+ * avg_vruntime() once, instead of the multiplication
+ * in entity_eligible() O(log n) times?
+ */
+ if (!entity_eligible(cfs_rq, se)) {
+ node = node->rb_left;
+ continue;
+ }
+
+ /*
+ * If this entity has an earlier deadline than the previous
+ * best, take this one. If it also has the earliest deadline
+ * of its subtree, we're done.
+ */
+ if (!best || deadline_gt(deadline, best, se)) {
+ best = se;
+ if (best->deadline == best->min_deadline)
+ break;
+ }
+
+ /*
+ * If the earlest deadline in this subtree is in the fully
+ * eligible left half of our space, go there.
+ */
+ if (node->rb_left &&
+ __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
+ node = node->rb_left;
+ continue;
+ }
+
+ node = node->rb_right;
+ }
+
+ if (!best || (curr && deadline_gt(deadline, best, curr)))
+ best = curr;
+
+ if (unlikely(!best)) {
+ struct sched_entity *left = __pick_first_entity(cfs_rq);
+ if (left) {
+ pr_err("EEVDF scheduling fail, picking leftmost\n");
+ return left;
+ }
+ }
+
+ return best;
+}
+
#ifdef CONFIG_SCHED_DEBUG
struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
{
@@ -882,6 +1036,32 @@ static u64 sched_slice(struct cfs_rq *cf
return slice;
}
+static void set_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ if (sched_feat(EEVDF)) {
+ /*
+ * For EEVDF the virtual time slope is determined by w_i (iow.
+ * nice) while the request time r_i is determined by
+ * latency-nice.
+ */
+ se->slice = se->latency_offset;
+ } else {
+ /*
+ * When many tasks blow up the sched_period; it is possible
+ * that sched_slice() reports unusually large results (when
+ * many tasks are very light for example). Therefore impose a
+ * maximum.
+ */
+ se->slice = min_t(u64, sched_slice(cfs_rq, se), sysctl_sched_latency);
+ }
+
+ /*
+ * vd_i = ve_i + r_i / w_i
+ */
+ se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+ se->min_deadline = se->deadline;
+}
+
#include "pelt.h"
#ifdef CONFIG_SMP
@@ -1014,6 +1194,13 @@ static void update_curr(struct cfs_rq *c
schedstat_add(cfs_rq->exec_clock, delta_exec);
curr->vruntime += calc_delta_fair(delta_exec, curr);
+ /*
+ * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
+ * this is probably good enough.
+ */
+ if ((s64)(curr->vruntime - curr->deadline) > 0)
+ set_slice(cfs_rq, curr);
+
update_min_vruntime(cfs_rq);
if (entity_is_task(curr)) {
@@ -4788,6 +4975,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
vruntime -= se->lag;
se->vruntime = vruntime;
+ set_slice(cfs_rq, se);
}
static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -4996,19 +5184,20 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
- unsigned long ideal_runtime, delta_exec;
+ unsigned long delta_exec;
struct sched_entity *se;
s64 delta;
- /*
- * When many tasks blow up the sched_period; it is possible that
- * sched_slice() reports unusually large results (when many tasks are
- * very light for example). Therefore impose a maximum.
- */
- ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
+ if (sched_feat(EEVDF)) {
+ if (pick_eevdf(cfs_rq) != curr)
+ goto preempt;
+
+ return;
+ }
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
- if (delta_exec > ideal_runtime) {
+ if (delta_exec > curr->slice) {
+preempt:
resched_curr(rq_of(cfs_rq));
/*
* The current task ran long enough, ensure it doesn't get
@@ -5032,7 +5221,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq
if (delta < 0)
return;
- if (delta > ideal_runtime)
+ if (delta > curr->slice)
resched_curr(rq_of(cfs_rq));
}
@@ -5087,17 +5276,20 @@ wakeup_preempt_entity(struct sched_entit
static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
- struct sched_entity *left = __pick_first_entity(cfs_rq);
- struct sched_entity *se;
+ struct sched_entity *left, *se;
- /*
- * If curr is set we have to see if its left of the leftmost entity
- * still in the tree, provided there was anything in the tree at all.
- */
- if (!left || (curr && entity_before(curr, left)))
- left = curr;
+ if (sched_feat(EEVDF)) {
+ /*
+ * Enabling NEXT_BUDDY will affect latency but not fairness.
+ */
+ if (sched_feat(NEXT_BUDDY) &&
+ cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
+ return cfs_rq->next;
+
+ return pick_eevdf(cfs_rq);
+ }
- se = left; /* ideally we run the leftmost entity */
+ se = left = pick_cfs(cfs_rq, curr);
/*
* Avoid running the skip buddy, if running something else can
@@ -6192,13 +6384,12 @@ static inline void unthrottle_offline_cf
static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
{
struct sched_entity *se = &p->se;
- struct cfs_rq *cfs_rq = cfs_rq_of(se);
SCHED_WARN_ON(task_rq(p) != rq);
if (rq->cfs.h_nr_running > 1) {
- u64 slice = sched_slice(cfs_rq, se);
u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+ u64 slice = se->slice;
s64 delta = slice - ran;
if (delta < 0) {
@@ -7921,7 +8112,19 @@ static void check_preempt_wakeup(struct
if (cse_is_idle != pse_is_idle)
return;
- update_curr(cfs_rq_of(se));
+ cfs_rq = cfs_rq_of(se);
+ update_curr(cfs_rq);
+
+ if (sched_feat(EEVDF)) {
+ /*
+ * XXX pick_eevdf(cfs_rq) != se ?
+ */
+ if (pick_eevdf(cfs_rq) == pse)
+ goto preempt;
+
+ return;
+ }
+
if (wakeup_preempt_entity(se, pse) == 1) {
/*
* Bias pick_next to pick the sched entity that is
@@ -8167,7 +8370,7 @@ static void yield_task_fair(struct rq *r
clear_buddies(cfs_rq, se);
- if (curr->policy != SCHED_BATCH) {
+ if (sched_feat(EEVDF) || curr->policy != SCHED_BATCH) {
update_rq_clock(rq);
/*
* Update run-time statistics of the 'current'.
@@ -8180,6 +8383,8 @@ static void yield_task_fair(struct rq *r
*/
rq_clock_skip_update(rq);
}
+ if (sched_feat(EEVDF))
+ se->deadline += calc_delta_fair(se->slice, se);
set_skip_buddy(se);
}
@@ -11923,8 +12128,8 @@ static void rq_offline_fair(struct rq *r
static inline bool
__entity_slice_used(struct sched_entity *se, int min_nr_tasks)
{
- u64 slice = sched_slice(cfs_rq_of(se), se);
u64 rtime = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+ u64 slice = se->slice;
return (rtime * min_nr_tasks > slice);
}
@@ -12639,7 +12844,7 @@ static unsigned int get_rr_interval_fair
* idle runqueue:
*/
if (rq->cfs.load.weight)
- rr_interval = NS_TO_JIFFIES(sched_slice(cfs_rq_of(se), se));
+ rr_interval = NS_TO_JIFFIES(se->slice);
return rr_interval;
}
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -103,3 +103,5 @@ SCHED_FEAT(LATENCY_WARN, false)
SCHED_FEAT(ALT_PERIOD, true)
SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(EEVDF, true)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3316,5 +3316,6 @@ static inline void switch_mm_cid(struct
#endif
extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
+extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
#endif /* _KERNEL_SCHED_SCHED_H */
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
2023-03-06 13:25 ` [PATCH 10/10] sched/fair: Implement an EEVDF like policy Peter Zijlstra
@ 2023-03-08 8:39 ` Mike Galbraith
2023-03-08 9:26 ` Mike Galbraith
2023-03-08 13:36 ` Mike Galbraith
0 siblings, 2 replies; 35+ messages in thread
From: Mike Galbraith @ 2023-03-08 8:39 UTC (permalink / raw)
To: Peter Zijlstra, mingo, vincent.guittot
Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
pjt, pavel, qperret, tim.c.chen, joshdon, timj, kprateek.nayak,
yu.c.chen, youssefesmat, joel
On Mon, 2023-03-06 at 14:25 +0100, Peter Zijlstra wrote:
> Where CFS is currently a WFQ based scheduler with only a single knob,
> the weight. The addition of a second, latency oriented parameter,
> makes something like WF2Q or EEVDF based a much better fit.
>
> Specifically, EEVDF does EDF like scheduling in the left half of the
> tree -- those entities that are owed service. Except because this is a
> virtual time scheduler, the deadlines are in virtual time as well,
> which is what allows over-subscription.
Curiosity got the best of me, and I stuffed this series into master and
did a little light testing. Unsurprisingly, the numbers move about as
they are wont to do when you diddle anything in sched-land, most
notable were the UDP_STREAM numbers which moved a LOT.
Another thing I found interesting was in the comparison of massive_intr
vs youtube clip load. I was expecting less scheduling from +eevdf
tree, but got more in total and more total wait time.
Season to taste or just toss: I didn't start with a virgin tree, though
local hacks *should* be meaningless to tbench and netperf, and while
they reduce stacking, they make zero perceptible difference to the
desktop, and are common to both trees.
Some numbers:
box is old i4790 desktop box
30 sec tbench
6.3.0.g8ca09d5
Throughput 3655.33 MB/sec 8 clients 8 procs max_latency=16.322 ms
Throughput 3651.73 MB/sec 8 clients 8 procs max_latency=16.260 ms
Throughput 3627.60 MB/sec 8 clients 8 procs max_latency=16.262 ms
6.3.0.g8ca09d5+eevdf
Throughput 3497.39 MB/sec 8 clients 8 procs max_latency=12.380 ms
Throughput 3403.28 MB/sec 8 clients 8 procs max_latency=12.296 ms
Throughput 3466.86 MB/sec 8 clients 8 procs max_latency=12.349 ms
avg vs avg .948
netperf cfs +eevdf vs cfs
TCP_SENDFILE-1 Avg: 89258 92080 1.031
TCP_SENDFILE-2 Avg: 83271 83371 1.001
TCP_SENDFILE-4 Avg: 56395 53011 .939
TCP_SENDFILE-8 Avg: 26389 39470 1.495
TCP_SENDFILE-16 Avg: 10251 19590 1.911
TCP_STREAM-1 Avg: 72583 71276 .981
TCP_STREAM-2 Avg: 59627 54424 .912
TCP_STREAM-4 Avg: 33310 22717 .681
TCP_STREAM-8 Avg: 8062 7718 .957
TCP_STREAM-16 Avg: 5143 3726 .724
TCP_MAERTS-1 Avg: 72897 71052 .974
TCP_MAERTS-2 Avg: 55065 60022 1.090
TCP_MAERTS-4 Avg: 45525 26531 .582
TCP_MAERTS-8 Avg: 11435 7937 .694
TCP_MAERTS-16 Avg: 5437 3351 .616
TCP_RR-1 Avg: 192766 180505 .936
TCP_RR-2 Avg: 169043 164731 .974
TCP_RR-4 Avg: 115702 110938 .958
TCP_RR-8 Avg: 113085 111775 .988
TCP_RR-16 Avg: 55226 53439 .967
UDP_RR-1 Avg: 261076 242027 .927
UDP_RR-2 Avg: 224808 221913 .987
UDP_RR-4 Avg: 158232 155162 .980
UDP_RR-8 Avg: 152255 149527 .982
UDP_RR-16 Avg: 75148 72739 .967
UDP_STREAM-1 Avg: 102612 102728 1.001 hmm: UDP_STREAM deltas verified repeatable
UDP_STREAM-2 Avg: 93774 93563 .997
UDP_STREAM-4 Avg: 54728 55702 1.017
UDP_STREAM-8 Avg: 30153 63329 2.100
UDP_STREAM-16 Avg: 12997 31644 2.434
(if your mailer don't do wide, this is gonna make a mess)
youtube BigBuckBunny vs 8 hogs (ancient 8ms run 1ms sleep cpu distribution checker)
massive_intr 8 9999& perf sched record -- firefox https://www.youtube.com/watch?v=aqz-KE-bpKQ& sleep 300;killall massive_intr firefox
note: perf obviously twiddled to add 'Sum delay'.
6.3.0.g8ca09d5
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms | Max delay start | Max delay end |
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Renderer:5410 | 201867.875 ms | 106745 | avg: 0.468 ms | max: 22.465 ms | sum:49998.624 ms | max start: 100.880096 s | max end: 100.902561 s
SwComposite:5453 | 189304.441 ms | 73405 | avg: 0.595 ms | max: 23.934 ms | sum:43657.252 ms | max start: 155.074533 s | max end: 155.098467 s
massive_intr:5571 | 179445.480 ms | 79984 | avg: 0.705 ms | max: 22.822 ms | sum:56395.232 ms | max start: 289.707867 s | max end: 289.730689 s
massive_intr:5569 | 179337.834 ms | 80101 | avg: 0.711 ms | max: 25.517 ms | sum:56965.103 ms | max start: 103.652981 s | max end: 103.678498 s
massive_intr:5575 | 179190.634 ms | 79433 | avg: 0.720 ms | max: 18.873 ms | sum:57223.508 ms | max start: 223.520540 s | max end: 223.539413 s
massive_intr:5568 | 179081.476 ms | 77118 | avg: 0.742 ms | max: 21.715 ms | sum:57188.088 ms | max start: 90.071642 s | max end: 90.093358 s
massive_intr:5574 | 179079.481 ms | 78924 | avg: 0.723 ms | max: 23.821 ms | sum:57024.111 ms | max start: 345.637059 s | max end: 345.660881 s
massive_intr:5570 | 178945.128 ms | 79466 | avg: 0.724 ms | max: 22.105 ms | sum:57554.610 ms | max start: 90.511319 s | max end: 90.533423 s
massive_intr:5572 | 178867.227 ms | 80016 | avg: 0.717 ms | max: 25.140 ms | sum:57393.314 ms | max start: 193.801127 s | max end: 193.826267 s
massive_intr:5573 | 178560.808 ms | 79764 | avg: 0.719 ms | max: 40.232 ms | sum:57338.388 ms | max start: 87.446558 s | max end: 87.486790 s
X:2492 | 86585.854 ms | 17924 | avg: 0.294 ms | max: 20.053 ms | sum: 5262.780 ms | max start: 106.798324 s | max end: 106.818377 s
llvmpipe-7:2870 | 36803.778 ms | 7268 | avg: 1.727 ms | max: 30.168 ms | sum:12548.310 ms | max start: 87.406812 s | max end: 87.436981 s
llvmpipe-3:2866 | 35004.654 ms | 6161 | avg: 1.385 ms | max: 19.992 ms | sum: 8531.873 ms | max start: 87.410811 s | max end: 87.430803 s
llvmpipe-1:2864 | 34615.309 ms | 6423 | avg: 1.238 ms | max: 21.740 ms | sum: 7954.871 ms | max start: 93.834245 s | max end: 93.855985 s
llvmpipe-2:2865 | 34375.917 ms | 6205 | avg: 1.273 ms | max: 22.031 ms | sum: 7897.655 ms | max start: 87.414812 s | max end: 87.436843 s
llvmpipe-0:2863 | 32479.993 ms | 8472 | avg: 0.906 ms | max: 18.145 ms | sum: 7674.587 ms | max start: 156.041479 s | max end: 156.059624 s
llvmpipe-5:2868 | 32284.589 ms | 5668 | avg: 1.271 ms | max: 21.562 ms | sum: 7203.223 ms | max start: 98.798222 s | max end: 98.819784 s
llvmpipe-6:2869 | 31752.624 ms | 5689 | avg: 1.241 ms | max: 18.067 ms | sum: 7057.222 ms | max start: 87.422817 s | max end: 87.440885 s
llvmpipe-4:2867 | 31621.552 ms | 5298 | avg: 1.327 ms | max: 21.903 ms | sum: 7029.350 ms | max start: 87.418812 s | max end: 87.440715 s
MediaPD~oder #1:5910 | 24623.459 ms | 7900 | avg: 0.455 ms | max: 18.267 ms | sum: 3596.813 ms | max start: 143.181740 s | max end: 143.200008 s
MediaPD~oder #1:5908 | 24498.697 ms | 7698 | avg: 0.470 ms | max: 24.616 ms | sum: 3614.831 ms | max start: 222.568707 s | max end: 222.593322 s
MediaPD~oder #1:5909 | 24447.400 ms | 7744 | avg: 0.476 ms | max: 22.047 ms | sum: 3683.826 ms | max start: 234.216798 s | max end: 234.238845 s
MediaPD~oder #1:5907 | 24349.888 ms | 7819 | avg: 0.437 ms | max: 19.288 ms | sum: 3413.911 ms | max start: 131.920097 s | max end: 131.939385 s
Isolated Web Co:5457 | 10982.274 ms | 5768 | avg: 0.295 ms | max: 25.414 ms | sum: 1701.308 ms | max start: 91.801774 s | max end: 91.827188 s
...
------------------------------------------------------------------------------------------------------------
TOTAL: |2370376.486 ms | 1189527 | | 654188.153 ms |
------------------------------------------------------------------------------------------------------------
6.3.0.g8ca09d5 +eevdf
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms | Max delay start | Max delay end |
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Renderer:5675 | 211250.021 ms | 84700 | avg: 0.546 ms | max: 20.503 ms | sum:46225.011 ms | max start: 317.446170 s | max end: 317.466673 s
SwComposite:5719 | 205804.660 ms | 66043 | avg: 0.685 ms | max: 19.871 ms | sum:45214.452 ms | max start: 119.470385 s | max end: 119.490256 s
massive_intr:5838 | 195285.673 ms | 75458 | avg: 0.885 ms | max: 18.171 ms | sum:66793.033 ms | max start: 177.270747 s | max end: 177.288919 s
massive_intr:5835 | 195217.246 ms | 75005 | avg: 0.884 ms | max: 18.211 ms | sum:66340.670 ms | max start: 340.966607 s | max end: 340.984818 s
massive_intr:5836 | 195148.073 ms | 74723 | avg: 0.891 ms | max: 22.868 ms | sum:66544.981 ms | max start: 92.619771 s | max end: 92.642639 s
massive_intr:5840 | 195093.638 ms | 75229 | avg: 0.886 ms | max: 21.502 ms | sum:66630.906 ms | max start: 96.715761 s | max end: 96.737263 s
massive_intr:5837 | 195081.767 ms | 74906 | avg: 0.890 ms | max: 18.384 ms | sum:66672.064 ms | max start: 157.736916 s | max end: 157.755300 s
massive_intr:5839 | 195067.653 ms | 74433 | avg: 0.892 ms | max: 17.731 ms | sum:66391.236 ms | max start: 327.225658 s | max end: 327.243389 s
massive_intr:5841 | 194947.303 ms | 75468 | avg: 0.890 ms | max: 19.250 ms | sum:67155.572 ms | max start: 94.936517 s | max end: 94.955767 s
massive_intr:5834 | 194820.690 ms | 74531 | avg: 0.901 ms | max: 16.593 ms | sum:67172.443 ms | max start: 87.268034 s | max end: 87.284627 s
X:2673 | 55537.286 ms | 24293 | avg: 0.624 ms | max: 19.071 ms | sum:15149.642 ms | max start: 130.940240 s | max end: 130.959311 s
MediaPD~oder #1:6190 | 24885.689 ms | 19835 | avg: 0.576 ms | max: 16.064 ms | sum:11424.368 ms | max start: 274.211226 s | max end: 274.227290 s
MediaPD~oder #1:6188 | 24802.432 ms | 19475 | avg: 0.589 ms | max: 24.496 ms | sum:11465.965 ms | max start: 255.251754 s | max end: 255.276250 s
MediaPD~oder #1:6189 | 24784.917 ms | 19684 | avg: 0.573 ms | max: 18.250 ms | sum:11277.622 ms | max start: 263.644372 s | max end: 263.662622 s
MediaPD~oder #1:6187 | 24660.751 ms | 19613 | avg: 0.572 ms | max: 19.865 ms | sum:11221.213 ms | max start: 172.852720 s | max end: 172.872585 s
llvmpipe-6:3047 | 18633.251 ms | 7545 | avg: 3.428 ms | max: 27.450 ms | sum:25864.293 ms | max start: 150.773081 s | max end: 150.800531 s
llvmpipe-7:3048 | 18414.068 ms | 8024 | avg: 3.880 ms | max: 23.249 ms | sum:31135.265 ms | max start: 150.776388 s | max end: 150.799637 s
llvmpipe-5:3046 | 17914.117 ms | 7336 | avg: 3.429 ms | max: 19.998 ms | sum:25155.781 ms | max start: 137.463848 s | max end: 137.483845 s
llvmpipe-3:3044 | 17669.019 ms | 7913 | avg: 3.329 ms | max: 21.280 ms | sum:26340.572 ms | max start: 232.804014 s | max end: 232.825294 s
llvmpipe-4:3045 | 17539.666 ms | 7438 | avg: 3.353 ms | max: 22.802 ms | sum:24936.398 ms | max start: 94.800014 s | max end: 94.822817 s
llvmpipe-0:3041 | 17428.494 ms | 9445 | avg: 2.663 ms | max: 29.456 ms | sum:25153.007 ms | max start: 96.231519 s | max end: 96.260975 s
llvmpipe-2:3043 | 17239.204 ms | 7962 | avg: 3.282 ms | max: 24.674 ms | sum:26133.925 ms | max start: 161.019506 s | max end: 161.044179 s
llvmpipe-1:3042 | 17143.242 ms | 8261 | avg: 3.118 ms | max: 22.086 ms | sum:25756.911 ms | max start: 379.521740 s | max end: 379.543826 s
Isolated Web Co:5723 | 11075.262 ms | 15573 | avg: 0.280 ms | max: 23.996 ms | sum: 4360.925 ms | max start: 230.161012 s | max end: 230.185008 s
...
------------------------------------------------------------------------------------------------------------
TOTAL: |2368428.034 ms | 2114759 | | 1048388.278 ms |
------------------------------------------------------------------------------------------------------------
vs 6.3.0.g8ca09d5 .999 1.777 1.602
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
2023-03-08 8:39 ` Mike Galbraith
@ 2023-03-08 9:26 ` Mike Galbraith
2023-03-08 13:36 ` Mike Galbraith
1 sibling, 0 replies; 35+ messages in thread
From: Mike Galbraith @ 2023-03-08 9:26 UTC (permalink / raw)
To: Peter Zijlstra, mingo, vincent.guittot
Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
pjt, pavel, qperret, tim.c.chen, joshdon, timj, kprateek.nayak,
yu.c.chen, youssefesmat, joel
On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
>
> netperf cfs +eevdf vs cfs
> TCP_SENDFILE-1 Avg: 89258 92080 1.031
> TCP_SENDFILE-2 Avg: 83271 83371 1.001
> TCP_SENDFILE-4 Avg: 56395 53011 .939
> TCP_SENDFILE-8 Avg: 26389 39470 1.495
> TCP_SENDFILE-16 Avg: 10251 19590 1.911
Forgot to recheck this not so modest clients >= CPUS win, it's also
repeatable.
-Mike
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
2023-03-08 8:39 ` Mike Galbraith
2023-03-08 9:26 ` Mike Galbraith
@ 2023-03-08 13:36 ` Mike Galbraith
2023-03-09 4:23 ` Mike Galbraith
2023-03-09 9:06 ` Peter Zijlstra
1 sibling, 2 replies; 35+ messages in thread
From: Mike Galbraith @ 2023-03-08 13:36 UTC (permalink / raw)
To: Peter Zijlstra, mingo, vincent.guittot
Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
pjt, pavel, qperret, tim.c.chen, joshdon, timj, kprateek.nayak,
yu.c.chen, youssefesmat, joel
On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
>
> Curiosity got the best of me...
Remember this little bugger, allegedly distilled from a real
application control thread starvation issue?
6.3.0.g8ca09d5-master
homer:/root # time taskset -c 3 starve
expecting to receive 10000000 signals
real 0m24.424s
user 0m4.468s
sys 0m18.957s
6.3.0.g8ca09d5-eevdf
homer:/root # time taskset -c 3 starve
expecting to receive 10000000 signals
zzzzzz
^C
real 15m24.115s
user 0m3.078s
sys 0m0.000s
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
volatile unsigned long loop;
void handler(int n)
{
if (loop > 0)
--loop;
}
static int child(void)
{
pid_t ppid = getppid();
sleep(1);
while (1)
kill(ppid, SIGUSR1);
return 0;
}
int main(int argc, char **argv)
{
pid_t child_pid;
int r;
loop = argc > 1 ? strtoul(argv[1], NULL, 10) : 10000000;
printf("expecting to receive %lu signals\n", loop);
if ((child_pid = fork()) == 0)
exit(child());
signal(SIGUSR1, handler);
while (loop)
sleep(1);
r = kill(child_pid, SIGTERM);
waitpid(child_pid, NULL, 0);
return 0;
}
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
2023-03-08 13:36 ` Mike Galbraith
@ 2023-03-09 4:23 ` Mike Galbraith
2023-03-10 20:38 ` Peter Zijlstra
2023-03-09 9:06 ` Peter Zijlstra
1 sibling, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2023-03-09 4:23 UTC (permalink / raw)
To: Peter Zijlstra, mingo, vincent.guittot
Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
pjt, pavel, qperret, tim.c.chen, joshdon, timj, kprateek.nayak,
yu.c.chen, youssefesmat, joel
On Wed, 2023-03-08 at 14:36 +0100, Mike Galbraith wrote:
>
> Remember this little bugger, allegedly distilled from a real
> application control thread starvation issue?
>
> 6.3.0.g8ca09d5-master
> homer:/root # time taskset -c 3 starve
> expecting to receive 10000000 signals
>
> real 0m24.424s
> user 0m4.468s
> sys 0m18.957s
>
> 6.3.0.g8ca09d5-eevdf
> homer:/root # time taskset -c 3 starve
> expecting to receive 10000000 signals
> zzzzzz
> ^C
Ok, seems there must be a math booboo lurking.
virgin source, 100% hog vs tbench buddy pair, all pinned.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
5060 root 20 0 4420 680 680 R 96.01 0.004 0:41.40 3 cpuhog
5058 root 20 0 25500 1920 1792 S 2.326 0.012 0:01.05 3 tbench
5059 root 20 0 8796 896 768 R 1.661 0.006 0:00.78 3 tbench_srv
echo NO_PRESERVE_LAG > features
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
5060 root 20 0 4420 680 680 R 99.33 0.004 1:28.24 3 cpuhog
5058 root 20 0 25500 1920 1792 R 0.333 0.012 0:01.75 3 tbench
5059 root 20 0 8796 896 768 S 0.333 0.006 0:01.30 3 tbench_srv
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
2023-03-09 4:23 ` Mike Galbraith
@ 2023-03-10 20:38 ` Peter Zijlstra
2023-03-11 5:53 ` Mike Galbraith
0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-10 20:38 UTC (permalink / raw)
To: Mike Galbraith
Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
youssefesmat, joel
On Thu, Mar 09, 2023 at 05:23:33AM +0100, Mike Galbraith wrote:
> Ok, seems there must be a math booboo lurking.
Yep that.. please try the version I pushed out here:
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/eevdf
(It's based on 6.3-rc1, but it should be trivial to rebase onto v6.2 if
you so wish.)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
2023-03-10 20:38 ` Peter Zijlstra
@ 2023-03-11 5:53 ` Mike Galbraith
2023-03-11 7:56 ` Mike Galbraith
0 siblings, 1 reply; 35+ messages in thread
From: Mike Galbraith @ 2023-03-11 5:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
youssefesmat, joel
On Fri, 2023-03-10 at 21:38 +0100, Peter Zijlstra wrote:
> On Thu, Mar 09, 2023 at 05:23:33AM +0100, Mike Galbraith wrote:
> > Ok, seems there must be a math booboo lurking.
>
> Yep that.. please try the version I pushed out here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/eevdf
>
> (It's based on 6.3-rc1, but it should be trivial to rebase onto v6.2 if
> you so wish.)
I stuffed it into .today. Yup, extreme sleeper stinginess is gone.
tbench/netperf numbers look roughly as previously. massive_intr vs
desktop CPU distribution improved as expected, but (to me) oddly, the
amount of desktop beating up itself did not.
These are all .today
perf.data.eevdf
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms | Max delay start | Max delay end |
massive_intr:(9) |1499451.694 ms | 647385 | avg: 0.940 ms | max: 27.969 ms | sum:608348.424 ms | max start: 467.219496 s | max end: 467.247465 s
TOTAL: |2369880.457 ms | 2797272 | | 1243484.857 ms |
perf.data.eevdf_FAIR_SLEEPERS
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms | Max delay start | Max delay end |
massive_intr:(9) |1415743.272 ms | 714999 | avg: 1.013 ms | max: 73.722 ms | sum:724624.600 ms | max start: 2678.846216 s | max end: 2678.919938 s
TOTAL: |2362333.392 ms | 2871586 | | 1238748.802 ms |
perf.data.master
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms | Max delay start | Max delay end |
massive_intr:(9) |1463798.673 ms | 598383 | avg: 0.730 ms | max: 40.029 ms | sum:437069.674 ms | max start: 945.022965 s | max end: 945.062994 s
TOTAL: |2370011.937 ms | 1109828 | | 616408.789 ms |
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
2023-03-11 5:53 ` Mike Galbraith
@ 2023-03-11 7:56 ` Mike Galbraith
0 siblings, 0 replies; 35+ messages in thread
From: Mike Galbraith @ 2023-03-11 7:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
youssefesmat, joel
On Sat, 2023-03-11 at 06:53 +0100, Mike Galbraith wrote:
>
> massive_intr vs desktop CPU distribution improved as expected, but
> (to me) oddly, the amount of desktop beating up itself did not.
Hmm, maybe expected when fancy deadline math tries to wedge a very wide
GUI into a not so wide and pretty well saturated CPU.
-Mike
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
2023-03-08 13:36 ` Mike Galbraith
2023-03-09 4:23 ` Mike Galbraith
@ 2023-03-09 9:06 ` Peter Zijlstra
2023-03-09 12:44 ` Peter Zijlstra
1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-09 9:06 UTC (permalink / raw)
To: Mike Galbraith
Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
youssefesmat, joel
Hi Mike!
On Wed, Mar 08, 2023 at 02:36:01PM +0100, Mike Galbraith wrote:
> On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
> >
> > Curiosity got the best of me...
>
> Remember this little bugger, allegedly distilled from a real
> application control thread starvation issue?
Oooh, yeah, I should still have that somewhere. I'll try and remember
what exactly was needed to make it behave properly.
Thanks for your feedback. I'll go prod at things a bit more. Like I
wrote, the code is more or less at the point where it stopped crashing
and doing obviously bad things. It definitely needs a more attention.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
2023-03-09 9:06 ` Peter Zijlstra
@ 2023-03-09 12:44 ` Peter Zijlstra
2023-03-09 15:29 ` Peter Zijlstra
0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-09 12:44 UTC (permalink / raw)
To: Mike Galbraith
Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
youssefesmat, joel
On Thu, Mar 09, 2023 at 10:06:33AM +0100, Peter Zijlstra wrote:
> Hi Mike!
>
> On Wed, Mar 08, 2023 at 02:36:01PM +0100, Mike Galbraith wrote:
> > On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
> > >
> > > Curiosity got the best of me...
> >
> > Remember this little bugger, allegedly distilled from a real
> > application control thread starvation issue?
>
> Oooh, yeah, I should still have that somewhere. I'll try and remember
> what exactly was needed to make it behave properly.
That thing wants both wakeup preemption and sleeper bonus. Specifically,
it needs the signal to insta-preempt the 'pointless' kill loop.
What happens is that while positive lag, we get this, when negative lag
happens wakeup-preemption is not achieved and we get delayed by a full
tick.
This gets us very little actual runtime.
Let me see what do do about that...
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
2023-03-09 12:44 ` Peter Zijlstra
@ 2023-03-09 15:29 ` Peter Zijlstra
2023-03-09 15:39 ` Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-09 15:29 UTC (permalink / raw)
To: Mike Galbraith
Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
youssefesmat, joel
On Thu, Mar 09, 2023 at 01:44:13PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 09, 2023 at 10:06:33AM +0100, Peter Zijlstra wrote:
> > Hi Mike!
> >
> > On Wed, Mar 08, 2023 at 02:36:01PM +0100, Mike Galbraith wrote:
> > > On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
> > > >
> > > > Curiosity got the best of me...
> > >
> > > Remember this little bugger, allegedly distilled from a real
> > > application control thread starvation issue?
> >
> > Oooh, yeah, I should still have that somewhere. I'll try and remember
> > what exactly was needed to make it behave properly.
>
> That thing wants both wakeup preemption and sleeper bonus. Specifically,
> it needs the signal to insta-preempt the 'pointless' kill loop.
>
> What happens is that while positive lag, we get this, when negative lag
> happens wakeup-preemption is not achieved and we get delayed by a full
> tick.
>
> This gets us very little actual runtime.
>
> Let me see what do do about that...
So if I add TICK_NSEC based sleeper bonus (/2 for gentle), then starve
works -- this is the absolutely minimal amount required. It sucks a bit
it's HZ dependent, but alas.
Also, the whole sleeper bonus gets us back into needing to track the old
vruntime and the overflow crap for super long sleeps and all that fugly
:/ I was so hoping we could delete that code.
Oh well.
(also, did you know that removing the debug cruft helps with running
numbers? ;-)
I think it adds a bit of variance to the numbers -- but I've not ran
long enough to tell with certainty.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
2023-03-09 15:29 ` Peter Zijlstra
@ 2023-03-09 15:39 ` Peter Zijlstra
2023-03-09 16:24 ` Mike Galbraith
2023-03-09 16:42 ` Peter Zijlstra
2 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-09 15:39 UTC (permalink / raw)
To: Mike Galbraith
Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
youssefesmat, joel
On Thu, Mar 09, 2023 at 04:29:04PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 09, 2023 at 01:44:13PM +0100, Peter Zijlstra wrote:
> > On Thu, Mar 09, 2023 at 10:06:33AM +0100, Peter Zijlstra wrote:
> > > Hi Mike!
> > >
> > > On Wed, Mar 08, 2023 at 02:36:01PM +0100, Mike Galbraith wrote:
> > > > On Wed, 2023-03-08 at 09:39 +0100, Mike Galbraith wrote:
> > > > >
> > > > > Curiosity got the best of me...
> > > >
> > > > Remember this little bugger, allegedly distilled from a real
> > > > application control thread starvation issue?
> > >
> > > Oooh, yeah, I should still have that somewhere. I'll try and remember
> > > what exactly was needed to make it behave properly.
> >
> > That thing wants both wakeup preemption and sleeper bonus. Specifically,
> > it needs the signal to insta-preempt the 'pointless' kill loop.
> >
> > What happens is that while positive lag, we get this, when negative lag
> > happens wakeup-preemption is not achieved and we get delayed by a full
> > tick.
> >
> > This gets us very little actual runtime.
> >
> > Let me see what do do about that...
>
> So if I add TICK_NSEC based sleeper bonus (/2 for gentle), then starve
> works -- this is the absolutely minimal amount required. It sucks a bit
> it's HZ dependent, but alas.
>
> Also, the whole sleeper bonus gets us back into needing to track the old
> vruntime and the overflow crap for super long sleeps and all that fugly
> :/ I was so hoping we could delete that code.
>
> Oh well.
>
> (also, did you know that removing the debug cruft helps with running
> numbers? ;-)
Also, it helps to turn the sched_feat on... clearly i should be calling
it a day.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
2023-03-09 15:29 ` Peter Zijlstra
2023-03-09 15:39 ` Peter Zijlstra
@ 2023-03-09 16:24 ` Mike Galbraith
2023-03-09 16:42 ` Peter Zijlstra
2 siblings, 0 replies; 35+ messages in thread
From: Mike Galbraith @ 2023-03-09 16:24 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
youssefesmat, joel
On Thu, 2023-03-09 at 16:29 +0100, Peter Zijlstra wrote:
>
> So if I add TICK_NSEC based sleeper bonus (/2 for gentle), then starve
> works -- this is the absolutely minimal amount required. It sucks a bit
> it's HZ dependent, but alas.
>
> Also, the whole sleeper bonus gets us back into needing to track the old
> vruntime and the overflow crap for super long sleeps and all that fugly
> :/ I was so hoping we could delete that code.
>
> Oh well.
Yeah, it's a worthy target.
-Mike
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 10/10] sched/fair: Implement an EEVDF like policy
2023-03-09 15:29 ` Peter Zijlstra
2023-03-09 15:39 ` Peter Zijlstra
2023-03-09 16:24 ` Mike Galbraith
@ 2023-03-09 16:42 ` Peter Zijlstra
2 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-09 16:42 UTC (permalink / raw)
To: Mike Galbraith
Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
qyousef, chris.hyser, patrick.bellasi, pjt, pavel, qperret,
tim.c.chen, joshdon, timj, kprateek.nayak, yu.c.chen,
youssefesmat, joel
On Thu, Mar 09, 2023 at 04:29:04PM +0100, Peter Zijlstra wrote:
> So if I add TICK_NSEC based sleeper bonus (/2 for gentle), then starve
> works -- this is the absolutely minimal amount required. It sucks a bit
> it's HZ dependent, but alas.
Fixes starve, sucks for schbench and hackbench :/
Clearly more thinking is required...
root@ivb-ep:~/bench# echo NO_FAIR_SLEEPERS > /debug/sched/features
root@ivb-ep:~/bench# ./doit-schbench.sh ; ./doit-hackbench-series.sh
Latency percentiles (usec)
50.0000th: 83
75.0000th: 102
90.0000th: 109
95.0000th: 114
*99.0000th: 450
99.5000th: 723
99.9000th: 985
min=0, max=1067
1: 0.55355 +- 0.00290 seconds time elapsed ( +- 0.52% )
2: 0.79591 +- 0.00545 seconds time elapsed ( +- 0.68% )
5: 1.5804 +- 0.0102 seconds time elapsed ( +- 0.65% )
10: 2.5674 +- 0.0110 seconds time elapsed ( +- 0.43% )
20: 4.6116 +- 0.0160 seconds time elapsed ( +- 0.35% )
40: 9.5965 +- 0.0167 seconds time elapsed ( +- 0.17% )
root@ivb-ep:~/bench# time taskset -c 3 ./starve/starve 1000000
expecting to receive 1000000 signals
^C
real 0m32.999s
user 0m0.000s
sys 0m0.719s
root@ivb-ep:~/bench# echo FAIR_SLEEPERS > /debug/sched/features
root@ivb-ep:~/bench# ./doit-schbench.sh ; ./doit-hackbench-series.sh
Latency percentiles (usec)
50.0000th: 87
75.0000th: 103
90.0000th: 111
95.0000th: 116
*99.0000th: 163
99.5000th: 697
99.9000th: 1110
min=0, max=1522
1: 0.59076 +- 0.00577 seconds time elapsed ( +- 0.98% )
2: 0.86093 +- 0.00407 seconds time elapsed ( +- 0.47% )
5: 2.1018 +- 0.0129 seconds time elapsed ( +- 0.61% )
10: 3.6378 +- 0.0395 seconds time elapsed ( +- 1.09% )
20: 5.56884 +- 0.00979 seconds time elapsed ( +- 0.18% )
40: 10.8570 +- 0.0207 seconds time elapsed ( +- 0.19% )
root@ivb-ep:~/bench# time taskset -c 3 ./starve/starve 1000000
expecting to receive 1000000 signals
real 0m5.651s
user 0m0.604s
sys 0m4.047s
---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4938,17 +4938,22 @@ place_entity(struct cfs_rq *cfs_rq, stru
{
u64 vruntime = avg_vruntime(cfs_rq);
+ if (sched_feat(PRESERVE_LAG))
+ vruntime -= se->lag;
+
if (sched_feat(FAIR_SLEEPERS)) {
- u64 sleep_time;
+// u64 sleep_time;
/* sleeps up to a single latency don't count. */
if (!initial) {
- unsigned long thresh;
+ unsigned long thresh = TICK_NSEC;
- if (se_is_idle(se))
- thresh = sysctl_sched_min_granularity;
- else
- thresh = sysctl_sched_latency;
+ if (!sched_feat(EEVDF)) {
+ if (se_is_idle(se))
+ thresh = sysctl_sched_min_granularity;
+ else
+ thresh = sysctl_sched_latency;
+ }
/*
* Halve their sleep time's effect, to allow
@@ -4957,7 +4962,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
if (sched_feat(GENTLE_FAIR_SLEEPERS))
thresh >>= 1;
- vruntime -= thresh;
+ vruntime -= calc_delta_fair(thresh, se);
}
/*
@@ -4966,15 +4971,12 @@ place_entity(struct cfs_rq *cfs_rq, stru
* slept for a long time, don't even try to compare its vruntime with
* the base as it may be too far off and the comparison may get
* inversed due to s64 overflow.
- */
sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
if ((s64)sleep_time < 60LL * NSEC_PER_SEC)
+ */
vruntime = max_vruntime(se->vruntime, vruntime);
}
- if (sched_feat(PRESERVE_LAG))
- vruntime -= se->lag;
-
se->vruntime = vruntime;
set_slice(cfs_rq, se);
}
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 00/10] sched: EEVDF using latency-nice
2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
` (9 preceding siblings ...)
2023-03-06 13:25 ` [PATCH 10/10] sched/fair: Implement an EEVDF like policy Peter Zijlstra
@ 2023-03-07 10:27 ` Vincent Guittot
2023-03-07 13:08 ` Peter Zijlstra
2023-03-08 15:13 ` Shrikanth Hegde
` (2 subsequent siblings)
13 siblings, 1 reply; 35+ messages in thread
From: Vincent Guittot @ 2023-03-07 10:27 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
kprateek.nayak, yu.c.chen, youssefesmat, joel
On Mon, 6 Mar 2023 at 15:17, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Hi!
>
> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
> not make more sense, and I did point Vincent at some older patches I had for
> that (which is here his augmented rbtree thing comes from).
>
> Also, since I really dislike the dual tree, I also figured we could dynamically
> switch between an augmented tree and not (and while I have code for that,
> that's not included in this posting because with the current results I don't
> think we actually need this).
>
> Anyway, since I'm somewhat under the weather, I spend last week desperately
> trying to connect a small cluster of neurons in defiance of the snot overlord
> and bring back the EEVDF patches from the dark crypts where they'd been
> gathering cobwebs for the past 13 odd years.
I haven't studied your patchset in detail yet but at a 1st glance this
seems to be a major rework on the cfs task placement and the latency
is just an add-on on top of moving to the EEVDF scheduling.
>
> By friday they worked well enough, and this morning (because obviously I forgot
> the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf,
> tbench and sysbench -- there's a bunch of wins and losses, but nothing that
> indicates a total fail.
>
> ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot
> more consistent than CFS and has a bunch of latency wins )
>
> ( hackbench also doesn't show the augmented tree and generally more expensive
> pick to be a loss, in fact it shows a slight win here )
>
>
> hackbech load + cyclictest --policy other results:
>
>
> EEVDF CFS
>
> # Min Latencies: 00053
> LNICE(19) # Avg Latencies: 04350
> # Max Latencies: 76019
>
> # Min Latencies: 00052 00053
> LNICE(0) # Avg Latencies: 00690 00687
> # Max Latencies: 14145 13913
>
> # Min Latencies: 00019
> LNICE(-19) # Avg Latencies: 00261
> # Max Latencies: 05642
>
>
> The nice -19 numbers aren't as pretty as Vincent's, but at the end I was going
> cross-eyed from staring at tree prints and I just couldn't figure out where it
> was going side-ways.
>
> There's definitely more benchmarking/tweaking to be done (0-day already
> reported a stress-ng loss), but if we can pull this off we can delete a whole
> much of icky heuristics code. EEVDF is a much better defined policy than what
> we currently have.
>
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 00/10] sched: EEVDF using latency-nice
2023-03-07 10:27 ` [PATCH 00/10] sched: EEVDF using latency-nice Vincent Guittot
@ 2023-03-07 13:08 ` Peter Zijlstra
0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-03-07 13:08 UTC (permalink / raw)
To: Vincent Guittot
Cc: mingo, linux-kernel, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, bristot, corbet, qyousef, chris.hyser,
patrick.bellasi, pjt, pavel, qperret, tim.c.chen, joshdon, timj,
kprateek.nayak, yu.c.chen, youssefesmat, joel
On Tue, Mar 07, 2023 at 11:27:37AM +0100, Vincent Guittot wrote:
> On Mon, 6 Mar 2023 at 15:17, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Hi!
> >
> > Ever since looking at the latency-nice patches, I've wondered if EEVDF would
> > not make more sense, and I did point Vincent at some older patches I had for
> > that (which is here his augmented rbtree thing comes from).
> >
> > Also, since I really dislike the dual tree, I also figured we could dynamically
> > switch between an augmented tree and not (and while I have code for that,
> > that's not included in this posting because with the current results I don't
> > think we actually need this).
> >
> > Anyway, since I'm somewhat under the weather, I spend last week desperately
> > trying to connect a small cluster of neurons in defiance of the snot overlord
> > and bring back the EEVDF patches from the dark crypts where they'd been
> > gathering cobwebs for the past 13 odd years.
>
> I haven't studied your patchset in detail yet but at a 1st glance this
> seems to be a major rework on the cfs task placement and the latency
> is just an add-on on top of moving to the EEVDF scheduling.
It completely reworks the base scheduler, placement, preemption, picking
-- everything. The only thing they have in common is that they're both a
virtual time based scheduler.
The big advantage I see is that EEVDF is fairly well known and studied,
and a much better defined scheduler than WFQ. Specifically, where WFQ is
only well defined in how much time is given to any task (bandwidth), but
says nothing about how that is distributed in time. That is, there is no
native preemption condition/constraint etc. -- all that code we have is
random heuristics mostly.
The WF2Q/EEVDF class of schedulers otoh *do* define all that. There is a
lot less wiggle room as a result. The avg_vruntime / placement stuff I
did is fundamental to how it controls bandwidth distribution and
guarantees the WFQ subset. Specifically, by limiting the pick to that
subset of tasks that has positive lag (owed time), it guarantees this
fairness -- but that means we need a working measure of lag.
Similarly, since the whole 'when' thing is well defined in order to
provide the additional latency goals of these schedulers, placement is
crucial. Things like sleeper bonus is fundamentally incompatible with
latency guarantees -- both affect the 'when'.
Initial EEVDF paper is here:
https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564
It contains a few 'mistakes' and oversights, but those should not
matter.
Anyway, I'm still struggling to make complete sense of what you did --
will continue to stare at that.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 00/10] sched: EEVDF using latency-nice
2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
` (10 preceding siblings ...)
2023-03-07 10:27 ` [PATCH 00/10] sched: EEVDF using latency-nice Vincent Guittot
@ 2023-03-08 15:13 ` Shrikanth Hegde
2023-03-22 6:49 ` K Prateek Nayak
2023-03-23 11:53 ` Pavel Machek
13 siblings, 0 replies; 35+ messages in thread
From: Shrikanth Hegde @ 2023-03-08 15:13 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
pjt, pavel, qperret, tim.c.chen, joshdon, timj, kprateek.nayak,
yu.c.chen, youssefesmat, joel, mingo, vincent.guittot
> Hi!
>
> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
> not make more sense, and I did point Vincent at some older patches I had for
> that (which is here his augmented rbtree thing comes from).
>
> Also, since I really dislike the dual tree, I also figured we could dynamically
> switch between an augmented tree and not (and while I have code for that,
> that's not included in this posting because with the current results I don't
> think we actually need this).
>
> Anyway, since I'm somewhat under the weather, I spend last week desperately
> trying to connect a small cluster of neurons in defiance of the snot overlord
> and bring back the EEVDF patches from the dark crypts where they'd been
> gathering cobwebs for the past 13 odd years.
>
> By friday they worked well enough, and this morning (because obviously I forgot
> the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf,
> tbench and sysbench -- there's a bunch of wins and losses, but nothing that
> indicates a total fail.
>
> ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot
> more consistent than CFS and has a bunch of latency wins )
>
> ( hackbench also doesn't show the augmented tree and generally more expensive
> pick to be a loss, in fact it shows a slight win here )
>
>
> hackbech load + cyclictest --policy other results:
>
>
> EEVDF CFS
>
> # Min Latencies: 00053
> LNICE(19) # Avg Latencies: 04350
> # Max Latencies: 76019
>
> # Min Latencies: 00052 00053
> LNICE(0) # Avg Latencies: 00690 00687
> # Max Latencies: 14145 13913
>
> # Min Latencies: 00019
> LNICE(-19) # Avg Latencies: 00261
> # Max Latencies: 05642
>
>
> The nice -19 numbers aren't as pretty as Vincent's, but at the end I was going
> cross-eyed from staring at tree prints and I just couldn't figure out where it
> was going side-ways.
>
> There's definitely more benchmarking/tweaking to be done (0-day already
> reported a stress-ng loss), but if we can pull this off we can delete a whole
> much of icky heuristics code. EEVDF is a much better defined policy than what
> we currently have.
>
Tested the patch series on powerpc systems. This test is done in the same way
that was done for vincent's V12 series.
Creating two cgroups. In cgroup2 running stress-ng -l 50 --cpu=<total_cpu> and
in cgroup1 running micro benchmarks. Different latency values are assigned to
cgroup1.
Tested on two different system. One system has 480 CPU and other one has 96
CPU.
++++++++
Summary:
++++++++
For hackbench, 480 CPU system shows good improvement.
96 CPU system shows same numbers as 6.2. Smaller system was showing regressing
results as discussed in Vincent's V12 series. With this patch, there is no regression.
Schbench shows good improvement compared to v6.2 at LN=0 or LN=-20. Whereas
at LN=19, it shows regression.
Please suggest if any variation of the benchmark or a different benchmark to be run.
++++++++++++++++++
480 CPU system
++++++++++++++++++
==========
schbench
==========
v6.2 | v6.2+LN=0 | v6.2+LN=-20 | v6.2+LN=19
1 Threads
50.0th: 14.00 | 12.00 | 14.50 | 15.00
75.0th: 16.50 | 14.50 | 17.00 | 18.00
90.0th: 18.50 | 17.00 | 19.50 | 20.00
95.0th: 20.50 | 18.50 | 22.00 | 23.50
99.0th: 27.50 | 24.50 | 31.50 | 155.00
99.5th: 36.00 | 30.00 | 44.50 | 2991.00
99.9th: 81.50 | 171.50 | 153.00 | 4621.00
2 Threads
50.0th: 14.00 | 15.50 | 17.00 | 16.00
75.0th: 17.00 | 18.00 | 19.00 | 19.00
90.0th: 20.00 | 21.00 | 22.00 | 22.50
95.0th: 23.00 | 23.00 | 25.00 | 25.50
99.0th: 71.00 | 30.50 | 35.50 | 990.50
99.5th: 1170.00 | 53.00 | 71.00 | 3719.00
99.9th: 5088.00 | 245.50 | 138.00 | 6644.00
4 Threads
50.0th: 20.50 | 20.00 | 20.00 | 19.50
75.0th: 24.50 | 23.00 | 23.00 | 23.50
90.0th: 31.00 | 27.00 | 26.50 | 27.50
95.0th: 260.50 | 29.50 | 29.00 | 35.00
99.0th: 3644.00 | 106.00 | 37.50 | 2884.00
99.5th: 5152.00 | 227.00 | 92.00 | 5496.00
99.9th: 8076.00 | 3662.50 | 517.00 | 8640.00
8 Threads
50.0th: 26.00 | 23.50 | 22.50 | 25.00
75.0th: 32.50 | 29.50 | 27.50 | 31.00
90.0th: 41.50 | 34.50 | 31.50 | 39.00
95.0th: 794.00 | 37.00 | 34.50 | 579.50
99.0th: 5992.00 | 48.50 | 52.00 | 5872.00
99.5th: 7208.00 | 100.50 | 97.50 | 7280.00
99.9th: 9392.00 | 4098.00 | 1226.00 | 9328.00
16 Threads
50.0th: 37.50 | 33.00 | 34.00 | 37.00
75.0th: 49.50 | 43.50 | 44.00 | 49.00
90.0th: 70.00 | 52.00 | 53.00 | 66.00
95.0th: 1284.00 | 57.50 | 59.00 | 1162.50
99.0th: 5600.00 | 79.50 | 111.50 | 5912.00
99.5th: 7216.00 | 282.00 | 194.50 | 7392.00
99.9th: 9328.00 | 4026.00 | 2009.00 | 9440.00
32 Threads
50.0th: 59.00 | 56.00 | 57.00 | 59.00
75.0th: 83.00 | 77.50 | 79.00 | 83.00
90.0th: 118.50 | 94.00 | 95.00 | 120.50
95.0th: 1921.00 | 104.50 | 104.00 | 1800.00
99.0th: 6672.00 | 425.00 | 255.00 | 6384.00
99.5th: 8252.00 | 2800.00 | 1252.00 | 7696.00
99.9th: 10448.00 | 7264.00 | 5888.00 | 9504.00
=========
hackbench
=========
Process 10 0.19 | 0.18 | 0.17 | 0.18
Process 20 0.34 | 0.32 | 0.33 | 0.31
Process 30 0.45 | 0.42 | 0.43 | 0.43
Process 40 0.58 | 0.53 | 0.53 | 0.53
Process 50 0.70 | 0.64 | 0.64 | 0.65
Process 60 0.82 | 0.74 | 0.75 | 0.76
thread 10 0.20 | 0.19 | 0.19 | 0.19
thread 20 0.36 | 0.34 | 0.34 | 0.34
Process(Pipe) 10 0.24 | 0.15 | 0.15 | 0.15
Process(Pipe) 20 0.46 | 0.22 | 0.22 | 0.21
Process(Pipe) 30 0.65 | 0.30 | 0.29 | 0.29
Process(Pipe) 40 0.90 | 0.35 | 0.36 | 0.34
Process(Pipe) 50 1.04 | 0.38 | 0.39 | 0.38
Process(Pipe) 60 1.16 | 0.42 | 0.42 | 0.43
thread(Pipe) 10 0.19 | 0.13 | 0.13 | 0.13
thread(Pipe) 20 0.46 | 0.21 | 0.21 | 0.21
++++++++++++++++++
96 CPU system
++++++++++++++++++
===========
schbench
===========
v6.2 | v6.2+LN=0 | v6.2+LN=-20 | v6.2+LN=19
1 Thread
50.0th: 10.50 | 10.00 | 10.00 | 11.00
75.0th: 12.50 | 11.50 | 11.50 | 12.50
90.0th: 15.00 | 13.00 | 13.50 | 16.50
95.0th: 47.50 | 15.00 | 15.00 | 274.50
99.0th: 4744.00 | 17.50 | 18.00 | 5032.00
99.5th: 7640.00 | 18.50 | 525.00 | 6636.00
99.9th: 8916.00 | 538.00 | 6704.00 | 9264.00
2 Threads
50.0th: 11.00 | 10.00 | 11.00 | 11.00
75.0th: 13.50 | 12.00 | 12.50 | 13.50
90.0th: 17.00 | 14.00 | 14.00 | 17.00
95.0th: 451.50 | 16.00 | 15.50 | 839.00
99.0th: 5488.00 | 20.50 | 18.00 | 6312.00
99.5th: 6712.00 | 986.00 | 19.00 | 7664.00
99.9th: 9856.00 | 4913.00 | 1154.00 | 8736.00
4 Threads
50.0th: 13.00 | 12.00 | 12.00 | 13.00
75.0th: 15.00 | 14.00 | 14.00 | 15.00
90.0th: 23.50 | 16.00 | 16.00 | 20.00
95.0th: 2508.00 | 17.50 | 17.50 | 1818.00
99.0th: 7232.00 | 777.00 | 38.50 | 5952.00
99.5th: 8720.00 | 3548.00 | 1926.00 | 7788.00
99.9th: 10352.00 | 6320.00 | 7160.00 | 10000.00
8 Threads
50.0th: 16.00 | 15.00 | 15.00 | 16.00
75.0th: 20.00 | 18.00 | 18.00 | 19.50
90.0th: 371.50 | 20.00 | 21.00 | 245.50
95.0th: 2992.00 | 22.00 | 23.00 | 2608.00
99.0th: 7784.00 | 1084.50 | 563.50 | 7136.00
99.5th: 9488.00 | 2612.00 | 2696.00 | 8720.00
99.9th: 15568.00 | 6656.00 | 7496.00 | 10000.00
16 Threads
50.0th: 23.00 | 21.00 | 20.00 | 22.50
75.0th: 31.00 | 27.50 | 26.00 | 29.50
90.0th: 1981.00 | 32.50 | 30.50 | 1500.50
95.0th: 4856.00 | 304.50 | 34.00 | 4046.00
99.0th: 10112.00 | 5720.00 | 4590.00 | 8220.00
99.5th: 13104.00 | 7828.00 | 7008.00 | 9312.00
99.9th: 18624.00 | 9856.00 | 9504.00 | 11984.00
32 Threads
50.0th: 36.50 | 34.50 | 33.50 | 35.50
75.0th: 56.50 | 48.00 | 46.00 | 52.50
90.0th: 4728.00 | 1470.50 | 376.00 | 3624.00
95.0th: 7808.00 | 4130.00 | 3850.00 | 6488.00
99.0th: 15776.00 | 8972.00 | 9060.00 | 9872.00
99.5th: 19072.00 | 11328.00 | 12224.00 | 11520.00
99.9th: 28864.00 | 18016.00 | 18368.00 | 18848.00
==========
Hackbench
=========
Type groups v6.2 | v6.2+LN=0 | v6.2+LN=-20 | v6.2+LN=19
Process 10 0.33 | 0.33 | 0.33 | 0.33
Process 20 0.61 | 0.56 | 0.58 | 0.57
Process 30 0.87 | 0.82 | 0.81 | 0.81
Process 40 1.10 | 1.05 | 1.06 | 1.05
Process 50 1.34 | 1.28 | 1.29 | 1.29
Process 60 1.58 | 1.53 | 1.52 | 1.51
thread 10 0.36 | 0.35 | 0.35 | 0.35
thread 20 0.64 | 0.63 | 0.62 | 0.62
Process(Pipe) 10 0.18 | 0.18 | 0.18 | 0.17
Process(Pipe) 20 0.32 | 0.31 | 0.31 | 0.31
Process(Pipe) 30 0.42 | 0.41 | 0.41 | 0.42
Process(Pipe) 40 0.56 | 0.53 | 0.55 | 0.53
Process(Pipe) 50 0.68 | 0.66 | 0.66 | 0.66
Process(Pipe) 60 0.80 | 0.78 | 0.78 | 0.78
thread(Pipe) 10 0.20 | 0.18 | 0.19 | 0.18
thread(Pipe) 20 0.34 | 0.34 | 0.33 | 0.33
Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 00/10] sched: EEVDF using latency-nice
2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
` (11 preceding siblings ...)
2023-03-08 15:13 ` Shrikanth Hegde
@ 2023-03-22 6:49 ` K Prateek Nayak
2023-03-22 9:38 ` K Prateek Nayak
2023-03-23 11:53 ` Pavel Machek
13 siblings, 1 reply; 35+ messages in thread
From: K Prateek Nayak @ 2023-03-22 6:49 UTC (permalink / raw)
To: Peter Zijlstra, mingo, vincent.guittot
Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
pjt, pavel, qperret, tim.c.chen, joshdon, timj, yu.c.chen,
youssefesmat, joel
Hello Peter,
Leaving some results from my testing on a dual socket Zen3 machine
(2 x 64C/128T) below.
tl;dr
o I've not tested workloads with nice and latency nice yet focusing more
on the out of the box performance. No changes to sched_feat were made
for the same reason.
o Except for hackbench (m:n communication relationship), I do not see any
regression for other standard benchmarks (mostly 1:1 or 1:n) relation
when system is below fully loaded.
o At fully loaded scenario, schbench seems to be unhappy. Looking at the
data from /proc/<pid>/sched for the tasks with schedstats enabled,
there is an increase in number of context switches and the total wait
sum. When system is overloaded, things flip and the schbench tail
latency improves drastically. I suspect the involuntary
context-switches help workers make progress much sooner after wakeup
compared to tip thus leading to lower tail latency.
o For the same reason as above, tbench throughput takes a hit with
number of involuntary context-switches increasing drastically for the
tbench server. There is also an increase in wait sum noticed.
o Couple of real world workloads were also tested. DeathStarBench
throughput tanks much more with the updated version in your tree
compared to this series as is.
SpecJBB Max-jOPS sees large improvements but comes at a cost of
drop in Critical-jOPS signifying an increase in either wait time
or an increase in involuntary context-switches which can lead to
transactions taking longer to complete.
o Apart from DeathStarBench, the all the trends reported remain same
comparing the version in your tree and this series, as is, applied
on the same base kernel.
I'll leave the detailed results below and some limited analysis.
On 3/6/2023 6:55 PM, Peter Zijlstra wrote:
> Hi!
>
> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
> not make more sense, and I did point Vincent at some older patches I had for
> that (which is here his augmented rbtree thing comes from).
>
> Also, since I really dislike the dual tree, I also figured we could dynamically
> switch between an augmented tree and not (and while I have code for that,
> that's not included in this posting because with the current results I don't
> think we actually need this).
>
> Anyway, since I'm somewhat under the weather, I spend last week desperately
> trying to connect a small cluster of neurons in defiance of the snot overlord
> and bring back the EEVDF patches from the dark crypts where they'd been
> gathering cobwebs for the past 13 odd years.
>
> By friday they worked well enough, and this morning (because obviously I forgot
> the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf,
> tbench and sysbench -- there's a bunch of wins and losses, but nothing that
> indicates a total fail.
>
> ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot
> more consistent than CFS and has a bunch of latency wins )
>
> ( hackbench also doesn't show the augmented tree and generally more expensive
> pick to be a loss, in fact it shows a slight win here )
>
>
> hackbech load + cyclictest --policy other results:
>
>
> EEVDF CFS
>
> # Min Latencies: 00053
> LNICE(19) # Avg Latencies: 04350
> # Max Latencies: 76019
>
> # Min Latencies: 00052 00053
> LNICE(0) # Avg Latencies: 00690 00687
> # Max Latencies: 14145 13913
>
> # Min Latencies: 00019
> LNICE(-19) # Avg Latencies: 00261
> # Max Latencies: 05642
>
Following are the results from testing the series on a dual socket
Zen3 machine (2 x 64C/128T):
NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:
NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.
Node 0: 0-63, 128-191
Node 1: 64-127, 192-255
NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.
Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255
NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.
Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255
Kernel versions:
- tip: 6.2.0-rc6 tip sched/core
- eevdf: 6.2.0-rc6 tip sched/core
+ eevdf commits from your tree
(https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/eevdf)
- eevdf prev: 6.2.0-rc6 tip sched/core + this series as is
When the testing started, the tip was at:
commit 7c4a5b89a0b5 "sched/rt: pick_next_rt_entity(): check list_entry"
Benchmark Results:
~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~
o NPS1
Test: tip eevdf
1-groups: 4.63 (0.00 pct) 4.52 (2.37 pct)
2-groups: 4.42 (0.00 pct) 5.41 (-22.39 pct) *
4-groups: 4.21 (0.00 pct) 5.26 (-24.94 pct) *
8-groups: 4.95 (0.00 pct) 5.01 (-1.21 pct)
16-groups: 5.43 (0.00 pct) 6.24 (-14.91 pct) *
o NPS2
Test: tip eevdf
1-groups: 4.68 (0.00 pct) 4.56 (2.56 pct)
2-groups: 4.45 (0.00 pct) 5.19 (-16.62 pct) *
4-groups: 4.19 (0.00 pct) 4.53 (-8.11 pct) *
8-groups: 4.80 (0.00 pct) 4.81 (-0.20 pct)
16-groups: 5.60 (0.00 pct) 6.22 (-11.07 pct) *
o NPS4
Test: tip eevdf
1-groups: 4.68 (0.00 pct) 4.57 (2.35 pct)
2-groups: 4.56 (0.00 pct) 5.19 (-13.81 pct) *
4-groups: 4.50 (0.00 pct) 4.96 (-10.22 pct) *
8-groups: 5.76 (0.00 pct) 5.49 (4.68 pct)
16-groups: 5.60 (0.00 pct) 6.53 (-16.60 pct) *
~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~
o NPS1
#workers: tip eevdf
1: 36.00 (0.00 pct) 36.00 (0.00 pct)
2: 37.00 (0.00 pct) 37.00 (0.00 pct)
4: 38.00 (0.00 pct) 39.00 (-2.63 pct)
8: 52.00 (0.00 pct) 50.00 (3.84 pct)
16: 66.00 (0.00 pct) 68.00 (-3.03 pct)
32: 111.00 (0.00 pct) 109.00 (1.80 pct)
64: 213.00 (0.00 pct) 212.00 (0.46 pct)
128: 502.00 (0.00 pct) 637.00 (-26.89 pct) *
256: 45632.00 (0.00 pct) 24992.00 (45.23 pct) ^
512: 78720.00 (0.00 pct) 44096.00 (43.98 pct) ^
o NPS2
#workers: tip eevdf
1: 31.00 (0.00 pct) 23.00 (25.80 pct)
2: 32.00 (0.00 pct) 33.00 (-3.12 pct)
4: 39.00 (0.00 pct) 37.00 (5.12 pct)
8: 52.00 (0.00 pct) 49.00 (5.76 pct)
16: 67.00 (0.00 pct) 68.00 (-1.49 pct)
32: 113.00 (0.00 pct) 112.00 (0.88 pct)
64: 213.00 (0.00 pct) 214.00 (-0.46 pct)
128: 508.00 (0.00 pct) 491.00 (3.34 pct)
256: 46912.00 (0.00 pct) 22304.00 (52.45 pct) ^
512: 76672.00 (0.00 pct) 42944.00 (43.98 pct) ^
o NPS4
#workers: tip eevdf
1: 33.00 (0.00 pct) 30.00 (9.09 pct)
2: 40.00 (0.00 pct) 36.00 (10.00 pct)
4: 44.00 (0.00 pct) 41.00 (6.81 pct)
8: 73.00 (0.00 pct) 73.00 (0.00 pct)
16: 71.00 (0.00 pct) 71.00 (0.00 pct)
32: 111.00 (0.00 pct) 115.00 (-3.60 pct)
64: 217.00 (0.00 pct) 211.00 (2.76 pct)
128: 509.00 (0.00 pct) 553.00 (-8.64 pct) *
256: 44352.00 (0.00 pct) 26848.00 (39.46 pct) ^
512: 75392.00 (0.00 pct) 44352.00 (41.17 pct) ^
~~~~~~~~~~
~ tbench ~
~~~~~~~~~~
o NPS1
Clients: tip eevdf
1 483.10 (0.00 pct) 476.46 (-1.37 pct)
2 956.03 (0.00 pct) 943.12 (-1.35 pct)
4 1786.36 (0.00 pct) 1760.64 (-1.43 pct)
8 3304.47 (0.00 pct) 3105.19 (-6.03 pct)
16 5440.44 (0.00 pct) 5609.24 (3.10 pct)
32 10462.02 (0.00 pct) 10416.02 (-0.43 pct)
64 18995.99 (0.00 pct) 19317.34 (1.69 pct)
128 27896.44 (0.00 pct) 28459.38 (2.01 pct)
256 49742.89 (0.00 pct) 46371.44 (-6.77 pct) *
512 49583.01 (0.00 pct) 45717.22 (-7.79 pct) *
1024 48467.75 (0.00 pct) 43475.31 (-10.30 pct) *
o NPS2
Clients: tip eevdf
1 472.57 (0.00 pct) 475.35 (0.58 pct)
2 938.27 (0.00 pct) 942.19 (0.41 pct)
4 1764.34 (0.00 pct) 1783.50 (1.08 pct)
8 3043.57 (0.00 pct) 3205.85 (5.33 pct)
16 5103.53 (0.00 pct) 5154.94 (1.00 pct)
32 9767.22 (0.00 pct) 9793.81 (0.27 pct)
64 18712.65 (0.00 pct) 18601.10 (-0.59 pct)
128 27691.95 (0.00 pct) 27542.57 (-0.53 pct)
256 47939.24 (0.00 pct) 43401.62 (-9.46 pct) *
512 47843.70 (0.00 pct) 43971.16 (-8.09 pct) *
1024 48412.05 (0.00 pct) 42808.58 (-11.57 pct) *
o NPS4
Clients: tip eevdf
1 486.74 (0.00 pct) 484.88 (-0.38 pct)
2 950.50 (0.00 pct) 950.04 (-0.04 pct)
4 1778.58 (0.00 pct) 1796.03 (0.98 pct)
8 3106.36 (0.00 pct) 3180.09 (2.37 pct)
16 5139.81 (0.00 pct) 5139.50 (0.00 pct)
32 9911.04 (0.00 pct) 10086.37 (1.76 pct)
64 18201.46 (0.00 pct) 18289.40 (0.48 pct)
128 27284.67 (0.00 pct) 26947.19 (-1.23 pct)
256 46793.72 (0.00 pct) 43971.87 (-6.03 pct) *
512 48841.96 (0.00 pct) 44255.01 (-9.39 pct) *
1024 48811.99 (0.00 pct) 43118.99 (-11.66 pct) *
~~~~~~~~~~
~ stream ~
~~~~~~~~~~
o NPS1
- 10 Runs:
Test: tip eevdf
Copy: 321229.54 (0.00 pct) 332975.45 (3.65 pct)
Scale: 207471.32 (0.00 pct) 212534.83 (2.44 pct)
Add: 234962.15 (0.00 pct) 243011.39 (3.42 pct)
Triad: 246256.00 (0.00 pct) 256453.73 (4.14 pct)
- 100 Runs:
Test: tip eevdf
Copy: 332714.94 (0.00 pct) 333183.42 (0.14 pct)
Scale: 216140.84 (0.00 pct) 212160.53 (-1.84 pct)
Add: 239605.00 (0.00 pct) 233168.69 (-2.68 pct)
Triad: 258580.84 (0.00 pct) 256972.33 (-0.62 pct)
o NPS2
- 10 Runs:
Test: tip eevdf
Copy: 324423.92 (0.00 pct) 340685.20 (5.01 pct)
Scale: 215993.56 (0.00 pct) 217895.31 (0.88 pct)
Add: 250590.28 (0.00 pct) 257495.12 (2.75 pct)
Triad: 261284.44 (0.00 pct) 261373.49 (0.03 pct)
- 100 Runs:
Test: tip eevdf
Copy: 325993.72 (0.00 pct) 341244.18 (4.67 pct)
Scale: 227201.27 (0.00 pct) 227255.98 (0.02 pct)
Add: 256601.84 (0.00 pct) 258026.75 (0.55 pct)
Triad: 260222.19 (0.00 pct) 269878.75 (3.71 pct)
o NPS4
- 10 Runs:
Test: tip eevdf
Copy: 356850.80 (0.00 pct) 371230.27 (4.02 pct)
Scale: 247219.39 (0.00 pct) 237846.20 (-3.79 pct)
Add: 268588.78 (0.00 pct) 261088.54 (-2.79 pct)
Triad: 272932.59 (0.00 pct) 284068.07 (4.07 pct)
- 100 Runs:
Test: tip eevdf
Copy: 365965.18 (0.00 pct) 371186.97 (1.42 pct)
Scale: 246068.58 (0.00 pct) 245991.10 (-0.03 pct)
Add: 263677.73 (0.00 pct) 269021.14 (2.02 pct)
Triad: 273701.36 (0.00 pct) 280566.44 (2.50 pct)
~~~~~~~~~~~~~
~ Unixbench ~
~~~~~~~~~~~~~
o NPS1
Test Metric Parallelism tip eevdf
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 49077561.21 ( 0.00%) 49144835.64 ( 0.14%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6285373890.61 ( 0.00%) 6270537933.92 ( -0.24%)
unixbench-syscall Amean unixbench-syscall-1 2664815.40 ( 0.00%) 2679289.17 * -0.54%*
unixbench-syscall Amean unixbench-syscall-512 7848462.70 ( 0.00%) 7456802.37 * 4.99%*
unixbench-pipe Hmean unixbench-pipe-1 2531131.89 ( 0.00%) 2475863.05 * -2.18%*
unixbench-pipe Hmean unixbench-pipe-512 305171024.40 ( 0.00%) 301182156.60 ( -1.31%)
unixbench-spawn Hmean unixbench-spawn-1 4058.05 ( 0.00%) 4284.38 * 5.58%*
unixbench-spawn Hmean unixbench-spawn-512 79893.24 ( 0.00%) 78234.45 * -2.08%*
unixbench-execl Hmean unixbench-execl-1 4148.64 ( 0.00%) 4086.73 * -1.49%*
unixbench-execl Hmean unixbench-execl-512 11077.20 ( 0.00%) 11137.79 ( 0.55%)
o NPS2
Test Metric Parallelism tip eevdf
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 49394822.56 ( 0.00%) 49175574.26 ( -0.44%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6267817215.36 ( 0.00%) 6282838979.08 * 0.24%*
unixbench-syscall Amean unixbench-syscall-1 2663675.03 ( 0.00%) 2677018.53 * -0.50%*
unixbench-syscall Amean unixbench-syscall-512 7342392.90 ( 0.00%) 7443264.00 * -1.37%*
unixbench-pipe Hmean unixbench-pipe-1 2533194.04 ( 0.00%) 2475969.01 * -2.26%*
unixbench-pipe Hmean unixbench-pipe-512 303588239.03 ( 0.00%) 302217597.98 * -0.45%*
unixbench-spawn Hmean unixbench-spawn-1 5141.40 ( 0.00%) 4862.78 ( -5.42%) *
unixbench-spawn Hmean unixbench-spawn-512 82993.79 ( 0.00%) 79139.42 * -4.64%* *
unixbench-execl Hmean unixbench-execl-1 4140.15 ( 0.00%) 4084.20 * -1.35%*
unixbench-execl Hmean unixbench-execl-512 12229.25 ( 0.00%) 11445.22 ( -6.41%) *
o NPS4
Test Metric Parallelism tip eevdf
unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48970677.27 ( 0.00%) 49070289.56 ( 0.20%)
unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6297506696.81 ( 0.00%) 6311038905.07 ( 0.21%)
unixbench-syscall Amean unixbench-syscall-1 2664715.13 ( 0.00%) 2677752.20 * -0.49%*
unixbench-syscall Amean unixbench-syscall-512 7938670.70 ( 0.00%) 7972291.60 ( -0.42%)
unixbench-pipe Hmean unixbench-pipe-1 2527605.54 ( 0.00%) 2476140.77 * -2.04%*
unixbench-pipe Hmean unixbench-pipe-512 305068507.23 ( 0.00%) 304114548.50 ( -0.31%)
unixbench-spawn Hmean unixbench-spawn-1 5207.34 ( 0.00%) 4964.39 ( -4.67%) *
unixbench-spawn Hmean unixbench-spawn-512 81352.38 ( 0.00%) 74467.00 * -8.46%* *
unixbench-execl Hmean unixbench-execl-1 4131.37 ( 0.00%) 4044.09 * -2.11%*
unixbench-execl Hmean unixbench-execl-512 13025.56 ( 0.00%) 11124.77 * -14.59%* *
~~~~~~~~~~~
~ netperf ~
~~~~~~~~~~~
o NPS1
tip eevdf
1-clients: 107932.22 (0.00 pct) 106167.39 (-1.63 pct)
2-clients: 106887.99 (0.00 pct) 105304.25 (-1.48 pct)
4-clients: 106676.11 (0.00 pct) 104328.10 (-2.20 pct)
8-clients: 98645.45 (0.00 pct) 94076.26 (-4.63 pct)
16-clients: 88881.23 (0.00 pct) 86831.85 (-2.30 pct)
32-clients: 86654.28 (0.00 pct) 86313.80 (-0.39 pct)
64-clients: 81431.90 (0.00 pct) 74885.75 (-8.03 pct)
128-clients: 55993.77 (0.00 pct) 55378.10 (-1.09 pct)
256-clients: 43865.59 (0.00 pct) 44326.30 (1.05 pct)
o NPS2
tip eevdf
1-clients: 106711.81 (0.00 pct) 108576.27 (1.74 pct)
2-clients: 106987.79 (0.00 pct) 108348.24 (1.27 pct)
4-clients: 105275.37 (0.00 pct) 105702.12 (0.40 pct)
8-clients: 103028.31 (0.00 pct) 96250.20 (-6.57 pct)
16-clients: 87382.43 (0.00 pct) 87683.29 (0.34 pct)
32-clients: 86578.14 (0.00 pct) 86968.29 (0.45 pct)
64-clients: 81470.63 (0.00 pct) 75906.15 (-6.83 pct)
128-clients: 54803.35 (0.00 pct) 55051.90 (0.45 pct)
256-clients: 42910.29 (0.00 pct) 44062.33 (2.68 pct)
~~~~~~~~~~~
~ SpecJBB ~
~~~~~~~~~~~
o NPS1
tip eevdf
Max-jOPS 100% 115.71% (+15.71%) ^
Critical-jOPS 100% 93.59% (-6.41%) *
~~~~~~~~~~~~~~~~~~
~ DeathStarBench ~
~~~~~~~~~~~~~~~~~~
o NPS1
#CCX 1 CCX 2 CCX 3 CCX 4 CCX
o eevdf compared to tip -10.93 -14.35 -9.74 -6.07
o eevdf prev (this sries as is)
compared to tip -1.99 -6.64 -4.99 -3.87
Note: #CCX is the number of LLCs the the services are pinned to.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Some Preliminary Analysis ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tl;dr
- There seems to be an increase in number of involuntary context switches
when the system is overloaded. This probably allows newly waking task to
make progress benefiting latency sensitive workload like schbench in
overloaded scenario compared to tip but hurts tbench performance.
When system is fully loaded, the larger average wait time seems to hurt
the schbench performance.
More analysis is needed to get to the bottom of the problem.
- For hackbench 2 groups scenario, there seems the wait time seems to go up
drastically.
Scheduler statistics of interest are listed in detail below.
Note: Units of all metrics denoting time is ms. They are processed from
per-task schedstats in /proc/<pid>/sched.
o Hackbench (2 Groups) (NPS1)
tip eevdf %diff
Comm sched-messaging sched-messaging N/A
Sum of avg_atom 282.0024818 19.04355233 -93.24702669
Average of avg_atom 3.481512121 0.235105584 -93.24702669
Sum of avg_per_cpu 1761.949461 61.52537145 -96.50810805
Average of avg_per_cpu 21.75246248 0.759572487 -96.50810805
Average of avg_wait_time 0.007239228 0.012899105 78.18343632
Sum of nr_switches 4897740 4728784 -3.449672706
Sum of nr_voluntary_switches 4742512 4621606 -2.549408415
Sum of nr_involuntary_switches 155228 107178 -30.95446698
Sum of nr_wakeups 4742648 4623175 -2.51912012
Sum of nr_migrations 1263925 930600 -26.37221354
Sum of sum_exec_runtime 288481.15 262255.2574 -9.091024712
Sum of sum_idle_runtime 2576164.568 2851759.68 10.69788457
Sum of sum_sleep_runtime 76890.14753 78632.31679 2.265789982
Sum of wait_count 4897894 4728939 -3.449543824
Sum of wait_sum 3041.78227 24167.4694 694.5167422
o schbench (2 messengers, 128 workers - fully loaded) (NPS1)
tip eevdf %diff
Comm schbench schbench N/A
Sum of avg_atom 7538.162897 7289.565705 -3.297848503
Average of avg_atom 29.10487605 28.14504133 -3.297848503
Sum of avg_per_cpu 630248.6079 471215.3671 -25.23341406
Average of avg_per_cpu 2433.392309 1819.364352 -25.23341406
Average of avg_wait_time 0.054147456 25.34304285 46703.75524
Sum of nr_switches 85210 88176 3.480812111
Sum of nr_voluntary_switches 83165 83457 0.351109241
Sum of nr_involuntary_switches 2045 4719 130.7579462
Sum of nr_wakeups 83168 83459 0.34989419
Sum of nr_migrations 3265 3025 -7.350689127
Sum of sum_exec_runtime 2476504.52 2469058.164 -0.300680129
Sum of sum_idle_runtime 110294825.8 132520924.2 20.15153321
Sum of sum_sleep_runtime 5293337.741 5297778.714 0.083897408
Sum of sum_block_runtime 56.043253 15.12936 -73.00413664
Sum of wait_count 85615 88606 3.493546692
Sum of wait_sum 4653.340163 9605.221964 106.4156418
o schbench (2 messengers, 256 workers - overloaded) (NPS1)
tip eevdf %diff
Comm schbench schbench N/A
Sum of avg_atom 11676.77306 4803.485728 -58.8629007
Average of avg_atom 22.67334574 9.327156753 -58.8629007
Sum of avg_per_cpu 55235.68013 38286.47722 -30.68524343
Average of avg_per_cpu 107.2537478 74.34267421 -30.68524343
Average of avg_wait_time 2.23189096 2.58191945 15.68304621
Sum of nr_switches 202862 425258 109.6292061
Sum of nr_voluntary_switches 163079 165058 1.213522281
Sum of nr_involuntary_switches 39783 260200 554.0482115
Sum of nr_wakeups 163082 165058 1.211660392
Sum of nr_migrations 44199 54894 24.19738003
Sum of sum_exec_runtime 4586675.667 3963846.024 -13.57910801
Sum of sum_idle_runtime 201050644.2 195126863.7 -2.946412087
Sum of sum_sleep_runtime 10418117.66 10402686.4 -0.148119407
Sum of sum_block_runtime 1548.979156 516.115078 -66.68030838
Sum of wait_count 203377 425792 109.3609405
Sum of wait_sum 455609.3122 1100885.201 141.6292142
o tbench (256 clients - overloaded) (NPS1)
- tbench client
tip eevdf % diff
comm tbench tbench N/A
Sum of avg_atom 3.594587941 5.112101854 42.21663064
Average of avg_atom 0.013986724 0.019891447 42.21663064
Sum of avg_per_cpu 392838.0975 142065.4206 -63.83613975
Average of avg_per_cpu 1528.552909 552.7837377 -63.83613975
Average of avg_wait_time 0.010512441 0.006861579 -34.72895916
Sum of nr_switches 692845080 511780111 -26.1335433
Sum of nr_voluntary_switches 178151085 371234907 108.3820635
Sum of nr_involuntary_switches 514693995 140545204 -72.69344399
Sum of nr_wakeups 178151085 371234909 108.3820646
Sum of nr_migrations 45279 71177 57.19649286
Sum of sum_exec_runtime 9192343.465 9624025.792 4.69610746
Sum of sum_idle_runtime 7125370.721 16145736.39 126.5950365
Sum of sum_sleep_runtime 2222469.726 5792868.629 160.650058
Sum of sum_block_runtime 68.60879 446.080476 550.1797743
Sum of wait_count 692845479 511780543 -26.13352349
Sum of wait_sum 7287852.246 3297894.139 -54.7480653
- tbench server
tip eevdf % diff
Comm tbench_srv tbench_srv N/A
Sum of avg_atom 5.077837807 5.447267364 7.275331971
Average of avg_atom2 0.019758124 0.021195593 7.275331971
Sum of avg_per_cpu 538586.1634 87925.51225 -83.67475471
Average of avg_per_cpu2 2095.666006 342.1226158 -83.67475471
Average of avg_wait_time 0.000827346 0.006505748 686.3392261
Sum of nr_switches 692980666 511838912 -26.13951051
Sum of nr_voluntary_switches 690367607 390304935 -43.46418762
Sum of nr_involuntary_switches 2613059 121533977 4551.023073
Sum of nr_wakeups 690367607 390304935 -43.46418762
Sum of nr_migrations 39486 84474 113.9340526
Sum of sum_exec_runtime 9176708.278 8734423.401 -4.819646259
Sum of sum_idle_runtime 413900.3645 447180.3879 8.040588086
Sum of sum_sleep_runtime 8966201.976 6690818.107 -25.37734345
Sum of sum_block_runtime 1.776413 1.617435 -8.949382829
Sum of wait_count 692980942 511839229 -26.13949418
Sum of wait_sum 565739.6984 3295519.077 482.5150836
>
> The nice -19 numbers aren't as pretty as Vincent's, but at the end I was going
> cross-eyed from staring at tree prints and I just couldn't figure out where it
> was going side-ways.
>
> There's definitely more benchmarking/tweaking to be done (0-day already
> reported a stress-ng loss), but if we can pull this off we can delete a whole
> much of icky heuristics code. EEVDF is a much better defined policy than what
> we currently have.
>
DeathStarBench and SpecJBB and slightly more complex to analyze. I'll
get the schedstat data for both soon. I'll rerun some of the above
workloads with NO_PRESERVE_LAG to see if that makes any difference.
In the meantime, if you need more data from the test system for any
particular workload, please do let me know. I will collect the per-task
and system-wide schedstat data for the workload as it is rather
inexpensive to collect and gives good insights but if you need any
other data, I'll be more than happy to get those too for analysis.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 00/10] sched: EEVDF using latency-nice
2023-03-22 6:49 ` K Prateek Nayak
@ 2023-03-22 9:38 ` K Prateek Nayak
0 siblings, 0 replies; 35+ messages in thread
From: K Prateek Nayak @ 2023-03-22 9:38 UTC (permalink / raw)
To: Peter Zijlstra, mingo, vincent.guittot
Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
mgorman, bristot, corbet, qyousef, chris.hyser, patrick.bellasi,
pjt, pavel, qperret, tim.c.chen, joshdon, timj, yu.c.chen,
youssefesmat, joel
Hello Peter,
One important detail I forgot to mention: When I picked eevdf commits
from your tree
(https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=sched/core),
they were based on v6.3-rc1 with the sched/eevdf HEAD at:
commit: 0dddbc0b54ad ("sched/fair: Implement an EEVDF like policy")
On 3/22/2023 12:19 PM, K Prateek Nayak wrote:
> Hello Peter,
>
> Leaving some results from my testing on a dual socket Zen3 machine
> (2 x 64C/128T) below.
>
> tl;dr
>
> o I've not tested workloads with nice and latency nice yet focusing more
> on the out of the box performance. No changes to sched_feat were made
> for the same reason.
>
> o Except for hackbench (m:n communication relationship), I do not see any
> regression for other standard benchmarks (mostly 1:1 or 1:n) relation
> when system is below fully loaded.
>
> o At fully loaded scenario, schbench seems to be unhappy. Looking at the
> data from /proc/<pid>/sched for the tasks with schedstats enabled,
> there is an increase in number of context switches and the total wait
> sum. When system is overloaded, things flip and the schbench tail
> latency improves drastically. I suspect the involuntary
> context-switches help workers make progress much sooner after wakeup
> compared to tip thus leading to lower tail latency.
>
> o For the same reason as above, tbench throughput takes a hit with
> number of involuntary context-switches increasing drastically for the
> tbench server. There is also an increase in wait sum noticed.
>
> o Couple of real world workloads were also tested. DeathStarBench
> throughput tanks much more with the updated version in your tree
> compared to this series as is.
> SpecJBB Max-jOPS sees large improvements but comes at a cost of
> drop in Critical-jOPS signifying an increase in either wait time
> or an increase in involuntary context-switches which can lead to
> transactions taking longer to complete.
>
> o Apart from DeathStarBench, the all the trends reported remain same
> comparing the version in your tree and this series, as is, applied
> on the same base kernel.
>
> I'll leave the detailed results below and some limited analysis.
>
> On 3/6/2023 6:55 PM, Peter Zijlstra wrote:
>> Hi!
>>
>> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
>> not make more sense, and I did point Vincent at some older patches I had for
>> that (which is here his augmented rbtree thing comes from).
>>
>> Also, since I really dislike the dual tree, I also figured we could dynamically
>> switch between an augmented tree and not (and while I have code for that,
>> that's not included in this posting because with the current results I don't
>> think we actually need this).
>>
>> Anyway, since I'm somewhat under the weather, I spend last week desperately
>> trying to connect a small cluster of neurons in defiance of the snot overlord
>> and bring back the EEVDF patches from the dark crypts where they'd been
>> gathering cobwebs for the past 13 odd years.
>>
>> By friday they worked well enough, and this morning (because obviously I forgot
>> the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf,
>> tbench and sysbench -- there's a bunch of wins and losses, but nothing that
>> indicates a total fail.
>>
>> ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot
>> more consistent than CFS and has a bunch of latency wins )
>>
>> ( hackbench also doesn't show the augmented tree and generally more expensive
>> pick to be a loss, in fact it shows a slight win here )
>>
>>
>> hackbech load + cyclictest --policy other results:
>>
>>
>> EEVDF CFS
>>
>> # Min Latencies: 00053
>> LNICE(19) # Avg Latencies: 04350
>> # Max Latencies: 76019
>>
>> # Min Latencies: 00052 00053
>> LNICE(0) # Avg Latencies: 00690 00687
>> # Max Latencies: 14145 13913
>>
>> # Min Latencies: 00019
>> LNICE(-19) # Avg Latencies: 00261
>> # Max Latencies: 05642
>>
>
> Following are the results from testing the series on a dual socket
> Zen3 machine (2 x 64C/128T):
>
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
>
> NPS1: Each socket is a NUMA node.
> Total 2 NUMA nodes in the dual socket machine.
>
> Node 0: 0-63, 128-191
> Node 1: 64-127, 192-255
>
> NPS2: Each socket is further logically divided into 2 NUMA regions.
> Total 4 NUMA nodes exist over 2 socket.
>
> Node 0: 0-31, 128-159
> Node 1: 32-63, 160-191
> Node 2: 64-95, 192-223
> Node 3: 96-127, 223-255
>
> NPS4: Each socket is logically divided into 4 NUMA regions.
> Total 8 NUMA nodes exist over 2 socket.
>
> Node 0: 0-15, 128-143
> Node 1: 16-31, 144-159
> Node 2: 32-47, 160-175
> Node 3: 48-63, 176-191
> Node 4: 64-79, 192-207
> Node 5: 80-95, 208-223
> Node 6: 96-111, 223-231
> Node 7: 112-127, 232-255
>
> Kernel versions:
> - tip: 6.2.0-rc6 tip sched/core
> - eevdf: 6.2.0-rc6 tip sched/core
> + eevdf commits from your tree
> (https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/eevdf)
I had cherry picked the following commits for eevdf:
commit: b84a8f6b6fa3 ("sched: Introduce latency-nice as a per-task attribute")
commit: eea7fc6f13b4 ("sched/core: Propagate parent task's latency requirements to the child task")
commit: a143d2bcef65 ("sched: Allow sched_{get,set}attr to change latency_nice of the task")
commit: d9790468df14 ("sched/fair: Add latency_offset")
commit: 3d4d37acaba4 ("sched/fair: Add sched group latency support")
commit: 707840ffc8fa ("sched/fair: Add avg_vruntime")
commit: 394af9db316b ("sched/fair: Remove START_DEBIT")
commit: 89b2a2ee0e9d ("sched/fair: Add lag based placement")
commit: e3db9631d8ca ("rbtree: Add rb_add_augmented_cached() helper")
commit: 0dddbc0b54ad ("sched/fair: Implement an EEVDF like policy")
from the sched/eevdf branch in your tree onto the tip branch back when
I started testing. I notice some more changes have been added since then.
Queuing testing of latest changes on the updated tip:sched/core based
on v6.3-rc3. I was able to cherry pick the latest commits from
sched/eevdf cleanly.
>
> - eevdf prev: 6.2.0-rc6 tip sched/core + this series as is
>
> When the testing started, the tip was at:
> commit 7c4a5b89a0b5 "sched/rt: pick_next_rt_entity(): check list_entry"
> [..snip..]
>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 00/10] sched: EEVDF using latency-nice
2023-03-06 13:25 [PATCH 00/10] sched: EEVDF using latency-nice Peter Zijlstra
` (12 preceding siblings ...)
2023-03-22 6:49 ` K Prateek Nayak
@ 2023-03-23 11:53 ` Pavel Machek
13 siblings, 0 replies; 35+ messages in thread
From: Pavel Machek @ 2023-03-23 11:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
dietmar.eggemann, rostedt, bsegall, mgorman, bristot, corbet,
qyousef, chris.hyser, patrick.bellasi, pjt, qperret, tim.c.chen,
joshdon, timj, kprateek.nayak, yu.c.chen, youssefesmat, joel
[-- Attachment #1: Type: text/plain, Size: 426 bytes --]
Hi!
> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
> not make more sense, and I did point Vincent at some older patches I had for
> that (which is here his augmented rbtree thing comes from).
Link for context: https://lwn.net/Articles/925371/ . "EEVDF" is not
commonly known acronym :-).
BR, Pavel
--
People of Russia, stop Putin before his war on Ukraine escalates.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread