linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/8]  Add latency priority for CFS class
@ 2022-09-16  8:02 Vincent Guittot
  2022-09-16  8:02 ` [PATCH v4 1/8] sched: Introduce latency-nice as a per-task attribute Vincent Guittot
                   ` (8 more replies)
  0 siblings, 9 replies; 44+ messages in thread
From: Vincent Guittot @ 2022-09-16  8:02 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth
  Cc: qais.yousef, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, tj, qperret, tim.c.chen, joshdon,
	Vincent Guittot

This patchset restarts the work about adding a latency priority to describe
the latency tolerance of cfs tasks.

The patches [1-4] have been done by Parth:
https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com/

I have just rebased and moved the set of latency priority outside the
priority update. I have removed the reviewed tag because the patches
are 2 years old.

The patch [5] uses latency nice priority to define a latency offset
and then to decide if a cfs task can preempt the current running task. The
patch gives some tests results with cyclictests and hackbench to highlight
the benefit of latency priority for short interactive task or
long intensive tasks.

Patch [6] adds the support of latency_offset to task group by adding a
cpu.latency field. There were discussions for the last version about
using a real unit for the field so I decided to expose directly the
latency offset which reflects the time up to which we can preempt when the
value is negative, or up to which we can defer the preemption when the
value is positive.
The range is [-sysctl_sched_latency:sysctl_sched_latency]

Patch [7] makes sched_core taking into account the latency offset.

Patch [8] adds a rb tree to cover some corner cases where the latency
sensitive task is preempted by high priority task (RT/DL) or fails to
preempt them. This patch ensures that tasks will have at least a
slice of sched_min_granularity in priority at wakeup. The patch gives
results to show the benefit in addition to patch 5

I have also backported the patchset on a dragonboard RB3 with an android
mainline kernel based on v5.18 for a quick test. I have used
the TouchLatency app which is part of AOSP and described to be very good
test to highlight jitter and jank frame sources of a system [1].
In addition to the app, I have added some short running tasks waking-up
regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
without overloading it (and disabling EAS). The 1st results shows that the
patchset helps to reduce the missed deadline frames from 5% to less than
0.1% when the cpu.latency of task group are set.

[1] https://source.android.com/docs/core/debug/eval_perf#touchlatency

Change since v3:
- Fix 2 compilation issues raised by kernel test robot <lkp@intel.com>

Change since v2:
- Set a latency_offset field instead of saving a weight and computing it
  on the fly.
- Make latency_offset available for task group: cpu.latency
- Fix some corner cases to make latency sensitive tasks schedule first and
  add a rb tree for latency sensitive task.

Change since v1:
- fix typo
- move some codes in the right patch to make bisect happy
- simplify and fixed how the weight is computed
- added support of sched core patch 7

Parth Shah (4):
  sched: Introduce latency-nice as a per-task attribute
  sched/core: Propagate parent task's latency requirements to the child
    task
  sched: Allow sched_{get,set}attr to change latency_nice of the task
  sched/core: Add permission checks for setting the latency_nice value

Vincent Guittot (4):
  sched/fair: Take into account latency priority at wakeup
  sched/fair: Add sched group latency support
  sched/core: support latency priority with sched core
  sched/fair: Add latency list

 include/linux/sched.h            |   5 +
 include/uapi/linux/sched.h       |   4 +-
 include/uapi/linux/sched/types.h |  19 ++++
 init/init_task.c                 |   1 +
 kernel/sched/core.c              |  81 +++++++++++++
 kernel/sched/debug.c             |   1 +
 kernel/sched/fair.c              | 189 ++++++++++++++++++++++++++++++-
 kernel/sched/sched.h             |  37 +++++-
 tools/include/uapi/linux/sched.h |   4 +-
 9 files changed, 333 insertions(+), 8 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v4 1/8] sched: Introduce latency-nice as a per-task attribute
  2022-09-16  8:02 [PATCH v4 0/8] Add latency priority for CFS class Vincent Guittot
@ 2022-09-16  8:02 ` Vincent Guittot
  2022-09-16  8:02 ` [PATCH v4 2/8] sched/core: Propagate parent task's latency requirements to the child task Vincent Guittot
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2022-09-16  8:02 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth
  Cc: qais.yousef, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, tj, qperret, tim.c.chen, joshdon,
	Vincent Guittot

From: Parth Shah <parth@linux.ibm.com>

Latency-nice indicates the latency requirements of a task with respect
to the other tasks in the system. The value of the attribute can be within
the range of [-20, 19] both inclusive to be in-line with the values just
like task nice values.

latency_nice = -20 indicates the task to have the least latency as
compared to the tasks having latency_nice = +19.

The latency_nice may affect only the CFS SCHED_CLASS by getting
latency requirements from the userspace.

Additionally, add debugging bits for newly added latency_nice attribute.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
[rebase]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h |  1 +
 kernel/sched/debug.c  |  1 +
 kernel/sched/sched.h  | 18 ++++++++++++++++++
 3 files changed, 20 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 88b8817b827d..a0adb55efa1c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -773,6 +773,7 @@ struct task_struct {
 	int				static_prio;
 	int				normal_prio;
 	unsigned int			rt_priority;
+	int				latency_nice;
 
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index bb3d63bdf4ae..a3f7876217a6 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1042,6 +1042,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 #endif
 	P(policy);
 	P(prio);
+	P(latency_nice);
 	if (task_has_dl_policy(p)) {
 		P(dl.runtime);
 		P(dl.deadline);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 74130a69d365..b927269b84f9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -124,6 +124,24 @@ extern int sched_rr_timeslice;
  */
 #define NS_TO_JIFFIES(TIME)	((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
 
+/*
+ * Latency nice is meant to provide scheduler hints about the relative
+ * latency requirements of a task with respect to other tasks.
+ * Thus a task with latency_nice == 19 can be hinted as the task with no
+ * latency requirements, in contrast to the task with latency_nice == -20
+ * which should be given priority in terms of lower latency.
+ */
+#define MAX_LATENCY_NICE	19
+#define MIN_LATENCY_NICE	-20
+
+#define LATENCY_NICE_WIDTH	\
+	(MAX_LATENCY_NICE - MIN_LATENCY_NICE + 1)
+
+/*
+ * Default tasks should be treated as a task with latency_nice = 0.
+ */
+#define DEFAULT_LATENCY_NICE	0
+
 /*
  * Increase resolution of nice-level calculations for 64-bit architectures.
  * The extra resolution improves shares distribution and load balancing of
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 2/8] sched/core: Propagate parent task's latency requirements to the child task
  2022-09-16  8:02 [PATCH v4 0/8] Add latency priority for CFS class Vincent Guittot
  2022-09-16  8:02 ` [PATCH v4 1/8] sched: Introduce latency-nice as a per-task attribute Vincent Guittot
@ 2022-09-16  8:02 ` Vincent Guittot
  2022-09-16  8:03 ` [PATCH v4 3/8] sched: Allow sched_{get,set}attr to change latency_nice of the task Vincent Guittot
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2022-09-16  8:02 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth
  Cc: qais.yousef, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, tj, qperret, tim.c.chen, joshdon,
	Vincent Guittot

From: Parth Shah <parth@linux.ibm.com>

Clone parent task's latency_nice attribute to the forked child task.

Reset the latency_nice value to default value when the child task is
set to sched_reset_on_fork.

Also, initialize init_task.latency_nice value with DEFAULT_LATENCY_NICE
value

Signed-off-by: Parth Shah <parth@linux.ibm.com>
[rebase]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 init/init_task.c    | 1 +
 kernel/sched/core.c | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/init/init_task.c b/init/init_task.c
index 73cc8f03511a..225d11a39bc9 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -78,6 +78,7 @@ struct task_struct init_task
 	.prio		= MAX_PRIO - 20,
 	.static_prio	= MAX_PRIO - 20,
 	.normal_prio	= MAX_PRIO - 20,
+	.latency_nice	= DEFAULT_LATENCY_NICE,
 	.policy		= SCHED_NORMAL,
 	.cpus_ptr	= &init_task.cpus_mask,
 	.user_cpus_ptr	= NULL,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 603a80ec9b0e..3c5fb04f00e1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4511,6 +4511,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	 */
 	p->prio = current->normal_prio;
 
+	/* Propagate the parent's latency requirements to the child as well */
+	p->latency_nice = current->latency_nice;
+
 	uclamp_fork(p);
 
 	/*
@@ -4527,6 +4530,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 		p->prio = p->normal_prio = p->static_prio;
 		set_load_weight(p, false);
 
+		p->latency_nice = DEFAULT_LATENCY_NICE;
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
 		 * fulfilled its duty:
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 3/8] sched: Allow sched_{get,set}attr to change latency_nice of the task
  2022-09-16  8:02 [PATCH v4 0/8] Add latency priority for CFS class Vincent Guittot
  2022-09-16  8:02 ` [PATCH v4 1/8] sched: Introduce latency-nice as a per-task attribute Vincent Guittot
  2022-09-16  8:02 ` [PATCH v4 2/8] sched/core: Propagate parent task's latency requirements to the child task Vincent Guittot
@ 2022-09-16  8:03 ` Vincent Guittot
  2022-09-16  8:03 ` [PATCH v4 4/8] sched/core: Add permission checks for setting the latency_nice value Vincent Guittot
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2022-09-16  8:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth
  Cc: qais.yousef, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, tj, qperret, tim.c.chen, joshdon,
	Vincent Guittot

From: Parth Shah <parth@linux.ibm.com>

Introduce the latency_nice attribute to sched_attr and provide a
mechanism to change the value with the use of sched_setattr/sched_getattr
syscall.

Also add new flag "SCHED_FLAG_LATENCY_NICE" to hint the change in
latency_nice of the task on every sched_setattr syscall.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
[rebase and add a dedicated __setscheduler_latency ]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/uapi/linux/sched.h       |  4 +++-
 include/uapi/linux/sched/types.h | 19 +++++++++++++++++++
 kernel/sched/core.c              | 24 ++++++++++++++++++++++++
 tools/include/uapi/linux/sched.h |  4 +++-
 4 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 3bac0a8ceab2..b2e932c25be6 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -132,6 +132,7 @@ struct clone_args {
 #define SCHED_FLAG_KEEP_PARAMS		0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
+#define SCHED_FLAG_LATENCY_NICE		0x80
 
 #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
 				 SCHED_FLAG_KEEP_PARAMS)
@@ -143,6 +144,7 @@ struct clone_args {
 			 SCHED_FLAG_RECLAIM		| \
 			 SCHED_FLAG_DL_OVERRUN		| \
 			 SCHED_FLAG_KEEP_ALL		| \
-			 SCHED_FLAG_UTIL_CLAMP)
+			 SCHED_FLAG_UTIL_CLAMP		| \
+			 SCHED_FLAG_LATENCY_NICE)
 
 #endif /* _UAPI_LINUX_SCHED_H */
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index f2c4589d4dbf..db1e8199e8c8 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -10,6 +10,7 @@ struct sched_param {
 
 #define SCHED_ATTR_SIZE_VER0	48	/* sizeof first published struct */
 #define SCHED_ATTR_SIZE_VER1	56	/* add: util_{min,max} */
+#define SCHED_ATTR_SIZE_VER2	60	/* add: latency_nice */
 
 /*
  * Extended scheduling parameters data structure.
@@ -98,6 +99,22 @@ struct sched_param {
  * scheduled on a CPU with no more capacity than the specified value.
  *
  * A task utilization boundary can be reset by setting the attribute to -1.
+ *
+ * Latency Tolerance Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the relative latency
+ * requirements of a task with respect to the other tasks running/queued in the
+ * system.
+ *
+ * @ sched_latency_nice	task's latency_nice value
+ *
+ * The latency_nice of a task can have any value in a range of
+ * [MIN_LATENCY_NICE..MAX_LATENCY_NICE].
+ *
+ * A task with latency_nice with the value of LATENCY_NICE_MIN can be
+ * taken for a task requiring a lower latency as opposed to the task with
+ * higher latency_nice.
  */
 struct sched_attr {
 	__u32 size;
@@ -120,6 +137,8 @@ struct sched_attr {
 	__u32 sched_util_min;
 	__u32 sched_util_max;
 
+	/* latency requirement hints */
+	__s32 sched_latency_nice;
 };
 
 #endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3c5fb04f00e1..2015e7d1f8b2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7289,6 +7289,14 @@ static void __setscheduler_params(struct task_struct *p,
 	p->rt_priority = attr->sched_priority;
 	p->normal_prio = normal_prio(p);
 	set_load_weight(p, true);
+
+}
+
+static void __setscheduler_latency(struct task_struct *p,
+		const struct sched_attr *attr)
+{
+	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
+		p->latency_nice = attr->sched_latency_nice;
 }
 
 /*
@@ -7431,6 +7439,13 @@ static int __sched_setscheduler(struct task_struct *p,
 			return retval;
 	}
 
+	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
+		if (attr->sched_latency_nice > MAX_LATENCY_NICE)
+			return -EINVAL;
+		if (attr->sched_latency_nice < MIN_LATENCY_NICE)
+			return -EINVAL;
+	}
+
 	if (pi)
 		cpuset_read_lock();
 
@@ -7465,6 +7480,9 @@ static int __sched_setscheduler(struct task_struct *p,
 			goto change;
 		if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
 			goto change;
+		if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE &&
+		    attr->sched_latency_nice != p->latency_nice)
+			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
 		retval = 0;
@@ -7553,6 +7571,7 @@ static int __sched_setscheduler(struct task_struct *p,
 		__setscheduler_params(p, attr);
 		__setscheduler_prio(p, newprio);
 	}
+	__setscheduler_latency(p, attr);
 	__setscheduler_uclamp(p, attr);
 
 	if (queued) {
@@ -7763,6 +7782,9 @@ static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *a
 	    size < SCHED_ATTR_SIZE_VER1)
 		return -EINVAL;
 
+	if ((attr->sched_flags & SCHED_FLAG_LATENCY_NICE) &&
+	    size < SCHED_ATTR_SIZE_VER2)
+		return -EINVAL;
 	/*
 	 * XXX: Do we want to be lenient like existing syscalls; or do we want
 	 * to be strict and return an error on out-of-bounds values?
@@ -8000,6 +8022,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 	get_params(p, &kattr);
 	kattr.sched_flags &= SCHED_FLAG_ALL;
 
+	kattr.sched_latency_nice = p->latency_nice;
+
 #ifdef CONFIG_UCLAMP_TASK
 	/*
 	 * This could race with another potential updater, but this is fine
diff --git a/tools/include/uapi/linux/sched.h b/tools/include/uapi/linux/sched.h
index 3bac0a8ceab2..ecc4884bfe4b 100644
--- a/tools/include/uapi/linux/sched.h
+++ b/tools/include/uapi/linux/sched.h
@@ -132,6 +132,7 @@ struct clone_args {
 #define SCHED_FLAG_KEEP_PARAMS		0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
+#define SCHED_FLAG_LATENCY_NICE		0X80
 
 #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
 				 SCHED_FLAG_KEEP_PARAMS)
@@ -143,6 +144,7 @@ struct clone_args {
 			 SCHED_FLAG_RECLAIM		| \
 			 SCHED_FLAG_DL_OVERRUN		| \
 			 SCHED_FLAG_KEEP_ALL		| \
-			 SCHED_FLAG_UTIL_CLAMP)
+			 SCHED_FLAG_UTIL_CLAMP		| \
+			 SCHED_FLAG_LATENCY_NICE)
 
 #endif /* _UAPI_LINUX_SCHED_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 4/8] sched/core: Add permission checks for setting the latency_nice value
  2022-09-16  8:02 [PATCH v4 0/8] Add latency priority for CFS class Vincent Guittot
                   ` (2 preceding siblings ...)
  2022-09-16  8:03 ` [PATCH v4 3/8] sched: Allow sched_{get,set}attr to change latency_nice of the task Vincent Guittot
@ 2022-09-16  8:03 ` Vincent Guittot
  2022-09-19  8:52   ` timj
  2022-09-16  8:03 ` [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup Vincent Guittot
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2022-09-16  8:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth
  Cc: qais.yousef, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, tj, qperret, tim.c.chen, joshdon,
	Vincent Guittot

From: Parth Shah <parth@linux.ibm.com>

Since the latency_nice uses the similar infrastructure as NICE, use the
already existing CAP_SYS_NICE security checks for the latency_nice. This
should return -EPERM for the non-root user when trying to set the task
latency_nice value to any lower than the current value.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
[rebase]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2015e7d1f8b2..3c79c5419d1b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7444,6 +7444,10 @@ static int __sched_setscheduler(struct task_struct *p,
 			return -EINVAL;
 		if (attr->sched_latency_nice < MIN_LATENCY_NICE)
 			return -EINVAL;
+		/* Use the same security checks as NICE */
+		if (attr->sched_latency_nice < p->latency_nice &&
+		    !capable(CAP_SYS_NICE))
+			return -EPERM;
 	}
 
 	if (pi)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup
  2022-09-16  8:02 [PATCH v4 0/8] Add latency priority for CFS class Vincent Guittot
                   ` (3 preceding siblings ...)
  2022-09-16  8:03 ` [PATCH v4 4/8] sched/core: Add permission checks for setting the latency_nice value Vincent Guittot
@ 2022-09-16  8:03 ` Vincent Guittot
       [not found]   ` <20220916120245.2951-1-hdanton@sina.com>
  2022-09-19 10:05   ` Dietmar Eggemann
  2022-09-16  8:03 ` [PATCH v4 6/8] sched/fair: Add sched group latency support Vincent Guittot
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 44+ messages in thread
From: Vincent Guittot @ 2022-09-16  8:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth
  Cc: qais.yousef, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, tj, qperret, tim.c.chen, joshdon,
	Vincent Guittot

Take into account the latency priority of a thread when deciding to
preempt the current running thread. We don't want to provide more CPU
bandwidth to a thread but reorder the scheduling to run latency sensitive
task first whenever possible.

As long as a thread didn't use its bandwidth, it will be able to preempt
the current thread.

At the opposite, a thread with a low latency priority will preempt current
thread at wakeup only to keep fair CPU bandwidth sharing. Otherwise it will
wait for the tick to get its sched slice.

                                   curr vruntime
                                       |
                      sysctl_sched_wakeup_granularity
                                   <-->
----------------------------------|----|-----------------------|---------------
                                  |    |<--------------------->
                                  |    .  sysctl_sched_latency
                                  |    .
default/current latency entity    |    .
                                  |    .
1111111111111111111111111111111111|0000|-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-
se preempts curr at wakeup ------>|<- se doesn't preempt curr -----------------
                                  |    .
                                  |    .
                                  |    .
low latency entity                |    .
                                   ---------------------->|
                               % of sysctl_sched_latency  |
1111111111111111111111111111111111111111111111111111111111|0000|-1-1-1-1-1-1-1-
preempt ------------------------------------------------->|<- do not preempt --
                                  |    .
                                  |    .
                                  |    .
high latency entity               |    .
         |<-----------------------|    .
         | % of sysctl_sched_latency   .
111111111|0000|-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1
preempt->|<- se doesn't preempt curr ------------------------------------------

Tests results of nice latency impact on heavy load like hackbench:

hackbench -l (2560 / group) -g group
group        latency 0             latency 19
1            1.350(+/- 1.10%)      1.287(+/- 1.60%) + 5%
4            1.392(+/- 1.67%)      1.248(+/- 1.51%) +11%
8            1.294(+/- 1.56%)      1.254(+/- 1.96%) + 3%
16           1.315(+/- 1.02%)      1.288(+/- 1.53%) + 2%

hackbench -p -l (2560 / group) -g group
group
1            1.768(+/- 13%)      0.805(+/- 2%) +54%
4            1.634(+/- 13%)      0.765(+/- 1%) +53%
8            1.305(+/-  4%)      0.745(+/- 2%) +43%
16           0.786(+/-  4%)      0.705(+/- 2%) +10%

By deacreasing the latency prio, we reduce the number of preemption at
wakeup and help hackbench making progress.

Test results of nice latency impact on short live load like cyclictest
while competing with heavy load like hackbench:

hackbench -l 10000 -g $group &
cyclictest --policy other -D 5 -q -n
        latency 0           latency -20
group   min  avg    max     min  avg    max
0       17    18     28      16   18     28
1       67   319   7369      63   76   2479
4       64   453  51699      45   95   8788
8       65   814  51699      63  116  19535
16      64  1275  37744      63  159  51134

group = 0 means that hackbench is not running.

The avg is significantly improved with nice latency -20 especially with
large number of groups but min and max remain quite similar. If we add the
histogram parameter to get details of latency, we have :

hackbench -l 10000 -g 16 &
cyclictest --policy other -D 5 -q -n  -H 20000 --histfile data.txt
              latency 0    latency -20
Min Latencies:    63           63
Avg Latencies:  1459          214
Max Latencies: 58274        82358
50% latencies:   164           87
75% latencies:   848           91
85% latencies:  1246           94
90% latencies:  2149           96
95% latencies:  8463           99
99% latencies:>20000          120

With percentile details, we see the benefit of nice latency -20 as
only 1% of the latencies stays above 120us whereas the default latency
has got 25% are above 848us, 10% over the 2ms and 1% above 20ms.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h |  4 +++-
 init/init_task.c      |  2 +-
 kernel/sched/core.c   | 39 ++++++++++++++++++++++++------
 kernel/sched/debug.c  |  2 +-
 kernel/sched/fair.c   | 56 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h  | 14 ++++++++++-
 6 files changed, 104 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a0adb55efa1c..2d2a361d65ee 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -557,6 +557,8 @@ struct sched_entity {
 	/* cached value of my_q->h_nr_running */
 	unsigned long			runnable_weight;
 #endif
+	/* preemption offset in ns */
+	long				latency_offset;
 
 #ifdef CONFIG_SMP
 	/*
@@ -773,7 +775,7 @@ struct task_struct {
 	int				static_prio;
 	int				normal_prio;
 	unsigned int			rt_priority;
-	int				latency_nice;
+	int				latency_prio;
 
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
diff --git a/init/init_task.c b/init/init_task.c
index 225d11a39bc9..e98c71f24981 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -78,7 +78,7 @@ struct task_struct init_task
 	.prio		= MAX_PRIO - 20,
 	.static_prio	= MAX_PRIO - 20,
 	.normal_prio	= MAX_PRIO - 20,
-	.latency_nice	= DEFAULT_LATENCY_NICE,
+	.latency_prio	= NICE_WIDTH - 20,
 	.policy		= SCHED_NORMAL,
 	.cpus_ptr	= &init_task.cpus_mask,
 	.user_cpus_ptr	= NULL,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3c79c5419d1b..13cf794708ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1288,6 +1288,13 @@ static void set_load_weight(struct task_struct *p, bool update_load)
 	}
 }
 
+static void set_latency_offset(struct task_struct *p)
+{
+	long weight = sched_latency_to_weight[p->latency_prio];
+
+	p->se.latency_offset = (sysctl_sched_latency * weight) >> NICE_LATENCY_SHIFT;
+}
+
 #ifdef CONFIG_UCLAMP_TASK
 /*
  * Serializes updates of utilization clamp values
@@ -4512,7 +4519,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->prio = current->normal_prio;
 
 	/* Propagate the parent's latency requirements to the child as well */
-	p->latency_nice = current->latency_nice;
+	p->latency_prio = current->latency_prio;
 
 	uclamp_fork(p);
 
@@ -4530,7 +4537,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 		p->prio = p->normal_prio = p->static_prio;
 		set_load_weight(p, false);
 
-		p->latency_nice = DEFAULT_LATENCY_NICE;
+		p->latency_prio = NICE_TO_LATENCY(0);
+		set_latency_offset(p);
+
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
 		 * fulfilled its duty:
@@ -7295,8 +7304,10 @@ static void __setscheduler_params(struct task_struct *p,
 static void __setscheduler_latency(struct task_struct *p,
 		const struct sched_attr *attr)
 {
-	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
-		p->latency_nice = attr->sched_latency_nice;
+	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
+		p->latency_prio = NICE_TO_LATENCY(attr->sched_latency_nice);
+		set_latency_offset(p);
+	}
 }
 
 /*
@@ -7445,7 +7456,7 @@ static int __sched_setscheduler(struct task_struct *p,
 		if (attr->sched_latency_nice < MIN_LATENCY_NICE)
 			return -EINVAL;
 		/* Use the same security checks as NICE */
-		if (attr->sched_latency_nice < p->latency_nice &&
+		if (attr->sched_latency_nice < LATENCY_TO_NICE(p->latency_prio) &&
 		    !capable(CAP_SYS_NICE))
 			return -EPERM;
 	}
@@ -7485,7 +7496,7 @@ static int __sched_setscheduler(struct task_struct *p,
 		if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
 			goto change;
 		if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE &&
-		    attr->sched_latency_nice != p->latency_nice)
+		    attr->sched_latency_nice != LATENCY_TO_NICE(p->latency_prio))
 			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
@@ -8026,7 +8037,7 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 	get_params(p, &kattr);
 	kattr.sched_flags &= SCHED_FLAG_ALL;
 
-	kattr.sched_latency_nice = p->latency_nice;
+	kattr.sched_latency_nice = LATENCY_TO_NICE(p->latency_prio);
 
 #ifdef CONFIG_UCLAMP_TASK
 	/*
@@ -11177,6 +11188,20 @@ const u32 sched_prio_to_wmult[40] = {
  /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
 };
 
+/*
+ * latency weight for wakeup preemption
+ */
+const int sched_latency_to_weight[40] = {
+ /* -20 */     -1024,     -973,     -922,      -870,      -819,
+ /* -15 */      -768,     -717,     -666,      -614,      -563,
+ /* -10 */      -512,     -461,     -410,      -358,      -307,
+ /*  -5 */      -256,     -205,     -154,      -102,       -51,
+ /*   0 */         0,       51,      102,       154,       205,
+ /*   5 */       256,      307,      358,       410,       461,
+ /*  10 */       512,      563,      614,       666,       717,
+ /*  15 */       768,      819,      870,       922,       973,
+};
+
 void call_trace_sched_update_nr_running(struct rq *rq, int count)
 {
         trace_sched_update_nr_running_tp(rq, count);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index a3f7876217a6..06aaa0c81d4b 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1042,7 +1042,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 #endif
 	P(policy);
 	P(prio);
-	P(latency_nice);
+	P(latency_prio);
 	if (task_has_dl_policy(p)) {
 		P(dl.runtime);
 		P(dl.deadline);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8c1b889dcbb..a20eadb0af97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4574,6 +4574,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
 }
 
+static long wakeup_latency_gran(struct sched_entity *curr, struct sched_entity *se);
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -4606,6 +4608,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 
 	se = __pick_first_entity(cfs_rq);
 	delta = curr->vruntime - se->vruntime;
+	delta -= wakeup_latency_gran(curr, se);
 
 	if (delta < 0)
 		return;
@@ -5732,6 +5735,35 @@ static int sched_idle_cpu(int cpu)
 }
 #endif
 
+static void set_next_buddy(struct sched_entity *se);
+
+static void check_preempt_from_others(struct cfs_rq *cfs, struct sched_entity *se)
+{
+	struct sched_entity *next;
+
+	if (se->latency_offset >= 0)
+		return;
+
+	if (cfs->nr_running <= 1)
+		return;
+	/*
+	 * When waking from idle, we don't need to check to preempt at wakeup
+	 * the idle thread and don't set next buddy as a candidate for being
+	 * picked in priority.
+	 * In case of simultaneous wakeup from idle, the latency sensitive tasks
+	 * lost opportunity to preempt non sensitive tasks which woke up
+	 * simultaneously.
+	 */
+
+	if (cfs->next)
+		next = cfs->next;
+	else
+		next = __pick_first_entity(cfs);
+
+	if (next && wakeup_preempt_entity(next, se) == 1)
+		set_next_buddy(se);
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -5818,14 +5850,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!task_new)
 		update_overutilized_status(rq);
 
+	if (rq->curr->sched_class != &fair_sched_class)
+		check_preempt_from_others(cfs_rq_of(&p->se), &p->se);
+
 enqueue_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
 	hrtick_update(rq);
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * The dequeue_task method is called before nr_running is
  * decreased. We remove the task from the rbtree and
@@ -7148,6 +7181,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 }
 #endif /* CONFIG_SMP */
 
+static long wakeup_latency_gran(struct sched_entity *curr, struct sched_entity *se)
+{
+	long latency_offset = se->latency_offset;
+
+	/*
+	 * A negative latency weigth means that the sched_entity has latency
+	 * requirement that needs to be evaluated versus other entity.
+	 * Otherwise, use the latency weight to evaluate how much scheduling
+	 * delay is acceptable by se.
+	 */
+	if ((se->latency_offset < 0) || (curr->latency_offset < 0))
+		latency_offset -= curr->latency_offset;
+
+	return latency_offset;
+}
+
 static unsigned long wakeup_gran(struct sched_entity *se)
 {
 	unsigned long gran = sysctl_sched_wakeup_granularity;
@@ -7187,6 +7236,9 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
 {
 	s64 gran, vdiff = curr->vruntime - se->vruntime;
 
+	/* Take into account latency priority */
+	vdiff -= wakeup_latency_gran(curr, se);
+
 	if (vdiff <= 0)
 		return -1;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b927269b84f9..a4cb8058f1b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -141,6 +141,17 @@ extern int sched_rr_timeslice;
  * Default tasks should be treated as a task with latency_nice = 0.
  */
 #define DEFAULT_LATENCY_NICE	0
+#define DEFAULT_LATENCY_PRIO	(DEFAULT_LATENCY_NICE + LATENCY_NICE_WIDTH/2)
+
+/*
+ * Convert user-nice values [ -20 ... 0 ... 19 ]
+ * to static latency [ 0..39 ],
+ * and back.
+ */
+#define NICE_TO_LATENCY(nice)	((nice) + DEFAULT_LATENCY_PRIO)
+#define LATENCY_TO_NICE(prio)	((prio) - DEFAULT_LATENCY_PRIO)
+#define NICE_LATENCY_SHIFT	(SCHED_FIXEDPOINT_SHIFT)
+#define NICE_LATENCY_WEIGHT_MAX	(1L << NICE_LATENCY_SHIFT)
 
 /*
  * Increase resolution of nice-level calculations for 64-bit architectures.
@@ -2114,6 +2125,7 @@ static_assert(WF_TTWU == SD_BALANCE_WAKE);
 
 extern const int		sched_prio_to_weight[40];
 extern const u32		sched_prio_to_wmult[40];
+extern const int		sched_latency_to_weight[40];
 
 /*
  * {de,en}queue flags:
@@ -2442,8 +2454,8 @@ extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);
 extern const_debug unsigned int sysctl_sched_nr_migrate;
 extern const_debug unsigned int sysctl_sched_migration_cost;
 
-#ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_latency;
+#ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_idle_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 6/8] sched/fair: Add sched group latency support
  2022-09-16  8:02 [PATCH v4 0/8] Add latency priority for CFS class Vincent Guittot
                   ` (4 preceding siblings ...)
  2022-09-16  8:03 ` [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup Vincent Guittot
@ 2022-09-16  8:03 ` Vincent Guittot
  2022-09-19 11:55   ` Dietmar Eggemann
  2022-09-16  8:03 ` [PATCH v4 7/8] sched/core: support latency priority with sched core Vincent Guittot
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2022-09-16  8:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth
  Cc: qais.yousef, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, tj, qperret, tim.c.chen, joshdon,
	Vincent Guittot

Task can set its latency priority, which is then used to decide to preempt
the current running entity of the cfs, but sched group entities still have
the default latency offset.

Add a latency field in task group to set the latency offset of the
sched_eneities of the group, which will be used against other entities in
the parent cfs when deciding which entity to schedule first.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  | 24 ++++++++++++++++++++++++
 kernel/sched/fair.c  | 33 +++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  4 ++++
 3 files changed, 61 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 13cf794708ee..bfea862a3588 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10860,6 +10860,19 @@ static int cpu_idle_write_s64(struct cgroup_subsys_state *css,
 {
 	return sched_group_set_idle(css_tg(css), idle);
 }
+
+static s64 cpu_latency_read_s64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	return css_tg(css)->latency_offset;
+}
+
+static int cpu_latency_write_s64(struct cgroup_subsys_state *css,
+				struct cftype *cft, s64 latency)
+{
+	return sched_group_set_latency(css_tg(css), latency);
+}
+
 #endif
 
 static struct cftype cpu_legacy_files[] = {
@@ -10874,6 +10887,11 @@ static struct cftype cpu_legacy_files[] = {
 		.read_s64 = cpu_idle_read_s64,
 		.write_s64 = cpu_idle_write_s64,
 	},
+	{
+		.name = "latency",
+		.read_s64 = cpu_latency_read_s64,
+		.write_s64 = cpu_latency_write_s64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
@@ -11091,6 +11109,12 @@ static struct cftype cpu_files[] = {
 		.read_s64 = cpu_idle_read_s64,
 		.write_s64 = cpu_idle_write_s64,
 	},
+	{
+		.name = "latency",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_s64 = cpu_latency_read_s64,
+		.write_s64 = cpu_latency_write_s64,
+	},
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	{
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a20eadb0af97..6cc4f2a9725d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11769,6 +11769,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 		goto err;
 
 	tg->shares = NICE_0_LOAD;
+	tg->latency_offset = 0;
 
 	init_cfs_bandwidth(tg_cfs_bandwidth(tg));
 
@@ -11867,6 +11868,9 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 	}
 
 	se->my_q = cfs_rq;
+
+	se->latency_offset = tg->latency_offset;
+
 	/* guarantee group entities always have weight */
 	update_load_set(&se->load, NICE_0_LOAD);
 	se->parent = parent;
@@ -11997,6 +12001,35 @@ int sched_group_set_idle(struct task_group *tg, long idle)
 	return 0;
 }
 
+int sched_group_set_latency(struct task_group *tg, long latency)
+{
+	int i;
+
+	if (tg == &root_task_group)
+		return -EINVAL;
+
+	if (abs(latency) > sysctl_sched_latency)
+		return -EINVAL;
+
+	mutex_lock(&shares_mutex);
+
+	if (tg->latency_offset == latency) {
+		mutex_unlock(&shares_mutex);
+		return 0;
+	}
+
+	tg->latency_offset = latency;
+
+	for_each_possible_cpu(i) {
+		struct sched_entity *se = tg->se[i];
+
+		WRITE_ONCE(se->latency_offset, latency);
+	}
+
+	mutex_unlock(&shares_mutex);
+	return 0;
+}
+
 #else /* CONFIG_FAIR_GROUP_SCHED */
 
 void free_fair_sched_group(struct task_group *tg) { }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a4cb8058f1b2..619132f4f480 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -406,6 +406,8 @@ struct task_group {
 
 	/* A positive value indicates that this is a SCHED_IDLE group. */
 	int			idle;
+	/* latency constraint of the group. */
+	int			latency_offset;
 
 #ifdef	CONFIG_SMP
 	/*
@@ -519,6 +521,8 @@ extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
 
 extern int sched_group_set_idle(struct task_group *tg, long idle);
 
+extern int sched_group_set_latency(struct task_group *tg, long latency);
+
 #ifdef CONFIG_SMP
 extern void set_task_rq_fair(struct sched_entity *se,
 			     struct cfs_rq *prev, struct cfs_rq *next);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 7/8] sched/core: support latency priority with sched core
  2022-09-16  8:02 [PATCH v4 0/8] Add latency priority for CFS class Vincent Guittot
                   ` (5 preceding siblings ...)
  2022-09-16  8:03 ` [PATCH v4 6/8] sched/fair: Add sched group latency support Vincent Guittot
@ 2022-09-16  8:03 ` Vincent Guittot
  2022-09-16  8:03 ` [PATCH v4 8/8] sched/fair: Add latency list Vincent Guittot
  2022-09-21 16:08 ` [PATCH v4 0/8] Add latency priority for CFS class Qais Yousef
  8 siblings, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2022-09-16  8:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth
  Cc: qais.yousef, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, tj, qperret, tim.c.chen, joshdon,
	Vincent Guittot

Take into account wakeup_latency_gran() when ordering the cfs threads.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6cc4f2a9725d..7563fb16aba1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11444,6 +11444,10 @@ bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool in_fi)
 	delta = (s64)(sea->vruntime - seb->vruntime) +
 		(s64)(cfs_rqb->min_vruntime_fi - cfs_rqa->min_vruntime_fi);
 
+	/* Take into account latency prio */
+	delta -= wakeup_latency_gran(sea, seb);
+
+
 	return delta > 0;
 }
 #else
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 8/8] sched/fair: Add latency list
  2022-09-16  8:02 [PATCH v4 0/8] Add latency priority for CFS class Vincent Guittot
                   ` (6 preceding siblings ...)
  2022-09-16  8:03 ` [PATCH v4 7/8] sched/core: support latency priority with sched core Vincent Guittot
@ 2022-09-16  8:03 ` Vincent Guittot
  2022-09-21 16:08 ` [PATCH v4 0/8] Add latency priority for CFS class Qais Yousef
  8 siblings, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2022-09-16  8:03 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth
  Cc: qais.yousef, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, tj, qperret, tim.c.chen, joshdon,
	Vincent Guittot

Add a rb tree for latency sensitive entities so we can schedule the most
sensitive one first even when it failed to preempt current at wakeup or
when it got quickly preempted by another entity of higher priority.

In order to keep fairness, the latency is used once at wakeup to get a
minimum slice and not during the following scheduling slice to prevent
long running entity to got more running time than allocated to his nice
priority.

The rb tree nebales to cover the last corner case where latency
sensitive entity can't got schedule quickly after the wakeup.

hackbench -l 10000 -g $group &
cyclictest --policy other -D 5 -q -n
        latency 0           latency -20
group   min  avg    max     min  avg    max
0       17    19     29      17   18     30
1       65   306   7149      64   83    208
4       50   395  15731      56   80    271
8       56   781  41548      54   80    301
16      60  1392  87237      59   86    490

group = 0 means that hackbench is not running.

Both avg and max are significantly improved with nice latency -20. If we
add the histogram parameters to get details of latency, we have :

hackbench -l 10000 -g 16 &
cyclictest --policy other -D 5 -q -n  -H 20000 --histfile data.txt
              latency 0    latency -20
Min Latencies:    60           61
Avg Latencies:  1077           86
Max Latencies: 87311          444
50% latencies:    92           85
75% latencies:   554           90
85% latencies:  1019           93
90% latencies:  1346           96
95% latencies:  5400          100
99% latencies: 19044          110

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h |  2 +
 kernel/sched/fair.c   | 96 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h  |  1 +
 3 files changed, 96 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2d2a361d65ee..8b7bf0cb4ee9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -537,6 +537,8 @@ struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
 	struct rb_node			run_node;
+	struct rb_node			latency_node;
+	unsigned int			on_latency;
 	struct list_head		group_node;
 	unsigned int			on_rq;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7563fb16aba1..edec532bc89d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -664,7 +664,77 @@ struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
 
 	return __node_2_se(last);
 }
+#endif
 
+/**************************************************************
+ * Scheduling class tree data structure manipulation methods:
+ * for latency
+ */
+
+static inline bool latency_before(struct sched_entity *a,
+				struct sched_entity *b)
+{
+	return (s64)(a->vruntime + a->latency_offset - b->vruntime - b->latency_offset) < 0;
+}
+
+#define __latency_node_2_se(node) \
+	rb_entry((node), struct sched_entity, latency_node)
+
+static inline bool __latency_less(struct rb_node *a, const struct rb_node *b)
+{
+	return latency_before(__latency_node_2_se(a), __latency_node_2_se(b));
+}
+
+/*
+ * Enqueue an entity into the latency rb-tree:
+ */
+static void __enqueue_latency(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+
+	/* Only latency sensitive entity can be added to the list */
+	if (se->latency_offset >= 0)
+		return;
+
+	if (se->on_latency)
+		return;
+
+	/*
+	 * An execution time less than sysctl_sched_min_granularity means that
+	 * the entity has been preempted by a higher sched class or an entity
+	 * with higher latency constraint.
+	 * Put it back in the list so it gets a chance to run 1st during the
+	 * next slice.
+	 */
+	if (!(flags & ENQUEUE_WAKEUP)) {
+		u64 delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+
+		if (delta_exec >= sysctl_sched_min_granularity)
+			return;
+	}
+
+	rb_add_cached(&se->latency_node, &cfs_rq->latency_timeline, __latency_less);
+	se->on_latency = 1;
+}
+
+static void __dequeue_latency(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	if (se->on_latency) {
+		rb_erase_cached(&se->latency_node, &cfs_rq->latency_timeline);
+		se->on_latency = 0;
+	}
+}
+
+static struct sched_entity *__pick_first_latency(struct cfs_rq *cfs_rq)
+{
+	struct rb_node *left = rb_first_cached(&cfs_rq->latency_timeline);
+
+	if (!left)
+		return NULL;
+
+	return __latency_node_2_se(left);
+}
+
+#ifdef CONFIG_SCHED_DEBUG
 /**************************************************************
  * Scheduling class statistics methods:
  */
@@ -4455,8 +4525,10 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	check_schedstat_required();
 	update_stats_enqueue_fair(cfs_rq, se, flags);
 	check_spread(cfs_rq, se);
-	if (!curr)
+	if (!curr) {
 		__enqueue_entity(cfs_rq, se);
+		__enqueue_latency(cfs_rq, se, flags);
+	}
 	se->on_rq = 1;
 
 	if (cfs_rq->nr_running == 1) {
@@ -4542,8 +4614,10 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	clear_buddies(cfs_rq, se);
 
-	if (se != cfs_rq->curr)
+	if (se != cfs_rq->curr) {
 		__dequeue_entity(cfs_rq, se);
+		__dequeue_latency(cfs_rq, se);
+	}
 	se->on_rq = 0;
 	account_entity_dequeue(cfs_rq, se);
 
@@ -4631,6 +4705,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end_fair(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
+		__dequeue_latency(cfs_rq, se);
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 	}
 
@@ -4669,7 +4744,7 @@ static struct sched_entity *
 pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
 	struct sched_entity *left = __pick_first_entity(cfs_rq);
-	struct sched_entity *se;
+	struct sched_entity *latency, *se;
 
 	/*
 	 * If curr is set we have to see if its left of the leftmost entity
@@ -4711,6 +4786,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 		se = cfs_rq->last;
 	}
 
+	/* Check for latency sensitive entity waiting for running */
+	latency = __pick_first_latency(cfs_rq);
+	if (latency && (latency != se) &&
+	    wakeup_preempt_entity(latency, se) < 1)
+		se = latency;
+
 	return se;
 }
 
@@ -4734,6 +4815,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		update_stats_wait_start_fair(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
 		__enqueue_entity(cfs_rq, prev);
+		__enqueue_latency(cfs_rq, prev, 0);
 		/* in !on_rq case, update occurred at dequeue */
 		update_load_avg(cfs_rq, prev, 0);
 	}
@@ -11718,6 +11800,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 void init_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
+	cfs_rq->latency_timeline = RB_ROOT_CACHED;
 	u64_u32_store(cfs_rq->min_vruntime, (u64)(-(1LL << 20)));
 #ifdef CONFIG_SMP
 	raw_spin_lock_init(&cfs_rq->removed.lock);
@@ -12026,8 +12109,15 @@ int sched_group_set_latency(struct task_group *tg, long latency)
 
 	for_each_possible_cpu(i) {
 		struct sched_entity *se = tg->se[i];
+		struct rq *rq = cpu_rq(i);
+		struct rq_flags rf;
+
+		rq_lock_irqsave(rq, &rf);
 
+		__dequeue_latency(se->cfs_rq, se);
 		WRITE_ONCE(se->latency_offset, latency);
+
+		rq_unlock_irqrestore(rq, &rf);
 	}
 
 	mutex_unlock(&shares_mutex);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 619132f4f480..cab1c241d6df 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -597,6 +597,7 @@ struct cfs_rq {
 #endif
 
 	struct rb_root_cached	tasks_timeline;
+	struct rb_root_cached	latency_timeline;
 
 	/*
 	 * 'curr' points to currently running entity on this cfs_rq.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup
       [not found]   ` <20220916120245.2951-1-hdanton@sina.com>
@ 2022-09-16 13:36     ` Vincent Guittot
       [not found]       ` <20220917225819.817-1-hdanton@sina.com>
  0 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2022-09-16 13:36 UTC (permalink / raw)
  To: Hillf Danton
  Cc: peterz, mgorman, valentin.schneider, parth, linux-mm, linux-kernel

Hi Hillf,

On Fri, 16 Sept 2022 at 14:03, Hillf Danton <hdanton@sina.com> wrote:
>
> Hello Vincent
>
> On 16 Sep 2022 10:03:02 +0200 Vincent Guittot <vincent.guittot@linaro.org> wrote:
> >
> > @@ -4606,6 +4608,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >
> >       se = __pick_first_entity(cfs_rq);
> >       delta = curr->vruntime - se->vruntime;
> > +     delta -= wakeup_latency_gran(curr, se);
> >
> >       if (delta < 0)
> >               return;
>
> What is derived from the latency nice you added is the runtime granulaity
> which has a role in preempting the current task.
>
> Given the same defination of latency nice as the nice, the runtime granularity
> can be computed without introducing the latency nice.
>
> Only for thoughts now.
>
> Hillf
>
> +++ b/kernel/sched/fair.c
> @@ -4569,7 +4569,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>  static void
>  check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>  {
> -       unsigned long ideal_runtime, delta_exec;
> +       unsigned long ideal_runtime, delta_exec, granu;
>         struct sched_entity *se;
>         s64 delta;
>
> @@ -4594,6 +4594,14 @@ check_preempt_tick(struct cfs_rq *cfs_rq
>                 return;
>
>         se = __pick_first_entity(cfs_rq);
> +
> +       granu = sysctl_sched_min_granularity +
> +               (ideal_runtime - sysctl_sched_min_granularity) *
> +               (se->latency_nice + 20) / LATENCY_NICE_WIDTH;

There is no latency_nice field in se but a latency_offset instead

Also at this step, we are sure that curr has run at least
sysctl_sched_min_granularity and we want now to compare curr vruntime
with first se's one. We take the latency offset into account to make
sure we will not preempt curr too early

> +
> +       if (delta_exec < granu)
> +               return;
> +
>         delta = curr->vruntime - se->vruntime;
>
>         if (delta < 0)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup
       [not found]       ` <20220917225819.817-1-hdanton@sina.com>
@ 2022-09-18 10:46         ` Vincent Guittot
       [not found]           ` <20220920113238.1176-1-hdanton@sina.com>
  0 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2022-09-18 10:46 UTC (permalink / raw)
  To: Hillf Danton
  Cc: peterz, mgorman, valentin.schneider, parth, linux-mm, linux-kernel

On Sun, 18 Sept 2022 at 00:58, Hillf Danton <hdanton@sina.com> wrote:
>
> On 16 Sep 2022 15:36:53 +0200 Vincent Guittot <vincent.guittot@linaro.org> wrote:
> >
> > Hi Hillf,
> >
> > On Fri, 16 Sept 2022 at 14:03, Hillf Danton <hdanton@sina.com> wrote:
> > >
> > > Hello Vincent
> > >
> > > On 16 Sep 2022 10:03:02 +0200 Vincent Guittot <vincent.guittot@linaro.org> wrote:
> > > >
> > > > @@ -4606,6 +4608,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > > >
> > > >       se = __pick_first_entity(cfs_rq);
> > > >       delta = curr->vruntime - se->vruntime;
> > > > +     delta -= wakeup_latency_gran(curr, se);
> > > >
> > > >       if (delta < 0)
> > > >               return;
> > >
> > > What is derived from the latency nice you added is the runtime granulaity
> > > which has a role in preempting the current task.
> > >
> > > Given the same defination of latency nice as the nice, the runtime granularity
> > > can be computed without introducing the latency nice.
> > >
> > > Only for thoughts now.
> > >
> > > Hillf
> > >
> > > +++ b/kernel/sched/fair.c
> > > @@ -4569,7 +4569,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> > >  static void
> > >  check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > >  {
> > > -       unsigned long ideal_runtime, delta_exec;
> > > +       unsigned long ideal_runtime, delta_exec, granu;
> > >         struct sched_entity *se;
> > >         s64 delta;
> > >
> > > @@ -4594,6 +4594,14 @@ check_preempt_tick(struct cfs_rq *cfs_rq
> > >                 return;
> > >
> > >         se = __pick_first_entity(cfs_rq);
> > > +
> > > +       granu = sysctl_sched_min_granularity +
> > > +               (ideal_runtime - sysctl_sched_min_granularity) *
> > > +               (se->latency_nice + 20) / LATENCY_NICE_WIDTH;
> >
> > There is no latency_nice field in se but a latency_offset instead
> >
> > Also at this step, we are sure that curr has run at least
> > sysctl_sched_min_granularity and we want now to compare curr vruntime
> > with first se's one. We take the latency offset into account to make
> > sure we will not preempt curr too early
> >
> > > +
> > > +       if (delta_exec < granu)
> > > +               return;
> > > +
> > >         delta = curr->vruntime - se->vruntime;
> > >
> > >         if (delta < 0)
>                 return;
>
>             if (delta > ideal_runtime)
>                 resched_curr(rq_of(cfs_rq));
>
> After another look, curr is not preempted without the gap in vruntime
> between curr and the first entity growing more than ideal runtime, while

Curr can be preempted as it has run more than the ideal time (1st
test). This one is to make sure that the diff does not become too
large. Here we reuse the same comparison as wakeup to make sure that a
newly curr will get a chance to run its ideal time after  having
preempted at wakeup the previous current

> with latency_offset, since the gap becomes larger, preempt happens later
> than ideal runtime thoughts IMO.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 4/8] sched/core: Add permission checks for setting the latency_nice value
  2022-09-16  8:03 ` [PATCH v4 4/8] sched/core: Add permission checks for setting the latency_nice value Vincent Guittot
@ 2022-09-19  8:52   ` timj
  2022-09-19  8:52     ` timj
  2022-09-19 12:41     ` Vincent Guittot
  0 siblings, 2 replies; 44+ messages in thread
From: timj @ 2022-09-19  8:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: qais.yousef, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, tj, qperret, tim.c.chen, joshdon

Hi,

great to see work on improving low latency scenarios.

There is an entire class of low latency audio applications that seems
to be underrepresented in the current low latency use case descriptions.
Experiencing audio drop outs (clicks or gaps) is generally perceived
to be much more disruptive than graphics glitches like missing a frame
deadline. That's why audio programs and devices with huge buffers put
so much emphasis on avoiding drop outs however possible.

Now, video players and tools like mpg123 are well served with increased
audio buffer sizes to avoid drop outs. But the story is different with
MIDI synthesizers (hit a piano key and expect immediate onset -
especially drum synthesizers) or audio filters that implement realtime
auto-tuning or short delay effects. Here, useful applications require
IO latency guarantees significantly below 10ms. A useful setup would be
e.g. 48kHz sampling frequency and an audio buffer of two fragments with
128 samples each. Simplified, that provides a combined IO latency of
2 * 128 / 48000 = 5.33ms if the audio application can rely on scheduling
deadlines of 1 * 128 / 48000 = 2.6ms.

So far, low latency applications have to jump through a number of hoops
to roughly achieve such deadlines, ranging from requiring suid-
installations for SCHED_DEADLINE (rarely used in practice), needing 
CAP_SYS_NICE, requiring a Jackd distro or depending on RtKit with all 
its shortcomings [1].
I.e. there's a plethora of user space workarounds that have piled up
over the decades, because regular user applications lack a simple way
to tell the kernel one of:

+1) I'm interested in throughput and don't care about latency: make -j
  0) I'm fine with latency handling defaults: bash, X11
-1) I'm depending on low latency scheduling much more than throughput:
       audio-filter, synthesizer

The emphasis here is on *USER* applications, not programs run by root.
As you write in [PATCH v4 5/8]: "We don't want to provide more CPU
bandwidth to a thread but reorder the scheduling to run latency
sensitive task first whenever possible" and as outlined in the
presentation "Latency_nice - Implementation and Use-case for Scheduler
Optimization" by Shah et al [2], changing latency nice will not result
in taking over the CPU or extending the possibility for DoS attacks.

So what I'm getting at with this lengthy use case description is that
it is vitally important for low latency audio applications without
elevated privileges to be able to request low latency scheduling.
I.e. please remove the check for !capable(CAP_SYS_NICE) to request low
latency scheduling from the patch set, so audio applications can simply
use their time slices (and nothing more) at the *right* time and have a
way to tell the kernel about it without requiring root or daemons to
relay the message. Conversely, idle tasks or make -j also don't need
root to enjoy up to 100% of idle CPU time.

[1] https://github.com/heftig/rtkit/issues/25
[2] https://static.lwn.net/images/conf/2020/ospm/latency-nice-slides.pdf

On 16.09.22 10:03, Vincent Guittot wrote:
> From: Parth Shah <parth@linux.ibm.com>
> 
> Since the latency_nice uses the similar infrastructure as NICE, use the
> already existing CAP_SYS_NICE security checks for the latency_nice. This
> should return -EPERM for the non-root user when trying to set the task
> latency_nice value to any lower than the current value.
> 
> Signed-off-by: Parth Shah <parth@linux.ibm.com>
> [rebase]
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/core.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2015e7d1f8b2..3c79c5419d1b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7444,6 +7444,10 @@ static int __sched_setscheduler(struct task_struct *p,
>   			return -EINVAL;
>   		if (attr->sched_latency_nice < MIN_LATENCY_NICE)
>   			return -EINVAL;
> +		/* Use the same security checks as NICE */
> +		if (attr->sched_latency_nice < p->latency_nice &&
> +		    !capable(CAP_SYS_NICE))
> +			return -EPERM;
>   	}
>   
>   	if (pi)



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 4/8] sched/core: Add permission checks for setting the latency_nice value
  2022-09-19  8:52   ` timj
@ 2022-09-19  8:52     ` timj
  2022-09-19 12:41     ` Vincent Guittot
  1 sibling, 0 replies; 44+ messages in thread
From: timj @ 2022-09-19  8:52 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot, vschneid, linux-kernel,
	parth
  Cc: qais.yousef, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, tj, qperret, tim.c.chen, joshdon

Hi,

great to see work on improving low latency scenarios.

There is an entire class of low latency audio applications that seems
to be underrepresented in the current low latency use case descriptions.
Experiencing audio drop outs (clicks or gaps) is generally perceived
to be much more disruptive than graphics glitches like missing a frame
deadline. That's why audio programs and devices with huge buffers put
so much emphasis on avoiding drop outs however possible.

Now, video players and tools like mpg123 are well served with increased
audio buffer sizes to avoid drop outs. But the story is different with
MIDI synthesizers (hit a piano key and expect immediate onset -
especially drum synthesizers) or audio filters that implement realtime
auto-tuning or short delay effects. Here, useful applications require
IO latency guarantees significantly below 10ms. A useful setup would be
e.g. 48kHz sampling frequency and an audio buffer of two fragments with
128 samples each. Simplified, that provides a combined IO latency of
2 * 128 / 48000 = 5.33ms if the audio application can rely on scheduling
deadlines of 1 * 128 / 48000 = 2.6ms.

So far, low latency applications have to jump through a number of hoops
to roughly achieve such deadlines, ranging from requiring suid-
installations for SCHED_DEADLINE (rarely used in practice), needing 
CAP_SYS_NICE, requiring a Jackd distro or depending on RtKit with all 
its shortcomings [1].
I.e. there's a plethora of user space workarounds that have piled up
over the decades, because regular user applications lack a simple way
to tell the kernel one of:

+1) I'm interested in throughput and don't care about latency: make -j
  0) I'm fine with latency handling defaults: bash, X11
-1) I'm depending on low latency scheduling much more than throughput:
       audio-filter, synthesizer

The emphasis here is on *USER* applications, not programs run by root.
As you write in [PATCH v4 5/8]: "We don't want to provide more CPU
bandwidth to a thread but reorder the scheduling to run latency
sensitive task first whenever possible" and as outlined in the
presentation "Latency_nice - Implementation and Use-case for Scheduler
Optimization" by Shah et al [2], changing latency nice will not result
in taking over the CPU or extending the possibility for DoS attacks.

So what I'm getting at with this lengthy use case description is that
it is vitally important for low latency audio applications without
elevated privileges to be able to request low latency scheduling.
I.e. please remove the check for !capable(CAP_SYS_NICE) to request low
latency scheduling from the patch set, so audio applications can simply
use their time slices (and nothing more) at the *right* time and have a
way to tell the kernel about it without requiring root or daemons to
relay the message. Conversely, idle tasks or make -j also don't need
root to enjoy up to 100% of idle CPU time.

[1] https://github.com/heftig/rtkit/issues/25
[2] https://static.lwn.net/images/conf/2020/ospm/latency-nice-slides.pdf

On 16.09.22 10:03, Vincent Guittot wrote:
> From: Parth Shah <parth@linux.ibm.com>
> 
> Since the latency_nice uses the similar infrastructure as NICE, use the
> already existing CAP_SYS_NICE security checks for the latency_nice. This
> should return -EPERM for the non-root user when trying to set the task
> latency_nice value to any lower than the current value.
> 
> Signed-off-by: Parth Shah <parth@linux.ibm.com>
> [rebase]
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/core.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2015e7d1f8b2..3c79c5419d1b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7444,6 +7444,10 @@ static int __sched_setscheduler(struct task_struct *p,
>   			return -EINVAL;
>   		if (attr->sched_latency_nice < MIN_LATENCY_NICE)
>   			return -EINVAL;
> +		/* Use the same security checks as NICE */
> +		if (attr->sched_latency_nice < p->latency_nice &&
> +		    !capable(CAP_SYS_NICE))
> +			return -EPERM;
>   	}
>   
>   	if (pi)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup
  2022-09-16  8:03 ` [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup Vincent Guittot
       [not found]   ` <20220916120245.2951-1-hdanton@sina.com>
@ 2022-09-19 10:05   ` Dietmar Eggemann
  2022-09-19 15:39     ` Vincent Guittot
  1 sibling, 1 reply; 44+ messages in thread
From: Dietmar Eggemann @ 2022-09-19 10:05 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth
  Cc: qais.yousef, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, tj, qperret, tim.c.chen, joshdon

On 16/09/2022 10:03, Vincent Guittot wrote:

[...]

> @@ -4512,7 +4519,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
>  	p->prio = current->normal_prio;
>  
>  	/* Propagate the parent's latency requirements to the child as well */
> -	p->latency_nice = current->latency_nice;
> +	p->latency_prio = current->latency_prio;

Isn't here a `set_latency_offset(p)` missing here?

>  
>  	uclamp_fork(p);
>  

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e8c1b889dcbb..a20eadb0af97 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4574,6 +4574,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  		update_idle_cfs_rq_clock_pelt(cfs_rq);
>  }
>  
> +static long wakeup_latency_gran(struct sched_entity *curr, struct sched_entity *se);

minor: `struct sched_entity *curr` ... doesn't have to be current
(cfs_rq->curr). Isn't this more like `struct sched_entity *sea, struct
sched_entity *seb`? Anyway, it's already the case for
`wakeup_preempt_entity`.

[...]

> @@ -5732,6 +5735,35 @@ static int sched_idle_cpu(int cpu)
>  }
>  #endif
>  
> +static void set_next_buddy(struct sched_entity *se);
> +
> +static void check_preempt_from_others(struct cfs_rq *cfs, struct sched_entity *se)

minor: Why `struct cfs_rq *cfs` and not `struct cfs_rq *cfs_rq` ?

Using `cfs_rq` would make it more consistent when looking for things
like `cfs_rq->nr_running` for example.

> +{
> +	struct sched_entity *next;
> +
> +	if (se->latency_offset >= 0)
> +		return;
> +
> +	if (cfs->nr_running <= 1)
> +		return;
> +	/*
> +	 * When waking from idle, we don't need to check to preempt at wakeup

s/idle/others ?

> +	 * the idle thread and don't set next buddy as a candidate for being
> +	 * picked in priority.
> +	 * In case of simultaneous wakeup from idle, the latency sensitive tasks
> +	 * lost opportunity to preempt non sensitive tasks which woke up
> +	 * simultaneously.
> +	 */

The position of this comment block within this function is somehow
misleading since it describes the reason for the function rather then a
particular condition within this function. Wouldn't it be more readable
when it would be a function header comment instead?

[...]

> @@ -7148,6 +7181,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>  }
>  #endif /* CONFIG_SMP */
>  
> +static long wakeup_latency_gran(struct sched_entity *curr, struct sched_entity *se)
> +{
> +	long latency_offset = se->latency_offset;
> +
> +	/*
> +	 * A negative latency weigth means that the sched_entity has latency

s/weigth/latency_offset ?


> +	 * requirement that needs to be evaluated versus other entity.
> +	 * Otherwise, use the latency weight to evaluate how much scheduling
> +	 * delay is acceptable by se.
> +	 */
> +	if ((se->latency_offset < 0) || (curr->latency_offset < 0))
> +		latency_offset -= curr->latency_offset;

I still don't get the rationale behind why when either one (se or curr)
of the latency_nice values is negative, we use the diff between them but
if not, we only care about se's value. Why don't you always use the diff
between se and curr? Since we have a range [-20 ... 19] why shouldn't we
use the difference between let's say se = 19 and curr = 5?
You discussed this with Tao Zhou on the v1 but I didn't understand it fully.

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/8] sched/fair: Add sched group latency support
  2022-09-16  8:03 ` [PATCH v4 6/8] sched/fair: Add sched group latency support Vincent Guittot
@ 2022-09-19 11:55   ` Dietmar Eggemann
  2022-09-19 15:49     ` Vincent Guittot
  2022-09-19 17:34     ` Tejun Heo
  0 siblings, 2 replies; 44+ messages in thread
From: Dietmar Eggemann @ 2022-09-19 11:55 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth
  Cc: qais.yousef, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, tj, qperret, tim.c.chen, joshdon

s/valentin.schneider@arm.com//

On 16/09/2022 10:03, Vincent Guittot wrote:
> Task can set its latency priority, which is then used to decide to preempt
> the current running entity of the cfs, but sched group entities still have
> the default latency offset.
> 
> Add a latency field in task group to set the latency offset of the
> sched_eneities of the group, which will be used against other entities in

s/sched_eneities/sched_entity

> the parent cfs when deciding which entity to schedule first.

So latency for cgroups does not follow any (existing) Resource
Distribution Model/Scheme (Documentation/admin-guide/cgroup-v2.rst)?
Latency values are only used to compare sched entities at the same level.

[...]

> +static int cpu_latency_write_s64(struct cgroup_subsys_state *css,
> +				struct cftype *cft, s64 latency)
> +{

There is no [MIN, MAX] checking?

min_weight = sched_latency_to_weight[0]  = -1024
max_weight = sched_latency_to_weight[39] =   973

[MIN, MAX] = [sysctl_sched_latency * min_weight >> NICE_LATENCY_SHIFT,
              sysctl_sched_latency * max_weight >> NICE_LATENCY_SHIFT]


With the `cpu.latency` knob user would have to know for example that the
value is -24,000,000ns to get the same behaviour as for a task latency
nice = -20 (latency prio = 0) (w/ sysctl_sched_latency = 24ms)?

For `nice` we have `cpu.weight.nice` next to `cpu.weight` in cgroup v2 ?

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 4/8] sched/core: Add permission checks for setting the latency_nice value
  2022-09-19  8:52   ` timj
  2022-09-19  8:52     ` timj
@ 2022-09-19 12:41     ` Vincent Guittot
  2022-09-20 10:18       ` Tim Janik
  1 sibling, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2022-09-19 12:41 UTC (permalink / raw)
  To: timj
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth, qais.yousef,
	chris.hyser, valentin.schneider, patrick.bellasi, David.Laight,
	pjt, pavel, tj, qperret, tim.c.chen, joshdon

Hi,

Thanks you for describing in detail your use case.

On Mon, 19 Sept 2022 at 10:52, timj <timj@gnu.org> wrote:
>
> Hi,
>
> great to see work on improving low latency scenarios.
>
> There is an entire class of low latency audio applications that seems
> to be underrepresented in the current low latency use case descriptions.
> Experiencing audio drop outs (clicks or gaps) is generally perceived
> to be much more disruptive than graphics glitches like missing a frame
> deadline. That's why audio programs and devices with huge buffers put
> so much emphasis on avoiding drop outs however possible.
>
> Now, video players and tools like mpg123 are well served with increased
> audio buffer sizes to avoid drop outs. But the story is different with
> MIDI synthesizers (hit a piano key and expect immediate onset -
> especially drum synthesizers) or audio filters that implement realtime
> auto-tuning or short delay effects. Here, useful applications require
> IO latency guarantees significantly below 10ms. A useful setup would be
> e.g. 48kHz sampling frequency and an audio buffer of two fragments with
> 128 samples each. Simplified, that provides a combined IO latency of
> 2 * 128 / 48000 = 5.33ms if the audio application can rely on scheduling
> deadlines of 1 * 128 / 48000 = 2.6ms.
>
> So far, low latency applications have to jump through a number of hoops
> to roughly achieve such deadlines, ranging from requiring suid-
> installations for SCHED_DEADLINE (rarely used in practice), needing
> CAP_SYS_NICE, requiring a Jackd distro or depending on RtKit with all
> its shortcomings [1].
> I.e. there's a plethora of user space workarounds that have piled up
> over the decades, because regular user applications lack a simple way
> to tell the kernel one of:
>
> +1) I'm interested in throughput and don't care about latency: make -j
>   0) I'm fine with latency handling defaults: bash, X11
> -1) I'm depending on low latency scheduling much more than throughput:
>        audio-filter, synthesizer
>
> The emphasis here is on *USER* applications, not programs run by root.
> As you write in [PATCH v4 5/8]: "We don't want to provide more CPU
> bandwidth to a thread but reorder the scheduling to run latency
> sensitive task first whenever possible" and as outlined in the
> presentation "Latency_nice - Implementation and Use-case for Scheduler
> Optimization" by Shah et al [2], changing latency nice will not result
> in taking over the CPU or extending the possibility for DoS attacks.
>
> So what I'm getting at with this lengthy use case description is that
> it is vitally important for low latency audio applications without
> elevated privileges to be able to request low latency scheduling.
> I.e. please remove the check for !capable(CAP_SYS_NICE) to request low
> latency scheduling from the patch set, so audio applications can simply
> use their time slices (and nothing more) at the *right* time and have a
> way to tell the kernel about it without requiring root or daemons to
> relay the message. Conversely, idle tasks or make -j also don't need
> root to enjoy up to 100% of idle CPU time.

Ok, Your explanation makes sense to me especially because we want to
ensure to not provide more cpu time with this latency prio. I'm
curious to see the feedback from others about the reason we want
CAP_SYS_NICE other than following nice priority.

Side question, Have you tried this patchset (minus this patch) with
your use case ?

>
> [1] https://github.com/heftig/rtkit/issues/25
> [2] https://static.lwn.net/images/conf/2020/ospm/latency-nice-slides.pdf
>
> On 16.09.22 10:03, Vincent Guittot wrote:
> > From: Parth Shah <parth@linux.ibm.com>
> >
> > Since the latency_nice uses the similar infrastructure as NICE, use the
> > already existing CAP_SYS_NICE security checks for the latency_nice. This
> > should return -EPERM for the non-root user when trying to set the task
> > latency_nice value to any lower than the current value.
> >
> > Signed-off-by: Parth Shah <parth@linux.ibm.com>
> > [rebase]
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   kernel/sched/core.c | 4 ++++
> >   1 file changed, 4 insertions(+)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 2015e7d1f8b2..3c79c5419d1b 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7444,6 +7444,10 @@ static int __sched_setscheduler(struct task_struct *p,
> >                       return -EINVAL;
> >               if (attr->sched_latency_nice < MIN_LATENCY_NICE)
> >                       return -EINVAL;
> > +             /* Use the same security checks as NICE */
> > +             if (attr->sched_latency_nice < p->latency_nice &&
> > +                 !capable(CAP_SYS_NICE))
> > +                     return -EPERM;
> >       }
> >
> >       if (pi)
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup
  2022-09-19 10:05   ` Dietmar Eggemann
@ 2022-09-19 15:39     ` Vincent Guittot
  2022-09-20 13:18       ` Dietmar Eggemann
  0 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2022-09-19 15:39 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel, parth, qais.yousef, chris.hyser,
	valentin.schneider, patrick.bellasi, David.Laight, pjt, pavel,
	tj, qperret, tim.c.chen, joshdon

On Mon, 19 Sept 2022 at 12:05, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 16/09/2022 10:03, Vincent Guittot wrote:
>
> [...]
>
> > @@ -4512,7 +4519,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
> >       p->prio = current->normal_prio;
> >
> >       /* Propagate the parent's latency requirements to the child as well */
> > -     p->latency_nice = current->latency_nice;
> > +     p->latency_prio = current->latency_prio;
>
> Isn't here a `set_latency_offset(p)` missing here?

Hmm, I think it's the opposite and the line above is a nop from the
beginning (i.e. patch 2).

>
> >
> >       uclamp_fork(p);
> >
>
> [...]
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e8c1b889dcbb..a20eadb0af97 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4574,6 +4574,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >               update_idle_cfs_rq_clock_pelt(cfs_rq);
> >  }
> >
> > +static long wakeup_latency_gran(struct sched_entity *curr, struct sched_entity *se);
>
> minor: `struct sched_entity *curr` ... doesn't have to be current
> (cfs_rq->curr). Isn't this more like `struct sched_entity *sea, struct
> sched_entity *seb`? Anyway, it's already the case for
> `wakeup_preempt_entity`.
>
> [...]
>
> > @@ -5732,6 +5735,35 @@ static int sched_idle_cpu(int cpu)
> >  }
> >  #endif
> >
> > +static void set_next_buddy(struct sched_entity *se);
> > +
> > +static void check_preempt_from_others(struct cfs_rq *cfs, struct sched_entity *se)
>
> minor: Why `struct cfs_rq *cfs` and not `struct cfs_rq *cfs_rq` ?
>
> Using `cfs_rq` would make it more consistent when looking for things
> like `cfs_rq->nr_running` for example.
>
> > +{
> > +     struct sched_entity *next;
> > +
> > +     if (se->latency_offset >= 0)
> > +             return;
> > +
> > +     if (cfs->nr_running <= 1)
> > +             return;
> > +     /*
> > +      * When waking from idle, we don't need to check to preempt at wakeup
>
> s/idle/others ?

yes, I forgot to update the comment

>
> > +      * the idle thread and don't set next buddy as a candidate for being
> > +      * picked in priority.
> > +      * In case of simultaneous wakeup from idle, the latency sensitive tasks
> > +      * lost opportunity to preempt non sensitive tasks which woke up
> > +      * simultaneously.
> > +      */
>
> The position of this comment block within this function is somehow
> misleading since it describes the reason for the function rather then a
> particular condition within this function. Wouldn't it be more readable
> when it would be a function header comment instead?

I put it after the usual early return tests to put the comment close
to the useful part: the use of next buddy and __pick_first_entity()

>
> [...]
>
> > @@ -7148,6 +7181,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >  }
> >  #endif /* CONFIG_SMP */
> >
> > +static long wakeup_latency_gran(struct sched_entity *curr, struct sched_entity *se)
> > +{
> > +     long latency_offset = se->latency_offset;
> > +
> > +     /*
> > +      * A negative latency weigth means that the sched_entity has latency
>
> s/weigth/latency_offset ?

yes

>
>
> > +      * requirement that needs to be evaluated versus other entity.
> > +      * Otherwise, use the latency weight to evaluate how much scheduling
> > +      * delay is acceptable by se.
> > +      */
> > +     if ((se->latency_offset < 0) || (curr->latency_offset < 0))
> > +             latency_offset -= curr->latency_offset;
>
> I still don't get the rationale behind why when either one (se or curr)
> of the latency_nice values is negative, we use the diff between them but
> if not, we only care about se's value. Why don't you always use the diff
> between se and curr? Since we have a range [-20 ... 19] why shouldn't we
> use the difference between let's say se = 19 and curr = 5?
> You discussed this with Tao Zhou on the v1 but I didn't understand it fully.

Let say that current has a latency nice prio of 19 and a task A with a
latency nice of 10 wakes up. Both tasks don't care about scheduling
latency (current more than task A). If we use the diff, the output of
wakeup_latency_gran() would be negative (-10ms) which reflects the
fact that the waking task is sensitive to the latency and wants to
preempt current even if its vruntime is after. But obviously both
current and task A don't care to preempt at wakeup.

>
> [...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/8] sched/fair: Add sched group latency support
  2022-09-19 11:55   ` Dietmar Eggemann
@ 2022-09-19 15:49     ` Vincent Guittot
  2022-09-19 17:34       ` Tejun Heo
  2022-09-20 18:17       ` Dietmar Eggemann
  2022-09-19 17:34     ` Tejun Heo
  1 sibling, 2 replies; 44+ messages in thread
From: Vincent Guittot @ 2022-09-19 15:49 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel, parth, qais.yousef, chris.hyser,
	valentin.schneider, patrick.bellasi, David.Laight, pjt, pavel,
	tj, qperret, tim.c.chen, joshdon

On Mon, 19 Sept 2022 at 13:55, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> s/valentin.schneider@arm.com//
>
> On 16/09/2022 10:03, Vincent Guittot wrote:
> > Task can set its latency priority, which is then used to decide to preempt
> > the current running entity of the cfs, but sched group entities still have
> > the default latency offset.
> >
> > Add a latency field in task group to set the latency offset of the
> > sched_eneities of the group, which will be used against other entities in
>
> s/sched_eneities/sched_entity
>
> > the parent cfs when deciding which entity to schedule first.
>
> So latency for cgroups does not follow any (existing) Resource
> Distribution Model/Scheme (Documentation/admin-guide/cgroup-v2.rst)?
> Latency values are only used to compare sched entities at the same level.

Just like share/cpu.weight value does for time sharing

>
> [...]
>
> > +static int cpu_latency_write_s64(struct cgroup_subsys_state *css,
> > +                             struct cftype *cft, s64 latency)
> > +{
>
> There is no [MIN, MAX] checking?

This is done is sched_group_set_latency() which checks that
abs(latency) < sysctl_sched_latency

>
> min_weight = sched_latency_to_weight[0]  = -1024
> max_weight = sched_latency_to_weight[39] =   973
>
> [MIN, MAX] = [sysctl_sched_latency * min_weight >> NICE_LATENCY_SHIFT,
>               sysctl_sched_latency * max_weight >> NICE_LATENCY_SHIFT]
>
>
> With the `cpu.latency` knob user would have to know for example that the
> value is -24,000,000ns to get the same behaviour as for a task latency
> nice = -20 (latency prio = 0) (w/ sysctl_sched_latency = 24ms)?

Yes, Tejun raised some concerns about adding an interface like nice in
the task group in v2 so I have removed it.

>
> For `nice` we have `cpu.weight.nice` next to `cpu.weight` in cgroup v2 ?

If everybody is ok, I can add back the cpu.latency.nice interface in
the v5 in addition to the cpu.latency

>
> [...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/8] sched/fair: Add sched group latency support
  2022-09-19 11:55   ` Dietmar Eggemann
  2022-09-19 15:49     ` Vincent Guittot
@ 2022-09-19 17:34     ` Tejun Heo
  2022-09-20  7:02       ` Vincent Guittot
  1 sibling, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2022-09-19 17:34 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth, qais.yousef,
	chris.hyser, valentin.schneider, patrick.bellasi, David.Laight,
	pjt, pavel, qperret, tim.c.chen, joshdon

Hello,

On Mon, Sep 19, 2022 at 01:55:15PM +0200, Dietmar Eggemann wrote:
> s/valentin.schneider@arm.com//
> 
> On 16/09/2022 10:03, Vincent Guittot wrote:
> > Task can set its latency priority, which is then used to decide to preempt
> > the current running entity of the cfs, but sched group entities still have
> > the default latency offset.
> > 
> > Add a latency field in task group to set the latency offset of the
> > sched_eneities of the group, which will be used against other entities in
> 
> s/sched_eneities/sched_entity
> 
> > the parent cfs when deciding which entity to schedule first.
> 
> So latency for cgroups does not follow any (existing) Resource
> Distribution Model/Scheme (Documentation/admin-guide/cgroup-v2.rst)?
> Latency values are only used to compare sched entities at the same level.

I think it'd still result in a hierarchical behavior as scheduling is done
recursively top-down. Right, vincent?

It doesn't fit any of the resource distribution model. But it's rather
difficult to map latencies to existing concepts of resources and we have a
precedence in the cpu controller (.idle) which behaves similarly, so as long
as the behavior is hierarchical, I think it's okay.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/8] sched/fair: Add sched group latency support
  2022-09-19 15:49     ` Vincent Guittot
@ 2022-09-19 17:34       ` Tejun Heo
  2022-09-20  7:03         ` Vincent Guittot
  2022-09-21 16:07         ` Qais Yousef
  2022-09-20 18:17       ` Dietmar Eggemann
  1 sibling, 2 replies; 44+ messages in thread
From: Tejun Heo @ 2022-09-19 17:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth, qais.yousef,
	chris.hyser, valentin.schneider, patrick.bellasi, David.Laight,
	pjt, pavel, qperret, tim.c.chen, joshdon

On Mon, Sep 19, 2022 at 05:49:27PM +0200, Vincent Guittot wrote:
> > For `nice` we have `cpu.weight.nice` next to `cpu.weight` in cgroup v2 ?
> 
> If everybody is ok, I can add back the cpu.latency.nice interface in
> the v5 in addition to the cpu.latency

Yeah, that sounds fine to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/8] sched/fair: Add sched group latency support
  2022-09-19 17:34     ` Tejun Heo
@ 2022-09-20  7:02       ` Vincent Guittot
  0 siblings, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2022-09-20  7:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Dietmar Eggemann, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth, qais.yousef,
	chris.hyser, valentin.schneider, patrick.bellasi, David.Laight,
	pjt, pavel, qperret, tim.c.chen, joshdon

On Mon, 19 Sept 2022 at 19:34, Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Mon, Sep 19, 2022 at 01:55:15PM +0200, Dietmar Eggemann wrote:
> > s/valentin.schneider@arm.com//
> >
> > On 16/09/2022 10:03, Vincent Guittot wrote:
> > > Task can set its latency priority, which is then used to decide to preempt
> > > the current running entity of the cfs, but sched group entities still have
> > > the default latency offset.
> > >
> > > Add a latency field in task group to set the latency offset of the
> > > sched_eneities of the group, which will be used against other entities in
> >
> > s/sched_eneities/sched_entity
> >
> > > the parent cfs when deciding which entity to schedule first.
> >
> > So latency for cgroups does not follow any (existing) Resource
> > Distribution Model/Scheme (Documentation/admin-guide/cgroup-v2.rst)?
> > Latency values are only used to compare sched entities at the same level.
>
> I think it'd still result in a hierarchical behavior as scheduling is done
> recursively top-down. Right, vincent?

Correct

>
> It doesn't fit any of the resource distribution model. But it's rather
> difficult to map latencies to existing concepts of resources and we have a
> precedence in the cpu controller (.idle) which behaves similarly, so as long
> as the behavior is hierarchical, I think it's okay.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/8] sched/fair: Add sched group latency support
  2022-09-19 17:34       ` Tejun Heo
@ 2022-09-20  7:03         ` Vincent Guittot
  2022-09-21 16:07         ` Qais Yousef
  1 sibling, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2022-09-20  7:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Dietmar Eggemann, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth, qais.yousef,
	chris.hyser, valentin.schneider, patrick.bellasi, David.Laight,
	pjt, pavel, qperret, tim.c.chen, joshdon

On Mon, 19 Sept 2022 at 19:34, Tejun Heo <tj@kernel.org> wrote:
>
> On Mon, Sep 19, 2022 at 05:49:27PM +0200, Vincent Guittot wrote:
> > > For `nice` we have `cpu.weight.nice` next to `cpu.weight` in cgroup v2 ?
> >
> > If everybody is ok, I can add back the cpu.latency.nice interface in
> > the v5 in addition to the cpu.latency
>
> Yeah, that sounds fine to me.

Thanks

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 4/8] sched/core: Add permission checks for setting the latency_nice value
  2022-09-19 12:41     ` Vincent Guittot
@ 2022-09-20 10:18       ` Tim Janik
  2022-09-20 14:56         ` Vincent Guittot
  0 siblings, 1 reply; 44+ messages in thread
From: Tim Janik @ 2022-09-20 10:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth, qais.yousef,
	chris.hyser, valentin.schneider, patrick.bellasi, David.Laight,
	pjt, pavel, tj, qperret, tim.c.chen, joshdon

Hi.

On 19.09.22 14:41, Vincent Guittot wrote:
> Hi,
> 
> Thanks you for describing in detail your use case.

> Ok, Your explanation makes sense to me especially because we want to
> ensure to not provide more cpu time with this latency prio. I'm
> curious to see the feedback from others about the reason we want
> CAP_SYS_NICE other than following nice priority.
> 
> Side question, Have you tried this patchset (minus this patch) with
> your use case ?

I have now tested a modified version of the ALSA Test_latency.c program 
that acquires latency nice as non-root:
   https://gist.github.com/tim-janik/88f9df5456b879ecc59da93dc6ce6be1

With a busy but not overloaded CPU, the short time latency tests are 
often better, measured with: ./lnice-latency -p -s 1

But the results aren't very reliable with this test. I.e. requesting a 
latency nice value of -20 reduces the chance for underruns somewhat but 
doesn't eliminate them (and lnice-latency.c gives up on the first XRUN 
in the given time period). It might be better to instead count the XRUN 
occurances over a given time pertiod.


-- 
Anklang Free Software DAW
https://anklang.testbit.eu/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup
  2022-09-19 15:39     ` Vincent Guittot
@ 2022-09-20 13:18       ` Dietmar Eggemann
  2022-09-20 15:49         ` Vincent Guittot
  0 siblings, 1 reply; 44+ messages in thread
From: Dietmar Eggemann @ 2022-09-20 13:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel, parth, qais.yousef, chris.hyser,
	valentin.schneider, patrick.bellasi, David.Laight, pjt, pavel,
	tj, qperret, tim.c.chen, joshdon

On 19/09/2022 17:39, Vincent Guittot wrote:
> On Mon, 19 Sept 2022 at 12:05, Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 16/09/2022 10:03, Vincent Guittot wrote:
>>
>> [...]
>>
>>> @@ -4512,7 +4519,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
>>>       p->prio = current->normal_prio;
>>>
>>>       /* Propagate the parent's latency requirements to the child as well */
>>> -     p->latency_nice = current->latency_nice;
>>> +     p->latency_prio = current->latency_prio;
>>
>> Isn't here a `set_latency_offset(p)` missing here?
> 
> Hmm, I think it's the opposite and the line above is a nop from the
> beginning (i.e. patch 2).

Yeah, you're right! It looked suspicious ...

[...]

>>> +      * the idle thread and don't set next buddy as a candidate for being
>>> +      * picked in priority.
>>> +      * In case of simultaneous wakeup from idle, the latency sensitive tasks
>>> +      * lost opportunity to preempt non sensitive tasks which woke up
>>> +      * simultaneously.
>>> +      */
>>
>> The position of this comment block within this function is somehow
>> misleading since it describes the reason for the function rather then a
>> particular condition within this function. Wouldn't it be more readable
>> when it would be a function header comment instead?
> 
> I put it after the usual early return tests to put the comment close
> to the useful part: the use of next buddy and __pick_first_entity()

So you want to have the `wakeup_preempt_entity(se, pse) == 1` condition
from check_preempt_wakeup() also for cfs_task woken up by others.

[...]

>>> +      * requirement that needs to be evaluated versus other entity.
>>> +      * Otherwise, use the latency weight to evaluate how much scheduling
>>> +      * delay is acceptable by se.
>>> +      */
>>> +     if ((se->latency_offset < 0) || (curr->latency_offset < 0))
>>> +             latency_offset -= curr->latency_offset;
>>
>> I still don't get the rationale behind why when either one (se or curr)
>> of the latency_nice values is negative, we use the diff between them but
>> if not, we only care about se's value. Why don't you always use the diff
>> between se and curr? Since we have a range [-20 ... 19] why shouldn't we
>> use the difference between let's say se = 19 and curr = 5?
>> You discussed this with Tao Zhou on the v1 but I didn't understand it fully.
> 
> Let say that current has a latency nice prio of 19 and a task A with a
> latency nice of 10 wakes up. Both tasks don't care about scheduling
> latency (current more than task A). If we use the diff, the output of
> wakeup_latency_gran() would be negative (-10ms) which reflects the
> fact that the waking task is sensitive to the latency and wants to
> preempt current even if its vruntime is after. But obviously both
> current and task A don't care to preempt at wakeup.

OK, I understand but there is a certain level of unsteadiness here.

If p has >0 it gets treated differently in case current has >=0 and case
current has <0.

Do we expect that tasks set their value to [1..19] in this case, when
the default 0 already indicates a 'don't care'?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 4/8] sched/core: Add permission checks for setting the latency_nice value
  2022-09-20 10:18       ` Tim Janik
@ 2022-09-20 14:56         ` Vincent Guittot
  2022-09-21 16:11           ` Vincent Guittot
  0 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2022-09-20 14:56 UTC (permalink / raw)
  To: Tim Janik
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth, qais.yousef,
	chris.hyser, valentin.schneider, patrick.bellasi, David.Laight,
	pjt, pavel, tj, qperret, tim.c.chen, joshdon

On Tue, 20 Sept 2022 at 12:18, Tim Janik <timj@gnu.org> wrote:
>
> Hi.
>
> On 19.09.22 14:41, Vincent Guittot wrote:
> > Hi,
> >
> > Thanks you for describing in detail your use case.
>
> > Ok, Your explanation makes sense to me especially because we want to
> > ensure to not provide more cpu time with this latency prio. I'm
> > curious to see the feedback from others about the reason we want
> > CAP_SYS_NICE other than following nice priority.
> >
> > Side question, Have you tried this patchset (minus this patch) with
> > your use case ?
>
> I have now tested a modified version of the ALSA Test_latency.c program
> that acquires latency nice as non-root:
>    https://gist.github.com/tim-janik/88f9df5456b879ecc59da93dc6ce6be1
>
> With a busy but not overloaded CPU, the short time latency tests are
> often better, measured with: ./lnice-latency -p -s 1
>
> But the results aren't very reliable with this test. I.e. requesting a
> latency nice value of -20 reduces the chance for underruns somewhat but
> doesn't eliminate them (and lnice-latency.c gives up on the first XRUN

It's expected that latency nice can't fix all scheduling latency
problems. The hard real time constraint can only be ensured with FIFO
or deadline scheduler

> in the given time period). It might be better to instead count the XRUN
> occurances over a given time pertiod.

Thanks. I'm going to have a look the test

>
>
> --
> Anklang Free Software DAW
> https://anklang.testbit.eu/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup
       [not found]           ` <20220920113238.1176-1-hdanton@sina.com>
@ 2022-09-20 15:17             ` Vincent Guittot
  0 siblings, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2022-09-20 15:17 UTC (permalink / raw)
  To: Hillf Danton
  Cc: peterz, mgorman, valentin.schneider, parth, linux-mm, linux-kernel

On Tue, 20 Sept 2022 at 13:32, Hillf Danton <hdanton@sina.com> wrote:
>
> On 18 Sep 2022 12:46:00 +0200 Vincent Guittot <vincent.guittot@linaro.org> wrote:
> > On Sun, 18 Sept 2022 at 00:58, Hillf Danton <hdanton@sina.com> wrote:
> > >
> > > On 16 Sep 2022 15:36:53 +0200 Vincent Guittot <vincent.guittot@linaro.org> wrote:
> > > >
> > > > Hi Hillf,
> > > >
> > > > On Fri, 16 Sept 2022 at 14:03, Hillf Danton <hdanton@sina.com> wrote:
> > > > >
> > > > > Hello Vincent
> > > > >
> > > > > On 16 Sep 2022 10:03:02 +0200 Vincent Guittot <vincent.guittot@linaro.org> wrote:
> > > > > >
> > > > > > @@ -4606,6 +4608,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > > > > >
> > > > > >       se = __pick_first_entity(cfs_rq);
> > > > > >       delta = curr->vruntime - se->vruntime;
> > > > > > +     delta -= wakeup_latency_gran(curr, se);
> > > > > >
> > > > > >       if (delta < 0)
> > > > > >               return;
> > > > >
> > > > > What is derived from the latency nice you added is the runtime granulaity
> > > > > which has a role in preempting the current task.
> > > > >
> > > > > Given the same defination of latency nice as the nice, the runtime granularity
> > > > > can be computed without introducing the latency nice.
> > > > >
> > > > > Only for thoughts now.
> > > > >
> > > > > Hillf
> > > > >
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -4569,7 +4569,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> > > > >  static void
> > > > >  check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > > > >  {
> > > > > -       unsigned long ideal_runtime, delta_exec;
> > > > > +       unsigned long ideal_runtime, delta_exec, granu;
> > > > >         struct sched_entity *se;
> > > > >         s64 delta;
> > > > >
> > > > > @@ -4594,6 +4594,14 @@ check_preempt_tick(struct cfs_rq *cfs_rq
> > > > >                 return;
> > > > >
> > > > >         se = __pick_first_entity(cfs_rq);
> > > > > +
> > > > > +       granu = sysctl_sched_min_granularity +
> > > > > +               (ideal_runtime - sysctl_sched_min_granularity) *
> > > > > +               (se->latency_nice + 20) / LATENCY_NICE_WIDTH;
> > > >
> > > > There is no latency_nice field in se but a latency_offset instead
> > > >
> > > > Also at this step, we are sure that curr has run at least
> > > > sysctl_sched_min_granularity and we want now to compare curr vruntime
> > > > with first se's one. We take the latency offset into account to make
> > > > sure we will not preempt curr too early
> > > >
> > > > > +
> > > > > +       if (delta_exec < granu)
> > > > > +               return;
> > > > > +
> > > > >         delta = curr->vruntime - se->vruntime;
> > > > >
> > > > >         if (delta < 0)
> > >                 return;
> > >
> > >             if (delta > ideal_runtime)
> > >                 resched_curr(rq_of(cfs_rq));
> > >
> > > After another look, curr is not preempted without the gap in vruntime
> > > between curr and the first entity growing more than ideal runtime, while
> >
> > Curr can be preempted as it has run more than the ideal time (1st
> > test). This one is to make sure that the diff does not become too
> > large. Here we reuse the same comparison as wakeup to make sure that a
> > newly curr will get a chance to run its ideal time after  having
> > preempted at wakeup the previous current
>
> IIUC it would take two checks to preempt correctly - diff in vruntime
> is checked first to avoid preempting too early, then it is checked again
> with latency_offset taken into account to avoid preempting too late.

The 1st test in check_preempt_tick() : if (delta_exec > ideal_runtime)
ensures that a resched happens after the current run is slice

The 2nd test : if (delta_exec < sysctl_sched_min_granularity)
ensures that current will run at least 3ms

The 3rd one :  if (delta > ideal_runtime)
is there to make sure that there is not too much diff between the
vruntime. But we are comparing virtual runtime with execution time and
as Peter mentioned in another thread that's weird. [1] will fix it in
addition to ensure max runtime.
Then, current might have preempted first_entity few ms before thanks
to its latency_offset. If the tick happens quickly after the
preemption, delta might be above ideal_runtime whereas current has run
its ideal time yet

[1] https://lore.kernel.org/all/20220916131538.24706-1-vincent.guittot@linaro.org/

>
> +++ b/kernel/sched/fair.c
> @@ -4571,7 +4571,7 @@ check_preempt_tick(struct cfs_rq *cfs_rq
>  {
>         unsigned long ideal_runtime, delta_exec;
>         struct sched_entity *se;
> -       s64 delta;
> +       s64 delta, d2;
>
>         ideal_runtime = sched_slice(cfs_rq, curr);
>         delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> @@ -4595,11 +4595,9 @@ check_preempt_tick(struct cfs_rq *cfs_rq
>
>         se = __pick_first_entity(cfs_rq);
>         delta = curr->vruntime - se->vruntime;
> +       d2 = delta - wakeup_latency_gran(curr, se);
>
> -       if (delta < 0)
> -               return;
> -
> -       if (delta > ideal_runtime)
> +       if (delta > ideal_runtime || d2 > ideal_runtime)
>                 resched_curr(rq_of(cfs_rq));
>  }
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup
  2022-09-20 13:18       ` Dietmar Eggemann
@ 2022-09-20 15:49         ` Vincent Guittot
  2022-09-21 22:41           ` Dietmar Eggemann
  0 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2022-09-20 15:49 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel, parth, qais.yousef, chris.hyser,
	valentin.schneider, patrick.bellasi, David.Laight, pjt, pavel,
	tj, qperret, tim.c.chen, joshdon

On Tue, 20 Sept 2022 at 15:18, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 19/09/2022 17:39, Vincent Guittot wrote:
> > On Mon, 19 Sept 2022 at 12:05, Dietmar Eggemann
> > <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 16/09/2022 10:03, Vincent Guittot wrote:
> >>
> >> [...]
> >>
> >>> @@ -4512,7 +4519,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
> >>>       p->prio = current->normal_prio;
> >>>
> >>>       /* Propagate the parent's latency requirements to the child as well */
> >>> -     p->latency_nice = current->latency_nice;
> >>> +     p->latency_prio = current->latency_prio;
> >>
> >> Isn't here a `set_latency_offset(p)` missing here?
> >
> > Hmm, I think it's the opposite and the line above is a nop from the
> > beginning (i.e. patch 2).
>
> Yeah, you're right! It looked suspicious ...
>
> [...]
>
> >>> +      * the idle thread and don't set next buddy as a candidate for being
> >>> +      * picked in priority.
> >>> +      * In case of simultaneous wakeup from idle, the latency sensitive tasks
> >>> +      * lost opportunity to preempt non sensitive tasks which woke up
> >>> +      * simultaneously.
> >>> +      */
> >>
> >> The position of this comment block within this function is somehow
> >> misleading since it describes the reason for the function rather then a
> >> particular condition within this function. Wouldn't it be more readable
> >> when it would be a function header comment instead?
> >
> > I put it after the usual early return tests to put the comment close
> > to the useful part: the use of next buddy and __pick_first_entity()
>
> So you want to have the `wakeup_preempt_entity(se, pse) == 1` condition
> from check_preempt_wakeup() also for cfs_task woken up by others.

I wake the wakeup_preempt_entity(cfs_rq->next, left) < 1 in
pick_next_entity() to pick the task with highest latency constraint
when another class is running while waking up

>
> [...]
>
> >>> +      * requirement that needs to be evaluated versus other entity.
> >>> +      * Otherwise, use the latency weight to evaluate how much scheduling
> >>> +      * delay is acceptable by se.
> >>> +      */
> >>> +     if ((se->latency_offset < 0) || (curr->latency_offset < 0))
> >>> +             latency_offset -= curr->latency_offset;
> >>
> >> I still don't get the rationale behind why when either one (se or curr)
> >> of the latency_nice values is negative, we use the diff between them but
> >> if not, we only care about se's value. Why don't you always use the diff
> >> between se and curr? Since we have a range [-20 ... 19] why shouldn't we
> >> use the difference between let's say se = 19 and curr = 5?
> >> You discussed this with Tao Zhou on the v1 but I didn't understand it fully.
> >
> > Let say that current has a latency nice prio of 19 and a task A with a
> > latency nice of 10 wakes up. Both tasks don't care about scheduling
> > latency (current more than task A). If we use the diff, the output of
> > wakeup_latency_gran() would be negative (-10ms) which reflects the
> > fact that the waking task is sensitive to the latency and wants to
> > preempt current even if its vruntime is after. But obviously both
> > current and task A don't care to preempt at wakeup.
>
> OK, I understand but there is a certain level of unsteadiness here.
>
> If p has >0 it gets treated differently in case current has >=0 and case

"If p >=0"; 0 has same behavior than [1..19]

> current has <0.
>
> Do we expect that tasks set their value to [1..19] in this case, when
> the default 0 already indicates a 'don't care'?

I'm not sure that I understand your concern as [0..19] are treated in
the same way. Only tasks (curr or se) with offset <0 need a relative
comparison to the other. If curr and se has both a latency nice of
-19, se should not blindly preempt curr but only if curr already run
for its amount of time

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/8] sched/fair: Add sched group latency support
  2022-09-19 15:49     ` Vincent Guittot
  2022-09-19 17:34       ` Tejun Heo
@ 2022-09-20 18:17       ` Dietmar Eggemann
  2022-09-21  7:48         ` Vincent Guittot
  1 sibling, 1 reply; 44+ messages in thread
From: Dietmar Eggemann @ 2022-09-20 18:17 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel, parth, qais.yousef, chris.hyser,
	valentin.schneider, patrick.bellasi, David.Laight, pjt, pavel,
	tj, qperret, tim.c.chen, joshdon

On 19/09/2022 17:49, Vincent Guittot wrote:
> On Mon, 19 Sept 2022 at 13:55, Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> s/valentin.schneider@arm.com//
>>
>> On 16/09/2022 10:03, Vincent Guittot wrote:
>>> Task can set its latency priority, which is then used to decide to preempt
>>> the current running entity of the cfs, but sched group entities still have
>>> the default latency offset.
>>>
>>> Add a latency field in task group to set the latency offset of the
>>> sched_eneities of the group, which will be used against other entities in
>>
>> s/sched_eneities/sched_entity
>>
>>> the parent cfs when deciding which entity to schedule first.
>>
>> So latency for cgroups does not follow any (existing) Resource
>> Distribution Model/Scheme (Documentation/admin-guide/cgroup-v2.rst)?
>> Latency values are only used to compare sched entities at the same level.
> 
> Just like share/cpu.weight value does for time sharing

But for this we define it as following the `Weights` scheme. That's why
I was asking,

>> [...]
>>
>>> +static int cpu_latency_write_s64(struct cgroup_subsys_state *css,
>>> +                             struct cftype *cft, s64 latency)
>>> +{
>>
>> There is no [MIN, MAX] checking?
> 
> This is done is sched_group_set_latency() which checks that
> abs(latency) < sysctl_sched_latency

I see. Nit-picking: Wouldn't this allow to specify a latency offset
value for the non-existent `nice = 20`? Highest nice value 19 maps to
`973/1024 * sysctl_sched_latency`.

> 
>>
>> min_weight = sched_latency_to_weight[0]  = -1024
>> max_weight = sched_latency_to_weight[39] =   973
>>
>> [MIN, MAX] = [sysctl_sched_latency * min_weight >> NICE_LATENCY_SHIFT,
>>               sysctl_sched_latency * max_weight >> NICE_LATENCY_SHIFT]
>>
>>
>> With the `cpu.latency` knob user would have to know for example that the
>> value is -24,000,000ns to get the same behaviour as for a task latency
>> nice = -20 (latency prio = 0) (w/ sysctl_sched_latency = 24ms)?
> 
> Yes, Tejun raised some concerns about adding an interface like nice in
> the task group in v2 so I have removed it.
> 
>>
>> For `nice` we have `cpu.weight.nice` next to `cpu.weight` in cgroup v2 ?
> 
> If everybody is ok, I can add back the cpu.latency.nice interface in
> the v5 in addition to the cpu.latency

cpu.weight/cpu.weight.nice interface:

echo X > cpu.weight	   tg->shares

    1                          10,240
  100                       1,048,576
10000                     104,857,600

echo X > cpu.weight.nice

  -20                     90,891,264
    0                      1,048,576
   19                         15,360

Wouldn't then a similar interface for cpu.latency [1..100..10000] and
cpu.latency.nice [-20..0..19] make most sense?

Raw latency_offset values at interface level are not portable.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/8] sched/fair: Add sched group latency support
  2022-09-20 18:17       ` Dietmar Eggemann
@ 2022-09-21  7:48         ` Vincent Guittot
  0 siblings, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2022-09-21  7:48 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel, parth, qais.yousef, chris.hyser,
	valentin.schneider, patrick.bellasi, David.Laight, pjt, pavel,
	tj, qperret, tim.c.chen, joshdon

On Tue, 20 Sept 2022 at 20:17, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 19/09/2022 17:49, Vincent Guittot wrote:
> > On Mon, 19 Sept 2022 at 13:55, Dietmar Eggemann
> > <dietmar.eggemann@arm.com> wrote:
> >>
> >> s/valentin.schneider@arm.com//
> >>
> >> On 16/09/2022 10:03, Vincent Guittot wrote:
> >>> Task can set its latency priority, which is then used to decide to preempt
> >>> the current running entity of the cfs, but sched group entities still have
> >>> the default latency offset.
> >>>
> >>> Add a latency field in task group to set the latency offset of the
> >>> sched_eneities of the group, which will be used against other entities in
> >>
> >> s/sched_eneities/sched_entity
> >>
> >>> the parent cfs when deciding which entity to schedule first.
> >>
> >> So latency for cgroups does not follow any (existing) Resource
> >> Distribution Model/Scheme (Documentation/admin-guide/cgroup-v2.rst)?
> >> Latency values are only used to compare sched entities at the same level.
> >
> > Just like share/cpu.weight value does for time sharing
>
> But for this we define it as following the `Weights` scheme. That's why
> I was asking,
>
> >> [...]
> >>
> >>> +static int cpu_latency_write_s64(struct cgroup_subsys_state *css,
> >>> +                             struct cftype *cft, s64 latency)
> >>> +{
> >>
> >> There is no [MIN, MAX] checking?
> >
> > This is done is sched_group_set_latency() which checks that
> > abs(latency) < sysctl_sched_latency
>
> I see. Nit-picking: Wouldn't this allow to specify a latency offset
> value for the non-existent `nice = 20`? Highest nice value 19 maps to
> `973/1024 * sysctl_sched_latency`.

yes, but the same applies for tg->shares and cpu.weight as we can set
a tg->shares of 104,857,600 whereas the max shares for nice -20 is
90,891,264. Furthermore, I don't see a real problem with the ability
to set a latency offset up to sysctl_sched_latency because it's about
being even more nice with other task and not the opposite

>
> >
> >>
> >> min_weight = sched_latency_to_weight[0]  = -1024
> >> max_weight = sched_latency_to_weight[39] =   973
> >>
> >> [MIN, MAX] = [sysctl_sched_latency * min_weight >> NICE_LATENCY_SHIFT,
> >>               sysctl_sched_latency * max_weight >> NICE_LATENCY_SHIFT]
> >>
> >>
> >> With the `cpu.latency` knob user would have to know for example that the
> >> value is -24,000,000ns to get the same behaviour as for a task latency
> >> nice = -20 (latency prio = 0) (w/ sysctl_sched_latency = 24ms)?
> >
> > Yes, Tejun raised some concerns about adding an interface like nice in
> > the task group in v2 so I have removed it.
> >
> >>
> >> For `nice` we have `cpu.weight.nice` next to `cpu.weight` in cgroup v2 ?
> >
> > If everybody is ok, I can add back the cpu.latency.nice interface in
> > the v5 in addition to the cpu.latency
>
> cpu.weight/cpu.weight.nice interface:
>
> echo X > cpu.weight        tg->shares
>
>     1                          10,240
>   100                       1,048,576
> 10000                     104,857,600

>
> echo X > cpu.weight.nice
>
>   -20                     90,891,264
>     0                      1,048,576
>    19                         15,360
>
> Wouldn't then a similar interface for cpu.latency [1..100..10000] and
> cpu.latency.nice [-20..0..19] make most sense?

We need at least a signed value for cpu.latency to make the difference
between a sensitivity to the latency or a not careness

>
> Raw latency_offset values at interface level are not portable.

I can use [-1000:1000] but I' not sure it's better than the raw value at the end

>
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/8] sched/fair: Add sched group latency support
  2022-09-19 17:34       ` Tejun Heo
  2022-09-20  7:03         ` Vincent Guittot
@ 2022-09-21 16:07         ` Qais Yousef
  2022-09-21 16:48           ` Tejun Heo
  1 sibling, 1 reply; 44+ messages in thread
From: Qais Yousef @ 2022-09-21 16:07 UTC (permalink / raw)
  To: Tejun Heo, Vincent Guittot
  Cc: Dietmar Eggemann, mingo, peterz, juri.lelli, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth, chris.hyser,
	valentin.schneider, patrick.bellasi, David.Laight, pjt, pavel,
	qperret, tim.c.chen, joshdon

On 09/19/22 07:34, Tejun Heo wrote:
> On Mon, Sep 19, 2022 at 05:49:27PM +0200, Vincent Guittot wrote:
> > > For `nice` we have `cpu.weight.nice` next to `cpu.weight` in cgroup v2 ?
> > 
> > If everybody is ok, I can add back the cpu.latency.nice interface in
> > the v5 in addition to the cpu.latency
> 
> Yeah, that sounds fine to me.

I do have concerns about cpu.latency knob as it exposes latency_offset which
won't be meaningful for all consumers of latency_nice [1].

The current use case proposed by Vincent is not going to be the only consumer
of this interface. The concept of converting this latency_nice value to weight
in similar fashion to existing nice value is specific to it. In previous
discussion this conversion was not required and I'd expect it to still not be
required for those other use cases.

Wouldn't cpu.latency.nice be enough? I think the latency_offset is
implementation detail that userspace shouldn't be concerned about.


[1] https://lwn.net/Articles/820659/


Cheers

--
Qais Yousef

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/8]  Add latency priority for CFS class
  2022-09-16  8:02 [PATCH v4 0/8] Add latency priority for CFS class Vincent Guittot
                   ` (7 preceding siblings ...)
  2022-09-16  8:03 ` [PATCH v4 8/8] sched/fair: Add latency list Vincent Guittot
@ 2022-09-21 16:08 ` Qais Yousef
  2022-09-22  7:19   ` Vincent Guittot
  8 siblings, 1 reply; 44+ messages in thread
From: Qais Yousef @ 2022-09-21 16:08 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth, chris.hyser,
	valentin.schneider, patrick.bellasi, David.Laight, pjt, pavel,
	tj, qperret, tim.c.chen, joshdon

Hi Vincent

On 09/16/22 10:02, Vincent Guittot wrote:
> This patchset restarts the work about adding a latency priority to describe
> the latency tolerance of cfs tasks.
> 
> The patches [1-4] have been done by Parth:
> https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com/
> 
> I have just rebased and moved the set of latency priority outside the
> priority update. I have removed the reviewed tag because the patches
> are 2 years old.
> 
> The patch [5] uses latency nice priority to define a latency offset
> and then to decide if a cfs task can preempt the current running task. The
> patch gives some tests results with cyclictests and hackbench to highlight
> the benefit of latency priority for short interactive task or
> long intensive tasks.
> 
> Patch [6] adds the support of latency_offset to task group by adding a
> cpu.latency field. There were discussions for the last version about
> using a real unit for the field so I decided to expose directly the
> latency offset which reflects the time up to which we can preempt when the
> value is negative, or up to which we can defer the preemption when the
> value is positive.
> The range is [-sysctl_sched_latency:sysctl_sched_latency]
> 
> Patch [7] makes sched_core taking into account the latency offset.
> 
> Patch [8] adds a rb tree to cover some corner cases where the latency
> sensitive task is preempted by high priority task (RT/DL) or fails to
> preempt them. This patch ensures that tasks will have at least a
> slice of sched_min_granularity in priority at wakeup. The patch gives
> results to show the benefit in addition to patch 5
> 
> I have also backported the patchset on a dragonboard RB3 with an android
> mainline kernel based on v5.18 for a quick test. I have used
> the TouchLatency app which is part of AOSP and described to be very good
> test to highlight jitter and jank frame sources of a system [1].
> In addition to the app, I have added some short running tasks waking-up
> regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
> without overloading it (and disabling EAS). The 1st results shows that the
> patchset helps to reduce the missed deadline frames from 5% to less than
> 0.1% when the cpu.latency of task group are set.
> 
> [1] https://source.android.com/docs/core/debug/eval_perf#touchlatency

One thing that still confuses me is whether this proposal is supposed to be the
only consumer of this interface or we still expect other users to be able to
use this hint to optimize other sources of latency in the scheduler? Last
discussion [1] raised issues on the interface and I haven't seen any
discussions on the suitability of the interface to enable future consumers.

I might have missed something. What's the current state on this?


[1] https://lwn.net/Articles/820659/


Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 4/8] sched/core: Add permission checks for setting the latency_nice value
  2022-09-20 14:56         ` Vincent Guittot
@ 2022-09-21 16:11           ` Vincent Guittot
  0 siblings, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2022-09-21 16:11 UTC (permalink / raw)
  To: Tim Janik
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth, qais.yousef,
	chris.hyser, valentin.schneider, patrick.bellasi, David.Laight,
	pjt, pavel, tj, qperret, tim.c.chen, joshdon

On Tue, 20 Sept 2022 at 16:56, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Tue, 20 Sept 2022 at 12:18, Tim Janik <timj@gnu.org> wrote:
> >
> > Hi.
> >
> > On 19.09.22 14:41, Vincent Guittot wrote:
> > > Hi,
> > >
> > > Thanks you for describing in detail your use case.
> >
> > > Ok, Your explanation makes sense to me especially because we want to
> > > ensure to not provide more cpu time with this latency prio. I'm
> > > curious to see the feedback from others about the reason we want
> > > CAP_SYS_NICE other than following nice priority.
> > >
> > > Side question, Have you tried this patchset (minus this patch) with
> > > your use case ?
> >
> > I have now tested a modified version of the ALSA Test_latency.c program
> > that acquires latency nice as non-root:
> >    https://gist.github.com/tim-janik/88f9df5456b879ecc59da93dc6ce6be1
> >
> > With a busy but not overloaded CPU, the short time latency tests are
> > often better, measured with: ./lnice-latency -p -s 1
> >
> > But the results aren't very reliable with this test. I.e. requesting a
> > latency nice value of -20 reduces the chance for underruns somewhat but
> > doesn't eliminate them (and lnice-latency.c gives up on the first XRUN
>
> It's expected that latency nice can't fix all scheduling latency
> problems. The hard real time constraint can only be ensured with FIFO
> or deadline scheduler
>
> > in the given time period). It might be better to instead count the XRUN
> > occurances over a given time pertiod.
>
> Thanks. I'm going to have a look the test

I have done some tests with your modified ALSA Test_latency.c on my
dev system. I have been able to run lnice-latency -p -s 8 -e -m 128
simultaneously with hackbench -l 20000 -g 2 on my 8 cores arm64
system. The same test with a default latency nice 0 triggers a lot of
XRUN.

Side note, my system doesn't have many RT threads, IRQ or softirq
running . But as explained, latency nice can't do much with those


>
> >
> >
> > --
> > Anklang Free Software DAW
> > https://anklang.testbit.eu/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/8] sched/fair: Add sched group latency support
  2022-09-21 16:07         ` Qais Yousef
@ 2022-09-21 16:48           ` Tejun Heo
  2022-09-21 17:02             ` Vincent Guittot
  0 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2022-09-21 16:48 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Vincent Guittot, Dietmar Eggemann, mingo, peterz, juri.lelli,
	rostedt, bsegall, mgorman, bristot, vschneid, linux-kernel,
	parth, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, qperret, tim.c.chen, joshdon

Hello,

On Wed, Sep 21, 2022 at 05:07:38PM +0100, Qais Yousef wrote:
> Wouldn't cpu.latency.nice be enough? I think the latency_offset is
> implementation detail that userspace shouldn't be concerned about.

One option could be just using the same mapping as cpu.weight so that 100
maps to neutral, 10000 maps close to -20, 1 maps close to 19. It isn't great
that the value can't be interpreted in any intuitive way (e.g. a time
duration based interface would be a lot easier to grok even if it still is
best effort) but if that's what the per-task interface is gonna be, it'd be
best to keep cgroup interface in line.

As for whether a single value would fit the bill, it's again something which
should be answered for both task and cgroup based interface at the same
time. That said, my not-too-throught-through opinion is that a single value
for per-task / per-cgroup interface + system level knobs to fine tune how
that actually applies is likely enough and probably better than exposing
exposing a host of internal details to applications directly.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/8] sched/fair: Add sched group latency support
  2022-09-21 16:48           ` Tejun Heo
@ 2022-09-21 17:02             ` Vincent Guittot
  2022-09-21 17:12               ` Tejun Heo
  0 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2022-09-21 17:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Qais Yousef, Dietmar Eggemann, mingo, peterz, juri.lelli,
	rostedt, bsegall, mgorman, bristot, vschneid, linux-kernel,
	parth, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, qperret, tim.c.chen, joshdon

On Wed, 21 Sept 2022 at 18:48, Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Sep 21, 2022 at 05:07:38PM +0100, Qais Yousef wrote:
> > Wouldn't cpu.latency.nice be enough? I think the latency_offset is
> > implementation detail that userspace shouldn't be concerned about.
>
> One option could be just using the same mapping as cpu.weight so that 100
> maps to neutral, 10000 maps close to -20, 1 maps close to 19. It isn't great
> that the value can't be interpreted in any intuitive way (e.g. a time
> duration based interface would be a lot easier to grok even if it still is
> best effort) but if that's what the per-task interface is gonna be, it'd be
> best to keep cgroup interface in line.

I would prefer a signed range like the [-1000:1000] as the behavior is
different for sensitive and non sensitive task unlike the cpu.weight
which is reflect that a bigger value get more

>
> As for whether a single value would fit the bill, it's again something which
> should be answered for both task and cgroup based interface at the same
> time. That said, my not-too-throught-through opinion is that a single value
> for per-task / per-cgroup interface + system level knobs to fine tune how
> that actually applies is likely enough and probably better than exposing
> exposing a host of internal details to applications directly.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/8] sched/fair: Add sched group latency support
  2022-09-21 17:02             ` Vincent Guittot
@ 2022-09-21 17:12               ` Tejun Heo
  2022-09-22  6:40                 ` Vincent Guittot
  0 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2022-09-21 17:12 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Qais Yousef, Dietmar Eggemann, mingo, peterz, juri.lelli,
	rostedt, bsegall, mgorman, bristot, vschneid, linux-kernel,
	parth, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, qperret, tim.c.chen, joshdon

On Wed, Sep 21, 2022 at 07:02:57PM +0200, Vincent Guittot wrote:
> > One option could be just using the same mapping as cpu.weight so that 100
> > maps to neutral, 10000 maps close to -20, 1 maps close to 19. It isn't great
> > that the value can't be interpreted in any intuitive way (e.g. a time
> > duration based interface would be a lot easier to grok even if it still is
> > best effort) but if that's what the per-task interface is gonna be, it'd be
> > best to keep cgroup interface in line.
> 
> I would prefer a signed range like the [-1000:1000] as the behavior is
> different for sensitive and non sensitive task unlike the cpu.weight
> which is reflect that a bigger value get more

How about just sticking with .nice?

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup
  2022-09-20 15:49         ` Vincent Guittot
@ 2022-09-21 22:41           ` Dietmar Eggemann
  2022-09-22  7:12             ` Vincent Guittot
  0 siblings, 1 reply; 44+ messages in thread
From: Dietmar Eggemann @ 2022-09-21 22:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel, parth, qais.yousef, chris.hyser,
	valentin.schneider, patrick.bellasi, David.Laight, pjt, pavel,
	tj, qperret, tim.c.chen, joshdon

On 20/09/2022 17:49, Vincent Guittot wrote:
> On Tue, 20 Sept 2022 at 15:18, Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 19/09/2022 17:39, Vincent Guittot wrote:
>>> On Mon, 19 Sept 2022 at 12:05, Dietmar Eggemann
>>> <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 16/09/2022 10:03, Vincent Guittot wrote:

[...]

>>>>> +      * the idle thread and don't set next buddy as a candidate for being
>>>>> +      * picked in priority.
>>>>> +      * In case of simultaneous wakeup from idle, the latency sensitive tasks
>>>>> +      * lost opportunity to preempt non sensitive tasks which woke up
>>>>> +      * simultaneously.
>>>>> +      */
>>>>
>>>> The position of this comment block within this function is somehow
>>>> misleading since it describes the reason for the function rather then a
>>>> particular condition within this function. Wouldn't it be more readable
>>>> when it would be a function header comment instead?
>>>
>>> I put it after the usual early return tests to put the comment close
>>> to the useful part: the use of next buddy and __pick_first_entity()
>>
>> So you want to have the `wakeup_preempt_entity(se, pse) == 1` condition
>> from check_preempt_wakeup() also for cfs_task woken up by others.
> 
> I wake the wakeup_preempt_entity(cfs_rq->next, left) < 1 in
> pick_next_entity() to pick the task with highest latency constraint
> when another class is running while waking up

That's correct. This is where you potentially pick this task since it is
the next_buddy.
All I wanted to say is that check_preempt_from_others() and its `next &&
wakeup_preempt_entity(next, se) == 1` is the counterpart of the
`wakeup_preempt_entity(se, pse) == 1` in check_preempt_wakeup() to be
able to set next_buddy even curr is from an other class than CFS.

[...]

>>>> I still don't get the rationale behind why when either one (se or curr)
>>>> of the latency_nice values is negative, we use the diff between them but
>>>> if not, we only care about se's value. Why don't you always use the diff
>>>> between se and curr? Since we have a range [-20 ... 19] why shouldn't we
>>>> use the difference between let's say se = 19 and curr = 5?
>>>> You discussed this with Tao Zhou on the v1 but I didn't understand it fully.
>>>
>>> Let say that current has a latency nice prio of 19 and a task A with a
>>> latency nice of 10 wakes up. Both tasks don't care about scheduling
>>> latency (current more than task A). If we use the diff, the output of
>>> wakeup_latency_gran() would be negative (-10ms) which reflects the
>>> fact that the waking task is sensitive to the latency and wants to
>>> preempt current even if its vruntime is after. But obviously both
>>> current and task A don't care to preempt at wakeup.
>>
>> OK, I understand but there is a certain level of unsteadiness here.
>>
>> If p has >0 it gets treated differently in case current has >=0 and case
> 
> "If p >=0"; 0 has same behavior than [1..19]
> 
>> current has <0.

Not quite. It depends on curr. With sysctl_sched_latency = 24ms:

(1) p = 10 curr =  19 -> wakeup_latency_gran() returns 12ms

(2) p = 10 curr = -10 -> wakeup_latency_gran() returns 24ms

In (1) only p's own latency counts whereas in (2) we take the diff,

In (A) we 'punish' p even though it competes against curr which has an
even lower latency requirement than p,

>> Do we expect that tasks set their value to [1..19] in this case, when
>> the default 0 already indicates a 'don't care'?
> 
> I'm not sure that I understand your concern as [0..19] are treated in
> the same way. Only tasks (curr or se) with offset <0 need a relative
> comparison to the other. If curr and se has both a latency nice of
> -19, se should not blindly preempt curr but only if curr already run
> for its amount of time

With p = -19 and curr = -19 we would take the diff, so 0ms.

With p = 19 and curr = 19, if we would use `latency_offset -=
curr->latency_offset` wakeup_latency_gran() would return 973/1024*24ms -
973/1024*24ms = 0ms and nothing will shift.

OTHA, in case (1) wakeup_latency_gran() would return 512/1024*24ms -
973/1024*24ms = - 10.80ms. So p would gain an advantage here instead of
a penalty.

Essentially using the full [-20 .. 19] nice scope for `se vs. curr`
comparison.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/8] sched/fair: Add sched group latency support
  2022-09-21 17:12               ` Tejun Heo
@ 2022-09-22  6:40                 ` Vincent Guittot
  2022-09-22 10:49                   ` Qais Yousef
  0 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2022-09-22  6:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Qais Yousef, Dietmar Eggemann, mingo, peterz, juri.lelli,
	rostedt, bsegall, mgorman, bristot, vschneid, linux-kernel,
	parth, chris.hyser, valentin.schneider, patrick.bellasi,
	David.Laight, pjt, pavel, qperret, tim.c.chen, joshdon

On Wed, 21 Sept 2022 at 19:12, Tejun Heo <tj@kernel.org> wrote:
>
> On Wed, Sep 21, 2022 at 07:02:57PM +0200, Vincent Guittot wrote:
> > > One option could be just using the same mapping as cpu.weight so that 100
> > > maps to neutral, 10000 maps close to -20, 1 maps close to 19. It isn't great
> > > that the value can't be interpreted in any intuitive way (e.g. a time
> > > duration based interface would be a lot easier to grok even if it still is
> > > best effort) but if that's what the per-task interface is gonna be, it'd be
> > > best to keep cgroup interface in line.
> >
> > I would prefer a signed range like the [-1000:1000] as the behavior is
> > different for sensitive and non sensitive task unlike the cpu.weight
> > which is reflect that a bigger value get more
>
> How about just sticking with .nice?

Looks good to me. I will just implement the cpu.latency.nice

>
> --
> tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup
  2022-09-21 22:41           ` Dietmar Eggemann
@ 2022-09-22  7:12             ` Vincent Guittot
  2022-09-22 16:50               ` Dietmar Eggemann
  0 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2022-09-22  7:12 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel, parth, qais.yousef, chris.hyser,
	valentin.schneider, patrick.bellasi, David.Laight, pjt, pavel,
	tj, qperret, tim.c.chen, joshdon

On Thu, 22 Sept 2022 at 00:41, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 20/09/2022 17:49, Vincent Guittot wrote:
> > On Tue, 20 Sept 2022 at 15:18, Dietmar Eggemann
> > <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 19/09/2022 17:39, Vincent Guittot wrote:
> >>> On Mon, 19 Sept 2022 at 12:05, Dietmar Eggemann
> >>> <dietmar.eggemann@arm.com> wrote:
> >>>>
> >>>> On 16/09/2022 10:03, Vincent Guittot wrote:
>
> [...]
>
> >>>>> +      * the idle thread and don't set next buddy as a candidate for being
> >>>>> +      * picked in priority.
> >>>>> +      * In case of simultaneous wakeup from idle, the latency sensitive tasks
> >>>>> +      * lost opportunity to preempt non sensitive tasks which woke up
> >>>>> +      * simultaneously.
> >>>>> +      */
> >>>>
> >>>> The position of this comment block within this function is somehow
> >>>> misleading since it describes the reason for the function rather then a
> >>>> particular condition within this function. Wouldn't it be more readable
> >>>> when it would be a function header comment instead?
> >>>
> >>> I put it after the usual early return tests to put the comment close
> >>> to the useful part: the use of next buddy and __pick_first_entity()
> >>
> >> So you want to have the `wakeup_preempt_entity(se, pse) == 1` condition
> >> from check_preempt_wakeup() also for cfs_task woken up by others.
> >
> > I wake the wakeup_preempt_entity(cfs_rq->next, left) < 1 in
> > pick_next_entity() to pick the task with highest latency constraint
> > when another class is running while waking up
>
> That's correct. This is where you potentially pick this task since it is
> the next_buddy.
> All I wanted to say is that check_preempt_from_others() and its `next &&
> wakeup_preempt_entity(next, se) == 1` is the counterpart of the
> `wakeup_preempt_entity(se, pse) == 1` in check_preempt_wakeup() to be
> able to set next_buddy even curr is from an other class than CFS.
>
> [...]
>
> >>>> I still don't get the rationale behind why when either one (se or curr)
> >>>> of the latency_nice values is negative, we use the diff between them but
> >>>> if not, we only care about se's value. Why don't you always use the diff
> >>>> between se and curr? Since we have a range [-20 ... 19] why shouldn't we
> >>>> use the difference between let's say se = 19 and curr = 5?
> >>>> You discussed this with Tao Zhou on the v1 but I didn't understand it fully.
> >>>
> >>> Let say that current has a latency nice prio of 19 and a task A with a
> >>> latency nice of 10 wakes up. Both tasks don't care about scheduling
> >>> latency (current more than task A). If we use the diff, the output of
> >>> wakeup_latency_gran() would be negative (-10ms) which reflects the
> >>> fact that the waking task is sensitive to the latency and wants to
> >>> preempt current even if its vruntime is after. But obviously both
> >>> current and task A don't care to preempt at wakeup.
> >>
> >> OK, I understand but there is a certain level of unsteadiness here.
> >>
> >> If p has >0 it gets treated differently in case current has >=0 and case
> >
> > "If p >=0"; 0 has same behavior than [1..19]
> >
> >> current has <0.
>
> Not quite. It depends on curr. With sysctl_sched_latency = 24ms:

I thought you were speaking about priority 0 vs [1..19] as you made a
difference in your previous comment below

>
> (1) p = 10 curr =  19 -> wakeup_latency_gran() returns 12ms
>
> (2) p = 10 curr = -10 -> wakeup_latency_gran() returns 24ms
>
> In (1) only p's own latency counts whereas in (2) we take the diff,

Yes because  curr is latency sensitive in (2) whereas it's not in (1)

>
> In (A) we 'punish' p even though it competes against curr which has an
> even lower latency requirement than p,

What is (A) ?  Assuming you meant (1), having a positive nice latency
means that you don't have latency requirement but you are tolerant to
scheduling delay so we don't 'punish' p. P will preempt curr is we are
above the tolerance

>
> >> Do we expect that tasks set their value to [1..19] in this case, when
> >> the default 0 already indicates a 'don't care'?
> >
> > I'm not sure that I understand your concern as [0..19] are treated in
> > the same way. Only tasks (curr or se) with offset <0 need a relative
> > comparison to the other. If curr and se has both a latency nice of
> > -19, se should not blindly preempt curr but only if curr already run
> > for its amount of time
>
> With p = -19 and curr = -19 we would take the diff, so 0ms.
>
> With p = 19 and curr = 19, if we would use `latency_offset -=
> curr->latency_offset` wakeup_latency_gran() would return 973/1024*24ms -
> 973/1024*24ms = 0ms and nothing will shift.
>
> OTHA, in case (1) wakeup_latency_gran() would return 512/1024*24ms -
> 973/1024*24ms = - 10.80ms. So p would gain an advantage here instead of
> a penalty.

And that's all the point. A priority >= 0 means that you don't care
about scheduling delays so there is no reason to be more aggressive
with a task that is also latency tolerant. We only have to ensure that
the delay stays in the acceptable range

>
> Essentially using the full [-20 .. 19] nice scope for `se vs. curr`
> comparison.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/8] Add latency priority for CFS class
  2022-09-21 16:08 ` [PATCH v4 0/8] Add latency priority for CFS class Qais Yousef
@ 2022-09-22  7:19   ` Vincent Guittot
  2022-09-22 11:00     ` Qais Yousef
  0 siblings, 1 reply; 44+ messages in thread
From: Vincent Guittot @ 2022-09-22  7:19 UTC (permalink / raw)
  To: Qais Yousef
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth, chris.hyser,
	valentin.schneider, patrick.bellasi, David.Laight, pjt, pavel,
	tj, qperret, tim.c.chen, joshdon

On Wed, 21 Sept 2022 at 18:08, Qais Yousef <qais.yousef@arm.com> wrote:
>
> Hi Vincent
>
> On 09/16/22 10:02, Vincent Guittot wrote:
> > This patchset restarts the work about adding a latency priority to describe
> > the latency tolerance of cfs tasks.
> >
> > The patches [1-4] have been done by Parth:
> > https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com/
> >
> > I have just rebased and moved the set of latency priority outside the
> > priority update. I have removed the reviewed tag because the patches
> > are 2 years old.
> >
> > The patch [5] uses latency nice priority to define a latency offset
> > and then to decide if a cfs task can preempt the current running task. The
> > patch gives some tests results with cyclictests and hackbench to highlight
> > the benefit of latency priority for short interactive task or
> > long intensive tasks.
> >
> > Patch [6] adds the support of latency_offset to task group by adding a
> > cpu.latency field. There were discussions for the last version about
> > using a real unit for the field so I decided to expose directly the
> > latency offset which reflects the time up to which we can preempt when the
> > value is negative, or up to which we can defer the preemption when the
> > value is positive.
> > The range is [-sysctl_sched_latency:sysctl_sched_latency]
> >
> > Patch [7] makes sched_core taking into account the latency offset.
> >
> > Patch [8] adds a rb tree to cover some corner cases where the latency
> > sensitive task is preempted by high priority task (RT/DL) or fails to
> > preempt them. This patch ensures that tasks will have at least a
> > slice of sched_min_granularity in priority at wakeup. The patch gives
> > results to show the benefit in addition to patch 5
> >
> > I have also backported the patchset on a dragonboard RB3 with an android
> > mainline kernel based on v5.18 for a quick test. I have used
> > the TouchLatency app which is part of AOSP and described to be very good
> > test to highlight jitter and jank frame sources of a system [1].
> > In addition to the app, I have added some short running tasks waking-up
> > regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
> > without overloading it (and disabling EAS). The 1st results shows that the
> > patchset helps to reduce the missed deadline frames from 5% to less than
> > 0.1% when the cpu.latency of task group are set.
> >
> > [1] https://source.android.com/docs/core/debug/eval_perf#touchlatency
>
> One thing that still confuses me is whether this proposal is supposed to be the
> only consumer of this interface or we still expect other users to be able to
> use this hint to optimize other sources of latency in the scheduler? Last
> discussion [1] raised issues on the interface and I haven't seen any
> discussions on the suitability of the interface to enable future consumers.
>
> I might have missed something. What's the current state on this?

Nothing has changed since the discussion in March:
https://lore.kernel.org/lkml/CAKfTPtBCKyqa-472Z7LtiWTq+Yirq6=jSrkzJtNbkdKXnwP-mA@mail.gmail.com/T/

>
>
> [1] https://lwn.net/Articles/820659/
>
>
> Thanks
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/8] sched/fair: Add sched group latency support
  2022-09-22  6:40                 ` Vincent Guittot
@ 2022-09-22 10:49                   ` Qais Yousef
  0 siblings, 0 replies; 44+ messages in thread
From: Qais Yousef @ 2022-09-22 10:49 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Tejun Heo, Dietmar Eggemann, mingo, peterz, juri.lelli, rostedt,
	bsegall, mgorman, bristot, vschneid, linux-kernel, parth,
	chris.hyser, valentin.schneider, patrick.bellasi, David.Laight,
	pjt, pavel, qperret, tim.c.chen, joshdon

On 09/22/22 08:40, Vincent Guittot wrote:
> On Wed, 21 Sept 2022 at 19:12, Tejun Heo <tj@kernel.org> wrote:
> >
> > On Wed, Sep 21, 2022 at 07:02:57PM +0200, Vincent Guittot wrote:
> > > > One option could be just using the same mapping as cpu.weight so that 100
> > > > maps to neutral, 10000 maps close to -20, 1 maps close to 19. It isn't great
> > > > that the value can't be interpreted in any intuitive way (e.g. a time
> > > > duration based interface would be a lot easier to grok even if it still is
> > > > best effort) but if that's what the per-task interface is gonna be, it'd be
> > > > best to keep cgroup interface in line.
> > >
> > > I would prefer a signed range like the [-1000:1000] as the behavior is
> > > different for sensitive and non sensitive task unlike the cpu.weight
> > > which is reflect that a bigger value get more
> >
> > How about just sticking with .nice?
> 
> Looks good to me. I will just implement the cpu.latency.nice

+1

Keeping both interfaces exposing the same thing would make the most sense IMHO.

Though it begs the question, how the per-task and cgroup interface should
interact?

For example, one of the proposed use cases was to use this knob to control how
hard we search for the best cpu in the load balancer (IIRC). If a task sets
latency_nice to -19, but it is attached to a cgroup that has cpu.latency.nice
to 20. How should the new consumer (load balancer) interpret the effective
value for the task?

IIUC, current use case doesn't care about effective value as it considers the
group and the task as separate entities. But other paths like above wouldn't
see this separation and would want to know which of the 2 to consider.

We should update Documentation/admin-guide/cgroup-v2.rst with these details for
cpu.latency.nice.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/8] Add latency priority for CFS class
  2022-09-22  7:19   ` Vincent Guittot
@ 2022-09-22 11:00     ` Qais Yousef
  2022-09-22 13:03       ` Vincent Guittot
  0 siblings, 1 reply; 44+ messages in thread
From: Qais Yousef @ 2022-09-22 11:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth, chris.hyser,
	valentin.schneider, patrick.bellasi, David.Laight, pjt, pavel,
	tj, qperret, tim.c.chen, joshdon

On 09/22/22 09:19, Vincent Guittot wrote:
> On Wed, 21 Sept 2022 at 18:08, Qais Yousef <qais.yousef@arm.com> wrote:
> >
> > Hi Vincent
> >
> > On 09/16/22 10:02, Vincent Guittot wrote:
> > > This patchset restarts the work about adding a latency priority to describe
> > > the latency tolerance of cfs tasks.
> > >
> > > The patches [1-4] have been done by Parth:
> > > https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com/
> > >
> > > I have just rebased and moved the set of latency priority outside the
> > > priority update. I have removed the reviewed tag because the patches
> > > are 2 years old.
> > >
> > > The patch [5] uses latency nice priority to define a latency offset
> > > and then to decide if a cfs task can preempt the current running task. The
> > > patch gives some tests results with cyclictests and hackbench to highlight
> > > the benefit of latency priority for short interactive task or
> > > long intensive tasks.
> > >
> > > Patch [6] adds the support of latency_offset to task group by adding a
> > > cpu.latency field. There were discussions for the last version about
> > > using a real unit for the field so I decided to expose directly the
> > > latency offset which reflects the time up to which we can preempt when the
> > > value is negative, or up to which we can defer the preemption when the
> > > value is positive.
> > > The range is [-sysctl_sched_latency:sysctl_sched_latency]
> > >
> > > Patch [7] makes sched_core taking into account the latency offset.
> > >
> > > Patch [8] adds a rb tree to cover some corner cases where the latency
> > > sensitive task is preempted by high priority task (RT/DL) or fails to
> > > preempt them. This patch ensures that tasks will have at least a
> > > slice of sched_min_granularity in priority at wakeup. The patch gives
> > > results to show the benefit in addition to patch 5
> > >
> > > I have also backported the patchset on a dragonboard RB3 with an android
> > > mainline kernel based on v5.18 for a quick test. I have used
> > > the TouchLatency app which is part of AOSP and described to be very good
> > > test to highlight jitter and jank frame sources of a system [1].
> > > In addition to the app, I have added some short running tasks waking-up
> > > regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
> > > without overloading it (and disabling EAS). The 1st results shows that the
> > > patchset helps to reduce the missed deadline frames from 5% to less than
> > > 0.1% when the cpu.latency of task group are set.
> > >
> > > [1] https://source.android.com/docs/core/debug/eval_perf#touchlatency
> >
> > One thing that still confuses me is whether this proposal is supposed to be the
> > only consumer of this interface or we still expect other users to be able to
> > use this hint to optimize other sources of latency in the scheduler? Last
> > discussion [1] raised issues on the interface and I haven't seen any
> > discussions on the suitability of the interface to enable future consumers.
> >
> > I might have missed something. What's the current state on this?
> 
> Nothing has changed since the discussion in March:
> https://lore.kernel.org/lkml/CAKfTPtBCKyqa-472Z7LtiWTq+Yirq6=jSrkzJtNbkdKXnwP-mA@mail.gmail.com/T/

Okay. For my peace of mind, could you update the cover letter please to be more
explicit that this is only one use case of potential others to come later in
the future? The other 2 that I remember are improve load balancer search and
modify EAS behavior. Parth had a use case to help take advantage of turbo
frequencies, but not sure if this is still being pursued/desired.

Your proposed use case could actually help make the EAS one unnecessary. But it
is hard to tell at this stage and prefer to continue to consider it as
a potential new consumer of this interface.


Thanks!

--
Qais Yousef

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 0/8] Add latency priority for CFS class
  2022-09-22 11:00     ` Qais Yousef
@ 2022-09-22 13:03       ` Vincent Guittot
  0 siblings, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2022-09-22 13:03 UTC (permalink / raw)
  To: Qais Yousef
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, parth, chris.hyser,
	valentin.schneider, patrick.bellasi, David.Laight, pjt, pavel,
	tj, qperret, tim.c.chen, joshdon

On Thu, 22 Sept 2022 at 13:00, Qais Yousef <qais.yousef@arm.com> wrote:
>
> On 09/22/22 09:19, Vincent Guittot wrote:
> > On Wed, 21 Sept 2022 at 18:08, Qais Yousef <qais.yousef@arm.com> wrote:
> > >
> > > Hi Vincent
> > >
> > > On 09/16/22 10:02, Vincent Guittot wrote:
> > > > This patchset restarts the work about adding a latency priority to describe
> > > > the latency tolerance of cfs tasks.
> > > >
> > > > The patches [1-4] have been done by Parth:
> > > > https://lore.kernel.org/lkml/20200228090755.22829-1-parth@linux.ibm.com/
> > > >
> > > > I have just rebased and moved the set of latency priority outside the
> > > > priority update. I have removed the reviewed tag because the patches
> > > > are 2 years old.
> > > >
> > > > The patch [5] uses latency nice priority to define a latency offset
> > > > and then to decide if a cfs task can preempt the current running task. The
> > > > patch gives some tests results with cyclictests and hackbench to highlight
> > > > the benefit of latency priority for short interactive task or
> > > > long intensive tasks.
> > > >
> > > > Patch [6] adds the support of latency_offset to task group by adding a
> > > > cpu.latency field. There were discussions for the last version about
> > > > using a real unit for the field so I decided to expose directly the
> > > > latency offset which reflects the time up to which we can preempt when the
> > > > value is negative, or up to which we can defer the preemption when the
> > > > value is positive.
> > > > The range is [-sysctl_sched_latency:sysctl_sched_latency]
> > > >
> > > > Patch [7] makes sched_core taking into account the latency offset.
> > > >
> > > > Patch [8] adds a rb tree to cover some corner cases where the latency
> > > > sensitive task is preempted by high priority task (RT/DL) or fails to
> > > > preempt them. This patch ensures that tasks will have at least a
> > > > slice of sched_min_granularity in priority at wakeup. The patch gives
> > > > results to show the benefit in addition to patch 5
> > > >
> > > > I have also backported the patchset on a dragonboard RB3 with an android
> > > > mainline kernel based on v5.18 for a quick test. I have used
> > > > the TouchLatency app which is part of AOSP and described to be very good
> > > > test to highlight jitter and jank frame sources of a system [1].
> > > > In addition to the app, I have added some short running tasks waking-up
> > > > regularly (to use the 8 cpus for 4 ms every 37777us) to stress the system
> > > > without overloading it (and disabling EAS). The 1st results shows that the
> > > > patchset helps to reduce the missed deadline frames from 5% to less than
> > > > 0.1% when the cpu.latency of task group are set.
> > > >
> > > > [1] https://source.android.com/docs/core/debug/eval_perf#touchlatency
> > >
> > > One thing that still confuses me is whether this proposal is supposed to be the
> > > only consumer of this interface or we still expect other users to be able to
> > > use this hint to optimize other sources of latency in the scheduler? Last
> > > discussion [1] raised issues on the interface and I haven't seen any
> > > discussions on the suitability of the interface to enable future consumers.
> > >
> > > I might have missed something. What's the current state on this?
> >
> > Nothing has changed since the discussion in March:
> > https://lore.kernel.org/lkml/CAKfTPtBCKyqa-472Z7LtiWTq+Yirq6=jSrkzJtNbkdKXnwP-mA@mail.gmail.com/T/
>
> Okay. For my peace of mind, could you update the cover letter please to be more
> explicit that this is only one use case of potential others to come later in

I thought that the reference to Prath's patchset at the beg of the
cover letter was enough to highlight that there were other possible
uses of the interface. But I can mentioned that this is not the sole
and alone use of the interface can be done in the scheduler and other
places could also take advantage of this new hint.

> the future? The other 2 that I remember are improve load balancer search and
> modify EAS behavior. Parth had a use case to help take advantage of turbo
> frequencies, but not sure if this is still being pursued/desired.
>
> Your proposed use case could actually help make the EAS one unnecessary. But it
> is hard to tell at this stage and prefer to continue to consider it as
> a potential new consumer of this interface.
>
>
> Thanks!
>
> --
> Qais Yousef

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup
  2022-09-22  7:12             ` Vincent Guittot
@ 2022-09-22 16:50               ` Dietmar Eggemann
  2022-09-23  6:01                 ` Vincent Guittot
  0 siblings, 1 reply; 44+ messages in thread
From: Dietmar Eggemann @ 2022-09-22 16:50 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel, parth, qais.yousef, chris.hyser,
	valentin.schneider, patrick.bellasi, David.Laight, pjt, pavel,
	tj, qperret, tim.c.chen, joshdon

On 22/09/2022 09:12, Vincent Guittot wrote:
> On Thu, 22 Sept 2022 at 00:41, Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 20/09/2022 17:49, Vincent Guittot wrote:
>>> On Tue, 20 Sept 2022 at 15:18, Dietmar Eggemann
>>> <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 19/09/2022 17:39, Vincent Guittot wrote:
>>>>> On Mon, 19 Sept 2022 at 12:05, Dietmar Eggemann
>>>>> <dietmar.eggemann@arm.com> wrote:
>>>>>>
>>>>>> On 16/09/2022 10:03, Vincent Guittot wrote:

[...]

> I thought you were speaking about priority 0 vs [1..19] as you made a
> difference in your previous comment below
> 
>>
>> (1) p = 10 curr =  19 -> wakeup_latency_gran() returns 12ms
>>
>> (2) p = 10 curr = -10 -> wakeup_latency_gran() returns 24ms
>>
>> In (1) only p's own latency counts whereas in (2) we take the diff,
> 
> Yes because  curr is latency sensitive in (2) whereas it's not in (1)
> 
>>
>> In (A) we 'punish' p even though it competes against curr which has an
>> even lower latency requirement than p,
> 
> What is (A) ?  Assuming you meant (1), having a positive nice latency

Sorry, yes I meant (1).

> means that you don't have latency requirement but you are tolerant to
> scheduling delay so we don't 'punish' p. P will preempt curr is we are
> above the tolerance

wakeup_preempt_entity() {

    vdiff = curr->vruntime - se->vruntime

    vdiff -= wakeup_latency_gran(curr, se)   <-- (3)

    if (vdiff <= 0)
        return -1;

    ...
}

Wouldn't it be more suitable to return 0 from wakeup_latency_gran() if
both have latency_nice >=0 in this case instead of se->latency_offset?

By `punish` I mean that vdiff (3) gets smaller in case we return (the
positive) `se->latency_offset` even `latency nice of curr` > `latency
nice of p`.

[...]

>> With p = -19 and curr = -19 we would take the diff, so 0ms.
>>
>> With p = 19 and curr = 19, if we would use `latency_offset -=
>> curr->latency_offset` wakeup_latency_gran() would return 973/1024*24ms -
>> 973/1024*24ms = 0ms and nothing will shift.
>>
>> OTHA, in case (1) wakeup_latency_gran() would return 512/1024*24ms -
>> 973/1024*24ms = - 10.80ms. So p would gain an advantage here instead of
>> a penalty.
> 
> And that's all the point. A priority >= 0 means that you don't care
> about scheduling delays so there is no reason to be more aggressive
> with a task that is also latency tolerant. We only have to ensure that
> the delay stays in the acceptable range

OK, I understand you model here but I'm still not convinced. Might be
interesting to hear what others think.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup
  2022-09-22 16:50               ` Dietmar Eggemann
@ 2022-09-23  6:01                 ` Vincent Guittot
  0 siblings, 0 replies; 44+ messages in thread
From: Vincent Guittot @ 2022-09-23  6:01 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, peterz, juri.lelli, rostedt, bsegall, mgorman, bristot,
	vschneid, linux-kernel, parth, qais.yousef, chris.hyser,
	valentin.schneider, patrick.bellasi, David.Laight, pjt, pavel,
	tj, qperret, tim.c.chen, joshdon

On Thu, 22 Sept 2022 at 18:50, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 22/09/2022 09:12, Vincent Guittot wrote:
> > On Thu, 22 Sept 2022 at 00:41, Dietmar Eggemann
> > <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 20/09/2022 17:49, Vincent Guittot wrote:
> >>> On Tue, 20 Sept 2022 at 15:18, Dietmar Eggemann
> >>> <dietmar.eggemann@arm.com> wrote:
> >>>>
> >>>> On 19/09/2022 17:39, Vincent Guittot wrote:
> >>>>> On Mon, 19 Sept 2022 at 12:05, Dietmar Eggemann
> >>>>> <dietmar.eggemann@arm.com> wrote:
> >>>>>>
> >>>>>> On 16/09/2022 10:03, Vincent Guittot wrote:
>
> [...]
>
> > I thought you were speaking about priority 0 vs [1..19] as you made a
> > difference in your previous comment below
> >
> >>
> >> (1) p = 10 curr =  19 -> wakeup_latency_gran() returns 12ms
> >>
> >> (2) p = 10 curr = -10 -> wakeup_latency_gran() returns 24ms
> >>
> >> In (1) only p's own latency counts whereas in (2) we take the diff,
> >
> > Yes because  curr is latency sensitive in (2) whereas it's not in (1)
> >
> >>
> >> In (A) we 'punish' p even though it competes against curr which has an
> >> even lower latency requirement than p,
> >
> > What is (A) ?  Assuming you meant (1), having a positive nice latency
>
> Sorry, yes I meant (1).
>
> > means that you don't have latency requirement but you are tolerant to
> > scheduling delay so we don't 'punish' p. P will preempt curr is we are
> > above the tolerance
>
> wakeup_preempt_entity() {
>
>     vdiff = curr->vruntime - se->vruntime
>
>     vdiff -= wakeup_latency_gran(curr, se)   <-- (3)
>
>     if (vdiff <= 0)
>         return -1;
>
>     ...
> }
>
> Wouldn't it be more suitable to return 0 from wakeup_latency_gran() if
> both have latency_nice >=0 in this case instead of se->latency_offset?

No because that defeats all the purpose of being tolerant to
scheduling delay and not trying to preempt the current as usual at
wakeup. In this case there would be not diff with not setting the
latency nice value.

>
> By `punish` I mean that vdiff (3) gets smaller in case we return (the
> positive) `se->latency_offset` even `latency nice of curr` > `latency
> nice of p`.
>
> [...]
>
> >> With p = -19 and curr = -19 we would take the diff, so 0ms.
> >>
> >> With p = 19 and curr = 19, if we would use `latency_offset -=
> >> curr->latency_offset` wakeup_latency_gran() would return 973/1024*24ms -
> >> 973/1024*24ms = 0ms and nothing will shift.
> >>
> >> OTHA, in case (1) wakeup_latency_gran() would return 512/1024*24ms -
> >> 973/1024*24ms = - 10.80ms. So p would gain an advantage here instead of
> >> a penalty.
> >
> > And that's all the point. A priority >= 0 means that you don't care
> > about scheduling delays so there is no reason to be more aggressive
> > with a task that is also latency tolerant. We only have to ensure that
> > the delay stays in the acceptable range
>
> OK, I understand you model here but I'm still not convinced. Might be
> interesting to hear what others think.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2022-09-23  6:02 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-16  8:02 [PATCH v4 0/8] Add latency priority for CFS class Vincent Guittot
2022-09-16  8:02 ` [PATCH v4 1/8] sched: Introduce latency-nice as a per-task attribute Vincent Guittot
2022-09-16  8:02 ` [PATCH v4 2/8] sched/core: Propagate parent task's latency requirements to the child task Vincent Guittot
2022-09-16  8:03 ` [PATCH v4 3/8] sched: Allow sched_{get,set}attr to change latency_nice of the task Vincent Guittot
2022-09-16  8:03 ` [PATCH v4 4/8] sched/core: Add permission checks for setting the latency_nice value Vincent Guittot
2022-09-19  8:52   ` timj
2022-09-19  8:52     ` timj
2022-09-19 12:41     ` Vincent Guittot
2022-09-20 10:18       ` Tim Janik
2022-09-20 14:56         ` Vincent Guittot
2022-09-21 16:11           ` Vincent Guittot
2022-09-16  8:03 ` [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup Vincent Guittot
     [not found]   ` <20220916120245.2951-1-hdanton@sina.com>
2022-09-16 13:36     ` Vincent Guittot
     [not found]       ` <20220917225819.817-1-hdanton@sina.com>
2022-09-18 10:46         ` Vincent Guittot
     [not found]           ` <20220920113238.1176-1-hdanton@sina.com>
2022-09-20 15:17             ` Vincent Guittot
2022-09-19 10:05   ` Dietmar Eggemann
2022-09-19 15:39     ` Vincent Guittot
2022-09-20 13:18       ` Dietmar Eggemann
2022-09-20 15:49         ` Vincent Guittot
2022-09-21 22:41           ` Dietmar Eggemann
2022-09-22  7:12             ` Vincent Guittot
2022-09-22 16:50               ` Dietmar Eggemann
2022-09-23  6:01                 ` Vincent Guittot
2022-09-16  8:03 ` [PATCH v4 6/8] sched/fair: Add sched group latency support Vincent Guittot
2022-09-19 11:55   ` Dietmar Eggemann
2022-09-19 15:49     ` Vincent Guittot
2022-09-19 17:34       ` Tejun Heo
2022-09-20  7:03         ` Vincent Guittot
2022-09-21 16:07         ` Qais Yousef
2022-09-21 16:48           ` Tejun Heo
2022-09-21 17:02             ` Vincent Guittot
2022-09-21 17:12               ` Tejun Heo
2022-09-22  6:40                 ` Vincent Guittot
2022-09-22 10:49                   ` Qais Yousef
2022-09-20 18:17       ` Dietmar Eggemann
2022-09-21  7:48         ` Vincent Guittot
2022-09-19 17:34     ` Tejun Heo
2022-09-20  7:02       ` Vincent Guittot
2022-09-16  8:03 ` [PATCH v4 7/8] sched/core: support latency priority with sched core Vincent Guittot
2022-09-16  8:03 ` [PATCH v4 8/8] sched/fair: Add latency list Vincent Guittot
2022-09-21 16:08 ` [PATCH v4 0/8] Add latency priority for CFS class Qais Yousef
2022-09-22  7:19   ` Vincent Guittot
2022-09-22 11:00     ` Qais Yousef
2022-09-22 13:03       ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).