All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/7] SCHED_DEADLINE server infrastructure
@ 2023-11-04 10:59 Daniel Bristot de Oliveira
  2023-11-04 10:59 ` [PATCH v5 1/7] sched: Unify runtime accounting across classes Daniel Bristot de Oliveira
                   ` (8 more replies)
  0 siblings, 9 replies; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-04 10:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

This is v5 of Peter's SCHED_DEADLINE server infrastructure
implementation [1].

SCHED_DEADLINE servers can help fixing starvation issues of low priority
tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
cycles. Today we have RT Throttling; DEADLINE servers should be able to
replace and improve that.

In the v1 there was discussion raised about the consequence of using
deadline based servers on the fixed-priority workloads. For a demonstration
here is the baseline of timerlat scheduling latency as-is, with kernel
build background workload:

 # rtla timerlat top -u -d 10m

  --------------------- %< ------------------------
                                     Timer Latency
  0 01:42:24   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)      |    Ret user Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max |      cur       min       avg       max
  0 #6143559   |        0         0         0        92 |        2         1         3        98 |        4         1         5       100
  1 #6143559   |        1         0         0        97 |        7         1         5       101 |        9         1         7       103
  2 #6143559   |        0         0         0        88 |        3         1         5        95 |        5         1         7        99
  3 #6143559   |        0         0         0        90 |        6         1         5       103 |       10         1         7       126
  4 #6143558   |        1         0         0        81 |        7         1         4        86 |        9         1         7        90
  5 #6143558   |        0         0         0        74 |        3         1         5        79 |        4         1         7        83
  6 #6143558   |        0         0         0        83 |        2         1         5        89 |        3         0         7       108
  7 #6143558   |        0         0         0        85 |        3         1         4       126 |        5         1         6       137
  --------------------- >% ------------------------

And this is the same tests with DL server activating without any delay:
  --------------------- %< ------------------------
  0 00:10:01   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)      |    Ret user Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max |      cur       min       avg       max
  0 #579147    |        0         0         0        54 |        2         1        52     61095 |        2         2        56     61102
  1 #578766    |        0         0         0        83 |        2         1        49     55824 |        3         2        53     55831
  2 #578559    |        0         0         1        59 |        2         1        50     55760 |        3         2        54     55770
  3 #578318    |        0         0         0        76 |        2         1        49     55751 |        3         2        54     55760
  4 #578611    |        0         0         0        64 |        2         1        49     55811 |        3         2        53     55820
  5 #578347    |        0         0         1        40 |        2         1        50     56121 |        3         2        55     56133
  6 #578938    |        0         0         1        75 |        2         1        49     55755 |        3         2        53     55764
  7 #578631    |        0         0         1        36 |        3         1        51     55528 |        4         2        55     55541
  --------------------- >% ------------------------

The problem with DL server only implementation is that FIFO tasks might
suffer preemption from NORMAL even when spare CPU cycles are available.
In fact, fair deadline server is enqueued right away when NORMAL tasks
wake up and they are first scheduled by the server, thus potentially
preempting a well behaving FIFO task. This is of course not ideal.

We had discussions about it, and one of the possibilities would be
using a different scheduling algorithm for this. But IMHO that is
an overkill.

Juri and I discussed this and though about delaying the server
activation for the 0-lag time, thus enabling the server only if the
fair scheduler is about to starve.

The patch 6/7 adds the possibility to defer the server start to the
(absolute deadline - runtime) point in time. This is achieved by
enqueuing the dl server throttled, with a next replenishing time
set to activate the server at (absolute deadline - runtime).

Differently from v4, now the server is enqueued with the runtime
replenished. As the fair scheduler runs without boost, its runtime
is consumed. If the fair server has its runtime before the 0-laxity
time, the a new period is set, and the timer armed for the new
(deadline - runtime).

The patch 7/7 add a per_rq interface for the knobs:
	fair_server_runtime (950 ms)
	fair_server_period  (1s)
	fair_server_defer   (enabled)

With defer enabled on CPUs [0:3], the results get better, having a
behavior similar to the one we have with the rt throttling.

  --------------------- %< ------------------------
                                     Timer Latency                                                                                       
  0 00:10:01   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)      |    Ret user Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max |      cur       min       avg       max
  0 #599979    |        0         0         0        64 |        4         1         4        67 |        6         1         5        69
  1 #599979    |        0         0         1        17 |        6         1         5        50 |       10         2         7        71
  2 #599984    |        1         0         1        22 |        4         1         5        78 |        5         2         7       107
  3 #599986    |        0         0         1        72 |        7         1         5        79 |       10         2         7        82
  4 #581580    |        1         0         1        37 |        6         1        38     52797 |       10         2        41     52805
  5 #583270    |        1         0         1        41 |        9         1        36     52617 |       12         2        38     52623
  6 #581240    |        0         0         1        25 |        7         1        39     52870 |       11         2        41     52876
  7 #581208    |        0         0         1        69 |        6         1        39     52917 |        9         2        41     52923
  --------------------- >% ------------------------

Here are some osnoise measurement, with osnoise threads running as FIFO:1 with
different setups (defer enabled):
 - CPU 2 isolated
 - CPU 3 isolated shared with a CFS busy loop task
 - CPU 8 non-isolated
 - CPU 9 non-isolated shared with a CFS busy loop task

  --------------------- %< ------------------------
 ~# pgrep ktimer | while read pid; do chrt -p -f 2 $pid; done # for RT kernel
 ~# sysctl kernel.sched_rt_runtime_us=-1
 ~# tuna  isolate -c 2
 ~# tuna  isolate -c 3
 ~# taskset -c 3 ./f &
 ~# taskset -c 9 ./f &
 ~# osnoise -P f:1 -c 2,3,8,9 -T 1 -d 10m -H 1
                                          Operating System Noise
duration:   0 00:10:00 | time is in us
CPU Period       Runtime        Noise  % CPU Aval   Max Noise   Max Single          HW          NMI          IRQ      Softirq       Thread
  2 #599       599000000          178    99.99997          18            2           0            0          270            0            0
  3 #598       598054434     31351553    94.75774      104442       104442           0            0      2837523            0         1794
  8 #599       599000001       567456    99.90526        3260         2375           2           89       620490            0        13539
  9 #598       598021196     31742537    94.69207       71707        53357           0           90      3411023            0         1762
   --------------------- >% ------------------------

the system runs fine!
	- no crashes (famous last words)
	- FIFO property is kept
	- per cpu interface because it is more flexible - and to detach this from
	  the throttling concept.

Global is broken, but it will > /dev/null.

TODO:
  - Move rt throttling code to RT_GROUP_SCHED for now (then send it to the same
    place as global then).

Changes from V4:
  - Enable the server when nr fair tasks is > 0 (peter)
  - Consume runtime if the zerolax server is not boosted (peterz)
  - Adjust interface to deal with admission control (peterz)
  - Rebased to 6.6
Changes from V3:
  - Add the defer server (Daniel)
  - Add an per rq interface (Daniel with peter's feedback)
  - Add an option not defer the server (for Joel)
  - Typos and 1-liner fixes (Valentin, Luca, Peter)
  - Fair scheduler running on dl server do not account as RT task (Daniel)
  - Changed the condition to enable the server (RT & fair tasks) (Daniel)
Changes from v2:
  - Refactor/rephrase/typos changes
  - Defferable server using throttling
  - The server starts when RT && Fair tasks are enqueued
  - Interface with runtime/period/defer option
Changes from v1:
  - rebased on 6.4-rc1 tip/sched/core

Daniel Bristot de Oliveira (2):
  sched/deadline: Deferrable dl server
  sched/fair: Fair server interface

Peter Zijlstra (5):
  sched: Unify runtime accounting across classes
  sched/deadline: Collect sched_dl_entity initialization
  sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
  sched/deadline: Introduce deadline servers
  sched/fair: Add trivial fair server

 include/linux/sched.h    |  26 +-
 kernel/sched/core.c      |  23 +-
 kernel/sched/deadline.c  | 671 ++++++++++++++++++++++++++++-----------
 kernel/sched/debug.c     | 202 ++++++++++++
 kernel/sched/fair.c      |  87 ++++-
 kernel/sched/rt.c        |  15 +-
 kernel/sched/sched.h     |  56 +++-
 kernel/sched/stop_task.c |  13 +-
 8 files changed, 847 insertions(+), 246 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v5 1/7] sched: Unify runtime accounting across classes
  2023-11-04 10:59 [PATCH v5 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
@ 2023-11-04 10:59 ` Daniel Bristot de Oliveira
  2023-11-15  9:04   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2023-11-04 10:59 ` [PATCH v5 2/7] sched/deadline: Collect sched_dl_entity initialization Daniel Bristot de Oliveira
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-04 10:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

From: Peter Zijlstra <peterz@infradead.org>

All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.

Collapse all that.

Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 include/linux/sched.h    |  2 +-
 kernel/sched/deadline.c  | 15 +++--------
 kernel/sched/fair.c      | 57 ++++++++++++++++++++++++++++++----------
 kernel/sched/rt.c        | 15 +++--------
 kernel/sched/sched.h     | 12 ++-------
 kernel/sched/stop_task.c | 13 +--------
 6 files changed, 53 insertions(+), 61 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 12ec109ce8c9..31eee8b03dcd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -523,7 +523,7 @@ struct sched_statistics {
 	u64				block_max;
 	s64				sum_block_runtime;
 
-	u64				exec_max;
+	s64				exec_max;
 	u64				slice_max;
 
 	u64				nr_migrations_cold;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b28114478b82..de79719c63c0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1275,9 +1275,8 @@ static void update_curr_dl(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_dl_entity *dl_se = &curr->dl;
-	u64 delta_exec, scaled_delta_exec;
+	s64 delta_exec, scaled_delta_exec;
 	int cpu = cpu_of(rq);
-	u64 now;
 
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
@@ -1290,21 +1289,13 @@ static void update_curr_dl(struct rq *rq)
 	 * natural solution, but the full ramifications of this
 	 * approach need further study.
 	 */
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <= 0)) {
+	delta_exec = update_curr_common(rq);
+	if (unlikely(delta_exec <= 0)) {
 		if (unlikely(dl_se->dl_yielded))
 			goto throttle;
 		return;
 	}
 
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
-	trace_sched_stat_runtime(curr, delta_exec, 0);
-
-	update_current_exec_runtime(curr, now, delta_exec);
-
 	if (dl_entity_is_special(dl_se))
 		return;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8767988242ee..2613704a2d2d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1129,23 +1129,17 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_SMP */
 
-/*
- * Update the current task's runtime statistics.
- */
-static void update_curr(struct cfs_rq *cfs_rq)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
 {
-	struct sched_entity *curr = cfs_rq->curr;
-	u64 now = rq_clock_task(rq_of(cfs_rq));
-	u64 delta_exec;
-
-	if (unlikely(!curr))
-		return;
+	u64 now = rq_clock_task(rq);
+	s64 delta_exec;
 
 	delta_exec = now - curr->exec_start;
-	if (unlikely((s64)delta_exec <= 0))
-		return;
+	if (unlikely(delta_exec <= 0))
+		return delta_exec;
 
 	curr->exec_start = now;
+	curr->sum_exec_runtime += delta_exec;
 
 	if (schedstat_enabled()) {
 		struct sched_statistics *stats;
@@ -1155,8 +1149,43 @@ static void update_curr(struct cfs_rq *cfs_rq)
 				max(delta_exec, stats->exec_max));
 	}
 
-	curr->sum_exec_runtime += delta_exec;
-	schedstat_add(cfs_rq->exec_clock, delta_exec);
+	return delta_exec;
+}
+
+/*
+ * Used by other classes to account runtime.
+ */
+s64 update_curr_common(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	s64 delta_exec;
+
+	delta_exec = update_curr_se(rq, &curr->se);
+	if (unlikely(delta_exec <= 0))
+		return delta_exec;
+
+	trace_sched_stat_runtime(curr, delta_exec, 0);
+
+	account_group_exec_runtime(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
+
+	return delta_exec;
+}
+
+/*
+ * Update the current task's runtime statistics.
+ */
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 delta_exec;
+
+	if (unlikely(!curr))
+		return;
+
+	delta_exec = update_curr_se(rq_of(cfs_rq), curr);
+	if (unlikely(delta_exec <= 0))
+		return;
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
 	update_deadline(cfs_rq, curr);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 6aaf0a3d6081..3261b067b67e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1002,24 +1002,15 @@ static void update_curr_rt(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_rt_entity *rt_se = &curr->rt;
-	u64 delta_exec;
-	u64 now;
+	s64 delta_exec;
 
 	if (curr->sched_class != &rt_sched_class)
 		return;
 
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <= 0))
+	delta_exec = update_curr_common(rq);
+	if (unlikely(delta_exec <= 0))
 		return;
 
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
-	trace_sched_stat_runtime(curr, delta_exec, 0);
-
-	update_current_exec_runtime(curr, now, delta_exec);
-
 	if (!rt_bandwidth_enabled())
 		return;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e5a95486a42..3e0e4fc8734b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2212,6 +2212,8 @@ struct affinity_context {
 	unsigned int flags;
 };
 
+extern s64 update_curr_common(struct rq *rq);
+
 struct sched_class {
 
 #ifdef CONFIG_UCLAMP_TASK
@@ -3261,16 +3263,6 @@ extern int sched_dynamic_mode(const char *str);
 extern void sched_dynamic_update(int mode);
 #endif
 
-static inline void update_current_exec_runtime(struct task_struct *curr,
-						u64 now, u64 delta_exec)
-{
-	curr->se.sum_exec_runtime += delta_exec;
-	account_group_exec_runtime(curr, delta_exec);
-
-	curr->se.exec_start = now;
-	cgroup_account_cputime(curr, delta_exec);
-}
-
 #ifdef CONFIG_SCHED_MM_CID
 
 #define SCHED_MM_CID_PERIOD_NS	(100ULL * 1000000)	/* 100ms */
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 6cf7304e6449..b1b8fe61c532 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq)
 
 static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 {
-	struct task_struct *curr = rq->curr;
-	u64 now, delta_exec;
-
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec < 0))
-		delta_exec = 0;
-
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
-	update_current_exec_runtime(curr, now, delta_exec);
+	update_curr_common(rq);
 }
 
 /*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 2/7] sched/deadline: Collect sched_dl_entity initialization
  2023-11-04 10:59 [PATCH v5 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
  2023-11-04 10:59 ` [PATCH v5 1/7] sched: Unify runtime accounting across classes Daniel Bristot de Oliveira
@ 2023-11-04 10:59 ` Daniel Bristot de Oliveira
  2023-11-15  9:04   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2023-11-04 10:59 ` [PATCH v5 3/7] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity Daniel Bristot de Oliveira
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-04 10:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

From: Peter Zijlstra <peterz@infradead.org>

Create a single function that initializes a sched_dl_entity.

Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/core.c     |  5 +----
 kernel/sched/deadline.c | 22 +++++++++++++++-------
 kernel/sched/sched.h    |  5 +----
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3d7e2d702699..257369d30303 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4509,10 +4509,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	memset(&p->stats, 0, sizeof(p->stats));
 #endif
 
-	RB_CLEAR_NODE(&p->dl.rb_node);
-	init_dl_task_timer(&p->dl);
-	init_dl_inactive_task_timer(&p->dl);
-	__dl_clear_params(p);
+	init_dl_entity(&p->dl);
 
 	INIT_LIST_HEAD(&p->rt.run_list);
 	p->rt.timeout		= 0;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index de79719c63c0..e80bb884262d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -335,6 +335,8 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
 	__add_rq_bw(new_bw, &rq->dl);
 }
 
+static void __dl_clear_params(struct sched_dl_entity *dl_se);
+
 /*
  * The utilization of a task cannot be immediately removed from
  * the rq active utilization (running_bw) when the task blocks.
@@ -434,7 +436,7 @@ static void task_non_contending(struct task_struct *p)
 			raw_spin_lock(&dl_b->lock);
 			__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 			raw_spin_unlock(&dl_b->lock);
-			__dl_clear_params(p);
+			__dl_clear_params(dl_se);
 		}
 
 		return;
@@ -1183,7 +1185,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-void init_dl_task_timer(struct sched_dl_entity *dl_se)
+static void init_dl_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->dl_timer;
 
@@ -1389,7 +1391,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 		raw_spin_lock(&dl_b->lock);
 		__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 		raw_spin_unlock(&dl_b->lock);
-		__dl_clear_params(p);
+		__dl_clear_params(dl_se);
 
 		goto unlock;
 	}
@@ -1405,7 +1407,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
+static void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->inactive_timer;
 
@@ -2957,10 +2959,8 @@ bool __checkparam_dl(const struct sched_attr *attr)
 /*
  * This function clears the sched_dl_entity static params.
  */
-void __dl_clear_params(struct task_struct *p)
+static void __dl_clear_params(struct sched_dl_entity *dl_se)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
-
 	dl_se->dl_runtime		= 0;
 	dl_se->dl_deadline		= 0;
 	dl_se->dl_period		= 0;
@@ -2978,6 +2978,14 @@ void __dl_clear_params(struct task_struct *p)
 #endif
 }
 
+void init_dl_entity(struct sched_dl_entity *dl_se)
+{
+	RB_CLEAR_NODE(&dl_se->rb_node);
+	init_dl_task_timer(dl_se);
+	init_dl_inactive_task_timer(dl_se);
+	__dl_clear_params(dl_se);
+}
+
 bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e0e4fc8734b..4f5f5a2778a9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -273,8 +273,6 @@ struct rt_bandwidth {
 	unsigned int		rt_period_active;
 };
 
-void __dl_clear_params(struct task_struct *p);
-
 static inline int dl_bandwidth_enabled(void)
 {
 	return sysctl_sched_rt_runtime >= 0;
@@ -2427,8 +2425,7 @@ extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
-extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
-extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
+extern void init_dl_entity(struct sched_dl_entity *dl_se);
 
 #define BW_SHIFT		20
 #define BW_UNIT			(1 << BW_SHIFT)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 3/7] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
  2023-11-04 10:59 [PATCH v5 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
  2023-11-04 10:59 ` [PATCH v5 1/7] sched: Unify runtime accounting across classes Daniel Bristot de Oliveira
  2023-11-04 10:59 ` [PATCH v5 2/7] sched/deadline: Collect sched_dl_entity initialization Daniel Bristot de Oliveira
@ 2023-11-04 10:59 ` Daniel Bristot de Oliveira
  2023-11-15  9:04   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2023-11-04 10:59 ` [PATCH v5 4/7] sched/deadline: Introduce deadline servers Daniel Bristot de Oliveira
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-04 10:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

From: Peter Zijlstra <peterz@infradead.org>

In preparation of introducing !task sched_dl_entity; move the
bandwidth accounting into {en.de}queue_dl_entity().

Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/deadline.c | 130 ++++++++++++++++++++++------------------
 kernel/sched/sched.h    |   6 ++
 2 files changed, 78 insertions(+), 58 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e80bb884262d..81810f67df7a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -391,12 +391,12 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se);
  * up, and checks if the task is still in the "ACTIVE non contending"
  * state or not (in the second case, it updates running_bw).
  */
-static void task_non_contending(struct task_struct *p)
+static void task_non_contending(struct sched_dl_entity *dl_se)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
 	struct hrtimer *timer = &dl_se->inactive_timer;
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
+	struct task_struct *p = dl_task_of(dl_se);
 	s64 zerolag_time;
 
 	/*
@@ -428,13 +428,14 @@ static void task_non_contending(struct task_struct *p)
 	if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
 		if (dl_task(p))
 			sub_running_bw(dl_se, dl_rq);
+
 		if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
 			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
 			if (READ_ONCE(p->__state) == TASK_DEAD)
-				sub_rq_bw(&p->dl, &rq->dl);
+				sub_rq_bw(dl_se, &rq->dl);
 			raw_spin_lock(&dl_b->lock);
-			__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
+			__dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
 			raw_spin_unlock(&dl_b->lock);
 			__dl_clear_params(dl_se);
 		}
@@ -1601,6 +1602,41 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 
 	update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);
 
+	/*
+	 * Check if a constrained deadline task was activated
+	 * after the deadline but before the next period.
+	 * If that is the case, the task will be throttled and
+	 * the replenishment timer will be set to the next period.
+	 */
+	if (!dl_se->dl_throttled && !dl_is_implicit(dl_se))
+		dl_check_constrained_dl(dl_se);
+
+	if (flags & (ENQUEUE_RESTORE|ENQUEUE_MIGRATING)) {
+		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+		add_rq_bw(dl_se, dl_rq);
+		add_running_bw(dl_se, dl_rq);
+	}
+
+	/*
+	 * If p is throttled, we do not enqueue it. In fact, if it exhausted
+	 * its budget it needs a replenishment and, since it now is on
+	 * its rq, the bandwidth timer callback (which clearly has not
+	 * run yet) will take care of this.
+	 * However, the active utilization does not depend on the fact
+	 * that the task is on the runqueue or not (but depends on the
+	 * task's state - in GRUB parlance, "inactive" vs "active contending").
+	 * In other words, even if a task is throttled its utilization must
+	 * be counted in the active utilization; hence, we need to call
+	 * add_running_bw().
+	 */
+	if (dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+		if (flags & ENQUEUE_WAKEUP)
+			task_contending(dl_se, flags);
+
+		return;
+	}
+
 	/*
 	 * If this is a wakeup or a new instance, the scheduling
 	 * parameters of the task might need updating. Otherwise,
@@ -1620,9 +1656,28 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	__enqueue_dl_entity(dl_se);
 }
 
-static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 {
 	__dequeue_dl_entity(dl_se);
+
+	if (flags & (DEQUEUE_SAVE|DEQUEUE_MIGRATING)) {
+		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+		sub_running_bw(dl_se, dl_rq);
+		sub_rq_bw(dl_se, dl_rq);
+	}
+
+	/*
+	 * This check allows to start the inactive timer (or to immediately
+	 * decrease the active utilization, if needed) in two cases:
+	 * when the task blocks and when it is terminating
+	 * (p->state == TASK_DEAD). We can handle the two cases in the same
+	 * way, because from GRUB's point of view the same thing is happening
+	 * (the task moves from "active contending" to "active non contending"
+	 * or "inactive")
+	 */
+	if (flags & DEQUEUE_SLEEP)
+		task_non_contending(dl_se);
 }
 
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -1667,76 +1722,35 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		return;
 	}
 
-	/*
-	 * Check if a constrained deadline task was activated
-	 * after the deadline but before the next period.
-	 * If that is the case, the task will be throttled and
-	 * the replenishment timer will be set to the next period.
-	 */
-	if (!p->dl.dl_throttled && !dl_is_implicit(&p->dl))
-		dl_check_constrained_dl(&p->dl);
-
-	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE) {
-		add_rq_bw(&p->dl, &rq->dl);
-		add_running_bw(&p->dl, &rq->dl);
-	}
-
-	/*
-	 * If p is throttled, we do not enqueue it. In fact, if it exhausted
-	 * its budget it needs a replenishment and, since it now is on
-	 * its rq, the bandwidth timer callback (which clearly has not
-	 * run yet) will take care of this.
-	 * However, the active utilization does not depend on the fact
-	 * that the task is on the runqueue or not (but depends on the
-	 * task's state - in GRUB parlance, "inactive" vs "active contending").
-	 * In other words, even if a task is throttled its utilization must
-	 * be counted in the active utilization; hence, we need to call
-	 * add_running_bw().
-	 */
-	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
-		if (flags & ENQUEUE_WAKEUP)
-			task_contending(&p->dl, flags);
-
-		return;
-	}
-
 	check_schedstat_required();
 	update_stats_wait_start_dl(dl_rq_of_se(&p->dl), &p->dl);
 
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		flags |= ENQUEUE_MIGRATING;
+
 	enqueue_dl_entity(&p->dl, flags);
 
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+	if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
 }
 
 static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_stats_dequeue_dl(&rq->dl, &p->dl, flags);
-	dequeue_dl_entity(&p->dl);
-	dequeue_pushable_dl_task(rq, p);
+	dequeue_dl_entity(&p->dl, flags);
+
+	if (!p->dl.dl_throttled)
+		dequeue_pushable_dl_task(rq, p);
 }
 
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
-	__dequeue_task_dl(rq, p, flags);
 
-	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE) {
-		sub_running_bw(&p->dl, &rq->dl);
-		sub_rq_bw(&p->dl, &rq->dl);
-	}
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		flags |= DEQUEUE_MIGRATING;
 
-	/*
-	 * This check allows to start the inactive timer (or to immediately
-	 * decrease the active utilization, if needed) in two cases:
-	 * when the task blocks and when it is terminating
-	 * (p->state == TASK_DEAD). We can handle the two cases in the same
-	 * way, because from GRUB's point of view the same thing is happening
-	 * (the task moves from "active contending" to "active non contending"
-	 * or "inactive")
-	 */
-	if (flags & DEQUEUE_SLEEP)
-		task_non_contending(p);
+	__dequeue_task_dl(rq, p, flags);
 }
 
 /*
@@ -2551,7 +2565,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	 * will reset the task parameters.
 	 */
 	if (task_on_rq_queued(p) && p->dl.dl_runtime)
-		task_non_contending(p);
+		task_non_contending(&p->dl);
 
 	/*
 	 * In case a task is setscheduled out from SCHED_DEADLINE we need to
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4f5f5a2778a9..a0cdc540029c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2177,6 +2177,10 @@ extern const u32		sched_prio_to_wmult[40];
  * MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location
  *        in the runqueue.
  *
+ * NOCLOCK - skip the update_rq_clock() (avoids double updates)
+ *
+ * MIGRATION - p->on_rq == TASK_ON_RQ_MIGRATING (used for DEADLINE)
+ *
  * ENQUEUE_HEAD      - place at front of runqueue (tail if not specified)
  * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
  * ENQUEUE_MIGRATED  - the task was migrated during wakeup
@@ -2187,6 +2191,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define DEQUEUE_SAVE		0x02 /* Matches ENQUEUE_RESTORE */
 #define DEQUEUE_MOVE		0x04 /* Matches ENQUEUE_MOVE */
 #define DEQUEUE_NOCLOCK		0x08 /* Matches ENQUEUE_NOCLOCK */
+#define DEQUEUE_MIGRATING	0x100 /* Matches ENQUEUE_MIGRATING */
 
 #define ENQUEUE_WAKEUP		0x01
 #define ENQUEUE_RESTORE		0x02
@@ -2201,6 +2206,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define ENQUEUE_MIGRATED	0x00
 #endif
 #define ENQUEUE_INITIAL		0x80
+#define ENQUEUE_MIGRATING	0x100
 
 #define RETRY_TASK		((void *)-1UL)
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 4/7] sched/deadline: Introduce deadline servers
  2023-11-04 10:59 [PATCH v5 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (2 preceding siblings ...)
  2023-11-04 10:59 ` [PATCH v5 3/7] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity Daniel Bristot de Oliveira
@ 2023-11-04 10:59 ` Daniel Bristot de Oliveira
  2023-11-15  9:04   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2023-11-04 10:59 ` [PATCH v5 5/7] sched/fair: Add trivial fair server Daniel Bristot de Oliveira
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-04 10:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

From: Peter Zijlstra <peterz@infradead.org>

Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).

RT Throttling has been introduced a while ago as a (mostly debug)
countermeasure one can utilize to reserve some CPU time for low priority
tasks (usually background type of work, e.g. workqueues, timers, etc.).
It however has its own problems (see documentation) and the undesired
effect of unconditionally throttling FIFO tasks even when no lower
priority activity needs to run (there are mechanisms to fix this issue
as well, but, again, with their own problems).

Introduce deadline servers to service low priority tasks needs under
starvation conditions. Deadline servers are built extending SCHED_DEADLINE
implementation to allow 2-level scheduling (a sched_deadline entity
becomes a container for lower priority scheduling entities).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 include/linux/sched.h   |  22 ++-
 kernel/sched/core.c     |  17 ++
 kernel/sched/deadline.c | 332 +++++++++++++++++++++++++++-------------
 kernel/sched/fair.c     |   4 +
 kernel/sched/sched.h    |  27 ++++
 5 files changed, 294 insertions(+), 108 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 31eee8b03dcd..5ac1f252e136 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -63,11 +63,13 @@ struct robust_list_head;
 struct root_domain;
 struct rq;
 struct sched_attr;
+struct sched_dl_entity;
 struct seq_file;
 struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct task_struct;
 struct user_event_mm;
 
 /*
@@ -607,6 +609,9 @@ struct sched_rt_entity {
 #endif
 } __randomize_layout;
 
+typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
+typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
+
 struct sched_dl_entity {
 	struct rb_node			rb_node;
 
@@ -654,6 +659,7 @@ struct sched_dl_entity {
 	unsigned int			dl_yielded        : 1;
 	unsigned int			dl_non_contending : 1;
 	unsigned int			dl_overrun	  : 1;
+	unsigned int			dl_server         : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
@@ -668,7 +674,20 @@ struct sched_dl_entity {
 	 * timer is needed to decrease the active utilization at the correct
 	 * time.
 	 */
-	struct hrtimer inactive_timer;
+	struct hrtimer			inactive_timer;
+
+	/*
+	 * Bits for DL-server functionality. Also see the comment near
+	 * dl_server_update().
+	 *
+	 * @rq the runqueue this server is for
+	 *
+	 * @server_has_tasks() returns true if @server_pick return a
+	 * runnable task.
+	 */
+	struct rq			*rq;
+	dl_server_has_tasks_f		server_has_tasks;
+	dl_server_pick_f		server_pick;
 
 #ifdef CONFIG_RT_MUTEXES
 	/*
@@ -795,6 +814,7 @@ struct task_struct {
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
 	struct sched_dl_entity		dl;
+	struct sched_dl_entity		*server;
 	const struct sched_class	*sched_class;
 
 #ifdef CONFIG_SCHED_CORE
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257369d30303..a721f6776b12 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3795,6 +3795,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		rq->idle_stamp = 0;
 	}
 #endif
+
+	p->server = NULL;
 }
 
 /*
@@ -6001,12 +6003,27 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			p = pick_next_task_idle(rq);
 		}
 
+		/*
+		 * This is the fast path; it cannot be a DL server pick;
+		 * therefore even if @p == @prev, ->server must be NULL.
+		 */
+		if (p->server)
+			p->server = NULL;
+
 		return p;
 	}
 
 restart:
 	put_prev_task_balance(rq, prev, rf);
 
+	/*
+	 * We've updated @prev and no longer need the server link, clear it.
+	 * Must be done before ->pick_next_task() because that can (re)set
+	 * ->server.
+	 */
+	if (prev->server)
+		prev->server = NULL;
+
 	for_each_class(class) {
 		p = class->pick_next_task(rq);
 		if (p)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 81810f67df7a..541d547e1019 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -54,8 +54,14 @@ static int __init sched_dl_sysctl_init(void)
 late_initcall(sched_dl_sysctl_init);
 #endif
 
+static bool dl_server(struct sched_dl_entity *dl_se)
+{
+	return dl_se->dl_server;
+}
+
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
+	BUG_ON(dl_server(dl_se));
 	return container_of(dl_se, struct task_struct, dl);
 }
 
@@ -64,12 +70,19 @@ static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
 	return container_of(dl_rq, struct rq, dl);
 }
 
-static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+static inline struct rq *rq_of_dl_se(struct sched_dl_entity *dl_se)
 {
-	struct task_struct *p = dl_task_of(dl_se);
-	struct rq *rq = task_rq(p);
+	struct rq *rq = dl_se->rq;
+
+	if (!dl_server(dl_se))
+		rq = task_rq(dl_task_of(dl_se));
 
-	return &rq->dl;
+	return rq;
+}
+
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+	return &rq_of_dl_se(dl_se)->dl;
 }
 
 static inline int on_dl_rq(struct sched_dl_entity *dl_se)
@@ -394,9 +407,8 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se);
 static void task_non_contending(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->inactive_timer;
-	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
-	struct rq *rq = rq_of_dl_rq(dl_rq);
-	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = rq_of_dl_se(dl_se);
+	struct dl_rq *dl_rq = &rq->dl;
 	s64 zerolag_time;
 
 	/*
@@ -426,25 +438,33 @@ static void task_non_contending(struct sched_dl_entity *dl_se)
 	 * utilization now, instead of starting a timer
 	 */
 	if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
-		if (dl_task(p))
+		if (dl_server(dl_se)) {
 			sub_running_bw(dl_se, dl_rq);
+		} else {
+			struct task_struct *p = dl_task_of(dl_se);
 
-		if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
-			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+			if (dl_task(p))
+				sub_running_bw(dl_se, dl_rq);
 
-			if (READ_ONCE(p->__state) == TASK_DEAD)
-				sub_rq_bw(dl_se, &rq->dl);
-			raw_spin_lock(&dl_b->lock);
-			__dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
-			raw_spin_unlock(&dl_b->lock);
-			__dl_clear_params(dl_se);
+			if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
+				struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+				if (READ_ONCE(p->__state) == TASK_DEAD)
+					sub_rq_bw(dl_se, &rq->dl);
+				raw_spin_lock(&dl_b->lock);
+				__dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
+				raw_spin_unlock(&dl_b->lock);
+				__dl_clear_params(dl_se);
+			}
 		}
 
 		return;
 	}
 
 	dl_se->dl_non_contending = 1;
-	get_task_struct(p);
+	if (!dl_server(dl_se))
+		get_task_struct(dl_task_of(dl_se));
+
 	hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL_HARD);
 }
 
@@ -471,8 +491,10 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
 		 * will not touch the rq's active utilization,
 		 * so we are still safe.
 		 */
-		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
-			put_task_struct(dl_task_of(dl_se));
+		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
+			if (!dl_server(dl_se))
+				put_task_struct(dl_task_of(dl_se));
+		}
 	} else {
 		/*
 		 * Since "dl_non_contending" is not set, the
@@ -485,10 +507,8 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
 	}
 }
 
-static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
+static inline int is_leftmost(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
-
 	return rb_first_cached(&dl_rq->root) == &dl_se->rb_node;
 }
 
@@ -740,8 +760,10 @@ static inline void deadline_queue_pull_task(struct rq *rq)
 }
 #endif /* CONFIG_SMP */
 
+static void
+enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags);
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags);
 static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, int flags);
 
 static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
@@ -989,8 +1011,7 @@ static inline bool dl_is_implicit(struct sched_dl_entity *dl_se)
  */
 static void update_dl_entity(struct sched_dl_entity *dl_se)
 {
-	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
-	struct rq *rq = rq_of_dl_rq(dl_rq);
+	struct rq *rq = rq_of_dl_se(dl_se);
 
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
 	    dl_entity_overflow(dl_se, rq_clock(rq))) {
@@ -1021,11 +1042,11 @@ static inline u64 dl_next_period(struct sched_dl_entity *dl_se)
  * actually started or not (i.e., the replenishment instant is in
  * the future or in the past).
  */
-static int start_dl_timer(struct task_struct *p)
+static int start_dl_timer(struct sched_dl_entity *dl_se)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
 	struct hrtimer *timer = &dl_se->dl_timer;
-	struct rq *rq = task_rq(p);
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
 	ktime_t now, act;
 	s64 delta;
 
@@ -1059,13 +1080,33 @@ static int start_dl_timer(struct task_struct *p)
 	 * and observe our state.
 	 */
 	if (!hrtimer_is_queued(timer)) {
-		get_task_struct(p);
+		if (!dl_server(dl_se))
+			get_task_struct(dl_task_of(dl_se));
 		hrtimer_start(timer, act, HRTIMER_MODE_ABS_HARD);
 	}
 
 	return 1;
 }
 
+static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
+{
+#ifdef CONFIG_SMP
+	/*
+	 * Queueing this task back might have overloaded rq, check if we need
+	 * to kick someone away.
+	 */
+	if (has_pushable_dl_tasks(rq)) {
+		/*
+		 * Nothing relies on rq->lock after this, so its safe to drop
+		 * rq->lock.
+		 */
+		rq_unpin_lock(rq, rf);
+		push_dl_task(rq);
+		rq_repin_lock(rq, rf);
+	}
+#endif
+}
+
 /*
  * This is the bandwidth enforcement timer callback. If here, we know
  * a task is not on its dl_rq, since the fact that the timer was running
@@ -1084,10 +1125,34 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	struct sched_dl_entity *dl_se = container_of(timer,
 						     struct sched_dl_entity,
 						     dl_timer);
-	struct task_struct *p = dl_task_of(dl_se);
+	struct task_struct *p;
 	struct rq_flags rf;
 	struct rq *rq;
 
+	if (dl_server(dl_se)) {
+		struct rq *rq = rq_of_dl_se(dl_se);
+		struct rq_flags rf;
+
+		rq_lock(rq, &rf);
+		if (dl_se->dl_throttled) {
+			sched_clock_tick();
+			update_rq_clock(rq);
+
+			if (dl_se->server_has_tasks(dl_se)) {
+				enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
+				resched_curr(rq);
+				__push_dl_task(rq, &rf);
+			} else {
+				replenish_dl_entity(dl_se);
+			}
+
+		}
+		rq_unlock(rq, &rf);
+
+		return HRTIMER_NORESTART;
+	}
+
+	p = dl_task_of(dl_se);
 	rq = task_rq_lock(p, &rf);
 
 	/*
@@ -1158,21 +1223,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	else
 		resched_curr(rq);
 
-#ifdef CONFIG_SMP
-	/*
-	 * Queueing this task back might have overloaded rq, check if we need
-	 * to kick someone away.
-	 */
-	if (has_pushable_dl_tasks(rq)) {
-		/*
-		 * Nothing relies on rq->lock after this, so its safe to drop
-		 * rq->lock.
-		 */
-		rq_unpin_lock(rq, &rf);
-		push_dl_task(rq);
-		rq_repin_lock(rq, &rf);
-	}
-#endif
+	__push_dl_task(rq, &rf);
 
 unlock:
 	task_rq_unlock(rq, p, &rf);
@@ -1214,12 +1265,11 @@ static void init_dl_task_timer(struct sched_dl_entity *dl_se)
  */
 static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se)
 {
-	struct task_struct *p = dl_task_of(dl_se);
-	struct rq *rq = rq_of_dl_rq(dl_rq_of_se(dl_se));
+	struct rq *rq = rq_of_dl_se(dl_se);
 
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) &&
 	    dl_time_before(rq_clock(rq), dl_next_period(dl_se))) {
-		if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(p)))
+		if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se)))
 			return;
 		dl_se->dl_throttled = 1;
 		if (dl_se->runtime > 0)
@@ -1270,29 +1320,13 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
 	return (delta * u_act) >> BW_SHIFT;
 }
 
-/*
- * Update the current task's runtime statistics (provided it is still
- * a -deadline task and has not been removed from the dl_rq).
- */
-static void update_curr_dl(struct rq *rq)
+static inline void
+update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
+                        int flags);
+static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
 {
-	struct task_struct *curr = rq->curr;
-	struct sched_dl_entity *dl_se = &curr->dl;
-	s64 delta_exec, scaled_delta_exec;
-	int cpu = cpu_of(rq);
-
-	if (!dl_task(curr) || !on_dl_rq(dl_se))
-		return;
+	s64 scaled_delta_exec;
 
-	/*
-	 * Consumed budget is computed considering the time as
-	 * observed by schedulable tasks (excluding time spent
-	 * in hardirq context, etc.). Deadlines are instead
-	 * computed using hard walltime. This seems to be the more
-	 * natural solution, but the full ramifications of this
-	 * approach need further study.
-	 */
-	delta_exec = update_curr_common(rq);
 	if (unlikely(delta_exec <= 0)) {
 		if (unlikely(dl_se->dl_yielded))
 			goto throttle;
@@ -1310,10 +1344,9 @@ static void update_curr_dl(struct rq *rq)
 	 * according to current frequency and CPU maximum capacity.
 	 */
 	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
-		scaled_delta_exec = grub_reclaim(delta_exec,
-						 rq,
-						 &curr->dl);
+		scaled_delta_exec = grub_reclaim(delta_exec, rq, dl_se);
 	} else {
+		int cpu = cpu_of(rq);
 		unsigned long scale_freq = arch_scale_freq_capacity(cpu);
 		unsigned long scale_cpu = arch_scale_cpu_capacity(cpu);
 
@@ -1332,11 +1365,20 @@ static void update_curr_dl(struct rq *rq)
 		    (dl_se->flags & SCHED_FLAG_DL_OVERRUN))
 			dl_se->dl_overrun = 1;
 
-		__dequeue_task_dl(rq, curr, 0);
-		if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(curr)))
-			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
+		dequeue_dl_entity(dl_se, 0);
+		if (!dl_server(dl_se)) {
+			update_stats_dequeue_dl(&rq->dl, dl_se, 0);
+			dequeue_pushable_dl_task(rq, dl_task_of(dl_se));
+		}
 
-		if (!is_leftmost(curr, &rq->dl))
+		if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se))) {
+			if (dl_server(dl_se))
+				enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
+			else
+				enqueue_task_dl(rq, dl_task_of(dl_se), ENQUEUE_REPLENISH);
+		}
+
+		if (!is_leftmost(dl_se, &rq->dl))
 			resched_curr(rq);
 	}
 
@@ -1366,20 +1408,82 @@ static void update_curr_dl(struct rq *rq)
 	}
 }
 
+void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
+{
+	update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+}
+
+void dl_server_start(struct sched_dl_entity *dl_se)
+{
+	if (!dl_server(dl_se)) {
+		dl_se->dl_server = 1;
+		setup_new_dl_entity(dl_se);
+	}
+	enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
+}
+
+void dl_server_stop(struct sched_dl_entity *dl_se)
+{
+	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
+}
+
+void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+		    dl_server_has_tasks_f has_tasks,
+		    dl_server_pick_f pick)
+{
+	dl_se->rq = rq;
+	dl_se->server_has_tasks = has_tasks;
+	dl_se->server_pick = pick;
+}
+
+/*
+ * Update the current task's runtime statistics (provided it is still
+ * a -deadline task and has not been removed from the dl_rq).
+ */
+static void update_curr_dl(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_dl_entity *dl_se = &curr->dl;
+	s64 delta_exec;
+
+	if (!dl_task(curr) || !on_dl_rq(dl_se))
+		return;
+
+	/*
+	 * Consumed budget is computed considering the time as
+	 * observed by schedulable tasks (excluding time spent
+	 * in hardirq context, etc.). Deadlines are instead
+	 * computed using hard walltime. This seems to be the more
+	 * natural solution, but the full ramifications of this
+	 * approach need further study.
+	 */
+	delta_exec = update_curr_common(rq);
+	update_curr_dl_se(rq, dl_se, delta_exec);
+}
+
 static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 {
 	struct sched_dl_entity *dl_se = container_of(timer,
 						     struct sched_dl_entity,
 						     inactive_timer);
-	struct task_struct *p = dl_task_of(dl_se);
+	struct task_struct *p = NULL;
 	struct rq_flags rf;
 	struct rq *rq;
 
-	rq = task_rq_lock(p, &rf);
+	if (!dl_server(dl_se)) {
+		p = dl_task_of(dl_se);
+		rq = task_rq_lock(p, &rf);
+	} else {
+		rq = dl_se->rq;
+		rq_lock(rq, &rf);
+	}
 
 	sched_clock_tick();
 	update_rq_clock(rq);
 
+	if (dl_server(dl_se))
+		goto no_task;
+
 	if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
 		struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
@@ -1396,14 +1500,21 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 
 		goto unlock;
 	}
+
+no_task:
 	if (dl_se->dl_non_contending == 0)
 		goto unlock;
 
 	sub_running_bw(dl_se, &rq->dl);
 	dl_se->dl_non_contending = 0;
 unlock:
-	task_rq_unlock(rq, p, &rf);
-	put_task_struct(p);
+
+	if (!dl_server(dl_se)) {
+		task_rq_unlock(rq, p, &rf);
+		put_task_struct(p);
+	} else {
+		rq_unlock(rq, &rf);
+	}
 
 	return HRTIMER_NORESTART;
 }
@@ -1466,10 +1577,8 @@ static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
 static inline
 void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	int prio = dl_task_of(dl_se)->prio;
 	u64 deadline = dl_se->deadline;
 
-	WARN_ON(!dl_prio(prio));
 	dl_rq->dl_nr_running++;
 	add_nr_running(rq_of_dl_rq(dl_rq), 1);
 
@@ -1479,9 +1588,6 @@ void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 static inline
 void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	int prio = dl_task_of(dl_se)->prio;
-
-	WARN_ON(!dl_prio(prio));
 	WARN_ON(!dl_rq->dl_nr_running);
 	dl_rq->dl_nr_running--;
 	sub_nr_running(rq_of_dl_rq(dl_rq), 1);
@@ -1648,8 +1754,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	} else if (flags & ENQUEUE_REPLENISH) {
 		replenish_dl_entity(dl_se);
 	} else if ((flags & ENQUEUE_RESTORE) &&
-		  dl_time_before(dl_se->deadline,
-				 rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se))))) {
+		   dl_time_before(dl_se->deadline, rq_clock(rq_of_dl_se(dl_se)))) {
 		setup_new_dl_entity(dl_se);
 	}
 
@@ -1730,19 +1835,13 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 
 	enqueue_dl_entity(&p->dl, flags);
 
+	if (dl_server(&p->dl))
+		return;
+
 	if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
 }
 
-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
-{
-	update_stats_dequeue_dl(&rq->dl, &p->dl, flags);
-	dequeue_dl_entity(&p->dl, flags);
-
-	if (!p->dl.dl_throttled)
-		dequeue_pushable_dl_task(rq, p);
-}
-
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
@@ -1750,7 +1849,9 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	if (p->on_rq == TASK_ON_RQ_MIGRATING)
 		flags |= DEQUEUE_MIGRATING;
 
-	__dequeue_task_dl(rq, p, flags);
+	dequeue_dl_entity(&p->dl, flags);
+	if (!p->dl.dl_throttled && !dl_server(&p->dl))
+		dequeue_pushable_dl_task(rq, p);
 }
 
 /*
@@ -1940,12 +2041,12 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
 }
 
 #ifdef CONFIG_SCHED_HRTICK
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
 {
-	hrtick_start(rq, p->dl.runtime);
+	hrtick_start(rq, dl_se->runtime);
 }
 #else /* !CONFIG_SCHED_HRTICK */
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
 {
 }
 #endif
@@ -1965,9 +2066,6 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
 	if (!first)
 		return;
 
-	if (hrtick_enabled_dl(rq))
-		start_hrtick_dl(rq, p);
-
 	if (rq->curr->sched_class != &dl_sched_class)
 		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
@@ -1990,12 +2088,25 @@ static struct task_struct *pick_task_dl(struct rq *rq)
 	struct dl_rq *dl_rq = &rq->dl;
 	struct task_struct *p;
 
+again:
 	if (!sched_dl_runnable(rq))
 		return NULL;
 
 	dl_se = pick_next_dl_entity(dl_rq);
 	WARN_ON_ONCE(!dl_se);
-	p = dl_task_of(dl_se);
+
+	if (dl_server(dl_se)) {
+		p = dl_se->server_pick(dl_se);
+		if (!p) {
+			WARN_ON_ONCE(1);
+			dl_se->dl_yielded = 1;
+			update_curr_dl_se(rq, dl_se, 0);
+			goto again;
+		}
+		p->server = dl_se;
+	} else {
+		p = dl_task_of(dl_se);
+	}
 
 	return p;
 }
@@ -2005,9 +2116,15 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
 	struct task_struct *p;
 
 	p = pick_task_dl(rq);
-	if (p)
+	if (!p)
+		return p;
+
+	if (!p->server)
 		set_next_task_dl(rq, p, true);
 
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, &p->dl);
+
 	return p;
 }
 
@@ -2045,8 +2162,8 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
 	 * be set and schedule() will start a new hrtick for the next task.
 	 */
 	if (hrtick_enabled_dl(rq) && queued && p->dl.runtime > 0 &&
-	    is_leftmost(p, &rq->dl))
-		start_hrtick_dl(rq, p);
+	    is_leftmost(&p->dl, &rq->dl))
+		start_hrtick_dl(rq, &p->dl);
 }
 
 static void task_fork_dl(struct task_struct *p)
@@ -2986,6 +3103,7 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se)
 	dl_se->dl_yielded		= 0;
 	dl_se->dl_non_contending	= 0;
 	dl_se->dl_overrun		= 0;
+	dl_se->dl_server		= 0;
 
 #ifdef CONFIG_RT_MUTEXES
 	dl_se->pi_se			= dl_se;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2613704a2d2d..bc3a4bc6c438 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1168,6 +1168,8 @@ s64 update_curr_common(struct rq *rq)
 
 	account_group_exec_runtime(curr, delta_exec);
 	cgroup_account_cputime(curr, delta_exec);
+	if (curr->server)
+		dl_server_update(curr->server, delta_exec);
 
 	return delta_exec;
 }
@@ -1197,6 +1199,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
 		cgroup_account_cputime(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
+		if (curtask->server)
+			dl_server_update(curtask->server, delta_exec);
 	}
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a0cdc540029c..24a2bc7c453b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -313,6 +313,33 @@ extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *att
 extern int  dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
 extern int  dl_bw_check_overflow(int cpu);
 
+/*
+ * SCHED_DEADLINE supports servers (nested scheduling) with the following
+ * interface:
+ *
+ *   dl_se::rq -- runqueue we belong to.
+ *
+ *   dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
+ *                                server when it runs out of tasks to run.
+ *
+ *   dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
+ *                           returns NULL.
+ *
+ *   dl_server_update() -- called from update_curr_common(), propagates runtime
+ *                         to the server.
+ *
+ *   dl_server_start()
+ *   dl_server_stop()  -- start/stop the server when it has (no) tasks.
+ *
+ *   dl_server_init() -- initializes the server.
+ */
+extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
+extern void dl_server_start(struct sched_dl_entity *dl_se);
+extern void dl_server_stop(struct sched_dl_entity *dl_se);
+extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+		    dl_server_has_tasks_f has_tasks,
+		    dl_server_pick_f pick);
+
 #ifdef CONFIG_CGROUP_SCHED
 
 struct cfs_rq;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 5/7] sched/fair: Add trivial fair server
  2023-11-04 10:59 [PATCH v5 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (3 preceding siblings ...)
  2023-11-04 10:59 ` [PATCH v5 4/7] sched/deadline: Introduce deadline servers Daniel Bristot de Oliveira
@ 2023-11-04 10:59 ` Daniel Bristot de Oliveira
  2023-11-06 14:24   ` Peter Zijlstra
  2023-11-04 10:59 ` [PATCH v5 6/7] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-04 10:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

From: Peter Zijlstra <peterz@infradead.org>

Use deadline servers to service fair tasks.

This patch adds a fair_server deadline entity which acts as a container
for fair entities and can be used to fix starvation when higher priority
(wrt fair) tasks are monopolizing CPU(s).

[ dl_server do not account for rt ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/core.c     |  1 +
 kernel/sched/deadline.c |  7 +++++++
 kernel/sched/fair.c     | 29 +++++++++++++++++++++++++++++
 kernel/sched/sched.h    |  4 ++++
 4 files changed, 41 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a721f6776b12..939266d29681 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10019,6 +10019,7 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+		fair_server_init(rq);
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = rq;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 541d547e1019..1d7b96ca9011 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1382,6 +1382,13 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 			resched_curr(rq);
 	}
 
+	/*
+	 * The fair server (sole dl_server) does not account for real-time
+	 * workload because it is running fair work.
+	 */
+	if (dl_server(dl_se))
+		return;
+
 	/*
 	 * Because -- for now -- we share the rt bandwidth, we need to
 	 * account our runtime there too, otherwise actual rt tasks
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc3a4bc6c438..b15f7f376a67 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6600,6 +6600,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	util_est_enqueue(&rq->cfs, p);
 
+	if (!rq->cfs.h_nr_running)
+		dl_server_start(&rq->fair_server);
+
 	/*
 	 * If in_iowait is set, the code below may not trigger any cpufreq
 	 * utilization updates, so do it here explicitly with the IOWAIT flag
@@ -6744,6 +6747,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		rq->next_balance = jiffies;
 
 dequeue_throttle:
+	if (!rq->cfs.h_nr_running)
+		dl_server_stop(&rq->fair_server);
+
 	util_est_update(&rq->cfs, p, task_sleep);
 	hrtick_update(rq);
 }
@@ -8396,6 +8402,29 @@ static struct task_struct *__pick_next_task_fair(struct rq *rq)
 	return pick_next_task_fair(rq, NULL, NULL);
 }
 
+static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
+{
+	return !!dl_se->rq->cfs.nr_running;
+}
+
+static struct task_struct *fair_server_pick(struct sched_dl_entity *dl_se)
+{
+	return pick_next_task_fair(dl_se->rq, NULL, NULL);
+}
+
+void fair_server_init(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se = &rq->fair_server;
+
+	init_dl_entity(dl_se);
+
+	dl_se->dl_runtime = 50 * NSEC_PER_MSEC;
+	dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
+	dl_se->dl_period = 1000 * NSEC_PER_MSEC;
+
+	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
+}
+
 /*
  * Account for a descheduled task:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24a2bc7c453b..ec0e288c8e06 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -340,6 +340,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick);
 
+extern void fair_server_init(struct rq *);
+
 #ifdef CONFIG_CGROUP_SCHED
 
 struct cfs_rq;
@@ -1005,6 +1007,8 @@ struct rq {
 	struct rt_rq		rt;
 	struct dl_rq		dl;
 
+	struct sched_dl_entity	fair_server;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this CPU: */
 	struct list_head	leaf_cfs_rq_list;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-04 10:59 [PATCH v5 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (4 preceding siblings ...)
  2023-11-04 10:59 ` [PATCH v5 5/7] sched/fair: Add trivial fair server Daniel Bristot de Oliveira
@ 2023-11-04 10:59 ` Daniel Bristot de Oliveira
  2023-11-06 14:55   ` Peter Zijlstra
                     ` (4 more replies)
  2023-11-04 10:59 ` [PATCH v5 7/7] sched/fair: Fair server interface Daniel Bristot de Oliveira
                   ` (2 subsequent siblings)
  8 siblings, 5 replies; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-04 10:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

Among the motivations for the DL servers is the real-time throttling
mechanism. This mechanism works by throttling the rt_rq after
running for a long period without leaving space for fair tasks.

The base dl server avoids this problem by boosting fair tasks instead
of throttling the rt_rq. The point is that it boosts without waiting
for potential starvation, causing some non-intuitive cases.

For example, an IRQ dispatches two tasks on an idle system, a fair
and an RT. The DL server will be activated, running the fair task
before the RT one. This problem can be avoided by deferring the
dl server activation.

By setting the zerolax option, the dl_server will dispatch an
SCHED_DEADLINE reservation with replenished runtime, but throttled.

The dl_timer will be set for (period - runtime) ns from start time.
Thus boosting the fair rq on its 0-laxity time with respect to
rt_rq.

If the fair scheduler has the opportunity to run while waiting
for zerolax time, the dl server runtime will be consumed. If
the runtime is completely consumed before the zerolax time, the
server will be replenished while still in a throttled state. Then,
the dl_timer will be reset to the new zerolax time

If the fair server reaches the zerolax time without consuming
its runtime, the server will be boosted, following CBS rules
(thus without breaking SCHED_DEADLINE).

Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 include/linux/sched.h   |   2 +
 kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++-
 kernel/sched/fair.c     |   3 ++
 3 files changed, 103 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5ac1f252e136..56e53e6fd5a0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -660,6 +660,8 @@ struct sched_dl_entity {
 	unsigned int			dl_non_contending : 1;
 	unsigned int			dl_overrun	  : 1;
 	unsigned int			dl_server         : 1;
+	unsigned int			dl_zerolax	  : 1;
+	unsigned int			dl_zerolax_armed  : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1d7b96ca9011..69ee1fbd60e4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
 	/* for non-boosted task, pi_of(dl_se) == dl_se */
 	dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
 	dl_se->runtime = pi_of(dl_se)->dl_runtime;
+
+	/*
+	 * If it is a zerolax reservation, throttle it.
+	 */
+	if (dl_se->dl_zerolax) {
+		dl_se->dl_throttled = 1;
+		dl_se->dl_zerolax_armed = 1;
+	}
 }
 
 /*
@@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
  * could happen are, typically, a entity voluntarily trying to overcome its
  * runtime, or it just underestimated it during sched_setattr().
  */
+static int start_dl_timer(struct sched_dl_entity *dl_se);
 static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
@@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 		dl_se->dl_yielded = 0;
 	if (dl_se->dl_throttled)
 		dl_se->dl_throttled = 0;
+
+	/*
+	 * If this is the replenishment of a zerolax reservation,
+	 * clear the flag and return.
+	 */
+	if (dl_se->dl_zerolax_armed) {
+		dl_se->dl_zerolax_armed = 0;
+		return;
+	}
+
+	/*
+	 * A this point, if the zerolax server is not armed, and the deadline
+	 * is in the future, throttle the server and arm the zerolax timer.
+	 */
+	if (dl_se->dl_zerolax &&
+	    dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) {
+		if (!is_dl_boosted(dl_se)) {
+			dl_se->dl_zerolax_armed = 1;
+			dl_se->dl_throttled = 1;
+			start_dl_timer(dl_se);
+		}
+	}
 }
 
 /*
@@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
 		}
 
 		replenish_dl_new_period(dl_se, rq);
+	} else if (dl_server(dl_se) && dl_se->dl_zerolax) {
+		/*
+		 * The server can still use its previous deadline, so throttle
+		 * and arm the zero-laxity timer.
+		 */
+		dl_se->dl_zerolax_armed = 1;
+		dl_se->dl_throttled = 1;
 	}
 }
 
@@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
 	 * We want the timer to fire at the deadline, but considering
 	 * that it is actually coming from rq->clock and not from
 	 * hrtimer's time base reading.
+	 *
+	 * The zerolax reservation will have its timer set to the
+	 * deadline - runtime. At that point, the CBS rule will decide
+	 * if the current deadline can be used, or if a replenishment
+	 * is required to avoid add too much pressure on the system
+	 * (current u > U).
 	 */
-	act = ns_to_ktime(dl_next_period(dl_se));
+	if (dl_se->dl_zerolax_armed) {
+		WARN_ON_ONCE(!dl_se->dl_throttled);
+		act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
+	} else {
+		act = ns_to_ktime(dl_next_period(dl_se));
+	}
+
 	now = hrtimer_cb_get_time(timer);
 	delta = ktime_to_ns(now) - rq_clock(rq);
 	act = ktime_add_ns(act, delta);
@@ -1333,6 +1383,9 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 		return;
 	}
 
+	if (dl_server(dl_se) && dl_se->dl_throttled && !dl_se->dl_zerolax)
+		return;
+
 	if (dl_entity_is_special(dl_se))
 		return;
 
@@ -1356,6 +1409,39 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 
 	dl_se->runtime -= scaled_delta_exec;
 
+	/*
+	 * The fair server can consume its runtime while throttled (not queued).
+	 *
+	 * If the server consumes its entire runtime in this state. The server
+	 * is not required for the current period. Thus, reset the server by
+	 * starting a new period, pushing the activation to the zero-lax time.
+	 */
+	if (dl_se->dl_zerolax && dl_se->dl_throttled && dl_runtime_exceeded(dl_se)) {
+		s64 runtime_diff = dl_se->runtime + dl_se->dl_runtime;
+
+		/*
+		 * If this is a regular throttling case, let it run negative until
+		 * the dl_runtime - runtime > 0. The reason being is that the next
+		 * replenishment will result in a positive runtime one period ahead.
+		 *
+		 * Otherwise, the deadline will be pushed more than one period, not
+		 * providing runtime/period anymore.
+		 *
+		 * If the dl_runtime - runtime < 0, then the server was able to get
+		 * the runtime/period before the replenishment. So it is safe
+		 * to start a new deffered period.
+		 */
+		if (!dl_se->dl_zerolax_armed && runtime_diff > 0)
+			return;
+
+		hrtimer_try_to_cancel(&dl_se->dl_timer);
+
+		replenish_dl_new_period(dl_se, dl_se->rq);
+		start_dl_timer(dl_se);
+
+		return;
+	}
+
 throttle:
 	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
 		dl_se->dl_throttled = 1;
@@ -1432,6 +1518,9 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 void dl_server_stop(struct sched_dl_entity *dl_se)
 {
 	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
+	hrtimer_try_to_cancel(&dl_se->dl_timer);
+	dl_se->dl_zerolax_armed = 0;
+	dl_se->dl_throttled = 0;
 }
 
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
@@ -1743,7 +1832,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	 * be counted in the active utilization; hence, we need to call
 	 * add_running_bw().
 	 */
-	if (dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+	if (!dl_se->dl_zerolax && dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
 		if (flags & ENQUEUE_WAKEUP)
 			task_contending(dl_se, flags);
 
@@ -1765,6 +1854,13 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 		setup_new_dl_entity(dl_se);
 	}
 
+	/*
+	 * If we are still throttled, eg. we got replenished but are a
+	 * zero-laxity task and still got to wait, don't enqueue.
+	 */
+	if (dl_se->dl_throttled && start_dl_timer(dl_se))
+		return;
+
 	__enqueue_dl_entity(dl_se);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b15f7f376a67..399237cd9f59 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1201,6 +1201,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		account_group_exec_runtime(curtask, delta_exec);
 		if (curtask->server)
 			dl_server_update(curtask->server, delta_exec);
+		else
+			dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);
 	}
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
@@ -8421,6 +8423,7 @@ void fair_server_init(struct rq *rq)
 	dl_se->dl_runtime = 50 * NSEC_PER_MSEC;
 	dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
 	dl_se->dl_period = 1000 * NSEC_PER_MSEC;
+	dl_se->dl_zerolax = 1;
 
 	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v5 7/7] sched/fair: Fair server interface
  2023-11-04 10:59 [PATCH v5 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (5 preceding siblings ...)
  2023-11-04 10:59 ` [PATCH v5 6/7] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
@ 2023-11-04 10:59 ` Daniel Bristot de Oliveira
  2023-11-04 15:18   ` kernel test robot
                     ` (6 more replies)
  2023-12-08 21:47 ` [PATCH v5 0/7] SCHED_DEADLINE server infrastructure Joel Fernandes
  2024-02-19  7:33 ` Huang, Ying
  8 siblings, 7 replies; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-04 10:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

Add an interface for fair server setup on debugfs.

Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:

 - fair_server_runtime: set runtime in ns
 - fair_server_period: set period in ns
 - fair_server_defer: on/off for the defer mechanism

Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/deadline.c |  89 +++++++++++++++---
 kernel/sched/debug.c    | 202 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c     |   6 --
 kernel/sched/sched.h    |   2 +
 4 files changed, 279 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 69ee1fbd60e4..1092ca8892e0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -321,19 +321,12 @@ void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 		__sub_running_bw(dl_se->dl_bw, dl_rq);
 }
 
-static void dl_change_utilization(struct task_struct *p, u64 new_bw)
+static void dl_rq_change_utilization(struct rq *rq, struct sched_dl_entity *dl_se, u64 new_bw)
 {
-	struct rq *rq;
-
-	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
-
-	if (task_on_rq_queued(p))
-		return;
+	if (dl_se->dl_non_contending) {
+		sub_running_bw(dl_se, &rq->dl);
+		dl_se->dl_non_contending = 0;
 
-	rq = task_rq(p);
-	if (p->dl.dl_non_contending) {
-		sub_running_bw(&p->dl, &rq->dl);
-		p->dl.dl_non_contending = 0;
 		/*
 		 * If the timer handler is currently running and the
 		 * timer cannot be canceled, inactive_task_timer()
@@ -341,13 +334,25 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
 		 * will not touch the rq's active utilization,
 		 * so we are still safe.
 		 */
-		if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
-			put_task_struct(p);
+		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
+			if (!dl_server(dl_se))
+				put_task_struct(dl_task_of(dl_se));
+		}
 	}
-	__sub_rq_bw(p->dl.dl_bw, &rq->dl);
+	__sub_rq_bw(dl_se->dl_bw, &rq->dl);
 	__add_rq_bw(new_bw, &rq->dl);
 }
 
+static void dl_change_utilization(struct task_struct *p, u64 new_bw)
+{
+	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
+
+	if (task_on_rq_queued(p))
+		return;
+
+	dl_rq_change_utilization(task_rq(p), &p->dl, new_bw);
+}
+
 static void __dl_clear_params(struct sched_dl_entity *dl_se);
 
 /*
@@ -1508,10 +1513,22 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 
 void dl_server_start(struct sched_dl_entity *dl_se)
 {
+	/*
+	 * XXX: the apply do not work fine at the init phase for the
+	 * fair server because things are not yet set. We need to improve
+	 * this before getting generic.
+	 */
 	if (!dl_server(dl_se)) {
+		u64 runtime = 50 * NSEC_PER_MSEC;
+		u64 period = 1000 * NSEC_PER_MSEC;
+
+		dl_server_apply_params(dl_se, runtime, period, 1);
+
+		dl_se->dl_zerolax = 1;
 		dl_se->dl_server = 1;
 		setup_new_dl_entity(dl_se);
 	}
+
 	enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
 }
 
@@ -1532,6 +1549,50 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 	dl_se->server_pick = pick;
 }
 
+int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
+{
+	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+	u64 new_bw = to_ratio(period, runtime);
+	struct rq *rq = dl_se->rq;
+	int cpu = cpu_of(rq);
+	struct dl_bw *dl_b;
+	unsigned long cap;
+	int retval = 0;
+	int cpus;
+
+	dl_b = dl_bw_of(cpu);
+	raw_spin_lock(&dl_b->lock);
+	cpus = dl_bw_cpus(cpu);
+	cap = dl_bw_capacity(cpu);
+
+	if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
+		retval = -EBUSY;
+		goto out;
+	}
+
+	if (init) {
+		__add_rq_bw(new_bw, &rq->dl);
+		__dl_add(dl_b, new_bw, cpus);
+	} else {
+		__dl_sub(dl_b, dl_se->dl_bw, cpus);
+		__dl_add(dl_b, new_bw, cpus);
+
+		dl_rq_change_utilization(rq, dl_se, new_bw);
+	}
+
+	rq->fair_server.dl_runtime = runtime;
+	rq->fair_server.dl_deadline  = period;
+	rq->fair_server.dl_period  = period;
+
+	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
+	dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
+
+out:
+	raw_spin_unlock(&dl_b->lock);
+
+	return retval;
+}
+
 /*
  * Update the current task's runtime statistics (provided it is still
  * a -deadline task and has not been removed from the dl_rq).
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4580a450700e..bd7ad6b8d3de 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -333,8 +333,208 @@ static const struct file_operations sched_debug_fops = {
 	.release	= seq_release,
 };
 
+enum dl_param {
+	DL_RUNTIME = 0,
+	DL_PERIOD,
+	DL_ZEROLAX
+};
+
+static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
+static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
+
+static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf,
+				       size_t cnt, loff_t *ppos, enum dl_param param)
+{
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	u64 runtime, period, zerolax;
+	struct rq *rq = cpu_rq(cpu);
+	size_t err;
+	int retval;
+	u64 value;
+
+	err = kstrtoull_from_user(ubuf, cnt, 10, &value);
+	if (err)
+		return err;
+
+	scoped_guard (rq_lock_irqsave, rq) {
+
+		runtime  = rq->fair_server.dl_runtime;
+		period = rq->fair_server.dl_period;
+		zerolax = rq->fair_server.dl_zerolax;
+
+		switch (param) {
+		case DL_RUNTIME:
+			if (runtime == value)
+				goto out;
+			runtime = value;
+			break;
+		case DL_PERIOD:
+			if (value == period)
+				goto out;
+			period = value;
+			break;
+		case DL_ZEROLAX:
+			if (zerolax == value)
+				goto out;
+			zerolax = value;
+			break;
+		}
+
+		if (runtime > period
+				|| period > fair_server_period_max
+				|| period < fair_server_period_min
+				|| zerolax > 1) {
+			cnt = -EINVAL;
+			goto out;
+		}
+
+		if (rq->cfs.h_nr_running) {
+			update_rq_clock(rq);
+			dl_server_stop(&rq->fair_server);
+		}
+
+		/*
+		 * The zerolax does not change utilization, so just
+		 * setting it is enough.
+		 */
+		if (rq->fair_server.dl_zerolax != zerolax) {
+			rq->fair_server.dl_zerolax = zerolax;
+		} else {
+			retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
+			if (retval)
+				cnt = retval;
+		}
+
+		if (rq->cfs.h_nr_running)
+			dl_server_start(&rq->fair_server);
+	}
+
+out:
+	*ppos += cnt;
+	return cnt;
+}
+
+static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param)
+{
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+	u64 value;
+
+	switch (param) {
+	case DL_RUNTIME:
+		value = rq->fair_server.dl_runtime;
+		break;
+	case DL_PERIOD:
+		value = rq->fair_server.dl_period;
+		break;
+	case DL_ZEROLAX:
+		value = rq->fair_server.dl_zerolax;
+	}
+
+	seq_printf(m, "%llu\n", value);
+	return 0;
+
+}
+
+static ssize_t
+sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf,
+				size_t cnt, loff_t *ppos)
+{
+	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME);
+}
+
+static int sched_fair_server_runtime_show(struct seq_file *m, void *v)
+{
+	return sched_fair_server_show(m, v, DL_RUNTIME);
+}
+
+static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_fair_server_runtime_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_runtime_fops = {
+	.open		= sched_fair_server_runtime_open,
+	.write		= sched_fair_server_runtime_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static ssize_t
+sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
+			       size_t cnt, loff_t *ppos)
+{
+	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD);
+}
+
+static int sched_fair_server_period_show(struct seq_file *m, void *v)
+{
+	return sched_fair_server_show(m, v, DL_PERIOD);
+}
+
+static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_fair_server_period_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_period_fops = {
+	.open		= sched_fair_server_period_open,
+	.write		= sched_fair_server_period_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static ssize_t
+sched_fair_server_defer_write(struct file *filp, const char __user *ubuf,
+			      size_t cnt, loff_t *ppos)
+{
+	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_ZEROLAX);
+}
+
+static int sched_fair_server_defer_show(struct seq_file *m, void *v)
+{
+	return sched_fair_server_show(m, v, DL_ZEROLAX);
+}
+
+static int sched_fair_server_defer_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_fair_server_defer_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_defer_fops = {
+	.open		= sched_fair_server_defer_open,
+	.write		= sched_fair_server_defer_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static struct dentry *debugfs_sched;
 
+static void debugfs_fair_server_init(void)
+{
+	long cpu;
+	struct dentry *rq_dentry;
+
+	rq_dentry = debugfs_create_dir("rq", debugfs_sched);
+	if (!rq_dentry)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct dentry *d_cpu;
+		char buf[32];
+
+		snprintf(buf, sizeof(buf), "cpu%ld", cpu);
+		d_cpu = debugfs_create_dir(buf, rq_dentry);
+
+		debugfs_create_file("fair_server_runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
+		debugfs_create_file("fair_server_period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops);
+		debugfs_create_file("fair_server_defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops);
+	}
+}
+
 static __init int sched_init_debug(void)
 {
 	struct dentry __maybe_unused *numa;
@@ -374,6 +574,8 @@ static __init int sched_init_debug(void)
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
+	debugfs_fair_server_init();
+
 	return 0;
 }
 late_initcall(sched_init_debug);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 399237cd9f59..5434c52f470d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8419,12 +8419,6 @@ void fair_server_init(struct rq *rq)
 	struct sched_dl_entity *dl_se = &rq->fair_server;
 
 	init_dl_entity(dl_se);
-
-	dl_se->dl_runtime = 50 * NSEC_PER_MSEC;
-	dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
-	dl_se->dl_period = 1000 * NSEC_PER_MSEC;
-	dl_se->dl_zerolax = 1;
-
 	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec0e288c8e06..312b31df5860 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,6 +341,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_pick_f pick);
 
 extern void fair_server_init(struct rq *);
+extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
+		    u64 runtime, u64 period, bool init);
 
 #ifdef CONFIG_CGROUP_SCHED
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2023-11-04 10:59 ` [PATCH v5 7/7] sched/fair: Fair server interface Daniel Bristot de Oliveira
@ 2023-11-04 15:18   ` kernel test robot
  2023-11-05  0:55   ` kernel test robot
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 76+ messages in thread
From: kernel test robot @ 2023-11-04 15:18 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: oe-kbuild-all, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Joel Fernandes, Vineeth Pillai, Shuah Khan, bristot, Phil Auld

Hi Daniel,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master linus/master next-20231103]
[cannot apply to tip/auto-latest v6.6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Daniel-Bristot-de-Oliveira/sched-Unify-runtime-accounting-across-classes/20231104-201952
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/26adad2378c8b15533e4f6216c2863341e587f57.1699095159.git.bristot%40kernel.org
patch subject: [PATCH v5 7/7] sched/fair: Fair server interface
config: i386-buildonly-randconfig-001-20231104 (https://download.01.org/0day-ci/archive/20231104/202311042329.PB1gTIL4-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231104/202311042329.PB1gTIL4-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311042329.PB1gTIL4-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from kernel/sched/build_utility.c:72:
>> kernel/sched/debug.c:342:57: warning: integer overflow in expression of type 'long int' results in '-100663296' [-Woverflow]
     342 | static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
         |                                                         ^


vim +342 kernel/sched/debug.c

   341	
 > 342	static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
   343	static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
   344	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2023-11-04 10:59 ` [PATCH v5 7/7] sched/fair: Fair server interface Daniel Bristot de Oliveira
  2023-11-04 15:18   ` kernel test robot
@ 2023-11-05  0:55   ` kernel test robot
  2023-11-06 15:40   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 76+ messages in thread
From: kernel test robot @ 2023-11-05  0:55 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: oe-kbuild-all, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Joel Fernandes, Vineeth Pillai, Shuah Khan, bristot, Phil Auld

Hi Daniel,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master linus/master next-20231103]
[cannot apply to tip/auto-latest v6.6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Daniel-Bristot-de-Oliveira/sched-Unify-runtime-accounting-across-classes/20231104-201952
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/26adad2378c8b15533e4f6216c2863341e587f57.1699095159.git.bristot%40kernel.org
patch subject: [PATCH v5 7/7] sched/fair: Fair server interface
config: i386-randconfig-013-20231105 (https://download.01.org/0day-ci/archive/20231105/202311050844.d45KQ8sK-lkp@intel.com/config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231105/202311050844.d45KQ8sK-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311050844.d45KQ8sK-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from kernel/sched/build_utility.c:72:0:
>> kernel/sched/debug.c:342:57: warning: integer overflow in expression [-Woverflow]
    static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
                                                            ^


vim +342 kernel/sched/debug.c

   341	
 > 342	static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
   343	static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
   344	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 5/7] sched/fair: Add trivial fair server
  2023-11-04 10:59 ` [PATCH v5 5/7] sched/fair: Add trivial fair server Daniel Bristot de Oliveira
@ 2023-11-06 14:24   ` Peter Zijlstra
  2023-11-06 14:26     ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2023-11-06 14:24 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Sat, Nov 04, 2023 at 11:59:22AM +0100, Daniel Bristot de Oliveira wrote:

> [ dl_server do not account for rt ]

> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 541d547e1019..1d7b96ca9011 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1382,6 +1382,13 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
>  			resched_curr(rq);
>  	}
>  
> +	/*
> +	 * The fair server (sole dl_server) does not account for real-time
> +	 * workload because it is running fair work.
> +	 */
> +	if (dl_server(dl_se))
> +		return;
> +
>  	/*
>  	 * Because -- for now -- we share the rt bandwidth, we need to
>  	 * account our runtime there too, otherwise actual rt tasks

Should we perhaps write this like so?

	if (dl_se == &rq->fair_server)
		return;


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 5/7] sched/fair: Add trivial fair server
  2023-11-06 14:24   ` Peter Zijlstra
@ 2023-11-06 14:26     ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-06 14:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On 11/6/23 15:24, Peter Zijlstra wrote:
> On Sat, Nov 04, 2023 at 11:59:22AM +0100, Daniel Bristot de Oliveira wrote:
> 
>> [ dl_server do not account for rt ]
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 541d547e1019..1d7b96ca9011 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -1382,6 +1382,13 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
>>  			resched_curr(rq);
>>  	}
>>  
>> +	/*
>> +	 * The fair server (sole dl_server) does not account for real-time
>> +	 * workload because it is running fair work.
>> +	 */
>> +	if (dl_server(dl_se))
>> +		return;
>> +
>>  	/*
>>  	 * Because -- for now -- we share the rt bandwidth, we need to
>>  	 * account our runtime there too, otherwise actual rt tasks
> Should we perhaps write this like so?
> 
> 	if (dl_se == &rq->fair_server)
> 		return;

right, it is better for the next step (making it generic).

-- Daniel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-04 10:59 ` [PATCH v5 6/7] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
@ 2023-11-06 14:55   ` Peter Zijlstra
  2023-11-06 17:05     ` Daniel Bristot de Oliveira
  2023-11-06 19:32   ` Joel Fernandes
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2023-11-06 14:55 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Sat, Nov 04, 2023 at 11:59:23AM +0100, Daniel Bristot de Oliveira wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b15f7f376a67..399237cd9f59 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1201,6 +1201,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
>  		account_group_exec_runtime(curtask, delta_exec);
>  		if (curtask->server)
>  			dl_server_update(curtask->server, delta_exec);
> +		else
> +			dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);
>  	}
>  
>  	account_cfs_rq_runtime(cfs_rq, delta_exec);

I've rewritten that something like so..

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1182,12 +1182,13 @@ s64 update_curr_common(struct rq *rq)
 static void update_curr(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
+	struct rq *rq = rq_of(cfs_rq);
 	s64 delta_exec;
 
 	if (unlikely(!curr))
 		return;
 
-	delta_exec = update_curr_se(rq_of(cfs_rq), curr);
+	delta_exec = update_curr_se(rq, curr);
 	if (unlikely(delta_exec <= 0))
 		return;
 
@@ -1195,8 +1196,17 @@ static void update_curr(struct cfs_rq *c
 	update_deadline(cfs_rq, curr);
 	update_min_vruntime(cfs_rq);
 
-	if (entity_is_task(curr))
-		update_curr_task(task_of(curr), delta_exec);
+	if (entity_is_task(curr)) {
+		struct task_struct *p = task_of(curr);
+		update_curr_task(p, delta_exec);
+		/*
+		 * Any fair task that runs outside of fair_server should
+		 * account against fair_server such that it can account for
+		 * this time and possible avoid running this period.
+		 */
+		if (p->dl_server != &rq->fair_server)
+			dl_server_update(&rq->fair_server, delta_exec);
+	}
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
 }

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2023-11-04 10:59 ` [PATCH v5 7/7] sched/fair: Fair server interface Daniel Bristot de Oliveira
  2023-11-04 15:18   ` kernel test robot
  2023-11-05  0:55   ` kernel test robot
@ 2023-11-06 15:40   ` Peter Zijlstra
  2023-11-06 16:29     ` Daniel Bristot de Oliveira
  2023-11-07 12:38   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2023-11-06 15:40 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> Add an interface for fair server setup on debugfs.
> 
> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
> 
>  - fair_server_runtime: set runtime in ns
>  - fair_server_period: set period in ns
>  - fair_server_defer: on/off for the defer mechanism
> 

This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the
total available bandwidth control, right?

But then shouldn've we also rip out the throttle thingy right quick?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2023-11-06 15:40   ` Peter Zijlstra
@ 2023-11-06 16:29     ` Daniel Bristot de Oliveira
  2023-11-07  8:16       ` Peter Zijlstra
  2023-11-07 14:44       ` Peter Zijlstra
  0 siblings, 2 replies; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-06 16:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Joel Fernandes, Vineeth Pillai, Shuah Khan, Phil Auld

On 11/6/23 16:40, Peter Zijlstra wrote:
> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>> Add an interface for fair server setup on debugfs.
>>
>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>
>>  - fair_server_runtime: set runtime in ns
>>  - fair_server_period: set period in ns
>>  - fair_server_defer: on/off for the defer mechanism
>>
> 
> This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the
> total available bandwidth control, right?

right, but thinking aloud... given that the per-cpu files are already allocating the
bandwidth on the dl_rq, the spare time for fair scheduler is granted.

Still, we can have them there as a safeguard to not overloading the deadline
scheduler... (thinking aloud 2) as long as global is a thing... as we get away
from it, that global limitation will make less sense - still better to have a form
of limitation so people are aware of bandwidth until there.

> But then shouldn've we also rip out the throttle thingy right quick?
> 

I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED
for now, because GROUP_SCHED depends on it, no?

With the next step on moving the dl server as the base for the hierarchical scheduling...
That will rip out the CONFIG_RT_GROUP_SCHED... with a thing with a per-cpu interface.

Does it make sense?



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-06 14:55   ` Peter Zijlstra
@ 2023-11-06 17:05     ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-06 17:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On 11/6/23 15:55, Peter Zijlstra wrote:
> On Sat, Nov 04, 2023 at 11:59:23AM +0100, Daniel Bristot de Oliveira wrote:
> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b15f7f376a67..399237cd9f59 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1201,6 +1201,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
>>  		account_group_exec_runtime(curtask, delta_exec);
>>  		if (curtask->server)
>>  			dl_server_update(curtask->server, delta_exec);
>> +		else
>> +			dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);
>>  	}
>>  
>>  	account_cfs_rq_runtime(cfs_rq, delta_exec);
>  
> @@ -1195,8 +1196,17 @@ static void update_curr(struct cfs_rq *c
>  	update_deadline(cfs_rq, curr);
>  	update_min_vruntime(cfs_rq);
>  
> -	if (entity_is_task(curr))
> -		update_curr_task(task_of(curr), delta_exec);
> +	if (entity_is_task(curr)) {
> +		struct task_struct *p = task_of(curr);
> +		update_curr_task(p, delta_exec);
> +		/*
> +		 * Any fair task that runs outside of fair_server should
> +		 * account against fair_server such that it can account for
> +		 * this time and possible avoid running this period.
> +		 */
> +		if (p->dl_server != &rq->fair_server)
> +			dl_server_update(&rq->fair_server, delta_exec);
aren't we missing:
		   else
			dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);

or am I missing something? :-)

> +	}
>  
>  	account_cfs_rq_runtime(cfs_rq, delta_exec);
>  }


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-04 10:59 ` [PATCH v5 6/7] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
  2023-11-06 14:55   ` Peter Zijlstra
@ 2023-11-06 19:32   ` Joel Fernandes
  2023-11-06 21:32     ` Joel Fernandes
  2023-11-07  7:30     ` Daniel Bristot de Oliveira
  2023-11-07 16:37   ` Steven Rostedt
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 76+ messages in thread
From: Joel Fernandes @ 2023-11-06 19:32 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

Hi Daniel,

On Sat, Nov 4, 2023 at 6:59 AM Daniel Bristot de Oliveira
<bristot@kernel.org> wrote:
>
> Among the motivations for the DL servers is the real-time throttling
> mechanism. This mechanism works by throttling the rt_rq after
> running for a long period without leaving space for fair tasks.
>
> The base dl server avoids this problem by boosting fair tasks instead
> of throttling the rt_rq. The point is that it boosts without waiting
> for potential starvation, causing some non-intuitive cases.
>
> For example, an IRQ dispatches two tasks on an idle system, a fair
> and an RT. The DL server will be activated, running the fair task
> before the RT one. This problem can be avoided by deferring the
> dl server activation.
>
> By setting the zerolax option, the dl_server will dispatch an
> SCHED_DEADLINE reservation with replenished runtime, but throttled.
>
> The dl_timer will be set for (period - runtime) ns from start time.
> Thus boosting the fair rq on its 0-laxity time with respect to
> rt_rq.
>
> If the fair scheduler has the opportunity to run while waiting
> for zerolax time, the dl server runtime will be consumed. If
> the runtime is completely consumed before the zerolax time, the
> server will be replenished while still in a throttled state. Then,
> the dl_timer will be reset to the new zerolax time
>
> If the fair server reaches the zerolax time without consuming
> its runtime, the server will be boosted, following CBS rules
> (thus without breaking SCHED_DEADLINE).
>
> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
> ---
>  include/linux/sched.h   |   2 +
>  kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/fair.c     |   3 ++
>  3 files changed, 103 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5ac1f252e136..56e53e6fd5a0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -660,6 +660,8 @@ struct sched_dl_entity {
>         unsigned int                    dl_non_contending : 1;
>         unsigned int                    dl_overrun        : 1;
>         unsigned int                    dl_server         : 1;
> +       unsigned int                    dl_zerolax        : 1;
> +       unsigned int                    dl_zerolax_armed  : 1;
>
>         /*
>          * Bandwidth enforcement timer. Each -deadline task has its
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 1d7b96ca9011..69ee1fbd60e4 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
>         /* for non-boosted task, pi_of(dl_se) == dl_se */
>         dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
>         dl_se->runtime = pi_of(dl_se)->dl_runtime;
> +
> +       /*
> +        * If it is a zerolax reservation, throttle it.
> +        */
> +       if (dl_se->dl_zerolax) {
> +               dl_se->dl_throttled = 1;
> +               dl_se->dl_zerolax_armed = 1;
> +       }
>  }
>
>  /*
> @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
>   * could happen are, typically, a entity voluntarily trying to overcome its
>   * runtime, or it just underestimated it during sched_setattr().
>   */
> +static int start_dl_timer(struct sched_dl_entity *dl_se);
>  static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>  {
>         struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>                 dl_se->dl_yielded = 0;
>         if (dl_se->dl_throttled)
>                 dl_se->dl_throttled = 0;
> +
> +       /*
> +        * If this is the replenishment of a zerolax reservation,
> +        * clear the flag and return.
> +        */
> +       if (dl_se->dl_zerolax_armed) {
> +               dl_se->dl_zerolax_armed = 0;
> +               return;
> +       }
> +
> +       /*
> +        * A this point, if the zerolax server is not armed, and the deadline
> +        * is in the future, throttle the server and arm the zerolax timer.
> +        */
> +       if (dl_se->dl_zerolax &&
> +           dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) {
> +               if (!is_dl_boosted(dl_se)) {
> +                       dl_se->dl_zerolax_armed = 1;
> +                       dl_se->dl_throttled = 1;
> +                       start_dl_timer(dl_se);
> +               }
> +       }
>  }
>
>  /*
> @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
>                 }
>
>                 replenish_dl_new_period(dl_se, rq);
> +       } else if (dl_server(dl_se) && dl_se->dl_zerolax) {
> +               /*
> +                * The server can still use its previous deadline, so throttle
> +                * and arm the zero-laxity timer.
> +                */
> +               dl_se->dl_zerolax_armed = 1;
> +               dl_se->dl_throttled = 1;
>         }
>  }
>
> @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
>          * We want the timer to fire at the deadline, but considering
>          * that it is actually coming from rq->clock and not from
>          * hrtimer's time base reading.
> +        *
> +        * The zerolax reservation will have its timer set to the
> +        * deadline - runtime. At that point, the CBS rule will decide
> +        * if the current deadline can be used, or if a replenishment
> +        * is required to avoid add too much pressure on the system
> +        * (current u > U).
>          */
> -       act = ns_to_ktime(dl_next_period(dl_se));
> +       if (dl_se->dl_zerolax_armed) {
> +               WARN_ON_ONCE(!dl_se->dl_throttled);
> +               act = ns_to_ktime(dl_se->deadline - dl_se->runtime);

Just a question, here if dl_se->deadline - dl_se->runtime is large,
then does that mean that server activation will be much more into the
future? So say I want to give CFS 30%, then it will take 70% of the
period before CFS preempts RT thus "starving" CFS for this duration. I
think that's Ok for smaller periods and runtimes, though.

I think it does reserve the amount of required CFS bandwidth so it is
probably OK, though it is perhaps letting RT run more initially (say
if CFS tasks are not CPU bound and occasionally wake up, they will
always be hit by the 70% latency AFAICS which may be large for large
periods and small runtimes).

I/we're currently trying these patches on ChromeOS as well.

Just started going over it to understand the patch. Looking nice so
far and thanks,

 - Joel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-06 19:32   ` Joel Fernandes
@ 2023-11-06 21:32     ` Joel Fernandes
  2023-11-06 21:37       ` Joel Fernandes
  2023-11-07  7:30     ` Daniel Bristot de Oliveira
  1 sibling, 1 reply; 76+ messages in thread
From: Joel Fernandes @ 2023-11-06 21:32 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

On Mon, Nov 6, 2023 at 2:32 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi Daniel,
>
> On Sat, Nov 4, 2023 at 6:59 AM Daniel Bristot de Oliveira
> <bristot@kernel.org> wrote:
> >
> > Among the motivations for the DL servers is the real-time throttling
> > mechanism. This mechanism works by throttling the rt_rq after
> > running for a long period without leaving space for fair tasks.
> >
> > The base dl server avoids this problem by boosting fair tasks instead
> > of throttling the rt_rq. The point is that it boosts without waiting
> > for potential starvation, causing some non-intuitive cases.
> >
> > For example, an IRQ dispatches two tasks on an idle system, a fair
> > and an RT. The DL server will be activated, running the fair task
> > before the RT one. This problem can be avoided by deferring the
> > dl server activation.
> >
> > By setting the zerolax option, the dl_server will dispatch an
> > SCHED_DEADLINE reservation with replenished runtime, but throttled.
> >
> > The dl_timer will be set for (period - runtime) ns from start time.
> > Thus boosting the fair rq on its 0-laxity time with respect to
> > rt_rq.
> >
> > If the fair scheduler has the opportunity to run while waiting
> > for zerolax time, the dl server runtime will be consumed. If
> > the runtime is completely consumed before the zerolax time, the
> > server will be replenished while still in a throttled state. Then,
> > the dl_timer will be reset to the new zerolax time
> >
> > If the fair server reaches the zerolax time without consuming
> > its runtime, the server will be boosted, following CBS rules
> > (thus without breaking SCHED_DEADLINE).
> >
> > Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
> > ---
> >  include/linux/sched.h   |   2 +
> >  kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++-
> >  kernel/sched/fair.c     |   3 ++
> >  3 files changed, 103 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 5ac1f252e136..56e53e6fd5a0 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -660,6 +660,8 @@ struct sched_dl_entity {
> >         unsigned int                    dl_non_contending : 1;
> >         unsigned int                    dl_overrun        : 1;
> >         unsigned int                    dl_server         : 1;
> > +       unsigned int                    dl_zerolax        : 1;
> > +       unsigned int                    dl_zerolax_armed  : 1;
> >
> >         /*
> >          * Bandwidth enforcement timer. Each -deadline task has its
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 1d7b96ca9011..69ee1fbd60e4 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
> >         /* for non-boosted task, pi_of(dl_se) == dl_se */
> >         dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
> >         dl_se->runtime = pi_of(dl_se)->dl_runtime;
> > +
> > +       /*
> > +        * If it is a zerolax reservation, throttle it.
> > +        */
> > +       if (dl_se->dl_zerolax) {
> > +               dl_se->dl_throttled = 1;
> > +               dl_se->dl_zerolax_armed = 1;
> > +       }
> >  }
> >
> >  /*
> > @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
> >   * could happen are, typically, a entity voluntarily trying to overcome its
> >   * runtime, or it just underestimated it during sched_setattr().
> >   */
> > +static int start_dl_timer(struct sched_dl_entity *dl_se);
> >  static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> >  {
> >         struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> > @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> >                 dl_se->dl_yielded = 0;
> >         if (dl_se->dl_throttled)
> >                 dl_se->dl_throttled = 0;
> > +
> > +       /*
> > +        * If this is the replenishment of a zerolax reservation,
> > +        * clear the flag and return.
> > +        */
> > +       if (dl_se->dl_zerolax_armed) {
> > +               dl_se->dl_zerolax_armed = 0;
> > +               return;
> > +       }
> > +
> > +       /*
> > +        * A this point, if the zerolax server is not armed, and the deadline
> > +        * is in the future, throttle the server and arm the zerolax timer.
> > +        */
> > +       if (dl_se->dl_zerolax &&
> > +           dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) {
> > +               if (!is_dl_boosted(dl_se)) {
> > +                       dl_se->dl_zerolax_armed = 1;
> > +                       dl_se->dl_throttled = 1;
> > +                       start_dl_timer(dl_se);
> > +               }
> > +       }
> >  }
> >
> >  /*
> > @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
> >                 }
> >
> >                 replenish_dl_new_period(dl_se, rq);
> > +       } else if (dl_server(dl_se) && dl_se->dl_zerolax) {
> > +               /*
> > +                * The server can still use its previous deadline, so throttle
> > +                * and arm the zero-laxity timer.
> > +                */
> > +               dl_se->dl_zerolax_armed = 1;
> > +               dl_se->dl_throttled = 1;
> >         }
> >  }
> >
> > @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
> >          * We want the timer to fire at the deadline, but considering
> >          * that it is actually coming from rq->clock and not from
> >          * hrtimer's time base reading.
> > +        *
> > +        * The zerolax reservation will have its timer set to the
> > +        * deadline - runtime. At that point, the CBS rule will decide
> > +        * if the current deadline can be used, or if a replenishment
> > +        * is required to avoid add too much pressure on the system
> > +        * (current u > U).
> >          */
> > -       act = ns_to_ktime(dl_next_period(dl_se));
> > +       if (dl_se->dl_zerolax_armed) {
> > +               WARN_ON_ONCE(!dl_se->dl_throttled);
> > +               act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
>
> Just a question, here if dl_se->deadline - dl_se->runtime is large,
> then does that mean that server activation will be much more into the
> future? So say I want to give CFS 30%, then it will take 70% of the
> period before CFS preempts RT thus "starving" CFS for this duration. I
> think that's Ok for smaller periods and runtimes, though.
>
> I think it does reserve the amount of required CFS bandwidth so it is
> probably OK, though it is perhaps letting RT run more initially (say
> if CFS tasks are not CPU bound and occasionally wake up, they will
> always be hit by the 70% latency AFAICS which may be large for large
> periods and small runtimes).
>

One more consideration I guess is, because the server is throttled
till 0-laxity time, it is possible that if CFS sleeps even a bit
(after the DL-server is unthrottled), then it will be pushed out to a
full current deadline + period due to CBS. In such a situation,  if
CFS-server is the only DL task running, it might starve RT for a bit
more time.

Example, say CFS runtime is 0.3s and period is 1s. At 0.7s, 0-laxity
timer fires. CFS runs for 0.29s, then sleeps for 0.005s and wakes up
at 0.295s (its remaining runtime is 0.01s at this point which is < the
"time till deadline" of 0.005s). Now the runtime of the CFS-server
will be replenished to the full 3s (due to CBS) and the deadline
pushed out. The end result is the total runtime that the CFS-server
actually gets is 0.0595s (though yes it did sleep for 5ms in between,
still that's tiny -- say if it briefly blocked on a kernel mutex).

On the other hand, if the CFS server started a bit earlier than the
0-laxity, it would probably not have had CBS pushing it out.

This is likely also not an issue for shorter runtime/period values,
still the throttling till later has a small trade-off (Not saying we
should not do this, this whole series is likely a huge improvement
over the current RT throttling).

There is a chance I am uttering nonsense as I am not a DL expert, so
apologies if so.

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-06 21:32     ` Joel Fernandes
@ 2023-11-06 21:37       ` Joel Fernandes
  2023-11-07 11:58         ` Daniel Bristot de Oliveira
  2023-11-07 16:47         ` Steven Rostedt
  0 siblings, 2 replies; 76+ messages in thread
From: Joel Fernandes @ 2023-11-06 21:37 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

On Mon, Nov 6, 2023 at 4:32 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Mon, Nov 6, 2023 at 2:32 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > Hi Daniel,
> >
> > On Sat, Nov 4, 2023 at 6:59 AM Daniel Bristot de Oliveira
> > <bristot@kernel.org> wrote:
> > >
> > > Among the motivations for the DL servers is the real-time throttling
> > > mechanism. This mechanism works by throttling the rt_rq after
> > > running for a long period without leaving space for fair tasks.
> > >
> > > The base dl server avoids this problem by boosting fair tasks instead
> > > of throttling the rt_rq. The point is that it boosts without waiting
> > > for potential starvation, causing some non-intuitive cases.
> > >
> > > For example, an IRQ dispatches two tasks on an idle system, a fair
> > > and an RT. The DL server will be activated, running the fair task
> > > before the RT one. This problem can be avoided by deferring the
> > > dl server activation.
> > >
> > > By setting the zerolax option, the dl_server will dispatch an
> > > SCHED_DEADLINE reservation with replenished runtime, but throttled.
> > >
> > > The dl_timer will be set for (period - runtime) ns from start time.
> > > Thus boosting the fair rq on its 0-laxity time with respect to
> > > rt_rq.
> > >
> > > If the fair scheduler has the opportunity to run while waiting
> > > for zerolax time, the dl server runtime will be consumed. If
> > > the runtime is completely consumed before the zerolax time, the
> > > server will be replenished while still in a throttled state. Then,
> > > the dl_timer will be reset to the new zerolax time
> > >
> > > If the fair server reaches the zerolax time without consuming
> > > its runtime, the server will be boosted, following CBS rules
> > > (thus without breaking SCHED_DEADLINE).
> > >
> > > Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
> > > ---
> > >  include/linux/sched.h   |   2 +
> > >  kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++-
> > >  kernel/sched/fair.c     |   3 ++
> > >  3 files changed, 103 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index 5ac1f252e136..56e53e6fd5a0 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -660,6 +660,8 @@ struct sched_dl_entity {
> > >         unsigned int                    dl_non_contending : 1;
> > >         unsigned int                    dl_overrun        : 1;
> > >         unsigned int                    dl_server         : 1;
> > > +       unsigned int                    dl_zerolax        : 1;
> > > +       unsigned int                    dl_zerolax_armed  : 1;
> > >
> > >         /*
> > >          * Bandwidth enforcement timer. Each -deadline task has its
> > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > > index 1d7b96ca9011..69ee1fbd60e4 100644
> > > --- a/kernel/sched/deadline.c
> > > +++ b/kernel/sched/deadline.c
> > > @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
> > >         /* for non-boosted task, pi_of(dl_se) == dl_se */
> > >         dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
> > >         dl_se->runtime = pi_of(dl_se)->dl_runtime;
> > > +
> > > +       /*
> > > +        * If it is a zerolax reservation, throttle it.
> > > +        */
> > > +       if (dl_se->dl_zerolax) {
> > > +               dl_se->dl_throttled = 1;
> > > +               dl_se->dl_zerolax_armed = 1;
> > > +       }
> > >  }
> > >
> > >  /*
> > > @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
> > >   * could happen are, typically, a entity voluntarily trying to overcome its
> > >   * runtime, or it just underestimated it during sched_setattr().
> > >   */
> > > +static int start_dl_timer(struct sched_dl_entity *dl_se);
> > >  static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> > >  {
> > >         struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> > > @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> > >                 dl_se->dl_yielded = 0;
> > >         if (dl_se->dl_throttled)
> > >                 dl_se->dl_throttled = 0;
> > > +
> > > +       /*
> > > +        * If this is the replenishment of a zerolax reservation,
> > > +        * clear the flag and return.
> > > +        */
> > > +       if (dl_se->dl_zerolax_armed) {
> > > +               dl_se->dl_zerolax_armed = 0;
> > > +               return;
> > > +       }
> > > +
> > > +       /*
> > > +        * A this point, if the zerolax server is not armed, and the deadline
> > > +        * is in the future, throttle the server and arm the zerolax timer.
> > > +        */
> > > +       if (dl_se->dl_zerolax &&
> > > +           dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) {
> > > +               if (!is_dl_boosted(dl_se)) {
> > > +                       dl_se->dl_zerolax_armed = 1;
> > > +                       dl_se->dl_throttled = 1;
> > > +                       start_dl_timer(dl_se);
> > > +               }
> > > +       }
> > >  }
> > >
> > >  /*
> > > @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
> > >                 }
> > >
> > >                 replenish_dl_new_period(dl_se, rq);
> > > +       } else if (dl_server(dl_se) && dl_se->dl_zerolax) {
> > > +               /*
> > > +                * The server can still use its previous deadline, so throttle
> > > +                * and arm the zero-laxity timer.
> > > +                */
> > > +               dl_se->dl_zerolax_armed = 1;
> > > +               dl_se->dl_throttled = 1;
> > >         }
> > >  }
> > >
> > > @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
> > >          * We want the timer to fire at the deadline, but considering
> > >          * that it is actually coming from rq->clock and not from
> > >          * hrtimer's time base reading.
> > > +        *
> > > +        * The zerolax reservation will have its timer set to the
> > > +        * deadline - runtime. At that point, the CBS rule will decide
> > > +        * if the current deadline can be used, or if a replenishment
> > > +        * is required to avoid add too much pressure on the system
> > > +        * (current u > U).
> > >          */
> > > -       act = ns_to_ktime(dl_next_period(dl_se));
> > > +       if (dl_se->dl_zerolax_armed) {
> > > +               WARN_ON_ONCE(!dl_se->dl_throttled);
> > > +               act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
> >
> > Just a question, here if dl_se->deadline - dl_se->runtime is large,
> > then does that mean that server activation will be much more into the
> > future? So say I want to give CFS 30%, then it will take 70% of the
> > period before CFS preempts RT thus "starving" CFS for this duration. I
> > think that's Ok for smaller periods and runtimes, though.
> >
> > I think it does reserve the amount of required CFS bandwidth so it is
> > probably OK, though it is perhaps letting RT run more initially (say
> > if CFS tasks are not CPU bound and occasionally wake up, they will
> > always be hit by the 70% latency AFAICS which may be large for large
> > periods and small runtimes).
> >
>
> One more consideration I guess is, because the server is throttled
> till 0-laxity time, it is possible that if CFS sleeps even a bit
> (after the DL-server is unthrottled), then it will be pushed out to a
> full current deadline + period due to CBS. In such a situation,  if
> CFS-server is the only DL task running, it might starve RT for a bit
> more time.
>
> Example, say CFS runtime is 0.3s and period is 1s. At 0.7s, 0-laxity
> timer fires. CFS runs for 0.29s, then sleeps for 0.005s and wakes up
> at 0.295s (its remaining runtime is 0.01s at this point which is < the
> "time till deadline" of 0.005s). Now the runtime of the CFS-server
> will be replenished to the full 3s (due to CBS) and the deadline
> pushed out. The end result is the total runtime that the CFS-server
> actually gets is 0.0595s (though yes it did sleep for 5ms in between,
> still that's tiny -- say if it briefly blocked on a kernel mutex).

Blah, I got lost in decimal points. Here's the example again:

Say CFS-server runtime is 0.3s and period is 1s.

At 0.7s, 0-laxity timer fires. CFS runs for 0.29s, then sleeps for
0.005s and wakes up at 0.295s (its remaining runtime is 0.01s at this
point which is < the "time till deadline" of 0.005s)

Now the runtime of the CFS-server will be replenished to the full 0.3s
(due to CBS) and the deadline
pushed out.

The end result is, the total runtime that the CFS-server actually gets
is 0.595s (though yes it did sleep for 5ms in between, still that's
tiny -- say if it briefly blocked on a kernel mutex). That's almost
double the allocated runtime.

This is just theoretical and I have yet to see if it is actually an
issue in practice.

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-06 19:32   ` Joel Fernandes
  2023-11-06 21:32     ` Joel Fernandes
@ 2023-11-07  7:30     ` Daniel Bristot de Oliveira
  1 sibling, 0 replies; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-07  7:30 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

On 11/6/23 20:32, Joel Fernandes wrote:
> Hi Daniel,
> 
> On Sat, Nov 4, 2023 at 6:59 AM Daniel Bristot de Oliveira
> <bristot@kernel.org> wrote:
>>
>> Among the motivations for the DL servers is the real-time throttling
>> mechanism. This mechanism works by throttling the rt_rq after
>> running for a long period without leaving space for fair tasks.
>>
>> The base dl server avoids this problem by boosting fair tasks instead
>> of throttling the rt_rq. The point is that it boosts without waiting
>> for potential starvation, causing some non-intuitive cases.
>>
>> For example, an IRQ dispatches two tasks on an idle system, a fair
>> and an RT. The DL server will be activated, running the fair task
>> before the RT one. This problem can be avoided by deferring the
>> dl server activation.
>>
>> By setting the zerolax option, the dl_server will dispatch an
>> SCHED_DEADLINE reservation with replenished runtime, but throttled.
>>
>> The dl_timer will be set for (period - runtime) ns from start time.
>> Thus boosting the fair rq on its 0-laxity time with respect to
>> rt_rq.
>>
>> If the fair scheduler has the opportunity to run while waiting
>> for zerolax time, the dl server runtime will be consumed. If
>> the runtime is completely consumed before the zerolax time, the
>> server will be replenished while still in a throttled state. Then,
>> the dl_timer will be reset to the new zerolax time
>>
>> If the fair server reaches the zerolax time without consuming
>> its runtime, the server will be boosted, following CBS rules
>> (thus without breaking SCHED_DEADLINE).
>>
>> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
>> ---
>>  include/linux/sched.h   |   2 +
>>  kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++-
>>  kernel/sched/fair.c     |   3 ++
>>  3 files changed, 103 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 5ac1f252e136..56e53e6fd5a0 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -660,6 +660,8 @@ struct sched_dl_entity {
>>         unsigned int                    dl_non_contending : 1;
>>         unsigned int                    dl_overrun        : 1;
>>         unsigned int                    dl_server         : 1;
>> +       unsigned int                    dl_zerolax        : 1;
>> +       unsigned int                    dl_zerolax_armed  : 1;
>>
>>         /*
>>          * Bandwidth enforcement timer. Each -deadline task has its
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 1d7b96ca9011..69ee1fbd60e4 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
>>         /* for non-boosted task, pi_of(dl_se) == dl_se */
>>         dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
>>         dl_se->runtime = pi_of(dl_se)->dl_runtime;
>> +
>> +       /*
>> +        * If it is a zerolax reservation, throttle it.
>> +        */
>> +       if (dl_se->dl_zerolax) {
>> +               dl_se->dl_throttled = 1;
>> +               dl_se->dl_zerolax_armed = 1;
>> +       }
>>  }
>>
>>  /*
>> @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
>>   * could happen are, typically, a entity voluntarily trying to overcome its
>>   * runtime, or it just underestimated it during sched_setattr().
>>   */
>> +static int start_dl_timer(struct sched_dl_entity *dl_se);
>>  static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>>  {
>>         struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>> @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>>                 dl_se->dl_yielded = 0;
>>         if (dl_se->dl_throttled)
>>                 dl_se->dl_throttled = 0;
>> +
>> +       /*
>> +        * If this is the replenishment of a zerolax reservation,
>> +        * clear the flag and return.
>> +        */
>> +       if (dl_se->dl_zerolax_armed) {
>> +               dl_se->dl_zerolax_armed = 0;
>> +               return;
>> +       }
>> +
>> +       /*
>> +        * A this point, if the zerolax server is not armed, and the deadline
>> +        * is in the future, throttle the server and arm the zerolax timer.
>> +        */
>> +       if (dl_se->dl_zerolax &&
>> +           dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) {
>> +               if (!is_dl_boosted(dl_se)) {
>> +                       dl_se->dl_zerolax_armed = 1;
>> +                       dl_se->dl_throttled = 1;
>> +                       start_dl_timer(dl_se);
>> +               }
>> +       }
>>  }
>>
>>  /*
>> @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
>>                 }
>>
>>                 replenish_dl_new_period(dl_se, rq);
>> +       } else if (dl_server(dl_se) && dl_se->dl_zerolax) {
>> +               /*
>> +                * The server can still use its previous deadline, so throttle
>> +                * and arm the zero-laxity timer.
>> +                */
>> +               dl_se->dl_zerolax_armed = 1;
>> +               dl_se->dl_throttled = 1;
>>         }
>>  }
>>
>> @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
>>          * We want the timer to fire at the deadline, but considering
>>          * that it is actually coming from rq->clock and not from
>>          * hrtimer's time base reading.
>> +        *
>> +        * The zerolax reservation will have its timer set to the
>> +        * deadline - runtime. At that point, the CBS rule will decide
>> +        * if the current deadline can be used, or if a replenishment
>> +        * is required to avoid add too much pressure on the system
>> +        * (current u > U).
>>          */
>> -       act = ns_to_ktime(dl_next_period(dl_se));
>> +       if (dl_se->dl_zerolax_armed) {
>> +               WARN_ON_ONCE(!dl_se->dl_throttled);
>> +               act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
> 
> Just a question, here if dl_se->deadline - dl_se->runtime is large,
> then does that mean that server activation will be much more into the
> future? So say I want to give CFS 30%, then it will take 70% of the
> period before CFS preempts RT thus "starving" CFS for this duration. I
> think that's Ok for smaller periods and runtimes, though.

I think you are answering yourself here :-)

If the default values are not good, change them o/

The current interface allows you to have more responsive/small chuck of CPU
or less responsive/large chucks of CPU... you can even place RT bellow CFS
for a "bounded amount of time" by disabling the defer option... per CPU.
All at once with different periods patterns on CPUs to increase the
changes of having a cfs rq ready on another CPU... like...

[3/10 - 2/6 - 1.5/5 - 1/3 no defer] in a 4 cpus system :-).

The default setup is based on the throttling to avoid changing
the historical behavior for those that... are happy with them.

-- Daniel
>  - Joel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2023-11-06 16:29     ` Daniel Bristot de Oliveira
@ 2023-11-07  8:16       ` Peter Zijlstra
  2023-11-07 14:06         ` Daniel Bristot de Oliveira
  2023-11-07 14:44       ` Peter Zijlstra
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2023-11-07  8:16 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Joel Fernandes, Vineeth Pillai, Shuah Khan, Phil Auld

On Mon, Nov 06, 2023 at 05:29:49PM +0100, Daniel Bristot de Oliveira wrote:
> On 11/6/23 16:40, Peter Zijlstra wrote:
> > On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> >> Add an interface for fair server setup on debugfs.
> >>
> >> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
> >>
> >>  - fair_server_runtime: set runtime in ns
> >>  - fair_server_period: set period in ns
> >>  - fair_server_defer: on/off for the defer mechanism
> >>
> > 
> > This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the
> > total available bandwidth control, right?
> 
> right, but thinking aloud... given that the per-cpu files are already allocating the
> bandwidth on the dl_rq, the spare time for fair scheduler is granted.
> 
> Still, we can have them there as a safeguard to not overloading the deadline
> scheduler... (thinking aloud 2) as long as global is a thing... as we get away
> from it, that global limitation will make less sense - still better to have a form
> of limitation so people are aware of bandwidth until there.

Yeah, so having a limit on the deadline thing seems prudent as a way to
model system overhead. I mean 100% sounds nice, but then all the models
also assume no interrupts, no scheduler or migration overhead etc.. So
setting a slightly lower max seems far more realistic to me.

That said, the period/bandwidth thing is now slightly odd, as we really
only care about the utilization. But whatever. One thing at a time.

> > But then shouldn've we also rip out the throttle thingy right quick?
> > 
> 
> I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED
> for now, because GROUP_SCHED depends on it, no?

Yes. Until we can delete all that code we'll have to keep some of that.

> With the next step on moving the dl server as the base for the
> hierarchical scheduling...  That will rip out the
> CONFIG_RT_GROUP_SCHED... with a thing with a per-cpu interface.
> 
> Does it make sense?

I'm still not sure how to deal with affinities and deadline servers for
RT.

There's a bunch of issues and I thing we've only got some of them solved.

The semi-partitioned thing (someone was working on that, I think you
know the guy), solves DL 'entities' having affinities.

But the problem of FIFO is that they don't have inherent bandwidth. This
in turn means that any server for FIFO needs to be minimally concurrent,
otherwise you hand out bandwidth to lower priority tasks that the higher
priority task might want etc.. (Andersson's group has papers here).

Specifically, imagine a server with U=1.5 and 3 tasks, a high prio task
that requires .8 a medium prio task that requires .6 and a low prio task
that soaks up whatever it can get its little grubby paws on.

Then with minimal concurreny this works out nicely, high gets .8, mid
gets .6 and low gets the remaining .1.

If OTOH you don't limit concurrency and let them all run concurrently,
you can end up with the situation where they each get .5. Which is
obviously fail.

Add affinities here though and you're up a creek, how do you distribute
utilization between the slices, what slices, etc.. You say given them a
per-cpu cgroup interface, and have them configure it themselves, but
that's a god-aweful thing to ask userspace to do.

Ideally, I'd delete all of FIFO, it's such a horrid trainwreck, a total
and abysmal failure of a model -- thank you POSIX :-(

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-06 21:37       ` Joel Fernandes
@ 2023-11-07 11:58         ` Daniel Bristot de Oliveira
  2023-11-08  2:42           ` Joel Fernandes
  2023-11-07 16:47         ` Steven Rostedt
  1 sibling, 1 reply; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-07 11:58 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

On 11/6/23 22:37, Joel Fernandes wrote:
> On Mon, Nov 6, 2023 at 4:32 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>>
>> On Mon, Nov 6, 2023 at 2:32 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>>>
>>> Hi Daniel,
>>>
>>> On Sat, Nov 4, 2023 at 6:59 AM Daniel Bristot de Oliveira
>>> <bristot@kernel.org> wrote:
>>>>
>>>> Among the motivations for the DL servers is the real-time throttling
>>>> mechanism. This mechanism works by throttling the rt_rq after
>>>> running for a long period without leaving space for fair tasks.
>>>>
>>>> The base dl server avoids this problem by boosting fair tasks instead
>>>> of throttling the rt_rq. The point is that it boosts without waiting
>>>> for potential starvation, causing some non-intuitive cases.
>>>>
>>>> For example, an IRQ dispatches two tasks on an idle system, a fair
>>>> and an RT. The DL server will be activated, running the fair task
>>>> before the RT one. This problem can be avoided by deferring the
>>>> dl server activation.
>>>>
>>>> By setting the zerolax option, the dl_server will dispatch an
>>>> SCHED_DEADLINE reservation with replenished runtime, but throttled.
>>>>
>>>> The dl_timer will be set for (period - runtime) ns from start time.
>>>> Thus boosting the fair rq on its 0-laxity time with respect to
>>>> rt_rq.
>>>>
>>>> If the fair scheduler has the opportunity to run while waiting
>>>> for zerolax time, the dl server runtime will be consumed. If
>>>> the runtime is completely consumed before the zerolax time, the
>>>> server will be replenished while still in a throttled state. Then,
>>>> the dl_timer will be reset to the new zerolax time
>>>>
>>>> If the fair server reaches the zerolax time without consuming
>>>> its runtime, the server will be boosted, following CBS rules
>>>> (thus without breaking SCHED_DEADLINE).
>>>>
>>>> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
>>>> ---
>>>>  include/linux/sched.h   |   2 +
>>>>  kernel/sched/deadline.c | 100 +++++++++++++++++++++++++++++++++++++++-
>>>>  kernel/sched/fair.c     |   3 ++
>>>>  3 files changed, 103 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>>> index 5ac1f252e136..56e53e6fd5a0 100644
>>>> --- a/include/linux/sched.h
>>>> +++ b/include/linux/sched.h
>>>> @@ -660,6 +660,8 @@ struct sched_dl_entity {
>>>>         unsigned int                    dl_non_contending : 1;
>>>>         unsigned int                    dl_overrun        : 1;
>>>>         unsigned int                    dl_server         : 1;
>>>> +       unsigned int                    dl_zerolax        : 1;
>>>> +       unsigned int                    dl_zerolax_armed  : 1;
>>>>
>>>>         /*
>>>>          * Bandwidth enforcement timer. Each -deadline task has its
>>>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>>>> index 1d7b96ca9011..69ee1fbd60e4 100644
>>>> --- a/kernel/sched/deadline.c
>>>> +++ b/kernel/sched/deadline.c
>>>> @@ -772,6 +772,14 @@ static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
>>>>         /* for non-boosted task, pi_of(dl_se) == dl_se */
>>>>         dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
>>>>         dl_se->runtime = pi_of(dl_se)->dl_runtime;
>>>> +
>>>> +       /*
>>>> +        * If it is a zerolax reservation, throttle it.
>>>> +        */
>>>> +       if (dl_se->dl_zerolax) {
>>>> +               dl_se->dl_throttled = 1;
>>>> +               dl_se->dl_zerolax_armed = 1;
>>>> +       }
>>>>  }
>>>>
>>>>  /*
>>>> @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
>>>>   * could happen are, typically, a entity voluntarily trying to overcome its
>>>>   * runtime, or it just underestimated it during sched_setattr().
>>>>   */
>>>> +static int start_dl_timer(struct sched_dl_entity *dl_se);
>>>>  static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>>>>  {
>>>>         struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>>>> @@ -874,6 +883,28 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>>>>                 dl_se->dl_yielded = 0;
>>>>         if (dl_se->dl_throttled)
>>>>                 dl_se->dl_throttled = 0;
>>>> +
>>>> +       /*
>>>> +        * If this is the replenishment of a zerolax reservation,
>>>> +        * clear the flag and return.
>>>> +        */
>>>> +       if (dl_se->dl_zerolax_armed) {
>>>> +               dl_se->dl_zerolax_armed = 0;
>>>> +               return;
>>>> +       }
>>>> +
>>>> +       /*
>>>> +        * A this point, if the zerolax server is not armed, and the deadline
>>>> +        * is in the future, throttle the server and arm the zerolax timer.
>>>> +        */
>>>> +       if (dl_se->dl_zerolax &&
>>>> +           dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) {
>>>> +               if (!is_dl_boosted(dl_se)) {
>>>> +                       dl_se->dl_zerolax_armed = 1;
>>>> +                       dl_se->dl_throttled = 1;
>>>> +                       start_dl_timer(dl_se);
>>>> +               }
>>>> +       }
>>>>  }
>>>>
>>>>  /*
>>>> @@ -1024,6 +1055,13 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
>>>>                 }
>>>>
>>>>                 replenish_dl_new_period(dl_se, rq);
>>>> +       } else if (dl_server(dl_se) && dl_se->dl_zerolax) {
>>>> +               /*
>>>> +                * The server can still use its previous deadline, so throttle
>>>> +                * and arm the zero-laxity timer.
>>>> +                */
>>>> +               dl_se->dl_zerolax_armed = 1;
>>>> +               dl_se->dl_throttled = 1;
>>>>         }
>>>>  }
>>>>
>>>> @@ -1056,8 +1094,20 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
>>>>          * We want the timer to fire at the deadline, but considering
>>>>          * that it is actually coming from rq->clock and not from
>>>>          * hrtimer's time base reading.
>>>> +        *
>>>> +        * The zerolax reservation will have its timer set to the
>>>> +        * deadline - runtime. At that point, the CBS rule will decide
>>>> +        * if the current deadline can be used, or if a replenishment
>>>> +        * is required to avoid add too much pressure on the system
>>>> +        * (current u > U).
>>>>          */
>>>> -       act = ns_to_ktime(dl_next_period(dl_se));
>>>> +       if (dl_se->dl_zerolax_armed) {
>>>> +               WARN_ON_ONCE(!dl_se->dl_throttled);
>>>> +               act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
>>>
>>> Just a question, here if dl_se->deadline - dl_se->runtime is large,
>>> then does that mean that server activation will be much more into the
>>> future? So say I want to give CFS 30%, then it will take 70% of the
>>> period before CFS preempts RT thus "starving" CFS for this duration. I
>>> think that's Ok for smaller periods and runtimes, though.
>>>
>>> I think it does reserve the amount of required CFS bandwidth so it is
>>> probably OK, though it is perhaps letting RT run more initially (say
>>> if CFS tasks are not CPU bound and occasionally wake up, they will
>>> always be hit by the 70% latency AFAICS which may be large for large
>>> periods and small runtimes).
>>>
>>
>> One more consideration I guess is, because the server is throttled
>> till 0-laxity time, it is possible that if CFS sleeps even a bit
>> (after the DL-server is unthrottled), then it will be pushed out to a
>> full current deadline + period due to CBS. In such a situation,  if
>> CFS-server is the only DL task running, it might starve RT for a bit
>> more time.
>>
>> Example, say CFS runtime is 0.3s and period is 1s. At 0.7s, 0-laxity
>> timer fires. CFS runs for 0.29s, then sleeps for 0.005s and wakes up
>> at 0.295s (its remaining runtime is 0.01s at this point which is < the
>> "time till deadline" of 0.005s). Now the runtime of the CFS-server
>> will be replenished to the full 3s (due to CBS) and the deadline
>> pushed out. The end result is the total runtime that the CFS-server
>> actually gets is 0.0595s (though yes it did sleep for 5ms in between,
>> still that's tiny -- say if it briefly blocked on a kernel mutex).
> 
> Blah, I got lost in decimal points. Here's the example again:
> 
> Say CFS-server runtime is 0.3s and period is 1s.
> 
> At 0.7s, 0-laxity timer fires. CFS runs for 0.29s, then sleeps for
> 0.005s and wakes up at 0.295s (its remaining runtime is 0.01s at this
> point which is < the "time till deadline" of 0.005s)
> 
> Now the runtime of the CFS-server will be replenished to the full 0.3s
> (due to CBS) and the deadline
> pushed out.
> 
> The end result is, the total runtime that the CFS-server actually gets
> is 0.595s (though yes it did sleep for 5ms in between, still that's
> tiny -- say if it briefly blocked on a kernel mutex). That's almost
> double the allocated runtime.

I think I got what you mean, and I think I took for granted that we were
doing overload control on the replenishment, but it seems that we are not..

I just got back from a doct appt, I will do a proper reply later today.

Thanks Joel!
-- Daniel


> This is just theoretical and I have yet to see if it is actually an
> issue in practice.
> 
> Thanks.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2023-11-04 10:59 ` [PATCH v5 7/7] sched/fair: Fair server interface Daniel Bristot de Oliveira
                     ` (2 preceding siblings ...)
  2023-11-06 15:40   ` Peter Zijlstra
@ 2023-11-07 12:38   ` Peter Zijlstra
  2023-11-07 13:24     ` Daniel Bristot de Oliveira
  2024-01-19  1:49   ` Joel Fernandes
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2023-11-07 12:38 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> Add an interface for fair server setup on debugfs.
> 
> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
> 
>  - fair_server_runtime: set runtime in ns
>  - fair_server_period: set period in ns
>  - fair_server_defer: on/off for the defer mechanism
> 
> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
> ---

I did the below, that gives us:

/debug/sched/fair_server/cpuX/{runtime,period,defer}

I wanted to also add:

/debug/sched/fair_server/{runtime,period,defer}

to more easily set all CPUs to the same value, but then figured
userspace will just have to loop.

---

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -380,10 +380,10 @@ static ssize_t sched_fair_server_write(s
 			break;
 		}
 
-		if (runtime > period
-				|| period > fair_server_period_max
-				|| period < fair_server_period_min
-				|| zerolax > 1) {
+		if (runtime > period ||
+		    period > fair_server_period_max ||
+		    period < fair_server_period_min ||
+		    zerolax > 1) {
 			cnt = -EINVAL;
 			goto out;
 		}
@@ -515,11 +515,11 @@ static struct dentry *debugfs_sched;
 
 static void debugfs_fair_server_init(void)
 {
-	long cpu;
-	struct dentry *rq_dentry;
+	struct dentry *d_fair;
+	unsigned int cpu;
 
-	rq_dentry = debugfs_create_dir("rq", debugfs_sched);
-	if (!rq_dentry)
+	d_fair = debugfs_create_dir("fair_server", debugfs_sched);
+	if (!d_fair)
 		return;
 
 	for_each_possible_cpu(cpu) {
@@ -527,11 +527,11 @@ static void debugfs_fair_server_init(voi
 		char buf[32];
 
 		snprintf(buf, sizeof(buf), "cpu%ld", cpu);
-		d_cpu = debugfs_create_dir(buf, rq_dentry);
+		d_cpu = debugfs_create_dir(buf, d_fair);
 
-		debugfs_create_file("fair_server_runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
-		debugfs_create_file("fair_server_period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops);
-		debugfs_create_file("fair_server_defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops);
+		debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
+		debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops);
+		debugfs_create_file("defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops);
 	}
 }
 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2023-11-07 12:38   ` Peter Zijlstra
@ 2023-11-07 13:24     ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-07 13:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On 11/7/23 13:38, Peter Zijlstra wrote:
> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>> Add an interface for fair server setup on debugfs.
>>
>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>
>>  - fair_server_runtime: set runtime in ns
>>  - fair_server_period: set period in ns
>>  - fair_server_defer: on/off for the defer mechanism
>>
>> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
>> ---
> 
> I did the below, that gives us:
> 
> /debug/sched/fair_server/cpuX/{runtime,period,defer}

right!

> I wanted to also add:
> 
> /debug/sched/fair_server/{runtime,period,defer}
> 
> to more easily set all CPUs to the same value, but then figured
> userspace will just have to loop.

I thought about that too... and had the same conclusion... let 'em loop.

-- Daniel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2023-11-07  8:16       ` Peter Zijlstra
@ 2023-11-07 14:06         ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-07 14:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Joel Fernandes, Vineeth Pillai, Shuah Khan, Phil Auld

On 11/7/23 09:16, Peter Zijlstra wrote:
> On Mon, Nov 06, 2023 at 05:29:49PM +0100, Daniel Bristot de Oliveira wrote:
>> On 11/6/23 16:40, Peter Zijlstra wrote:
>>> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>>>> Add an interface for fair server setup on debugfs.
>>>>
>>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>>
>>>>  - fair_server_runtime: set runtime in ns
>>>>  - fair_server_period: set period in ns
>>>>  - fair_server_defer: on/off for the defer mechanism
>>>>
>>>
>>> This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to be the
>>> total available bandwidth control, right?
>>
>> right, but thinking aloud... given that the per-cpu files are already allocating the
>> bandwidth on the dl_rq, the spare time for fair scheduler is granted.
>>
>> Still, we can have them there as a safeguard to not overloading the deadline
>> scheduler... (thinking aloud 2) as long as global is a thing... as we get away
>> from it, that global limitation will make less sense - still better to have a form
>> of limitation so people are aware of bandwidth until there.
> 
> Yeah, so having a limit on the deadline thing seems prudent as a way to
> model system overhead. I mean 100% sounds nice, but then all the models
> also assume no interrupts, no scheduler or migration overhead etc.. So
> setting a slightly lower max seems far more realistic to me.
> 
> That said, the period/bandwidth thing is now slightly odd, as we really
> only care about the utilization. But whatever. One thing at a time.

Yep, that is why I am mentioning the generalization as a second phase, it is
a harder problem... But having the rt throttling out of the default way is
already a good step.

> 
>>> But then shouldn've we also rip out the throttle thingy right quick?
>>>
>>
>> I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED
>> for now, because GROUP_SCHED depends on it, no?
> 
> Yes. Until we can delete all that code we'll have to keep some of that.
> 
>> With the next step on moving the dl server as the base for the
>> hierarchical scheduling...  That will rip out the
>> CONFIG_RT_GROUP_SCHED... with a thing with a per-cpu interface.
>>
>> Does it make sense?
> 
> I'm still not sure how to deal with affinities and deadline servers for
> RT.
> 
> There's a bunch of issues and I thing we've only got some of them solved.
> 
> The semi-partitioned thing (someone was working on that, I think you
> know the guy), solves DL 'entities' having affinities.

Yep, then having arbitrari affinities is another step towards mode flexible models...

> But the problem of FIFO is that they don't have inherent bandwidth. This
> in turn means that any server for FIFO needs to be minimally concurrent,
> otherwise you hand out bandwidth to lower priority tasks that the higher
> priority task might want etc.. (Andersson's group has papers here).
> 
> Specifically, imagine a server with U=1.5 and 3 tasks, a high prio task
> that requires .8 a medium prio task that requires .6 and a low prio task
> that soaks up whatever it can get its little grubby paws on.
> 
> Then with minimal concurreny this works out nicely, high gets .8, mid
> gets .6 and low gets the remaining .1.
> 
> If OTOH you don't limit concurrency and let them all run concurrently,
> you can end up with the situation where they each get .5. Which is
> obviously fail.
> 
> Add affinities here though and you're up a creek, how do you distribute
> utilization between the slices, what slices, etc.. You say given them a
> per-cpu cgroup interface, and have them configure it themselves, but
> that's a god-aweful thing to ask userspace to do.

and yep again... It is definitely a harder topic... but it gets simpler as we do
those other moves...

> Ideally, I'd delete all of FIFO, it's such a horrid trainwreck, a total
> and abysmal failure of a model -- thank you POSIX :-(

-- Daniel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2023-11-06 16:29     ` Daniel Bristot de Oliveira
  2023-11-07  8:16       ` Peter Zijlstra
@ 2023-11-07 14:44       ` Peter Zijlstra
  1 sibling, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2023-11-07 14:44 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Joel Fernandes, Vineeth Pillai, Shuah Khan, Phil Auld

On Mon, Nov 06, 2023 at 05:29:49PM +0100, Daniel Bristot de Oliveira wrote:

> I was thinking about moving the entire throttling machinery inside CONFIG_RT_GROUP_SCHED
> for now, because GROUP_SCHED depends on it, no?

This builds and boots..

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9930,8 +9930,6 @@ void __init sched_init(void)
 #endif /* CONFIG_RT_GROUP_SCHED */
 	}
 
-	init_rt_bandwidth(&def_rt_bandwidth, global_rt_period(), global_rt_runtime());
-
 #ifdef CONFIG_SMP
 	init_defrootdomain();
 #endif
@@ -9986,7 +9984,6 @@ void __init sched_init(void)
 		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-		rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
 #ifdef CONFIG_RT_GROUP_SCHED
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
 #endif
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1480,6 +1480,7 @@ static void update_curr_dl_se(struct rq
 	if (dl_se == &rq->fair_server)
 		return;
 
+#ifdef CONFIG_RT_GROUP_SCHED
 	/*
 	 * Because -- for now -- we share the rt bandwidth, we need to
 	 * account our runtime there too, otherwise actual rt tasks
@@ -1504,6 +1505,7 @@ static void update_curr_dl_se(struct rq
 			rt_rq->rt_time += delta_exec;
 		raw_spin_unlock(&rt_rq->rt_runtime_lock);
 	}
+#endif
 }
 
 void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -8,10 +8,6 @@ int sched_rr_timeslice = RR_TIMESLICE;
 /* More than 4 hours if BW_SHIFT equals 20. */
 static const u64 max_rt_runtime = MAX_BW;
 
-static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
-
-struct rt_bandwidth def_rt_bandwidth;
-
 /*
  * period over which we measure -rt task CPU usage in us.
  * default: 1s
@@ -67,6 +63,40 @@ static int __init sched_rt_sysctl_init(v
 late_initcall(sched_rt_sysctl_init);
 #endif
 
+void init_rt_rq(struct rt_rq *rt_rq)
+{
+	struct rt_prio_array *array;
+	int i;
+
+	array = &rt_rq->active;
+	for (i = 0; i < MAX_RT_PRIO; i++) {
+		INIT_LIST_HEAD(array->queue + i);
+		__clear_bit(i, array->bitmap);
+	}
+	/* delimiter for bitsearch: */
+	__set_bit(MAX_RT_PRIO, array->bitmap);
+
+#if defined CONFIG_SMP
+	rt_rq->highest_prio.curr = MAX_RT_PRIO-1;
+	rt_rq->highest_prio.next = MAX_RT_PRIO-1;
+	rt_rq->overloaded = 0;
+	plist_head_init(&rt_rq->pushable_tasks);
+#endif /* CONFIG_SMP */
+	/* We start is dequeued state, because no RT tasks are queued */
+	rt_rq->rt_queued = 0;
+
+#ifdef CONFIG_RT_GROUP_SCHED
+	rt_rq->rt_time = 0;
+	rt_rq->rt_throttled = 0;
+	rt_rq->rt_runtime = 0;
+	raw_spin_lock_init(&rt_rq->rt_runtime_lock);
+#endif
+}
+
+#ifdef CONFIG_RT_GROUP_SCHED
+
+static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
+
 static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
 {
 	struct rt_bandwidth *rt_b =
@@ -131,35 +161,6 @@ static void start_rt_bandwidth(struct rt
 	do_start_rt_bandwidth(rt_b);
 }
 
-void init_rt_rq(struct rt_rq *rt_rq)
-{
-	struct rt_prio_array *array;
-	int i;
-
-	array = &rt_rq->active;
-	for (i = 0; i < MAX_RT_PRIO; i++) {
-		INIT_LIST_HEAD(array->queue + i);
-		__clear_bit(i, array->bitmap);
-	}
-	/* delimiter for bitsearch: */
-	__set_bit(MAX_RT_PRIO, array->bitmap);
-
-#if defined CONFIG_SMP
-	rt_rq->highest_prio.curr = MAX_RT_PRIO-1;
-	rt_rq->highest_prio.next = MAX_RT_PRIO-1;
-	rt_rq->overloaded = 0;
-	plist_head_init(&rt_rq->pushable_tasks);
-#endif /* CONFIG_SMP */
-	/* We start is dequeued state, because no RT tasks are queued */
-	rt_rq->rt_queued = 0;
-
-	rt_rq->rt_time = 0;
-	rt_rq->rt_throttled = 0;
-	rt_rq->rt_runtime = 0;
-	raw_spin_lock_init(&rt_rq->rt_runtime_lock);
-}
-
-#ifdef CONFIG_RT_GROUP_SCHED
 static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
 {
 	hrtimer_cancel(&rt_b->rt_period_timer);
@@ -254,9 +255,6 @@ int alloc_rt_sched_group(struct task_gro
 	if (!tg->rt_se)
 		goto err;
 
-	init_rt_bandwidth(&tg->rt_bandwidth,
-			ktime_to_ns(def_rt_bandwidth.rt_period), 0);
-
 	for_each_possible_cpu(i) {
 		rt_rq = kzalloc_node(sizeof(struct rt_rq),
 				     GFP_KERNEL, cpu_to_node(i));
@@ -605,70 +603,6 @@ static inline struct rt_bandwidth *sched
 	return &rt_rq->tg->rt_bandwidth;
 }
 
-#else /* !CONFIG_RT_GROUP_SCHED */
-
-static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
-{
-	return rt_rq->rt_runtime;
-}
-
-static inline u64 sched_rt_period(struct rt_rq *rt_rq)
-{
-	return ktime_to_ns(def_rt_bandwidth.rt_period);
-}
-
-typedef struct rt_rq *rt_rq_iter_t;
-
-#define for_each_rt_rq(rt_rq, iter, rq) \
-	for ((void) iter, rt_rq = &rq->rt; rt_rq; rt_rq = NULL)
-
-#define for_each_sched_rt_entity(rt_se) \
-	for (; rt_se; rt_se = NULL)
-
-static inline struct rt_rq *group_rt_rq(struct sched_rt_entity *rt_se)
-{
-	return NULL;
-}
-
-static inline void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
-{
-	struct rq *rq = rq_of_rt_rq(rt_rq);
-
-	if (!rt_rq->rt_nr_running)
-		return;
-
-	enqueue_top_rt_rq(rt_rq);
-	resched_curr(rq);
-}
-
-static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
-{
-	dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
-}
-
-static inline int rt_rq_throttled(struct rt_rq *rt_rq)
-{
-	return rt_rq->rt_throttled;
-}
-
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_online_mask;
-}
-
-static inline
-struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
-{
-	return &cpu_rq(cpu)->rt;
-}
-
-static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
-{
-	return &def_rt_bandwidth;
-}
-
-#endif /* CONFIG_RT_GROUP_SCHED */
-
 bool sched_rt_bandwidth_account(struct rt_rq *rt_rq)
 {
 	struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
@@ -860,7 +794,7 @@ static int do_sched_rt_period_timer(stru
 	const struct cpumask *span;
 
 	span = sched_rt_period_mask();
-#ifdef CONFIG_RT_GROUP_SCHED
+
 	/*
 	 * FIXME: isolated CPUs should really leave the root task group,
 	 * whether they are isolcpus or were isolated via cpusets, lest
@@ -872,7 +806,7 @@ static int do_sched_rt_period_timer(stru
 	 */
 	if (rt_b == &root_task_group.rt_bandwidth)
 		span = cpu_online_mask;
-#endif
+
 	for_each_cpu(i, span) {
 		int enqueue = 0;
 		struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);
@@ -939,18 +873,6 @@ static int do_sched_rt_period_timer(stru
 	return idle;
 }
 
-static inline int rt_se_prio(struct sched_rt_entity *rt_se)
-{
-#ifdef CONFIG_RT_GROUP_SCHED
-	struct rt_rq *rt_rq = group_rt_rq(rt_se);
-
-	if (rt_rq)
-		return rt_rq->highest_prio.curr;
-#endif
-
-	return rt_task_of(rt_se)->prio;
-}
-
 static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
 {
 	u64 runtime = sched_rt_runtime(rt_rq);
@@ -994,6 +916,70 @@ static int sched_rt_runtime_exceeded(str
 	return 0;
 }
 
+#else /* !CONFIG_RT_GROUP_SCHED */
+
+typedef struct rt_rq *rt_rq_iter_t;
+
+#define for_each_rt_rq(rt_rq, iter, rq) \
+	for ((void) iter, rt_rq = &rq->rt; rt_rq; rt_rq = NULL)
+
+#define for_each_sched_rt_entity(rt_se) \
+	for (; rt_se; rt_se = NULL)
+
+static inline struct rt_rq *group_rt_rq(struct sched_rt_entity *rt_se)
+{
+	return NULL;
+}
+
+static inline void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
+{
+	struct rq *rq = rq_of_rt_rq(rt_rq);
+
+	if (!rt_rq->rt_nr_running)
+		return;
+
+	enqueue_top_rt_rq(rt_rq);
+	resched_curr(rq);
+}
+
+static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
+{
+	dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
+}
+
+static inline int rt_rq_throttled(struct rt_rq *rt_rq)
+{
+	return false;
+}
+
+static inline const struct cpumask *sched_rt_period_mask(void)
+{
+	return cpu_online_mask;
+}
+
+static inline
+struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
+{
+	return &cpu_rq(cpu)->rt;
+}
+
+static void __enable_runtime(struct rq *rq) { }
+static void __disable_runtime(struct rq *rq) { }
+
+#endif /* CONFIG_RT_GROUP_SCHED */
+
+static inline int rt_se_prio(struct sched_rt_entity *rt_se)
+{
+#ifdef CONFIG_RT_GROUP_SCHED
+	struct rt_rq *rt_rq = group_rt_rq(rt_se);
+
+	if (rt_rq)
+		return rt_rq->highest_prio.curr;
+#endif
+
+	return rt_task_of(rt_se)->prio;
+}
+
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -1001,7 +987,6 @@ static int sched_rt_runtime_exceeded(str
 static void update_curr_rt(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
-	struct sched_rt_entity *rt_se = &curr->rt;
 	s64 delta_exec;
 
 	if (curr->sched_class != &rt_sched_class)
@@ -1011,6 +996,9 @@ static void update_curr_rt(struct rq *rq
 	if (unlikely(delta_exec <= 0))
 		return;
 
+#ifdef CONFIG_RT_GROUP_SCHED
+	struct sched_rt_entity *rt_se = &curr->rt;
+
 	if (!rt_bandwidth_enabled())
 		return;
 
@@ -1029,6 +1017,7 @@ static void update_curr_rt(struct rq *rq
 				do_start_rt_bandwidth(sched_rt_bandwidth(rt_rq));
 		}
 	}
+#endif
 }
 
 static void
@@ -1185,7 +1174,6 @@ dec_rt_group(struct sched_rt_entity *rt_
 static void
 inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 {
-	start_rt_bandwidth(&def_rt_bandwidth);
 }
 
 static inline
@@ -2913,19 +2901,6 @@ int sched_rt_can_attach(struct task_grou
 #ifdef CONFIG_SYSCTL
 static int sched_rt_global_constraints(void)
 {
-	unsigned long flags;
-	int i;
-
-	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
-	for_each_possible_cpu(i) {
-		struct rt_rq *rt_rq = &cpu_rq(i)->rt;
-
-		raw_spin_lock(&rt_rq->rt_runtime_lock);
-		rt_rq->rt_runtime = global_rt_runtime();
-		raw_spin_unlock(&rt_rq->rt_runtime_lock);
-	}
-	raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
-
 	return 0;
 }
 #endif /* CONFIG_SYSCTL */
@@ -2945,12 +2920,6 @@ static int sched_rt_global_validate(void
 
 static void sched_rt_do_global(void)
 {
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
-	def_rt_bandwidth.rt_runtime = global_rt_runtime();
-	def_rt_bandwidth.rt_period = ns_to_ktime(global_rt_period());
-	raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
 }
 
 static int sched_rt_handler(struct ctl_table *table, int write, void *buffer,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -698,13 +698,13 @@ struct rt_rq {
 #endif /* CONFIG_SMP */
 	int			rt_queued;
 
+#ifdef CONFIG_RT_GROUP_SCHED
 	int			rt_throttled;
 	u64			rt_time;
 	u64			rt_runtime;
 	/* Nests inside the rq lock: */
 	raw_spinlock_t		rt_runtime_lock;
 
-#ifdef CONFIG_RT_GROUP_SCHED
 	unsigned int		rt_nr_boosted;
 
 	struct rq		*rq;
@@ -2460,7 +2460,6 @@ extern void reweight_task(struct task_st
 extern void resched_curr(struct rq *rq);
 extern void resched_cpu(int cpu);
 
-extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-04 10:59 ` [PATCH v5 6/7] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
  2023-11-06 14:55   ` Peter Zijlstra
  2023-11-06 19:32   ` Joel Fernandes
@ 2023-11-07 16:37   ` Steven Rostedt
  2023-11-13 15:05   ` kernel test robot
  2024-03-20  0:03   ` Joel Fernandes
  4 siblings, 0 replies; 76+ messages in thread
From: Steven Rostedt @ 2023-11-07 16:37 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Sat,  4 Nov 2023 11:59:23 +0100
Daniel Bristot de Oliveira <bristot@kernel.org> wrote:

> @@ -828,6 +836,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
>   * could happen are, typically, a entity voluntarily trying to overcome its
>   * runtime, or it just underestimated it during sched_setattr().
>   */
> +static int start_dl_timer(struct sched_dl_entity *dl_se);
>  static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>  {
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);

Nit, but you really shouldn't have a function prototype declaration right
next to a function, and especially not between the function's comment and
the function itself.

-- Steve

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-06 21:37       ` Joel Fernandes
  2023-11-07 11:58         ` Daniel Bristot de Oliveira
@ 2023-11-07 16:47         ` Steven Rostedt
  2023-11-07 17:35           ` Steven Rostedt
                             ` (2 more replies)
  1 sibling, 3 replies; 76+ messages in thread
From: Steven Rostedt @ 2023-11-07 16:47 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Mon, 6 Nov 2023 16:37:32 -0500
Joel Fernandes <joel@joelfernandes.org> wrote:

> Say CFS-server runtime is 0.3s and period is 1s.
> 
> At 0.7s, 0-laxity timer fires. CFS runs for 0.29s, then sleeps for
> 0.005s and wakes up at 0.295s (its remaining runtime is 0.01s at this
> point which is < the "time till deadline" of 0.005s)
> 
> Now the runtime of the CFS-server will be replenished to the full 0.3s
> (due to CBS) and the deadline
> pushed out.
> 
> The end result is, the total runtime that the CFS-server actually gets
> is 0.595s (though yes it did sleep for 5ms in between, still that's
> tiny -- say if it briefly blocked on a kernel mutex). That's almost
> double the allocated runtime.
> 
> This is just theoretical and I have yet to see if it is actually an
> issue in practice.

Let me see if I understand what you are asking. By pushing the execution of
the CFS-server to the end of its period, if it it was briefly blocked and
was not able to consume all of its zerolax time, its bandwidth gets
refreshed. Then it can run again, basically doubling its total time.

But this is basically saying that it ran for its runtime at the start of
one period and at the beginning of another, right?

Is that an issue? The CFS-server is still just consuming it's time per
period. That means that an RT tasks was starving the system that much to
push it forward too much anyway. I wonder if we just document this
behavior, if that would be enough?

-- Steve

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-07 16:47         ` Steven Rostedt
@ 2023-11-07 17:35           ` Steven Rostedt
  2023-11-07 17:46             ` Steven Rostedt
  2023-11-07 17:54             ` Steven Rostedt
  2023-11-07 17:37           ` Daniel Bristot de Oliveira
  2023-11-08  2:37           ` Joel Fernandes
  2 siblings, 2 replies; 76+ messages in thread
From: Steven Rostedt @ 2023-11-07 17:35 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Vineeth Pillai, Shuah Khan, Phil Auld

[-- Attachment #1: Type: text/plain, Size: 1541 bytes --]

On Tue, 7 Nov 2023 11:47:32 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> Let me see if I understand what you are asking. By pushing the execution of
> the CFS-server to the end of its period, if it it was briefly blocked and
> was not able to consume all of its zerolax time, its bandwidth gets
> refreshed. Then it can run again, basically doubling its total time.
> 
> But this is basically saying that it ran for its runtime at the start of
> one period and at the beginning of another, right?
> 
> Is that an issue? The CFS-server is still just consuming it's time per
> period. That means that an RT tasks was starving the system that much to
> push it forward too much anyway. I wonder if we just document this
> behavior, if that would be enough?

I may have even captured this scenario.

I ran my migrate[1] program which I use to test RT migration, and it kicks
off a bunch of RT tasks. I like this test because with the
/proc/sys/kernel/sched_rt_* options set, it shows the lines where they are
throttled really well.

This time, I disabled those, and just kept the default:

~# cat /sys/kernel/debug/sched/rq/cpu0/fair_server_defer
1

~# cat /sys/kernel/debug/sched/rq/cpu0/fair_server_period 
1000000000

~# cat /sys/kernel/debug/sched/rq/cpu0/fair_server_runtime 
50000000

And ran my userspin[2] program. And recorded it with:

  trace-cmd record -e sched_switch

The kernelshark output shows the delay from userspin taking up 0.1 seconds
(double the time usually given), with a little preemption in between.

-- Steve

[-- Attachment #2: userspin.png --]
[-- Type: image/png, Size: 307442 bytes --]

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-07 16:47         ` Steven Rostedt
  2023-11-07 17:35           ` Steven Rostedt
@ 2023-11-07 17:37           ` Daniel Bristot de Oliveira
  2023-11-07 18:50             ` Daniel Bristot de Oliveira
  2023-11-08  2:37           ` Joel Fernandes
  2 siblings, 1 reply; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-07 17:37 UTC (permalink / raw)
  To: Steven Rostedt, Joel Fernandes
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Luca Abeni,
	Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai, Shuah Khan,
	Phil Auld

On 11/7/23 17:47, Steven Rostedt wrote:
> On Mon, 6 Nov 2023 16:37:32 -0500
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
>> Say CFS-server runtime is 0.3s and period is 1s.
>>
>> At 0.7s, 0-laxity timer fires. CFS runs for 0.29s, then sleeps for
>> 0.005s and wakes up at 0.295s (its remaining runtime is 0.01s at this
>> point which is < the "time till deadline" of 0.005s)
>>
>> Now the runtime of the CFS-server will be replenished to the full 0.3s
>> (due to CBS) and the deadline
>> pushed out.
>>
>> The end result is, the total runtime that the CFS-server actually gets
>> is 0.595s (though yes it did sleep for 5ms in between, still that's
>> tiny -- say if it briefly blocked on a kernel mutex). That's almost
>> double the allocated runtime.
>>
>> This is just theoretical and I have yet to see if it is actually an
>> issue in practice.
> 
> Let me see if I understand what you are asking. By pushing the execution of
> the CFS-server to the end of its period, if it it was briefly blocked and
> was not able to consume all of its zerolax time, its bandwidth gets
> refreshed. Then it can run again, basically doubling its total time.
> 
> But this is basically saying that it ran for its runtime at the start of
> one period and at the beginning of another, right?
> 
> Is that an issue? The CFS-server is still just consuming it's time per
> period. That means that an RT tasks was starving the system that much to
> push it forward too much anyway. I wonder if we just document this
> behavior, if that would be enough?

The code is not doing what I intended because I thought it was doing overload
control on the replenishment, but it is not (my bad).

he is seeing this timeline:

- w=waiting
- r=running
- s=sleeping
- T=throttled
- 3/10 reservation (30%).

|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww|rrrrrrrrrrrrrrrrrrrrrrrrrrr|s|rrrrrrrrr+rrrrrrrr+rrrrrrrrr|TTTTTTTTTT <CPU
|___________________________period 1_______________________________________________________________|________period 2_______________________ < internal-period
0---------1---------2---------3---------4---------5---------6--------7--------8---------9----------10.......11.......12.........13......... < Real-time

It is not actually that bad because the ~2x runtime is over 2 periods.

But it is not what I intended... I intended this:

|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww|rrrrrrrrrrrrrrrrrrrrrrrrrrrrsr|TTTTTTTTTT[...]TTTTTTTTTTT|rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr|TTTTTTT
|___________________________period 1_________________________________|_________period 2________________________[...]___________|___period 3____________________|[.... internal-period
0---------1---------2---------3---------4---------5---------6--------7--------8---------9----------10.......11.[...]16.........17........18........19........20|[.... < Real-time
---------------------------------------------------------------------+---------------------------------------------------------|
                                                                     |                                                         +new period
                                                                     +30/30>30/100, thus new period.

At the replenishment time, if the runtime left/period left > dl_rutime/dl_period,
replenish with a new period to avoid adding to much pressure to CBS/EDF.

One might say: but then the task period is different... or out of sync...
but it is not a problem: look at the "real-time"... the task starts and
run at the "deadline - runtime...." emulating the "zerolax"
(note, I do not like the term zerolax here... but (thomas voice:) whatever :-)).

One could say: in presence of deadline, this timelime will be different...

But that is intentional, as we do not want the fair server to break DL. But more
than that, if one has DL tasks, FIFO latency "property" is broken, and they should
just disable the defer option....

that is what I mentioned at the log:

"If the fair server reaches the zerolax time without consuming
its runtime, the server will be boosted, following CBS rules
(thus without breaking SCHED_DEADLINE)."

by the rule I meant doing the overload check... I thought it was
there already... but it was not... there was no need for it.

I am working on it... it is a simple change (but I need to test).

-- Daniel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-07 17:35           ` Steven Rostedt
@ 2023-11-07 17:46             ` Steven Rostedt
  2023-11-07 17:54             ` Steven Rostedt
  1 sibling, 0 replies; 76+ messages in thread
From: Steven Rostedt @ 2023-11-07 17:46 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Tue, 7 Nov 2023 12:35:40 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> I ran my migrate[1] program which I use to test RT migration, and it kicks

> 
> And ran my userspin[2] program. And recorded it with:

I forgot to add the [1] and [2]

[1] https://rostedt.org/code/migrate.c
[2] https://rostedt.org/code/userspin.c

-- Steve

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-07 17:35           ` Steven Rostedt
  2023-11-07 17:46             ` Steven Rostedt
@ 2023-11-07 17:54             ` Steven Rostedt
  2023-11-07 19:32               ` Steven Rostedt
  1 sibling, 1 reply; 76+ messages in thread
From: Steven Rostedt @ 2023-11-07 17:54 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Vineeth Pillai, Shuah Khan, Phil Auld


What's more interesting, when looking at the userspin task, I see a lot of this:

         migrate-1153  [003]  1272.988097: sched_switch:         migrate:1153 [90] S ==> userspin:1135 [120]
        userspin-1135  [003]  1272.988111: sched_switch:         userspin:1135 [120] R ==> migrate:1146 [97]

userspin sneaks in for 14 microseconds

         migrate-1146  [003]  1272.988141: sched_switch:         migrate:1146 [97] R ==> migrate:1159 [84]
         migrate-1159  [003]  1272.988159: print:                tracing_mark_write: thread 13 iter 15, took lock 15020 in 140726333419648 us
         migrate-1159  [003]  1272.992161: print:                tracing_mark_write: thread 13 iter 15, unlock lock 6
         migrate-1159  [003]  1272.992169: print:                tracing_mark_write: thread 13 iter 15 sleeping
         migrate-1159  [003]  1272.992177: sched_switch:         migrate:1159 [84] S ==> userspin:1135 [120]
        userspin-1135  [003]  1272.992190: sched_switch:         userspin:1135 [120] R ==> migrate:1150 [93]

Again for 13 microseconds.

         migrate-1150  [003]  1272.995118: sched_switch:         migrate:1150 [93] R ==> migrate:1153 [90]
         migrate-1153  [003]  1272.995129: print:                tracing_mark_write: thread 7 iter 15, taking lock 5
         migrate-1153  [003]  1272.995164: print:                tracing_mark_write: thread 7 iter 15, took lock 32 in 140726333419648 us
         migrate-1153  [003]  1273.005166: print:                tracing_mark_write: thread 7 iter 15, unlock lock 5
         migrate-1153  [003]  1273.005174: print:                tracing_mark_write: thread 7 iter 15 sleeping
         migrate-1153  [003]  1273.005183: sched_switch:         migrate:1153 [90] S ==> userspin:1135 [120]
        userspin-1135  [003]  1273.005204: sched_switch:         userspin:1135 [120] R ==> migrate:1159 [84]

For 21 microseconds.

         migrate-1159  [003]  1273.005216: print:                tracing_mark_write: thread 13 iter 15, taking lock 7
         migrate-1159  [003]  1273.005271: print:                tracing_mark_write: thread 13 iter 15, took lock 53 in 140726333419648 us
         migrate-1159  [003]  1273.009273: print:                tracing_mark_write: thread 13 iter 15, unlock lock 7
         migrate-1159  [003]  1273.009281: print:                tracing_mark_write: thread 13 iter 15 sleeping
         migrate-1159  [003]  1273.009289: sched_switch:         migrate:1159 [84] S ==> userspin:1135 [120]
        userspin-1135  [003]  1273.009301: sched_switch:         userspin:1135 [120] R ==> migrate:1147 [96]

12 microseconds

         migrate-1147  [003]  1273.012205: sched_switch:         migrate:1147 [96] R ==> migrate:1153 [90]
         migrate-1153  [003]  1273.012217: print:                tracing_mark_write: thread 7 iter 15, taking lock 6
         migrate-1153  [003]  1273.012228: sched_switch:         migrate:1153 [90] S ==> userspin:1135 [120]
        userspin-1135  [003]  1273.012242: sched_switch:         userspin:1135 [120] R ==> migrate:1146 [97]
         migrate-1146  [003]  1273.014251: sched_switch:         migrate:1146 [97] R ==> migrate:1148 [95]

2 milliseconds. (which is probably fine).

         migrate-1148  [003]  1273.020300: print:                tracing_mark_write: thread 2 iter 14, unlock lock 2
         migrate-1148  [003]  1273.020302: print:                tracing_mark_write: thread 2 iter 14 sleeping
         migrate-1148  [003]  1273.020309: sched_switch:         migrate:1148 [95] S ==> userspin:1135 [120]
        userspin-1135  [003]  1273.020324: sched_switch:         userspin:1135 [120] R ==> migrate:1147 [96]

15 microseconds.

         migrate-1147  [003]  1273.020360: print:                tracing_mark_write: thread 1 iter 14, unlock lock 1
         migrate-1147  [003]  1273.020373: print:                tracing_mark_write: thread 1 iter 14 sleeping
         migrate-1147  [003]  1273.020381: sched_switch:         migrate:1147 [96] S ==> userspin:1135 [120]
        userspin-1135  [003]  1273.021397: sched_switch:         userspin:1135 [120] R ==> migrate:1147 [96]

1 millisecond.

         migrate-1147  [003]  1273.021402: print:                tracing_mark_write: thread 1 iter 14, taking lock 2
         migrate-1147  [003]  1273.021404: print:                tracing_mark_write: thread 1 iter 14, took lock 1 in 140726333419648 us
         migrate-1147  [003]  1273.022200: sched_switch:         migrate:1147 [96] R ==> migrate:1152 [91]
         migrate-1152  [003]  1273.022206: print:                tracing_mark_write: thread 6 iter 15, taking lock 6
         migrate-1152  [003]  1273.022217: sched_switch:         migrate:1152 [91] S ==> migrate:1147 [96]
         migrate-1147  [003]  1273.022289: sched_switch:         migrate:1147 [96] R ==> migrate:1159 [84]
         migrate-1159  [003]  1273.022299: print:                tracing_mark_write: thread 13 iter 16, taking lock 0
         migrate-1159  [003]  1273.022326: print:                tracing_mark_write: thread 13 iter 16, took lock 25 in 140726333419648 us
         migrate-1159  [003]  1273.026328: print:                tracing_mark_write: thread 13 iter 16, unlock lock 0
         migrate-1159  [003]  1273.026337: print:                tracing_mark_write: thread 13 iter 16 sleeping
         migrate-1159  [003]  1273.026346: sched_switch:         migrate:1159 [84] S ==> userspin:1135 [120]
        userspin-1135  [003]  1273.026359: sched_switch:         userspin:1135 [120] R ==> migrate:1146 [97]

13 microseconds, and so on...

         migrate-1146  [003]  1273.027170: sched_switch:         migrate:1146 [97] R ==> migrate:1149 [94]
         migrate-1149  [003]  1273.027189: print:                tracing_mark_write: thread 3 iter 14, took lock 1927 in 140726333419648 us
         migrate-1149  [003]  1273.027335: sched_switch:         migrate:1149 [94] R ==> migrate:1153 [90]
         migrate-1153  [003]  1273.027349: print:                tracing_mark_write: thread 7 iter 15, took lock 15130 in 140726333419648 us
         migrate-1153  [003]  1273.037352: print:                tracing_mark_write: thread 7 iter 15, unlock lock 6
         migrate-1153  [003]  1273.037362: print:                tracing_mark_write: thread 7 iter 15 sleeping
         migrate-1153  [003]  1273.037370: sched_switch:         migrate:1153 [90] S ==> userspin:1135 [120]
        userspin-1135  [003]  1273.037395: sched_switch:         userspin:1135 [120] R ==> migrate:1147 [96]
         migrate-1147  [003]  1273.037406: print:                tracing_mark_write: thread 1 iter 14, unlock lock 2
         migrate-1147  [003]  1273.037408: print:                tracing_mark_write: thread 1 iter 14 sleeping
         migrate-1147  [003]  1273.037414: sched_switch:         migrate:1147 [96] S ==> userspin:1135 [120]
        userspin-1135  [003]  1273.038428: sched_switch:         userspin:1135 [120] R ==> migrate:1147 [96]


It looks like it sneaks in when it's about to schedule a new RT task.

Is this expected?

-- Steve

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-07 17:37           ` Daniel Bristot de Oliveira
@ 2023-11-07 18:50             ` Daniel Bristot de Oliveira
  2023-11-08  3:20               ` Joel Fernandes
  2023-11-08 12:44               ` Peter Zijlstra
  0 siblings, 2 replies; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-07 18:50 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Steven Rostedt, Joel Fernandes
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Vineeth Pillai, Shuah Khan, Phil Auld

> The code is not doing what I intended because I thought it was doing overload
> control on the replenishment, but it is not (my bad).
> 

I am still testing but... it is missing something like this (famous last words).

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1092ca8892e0..6e2d21c47a04 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -842,6 +842,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
  * runtime, or it just underestimated it during sched_setattr().
  */
 static int start_dl_timer(struct sched_dl_entity *dl_se);
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t);
+
 static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
@@ -852,9 +854,18 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	/*
 	 * This could be the case for a !-dl task that is boosted.
 	 * Just go with full inherited parameters.
+	 *
+	 * Or, it could be the case of a zerolax reservation that
+	 * was not able to consume its runtime in background and
+	 * reached this point with current u > U.
+	 *
+	 * In both cases, set a new period.
 	 */
-	if (dl_se->dl_deadline == 0)
-		replenish_dl_new_period(dl_se, rq);
+	if (dl_se->dl_deadline == 0 ||
+		(dl_se->dl_zerolax_armed && dl_entity_overflow(dl_se, rq_clock(rq)))) {
+			dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
+			dl_se->runtime = pi_of(dl_se)->dl_runtime;
+	}

 	if (dl_se->dl_yielded && dl_se->runtime > 0)
 		dl_se->runtime = 0;


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-07 17:54             ` Steven Rostedt
@ 2023-11-07 19:32               ` Steven Rostedt
  2023-11-07 20:07                 ` Steven Rostedt
  0 siblings, 1 reply; 76+ messages in thread
From: Steven Rostedt @ 2023-11-07 19:32 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Vineeth Pillai, Shuah Khan, Phil Auld



Just got this too (with the 20% we talked about on IRC).

 migrate-991     6.....   713.996237: print:                tracing_mark_write: thread 7 iter 3 sleeping

The above is 991 in userspace writing to trace_marker

 migrate-991     6d..2.   713.996251: bprint:               __schedule: Pick userspin:973:120

I added the above printk in the core pick_next_task().

 migrate-991     6d..2.   713.996254: sched_switch:         migrate:991 [90] S ==> userspin:973 [120]

We switch to userspin for just 16 microseconds, and notice, NEED_RESCHED is
not set.

userspin-973     6dN.2.   713.996270: bprint:               pick_task_rt: Pick RT migrate:988:93

The above printk is in pick_next_task_rt(), and NEED_RESCHED is now set!

userspin-973     6dN.2.   713.996271: bprint:               __schedule: Pick migrate:988:93
userspin-973     6d..2.   713.996272: sched_switch:         userspin:973 [120] R ==> migrate:988 [93]

I'll add your latest patch and see if that's different.

I'll also test this without any of the patches first.

-- Steve

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-07 19:32               ` Steven Rostedt
@ 2023-11-07 20:07                 ` Steven Rostedt
  0 siblings, 0 replies; 76+ messages in thread
From: Steven Rostedt @ 2023-11-07 20:07 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Tue, 7 Nov 2023 14:32:16 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> I'll also test this without any of the patches first.

And it still happens, and I now I know why :-)

Duh! The program is "migrate" which stress tests how RT tasks migrate
between CPUs when there's more RT tasks running that CPUs to run them on.

This is the push/pull logic in action!

 migrate-958     4d..2.   266.971936: sched_switch:         migrate:958 [89] S ==> userspin:939 [120]

Task 958 of priority 89 (lower is higher) goes to sleep. There's no RT
tasks on CPU 4 to run, so it runs userspin.

 migrate-953     2d.h2.   266.971938: sched_waking:         comm=migrate pid=957 prio=90 target_cpu=002
 migrate-953     2d..2.   266.971944: sched_switch:         migrate:953 [94] R ==> migrate:957 [90]

On CPU 2, task 957 (prio 90) preempts 953 (prio 94).

userspin-939     4d..2.   266.971953: sched_switch:         userspin:939 [120] R ==> migrate:953 [94]

Now 953 migrates over to CPU 4 as it's currently the CPU running the lowest
priority task.

There's other cases where another CPU was simply overloaded, and when the
RT task on the CPU with userspin went to sleep, it triggered an IPI to the
overloaded CPU to tell it to push it over here.

All is good. Nothing to see here ;-)

-- Steve

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-07 16:47         ` Steven Rostedt
  2023-11-07 17:35           ` Steven Rostedt
  2023-11-07 17:37           ` Daniel Bristot de Oliveira
@ 2023-11-08  2:37           ` Joel Fernandes
  2 siblings, 0 replies; 76+ messages in thread
From: Joel Fernandes @ 2023-11-08  2:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Tue, Nov 07, 2023 at 11:47:32AM -0500, Steven Rostedt wrote:
> On Mon, 6 Nov 2023 16:37:32 -0500
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> > Say CFS-server runtime is 0.3s and period is 1s.
> > 
> > At 0.7s, 0-laxity timer fires. CFS runs for 0.29s, then sleeps for
> > 0.005s and wakes up at 0.295s (its remaining runtime is 0.01s at this
> > point which is < the "time till deadline" of 0.005s)
> > 
> > Now the runtime of the CFS-server will be replenished to the full 0.3s
> > (due to CBS) and the deadline
> > pushed out.
> > 
> > The end result is, the total runtime that the CFS-server actually gets
> > is 0.595s (though yes it did sleep for 5ms in between, still that's
> > tiny -- say if it briefly blocked on a kernel mutex). That's almost
> > double the allocated runtime.
> > 
> > This is just theoretical and I have yet to see if it is actually an
> > issue in practice.
> 
> Let me see if I understand what you are asking. By pushing the execution of
> the CFS-server to the end of its period, if it it was briefly blocked and
> was not able to consume all of its zerolax time, its bandwidth gets
> refreshed. Then it can run again, basically doubling its total time.

I think my assumption about what happens during blocking was wrong. If it
blocked, the server is actually stopped via dl_server_stop() and it starts
all over again on enqueue.

That makes me worry about the opposite issue now. If the server restarts
because it blocked briefly, that means again it starts in a throttled state
and has to wait to run till zero-lax time. If CFS is a 99% load but blocks
very briefly after getting to run a little bit (totalling 1% of the time),
then it wont get 30% because it will keep getting delayed to the new 0-lax
every time it wakes up from its very-brief nap. Is that really Ok?

> But this is basically saying that it ran for its runtime at the start of
> one period and at the beginning of another, right?

I am not sure if this can happen but I could be missing something. AFAICS,
there is no scenario where the DL server gets to run at the start of a new
period unless RT is not running. The way the patch is written AFAICS,
whenever the DL-server runs out of runtime, it gets throttled and a timer
fires to go off at the beginning of the next period.
(update_curr_dl_se() -> dl_runtime_exceeded() -> start_dl_timer()).

In this timer handler (which fired at next period beginning), it will
actually replenish_dl_entity() to refresh the runtime and push the period
forward. Then it will throttle the server till the 0-lax time. That  means we
always end up running at the 0-lax time when starting a new period if RT is
running, and never at the beginning. Did I miss something?

On the other hand, if it does not run out of runtime, it will keep running
within its 0-lax time. We know there is enough time within its 0-lax time for
it to run because when we unthrottled it, we checked for that.

Switching gears, another (most likely theoretical) concern I had is what if
the 0-lax timer interrupt gets delayed a little bit. Then we will always end
up not having enough 0-lax time and keep requeuing the timer, that means CFS
will be starved always as we keep pushing the execution to the next period's
0-lax time.

Anyway, I guess I better get to testing this stuff tomorrow and day after on
ChromeOS before LPC starts. Personally I feel this is a great first cut and
hope we can get v5 into mainline and iteratively improve. :)

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-07 11:58         ` Daniel Bristot de Oliveira
@ 2023-11-08  2:42           ` Joel Fernandes
  0 siblings, 0 replies; 76+ messages in thread
From: Joel Fernandes @ 2023-11-08  2:42 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

On Tue, Nov 07, 2023 at 12:58:48PM +0100, Daniel Bristot de Oliveira wrote:
[...]
> >> One more consideration I guess is, because the server is throttled
> >> till 0-laxity time, it is possible that if CFS sleeps even a bit
> >> (after the DL-server is unthrottled), then it will be pushed out to a
> >> full current deadline + period due to CBS. In such a situation,  if
> >> CFS-server is the only DL task running, it might starve RT for a bit
> >> more time.
> >>
> >> Example, say CFS runtime is 0.3s and period is 1s. At 0.7s, 0-laxity
> >> timer fires. CFS runs for 0.29s, then sleeps for 0.005s and wakes up
> >> at 0.295s (its remaining runtime is 0.01s at this point which is < the
> >> "time till deadline" of 0.005s). Now the runtime of the CFS-server
> >> will be replenished to the full 3s (due to CBS) and the deadline
> >> pushed out. The end result is the total runtime that the CFS-server
> >> actually gets is 0.0595s (though yes it did sleep for 5ms in between,
> >> still that's tiny -- say if it briefly blocked on a kernel mutex).
> > 
> > Blah, I got lost in decimal points. Here's the example again:
> > 
> > Say CFS-server runtime is 0.3s and period is 1s.
> > 
> > At 0.7s, 0-laxity timer fires. CFS runs for 0.29s, then sleeps for
> > 0.005s and wakes up at 0.295s (its remaining runtime is 0.01s at this
> > point which is < the "time till deadline" of 0.005s)
> > 
> > Now the runtime of the CFS-server will be replenished to the full 0.3s
> > (due to CBS) and the deadline
> > pushed out.
> > 
> > The end result is, the total runtime that the CFS-server actually gets
> > is 0.595s (though yes it did sleep for 5ms in between, still that's
> > tiny -- say if it briefly blocked on a kernel mutex). That's almost
> > double the allocated runtime.
> 
> I think I got what you mean, and I think I took for granted that we were
> doing overload control on the replenishment, but it seems that we are not..
> 
> I just got back from a doct appt, I will do a proper reply later today.

Ah ok! Thanks Daniel! And hope the appointment went well.

 - Joel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-07 18:50             ` Daniel Bristot de Oliveira
@ 2023-11-08  3:20               ` Joel Fernandes
  2023-11-08  8:01                 ` Daniel Bristot de Oliveira
  2023-11-08 12:44               ` Peter Zijlstra
  1 sibling, 1 reply; 76+ messages in thread
From: Joel Fernandes @ 2023-11-08  3:20 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Daniel Bristot de Oliveira, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

Hi Daniel,

On Tue, Nov 7, 2023 at 1:50 PM Daniel Bristot de Oliveira
<bristot@kernel.org> wrote:
>
> > The code is not doing what I intended because I thought it was doing overload
> > control on the replenishment, but it is not (my bad).
> >
>
> I am still testing but... it is missing something like this (famous last words).
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 1092ca8892e0..6e2d21c47a04 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -842,6 +842,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
>   * runtime, or it just underestimated it during sched_setattr().
>   */
>  static int start_dl_timer(struct sched_dl_entity *dl_se);
> +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t);
> +
>  static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>  {
>         struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> @@ -852,9 +854,18 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>         /*
>          * This could be the case for a !-dl task that is boosted.
>          * Just go with full inherited parameters.
> +        *
> +        * Or, it could be the case of a zerolax reservation that
> +        * was not able to consume its runtime in background and
> +        * reached this point with current u > U.
> +        *
> +        * In both cases, set a new period.
>          */
> -       if (dl_se->dl_deadline == 0)
> -               replenish_dl_new_period(dl_se, rq);
> +       if (dl_se->dl_deadline == 0 ||
> +               (dl_se->dl_zerolax_armed && dl_entity_overflow(dl_se, rq_clock(rq)))) {
> +                       dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
> +                       dl_se->runtime = pi_of(dl_se)->dl_runtime;
> +       }
>
>         if (dl_se->dl_yielded && dl_se->runtime > 0)
>                 dl_se->runtime = 0;

I was wondering does this mean GRUB needs to be enabled? Otherwise I
can see that "runtime / (deadline - t) > dl_runtime / dl_deadline"
will be true almost all the time due to the constraint of executing at
the 0-lax time.

Because at the 0-lax time, AFAICS this will be 100% > 30% (say if CFS
has a 30% reservation).

And I think even if GRUB is enabled, it is possible other DL task may
have reserved bandwidth.

Or is there a subtlety that makes that not possible?

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-08  3:20               ` Joel Fernandes
@ 2023-11-08  8:01                 ` Daniel Bristot de Oliveira
  2023-11-08 18:25                   ` Joel Fernandes
  0 siblings, 1 reply; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-08  8:01 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Daniel Bristot de Oliveira, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

On 11/8/23 04:20, Joel Fernandes wrote:
> Hi Daniel,
> 
> On Tue, Nov 7, 2023 at 1:50 PM Daniel Bristot de Oliveira
> <bristot@kernel.org> wrote:
>>
>>> The code is not doing what I intended because I thought it was doing overload
>>> control on the replenishment, but it is not (my bad).
>>>
>>
>> I am still testing but... it is missing something like this (famous last words).
>>
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 1092ca8892e0..6e2d21c47a04 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -842,6 +842,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
>>   * runtime, or it just underestimated it during sched_setattr().
>>   */
>>  static int start_dl_timer(struct sched_dl_entity *dl_se);
>> +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t);
>> +
>>  static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>>  {
>>         struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>> @@ -852,9 +854,18 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>>         /*
>>          * This could be the case for a !-dl task that is boosted.
>>          * Just go with full inherited parameters.
>> +        *
>> +        * Or, it could be the case of a zerolax reservation that
>> +        * was not able to consume its runtime in background and
>> +        * reached this point with current u > U.
>> +        *
>> +        * In both cases, set a new period.
>>          */
>> -       if (dl_se->dl_deadline == 0)
>> -               replenish_dl_new_period(dl_se, rq);
>> +       if (dl_se->dl_deadline == 0 ||
>> +               (dl_se->dl_zerolax_armed && dl_entity_overflow(dl_se, rq_clock(rq)))) {
>> +                       dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
>> +                       dl_se->runtime = pi_of(dl_se)->dl_runtime;
>> +       }
>>
>>         if (dl_se->dl_yielded && dl_se->runtime > 0)
>>                 dl_se->runtime = 0;
> 
> I was wondering does this mean GRUB needs to be enabled? Otherwise I
> can see that "runtime / (deadline - t) > dl_runtime / dl_deadline"
> will be true almost all the time due to the constraint of executing at
> the 0-lax time.

No grub needed. It will only happen if the fair server did not have any chance to run.

If it happens, it is not a problem, see that timeline I replied in the previous
email.

We do not want a zerolax scheduler, because it breaks everything else. It is
a deferred EDF, that looking from wall clock, composes an "zerolaxish" timeline.

> Because at the 0-lax time, AFAICS this will be 100% > 30% (say if CFS
> has a 30% reservation).
> 
> And I think even if GRUB is enabled, it is possible other DL task may
> have reserved bandwidth.
> 
> Or is there a subtlety that makes that not possible?
> 
> thanks,
> 
>  - Joel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-07 18:50             ` Daniel Bristot de Oliveira
  2023-11-08  3:20               ` Joel Fernandes
@ 2023-11-08 12:44               ` Peter Zijlstra
  2023-11-08 12:50                 ` Peter Zijlstra
                                   ` (3 more replies)
  1 sibling, 4 replies; 76+ messages in thread
From: Peter Zijlstra @ 2023-11-08 12:44 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Daniel Bristot de Oliveira, Steven Rostedt, Joel Fernandes,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

On Tue, Nov 07, 2023 at 07:50:28PM +0100, Daniel Bristot de Oliveira wrote:
> > The code is not doing what I intended because I thought it was doing overload
> > control on the replenishment, but it is not (my bad).
> > 
> 
> I am still testing but... it is missing something like this (famous last words).
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 1092ca8892e0..6e2d21c47a04 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -842,6 +842,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
>   * runtime, or it just underestimated it during sched_setattr().
>   */
>  static int start_dl_timer(struct sched_dl_entity *dl_se);
> +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t);
> +
>  static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>  {
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> @@ -852,9 +854,18 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>  	/*
>  	 * This could be the case for a !-dl task that is boosted.
>  	 * Just go with full inherited parameters.
> +	 *
> +	 * Or, it could be the case of a zerolax reservation that
> +	 * was not able to consume its runtime in background and
> +	 * reached this point with current u > U.
> +	 *
> +	 * In both cases, set a new period.
>  	 */
> -	if (dl_se->dl_deadline == 0)
> -		replenish_dl_new_period(dl_se, rq);
> +	if (dl_se->dl_deadline == 0 ||
> +		(dl_se->dl_zerolax_armed && dl_entity_overflow(dl_se, rq_clock(rq)))) {
> +			dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
> +			dl_se->runtime = pi_of(dl_se)->dl_runtime;
> +	}
> 
>  	if (dl_se->dl_yielded && dl_se->runtime > 0)
>  		dl_se->runtime = 0;

Should we rather not cap the runtime, something like so?

Because the above also causes period drift, which we do not want.

---
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 58b542bf2893..1453a2cd0680 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -829,10 +829,12 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
  */
 static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 {
+	struct sched_dl_entity *pi_se = pi_of(dl_se);
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
+	u64 dl_runtime = pi_se->dl_runtime;
 
-	WARN_ON_ONCE(pi_of(dl_se)->dl_runtime <= 0);
+	WARN_ON_ONCE(dl_runtime <= 0);
 
 	/*
 	 * This could be the case for a !-dl task that is boosted.
@@ -851,10 +853,13 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 * arbitrary large.
 	 */
 	while (dl_se->runtime <= 0) {
-		dl_se->deadline += pi_of(dl_se)->dl_period;
-		dl_se->runtime += pi_of(dl_se)->dl_runtime;
+		dl_se->deadline += pi_se->dl_period;
+		dl_se->runtime += dl_runtime;
 	}
 
+	if (dl_se->zerolax && dl_se->runtime > dl_runtime)
+		dl_se->runtime = dl_runtime;
+
 	/*
 	 * At this point, the deadline really should be "in
 	 * the future" with respect to rq->clock. If it's

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-08 12:44               ` Peter Zijlstra
@ 2023-11-08 12:50                 ` Peter Zijlstra
  2023-11-08 14:52                   ` Daniel Bristot de Oliveira
  2023-11-08 13:46                 ` Daniel Bristot de Oliveira
                                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2023-11-08 12:50 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Daniel Bristot de Oliveira, Steven Rostedt, Joel Fernandes,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

On Wed, Nov 08, 2023 at 01:44:01PM +0100, Peter Zijlstra wrote:

> Should we rather not cap the runtime, something like so?
> 

Clearly I should've done the patch against a tree that includes the
changes... 

> ---
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 58b542bf2893..1453a2cd0680 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -829,10 +829,12 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
>   */
>  static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>  {
> +	struct sched_dl_entity *pi_se = pi_of(dl_se);
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>  	struct rq *rq = rq_of_dl_rq(dl_rq);
> +	u64 dl_runtime = pi_se->dl_runtime;
>  
> -	WARN_ON_ONCE(pi_of(dl_se)->dl_runtime <= 0);
> +	WARN_ON_ONCE(dl_runtime <= 0);
>  
>  	/*
>  	 * This could be the case for a !-dl task that is boosted.
> @@ -851,10 +853,13 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>  	 * arbitrary large.
>  	 */
>  	while (dl_se->runtime <= 0) {
> -		dl_se->deadline += pi_of(dl_se)->dl_period;
> -		dl_se->runtime += pi_of(dl_se)->dl_runtime;
> +		dl_se->deadline += pi_se->dl_period;
> +		dl_se->runtime += dl_runtime;
>  	}
>  
> +	if (dl_se->zerolax && dl_se->runtime > dl_runtime)
> +		dl_se->runtime = dl_runtime;
> +

This should ofcourse go in the if (dl_se->dl_zerolax_armed) branch a
little down from here.

>  	/*
>  	 * At this point, the deadline really should be "in
>  	 * the future" with respect to rq->clock. If it's

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-08 12:44               ` Peter Zijlstra
  2023-11-08 12:50                 ` Peter Zijlstra
@ 2023-11-08 13:46                 ` Daniel Bristot de Oliveira
  2023-11-08 13:58                 ` Daniel Bristot de Oliveira
  2023-11-08 15:14                 ` Juri Lelli
  3 siblings, 0 replies; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-08 13:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Bristot de Oliveira, Steven Rostedt, Joel Fernandes,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

On 11/8/23 13:44, Peter Zijlstra wrote:
> Because the above also causes period drift, which we do not want.

The period drift is not a problem when we do not have DL tasks because
... we do not have dl tasks. The task will run for runtime.

But not doing the period drift is bad if we have DL tasks because we
break the (current u <= U) rule... which breaks CBS/EDF.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-08 12:44               ` Peter Zijlstra
  2023-11-08 12:50                 ` Peter Zijlstra
  2023-11-08 13:46                 ` Daniel Bristot de Oliveira
@ 2023-11-08 13:58                 ` Daniel Bristot de Oliveira
  2023-11-08 15:14                 ` Juri Lelli
  3 siblings, 0 replies; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-08 13:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Bristot de Oliveira, Steven Rostedt, Joel Fernandes,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

On 11/8/23 13:44, Peter Zijlstra wrote:
> Should we rather not cap the runtime, something like so?
> 
> Because the above also causes period drift, which we do not want.

like in the example I showed before:

- 3/10 reservation (30%).
- w=waiting
- r=running
- s=sleeping
- T=throttled
- fair server dispatched at 0, starvation from RT.


|wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww|rrrrrrrrrrrrrrrrrrrrrrrrrrrrrr|TTTTTTTTTT[...]TTTTTTTTTTT|rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr|TTTTTTT
|___________________________period 1_________________________________|_________period 2________________________[...]___________|___period 3____________________|[.... internal-period
0---------1---------2---------3---------4---------5---------6--------7--------8---------9----------10.......11.[...]16.........17........18........19........20|[.... < Real-time
---------------------------------------------------------------------+---------------------------------------------------------|
                                                                     |                                                         +new period

From "real-world/wall clock" the internal period shift produces the
"zerolax" timeline. It runs 3 units of time before the 10's.

If one has a mix of DL and FIFO task, and want to enforce
a given response time to the fair server, they can reduce the
fair server period to achieve that.

-- Daniel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-08 12:50                 ` Peter Zijlstra
@ 2023-11-08 14:52                   ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-11-08 14:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Bristot de Oliveira, Steven Rostedt, Joel Fernandes,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

On 11/8/23 13:50, Peter Zijlstra wrote:
>> ---
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 58b542bf2893..1453a2cd0680 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -829,10 +829,12 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
>>   */
>>  static void replenish_dl_entity(struct sched_dl_entity *dl_se)>>  {

assuming starting rt, 3/10 params:

it arrives here with:

	runtime = 3
	laxity = 10 - 7 = 3
	u = 1

>> +	struct sched_dl_entity *pi_se = pi_of(dl_se);
>>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>>  	struct rq *rq = rq_of_dl_rq(dl_rq);
>> +	u64 dl_runtime = pi_se->dl_runtime;
>>  
>> -	WARN_ON_ONCE(pi_of(dl_se)->dl_runtime <= 0);
>> +	WARN_ON_ONCE(dl_runtime <= 0);
>>  
>>  	/*
>>  	 * This could be the case for a !-dl task that is boosted.
>> @@ -851,10 +853,13 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>>  	 * arbitrary large.
>>  	 */

skip the while because runtime = 3 > 0

>>  	while (dl_se->runtime <= 0) {
>> -		dl_se->deadline += pi_of(dl_se)->dl_period;
>> -		dl_se->runtime += pi_of(dl_se)->dl_runtime;
>> +		dl_se->deadline += pi_se->dl_period;
>> +		dl_se->runtime += dl_runtime;
>>  	}

runtime is already = dl_runtime...

>> +	if (dl_se->zerolax && dl_se->runtime > dl_runtime)
>> +		dl_se->runtime = dl_runtime;
>> +

There is a way to cap it: it is doing the revised wakeup rule...
the runtime will become 1. That is not what we want...

and we would have to keep arming the server... while shifting the
(internal) period puts the scheduler in the regular case :-)

Externally, e.g., the user with the mouse his laptop, sees the
"zerolax" timeline... :-)

i.e., after at most 7, they get 3, before 10.

it is simpler...

and breaking the U thing is breaking GRUB, admission control.. and so on...
by default - not in a overload DL overload scenario... it is by default :-/.

> This should ofcourse go in the if (dl_se->dl_zerolax_armed) branch a
> little down from here.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-08 12:44               ` Peter Zijlstra
                                   ` (2 preceding siblings ...)
  2023-11-08 13:58                 ` Daniel Bristot de Oliveira
@ 2023-11-08 15:14                 ` Juri Lelli
  2023-11-08 16:57                   ` Peter Zijlstra
  3 siblings, 1 reply; 76+ messages in thread
From: Juri Lelli @ 2023-11-08 15:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Bristot de Oliveira, Daniel Bristot de Oliveira,
	Steven Rostedt, Joel Fernandes, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Vineeth Pillai, Shuah Khan, Phil Auld

Hi Peter,

On 08/11/23 13:44, Peter Zijlstra wrote:
> On Tue, Nov 07, 2023 at 07:50:28PM +0100, Daniel Bristot de Oliveira wrote:
> > > The code is not doing what I intended because I thought it was doing overload
> > > control on the replenishment, but it is not (my bad).
> > > 
> > 
> > I am still testing but... it is missing something like this (famous last words).
> > 
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 1092ca8892e0..6e2d21c47a04 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -842,6 +842,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
> >   * runtime, or it just underestimated it during sched_setattr().
> >   */
> >  static int start_dl_timer(struct sched_dl_entity *dl_se);
> > +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t);
> > +
> >  static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> >  {
> >  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> > @@ -852,9 +854,18 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> >  	/*
> >  	 * This could be the case for a !-dl task that is boosted.
> >  	 * Just go with full inherited parameters.
> > +	 *
> > +	 * Or, it could be the case of a zerolax reservation that
> > +	 * was not able to consume its runtime in background and
> > +	 * reached this point with current u > U.
> > +	 *
> > +	 * In both cases, set a new period.
> >  	 */
> > -	if (dl_se->dl_deadline == 0)
> > -		replenish_dl_new_period(dl_se, rq);
> > +	if (dl_se->dl_deadline == 0 ||
> > +		(dl_se->dl_zerolax_armed && dl_entity_overflow(dl_se, rq_clock(rq)))) {
> > +			dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
> > +			dl_se->runtime = pi_of(dl_se)->dl_runtime;
> > +	}
> > 
> >  	if (dl_se->dl_yielded && dl_se->runtime > 0)
> >  		dl_se->runtime = 0;
> 
> Should we rather not cap the runtime, something like so?
> 
> Because the above also causes period drift, which we do not want.

I was honestly also concerned with the drift, but then thought it might
not be an issue for the dl_server (zerolax), as it doesn't have a
userspace counterpart that relies on synchronized clocks?

> 
> ---
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 58b542bf2893..1453a2cd0680 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -829,10 +829,12 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
>   */
>  static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>  {
> +	struct sched_dl_entity *pi_se = pi_of(dl_se);
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>  	struct rq *rq = rq_of_dl_rq(dl_rq);
> +	u64 dl_runtime = pi_se->dl_runtime;
>  
> -	WARN_ON_ONCE(pi_of(dl_se)->dl_runtime <= 0);
> +	WARN_ON_ONCE(dl_runtime <= 0);
>  
>  	/*
>  	 * This could be the case for a !-dl task that is boosted.
> @@ -851,10 +853,13 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>  	 * arbitrary large.
>  	 */
>  	while (dl_se->runtime <= 0) {
> -		dl_se->deadline += pi_of(dl_se)->dl_period;
> -		dl_se->runtime += pi_of(dl_se)->dl_runtime;
> +		dl_se->deadline += pi_se->dl_period;
> +		dl_se->runtime += dl_runtime;
>  	}
>  
> +	if (dl_se->zerolax && dl_se->runtime > dl_runtime)
> +		dl_se->runtime = dl_runtime;
> +

Anyway, I have the impression that this breaks EDF/CBS, as we are letting
the dl_server run with full dl_runtime w/o postponing the period
(essentially an u = 1 reservation until runtime is depleted).

I would say we need to either do

dl_se->deadline += pi_of(dl_se)->dl_period;
dl_se->runtime = pi_of(dl_se)->dl_runtime;

or (as Daniel proposed)

dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
dl_se->runtime = pi_of(dl_se)->dl_runtime;

and I seem to be inclined towards the latter, as the former would
essentially reduce dl_server bandwidth under dl_runtime/dl_period at
times.

Best,
Juri


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-08 15:14                 ` Juri Lelli
@ 2023-11-08 16:57                   ` Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2023-11-08 16:57 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Daniel Bristot de Oliveira, Daniel Bristot de Oliveira,
	Steven Rostedt, Joel Fernandes, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Wed, Nov 08, 2023 at 04:14:18PM +0100, Juri Lelli wrote:
> > +	if (dl_se->zerolax && dl_se->runtime > dl_runtime)
> > +		dl_se->runtime = dl_runtime;
> > +
> 
> Anyway, I have the impression that this breaks EDF/CBS, as we are letting
> the dl_server run with full dl_runtime w/o postponing the period
> (essentially an u = 1 reservation until runtime is depleted).

Yeah, I sorted it with Daniel, we were not trying to fix the same
problem :-)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-08  8:01                 ` Daniel Bristot de Oliveira
@ 2023-11-08 18:25                   ` Joel Fernandes
  0 siblings, 0 replies; 76+ messages in thread
From: Joel Fernandes @ 2023-11-08 18:25 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Daniel Bristot de Oliveira, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

On Wed, Nov 08, 2023 at 09:01:17AM +0100, Daniel Bristot de Oliveira wrote:
> On 11/8/23 04:20, Joel Fernandes wrote:
> > Hi Daniel,
> > 
> > On Tue, Nov 7, 2023 at 1:50 PM Daniel Bristot de Oliveira
> > <bristot@kernel.org> wrote:
> >>
> >>> The code is not doing what I intended because I thought it was doing overload
> >>> control on the replenishment, but it is not (my bad).
> >>>
> >>
> >> I am still testing but... it is missing something like this (famous last words).
> >>
> >> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> >> index 1092ca8892e0..6e2d21c47a04 100644
> >> --- a/kernel/sched/deadline.c
> >> +++ b/kernel/sched/deadline.c
> >> @@ -842,6 +842,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
> >>   * runtime, or it just underestimated it during sched_setattr().
> >>   */
> >>  static int start_dl_timer(struct sched_dl_entity *dl_se);
> >> +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t);
> >> +
> >>  static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> >>  {
> >>         struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> >> @@ -852,9 +854,18 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> >>         /*
> >>          * This could be the case for a !-dl task that is boosted.
> >>          * Just go with full inherited parameters.
> >> +        *
> >> +        * Or, it could be the case of a zerolax reservation that
> >> +        * was not able to consume its runtime in background and
> >> +        * reached this point with current u > U.
> >> +        *
> >> +        * In both cases, set a new period.
> >>          */
> >> -       if (dl_se->dl_deadline == 0)
> >> -               replenish_dl_new_period(dl_se, rq);
> >> +       if (dl_se->dl_deadline == 0 ||
> >> +               (dl_se->dl_zerolax_armed && dl_entity_overflow(dl_se, rq_clock(rq)))) {
> >> +                       dl_se->deadline = rq_clock(rq) + pi_of(dl_se)->dl_deadline;
> >> +                       dl_se->runtime = pi_of(dl_se)->dl_runtime;
> >> +       }
> >>
> >>         if (dl_se->dl_yielded && dl_se->runtime > 0)
> >>                 dl_se->runtime = 0;
> > 
> > I was wondering does this mean GRUB needs to be enabled? Otherwise I
> > can see that "runtime / (deadline - t) > dl_runtime / dl_deadline"
> > will be true almost all the time due to the constraint of executing at
> > the 0-lax time.
> 
> No grub needed. It will only happen if the fair server did not have any chance to run.
> 
> If it happens, it is not a problem, see that timeline I replied in the previous
> email.

Ah you're right, I mistakenly read your diff assuming you were calling
replenish_dl_new_period() on dl_entity_overflow(). Indeed the diff is needed
(I was actually wondering about why that was not done in my initial review as
well -- so its good we found it in discussion).

> We do not want a zerolax scheduler, because it breaks everything else. It is
> a deferred EDF, that looking from wall clock, composes an "zerolaxish" timeline.

Indeed. I was not intending that we do zerolax scheduler, I was merely
misreading the diff assuming you were throttling the DL-server once again at
the zerolax time.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-04 10:59 ` [PATCH v5 6/7] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
                     ` (2 preceding siblings ...)
  2023-11-07 16:37   ` Steven Rostedt
@ 2023-11-13 15:05   ` kernel test robot
  2024-03-20  0:03   ` Joel Fernandes
  4 siblings, 0 replies; 76+ messages in thread
From: kernel test robot @ 2023-11-13 15:05 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: oe-lkp, lkp, linux-kernel, aubrey.li, yu.c.chen, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Luca Abeni,
	Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld, oliver.sang



Hello,

kernel test robot noticed "WARNING:at_kernel/sched/deadline.c:#enqueue_dl_entity" on:

commit: dea46af8e193ed4f23c37123bfd4a825399aedfe ("[PATCH v5 6/7] sched/deadline: Deferrable dl server")
url: https://github.com/intel-lab-lkp/linux/commits/Daniel-Bristot-de-Oliveira/sched-Unify-runtime-accounting-across-classes/20231104-201952
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 984ffb6a4366752c949f7b39640aecdce222607f
patch link: https://lore.kernel.org/all/c7b706d30d6316c52853ca056db5beb82ba72863.1699095159.git.bristot@kernel.org/
patch subject: [PATCH v5 6/7] sched/deadline: Deferrable dl server

in testcase: trinity
version: trinity-i386-abe9de86-1_20230429
with following parameters:

	runtime: 600s

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


compiler: gcc-9
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


the issue does not always happen in our tests (show 9 times out of 20 runs),
but keeps clean on parent.

6f69498ee58c052e dea46af8e193ed4f23c37123bfd
---------------- ---------------------------
       fail:runs  %reproduction    fail:runs
           |             |             |
           :20          45%           9:20    dmesg.RIP:enqueue_dl_entity
           :20          45%           9:20    dmesg.WARNING:at_kernel/sched/deadline.c:#enqueue_dl_entity




If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202311132217.2a9a4aac-oliver.sang@intel.com


[   59.623267][    C0] ------------[ cut here ]------------
[ 59.627229][ C0] WARNING: CPU: 0 PID: 1 at kernel/sched/deadline.c:1803 enqueue_dl_entity (kernel/sched/deadline.c:1803 (discriminator 1)) 
[   59.627229][    C0] Modules linked in:
[   59.627229][    C0] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G                T  6.6.0-rc7-00090-gdea46af8e193 #1
[   59.627229][    C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 59.627229][ C0] RIP: 0010:enqueue_dl_entity (kernel/sched/deadline.c:1803 (discriminator 1)) 
[ 59.627229][ C0] Code: 8e 74 ed ff ff 45 84 f6 0f 85 fd 08 00 00 48 8d b5 08 d8 ff ff 48 8d 95 c8 ee ff ff 4c 89 ff e8 40 f8 01 00 e9 50 ed ff ff 90 <0f> 0b 90 e9 9b ec ff ff 48 8d bd 1c d8 ff ff 48 c7 c3 40 fa 1f 00
All code
========
   0:	8e 74 ed ff          	mov    -0x1(%rbp,%rbp,8),%?
   4:	ff 45 84             	incl   -0x7c(%rbp)
   7:	f6 0f 85             	testb  $0x85,(%rdi)
   a:	fd                   	std    
   b:	08 00                	or     %al,(%rax)
   d:	00 48 8d             	add    %cl,-0x73(%rax)
  10:	b5 08                	mov    $0x8,%ch
  12:	d8 ff                	fdivr  %st(7),%st
  14:	ff 48 8d             	decl   -0x73(%rax)
  17:	95                   	xchg   %eax,%ebp
  18:	c8 ee ff ff          	enterq $0xffee,$0xff
  1c:	4c 89 ff             	mov    %r15,%rdi
  1f:	e8 40 f8 01 00       	callq  0x1f864
  24:	e9 50 ed ff ff       	jmpq   0xffffffffffffed79
  29:	90                   	nop
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	90                   	nop
  2d:	e9 9b ec ff ff       	jmpq   0xffffffffffffeccd
  32:	48 8d bd 1c d8 ff ff 	lea    -0x27e4(%rbp),%rdi
  39:	48 c7 c3 40 fa 1f 00 	mov    $0x1ffa40,%rbx

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	90                   	nop
   3:	e9 9b ec ff ff       	jmpq   0xffffffffffffeca3
   8:	48 8d bd 1c d8 ff ff 	lea    -0x27e4(%rbp),%rdi
   f:	48 c7 c3 40 fa 1f 00 	mov    $0x1ffa40,%rbx
[   59.627229][    C0] RSP: 0000:ffffc90000007d28 EFLAGS: 00010092
[   59.627229][    C0] RAX: dffffc0000000000 RBX: ffff8883aec00418 RCX: 1ffffffff096d168
[   59.627229][    C0] RDX: 1ffff11075d80078 RSI: 0000000000000020 RDI: ffff8883aec003c0
[   59.627229][    C0] RBP: ffff8883aec003c0 R08: ffff8883aec004f0 R09: ffff8883aec00500
[   59.627229][    C0] R10: 0000000000000000 R11: ffffffff873fba8f R12: ffff8883aec00414
[   59.627229][    C0] R13: 0000000000000020 R14: ffff8883aebffa58 R15: ffff8883aec003c0
[   59.627229][    C0] FS:  0000000000000000(0000) GS:ffff8883aea00000(0000) knlGS:0000000000000000
[   59.627229][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   59.627229][    C0] CR2: ffff88843ffff000 CR3: 0000000004e70000 CR4: 00000000000006b0
[   59.627229][    C0] Call Trace:
[   59.627229][    C0]  <IRQ>
[ 59.627229][ C0] ? __warn (kernel/panic.c:673) 
[ 59.627229][ C0] ? enqueue_dl_entity (kernel/sched/deadline.c:1803 (discriminator 1)) 
[ 59.627229][ C0] ? report_bug (lib/bug.c:201 lib/bug.c:219) 
[ 59.627229][ C0] ? handle_bug (arch/x86/kernel/traps.c:237) 
[ 59.627229][ C0] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1)) 
[ 59.627229][ C0] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) 
[ 59.627229][ C0] ? enqueue_dl_entity (kernel/sched/deadline.c:1803 (discriminator 1)) 
[ 59.627229][ C0] ? update_rq_clock (kernel/sched/core.c:765 kernel/sched/core.c:750) 
[ 59.627229][ C0] ? kvm_sched_clock_read (arch/x86/kernel/kvmclock.c:91) 
[ 59.627229][ C0] ? sched_clock_tick (kernel/sched/clock.c:270 kernel/sched/clock.c:426 kernel/sched/clock.c:412) 
[ 59.627229][ C0] dl_task_timer (kernel/sched/deadline.c:1193) 
[ 59.627229][ C0] ? pick_task_dl (kernel/sched/deadline.c:1174) 
[ 59.627229][ C0] __hrtimer_run_queues (kernel/time/hrtimer.c:1688 kernel/time/hrtimer.c:1752) 
[ 59.627229][ C0] ? enqueue_hrtimer (kernel/time/hrtimer.c:1722) 
[ 59.627229][ C0] ? kvm_clock_read (arch/x86/include/asm/preempt.h:95 arch/x86/kernel/kvmclock.c:80) 
[ 59.627229][ C0] ? ktime_get_update_offsets_now (kernel/time/timekeeping.c:292 (discriminator 4) kernel/time/timekeeping.c:388 (discriminator 4) kernel/time/timekeeping.c:2320 (discriminator 4)) 
[ 59.627229][ C0] hrtimer_interrupt (kernel/time/hrtimer.c:1817) 
[ 59.627229][ C0] __sysvec_apic_timer_interrupt (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:444 include/linux/jump_label.h:260 include/linux/jump_label.h:270 arch/x86/include/asm/trace/irq_vectors.h:41 arch/x86/kernel/apic/apic.c:1081) 
[ 59.627229][ C0] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1074 (discriminator 14)) 
[   59.627229][    C0]  </IRQ>
[   59.627229][    C0]  <TASK>
[ 59.627229][ C0] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:645) 
[ 59.627229][ C0] RIP: 0010:_raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194) 
[ 59.627229][ C0] Code: 83 c7 18 e8 ca 80 74 fd 48 89 ef e8 82 18 75 fd 81 e3 00 02 00 00 75 25 9c 58 f6 c4 02 75 2d 48 85 db 74 01 fb bf 01 00 00 00 <e8> 93 55 69 fd 65 8b 05 54 1a 66 7c 85 c0 74 0a 5b 5d c3 e8 30 63
All code
========
   0:	83 c7 18             	add    $0x18,%edi
   3:	e8 ca 80 74 fd       	callq  0xfffffffffd7480d2
   8:	48 89 ef             	mov    %rbp,%rdi
   b:	e8 82 18 75 fd       	callq  0xfffffffffd751892
  10:	81 e3 00 02 00 00    	and    $0x200,%ebx
  16:	75 25                	jne    0x3d
  18:	9c                   	pushfq 
  19:	58                   	pop    %rax
  1a:	f6 c4 02             	test   $0x2,%ah
  1d:	75 2d                	jne    0x4c
  1f:	48 85 db             	test   %rbx,%rbx
  22:	74 01                	je     0x25
  24:	fb                   	sti    
  25:	bf 01 00 00 00       	mov    $0x1,%edi
  2a:*	e8 93 55 69 fd       	callq  0xfffffffffd6955c2		<-- trapping instruction
  2f:	65 8b 05 54 1a 66 7c 	mov    %gs:0x7c661a54(%rip),%eax        # 0x7c661a8a
  36:	85 c0                	test   %eax,%eax
  38:	74 0a                	je     0x44
  3a:	5b                   	pop    %rbx
  3b:	5d                   	pop    %rbp
  3c:	c3                   	retq   
  3d:	e8                   	.byte 0xe8
  3e:	30                   	.byte 0x30
  3f:	63                   	.byte 0x63

Code starting with the faulting instruction
===========================================
   0:	e8 93 55 69 fd       	callq  0xfffffffffd695598
   5:	65 8b 05 54 1a 66 7c 	mov    %gs:0x7c661a54(%rip),%eax        # 0x7c661a60
   c:	85 c0                	test   %eax,%eax
   e:	74 0a                	je     0x1a
  10:	5b                   	pop    %rbx
  11:	5d                   	pop    %rbp
  12:	c3                   	retq   
  13:	e8                   	.byte 0xe8
  14:	30                   	.byte 0x30
  15:	63                   	.byte 0x63
[   59.627229][    C0] RSP: 0000:ffffc9000001fbb8 EFLAGS: 00000206
[   59.627229][    C0] RAX: 0000000000000006 RBX: 0000000000000200 RCX: ffffffff812e2631
[   59.627229][    C0] RDX: 0000000000000000 RSI: ffffffff83eaa940 RDI: 0000000000000001
[   59.627229][    C0] RBP: ffff88812d07ed00 R08: 0000000000000001 R09: fffffbfff0e7f757
[   59.627229][    C0] R10: fffffbfff0e7f756 R11: ffffffff873fbab7 R12: 0000000000000000
[   59.627229][    C0] R13: 0000000000000246 R14: ffff888195a701a8 R15: ffff88812cd23350
[ 59.627229][ C0] ? mark_lock (arch/x86/include/asm/bitops.h:228 (discriminator 3) arch/x86/include/asm/bitops.h:240 (discriminator 3) include/asm-generic/bitops/instrumented-non-atomic.h:142 (discriminator 3) kernel/locking/lockdep.c:228 (discriminator 3) kernel/locking/lockdep.c:4655 (discriminator 3)) 
[ 59.627229][ C0] dma_fence_signal (drivers/dma-buf/dma-fence.c:327 drivers/dma-buf/dma-fence.c:476) 
[ 59.627229][ C0] wait_backward (drivers/dma-buf/st-dma-fence-chain.c:621) 
[ 59.627229][ C0] ? find_gap (drivers/dma-buf/st-dma-fence-chain.c:603) 
[ 59.627229][ C0] ? __cond_resched (kernel/sched/core.c:8521) 
[ 59.627229][ C0] __subtests (drivers/dma-buf/selftest.c:106 (discriminator 1)) 
[ 59.627229][ C0] ? kmem_cache_open (mm/slub.c:2479 mm/slub.c:4232 mm/slub.c:4560) 
[ 59.627229][ C0] ? __sanitycheck__ (drivers/dma-buf/selftest.c:92) 
[ 59.627229][ C0] ? kmem_cache_create_usercopy (mm/slab_common.c:351) 
[ 59.627229][ C0] dma_fence_chain (drivers/dma-buf/st-dma-fence-chain.c:708) 
[ 59.627229][ C0] st_init (drivers/dma-buf/selftest.c:141 drivers/dma-buf/selftest.c:155) 
[ 59.627229][ C0] ? udmabuf_dev_init (drivers/dma-buf/selftest.c:154) 
[ 59.627229][ C0] do_one_initcall (init/main.c:1232) 
[ 59.627229][ C0] ? trace_event_raw_event_initcall_level (init/main.c:1223) 
[ 59.627229][ C0] ? parameq (kernel/params.c:171) 
[ 59.627229][ C0] ? strcpy (lib/string.c:83 (discriminator 1)) 
[ 59.627229][ C0] kernel_init_freeable (init/main.c:1293 init/main.c:1310 init/main.c:1329 init/main.c:1547) 
[ 59.627229][ C0] ? finish_task_switch (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:77 kernel/sched/sched.h:1390 kernel/sched/core.c:5129 kernel/sched/core.c:5247) 
[ 59.627229][ C0] ? rest_init (init/main.c:1429) 
[ 59.627229][ C0] kernel_init (init/main.c:1439) 
[ 59.627229][ C0] ? _raw_spin_unlock_irq (arch/x86/include/asm/preempt.h:104 include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:202) 
[ 59.627229][ C0] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 59.627229][ C0] ? rest_init (init/main.c:1429) 
[ 59.627229][ C0] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) 
[   59.627229][    C0]  </TASK>
[   59.627229][    C0] irq event stamp: 842784
[ 59.627229][ C0] hardirqs last enabled at (842783): irqentry_exit (kernel/entry/common.c:436) 
[ 59.627229][ C0] hardirqs last disabled at (842784): sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1074) 
[ 59.627229][ C0] softirqs last enabled at (842772): __do_softirq (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:400 kernel/softirq.c:582) 
[ 59.627229][ C0] softirqs last disabled at (842761): irq_exit_rcu (kernel/softirq.c:427 kernel/softirq.c:632 kernel/softirq.c:622 kernel/softirq.c:644) 
[   59.627229][    C0] ---[ end trace 0000000000000000 ]---


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231113/202311132217.2a9a4aac-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [tip: sched/core] sched/deadline: Introduce deadline servers
  2023-11-04 10:59 ` [PATCH v5 4/7] sched/deadline: Introduce deadline servers Daniel Bristot de Oliveira
@ 2023-11-15  9:04   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-11-15  9:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Daniel Bristot de Oliveira, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     63ba8422f876e32ee564ea95da9a7313b13ff0a1
Gitweb:        https://git.kernel.org/tip/63ba8422f876e32ee564ea95da9a7313b13ff0a1
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Sat, 04 Nov 2023 11:59:21 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 15 Nov 2023 09:57:51 +01:00

sched/deadline: Introduce deadline servers

Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).

RT Throttling has been introduced a while ago as a (mostly debug)
countermeasure one can utilize to reserve some CPU time for low priority
tasks (usually background type of work, e.g. workqueues, timers, etc.).
It however has its own problems (see documentation) and the undesired
effect of unconditionally throttling FIFO tasks even when no lower
priority activity needs to run (there are mechanisms to fix this issue
as well, but, again, with their own problems).

Introduce deadline servers to service low priority tasks needs under
starvation conditions. Deadline servers are built extending SCHED_DEADLINE
implementation to allow 2-level scheduling (a sched_deadline entity
becomes a container for lower priority scheduling entities).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/4968601859d920335cf85822eb573a5f179f04b8.1699095159.git.bristot@kernel.org
---
 include/linux/sched.h   |  22 ++-
 kernel/sched/core.c     |  17 ++-
 kernel/sched/deadline.c | 332 ++++++++++++++++++++++++++-------------
 kernel/sched/fair.c     |   2 +-
 kernel/sched/sched.h    |  27 +++-
 5 files changed, 292 insertions(+), 108 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 44b46d9..8d25816 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -63,11 +63,13 @@ struct robust_list_head;
 struct root_domain;
 struct rq;
 struct sched_attr;
+struct sched_dl_entity;
 struct seq_file;
 struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct task_struct;
 struct user_event_mm;
 
 /*
@@ -607,6 +609,9 @@ struct sched_rt_entity {
 #endif
 } __randomize_layout;
 
+typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
+typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
+
 struct sched_dl_entity {
 	struct rb_node			rb_node;
 
@@ -654,6 +659,7 @@ struct sched_dl_entity {
 	unsigned int			dl_yielded        : 1;
 	unsigned int			dl_non_contending : 1;
 	unsigned int			dl_overrun	  : 1;
+	unsigned int			dl_server         : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
@@ -668,7 +674,20 @@ struct sched_dl_entity {
 	 * timer is needed to decrease the active utilization at the correct
 	 * time.
 	 */
-	struct hrtimer inactive_timer;
+	struct hrtimer			inactive_timer;
+
+	/*
+	 * Bits for DL-server functionality. Also see the comment near
+	 * dl_server_update().
+	 *
+	 * @rq the runqueue this server is for
+	 *
+	 * @server_has_tasks() returns true if @server_pick return a
+	 * runnable task.
+	 */
+	struct rq			*rq;
+	dl_server_has_tasks_f		server_has_tasks;
+	dl_server_pick_f		server_pick;
 
 #ifdef CONFIG_RT_MUTEXES
 	/*
@@ -795,6 +814,7 @@ struct task_struct {
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
 	struct sched_dl_entity		dl;
+	struct sched_dl_entity		*dl_server;
 	const struct sched_class	*sched_class;
 
 #ifdef CONFIG_SCHED_CORE
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 966631f..f5f4495 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3797,6 +3797,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		rq->idle_stamp = 0;
 	}
 #endif
+
+	p->dl_server = NULL;
 }
 
 /*
@@ -6003,12 +6005,27 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			p = pick_next_task_idle(rq);
 		}
 
+		/*
+		 * This is the fast path; it cannot be a DL server pick;
+		 * therefore even if @p == @prev, ->dl_server must be NULL.
+		 */
+		if (p->dl_server)
+			p->dl_server = NULL;
+
 		return p;
 	}
 
 restart:
 	put_prev_task_balance(rq, prev, rf);
 
+	/*
+	 * We've updated @prev and no longer need the server link, clear it.
+	 * Must be done before ->pick_next_task() because that can (re)set
+	 * ->dl_server.
+	 */
+	if (prev->dl_server)
+		prev->dl_server = NULL;
+
 	for_each_class(class) {
 		p = class->pick_next_task(rq);
 		if (p)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 81810f6..a04a436 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -54,8 +54,14 @@ static int __init sched_dl_sysctl_init(void)
 late_initcall(sched_dl_sysctl_init);
 #endif
 
+static bool dl_server(struct sched_dl_entity *dl_se)
+{
+	return dl_se->dl_server;
+}
+
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
+	BUG_ON(dl_server(dl_se));
 	return container_of(dl_se, struct task_struct, dl);
 }
 
@@ -64,12 +70,19 @@ static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
 	return container_of(dl_rq, struct rq, dl);
 }
 
-static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+static inline struct rq *rq_of_dl_se(struct sched_dl_entity *dl_se)
 {
-	struct task_struct *p = dl_task_of(dl_se);
-	struct rq *rq = task_rq(p);
+	struct rq *rq = dl_se->rq;
+
+	if (!dl_server(dl_se))
+		rq = task_rq(dl_task_of(dl_se));
 
-	return &rq->dl;
+	return rq;
+}
+
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+	return &rq_of_dl_se(dl_se)->dl;
 }
 
 static inline int on_dl_rq(struct sched_dl_entity *dl_se)
@@ -394,9 +407,8 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se);
 static void task_non_contending(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->inactive_timer;
-	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
-	struct rq *rq = rq_of_dl_rq(dl_rq);
-	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = rq_of_dl_se(dl_se);
+	struct dl_rq *dl_rq = &rq->dl;
 	s64 zerolag_time;
 
 	/*
@@ -426,25 +438,33 @@ static void task_non_contending(struct sched_dl_entity *dl_se)
 	 * utilization now, instead of starting a timer
 	 */
 	if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
-		if (dl_task(p))
+		if (dl_server(dl_se)) {
 			sub_running_bw(dl_se, dl_rq);
+		} else {
+			struct task_struct *p = dl_task_of(dl_se);
 
-		if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
-			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+			if (dl_task(p))
+				sub_running_bw(dl_se, dl_rq);
 
-			if (READ_ONCE(p->__state) == TASK_DEAD)
-				sub_rq_bw(dl_se, &rq->dl);
-			raw_spin_lock(&dl_b->lock);
-			__dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
-			raw_spin_unlock(&dl_b->lock);
-			__dl_clear_params(dl_se);
+			if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
+				struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+				if (READ_ONCE(p->__state) == TASK_DEAD)
+					sub_rq_bw(dl_se, &rq->dl);
+				raw_spin_lock(&dl_b->lock);
+				__dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
+				raw_spin_unlock(&dl_b->lock);
+				__dl_clear_params(dl_se);
+			}
 		}
 
 		return;
 	}
 
 	dl_se->dl_non_contending = 1;
-	get_task_struct(p);
+	if (!dl_server(dl_se))
+		get_task_struct(dl_task_of(dl_se));
+
 	hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL_HARD);
 }
 
@@ -471,8 +491,10 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
 		 * will not touch the rq's active utilization,
 		 * so we are still safe.
 		 */
-		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
-			put_task_struct(dl_task_of(dl_se));
+		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
+			if (!dl_server(dl_se))
+				put_task_struct(dl_task_of(dl_se));
+		}
 	} else {
 		/*
 		 * Since "dl_non_contending" is not set, the
@@ -485,10 +507,8 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
 	}
 }
 
-static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
+static inline int is_leftmost(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
-
 	return rb_first_cached(&dl_rq->root) == &dl_se->rb_node;
 }
 
@@ -740,8 +760,10 @@ static inline void deadline_queue_pull_task(struct rq *rq)
 }
 #endif /* CONFIG_SMP */
 
+static void
+enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags);
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags);
 static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, int flags);
 
 static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
@@ -989,8 +1011,7 @@ static inline bool dl_is_implicit(struct sched_dl_entity *dl_se)
  */
 static void update_dl_entity(struct sched_dl_entity *dl_se)
 {
-	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
-	struct rq *rq = rq_of_dl_rq(dl_rq);
+	struct rq *rq = rq_of_dl_se(dl_se);
 
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
 	    dl_entity_overflow(dl_se, rq_clock(rq))) {
@@ -1021,11 +1042,11 @@ static inline u64 dl_next_period(struct sched_dl_entity *dl_se)
  * actually started or not (i.e., the replenishment instant is in
  * the future or in the past).
  */
-static int start_dl_timer(struct task_struct *p)
+static int start_dl_timer(struct sched_dl_entity *dl_se)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
 	struct hrtimer *timer = &dl_se->dl_timer;
-	struct rq *rq = task_rq(p);
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
 	ktime_t now, act;
 	s64 delta;
 
@@ -1059,13 +1080,33 @@ static int start_dl_timer(struct task_struct *p)
 	 * and observe our state.
 	 */
 	if (!hrtimer_is_queued(timer)) {
-		get_task_struct(p);
+		if (!dl_server(dl_se))
+			get_task_struct(dl_task_of(dl_se));
 		hrtimer_start(timer, act, HRTIMER_MODE_ABS_HARD);
 	}
 
 	return 1;
 }
 
+static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
+{
+#ifdef CONFIG_SMP
+	/*
+	 * Queueing this task back might have overloaded rq, check if we need
+	 * to kick someone away.
+	 */
+	if (has_pushable_dl_tasks(rq)) {
+		/*
+		 * Nothing relies on rq->lock after this, so its safe to drop
+		 * rq->lock.
+		 */
+		rq_unpin_lock(rq, rf);
+		push_dl_task(rq);
+		rq_repin_lock(rq, rf);
+	}
+#endif
+}
+
 /*
  * This is the bandwidth enforcement timer callback. If here, we know
  * a task is not on its dl_rq, since the fact that the timer was running
@@ -1084,10 +1125,34 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	struct sched_dl_entity *dl_se = container_of(timer,
 						     struct sched_dl_entity,
 						     dl_timer);
-	struct task_struct *p = dl_task_of(dl_se);
+	struct task_struct *p;
 	struct rq_flags rf;
 	struct rq *rq;
 
+	if (dl_server(dl_se)) {
+		struct rq *rq = rq_of_dl_se(dl_se);
+		struct rq_flags rf;
+
+		rq_lock(rq, &rf);
+		if (dl_se->dl_throttled) {
+			sched_clock_tick();
+			update_rq_clock(rq);
+
+			if (dl_se->server_has_tasks(dl_se)) {
+				enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
+				resched_curr(rq);
+				__push_dl_task(rq, &rf);
+			} else {
+				replenish_dl_entity(dl_se);
+			}
+
+		}
+		rq_unlock(rq, &rf);
+
+		return HRTIMER_NORESTART;
+	}
+
+	p = dl_task_of(dl_se);
 	rq = task_rq_lock(p, &rf);
 
 	/*
@@ -1158,21 +1223,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	else
 		resched_curr(rq);
 
-#ifdef CONFIG_SMP
-	/*
-	 * Queueing this task back might have overloaded rq, check if we need
-	 * to kick someone away.
-	 */
-	if (has_pushable_dl_tasks(rq)) {
-		/*
-		 * Nothing relies on rq->lock after this, so its safe to drop
-		 * rq->lock.
-		 */
-		rq_unpin_lock(rq, &rf);
-		push_dl_task(rq);
-		rq_repin_lock(rq, &rf);
-	}
-#endif
+	__push_dl_task(rq, &rf);
 
 unlock:
 	task_rq_unlock(rq, p, &rf);
@@ -1214,12 +1265,11 @@ static void init_dl_task_timer(struct sched_dl_entity *dl_se)
  */
 static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se)
 {
-	struct task_struct *p = dl_task_of(dl_se);
-	struct rq *rq = rq_of_dl_rq(dl_rq_of_se(dl_se));
+	struct rq *rq = rq_of_dl_se(dl_se);
 
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) &&
 	    dl_time_before(rq_clock(rq), dl_next_period(dl_se))) {
-		if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(p)))
+		if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se)))
 			return;
 		dl_se->dl_throttled = 1;
 		if (dl_se->runtime > 0)
@@ -1270,29 +1320,13 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
 	return (delta * u_act) >> BW_SHIFT;
 }
 
-/*
- * Update the current task's runtime statistics (provided it is still
- * a -deadline task and has not been removed from the dl_rq).
- */
-static void update_curr_dl(struct rq *rq)
+static inline void
+update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
+                        int flags);
+static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
 {
-	struct task_struct *curr = rq->curr;
-	struct sched_dl_entity *dl_se = &curr->dl;
-	s64 delta_exec, scaled_delta_exec;
-	int cpu = cpu_of(rq);
-
-	if (!dl_task(curr) || !on_dl_rq(dl_se))
-		return;
+	s64 scaled_delta_exec;
 
-	/*
-	 * Consumed budget is computed considering the time as
-	 * observed by schedulable tasks (excluding time spent
-	 * in hardirq context, etc.). Deadlines are instead
-	 * computed using hard walltime. This seems to be the more
-	 * natural solution, but the full ramifications of this
-	 * approach need further study.
-	 */
-	delta_exec = update_curr_common(rq);
 	if (unlikely(delta_exec <= 0)) {
 		if (unlikely(dl_se->dl_yielded))
 			goto throttle;
@@ -1310,10 +1344,9 @@ static void update_curr_dl(struct rq *rq)
 	 * according to current frequency and CPU maximum capacity.
 	 */
 	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
-		scaled_delta_exec = grub_reclaim(delta_exec,
-						 rq,
-						 &curr->dl);
+		scaled_delta_exec = grub_reclaim(delta_exec, rq, dl_se);
 	} else {
+		int cpu = cpu_of(rq);
 		unsigned long scale_freq = arch_scale_freq_capacity(cpu);
 		unsigned long scale_cpu = arch_scale_cpu_capacity(cpu);
 
@@ -1332,11 +1365,20 @@ throttle:
 		    (dl_se->flags & SCHED_FLAG_DL_OVERRUN))
 			dl_se->dl_overrun = 1;
 
-		__dequeue_task_dl(rq, curr, 0);
-		if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(curr)))
-			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
+		dequeue_dl_entity(dl_se, 0);
+		if (!dl_server(dl_se)) {
+			update_stats_dequeue_dl(&rq->dl, dl_se, 0);
+			dequeue_pushable_dl_task(rq, dl_task_of(dl_se));
+		}
 
-		if (!is_leftmost(curr, &rq->dl))
+		if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se))) {
+			if (dl_server(dl_se))
+				enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
+			else
+				enqueue_task_dl(rq, dl_task_of(dl_se), ENQUEUE_REPLENISH);
+		}
+
+		if (!is_leftmost(dl_se, &rq->dl))
 			resched_curr(rq);
 	}
 
@@ -1366,20 +1408,82 @@ throttle:
 	}
 }
 
+void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
+{
+	update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+}
+
+void dl_server_start(struct sched_dl_entity *dl_se)
+{
+	if (!dl_server(dl_se)) {
+		dl_se->dl_server = 1;
+		setup_new_dl_entity(dl_se);
+	}
+	enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
+}
+
+void dl_server_stop(struct sched_dl_entity *dl_se)
+{
+	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
+}
+
+void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+		    dl_server_has_tasks_f has_tasks,
+		    dl_server_pick_f pick)
+{
+	dl_se->rq = rq;
+	dl_se->server_has_tasks = has_tasks;
+	dl_se->server_pick = pick;
+}
+
+/*
+ * Update the current task's runtime statistics (provided it is still
+ * a -deadline task and has not been removed from the dl_rq).
+ */
+static void update_curr_dl(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_dl_entity *dl_se = &curr->dl;
+	s64 delta_exec;
+
+	if (!dl_task(curr) || !on_dl_rq(dl_se))
+		return;
+
+	/*
+	 * Consumed budget is computed considering the time as
+	 * observed by schedulable tasks (excluding time spent
+	 * in hardirq context, etc.). Deadlines are instead
+	 * computed using hard walltime. This seems to be the more
+	 * natural solution, but the full ramifications of this
+	 * approach need further study.
+	 */
+	delta_exec = update_curr_common(rq);
+	update_curr_dl_se(rq, dl_se, delta_exec);
+}
+
 static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 {
 	struct sched_dl_entity *dl_se = container_of(timer,
 						     struct sched_dl_entity,
 						     inactive_timer);
-	struct task_struct *p = dl_task_of(dl_se);
+	struct task_struct *p = NULL;
 	struct rq_flags rf;
 	struct rq *rq;
 
-	rq = task_rq_lock(p, &rf);
+	if (!dl_server(dl_se)) {
+		p = dl_task_of(dl_se);
+		rq = task_rq_lock(p, &rf);
+	} else {
+		rq = dl_se->rq;
+		rq_lock(rq, &rf);
+	}
 
 	sched_clock_tick();
 	update_rq_clock(rq);
 
+	if (dl_server(dl_se))
+		goto no_task;
+
 	if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
 		struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
@@ -1396,14 +1500,21 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 
 		goto unlock;
 	}
+
+no_task:
 	if (dl_se->dl_non_contending == 0)
 		goto unlock;
 
 	sub_running_bw(dl_se, &rq->dl);
 	dl_se->dl_non_contending = 0;
 unlock:
-	task_rq_unlock(rq, p, &rf);
-	put_task_struct(p);
+
+	if (!dl_server(dl_se)) {
+		task_rq_unlock(rq, p, &rf);
+		put_task_struct(p);
+	} else {
+		rq_unlock(rq, &rf);
+	}
 
 	return HRTIMER_NORESTART;
 }
@@ -1466,10 +1577,8 @@ static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
 static inline
 void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	int prio = dl_task_of(dl_se)->prio;
 	u64 deadline = dl_se->deadline;
 
-	WARN_ON(!dl_prio(prio));
 	dl_rq->dl_nr_running++;
 	add_nr_running(rq_of_dl_rq(dl_rq), 1);
 
@@ -1479,9 +1588,6 @@ void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 static inline
 void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	int prio = dl_task_of(dl_se)->prio;
-
-	WARN_ON(!dl_prio(prio));
 	WARN_ON(!dl_rq->dl_nr_running);
 	dl_rq->dl_nr_running--;
 	sub_nr_running(rq_of_dl_rq(dl_rq), 1);
@@ -1648,8 +1754,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	} else if (flags & ENQUEUE_REPLENISH) {
 		replenish_dl_entity(dl_se);
 	} else if ((flags & ENQUEUE_RESTORE) &&
-		  dl_time_before(dl_se->deadline,
-				 rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se))))) {
+		   dl_time_before(dl_se->deadline, rq_clock(rq_of_dl_se(dl_se)))) {
 		setup_new_dl_entity(dl_se);
 	}
 
@@ -1730,19 +1835,13 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 
 	enqueue_dl_entity(&p->dl, flags);
 
+	if (dl_server(&p->dl))
+		return;
+
 	if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
 }
 
-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
-{
-	update_stats_dequeue_dl(&rq->dl, &p->dl, flags);
-	dequeue_dl_entity(&p->dl, flags);
-
-	if (!p->dl.dl_throttled)
-		dequeue_pushable_dl_task(rq, p);
-}
-
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
@@ -1750,7 +1849,9 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	if (p->on_rq == TASK_ON_RQ_MIGRATING)
 		flags |= DEQUEUE_MIGRATING;
 
-	__dequeue_task_dl(rq, p, flags);
+	dequeue_dl_entity(&p->dl, flags);
+	if (!p->dl.dl_throttled && !dl_server(&p->dl))
+		dequeue_pushable_dl_task(rq, p);
 }
 
 /*
@@ -1940,12 +2041,12 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
 }
 
 #ifdef CONFIG_SCHED_HRTICK
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
 {
-	hrtick_start(rq, p->dl.runtime);
+	hrtick_start(rq, dl_se->runtime);
 }
 #else /* !CONFIG_SCHED_HRTICK */
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
 {
 }
 #endif
@@ -1965,9 +2066,6 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
 	if (!first)
 		return;
 
-	if (hrtick_enabled_dl(rq))
-		start_hrtick_dl(rq, p);
-
 	if (rq->curr->sched_class != &dl_sched_class)
 		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
@@ -1990,12 +2088,25 @@ static struct task_struct *pick_task_dl(struct rq *rq)
 	struct dl_rq *dl_rq = &rq->dl;
 	struct task_struct *p;
 
+again:
 	if (!sched_dl_runnable(rq))
 		return NULL;
 
 	dl_se = pick_next_dl_entity(dl_rq);
 	WARN_ON_ONCE(!dl_se);
-	p = dl_task_of(dl_se);
+
+	if (dl_server(dl_se)) {
+		p = dl_se->server_pick(dl_se);
+		if (!p) {
+			WARN_ON_ONCE(1);
+			dl_se->dl_yielded = 1;
+			update_curr_dl_se(rq, dl_se, 0);
+			goto again;
+		}
+		p->dl_server = dl_se;
+	} else {
+		p = dl_task_of(dl_se);
+	}
 
 	return p;
 }
@@ -2005,9 +2116,15 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
 	struct task_struct *p;
 
 	p = pick_task_dl(rq);
-	if (p)
+	if (!p)
+		return p;
+
+	if (!p->dl_server)
 		set_next_task_dl(rq, p, true);
 
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, &p->dl);
+
 	return p;
 }
 
@@ -2045,8 +2162,8 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
 	 * be set and schedule() will start a new hrtick for the next task.
 	 */
 	if (hrtick_enabled_dl(rq) && queued && p->dl.runtime > 0 &&
-	    is_leftmost(p, &rq->dl))
-		start_hrtick_dl(rq, p);
+	    is_leftmost(&p->dl, &rq->dl))
+		start_hrtick_dl(rq, &p->dl);
 }
 
 static void task_fork_dl(struct task_struct *p)
@@ -2986,6 +3103,7 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se)
 	dl_se->dl_yielded		= 0;
 	dl_se->dl_non_contending	= 0;
 	dl_se->dl_overrun		= 0;
+	dl_se->dl_server		= 0;
 
 #ifdef CONFIG_RT_MUTEXES
 	dl_se->pi_se			= dl_se;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1cd92b1..07f5558 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1131,6 +1131,8 @@ static inline void update_curr_task(struct task_struct *p, s64 delta_exec)
 	trace_sched_stat_runtime(p, delta_exec);
 	account_group_exec_runtime(p, delta_exec);
 	cgroup_account_cputime(p, delta_exec);
+	if (p->dl_server)
+		dl_server_update(p->dl_server, delta_exec);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1cda787..8a70d51 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -313,6 +313,33 @@ extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *att
 extern int  dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
 extern int  dl_bw_check_overflow(int cpu);
 
+/*
+ * SCHED_DEADLINE supports servers (nested scheduling) with the following
+ * interface:
+ *
+ *   dl_se::rq -- runqueue we belong to.
+ *
+ *   dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
+ *                                server when it runs out of tasks to run.
+ *
+ *   dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
+ *                           returns NULL.
+ *
+ *   dl_server_update() -- called from update_curr_common(), propagates runtime
+ *                         to the server.
+ *
+ *   dl_server_start()
+ *   dl_server_stop()  -- start/stop the server when it has (no) tasks.
+ *
+ *   dl_server_init() -- initializes the server.
+ */
+extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
+extern void dl_server_start(struct sched_dl_entity *dl_se);
+extern void dl_server_stop(struct sched_dl_entity *dl_se);
+extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+		    dl_server_has_tasks_f has_tasks,
+		    dl_server_pick_f pick);
+
 #ifdef CONFIG_CGROUP_SCHED
 
 struct cfs_rq;

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip: sched/core] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
  2023-11-04 10:59 ` [PATCH v5 3/7] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity Daniel Bristot de Oliveira
@ 2023-11-15  9:04   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-11-15  9:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Daniel Bristot de Oliveira, Phil Auld, Valentin Schneider, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2f7a0f58948d8231236e2facecc500f1930fb996
Gitweb:        https://git.kernel.org/tip/2f7a0f58948d8231236e2facecc500f1930fb996
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Sat, 04 Nov 2023 11:59:20 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 15 Nov 2023 09:57:50 +01:00

sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity

In preparation of introducing !task sched_dl_entity; move the
bandwidth accounting into {en.de}queue_dl_entity().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/a86dccbbe44e021b8771627e1dae01a69b73466d.1699095159.git.bristot@kernel.org
---
 kernel/sched/deadline.c | 130 +++++++++++++++++++++------------------
 kernel/sched/sched.h    |   6 ++-
 2 files changed, 78 insertions(+), 58 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e80bb88..81810f6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -391,12 +391,12 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se);
  * up, and checks if the task is still in the "ACTIVE non contending"
  * state or not (in the second case, it updates running_bw).
  */
-static void task_non_contending(struct task_struct *p)
+static void task_non_contending(struct sched_dl_entity *dl_se)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
 	struct hrtimer *timer = &dl_se->inactive_timer;
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
+	struct task_struct *p = dl_task_of(dl_se);
 	s64 zerolag_time;
 
 	/*
@@ -428,13 +428,14 @@ static void task_non_contending(struct task_struct *p)
 	if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
 		if (dl_task(p))
 			sub_running_bw(dl_se, dl_rq);
+
 		if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
 			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
 			if (READ_ONCE(p->__state) == TASK_DEAD)
-				sub_rq_bw(&p->dl, &rq->dl);
+				sub_rq_bw(dl_se, &rq->dl);
 			raw_spin_lock(&dl_b->lock);
-			__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
+			__dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
 			raw_spin_unlock(&dl_b->lock);
 			__dl_clear_params(dl_se);
 		}
@@ -1602,6 +1603,41 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);
 
 	/*
+	 * Check if a constrained deadline task was activated
+	 * after the deadline but before the next period.
+	 * If that is the case, the task will be throttled and
+	 * the replenishment timer will be set to the next period.
+	 */
+	if (!dl_se->dl_throttled && !dl_is_implicit(dl_se))
+		dl_check_constrained_dl(dl_se);
+
+	if (flags & (ENQUEUE_RESTORE|ENQUEUE_MIGRATING)) {
+		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+		add_rq_bw(dl_se, dl_rq);
+		add_running_bw(dl_se, dl_rq);
+	}
+
+	/*
+	 * If p is throttled, we do not enqueue it. In fact, if it exhausted
+	 * its budget it needs a replenishment and, since it now is on
+	 * its rq, the bandwidth timer callback (which clearly has not
+	 * run yet) will take care of this.
+	 * However, the active utilization does not depend on the fact
+	 * that the task is on the runqueue or not (but depends on the
+	 * task's state - in GRUB parlance, "inactive" vs "active contending").
+	 * In other words, even if a task is throttled its utilization must
+	 * be counted in the active utilization; hence, we need to call
+	 * add_running_bw().
+	 */
+	if (dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+		if (flags & ENQUEUE_WAKEUP)
+			task_contending(dl_se, flags);
+
+		return;
+	}
+
+	/*
 	 * If this is a wakeup or a new instance, the scheduling
 	 * parameters of the task might need updating. Otherwise,
 	 * we want a replenishment of its runtime.
@@ -1620,9 +1656,28 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	__enqueue_dl_entity(dl_se);
 }
 
-static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 {
 	__dequeue_dl_entity(dl_se);
+
+	if (flags & (DEQUEUE_SAVE|DEQUEUE_MIGRATING)) {
+		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+		sub_running_bw(dl_se, dl_rq);
+		sub_rq_bw(dl_se, dl_rq);
+	}
+
+	/*
+	 * This check allows to start the inactive timer (or to immediately
+	 * decrease the active utilization, if needed) in two cases:
+	 * when the task blocks and when it is terminating
+	 * (p->state == TASK_DEAD). We can handle the two cases in the same
+	 * way, because from GRUB's point of view the same thing is happening
+	 * (the task moves from "active contending" to "active non contending"
+	 * or "inactive")
+	 */
+	if (flags & DEQUEUE_SLEEP)
+		task_non_contending(dl_se);
 }
 
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -1667,76 +1722,35 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		return;
 	}
 
-	/*
-	 * Check if a constrained deadline task was activated
-	 * after the deadline but before the next period.
-	 * If that is the case, the task will be throttled and
-	 * the replenishment timer will be set to the next period.
-	 */
-	if (!p->dl.dl_throttled && !dl_is_implicit(&p->dl))
-		dl_check_constrained_dl(&p->dl);
-
-	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE) {
-		add_rq_bw(&p->dl, &rq->dl);
-		add_running_bw(&p->dl, &rq->dl);
-	}
-
-	/*
-	 * If p is throttled, we do not enqueue it. In fact, if it exhausted
-	 * its budget it needs a replenishment and, since it now is on
-	 * its rq, the bandwidth timer callback (which clearly has not
-	 * run yet) will take care of this.
-	 * However, the active utilization does not depend on the fact
-	 * that the task is on the runqueue or not (but depends on the
-	 * task's state - in GRUB parlance, "inactive" vs "active contending").
-	 * In other words, even if a task is throttled its utilization must
-	 * be counted in the active utilization; hence, we need to call
-	 * add_running_bw().
-	 */
-	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
-		if (flags & ENQUEUE_WAKEUP)
-			task_contending(&p->dl, flags);
-
-		return;
-	}
-
 	check_schedstat_required();
 	update_stats_wait_start_dl(dl_rq_of_se(&p->dl), &p->dl);
 
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		flags |= ENQUEUE_MIGRATING;
+
 	enqueue_dl_entity(&p->dl, flags);
 
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+	if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
 }
 
 static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_stats_dequeue_dl(&rq->dl, &p->dl, flags);
-	dequeue_dl_entity(&p->dl);
-	dequeue_pushable_dl_task(rq, p);
+	dequeue_dl_entity(&p->dl, flags);
+
+	if (!p->dl.dl_throttled)
+		dequeue_pushable_dl_task(rq, p);
 }
 
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
-	__dequeue_task_dl(rq, p, flags);
 
-	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE) {
-		sub_running_bw(&p->dl, &rq->dl);
-		sub_rq_bw(&p->dl, &rq->dl);
-	}
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		flags |= DEQUEUE_MIGRATING;
 
-	/*
-	 * This check allows to start the inactive timer (or to immediately
-	 * decrease the active utilization, if needed) in two cases:
-	 * when the task blocks and when it is terminating
-	 * (p->state == TASK_DEAD). We can handle the two cases in the same
-	 * way, because from GRUB's point of view the same thing is happening
-	 * (the task moves from "active contending" to "active non contending"
-	 * or "inactive")
-	 */
-	if (flags & DEQUEUE_SLEEP)
-		task_non_contending(p);
+	__dequeue_task_dl(rq, p, flags);
 }
 
 /*
@@ -2551,7 +2565,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	 * will reset the task parameters.
 	 */
 	if (task_on_rq_queued(p) && p->dl.dl_runtime)
-		task_non_contending(p);
+		task_non_contending(&p->dl);
 
 	/*
 	 * In case a task is setscheduled out from SCHED_DEADLINE we need to
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3c62df1..1cda787 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2177,6 +2177,10 @@ extern const u32		sched_prio_to_wmult[40];
  * MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location
  *        in the runqueue.
  *
+ * NOCLOCK - skip the update_rq_clock() (avoids double updates)
+ *
+ * MIGRATION - p->on_rq == TASK_ON_RQ_MIGRATING (used for DEADLINE)
+ *
  * ENQUEUE_HEAD      - place at front of runqueue (tail if not specified)
  * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
  * ENQUEUE_MIGRATED  - the task was migrated during wakeup
@@ -2187,6 +2191,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define DEQUEUE_SAVE		0x02 /* Matches ENQUEUE_RESTORE */
 #define DEQUEUE_MOVE		0x04 /* Matches ENQUEUE_MOVE */
 #define DEQUEUE_NOCLOCK		0x08 /* Matches ENQUEUE_NOCLOCK */
+#define DEQUEUE_MIGRATING	0x100 /* Matches ENQUEUE_MIGRATING */
 
 #define ENQUEUE_WAKEUP		0x01
 #define ENQUEUE_RESTORE		0x02
@@ -2201,6 +2206,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define ENQUEUE_MIGRATED	0x00
 #endif
 #define ENQUEUE_INITIAL		0x80
+#define ENQUEUE_MIGRATING	0x100
 
 #define RETRY_TASK		((void *)-1UL)
 

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip: sched/core] sched/deadline: Collect sched_dl_entity initialization
  2023-11-04 10:59 ` [PATCH v5 2/7] sched/deadline: Collect sched_dl_entity initialization Daniel Bristot de Oliveira
@ 2023-11-15  9:04   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-11-15  9:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Daniel Bristot de Oliveira, Phil Auld, Valentin Schneider, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     9e07d45c5210f5dd6701c00d55791983db7320fa
Gitweb:        https://git.kernel.org/tip/9e07d45c5210f5dd6701c00d55791983db7320fa
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Sat, 04 Nov 2023 11:59:19 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 15 Nov 2023 09:57:50 +01:00

sched/deadline: Collect sched_dl_entity initialization

Create a single function that initializes a sched_dl_entity.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/51acc695eecf0a1a2f78f9a044e11ffd9b316bcf.1699095159.git.bristot@kernel.org
---
 kernel/sched/core.c     |  5 +----
 kernel/sched/deadline.c | 22 +++++++++++++++-------
 kernel/sched/sched.h    |  5 +----
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d5099d..966631f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4511,10 +4511,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	memset(&p->stats, 0, sizeof(p->stats));
 #endif
 
-	RB_CLEAR_NODE(&p->dl.rb_node);
-	init_dl_task_timer(&p->dl);
-	init_dl_inactive_task_timer(&p->dl);
-	__dl_clear_params(p);
+	init_dl_entity(&p->dl);
 
 	INIT_LIST_HEAD(&p->rt.run_list);
 	p->rt.timeout		= 0;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index de79719..e80bb88 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -335,6 +335,8 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
 	__add_rq_bw(new_bw, &rq->dl);
 }
 
+static void __dl_clear_params(struct sched_dl_entity *dl_se);
+
 /*
  * The utilization of a task cannot be immediately removed from
  * the rq active utilization (running_bw) when the task blocks.
@@ -434,7 +436,7 @@ static void task_non_contending(struct task_struct *p)
 			raw_spin_lock(&dl_b->lock);
 			__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 			raw_spin_unlock(&dl_b->lock);
-			__dl_clear_params(p);
+			__dl_clear_params(dl_se);
 		}
 
 		return;
@@ -1183,7 +1185,7 @@ unlock:
 	return HRTIMER_NORESTART;
 }
 
-void init_dl_task_timer(struct sched_dl_entity *dl_se)
+static void init_dl_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->dl_timer;
 
@@ -1389,7 +1391,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 		raw_spin_lock(&dl_b->lock);
 		__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 		raw_spin_unlock(&dl_b->lock);
-		__dl_clear_params(p);
+		__dl_clear_params(dl_se);
 
 		goto unlock;
 	}
@@ -1405,7 +1407,7 @@ unlock:
 	return HRTIMER_NORESTART;
 }
 
-void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
+static void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->inactive_timer;
 
@@ -2957,10 +2959,8 @@ bool __checkparam_dl(const struct sched_attr *attr)
 /*
  * This function clears the sched_dl_entity static params.
  */
-void __dl_clear_params(struct task_struct *p)
+static void __dl_clear_params(struct sched_dl_entity *dl_se)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
-
 	dl_se->dl_runtime		= 0;
 	dl_se->dl_deadline		= 0;
 	dl_se->dl_period		= 0;
@@ -2978,6 +2978,14 @@ void __dl_clear_params(struct task_struct *p)
 #endif
 }
 
+void init_dl_entity(struct sched_dl_entity *dl_se)
+{
+	RB_CLEAR_NODE(&dl_se->rb_node);
+	init_dl_task_timer(dl_se);
+	init_dl_inactive_task_timer(dl_se);
+	__dl_clear_params(dl_se);
+}
+
 bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6703e9e..3c62df1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -273,8 +273,6 @@ struct rt_bandwidth {
 	unsigned int		rt_period_active;
 };
 
-void __dl_clear_params(struct task_struct *p);
-
 static inline int dl_bandwidth_enabled(void)
 {
 	return sysctl_sched_rt_runtime >= 0;
@@ -2427,8 +2425,7 @@ extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
-extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
-extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
+extern void init_dl_entity(struct sched_dl_entity *dl_se);
 
 #define BW_SHIFT		20
 #define BW_UNIT			(1 << BW_SHIFT)

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [tip: sched/core] sched: Unify runtime accounting across classes
  2023-11-04 10:59 ` [PATCH v5 1/7] sched: Unify runtime accounting across classes Daniel Bristot de Oliveira
@ 2023-11-15  9:04   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-11-15  9:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Daniel Bristot de Oliveira, Phil Auld, Valentin Schneider,
	Steven Rostedt (Google),
	x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     5d69eca542ee17c618f9a55da52191d5e28b435f
Gitweb:        https://git.kernel.org/tip/5d69eca542ee17c618f9a55da52191d5e28b435f
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Sat, 04 Nov 2023 11:59:18 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 15 Nov 2023 09:57:48 +01:00

sched: Unify runtime accounting across classes

All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.

Collapse all that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/54d148a144f26d9559698c4dd82d8859038a7380.1699095159.git.bristot@kernel.org
---
 include/linux/sched.h    |  2 +-
 kernel/sched/deadline.c  | 15 ++--------
 kernel/sched/fair.c      | 57 +++++++++++++++++++++++++++++----------
 kernel/sched/rt.c        | 15 ++--------
 kernel/sched/sched.h     | 12 +-------
 kernel/sched/stop_task.c | 13 +---------
 6 files changed, 53 insertions(+), 61 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cd56d40..44b46d9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -523,7 +523,7 @@ struct sched_statistics {
 	u64				block_max;
 	s64				sum_block_runtime;
 
-	u64				exec_max;
+	s64				exec_max;
 	u64				slice_max;
 
 	u64				nr_migrations_cold;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b281144..de79719 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1275,9 +1275,8 @@ static void update_curr_dl(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_dl_entity *dl_se = &curr->dl;
-	u64 delta_exec, scaled_delta_exec;
+	s64 delta_exec, scaled_delta_exec;
 	int cpu = cpu_of(rq);
-	u64 now;
 
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
@@ -1290,21 +1289,13 @@ static void update_curr_dl(struct rq *rq)
 	 * natural solution, but the full ramifications of this
 	 * approach need further study.
 	 */
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <= 0)) {
+	delta_exec = update_curr_common(rq);
+	if (unlikely(delta_exec <= 0)) {
 		if (unlikely(dl_se->dl_yielded))
 			goto throttle;
 		return;
 	}
 
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
-	trace_sched_stat_runtime(curr, delta_exec, 0);
-
-	update_current_exec_runtime(curr, now, delta_exec);
-
 	if (dl_entity_is_special(dl_se))
 		return;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3e045d..11073cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1103,23 +1103,17 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_SMP */
 
-/*
- * Update the current task's runtime statistics.
- */
-static void update_curr(struct cfs_rq *cfs_rq)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
 {
-	struct sched_entity *curr = cfs_rq->curr;
-	u64 now = rq_clock_task(rq_of(cfs_rq));
-	u64 delta_exec;
-
-	if (unlikely(!curr))
-		return;
+	u64 now = rq_clock_task(rq);
+	s64 delta_exec;
 
 	delta_exec = now - curr->exec_start;
-	if (unlikely((s64)delta_exec <= 0))
-		return;
+	if (unlikely(delta_exec <= 0))
+		return delta_exec;
 
 	curr->exec_start = now;
+	curr->sum_exec_runtime += delta_exec;
 
 	if (schedstat_enabled()) {
 		struct sched_statistics *stats;
@@ -1129,8 +1123,43 @@ static void update_curr(struct cfs_rq *cfs_rq)
 				max(delta_exec, stats->exec_max));
 	}
 
-	curr->sum_exec_runtime += delta_exec;
-	schedstat_add(cfs_rq->exec_clock, delta_exec);
+	return delta_exec;
+}
+
+/*
+ * Used by other classes to account runtime.
+ */
+s64 update_curr_common(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	s64 delta_exec;
+
+	delta_exec = update_curr_se(rq, &curr->se);
+	if (unlikely(delta_exec <= 0))
+		return delta_exec;
+
+	trace_sched_stat_runtime(curr, delta_exec, 0);
+
+	account_group_exec_runtime(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
+
+	return delta_exec;
+}
+
+/*
+ * Update the current task's runtime statistics.
+ */
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 delta_exec;
+
+	if (unlikely(!curr))
+		return;
+
+	delta_exec = update_curr_se(rq_of(cfs_rq), curr);
+	if (unlikely(delta_exec <= 0))
+		return;
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
 	update_deadline(cfs_rq, curr);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 6aaf0a3..3261b06 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1002,24 +1002,15 @@ static void update_curr_rt(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_rt_entity *rt_se = &curr->rt;
-	u64 delta_exec;
-	u64 now;
+	s64 delta_exec;
 
 	if (curr->sched_class != &rt_sched_class)
 		return;
 
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <= 0))
+	delta_exec = update_curr_common(rq);
+	if (unlikely(delta_exec <= 0))
 		return;
 
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
-	trace_sched_stat_runtime(curr, delta_exec, 0);
-
-	update_current_exec_runtime(curr, now, delta_exec);
-
 	if (!rt_bandwidth_enabled())
 		return;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 539c7e7..6703e9e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2212,6 +2212,8 @@ struct affinity_context {
 	unsigned int flags;
 };
 
+extern s64 update_curr_common(struct rq *rq);
+
 struct sched_class {
 
 #ifdef CONFIG_UCLAMP_TASK
@@ -3262,16 +3264,6 @@ extern int sched_dynamic_mode(const char *str);
 extern void sched_dynamic_update(int mode);
 #endif
 
-static inline void update_current_exec_runtime(struct task_struct *curr,
-						u64 now, u64 delta_exec)
-{
-	curr->se.sum_exec_runtime += delta_exec;
-	account_group_exec_runtime(curr, delta_exec);
-
-	curr->se.exec_start = now;
-	cgroup_account_cputime(curr, delta_exec);
-}
-
 #ifdef CONFIG_SCHED_MM_CID
 
 #define SCHED_MM_CID_PERIOD_NS	(100ULL * 1000000)	/* 100ms */
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 6cf7304..b1b8fe6 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq)
 
 static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 {
-	struct task_struct *curr = rq->curr;
-	u64 now, delta_exec;
-
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec < 0))
-		delta_exec = 0;
-
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
-	update_current_exec_runtime(curr, now, delta_exec);
+	update_curr_common(rq);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure
  2023-11-04 10:59 [PATCH v5 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (6 preceding siblings ...)
  2023-11-04 10:59 ` [PATCH v5 7/7] sched/fair: Fair server interface Daniel Bristot de Oliveira
@ 2023-12-08 21:47 ` Joel Fernandes
  2024-02-19  7:33 ` Huang, Ying
  8 siblings, 0 replies; 76+ messages in thread
From: Joel Fernandes @ 2023-12-08 21:47 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

On Sat, Nov 04, 2023 at 11:59:17AM +0100, Daniel Bristot de Oliveira wrote:
> This is v5 of Peter's SCHED_DEADLINE server infrastructure
> implementation [1].
> 
> SCHED_DEADLINE servers can help fixing starvation issues of low priority
> tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
> cycles. Today we have RT Throttling; DEADLINE servers should be able to
> replace and improve that.

Hello!
Just wanted to provide some ChromeOS data on these patches. There is
great improvement when using DL-sever along with RT for foreground Chrome's
display, audio and main threads. Once they are in the background, we set them
back to CFS (except audio). I think these patches are ready to move forward
as the data looks good to me. I see Peter picked up some of them already
which is nice.

One of the key metrics for us is event latency. We have a test that measures
various latency metrics with typing happening on a Google docs in one window
and a 16-person Google meet call happening on the other. This is a very
complex test but gets us close to what the user experiences (as is typical -
meeting attendees in a Google meet call take notes in a Google doc). As a
result, getting stable numbers requires a lot of care which is why I used
P-value to measure the statistical significance of the results. The P-value
for some metrics show lower significance, so we can ignore those but I still
provided it in the table.

The test is run on a Chromebook with 4 cores (Intel(R) Celeron(R) N4100 CPU @
1.10GHz) and 16GB of RAM. No Hyperthreading.

All units are microseconds. The average is calculated as the average of 20
runs with and without "Chrome using RT + DL-server". The 5% every 1 second
default does not work for us, so I changed the DL server parameters to 5ms
every 30ms. This allows CFS to run more often.

This test runs for 6 hours. Total test time for both before and after is 12 hours:

---------------------------------------------------------------------------------------------------------
| MetricName                                      | Average Before | Average After | Change % | P-value |
---------------------------------------------------------------------------------------------------------
| Ash.EventLatency.Core.TotalLatency              | 90.19          | 78.22         | 13.27%   | 0.03    |
---------------------------------------------------------------------------------------------------------
| Ash.EventLatency.KeyReleased.TotalLatency       | 90733.76       | 78602.72      | 13.37%   | 0.03    |
---------------------------------------------------------------------------------------------------------
| Ash.EventLatency.TotalLatency                   | 90.19          | 78.22         | 13.27%   | 0.03    |
---------------------------------------------------------------------------------------------------------
| Docs.EventLatency.KeyPressed.TotalLatency       | 68269.21       | 63310.99      | 7.26%    | 0.00    |
---------------------------------------------------------------------------------------------------------
| Docs.EventLatency.MousePressed.TotalLatency     | 192080.44      | 179264.31     | 6.67%    | 0.26    |
---------------------------------------------------------------------------------------------------------
| Docs.EventLatency.TotalLatency                  | 68795.99       | 63860.04      | 7.17%    | 0.00    |
---------------------------------------------------------------------------------------------------------
| EventLatency.GestureScrollUpdt.Wheel.TotalLat   | 63420.88       | 59394.18      | 6.35%    | 0.02    |
---------------------------------------------------------------------------------------------------------
| EventLatency.KeyPressed.TotalLatency            | 68269.21       | 63310.99      | 7.26%    | 0.00    |
---------------------------------------------------------------------------------------------------------
| EventLatency.MouseDragged.TotalLatency          | 106393.09      | 104152.50     | 2.11%    | 0.57    |
---------------------------------------------------------------------------------------------------------
| EventLatency.MouseMoved.TotalLatency            | 129225.65      | 113268.48     | 12.35%   | 0.01    |
---------------------------------------------------------------------------------------------------------
| EventLatency.MousePressed.TotalLatency          | 192080.44      | 179264.31     | 6.67%    | 0.26    |
---------------------------------------------------------------------------------------------------------
| EventLatency.MouseReleased.TotalLatency         | 152366.33      | 140309.50     | 7.91%    | 0.44    |
---------------------------------------------------------------------------------------------------------
| EventLatency.TotalLatency                       | 68795.99       | 63862.45      | 7.17%    | 0.00    |
---------------------------------------------------------------------------------------------------------
| EventLatency.TotalLatency_ash-Chrome            | 68795.99       | 63862.45      | 7.17%    | 0.00    |
---------------------------------------------------------------------------------------------------------

I also did another test where I measure the CFS maximum latency (using perf
sched) while a YouTube video is playing, and the CFS max latency looks great
too. In fact, with the vanilla RT throttling, our CFS tasks are doing really
badly (perhaps because of depending on RT tasks due to locks or such). So we
definitely need the DL-server to use RT properly!

We are testing dlserver with 5ms/50ms and 5ms/100ms as well to see the
impact. But at the moment, 5ms/30ms is looking good.

Thanks for all of your work, here's to better Linux and better Chromebooks ;)

 - Joel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2023-11-04 10:59 ` [PATCH v5 7/7] sched/fair: Fair server interface Daniel Bristot de Oliveira
                     ` (3 preceding siblings ...)
  2023-11-07 12:38   ` Peter Zijlstra
@ 2024-01-19  1:49   ` Joel Fernandes
  2024-01-19  1:55   ` Joel Fernandes
  2024-02-13  2:13   ` Joel Fernandes
  6 siblings, 0 replies; 76+ messages in thread
From: Joel Fernandes @ 2024-01-19  1:49 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld, suleiman

On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> Add an interface for fair server setup on debugfs.
> 
> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
> 
>  - fair_server_runtime: set runtime in ns
>  - fair_server_period: set period in ns
>  - fair_server_defer: on/off for the defer mechanism
> 
> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>

Hi Daniel, Peter,
I am writing on behalf of the ChromeOS scheduler team.

We had to revert the last 3 patches in this series because of a syzkaller
reported bug, this happens on the sched/more branch in Peter's tree:

 WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220
 place_entity+0x240/0x290 kernel/sched/fair.c:5147
 Call Trace:
 <TASK>
  enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283
  enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717
  enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117
  activate_task+0x60/0xc0 kernel/sched/core.c:2147
  ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794
  ttwu_queue kernel/sched/core.c:4047 [inline]
  try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368
  kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142
  __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800
  queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986
  queue_delayed_work include/linux/workqueue.h:577 [inline]
  srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline]

which is basically this warning in place_entity:
		if (WARN_ON_ONCE(!load))
			load = 1;

Full log (scroll to the bottom as there is console/lockdep side effects which
are likely not relevant to this issue): https://paste.debian.net/1304579/

Side note, we are also looking into a KASAN nullptr deref but this happens
only on our backport of the patches to a 5.15 kernel, as far as we know.

KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...]
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
 RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline] 
 RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445
 [...]
Call Trace:
 <TASK>
  set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728
  set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943
  set_next_task kernel/sched/sched.h:2241 [inline] 
  pick_next_task kernel/sched/core.c:6014 [inline] 
  __schedule+0x36fb/0x402d kernel/sched/core.c:6378
  preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590
  preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615

Full splat: https://paste.debian.net/1304573/

Investigation is on going but could you also please take a look at these? It
is hard to reproduce and only syzkaller's syzbot has luck so for reproducing
these.

Also I had a comment below:

> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
> +{
> +	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> +	u64 new_bw = to_ratio(period, runtime);
> +	struct rq *rq = dl_se->rq;
> +	int cpu = cpu_of(rq);
> +	struct dl_bw *dl_b;
> +	unsigned long cap;
> +	int retval = 0;
> +	int cpus;
> +
> +	dl_b = dl_bw_of(cpu);
> +	raw_spin_lock(&dl_b->lock);
> +	cpus = dl_bw_cpus(cpu);
> +	cap = dl_bw_capacity(cpu);
> +
> +	if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {

The dl_overflow() call here seems introducing an issue with our conceptual
understanding of how the dl server is supposed to work.

Suppose we have a 4 CPU system. Also suppose RT throttling is disabled.
Suppose the DL server params are 50ms runtime in 100ms period (basically we
want to dedicate 50% of the bandwidth of each CPU to CFS).

In such a situation, __dl_overflow() will return an error right? Because
total bandwidth will exceed 100% (4 times 50% is 200%).

Further, this complicates the setting of the parameters since it means we
have to check the number of CPUs in advance and then set the parameters to
prevent dl_overflow(). As an example of this, 30% (runtime / period) for each
CPU will work fine if we have 2 CPUs. But if we have 4 CPUs, it will not work
because __dl_overflow() will fail.

How do you suggest we remedy this? Can we make the dlserver calculate how
much bandwidth is allowed on a per-CPU basis? My understanding is the fake
dl_se are all pinned to their respective CPUs, so we don't have the same
requirement as real DL tasks which may migrate freely within the root domain.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2023-11-04 10:59 ` [PATCH v5 7/7] sched/fair: Fair server interface Daniel Bristot de Oliveira
                     ` (4 preceding siblings ...)
  2024-01-19  1:49   ` Joel Fernandes
@ 2024-01-19  1:55   ` Joel Fernandes
  2024-01-22 14:14     ` Daniel Bristot de Oliveira
  2024-02-13  2:13   ` Joel Fernandes
  6 siblings, 1 reply; 76+ messages in thread
From: Joel Fernandes @ 2024-01-19  1:55 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld, suleiman

On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
> Add an interface for fair server setup on debugfs.
> 
> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
> 
>  - fair_server_runtime: set runtime in ns
>  - fair_server_period: set period in ns
>  - fair_server_defer: on/off for the defer mechanism
> 
> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>

Hi Daniel, Peter,
I am writing on behalf of the ChromeOS scheduler team.

We had to revert the last 3 patches in this series because of a syzkaller
reported bug, this happens on the sched/more branch in Peter's tree:

 WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220
 place_entity+0x240/0x290 kernel/sched/fair.c:5147
 Call Trace:
 <TASK>
  enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283
  enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717
  enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117
  activate_task+0x60/0xc0 kernel/sched/core.c:2147
  ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794
  ttwu_queue kernel/sched/core.c:4047 [inline]
  try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368
  kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142
  __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800
  queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986
  queue_delayed_work include/linux/workqueue.h:577 [inline]
  srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline]

which is basically this warning in place_entity:
		if (WARN_ON_ONCE(!load))
			load = 1;

Full log (scroll to the bottom as there is console/lockdep side effects which
are likely not relevant to this issue): https://paste.debian.net/1304579/

Side note, we are also looking into a KASAN nullptr deref but this happens
only on our backport of the patches to a 5.15 kernel, as far as we know.

KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...]
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
 RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline] 
 RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445
 [...]
Call Trace:
 <TASK>
  set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728
  set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943
  set_next_task kernel/sched/sched.h:2241 [inline] 
  pick_next_task kernel/sched/core.c:6014 [inline] 
  __schedule+0x36fb/0x402d kernel/sched/core.c:6378
  preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590
  preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615

Full splat: https://paste.debian.net/1304573/

Investigation is on going but could you also please take a look at these? It
is hard to reproduce and only syzbot has luck reproducing these.

Also I had a comment below:

> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
> +{
> +	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> +	u64 new_bw = to_ratio(period, runtime);
> +	struct rq *rq = dl_se->rq;
> +	int cpu = cpu_of(rq);
> +	struct dl_bw *dl_b;
> +	unsigned long cap;
> +	int retval = 0;
> +	int cpus;
> +
> +	dl_b = dl_bw_of(cpu);
> +	raw_spin_lock(&dl_b->lock);
> +	cpus = dl_bw_cpus(cpu);
> +	cap = dl_bw_capacity(cpu);
> +
> +	if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {

The dl_overflow() call here seems introducing an issue with our conceptual
understanding of how the dl server is supposed to work.

Suppose we have a 4 CPU system. Also suppose RT throttling is disabled.
Suppose the DL server params are 50ms runtime in 100ms period (basically we
want to dedicate 50% of the bandwidth of each CPU to CFS).

In such a situation, __dl_overflow() will return an error right? Because
total bandwidth will exceed 100% (4 times 50% is 200%).

Further, this complicates the setting of the parameters since it means we
have to check the number of CPUs in advance and then set the parameters to
prevent dl_overflow(). As an example of this, 30% (runtime / period) for each
CPU will work fine if we have 2 CPUs. But if we have 4 CPUs, it will not work
because __dl_overflow() will fail.

How do you suggest we remedy this? Can we make the dlserver calculate how
much bandwidth is allowed on a per-CPU basis? My understanding is the fake
dl_se are all pinned to their respective CPUs, so we don't have the same
requirement as real DL tasks which may migrate freely within the root domain.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2024-01-19  1:55   ` Joel Fernandes
@ 2024-01-22 14:14     ` Daniel Bristot de Oliveira
  2024-01-23 15:39       ` Joel Fernandes
  2024-01-23 15:44       ` Joel Fernandes
  0 siblings, 2 replies; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-01-22 14:14 UTC (permalink / raw)
  To: Joel Fernandes, Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Luca Abeni, Tommaso Cucinotta,
	Thomas Gleixner, Vineeth Pillai, Shuah Khan, Phil Auld, suleiman

On 1/19/24 02:55, Joel Fernandes wrote:
> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>> Add an interface for fair server setup on debugfs.
>>
>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>
>>  - fair_server_runtime: set runtime in ns
>>  - fair_server_period: set period in ns
>>  - fair_server_defer: on/off for the defer mechanism
>>
>> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
> 
> Hi Daniel, Peter,
> I am writing on behalf of the ChromeOS scheduler team.
> 
> We had to revert the last 3 patches in this series because of a syzkaller
> reported bug, this happens on the sched/more branch in Peter's tree:
> 
>  WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220
>  place_entity+0x240/0x290 kernel/sched/fair.c:5147
>  Call Trace:
>  <TASK>
>   enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283
>   enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717
>   enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117
>   activate_task+0x60/0xc0 kernel/sched/core.c:2147
>   ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794
>   ttwu_queue kernel/sched/core.c:4047 [inline]
>   try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368
>   kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142
>   __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800
>   queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986
>   queue_delayed_work include/linux/workqueue.h:577 [inline]
>   srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline]
> 
> which is basically this warning in place_entity:
> 		if (WARN_ON_ONCE(!load))
> 			load = 1;
> 
> Full log (scroll to the bottom as there is console/lockdep side effects which
> are likely not relevant to this issue): https://paste.debian.net/1304579/
> 
> Side note, we are also looking into a KASAN nullptr deref but this happens
> only on our backport of the patches to a 5.15 kernel, as far as we know.
> 
> KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...]
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
>  RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline] 
>  RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445
>  [...]
> Call Trace:
>  <TASK>
>   set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728
>   set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943
>   set_next_task kernel/sched/sched.h:2241 [inline] 
>   pick_next_task kernel/sched/core.c:6014 [inline] 
>   __schedule+0x36fb/0x402d kernel/sched/core.c:6378
>   preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590
>   preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615
> 
> Full splat: https://paste.debian.net/1304573/

Interesting, does it keep any task hung? I am having a case where I see
a hung task, but I do not get the splat because the system freezes (printk
with rq_lock I guess)...

It might be the same problem.

> Investigation is on going but could you also please take a look at these? It
> is hard to reproduce and only syzbot has luck reproducing these.
> 
> Also I had a comment below:
> 
>> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
>> +{
>> +	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
>> +	u64 new_bw = to_ratio(period, runtime);
>> +	struct rq *rq = dl_se->rq;
>> +	int cpu = cpu_of(rq);
>> +	struct dl_bw *dl_b;
>> +	unsigned long cap;
>> +	int retval = 0;
>> +	int cpus;
>> +
>> +	dl_b = dl_bw_of(cpu);
>> +	raw_spin_lock(&dl_b->lock);
>> +	cpus = dl_bw_cpus(cpu);
>> +	cap = dl_bw_capacity(cpu);
>> +
>> +	if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
> 
> The dl_overflow() call here seems introducing an issue with our conceptual
> understanding of how the dl server is supposed to work.
> 
> Suppose we have a 4 CPU system. Also suppose RT throttling is disabled.
> Suppose the DL server params are 50ms runtime in 100ms period (basically we
> want to dedicate 50% of the bandwidth of each CPU to CFS).
> 
> In such a situation, __dl_overflow() will return an error right? Because
> total bandwidth will exceed 100% (4 times 50% is 200%).

I might be missing something in your case, but, it accepts:

root@fedora:/sys/kernel/debug/sched/fair_server# find . -type f -exec cat {} \;
1
1000000000
500000000
1
1000000000
500000000
1
1000000000
500000000
1
1000000000
500000000

your system accepts 400%... the percentage is "global".

is it failing in your system?

-- Daniel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2024-01-22 14:14     ` Daniel Bristot de Oliveira
@ 2024-01-23 15:39       ` Joel Fernandes
  2024-01-23 15:44       ` Joel Fernandes
  1 sibling, 0 replies; 76+ messages in thread
From: Joel Fernandes @ 2024-01-23 15:39 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Luca Abeni, Tommaso Cucinotta,
	Thomas Gleixner, Vineeth Pillai, Shuah Khan, Phil Auld, suleiman

Hi Daniel,

On 1/22/2024 9:14 AM, Daniel Bristot de Oliveira wrote:
> On 1/19/24 02:55, Joel Fernandes wrote:
>> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>>> Add an interface for fair server setup on debugfs.
>>>
>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>
>>>  - fair_server_runtime: set runtime in ns
>>>  - fair_server_period: set period in ns
>>>  - fair_server_defer: on/off for the defer mechanism
>>>
>>> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
>>
>> Hi Daniel, Peter,
>> I am writing on behalf of the ChromeOS scheduler team.
>>
>> We had to revert the last 3 patches in this series because of a syzkaller
>> reported bug, this happens on the sched/more branch in Peter's tree:
>>
>>  WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220
>>  place_entity+0x240/0x290 kernel/sched/fair.c:5147
>>  Call Trace:
>>  <TASK>
>>   enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283
>>   enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717
>>   enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117
>>   activate_task+0x60/0xc0 kernel/sched/core.c:2147
>>   ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794
>>   ttwu_queue kernel/sched/core.c:4047 [inline]
>>   try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368
>>   kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142
>>   __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800
>>   queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986
>>   queue_delayed_work include/linux/workqueue.h:577 [inline]
>>   srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline]
>>
>> which is basically this warning in place_entity:
>> 		if (WARN_ON_ONCE(!load))
>> 			load = 1;
>>
>> Full log (scroll to the bottom as there is console/lockdep side effects which
>> are likely not relevant to this issue): https://paste.debian.net/1304579/
>>
>> Side note, we are also looking into a KASAN nullptr deref but this happens
>> only on our backport of the patches to a 5.15 kernel, as far as we know.
>>
>> KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
>> CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...]
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
>>  RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline] 
>>  RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445
>>  [...]
>> Call Trace:
>>  <TASK>
>>   set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728
>>   set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943
>>   set_next_task kernel/sched/sched.h:2241 [inline] 
>>   pick_next_task kernel/sched/core.c:6014 [inline] 
>>   __schedule+0x36fb/0x402d kernel/sched/core.c:6378
>>   preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590
>>   preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615
>>
>> Full splat: https://paste.debian.net/1304573/
> 
> Interesting, does it keep any task hung? I am having a case where I see
> a hung task, but I do not get the splat because the system freezes (printk
> with rq_lock I guess)...
> 
> It might be the same problem.

Ah, we have an update. Suleiman found this is happening because of core
scheduling's pick logic. I have some patches to fix it, there's also more fixes
we have on other issues. Will coordinate with the team to send these out soon.
We are currently testing them more.

>> Investigation is on going but could you also please take a look at these? It
>> is hard to reproduce and only syzbot has luck reproducing these.
>>
>> Also I had a comment below:
>>
>>> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
>>> +{
>>> +	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
>>> +	u64 new_bw = to_ratio(period, runtime);
>>> +	struct rq *rq = dl_se->rq;
>>> +	int cpu = cpu_of(rq);
>>> +	struct dl_bw *dl_b;
>>> +	unsigned long cap;
>>> +	int retval = 0;
>>> +	int cpus;
>>> +
>>> +	dl_b = dl_bw_of(cpu);
>>> +	raw_spin_lock(&dl_b->lock);
>>> +	cpus = dl_bw_cpus(cpu);
>>> +	cap = dl_bw_capacity(cpu);
>>> +
>>> +	if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
>>
>> The dl_overflow() call here seems introducing an issue with our conceptual
>> understanding of how the dl server is supposed to work.
>>
>> Suppose we have a 4 CPU system. Also suppose RT throttling is disabled.
>> Suppose the DL server params are 50ms runtime in 100ms period (basically we
>> want to dedicate 50% of the bandwidth of each CPU to CFS).
>>
>> In such a situation, __dl_overflow() will return an error right? Because
>> total bandwidth will exceed 100% (4 times 50% is 200%).
> 
> I might be missing something in your case, but, it accepts:
> 
> root@fedora:/sys/kernel/debug/sched/fair_server# find . -type f -exec cat {} \;
> 1
> 1000000000
> 500000000
> 1
> 1000000000
> 500000000
> 1
> 1000000000
> 500000000
> 1
> 1000000000
> 500000000
> 
> your system accepts 400%... the percentage is "global".
> 
> is it failing in your system?

You are right, I was actually trying to change it manually in my kernel in
dl_server_start(). In this case dlserver_apply_server_params() gets init=1 and
old_bw is 0.

I tried using the debugfs, and that works. So I think we will just use the
debugfs. I was being lazy and setting it in my kernel manually for testing like
this:

@@ -1475,7 +1475,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
         * this before getting generic.
         */
        if (!dl_server(dl_se)) {
               u64 runtime = 12 * NSEC_PER_MSEC;
                u64 period = 15 * NSEC_PER_MSEC;

That doesn't work but I tried debugfs and it works. But for production, we will
set it from userspace so it should not be an issue.

I feel so much better now :) Thanks Daniel.

By the way, what's the plan on remaining patches in sched/more branch, are you
planning to resend those later? If so, we can just post our fixes on top of
that, and if you don't mind you could include it in your next series posting
(sched/more + our fixes + your fixes).

Thanks!

 - Joel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2024-01-22 14:14     ` Daniel Bristot de Oliveira
  2024-01-23 15:39       ` Joel Fernandes
@ 2024-01-23 15:44       ` Joel Fernandes
  1 sibling, 0 replies; 76+ messages in thread
From: Joel Fernandes @ 2024-01-23 15:44 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Luca Abeni, Tommaso Cucinotta,
	Thomas Gleixner, Vineeth Pillai, Shuah Khan, Phil Auld, suleiman

On 1/22/2024 9:14 AM, Daniel Bristot de Oliveira wrote:
> Interesting, does it keep any task hung? I am having a case where I see
> a hung task, but I do not get the splat because the system freezes (printk
> with rq_lock I guess)...

I missed replying to this part of your email. We see it as a syzkaller report
with splats, so it is not clear if a task was hung at the time of a splat.
syzkaller runs in its own separate VM instance.

The fun part is though, I found a way to do ftrace dump on it without having to
pass any boot parameters (that's another issue that we can't pass it boot
parameters). So we have that tool in our debug arsenal for these issues, thankfully.

> It might be the same problem.

True, possibly.

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2023-11-04 10:59 ` [PATCH v5 7/7] sched/fair: Fair server interface Daniel Bristot de Oliveira
                     ` (5 preceding siblings ...)
  2024-01-19  1:55   ` Joel Fernandes
@ 2024-02-13  2:13   ` Joel Fernandes
  2024-02-13  2:21     ` Joel Fernandes
  2024-02-14 14:23     ` Daniel Bristot de Oliveira
  6 siblings, 2 replies; 76+ messages in thread
From: Joel Fernandes @ 2024-02-13  2:13 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld



On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
> Add an interface for fair server setup on debugfs.
> 
> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
> 
>  - fair_server_runtime: set runtime in ns
>  - fair_server_period: set period in ns
>  - fair_server_defer: on/off for the defer mechanism

Btw Daniel, there is an interesting side-effect of this interface having runtime
and period in 2 separate files :)

Say I want to set a CPU to 5ms / 10ms.

I cannot set either period or runtime to 5ms or 10ms directly.

I have to first set period to 100ms, then set runtime to 50ms, then set period
to 50ms, then set runtime to 5ms, then finally set period to 10ms.

The reason seems to be because otherwise runtime / period will not be
accomodated and will cause dl_overflow issues.

I'd suggest providing both runtime and period in the same interface to make it
more easier to use. However, for the testing I am going with what we have.

Also a request:

I was wondering if a new version of the last 3 patches could be posted to
LKML or shared in a tree somewhere. I am trying to sync to mainline and
rebase our latest fixes on top of that, however it is difficult to do because
these 3 patches are in bit of a flux (example the discussion between you and
Peter about update_curr()). What's the best way to move forward with rebasing
our fix contributions?  I am going with the sched/more in Peter's queue.git
unless you/Peter prefer something else. And I added your update_curr()
suggestion onto that, let me know if you disagree with it:

@@ -1173,6 +1171,8 @@ static void update_curr(struct cfs_rq *cfs_rq)

        if (entity_is_task(curr))
                update_curr_task(task_of(curr), delta_exec);
+       else
+               dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);

        account_cfs_rq_runtime(cfs_rq, delta_exec);
 }

thanks,

 - Joel

> 
> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
> ---
>  kernel/sched/deadline.c |  89 +++++++++++++++---
>  kernel/sched/debug.c    | 202 ++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/fair.c     |   6 --
>  kernel/sched/sched.h    |   2 +
>  4 files changed, 279 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 69ee1fbd60e4..1092ca8892e0 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -321,19 +321,12 @@ void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>  		__sub_running_bw(dl_se->dl_bw, dl_rq);
>  }
>  
> -static void dl_change_utilization(struct task_struct *p, u64 new_bw)
> +static void dl_rq_change_utilization(struct rq *rq, struct sched_dl_entity *dl_se, u64 new_bw)
>  {
> -	struct rq *rq;
> -
> -	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
> -
> -	if (task_on_rq_queued(p))
> -		return;
> +	if (dl_se->dl_non_contending) {
> +		sub_running_bw(dl_se, &rq->dl);
> +		dl_se->dl_non_contending = 0;
>  
> -	rq = task_rq(p);
> -	if (p->dl.dl_non_contending) {
> -		sub_running_bw(&p->dl, &rq->dl);
> -		p->dl.dl_non_contending = 0;
>  		/*
>  		 * If the timer handler is currently running and the
>  		 * timer cannot be canceled, inactive_task_timer()
> @@ -341,13 +334,25 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
>  		 * will not touch the rq's active utilization,
>  		 * so we are still safe.
>  		 */
> -		if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> -			put_task_struct(p);
> +		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
> +			if (!dl_server(dl_se))
> +				put_task_struct(dl_task_of(dl_se));
> +		}
>  	}
> -	__sub_rq_bw(p->dl.dl_bw, &rq->dl);
> +	__sub_rq_bw(dl_se->dl_bw, &rq->dl);
>  	__add_rq_bw(new_bw, &rq->dl);
>  }
>  
> +static void dl_change_utilization(struct task_struct *p, u64 new_bw)
> +{
> +	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
> +
> +	if (task_on_rq_queued(p))
> +		return;
> +
> +	dl_rq_change_utilization(task_rq(p), &p->dl, new_bw);
> +}
> +
>  static void __dl_clear_params(struct sched_dl_entity *dl_se);
>  
>  /*
> @@ -1508,10 +1513,22 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
>  
>  void dl_server_start(struct sched_dl_entity *dl_se)
>  {
> +	/*
> +	 * XXX: the apply do not work fine at the init phase for the
> +	 * fair server because things are not yet set. We need to improve
> +	 * this before getting generic.
> +	 */
>  	if (!dl_server(dl_se)) {
> +		u64 runtime = 50 * NSEC_PER_MSEC;
> +		u64 period = 1000 * NSEC_PER_MSEC;
> +
> +		dl_server_apply_params(dl_se, runtime, period, 1);
> +
> +		dl_se->dl_zerolax = 1;
>  		dl_se->dl_server = 1;
>  		setup_new_dl_entity(dl_se);
>  	}
> +
>  	enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
>  }
>  
> @@ -1532,6 +1549,50 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
>  	dl_se->server_pick = pick;
>  }
>  
> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
> +{
> +	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> +	u64 new_bw = to_ratio(period, runtime);
> +	struct rq *rq = dl_se->rq;
> +	int cpu = cpu_of(rq);
> +	struct dl_bw *dl_b;
> +	unsigned long cap;
> +	int retval = 0;
> +	int cpus;
> +
> +	dl_b = dl_bw_of(cpu);
> +	raw_spin_lock(&dl_b->lock);
> +	cpus = dl_bw_cpus(cpu);
> +	cap = dl_bw_capacity(cpu);
> +
> +	if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
> +		retval = -EBUSY;
> +		goto out;
> +	}
> +
> +	if (init) {
> +		__add_rq_bw(new_bw, &rq->dl);
> +		__dl_add(dl_b, new_bw, cpus);
> +	} else {
> +		__dl_sub(dl_b, dl_se->dl_bw, cpus);
> +		__dl_add(dl_b, new_bw, cpus);
> +
> +		dl_rq_change_utilization(rq, dl_se, new_bw);
> +	}
> +
> +	rq->fair_server.dl_runtime = runtime;
> +	rq->fair_server.dl_deadline  = period;
> +	rq->fair_server.dl_period  = period;
> +
> +	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
> +	dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
> +
> +out:
> +	raw_spin_unlock(&dl_b->lock);
> +
> +	return retval;
> +}
> +
>  /*
>   * Update the current task's runtime statistics (provided it is still
>   * a -deadline task and has not been removed from the dl_rq).
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 4580a450700e..bd7ad6b8d3de 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -333,8 +333,208 @@ static const struct file_operations sched_debug_fops = {
>  	.release	= seq_release,
>  };
>  
> +enum dl_param {
> +	DL_RUNTIME = 0,
> +	DL_PERIOD,
> +	DL_ZEROLAX
> +};
> +
> +static unsigned long fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
> +static unsigned long fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
> +
> +static ssize_t sched_fair_server_write(struct file *filp, const char __user *ubuf,
> +				       size_t cnt, loff_t *ppos, enum dl_param param)
> +{
> +	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
> +	u64 runtime, period, zerolax;
> +	struct rq *rq = cpu_rq(cpu);
> +	size_t err;
> +	int retval;
> +	u64 value;
> +
> +	err = kstrtoull_from_user(ubuf, cnt, 10, &value);
> +	if (err)
> +		return err;
> +
> +	scoped_guard (rq_lock_irqsave, rq) {
> +
> +		runtime  = rq->fair_server.dl_runtime;
> +		period = rq->fair_server.dl_period;
> +		zerolax = rq->fair_server.dl_zerolax;
> +
> +		switch (param) {
> +		case DL_RUNTIME:
> +			if (runtime == value)
> +				goto out;
> +			runtime = value;
> +			break;
> +		case DL_PERIOD:
> +			if (value == period)
> +				goto out;
> +			period = value;
> +			break;
> +		case DL_ZEROLAX:
> +			if (zerolax == value)
> +				goto out;
> +			zerolax = value;
> +			break;
> +		}
> +
> +		if (runtime > period
> +				|| period > fair_server_period_max
> +				|| period < fair_server_period_min
> +				|| zerolax > 1) {
> +			cnt = -EINVAL;
> +			goto out;
> +		}
> +
> +		if (rq->cfs.h_nr_running) {
> +			update_rq_clock(rq);
> +			dl_server_stop(&rq->fair_server);
> +		}
> +
> +		/*
> +		 * The zerolax does not change utilization, so just
> +		 * setting it is enough.
> +		 */
> +		if (rq->fair_server.dl_zerolax != zerolax) {
> +			rq->fair_server.dl_zerolax = zerolax;
> +		} else {
> +			retval = dl_server_apply_params(&rq->fair_server, runtime, period, 0);
> +			if (retval)
> +				cnt = retval;
> +		}
> +
> +		if (rq->cfs.h_nr_running)
> +			dl_server_start(&rq->fair_server);
> +	}
> +
> +out:
> +	*ppos += cnt;
> +	return cnt;
> +}
> +
> +static size_t sched_fair_server_show(struct seq_file *m, void *v, enum dl_param param)
> +{
> +	unsigned long cpu = (unsigned long) m->private;
> +	struct rq *rq = cpu_rq(cpu);
> +	u64 value;
> +
> +	switch (param) {
> +	case DL_RUNTIME:
> +		value = rq->fair_server.dl_runtime;
> +		break;
> +	case DL_PERIOD:
> +		value = rq->fair_server.dl_period;
> +		break;
> +	case DL_ZEROLAX:
> +		value = rq->fair_server.dl_zerolax;
> +	}
> +
> +	seq_printf(m, "%llu\n", value);
> +	return 0;
> +
> +}
> +
> +static ssize_t
> +sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf,
> +				size_t cnt, loff_t *ppos)
> +{
> +	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_RUNTIME);
> +}
> +
> +static int sched_fair_server_runtime_show(struct seq_file *m, void *v)
> +{
> +	return sched_fair_server_show(m, v, DL_RUNTIME);
> +}
> +
> +static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp)
> +{
> +	return single_open(filp, sched_fair_server_runtime_show, inode->i_private);
> +}
> +
> +static const struct file_operations fair_server_runtime_fops = {
> +	.open		= sched_fair_server_runtime_open,
> +	.write		= sched_fair_server_runtime_write,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +
> +static ssize_t
> +sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
> +			       size_t cnt, loff_t *ppos)
> +{
> +	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_PERIOD);
> +}
> +
> +static int sched_fair_server_period_show(struct seq_file *m, void *v)
> +{
> +	return sched_fair_server_show(m, v, DL_PERIOD);
> +}
> +
> +static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
> +{
> +	return single_open(filp, sched_fair_server_period_show, inode->i_private);
> +}
> +
> +static const struct file_operations fair_server_period_fops = {
> +	.open		= sched_fair_server_period_open,
> +	.write		= sched_fair_server_period_write,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +
> +static ssize_t
> +sched_fair_server_defer_write(struct file *filp, const char __user *ubuf,
> +			      size_t cnt, loff_t *ppos)
> +{
> +	return sched_fair_server_write(filp, ubuf, cnt, ppos, DL_ZEROLAX);
> +}
> +
> +static int sched_fair_server_defer_show(struct seq_file *m, void *v)
> +{
> +	return sched_fair_server_show(m, v, DL_ZEROLAX);
> +}
> +
> +static int sched_fair_server_defer_open(struct inode *inode, struct file *filp)
> +{
> +	return single_open(filp, sched_fair_server_defer_show, inode->i_private);
> +}
> +
> +static const struct file_operations fair_server_defer_fops = {
> +	.open		= sched_fair_server_defer_open,
> +	.write		= sched_fair_server_defer_write,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +
>  static struct dentry *debugfs_sched;
>  
> +static void debugfs_fair_server_init(void)
> +{
> +	long cpu;
> +	struct dentry *rq_dentry;
> +
> +	rq_dentry = debugfs_create_dir("rq", debugfs_sched);
> +	if (!rq_dentry)
> +		return;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct dentry *d_cpu;
> +		char buf[32];
> +
> +		snprintf(buf, sizeof(buf), "cpu%ld", cpu);
> +		d_cpu = debugfs_create_dir(buf, rq_dentry);
> +
> +		debugfs_create_file("fair_server_runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
> +		debugfs_create_file("fair_server_period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops);
> +		debugfs_create_file("fair_server_defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops);
> +	}
> +}
> +
>  static __init int sched_init_debug(void)
>  {
>  	struct dentry __maybe_unused *numa;
> @@ -374,6 +574,8 @@ static __init int sched_init_debug(void)
>  
>  	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>  
> +	debugfs_fair_server_init();
> +
>  	return 0;
>  }
>  late_initcall(sched_init_debug);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 399237cd9f59..5434c52f470d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8419,12 +8419,6 @@ void fair_server_init(struct rq *rq)
>  	struct sched_dl_entity *dl_se = &rq->fair_server;
>  
>  	init_dl_entity(dl_se);
> -
> -	dl_se->dl_runtime = 50 * NSEC_PER_MSEC;
> -	dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
> -	dl_se->dl_period = 1000 * NSEC_PER_MSEC;
> -	dl_se->dl_zerolax = 1;
> -
>  	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
>  }
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ec0e288c8e06..312b31df5860 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -341,6 +341,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
>  		    dl_server_pick_f pick);
>  
>  extern void fair_server_init(struct rq *);
> +extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
> +		    u64 runtime, u64 period, bool init);
>  
>  #ifdef CONFIG_CGROUP_SCHED
>  

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2024-02-13  2:13   ` Joel Fernandes
@ 2024-02-13  2:21     ` Joel Fernandes
  2024-02-14 14:23     ` Daniel Bristot de Oliveira
  1 sibling, 0 replies; 76+ messages in thread
From: Joel Fernandes @ 2024-02-13  2:21 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld, Suleiman Souhlal, youssefesmat



On 2/12/2024 9:13 PM, Joel Fernandes wrote:
> 
> 
> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>> Add an interface for fair server setup on debugfs.
>>
>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>
>>  - fair_server_runtime: set runtime in ns
>>  - fair_server_period: set period in ns
>>  - fair_server_defer: on/off for the defer mechanism
> 
> Btw Daniel, there is an interesting side-effect of this interface having runtime
> and period in 2 separate files :)
> 
> Say I want to set a CPU to 5ms / 10ms.
> 

Sorry let me try again, say I want to set 500us / 1ms.

Then I have to do the following to get it to work:

# echo 100000000 > /sys/kernel/debug/sched/fair_server/cpu0/period
# echo 5000000 > /sys/kernel/debug/sched/fair_server/cpu0/runtime
# echo 10000000 > /sys/kernel/debug/sched/fair_server/cpu0/period
# echo 500000 > /sys/kernel/debug/sched/fair_server/cpu0/runtime
# echo 1000000 > /sys/kernel/debug/sched/fair_server/cpu0/period

IOW, if I boot and do the following, it fails:

# echo 500000 > /sys/kernel/debug/sched/fair_server/cpu0/runtime
bash: echo: write error: Device or resource busy

 - Joel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2024-02-13  2:13   ` Joel Fernandes
  2024-02-13  2:21     ` Joel Fernandes
@ 2024-02-14 14:23     ` Daniel Bristot de Oliveira
  2024-02-15 13:57       ` Joel Fernandes
  1 sibling, 1 reply; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-02-14 14:23 UTC (permalink / raw)
  To: Joel Fernandes, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

On 2/13/24 03:13, Joel Fernandes wrote:
> 
> 
> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>> Add an interface for fair server setup on debugfs.
>>
>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>
>>  - fair_server_runtime: set runtime in ns
>>  - fair_server_period: set period in ns
>>  - fair_server_defer: on/off for the defer mechanism
> 
> Btw Daniel, there is an interesting side-effect of this interface having runtime
> and period in 2 separate files :)
> 
> Say I want to set a CPU to 5ms / 10ms.
> 
> I cannot set either period or runtime to 5ms or 10ms directly.
> 
> I have to first set period to 100ms, then set runtime to 50ms, then set period
> to 50ms, then set runtime to 5ms, then finally set period to 10ms.

Hummm yeah I could reproduce that, it seems that it is not even a problem of having
two files, but a bug in the logic, I will have a look.

> The reason seems to be because otherwise runtime / period will not be
> accomodated and will cause dl_overflow issues.
> 
> I'd suggest providing both runtime and period in the same interface to make it
> more easier to use. However, for the testing I am going with what we have.
> 
> Also a request:
> 
> I was wondering if a new version of the last 3 patches could be posted to
> LKML or shared in a tree somewhere. I am trying to sync to mainline and
> rebase our latest fixes on top of that, however it is difficult to do because
> these 3 patches are in bit of a flux (example the discussion between you and
> Peter about update_curr()). What's the best way to move forward with rebasing
> our fix contributions?

Juri and I chat about, and we think it is a good thing to re-send this patch set,
including a fix I have to it (to avoid regression wrt rt throttling), explaining
these things in the mailing list so peter will be able to follow the discussion.

I still need to finish testing, and to make a proper cover page with all updates, the
latest thing is here (tm):

https://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git/log/?h=dl_server_v6

It is based on peter's sched/more. I will probably re-send it today or tomorrow,
but at least you can have a look at it.

Another reason to send it is to get the regression test machinery running....

 I am going with the sched/more in Peter's queue.git
> unless you/Peter prefer something else. And I added your update_curr()
> suggestion onto that, let me know if you disagree with it:
> 
> @@ -1173,6 +1171,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
> 
>         if (entity_is_task(curr))
>                 update_curr_task(task_of(curr), delta_exec);
> +       else
> +               dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);
> 
>         account_cfs_rq_runtime(cfs_rq, delta_exec);
>  }

That part of the code was optimized by peter during the last round of discussions.

It is like this now:

------------ %< -----------
-       if (entity_is_task(curr))
-               update_curr_task(task_of(curr), delta_exec);
+       if (entity_is_task(curr)) {
+               struct task_struct *p = task_of(curr);
+               update_curr_task(p, delta_exec);
+               /*
+                * Any fair task that runs outside of fair_server should
+                * account against fair_server such that it can account for
+                * this time and possibly avoid running this period.
+                */
+               if (p->dl_server != &rq->fair_server)
+                       dl_server_update(&rq->fair_server, delta_exec);
+       }
------------ >% -----------

It is not straightforward to understand... but the ideia is:

if it is a task, and the server is ! of the fair server, discount time
directly from the fair server. This also means that if dl_server is NULL
(the server is not enabled) it will discount time from the fair server.

-- Daniel


> thanks,
> 
>  - Joel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2024-02-14 14:23     ` Daniel Bristot de Oliveira
@ 2024-02-15 13:57       ` Joel Fernandes
  2024-02-15 17:27         ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 76+ messages in thread
From: Joel Fernandes @ 2024-02-15 13:57 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld, Suleiman Souhlal, youssefesmat

Hello, Daniel,

On 2/14/2024 9:23 AM, Daniel Bristot de Oliveira wrote:
> On 2/13/24 03:13, Joel Fernandes wrote:
>>
>>
>> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>>> Add an interface for fair server setup on debugfs.
>>>
>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>
>>>  - fair_server_runtime: set runtime in ns
>>>  - fair_server_period: set period in ns
>>>  - fair_server_defer: on/off for the defer mechanism
>>
>> Btw Daniel, there is an interesting side-effect of this interface having runtime
>> and period in 2 separate files :)
>>
>> Say I want to set a CPU to 5ms / 10ms.
>>
>> I cannot set either period or runtime to 5ms or 10ms directly.
>>
>> I have to first set period to 100ms, then set runtime to 50ms, then set period
>> to 50ms, then set runtime to 5ms, then finally set period to 10ms.
> 
> Hummm yeah I could reproduce that, it seems that it is not even a problem of having
> two files, but a bug in the logic, I will have a look.

Thanks for taking a look. My colleague Suleiman hit the issue too. He's able to
not set 45ms/50ms for instance.

Also just want to mention, if you could please CC my colleagues Suleiman and
Youssef on the patches who are also working on / reviewing these:

Suleiman Souhlal <suleiman@google.com>
Youssef Esmat <youssefesmat@google.com>

>> The reason seems to be because otherwise runtime / period will not be
>> accomodated and will cause dl_overflow issues.
>>
>> I'd suggest providing both runtime and period in the same interface to make it
>> more easier to use. However, for the testing I am going with what we have.
>>
>> Also a request:
>>
>> I was wondering if a new version of the last 3 patches could be posted to
>> LKML or shared in a tree somewhere. I am trying to sync to mainline and
>> rebase our latest fixes on top of that, however it is difficult to do because
>> these 3 patches are in bit of a flux (example the discussion between you and
>> Peter about update_curr()). What's the best way to move forward with rebasing
>> our fix contributions?
> 
> Juri and I chat about, and we think it is a good thing to re-send this patch set,
> including a fix I have to it (to avoid regression wrt rt throttling), explaining
> these things in the mailing list so peter will be able to follow the discussion.
> 
> I still need to finish testing, and to make a proper cover page with all updates, the
> latest thing is here (tm):
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git/log/?h=dl_server_v6
> 
> It is based on peter's sched/more. I will probably re-send it today or tomorrow,
> but at least you can have a look at it.
>> Another reason to send it is to get the regression test machinery running....

Sure, looking forward to it. I rebased on above tree and it applied cleanly.
What I'll do is I will send our patches today (not those in sched/more) after a
bit more testing and tweaks.

There are 2 reasons for this:
1. Can get the build robot do its thing.
2. Our internal system checks whether patches backported were posted upstream to
list.

Hope that sounds good to you and we can start reviewing as well.

>  I am going with the sched/more in Peter's queue.git
>> unless you/Peter prefer something else. And I added your update_curr()
>> suggestion onto that, let me know if you disagree with it:
>>
>> @@ -1173,6 +1171,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
>>
>>         if (entity_is_task(curr))
>>                 update_curr_task(task_of(curr), delta_exec);
>> +       else
>> +               dl_server_update(&rq_of(cfs_rq)->fair_server, delta_exec);
>>
>>         account_cfs_rq_runtime(cfs_rq, delta_exec);
>>  }
> 
> That part of the code was optimized by peter during the last round of discussions.
> 
> It is like this now:
> 
> ------------ %< -----------
> -       if (entity_is_task(curr))
> -               update_curr_task(task_of(curr), delta_exec);
> +       if (entity_is_task(curr)) {
> +               struct task_struct *p = task_of(curr);
> +               update_curr_task(p, delta_exec);
> +               /*
> +                * Any fair task that runs outside of fair_server should
> +                * account against fair_server such that it can account for
> +                * this time and possibly avoid running this period.
> +                */
> +               if (p->dl_server != &rq->fair_server)
> +                       dl_server_update(&rq->fair_server, delta_exec);
> +       }
> ------------ >% -----------
> 
> It is not straightforward to understand... but the ideia is:
> 
> if it is a task, and the server is ! of the fair server, discount time
> directly from the fair server. This also means that if dl_server is NULL
> (the server is not enabled) it will discount time from the fair server.

Yes, that makes sense. We certainly want to debit from the server even when DL
is not picking the task indirectly. I guess Peter's optimization also handles
the case where multiple servers are in play. That will help us when/if we make
RT as a server as well, right?

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2024-02-15 13:57       ` Joel Fernandes
@ 2024-02-15 17:27         ` Daniel Bristot de Oliveira
  2024-02-15 17:41           ` Joel Fernandes
  0 siblings, 1 reply; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-02-15 17:27 UTC (permalink / raw)
  To: Joel Fernandes, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld, Suleiman Souhlal, youssefesmat

On 2/15/24 14:57, Joel Fernandes wrote:
> Hello, Daniel,
> 
> On 2/14/2024 9:23 AM, Daniel Bristot de Oliveira wrote:
>> On 2/13/24 03:13, Joel Fernandes wrote:
>>>
>>>
>>> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>>>> Add an interface for fair server setup on debugfs.
>>>>
>>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>>
>>>>  - fair_server_runtime: set runtime in ns
>>>>  - fair_server_period: set period in ns
>>>>  - fair_server_defer: on/off for the defer mechanism
>>>
>>> Btw Daniel, there is an interesting side-effect of this interface having runtime
>>> and period in 2 separate files :)
>>>
>>> Say I want to set a CPU to 5ms / 10ms.
>>>
>>> I cannot set either period or runtime to 5ms or 10ms directly.
>>>
>>> I have to first set period to 100ms, then set runtime to 50ms, then set period
>>> to 50ms, then set runtime to 5ms, then finally set period to 10ms.
>>
>> Hummm yeah I could reproduce that, it seems that it is not even a problem of having
>> two files, but a bug in the logic, I will have a look.
> 
> Thanks for taking a look. My colleague Suleiman hit the issue too. He's able to
> not set 45ms/50ms for instance.

I isolated the problem. It is not an interface problem.

Long story short, the servers are initialized at the defrootdomain, but
the dl_bw info is not being carried over to the new domain because the
servers are not a task.

I am discussing this with Valentin (topology) & Juri. We will try to find a
solution, or at least an presentable XXX: solution... in the next days.

You can work around it by disabling the admission control via:

# sysctl kernel.sched_rt_runtime_us=-1

the the values will be accepted. For the best of my knowledge, you guys are
only using SCHED_RR/FIFO so the admission control for DL is not an issue.

>> I still need to finish testing, and to make a proper cover page with all updates, the
>> latest thing is here (tm):
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git/log/?h=dl_server_v6
>>
>> It is based on peter's sched/more. I will probably re-send it today or tomorrow,
>> but at least you can have a look at it.
>>> Another reason to send it is to get the regression test machinery running....
> 
> Sure, looking forward to it. I rebased on above tree and it applied cleanly.
> What I'll do is I will send our patches today (not those in sched/more) after a
> bit more testing and tweaks.
> 
> There are 2 reasons for this:
> 1. Can get the build robot do its thing.
> 2. Our internal system checks whether patches backported were posted upstream to
> list.
> 
> Hope that sounds good to you and we can start reviewing as well.

If it helps downstream for you guys, it is not a problem for me. Still, peter is
the person that has more comments to give so...

-- Daniel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2024-02-15 17:27         ` Daniel Bristot de Oliveira
@ 2024-02-15 17:41           ` Joel Fernandes
  2024-04-04 17:43             ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 76+ messages in thread
From: Joel Fernandes @ 2024-02-15 17:41 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld, Suleiman Souhlal, youssefesmat



On 2/15/2024 12:27 PM, Daniel Bristot de Oliveira wrote:
> On 2/15/24 14:57, Joel Fernandes wrote:
>> Hello, Daniel,
>>
>> On 2/14/2024 9:23 AM, Daniel Bristot de Oliveira wrote:
>>> On 2/13/24 03:13, Joel Fernandes wrote:
>>>>
>>>>
>>>> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>>>>> Add an interface for fair server setup on debugfs.
>>>>>
>>>>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>>>>
>>>>>  - fair_server_runtime: set runtime in ns
>>>>>  - fair_server_period: set period in ns
>>>>>  - fair_server_defer: on/off for the defer mechanism
>>>>
>>>> Btw Daniel, there is an interesting side-effect of this interface having runtime
>>>> and period in 2 separate files :)
>>>>
>>>> Say I want to set a CPU to 5ms / 10ms.
>>>>
>>>> I cannot set either period or runtime to 5ms or 10ms directly.
>>>>
>>>> I have to first set period to 100ms, then set runtime to 50ms, then set period
>>>> to 50ms, then set runtime to 5ms, then finally set period to 10ms.
>>>
>>> Hummm yeah I could reproduce that, it seems that it is not even a problem of having
>>> two files, but a bug in the logic, I will have a look.
>>
>> Thanks for taking a look. My colleague Suleiman hit the issue too. He's able to
>> not set 45ms/50ms for instance.
> 
> I isolated the problem. It is not an interface problem.
> 
> Long story short, the servers are initialized at the defrootdomain, but
> the dl_bw info is not being carried over to the new domain because the
> servers are not a task.

Nice work on nailing the issue.

> I am discussing this with Valentin (topology) & Juri. We will try to find a
> solution, or at least an presentable XXX: solution... in the next days.
> 
> You can work around it by disabling the admission control via:
> 
> # sysctl kernel.sched_rt_runtime_us=-1
> 
> the the values will be accepted. For the best of my knowledge, you guys are
> only using SCHED_RR/FIFO so the admission control for DL is not an issue.

That's right, we only use deadline for the server. However, on some devices,
schedutil is used and AFAIR its kthread uses SCHED_DEADLINE. I don't anticipate
problems related to admission control and that kthread, so I think your proposed
workaround sounds good to me.

thanks,

 - Joel



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure
  2023-11-04 10:59 [PATCH v5 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (7 preceding siblings ...)
  2023-12-08 21:47 ` [PATCH v5 0/7] SCHED_DEADLINE server infrastructure Joel Fernandes
@ 2024-02-19  7:33 ` Huang, Ying
  2024-02-19 10:23   ` Daniel Bristot de Oliveira
  8 siblings, 1 reply; 76+ messages in thread
From: Huang, Ying @ 2024-02-19  7:33 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld, Aaron Lu, Kairui Song,
	Guo Ziliang

Hi, Daniel,

Thanks a lot for your great patchset!

We have a similar starvation issue in mm subsystem too.  Details are in
the patch description of the below commit.  In short, task A is busy
looping on some event, while task B will signal the event after some
work.  If the priority of task A is higher than that of task B, task B
may be starved.

IIUC, if task A is RT task while task B is fair task, then your patchset
will solve the issue.  If both task A and task B is RT tasks, is there
some way to solve the issue?

Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to
resolve the issue, is there something better?

Best Regards,
Huang, Ying

--------------------------8<---------------------------------------
commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e
Author: Guo Ziliang <guo.ziliang@zte.com.cn>
Date:   Wed Mar 16 16:15:03 2022 -0700

    mm: swap: get rid of livelock in swapin readahead
    
    In our testing, a livelock task was found.  Through sysrq printing, same
    stack was found every time, as follows:
    
      __swap_duplicate+0x58/0x1a0
      swapcache_prepare+0x24/0x30
      __read_swap_cache_async+0xac/0x220
      read_swap_cache_async+0x58/0xa0
      swapin_readahead+0x24c/0x628
      do_swap_page+0x374/0x8a0
      __handle_mm_fault+0x598/0xd60
      handle_mm_fault+0x114/0x200
      do_page_fault+0x148/0x4d0
      do_translation_fault+0xb0/0xd4
      do_mem_abort+0x50/0xb0
    
    The reason for the livelock is that swapcache_prepare() always returns
    EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
    cannot jump out of the loop.  We suspect that the task that clears the
    SWAP_HAS_CACHE flag never gets a chance to run.  We try to lower the
    priority of the task stuck in a livelock so that the task that clears
    the SWAP_HAS_CACHE flag will run.  The results show that the system
    returns to normal after the priority is lowered.
    
    In our testing, multiple real-time tasks are bound to the same core, and
    the task in the livelock is the highest priority task of the core, so
    the livelocked task cannot be preempted.
    
    Although cond_resched() is used by __read_swap_cache_async, it is an
    empty function in the preemptive system and cannot achieve the purpose
    of releasing the CPU.  A high-priority task cannot release the CPU
    unless preempted by a higher-priority task.  But when this task is
    already the highest priority task on this core, other tasks will not be
    able to be scheduled.  So we think we should replace cond_resched() with
    schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
    call set_current_state first to set the task state, so the task will be
    removed from the running queue, so as to achieve the purpose of giving
    up the CPU and prevent it from running in kernel mode for too long.
    
    (akpm: ugly hack becomes uglier.  But it fixes the issue in a
    backportable-to-stable fashion while we hopefully work on something
    better)
    
    Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
    Signed-off-by: Guo Ziliang <guo.ziliang@zte.com.cn>
    Reported-by: Zeal Robot <zealci@zte.com.cn>
    Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
    Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
    Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
    Acked-by: Hugh Dickins <hughd@google.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Roger Quadros <rogerq@kernel.org>
    Cc: Ziliang Guo <guo.ziliang@zte.com.cn>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8d4104242100..ee67164531c0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
 		 * in swap_map, but not yet added its page to swap cache.
 		 */
-		cond_resched();
+		schedule_timeout_uninterruptible(1);
 	}
 
 	/*

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure
  2024-02-19  7:33 ` Huang, Ying
@ 2024-02-19 10:23   ` Daniel Bristot de Oliveira
  2024-02-20  3:28     ` Huang, Ying
  0 siblings, 1 reply; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-02-19 10:23 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld, Aaron Lu, Kairui Song,
	Guo Ziliang

Hi

On 2/19/24 08:33, Huang, Ying wrote:
> Hi, Daniel,
> 
> Thanks a lot for your great patchset!
> 
> We have a similar starvation issue in mm subsystem too.  Details are in
> the patch description of the below commit.  In short, task A is busy
> looping on some event, while task B will signal the event after some
> work.  If the priority of task A is higher than that of task B, task B
> may be starved.

ok...

> 
> IIUC, if task A is RT task while task B is fair task, then your patchset
> will solve the issue.

This patch set will not solve the issue. It will mitigate the effect of the
problem. Still the system will perform very poorly...

If both task A and task B is RT tasks, is there
> some way to solve the issue?

I would say reworking the swap algorithm, as it is not meant to be used when
real-time tasks are in place.

As an exercise, let's say that we add a server per priority on FIFO, with a default
50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
busy loop in vain.

Then one would say, let's lower the parameters, so the granularity of
the server would provide lower latencies. The same problem would still
exist, as it exists with sched fair....

So, the solution is not on schedule. Busy loop waiting is not right when you
have RT tasks. That is why PREEMPT_RT reworks the locking pattern to remove
spin_locks that do busy waiting. spin_locks do not have this problem you
show because they disable preemption... but disabling preemption is not
the best solution either.

So, a first try of duct tape would using (local_?) locks like in
preempt rt to make things sleepable...

AFAICS, this was already discussed in the previous link, right?

> 
> Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to
> resolve the issue, is there something better?

I am not a swap/mm expert.. my guesses would be all on sleepable locking.
But I know there are many smart people on the mm side with better guesses...

It is just that the DL server or any type of starvation avoidance does not
seem to be a solution for your problem.

-- Daniel


> Best Regards,
> Huang, Ying
> 
> --------------------------8<---------------------------------------
> commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e
> Author: Guo Ziliang <guo.ziliang@zte.com.cn>
> Date:   Wed Mar 16 16:15:03 2022 -0700
> 
>     mm: swap: get rid of livelock in swapin readahead
>     
>     In our testing, a livelock task was found.  Through sysrq printing, same
>     stack was found every time, as follows:
>     
>       __swap_duplicate+0x58/0x1a0
>       swapcache_prepare+0x24/0x30
>       __read_swap_cache_async+0xac/0x220
>       read_swap_cache_async+0x58/0xa0
>       swapin_readahead+0x24c/0x628
>       do_swap_page+0x374/0x8a0
>       __handle_mm_fault+0x598/0xd60
>       handle_mm_fault+0x114/0x200
>       do_page_fault+0x148/0x4d0
>       do_translation_fault+0xb0/0xd4
>       do_mem_abort+0x50/0xb0
>     
>     The reason for the livelock is that swapcache_prepare() always returns
>     EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
>     cannot jump out of the loop.  We suspect that the task that clears the
>     SWAP_HAS_CACHE flag never gets a chance to run.  We try to lower the
>     priority of the task stuck in a livelock so that the task that clears
>     the SWAP_HAS_CACHE flag will run.  The results show that the system
>     returns to normal after the priority is lowered.
>     
>     In our testing, multiple real-time tasks are bound to the same core, and
>     the task in the livelock is the highest priority task of the core, so
>     the livelocked task cannot be preempted.
>     
>     Although cond_resched() is used by __read_swap_cache_async, it is an
>     empty function in the preemptive system and cannot achieve the purpose
>     of releasing the CPU.  A high-priority task cannot release the CPU
>     unless preempted by a higher-priority task.  But when this task is
>     already the highest priority task on this core, other tasks will not be
>     able to be scheduled.  So we think we should replace cond_resched() with
>     schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
>     call set_current_state first to set the task state, so the task will be
>     removed from the running queue, so as to achieve the purpose of giving
>     up the CPU and prevent it from running in kernel mode for too long.
>     
>     (akpm: ugly hack becomes uglier.  But it fixes the issue in a
>     backportable-to-stable fashion while we hopefully work on something
>     better)
>     
>     Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
>     Signed-off-by: Guo Ziliang <guo.ziliang@zte.com.cn>
>     Reported-by: Zeal Robot <zealci@zte.com.cn>
>     Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
>     Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
>     Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
>     Acked-by: Hugh Dickins <hughd@google.com>
>     Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
>     Cc: Michal Hocko <mhocko@kernel.org>
>     Cc: Minchan Kim <minchan@kernel.org>
>     Cc: Johannes Weiner <hannes@cmpxchg.org>
>     Cc: Roger Quadros <rogerq@kernel.org>
>     Cc: Ziliang Guo <guo.ziliang@zte.com.cn>
>     Cc: <stable@vger.kernel.org>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 8d4104242100..ee67164531c0 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
>  		 * in swap_map, but not yet added its page to swap cache.
>  		 */
> -		cond_resched();
> +		schedule_timeout_uninterruptible(1);
>  	}
>  
>  	/*


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure
  2024-02-19 10:23   ` Daniel Bristot de Oliveira
@ 2024-02-20  3:28     ` Huang, Ying
  2024-02-20  8:31       ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 76+ messages in thread
From: Huang, Ying @ 2024-02-20  3:28 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld, Aaron Lu, Kairui Song,
	Guo Ziliang

Daniel Bristot de Oliveira <bristot@kernel.org> writes:

> Hi
>
> On 2/19/24 08:33, Huang, Ying wrote:
>> Hi, Daniel,
>> 
>> Thanks a lot for your great patchset!
>> 
>> We have a similar starvation issue in mm subsystem too.  Details are in
>> the patch description of the below commit.  In short, task A is busy
>> looping on some event, while task B will signal the event after some
>> work.  If the priority of task A is higher than that of task B, task B
>> may be starved.
>
> ok...
>
>> 
>> IIUC, if task A is RT task while task B is fair task, then your patchset
>> will solve the issue.
>
> This patch set will not solve the issue. It will mitigate the effect of the
> problem. Still the system will perform very poorly...

I don't think that it's common (or even reasonable) for real-time tasks
to use swap.  So, IMHO, performance isn't very important here.  But, we
need to avoid live-lock anyway.  I think that your patchset solves the
live-lock issue.

>> If both task A and task B is RT tasks, is there
>> some way to solve the issue?
>
> I would say reworking the swap algorithm, as it is not meant to be used when
> real-time tasks are in place.
>
> As an exercise, let's say that we add a server per priority on FIFO, with a default
> 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
> busy loop in vain.

If the target is only the live-lock avoidance, is it possible to run
lower priority runnable tasks for a short while if we run long enough in
the busy loop?

> Then one would say, let's lower the parameters, so the granularity of
> the server would provide lower latencies. The same problem would still
> exist, as it exists with sched fair....
>
> So, the solution is not on schedule. Busy loop waiting is not right when you
> have RT tasks. That is why PREEMPT_RT reworks the locking pattern to remove
> spin_locks that do busy waiting. spin_locks do not have this problem you
> show because they disable preemption... but disabling preemption is not
> the best solution either.
>
> So, a first try of duct tape would using (local_?) locks like in
> preempt rt to make things sleepable...
>
> AFAICS, this was already discussed in the previous link, right?
>
>> 
>> Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to
>> resolve the issue, is there something better?
>
> I am not a swap/mm expert.. my guesses would be all on sleepable locking.
> But I know there are many smart people on the mm side with better guesses...
>
> It is just that the DL server or any type of starvation avoidance does not
> seem to be a solution for your problem.

Yes.  To improve the performance, we need something else.

--
Best Regards,
Huang, Ying

> -- Daniel
>
>
>> Best Regards,
>> Huang, Ying
>> 
>> --------------------------8<---------------------------------------
>> commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e
>> Author: Guo Ziliang <guo.ziliang@zte.com.cn>
>> Date:   Wed Mar 16 16:15:03 2022 -0700
>> 
>>     mm: swap: get rid of livelock in swapin readahead
>>     
>>     In our testing, a livelock task was found.  Through sysrq printing, same
>>     stack was found every time, as follows:
>>     
>>       __swap_duplicate+0x58/0x1a0
>>       swapcache_prepare+0x24/0x30
>>       __read_swap_cache_async+0xac/0x220
>>       read_swap_cache_async+0x58/0xa0
>>       swapin_readahead+0x24c/0x628
>>       do_swap_page+0x374/0x8a0
>>       __handle_mm_fault+0x598/0xd60
>>       handle_mm_fault+0x114/0x200
>>       do_page_fault+0x148/0x4d0
>>       do_translation_fault+0xb0/0xd4
>>       do_mem_abort+0x50/0xb0
>>     
>>     The reason for the livelock is that swapcache_prepare() always returns
>>     EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
>>     cannot jump out of the loop.  We suspect that the task that clears the
>>     SWAP_HAS_CACHE flag never gets a chance to run.  We try to lower the
>>     priority of the task stuck in a livelock so that the task that clears
>>     the SWAP_HAS_CACHE flag will run.  The results show that the system
>>     returns to normal after the priority is lowered.
>>     
>>     In our testing, multiple real-time tasks are bound to the same core, and
>>     the task in the livelock is the highest priority task of the core, so
>>     the livelocked task cannot be preempted.
>>     
>>     Although cond_resched() is used by __read_swap_cache_async, it is an
>>     empty function in the preemptive system and cannot achieve the purpose
>>     of releasing the CPU.  A high-priority task cannot release the CPU
>>     unless preempted by a higher-priority task.  But when this task is
>>     already the highest priority task on this core, other tasks will not be
>>     able to be scheduled.  So we think we should replace cond_resched() with
>>     schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
>>     call set_current_state first to set the task state, so the task will be
>>     removed from the running queue, so as to achieve the purpose of giving
>>     up the CPU and prevent it from running in kernel mode for too long.
>>     
>>     (akpm: ugly hack becomes uglier.  But it fixes the issue in a
>>     backportable-to-stable fashion while we hopefully work on something
>>     better)
>>     
>>     Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
>>     Signed-off-by: Guo Ziliang <guo.ziliang@zte.com.cn>
>>     Reported-by: Zeal Robot <zealci@zte.com.cn>
>>     Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
>>     Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
>>     Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
>>     Acked-by: Hugh Dickins <hughd@google.com>
>>     Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>     Cc: Michal Hocko <mhocko@kernel.org>
>>     Cc: Minchan Kim <minchan@kernel.org>
>>     Cc: Johannes Weiner <hannes@cmpxchg.org>
>>     Cc: Roger Quadros <rogerq@kernel.org>
>>     Cc: Ziliang Guo <guo.ziliang@zte.com.cn>
>>     Cc: <stable@vger.kernel.org>
>>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>> 
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 8d4104242100..ee67164531c0 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>>  		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
>>  		 * in swap_map, but not yet added its page to swap cache.
>>  		 */
>> -		cond_resched();
>> +		schedule_timeout_uninterruptible(1);
>>  	}
>>  
>>  	/*

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure
  2024-02-20  3:28     ` Huang, Ying
@ 2024-02-20  8:31       ` Daniel Bristot de Oliveira
  2024-02-20  8:41         ` Huang, Ying
  0 siblings, 1 reply; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-02-20  8:31 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld, Aaron Lu, Kairui Song,
	Guo Ziliang

On 2/20/24 04:28, Huang, Ying wrote:
> Daniel Bristot de Oliveira <bristot@kernel.org> writes:
> 
>> Hi
>>
>> On 2/19/24 08:33, Huang, Ying wrote:
>>> Hi, Daniel,
>>>
>>> Thanks a lot for your great patchset!
>>>
>>> We have a similar starvation issue in mm subsystem too.  Details are in
>>> the patch description of the below commit.  In short, task A is busy
>>> looping on some event, while task B will signal the event after some
>>> work.  If the priority of task A is higher than that of task B, task B
>>> may be starved.
>>
>> ok...
>>
>>>
>>> IIUC, if task A is RT task while task B is fair task, then your patchset
>>> will solve the issue.
>>
>> This patch set will not solve the issue. It will mitigate the effect of the
>> problem. Still the system will perform very poorly...
> 
> I don't think that it's common (or even reasonable) for real-time tasks
> to use swap.  So, IMHO, performance isn't very important here.  But, we
> need to avoid live-lock anyway.  I think that your patchset solves the
> live-lock issue.

I mean, if for you this is solving your user problem, be happy :-) Play with parameters...
find a way to tune your system as a user... use it :)

But your problem is also "solved" with RT throttling without RT_RUNTIME_SHARE (the
default since... two years ago, I think). So there is not much news here.

IMHO, it is not a solution. As a developer, there is a synchronization problem
in swap code, and pushing a workaround to the scheduling side is not the way to go...

> 
>>> If both task A and task B is RT tasks, is there
>>> some way to solve the issue?
>>
>> I would say reworking the swap algorithm, as it is not meant to be used when
>> real-time tasks are in place.
>>
>> As an exercise, let's say that we add a server per priority on FIFO, with a default
>> 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
>> busy loop in vain.
> 
> If the target is only the live-lock avoidance, is it possible to run
> lower priority runnable tasks for a short while if we run long enough in
> the busy loop?

If you do it in the algorithm side (instead of relying on scheduling), it could be a
thing.

I think NAPI still uses something like this: Busy-loop for two jiffies in the softirq
context (a priority higher than all threads on the !rt kernel), then move to thread
the thread context to avoid starvation. In the swap case, it could run for two jiffies
and then go to sleep for a while. How well will swap people receive this as a solution...
I do not know :) I would first try something better than this using synchronization
primitives.

This patch set is for things outside of kernel control. For example, people running
poll mode DPDK in user-space with FIFO priority; FIFO tasks in user-space for too long...
with a better design than rt throttling.

Will this patch help in misbehaving kernel activities: yes. Is it a reason not to
fix kernel problems? I do not think so, and I bet many other people do not believe as
well.

-- Daniel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure
  2024-02-20  8:31       ` Daniel Bristot de Oliveira
@ 2024-02-20  8:41         ` Huang, Ying
  0 siblings, 0 replies; 76+ messages in thread
From: Huang, Ying @ 2024-02-20  8:41 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld, Aaron Lu, Kairui Song,
	Guo Ziliang

Daniel Bristot de Oliveira <bristot@kernel.org> writes:

> On 2/20/24 04:28, Huang, Ying wrote:
>> Daniel Bristot de Oliveira <bristot@kernel.org> writes:
>> 
>>> Hi
>>>
>>> On 2/19/24 08:33, Huang, Ying wrote:
>>>> Hi, Daniel,
>>>>
>>>> Thanks a lot for your great patchset!
>>>>
>>>> We have a similar starvation issue in mm subsystem too.  Details are in
>>>> the patch description of the below commit.  In short, task A is busy
>>>> looping on some event, while task B will signal the event after some
>>>> work.  If the priority of task A is higher than that of task B, task B
>>>> may be starved.
>>>
>>> ok...
>>>
>>>>
>>>> IIUC, if task A is RT task while task B is fair task, then your patchset
>>>> will solve the issue.
>>>
>>> This patch set will not solve the issue. It will mitigate the effect of the
>>> problem. Still the system will perform very poorly...
>> 
>> I don't think that it's common (or even reasonable) for real-time tasks
>> to use swap.  So, IMHO, performance isn't very important here.  But, we
>> need to avoid live-lock anyway.  I think that your patchset solves the
>> live-lock issue.
>
> I mean, if for you this is solving your user problem, be happy :-) Play with parameters...
> find a way to tune your system as a user... use it :)
>
> But your problem is also "solved" with RT throttling without RT_RUNTIME_SHARE (the
> default since... two years ago, I think). So there is not much news here.
>
> IMHO, it is not a solution. As a developer, there is a synchronization problem
> in swap code, and pushing a workaround to the scheduling side is not the way to go...
>
>> 
>>>> If both task A and task B is RT tasks, is there
>>>> some way to solve the issue?
>>>
>>> I would say reworking the swap algorithm, as it is not meant to be used when
>>> real-time tasks are in place.
>>>
>>> As an exercise, let's say that we add a server per priority on FIFO, with a default
>>> 50ms/1s runtime period. Your "real-time" workload would suffer a 950ms latency,
>>> busy loop in vain.
>> 
>> If the target is only the live-lock avoidance, is it possible to run
>> lower priority runnable tasks for a short while if we run long enough in
>> the busy loop?
>
> If you do it in the algorithm side (instead of relying on scheduling), it could be a
> thing.
>
> I think NAPI still uses something like this: Busy-loop for two jiffies in the softirq
> context (a priority higher than all threads on the !rt kernel), then move to thread
> the thread context to avoid starvation. In the swap case, it could run for two jiffies
> and then go to sleep for a while. How well will swap people receive this as a solution...
> I do not know :) I would first try something better than this using synchronization
> primitives.
>
> This patch set is for things outside of kernel control. For example, people running
> poll mode DPDK in user-space with FIFO priority; FIFO tasks in user-space for too long...
> with a better design than rt throttling.
>
> Will this patch help in misbehaving kernel activities: yes. Is it a reason not to
> fix kernel problems? I do not think so, and I bet many other people do not believe as
> well.

I totally agree with you that we need to fix the kernel problems.  And,
Thanks for your information!

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2023-11-04 10:59 ` [PATCH v5 6/7] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
                     ` (3 preceding siblings ...)
  2023-11-13 15:05   ` kernel test robot
@ 2024-03-20  0:03   ` Joel Fernandes
  2024-03-20 19:24     ` Daniel Bristot de Oliveira
  4 siblings, 1 reply; 76+ messages in thread
From: Joel Fernandes @ 2024-03-20  0:03 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld



On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
> Among the motivations for the DL servers is the real-time throttling
> mechanism. This mechanism works by throttling the rt_rq after
> running for a long period without leaving space for fair tasks.
> 
> The base dl server avoids this problem by boosting fair tasks instead
> of throttling the rt_rq. The point is that it boosts without waiting
> for potential starvation, causing some non-intuitive cases.
> 
> For example, an IRQ dispatches two tasks on an idle system, a fair
> and an RT. The DL server will be activated, running the fair task
> before the RT one. This problem can be avoided by deferring the
> dl server activation.
> 
> By setting the zerolax option, the dl_server will dispatch an
> SCHED_DEADLINE reservation with replenished runtime, but throttled.
> 
> The dl_timer will be set for (period - runtime) ns from start time.
> Thus boosting the fair rq on its 0-laxity time with respect to
> rt_rq.
> 
> If the fair scheduler has the opportunity to run while waiting
> for zerolax time, the dl server runtime will be consumed. If
> the runtime is completely consumed before the zerolax time, the
> server will be replenished while still in a throttled state. Then,
> the dl_timer will be reset to the new zerolax time
> 
> If the fair server reaches the zerolax time without consuming
> its runtime, the server will be boosted, following CBS rules
> (thus without breaking SCHED_DEADLINE).
> 
> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>

Hi, Daniel,
We have one additional patch (other than the 15 I just sent). Since I have just
3 more working days for the next 3 weeks, I thought I might as well reply inline
here since it might be unnecessary to resend all 15 patches so soon just for the
one new addition below. I am replying to this patch here, because the new patch
is related (to 0-laxity).  But once I am back from holiday, I can resend it with
the set I have unless you've applied it.

So, Vineeth and me came up with a patch below to "max cap" the DL server 0-lax
time (max cap is default off keeping the regular behavior). This is needed to
guarantee bandwidth for periodic CFS runners/sleepers.

The example usecase is:

Consider DL server params 25ms / 50ms.

Consider CFS task with duty cycle of 25ms / 76ms (run 25ms sleep 51ms).

         run 25ms                    run 25ms
         _______                     _______
        |       | sleep 51          |       |  sleep 51
-|------|-------|---------|---------|-------|----------|--------|------> t
 0     25      50       101        126      151       202      227
                          \ 0-lax /                    \ 0-lax /

Here the 0-lax addition in the original v5's zero-lax patch causes lesser bandwidth.

So the task runs 50ms every 227ms, instead of 50ms every 152ms.

A simple unit test confirms the issue, and it is fixed by Vineeth's patch below:

Please take a look at the patch below (applies only to v5.15 but Vineeth is
rebase on mainline as we speak), thanks.

-----8<--------
From: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Subject: [PATCH] sched/deadline/dlserver: sysctl for dlserver maxdefer time

Inorder to avoid dlserver preempting RT tasks when it wakes up, dlserver
is throttled(deferred) until zero lax time. This is the farthest time
before deadline where dlserver can meet its deadline.

Zero lax time causes cfs tasks with sleep/run pattern where the cfs
tasks doesn't get the bandwidth promised by dlserver. So introduce a
sysctl for limiting the defer time of dlserver.

Suggested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
---
 include/linux/sched/sysctl.h | 2 ++
 kernel/sched/deadline.c      | 6 ++++++
 kernel/sysctl.c              | 7 +++++++
 3 files changed, 15 insertions(+)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 4939e6128840..a27fba6fe0ab 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -41,6 +41,8 @@ extern unsigned int sysctl_iowait_apply_ticks;
 extern unsigned int sysctl_sched_dl_period_max;
 extern unsigned int sysctl_sched_dl_period_min;
 +extern unsigned int sysctl_sched_dlserver_maxdefer_ms;
+
 #ifdef CONFIG_UCLAMP_TASK
 extern unsigned int sysctl_sched_uclamp_util_min;
 extern unsigned int sysctl_sched_uclamp_util_max;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d638cc5b45c7..69c9fd80a67d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1071,6 +1071,11 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
 	if (dl_se->dl_defer_armed) {
 		WARN_ON_ONCE(!dl_se->dl_throttled);
 		act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
+		if (sysctl_sched_dlserver_maxdefer_ms) {
+			ktime_t dlserver_maxdefer = rq_clock(rq) +
ms_to_ktime(sysctl_sched_dlserver_maxdefer_ms);
+			if (ktime_after(act, dlserver_maxdefer))
+				act = dlserver_maxdefer;
+		}
 	} else {
 		act = ns_to_ktime(dl_next_period(dl_se));
 	}
@@ -3099,6 +3104,7 @@ void __getparam_dl(struct task_struct *p, struct
sched_attr *attr)
  */
 unsigned int sysctl_sched_dl_period_max = 1 << 22; /* ~4 seconds */
 unsigned int sysctl_sched_dl_period_min = 100;     /* 100 us */
+unsigned int sysctl_sched_dlserver_maxdefer_ms = 2;
  /*
  * This function validates the new parameters of a -deadline task.
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 39f47a871fb4..027193302e7e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1842,6 +1842,13 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "sched_dlserver_maxdefer_ms",
+		.data		= &sysctl_sched_dlserver_maxdefer_ms,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{
 		.procname	= "sched_rr_timeslice_ms",
 		.data		= &sysctl_sched_rr_timeslice,
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2024-03-20  0:03   ` Joel Fernandes
@ 2024-03-20 19:24     ` Daniel Bristot de Oliveira
  2024-03-21 16:15       ` Joel Fernandes
  0 siblings, 1 reply; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-03-20 19:24 UTC (permalink / raw)
  To: Joel Fernandes, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld

On 3/20/24 01:03, Joel Fernandes wrote:
> 
> 
> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>> Among the motivations for the DL servers is the real-time throttling
>> mechanism. This mechanism works by throttling the rt_rq after
>> running for a long period without leaving space for fair tasks.
>>
>> The base dl server avoids this problem by boosting fair tasks instead
>> of throttling the rt_rq. The point is that it boosts without waiting
>> for potential starvation, causing some non-intuitive cases.
>>
>> For example, an IRQ dispatches two tasks on an idle system, a fair
>> and an RT. The DL server will be activated, running the fair task
>> before the RT one. This problem can be avoided by deferring the
>> dl server activation.
>>
>> By setting the zerolax option, the dl_server will dispatch an
>> SCHED_DEADLINE reservation with replenished runtime, but throttled.
>>
>> The dl_timer will be set for (period - runtime) ns from start time.
>> Thus boosting the fair rq on its 0-laxity time with respect to
>> rt_rq.
>>
>> If the fair scheduler has the opportunity to run while waiting
>> for zerolax time, the dl server runtime will be consumed. If
>> the runtime is completely consumed before the zerolax time, the
>> server will be replenished while still in a throttled state. Then,
>> the dl_timer will be reset to the new zerolax time
>>
>> If the fair server reaches the zerolax time without consuming
>> its runtime, the server will be boosted, following CBS rules
>> (thus without breaking SCHED_DEADLINE).

notice: at this point in history, the term zero-laxity was removed from
the latest code we have, the term was moved to defer server... I will
remove from the long in the next time I send it.

>> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
> 
> Hi, Daniel,
> We have one additional patch (other than the 15 I just sent).

Those 15 are the next thing I will review... I was working in the merge window.

Since I have just
> 3 more working days for the next 3 weeks, I thought I might as well reply inline
> here since it might be unnecessary to resend all 15 patches so soon just for the
> one new addition below. I am replying to this patch here, because the new patch
> is related (to 0-laxity).  But once I am back from holiday, I can resend it with
> the set I have unless you've applied it.

before starting... in three weeks, we will be very far still from any attempt to
ping Peter & ingo to ask if they could think about putting us into a queue.
Have fun :-)

as I explained at LPC.. and in chat... there is no "0-laxity", and repeating it
only creates more confusion. Let's use a deferred server... a regular deadline
reservation with deferred starting time at (now + period - runtime), and at that point
il will receive a new deadline one period away - (not one runtime away).

There will always be a person reading these emails and echoing the wrong things...
using 0-lax/0-laxity term here is a lose-lose.

> So, Vineeth and me came up with a patch below to "max cap" the DL server 0-lax
> time (max cap is default off keeping the regular behavior). This is needed to
> guarantee bandwidth for periodic CFS runners/sleepers.

Another point... "guarantee bandwidth"... the bandwith is provided under certain conditions.
If the conditions are not respected, the guarantee a dl reservation will provide is that
the task will not put a utilization higher than the one requested, so yes, a dl reservation
can and will enforce a lower utilization if the task does not respect the conditions.
Also, if the reservation is ready, but no task is ready...

> 
> The example usecase is:
> 
> Consider DL server params 25ms / 50ms.
> 
> Consider CFS task with duty cycle of 25ms / 76ms (run 25ms sleep 51ms).

define duty... like, runtime 25, period 76? sleeps for 51 relative to a starting time
or not?

there are some holes in your explanation, it is tricky to reply inline for these cases...
I am continuing but....

> 
>          run 25ms                    run 25ms
>          _______                     _______
>         |       | sleep 51          |       |  sleep 51
> -|------|-------|---------|---------|-------|----------|--------|------> t
>  0     25      50       101        126      151       202      227
>                           \ 0-lax /                    \ 0-lax /


trying to understand...

at time 0 the task is activated... RT tasks are spinning... assuming that so the
server was deferred to 25?

At 25, it becomes DL, with a new deadline of 75. If there is no other DL
task, run [25..50].

at 75 the task would have another 25 ms... but it decided not to run, throwing
away 25 ms until 100. At this point, I would say: is not an odd setup....

At point 100, the deferred server assumes that the starvation condition is gone,
and goes to the initial state.

now, that 0-lax means what? the zero-laxity time for the task... but not
from its start time, but from the beginning of the deferred server = 25 + 76 = 101
... also 0-lax is a range?

and here, the system seems to start the same cycle with a 1 shift to repeat the
case that the task and the reservation did not match.

looking back, for the first cycle... with defer at 75, without at point 50, there
would be time for the server to run, but the task is just ready, so runtime is
thrown away.

So, this miss match between the configuration and the task setup is... clearly
causing runtime to be wasted... but one can change the parameter for them
to be better in sync...

and here, I assume that there is something missing in the explanation because...


> Here the 0-lax addition in the original v5's zero-lax patch causes lesser bandwidth.

which addition?

> So the task runs 50ms every 227ms, instead of 50ms every 152ms.

...

> 
> A simple unit test confirms the issue, and it is fixed by Vineeth's patch below:
> 
> Please take a look at the patch below (applies only to v5.15 but Vineeth is
> rebase on mainline as we speak), thanks.
> 
> -----8<--------
> From: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> Subject: [PATCH] sched/deadline/dlserver: sysctl for dlserver maxdefer time
> 
> Inorder to avoid dlserver preempting RT tasks when it wakes up, dlserver
> is throttled(deferred) until zero lax time. This is the farthest time
> before deadline where dlserver can meet its deadline.
> 
> Zero lax time causes cfs tasks with sleep/run pattern where the cfs
> tasks doesn't get the bandwidth promised by dlserver. So introduce a
> sysctl for limiting the defer time of dlserver.

so... that explanation before to reach to the conclusion that limiting the
amount of time to defer the server is a fix? there is a huge gap here.

> 
> Suggested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> ---
>  include/linux/sched/sysctl.h | 2 ++
>  kernel/sched/deadline.c      | 6 ++++++
>  kernel/sysctl.c              | 7 +++++++
>  3 files changed, 15 insertions(+)
> 
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index 4939e6128840..a27fba6fe0ab 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -41,6 +41,8 @@ extern unsigned int sysctl_iowait_apply_ticks;
>  extern unsigned int sysctl_sched_dl_period_max;
>  extern unsigned int sysctl_sched_dl_period_min;
>  +extern unsigned int sysctl_sched_dlserver_maxdefer_ms;
> +
>  #ifdef CONFIG_UCLAMP_TASK
>  extern unsigned int sysctl_sched_uclamp_util_min;
>  extern unsigned int sysctl_sched_uclamp_util_max;
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index d638cc5b45c7..69c9fd80a67d 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1071,6 +1071,11 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
>  	if (dl_se->dl_defer_armed) {
>  		WARN_ON_ONCE(!dl_se->dl_throttled);
>  		act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
> +		if (sysctl_sched_dlserver_maxdefer_ms) {
> +			ktime_t dlserver_maxdefer = rq_clock(rq) +
> ms_to_ktime(sysctl_sched_dlserver_maxdefer_ms);
> +			if (ktime_after(act, dlserver_maxdefer))
> +				act = dlserver_maxdefer;


that is, having a global limit... we have a per-cpu set of variables,
that is bounded by a global limit.

<joking>
It is already hard to put in sync with two parameters. Now we need a
third one that is global :-)
<joking>

reading the code was actually more instructive than reading the comments.
The good point is that this puts a nail in the coffin of "zerolax" :-)

that is, you all want also to control for how long the defer happens. Now
it is fixed... (period - runtime). You all would like to have the ability
to set it so something closer, so to defer less.

That phrase is simple :-)

It is already possible! how? This can be done by adjusting runtime/period.
Which are already per CPU. The DL server does not need to give an entire
chuck all at once; one can split it into smaller runtime/period slices,
as EEVDF does. And here we get back to the things we talked about when
trying to use EDF... but now it is only for one task :-) It is easier.

This also reduces the amount of time thrown away because there is no
task ready. It is the main cause of time wasted anyway. And the CFS
scheduler is really aperiodic, in practice, these more corner
case timelines will hardly happen... unless you try to force with
a simple test case.

now... could I change the defer per-cpu option to have a value between
0 and (period - runtime) so we could defer less than (period - runtime),
with proper check and granularity (per cpu) in the next version I send
for the interface.... maybe... it is less prone to have a no no from
people who care a lot about the interface.

But I am asking myself: is it worth the complexity? I would try first
getting used with the runtime/period setup only...

I will start reviewing the other patches as soon as the worries
of the merge window passes away. Hopefully tomorrow.

-- Daniel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2024-03-20 19:24     ` Daniel Bristot de Oliveira
@ 2024-03-21 16:15       ` Joel Fernandes
  2024-03-23 14:37         ` Joel Fernandes
  2024-04-05 14:35         ` Daniel Bristot de Oliveira
  0 siblings, 2 replies; 76+ messages in thread
From: Joel Fernandes @ 2024-03-21 16:15 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld, David Vernet

Hello Daniel,
Thank you for the reply and I replied below.

On 3/20/2024 3:24 PM, Daniel Bristot de Oliveira wrote:
> On 3/20/24 01:03, Joel Fernandes wrote:
>>
>>
>> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>>> Among the motivations for the DL servers is the real-time throttling
>>> mechanism. This mechanism works by throttling the rt_rq after
>>> running for a long period without leaving space for fair tasks.
>>>
>>> The base dl server avoids this problem by boosting fair tasks instead
>>> of throttling the rt_rq. The point is that it boosts without waiting
>>> for potential starvation, causing some non-intuitive cases.
>>>
>>> For example, an IRQ dispatches two tasks on an idle system, a fair
>>> and an RT. The DL server will be activated, running the fair task
>>> before the RT one. This problem can be avoided by deferring the
>>> dl server activation.
>>>
>>> By setting the zerolax option, the dl_server will dispatch an
>>> SCHED_DEADLINE reservation with replenished runtime, but throttled.
>>>
>>> The dl_timer will be set for (period - runtime) ns from start time.
>>> Thus boosting the fair rq on its 0-laxity time with respect to 

Note your patch changelog. --- (1)

>>> rt_rq.
>>>
>>> If the fair scheduler has the opportunity to run while waiting
>>> for zerolax time, the dl server runtime will be consumed. If
>>> the runtime is completely consumed before the zerolax time, the

Note your patch changelog. --- (2)

>>> server will be replenished while still in a throttled state. Then,
>>> the dl_timer will be reset to the new zerolax time
>>>
>>> If the fair server reaches the zerolax time without consuming
>>> its runtime, the server will be boosted, following CBS rules
>>> (thus without breaking SCHED_DEADLINE).
> 
> notice: at this point in history, the term zero-laxity was removed from
> the latest code we have, the term was moved to defer server... I will
> remove from the long in the next time I send it.

I am confused because your change log in your v5 (and the v6 in kernel.org)
mentions "zero-laxity" all over the place (example see above for (1) and (2)).
Further, the terminology is the least of the problems unfortunately (more
below). Call it defer or whatever but it has to work. :)

>>> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
>>
>> Hi, Daniel,
>> We have one additional patch (other than the 15 I just sent).
> 
> Those 15 are the next thing I will review... I was working in the merge window.

Me too ;-) It has been a fun merge window.

> Since I have just
>> 3 more working days for the next 3 weeks, I thought I might as well reply inline
>> here since it might be unnecessary to resend all 15 patches so soon just for the
>> one new addition below. I am replying to this patch here, because the new patch
>> is related (to 0-laxity).  But once I am back from holiday, I can resend it with
>> the set I have unless you've applied it.
> 
> before starting... in three weeks, we will be very far still from any attempt to
> ping Peter & ingo to ask if they could think about putting us into a queue.

I wasn't really expecting an imminent "ping for merge" to be honest. My main
motivation with posting it before my 3 week off time was to trigger some review
and discussion here (which we desperately need more off IMO ;)). I am not in a
hurry to request for a merge without it working correctly either.

> Have fun :-)

Thanks!

> as I explained at LPC.. and in chat... there is no "0-laxity", and repeating it
> only creates more confusion.

We can stick to defer but could you please update your changelog as well. I
don't think you posted an update to the patch since November either so maybe you
should do that as well, with the patches adjusted with the bugs we found. I do
agree that using zero-laxity creates a source of confusion and prefer "defer".

> Let's use a deferred server... a regular deadline
> reservation with deferred starting time at (now + period - runtime), and at that point
> il will receive a new deadline one period away - (not one runtime away).
> 
> There will always be a person reading these emails and echoing the wrong things...
> using 0-lax/0-laxity term here is a lose-lose.

Agreed, so why not update your patch changelog to correct that (or post a new
revision)?

>> So, Vineeth and me came up with a patch below to "max cap" the DL server 0-lax
>> time (max cap is default off keeping the regular behavior). This is needed to
>> guarantee bandwidth for periodic CFS runners/sleepers.
> 
> Another point... "guarantee bandwidth"... the bandwith is provided under certain conditions.
> If the conditions are not respected, the guarantee a dl reservation will provide is that
> the task will not put a utilization higher than the one requested, so yes, a dl reservation
> can and will enforce a lower utilization if the task does not respect the conditions.
> Also, if the reservation is ready, but no task is ready...

Please clarify what conditions you mean? The conditions I am looking for are
those given by RT throttling (see below). i.e., if RT takes up all of the CPU in
a certain amount of time, then there is a certain amount reserved for CFS.

>> The example usecase is:
>>
>> Consider DL server params 25ms / 50ms.
>>
>> Consider CFS task with duty cycle of 25ms / 76ms (run 25ms sleep 51ms).
> 
> define duty... like, runtime 25, period 76? sleeps for 51 relative to a starting time
> or not?

There is no starting time. The CFS task in the quoted example only does run +
sleep. Run for 25ms and sleep for 51ms, then run again for 25ms, etc.

> 
> there are some holes in your explanation, it is tricky to reply inline for these cases...
> I am continuing but....
> 
>>
>>          run 25ms                    run 25ms
>>          _______                     _______
>>         |       | sleep 51          |       |  sleep 51
>> -|------|-------|---------|---------|-------|----------|--------|------> t
>>  0     25      50       101        126      151       202      227
>>                           \ 0-lax /                    \ 0-lax /

So before going into the discussion below, I just want to mention that if the
DL-server cannot provide the same bandwidth that the RT throttling provided,
then its broken by definition.  And the breakage comes specifically because of
this patch and nothing else. There are many breakages with this patch and I will
go over some of them with unit tests below. Basically, in my view -- if a test
case shows it works with RT throttling, but not with the DL-server, then its
broken. Period. And most of those functional "breakages" come about because of
this patch (6/7) and not the initial series actually.

Here are some cases to shed some light:

Case 1. Consider a CFS task with runtime 15ms and period 50ms. With the
parameters set to 25ms runtime and 50ms period.

The test fails with DL server (because of 6/7), and passes with RT throttling.
See results below. For this test's code, see: https://shorturl.at/rwW07

Specifically, it breaks because of this patch (6/7). If you revert the patch,
the issue goes away.

With the patch 6/7:
# ./dlserver_test
# Runtime of PID 85 is 0.430000 seconds
Bail out! Runtime of PID 85 is not within 10% of expected runtime 0.900000

Without the patch 6/7:
# ./dlserver_test
# Runtime of PID 87 is 0.900000 seconds
ok 1 PASS

So basically, because of defer (or whatever you want to call it ;)), it gives
less than 50% of the bandwidth that it gave without the defer.

And here it is with vanilla RT throttling:
# ./dlserver_test
[   44.968900] sched: RT throttling activated
# Runtime of PID 87 is 0.880000 seconds

Vineeth's patch I shared fixes the issue.

Case 2:
For this case, please run the core scheduling test I provided with
CONFIG_SCHED_CORE=y. It is basically like Case 1 but with some slightly changes.

You will see that because of patch 6/7, that test also breaks and gets lesser
bandwidth. And it happens of the negative runtime stuff you added to
update_curr_dl_se(). Deleting that fixes this, but is indication of yet another
problem with this patch.

The patch fixing this issue (by deleting that block) is also included in the set
of 15 I posted earlier.
-------

Further, my impression is this patch (6/7) does not even solve all the issue it
intended. For example, consider that a CFS task is in the boosted phase, and now
an RT task wakes up. That RT task *will wait* for possibly the whole runtime
granted to CFS, so it might not always help. Contrasting that with RT
throttling, if an RT task is very well behaved (well behaved defined as not
running to the limit that RT throttling should kick in), and it wakes up, it
will run right away without any wait time, regardless of what CFS was or was not
doing.

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2024-03-21 16:15       ` Joel Fernandes
@ 2024-03-23 14:37         ` Joel Fernandes
  2024-04-05 14:35         ` Daniel Bristot de Oliveira
  1 sibling, 0 replies; 76+ messages in thread
From: Joel Fernandes @ 2024-03-23 14:37 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld, David Vernet

> On 3/20/2024 3:24 PM, Daniel Bristot de Oliveira wrote:
>>> On 11/4/2023 6:59 AM, Daniel Bristot de Oliveira wrote:
>>>> Among the motivations for the DL servers is the real-time throttling
>>>> mechanism. This mechanism works by throttling the rt_rq after
>>>> running for a long period without leaving space for fair tasks.
>>>>
>>>> The base dl server avoids this problem by boosting fair tasks instead
>>>> of throttling the rt_rq. The point is that it boosts without waiting
>>>> for potential starvation, causing some non-intuitive cases.
>>>>
>>>> For example, an IRQ dispatches two tasks on an idle system, a fair
>>>> and an RT. The DL server will be activated, running the fair task
>>>> before the RT one. This problem can be avoided by deferring the
>>>> dl server activation.
>>>>
>>>> By setting the zerolax option, the dl_server will dispatch an
>>>> SCHED_DEADLINE reservation with replenished runtime, but throttled.
>>>>
>>>> The dl_timer will be set for (period - runtime) ns from start time.
>>>> Thus boosting the fair rq on its 0-laxity time with respect to 
> 

Hi,
Upon reflection, might we simplify the solution by treating RT as a deadline
reservation as well?

The RT deadline reservation can have shorter deadline so it will be interrupted
less immediately by CFS due to EDF. Would that work, or was that already tried
and has other dragons?

If we could pull that off, then we do not need all the deferral/timer stuff and
could considering dropping this patch. Yes it is more code, but this 6th patch
is also big and non trivial.

Juri, Daniel, all, what do you think?

(On the other hand if we want to keep this patch as a first step, and
incrementally improve that is Ok, but  I believe we do need to make a decision..)

By the way, is there a still a slot in OSPM available to discuss these? If so
that would be great. I can put up some slides.

cheers,

 - Joel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 7/7] sched/fair: Fair server interface
  2024-02-15 17:41           ` Joel Fernandes
@ 2024-04-04 17:43             ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-04-04 17:43 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld, Suleiman Souhlal, youssefesmat,
	Joel Fernandes

On 2/15/24 18:41, Joel Fernandes wrote:
>> Thanks for taking a look. My colleague Suleiman hit the issue too. He's able to
>> not set 45ms/50ms for instance.
> I isolated the problem. It is not an interface problem.
> 
> Long story short, the servers are initialized at the defrootdomain, but
> the dl_bw info is not being carried over to the new domain because the
> servers are not a task.

Fixed on v6. When an rq is attached to a new rd, at rq_attach_root(), the bw of
from fair server is being added to the new rd.

Without this, the RD has no bandwidth registered (no-dl task) and so the
admission control becomes broken.

-- Daniel

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2024-03-21 16:15       ` Joel Fernandes
  2024-03-23 14:37         ` Joel Fernandes
@ 2024-04-05 14:35         ` Daniel Bristot de Oliveira
  2024-04-08 17:11           ` Steven Rostedt
  1 sibling, 1 reply; 76+ messages in thread
From: Daniel Bristot de Oliveira @ 2024-04-05 14:35 UTC (permalink / raw)
  To: Joel Fernandes, Daniel Bristot de Oliveira, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Luca Abeni, Tommaso Cucinotta,
	Thomas Gleixner, Vineeth Pillai, Shuah Khan, Phil Auld,
	David Vernet



>> There will always be a person reading these emails and echoing the wrong things...
>> using 0-lax/0-laxity term here is a lose-lose.
> 
> Agreed, so why not update your patch changelog to correct that (or post a new
> revision)?

That v6 is the repo was a partial update, that I sent there to sync our work. The v6 that I will
send already removed that.

>>> So, Vineeth and me came up with a patch below to "max cap" the DL server 0-lax
>>> time (max cap is default off keeping the regular behavior). This is needed to
>>> guarantee bandwidth for periodic CFS runners/sleepers.
>>
>> Another point... "guarantee bandwidth"... the bandwith is provided under certain conditions.
>> If the conditions are not respected, the guarantee a dl reservation will provide is that
>> the task will not put a utilization higher than the one requested, so yes, a dl reservation
>> can and will enforce a lower utilization if the task does not respect the conditions.
>> Also, if the reservation is ready, but no task is ready...
> 
> Please clarify what conditions you mean? The conditions I am looking for are
> those given by RT throttling (see below). i.e., if RT takes up all of the CPU in
> a certain amount of time, then there is a certain amount reserved for CFS.
> 
>>> The example usecase is:
>>>
>>> Consider DL server params 25ms / 50ms.
>>>
>>> Consider CFS task with duty cycle of 25ms / 76ms (run 25ms sleep 51ms).
>>
>> define duty... like, runtime 25, period 76? sleeps for 51 relative to a starting time
>> or not?
> 
> There is no starting time. The CFS task in the quoted example only does run +
> sleep. Run for 25ms and sleep for 51ms, then run again for 25ms, etc.
> 
>>
>> there are some holes in your explanation, it is tricky to reply inline for these cases...
>> I am continuing but....
>>
>>>
>>>          run 25ms                    run 25ms
>>>          _______                     _______
>>>         |       | sleep 51          |       |  sleep 51
>>> -|------|-------|---------|---------|-------|----------|--------|------> t
>>>  0     25      50       101        126      151       202      227
>>>                           \ 0-lax /                    \ 0-lax /
> 
> So before going into the discussion below, I just want to mention that if the
> DL-server cannot provide the same bandwidth that the RT throttling provided,
> then its broken by definition.  And the breakage comes specifically because of
> this patch and nothing else. There are many breakages with this patch and I will
> go over some of them with unit tests below. Basically, in my view -- if a test
> case shows it works with RT throttling, but not with the DL-server, then its
> broken. Period. And most of those functional "breakages" come about because of
> this patch (6/7) and not the initial series actually.
> 
> Here are some cases to shed some light:
> 
> Case 1. Consider a CFS task with runtime 15ms and period 50ms. With the
> parameters set to 25ms runtime and 50ms period.
> 
> The test fails with DL server (because of 6/7), and passes with RT throttling.
> See results below. For this test's code, see: https://shorturl.at/rwW07

A reproducer always helps. So, your task there is not a periodic task... it is
a sporadic task because it sleeps for a fixed amount of time after the runtime.

A periodic task with period 76 would wake at 0, 76, 152 - like cyclictest...
so consuming at a fixed time rate if the scheduler allows it.

In the case of a fixed sleep time at the end of the execution, it will end up
"throwing away bandwidth" if the runtime is not given at the beginning of the
period because it will run slower... accumulating error. But that was not the
main point here...

The problem here was more like: if a fair task goes to sleep in the middle of
the server activation (for a lock?), and then wakes up again, the code in v5 is
forcing it to defer... again. Thus, it is getting less bandwidth... notice that
it does not even need to be at the start of the period. It is the middle of the
execution.

Intuitively, reducing the deferred time would help there. But the best thing to do is:

If the fair task waited for the defer, and the real-time tasks are still using all
CPU time, do not defer the activation again, and keep the defer mechanism disabled
until the real-time tasks allow the fair scheduler to run in the background. So,
making the defer mode equivalent to the non-defer mode until the RT tasks start
to behave again.

For that, in the v6, there is a variable (dl_defer_running), once the dl_server
is enqueued after the defer time, the variable dl_defer_running is set.

If the fair task sleeps in the middle of the period, that variable do not change.

If the fair task wakes up and the dl_defer_running is still set, do not defer.
Keep running until you consume the reservation.

The variable dl_defer_running is set to 0 only after the fair tasks consume
its runtime without being in a dl_server... IOW, when the RT tasks start to
behave.

No interface change.

With that in place, your reproducers are working. I have a periodic version
of your reproducer, also improving how the task consumes the runtime,.. I
will send it to you so you can have a look.

> 
> Specifically, it breaks because of this patch (6/7). If you revert the patch,
> the issue goes away.
> 
> With the patch 6/7:
> # ./dlserver_test
> # Runtime of PID 85 is 0.430000 seconds
> Bail out! Runtime of PID 85 is not within 10% of expected runtime 0.900000
> 
> Without the patch 6/7:
> # ./dlserver_test
> # Runtime of PID 87 is 0.900000 seconds
> ok 1 PASS
> 
> So basically, because of defer (or whatever you want to call it ;)), it gives
> less than 50% of the bandwidth that it gave without the defer.

There was a problem with the non-defer mode as well, the dl_server_start() was
missing a set need resched. Fixed that in v6.

> Further, my impression is this patch (6/7) does not even solve all the issue it
> intended. For example, consider that a CFS task is in the boosted phase, and now
> an RT task wakes up. That RT task *will wait* for possibly the whole runtime
> granted to CFS, so it might not always help. Contrasting that with RT
> throttling, if an RT task is very well behaved (well behaved defined as not
> running to the limit that RT throttling should kick in), and it wakes up, it
> will run right away without any wait time, regardless of what CFS was or was not
> doing.


I fixed that as well.

The problem happens when a DL server has a large runtime (>=~ 50%).
Let's say 25 ms runtime, 50 ms period.

At time 0, the defer timer will be set at 25 ms (50 - 25).

From 0 to 25, the RT task would consume, for instance, only 2 ms... so
it is behaving...

At time 25, the defer timer fires... and as the fair task ran for 23 ms
(25 - 2 ms taken by RT) it still has 2 ms runtime to run... so the server
is activated... it is not correct.

The change I made in v6 is:

Same case...

At time 25, the defer timer fires...
	Then, the timer will re-compute the defer time:
		If the RT tasks are behaving, forward the timer for the
		new (deadline - runtime).
		return;

For instance, in the previous case, the new defer timer would be: 50 ms - 2 ms.

CFS will continue working, consuming runtime and resetting the period to avoid
activating the dl server.

The idea of forwarding the timer was taken from the cfs period timer. It is also
possible to forward the timer on other points... if necessary...

I did more testing, with different task sets, including tasks that goes to sleep...
it is working as expected.

-- Daniel


> thanks,
> 
>  - Joel
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v5 6/7] sched/deadline: Deferrable dl server
  2024-04-05 14:35         ` Daniel Bristot de Oliveira
@ 2024-04-08 17:11           ` Steven Rostedt
  0 siblings, 0 replies; 76+ messages in thread
From: Steven Rostedt @ 2024-04-08 17:11 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Joel Fernandes, Daniel Bristot de Oliveira, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Vineeth Pillai,
	Shuah Khan, Phil Auld, David Vernet

On Fri, 5 Apr 2024 16:35:49 +0200
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

> A reproducer always helps. So, your task there is not a periodic task... it is
> a sporadic task because it sleeps for a fixed amount of time after the runtime.
> 
> A periodic task with period 76 would wake at 0, 76, 152 - like cyclictest...
> so consuming at a fixed time rate if the scheduler allows it.
> 
> In the case of a fixed sleep time at the end of the execution, it will end up
> "throwing away bandwidth" if the runtime is not given at the beginning of the
> period because it will run slower... accumulating error. But that was not the
> main point here...
> 
> The problem here was more like: if a fair task goes to sleep in the middle of
> the server activation (for a lock?), and then wakes up again, the code in v5 is
> forcing it to defer... again. Thus, it is getting less bandwidth... notice that
> it does not even need to be at the start of the period. It is the middle of the
> execution.
> 
> Intuitively, reducing the deferred time would help there. But the best thing to do is:
> 
> If the fair task waited for the defer, and the real-time tasks are still using all
> CPU time, do not defer the activation again, and keep the defer mechanism disabled
> until the real-time tasks allow the fair scheduler to run in the background. So,
> making the defer mode equivalent to the non-defer mode until the RT tasks start
> to behave again.
> 
> For that, in the v6, there is a variable (dl_defer_running), once the dl_server
> is enqueued after the defer time, the variable dl_defer_running is set.
> 
> If the fair task sleeps in the middle of the period, that variable do not change.
> 
> If the fair task wakes up and the dl_defer_running is still set, do not defer.
> Keep running until you consume the reservation.
> 
> The variable dl_defer_running is set to 0 only after the fair tasks consume
> its runtime without being in a dl_server... IOW, when the RT tasks start to
> behave.

Very nice explanation! Thanks Daniel.

> 
> No interface change.
> 
> With that in place, your reproducers are working. I have a periodic version
> of your reproducer, also improving how the task consumes the runtime,.. I
> will send it to you so you can have a look.

Looking forward to reviewing your patches when I'm back from PTO.

-- Steve

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2024-04-08 17:11 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-04 10:59 [PATCH v5 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
2023-11-04 10:59 ` [PATCH v5 1/7] sched: Unify runtime accounting across classes Daniel Bristot de Oliveira
2023-11-15  9:04   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2023-11-04 10:59 ` [PATCH v5 2/7] sched/deadline: Collect sched_dl_entity initialization Daniel Bristot de Oliveira
2023-11-15  9:04   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2023-11-04 10:59 ` [PATCH v5 3/7] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity Daniel Bristot de Oliveira
2023-11-15  9:04   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2023-11-04 10:59 ` [PATCH v5 4/7] sched/deadline: Introduce deadline servers Daniel Bristot de Oliveira
2023-11-15  9:04   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2023-11-04 10:59 ` [PATCH v5 5/7] sched/fair: Add trivial fair server Daniel Bristot de Oliveira
2023-11-06 14:24   ` Peter Zijlstra
2023-11-06 14:26     ` Daniel Bristot de Oliveira
2023-11-04 10:59 ` [PATCH v5 6/7] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
2023-11-06 14:55   ` Peter Zijlstra
2023-11-06 17:05     ` Daniel Bristot de Oliveira
2023-11-06 19:32   ` Joel Fernandes
2023-11-06 21:32     ` Joel Fernandes
2023-11-06 21:37       ` Joel Fernandes
2023-11-07 11:58         ` Daniel Bristot de Oliveira
2023-11-08  2:42           ` Joel Fernandes
2023-11-07 16:47         ` Steven Rostedt
2023-11-07 17:35           ` Steven Rostedt
2023-11-07 17:46             ` Steven Rostedt
2023-11-07 17:54             ` Steven Rostedt
2023-11-07 19:32               ` Steven Rostedt
2023-11-07 20:07                 ` Steven Rostedt
2023-11-07 17:37           ` Daniel Bristot de Oliveira
2023-11-07 18:50             ` Daniel Bristot de Oliveira
2023-11-08  3:20               ` Joel Fernandes
2023-11-08  8:01                 ` Daniel Bristot de Oliveira
2023-11-08 18:25                   ` Joel Fernandes
2023-11-08 12:44               ` Peter Zijlstra
2023-11-08 12:50                 ` Peter Zijlstra
2023-11-08 14:52                   ` Daniel Bristot de Oliveira
2023-11-08 13:46                 ` Daniel Bristot de Oliveira
2023-11-08 13:58                 ` Daniel Bristot de Oliveira
2023-11-08 15:14                 ` Juri Lelli
2023-11-08 16:57                   ` Peter Zijlstra
2023-11-08  2:37           ` Joel Fernandes
2023-11-07  7:30     ` Daniel Bristot de Oliveira
2023-11-07 16:37   ` Steven Rostedt
2023-11-13 15:05   ` kernel test robot
2024-03-20  0:03   ` Joel Fernandes
2024-03-20 19:24     ` Daniel Bristot de Oliveira
2024-03-21 16:15       ` Joel Fernandes
2024-03-23 14:37         ` Joel Fernandes
2024-04-05 14:35         ` Daniel Bristot de Oliveira
2024-04-08 17:11           ` Steven Rostedt
2023-11-04 10:59 ` [PATCH v5 7/7] sched/fair: Fair server interface Daniel Bristot de Oliveira
2023-11-04 15:18   ` kernel test robot
2023-11-05  0:55   ` kernel test robot
2023-11-06 15:40   ` Peter Zijlstra
2023-11-06 16:29     ` Daniel Bristot de Oliveira
2023-11-07  8:16       ` Peter Zijlstra
2023-11-07 14:06         ` Daniel Bristot de Oliveira
2023-11-07 14:44       ` Peter Zijlstra
2023-11-07 12:38   ` Peter Zijlstra
2023-11-07 13:24     ` Daniel Bristot de Oliveira
2024-01-19  1:49   ` Joel Fernandes
2024-01-19  1:55   ` Joel Fernandes
2024-01-22 14:14     ` Daniel Bristot de Oliveira
2024-01-23 15:39       ` Joel Fernandes
2024-01-23 15:44       ` Joel Fernandes
2024-02-13  2:13   ` Joel Fernandes
2024-02-13  2:21     ` Joel Fernandes
2024-02-14 14:23     ` Daniel Bristot de Oliveira
2024-02-15 13:57       ` Joel Fernandes
2024-02-15 17:27         ` Daniel Bristot de Oliveira
2024-02-15 17:41           ` Joel Fernandes
2024-04-04 17:43             ` Daniel Bristot de Oliveira
2023-12-08 21:47 ` [PATCH v5 0/7] SCHED_DEADLINE server infrastructure Joel Fernandes
2024-02-19  7:33 ` Huang, Ying
2024-02-19 10:23   ` Daniel Bristot de Oliveira
2024-02-20  3:28     ` Huang, Ying
2024-02-20  8:31       ` Daniel Bristot de Oliveira
2024-02-20  8:41         ` Huang, Ying

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.