All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/7] SCHED_DEADLINE server infrastructure
@ 2023-08-31 20:28 Daniel Bristot de Oliveira
  2023-08-31 20:28 ` [PATCH v4 1/7] sched: Unify runtime accounting across classes Daniel Bristot de Oliveira
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-08-31 20:28 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

This is v4 of Peter's SCHED_DEADLINE server infrastructure
implementation [1].

SCHED_DEADLINE servers can help fixing starvation issues of low priority
tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
cycles. Today we have RT Throttling; DEADLINE servers should be able to
replace and improve that.

In the v1 there was discussion raised about the consequence of using
deadline based servers on the fixed-priority workloads. For a demonstration
here is the baseline of timerlat scheduling latency as-is, with kernel
build background workload:

 # rtla timerlat top -u -d 10m

  --------------------- %< ------------------------
                                     Timer Latency
  0 01:42:24   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)      |    Ret user Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max |      cur       min       avg       max
  0 #6143559   |        0         0         0        92 |        2         1         3        98 |        4         1         5       100
  1 #6143559   |        1         0         0        97 |        7         1         5       101 |        9         1         7       103
  2 #6143559   |        0         0         0        88 |        3         1         5        95 |        5         1         7        99
  3 #6143559   |        0         0         0        90 |        6         1         5       103 |       10         1         7       126
  4 #6143558   |        1         0         0        81 |        7         1         4        86 |        9         1         7        90
  5 #6143558   |        0         0         0        74 |        3         1         5        79 |        4         1         7        83
  6 #6143558   |        0         0         0        83 |        2         1         5        89 |        3         0         7       108
  7 #6143558   |        0         0         0        85 |        3         1         4       126 |        5         1         6       137
  --------------------- >% ------------------------

And this is the same tests with DL server activating without any delay:
  --------------------- %< ------------------------
  0 00:10:01   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)      |    Ret user Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max |      cur       min       avg       max
  0 #579147    |        0         0         0        54 |        2         1        52     61095 |        2         2        56     61102
  1 #578766    |        0         0         0        83 |        2         1        49     55824 |        3         2        53     55831
  2 #578559    |        0         0         1        59 |        2         1        50     55760 |        3         2        54     55770
  3 #578318    |        0         0         0        76 |        2         1        49     55751 |        3         2        54     55760
  4 #578611    |        0         0         0        64 |        2         1        49     55811 |        3         2        53     55820
  5 #578347    |        0         0         1        40 |        2         1        50     56121 |        3         2        55     56133
  6 #578938    |        0         0         1        75 |        2         1        49     55755 |        3         2        53     55764
  7 #578631    |        0         0         1        36 |        3         1        51     55528 |        4         2        55     55541
  --------------------- >% ------------------------

The problem with DL server only implementation is that FIFO tasks might
suffer preemption from NORMAL even when spare CPU cycles are available.
In fact, fair deadline server is enqueued right away when NORMAL tasks
wake up and they are first scheduled by the server, thus potentially
preempting a well behaving FIFO task. This is of course not ideal.

We had discussions about it, and one of the possibilities would be
using a different scheduling algorithm for this. But IMHO that is
an overkill.

Juri and I discussed this and though about delaying the server
activation for the 0-lag time, thus enabling the server only
if the fair scheduler is about to starve.

The patch 6/7 adds the possibility to defer the server start
to the (absolute deadline - runtime) point in time. This is
achieved by enqueuing the dl server throttled, with a next
replenishing time set to activate the server at
(absolute deadline - runtime).

The patch 7/7 add a per_rq interface for the knobs:
	fair_server_runtime (950 ms)
	fair_server_period  (1s)
	fair_server_defer   (enabled)

With defer enabled on CPUs [0:3], the results get better,
having a behavior similar to the one we have with the rt
throttling.
  --------------------- %< ------------------------
  0 00:10:01   |          IRQ Timer Latency (us)        |         Thread Timer Latency (us)      |    Ret user Timer Latency (us)
CPU COUNT      |      cur       min       avg       max |      cur       min       avg       max |      cur       min       avg       max
  0 #600003    |        0         0         0        34 |        6         1         5        75 |       10         2         7       108
  1 #600003    |        1         0         1        38 |        9         1         6        96 |       14         2         9       144
  2 #600005    |        1         0         1        85 |       10         1         6        94 |       14         2         9       120
  3 #600006    |        0         0         1        72 |        8         1         6       103 |       13         2         9       108
  4 #600005    |        1         0         1        61 |       10         1         6       110 |       14         2         9       126
  5 #578569    |        0         0         0        65 |       13         1        49     55962 |       20         2        54     55974
  6 #578852    |        0         0         0        56 |        5         1        48     55559 |        9         2        53     55568
  7 #578710    |        0         0         0        91 |       10         1        49     55773 |       16         2        53     55786
  --------------------- >% ------------------------

Here are some osnoise measurement, with osnoise threads running as FIFO:1 with
different setups (defer enabled):
 - CPU 2 isolated
 - CPU 3 isolated shared with a CFS busy loop task
 - CPU 8 non-isolated
 - CPU 9 non-isolated shared with a CFS busy loop task

  --------------------- %< ------------------------
 ~# pgrep ktimer | while read pid; do chrt -p -f 2 $pid; done # for RT kernel
 ~# sysctl kernel.sched_rt_runtime_us=-1
 ~# tuna  isolate -c 2
 ~# tuna  isolate -c 3
 ~# taskset -c 3 ./f &
 ~# taskset -c 9 ./f &
 ~# osnoise -P f:1 -c 2,3,8,9 -T 1 -d 10m -H 1
                                          Operating System Noise
duration:   0 00:10:00 | time is in us
CPU Period       Runtime        Noise  % CPU Aval   Max Noise   Max Single          HW          NMI          IRQ      Softirq       Thread
  2 #599       599000000            3    99.99999           2            1           3            0            0            0            0
  3 #598       598001768     31188796    94.78449       53907        53907           0            0      2842602            0         2394
  8 #599       599000000       918224    99.84670        1735           36           0           88       615903            0        37958
  9 #598       598000000     31441197    94.74227       53875        53448           0           88      3417253            0         1364
   --------------------- >% ------------------------

the system runs fine!
	- no crashes (famous last words)
	- FIFO property is kept
	- per cpu interface because it is more flexible - and to detach this from
	  the throttling concept.

Global is broken, but it will > /dev/null.

Changes from V3:
  - Add the defer server (Daniel)
  - Add an per rq interface (Daniel with peter's feedback)
  - Add an option not defer the server (for Joel)
  - Typos and 1-liner fixes (Valentin, Luca, Peter)
  - Fair scheduler running on dl server do not account as RT task (Daniel)
  - Changed the condition to enable the server (RT & fair tasks) (Daniel)
Changes from v2:
  - Refactor/rephrase/typos changes
  - Defferable server using throttling
  - The server starts when RT && Fair tasks are enqueued
  - Interface with runtime/period/defer option
Changes from v1:
  - rebased on 6.4-rc1 tip/sched/core

Daniel Bristot de Oliveira (2):
  sched/deadline: Deferrable dl server
  sched/fair: Fair server interface

Peter Zijlstra (5):
  sched: Unify runtime accounting across classes
  sched/deadline: Collect sched_dl_entity initialization
  sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
  sched/deadline: Introduce deadline servers
  sched/fair: Add trivial fair server

 include/linux/sched.h    |  31 ++-
 kernel/sched/core.c      |  23 +-
 kernel/sched/deadline.c  | 555 ++++++++++++++++++++++++++-------------
 kernel/sched/debug.c     | 177 +++++++++++++
 kernel/sched/fair.c      |  92 ++++++-
 kernel/sched/rt.c        |  21 +-
 kernel/sched/sched.h     |  64 ++++-
 kernel/sched/stop_task.c |  13 +-
 8 files changed, 737 insertions(+), 239 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 1/7] sched: Unify runtime accounting across classes
  2023-08-31 20:28 [PATCH v4 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
@ 2023-08-31 20:28 ` Daniel Bristot de Oliveira
  2023-09-15 21:41   ` Steven Rostedt
  2023-08-31 20:28 ` [PATCH v4 2/7] sched/deadline: Collect sched_dl_entity initialization Daniel Bristot de Oliveira
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-08-31 20:28 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

From: Peter Zijlstra <peterz@infradead.org>

All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.

Collapse all that.

Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 include/linux/sched.h    |  2 +-
 kernel/sched/deadline.c  | 15 +++--------
 kernel/sched/fair.c      | 57 ++++++++++++++++++++++++++++++----------
 kernel/sched/rt.c        | 15 +++--------
 kernel/sched/sched.h     | 12 ++-------
 kernel/sched/stop_task.c | 13 +--------
 6 files changed, 53 insertions(+), 61 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 177b3f3676ef..639f6eb9bd4f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -520,7 +520,7 @@ struct sched_statistics {
 	u64				block_max;
 	s64				sum_block_runtime;
 
-	u64				exec_max;
+	s64				exec_max;
 	u64				slice_max;
 
 	u64				nr_migrations_cold;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 58b542bf2893..9a09d9dafd88 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1299,9 +1299,8 @@ static void update_curr_dl(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_dl_entity *dl_se = &curr->dl;
-	u64 delta_exec, scaled_delta_exec;
+	s64 delta_exec, scaled_delta_exec;
 	int cpu = cpu_of(rq);
-	u64 now;
 
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
@@ -1314,21 +1313,13 @@ static void update_curr_dl(struct rq *rq)
 	 * natural solution, but the full ramifications of this
 	 * approach need further study.
 	 */
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <= 0)) {
+	delta_exec = update_curr_common(rq);
+	if (unlikely(delta_exec <= 0)) {
 		if (unlikely(dl_se->dl_yielded))
 			goto throttle;
 		return;
 	}
 
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
-	trace_sched_stat_runtime(curr, delta_exec, 0);
-
-	update_current_exec_runtime(curr, now, delta_exec);
-
 	if (dl_entity_is_special(dl_se))
 		return;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 911d0063763c..52c8219623b1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1092,23 +1092,17 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_SMP */
 
-/*
- * Update the current task's runtime statistics.
- */
-static void update_curr(struct cfs_rq *cfs_rq)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
 {
-	struct sched_entity *curr = cfs_rq->curr;
-	u64 now = rq_clock_task(rq_of(cfs_rq));
-	u64 delta_exec;
-
-	if (unlikely(!curr))
-		return;
+	u64 now = rq_clock_task(rq);
+	s64 delta_exec;
 
 	delta_exec = now - curr->exec_start;
-	if (unlikely((s64)delta_exec <= 0))
-		return;
+	if (unlikely(delta_exec <= 0))
+		return delta_exec;
 
 	curr->exec_start = now;
+	curr->sum_exec_runtime += delta_exec;
 
 	if (schedstat_enabled()) {
 		struct sched_statistics *stats;
@@ -1118,8 +1112,43 @@ static void update_curr(struct cfs_rq *cfs_rq)
 				max(delta_exec, stats->exec_max));
 	}
 
-	curr->sum_exec_runtime += delta_exec;
-	schedstat_add(cfs_rq->exec_clock, delta_exec);
+	return delta_exec;
+}
+
+/*
+ * Used by other classes to account runtime.
+ */
+s64 update_curr_common(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	s64 delta_exec;
+
+	delta_exec = update_curr_se(rq, &curr->se);
+	if (unlikely(delta_exec <= 0))
+		return delta_exec;
+
+	trace_sched_stat_runtime(curr, delta_exec, 0);
+
+	account_group_exec_runtime(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
+
+	return delta_exec;
+}
+
+/*
+ * Update the current task's runtime statistics.
+ */
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 delta_exec;
+
+	if (unlikely(!curr))
+		return;
+
+	delta_exec = update_curr_se(rq_of(cfs_rq), curr);
+	if (unlikely(delta_exec <= 0))
+		return;
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
 	update_deadline(cfs_rq, curr);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0597ba0f85ff..e23cc67c9467 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1046,24 +1046,15 @@ static void update_curr_rt(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_rt_entity *rt_se = &curr->rt;
-	u64 delta_exec;
-	u64 now;
+	s64 delta_exec;
 
 	if (curr->sched_class != &rt_sched_class)
 		return;
 
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <= 0))
+	delta_exec = update_curr_common(rq);
+	if (unlikely(delta_exec <= 0))
 		return;
 
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
-	trace_sched_stat_runtime(curr, delta_exec, 0);
-
-	update_current_exec_runtime(curr, now, delta_exec);
-
 	if (!rt_bandwidth_enabled())
 		return;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 04846272409c..1def5b7fa1df 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2228,6 +2228,8 @@ struct affinity_context {
 	unsigned int flags;
 };
 
+extern s64 update_curr_common(struct rq *rq);
+
 struct sched_class {
 
 #ifdef CONFIG_UCLAMP_TASK
@@ -3280,16 +3282,6 @@ extern int sched_dynamic_mode(const char *str);
 extern void sched_dynamic_update(int mode);
 #endif
 
-static inline void update_current_exec_runtime(struct task_struct *curr,
-						u64 now, u64 delta_exec)
-{
-	curr->se.sum_exec_runtime += delta_exec;
-	account_group_exec_runtime(curr, delta_exec);
-
-	curr->se.exec_start = now;
-	cgroup_account_cputime(curr, delta_exec);
-}
-
 #ifdef CONFIG_SCHED_MM_CID
 
 #define SCHED_MM_CID_PERIOD_NS	(100ULL * 1000000)	/* 100ms */
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 85590599b4d6..7595494ceb6d 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq)
 
 static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 {
-	struct task_struct *curr = rq->curr;
-	u64 now, delta_exec;
-
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec < 0))
-		delta_exec = 0;
-
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
-	update_current_exec_runtime(curr, now, delta_exec);
+	update_curr_common(rq);
 }
 
 /*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 2/7] sched/deadline: Collect sched_dl_entity initialization
  2023-08-31 20:28 [PATCH v4 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
  2023-08-31 20:28 ` [PATCH v4 1/7] sched: Unify runtime accounting across classes Daniel Bristot de Oliveira
@ 2023-08-31 20:28 ` Daniel Bristot de Oliveira
  2023-08-31 20:28 ` [PATCH v4 3/7] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity Daniel Bristot de Oliveira
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-08-31 20:28 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

From: Peter Zijlstra <peterz@infradead.org>

Create a single function that initializes a sched_dl_entity.

Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/core.c     |  5 +----
 kernel/sched/deadline.c | 22 +++++++++++++++-------
 kernel/sched/sched.h    |  5 +----
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2299a5cfbfb9..b57746237a43 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4513,10 +4513,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	memset(&p->stats, 0, sizeof(p->stats));
 #endif
 
-	RB_CLEAR_NODE(&p->dl.rb_node);
-	init_dl_task_timer(&p->dl);
-	init_dl_inactive_task_timer(&p->dl);
-	__dl_clear_params(p);
+	init_dl_entity(&p->dl);
 
 	INIT_LIST_HEAD(&p->rt.run_list);
 	p->rt.timeout		= 0;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9a09d9dafd88..f8c402079404 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -335,6 +335,8 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
 	__add_rq_bw(new_bw, &rq->dl);
 }
 
+static void __dl_clear_params(struct sched_dl_entity *dl_se);
+
 /*
  * The utilization of a task cannot be immediately removed from
  * the rq active utilization (running_bw) when the task blocks.
@@ -434,7 +436,7 @@ static void task_non_contending(struct task_struct *p)
 			raw_spin_lock(&dl_b->lock);
 			__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 			raw_spin_unlock(&dl_b->lock);
-			__dl_clear_params(p);
+			__dl_clear_params(dl_se);
 		}
 
 		return;
@@ -1207,7 +1209,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-void init_dl_task_timer(struct sched_dl_entity *dl_se)
+static void init_dl_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->dl_timer;
 
@@ -1413,7 +1415,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 		raw_spin_lock(&dl_b->lock);
 		__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 		raw_spin_unlock(&dl_b->lock);
-		__dl_clear_params(p);
+		__dl_clear_params(dl_se);
 
 		goto unlock;
 	}
@@ -1429,7 +1431,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
+static void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->inactive_timer;
 
@@ -2984,10 +2986,8 @@ bool __checkparam_dl(const struct sched_attr *attr)
 /*
  * This function clears the sched_dl_entity static params.
  */
-void __dl_clear_params(struct task_struct *p)
+static void __dl_clear_params(struct sched_dl_entity *dl_se)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
-
 	dl_se->dl_runtime		= 0;
 	dl_se->dl_deadline		= 0;
 	dl_se->dl_period		= 0;
@@ -3005,6 +3005,14 @@ void __dl_clear_params(struct task_struct *p)
 #endif
 }
 
+void init_dl_entity(struct sched_dl_entity *dl_se)
+{
+	RB_CLEAR_NODE(&dl_se->rb_node);
+	init_dl_task_timer(dl_se);
+	init_dl_inactive_task_timer(dl_se);
+	__dl_clear_params(dl_se);
+}
+
 bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1def5b7fa1df..5e0df4bba476 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -284,8 +284,6 @@ struct rt_bandwidth {
 	unsigned int		rt_period_active;
 };
 
-void __dl_clear_params(struct task_struct *p);
-
 static inline int dl_bandwidth_enabled(void)
 {
 	return sysctl_sched_rt_runtime >= 0;
@@ -2443,8 +2441,7 @@ extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
-extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
-extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
+extern void init_dl_entity(struct sched_dl_entity *dl_se);
 
 #define BW_SHIFT		20
 #define BW_UNIT			(1 << BW_SHIFT)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 3/7] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
  2023-08-31 20:28 [PATCH v4 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
  2023-08-31 20:28 ` [PATCH v4 1/7] sched: Unify runtime accounting across classes Daniel Bristot de Oliveira
  2023-08-31 20:28 ` [PATCH v4 2/7] sched/deadline: Collect sched_dl_entity initialization Daniel Bristot de Oliveira
@ 2023-08-31 20:28 ` Daniel Bristot de Oliveira
  2023-08-31 20:28 ` [PATCH v4 4/7] sched/deadline: Introduce deadline servers Daniel Bristot de Oliveira
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-08-31 20:28 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

From: Peter Zijlstra <peterz@infradead.org>

In preparation of introducing !task sched_dl_entity; move the
bandwidth accounting into {en.de}queue_dl_entity().

Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/deadline.c | 130 ++++++++++++++++++++++------------------
 kernel/sched/sched.h    |   6 ++
 2 files changed, 78 insertions(+), 58 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f8c402079404..957baaf6dc92 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -391,12 +391,12 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se);
  * up, and checks if the task is still in the "ACTIVE non contending"
  * state or not (in the second case, it updates running_bw).
  */
-static void task_non_contending(struct task_struct *p)
+static void task_non_contending(struct sched_dl_entity *dl_se)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
 	struct hrtimer *timer = &dl_se->inactive_timer;
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
+	struct task_struct *p = dl_task_of(dl_se);
 	s64 zerolag_time;
 
 	/*
@@ -428,13 +428,14 @@ static void task_non_contending(struct task_struct *p)
 	if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
 		if (dl_task(p))
 			sub_running_bw(dl_se, dl_rq);
+
 		if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
 			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
 			if (READ_ONCE(p->__state) == TASK_DEAD)
-				sub_rq_bw(&p->dl, &rq->dl);
+				sub_rq_bw(dl_se, &rq->dl);
 			raw_spin_lock(&dl_b->lock);
-			__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
+			__dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
 			raw_spin_unlock(&dl_b->lock);
 			__dl_clear_params(dl_se);
 		}
@@ -1627,6 +1628,41 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 
 	update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);
 
+	/*
+	 * Check if a constrained deadline task was activated
+	 * after the deadline but before the next period.
+	 * If that is the case, the task will be throttled and
+	 * the replenishment timer will be set to the next period.
+	 */
+	if (!dl_se->dl_throttled && !dl_is_implicit(dl_se))
+		dl_check_constrained_dl(dl_se);
+
+	if (flags & (ENQUEUE_RESTORE|ENQUEUE_MIGRATING)) {
+		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+		add_rq_bw(dl_se, dl_rq);
+		add_running_bw(dl_se, dl_rq);
+	}
+
+	/*
+	 * If p is throttled, we do not enqueue it. In fact, if it exhausted
+	 * its budget it needs a replenishment and, since it now is on
+	 * its rq, the bandwidth timer callback (which clearly has not
+	 * run yet) will take care of this.
+	 * However, the active utilization does not depend on the fact
+	 * that the task is on the runqueue or not (but depends on the
+	 * task's state - in GRUB parlance, "inactive" vs "active contending").
+	 * In other words, even if a task is throttled its utilization must
+	 * be counted in the active utilization; hence, we need to call
+	 * add_running_bw().
+	 */
+	if (dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+		if (flags & ENQUEUE_WAKEUP)
+			task_contending(dl_se, flags);
+
+		return;
+	}
+
 	/*
 	 * If this is a wakeup or a new instance, the scheduling
 	 * parameters of the task might need updating. Otherwise,
@@ -1646,9 +1682,28 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	__enqueue_dl_entity(dl_se);
 }
 
-static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 {
 	__dequeue_dl_entity(dl_se);
+
+	if (flags & (DEQUEUE_SAVE|DEQUEUE_MIGRATING)) {
+		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+		sub_running_bw(dl_se, dl_rq);
+		sub_rq_bw(dl_se, dl_rq);
+	}
+
+	/*
+	 * This check allows to start the inactive timer (or to immediately
+	 * decrease the active utilization, if needed) in two cases:
+	 * when the task blocks and when it is terminating
+	 * (p->state == TASK_DEAD). We can handle the two cases in the same
+	 * way, because from GRUB's point of view the same thing is happening
+	 * (the task moves from "active contending" to "active non contending"
+	 * or "inactive")
+	 */
+	if (flags & DEQUEUE_SLEEP)
+		task_non_contending(dl_se);
 }
 
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -1693,76 +1748,35 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		return;
 	}
 
-	/*
-	 * Check if a constrained deadline task was activated
-	 * after the deadline but before the next period.
-	 * If that is the case, the task will be throttled and
-	 * the replenishment timer will be set to the next period.
-	 */
-	if (!p->dl.dl_throttled && !dl_is_implicit(&p->dl))
-		dl_check_constrained_dl(&p->dl);
-
-	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE) {
-		add_rq_bw(&p->dl, &rq->dl);
-		add_running_bw(&p->dl, &rq->dl);
-	}
-
-	/*
-	 * If p is throttled, we do not enqueue it. In fact, if it exhausted
-	 * its budget it needs a replenishment and, since it now is on
-	 * its rq, the bandwidth timer callback (which clearly has not
-	 * run yet) will take care of this.
-	 * However, the active utilization does not depend on the fact
-	 * that the task is on the runqueue or not (but depends on the
-	 * task's state - in GRUB parlance, "inactive" vs "active contending").
-	 * In other words, even if a task is throttled its utilization must
-	 * be counted in the active utilization; hence, we need to call
-	 * add_running_bw().
-	 */
-	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
-		if (flags & ENQUEUE_WAKEUP)
-			task_contending(&p->dl, flags);
-
-		return;
-	}
-
 	check_schedstat_required();
 	update_stats_wait_start_dl(dl_rq_of_se(&p->dl), &p->dl);
 
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		flags |= ENQUEUE_MIGRATING;
+
 	enqueue_dl_entity(&p->dl, flags);
 
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+	if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
 }
 
 static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_stats_dequeue_dl(&rq->dl, &p->dl, flags);
-	dequeue_dl_entity(&p->dl);
-	dequeue_pushable_dl_task(rq, p);
+	dequeue_dl_entity(&p->dl, flags);
+
+	if (!p->dl.dl_throttled)
+		dequeue_pushable_dl_task(rq, p);
 }
 
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
-	__dequeue_task_dl(rq, p, flags);
 
-	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE) {
-		sub_running_bw(&p->dl, &rq->dl);
-		sub_rq_bw(&p->dl, &rq->dl);
-	}
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		flags |= DEQUEUE_MIGRATING;
 
-	/*
-	 * This check allows to start the inactive timer (or to immediately
-	 * decrease the active utilization, if needed) in two cases:
-	 * when the task blocks and when it is terminating
-	 * (p->state == TASK_DEAD). We can handle the two cases in the same
-	 * way, because from GRUB's point of view the same thing is happening
-	 * (the task moves from "active contending" to "active non contending"
-	 * or "inactive")
-	 */
-	if (flags & DEQUEUE_SLEEP)
-		task_non_contending(p);
+	__dequeue_task_dl(rq, p, flags);
 }
 
 /*
@@ -2578,7 +2592,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	 * will reset the task parameters.
 	 */
 	if (task_on_rq_queued(p) && p->dl.dl_runtime)
-		task_non_contending(p);
+		task_non_contending(&p->dl);
 
 	/*
 	 * In case a task is setscheduled out from SCHED_DEADLINE we need to
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5e0df4bba476..9f48ed3e9028 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2193,6 +2193,10 @@ extern const u32		sched_prio_to_wmult[40];
  * MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location
  *        in the runqueue.
  *
+ * NOCLOCK - skip the update_rq_clock() (avoids double updates)
+ *
+ * MIGRATION - p->on_rq == TASK_ON_RQ_MIGRATING (used for DEADLINE)
+ *
  * ENQUEUE_HEAD      - place at front of runqueue (tail if not specified)
  * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
  * ENQUEUE_MIGRATED  - the task was migrated during wakeup
@@ -2203,6 +2207,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define DEQUEUE_SAVE		0x02 /* Matches ENQUEUE_RESTORE */
 #define DEQUEUE_MOVE		0x04 /* Matches ENQUEUE_MOVE */
 #define DEQUEUE_NOCLOCK		0x08 /* Matches ENQUEUE_NOCLOCK */
+#define DEQUEUE_MIGRATING	0x100 /* Matches ENQUEUE_MIGRATING */
 
 #define ENQUEUE_WAKEUP		0x01
 #define ENQUEUE_RESTORE		0x02
@@ -2217,6 +2222,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define ENQUEUE_MIGRATED	0x00
 #endif
 #define ENQUEUE_INITIAL		0x80
+#define ENQUEUE_MIGRATING	0x100
 
 #define RETRY_TASK		((void *)-1UL)
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 4/7] sched/deadline: Introduce deadline servers
  2023-08-31 20:28 [PATCH v4 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (2 preceding siblings ...)
  2023-08-31 20:28 ` [PATCH v4 3/7] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity Daniel Bristot de Oliveira
@ 2023-08-31 20:28 ` Daniel Bristot de Oliveira
  2023-08-31 20:28 ` [PATCH v4 5/7] sched/fair: Add trivial fair server Daniel Bristot de Oliveira
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-08-31 20:28 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

From: Peter Zijlstra <peterz@infradead.org>

Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).

RT Throttling has been introduced a while ago as a (mostly debug)
countermeasure one can utilize to reserve some CPU time for low priority
tasks (usually background type of work, e.g. workqueues, timers, etc.).
It however has its own problems (see documentation) and the undesired
effect of unconditionally throttling FIFO tasks even when no lower
priority activity needs to run (there are mechanisms to fix this issue
as well, but, again, with their own problems).

Introduce deadline servers to service low priority tasks needs under
starvation conditions. Deadline servers are built extending SCHED_DEADLINE
implementation to allow 2-level scheduling (a sched_deadline entity
becomes a container for lower priority scheduling entities).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 include/linux/sched.h   |  22 ++-
 kernel/sched/core.c     |  17 ++
 kernel/sched/deadline.c | 344 +++++++++++++++++++++++++++-------------
 kernel/sched/fair.c     |   4 +
 kernel/sched/sched.h    |  27 ++++
 5 files changed, 301 insertions(+), 113 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 639f6eb9bd4f..40fbf3f034e0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -63,12 +63,14 @@ struct robust_list_head;
 struct root_domain;
 struct rq;
 struct sched_attr;
+struct sched_dl_entity;
 struct sched_param;
 struct seq_file;
 struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct task_struct;
 struct user_event_mm;
 
 /*
@@ -604,6 +606,9 @@ struct sched_rt_entity {
 #endif
 } __randomize_layout;
 
+typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
+typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
+
 struct sched_dl_entity {
 	struct rb_node			rb_node;
 
@@ -651,6 +656,7 @@ struct sched_dl_entity {
 	unsigned int			dl_yielded        : 1;
 	unsigned int			dl_non_contending : 1;
 	unsigned int			dl_overrun	  : 1;
+	unsigned int			dl_server         : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
@@ -665,7 +671,20 @@ struct sched_dl_entity {
 	 * timer is needed to decrease the active utilization at the correct
 	 * time.
 	 */
-	struct hrtimer inactive_timer;
+	struct hrtimer			inactive_timer;
+
+	/*
+	 * Bits for DL-server functionality. Also see the comment near
+	 * dl_server_update().
+	 *
+	 * @rq the runqueue this server is for
+	 *
+	 * @server_has_tasks() returns true if @server_pick return a
+	 * runnable task.
+	 */
+	struct rq			*rq;
+	dl_server_has_tasks_f		server_has_tasks;
+	dl_server_pick_f		server_pick;
 
 #ifdef CONFIG_RT_MUTEXES
 	/*
@@ -794,6 +813,7 @@ struct task_struct {
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
 	struct sched_dl_entity		dl;
+	struct sched_dl_entity		*server;
 	const struct sched_class	*sched_class;
 
 #ifdef CONFIG_SCHED_CORE
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b57746237a43..c780707e1761 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3815,6 +3815,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		rq->idle_stamp = 0;
 	}
 #endif
+
+	p->server = NULL;
 }
 
 /*
@@ -6008,12 +6010,27 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			p = pick_next_task_idle(rq);
 		}
 
+		/*
+		 * This is the fast path; it cannot be a DL server pick;
+		 * therefore even if @p == @prev, ->server must be NULL.
+		 */
+		if (p->server)
+			p->server = NULL;
+
 		return p;
 	}
 
 restart:
 	put_prev_task_balance(rq, prev, rf);
 
+	/*
+	 * We've updated @prev and no longer need the server link, clear it.
+	 * Must be done before ->pick_next_task() because that can (re)set
+	 * ->server.
+	 */
+	if (prev->server)
+		prev->server = NULL;
+
 	for_each_class(class) {
 		p = class->pick_next_task(rq);
 		if (p)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 957baaf6dc92..4dac16ed1317 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -54,8 +54,14 @@ static int __init sched_dl_sysctl_init(void)
 late_initcall(sched_dl_sysctl_init);
 #endif
 
+static bool dl_server(struct sched_dl_entity *dl_se)
+{
+	return dl_se->dl_server;
+}
+
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
+	BUG_ON(dl_server(dl_se));
 	return container_of(dl_se, struct task_struct, dl);
 }
 
@@ -64,14 +70,22 @@ static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
 	return container_of(dl_rq, struct rq, dl);
 }
 
-static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+static inline struct rq *rq_of_dl_se(struct sched_dl_entity *dl_se)
 {
-	struct task_struct *p = dl_task_of(dl_se);
-	struct rq *rq = task_rq(p);
+	struct rq *rq = dl_se->rq;
+
+	if (!dl_server(dl_se))
+		rq = task_rq(dl_task_of(dl_se));
 
-	return &rq->dl;
+	return rq;
 }
 
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+	return &rq_of_dl_se(dl_se)->dl;
+}
+
+
 static inline int on_dl_rq(struct sched_dl_entity *dl_se)
 {
 	return !RB_EMPTY_NODE(&dl_se->rb_node);
@@ -394,9 +408,8 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se);
 static void task_non_contending(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->inactive_timer;
-	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
-	struct rq *rq = rq_of_dl_rq(dl_rq);
-	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = rq_of_dl_se(dl_se);
+	struct dl_rq *dl_rq = &rq->dl;
 	s64 zerolag_time;
 
 	/*
@@ -426,25 +439,33 @@ static void task_non_contending(struct sched_dl_entity *dl_se)
 	 * utilization now, instead of starting a timer
 	 */
 	if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
-		if (dl_task(p))
+		if (dl_server(dl_se)) {
 			sub_running_bw(dl_se, dl_rq);
+		} else {
+			struct task_struct *p = dl_task_of(dl_se);
+
+			if (dl_task(p))
+				sub_running_bw(dl_se, dl_rq);
 
-		if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
-			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+			if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
+				struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
-			if (READ_ONCE(p->__state) == TASK_DEAD)
-				sub_rq_bw(dl_se, &rq->dl);
-			raw_spin_lock(&dl_b->lock);
-			__dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
-			raw_spin_unlock(&dl_b->lock);
-			__dl_clear_params(dl_se);
+				if (READ_ONCE(p->__state) == TASK_DEAD)
+					sub_rq_bw(dl_se, &rq->dl);
+				raw_spin_lock(&dl_b->lock);
+				__dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
+				raw_spin_unlock(&dl_b->lock);
+				__dl_clear_params(dl_se);
+			}
 		}
 
 		return;
 	}
 
 	dl_se->dl_non_contending = 1;
-	get_task_struct(p);
+	if (!dl_server(dl_se))
+		get_task_struct(dl_task_of(dl_se));
+
 	hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL_HARD);
 }
 
@@ -471,8 +492,10 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
 		 * will not touch the rq's active utilization,
 		 * so we are still safe.
 		 */
-		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
-			put_task_struct(dl_task_of(dl_se));
+		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
+			if (!dl_server(dl_se))
+				put_task_struct(dl_task_of(dl_se));
+		}
 	} else {
 		/*
 		 * Since "dl_non_contending" is not set, the
@@ -485,10 +508,8 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
 	}
 }
 
-static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
+static inline int is_leftmost(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
-
 	return rb_first_cached(&dl_rq->root) == &dl_se->rb_node;
 }
 
@@ -575,8 +596,6 @@ static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 
 	if (p->nr_cpus_allowed > 1)
 		dl_rq->dl_nr_migratory++;
-
-	update_dl_migration(dl_rq);
 }
 
 static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
@@ -585,8 +604,6 @@ static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 
 	if (p->nr_cpus_allowed > 1)
 		dl_rq->dl_nr_migratory--;
-
-	update_dl_migration(dl_rq);
 }
 
 #define __node_2_pdl(node) \
@@ -764,8 +781,10 @@ static inline void deadline_queue_pull_task(struct rq *rq)
 }
 #endif /* CONFIG_SMP */
 
+static void
+enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags);
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags);
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p, int flags);
 
 static inline void replenish_dl_new_period(struct sched_dl_entity *dl_se,
@@ -1013,8 +1032,7 @@ static inline bool dl_is_implicit(struct sched_dl_entity *dl_se)
  */
 static void update_dl_entity(struct sched_dl_entity *dl_se)
 {
-	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
-	struct rq *rq = rq_of_dl_rq(dl_rq);
+	struct rq *rq = rq_of_dl_se(dl_se);
 
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
 	    dl_entity_overflow(dl_se, rq_clock(rq))) {
@@ -1045,11 +1063,11 @@ static inline u64 dl_next_period(struct sched_dl_entity *dl_se)
  * actually started or not (i.e., the replenishment instant is in
  * the future or in the past).
  */
-static int start_dl_timer(struct task_struct *p)
+static int start_dl_timer(struct sched_dl_entity *dl_se)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
 	struct hrtimer *timer = &dl_se->dl_timer;
-	struct rq *rq = task_rq(p);
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
 	ktime_t now, act;
 	s64 delta;
 
@@ -1083,13 +1101,33 @@ static int start_dl_timer(struct task_struct *p)
 	 * and observe our state.
 	 */
 	if (!hrtimer_is_queued(timer)) {
-		get_task_struct(p);
+		if (!dl_server(dl_se))
+			get_task_struct(dl_task_of(dl_se));
 		hrtimer_start(timer, act, HRTIMER_MODE_ABS_HARD);
 	}
 
 	return 1;
 }
 
+static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
+{
+#ifdef CONFIG_SMP
+	/*
+	 * Queueing this task back might have overloaded rq, check if we need
+	 * to kick someone away.
+	 */
+	if (has_pushable_dl_tasks(rq)) {
+		/*
+		 * Nothing relies on rq->lock after this, so its safe to drop
+		 * rq->lock.
+		 */
+		rq_unpin_lock(rq, rf);
+		push_dl_task(rq);
+		rq_repin_lock(rq, rf);
+	}
+#endif
+}
+
 /*
  * This is the bandwidth enforcement timer callback. If here, we know
  * a task is not on its dl_rq, since the fact that the timer was running
@@ -1108,10 +1146,34 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	struct sched_dl_entity *dl_se = container_of(timer,
 						     struct sched_dl_entity,
 						     dl_timer);
-	struct task_struct *p = dl_task_of(dl_se);
+	struct task_struct *p;
 	struct rq_flags rf;
 	struct rq *rq;
 
+	if (dl_server(dl_se)) {
+		struct rq *rq = rq_of_dl_se(dl_se);
+		struct rq_flags rf;
+
+		rq_lock(rq, &rf);
+		if (dl_se->dl_throttled) {
+			sched_clock_tick();
+			update_rq_clock(rq);
+
+			if (dl_se->server_has_tasks(dl_se)) {
+				enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
+				resched_curr(rq);
+				__push_dl_task(rq, &rf);
+			} else {
+				replenish_dl_entity(dl_se);
+			}
+
+		}
+		rq_unlock(rq, &rf);
+
+		return HRTIMER_NORESTART;
+	}
+
+	p = dl_task_of(dl_se);
 	rq = task_rq_lock(p, &rf);
 
 	/*
@@ -1182,21 +1244,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	else
 		resched_curr(rq);
 
-#ifdef CONFIG_SMP
-	/*
-	 * Queueing this task back might have overloaded rq, check if we need
-	 * to kick someone away.
-	 */
-	if (has_pushable_dl_tasks(rq)) {
-		/*
-		 * Nothing relies on rq->lock after this, so its safe to drop
-		 * rq->lock.
-		 */
-		rq_unpin_lock(rq, &rf);
-		push_dl_task(rq);
-		rq_repin_lock(rq, &rf);
-	}
-#endif
+	__push_dl_task(rq, &rf);
 
 unlock:
 	task_rq_unlock(rq, p, &rf);
@@ -1238,12 +1286,11 @@ static void init_dl_task_timer(struct sched_dl_entity *dl_se)
  */
 static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se)
 {
-	struct task_struct *p = dl_task_of(dl_se);
-	struct rq *rq = rq_of_dl_rq(dl_rq_of_se(dl_se));
+	struct rq *rq = rq_of_dl_se(dl_se);
 
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) &&
 	    dl_time_before(rq_clock(rq), dl_next_period(dl_se))) {
-		if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(p)))
+		if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se)))
 			return;
 		dl_se->dl_throttled = 1;
 		if (dl_se->runtime > 0)
@@ -1294,29 +1341,13 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
 	return (delta * u_act) >> BW_SHIFT;
 }
 
-/*
- * Update the current task's runtime statistics (provided it is still
- * a -deadline task and has not been removed from the dl_rq).
- */
-static void update_curr_dl(struct rq *rq)
+static inline void
+update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
+                        int flags);
+static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
 {
-	struct task_struct *curr = rq->curr;
-	struct sched_dl_entity *dl_se = &curr->dl;
-	s64 delta_exec, scaled_delta_exec;
-	int cpu = cpu_of(rq);
+	s64 scaled_delta_exec;
 
-	if (!dl_task(curr) || !on_dl_rq(dl_se))
-		return;
-
-	/*
-	 * Consumed budget is computed considering the time as
-	 * observed by schedulable tasks (excluding time spent
-	 * in hardirq context, etc.). Deadlines are instead
-	 * computed using hard walltime. This seems to be the more
-	 * natural solution, but the full ramifications of this
-	 * approach need further study.
-	 */
-	delta_exec = update_curr_common(rq);
 	if (unlikely(delta_exec <= 0)) {
 		if (unlikely(dl_se->dl_yielded))
 			goto throttle;
@@ -1334,10 +1365,9 @@ static void update_curr_dl(struct rq *rq)
 	 * according to current frequency and CPU maximum capacity.
 	 */
 	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
-		scaled_delta_exec = grub_reclaim(delta_exec,
-						 rq,
-						 &curr->dl);
+		scaled_delta_exec = grub_reclaim(delta_exec, rq, dl_se);
 	} else {
+		int cpu = cpu_of(rq);
 		unsigned long scale_freq = arch_scale_freq_capacity(cpu);
 		unsigned long scale_cpu = arch_scale_cpu_capacity(cpu);
 
@@ -1356,11 +1386,20 @@ static void update_curr_dl(struct rq *rq)
 		    (dl_se->flags & SCHED_FLAG_DL_OVERRUN))
 			dl_se->dl_overrun = 1;
 
-		__dequeue_task_dl(rq, curr, 0);
-		if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(curr)))
-			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
+		dequeue_dl_entity(dl_se, 0);
+		if (!dl_server(dl_se)) {
+			update_stats_dequeue_dl(&rq->dl, dl_se, 0);
+			dequeue_pushable_dl_task(rq, dl_task_of(dl_se));
+		}
+
+		if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se))) {
+			if (dl_server(dl_se))
+				enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH);
+			else
+				enqueue_task_dl(rq, dl_task_of(dl_se), ENQUEUE_REPLENISH);
+		}
 
-		if (!is_leftmost(curr, &rq->dl))
+		if (!is_leftmost(dl_se, &rq->dl))
 			resched_curr(rq);
 	}
 
@@ -1390,20 +1429,82 @@ static void update_curr_dl(struct rq *rq)
 	}
 }
 
+void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
+{
+	update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+}
+
+void dl_server_start(struct sched_dl_entity *dl_se)
+{
+	if (!dl_server(dl_se)) {
+		dl_se->dl_server = 1;
+		setup_new_dl_entity(dl_se);
+	}
+	enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
+}
+
+void dl_server_stop(struct sched_dl_entity *dl_se)
+{
+	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
+}
+
+void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+		    dl_server_has_tasks_f has_tasks,
+		    dl_server_pick_f pick)
+{
+	dl_se->rq = rq;
+	dl_se->server_has_tasks = has_tasks;
+	dl_se->server_pick = pick;
+}
+
+/*
+ * Update the current task's runtime statistics (provided it is still
+ * a -deadline task and has not been removed from the dl_rq).
+ */
+static void update_curr_dl(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_dl_entity *dl_se = &curr->dl;
+	s64 delta_exec;
+
+	if (!dl_task(curr) || !on_dl_rq(dl_se))
+		return;
+
+	/*
+	 * Consumed budget is computed considering the time as
+	 * observed by schedulable tasks (excluding time spent
+	 * in hardirq context, etc.). Deadlines are instead
+	 * computed using hard walltime. This seems to be the more
+	 * natural solution, but the full ramifications of this
+	 * approach need further study.
+	 */
+	delta_exec = update_curr_common(rq);
+	update_curr_dl_se(rq, dl_se, delta_exec);
+}
+
 static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 {
 	struct sched_dl_entity *dl_se = container_of(timer,
 						     struct sched_dl_entity,
 						     inactive_timer);
-	struct task_struct *p = dl_task_of(dl_se);
+	struct task_struct *p = NULL;
 	struct rq_flags rf;
 	struct rq *rq;
 
-	rq = task_rq_lock(p, &rf);
+	if (!dl_server(dl_se)) {
+		p = dl_task_of(dl_se);
+		rq = task_rq_lock(p, &rf);
+	} else {
+		rq = dl_se->rq;
+		rq_lock(rq, &rf);
+	}
 
 	sched_clock_tick();
 	update_rq_clock(rq);
 
+	if (dl_server(dl_se))
+		goto no_task;
+
 	if (!dl_task(p) || READ_ONCE(p->__state) == TASK_DEAD) {
 		struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
@@ -1420,14 +1521,21 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 
 		goto unlock;
 	}
+
+no_task:
 	if (dl_se->dl_non_contending == 0)
 		goto unlock;
 
 	sub_running_bw(dl_se, &rq->dl);
 	dl_se->dl_non_contending = 0;
 unlock:
-	task_rq_unlock(rq, p, &rf);
-	put_task_struct(p);
+
+	if (!dl_server(dl_se)) {
+		task_rq_unlock(rq, p, &rf);
+		put_task_struct(p);
+	} else {
+		rq_unlock(rq, &rf);
+	}
 
 	return HRTIMER_NORESTART;
 }
@@ -1485,34 +1593,35 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
 static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
 
+static inline void update_dl_migration(struct dl_rq *dl_rq) {}
+
 #endif /* CONFIG_SMP */
 
 static inline
 void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	int prio = dl_task_of(dl_se)->prio;
 	u64 deadline = dl_se->deadline;
 
-	WARN_ON(!dl_prio(prio));
 	dl_rq->dl_nr_running++;
 	add_nr_running(rq_of_dl_rq(dl_rq), 1);
 
 	inc_dl_deadline(dl_rq, deadline);
-	inc_dl_migration(dl_se, dl_rq);
+	if (!dl_server(dl_se))
+		inc_dl_migration(dl_se, dl_rq);
+	update_dl_migration(dl_rq);
 }
 
 static inline
 void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	int prio = dl_task_of(dl_se)->prio;
-
-	WARN_ON(!dl_prio(prio));
 	WARN_ON(!dl_rq->dl_nr_running);
 	dl_rq->dl_nr_running--;
 	sub_nr_running(rq_of_dl_rq(dl_rq), 1);
 
 	dec_dl_deadline(dl_rq, dl_se->deadline);
-	dec_dl_migration(dl_se, dl_rq);
+	if (!dl_server(dl_se))
+		dec_dl_migration(dl_se, dl_rq);
+	update_dl_migration(dl_rq);
 }
 
 static inline bool __dl_less(struct rb_node *a, const struct rb_node *b)
@@ -1674,8 +1783,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	} else if (flags & ENQUEUE_REPLENISH) {
 		replenish_dl_entity(dl_se);
 	} else if ((flags & ENQUEUE_RESTORE) &&
-		  dl_time_before(dl_se->deadline,
-				 rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se))))) {
+		   dl_time_before(dl_se->deadline, rq_clock(rq_of_dl_se(dl_se)))) {
 		setup_new_dl_entity(dl_se);
 	}
 
@@ -1760,14 +1868,6 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		enqueue_pushable_dl_task(rq, p);
 }
 
-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
-{
-	update_stats_dequeue_dl(&rq->dl, &p->dl, flags);
-	dequeue_dl_entity(&p->dl, flags);
-
-	if (!p->dl.dl_throttled)
-		dequeue_pushable_dl_task(rq, p);
-}
 
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
@@ -1776,7 +1876,9 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	if (p->on_rq == TASK_ON_RQ_MIGRATING)
 		flags |= DEQUEUE_MIGRATING;
 
-	__dequeue_task_dl(rq, p, flags);
+	dequeue_dl_entity(&p->dl, flags);
+	if (!p->dl.dl_throttled)
+		dequeue_pushable_dl_task(rq, p);
 }
 
 /*
@@ -1966,12 +2068,12 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 }
 
 #ifdef CONFIG_SCHED_HRTICK
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
 {
-	hrtick_start(rq, p->dl.runtime);
+	hrtick_start(rq, dl_se->runtime);
 }
 #else /* !CONFIG_SCHED_HRTICK */
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
 {
 }
 #endif
@@ -1991,9 +2093,6 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
 	if (!first)
 		return;
 
-	if (hrtick_enabled_dl(rq))
-		start_hrtick_dl(rq, p);
-
 	if (rq->curr->sched_class != &dl_sched_class)
 		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
@@ -2016,12 +2115,26 @@ static struct task_struct *pick_task_dl(struct rq *rq)
 	struct dl_rq *dl_rq = &rq->dl;
 	struct task_struct *p;
 
+again:
 	if (!sched_dl_runnable(rq))
 		return NULL;
 
 	dl_se = pick_next_dl_entity(dl_rq);
 	WARN_ON_ONCE(!dl_se);
-	p = dl_task_of(dl_se);
+
+
+	if (dl_server(dl_se)) {
+		p = dl_se->server_pick(dl_se);
+		if (!p) {
+			WARN_ON_ONCE(1);
+			dl_se->dl_yielded = 1;
+			update_curr_dl_se(rq, dl_se, 0);
+			goto again;
+		}
+		p->server = dl_se;
+	} else {
+		p = dl_task_of(dl_se);
+	}
 
 	return p;
 }
@@ -2031,9 +2144,15 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
 	struct task_struct *p;
 
 	p = pick_task_dl(rq);
-	if (p)
+	if (!p)
+		return p;
+
+	if (!p->server)
 		set_next_task_dl(rq, p, true);
 
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, &p->dl);
+
 	return p;
 }
 
@@ -2071,8 +2190,8 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
 	 * be set and schedule() will start a new hrtick for the next task.
 	 */
 	if (hrtick_enabled_dl(rq) && queued && p->dl.runtime > 0 &&
-	    is_leftmost(p, &rq->dl))
-		start_hrtick_dl(rq, p);
+	    is_leftmost(&p->dl, &rq->dl))
+		start_hrtick_dl(rq, &p->dl);
 }
 
 static void task_fork_dl(struct task_struct *p)
@@ -3013,6 +3132,7 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se)
 	dl_se->dl_yielded		= 0;
 	dl_se->dl_non_contending	= 0;
 	dl_se->dl_overrun		= 0;
+	dl_se->dl_server		= 0;
 
 #ifdef CONFIG_RT_MUTEXES
 	dl_se->pi_se			= dl_se;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52c8219623b1..5ded18e28609 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1131,6 +1131,8 @@ s64 update_curr_common(struct rq *rq)
 
 	account_group_exec_runtime(curr, delta_exec);
 	cgroup_account_cputime(curr, delta_exec);
+	if (curr->server)
+		dl_server_update(curr->server, delta_exec);
 
 	return delta_exec;
 }
@@ -1160,6 +1162,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
 		cgroup_account_cputime(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
+		if (curtask->server)
+			dl_server_update(curtask->server, delta_exec);
 	}
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f48ed3e9028..f30be4ae4c22 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -324,6 +324,33 @@ extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *att
 extern int  dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
 extern int  dl_bw_check_overflow(int cpu);
 
+/*
+ * SCHED_DEADLINE supports servers (nested scheduling) with the following
+ * interface:
+ *
+ *   dl_se::rq -- runqueue we belong to.
+ *
+ *   dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
+ *                                server when it runs out of tasks to run.
+ *
+ *   dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
+ *                           returns NULL.
+ *
+ *   dl_server_update() -- called from update_curr_common(), propagates runtime
+ *                         to the server.
+ *
+ *   dl_server_start()
+ *   dl_server_stop()  -- start/stop the server when it has (no) tasks.
+ *
+ *   dl_server_init() -- initializes the server.
+ */
+extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
+extern void dl_server_start(struct sched_dl_entity *dl_se);
+extern void dl_server_stop(struct sched_dl_entity *dl_se);
+extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+		    dl_server_has_tasks_f has_tasks,
+		    dl_server_pick_f pick);
+
 #ifdef CONFIG_CGROUP_SCHED
 
 struct cfs_rq;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 5/7] sched/fair: Add trivial fair server
  2023-08-31 20:28 [PATCH v4 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (3 preceding siblings ...)
  2023-08-31 20:28 ` [PATCH v4 4/7] sched/deadline: Introduce deadline servers Daniel Bristot de Oliveira
@ 2023-08-31 20:28 ` Daniel Bristot de Oliveira
  2023-08-31 20:28 ` [PATCH v4 6/7] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
  2023-08-31 20:28 ` [PATCH v4 7/7] sched/fair: Fair server interface Daniel Bristot de Oliveira
  6 siblings, 0 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-08-31 20:28 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

From: Peter Zijlstra <peterz@infradead.org>

Use deadline servers to service fair tasks.

This patch adds a fair_server deadline entity which acts as a container
for fair entities and can be used to fix starvation when higher priority
(wrt fair) tasks are monopolizing CPU(s).

[ dl_server do not account for rt ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/core.c     |  1 +
 kernel/sched/deadline.c |  7 +++++++
 kernel/sched/fair.c     | 29 +++++++++++++++++++++++++++++
 kernel/sched/sched.h    |  4 ++++
 4 files changed, 41 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c780707e1761..4ba4f1e09a80 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10055,6 +10055,7 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+		fair_server_init(rq);
 
 #ifdef CONFIG_SCHED_CORE
 		rq->core = rq;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4dac16ed1317..7844cfb73029 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1403,6 +1403,13 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 			resched_curr(rq);
 	}
 
+	/*
+	 * The fair server (sole dl_server) does not account for real-time
+	 * workload because it is running fair work.
+	 */
+	if (dl_server(dl_se))
+		return;
+
 	/*
 	 * Because -- for now -- we share the rt bandwidth, we need to
 	 * account our runtime there too, otherwise actual rt tasks
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5ded18e28609..580e6764a68b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6499,6 +6499,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	util_est_enqueue(&rq->cfs, p);
 
+	if (!rq->cfs.h_nr_running)
+		dl_server_start(&rq->fair_server);
+
 	/*
 	 * If in_iowait is set, the code below may not trigger any cpufreq
 	 * utilization updates, so do it here explicitly with the IOWAIT flag
@@ -6643,6 +6646,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		rq->next_balance = jiffies;
 
 dequeue_throttle:
+	if (!rq->cfs.h_nr_running)
+		dl_server_stop(&rq->fair_server);
+
 	util_est_update(&rq->cfs, p, task_sleep);
 	hrtick_update(rq);
 }
@@ -8291,6 +8297,29 @@ static struct task_struct *__pick_next_task_fair(struct rq *rq)
 	return pick_next_task_fair(rq, NULL, NULL);
 }
 
+static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
+{
+	return !!dl_se->rq->cfs.nr_running;
+}
+
+static struct task_struct *fair_server_pick(struct sched_dl_entity *dl_se)
+{
+	return pick_next_task_fair(dl_se->rq, NULL, NULL);
+}
+
+void fair_server_init(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se = &rq->fair_server;
+
+	init_dl_entity(dl_se);
+
+	dl_se->dl_runtime = 50 * NSEC_PER_MSEC;
+	dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
+	dl_se->dl_period = 1000 * NSEC_PER_MSEC;
+
+	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
+}
+
 /*
  * Account for a descheduled task:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f30be4ae4c22..ac94c386741c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -351,6 +351,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick);
 
+extern void fair_server_init(struct rq *);
+
 #ifdef CONFIG_CGROUP_SCHED
 
 struct cfs_rq;
@@ -1024,6 +1026,8 @@ struct rq {
 	struct rt_rq		rt;
 	struct dl_rq		dl;
 
+	struct sched_dl_entity	fair_server;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this CPU: */
 	struct list_head	leaf_cfs_rq_list;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 6/7] sched/deadline: Deferrable dl server
  2023-08-31 20:28 [PATCH v4 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (4 preceding siblings ...)
  2023-08-31 20:28 ` [PATCH v4 5/7] sched/fair: Add trivial fair server Daniel Bristot de Oliveira
@ 2023-08-31 20:28 ` Daniel Bristot de Oliveira
  2023-09-05 13:42   ` Peter Zijlstra
  2023-08-31 20:28 ` [PATCH v4 7/7] sched/fair: Fair server interface Daniel Bristot de Oliveira
  6 siblings, 1 reply; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-08-31 20:28 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

Among the motivations for the DL servers is the real-time throttling
mechanism. This mechanism works by throttling the rt_rq after
running for a long period without leaving space for fair tasks.

The base dl server avoids this problem by boosting fair tasks instead
of throttling the rt_rq. The point is that it boosts without waiting
for potential starvation, causing some non-intuitive cases.

For example, an IRQ dispatches two tasks on an idle system, a fair
and an RT. The DL server will be activated, running the fair task
before the RT one. This problem can be avoided by deferring the
dl server activation.

By passing the deferring option, the dl_server will dispatch an
SCHED_DEADLINE reservation throttled, and the replenishment
timer set for (period - runtime) ns from start time. Thus,
boosting the fair rq on its 0-laxity time with respect
to rt_rq.

The fair server will be scheduled under EDF, with a new
a period at the replenishment time, thus not breaking dl tasks.

Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 include/linux/sched.h   |  7 +++++
 kernel/sched/deadline.c | 61 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/fair.c     | 10 ++++---
 kernel/sched/rt.c       |  6 ++++
 kernel/sched/sched.h    | 12 +++++++-
 5 files changed, 87 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 40fbf3f034e0..38d0b3de03b2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -609,6 +609,12 @@ struct sched_rt_entity {
 typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
 typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
 
+enum dl_server_state {
+	DL_SERVER_STOPPED = 0,
+	DL_SERVER_DEFER,
+	DL_SERVER_RUNNING
+};
+
 struct sched_dl_entity {
 	struct rb_node			rb_node;
 
@@ -685,6 +691,7 @@ struct sched_dl_entity {
 	struct rq			*rq;
 	dl_server_has_tasks_f		server_has_tasks;
 	dl_server_pick_f		server_pick;
+	enum dl_server_state			server_state;
 
 #ifdef CONFIG_RT_MUTEXES
 	/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7844cfb73029..7f1c52bfe78f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -422,7 +422,7 @@ static void task_non_contending(struct sched_dl_entity *dl_se)
 	if (dl_entity_is_special(dl_se))
 		return;
 
-	WARN_ON(dl_se->dl_non_contending);
+	WARN_ON_ONCE(dl_se->dl_non_contending);
 
 	zerolag_time = dl_se->deadline -
 		 div64_long((dl_se->runtime * dl_se->dl_period),
@@ -1155,6 +1155,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 		struct rq_flags rf;
 
 		rq_lock(rq, &rf);
+
 		if (dl_se->dl_throttled) {
 			sched_clock_tick();
 			update_rq_clock(rq);
@@ -1165,9 +1166,12 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 				__push_dl_task(rq, &rf);
 			} else {
 				replenish_dl_entity(dl_se);
+				task_non_contending(dl_se);
 			}
 
 		}
+
+		dl_se->server_state = DL_SERVER_RUNNING;
 		rq_unlock(rq, &rf);
 
 		return HRTIMER_NORESTART;
@@ -1441,18 +1445,65 @@ void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 	update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
 }
 
-void dl_server_start(struct sched_dl_entity *dl_se)
+void dl_server_start(struct sched_dl_entity *dl_se, int defer)
 {
+	if (dl_se->server_state != DL_SERVER_STOPPED) {
+		WARN_ON_ONCE(!(on_dl_rq(dl_se) || dl_se->dl_throttled));
+		return;
+	}
+
+	if (defer) {
+		/*
+		 * Postpone the replenishment to the (next period - the execution time)
+		 *
+		 * With this in place, we have two cases:
+		 *
+		 * On the absence of DL tasks:
+		 *	The server will start at the replenishment time, getting
+		 *	its runtime before now + period. This is the expected
+		 *	throttling behavior.
+		 *
+		 * In the presense of DL tasks:
+		 *	The server will be replenished, and then it will be
+		 *	schedule according to EDF, not breaking SCHED_DEADLINE.
+		 *
+		 *	In the first cycle the server will be postponed at most
+		 *	at period + period - runtime at most. But then the
+		 *	server will receive its runtime/period.
+		 *
+		 *	The server will, however, run on top of any RT task, which
+		 *	is the expected throttling behavior.
+		 */
+		dl_se->deadline = rq_clock(dl_se->rq) + dl_se->dl_period - dl_se->dl_runtime;
+		/* Zero the runtime */
+		dl_se->runtime = 0;
+		/* throttle the server */
+		dl_se->dl_throttled = 1;
+
+		dl_se->server_state = DL_SERVER_DEFER;
+		start_dl_timer(dl_se);
+		return;
+	}
+
 	if (!dl_server(dl_se)) {
 		dl_se->dl_server = 1;
 		setup_new_dl_entity(dl_se);
 	}
+
+	dl_se->server_state = DL_SERVER_RUNNING;
 	enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
 }
 
 void dl_server_stop(struct sched_dl_entity *dl_se)
 {
+	if (dl_se->server_state == DL_SERVER_STOPPED)
+		return;
+
+	hrtimer_try_to_cancel(&dl_se->dl_timer);
 	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
+
+	dl_se->dl_throttled = 0;
+	dl_se->server_state = DL_SERVER_STOPPED;
 }
 
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
@@ -1462,6 +1513,8 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 	dl_se->rq = rq;
 	dl_se->server_has_tasks = has_tasks;
 	dl_se->server_pick = pick;
+	dl_se->server_state = DL_SERVER_STOPPED;
+	dl_se->dl_server = 1;
 }
 
 /*
@@ -1817,8 +1870,9 @@ static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	 * (the task moves from "active contending" to "active non contending"
 	 * or "inactive")
 	 */
-	if (flags & DEQUEUE_SLEEP)
+	if (flags & DEQUEUE_SLEEP && !dl_server(dl_se))
 		task_non_contending(dl_se);
+
 }
 
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -1875,7 +1929,6 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		enqueue_pushable_dl_task(rq, p);
 }
 
-
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 580e6764a68b..b9d0f08dc8ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6499,9 +6499,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	util_est_enqueue(&rq->cfs, p);
 
-	if (!rq->cfs.h_nr_running)
-		dl_server_start(&rq->fair_server);
-
 	/*
 	 * If in_iowait is set, the code below may not trigger any cpufreq
 	 * utilization updates, so do it here explicitly with the IOWAIT flag
@@ -6568,6 +6565,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_overutilized_status(rq);
 
 enqueue_throttle:
+	if (sched_fair_server_needed(rq))
+		dl_server_start(&rq->fair_server, rq->fair_server_defer);
+
 	assert_list_leaf_cfs_rq(rq);
 
 	hrtick_update(rq);
@@ -6646,7 +6646,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		rq->next_balance = jiffies;
 
 dequeue_throttle:
-	if (!rq->cfs.h_nr_running)
+	if (!sched_fair_server_needed(rq))
 		dl_server_stop(&rq->fair_server);
 
 	util_est_update(&rq->cfs, p, task_sleep);
@@ -8317,6 +8317,8 @@ void fair_server_init(struct rq *rq)
 	dl_se->dl_deadline = 1000 * NSEC_PER_MSEC;
 	dl_se->dl_period = 1000 * NSEC_PER_MSEC;
 
+	rq->fair_server_defer = 1;
+
 	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e23cc67c9467..7595110a5a3e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1537,6 +1537,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 
 	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
+
+	if (sched_fair_server_needed(rq))
+		dl_server_start(&rq->fair_server, rq->fair_server_defer);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -1547,6 +1550,9 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 	dequeue_rt_entity(rt_se, flags);
 
 	dequeue_pushable_task(rq, p);
+
+	if (!sched_fair_server_needed(rq))
+		dl_server_stop(&rq->fair_server);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ac94c386741c..510c4db379be 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -345,7 +345,7 @@ extern int  dl_bw_check_overflow(int cpu);
  *   dl_server_init() -- initializes the server.
  */
 extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
-extern void dl_server_start(struct sched_dl_entity *dl_se);
+extern void dl_server_start(struct sched_dl_entity *dl_se, int defer);
 extern void dl_server_stop(struct sched_dl_entity *dl_se);
 extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
@@ -1027,6 +1027,7 @@ struct rq {
 	struct dl_rq		dl;
 
 	struct sched_dl_entity	fair_server;
+	int			fair_server_defer;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this CPU: */
@@ -2394,6 +2395,15 @@ static inline bool sched_fair_runnable(struct rq *rq)
 	return rq->cfs.nr_running > 0;
 }
 
+static inline bool sched_fair_server_needed(struct rq *rq)
+{
+	/*
+	 * The fair server will activate anytime a fair task can starve
+	 * because real-time tasks.
+	 */
+	return (sched_rt_runnable(rq) && sched_fair_runnable(rq));
+}
+
 extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
 extern struct task_struct *pick_next_task_idle(struct rq *rq);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 7/7] sched/fair: Fair server interface
  2023-08-31 20:28 [PATCH v4 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
                   ` (5 preceding siblings ...)
  2023-08-31 20:28 ` [PATCH v4 6/7] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
@ 2023-08-31 20:28 ` Daniel Bristot de Oliveira
  2023-09-01  2:01   ` kernel test robot
  2023-09-05 13:55   ` Peter Zijlstra
  6 siblings, 2 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-08-31 20:28 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, bristot, Phil Auld

Add an interface for fair server setup on debugfs.

Each rq have three file under /sys/kernel/debug/sched/rq/CPU{ID}:

 - fair_server_runtime: set runtime in ns
 - fair_server_period: set period in ns
 - fair_server_defer: on/off for the defer mechanism

Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
---
 kernel/sched/debug.c | 177 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 177 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4c3d0d9f3db6..dad7d5d073ef 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -333,8 +333,183 @@ static const struct file_operations sched_debug_fops = {
 	.release	= seq_release,
 };
 
+static ssize_t
+sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf,
+				size_t cnt, loff_t *ppos)
+{
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+	unsigned long flags;
+	u64 runtime;
+	int err;
+
+	err = kstrtoull_from_user(ubuf, cnt, 10, &runtime);
+	if (err)
+		return err;
+
+	raw_spin_rq_lock_irqsave(rq, flags);
+	if (runtime > rq->fair_server.dl_period)
+		err = -EINVAL;
+	else
+		rq->fair_server.dl_runtime = runtime;
+	raw_spin_rq_unlock_irqrestore(rq, flags);
+
+	if (err)
+		return err;
+
+	*ppos += cnt;
+	return cnt;
+}
+
+static int sched_fair_server_runtime_show(struct seq_file *m, void *v)
+{
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	seq_printf(m, "%llu\n", rq->fair_server.dl_runtime);
+	return 0;
+}
+
+static int sched_fair_server_runtime_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_fair_server_runtime_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_runtime_fops = {
+	.open		= sched_fair_server_runtime_open,
+	.write		= sched_fair_server_runtime_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static unsigned int fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
+static unsigned int fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
+
+static ssize_t
+sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
+			       size_t cnt, loff_t *ppos)
+{
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+	unsigned long flags;
+	u64 period;
+	int err;
+
+	err = kstrtoull_from_user(ubuf, cnt, 10, &period);
+	if (err)
+		return err;
+
+	if (period < fair_server_period_min || period > fair_server_period_max)
+		return -EINVAL;
+
+	raw_spin_rq_lock_irqsave(rq, flags);
+	if (period < rq->fair_server.dl_runtime)
+		err = -EINVAL;
+	else
+		rq->fair_server.dl_period = period;
+	raw_spin_rq_unlock_irqrestore(rq, flags);
+
+	if (err)
+		return err;
+
+	*ppos += cnt;
+	return cnt;
+}
+
+static int sched_fair_server_period_show(struct seq_file *m, void *v)
+{
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	seq_printf(m, "%llu\n", rq->fair_server.dl_period);
+	return 0;
+}
+
+static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_fair_server_period_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_period_fops = {
+	.open		= sched_fair_server_period_open,
+	.write		= sched_fair_server_period_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static ssize_t
+sched_fair_server_defer_write(struct file *filp, const char __user *ubuf,
+			      size_t cnt, loff_t *ppos)
+{
+	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
+	struct rq *rq = cpu_rq(cpu);
+	unsigned long flags;
+	u64 defer;
+	int err;
+
+	err = kstrtoull_from_user(ubuf, cnt, 10, &defer);
+	if (err)
+		return err;
+
+	if (defer < 0 || defer > 1)
+		return -EINVAL;
+
+	raw_spin_rq_lock_irqsave(rq, flags);
+	rq->fair_server_defer = defer;
+	raw_spin_rq_unlock_irqrestore(rq, flags);
+
+	*ppos += cnt;
+	return cnt;
+}
+
+static int sched_fair_server_defer_show(struct seq_file *m, void *v)
+{
+	unsigned long cpu = (unsigned long) m->private;
+	struct rq *rq = cpu_rq(cpu);
+
+	seq_printf(m, "%d\n", rq->fair_server_defer);
+	return 0;
+}
+
+static int sched_fair_server_defer_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_fair_server_defer_show, inode->i_private);
+}
+
+static const struct file_operations fair_server_defer_fops = {
+	.open		= sched_fair_server_defer_open,
+	.write		= sched_fair_server_defer_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static struct dentry *debugfs_sched;
 
+void debugfs_fair_server_init(void)
+{
+	long cpu;
+	struct dentry *rq_dentry;
+
+	rq_dentry = debugfs_create_dir("rq", debugfs_sched);
+	if (!rq_dentry)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct dentry *d_cpu;
+		char buf[32];
+
+		snprintf(buf, sizeof(buf), "cpu%ld", cpu);
+		d_cpu = debugfs_create_dir(buf, rq_dentry);
+
+		debugfs_create_file("fair_server_runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
+		debugfs_create_file("fair_server_period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops);
+		debugfs_create_file("fair_server_defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops);
+	}
+}
+
 static __init int sched_init_debug(void)
 {
 	struct dentry __maybe_unused *numa;
@@ -374,6 +549,8 @@ static __init int sched_init_debug(void)
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
+	debugfs_fair_server_init();
+
 	return 0;
 }
 late_initcall(sched_init_debug);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 7/7] sched/fair: Fair server interface
  2023-08-31 20:28 ` [PATCH v4 7/7] sched/fair: Fair server interface Daniel Bristot de Oliveira
@ 2023-09-01  2:01   ` kernel test robot
  2023-09-05 13:55   ` Peter Zijlstra
  1 sibling, 0 replies; 25+ messages in thread
From: kernel test robot @ 2023-09-01  2:01 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: oe-kbuild-all, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Joel Fernandes, Vineeth Pillai, Shuah Khan, bristot, Phil Auld

Hi Daniel,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master linus/master next-20230831]
[cannot apply to tip/auto-latest v6.5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Daniel-Bristot-de-Oliveira/sched-Unify-runtime-accounting-across-classes/20230901-043307
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/db775d65b18ddac4a75faad6761c6c2abf3efb78.1693510979.git.bristot%40kernel.org
patch subject: [PATCH v4 7/7] sched/fair: Fair server interface
config: m68k-allyesconfig (https://download.01.org/0day-ci/archive/20230901/202309010917.ryl6BbIf-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230901/202309010917.ryl6BbIf-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309010917.ryl6BbIf-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from kernel/sched/build_utility.c:73:
>> kernel/sched/debug.c:386:56: warning: integer overflow in expression of type 'long int' results in '-100663296' [-Woverflow]
     386 | static unsigned int fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
         |                                                        ^
>> kernel/sched/debug.c:491:6: warning: no previous prototype for 'debugfs_fair_server_init' [-Wmissing-prototypes]
     491 | void debugfs_fair_server_init(void)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~


vim +386 kernel/sched/debug.c

   385	
 > 386	static unsigned int fair_server_period_max = (1 << 22) * NSEC_PER_USEC; /* ~4 seconds */
   387	static unsigned int fair_server_period_min = (100) * NSEC_PER_USEC;     /* 100 us */
   388	
   389	static ssize_t
   390	sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
   391				       size_t cnt, loff_t *ppos)
   392	{
   393		long cpu = (long) ((struct seq_file *) filp->private_data)->private;
   394		struct rq *rq = cpu_rq(cpu);
   395		unsigned long flags;
   396		u64 period;
   397		int err;
   398	
   399		err = kstrtoull_from_user(ubuf, cnt, 10, &period);
   400		if (err)
   401			return err;
   402	
   403		if (period < fair_server_period_min || period > fair_server_period_max)
   404			return -EINVAL;
   405	
   406		raw_spin_rq_lock_irqsave(rq, flags);
   407		if (period < rq->fair_server.dl_runtime)
   408			err = -EINVAL;
   409		else
   410			rq->fair_server.dl_period = period;
   411		raw_spin_rq_unlock_irqrestore(rq, flags);
   412	
   413		if (err)
   414			return err;
   415	
   416		*ppos += cnt;
   417		return cnt;
   418	}
   419	
   420	static int sched_fair_server_period_show(struct seq_file *m, void *v)
   421	{
   422		unsigned long cpu = (unsigned long) m->private;
   423		struct rq *rq = cpu_rq(cpu);
   424	
   425		seq_printf(m, "%llu\n", rq->fair_server.dl_period);
   426		return 0;
   427	}
   428	
   429	static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
   430	{
   431		return single_open(filp, sched_fair_server_period_show, inode->i_private);
   432	}
   433	
   434	static const struct file_operations fair_server_period_fops = {
   435		.open		= sched_fair_server_period_open,
   436		.write		= sched_fair_server_period_write,
   437		.read		= seq_read,
   438		.llseek		= seq_lseek,
   439		.release	= single_release,
   440	};
   441	
   442	static ssize_t
   443	sched_fair_server_defer_write(struct file *filp, const char __user *ubuf,
   444				      size_t cnt, loff_t *ppos)
   445	{
   446		long cpu = (long) ((struct seq_file *) filp->private_data)->private;
   447		struct rq *rq = cpu_rq(cpu);
   448		unsigned long flags;
   449		u64 defer;
   450		int err;
   451	
   452		err = kstrtoull_from_user(ubuf, cnt, 10, &defer);
   453		if (err)
   454			return err;
   455	
   456		if (defer < 0 || defer > 1)
   457			return -EINVAL;
   458	
   459		raw_spin_rq_lock_irqsave(rq, flags);
   460		rq->fair_server_defer = defer;
   461		raw_spin_rq_unlock_irqrestore(rq, flags);
   462	
   463		*ppos += cnt;
   464		return cnt;
   465	}
   466	
   467	static int sched_fair_server_defer_show(struct seq_file *m, void *v)
   468	{
   469		unsigned long cpu = (unsigned long) m->private;
   470		struct rq *rq = cpu_rq(cpu);
   471	
   472		seq_printf(m, "%d\n", rq->fair_server_defer);
   473		return 0;
   474	}
   475	
   476	static int sched_fair_server_defer_open(struct inode *inode, struct file *filp)
   477	{
   478		return single_open(filp, sched_fair_server_defer_show, inode->i_private);
   479	}
   480	
   481	static const struct file_operations fair_server_defer_fops = {
   482		.open		= sched_fair_server_defer_open,
   483		.write		= sched_fair_server_defer_write,
   484		.read		= seq_read,
   485		.llseek		= seq_lseek,
   486		.release	= single_release,
   487	};
   488	
   489	static struct dentry *debugfs_sched;
   490	
 > 491	void debugfs_fair_server_init(void)
   492	{
   493		long cpu;
   494		struct dentry *rq_dentry;
   495	
   496		rq_dentry = debugfs_create_dir("rq", debugfs_sched);
   497		if (!rq_dentry)
   498			return;
   499	
   500		for_each_possible_cpu(cpu) {
   501			struct dentry *d_cpu;
   502			char buf[32];
   503	
   504			snprintf(buf, sizeof(buf), "cpu%ld", cpu);
   505			d_cpu = debugfs_create_dir(buf, rq_dentry);
   506	
   507			debugfs_create_file("fair_server_runtime", 0644, d_cpu, (void *) cpu, &fair_server_runtime_fops);
   508			debugfs_create_file("fair_server_period", 0644, d_cpu, (void *) cpu, &fair_server_period_fops);
   509			debugfs_create_file("fair_server_defer", 0644, d_cpu, (void *) cpu, &fair_server_defer_fops);
   510		}
   511	}
   512	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] sched/deadline: Deferrable dl server
  2023-08-31 20:28 ` [PATCH v4 6/7] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
@ 2023-09-05 13:42   ` Peter Zijlstra
  2023-09-05 15:24     ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-09-05 13:42 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Thu, Aug 31, 2023 at 10:28:57PM +0200, Daniel Bristot de Oliveira wrote:
> +void dl_server_start(struct sched_dl_entity *dl_se, int defer)
>  {
> +	if (dl_se->server_state != DL_SERVER_STOPPED) {
> +		WARN_ON_ONCE(!(on_dl_rq(dl_se) || dl_se->dl_throttled));
> +		return;
> +	}
> +
> +	if (defer) {
> +		/*
> +		 * Postpone the replenishment to the (next period - the execution time)
> +		 *
> +		 * With this in place, we have two cases:
> +		 *
> +		 * On the absence of DL tasks:
> +		 *	The server will start at the replenishment time, getting
> +		 *	its runtime before now + period. This is the expected
> +		 *	throttling behavior.
> +		 *
> +		 * In the presense of DL tasks:
> +		 *	The server will be replenished, and then it will be
> +		 *	schedule according to EDF, not breaking SCHED_DEADLINE.
> +		 *
> +		 *	In the first cycle the server will be postponed at most
> +		 *	at period + period - runtime at most. But then the
> +		 *	server will receive its runtime/period.
> +		 *
> +		 *	The server will, however, run on top of any RT task, which
> +		 *	is the expected throttling behavior.
> +		 */
> +		dl_se->deadline = rq_clock(dl_se->rq) + dl_se->dl_period - dl_se->dl_runtime;

I was confused by this, but this is an instance of
replenish_dl_new_period(), where we explicitly do not preserve the
period.

> +		/* Zero the runtime */
> +		dl_se->runtime = 0;
> +		/* throttle the server */
> +		dl_se->dl_throttled = 1;

These comments are both obvious and placed so as to make everything
unreadable :/ 

> +
> +		dl_se->server_state = DL_SERVER_DEFER;
> +		start_dl_timer(dl_se);
> +		return;
> +	}
> +
>  	if (!dl_server(dl_se)) {
>  		dl_se->dl_server = 1;
>  		setup_new_dl_entity(dl_se);
>  	}
> +
> +	dl_se->server_state = DL_SERVER_RUNNING;
>  	enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
>  }


> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 580e6764a68b..b9d0f08dc8ca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6499,9 +6499,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	 */
>  	util_est_enqueue(&rq->cfs, p);
>  
> -	if (!rq->cfs.h_nr_running)
> -		dl_server_start(&rq->fair_server);
> -
>  	/*
>  	 * If in_iowait is set, the code below may not trigger any cpufreq
>  	 * utilization updates, so do it here explicitly with the IOWAIT flag
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index e23cc67c9467..7595110a5a3e 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1537,6 +1537,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>  
>  	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
>  		enqueue_pushable_task(rq, p);
> +
> +	if (sched_fair_server_needed(rq))
> +		dl_server_start(&rq->fair_server, rq->fair_server_defer);
>  }
>  
>  static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
> @@ -1547,6 +1550,9 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>  	dequeue_rt_entity(rt_se, flags);
>  
>  	dequeue_pushable_task(rq, p);
> +
> +	if (!sched_fair_server_needed(rq))
> +		dl_server_stop(&rq->fair_server);
>  }
>  
>  /*

I'm thinking this is wrong -- ISTR there definite benefits to explicit
slack time scheduling, meaning the server should also run while DL tasks
are on. Additionally, running the server when there are only fair tasks
is mostly harmless, right? So why this change?

One of the benefits was -- IIRC, that we no longer need to subtract some
random margin from the reservation limit, but there were more I think.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 7/7] sched/fair: Fair server interface
  2023-08-31 20:28 ` [PATCH v4 7/7] sched/fair: Fair server interface Daniel Bristot de Oliveira
  2023-09-01  2:01   ` kernel test robot
@ 2023-09-05 13:55   ` Peter Zijlstra
  2023-09-05 16:17     ` Daniel Bristot de Oliveira
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-09-05 13:55 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Thu, Aug 31, 2023 at 10:28:58PM +0200, Daniel Bristot de Oliveira wrote:
> +static ssize_t
> +sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf,
> +				size_t cnt, loff_t *ppos)
> +{
> +	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
> +	struct rq *rq = cpu_rq(cpu);
> +	unsigned long flags;
> +	u64 runtime;
> +	int err;
> +
> +	err = kstrtoull_from_user(ubuf, cnt, 10, &runtime);
> +	if (err)
> +		return err;
> +
> +	raw_spin_rq_lock_irqsave(rq, flags);
> +	if (runtime > rq->fair_server.dl_period)
> +		err = -EINVAL;
> +	else
> +		rq->fair_server.dl_runtime = runtime;
> +	raw_spin_rq_unlock_irqrestore(rq, flags);
> +
> +	if (err)
> +		return err;
> +
> +	*ppos += cnt;
> +	return cnt;
> +}

> +static ssize_t
> +sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
> +			       size_t cnt, loff_t *ppos)
> +{
> +	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
> +	struct rq *rq = cpu_rq(cpu);
> +	unsigned long flags;
> +	u64 period;
> +	int err;
> +
> +	err = kstrtoull_from_user(ubuf, cnt, 10, &period);
> +	if (err)
> +		return err;
> +
> +	if (period < fair_server_period_min || period > fair_server_period_max)
> +		return -EINVAL;
> +
> +	raw_spin_rq_lock_irqsave(rq, flags);
> +	if (period < rq->fair_server.dl_runtime)
> +		err = -EINVAL;
> +	else
> +		rq->fair_server.dl_period = period;
> +	raw_spin_rq_unlock_irqrestore(rq, flags);
> +
> +	if (err)
> +		return err;
> +
> +	*ppos += cnt;
> +	return cnt;
> +}

> +static ssize_t
> +sched_fair_server_defer_write(struct file *filp, const char __user *ubuf,
> +			      size_t cnt, loff_t *ppos)
> +{
> +	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
> +	struct rq *rq = cpu_rq(cpu);
> +	unsigned long flags;
> +	u64 defer;
> +	int err;
> +
> +	err = kstrtoull_from_user(ubuf, cnt, 10, &defer);
> +	if (err)
> +		return err;
> +
> +	if (defer < 0 || defer > 1)
> +		return -EINVAL;
> +
> +	raw_spin_rq_lock_irqsave(rq, flags);
> +	rq->fair_server_defer = defer;
> +	raw_spin_rq_unlock_irqrestore(rq, flags);
> +
> +	*ppos += cnt;
> +	return cnt;
> +}

Surely we can write a single function that does all of that with less
duplication?

Additionally, should not the deadline parameters be vetted by access
control before being accepted ?

Perhaps something like so:

static ssize_t
sched_fair_server_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos, enum dl_param param)
{
	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
	struct rq *rq = cpu_rq(cpu);
	u64 value;

	err = kstrtoull_from_user(ubuf, cnt, 10, &value);
	if (err)
		return err;

	scoped_guard (rq_lock_irqsave, rq) {

		u64 runtime, deadline, period;

		runtime  = rq->fair_server.dl_runtime;
		deadline = rq->fair_server.dl_deadline;
		period   = rq->fair_server.dl_period;

		switch (param) {
			case dl_runtime:  runtime  = value; break;
			case dl_deadline: deadline = value; break;
			case dl_period:   period   = value; break;
		}

		if (runtime > deadline ||
		    deadline > period ||
		    /* more stuff like access controll */)
			return -EINVAL;

		rq->fair_server.dl_runtime  = runtime;
		rq->fair_server.dl_deadline = deadline;
		rq->fair_server.dl_period   = period;
	}

	*ppos += cnt;
	return cnt;
}



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] sched/deadline: Deferrable dl server
  2023-09-05 13:42   ` Peter Zijlstra
@ 2023-09-05 15:24     ` Daniel Bristot de Oliveira
  2023-09-06  8:29       ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-09-05 15:24 UTC (permalink / raw)
  To: Peter Zijlstra, Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Joel Fernandes, Vineeth Pillai, Shuah Khan, Phil Auld

On 9/5/23 15:42, Peter Zijlstra wrote:
> On Thu, Aug 31, 2023 at 10:28:57PM +0200, Daniel Bristot de Oliveira wrote:
>> +void dl_server_start(struct sched_dl_entity *dl_se, int defer)
>>  {
>> +	if (dl_se->server_state != DL_SERVER_STOPPED) {
>> +		WARN_ON_ONCE(!(on_dl_rq(dl_se) || dl_se->dl_throttled));
>> +		return;
>> +	}
>> +
>> +	if (defer) {
>> +		/*
>> +		 * Postpone the replenishment to the (next period - the execution time)
>> +		 *
>> +		 * With this in place, we have two cases:
>> +		 *
>> +		 * On the absence of DL tasks:
>> +		 *	The server will start at the replenishment time, getting
>> +		 *	its runtime before now + period. This is the expected
>> +		 *	throttling behavior.
>> +		 *
>> +		 * In the presense of DL tasks:
>> +		 *	The server will be replenished, and then it will be
>> +		 *	schedule according to EDF, not breaking SCHED_DEADLINE.
>> +		 *
>> +		 *	In the first cycle the server will be postponed at most
>> +		 *	at period + period - runtime at most. But then the
>> +		 *	server will receive its runtime/period.
>> +		 *
>> +		 *	The server will, however, run on top of any RT task, which
>> +		 *	is the expected throttling behavior.
>> +		 */
>> +		dl_se->deadline = rq_clock(dl_se->rq) + dl_se->dl_period - dl_se->dl_runtime;
> 
> I was confused by this, but this is an instance of
> replenish_dl_new_period(), where we explicitly do not preserve the
> period.

yeah, it is two operations in one (so not straight forward). It sets the period as the now - runtime, so..


>> +		/* Zero the runtime */
>> +		dl_se->runtime = 0;
>> +		/* throttle the server */
>> +		dl_se->dl_throttled = 1;

as the server is throttled, it will set the replenishing timer for now - runtime + period because
of the deadline.

> These comments are both obvious and placed so as to make everything
> unreadable :/ 

I can't disagree :-) I will remove.

>> +
>> +		dl_se->server_state = DL_SERVER_DEFER;
>> +		start_dl_timer(dl_se);
>> +		return;
>> +	}
>> +
>>  	if (!dl_server(dl_se)) {
>>  		dl_se->dl_server = 1;
>>  		setup_new_dl_entity(dl_se);
>>  	}
>> +
>> +	dl_se->server_state = DL_SERVER_RUNNING;
>>  	enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
>>  }
> 
> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 580e6764a68b..b9d0f08dc8ca 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6499,9 +6499,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>  	 */
>>  	util_est_enqueue(&rq->cfs, p);
>>  
>> -	if (!rq->cfs.h_nr_running)
>> -		dl_server_start(&rq->fair_server);
>> -
>>  	/*
>>  	 * If in_iowait is set, the code below may not trigger any cpufreq
>>  	 * utilization updates, so do it here explicitly with the IOWAIT flag
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index e23cc67c9467..7595110a5a3e 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -1537,6 +1537,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>>  
>>  	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
>>  		enqueue_pushable_task(rq, p);
>> +
>> +	if (sched_fair_server_needed(rq))
>> +		dl_server_start(&rq->fair_server, rq->fair_server_defer);
>>  }
>>  
>>  static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>> @@ -1547,6 +1550,9 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>>  	dequeue_rt_entity(rt_se, flags);
>>  
>>  	dequeue_pushable_task(rq, p);
>> +
>> +	if (!sched_fair_server_needed(rq))
>> +		dl_server_stop(&rq->fair_server);
>>  }
>>  
>>  /*
> 
> I'm thinking this is wrong -- ISTR there definite benefits to explicit
> slack time scheduling, meaning the server should also run while DL tasks
> are on.

I can add the check to enable it also in the presence of DL tasks, we just need
to have consider that the dl server is a dl task as well when disabling.

It is a benefit because it will fix the case in which a CPU full of DL tasks
(possible with global).

Additionally, running the server when there are only fair tasks
> is mostly harmless, right? So why this change?

Rhe problem is that we never know when a RT one will arrive :-/

E.g., only fair tasks, server enabled. All fine :-) Then an RT arrives, and gets
postponed by the server... RT people complaining (for those thinking on adding
a server for RT, we will have a similar problem as we have with throttling today +
DL has no fixed priority).

We will still face the same problem with defferable server, and the example is the same,
just with a shift in the RT arrival. e.g., only fair task for (server period - runtime),
then the server gets enabled, and a 1us after the RT arrives and gets postponed... again.

So the need to enable it only after the dispatch of a kind of RT (or DL to be added)
tasks that can cause starvation.

We could simplify it by enabling only on RT/DL arrival but... if there is nothing to
starve... it is not needed anyways so less overhead avoiding the enqueue...

> 
> One of the benefits was -- IIRC, that we no longer need to subtract some
> random margin from the reservation limit, but there were more I think.
> 

We can simplify things because we can start ignoring the rt throttling limit when
there is no need to throttle... like on grub (but, only after we remove the rt
throttling).

Thoughts?
-- Daniel


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 7/7] sched/fair: Fair server interface
  2023-09-05 13:55   ` Peter Zijlstra
@ 2023-09-05 16:17     ` Daniel Bristot de Oliveira
  2023-09-06  7:25       ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-09-05 16:17 UTC (permalink / raw)
  To: Peter Zijlstra, Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Joel Fernandes, Vineeth Pillai, Shuah Khan, Phil Auld

On 9/5/23 15:55, Peter Zijlstra wrote:
> On Thu, Aug 31, 2023 at 10:28:58PM +0200, Daniel Bristot de Oliveira wrote:
>> +static ssize_t
>> +sched_fair_server_runtime_write(struct file *filp, const char __user *ubuf,
>> +				size_t cnt, loff_t *ppos)
>> +{
>> +	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
>> +	struct rq *rq = cpu_rq(cpu);
>> +	unsigned long flags;
>> +	u64 runtime;
>> +	int err;
>> +
>> +	err = kstrtoull_from_user(ubuf, cnt, 10, &runtime);
>> +	if (err)
>> +		return err;
>> +
>> +	raw_spin_rq_lock_irqsave(rq, flags);
>> +	if (runtime > rq->fair_server.dl_period)
>> +		err = -EINVAL;
>> +	else
>> +		rq->fair_server.dl_runtime = runtime;
>> +	raw_spin_rq_unlock_irqrestore(rq, flags);
>> +
>> +	if (err)
>> +		return err;
>> +
>> +	*ppos += cnt;
>> +	return cnt;
>> +}
> 
>> +static ssize_t
>> +sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
>> +			       size_t cnt, loff_t *ppos)
>> +{
>> +	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
>> +	struct rq *rq = cpu_rq(cpu);
>> +	unsigned long flags;
>> +	u64 period;
>> +	int err;
>> +
>> +	err = kstrtoull_from_user(ubuf, cnt, 10, &period);
>> +	if (err)
>> +		return err;
>> +
>> +	if (period < fair_server_period_min || period > fair_server_period_max)
>> +		return -EINVAL;
>> +
>> +	raw_spin_rq_lock_irqsave(rq, flags);
>> +	if (period < rq->fair_server.dl_runtime)
>> +		err = -EINVAL;
>> +	else
>> +		rq->fair_server.dl_period = period;
>> +	raw_spin_rq_unlock_irqrestore(rq, flags);
>> +
>> +	if (err)
>> +		return err;
>> +
>> +	*ppos += cnt;
>> +	return cnt;
>> +}
> 
>> +static ssize_t
>> +sched_fair_server_defer_write(struct file *filp, const char __user *ubuf,
>> +			      size_t cnt, loff_t *ppos)
>> +{
>> +	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
>> +	struct rq *rq = cpu_rq(cpu);
>> +	unsigned long flags;
>> +	u64 defer;
>> +	int err;
>> +
>> +	err = kstrtoull_from_user(ubuf, cnt, 10, &defer);
>> +	if (err)
>> +		return err;
>> +
>> +	if (defer < 0 || defer > 1)
>> +		return -EINVAL;
>> +
>> +	raw_spin_rq_lock_irqsave(rq, flags);
>> +	rq->fair_server_defer = defer;
>> +	raw_spin_rq_unlock_irqrestore(rq, flags);
>> +
>> +	*ppos += cnt;
>> +	return cnt;
>> +}
> 
> Surely we can write a single function that does all of that with less
> duplication?

I agree, I will use your code as starting point for that...

> 
> Additionally, should not the deadline parameters be vetted by access
> control before being accepted ?

like security_task_getscheduler(p)? But we have no p...

I checked rt throttling, but it seems that it does not check. Do you have
a pointer?

-- Daniel


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 7/7] sched/fair: Fair server interface
  2023-09-05 16:17     ` Daniel Bristot de Oliveira
@ 2023-09-06  7:25       ` Peter Zijlstra
  2023-09-06  8:25         ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-09-06  7:25 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Luca Abeni,
	Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Tue, Sep 05, 2023 at 06:17:26PM +0200, Daniel Bristot de Oliveira wrote:

> > Additionally, should not the deadline parameters be vetted by access
> > control before being accepted ?
> 
> like security_task_getscheduler(p)? But we have no p...

I was thinking sched_dl_overflow() or thereabout. That still runs on p,
but I'm thikning that should be easily adapted to dl_se or somesuch.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 7/7] sched/fair: Fair server interface
  2023-09-06  7:25       ` Peter Zijlstra
@ 2023-09-06  8:25         ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-09-06  8:25 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Luca Abeni,
	Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Wed, Sep 06, 2023 at 09:25:01AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 05, 2023 at 06:17:26PM +0200, Daniel Bristot de Oliveira wrote:
> 
> > > Additionally, should not the deadline parameters be vetted by access
> > > control before being accepted ?
> > 
> > like security_task_getscheduler(p)? But we have no p...
> 
> I was thinking sched_dl_overflow() or thereabout. That still runs on p,
> but I'm thikning that should be easily adapted to dl_se or somesuch.

Clearly my scrambled brain confused admission control and access
control.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] sched/deadline: Deferrable dl server
  2023-09-05 15:24     ` Daniel Bristot de Oliveira
@ 2023-09-06  8:29       ` Peter Zijlstra
  2023-09-06 14:58         ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-09-06  8:29 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Luca Abeni,
	Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Tue, Sep 05, 2023 at 05:24:40PM +0200, Daniel Bristot de Oliveira wrote:
> On 9/5/23 15:42, Peter Zijlstra wrote:
> > On Thu, Aug 31, 2023 at 10:28:57PM +0200, Daniel Bristot de Oliveira wrote:
> >> +void dl_server_start(struct sched_dl_entity *dl_se, int defer)
> >>  {
> >> +	if (dl_se->server_state != DL_SERVER_STOPPED) {
> >> +		WARN_ON_ONCE(!(on_dl_rq(dl_se) || dl_se->dl_throttled));
> >> +		return;
> >> +	}
> >> +
> >> +	if (defer) {
> >> +		/*
> >> +		 * Postpone the replenishment to the (next period - the execution time)

perhaps explicitly mention zero-laxity here, then we all know what is
meant, no?

> >> +		 *
> >> +		 * With this in place, we have two cases:
> >> +		 *
> >> +		 * On the absence of DL tasks:
> >> +		 *	The server will start at the replenishment time, getting
> >> +		 *	its runtime before now + period. This is the expected
> >> +		 *	throttling behavior.
> >> +		 *
> >> +		 * In the presense of DL tasks:
> >> +		 *	The server will be replenished, and then it will be
> >> +		 *	schedule according to EDF, not breaking SCHED_DEADLINE.
> >> +		 *
> >> +		 *	In the first cycle the server will be postponed at most
> >> +		 *	at period + period - runtime at most. But then the
> >> +		 *	server will receive its runtime/period.
> >> +		 *
> >> +		 *	The server will, however, run on top of any RT task, which
> >> +		 *	is the expected throttling behavior.
> >> +		 */
> >> +		dl_se->deadline = rq_clock(dl_se->rq) + dl_se->dl_period - dl_se->dl_runtime;
> > 
> > I was confused by this, but this is an instance of
> > replenish_dl_new_period(), where we explicitly do not preserve the
> > period.
> 
> yeah, it is two operations in one (so not straight forward). It sets
> the period as the now - runtime, so..

Right. I'll just attribute it to me being generally confused about
everything after holidays ;-)

The thing that tripped me was that the use of rq_clock() breaks
periodicy, except of course you want/need that when you start a new
activation.

> >> +		/* Zero the runtime */
> >> +		dl_se->runtime = 0;
> >> +		/* throttle the server */
> >> +		dl_se->dl_throttled = 1;
> 
> as the server is throttled, it will set the replenishing timer for now - runtime + period because
> of the deadline.

Yeah, it's a wee hack to move it to the zero-laxity point. I was
considering if it makes sense to push that down and make it available
for all DL tasks, but I'm not sure..

> > I'm thinking this is wrong -- ISTR there definite benefits to explicit
> > slack time scheduling, meaning the server should also run while DL tasks
> > are on.
> 
> I can add the check to enable it also in the presence of DL tasks, we just need
> to have consider that the dl server is a dl task as well when disabling.

Why not keep what was there, always run the thing when there's FAIR
tasks around? Or do you see severe overhead from running it with only
FAIR tasks on?

> It is a benefit because it will fix the case in which a CPU full of DL tasks
> (possible with global).
> 
> Additionally, running the server when there are only fair tasks
> > is mostly harmless, right? So why this change?
> 
> Rhe problem is that we never know when a RT one will arrive :-/
> 
> E.g., only fair tasks, server enabled. All fine :-) Then an RT arrives, and gets
> postponed by the server... RT people complaining (for those thinking on adding
> a server for RT, we will have a similar problem as we have with throttling today +
> DL has no fixed priority).

Well, the thing is, the moment you run DL tasks on that machine those RT
tasks will get delayed *anyway*. RT people need to stop pretending FIFO
is the highest class.

But yeah, I can see why they might get upset if the time is then used to
run FAIR tasks while their RT task gets to wait.

> We will still face the same problem with defferable server, and the example is the same,
> just with a shift in the RT arrival. e.g., only fair task for (server period - runtime),
> then the server gets enabled, and a 1us after the RT arrives and gets postponed... again.
> 
> So the need to enable it only after the dispatch of a kind of RT (or DL to be added)
> tasks that can cause starvation.
> 
> We could simplify it by enabling only on RT/DL arrival but... if there is nothing to
> starve... it is not needed anyways so less overhead avoiding the enqueue...

So one thing we could do is have update_curr_fair() decrement
fair_server's runtime and yield the period then it hits 0 (and capping
it at 0, not allowing it to go negative or so).

That way you only force the situation when FAIR hasn't had it's allotted
time this perio, and only for as much as to make up for the time it
lacks.

> > 
> > One of the benefits was -- IIRC, that we no longer need to subtract some
> > random margin from the reservation limit, but there were more I think.
> > 
> 
> We can simplify things because we can start ignoring the rt throttling limit when
> there is no need to throttle... like on grub (but, only after we remove the rt
> throttling).
> 
> Thoughts?

The moment the fair server is there (these patches) you should also kill
the throttling. It doesn't make sense to have both.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] sched/deadline: Deferrable dl server
  2023-09-06  8:29       ` Peter Zijlstra
@ 2023-09-06 14:58         ` Daniel Bristot de Oliveira
  2023-09-06 20:04           ` Peter Zijlstra
  2023-09-07  8:07           ` Peter Zijlstra
  0 siblings, 2 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-09-06 14:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Luca Abeni,
	Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On 9/6/23 10:29, Peter Zijlstra wrote:
> On Tue, Sep 05, 2023 at 05:24:40PM +0200, Daniel Bristot de Oliveira wrote:
>> On 9/5/23 15:42, Peter Zijlstra wrote:
>>> On Thu, Aug 31, 2023 at 10:28:57PM +0200, Daniel Bristot de Oliveira wrote:
>>>> +void dl_server_start(struct sched_dl_entity *dl_se, int defer)
>>>>  {
>>>> +	if (dl_se->server_state != DL_SERVER_STOPPED) {
>>>> +		WARN_ON_ONCE(!(on_dl_rq(dl_se) || dl_se->dl_throttled));
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	if (defer) {
>>>> +		/*
>>>> +		 * Postpone the replenishment to the (next period - the execution time)
> 
> perhaps explicitly mention zero-laxity here, then we all know what is
> meant, no?

Last time I used that word it caused more problems than helped :-) But I will
add it and specify that is "for this task".

>>>> +		 *
>>>> +		 * With this in place, we have two cases:
>>>> +		 *
>>>> +		 * On the absence of DL tasks:
>>>> +		 *	The server will start at the replenishment time, getting
>>>> +		 *	its runtime before now + period. This is the expected
>>>> +		 *	throttling behavior.
>>>> +		 *
>>>> +		 * In the presense of DL tasks:
>>>> +		 *	The server will be replenished, and then it will be
>>>> +		 *	schedule according to EDF, not breaking SCHED_DEADLINE.
>>>> +		 *
>>>> +		 *	In the first cycle the server will be postponed at most
>>>> +		 *	at period + period - runtime at most. But then the
>>>> +		 *	server will receive its runtime/period.
>>>> +		 *
>>>> +		 *	The server will, however, run on top of any RT task, which
>>>> +		 *	is the expected throttling behavior.
>>>> +		 */
>>>> +		dl_se->deadline = rq_clock(dl_se->rq) + dl_se->dl_period - dl_se->dl_runtime;
>>>
>>> I was confused by this, but this is an instance of
>>> replenish_dl_new_period(), where we explicitly do not preserve the
>>> period.
>>
>> yeah, it is two operations in one (so not straight forward). It sets
>> the period as the now - runtime, so..
> 
> Right. I'll just attribute it to me being generally confused about
> everything after holidays ;-)
> 
> The thing that tripped me was that the use of rq_clock() breaks
> periodicy, except of course you want/need that when you start a new
> activation.

that rq_clock() is only used at the time in which we start the deferred
server. If the throttling condition stays one, the server will act as a
regular periodic DL task...

> 
>>>> +		/* Zero the runtime */
>>>> +		dl_se->runtime = 0;
>>>> +		/* throttle the server */
>>>> +		dl_se->dl_throttled = 1;
>>
>> as the server is throttled, it will set the replenishing timer for now - runtime + period because
>> of the deadline.
> 
> Yeah, it's a wee hack to move it to the zero-laxity point. I was
> considering if it makes sense to push that down and make it available
> for all DL tasks, but I'm not sure..

It might be useful in the future, like when DL dominates all other schedulers, so
we can have a way to schedule a deferred work, like kworkers... :-) But it might be
too early for that..

>>> I'm thinking this is wrong -- ISTR there definite benefits to explicit
>>> slack time scheduling, meaning the server should also run while DL tasks
>>> are on.
>>
>> I can add the check to enable it also in the presence of DL tasks, we just need
>> to have consider that the dl server is a dl task as well when disabling.
> 
> Why not keep what was there, always run the thing when there's FAIR
> tasks around? Or do you see severe overhead from running it with only
> FAIR tasks on?

Assuming that most of the cases people only have fair tasks, which is probably
true in the regular kernel... (like, no threaded IRQs).

>> It is a benefit because it will fix the case in which a CPU full of DL tasks
>> (possible with global).
>>
>> Additionally, running the server when there are only fair tasks
>>> is mostly harmless, right? So why this change?
>>
>> Rhe problem is that we never know when a RT one will arrive :-/
>>
>> E.g., only fair tasks, server enabled. All fine :-) Then an RT arrives, and gets
>> postponed by the server... RT people complaining (for those thinking on adding
>> a server for RT, we will have a similar problem as we have with throttling today +
>> DL has no fixed priority).
> 
> Well, the thing is, the moment you run DL tasks on that machine those RT
> tasks will get delayed *anyway*. RT people need to stop pretending FIFO
> is the highest class.
> 
> But yeah, I can see why they might get upset if the time is then used to
> run FAIR tasks while their RT task gets to wait.

right

>> We will still face the same problem with defferable server, and the example is the same,
>> just with a shift in the RT arrival. e.g., only fair task for (server period - runtime),
>> then the server gets enabled, and a 1us after the RT arrives and gets postponed... again.
>>
>> So the need to enable it only after the dispatch of a kind of RT (or DL to be added)
>> tasks that can cause starvation.
>>
>> We could simplify it by enabling only on RT/DL arrival but... if there is nothing to
>> starve... it is not needed anyways so less overhead avoiding the enqueue...
> 
> So one thing we could do is have update_curr_fair() decrement
> fair_server's runtime and yield the period then it hits 0 (and capping
> it at 0, not allowing it to go negative or so).
> 
> That way you only force the situation when FAIR hasn't had it's allotted
> time this perio, and only for as much as to make up for the time it
> lacks.

We can also decrease the runtime to a negative number while in defer/throttle state, and let
the while in replenish_dl_entity() to replenish with the += runtime;

here:
--- %< ---
        /*
         * We keep moving the deadline away until we get some
         * available runtime for the entity. This ensures correct
         * handling of situations where the runtime overrun is
         * arbitrary large.
         */
        while (dl_se->runtime <= 0) {
                dl_se->deadline += pi_of(dl_se)->dl_period;
                dl_se->runtime += pi_of(dl_se)->dl_runtime;
        }

--- >% ---

it is already done... no?

> 
>>>
>>> One of the benefits was -- IIRC, that we no longer need to subtract some
>>> random margin from the reservation limit, but there were more I think.
>>>
>>
>> We can simplify things because we can start ignoring the rt throttling limit when
>> there is no need to throttle... like on grub (but, only after we remove the rt
>> throttling).
>>
>> Thoughts?
> 
> The moment the fair server is there (these patches) you should also kill
> the throttling. It doesn't make sense to have both.
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] sched/deadline: Deferrable dl server
  2023-09-06 14:58         ` Daniel Bristot de Oliveira
@ 2023-09-06 20:04           ` Peter Zijlstra
  2023-09-06 20:08             ` Peter Zijlstra
  2023-09-08 13:59             ` Daniel Bristot de Oliveira
  2023-09-07  8:07           ` Peter Zijlstra
  1 sibling, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-09-06 20:04 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Luca Abeni,
	Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Wed, Sep 06, 2023 at 04:58:11PM +0200, Daniel Bristot de Oliveira wrote:

> > So one thing we could do is have update_curr_fair() decrement
> > fair_server's runtime and yield the period then it hits 0 (and capping
> > it at 0, not allowing it to go negative or so).
> > 
> > That way you only force the situation when FAIR hasn't had it's allotted
> > time this perio, and only for as much as to make up for the time it
> > lacks.
> 
> We can also decrease the runtime to a negative number while in
> defer/throttle state, and let the while in replenish_dl_entity() to
> replenish with the += runtime;

Yes, but my point was that fair_server gives a lower bound of runtime
per period, more -- if available -- is fine.

If we allow negative runtime, you'll affect future periods, and that is
not desired in this case.

Or am I still confused?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] sched/deadline: Deferrable dl server
  2023-09-06 20:04           ` Peter Zijlstra
@ 2023-09-06 20:08             ` Peter Zijlstra
  2023-09-08 14:14               ` Daniel Bristot de Oliveira
  2023-09-08 13:59             ` Daniel Bristot de Oliveira
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-09-06 20:08 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Luca Abeni,
	Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Wed, Sep 06, 2023 at 10:04:06PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 06, 2023 at 04:58:11PM +0200, Daniel Bristot de Oliveira wrote:
> 
> > > So one thing we could do is have update_curr_fair() decrement
> > > fair_server's runtime and yield the period then it hits 0 (and capping
> > > it at 0, not allowing it to go negative or so).
> > > 
> > > That way you only force the situation when FAIR hasn't had it's allotted
> > > time this perio, and only for as much as to make up for the time it
> > > lacks.
> > 
> > We can also decrease the runtime to a negative number while in
> > defer/throttle state, and let the while in replenish_dl_entity() to
> > replenish with the += runtime;
> 
> Yes, but my point was that fair_server gives a lower bound of runtime
> per period, more -- if available -- is fine.
> 
> If we allow negative runtime, you'll affect future periods, and that is
> not desired in this case.
> 
> Or am I still confused?

That is, let update_curr_fair() decrement fair_server runtime
*unconditionally* -- even if the task was not selected through the
server.

Specifically, if the fair task is selected normally due to lack of
deadline tasks, that runtime *still* counts towards the fair-server and
have the server yield the period when zero.

This means that fair_server is only effective IFF 'normal' execution
doesn't match the fair_server.runtime execution.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] sched/deadline: Deferrable dl server
  2023-09-06 14:58         ` Daniel Bristot de Oliveira
  2023-09-06 20:04           ` Peter Zijlstra
@ 2023-09-07  8:07           ` Peter Zijlstra
  2023-09-08 15:28             ` Daniel Bristot de Oliveira
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-09-07  8:07 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Luca Abeni,
	Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Wed, Sep 06, 2023 at 04:58:11PM +0200, Daniel Bristot de Oliveira wrote:

> > Yeah, it's a wee hack to move it to the zero-laxity point. I was
> > considering if it makes sense to push that down and make it available
> > for all DL tasks, but I'm not sure..
> 
> It might be useful in the future, like when DL dominates all other schedulers, so
> we can have a way to schedule a deferred work, like kworkers... :-) But it might be
> too early for that..

So... that scheme I was pushing where we unconditionally decrement
fair_server.dl_runtime from update_curr_fair(), that relies on it being
a proper zero-laxity scheduler, and doesn't work with the proposed defer
hack.

That is, it relies on dl_runtime > 0 during throttle, and you explicitly
set it 0.

Now, I've not looked at all this code in detail in a minute, but would
not something like the below work?

AFAICT the regular dl_task_timer() callback works to make it go, because
replenish will see positive runtime (or not, when already consumed) and
DTRT.


Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -657,6 +657,7 @@ struct sched_dl_entity {
 	unsigned int			dl_non_contending : 1;
 	unsigned int			dl_overrun	  : 1;
 	unsigned int			dl_server         : 1;
+	unsigned int			dl_zerolax        : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
Index: linux-2.6/kernel/sched/deadline.c
===================================================================
--- linux-2.6.orig/kernel/sched/deadline.c
+++ linux-2.6/kernel/sched/deadline.c
@@ -895,6 +895,16 @@ static void replenish_dl_entity(struct s
 		dl_se->dl_yielded = 0;
 	if (dl_se->dl_throttled)
 		dl_se->dl_throttled = 0;
+
+	/*
+	 * If this is a zero-laxity task, and we're before the zero-laxity
+	 * point, throttle it.
+	 */
+	if (dl_se->dl_zerolax &&
+	    dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) {
+		if (!is_dl_boosted(dl_se) && start_dl_timer(dl_se))
+			dl_se->dl_throttled = 1;
+	}
 }
 
 /*
@@ -1078,7 +1088,12 @@ static int start_dl_timer(struct sched_d
 	 * that it is actually coming from rq->clock and not from
 	 * hrtimer's time base reading.
 	 */
-	act = ns_to_ktime(dl_next_period(dl_se));
+	if (dl_se->dl_zerolax && !dl_se->dl_throttled) {
+		act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
+	} else {
+		act = ns_to_ktime(dl_next_period(dl_se));
+	}
+
 	now = hrtimer_cb_get_time(timer);
 	delta = ktime_to_ns(now) - rq_clock(rq);
 	act = ktime_add_ns(act, delta);
@@ -1794,6 +1809,13 @@ enqueue_dl_entity(struct sched_dl_entity
 		setup_new_dl_entity(dl_se);
 	}
 
+	/*
+	 * If we are still throttled, eg. we got replenished but are a
+	 * zero-laxity task and still got to wait, don't enqueue.
+	 */
+	if (dl_se->dl_throttled)
+		return;
+
 	__enqueue_dl_entity(dl_se);
 }
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] sched/deadline: Deferrable dl server
  2023-09-06 20:04           ` Peter Zijlstra
  2023-09-06 20:08             ` Peter Zijlstra
@ 2023-09-08 13:59             ` Daniel Bristot de Oliveira
  1 sibling, 0 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-09-08 13:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Luca Abeni,
	Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On 9/6/23 22:04, Peter Zijlstra wrote:
> On Wed, Sep 06, 2023 at 04:58:11PM +0200, Daniel Bristot de Oliveira wrote:
> 
>>> So one thing we could do is have update_curr_fair() decrement
>>> fair_server's runtime and yield the period then it hits 0 (and capping
>>> it at 0, not allowing it to go negative or so).
>>>
>>> That way you only force the situation when FAIR hasn't had it's allotted
>>> time this perio, and only for as much as to make up for the time it
>>> lacks.
>>
>> We can also decrease the runtime to a negative number while in
>> defer/throttle state, and let the while in replenish_dl_entity() to
>> replenish with the += runtime;

Repying in the sequence...but mostly to try to understand/explain my point (we might even
be in agreement, but touching different parts of the code).

> Yes, but my point was that fair_server gives a lower bound of runtime
> per period, more -- if available -- is fine.

I am targeting that as well, and it works for the case in which we have only RT
tasks causing starvation.

If we have other DL tasks, we cannot force the fair server to run till
completion because it would add a U=1 to the system. Like, if we have a
50ms server runtime... BOOM, we will miss lots of regular DL tasks deadline with
1 ms period. I do not think it is worth to break deadline to give fair server
time immediately. So, the fair server is scheduled as a periodic DL task.

After the initial defer state, the DL server will get the runtime/period
even with the CPU load of DL tasks. But:

	- We do not have such high load of DL tasks as well
	- If one cares about it more, they can reduce the runtime/period
	  granularity to mitigate the defer time
	- If one do not care about RT tasks, just disable the defer mechanism

So I think we are well covered, without having to break the basis of CBS+EDF assumptions
(like that task will not add a higher load than U).


> If we allow negative runtime, you'll affect future periods, and that is
> not desired in this case.

I think that I need to clarify this. I was thinking on this case:

	- Fair server deffered
	- If the server gets some time to run while waiting for the 0-lax
	  we decrease the runtime...
	- When the defer starts, the replenish will happen, and will +=
	  runtime, giving it the correct proportional time left for the
	  period the timer was armed. So it is not the next period, it is
	  the delayed period.

So I think we are thinking the same thing... just with a shift.

> 
> Or am I still confused?
> 

You are not alone... there are many states and I fear that I might be focusing
on a different state.

-- Daniel


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] sched/deadline: Deferrable dl server
  2023-09-06 20:08             ` Peter Zijlstra
@ 2023-09-08 14:14               ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-09-08 14:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Luca Abeni,
	Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On 9/6/23 22:08, Peter Zijlstra wrote:
> On Wed, Sep 06, 2023 at 10:04:06PM +0200, Peter Zijlstra wrote:
>> On Wed, Sep 06, 2023 at 04:58:11PM +0200, Daniel Bristot de Oliveira wrote:
>>
>>>> So one thing we could do is have update_curr_fair() decrement
>>>> fair_server's runtime and yield the period then it hits 0 (and capping
>>>> it at 0, not allowing it to go negative or so).
>>>>
>>>> That way you only force the situation when FAIR hasn't had it's allotted
>>>> time this perio, and only for as much as to make up for the time it
>>>> lacks.
>>>
>>> We can also decrease the runtime to a negative number while in
>>> defer/throttle state, and let the while in replenish_dl_entity() to
>>> replenish with the += runtime;
>>
>> Yes, but my point was that fair_server gives a lower bound of runtime
>> per period, more -- if available -- is fine.
>>
>> If we allow negative runtime, you'll affect future periods, and that is
>> not desired in this case.
>>
>> Or am I still confused?
> 
> That is, let update_curr_fair() decrement fair_server runtime
> *unconditionally* -- even if the task was not selected through the
> server.

Ah, I see! but then we would have to also consider the period, and control a
period... without SCHED_DEADLINE watching us...

I was considering only the "waiting for the 0-lag time to start running (after
being armed)"

If there is no need for the server to be armed... do nothing :-) If it is armed,
reduce the amount of time the fair server could get.

> Specifically, if the fair task is selected normally due to lack of
> deadline tasks, that runtime *still* counts towards the fair-server and
> have the server yield the period when zero.
> 
> This means that fair_server is only effective IFF 'normal' execution
> doesn't match the fair_server.runtime execution.

I see! I *think* it will cause more overhead than doing nothing unless there
is something that can cause starvation. But I need to think more.

-- Daniel



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] sched/deadline: Deferrable dl server
  2023-09-07  8:07           ` Peter Zijlstra
@ 2023-09-08 15:28             ` Daniel Bristot de Oliveira
  2023-09-08 16:11               ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Bristot de Oliveira @ 2023-09-08 15:28 UTC (permalink / raw)
  To: Peter Zijlstra, Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, Luca Abeni, Tommaso Cucinotta, Thomas Gleixner,
	Joel Fernandes, Vineeth Pillai, Shuah Khan, Phil Auld

On 9/7/23 10:07, Peter Zijlstra wrote:
> On Wed, Sep 06, 2023 at 04:58:11PM +0200, Daniel Bristot de Oliveira wrote:
> 
>>> Yeah, it's a wee hack to move it to the zero-laxity point. I was
>>> considering if it makes sense to push that down and make it available
>>> for all DL tasks, but I'm not sure..
>>
>> It might be useful in the future, like when DL dominates all other schedulers, so
>> we can have a way to schedule a deferred work, like kworkers... :-) But it might be
>> too early for that..
> 
> So... that scheme I was pushing where we unconditionally decrement
> fair_server.dl_runtime from update_curr_fair(), that relies on it being
> a proper zero-laxity scheduler, and doesn't work with the proposed defer
> hack.
> 
> That is, it relies on dl_runtime > 0 during throttle, and you explicitly
> set it 0.
> 
> Now, I've not looked at all this code in detail in a minute, but would
> not something like the below work?
> 
> AFAICT the regular dl_task_timer() callback works to make it go, because
> replenish will see positive runtime (or not, when already consumed) and
> DTRT.
> 
> 
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h
> +++ linux-2.6/include/linux/sched.h
> @@ -657,6 +657,7 @@ struct sched_dl_entity {
>  	unsigned int			dl_non_contending : 1;
>  	unsigned int			dl_overrun	  : 1;
>  	unsigned int			dl_server         : 1;
> +	unsigned int			dl_zerolax        : 1;
>  
>  	/*
>  	 * Bandwidth enforcement timer. Each -deadline task has its
> Index: linux-2.6/kernel/sched/deadline.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/deadline.c
> +++ linux-2.6/kernel/sched/deadline.c
> @@ -895,6 +895,16 @@ static void replenish_dl_entity(struct s
>  		dl_se->dl_yielded = 0;
>  	if (dl_se->dl_throttled)
>  		dl_se->dl_throttled = 0;
> +
> +	/*
> +	 * If this is a zero-laxity task, and we're before the zero-laxity
> +	 * point, throttle it.
> +	 */
> +	if (dl_se->dl_zerolax &&
> +	    dl_time_before(dl_se->deadline - dl_se->runtime, rq_clock(rq))) {
> +		if (!is_dl_boosted(dl_se) && start_dl_timer(dl_se))
> +			dl_se->dl_throttled = 1;
> +	}
>  }
>  
>  /*
> @@ -1078,7 +1088,12 @@ static int start_dl_timer(struct sched_d
>  	 * that it is actually coming from rq->clock and not from
>  	 * hrtimer's time base reading.
>  	 */
> -	act = ns_to_ktime(dl_next_period(dl_se));
> +	if (dl_se->dl_zerolax && !dl_se->dl_throttled) {
> +		act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
> +	} else {
> +		act = ns_to_ktime(dl_next_period(dl_se));
> +	}
> +
>  	now = hrtimer_cb_get_time(timer);
>  	delta = ktime_to_ns(now) - rq_clock(rq);
>  	act = ktime_add_ns(act, delta);
> @@ -1794,6 +1809,13 @@ enqueue_dl_entity(struct sched_dl_entity
>  		setup_new_dl_entity(dl_se);
>  	}
>  
> +	/*
> +	 * If we are still throttled, eg. we got replenished but are a
> +	 * zero-laxity task and still got to wait, don't enqueue.
> +	 */
> +	if (dl_se->dl_throttled)
> +		return;
> +
>  	__enqueue_dl_entity(dl_se);
>  }

Let me see if I got it:

	- Always start the server, but throttled with full runtime...
	- Unconditionally decrement fair_server.dl_runtime from update_curr_fair()
		(check if it is not decremented twice as it runs)
	- When the dl timer fire, replenish or throttle for the next period?

is that the base for it?

-- Daniel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 6/7] sched/deadline: Deferrable dl server
  2023-09-08 15:28             ` Daniel Bristot de Oliveira
@ 2023-09-08 16:11               ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-09-08 16:11 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Daniel Bristot de Oliveira, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Luca Abeni,
	Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Fri, Sep 08, 2023 at 05:28:46PM +0200, Daniel Bristot de Oliveira wrote:

> Let me see if I got it:
> 
> 	- Always start the server, but throttled with full runtime...
> 	- Unconditionally decrement fair_server.dl_runtime from update_curr_fair()
> 		(check if it is not decremented twice as it runs)
> 	- When the dl timer fire, replenish or throttle for the next period?
> 
> is that the base for it?

Yes. So if dl timer fires and it still has runtime replenish will not
move the deadline and it will become eligible to run. If it has 0
runtime, replenish does it's thing, ups runtime and moves the deadline
forward and then the zero-laxity condition will re-throttle, goto 1
etc..

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/7] sched: Unify runtime accounting across classes
  2023-08-31 20:28 ` [PATCH v4 1/7] sched: Unify runtime accounting across classes Daniel Bristot de Oliveira
@ 2023-09-15 21:41   ` Steven Rostedt
  0 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2023-09-15 21:41 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	Luca Abeni, Tommaso Cucinotta, Thomas Gleixner, Joel Fernandes,
	Vineeth Pillai, Shuah Khan, Phil Auld

On Thu, 31 Aug 2023 22:28:52 +0200
Daniel Bristot de Oliveira <bristot@kernel.org> wrote:

As I have a vested interest in this work, I started a deep dive into the
code.

> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1299,9 +1299,8 @@ static void update_curr_dl(struct rq *rq)
>  {
>  	struct task_struct *curr = rq->curr;
>  	struct sched_dl_entity *dl_se = &curr->dl;
> -	u64 delta_exec, scaled_delta_exec;
> +	s64 delta_exec, scaled_delta_exec;
>  	int cpu = cpu_of(rq);
> -	u64 now;
>  
>  	if (!dl_task(curr) || !on_dl_rq(dl_se))
>  		return;
> @@ -1314,21 +1313,13 @@ static void update_curr_dl(struct rq *rq)
>  	 * natural solution, but the full ramifications of this
>  	 * approach need further study.
>  	 */
> -	now = rq_clock_task(rq);
> -	delta_exec = now - curr->se.exec_start;
> -	if (unlikely((s64)delta_exec <= 0)) {
> +	delta_exec = update_curr_common(rq);

I have to say, mapping the update_curr_common() to the removed code below,
wasn't as easy as I thought it would be, as the above function is broken up
slightly differently.

But the conclusion does appear to be pretty much the same.

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve


> +	if (unlikely(delta_exec <= 0)) {
>  		if (unlikely(dl_se->dl_yielded))
>  			goto throttle;
>  		return;
>  	}
>  
> -	schedstat_set(curr->stats.exec_max,
> -		      max(curr->stats.exec_max, delta_exec));
> -
> -	trace_sched_stat_runtime(curr, delta_exec, 0);
> -
> -	update_current_exec_runtime(curr, now, delta_exec);
> -
>  	if (dl_entity_is_special(dl_se))
>  		return;
>  

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-09-15 21:42 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-31 20:28 [PATCH v4 0/7] SCHED_DEADLINE server infrastructure Daniel Bristot de Oliveira
2023-08-31 20:28 ` [PATCH v4 1/7] sched: Unify runtime accounting across classes Daniel Bristot de Oliveira
2023-09-15 21:41   ` Steven Rostedt
2023-08-31 20:28 ` [PATCH v4 2/7] sched/deadline: Collect sched_dl_entity initialization Daniel Bristot de Oliveira
2023-08-31 20:28 ` [PATCH v4 3/7] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity Daniel Bristot de Oliveira
2023-08-31 20:28 ` [PATCH v4 4/7] sched/deadline: Introduce deadline servers Daniel Bristot de Oliveira
2023-08-31 20:28 ` [PATCH v4 5/7] sched/fair: Add trivial fair server Daniel Bristot de Oliveira
2023-08-31 20:28 ` [PATCH v4 6/7] sched/deadline: Deferrable dl server Daniel Bristot de Oliveira
2023-09-05 13:42   ` Peter Zijlstra
2023-09-05 15:24     ` Daniel Bristot de Oliveira
2023-09-06  8:29       ` Peter Zijlstra
2023-09-06 14:58         ` Daniel Bristot de Oliveira
2023-09-06 20:04           ` Peter Zijlstra
2023-09-06 20:08             ` Peter Zijlstra
2023-09-08 14:14               ` Daniel Bristot de Oliveira
2023-09-08 13:59             ` Daniel Bristot de Oliveira
2023-09-07  8:07           ` Peter Zijlstra
2023-09-08 15:28             ` Daniel Bristot de Oliveira
2023-09-08 16:11               ` Peter Zijlstra
2023-08-31 20:28 ` [PATCH v4 7/7] sched/fair: Fair server interface Daniel Bristot de Oliveira
2023-09-01  2:01   ` kernel test robot
2023-09-05 13:55   ` Peter Zijlstra
2023-09-05 16:17     ` Daniel Bristot de Oliveira
2023-09-06  7:25       ` Peter Zijlstra
2023-09-06  8:25         ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.