Linux-rt-users Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure
@ 2020-08-07  9:50 Juri Lelli
  2020-08-07  9:50 ` [RFC PATCH v2 1/6] sched: Unify runtime accounting across classes Juri Lelli
                   ` (7 more replies)
  0 siblings, 8 replies; 29+ messages in thread
From: Juri Lelli @ 2020-08-07  9:50 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rostedt, tglx, linux-kernel, luca.abeni, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider, Juri Lelli

Hi,

This is RFC v2 of Peter's SCHED_DEADLINE server infrastructure
implementation [1].

SCHED_DEADLINE servers can help fixing starvation issues of low priority
tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU
cycles. Today we have RT Throttling; DEADLINE servers should be able to
replace and improve that.

I rebased Peter's patches (adding changelogs where needed) on
tip/sched/core as of today and incorporated fixes to issues discussed
during RFC v1. Current set seems to even boot on real HW! :-)

While playing with RFC v1 set (and discussing it further offline with
Daniel) it has emerged the need to slightly change the behavior. Patch
6/6 is a (cumbersome?) attempt to show what's probably needed.
The problem with "original" implementation is that FIFO tasks might
suffer preemption from NORMAL even when spare CPU cycles are available.
In fact, fair deadline server is enqueued right away when NORMAL tasks
wake up and they are first scheduled by the server, thus potentially
preempting a well behaving FIFO task. This is of course not ideal.
So, in patch 6/6 I propose to use some kind of starvation monitor/
watchdog that delays enqueuing of deadline servers to the point when
fair tasks might start to actually suffer from starvation (just randomly
picked HZ/2 for now). One problem I already see with the current
implementation is that it adds overhead to fair paths, so I'm pretty
sure there are better ways to implement the idea (e.g., Daniel already
suggested using a starvation monitor kthread sort of thing).

Receiving comments and suggestions is the sole purpose of this posting
at this stage. Hopefully we can further discuss the idea at Plumbers in
a few weeks. So, please don't focus too much into actual implementation
(which I plan to revise anyway after I'm back from pto :), but try to
see if this might actually fly. The feature seems to be very much needed.

Thanks!

Juri

1 - https://lore.kernel.org/lkml/20190726145409.947503076@infradead.org/

Juri Lelli (1):
  sched/fair: Implement starvation monitor

Peter Zijlstra (5):
  sched: Unify runtime accounting across classes
  sched/deadline: Collect sched_dl_entity initialization
  sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
  sched/deadline: Introduce deadline servers
  sched/fair: Add trivial fair server

 include/linux/sched.h    |  28 ++-
 kernel/sched/core.c      |  23 +-
 kernel/sched/deadline.c  | 483 ++++++++++++++++++++++++---------------
 kernel/sched/fair.c      | 136 ++++++++++-
 kernel/sched/rt.c        |  17 +-
 kernel/sched/sched.h     |  50 +++-
 kernel/sched/stop_task.c |  16 +-
 7 files changed, 522 insertions(+), 231 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 1/6] sched: Unify runtime accounting across classes
  2020-08-07  9:50 [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure Juri Lelli
@ 2020-08-07  9:50 ` Juri Lelli
  2020-08-07  9:50 ` [RFC PATCH v2 2/6] sched/deadline: Collect sched_dl_entity initialization Juri Lelli
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Juri Lelli @ 2020-08-07  9:50 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rostedt, tglx, linux-kernel, luca.abeni, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

From: Peter Zijlstra <peterz@infradead.org>

All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.

Collapse all that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h    |  2 +-
 kernel/sched/deadline.c  | 17 +++-----------
 kernel/sched/fair.c      | 50 +++++++++++++++++++++++++++++++---------
 kernel/sched/rt.c        | 17 +++-----------
 kernel/sched/sched.h     |  2 ++
 kernel/sched/stop_task.c | 16 +------------
 6 files changed, 49 insertions(+), 55 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a6bf77c346876..f7b9ba04970bc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -424,7 +424,7 @@ struct sched_statistics {
 
 	u64				block_start;
 	u64				block_max;
-	u64				exec_max;
+	s64				exec_max;
 	u64				slice_max;
 
 	u64				nr_migrations_cold;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3862a28cd05d0..2ece83b5991f5 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1221,9 +1221,8 @@ static void update_curr_dl(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_dl_entity *dl_se = &curr->dl;
-	u64 delta_exec, scaled_delta_exec;
+	s64 delta_exec, scaled_delta_exec;
 	int cpu = cpu_of(rq);
-	u64 now;
 
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
@@ -1236,23 +1235,13 @@ static void update_curr_dl(struct rq *rq)
 	 * natural solution, but the full ramifications of this
 	 * approach need further study.
 	 */
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <= 0)) {
+	delta_exec = update_curr_common(rq);
+	if (unlikely(delta_exec <= 0)) {
 		if (unlikely(dl_se->dl_yielded))
 			goto throttle;
 		return;
 	}
 
-	schedstat_set(curr->se.statistics.exec_max,
-		      max(curr->se.statistics.exec_max, delta_exec));
-
-	curr->se.sum_exec_runtime += delta_exec;
-	account_group_exec_runtime(curr, delta_exec);
-
-	curr->se.exec_start = now;
-	cgroup_account_cputime(curr, delta_exec);
-
 	if (dl_entity_is_special(dl_se))
 		return;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2ba8f230feb9a..10a230d85104a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -836,30 +836,58 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
 }
 #endif /* CONFIG_SMP */
 
+static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
+{
+	u64 now = rq_clock_task(rq);
+	s64 delta_exec;
+
+	delta_exec = now - curr->exec_start;
+	if (unlikely(delta_exec <= 0))
+		return delta_exec;
+
+	curr->exec_start = now;
+	curr->sum_exec_runtime += delta_exec;
+
+	schedstat_set(curr->statistics.exec_max,
+		      max(delta_exec, curr->statistics.exec_max));
+
+	return delta_exec;
+}
+
+/*
+ * Used by other classes to account runtime.
+ */
+s64 update_curr_common(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	s64 delta_exec;
+
+	delta_exec = update_curr_se(rq, &curr->se);
+	if (unlikely(delta_exec <= 0))
+		return delta_exec;
+
+	account_group_exec_runtime(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
+
+	return delta_exec;
+}
+
 /*
  * Update the current task's runtime statistics.
  */
 static void update_curr(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *curr = cfs_rq->curr;
-	u64 now = rq_clock_task(rq_of(cfs_rq));
-	u64 delta_exec;
+	s64 delta_exec;
 
 	if (unlikely(!curr))
 		return;
 
-	delta_exec = now - curr->exec_start;
-	if (unlikely((s64)delta_exec <= 0))
+	delta_exec = update_curr_se(rq_of(cfs_rq), curr);
+	if (unlikely(delta_exec <= 0))
 		return;
 
-	curr->exec_start = now;
-
-	schedstat_set(curr->statistics.exec_max,
-		      max(delta_exec, curr->statistics.exec_max));
-
-	curr->sum_exec_runtime += delta_exec;
 	schedstat_add(cfs_rq->exec_clock, delta_exec);
-
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
 	update_min_vruntime(cfs_rq);
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f215eea6a9661..196171fbf5978 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -997,26 +997,15 @@ static void update_curr_rt(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_rt_entity *rt_se = &curr->rt;
-	u64 delta_exec;
-	u64 now;
+	s64 delta_exec;
 
 	if (curr->sched_class != &rt_sched_class)
 		return;
 
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <= 0))
+	delta_exec = update_curr_common(rq);
+	if (unlikely(delta_exec < 0))
 		return;
 
-	schedstat_set(curr->se.statistics.exec_max,
-		      max(curr->se.statistics.exec_max, delta_exec));
-
-	curr->se.sum_exec_runtime += delta_exec;
-	account_group_exec_runtime(curr, delta_exec);
-
-	curr->se.exec_start = now;
-	cgroup_account_cputime(curr, delta_exec);
-
 	if (!rt_bandwidth_enabled())
 		return;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3fd283892761d..963c16fc27500 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1769,6 +1769,8 @@ extern const u32		sched_prio_to_wmult[40];
 
 #define RETRY_TASK		((void *)-1UL)
 
+extern s64 update_curr_common(struct rq *rq);
+
 struct sched_class {
 
 #ifdef CONFIG_UCLAMP_TASK
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 394bc8126a1e5..1eb1e336e18e7 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -62,21 +62,7 @@ static void yield_task_stop(struct rq *rq)
 
 static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 {
-	struct task_struct *curr = rq->curr;
-	u64 delta_exec;
-
-	delta_exec = rq_clock_task(rq) - curr->se.exec_start;
-	if (unlikely((s64)delta_exec < 0))
-		delta_exec = 0;
-
-	schedstat_set(curr->se.statistics.exec_max,
-			max(curr->se.statistics.exec_max, delta_exec));
-
-	curr->se.sum_exec_runtime += delta_exec;
-	account_group_exec_runtime(curr, delta_exec);
-
-	curr->se.exec_start = rq_clock_task(rq);
-	cgroup_account_cputime(curr, delta_exec);
+	update_curr_common(rq);
 }
 
 /*
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 2/6] sched/deadline: Collect sched_dl_entity initialization
  2020-08-07  9:50 [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure Juri Lelli
  2020-08-07  9:50 ` [RFC PATCH v2 1/6] sched: Unify runtime accounting across classes Juri Lelli
@ 2020-08-07  9:50 ` Juri Lelli
  2020-08-07  9:50 ` [RFC PATCH v2 3/6] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity Juri Lelli
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Juri Lelli @ 2020-08-07  9:50 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rostedt, tglx, linux-kernel, luca.abeni, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

From: Peter Zijlstra <peterz@infradead.org>

Create a single function that initializes a sched_dl_entity.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c     |  5 +----
 kernel/sched/deadline.c | 22 +++++++++++++++-------
 kernel/sched/sched.h    |  5 +----
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 12e1f3a2cabc6..6b36bf82b53c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3081,10 +3081,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
 
-	RB_CLEAR_NODE(&p->dl.rb_node);
-	init_dl_task_timer(&p->dl);
-	init_dl_inactive_task_timer(&p->dl);
-	__dl_clear_params(p);
+	init_dl_entity(&p->dl);
 
 	INIT_LIST_HEAD(&p->rt.run_list);
 	p->rt.timeout		= 0;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 2ece83b5991f5..8d909bdb9a119 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -219,6 +219,8 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
 	__add_rq_bw(new_bw, &rq->dl);
 }
 
+static void __dl_clear_params(struct sched_dl_entity *dl_se);
+
 /*
  * The utilization of a task cannot be immediately removed from
  * the rq active utilization (running_bw) when the task blocks.
@@ -317,7 +319,7 @@ static void task_non_contending(struct task_struct *p)
 				sub_rq_bw(&p->dl, &rq->dl);
 			raw_spin_lock(&dl_b->lock);
 			__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
-			__dl_clear_params(p);
+			__dl_clear_params(dl_se);
 			raw_spin_unlock(&dl_b->lock);
 		}
 
@@ -1123,7 +1125,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-void init_dl_task_timer(struct sched_dl_entity *dl_se)
+static void init_dl_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->dl_timer;
 
@@ -1335,7 +1337,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 		raw_spin_lock(&dl_b->lock);
 		__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 		raw_spin_unlock(&dl_b->lock);
-		__dl_clear_params(p);
+		__dl_clear_params(dl_se);
 
 		goto unlock;
 	}
@@ -1351,7 +1353,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
+static void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->inactive_timer;
 
@@ -2741,10 +2743,8 @@ bool __checkparam_dl(const struct sched_attr *attr)
 /*
  * This function clears the sched_dl_entity static params.
  */
-void __dl_clear_params(struct task_struct *p)
+static void __dl_clear_params(struct sched_dl_entity *dl_se)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
-
 	dl_se->dl_runtime		= 0;
 	dl_se->dl_deadline		= 0;
 	dl_se->dl_period		= 0;
@@ -2759,6 +2759,14 @@ void __dl_clear_params(struct task_struct *p)
 	dl_se->dl_overrun		= 0;
 }
 
+void init_dl_entity(struct sched_dl_entity *dl_se)
+{
+	RB_CLEAR_NODE(&dl_se->rb_node);
+	init_dl_task_timer(dl_se);
+	init_dl_inactive_task_timer(dl_se);
+	__dl_clear_params(dl_se);
+}
+
 bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 963c16fc27500..62304d4de99cc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -255,8 +255,6 @@ struct rt_bandwidth {
 	unsigned int		rt_period_active;
 };
 
-void __dl_clear_params(struct task_struct *p);
-
 /*
  * To keep the bandwidth of -deadline tasks and groups under control
  * we need some place where:
@@ -1939,8 +1937,7 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
 
 extern struct dl_bandwidth def_dl_bandwidth;
 extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
-extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
-extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
+extern void init_dl_entity(struct sched_dl_entity *dl_se);
 
 #define BW_SHIFT		20
 #define BW_UNIT			(1 << BW_SHIFT)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 3/6] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
  2020-08-07  9:50 [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure Juri Lelli
  2020-08-07  9:50 ` [RFC PATCH v2 1/6] sched: Unify runtime accounting across classes Juri Lelli
  2020-08-07  9:50 ` [RFC PATCH v2 2/6] sched/deadline: Collect sched_dl_entity initialization Juri Lelli
@ 2020-08-07  9:50 ` Juri Lelli
  2020-08-07  9:50 ` [RFC PATCH v2 4/6] sched/deadline: Introduce deadline servers Juri Lelli
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Juri Lelli @ 2020-08-07  9:50 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rostedt, tglx, linux-kernel, luca.abeni, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

From: Peter Zijlstra <peterz@infradead.org>

In preparation of introducing !task sched_dl_entity; move the
bandwidth accounting into {en.de}queue_dl_entity().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/deadline.c | 128 ++++++++++++++++++++++------------------
 kernel/sched/sched.h    |   6 ++
 2 files changed, 77 insertions(+), 57 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 8d909bdb9a119..d4007d1461522 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -275,12 +275,12 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se);
  * up, and checks if the task is still in the "ACTIVE non contending"
  * state or not (in the second case, it updates running_bw).
  */
-static void task_non_contending(struct task_struct *p)
+static void task_non_contending(struct sched_dl_entity *dl_se)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
 	struct hrtimer *timer = &dl_se->inactive_timer;
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
+	struct task_struct *p = dl_task_of(dl_se);
 	s64 zerolag_time;
 
 	/*
@@ -312,13 +312,14 @@ static void task_non_contending(struct task_struct *p)
 	if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
 		if (dl_task(p))
 			sub_running_bw(dl_se, dl_rq);
+
 		if (!dl_task(p) || p->state == TASK_DEAD) {
 			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
 			if (p->state == TASK_DEAD)
-				sub_rq_bw(&p->dl, &rq->dl);
+				sub_rq_bw(dl_se, &rq->dl);
 			raw_spin_lock(&dl_b->lock);
-			__dl_sub(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
+			__dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
 			__dl_clear_params(dl_se);
 			raw_spin_unlock(&dl_b->lock);
 		}
@@ -1477,6 +1478,41 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
 {
 	BUG_ON(on_dl_rq(dl_se));
 
+	/*
+	 * Check if a constrained deadline task was activated
+	 * after the deadline but before the next period.
+	 * If that is the case, the task will be throttled and
+	 * the replenishment timer will be set to the next period.
+	 */
+	if (!dl_se->dl_throttled && !dl_is_implicit(dl_se))
+		dl_check_constrained_dl(dl_se);
+
+	if (flags & (ENQUEUE_RESTORE|ENQUEUE_MIGRATING)) {
+		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+		add_rq_bw(dl_se, dl_rq);
+		add_running_bw(dl_se, dl_rq);
+	}
+
+	/*
+	 * If p is throttled, we do not enqueue it. In fact, if it exhausted
+	 * its budget it needs a replenishment and, since it now is on
+	 * its rq, the bandwidth timer callback (which clearly has not
+	 * run yet) will take care of this.
+	 * However, the active utilization does not depend on the fact
+	 * that the task is on the runqueue or not (but depends on the
+	 * task's state - in GRUB parlance, "inactive" vs "active contending").
+	 * In other words, even if a task is throttled its utilization must
+	 * be counted in the active utilization; hence, we need to call
+	 * add_running_bw().
+	 */
+	if (dl_se->dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+		if (flags & ENQUEUE_WAKEUP)
+			task_contending(dl_se, flags);
+
+		return;
+	}
+
 	/*
 	 * If this is a wakeup or a new instance, the scheduling
 	 * parameters of the task might need updating. Otherwise,
@@ -1496,9 +1532,28 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
 	__enqueue_dl_entity(dl_se);
 }
 
-static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 {
 	__dequeue_dl_entity(dl_se);
+
+	if (flags & (DEQUEUE_SAVE|DEQUEUE_MIGRATING)) {
+		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+		sub_running_bw(dl_se, dl_rq);
+		sub_rq_bw(dl_se, dl_rq);
+	}
+
+	/*
+	 * This check allows to start the inactive timer (or to immediately
+	 * decrease the active utilization, if needed) in two cases:
+	 * when the task blocks and when it is terminating
+	 * (p->state == TASK_DEAD). We can handle the two cases in the same
+	 * way, because from GRUB's point of view the same thing is happening
+	 * (the task moves from "active contending" to "active non contending"
+	 * or "inactive")
+	 */
+	if (flags & DEQUEUE_SLEEP)
+		task_non_contending(dl_se);
 }
 
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -1528,72 +1583,31 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		return;
 	}
 
-	/*
-	 * Check if a constrained deadline task was activated
-	 * after the deadline but before the next period.
-	 * If that is the case, the task will be throttled and
-	 * the replenishment timer will be set to the next period.
-	 */
-	if (!p->dl.dl_throttled && !dl_is_implicit(&p->dl))
-		dl_check_constrained_dl(&p->dl);
-
-	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE) {
-		add_rq_bw(&p->dl, &rq->dl);
-		add_running_bw(&p->dl, &rq->dl);
-	}
-
-	/*
-	 * If p is throttled, we do not enqueue it. In fact, if it exhausted
-	 * its budget it needs a replenishment and, since it now is on
-	 * its rq, the bandwidth timer callback (which clearly has not
-	 * run yet) will take care of this.
-	 * However, the active utilization does not depend on the fact
-	 * that the task is on the runqueue or not (but depends on the
-	 * task's state - in GRUB parlance, "inactive" vs "active contending").
-	 * In other words, even if a task is throttled its utilization must
-	 * be counted in the active utilization; hence, we need to call
-	 * add_running_bw().
-	 */
-	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
-		if (flags & ENQUEUE_WAKEUP)
-			task_contending(&p->dl, flags);
-
-		return;
-	}
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		flags |= ENQUEUE_MIGRATING;
 
 	enqueue_dl_entity(&p->dl, pi_se, flags);
 
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+	if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
 }
 
 static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
-	dequeue_dl_entity(&p->dl);
-	dequeue_pushable_dl_task(rq, p);
+	dequeue_dl_entity(&p->dl, flags);
+
+	if (!p->dl.dl_throttled)
+		dequeue_pushable_dl_task(rq, p);
 }
 
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
-	__dequeue_task_dl(rq, p, flags);
 
-	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE) {
-		sub_running_bw(&p->dl, &rq->dl);
-		sub_rq_bw(&p->dl, &rq->dl);
-	}
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		flags |= DEQUEUE_MIGRATING;
 
-	/*
-	 * This check allows to start the inactive timer (or to immediately
-	 * decrease the active utilization, if needed) in two cases:
-	 * when the task blocks and when it is terminating
-	 * (p->state == TASK_DEAD). We can handle the two cases in the same
-	 * way, because from GRUB's point of view the same thing is happening
-	 * (the task moves from "active contending" to "active non contending"
-	 * or "inactive")
-	 */
-	if (flags & DEQUEUE_SLEEP)
-		task_non_contending(p);
+	__dequeue_task_dl(rq, p, flags);
 }
 
 /*
@@ -2373,7 +2387,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	 * will reset the task parameters.
 	 */
 	if (task_on_rq_queued(p) && p->dl.dl_runtime)
-		task_non_contending(p);
+		task_non_contending(&p->dl);
 
 	if (!task_on_rq_queued(p)) {
 		/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 62304d4de99cc..d3db8c7d8b641 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1741,6 +1741,10 @@ extern const u32		sched_prio_to_wmult[40];
  * MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location
  *        in the runqueue.
  *
+ * NOCLOCK - skip the update_rq_clock() (avoids double updates)
+ *
+ * MIGRATION - p->on_rq == TASK_ON_RQ_MIGRATING (used for DEADLINE)
+ *
  * ENQUEUE_HEAD      - place at front of runqueue (tail if not specified)
  * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
  * ENQUEUE_MIGRATED  - the task was migrated during wakeup
@@ -1751,6 +1755,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define DEQUEUE_SAVE		0x02 /* Matches ENQUEUE_RESTORE */
 #define DEQUEUE_MOVE		0x04 /* Matches ENQUEUE_MOVE */
 #define DEQUEUE_NOCLOCK		0x08 /* Matches ENQUEUE_NOCLOCK */
+#define DEQUEUE_MIGRATING	0x80 /* Matches ENQUEUE_MIGRATING */
 
 #define ENQUEUE_WAKEUP		0x01
 #define ENQUEUE_RESTORE		0x02
@@ -1764,6 +1769,7 @@ extern const u32		sched_prio_to_wmult[40];
 #else
 #define ENQUEUE_MIGRATED	0x00
 #endif
+#define ENQUEUE_MIGRATING	0x80
 
 #define RETRY_TASK		((void *)-1UL)
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 4/6] sched/deadline: Introduce deadline servers
  2020-08-07  9:50 [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure Juri Lelli
                   ` (2 preceding siblings ...)
  2020-08-07  9:50 ` [RFC PATCH v2 3/6] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity Juri Lelli
@ 2020-08-07  9:50 ` Juri Lelli
  2020-10-06  7:56   ` luca abeni
  2020-08-07  9:50 ` [RFC PATCH v2 5/6] sched/fair: Add trivial fair server Juri Lelli
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Juri Lelli @ 2020-08-07  9:50 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rostedt, tglx, linux-kernel, luca.abeni, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

From: Peter Zijlstra <peterz@infradead.org>

Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).

RT Throttling has been introduced a while ago as a (mostly debug)
countermeasure one can utilize to reserve some CPU time for low priority
tasks (usually background type of work, e.g. workqueues, timers, etc.).
It however has its own problems (see documentation) and the undesired
effect of unconditionally throttling FIFO tasks even when no lower
priority activity needs to run (there are mechanisms to fix this issue
as well, but, again, with their own problems).

Introduce deadline servers to service low priority tasks needs under
starvation conditions. Deadline servers are built extending SCHED_DEADLINE
implementation to allow 2-level scheduling (a sched_deadline entity
becomes a container for lower priority scheduling entities).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h   |  26 +++-
 kernel/sched/core.c     |  17 ++
 kernel/sched/deadline.c | 336 ++++++++++++++++++++++++++--------------
 kernel/sched/fair.c     |   4 +
 kernel/sched/sched.h    |  29 ++++
 5 files changed, 298 insertions(+), 114 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f7b9ba04970bc..3ede6dd39e55e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -55,12 +55,14 @@ struct robust_list_head;
 struct root_domain;
 struct rq;
 struct sched_attr;
+struct sched_dl_entity;
 struct sched_param;
 struct seq_file;
 struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct task_struct;
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -501,6 +503,9 @@ struct sched_rt_entity {
 #endif
 } __randomize_layout;
 
+typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
+typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
+
 struct sched_dl_entity {
 	struct rb_node			rb_node;
 
@@ -553,6 +558,7 @@ struct sched_dl_entity {
 	unsigned int			dl_yielded        : 1;
 	unsigned int			dl_non_contending : 1;
 	unsigned int			dl_overrun	  : 1;
+	unsigned int			dl_server         : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
@@ -567,7 +573,20 @@ struct sched_dl_entity {
 	 * timer is needed to decrease the active utilization at the correct
 	 * time.
 	 */
-	struct hrtimer inactive_timer;
+	struct hrtimer			inactive_timer;
+
+	/*
+	 * Bits for DL-server functionality. Also see the comment near
+	 * dl_server_update().
+	 *
+	 * @rq the runqueue this server is for
+	 *
+	 * @server_has_tasks() returns true if @server_pick return a
+	 * runnable task.
+	 */
+	struct rq			*rq;
+	dl_server_has_tasks_f		server_has_tasks;
+	dl_server_pick_f		server_pick;
 };
 
 #ifdef CONFIG_UCLAMP_TASK
@@ -680,10 +699,13 @@ struct task_struct {
 	const struct sched_class	*sched_class;
 	struct sched_entity		se;
 	struct sched_rt_entity		rt;
+	struct sched_dl_entity		dl;
+
+	struct sched_dl_entity		*server;
+
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
 #endif
-	struct sched_dl_entity		dl;
 
 #ifdef CONFIG_UCLAMP_TASK
 	/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6b36bf82b53c2..7c471961fd0b8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3100,6 +3100,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 #ifdef CONFIG_SMP
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 #endif
+
+	p->server = NULL;
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -4348,12 +4350,27 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			p = pick_next_task_idle(rq);
 		}
 
+		/*
+		 * This is the fast path; it cannot be a DL server pick;
+		 * therefore even if @p == @prev, ->server must be NULL.
+		 */
+		if (p->server)
+			p->server = NULL;
+
 		return p;
 	}
 
 restart:
 	put_prev_task_balance(rq, prev, rf);
 
+	/*
+	 * We've updated @prev and no longer need the server link, clear it.
+	 * Must be done before ->pick_next_task() because that can (re)set
+	 * ->server.
+	 */
+	if (prev->server)
+		prev->server = NULL;
+
 	for_each_class(class) {
 		p = class->pick_next_task(rq);
 		if (p)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d4007d1461522..26dd82f0a586a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -20,8 +20,14 @@
 
 struct dl_bandwidth def_dl_bandwidth;
 
+static bool dl_server(struct sched_dl_entity *dl_se)
+{
+	return dl_se->dl_server;
+}
+
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
+	BUG_ON(dl_server(dl_se));
 	return container_of(dl_se, struct task_struct, dl);
 }
 
@@ -30,14 +36,22 @@ static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
 	return container_of(dl_rq, struct rq, dl);
 }
 
-static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+static inline struct rq *rq_of_dl_se(struct sched_dl_entity *dl_se)
 {
-	struct task_struct *p = dl_task_of(dl_se);
-	struct rq *rq = task_rq(p);
+	struct rq *rq = dl_se->rq;
+
+	if (!dl_server(dl_se))
+		rq = task_rq(dl_task_of(dl_se));
 
-	return &rq->dl;
+	return rq;
 }
 
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+	return &rq_of_dl_se(dl_se)->dl;
+}
+
+
 static inline int on_dl_rq(struct sched_dl_entity *dl_se)
 {
 	return !RB_EMPTY_NODE(&dl_se->rb_node);
@@ -278,9 +292,8 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se);
 static void task_non_contending(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->inactive_timer;
-	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
-	struct rq *rq = rq_of_dl_rq(dl_rq);
-	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = rq_of_dl_se(dl_se);
+	struct dl_rq *dl_rq = &rq->dl;
 	s64 zerolag_time;
 
 	/*
@@ -310,25 +323,32 @@ static void task_non_contending(struct sched_dl_entity *dl_se)
 	 * utilization now, instead of starting a timer
 	 */
 	if ((zerolag_time < 0) || hrtimer_active(&dl_se->inactive_timer)) {
-		if (dl_task(p))
+		if (dl_server(dl_se)) {
 			sub_running_bw(dl_se, dl_rq);
+		} else {
+			struct task_struct *p = dl_task_of(dl_se);
+
+			if (dl_task(p))
+				sub_running_bw(dl_se, dl_rq);
 
-		if (!dl_task(p) || p->state == TASK_DEAD) {
-			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+			if (!dl_task(p) || p->state == TASK_DEAD) {
+				struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
-			if (p->state == TASK_DEAD)
-				sub_rq_bw(dl_se, &rq->dl);
-			raw_spin_lock(&dl_b->lock);
-			__dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
-			__dl_clear_params(dl_se);
-			raw_spin_unlock(&dl_b->lock);
+				if (p->state == TASK_DEAD)
+					sub_rq_bw(dl_se, &rq->dl);
+				raw_spin_lock(&dl_b->lock);
+				__dl_sub(dl_b, dl_se->dl_bw, dl_bw_cpus(task_cpu(p)));
+				__dl_clear_params(dl_se);
+				raw_spin_unlock(&dl_b->lock);
+			}
 		}
 
 		return;
 	}
 
 	dl_se->dl_non_contending = 1;
-	get_task_struct(p);
+	if (!dl_server(dl_se))
+		get_task_struct(dl_task_of(dl_se));
 	hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL_HARD);
 }
 
@@ -355,8 +375,10 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
 		 * will not touch the rq's active utilization,
 		 * so we are still safe.
 		 */
-		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
-			put_task_struct(dl_task_of(dl_se));
+		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) {
+			if (!dl_server(dl_se))
+				put_task_struct(dl_task_of(dl_se));
+		}
 	} else {
 		/*
 		 * Since "dl_non_contending" is not set, the
@@ -369,10 +391,8 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
 	}
 }
 
-static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
+static inline int is_leftmost(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
-
 	return dl_rq->root.rb_leftmost == &dl_se->rb_node;
 }
 
@@ -468,8 +488,6 @@ static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 
 	if (p->nr_cpus_allowed > 1)
 		dl_rq->dl_nr_migratory++;
-
-	update_dl_migration(dl_rq);
 }
 
 static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
@@ -478,8 +496,6 @@ static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 
 	if (p->nr_cpus_allowed > 1)
 		dl_rq->dl_nr_migratory--;
-
-	update_dl_migration(dl_rq);
 }
 
 /*
@@ -680,8 +696,11 @@ static inline void deadline_queue_pull_task(struct rq *rq)
 }
 #endif /* CONFIG_SMP */
 
+static void
+enqueue_dl_entity(struct sched_dl_entity *dl_se,
+		  struct sched_dl_entity *pi_se, int flags);
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags);
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p, int flags);
 
 /*
@@ -928,8 +947,7 @@ static inline bool dl_is_implicit(struct sched_dl_entity *dl_se)
 static void update_dl_entity(struct sched_dl_entity *dl_se,
 			     struct sched_dl_entity *pi_se)
 {
-	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
-	struct rq *rq = rq_of_dl_rq(dl_rq);
+	struct rq *rq = rq_of_dl_se(dl_se);
 
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
 	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
@@ -961,11 +979,11 @@ static inline u64 dl_next_period(struct sched_dl_entity *dl_se)
  * actually started or not (i.e., the replenishment instant is in
  * the future or in the past).
  */
-static int start_dl_timer(struct task_struct *p)
+static int start_dl_timer(struct sched_dl_entity *dl_se)
 {
-	struct sched_dl_entity *dl_se = &p->dl;
 	struct hrtimer *timer = &dl_se->dl_timer;
-	struct rq *rq = task_rq(p);
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
 	ktime_t now, act;
 	s64 delta;
 
@@ -999,13 +1017,33 @@ static int start_dl_timer(struct task_struct *p)
 	 * and observe our state.
 	 */
 	if (!hrtimer_is_queued(timer)) {
-		get_task_struct(p);
+		if (!dl_server(dl_se))
+			get_task_struct(dl_task_of(dl_se));
 		hrtimer_start(timer, act, HRTIMER_MODE_ABS_HARD);
 	}
 
 	return 1;
 }
 
+static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
+{
+#ifdef CONFIG_SMP
+	/*
+	 * Queueing this task back might have overloaded rq, check if we need
+	 * to kick someone away.
+	 */
+	if (has_pushable_dl_tasks(rq)) {
+		/*
+		 * Nothing relies on rq->lock after this, so its safe to drop
+		 * rq->lock.
+		 */
+		rq_unpin_lock(rq, rf);
+		push_dl_task(rq);
+		rq_repin_lock(rq, rf);
+	}
+#endif
+}
+
 /*
  * This is the bandwidth enforcement timer callback. If here, we know
  * a task is not on its dl_rq, since the fact that the timer was running
@@ -1024,10 +1062,34 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	struct sched_dl_entity *dl_se = container_of(timer,
 						     struct sched_dl_entity,
 						     dl_timer);
-	struct task_struct *p = dl_task_of(dl_se);
+	struct task_struct *p;
 	struct rq_flags rf;
 	struct rq *rq;
 
+	if (dl_server(dl_se)) {
+		struct rq *rq = rq_of_dl_se(dl_se);
+		struct rq_flags rf;
+
+		rq_lock(rq, &rf);
+		if (dl_se->dl_throttled) {
+			sched_clock_tick();
+			update_rq_clock(rq);
+
+			if (dl_se->server_has_tasks(dl_se)) {
+				enqueue_dl_entity(dl_se, dl_se, ENQUEUE_REPLENISH);
+				resched_curr(rq);
+				__push_dl_task(rq, &rf);
+			} else {
+				replenish_dl_entity(dl_se, dl_se);
+			}
+
+		}
+		rq_unlock(rq, &rf);
+
+		return HRTIMER_NORESTART;
+	}
+
+	p = dl_task_of(dl_se);
 	rq = task_rq_lock(p, &rf);
 
 	/*
@@ -1098,21 +1160,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	else
 		resched_curr(rq);
 
-#ifdef CONFIG_SMP
-	/*
-	 * Queueing this task back might have overloaded rq, check if we need
-	 * to kick someone away.
-	 */
-	if (has_pushable_dl_tasks(rq)) {
-		/*
-		 * Nothing relies on rq->lock after this, so its safe to drop
-		 * rq->lock.
-		 */
-		rq_unpin_lock(rq, &rf);
-		push_dl_task(rq);
-		rq_repin_lock(rq, &rf);
-	}
-#endif
+	__push_dl_task(rq, &rf);
 
 unlock:
 	task_rq_unlock(rq, p, &rf);
@@ -1154,12 +1202,11 @@ static void init_dl_task_timer(struct sched_dl_entity *dl_se)
  */
 static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se)
 {
-	struct task_struct *p = dl_task_of(dl_se);
-	struct rq *rq = rq_of_dl_rq(dl_rq_of_se(dl_se));
+	struct rq *rq = rq_of_dl_se(dl_se);
 
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) &&
 	    dl_time_before(rq_clock(rq), dl_next_period(dl_se))) {
-		if (unlikely(dl_se->dl_boosted || !start_dl_timer(p)))
+		if (unlikely(dl_se->dl_boosted || !start_dl_timer(dl_se)))
 			return;
 		dl_se->dl_throttled = 1;
 		if (dl_se->runtime > 0)
@@ -1216,29 +1263,10 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
 	return (delta * u_act) >> BW_SHIFT;
 }
 
-/*
- * Update the current task's runtime statistics (provided it is still
- * a -deadline task and has not been removed from the dl_rq).
- */
-static void update_curr_dl(struct rq *rq)
+static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
 {
-	struct task_struct *curr = rq->curr;
-	struct sched_dl_entity *dl_se = &curr->dl;
-	s64 delta_exec, scaled_delta_exec;
-	int cpu = cpu_of(rq);
+	s64 scaled_delta_exec;
 
-	if (!dl_task(curr) || !on_dl_rq(dl_se))
-		return;
-
-	/*
-	 * Consumed budget is computed considering the time as
-	 * observed by schedulable tasks (excluding time spent
-	 * in hardirq context, etc.). Deadlines are instead
-	 * computed using hard walltime. This seems to be the more
-	 * natural solution, but the full ramifications of this
-	 * approach need further study.
-	 */
-	delta_exec = update_curr_common(rq);
 	if (unlikely(delta_exec <= 0)) {
 		if (unlikely(dl_se->dl_yielded))
 			goto throttle;
@@ -1256,10 +1284,9 @@ static void update_curr_dl(struct rq *rq)
 	 * according to current frequency and CPU maximum capacity.
 	 */
 	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
-		scaled_delta_exec = grub_reclaim(delta_exec,
-						 rq,
-						 &curr->dl);
+		scaled_delta_exec = grub_reclaim(delta_exec, rq, dl_se);
 	} else {
+		int cpu = cpu_of(rq);
 		unsigned long scale_freq = arch_scale_freq_capacity(cpu);
 		unsigned long scale_cpu = arch_scale_cpu_capacity(cpu);
 
@@ -1278,11 +1305,18 @@ static void update_curr_dl(struct rq *rq)
 		    (dl_se->flags & SCHED_FLAG_DL_OVERRUN))
 			dl_se->dl_overrun = 1;
 
-		__dequeue_task_dl(rq, curr, 0);
-		if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
-			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
+		dequeue_dl_entity(dl_se, 0);
+		if (!dl_server(dl_se))
+			dequeue_pushable_dl_task(rq, dl_task_of(dl_se));
 
-		if (!is_leftmost(curr, &rq->dl))
+		if (unlikely(dl_se->dl_boosted || !start_dl_timer(dl_se))) {
+			if (dl_server(dl_se))
+				enqueue_dl_entity(dl_se, dl_se, ENQUEUE_REPLENISH);
+			else
+				enqueue_task_dl(rq, dl_task_of(dl_se), ENQUEUE_REPLENISH);
+		}
+
+		if (!is_leftmost(dl_se, &rq->dl))
 			resched_curr(rq);
 	}
 
@@ -1312,20 +1346,82 @@ static void update_curr_dl(struct rq *rq)
 	}
 }
 
+void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
+{
+	update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+}
+
+void dl_server_start(struct sched_dl_entity *dl_se)
+{
+	if (!dl_server(dl_se)) {
+		dl_se->dl_server = 1;
+		setup_new_dl_entity(dl_se);
+	}
+	enqueue_dl_entity(dl_se, dl_se, ENQUEUE_WAKEUP);
+}
+
+void dl_server_stop(struct sched_dl_entity *dl_se)
+{
+	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
+}
+
+void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+		    dl_server_has_tasks_f has_tasks,
+		    dl_server_pick_f pick)
+{
+	dl_se->rq = rq;
+	dl_se->server_has_tasks = has_tasks;
+	dl_se->server_pick = pick;
+}
+
+/*
+ * Update the current task's runtime statistics (provided it is still
+ * a -deadline task and has not been removed from the dl_rq).
+ */
+static void update_curr_dl(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_dl_entity *dl_se = &curr->dl;
+	s64 delta_exec;
+
+	if (!dl_task(curr) || !on_dl_rq(dl_se))
+		return;
+
+	/*
+	 * Consumed budget is computed considering the time as
+	 * observed by schedulable tasks (excluding time spent
+	 * in hardirq context, etc.). Deadlines are instead
+	 * computed using hard walltime. This seems to be the more
+	 * natural solution, but the full ramifications of this
+	 * approach need further study.
+	 */
+	delta_exec = update_curr_common(rq);
+	update_curr_dl_se(rq, dl_se, delta_exec);
+}
+
 static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 {
 	struct sched_dl_entity *dl_se = container_of(timer,
 						     struct sched_dl_entity,
 						     inactive_timer);
-	struct task_struct *p = dl_task_of(dl_se);
+	struct task_struct *p = NULL;
 	struct rq_flags rf;
 	struct rq *rq;
 
-	rq = task_rq_lock(p, &rf);
+	if (!dl_server(dl_se)) {
+		p = dl_task_of(dl_se);
+		rq = task_rq_lock(p, &rf);
+	} else {
+		rq = dl_se->rq;
+		rq_lock(rq, &rf);
+	}
 
 	sched_clock_tick();
 	update_rq_clock(rq);
 
+	if (dl_server(dl_se))
+		goto no_task;
+
 	if (!dl_task(p) || p->state == TASK_DEAD) {
 		struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
@@ -1342,14 +1438,21 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 
 		goto unlock;
 	}
+
+no_task:
 	if (dl_se->dl_non_contending == 0)
 		goto unlock;
 
 	sub_running_bw(dl_se, &rq->dl);
 	dl_se->dl_non_contending = 0;
 unlock:
-	task_rq_unlock(rq, p, &rf);
-	put_task_struct(p);
+
+	if (!dl_server(dl_se)) {
+		task_rq_unlock(rq, p, &rf);
+		put_task_struct(p);
+	} else {
+		rq_unlock(rq, &rf);
+	}
 
 	return HRTIMER_NORESTART;
 }
@@ -1402,34 +1505,35 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
 static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
 
+static inline void update_dl_migration(struct dl_rq *dl_rq) {}
+
 #endif /* CONFIG_SMP */
 
 static inline
 void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	int prio = dl_task_of(dl_se)->prio;
 	u64 deadline = dl_se->deadline;
 
-	WARN_ON(!dl_prio(prio));
 	dl_rq->dl_nr_running++;
 	add_nr_running(rq_of_dl_rq(dl_rq), 1);
 
 	inc_dl_deadline(dl_rq, deadline);
-	inc_dl_migration(dl_se, dl_rq);
+	if (!dl_server(dl_se))
+		inc_dl_migration(dl_se, dl_rq);
+	update_dl_migration(dl_rq);
 }
 
 static inline
 void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	int prio = dl_task_of(dl_se)->prio;
-
-	WARN_ON(!dl_prio(prio));
 	WARN_ON(!dl_rq->dl_nr_running);
 	dl_rq->dl_nr_running--;
 	sub_nr_running(rq_of_dl_rq(dl_rq), 1);
 
 	dec_dl_deadline(dl_rq, dl_se->deadline);
-	dec_dl_migration(dl_se, dl_rq);
+	if (!dl_server(dl_se))
+		dec_dl_migration(dl_se, dl_rq);
+	update_dl_migration(dl_rq);
 }
 
 static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
@@ -1524,8 +1628,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
 	} else if (flags & ENQUEUE_REPLENISH) {
 		replenish_dl_entity(dl_se, pi_se);
 	} else if ((flags & ENQUEUE_RESTORE) &&
-		  dl_time_before(dl_se->deadline,
-				 rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se))))) {
+		   dl_time_before(dl_se->deadline, rq_clock(rq_of_dl_se(dl_se)))) {
 		setup_new_dl_entity(dl_se);
 	}
 
@@ -1592,14 +1695,6 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		enqueue_pushable_dl_task(rq, p);
 }
 
-static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
-{
-	dequeue_dl_entity(&p->dl, flags);
-
-	if (!p->dl.dl_throttled)
-		dequeue_pushable_dl_task(rq, p);
-}
-
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
@@ -1607,7 +1702,9 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	if (p->on_rq == TASK_ON_RQ_MIGRATING)
 		flags |= DEQUEUE_MIGRATING;
 
-	__dequeue_task_dl(rq, p, flags);
+	dequeue_dl_entity(&p->dl, flags);
+	if (!p->dl.dl_throttled)
+		dequeue_pushable_dl_task(rq, p);
 }
 
 /*
@@ -1789,12 +1886,12 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 }
 
 #ifdef CONFIG_SCHED_HRTICK
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
 {
-	hrtick_start(rq, p->dl.runtime);
+	hrtick_start(rq, dl_se->runtime);
 }
 #else /* !CONFIG_SCHED_HRTICK */
-static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
 {
 }
 #endif
@@ -1809,9 +1906,6 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
 	if (!first)
 		return;
 
-	if (hrtick_enabled(rq))
-		start_hrtick_dl(rq, p);
-
 	if (rq->curr->sched_class != &dl_sched_class)
 		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
@@ -1835,13 +1929,30 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
 	struct dl_rq *dl_rq = &rq->dl;
 	struct task_struct *p;
 
+again:
 	if (!sched_dl_runnable(rq))
 		return NULL;
 
 	dl_se = pick_next_dl_entity(rq, dl_rq);
 	BUG_ON(!dl_se);
-	p = dl_task_of(dl_se);
-	set_next_task_dl(rq, p, true);
+	if (dl_server(dl_se)) {
+		p = dl_se->server_pick(dl_se);
+		if (!p) {
+			// XXX should not happen, warn?!
+			dl_se->dl_yielded = 1;
+			update_curr_dl_se(rq, dl_se, 0);
+			goto again;
+		}
+		p->server = dl_se;
+	} else {
+		p = dl_task_of(dl_se);
+		set_next_task_dl(rq, p, true);
+	}
+
+	/* XXX not quite right */
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, dl_se);
+
 	return p;
 }
 
@@ -1873,8 +1984,8 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
 	 * be set and schedule() will start a new hrtick for the next task.
 	 */
 	if (hrtick_enabled(rq) && queued && p->dl.runtime > 0 &&
-	    is_leftmost(p, &rq->dl))
-		start_hrtick_dl(rq, p);
+	    is_leftmost(&p->dl, &rq->dl))
+		start_hrtick_dl(rq, &p->dl);
 }
 
 static void task_fork_dl(struct task_struct *p)
@@ -2771,6 +2882,7 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se)
 	dl_se->dl_yielded		= 0;
 	dl_se->dl_non_contending	= 0;
 	dl_se->dl_overrun		= 0;
+	dl_se->dl_server		= 0;
 }
 
 void init_dl_entity(struct sched_dl_entity *dl_se)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 10a230d85104a..5130239c0e1e5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -868,6 +868,8 @@ s64 update_curr_common(struct rq *rq)
 
 	account_group_exec_runtime(curr, delta_exec);
 	cgroup_account_cputime(curr, delta_exec);
+	if (curr->server)
+		dl_server_update(curr->server, delta_exec);
 
 	return delta_exec;
 }
@@ -897,6 +899,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
 		cgroup_account_cputime(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
+		if (curtask->server)
+			dl_server_update(curtask->server, delta_exec);
 	}
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d3db8c7d8b641..f035cd8ccd224 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -346,6 +346,35 @@ extern int  dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_c
 extern int  dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
 extern bool dl_cpu_busy(unsigned int cpu);
 
+/*
+ * SCHED_DEADLINE supports servers (nested scheduling) with the following
+ * interface:
+ *
+ *   dl_se::rq -- runqueue we belong to.
+ *
+ *   dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
+ *                                server when it runs out of tasks to run.
+ *
+ *   dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
+ *                           returns NULL.
+ *
+ *   dl_server_update() -- called from update_curr_common(), propagates runtime
+ *                         to the server.
+ *
+ *   dl_server_start()
+ *   dl_server_stop()  -- start/stop the server when it has (no) tasks
+ *
+ *   dl_server_init()
+ *
+ * XXX
+ */
+extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
+extern void dl_server_start(struct sched_dl_entity *dl_se);
+extern void dl_server_stop(struct sched_dl_entity *dl_se);
+extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
+		    dl_server_has_tasks_f has_tasks,
+		    dl_server_pick_f pick);
+
 #ifdef CONFIG_CGROUP_SCHED
 
 #include <linux/cgroup.h>
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 5/6] sched/fair: Add trivial fair server
  2020-08-07  9:50 [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure Juri Lelli
                   ` (3 preceding siblings ...)
  2020-08-07  9:50 ` [RFC PATCH v2 4/6] sched/deadline: Introduce deadline servers Juri Lelli
@ 2020-08-07  9:50 ` Juri Lelli
  2020-08-07  9:56 ` [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor Juri Lelli
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Juri Lelli @ 2020-08-07  9:50 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rostedt, tglx, linux-kernel, luca.abeni, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

From: Peter Zijlstra <peterz@infradead.org>

Use deadline servers to service fair tasks.

This patch adds a fair_server deadline entity which acts as a container
for fair entities and can be used to fix starvation when higher priority
(wrt fair) tasks are monopolizing CPU(s).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |  1 +
 kernel/sched/fair.c  | 29 +++++++++++++++++++++++++++++
 kernel/sched/sched.h |  4 ++++
 3 files changed, 34 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7c471961fd0b8..6537637139c63 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7170,6 +7170,7 @@ void __init sched_init(void)
 #endif /* CONFIG_SMP */
 		hrtick_rq_init(rq);
 		atomic_set(&rq->nr_iowait, 0);
+		fair_server_init(rq);
 	}
 
 	set_load_weight(&init_task, false);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5130239c0e1e5..6a97ee2a4e26d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5514,6 +5514,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	util_est_enqueue(&rq->cfs, p);
 
+	if (!rq->cfs.h_nr_running)
+		dl_server_start(&rq->fair_server);
+
 	/*
 	 * If in_iowait is set, the code below may not trigger any cpufreq
 	 * utilization updates, so do it here explicitly with the IOWAIT flag
@@ -5666,6 +5669,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		rq->next_balance = jiffies;
 
 dequeue_throttle:
+	if (!rq->cfs.h_nr_running)
+		dl_server_stop(&rq->fair_server);
+
 	util_est_dequeue(&rq->cfs, p, task_sleep);
 	hrtick_update(rq);
 }
@@ -7151,6 +7157,29 @@ static struct task_struct *__pick_next_task_fair(struct rq *rq)
 	return pick_next_task_fair(rq, NULL, NULL);
 }
 
+static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
+{
+	return !!dl_se->rq->cfs.nr_running;
+}
+
+static struct task_struct *fair_server_pick(struct sched_dl_entity *dl_se)
+{
+	return pick_next_task_fair(dl_se->rq, NULL, NULL);
+}
+
+void fair_server_init(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se = &rq->fair_server;
+
+	init_dl_entity(dl_se);
+
+	dl_se->dl_runtime = TICK_NSEC;
+	dl_se->dl_deadline = 20 * TICK_NSEC;
+	dl_se->dl_period = 20 * TICK_NSEC;
+
+	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
+}
+
 /*
  * Account for a descheduled task:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f035cd8ccd224..bf8c9c07705c9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -375,6 +375,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick);
 
+extern void fair_server_init(struct rq *);
+
 #ifdef CONFIG_CGROUP_SCHED
 
 #include <linux/cgroup.h>
@@ -959,6 +961,8 @@ struct rq {
 	struct rt_rq		rt;
 	struct dl_rq		dl;
 
+	struct sched_dl_entity	fair_server;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this CPU: */
 	struct list_head	leaf_cfs_rq_list;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor
  2020-08-07  9:50 [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure Juri Lelli
                   ` (4 preceding siblings ...)
  2020-08-07  9:50 ` [RFC PATCH v2 5/6] sched/fair: Add trivial fair server Juri Lelli
@ 2020-08-07  9:56 ` Juri Lelli
  2020-08-07 10:46   ` peterz
  2020-08-07 13:28   ` luca abeni
  2020-08-07 13:16 ` [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure luca abeni
  2020-09-08 22:22 ` Pavel Machek
  7 siblings, 2 replies; 29+ messages in thread
From: Juri Lelli @ 2020-08-07  9:56 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rostedt, tglx, linux-kernel, luca.abeni, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider, juri.lelli

Starting deadline server for lower priority classes right away when
first task is enqueued might break guarantees, as tasks belonging to
intermediate priority classes could be uselessly preempted. E.g., a well
behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
there are still CPU cycles available for NORMAL tasks to run, as they'll
be running inside the fair deadline server for some period of time.

To prevent this issue, implement a starvation monitor mechanism that
starts the deadline server only if a (fair in this case) task hasn't
been scheduled for some interval of time after it has been enqueued.
Use pick/put functions to manage starvation monitor status.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 kernel/sched/fair.c  | 57 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |  4 ++++
 2 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a97ee2a4e26d..5cdf76e508074 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5494,6 +5494,53 @@ static int sched_idle_cpu(int cpu)
 }
 #endif
 
+
+static void fair_server_watchdog(struct timer_list *list)
+{
+	struct rq *rq = container_of(list, struct rq, fair_server_wd);
+	struct rq_flags rf;
+
+	rq_lock_irqsave(rq, &rf);
+	rq->fair_server_wd_running = 0;
+
+	if (!rq->cfs.h_nr_running)
+		goto out;
+
+	update_rq_clock(rq);
+	dl_server_start(&rq->fair_server);
+	rq->fair_server_active = 1;
+	resched_curr(rq);
+
+out:
+	rq_unlock_irqrestore(rq, &rf);
+}
+
+static inline void fair_server_watchdog_start(struct rq *rq)
+{
+	if (rq->fair_server_wd_running || rq->fair_server_active)
+		return;
+
+	timer_setup(&rq->fair_server_wd, fair_server_watchdog, 0);
+	rq->fair_server_wd.expires = jiffies + FAIR_SERVER_WATCHDOG_INTERVAL;
+	add_timer_on(&rq->fair_server_wd, cpu_of(rq));
+	rq->fair_server_active = 0;
+	rq->fair_server_wd_running = 1;
+}
+
+static inline void fair_server_watchdog_stop(struct rq *rq, bool stop_server)
+{
+	if (!rq->fair_server_wd_running && !stop_server)
+		return;
+
+	del_timer(&rq->fair_server_wd);
+	rq->fair_server_wd_running = 0;
+
+	if (stop_server && rq->fair_server_active) {
+		dl_server_stop(&rq->fair_server);
+		rq->fair_server_active = 0;
+	}
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -5515,7 +5562,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	util_est_enqueue(&rq->cfs, p);
 
 	if (!rq->cfs.h_nr_running)
-		dl_server_start(&rq->fair_server);
+		fair_server_watchdog_start(rq);
 
 	/*
 	 * If in_iowait is set, the code below may not trigger any cpufreq
@@ -5670,7 +5717,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 dequeue_throttle:
 	if (!rq->cfs.h_nr_running)
-		dl_server_stop(&rq->fair_server);
+		fair_server_watchdog_stop(rq, true);
 
 	util_est_dequeue(&rq->cfs, p, task_sleep);
 	hrtick_update(rq);
@@ -7123,6 +7170,7 @@ done: __maybe_unused;
 		hrtick_start_fair(rq, p);
 
 	update_misfit_status(p, rq);
+	fair_server_watchdog_stop(rq, false);
 
 	return p;
 
@@ -7178,6 +7226,8 @@ void fair_server_init(struct rq *rq)
 	dl_se->dl_period = 20 * TICK_NSEC;
 
 	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick);
+
+	rq->fair_server_wd_running = 0;
 }
 
 /*
@@ -7192,6 +7242,9 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
 		cfs_rq = cfs_rq_of(se);
 		put_prev_entity(cfs_rq, se);
 	}
+
+	if (rq->cfs.h_nr_running)
+		fair_server_watchdog_start(rq);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bf8c9c07705c9..1e1a5436be725 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -375,6 +375,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick);
 
+#define FAIR_SERVER_WATCHDOG_INTERVAL (HZ >> 1)
 extern void fair_server_init(struct rq *);
 
 #ifdef CONFIG_CGROUP_SCHED
@@ -962,6 +963,9 @@ struct rq {
 	struct dl_rq		dl;
 
 	struct sched_dl_entity	fair_server;
+	int			fair_server_active;
+	struct timer_list	fair_server_wd;
+	int			fair_server_wd_running;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this CPU: */
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor
  2020-08-07  9:56 ` [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor Juri Lelli
@ 2020-08-07 10:46   ` peterz
  2020-08-07 11:30     ` Daniel Bristot de Oliveira
  2020-08-07 13:49     ` luca abeni
  2020-08-07 13:28   ` luca abeni
  1 sibling, 2 replies; 29+ messages in thread
From: peterz @ 2020-08-07 10:46 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, rostedt, tglx, linux-kernel, luca.abeni,
	tommaso.cucinotta, alessio.balsini, bristot, dietmar.eggemann,
	linux-rt-users, mtosatti, williams, valentin.schneider

On Fri, Aug 07, 2020 at 11:56:04AM +0200, Juri Lelli wrote:
> Starting deadline server for lower priority classes right away when
> first task is enqueued might break guarantees, as tasks belonging to
> intermediate priority classes could be uselessly preempted. E.g., a well
> behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
> there are still CPU cycles available for NORMAL tasks to run, as they'll
> be running inside the fair deadline server for some period of time.
> 
> To prevent this issue, implement a starvation monitor mechanism that
> starts the deadline server only if a (fair in this case) task hasn't
> been scheduled for some interval of time after it has been enqueued.
> Use pick/put functions to manage starvation monitor status.

One thing I considerd was scheduling this as a least-laxity entity --
such that it runs late, not early -- and start the server when
rq->nr_running != rq->cfs.h_nr_running, IOW when there's !fair tasks
around.

Not saying we should do it like that, but that's perhaps more
deterministic than this.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor
  2020-08-07 10:46   ` peterz
@ 2020-08-07 11:30     ` Daniel Bristot de Oliveira
  2020-08-07 12:50       ` Juri Lelli
  2020-08-07 13:49     ` luca abeni
  1 sibling, 1 reply; 29+ messages in thread
From: Daniel Bristot de Oliveira @ 2020-08-07 11:30 UTC (permalink / raw)
  To: peterz, Juri Lelli
  Cc: mingo, rostedt, tglx, linux-kernel, luca.abeni,
	tommaso.cucinotta, alessio.balsini, dietmar.eggemann,
	linux-rt-users, mtosatti, williams, valentin.schneider

On 8/7/20 12:46 PM, peterz@infradead.org wrote:
> On Fri, Aug 07, 2020 at 11:56:04AM +0200, Juri Lelli wrote:
>> Starting deadline server for lower priority classes right away when
>> first task is enqueued might break guarantees, as tasks belonging to
>> intermediate priority classes could be uselessly preempted. E.g., a well
>> behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
>> there are still CPU cycles available for NORMAL tasks to run, as they'll
>> be running inside the fair deadline server for some period of time.
>>
>> To prevent this issue, implement a starvation monitor mechanism that
>> starts the deadline server only if a (fair in this case) task hasn't
>> been scheduled for some interval of time after it has been enqueued.
>> Use pick/put functions to manage starvation monitor status.
> One thing I considerd was scheduling this as a least-laxity entity --
> such that it runs late, not early -- and start the server when
> rq->nr_running != rq->cfs.h_nr_running, IOW when there's !fair tasks
> around.
> 
> Not saying we should do it like that, but that's perhaps more
> deterministic than this.
> 

I agree, what we want here is something that schedules the server if it still
retains some runtime when the laxity is 0. But this is easier said than done, as
this would require another scheduler (other pros and cons and analysis (and
hours of work)...).

But, for the starvation monitor purpose, the goal is not (necessarily) to
provide a deterministic guarantee for the starving task, but to avoid system
issues while minimizing the damage to the "real" real-time workload. With that
in mind, we could relax our ambitions...

Thoughts?

-- Daniel


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor
  2020-08-07 11:30     ` Daniel Bristot de Oliveira
@ 2020-08-07 12:50       ` Juri Lelli
  0 siblings, 0 replies; 29+ messages in thread
From: Juri Lelli @ 2020-08-07 12:50 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: peterz, mingo, rostedt, tglx, linux-kernel, luca.abeni,
	tommaso.cucinotta, alessio.balsini, dietmar.eggemann,
	linux-rt-users, mtosatti, williams, valentin.schneider

On 07/08/20 13:30, Daniel Bristot de Oliveira wrote:
> On 8/7/20 12:46 PM, peterz@infradead.org wrote:
> > On Fri, Aug 07, 2020 at 11:56:04AM +0200, Juri Lelli wrote:
> >> Starting deadline server for lower priority classes right away when
> >> first task is enqueued might break guarantees, as tasks belonging to
> >> intermediate priority classes could be uselessly preempted. E.g., a well
> >> behaving (non hog) FIFO task can be preempted by NORMAL tasks even if
> >> there are still CPU cycles available for NORMAL tasks to run, as they'll
> >> be running inside the fair deadline server for some period of time.
> >>
> >> To prevent this issue, implement a starvation monitor mechanism that
> >> starts the deadline server only if a (fair in this case) task hasn't
> >> been scheduled for some interval of time after it has been enqueued.
> >> Use pick/put functions to manage starvation monitor status.
> > One thing I considerd was scheduling this as a least-laxity entity --
> > such that it runs late, not early -- and start the server when
> > rq->nr_running != rq->cfs.h_nr_running, IOW when there's !fair tasks
> > around.

IIUC, this would still require programming a timer to fire when laxity
is 0, but doing that only when there are !fair tasks around (so when
enqueuing the first !fair or if there are !fair already when first fair
is enqueued) would probably save us some overhead, I agree (as no timer
and no enqueue of deadline server would be needed in the common "only
fair" case).

> > 
> > Not saying we should do it like that, but that's perhaps more
> > deterministic than this.
> > 
> 
> I agree, what we want here is something that schedules the server if it still
> retains some runtime when the laxity is 0. But this is easier said than done, as
> this would require another scheduler (other pros and cons and analysis (and
> hours of work)...).
> 
> But, for the starvation monitor purpose, the goal is not (necessarily) to
> provide a deterministic guarantee for the starving task, but to avoid system
> issues while minimizing the damage to the "real" real-time workload. With that
> in mind, we could relax our ambitions...
> 
> Thoughts?

I agree that we don't probably want to develop an additional scheduler/
policy for this, but I'll think a bit more about Peter's idea. Maybe
it's already a viable optimization w/o changing EDF/CBS.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure
  2020-08-07  9:50 [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure Juri Lelli
                   ` (5 preceding siblings ...)
  2020-08-07  9:56 ` [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor Juri Lelli
@ 2020-08-07 13:16 ` luca abeni
  2020-08-07 13:30   ` Juri Lelli
  2020-08-07 14:14   ` peterz
  2020-09-08 22:22 ` Pavel Machek
  7 siblings, 2 replies; 29+ messages in thread
From: luca abeni @ 2020-08-07 13:16 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, rostedt, tglx, linux-kernel, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

Hi Juri,

thanks for sharing the v2 patchset!

In the next days I'll have a look at it, and try some tests...

In the meanwhile, I have some questions/comments after a first quick
look.

If I understand well, the patchset does not apply deadline servers to
FIFO and RR tasks, right? How does this patchset interact with RT
throttling?

If I understand well, patch 6/6 does something like "use deadline
servers for SCHED_OTHER only if FIFO/RR tasks risk to starve
SCHED_OTHER tasks"... Right? I understand this is because you do not
want to delay RT tasks if they are not starving other tasks. But then,
maybe what you want is not deadline-based scheduling. Maybe a
reservation-based scheduler based on fixed priorities is what you want?
(with SCHED_DEADLINE, you could provide exact performance guarantees to
SCHED_OTHER tasks, but I suspect patch 6/6 breaks these guarantees?)


			Thanks,
				Luca

On Fri,  7 Aug 2020 11:50:45 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> Hi,
> 
> This is RFC v2 of Peter's SCHED_DEADLINE server infrastructure
> implementation [1].
> 
> SCHED_DEADLINE servers can help fixing starvation issues of low
> priority tasks (e.g., SCHED_OTHER) when higher priority tasks
> monopolize CPU cycles. Today we have RT Throttling; DEADLINE servers
> should be able to replace and improve that.
> 
> I rebased Peter's patches (adding changelogs where needed) on
> tip/sched/core as of today and incorporated fixes to issues discussed
> during RFC v1. Current set seems to even boot on real HW! :-)
> 
> While playing with RFC v1 set (and discussing it further offline with
> Daniel) it has emerged the need to slightly change the behavior. Patch
> 6/6 is a (cumbersome?) attempt to show what's probably needed.
> The problem with "original" implementation is that FIFO tasks might
> suffer preemption from NORMAL even when spare CPU cycles are
> available. In fact, fair deadline server is enqueued right away when
> NORMAL tasks wake up and they are first scheduled by the server, thus
> potentially preempting a well behaving FIFO task. This is of course
> not ideal. So, in patch 6/6 I propose to use some kind of starvation
> monitor/ watchdog that delays enqueuing of deadline servers to the
> point when fair tasks might start to actually suffer from starvation
> (just randomly picked HZ/2 for now). One problem I already see with
> the current implementation is that it adds overhead to fair paths, so
> I'm pretty sure there are better ways to implement the idea (e.g.,
> Daniel already suggested using a starvation monitor kthread sort of
> thing).
> 
> Receiving comments and suggestions is the sole purpose of this posting
> at this stage. Hopefully we can further discuss the idea at Plumbers
> in a few weeks. So, please don't focus too much into actual
> implementation (which I plan to revise anyway after I'm back from pto
> :), but try to see if this might actually fly. The feature seems to
> be very much needed.
> 
> Thanks!
> 
> Juri
> 
> 1 -
> https://lore.kernel.org/lkml/20190726145409.947503076@infradead.org/
> 
> Juri Lelli (1):
>   sched/fair: Implement starvation monitor
> 
> Peter Zijlstra (5):
>   sched: Unify runtime accounting across classes
>   sched/deadline: Collect sched_dl_entity initialization
>   sched/deadline: Move bandwidth accounting into
> {en,de}queue_dl_entity sched/deadline: Introduce deadline servers
>   sched/fair: Add trivial fair server
> 
>  include/linux/sched.h    |  28 ++-
>  kernel/sched/core.c      |  23 +-
>  kernel/sched/deadline.c  | 483
> ++++++++++++++++++++++++--------------- kernel/sched/fair.c      |
> 136 ++++++++++- kernel/sched/rt.c        |  17 +-
>  kernel/sched/sched.h     |  50 +++-
>  kernel/sched/stop_task.c |  16 +-
>  7 files changed, 522 insertions(+), 231 deletions(-)
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor
  2020-08-07  9:56 ` [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor Juri Lelli
  2020-08-07 10:46   ` peterz
@ 2020-08-07 13:28   ` luca abeni
  2020-08-07 13:43     ` Juri Lelli
  1 sibling, 1 reply; 29+ messages in thread
From: luca abeni @ 2020-08-07 13:28 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, rostedt, tglx, linux-kernel, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

Hi Juri,

On Fri, 7 Aug 2020 11:56:04 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> Starting deadline server for lower priority classes right away when
> first task is enqueued might break guarantees

Which guarantees are you thinking about, here? Response times of fixed
priority tasks?

If fixed priority tasks are also scheduled through deadline servers,
then you can provide response-time guarantees to them even when
lower-priority/non-real-time tasks are scheduled through deadline
servers.


			Thanks,
				Luca

> as tasks belonging to
> intermediate priority classes could be uselessly preempted. E.g., a
> well behaving (non hog) FIFO task can be preempted by NORMAL tasks
> even if there are still CPU cycles available for NORMAL tasks to run,
> as they'll be running inside the fair deadline server for some period
> of time.
> 
> To prevent this issue, implement a starvation monitor mechanism that
> starts the deadline server only if a (fair in this case) task hasn't
> been scheduled for some interval of time after it has been enqueued.
> Use pick/put functions to manage starvation monitor status.
> 
> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> ---
>  kernel/sched/fair.c  | 57
> ++++++++++++++++++++++++++++++++++++++++++-- kernel/sched/sched.h |
> 4 ++++ 2 files changed, 59 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6a97ee2a4e26d..5cdf76e508074 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5494,6 +5494,53 @@ static int sched_idle_cpu(int cpu)
>  }
>  #endif
>  
> +
> +static void fair_server_watchdog(struct timer_list *list)
> +{
> +	struct rq *rq = container_of(list, struct rq,
> fair_server_wd);
> +	struct rq_flags rf;
> +
> +	rq_lock_irqsave(rq, &rf);
> +	rq->fair_server_wd_running = 0;
> +
> +	if (!rq->cfs.h_nr_running)
> +		goto out;
> +
> +	update_rq_clock(rq);
> +	dl_server_start(&rq->fair_server);
> +	rq->fair_server_active = 1;
> +	resched_curr(rq);
> +
> +out:
> +	rq_unlock_irqrestore(rq, &rf);
> +}
> +
> +static inline void fair_server_watchdog_start(struct rq *rq)
> +{
> +	if (rq->fair_server_wd_running || rq->fair_server_active)
> +		return;
> +
> +	timer_setup(&rq->fair_server_wd, fair_server_watchdog, 0);
> +	rq->fair_server_wd.expires = jiffies +
> FAIR_SERVER_WATCHDOG_INTERVAL;
> +	add_timer_on(&rq->fair_server_wd, cpu_of(rq));
> +	rq->fair_server_active = 0;
> +	rq->fair_server_wd_running = 1;
> +}
> +
> +static inline void fair_server_watchdog_stop(struct rq *rq, bool
> stop_server) +{
> +	if (!rq->fair_server_wd_running && !stop_server)
> +		return;
> +
> +	del_timer(&rq->fair_server_wd);
> +	rq->fair_server_wd_running = 0;
> +
> +	if (stop_server && rq->fair_server_active) {
> +		dl_server_stop(&rq->fair_server);
> +		rq->fair_server_active = 0;
> +	}
> +}
> +
>  /*
>   * The enqueue_task method is called before nr_running is
>   * increased. Here we update the fair scheduling stats and
> @@ -5515,7 +5562,7 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags) util_est_enqueue(&rq->cfs, p);
>  
>  	if (!rq->cfs.h_nr_running)
> -		dl_server_start(&rq->fair_server);
> +		fair_server_watchdog_start(rq);
>  
>  	/*
>  	 * If in_iowait is set, the code below may not trigger any
> cpufreq @@ -5670,7 +5717,7 @@ static void dequeue_task_fair(struct rq
> *rq, struct task_struct *p, int flags) 
>  dequeue_throttle:
>  	if (!rq->cfs.h_nr_running)
> -		dl_server_stop(&rq->fair_server);
> +		fair_server_watchdog_stop(rq, true);
>  
>  	util_est_dequeue(&rq->cfs, p, task_sleep);
>  	hrtick_update(rq);
> @@ -7123,6 +7170,7 @@ done: __maybe_unused;
>  		hrtick_start_fair(rq, p);
>  
>  	update_misfit_status(p, rq);
> +	fair_server_watchdog_stop(rq, false);
>  
>  	return p;
>  
> @@ -7178,6 +7226,8 @@ void fair_server_init(struct rq *rq)
>  	dl_se->dl_period = 20 * TICK_NSEC;
>  
>  	dl_server_init(dl_se, rq, fair_server_has_tasks,
> fair_server_pick); +
> +	rq->fair_server_wd_running = 0;
>  }
>  
>  /*
> @@ -7192,6 +7242,9 @@ static void put_prev_task_fair(struct rq *rq,
> struct task_struct *prev) cfs_rq = cfs_rq_of(se);
>  		put_prev_entity(cfs_rq, se);
>  	}
> +
> +	if (rq->cfs.h_nr_running)
> +		fair_server_watchdog_start(rq);
>  }
>  
>  /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index bf8c9c07705c9..1e1a5436be725 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -375,6 +375,7 @@ extern void dl_server_init(struct sched_dl_entity
> *dl_se, struct rq *rq, dl_server_has_tasks_f has_tasks,
>  		    dl_server_pick_f pick);
>  
> +#define FAIR_SERVER_WATCHDOG_INTERVAL (HZ >> 1)
>  extern void fair_server_init(struct rq *);
>  
>  #ifdef CONFIG_CGROUP_SCHED
> @@ -962,6 +963,9 @@ struct rq {
>  	struct dl_rq		dl;
>  
>  	struct sched_dl_entity	fair_server;
> +	int			fair_server_active;
> +	struct timer_list	fair_server_wd;
> +	int			fair_server_wd_running;
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	/* list of leaf cfs_rq on this CPU: */


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure
  2020-08-07 13:16 ` [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure luca abeni
@ 2020-08-07 13:30   ` Juri Lelli
  2020-08-07 13:41     ` luca abeni
  2020-08-07 14:14   ` peterz
  1 sibling, 1 reply; 29+ messages in thread
From: Juri Lelli @ 2020-08-07 13:30 UTC (permalink / raw)
  To: luca abeni
  Cc: peterz, mingo, rostedt, tglx, linux-kernel, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

Hi Luca,

On 07/08/20 15:16, luca abeni wrote:
> Hi Juri,
> 
> thanks for sharing the v2 patchset!
> 
> In the next days I'll have a look at it, and try some tests...

Thanks!

> In the meanwhile, I have some questions/comments after a first quick
> look.
> 
> If I understand well, the patchset does not apply deadline servers to
> FIFO and RR tasks, right? How does this patchset interact with RT
> throttling?

Well, it's something like the dual of it, in that RT Throttling directly
throttles RT tasks to make spare CPU cycles available to fair tasks
while this patchset introduces deadline servers to schedule fair tasks,
thus still reserving CPU time for those (when needed).

> If I understand well, patch 6/6 does something like "use deadline
> servers for SCHED_OTHER only if FIFO/RR tasks risk to starve
> SCHED_OTHER tasks"... Right?

That's the basic idea, yes.

> I understand this is because you do not
> want to delay RT tasks if they are not starving other tasks. But then,
> maybe what you want is not deadline-based scheduling. Maybe a
> reservation-based scheduler based on fixed priorities is what you want?
> (with SCHED_DEADLINE, you could provide exact performance guarantees to
> SCHED_OTHER tasks, but I suspect patch 6/6 breaks these guarantees?)

I agree that we are not interested in exact guarantees in this case, but
why not using something that it's already there and would give us the
functionality we need (fix starvation for fair)? It would also work well
in presence of "real" deadline tasks I think, in that you could account
for these fair servers while performing admission control.

Best,

Juri


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure
  2020-08-07 13:30   ` Juri Lelli
@ 2020-08-07 13:41     ` luca abeni
  2020-08-07 14:04       ` Juri Lelli
  0 siblings, 1 reply; 29+ messages in thread
From: luca abeni @ 2020-08-07 13:41 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, rostedt, tglx, linux-kernel, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

Hi Juri,

On Fri, 7 Aug 2020 15:30:41 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:
[...]
> > In the meanwhile, I have some questions/comments after a first quick
> > look.
> > 
> > If I understand well, the patchset does not apply deadline servers
> > to FIFO and RR tasks, right? How does this patchset interact with RT
> > throttling?  
> 
> Well, it's something like the dual of it, in that RT Throttling
> directly throttles RT tasks to make spare CPU cycles available to
> fair tasks while this patchset introduces deadline servers to
> schedule fair tasks, thus still reserving CPU time for those (when
> needed).

Ah, OK... I was thinking about using deadline servers for both RT and
non-RT tasks. And to use them not only to throttle, but also to provide
some kind of performance guarantees (to all the scheduling classes).
Think about what can be done when combining this mechanism with
cgroups/containers :)

[...]
> > I understand this is because you do not
> > want to delay RT tasks if they are not starving other tasks. But
> > then, maybe what you want is not deadline-based scheduling. Maybe a
> > reservation-based scheduler based on fixed priorities is what you
> > want? (with SCHED_DEADLINE, you could provide exact performance
> > guarantees to SCHED_OTHER tasks, but I suspect patch 6/6 breaks
> > these guarantees?)  
> 
> I agree that we are not interested in exact guarantees in this case,
> but why not using something that it's already there and would give us
> the functionality we need (fix starvation for fair)?

Ok, if performance guarantees to non-RT tasks are not the goal, then I
agree. I was thinking that since the patchset provides a mechanism to
schedule various classes of tasks through deadline servers, then
using these servers to provide some kinds of guarantees could be
interesting ;-)



			Thanks,
				Luca

> It would also
> work well in presence of "real" deadline tasks I think, in that you
> could account for these fair servers while performing admission
> control.
> 
> Best,
> 
> Juri
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor
  2020-08-07 13:28   ` luca abeni
@ 2020-08-07 13:43     ` Juri Lelli
  2020-08-07 13:55       ` luca abeni
  2020-08-07 14:13       ` peterz
  0 siblings, 2 replies; 29+ messages in thread
From: Juri Lelli @ 2020-08-07 13:43 UTC (permalink / raw)
  To: luca abeni
  Cc: peterz, mingo, rostedt, tglx, linux-kernel, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

On 07/08/20 15:28, luca abeni wrote:
> Hi Juri,
> 
> On Fri, 7 Aug 2020 11:56:04 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > Starting deadline server for lower priority classes right away when
> > first task is enqueued might break guarantees
> 
> Which guarantees are you thinking about, here? Response times of fixed
> priority tasks?

Response time, but also wakeup latency (which, for better or worse, is
another important metric).

> If fixed priority tasks are also scheduled through deadline servers,
> then you can provide response-time guarantees to them even when
> lower-priority/non-real-time tasks are scheduled through deadline
> servers.

Right, but I fear we won't be able to keep current behavior for wakeups:
RT with highest prio always gets scheduled right away?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor
  2020-08-07 10:46   ` peterz
  2020-08-07 11:30     ` Daniel Bristot de Oliveira
@ 2020-08-07 13:49     ` luca abeni
  2020-08-07 14:11       ` peterz
  1 sibling, 1 reply; 29+ messages in thread
From: luca abeni @ 2020-08-07 13:49 UTC (permalink / raw)
  To: peterz
  Cc: Juri Lelli, mingo, rostedt, tglx, linux-kernel,
	tommaso.cucinotta, alessio.balsini, bristot, dietmar.eggemann,
	linux-rt-users, mtosatti, williams, valentin.schneider

Hi Peter,

On Fri, 7 Aug 2020 12:46:18 +0200
peterz@infradead.org wrote:

> On Fri, Aug 07, 2020 at 11:56:04AM +0200, Juri Lelli wrote:
> > Starting deadline server for lower priority classes right away when
> > first task is enqueued might break guarantees, as tasks belonging to
> > intermediate priority classes could be uselessly preempted. E.g., a
> > well behaving (non hog) FIFO task can be preempted by NORMAL tasks
> > even if there are still CPU cycles available for NORMAL tasks to
> > run, as they'll be running inside the fair deadline server for some
> > period of time.
> > 
> > To prevent this issue, implement a starvation monitor mechanism that
> > starts the deadline server only if a (fair in this case) task hasn't
> > been scheduled for some interval of time after it has been enqueued.
> > Use pick/put functions to manage starvation monitor status.  
> 
> One thing I considerd was scheduling this as a least-laxity entity --
> such that it runs late, not early

Are you thinking about scheduling both RT and non-RT tasks through
deadline servers? If yes, then I think that using something like
laxity-based scheduling for the SCHED_OTHER server can be a good idea
(but then we need to understand how to combine deadline-based
scheduling with laxity-based scheduling, etc...)

Or are you thinking about keeping the SCHED_OTHER server throttled
until its laxity is 0 (or until its laxity is lower than some small
value)? In this second case, the approach would work even if RT tasks
are not scheduled through a server (but I do not know which kind of
performance guarantee we could provide).


> -- and start the server when
> rq->nr_running != rq->cfs.h_nr_running, IOW when there's !fair tasks
> around.

Yes, this could be a good optimization.



			Luca
> 
> Not saying we should do it like that, but that's perhaps more
> deterministic than this.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor
  2020-08-07 13:43     ` Juri Lelli
@ 2020-08-07 13:55       ` luca abeni
  2020-08-07 14:11         ` Juri Lelli
  2020-08-07 14:13       ` peterz
  1 sibling, 1 reply; 29+ messages in thread
From: luca abeni @ 2020-08-07 13:55 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, rostedt, tglx, linux-kernel, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

On Fri, 7 Aug 2020 15:43:53 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> On 07/08/20 15:28, luca abeni wrote:
> > Hi Juri,
> > 
> > On Fri, 7 Aug 2020 11:56:04 +0200
> > Juri Lelli <juri.lelli@redhat.com> wrote:
> >   
> > > Starting deadline server for lower priority classes right away
> > > when first task is enqueued might break guarantees  
> > 
> > Which guarantees are you thinking about, here? Response times of
> > fixed priority tasks?  
> 
> Response time, but also wakeup latency (which, for better or worse, is
> another important metric).
> 
> > If fixed priority tasks are also scheduled through deadline servers,
> > then you can provide response-time guarantees to them even when
> > lower-priority/non-real-time tasks are scheduled through deadline
> > servers.  
> 
> Right, but I fear we won't be able to keep current behavior for
> wakeups: RT with highest prio always gets scheduled right away?

Uhm... I think this depends on how the servers' parameters are
designed: assigning "wrong" (or "bad") parameters to the server used to
schedule RT tasks, this property is broken.

(however, notice that even with the current patchset the highest
priority task might be scheduled with some delay --- if the SCHED_OTHER
deadline server is active because SCHED_OTHER tasks are being starved).



			Luca

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure
  2020-08-07 13:41     ` luca abeni
@ 2020-08-07 14:04       ` Juri Lelli
  0 siblings, 0 replies; 29+ messages in thread
From: Juri Lelli @ 2020-08-07 14:04 UTC (permalink / raw)
  To: luca abeni
  Cc: peterz, mingo, rostedt, tglx, linux-kernel, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

On 07/08/20 15:41, luca abeni wrote:
> Hi Juri,
> 
> On Fri, 7 Aug 2020 15:30:41 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> [...]
> > > In the meanwhile, I have some questions/comments after a first quick
> > > look.
> > > 
> > > If I understand well, the patchset does not apply deadline servers
> > > to FIFO and RR tasks, right? How does this patchset interact with RT
> > > throttling?  
> > 
> > Well, it's something like the dual of it, in that RT Throttling
> > directly throttles RT tasks to make spare CPU cycles available to
> > fair tasks while this patchset introduces deadline servers to
> > schedule fair tasks, thus still reserving CPU time for those (when
> > needed).
> 
> Ah, OK... I was thinking about using deadline servers for both RT and
> non-RT tasks. And to use them not only to throttle, but also to provide
> some kind of performance guarantees (to all the scheduling classes).
> Think about what can be done when combining this mechanism with
> cgroups/containers :)
> 
> [...]
> > > I understand this is because you do not
> > > want to delay RT tasks if they are not starving other tasks. But
> > > then, maybe what you want is not deadline-based scheduling. Maybe a
> > > reservation-based scheduler based on fixed priorities is what you
> > > want? (with SCHED_DEADLINE, you could provide exact performance
> > > guarantees to SCHED_OTHER tasks, but I suspect patch 6/6 breaks
> > > these guarantees?)  
> > 
> > I agree that we are not interested in exact guarantees in this case,
> > but why not using something that it's already there and would give us
> > the functionality we need (fix starvation for fair)?
> 
> Ok, if performance guarantees to non-RT tasks are not the goal, then I
> agree. I was thinking that since the patchset provides a mechanism to
> schedule various classes of tasks through deadline servers, then
> using these servers to provide some kinds of guarantees could be
> interesting ;-)

Not saying that the wider scope approach is not worth pursuing, you know
I would be very much interested into that as well :-), but I'd leave it
for a later time. This proposal looks reasonably achieaveble in somewhat
short times and it should provide provable benefits to production today.
Plus, you are again right, foundations would be there already when we'll
be ready for the wider solution.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor
  2020-08-07 13:49     ` luca abeni
@ 2020-08-07 14:11       ` peterz
  2020-08-07 16:48         ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 29+ messages in thread
From: peterz @ 2020-08-07 14:11 UTC (permalink / raw)
  To: luca abeni
  Cc: Juri Lelli, mingo, rostedt, tglx, linux-kernel,
	tommaso.cucinotta, alessio.balsini, bristot, dietmar.eggemann,
	linux-rt-users, mtosatti, williams, valentin.schneider

On Fri, Aug 07, 2020 at 03:49:41PM +0200, luca abeni wrote:
> Hi Peter,
> 
> peterz@infradead.org wrote:

> > One thing I considerd was scheduling this as a least-laxity entity --
> > such that it runs late, not early
> 
> Are you thinking about scheduling both RT and non-RT tasks through
> deadline servers? If yes,

Maybe, I initially considered this for mixed criticality, where the
'soft' class would run EDF and the 'hard' class would run LLF (or the
other way around, I can't quite remember how I figured it).

If you restrict the hard class to single CPU assignment (IOW the UP
case) and ensure that u_llf + U_gedf/N < 1, it should just work out.

But I shelved all that after I heard about that other balancer idea
Danial was suppose to be working on ;-)))

> then I think that using something like
> laxity-based scheduling for the SCHED_OTHER server can be a good idea
> (but then we need to understand how to combine deadline-based
> scheduling with laxity-based scheduling, etc...)

/me consults notes, EDZL is I think the closest thing there.

> Or are you thinking about keeping the SCHED_OTHER server throttled
> until its laxity is 0 (or until its laxity is lower than some small
> value)? In this second case, the approach would work even if RT tasks
> are not scheduled through a server (but I do not know which kind of
> performance guarantee we could provide).

That would certainly be sufficient for OTHER servers I think.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor
  2020-08-07 13:55       ` luca abeni
@ 2020-08-07 14:11         ` Juri Lelli
  0 siblings, 0 replies; 29+ messages in thread
From: Juri Lelli @ 2020-08-07 14:11 UTC (permalink / raw)
  To: luca abeni
  Cc: peterz, mingo, rostedt, tglx, linux-kernel, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

On 07/08/20 15:55, luca abeni wrote:
> On Fri, 7 Aug 2020 15:43:53 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > On 07/08/20 15:28, luca abeni wrote:
> > > Hi Juri,
> > > 
> > > On Fri, 7 Aug 2020 11:56:04 +0200
> > > Juri Lelli <juri.lelli@redhat.com> wrote:
> > >   
> > > > Starting deadline server for lower priority classes right away
> > > > when first task is enqueued might break guarantees  
> > > 
> > > Which guarantees are you thinking about, here? Response times of
> > > fixed priority tasks?  
> > 
> > Response time, but also wakeup latency (which, for better or worse, is
> > another important metric).
> > 
> > > If fixed priority tasks are also scheduled through deadline servers,
> > > then you can provide response-time guarantees to them even when
> > > lower-priority/non-real-time tasks are scheduled through deadline
> > > servers.  
> > 
> > Right, but I fear we won't be able to keep current behavior for
> > wakeups: RT with highest prio always gets scheduled right away?
> 
> Uhm... I think this depends on how the servers' parameters are
> designed: assigning "wrong" (or "bad") parameters to the server used to
> schedule RT tasks, this property is broken.
> 
> (however, notice that even with the current patchset the highest
> priority task might be scheduled with some delay --- if the SCHED_OTHER
> deadline server is active because SCHED_OTHER tasks are being starved).

But that's OK I think, because if the server is active it means that
OTHER didn't get a chance to run for some time and if it continues to
hung than worse problems than breaking FIFO assumptions will happen. :-/


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor
  2020-08-07 13:43     ` Juri Lelli
  2020-08-07 13:55       ` luca abeni
@ 2020-08-07 14:13       ` peterz
  2020-08-07 15:06         ` Juri Lelli
  1 sibling, 1 reply; 29+ messages in thread
From: peterz @ 2020-08-07 14:13 UTC (permalink / raw)
  To: Juri Lelli
  Cc: luca abeni, mingo, rostedt, tglx, linux-kernel,
	tommaso.cucinotta, alessio.balsini, bristot, dietmar.eggemann,
	linux-rt-users, mtosatti, williams, valentin.schneider

On Fri, Aug 07, 2020 at 03:43:53PM +0200, Juri Lelli wrote:

> Right, but I fear we won't be able to keep current behavior for wakeups:
> RT with highest prio always gets scheduled right away?

If you consider RT throttling, that's already not the case. We can
consider this fair server to be just another way to implement that.

At some point, we'll have have to preempt higher priority tasks, that's
the entire point of the thing after all.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure
  2020-08-07 13:16 ` [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure luca abeni
  2020-08-07 13:30   ` Juri Lelli
@ 2020-08-07 14:14   ` peterz
  1 sibling, 0 replies; 29+ messages in thread
From: peterz @ 2020-08-07 14:14 UTC (permalink / raw)
  To: luca abeni
  Cc: Juri Lelli, mingo, rostedt, tglx, linux-kernel,
	tommaso.cucinotta, alessio.balsini, bristot, dietmar.eggemann,
	linux-rt-users, mtosatti, williams, valentin.schneider

On Fri, Aug 07, 2020 at 03:16:32PM +0200, luca abeni wrote:

> If I understand well, the patchset does not apply deadline servers to
> FIFO and RR tasks, right? How does this patchset interact with RT
> throttling?

ideally it will replace it ;-)

But of course, there's the whole cgroup trainwreck waiting there :/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor
  2020-08-07 14:13       ` peterz
@ 2020-08-07 15:06         ` Juri Lelli
  0 siblings, 0 replies; 29+ messages in thread
From: Juri Lelli @ 2020-08-07 15:06 UTC (permalink / raw)
  To: peterz
  Cc: luca abeni, mingo, rostedt, tglx, linux-kernel,
	tommaso.cucinotta, alessio.balsini, bristot, dietmar.eggemann,
	linux-rt-users, mtosatti, williams, valentin.schneider

On 07/08/20 16:13, peterz@infradead.org wrote:
> On Fri, Aug 07, 2020 at 03:43:53PM +0200, Juri Lelli wrote:
> 
> > Right, but I fear we won't be able to keep current behavior for wakeups:
> > RT with highest prio always gets scheduled right away?
> 
> If you consider RT throttling, that's already not the case. We can
> consider this fair server to be just another way to implement that.
> 
> At some point, we'll have have to preempt higher priority tasks, that's
> the entire point of the thing after all.
> 

Ah, indeed. But I was thinking we might try to do a little "better" wrt
to RT Throttling while we are at it. :-)

This proposal has the potential to be more flexible and to kick in only
when really needed.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor
  2020-08-07 14:11       ` peterz
@ 2020-08-07 16:48         ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 29+ messages in thread
From: Daniel Bristot de Oliveira @ 2020-08-07 16:48 UTC (permalink / raw)
  To: peterz, luca abeni
  Cc: Juri Lelli, mingo, rostedt, tglx, linux-kernel,
	tommaso.cucinotta, alessio.balsini, dietmar.eggemann,
	linux-rt-users, mtosatti, williams, valentin.schneider

On 8/7/20 4:11 PM, peterz@infradead.org wrote:
> But I shelved all that after I heard about that other balancer idea
> Danial was suppose to be working on ;-)))

The PhD bureaucracy (and behind the scenes) were blocking me... but I am free
man now and will catch up on that ;-).

[ also because I made enough progress on this:

https://github.com/bristot/rtsl/

and can start other things. ]

-- Daniel


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure
  2020-08-07  9:50 [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure Juri Lelli
                   ` (6 preceding siblings ...)
  2020-08-07 13:16 ` [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure luca abeni
@ 2020-09-08 22:22 ` Pavel Machek
  2020-09-09  5:51   ` Juri Lelli
  7 siblings, 1 reply; 29+ messages in thread
From: Pavel Machek @ 2020-09-08 22:22 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, rostedt, tglx, linux-kernel, luca.abeni,
	tommaso.cucinotta, alessio.balsini, bristot, dietmar.eggemann,
	linux-rt-users, mtosatti, williams, valentin.schneider

Hi!

> This is RFC v2 of Peter's SCHED_DEADLINE server infrastructure
> implementation [1].
> 
> SCHED_DEADLINE servers can help fixing starvation issues of low priority tasks (e.g., 
> SCHED_OTHER) when higher priority tasks monopolize CPU cycles. Today we have RT 
> Throttling; DEADLINE servers should be able to replace and improve that.

It would be worth noting what "server" is in this context.

It is not white box with CPU inside, it is not even an userland process, afaict.

Subject is quite confusing.

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure
  2020-09-08 22:22 ` Pavel Machek
@ 2020-09-09  5:51   ` Juri Lelli
  0 siblings, 0 replies; 29+ messages in thread
From: Juri Lelli @ 2020-09-09  5:51 UTC (permalink / raw)
  To: Pavel Machek
  Cc: peterz, mingo, rostedt, tglx, linux-kernel, luca.abeni,
	tommaso.cucinotta, alessio.balsini, bristot, dietmar.eggemann,
	linux-rt-users, mtosatti, williams, valentin.schneider

Hi Pavel,

On 09/09/20 00:22, Pavel Machek wrote:
> Hi!
> 
> > This is RFC v2 of Peter's SCHED_DEADLINE server infrastructure
> > implementation [1].
> > 
> > SCHED_DEADLINE servers can help fixing starvation issues of low priority tasks (e.g., 
> > SCHED_OTHER) when higher priority tasks monopolize CPU cycles. Today we have RT 
> > Throttling; DEADLINE servers should be able to replace and improve that.
> 
> It would be worth noting what "server" is in this context.

It comes from Constant Bandwidth Server (CBS), that SCHED_DEADLINE is
implementing [1].

> 
> It is not white box with CPU inside, it is not even an userland process, afaict.
> 
> Subject is quite confusing.

Best,
Juri

1 - https://elixir.bootlin.com/linux/latest/source/Documentation/scheduler/sched-deadline.rst#L42


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/deadline: Introduce deadline servers
  2020-08-07  9:50 ` [RFC PATCH v2 4/6] sched/deadline: Introduce deadline servers Juri Lelli
@ 2020-10-06  7:56   ` luca abeni
  2020-10-06  9:35     ` Juri Lelli
  0 siblings, 1 reply; 29+ messages in thread
From: luca abeni @ 2020-10-06  7:56 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, rostedt, tglx, linux-kernel, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

Hi,

sorry for the late reply... Anyway, I am currently testing this
patchset (and trying to use it for the "SCHED_DEADLINE-based cgroup
scheduling" patchset).
And during my tests I had a doubt:



On Fri,  7 Aug 2020 11:50:49 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:

> From: Peter Zijlstra <peterz@infradead.org>
> 
> Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
> with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).
[...]
> @@ -1024,10 +1062,34 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
>  	struct sched_dl_entity *dl_se = container_of(timer,
>  						     struct sched_dl_entity,
>  						     dl_timer);
> -	struct task_struct *p = dl_task_of(dl_se);
> +	struct task_struct *p;
>  	struct rq_flags rf;
>  	struct rq *rq;
>  
> +	if (dl_server(dl_se)) {
> +		struct rq *rq = rq_of_dl_se(dl_se);
> +		struct rq_flags rf;
> +
> +		rq_lock(rq, &rf);
> +		if (dl_se->dl_throttled) {
> +			sched_clock_tick();
> +			update_rq_clock(rq);
> +
> +			if (dl_se->server_has_tasks(dl_se)) {
> +				enqueue_dl_entity(dl_se, dl_se, ENQUEUE_REPLENISH);
> +				resched_curr(rq);
> +				__push_dl_task(rq, &rf);
> +			} else {
> +				replenish_dl_entity(dl_se, dl_se);

I am wondering if here we need a "task_non_contending(dl_se)" after
"replenish_dl_entity(dl_se, dl_se);"...

Basically, what happened is that all the served tasks blocked while the
server was throttled... So, now the server should be disabled (so, we
replenish the dl entity but we do not insert it in runqueue).
But when the server finished its budget and has been throttled, it has
not been disabled (so, its utilization is still in running_bw).

I think that if we do not call task_non_contending() the server's
utilization is not correctly removed from running_bw (this does not
happen for regular tasks, because task_non_contending() is invoked when
they block, even if the dl entity is throttled... But for servers I
think we have an issue).

What do you think? Do you agree with this change?

I have no easy way to reproduce the issue, but this change fixed some
strange behaviours I was seeing when using this patch to schedule RT
cgroups.



			Thanks,
				Luca
> +			}
> +
> +		}
> +		rq_unlock(rq, &rf);
> +
> +		return HRTIMER_NORESTART;
> +	}
> +
> +	p = dl_task_of(dl_se);
>  	rq = task_rq_lock(p, &rf);
>  
>  	/*
> @@ -1098,21 +1160,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
>  	else
>  		resched_curr(rq);
>  
> -#ifdef CONFIG_SMP
> -	/*
> -	 * Queueing this task back might have overloaded rq, check if we need
> -	 * to kick someone away.
> -	 */
> -	if (has_pushable_dl_tasks(rq)) {
> -		/*
> -		 * Nothing relies on rq->lock after this, so its safe to drop
> -		 * rq->lock.
> -		 */
> -		rq_unpin_lock(rq, &rf);
> -		push_dl_task(rq);
> -		rq_repin_lock(rq, &rf);
> -	}
> -#endif
> +	__push_dl_task(rq, &rf);
>  
>  unlock:
>  	task_rq_unlock(rq, p, &rf);
> @@ -1154,12 +1202,11 @@ static void init_dl_task_timer(struct sched_dl_entity *dl_se)
>   */
>  static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se)
>  {
> -	struct task_struct *p = dl_task_of(dl_se);
> -	struct rq *rq = rq_of_dl_rq(dl_rq_of_se(dl_se));
> +	struct rq *rq = rq_of_dl_se(dl_se);
>  
>  	if (dl_time_before(dl_se->deadline, rq_clock(rq)) &&
>  	    dl_time_before(rq_clock(rq), dl_next_period(dl_se))) {
> -		if (unlikely(dl_se->dl_boosted || !start_dl_timer(p)))
> +		if (unlikely(dl_se->dl_boosted || !start_dl_timer(dl_se)))
>  			return;
>  		dl_se->dl_throttled = 1;
>  		if (dl_se->runtime > 0)
> @@ -1216,29 +1263,10 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
>  	return (delta * u_act) >> BW_SHIFT;
>  }
>  
> -/*
> - * Update the current task's runtime statistics (provided it is still
> - * a -deadline task and has not been removed from the dl_rq).
> - */
> -static void update_curr_dl(struct rq *rq)
> +static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
>  {
> -	struct task_struct *curr = rq->curr;
> -	struct sched_dl_entity *dl_se = &curr->dl;
> -	s64 delta_exec, scaled_delta_exec;
> -	int cpu = cpu_of(rq);
> +	s64 scaled_delta_exec;
>  
> -	if (!dl_task(curr) || !on_dl_rq(dl_se))
> -		return;
> -
> -	/*
> -	 * Consumed budget is computed considering the time as
> -	 * observed by schedulable tasks (excluding time spent
> -	 * in hardirq context, etc.). Deadlines are instead
> -	 * computed using hard walltime. This seems to be the more
> -	 * natural solution, but the full ramifications of this
> -	 * approach need further study.
> -	 */
> -	delta_exec = update_curr_common(rq);
>  	if (unlikely(delta_exec <= 0)) {
>  		if (unlikely(dl_se->dl_yielded))
>  			goto throttle;
> @@ -1256,10 +1284,9 @@ static void update_curr_dl(struct rq *rq)
>  	 * according to current frequency and CPU maximum capacity.
>  	 */
>  	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> -		scaled_delta_exec = grub_reclaim(delta_exec,
> -						 rq,
> -						 &curr->dl);
> +		scaled_delta_exec = grub_reclaim(delta_exec, rq, dl_se);
>  	} else {
> +		int cpu = cpu_of(rq);
>  		unsigned long scale_freq = arch_scale_freq_capacity(cpu);
>  		unsigned long scale_cpu = arch_scale_cpu_capacity(cpu);
>  
> @@ -1278,11 +1305,18 @@ static void update_curr_dl(struct rq *rq)
>  		    (dl_se->flags & SCHED_FLAG_DL_OVERRUN))
>  			dl_se->dl_overrun = 1;
>  
> -		__dequeue_task_dl(rq, curr, 0);
> -		if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
> -			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
> +		dequeue_dl_entity(dl_se, 0);
> +		if (!dl_server(dl_se))
> +			dequeue_pushable_dl_task(rq, dl_task_of(dl_se));
>  
> -		if (!is_leftmost(curr, &rq->dl))
> +		if (unlikely(dl_se->dl_boosted || !start_dl_timer(dl_se))) {
> +			if (dl_server(dl_se))
> +				enqueue_dl_entity(dl_se, dl_se, ENQUEUE_REPLENISH);
> +			else
> +				enqueue_task_dl(rq, dl_task_of(dl_se), ENQUEUE_REPLENISH);
> +		}
> +
> +		if (!is_leftmost(dl_se, &rq->dl))
>  			resched_curr(rq);
>  	}
>  
> @@ -1312,20 +1346,82 @@ static void update_curr_dl(struct rq *rq)
>  	}
>  }
>  
> +void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
> +{
> +	update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
> +}
> +
> +void dl_server_start(struct sched_dl_entity *dl_se)
> +{
> +	if (!dl_server(dl_se)) {
> +		dl_se->dl_server = 1;
> +		setup_new_dl_entity(dl_se);
> +	}
> +	enqueue_dl_entity(dl_se, dl_se, ENQUEUE_WAKEUP);
> +}
> +
> +void dl_server_stop(struct sched_dl_entity *dl_se)
> +{
> +	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
> +}
> +
> +void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
> +		    dl_server_has_tasks_f has_tasks,
> +		    dl_server_pick_f pick)
> +{
> +	dl_se->rq = rq;
> +	dl_se->server_has_tasks = has_tasks;
> +	dl_se->server_pick = pick;
> +}
> +
> +/*
> + * Update the current task's runtime statistics (provided it is still
> + * a -deadline task and has not been removed from the dl_rq).
> + */
> +static void update_curr_dl(struct rq *rq)
> +{
> +	struct task_struct *curr = rq->curr;
> +	struct sched_dl_entity *dl_se = &curr->dl;
> +	s64 delta_exec;
> +
> +	if (!dl_task(curr) || !on_dl_rq(dl_se))
> +		return;
> +
> +	/*
> +	 * Consumed budget is computed considering the time as
> +	 * observed by schedulable tasks (excluding time spent
> +	 * in hardirq context, etc.). Deadlines are instead
> +	 * computed using hard walltime. This seems to be the more
> +	 * natural solution, but the full ramifications of this
> +	 * approach need further study.
> +	 */
> +	delta_exec = update_curr_common(rq);
> +	update_curr_dl_se(rq, dl_se, delta_exec);
> +}
> +
>  static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
>  {
>  	struct sched_dl_entity *dl_se = container_of(timer,
>  						     struct sched_dl_entity,
>  						     inactive_timer);
> -	struct task_struct *p = dl_task_of(dl_se);
> +	struct task_struct *p = NULL;
>  	struct rq_flags rf;
>  	struct rq *rq;
>  
> -	rq = task_rq_lock(p, &rf);
> +	if (!dl_server(dl_se)) {
> +		p = dl_task_of(dl_se);
> +		rq = task_rq_lock(p, &rf);
> +	} else {
> +		rq = dl_se->rq;
> +		rq_lock(rq, &rf);
> +	}
>  
>  	sched_clock_tick();
>  	update_rq_clock(rq);
>  
> +	if (dl_server(dl_se))
> +		goto no_task;
> +
>  	if (!dl_task(p) || p->state == TASK_DEAD) {
>  		struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
>  
> @@ -1342,14 +1438,21 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
>  
>  		goto unlock;
>  	}
> +
> +no_task:
>  	if (dl_se->dl_non_contending == 0)
>  		goto unlock;
>  
>  	sub_running_bw(dl_se, &rq->dl);
>  	dl_se->dl_non_contending = 0;
>  unlock:
> -	task_rq_unlock(rq, p, &rf);
> -	put_task_struct(p);
> +
> +	if (!dl_server(dl_se)) {
> +		task_rq_unlock(rq, p, &rf);
> +		put_task_struct(p);
> +	} else {
> +		rq_unlock(rq, &rf);
> +	}
>  
>  	return HRTIMER_NORESTART;
>  }
> @@ -1402,34 +1505,35 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
>  static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
>  static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
>  
> +static inline void update_dl_migration(struct dl_rq *dl_rq) {}
> +
>  #endif /* CONFIG_SMP */
>  
>  static inline
>  void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>  {
> -	int prio = dl_task_of(dl_se)->prio;
>  	u64 deadline = dl_se->deadline;
>  
> -	WARN_ON(!dl_prio(prio));
>  	dl_rq->dl_nr_running++;
>  	add_nr_running(rq_of_dl_rq(dl_rq), 1);
>  
>  	inc_dl_deadline(dl_rq, deadline);
> -	inc_dl_migration(dl_se, dl_rq);
> +	if (!dl_server(dl_se))
> +		inc_dl_migration(dl_se, dl_rq);
> +	update_dl_migration(dl_rq);
>  }
>  
>  static inline
>  void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>  {
> -	int prio = dl_task_of(dl_se)->prio;
> -
> -	WARN_ON(!dl_prio(prio));
>  	WARN_ON(!dl_rq->dl_nr_running);
>  	dl_rq->dl_nr_running--;
>  	sub_nr_running(rq_of_dl_rq(dl_rq), 1);
>  
>  	dec_dl_deadline(dl_rq, dl_se->deadline);
> -	dec_dl_migration(dl_se, dl_rq);
> +	if (!dl_server(dl_se))
> +		dec_dl_migration(dl_se, dl_rq);
> +	update_dl_migration(dl_rq);
>  }
>  
>  static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
> @@ -1524,8 +1628,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
>  	} else if (flags & ENQUEUE_REPLENISH) {
>  		replenish_dl_entity(dl_se, pi_se);
>  	} else if ((flags & ENQUEUE_RESTORE) &&
> -		  dl_time_before(dl_se->deadline,
> -				 rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se))))) {
> +		   dl_time_before(dl_se->deadline, rq_clock(rq_of_dl_se(dl_se)))) {
>  		setup_new_dl_entity(dl_se);
>  	}
>  
> @@ -1592,14 +1695,6 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>  		enqueue_pushable_dl_task(rq, p);
>  }
>  
> -static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> -{
> -	dequeue_dl_entity(&p->dl, flags);
> -
> -	if (!p->dl.dl_throttled)
> -		dequeue_pushable_dl_task(rq, p);
> -}
> -
>  static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	update_curr_dl(rq);
> @@ -1607,7 +1702,9 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>  	if (p->on_rq == TASK_ON_RQ_MIGRATING)
>  		flags |= DEQUEUE_MIGRATING;
>  
> -	__dequeue_task_dl(rq, p, flags);
> +	dequeue_dl_entity(&p->dl, flags);
> +	if (!p->dl.dl_throttled)
> +		dequeue_pushable_dl_task(rq, p);
>  }
>  
>  /*
> @@ -1789,12 +1886,12 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
>  }
>  
>  #ifdef CONFIG_SCHED_HRTICK
> -static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
> +static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
>  {
> -	hrtick_start(rq, p->dl.runtime);
> +	hrtick_start(rq, dl_se->runtime);
>  }
>  #else /* !CONFIG_SCHED_HRTICK */
> -static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
> +static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
>  {
>  }
>  #endif
> @@ -1809,9 +1906,6 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
>  	if (!first)
>  		return;
>  
> -	if (hrtick_enabled(rq))
> -		start_hrtick_dl(rq, p);
> -
>  	if (rq->curr->sched_class != &dl_sched_class)
>  		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
>  
> @@ -1835,13 +1929,30 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
>  	struct dl_rq *dl_rq = &rq->dl;
>  	struct task_struct *p;
>  
> +again:
>  	if (!sched_dl_runnable(rq))
>  		return NULL;
>  
>  	dl_se = pick_next_dl_entity(rq, dl_rq);
>  	BUG_ON(!dl_se);
> -	p = dl_task_of(dl_se);
> -	set_next_task_dl(rq, p, true);
> +	if (dl_server(dl_se)) {
> +		p = dl_se->server_pick(dl_se);
> +		if (!p) {
> +			// XXX should not happen, warn?!
> +			dl_se->dl_yielded = 1;
> +			update_curr_dl_se(rq, dl_se, 0);
> +			goto again;
> +		}
> +		p->server = dl_se;
> +	} else {
> +		p = dl_task_of(dl_se);
> +		set_next_task_dl(rq, p, true);
> +	}
> +
> +	/* XXX not quite right */
> +	if (hrtick_enabled(rq))
> +		start_hrtick_dl(rq, dl_se);
> +
>  	return p;
>  }
>  
> @@ -1873,8 +1984,8 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
>  	 * be set and schedule() will start a new hrtick for the next task.
>  	 */
>  	if (hrtick_enabled(rq) && queued && p->dl.runtime > 0 &&
> -	    is_leftmost(p, &rq->dl))
> -		start_hrtick_dl(rq, p);
> +	    is_leftmost(&p->dl, &rq->dl))
> +		start_hrtick_dl(rq, &p->dl);
>  }
>  
>  static void task_fork_dl(struct task_struct *p)
> @@ -2771,6 +2882,7 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se)
>  	dl_se->dl_yielded		= 0;
>  	dl_se->dl_non_contending	= 0;
>  	dl_se->dl_overrun		= 0;
> +	dl_se->dl_server		= 0;
>  }
>  
>  void init_dl_entity(struct sched_dl_entity *dl_se)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 10a230d85104a..5130239c0e1e5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -868,6 +868,8 @@ s64 update_curr_common(struct rq *rq)
>  
>  	account_group_exec_runtime(curr, delta_exec);
>  	cgroup_account_cputime(curr, delta_exec);
> +	if (curr->server)
> +		dl_server_update(curr->server, delta_exec);
>  
>  	return delta_exec;
>  }
> @@ -897,6 +899,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
>  		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
>  		cgroup_account_cputime(curtask, delta_exec);
>  		account_group_exec_runtime(curtask, delta_exec);
> +		if (curtask->server)
> +			dl_server_update(curtask->server, delta_exec);
>  	}
>  
>  	account_cfs_rq_runtime(cfs_rq, delta_exec);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d3db8c7d8b641..f035cd8ccd224 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -346,6 +346,35 @@ extern int  dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_c
>  extern int  dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
>  extern bool dl_cpu_busy(unsigned int cpu);
>  
> +/*
> + * SCHED_DEADLINE supports servers (nested scheduling) with the following
> + * interface:
> + *
> + *   dl_se::rq -- runqueue we belong to.
> + *
> + *   dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
> + *                                server when it runs out of tasks to run.
> + *
> + *   dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
> + *                           returns NULL.
> + *
> + *   dl_server_update() -- called from update_curr_common(), propagates runtime
> + *                         to the server.
> + *
> + *   dl_server_start()
> + *   dl_server_stop()  -- start/stop the server when it has (no) tasks
> + *
> + *   dl_server_init()
> + *
> + * XXX
> + */
> +extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
> +extern void dl_server_start(struct sched_dl_entity *dl_se);
> +extern void dl_server_stop(struct sched_dl_entity *dl_se);
> +extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
> +		    dl_server_has_tasks_f has_tasks,
> +		    dl_server_pick_f pick);
> +
>  #ifdef CONFIG_CGROUP_SCHED
>  
>  #include <linux/cgroup.h>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/deadline: Introduce deadline servers
  2020-10-06  7:56   ` luca abeni
@ 2020-10-06  9:35     ` Juri Lelli
  2020-10-06  9:51       ` luca abeni
  0 siblings, 1 reply; 29+ messages in thread
From: Juri Lelli @ 2020-10-06  9:35 UTC (permalink / raw)
  To: luca abeni
  Cc: peterz, mingo, rostedt, tglx, linux-kernel, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

On 06/10/20 09:56, luca abeni wrote:
> Hi,
> 
> sorry for the late reply... Anyway, I am currently testing this
> patchset (and trying to use it for the "SCHED_DEADLINE-based cgroup
> scheduling" patchset).
> And during my tests I had a doubt:
> 
> 
> 
> On Fri,  7 Aug 2020 11:50:49 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> 
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
> > with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).
> [...]
> > @@ -1024,10 +1062,34 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
> >  	struct sched_dl_entity *dl_se = container_of(timer,
> >  						     struct sched_dl_entity,
> >  						     dl_timer);
> > -	struct task_struct *p = dl_task_of(dl_se);
> > +	struct task_struct *p;
> >  	struct rq_flags rf;
> >  	struct rq *rq;
> >  
> > +	if (dl_server(dl_se)) {
> > +		struct rq *rq = rq_of_dl_se(dl_se);
> > +		struct rq_flags rf;
> > +
> > +		rq_lock(rq, &rf);
> > +		if (dl_se->dl_throttled) {
> > +			sched_clock_tick();
> > +			update_rq_clock(rq);
> > +
> > +			if (dl_se->server_has_tasks(dl_se)) {
> > +				enqueue_dl_entity(dl_se, dl_se, ENQUEUE_REPLENISH);
> > +				resched_curr(rq);
> > +				__push_dl_task(rq, &rf);
> > +			} else {
> > +				replenish_dl_entity(dl_se, dl_se);
> 
> I am wondering if here we need a "task_non_contending(dl_se)" after
> "replenish_dl_entity(dl_se, dl_se);"...
> 
> Basically, what happened is that all the served tasks blocked while the
> server was throttled... So, now the server should be disabled (so, we
> replenish the dl entity but we do not insert it in runqueue).
> But when the server finished its budget and has been throttled, it has
> not been disabled (so, its utilization is still in running_bw).

Hummm. For CFS, we call dl_server_stop() after the last CFS task blocks
and that calls dequeue_dl(SLEEP), which should be calling
task_non_contending(). That should be happening also while the server is
throttled and CFS tasks are running outside of it, no? Guess I'm missing
something.

> I think that if we do not call task_non_contending() the server's
> utilization is not correctly removed from running_bw (this does not
> happen for regular tasks, because task_non_contending() is invoked when
> they block, even if the dl entity is throttled... But for servers I
> think we have an issue).
> 
> What do you think? Do you agree with this change?
> 
> I have no easy way to reproduce the issue, but this change fixed some
> strange behaviours I was seeing when using this patch to schedule RT
> cgroups.
> 
> 
> 
> 			Thanks,
> 				Luca
> > +			}
> > +
> > +		}
> > +		rq_unlock(rq, &rf);
> > +
> > +		return HRTIMER_NORESTART;
> > +	}
> > +
> > +	p = dl_task_of(dl_se);
> >  	rq = task_rq_lock(p, &rf);
> >  
> >  	/*
> > @@ -1098,21 +1160,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
> >  	else
> >  		resched_curr(rq);
> >  
> > -#ifdef CONFIG_SMP
> > -	/*
> > -	 * Queueing this task back might have overloaded rq, check if we need
> > -	 * to kick someone away.
> > -	 */
> > -	if (has_pushable_dl_tasks(rq)) {
> > -		/*
> > -		 * Nothing relies on rq->lock after this, so its safe to drop
> > -		 * rq->lock.
> > -		 */
> > -		rq_unpin_lock(rq, &rf);
> > -		push_dl_task(rq);
> > -		rq_repin_lock(rq, &rf);
> > -	}
> > -#endif
> > +	__push_dl_task(rq, &rf);
> >  
> >  unlock:
> >  	task_rq_unlock(rq, p, &rf);
> > @@ -1154,12 +1202,11 @@ static void init_dl_task_timer(struct sched_dl_entity *dl_se)
> >   */
> >  static inline void dl_check_constrained_dl(struct sched_dl_entity *dl_se)
> >  {
> > -	struct task_struct *p = dl_task_of(dl_se);
> > -	struct rq *rq = rq_of_dl_rq(dl_rq_of_se(dl_se));
> > +	struct rq *rq = rq_of_dl_se(dl_se);
> >  
> >  	if (dl_time_before(dl_se->deadline, rq_clock(rq)) &&
> >  	    dl_time_before(rq_clock(rq), dl_next_period(dl_se))) {
> > -		if (unlikely(dl_se->dl_boosted || !start_dl_timer(p)))
> > +		if (unlikely(dl_se->dl_boosted || !start_dl_timer(dl_se)))
> >  			return;
> >  		dl_se->dl_throttled = 1;
> >  		if (dl_se->runtime > 0)
> > @@ -1216,29 +1263,10 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
> >  	return (delta * u_act) >> BW_SHIFT;
> >  }
> >  
> > -/*
> > - * Update the current task's runtime statistics (provided it is still
> > - * a -deadline task and has not been removed from the dl_rq).
> > - */
> > -static void update_curr_dl(struct rq *rq)
> > +static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec)
> >  {
> > -	struct task_struct *curr = rq->curr;
> > -	struct sched_dl_entity *dl_se = &curr->dl;
> > -	s64 delta_exec, scaled_delta_exec;
> > -	int cpu = cpu_of(rq);
> > +	s64 scaled_delta_exec;
> >  
> > -	if (!dl_task(curr) || !on_dl_rq(dl_se))
> > -		return;
> > -
> > -	/*
> > -	 * Consumed budget is computed considering the time as
> > -	 * observed by schedulable tasks (excluding time spent
> > -	 * in hardirq context, etc.). Deadlines are instead
> > -	 * computed using hard walltime. This seems to be the more
> > -	 * natural solution, but the full ramifications of this
> > -	 * approach need further study.
> > -	 */
> > -	delta_exec = update_curr_common(rq);
> >  	if (unlikely(delta_exec <= 0)) {
> >  		if (unlikely(dl_se->dl_yielded))
> >  			goto throttle;
> > @@ -1256,10 +1284,9 @@ static void update_curr_dl(struct rq *rq)
> >  	 * according to current frequency and CPU maximum capacity.
> >  	 */
> >  	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> > -		scaled_delta_exec = grub_reclaim(delta_exec,
> > -						 rq,
> > -						 &curr->dl);
> > +		scaled_delta_exec = grub_reclaim(delta_exec, rq, dl_se);
> >  	} else {
> > +		int cpu = cpu_of(rq);
> >  		unsigned long scale_freq = arch_scale_freq_capacity(cpu);
> >  		unsigned long scale_cpu = arch_scale_cpu_capacity(cpu);
> >  
> > @@ -1278,11 +1305,18 @@ static void update_curr_dl(struct rq *rq)
> >  		    (dl_se->flags & SCHED_FLAG_DL_OVERRUN))
> >  			dl_se->dl_overrun = 1;
> >  
> > -		__dequeue_task_dl(rq, curr, 0);
> > -		if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
> > -			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
> > +		dequeue_dl_entity(dl_se, 0);
> > +		if (!dl_server(dl_se))
> > +			dequeue_pushable_dl_task(rq, dl_task_of(dl_se));
> >  
> > -		if (!is_leftmost(curr, &rq->dl))
> > +		if (unlikely(dl_se->dl_boosted || !start_dl_timer(dl_se))) {
> > +			if (dl_server(dl_se))
> > +				enqueue_dl_entity(dl_se, dl_se, ENQUEUE_REPLENISH);
> > +			else
> > +				enqueue_task_dl(rq, dl_task_of(dl_se), ENQUEUE_REPLENISH);
> > +		}
> > +
> > +		if (!is_leftmost(dl_se, &rq->dl))
> >  			resched_curr(rq);
> >  	}
> >  
> > @@ -1312,20 +1346,82 @@ static void update_curr_dl(struct rq *rq)
> >  	}
> >  }
> >  
> > +void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
> > +{
> > +	update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
> > +}
> > +
> > +void dl_server_start(struct sched_dl_entity *dl_se)
> > +{
> > +	if (!dl_server(dl_se)) {
> > +		dl_se->dl_server = 1;
> > +		setup_new_dl_entity(dl_se);
> > +	}
> > +	enqueue_dl_entity(dl_se, dl_se, ENQUEUE_WAKEUP);
> > +}
> > +
> > +void dl_server_stop(struct sched_dl_entity *dl_se)
> > +{
> > +	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
> > +}
> > +
> > +void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
> > +		    dl_server_has_tasks_f has_tasks,
> > +		    dl_server_pick_f pick)
> > +{
> > +	dl_se->rq = rq;
> > +	dl_se->server_has_tasks = has_tasks;
> > +	dl_se->server_pick = pick;
> > +}
> > +
> > +/*
> > + * Update the current task's runtime statistics (provided it is still
> > + * a -deadline task and has not been removed from the dl_rq).
> > + */
> > +static void update_curr_dl(struct rq *rq)
> > +{
> > +	struct task_struct *curr = rq->curr;
> > +	struct sched_dl_entity *dl_se = &curr->dl;
> > +	s64 delta_exec;
> > +
> > +	if (!dl_task(curr) || !on_dl_rq(dl_se))
> > +		return;
> > +
> > +	/*
> > +	 * Consumed budget is computed considering the time as
> > +	 * observed by schedulable tasks (excluding time spent
> > +	 * in hardirq context, etc.). Deadlines are instead
> > +	 * computed using hard walltime. This seems to be the more
> > +	 * natural solution, but the full ramifications of this
> > +	 * approach need further study.
> > +	 */
> > +	delta_exec = update_curr_common(rq);
> > +	update_curr_dl_se(rq, dl_se, delta_exec);
> > +}
> > +
> >  static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
> >  {
> >  	struct sched_dl_entity *dl_se = container_of(timer,
> >  						     struct sched_dl_entity,
> >  						     inactive_timer);
> > -	struct task_struct *p = dl_task_of(dl_se);
> > +	struct task_struct *p = NULL;
> >  	struct rq_flags rf;
> >  	struct rq *rq;
> >  
> > -	rq = task_rq_lock(p, &rf);
> > +	if (!dl_server(dl_se)) {
> > +		p = dl_task_of(dl_se);
> > +		rq = task_rq_lock(p, &rf);
> > +	} else {
> > +		rq = dl_se->rq;
> > +		rq_lock(rq, &rf);
> > +	}
> >  
> >  	sched_clock_tick();
> >  	update_rq_clock(rq);
> >  
> > +	if (dl_server(dl_se))
> > +		goto no_task;
> > +
> >  	if (!dl_task(p) || p->state == TASK_DEAD) {
> >  		struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
> >  
> > @@ -1342,14 +1438,21 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
> >  
> >  		goto unlock;
> >  	}
> > +
> > +no_task:
> >  	if (dl_se->dl_non_contending == 0)
> >  		goto unlock;
> >  
> >  	sub_running_bw(dl_se, &rq->dl);
> >  	dl_se->dl_non_contending = 0;
> >  unlock:
> > -	task_rq_unlock(rq, p, &rf);
> > -	put_task_struct(p);
> > +
> > +	if (!dl_server(dl_se)) {
> > +		task_rq_unlock(rq, p, &rf);
> > +		put_task_struct(p);
> > +	} else {
> > +		rq_unlock(rq, &rf);
> > +	}
> >  
> >  	return HRTIMER_NORESTART;
> >  }
> > @@ -1402,34 +1505,35 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> >  static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
> >  static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
> >  
> > +static inline void update_dl_migration(struct dl_rq *dl_rq) {}
> > +
> >  #endif /* CONFIG_SMP */
> >  
> >  static inline
> >  void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >  {
> > -	int prio = dl_task_of(dl_se)->prio;
> >  	u64 deadline = dl_se->deadline;
> >  
> > -	WARN_ON(!dl_prio(prio));
> >  	dl_rq->dl_nr_running++;
> >  	add_nr_running(rq_of_dl_rq(dl_rq), 1);
> >  
> >  	inc_dl_deadline(dl_rq, deadline);
> > -	inc_dl_migration(dl_se, dl_rq);
> > +	if (!dl_server(dl_se))
> > +		inc_dl_migration(dl_se, dl_rq);
> > +	update_dl_migration(dl_rq);
> >  }
> >  
> >  static inline
> >  void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >  {
> > -	int prio = dl_task_of(dl_se)->prio;
> > -
> > -	WARN_ON(!dl_prio(prio));
> >  	WARN_ON(!dl_rq->dl_nr_running);
> >  	dl_rq->dl_nr_running--;
> >  	sub_nr_running(rq_of_dl_rq(dl_rq), 1);
> >  
> >  	dec_dl_deadline(dl_rq, dl_se->deadline);
> > -	dec_dl_migration(dl_se, dl_rq);
> > +	if (!dl_server(dl_se))
> > +		dec_dl_migration(dl_se, dl_rq);
> > +	update_dl_migration(dl_rq);
> >  }
> >  
> >  static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
> > @@ -1524,8 +1628,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
> >  	} else if (flags & ENQUEUE_REPLENISH) {
> >  		replenish_dl_entity(dl_se, pi_se);
> >  	} else if ((flags & ENQUEUE_RESTORE) &&
> > -		  dl_time_before(dl_se->deadline,
> > -				 rq_clock(rq_of_dl_rq(dl_rq_of_se(dl_se))))) {
> > +		   dl_time_before(dl_se->deadline, rq_clock(rq_of_dl_se(dl_se)))) {
> >  		setup_new_dl_entity(dl_se);
> >  	}
> >  
> > @@ -1592,14 +1695,6 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> >  		enqueue_pushable_dl_task(rq, p);
> >  }
> >  
> > -static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> > -{
> > -	dequeue_dl_entity(&p->dl, flags);
> > -
> > -	if (!p->dl.dl_throttled)
> > -		dequeue_pushable_dl_task(rq, p);
> > -}
> > -
> >  static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> >  {
> >  	update_curr_dl(rq);
> > @@ -1607,7 +1702,9 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> >  	if (p->on_rq == TASK_ON_RQ_MIGRATING)
> >  		flags |= DEQUEUE_MIGRATING;
> >  
> > -	__dequeue_task_dl(rq, p, flags);
> > +	dequeue_dl_entity(&p->dl, flags);
> > +	if (!p->dl.dl_throttled)
> > +		dequeue_pushable_dl_task(rq, p);
> >  }
> >  
> >  /*
> > @@ -1789,12 +1886,12 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
> >  }
> >  
> >  #ifdef CONFIG_SCHED_HRTICK
> > -static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
> > +static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
> >  {
> > -	hrtick_start(rq, p->dl.runtime);
> > +	hrtick_start(rq, dl_se->runtime);
> >  }
> >  #else /* !CONFIG_SCHED_HRTICK */
> > -static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
> > +static void start_hrtick_dl(struct rq *rq, struct sched_dl_entity *dl_se)
> >  {
> >  }
> >  #endif
> > @@ -1809,9 +1906,6 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
> >  	if (!first)
> >  		return;
> >  
> > -	if (hrtick_enabled(rq))
> > -		start_hrtick_dl(rq, p);
> > -
> >  	if (rq->curr->sched_class != &dl_sched_class)
> >  		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
> >  
> > @@ -1835,13 +1929,30 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
> >  	struct dl_rq *dl_rq = &rq->dl;
> >  	struct task_struct *p;
> >  
> > +again:
> >  	if (!sched_dl_runnable(rq))
> >  		return NULL;
> >  
> >  	dl_se = pick_next_dl_entity(rq, dl_rq);
> >  	BUG_ON(!dl_se);
> > -	p = dl_task_of(dl_se);
> > -	set_next_task_dl(rq, p, true);
> > +	if (dl_server(dl_se)) {
> > +		p = dl_se->server_pick(dl_se);
> > +		if (!p) {
> > +			// XXX should not happen, warn?!
> > +			dl_se->dl_yielded = 1;
> > +			update_curr_dl_se(rq, dl_se, 0);
> > +			goto again;
> > +		}
> > +		p->server = dl_se;
> > +	} else {
> > +		p = dl_task_of(dl_se);
> > +		set_next_task_dl(rq, p, true);
> > +	}
> > +
> > +	/* XXX not quite right */
> > +	if (hrtick_enabled(rq))
> > +		start_hrtick_dl(rq, dl_se);
> > +
> >  	return p;
> >  }
> >  
> > @@ -1873,8 +1984,8 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
> >  	 * be set and schedule() will start a new hrtick for the next task.
> >  	 */
> >  	if (hrtick_enabled(rq) && queued && p->dl.runtime > 0 &&
> > -	    is_leftmost(p, &rq->dl))
> > -		start_hrtick_dl(rq, p);
> > +	    is_leftmost(&p->dl, &rq->dl))
> > +		start_hrtick_dl(rq, &p->dl);
> >  }
> >  
> >  static void task_fork_dl(struct task_struct *p)
> > @@ -2771,6 +2882,7 @@ static void __dl_clear_params(struct sched_dl_entity *dl_se)
> >  	dl_se->dl_yielded		= 0;
> >  	dl_se->dl_non_contending	= 0;
> >  	dl_se->dl_overrun		= 0;
> > +	dl_se->dl_server		= 0;
> >  }
> >  
> >  void init_dl_entity(struct sched_dl_entity *dl_se)
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 10a230d85104a..5130239c0e1e5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -868,6 +868,8 @@ s64 update_curr_common(struct rq *rq)
> >  
> >  	account_group_exec_runtime(curr, delta_exec);
> >  	cgroup_account_cputime(curr, delta_exec);
> > +	if (curr->server)
> > +		dl_server_update(curr->server, delta_exec);
> >  
> >  	return delta_exec;
> >  }
> > @@ -897,6 +899,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
> >  		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
> >  		cgroup_account_cputime(curtask, delta_exec);
> >  		account_group_exec_runtime(curtask, delta_exec);
> > +		if (curtask->server)
> > +			dl_server_update(curtask->server, delta_exec);
> >  	}
> >  
> >  	account_cfs_rq_runtime(cfs_rq, delta_exec);
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index d3db8c7d8b641..f035cd8ccd224 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -346,6 +346,35 @@ extern int  dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_c
> >  extern int  dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
> >  extern bool dl_cpu_busy(unsigned int cpu);
> >  
> > +/*
> > + * SCHED_DEADLINE supports servers (nested scheduling) with the following
> > + * interface:
> > + *
> > + *   dl_se::rq -- runqueue we belong to.
> > + *
> > + *   dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
> > + *                                server when it runs out of tasks to run.
> > + *
> > + *   dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
> > + *                           returns NULL.
> > + *
> > + *   dl_server_update() -- called from update_curr_common(), propagates runtime
> > + *                         to the server.
> > + *
> > + *   dl_server_start()
> > + *   dl_server_stop()  -- start/stop the server when it has (no) tasks
> > + *
> > + *   dl_server_init()
> > + *
> > + * XXX
> > + */
> > +extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
> > +extern void dl_server_start(struct sched_dl_entity *dl_se);
> > +extern void dl_server_stop(struct sched_dl_entity *dl_se);
> > +extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
> > +		    dl_server_has_tasks_f has_tasks,
> > +		    dl_server_pick_f pick);
> > +
> >  #ifdef CONFIG_CGROUP_SCHED
> >  
> >  #include <linux/cgroup.h>
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/deadline: Introduce deadline servers
  2020-10-06  9:35     ` Juri Lelli
@ 2020-10-06  9:51       ` luca abeni
  0 siblings, 0 replies; 29+ messages in thread
From: luca abeni @ 2020-10-06  9:51 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, mingo, rostedt, tglx, linux-kernel, tommaso.cucinotta,
	alessio.balsini, bristot, dietmar.eggemann, linux-rt-users,
	mtosatti, williams, valentin.schneider

On Tue, 6 Oct 2020 11:35:23 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:
[...]
> > > +			if (dl_se->server_has_tasks(dl_se)) {
> > > +				enqueue_dl_entity(dl_se, dl_se, ENQUEUE_REPLENISH);
> > > +				resched_curr(rq);
> > > +				__push_dl_task(rq, &rf);
> > > +			} else {
> > > +				replenish_dl_entity(dl_se, dl_se);  
> > 
> > I am wondering if here we need a "task_non_contending(dl_se)" after
> > "replenish_dl_entity(dl_se, dl_se);"...
> > 
> > Basically, what happened is that all the served tasks blocked while the
> > server was throttled... So, now the server should be disabled (so, we
> > replenish the dl entity but we do not insert it in runqueue).
> > But when the server finished its budget and has been throttled, it has
> > not been disabled (so, its utilization is still in running_bw).  
> 
> Hummm. For CFS, we call dl_server_stop() after the last CFS task blocks
> and that calls dequeue_dl(SLEEP), which should be calling
> task_non_contending(). That should be happening also while the server is
> throttled and CFS tasks are running outside of it, no?

You are right... I somehow lost this detail.


> Guess I'm missing something.

No, I was the one missing something :)
Sorry about the noise.



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, back to index

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-07  9:50 [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure Juri Lelli
2020-08-07  9:50 ` [RFC PATCH v2 1/6] sched: Unify runtime accounting across classes Juri Lelli
2020-08-07  9:50 ` [RFC PATCH v2 2/6] sched/deadline: Collect sched_dl_entity initialization Juri Lelli
2020-08-07  9:50 ` [RFC PATCH v2 3/6] sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity Juri Lelli
2020-08-07  9:50 ` [RFC PATCH v2 4/6] sched/deadline: Introduce deadline servers Juri Lelli
2020-10-06  7:56   ` luca abeni
2020-10-06  9:35     ` Juri Lelli
2020-10-06  9:51       ` luca abeni
2020-08-07  9:50 ` [RFC PATCH v2 5/6] sched/fair: Add trivial fair server Juri Lelli
2020-08-07  9:56 ` [RFC PATCH v2 6/6] sched/fair: Implement starvation monitor Juri Lelli
2020-08-07 10:46   ` peterz
2020-08-07 11:30     ` Daniel Bristot de Oliveira
2020-08-07 12:50       ` Juri Lelli
2020-08-07 13:49     ` luca abeni
2020-08-07 14:11       ` peterz
2020-08-07 16:48         ` Daniel Bristot de Oliveira
2020-08-07 13:28   ` luca abeni
2020-08-07 13:43     ` Juri Lelli
2020-08-07 13:55       ` luca abeni
2020-08-07 14:11         ` Juri Lelli
2020-08-07 14:13       ` peterz
2020-08-07 15:06         ` Juri Lelli
2020-08-07 13:16 ` [RFC PATCH v2 0/6] SCHED_DEADLINE server infrastructure luca abeni
2020-08-07 13:30   ` Juri Lelli
2020-08-07 13:41     ` luca abeni
2020-08-07 14:04       ` Juri Lelli
2020-08-07 14:14   ` peterz
2020-09-08 22:22 ` Pavel Machek
2020-09-09  5:51   ` Juri Lelli

Linux-rt-users Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rt-users/0 linux-rt-users/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rt-users linux-rt-users/ https://lore.kernel.org/linux-rt-users \
		linux-rt-users@vger.kernel.org
	public-inbox-index linux-rt-users

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rt-users


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git