All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/8] CPU reclaiming for SCHED_DEADLINE
@ 2016-01-14 15:24 Luca Abeni
  2016-01-14 15:24 ` [RFC 1/8] Track the active utilisation Luca Abeni
                   ` (8 more replies)
  0 siblings, 9 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-14 15:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Luca Abeni

Hi all,

this patchset implements CPU reclaiming (using the GRUB algorithm[1])
for SCHED_DEADLINE: basically, this feature allows SCHED_DEADLINE tasks
to consume more than their reserved runtime, up to a maximum fraction
of the CPU time (so that other tasks are left some spare CPU time to
execute), if this does not break the guarantees of other SCHED_DEADLINE
tasks.

I send this RFC because I think the code still needs some work and/or
cleanups (or maybe the patches should be splitted or merged in a different
way), but I'd like to check if there is interest in merging this feature
and if the current implementation strategy is reasonable.

I added in cc the usual people interested in SCHED_DEADLINE patches; if
you think that I should have added someone else, let me know (or please
forward these patches to interested people).

The implemented CPU reclaiming algorithm is based on tracking the
utilization U_act of active tasks (first 5 patches), and modifying the
runtime accounting rule (see patch 0006). The original GRUB algorithm is
modified as described in [2] to support multiple CPUs (the original
algorithm only considered one single CPU, this one tracks U_act per
runqueue) and to leave an "unreclaimable" fraction of CPU time to non
SCHED_DEADLINE tasks (the original algorithm can consume 100% of the CPU
time, starving all the other tasks).

I tried to split the patches so that the whole patchset can be better
understood; if they should be organized in a different way, let me know.
The first 5 patches (tracking of per-runqueue active utilization) can
be useful for frequency scaling too (the tracked "active utilization"
gives a clear hint about how much the core speed can be reduced without
compromising the SCHED_DEADLINE guarantees):
- patches 0001 and 0002 implement a simple tracking of the active
  utilization that is too optimistic from the theoretical point of
  view
- patch 0003 is mainly useful for debugging this patchset and can
  be removed without problems
- patch 0004 implements the "active utilization" tracking algorithm
  described in [1,2]. It uses a timer (named "inactive timer" here) to
  decrease U_act at the correct time (I called it the "0-lag time").
  I am working on an alternative implementation that does not use
  additional timers, but it is not ready yet; I'll post it when ready
  and tested
- patch 0005 tracks the utilization of the tasks that can execute on
  each runqueue. It is a pessimistic approximation of U_act (so, if
  used instead of U_act it allows to reclaim less CPU time, but does
  not break SCHED_DEADLINE guarantees)
- patches 0006-0008 implement the reclaiming algorithm.

[1] http://retis.sssup.it/~lipari/papers/lipariBaruah2000.pdf
[2] http://disi.unitn.it/~abeni/reclaiming/rtlws14-grub.pdf



Juri Lelli (1):
  sched/deadline: add some tracepoints

Luca Abeni (7):
  Track the active utilisation
  Correctly track the active utilisation for migrating tasks
  Improve the tracking of active utilisation
  Track the "total rq utilisation" too
  GRUB accounting
  Make GRUB a task's flag
  Do not reclaim the whole CPU bandwidth

 include/linux/sched.h        |   1 +
 include/trace/events/sched.h |  69 ++++++++++++++
 include/uapi/linux/sched.h   |   1 +
 kernel/sched/core.c          |   3 +-
 kernel/sched/deadline.c      | 214 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h         |  12 +++
 6 files changed, 292 insertions(+), 8 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 1/8] Track the active utilisation
  2016-01-14 15:24 [RFC 0/8] CPU reclaiming for SCHED_DEADLINE Luca Abeni
@ 2016-01-14 15:24 ` Luca Abeni
  2016-01-14 16:49   ` Peter Zijlstra
  2016-01-14 19:13   ` Peter Zijlstra
  2016-01-14 15:24 ` [RFC 2/8] Correctly track the active utilisation for migrating tasks Luca Abeni
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-14 15:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Luca Abeni

The active utilisation here is defined as the total utilisation of the
active (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
when a task wakes up and is decreased when a task blocks.
This might need to be fixed / improved by decreasing the active
utilisation at the so-called "0-lag time" instead of when the task blocks.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
---
 kernel/sched/deadline.c | 36 +++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h    |  5 +++++
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index cd64c97..e779cce 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -43,6 +43,24 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
 	return !RB_EMPTY_NODE(&dl_se->rb_node);
 }
 
+static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	u64 se_bw = dl_se->dl_bw;
+
+	dl_rq->running_bw += se_bw;
+}
+
+static void clear_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	u64 se_bw = dl_se->dl_bw;
+
+	dl_rq->running_bw -= se_bw;
+	if (dl_rq->running_bw < 0) {
+		WARN_ON(1);
+		dl_rq->running_bw = 0;
+	}
+}
+
 static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
@@ -500,6 +518,8 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
+	add_running_bw(dl_se, dl_rq);
+
 	/*
 	 * The arrival of a new instance needs special treatment, i.e.,
 	 * the actual scheduling parameters have to be "renewed".
@@ -961,8 +981,10 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	 * its rq, the bandwidth timer callback (which clearly has not
 	 * run yet) will take care of this.
 	 */
-	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
+	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+		add_running_bw(&p->dl, &rq->dl);
 		return;
+	}
 
 	enqueue_dl_entity(&p->dl, pi_se, flags);
 
@@ -980,6 +1002,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
 	__dequeue_task_dl(rq, p, flags);
+	if (flags & DEQUEUE_SLEEP)
+		clear_running_bw(&p->dl, &rq->dl);
 }
 
 /*
@@ -1218,6 +1242,8 @@ static void task_fork_dl(struct task_struct *p)
 static void task_dead_dl(struct task_struct *p)
 {
 	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+	struct dl_rq *dl_rq = dl_rq_of_se(&p->dl);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
 
 	/*
 	 * Since we are TASK_DEAD we won't slip out of the domain!
@@ -1226,6 +1252,10 @@ static void task_dead_dl(struct task_struct *p)
 	/* XXX we should retain the bw until 0-lag */
 	dl_b->total_bw -= p->dl.dl_bw;
 	raw_spin_unlock_irq(&dl_b->lock);
+
+	if (task_on_rq_queued(p)) {
+		clear_running_bw(&p->dl, &rq->dl);
+	}
 }
 
 static void set_curr_task_dl(struct rq *rq)
@@ -1705,6 +1735,10 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	if (!start_dl_timer(p))
 		__dl_clear_params(p);
 
+	if (task_on_rq_queued(p)) {
+		clear_running_bw(&p->dl, &rq->dl);
+	}
+
 	/*
 	 * Since this might be the only -deadline task on the rq,
 	 * this is the right place to try to pull some other one
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10f1637..826ca6a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -519,6 +519,11 @@ struct dl_rq {
 #else
 	struct dl_bw dl_bw;
 #endif
+	/* This is the "active utilization" for this runqueue.
+	 * Increased when a task wakes up (becomes TASK_RUNNING)
+	 * and decreased when a task blocks
+	 */
+	s64 running_bw;
 };
 
 #ifdef CONFIG_SMP
-- 
1.9.1

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 2/8] Correctly track the active utilisation for migrating tasks
  2016-01-14 15:24 [RFC 0/8] CPU reclaiming for SCHED_DEADLINE Luca Abeni
  2016-01-14 15:24 ` [RFC 1/8] Track the active utilisation Luca Abeni
@ 2016-01-14 15:24 ` Luca Abeni
  2016-01-14 15:24 ` [RFC 3/8] sched/deadline: add some tracepoints Luca Abeni
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-14 15:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Luca Abeni

Fix active utilisation accounting on migration: when a task is migrated
from CPUi to CPUj, immediately subtract the task's utilisation from
CPUi and add it to CPUj. This mechanism is implemented by modifying the
pull and push functions.

Note: this is not fully correct from the theoretical point of view
(the utilisation should be removed from CPUi only at the 0 lag time),
but doing the right thing would be _MUCH_ more complex (leaving the
timer armed when the task is on a different CPU... Inactive timers should
be moved from per-task timers to per-runqueue lists of timers! Bah...)
---
 kernel/sched/deadline.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e779cce..8d7ee79 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1541,7 +1541,9 @@ retry:
 	}
 
 	deactivate_task(rq, next_task, 0);
+	clear_running_bw(&next_task->dl, &rq->dl);
 	set_task_cpu(next_task, later_rq->cpu);
+	add_running_bw(&next_task->dl, &later_rq->dl);
 	activate_task(later_rq, next_task, 0);
 	ret = 1;
 
@@ -1629,7 +1631,9 @@ static void pull_dl_task(struct rq *this_rq)
 			resched = true;
 
 			deactivate_task(src_rq, p, 0);
+			clear_running_bw(&p->dl, &src_rq->dl);
 			set_task_cpu(p, this_cpu);
+			add_running_bw(&p->dl, &this_rq->dl);
 			activate_task(this_rq, p, 0);
 			dmin = p->dl.deadline;
 
-- 
1.9.1

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 3/8] sched/deadline: add some tracepoints
  2016-01-14 15:24 [RFC 0/8] CPU reclaiming for SCHED_DEADLINE Luca Abeni
  2016-01-14 15:24 ` [RFC 1/8] Track the active utilisation Luca Abeni
  2016-01-14 15:24 ` [RFC 2/8] Correctly track the active utilisation for migrating tasks Luca Abeni
@ 2016-01-14 15:24 ` Luca Abeni
  2016-01-14 15:24 ` [RFC 4/8] Improve the tracking of active utilisation Luca Abeni
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-14 15:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli

From: Juri Lelli <juri.lelli@arm.com>

These tracepoints can be used to check the active bandwidth
tracking and to show SCHED_DEADLINE parameters

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
---
 include/trace/events/sched.h | 69 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/deadline.c      |  6 ++++
 2 files changed, 75 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9b90c57..52644c7 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -408,6 +408,75 @@ DEFINE_EVENT(sched_stat_runtime, sched_stat_runtime,
 	     TP_ARGS(tsk, runtime, vruntime));
 
 /*
+ * Tracepoint for accounting running bandwidth of active SCHED_DEADLINE
+ * tasks (XXX specific to SCHED_FLAG_GRUB).
+ */
+DECLARE_EVENT_CLASS(sched_stat_running_bw,
+
+	TP_PROTO(struct task_struct *tsk, u64 tsk_bw, s64 running_bw),
+
+	TP_ARGS(tsk, tsk_bw, running_bw),
+
+	TP_STRUCT__entry(
+		__array( char,	comm,	TASK_COMM_LEN	)
+		__field( pid_t,	pid			)
+		__field( u64,	tsk_bw			)
+		__field( s64,	running_bw		)
+		__field( int,	cpu			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+		__entry->cpu		= task_cpu(tsk);
+		__entry->pid		= tsk->pid;
+		__entry->tsk_bw		= tsk_bw;
+		__entry->running_bw	= running_bw;
+	),
+
+	TP_printk("comm=%s pid=%d cpu=%d tsk_bw=%Lu running_bw=%Ld ",
+			__entry->comm, __entry->pid, __entry->cpu,
+			(unsigned long long)__entry->tsk_bw,
+			(unsigned long long)__entry->running_bw)
+);
+
+DEFINE_EVENT(sched_stat_running_bw, sched_stat_running_bw_add,
+	     TP_PROTO(struct task_struct *tsk, u64 tsk_bw, s64 running_bw),
+	     TP_ARGS(tsk, tsk_bw, running_bw));
+
+DEFINE_EVENT(sched_stat_running_bw, sched_stat_running_bw_clear,
+	     TP_PROTO(struct task_struct *tsk, u64 tsk_bw, s64 running_bw),
+	     TP_ARGS(tsk, tsk_bw, running_bw));
+/*
+ * Tracepoint for showing actual parameters of SCHED_DEADLINE
+ * tasks.
+ */
+TRACE_EVENT(sched_stat_params_dl,
+
+	TP_PROTO(struct task_struct *tsk, s64 runtime, u64 deadline),
+
+	TP_ARGS(tsk, runtime, deadline),
+
+	TP_STRUCT__entry(
+		__array( char,	comm,	TASK_COMM_LEN	)
+		__field( pid_t,	pid			)
+		__field( s64,	runtime			)
+		__field( u64,	deadline		)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+		__entry->pid		= tsk->pid;
+		__entry->runtime	= runtime;
+		__entry->deadline	= deadline;
+	),
+
+	TP_printk("comm=%s pid=%d runtime=%Ld [ns] deadline=%Lu",
+			__entry->comm, __entry->pid,
+			__entry->runtime, __entry->deadline)
+);
+
+
+/*
  * Tracepoint for showing priority inheritance modifying a tasks
  * priority.
  */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 8d7ee79..d8e9962 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -18,6 +18,8 @@
 
 #include <linux/slab.h>
 
+#include <trace/events/sched.h>
+
 struct dl_bandwidth def_dl_bandwidth;
 
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
@@ -48,6 +50,7 @@ static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 	u64 se_bw = dl_se->dl_bw;
 
 	dl_rq->running_bw += se_bw;
+	trace_sched_stat_running_bw_add(dl_task_of(dl_se), se_bw, dl_rq->running_bw);
 }
 
 static void clear_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
@@ -55,6 +58,7 @@ static void clear_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 	u64 se_bw = dl_se->dl_bw;
 
 	dl_rq->running_bw -= se_bw;
+	trace_sched_stat_running_bw_clear(dl_task_of(dl_se), se_bw, dl_rq->running_bw);
 	if (dl_rq->running_bw < 0) {
 		WARN_ON(1);
 		dl_rq->running_bw = 0;
@@ -770,6 +774,7 @@ static void update_curr_dl(struct rq *rq)
 	sched_rt_avg_update(rq, delta_exec);
 
 	dl_se->runtime -= dl_se->dl_yielded ? 0 : delta_exec;
+	trace_sched_stat_params_dl(curr, dl_se->runtime, dl_se->deadline);
 	if (dl_runtime_exceeded(dl_se)) {
 		dl_se->dl_throttled = 1;
 		__dequeue_task_dl(rq, curr, 0);
@@ -987,6 +992,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	}
 
 	enqueue_dl_entity(&p->dl, pi_se, flags);
+	trace_sched_stat_params_dl(p, p->dl.runtime, p->dl.deadline);
 
 	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
-- 
1.9.1

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 4/8] Improve the tracking of active utilisation
  2016-01-14 15:24 [RFC 0/8] CPU reclaiming for SCHED_DEADLINE Luca Abeni
                   ` (2 preceding siblings ...)
  2016-01-14 15:24 ` [RFC 3/8] sched/deadline: add some tracepoints Luca Abeni
@ 2016-01-14 15:24 ` Luca Abeni
  2016-01-14 17:16   ` Peter Zijlstra
                     ` (2 more replies)
  2016-01-14 15:24 ` [RFC 5/8] Track the "total rq utilisation" too Luca Abeni
                   ` (4 subsequent siblings)
  8 siblings, 3 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-14 15:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Luca Abeni

This patch implements a more theoretically sound algorithm for
thracking the active utilisation: instead of decreasing it when a
task blocks, use a timer (the "inactive timer", named after the
"Inactive" task state of the GRUB algorithm) to decrease the
active utilisaation at the so called "0-lag time".
---
 include/linux/sched.h   |   1 +
 kernel/sched/core.c     |   1 +
 kernel/sched/deadline.c | 152 ++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h    |   1 +
 4 files changed, 137 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61aa9bb..50f212f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1346,6 +1346,7 @@ struct sched_dl_entity {
 	 * own bandwidth to be enforced, thus we need one timer per task.
 	 */
 	struct hrtimer dl_timer;
+	struct hrtimer inactive_timer;
 };
 
 union rcu_special {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44253ad..7ca17e4c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2215,6 +2215,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	RB_CLEAR_NODE(&p->dl.rb_node);
 	init_dl_task_timer(&p->dl);
+	init_inactive_task_timer(&p->dl);
 	__dl_clear_params(p);
 
 	INIT_LIST_HEAD(&p->rt.run_list);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d8e9962..0efa596 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -49,6 +49,7 @@ static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
 	u64 se_bw = dl_se->dl_bw;
 
+	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
 	dl_rq->running_bw += se_bw;
 	trace_sched_stat_running_bw_add(dl_task_of(dl_se), se_bw, dl_rq->running_bw);
 }
@@ -57,6 +58,7 @@ static void clear_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
 	u64 se_bw = dl_se->dl_bw;
 
+	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
 	dl_rq->running_bw -= se_bw;
 	trace_sched_stat_running_bw_clear(dl_task_of(dl_se), se_bw, dl_rq->running_bw);
 	if (dl_rq->running_bw < 0) {
@@ -65,6 +67,62 @@ static void clear_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 	}
 }
 
+static void task_go_inactive(struct task_struct *p)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+	struct hrtimer *timer = &dl_se->inactive_timer;
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+	ktime_t now, act;
+	s64 delta;
+	u64 zerolag_time;
+
+	WARN_ON(dl_se->dl_runtime == 0);
+
+	/* If the inactive timer is already armed, return immediately */
+	if (hrtimer_active(&dl_se->inactive_timer))
+		return;
+
+
+	/*
+	 * We want the timer to fire at the "0 lag time", but considering
+	 * that it is actually coming from rq->clock and not from
+	 * hrtimer's time base reading.
+	 */
+        zerolag_time = dl_se->deadline - div64_long((dl_se->runtime * dl_se->dl_period), dl_se->dl_runtime);
+
+	act = ns_to_ktime(zerolag_time);
+	now = hrtimer_cb_get_time(timer);
+	delta = ktime_to_ns(now) - rq_clock(rq);
+	act = ktime_add_ns(act, delta);
+
+	/*
+	 * If the "0-lag time" already passed, decrease the active
+	 * utilization now, instead of starting a timer
+	 */
+	if (ktime_us_delta(act, now) < 0) {
+		clear_running_bw(dl_se, dl_rq);
+		if (!dl_task(p)) {
+			__dl_clear_params(p);
+		}
+		return;
+	}
+
+	if (!hrtimer_is_queued(timer)) {
+		hrtimer_start(timer, act, HRTIMER_MODE_ABS);
+	}
+
+	if (hrtimer_active(timer) == 0) {
+		printk("Problem activating inactive_timer!\n");
+		clear_running_bw(dl_se, dl_rq);
+		if (!dl_task(p)) {
+			__dl_clear_params(p);
+		}
+	} else {
+		get_task_struct(p);
+	}
+}
+
 static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
@@ -522,7 +580,6 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
-	add_running_bw(dl_se, dl_rq);
 
 	/*
 	 * The arrival of a new instance needs special treatment, i.e.,
@@ -530,9 +587,20 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
 	 */
 	if (dl_se->dl_new) {
 		setup_new_dl_entity(dl_se, pi_se);
+		add_running_bw(dl_se, dl_rq);
 		return;
 	}
 
+	/* If the "inactive timer" is still active, stop it adn leave
+	 * the active utilisation unchanged.
+	 * If it is running, increase the active utilisation
+	 */
+	if (hrtimer_active(&dl_se->inactive_timer)) {
+		hrtimer_try_to_cancel(&dl_se->inactive_timer);
+	} else {
+	        add_running_bw(dl_se, dl_rq);
+	}
+
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
 	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
 		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
@@ -619,12 +687,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 
 	rq = task_rq_lock(p, &flags);
 
-	/*
-	 * The task might have changed its scheduling policy to something
-	 * different than SCHED_DEADLINE (through switched_fromd_dl()).
-	 */
 	if (!dl_task(p)) {
-		__dl_clear_params(p);
 		goto unlock;
 	}
 
@@ -811,6 +874,49 @@ static void update_curr_dl(struct rq *rq)
 	}
 }
 
+static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
+{
+	struct sched_dl_entity *dl_se = container_of(timer,
+						     struct sched_dl_entity,
+						     inactive_timer);
+	struct task_struct *p = dl_task_of(dl_se);
+	unsigned long flags;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &flags);
+
+	if (dl_se->dl_new) {
+		printk("Problem! New task was inactive?\n");
+		goto unlock;
+	}
+	if (!dl_task(p)) {
+		__dl_clear_params(p);
+
+		goto unlock;
+	}
+	if (p->state == TASK_RUNNING) {
+		goto unlock;
+	}
+
+	sched_clock_tick();
+	update_rq_clock(rq);
+
+	clear_running_bw(dl_se, &rq->dl);
+unlock:
+	task_rq_unlock(rq, p, &flags);
+	put_task_struct(p);
+
+	return HRTIMER_NORESTART;
+}
+
+void init_inactive_task_timer(struct sched_dl_entity *dl_se)
+{
+	struct hrtimer *timer = &dl_se->inactive_timer;
+
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	timer->function = inactive_task_timer;
+}
+
 #ifdef CONFIG_SMP
 
 static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
@@ -987,7 +1093,10 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	 * run yet) will take care of this.
 	 */
 	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
-		add_running_bw(&p->dl, &rq->dl);
+		if (hrtimer_try_to_cancel(&p->dl.inactive_timer) < 0) {
+			printk("Waking up a depleted task, but cannot cancel inactive timer!\n");
+			add_running_bw(&p->dl, &rq->dl);
+		}
 		return;
 	}
 
@@ -1009,7 +1118,7 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	update_curr_dl(rq);
 	__dequeue_task_dl(rq, p, flags);
 	if (flags & DEQUEUE_SLEEP)
-		clear_running_bw(&p->dl, &rq->dl);
+		task_go_inactive(p);
 }
 
 /*
@@ -1087,6 +1196,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
 	}
 	rcu_read_unlock();
 
+	if (rq != cpu_rq(cpu)) {
+		if (hrtimer_active(&p->dl.inactive_timer)) {
+			raw_spin_lock(&rq->lock);
+			clear_running_bw(&p->dl, &rq->dl);
+			raw_spin_unlock(&rq->lock);
+			rq = cpu_rq(cpu);
+			raw_spin_lock(&rq->lock);
+			add_running_bw(&p->dl, &rq->dl);
+			raw_spin_unlock(&rq->lock);
+		}
+	}
+
+
 out:
 	return cpu;
 }
@@ -1248,8 +1370,6 @@ static void task_fork_dl(struct task_struct *p)
 static void task_dead_dl(struct task_struct *p)
 {
 	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
-	struct dl_rq *dl_rq = dl_rq_of_se(&p->dl);
-	struct rq *rq = rq_of_dl_rq(dl_rq);
 
 	/*
 	 * Since we are TASK_DEAD we won't slip out of the domain!
@@ -1258,10 +1378,6 @@ static void task_dead_dl(struct task_struct *p)
 	/* XXX we should retain the bw until 0-lag */
 	dl_b->total_bw -= p->dl.dl_bw;
 	raw_spin_unlock_irq(&dl_b->lock);
-
-	if (task_on_rq_queued(p)) {
-		clear_running_bw(&p->dl, &rq->dl);
-	}
 }
 
 static void set_curr_task_dl(struct rq *rq)
@@ -1742,12 +1858,12 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	 * SCHED_DEADLINE until the deadline passes, the timer will reset the
 	 * task.
 	 */
-	if (!start_dl_timer(p))
+	if (task_on_rq_queued(p))
+		task_go_inactive(p);
+	if (!hrtimer_active(&p->dl.inactive_timer))
 		__dl_clear_params(p);
-
-	if (task_on_rq_queued(p)) {
+	else if (!hrtimer_callback_running(&p->dl.inactive_timer))
 		clear_running_bw(&p->dl, &rq->dl);
-	}
 
 	/*
 	 * Since this might be the only -deadline task on the rq,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 826ca6a..9d0fdb1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1278,6 +1278,7 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
 extern struct dl_bandwidth def_dl_bandwidth;
 extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
 extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
+extern void init_inactive_task_timer(struct sched_dl_entity *dl_se);
 
 unsigned long to_ratio(u64 period, u64 runtime);
 
-- 
1.9.1

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 5/8] Track the "total rq utilisation" too
  2016-01-14 15:24 [RFC 0/8] CPU reclaiming for SCHED_DEADLINE Luca Abeni
                   ` (3 preceding siblings ...)
  2016-01-14 15:24 ` [RFC 4/8] Improve the tracking of active utilisation Luca Abeni
@ 2016-01-14 15:24 ` Luca Abeni
  2016-01-14 19:12   ` Peter Zijlstra
  2016-01-14 19:48   ` Peter Zijlstra
  2016-01-14 15:24 ` [RFC 6/8] GRUB accounting Luca Abeni
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-14 15:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Luca Abeni

This is the sum of the utilisations of tasks that are assigned to
a runqueue, independently from their state (TASK_RUNNING or blocked)
---
 kernel/sched/deadline.c | 35 +++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h    |  2 ++
 2 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0efa596..15d3fd8 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -52,6 +52,10 @@ static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
 	dl_rq->running_bw += se_bw;
 	trace_sched_stat_running_bw_add(dl_task_of(dl_se), se_bw, dl_rq->running_bw);
+	if (dl_rq->running_bw > dl_rq->this_bw) {
+		WARN_ON(1);
+		dl_rq->running_bw = dl_rq->this_bw;
+	}
 }
 
 static void clear_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
@@ -67,6 +71,22 @@ static void clear_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 	}
 }
 
+static void clear_rq_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	u64 se_bw = dl_se->dl_bw;
+
+	dl_rq->this_bw -= se_bw;
+	WARN_ON(dl_rq->this_bw < 0);
+	if (dl_rq->this_bw < 0) dl_rq->this_bw = 0;
+}
+
+static void add_rq_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	u64 se_bw = dl_se->dl_bw;
+
+	dl_rq->this_bw += se_bw;
+}
+
 static void task_go_inactive(struct task_struct *p)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
@@ -104,6 +124,7 @@ static void task_go_inactive(struct task_struct *p)
 		clear_running_bw(dl_se, dl_rq);
 		if (!dl_task(p)) {
 			__dl_clear_params(p);
+			clear_rq_bw(&p->dl, &rq->dl);
 		}
 		return;
 	}
@@ -117,6 +138,7 @@ static void task_go_inactive(struct task_struct *p)
 		clear_running_bw(dl_se, dl_rq);
 		if (!dl_task(p)) {
 			__dl_clear_params(p);
+			clear_rq_bw(&p->dl, &rq->dl);
 		}
 	} else {
 		get_task_struct(p);
@@ -587,6 +609,7 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
 	 */
 	if (dl_se->dl_new) {
 		setup_new_dl_entity(dl_se, pi_se);
+		add_rq_bw(dl_se, dl_rq);
 		add_running_bw(dl_se, dl_rq);
 		return;
 	}
@@ -891,6 +914,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 	}
 	if (!dl_task(p)) {
 		__dl_clear_params(p);
+		clear_rq_bw(&p->dl, &rq->dl);
 
 		goto unlock;
 	}
@@ -1200,9 +1224,11 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
 		if (hrtimer_active(&p->dl.inactive_timer)) {
 			raw_spin_lock(&rq->lock);
 			clear_running_bw(&p->dl, &rq->dl);
+			clear_rq_bw(&p->dl, &rq->dl);
 			raw_spin_unlock(&rq->lock);
 			rq = cpu_rq(cpu);
 			raw_spin_lock(&rq->lock);
+			add_rq_bw(&p->dl, &rq->dl);
 			add_running_bw(&p->dl, &rq->dl);
 			raw_spin_unlock(&rq->lock);
 		}
@@ -1664,7 +1690,9 @@ retry:
 
 	deactivate_task(rq, next_task, 0);
 	clear_running_bw(&next_task->dl, &rq->dl);
+	clear_rq_bw(&next_task->dl, &rq->dl);
 	set_task_cpu(next_task, later_rq->cpu);
+	add_rq_bw(&next_task->dl, &later_rq->dl);
 	add_running_bw(&next_task->dl, &later_rq->dl);
 	activate_task(later_rq, next_task, 0);
 	ret = 1;
@@ -1754,7 +1782,9 @@ static void pull_dl_task(struct rq *this_rq)
 
 			deactivate_task(src_rq, p, 0);
 			clear_running_bw(&p->dl, &src_rq->dl);
+			clear_rq_bw(&p->dl, &src_rq->dl);
 			set_task_cpu(p, this_cpu);
+			add_rq_bw(&p->dl, &this_rq->dl);
 			add_running_bw(&p->dl, &this_rq->dl);
 			activate_task(this_rq, p, 0);
 			dmin = p->dl.deadline;
@@ -1860,9 +1890,10 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	 */
 	if (task_on_rq_queued(p))
 		task_go_inactive(p);
-	if (!hrtimer_active(&p->dl.inactive_timer))
+	if (!hrtimer_active(&p->dl.inactive_timer)) {
 		__dl_clear_params(p);
-	else if (!hrtimer_callback_running(&p->dl.inactive_timer))
+		clear_rq_bw(&p->dl, &rq->dl);
+	} else if (!hrtimer_callback_running(&p->dl.inactive_timer))
 		clear_running_bw(&p->dl, &rq->dl);
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9d0fdb1..d06005b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -524,6 +524,8 @@ struct dl_rq {
 	 * and decreased when a task blocks
 	 */
 	s64 running_bw;
+
+	s64 this_bw;
 };
 
 #ifdef CONFIG_SMP
-- 
1.9.1

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 6/8] GRUB accounting
  2016-01-14 15:24 [RFC 0/8] CPU reclaiming for SCHED_DEADLINE Luca Abeni
                   ` (4 preceding siblings ...)
  2016-01-14 15:24 ` [RFC 5/8] Track the "total rq utilisation" too Luca Abeni
@ 2016-01-14 15:24 ` Luca Abeni
  2016-01-14 19:50   ` Peter Zijlstra
  2016-01-14 15:24 ` [RFC 7/8] Make GRUB a task's flag Luca Abeni
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 58+ messages in thread
From: Luca Abeni @ 2016-01-14 15:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Luca Abeni

---
 kernel/sched/deadline.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 15d3fd8..4795d7f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -823,6 +823,11 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
 
 extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
+u64 grub_reclaim(u64 delta, struct rq *rq, u64 u)
+{
+	return (delta * rq->dl.running_bw) >> 20;
+}
+
 /*
  * Update the current task's runtime statistics (provided it is still
  * a -deadline task and has not been removed from the dl_rq).
@@ -859,6 +864,7 @@ static void update_curr_dl(struct rq *rq)
 
 	sched_rt_avg_update(rq, delta_exec);
 
+	delta_exec = grub_reclaim(delta_exec, rq, curr->dl.dl_bw);
 	dl_se->runtime -= dl_se->dl_yielded ? 0 : delta_exec;
 	trace_sched_stat_params_dl(curr, dl_se->runtime, dl_se->deadline);
 	if (dl_runtime_exceeded(dl_se)) {
-- 
1.9.1

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 7/8] Make GRUB a task's flag
  2016-01-14 15:24 [RFC 0/8] CPU reclaiming for SCHED_DEADLINE Luca Abeni
                   ` (5 preceding siblings ...)
  2016-01-14 15:24 ` [RFC 6/8] GRUB accounting Luca Abeni
@ 2016-01-14 15:24 ` Luca Abeni
  2016-01-14 19:56   ` Peter Zijlstra
  2016-01-14 15:24 ` [RFC 8/8] Do not reclaim the whole CPU bandwidth Luca Abeni
  2016-01-19 10:11 ` [RFC 0/8] CPU reclaiming for SCHED_DEADLINE Juri Lelli
  8 siblings, 1 reply; 58+ messages in thread
From: Luca Abeni @ 2016-01-14 15:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Luca Abeni

---
 include/uapi/linux/sched.h | 1 +
 kernel/sched/core.c        | 2 +-
 kernel/sched/deadline.c    | 4 +++-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index cc89dde..9279562 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -48,5 +48,6 @@
  * For the sched_{set,get}attr() calls
  */
 #define SCHED_FLAG_RESET_ON_FORK	0x01
+#define SCHED_FLAG_RECLAIM		0x02
 
 #endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7ca17e4c..1a384c7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3911,7 +3911,7 @@ recheck:
 			return -EINVAL;
 	}
 
-	if (attr->sched_flags & ~(SCHED_FLAG_RESET_ON_FORK))
+	if (attr->sched_flags & ~(SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_RECLAIM))
 		return -EINVAL;
 
 	/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4795d7f..712cc6d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -864,7 +864,9 @@ static void update_curr_dl(struct rq *rq)
 
 	sched_rt_avg_update(rq, delta_exec);
 
-	delta_exec = grub_reclaim(delta_exec, rq, curr->dl.dl_bw);
+	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
+		delta_exec = grub_reclaim(delta_exec, rq, curr->dl.dl_bw);
+	}
 	dl_se->runtime -= dl_se->dl_yielded ? 0 : delta_exec;
 	trace_sched_stat_params_dl(curr, dl_se->runtime, dl_se->deadline);
 	if (dl_runtime_exceeded(dl_se)) {
-- 
1.9.1

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC 8/8] Do not reclaim the whole CPU bandwidth
  2016-01-14 15:24 [RFC 0/8] CPU reclaiming for SCHED_DEADLINE Luca Abeni
                   ` (6 preceding siblings ...)
  2016-01-14 15:24 ` [RFC 7/8] Make GRUB a task's flag Luca Abeni
@ 2016-01-14 15:24 ` Luca Abeni
  2016-01-14 19:59   ` Peter Zijlstra
  2016-01-19 10:11 ` [RFC 0/8] CPU reclaiming for SCHED_DEADLINE Juri Lelli
  8 siblings, 1 reply; 58+ messages in thread
From: Luca Abeni @ 2016-01-14 15:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Luca Abeni

Original GRUB tends to reclaim 100% of the CPU time... And this allows a
"CPU hog" (i.e., a busy loop) to starve non-deadline tasks.
To address this issue, allow the scheduler to reclaim only a specified
fraction of CPU time.
NOTE: the fraction of CPU time that cannot be reclaimed is currently
hardcoded as (1 << 20) / 10 -> 90%, but it must be made configurable!
---
 kernel/sched/deadline.c | 3 ++-
 kernel/sched/sched.h    | 4 ++++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 712cc6d..57b693b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -185,6 +185,7 @@ void init_dl_rq(struct dl_rq *dl_rq)
 #else
 	init_dl_bw(&dl_rq->dl_bw);
 #endif
+	dl_rq->unusable_bw = (1 << 20) / 10;		// FIXME: allow to set this!
 }
 
 #ifdef CONFIG_SMP
@@ -825,7 +826,7 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
 u64 grub_reclaim(u64 delta, struct rq *rq, u64 u)
 {
-	return (delta * rq->dl.running_bw) >> 20;
+	return (delta * (rq->dl.unusable_bw + rq->dl.running_bw)) >> 20;
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d06005b..76df0ff 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -524,6 +524,10 @@ struct dl_rq {
 	 * and decreased when a task blocks
 	 */
 	s64 running_bw;
+	/* This is the amount of utilization that GRUB can not
+         * reclaim (per runqueue)
+         */
+	s64 unusable_bw;
 
 	s64 this_bw;
 };
-- 
1.9.1

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 1/8] Track the active utilisation
  2016-01-14 15:24 ` [RFC 1/8] Track the active utilisation Luca Abeni
@ 2016-01-14 16:49   ` Peter Zijlstra
  2016-01-15  6:37     ` Luca Abeni
  2016-01-14 19:13   ` Peter Zijlstra
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-14 16:49 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, Jan 14, 2016 at 04:24:46PM +0100, Luca Abeni wrote:
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 10f1637..826ca6a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -519,6 +519,11 @@ struct dl_rq {
>  #else
>  	struct dl_bw dl_bw;
>  #endif
> +	/* This is the "active utilization" for this runqueue.

Incorrect comment style.

> +	 * Increased when a task wakes up (becomes TASK_RUNNING)
> +	 * and decreased when a task blocks
> +	 */
> +	s64 running_bw;
>  };
>  
>  #ifdef CONFIG_SMP
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-14 15:24 ` [RFC 4/8] Improve the tracking of active utilisation Luca Abeni
@ 2016-01-14 17:16   ` Peter Zijlstra
  2016-01-15  6:48     ` Luca Abeni
  2016-01-14 19:43   ` Peter Zijlstra
  2016-01-14 19:47   ` Peter Zijlstra
  2 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-14 17:16 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, Jan 14, 2016 at 04:24:49PM +0100, Luca Abeni wrote:
> @@ -65,6 +67,62 @@ static void clear_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>  	}
>  }
>  
> +static void task_go_inactive(struct task_struct *p)
> +{
> +	struct sched_dl_entity *dl_se = &p->dl;
> +	struct hrtimer *timer = &dl_se->inactive_timer;
> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +	struct rq *rq = rq_of_dl_rq(dl_rq);
> +	ktime_t now, act;
> +	s64 delta;
> +	u64 zerolag_time;
> +
> +	WARN_ON(dl_se->dl_runtime == 0);
> +
> +	/* If the inactive timer is already armed, return immediately */
> +	if (hrtimer_active(&dl_se->inactive_timer))
> +		return;
> +
> +
> +	/*
> +	 * We want the timer to fire at the "0 lag time", but considering
> +	 * that it is actually coming from rq->clock and not from
> +	 * hrtimer's time base reading.
> +	 */
> +        zerolag_time = dl_se->deadline - div64_long((dl_se->runtime * dl_se->dl_period), dl_se->dl_runtime);

whitespace damage

> @@ -530,9 +587,20 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
>  	 */
>  	if (dl_se->dl_new) {
>  		setup_new_dl_entity(dl_se, pi_se);
> +		add_running_bw(dl_se, dl_rq);
>  		return;
>  	}
>  
> +	/* If the "inactive timer" is still active, stop it adn leave
> +	 * the active utilisation unchanged.
> +	 * If it is running, increase the active utilisation
> +	 */
> +	if (hrtimer_active(&dl_se->inactive_timer)) {
> +		hrtimer_try_to_cancel(&dl_se->inactive_timer);

what if cancel fails?

> +	} else {
> +	        add_running_bw(dl_se, dl_rq);
> +	}
> +
>  	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
>  	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
>  		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;

> @@ -1248,8 +1370,6 @@ static void task_fork_dl(struct task_struct *p)
>  static void task_dead_dl(struct task_struct *p)
>  {
>  	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
> -	struct dl_rq *dl_rq = dl_rq_of_se(&p->dl);
> -	struct rq *rq = rq_of_dl_rq(dl_rq);
>  
>  	/*
>  	 * Since we are TASK_DEAD we won't slip out of the domain!
> @@ -1258,10 +1378,6 @@ static void task_dead_dl(struct task_struct *p)
>  	/* XXX we should retain the bw until 0-lag */
>  	dl_b->total_bw -= p->dl.dl_bw;
>  	raw_spin_unlock_irq(&dl_b->lock);
> -
> -	if (task_on_rq_queued(p)) {
> -		clear_running_bw(&p->dl, &rq->dl);
> -	}

what happens if the timer is still active here? then we get the timer
storage freed while enqueued?

> @@ -1742,12 +1858,12 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
>  	 * SCHED_DEADLINE until the deadline passes, the timer will reset the
>  	 * task.
>  	 */
> -	if (!start_dl_timer(p))
> +	if (task_on_rq_queued(p))
> +		task_go_inactive(p);
> +	if (!hrtimer_active(&p->dl.inactive_timer))
>  		__dl_clear_params(p);
> -
> -	if (task_on_rq_queued(p)) {
> +	else if (!hrtimer_callback_running(&p->dl.inactive_timer))
>  		clear_running_bw(&p->dl, &rq->dl);
> -	}
>  
>  	/*
>  	 * Since this might be the only -deadline task on the rq,

idem, what if the task dies while !dl but with timer pending?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 5/8] Track the "total rq utilisation" too
  2016-01-14 15:24 ` [RFC 5/8] Track the "total rq utilisation" too Luca Abeni
@ 2016-01-14 19:12   ` Peter Zijlstra
  2016-01-15  8:04     ` Luca Abeni
  2016-01-14 19:48   ` Peter Zijlstra
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-14 19:12 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, Jan 14, 2016 at 04:24:50PM +0100, Luca Abeni wrote:
> +	if (dl_rq->running_bw > dl_rq->this_bw) {
> +		WARN_ON(1);
> +		dl_rq->running_bw = dl_rq->this_bw;
> +	}

FWIW you can write this as:

	if (WARN_ON(dl_rq->running_bw > dl_rq->this_bw))
		dl_rq->running_bw = dl_rq->this_bw;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 1/8] Track the active utilisation
  2016-01-14 15:24 ` [RFC 1/8] Track the active utilisation Luca Abeni
  2016-01-14 16:49   ` Peter Zijlstra
@ 2016-01-14 19:13   ` Peter Zijlstra
  2016-01-15  8:07     ` Luca Abeni
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-14 19:13 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, Jan 14, 2016 at 04:24:46PM +0100, Luca Abeni wrote:
> +static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +static void clear_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)

I would prefer {add,sub}, I read clear as =0.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-14 15:24 ` [RFC 4/8] Improve the tracking of active utilisation Luca Abeni
  2016-01-14 17:16   ` Peter Zijlstra
@ 2016-01-14 19:43   ` Peter Zijlstra
  2016-01-15  9:27     ` Luca Abeni
  2016-01-19 12:20     ` Luca Abeni
  2016-01-14 19:47   ` Peter Zijlstra
  2 siblings, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-14 19:43 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, Jan 14, 2016 at 04:24:49PM +0100, Luca Abeni wrote:
> This patch implements a more theoretically sound algorithm for
> thracking the active utilisation: instead of decreasing it when a
> task blocks, use a timer (the "inactive timer", named after the
> "Inactive" task state of the GRUB algorithm) to decrease the
> active utilisaation at the so called "0-lag time".

See also the large-ish comment in __setparam_dl().

If we go do proper 0-lag, as GRUB requires, then we might as well use it
for that.

But we need to sort the issue of the task exiting with an armed timer.
The solution suggested there is keeping a task reference with the timer.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-14 15:24 ` [RFC 4/8] Improve the tracking of active utilisation Luca Abeni
  2016-01-14 17:16   ` Peter Zijlstra
  2016-01-14 19:43   ` Peter Zijlstra
@ 2016-01-14 19:47   ` Peter Zijlstra
  2016-01-15  8:10     ` Luca Abeni
  2 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-14 19:47 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, Jan 14, 2016 at 04:24:49PM +0100, Luca Abeni wrote:

> +	if (!hrtimer_is_queued(timer)) {
> +		hrtimer_start(timer, act, HRTIMER_MODE_ABS);
> +	}
> +
> +	if (hrtimer_active(timer) == 0) {
> +		printk("Problem activating inactive_timer!\n");
> +		clear_running_bw(dl_se, dl_rq);
> +		if (!dl_task(p)) {
> +			__dl_clear_params(p);
> +		}
> +	} else {
> +		get_task_struct(p);

Ah, I missed that one. I would suggest putting that right _before_
hrtimer_start(), because hrtimer_start() guarantees the callback will
run.

> +	}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 5/8] Track the "total rq utilisation" too
  2016-01-14 15:24 ` [RFC 5/8] Track the "total rq utilisation" too Luca Abeni
  2016-01-14 19:12   ` Peter Zijlstra
@ 2016-01-14 19:48   ` Peter Zijlstra
  2016-01-15  6:50     ` Luca Abeni
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-14 19:48 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, Jan 14, 2016 at 04:24:50PM +0100, Luca Abeni wrote:
> This is the sum of the utilisations of tasks that are assigned to
> a runqueue, independently from their state (TASK_RUNNING or blocked)

Is it actually used?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 6/8] GRUB accounting
  2016-01-14 15:24 ` [RFC 6/8] GRUB accounting Luca Abeni
@ 2016-01-14 19:50   ` Peter Zijlstra
  2016-01-15  8:05     ` Luca Abeni
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-14 19:50 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, Jan 14, 2016 at 04:24:51PM +0100, Luca Abeni wrote:


It would be good to have a really short recap of GRUB in a comment right
about here...

> +u64 grub_reclaim(u64 delta, struct rq *rq, u64 u)
> +{
> +	return (delta * rq->dl.running_bw) >> 20;
> +}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Make GRUB a task's flag
  2016-01-14 15:24 ` [RFC 7/8] Make GRUB a task's flag Luca Abeni
@ 2016-01-14 19:56   ` Peter Zijlstra
  2016-01-15  8:15     ` Luca Abeni
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-14 19:56 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, Jan 14, 2016 at 04:24:52PM +0100, Luca Abeni wrote:
> +++ b/include/uapi/linux/sched.h
> @@ -48,5 +48,6 @@
>   * For the sched_{set,get}attr() calls
>   */
>  #define SCHED_FLAG_RESET_ON_FORK	0x01
> +#define SCHED_FLAG_RECLAIM		0x02

With an eye towards unpriv usage of SCHED_DEADLINE, this isn't something
we could allow unpriv tasks, right? Since (IIRC) GRUB will allow eating
all !deadline time.

Something with an average runtime/budget that also puts limits on the
max (say 2*avg) would be far more amenable to be exposed to unpriv
tasks, except since that would directly result in an average tardiness
bound this might be non-trivial to combine with tasks not opting for
this.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 8/8] Do not reclaim the whole CPU bandwidth
  2016-01-14 15:24 ` [RFC 8/8] Do not reclaim the whole CPU bandwidth Luca Abeni
@ 2016-01-14 19:59   ` Peter Zijlstra
  2016-01-15  8:21     ` Luca Abeni
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-14 19:59 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, Jan 14, 2016 at 04:24:53PM +0100, Luca Abeni wrote:
> Original GRUB tends to reclaim 100% of the CPU time... And this allows a
> "CPU hog" (i.e., a busy loop) to starve non-deadline tasks.
> To address this issue, allow the scheduler to reclaim only a specified
> fraction of CPU time.
> NOTE: the fraction of CPU time that cannot be reclaimed is currently
> hardcoded as (1 << 20) / 10 -> 90%, but it must be made configurable!

So the alternative is an explicit SCHED_OTHER server which is
configurable.

That would maybe fit in nicely with the DL based FIFO/RR servers from
this other pending project.

> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -524,6 +524,10 @@ struct dl_rq {
>  	 * and decreased when a task blocks
>  	 */
>  	s64 running_bw;
> +	/* This is the amount of utilization that GRUB can not
> +         * reclaim (per runqueue)
> +         */
> +	s64 unusable_bw;


Wrong comment style and whitespace challenged.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 1/8] Track the active utilisation
  2016-01-14 16:49   ` Peter Zijlstra
@ 2016-01-15  6:37     ` Luca Abeni
  0 siblings, 0 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-15  6:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, 14 Jan 2016 17:49:14 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Jan 14, 2016 at 04:24:46PM +0100, Luca Abeni wrote:
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 10f1637..826ca6a 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -519,6 +519,11 @@ struct dl_rq {
> >  #else
> >  	struct dl_bw dl_bw;
> >  #endif
> > +	/* This is the "active utilization" for this runqueue.
> 
> Incorrect comment style.
Ops... Sorry, I did not realize that I was using the wrong style. I am
locally fixing this and all the other style issues you pointed out.


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-14 17:16   ` Peter Zijlstra
@ 2016-01-15  6:48     ` Luca Abeni
  0 siblings, 0 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-15  6:48 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

Hi Peter,


On Thu, 14 Jan 2016 18:16:19 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

[...]
> > +	/* If the "inactive timer" is still active, stop it adn
> > leave
> > +	 * the active utilisation unchanged.
> > +	 * If it is running, increase the active utilisation
> > +	 */
> > +	if (hrtimer_active(&dl_se->inactive_timer)) {
> > +		hrtimer_try_to_cancel(&dl_se->inactive_timer);
> 
> what if cancel fails?
Eh, this is a tricky point :)
In this case, the "if (p->state == TASK_RUNNING) {" in
inactive_task_timer() should detect what happened, and avoid decreasing
the active utilization. So, we should be safe... At least, this was my
plan, maybe I missed something.


> > @@ -1248,8 +1370,6 @@ static void task_fork_dl(struct task_struct
> > *p) static void task_dead_dl(struct task_struct *p)
> >  {
> >  	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
> > -	struct dl_rq *dl_rq = dl_rq_of_se(&p->dl);
> > -	struct rq *rq = rq_of_dl_rq(dl_rq);
> >  
> >  	/*
> >  	 * Since we are TASK_DEAD we won't slip out of the domain!
> > @@ -1258,10 +1378,6 @@ static void task_dead_dl(struct task_struct
> > *p) /* XXX we should retain the bw until 0-lag */
> >  	dl_b->total_bw -= p->dl.dl_bw;
> >  	raw_spin_unlock_irq(&dl_b->lock);
> > -
> > -	if (task_on_rq_queued(p)) {
> > -		clear_running_bw(&p->dl, &rq->dl);
> > -	}
> 
> what happens if the timer is still active here? then we get the timer
> storage freed while enqueued?
I think here (and in the successive comment) we are safe because of the
get_task_struct() you mention in another email, right? Or am I missing
something else?


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 5/8] Track the "total rq utilisation" too
  2016-01-14 19:48   ` Peter Zijlstra
@ 2016-01-15  6:50     ` Luca Abeni
  2016-01-15  8:34       ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Luca Abeni @ 2016-01-15  6:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, 14 Jan 2016 20:48:37 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Jan 14, 2016 at 04:24:50PM +0100, Luca Abeni wrote:
> > This is the sum of the utilisations of tasks that are assigned to
> > a runqueue, independently from their state (TASK_RUNNING or blocked)
> 
> Is it actually used?
Not in this patchset...
It is a possible "cheap" (but less accurate) alternative to the
tracking introduced in patch 4. Or can be used in more advanced
implementations of multi-processor GRUB, but not in this patchset.

So, it can be removed from the patchset; I added it so that people can
see all the possible alternative utilization tracking strategies.


		Thanks,
			Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 5/8] Track the "total rq utilisation" too
  2016-01-14 19:12   ` Peter Zijlstra
@ 2016-01-15  8:04     ` Luca Abeni
  0 siblings, 0 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-15  8:04 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On 01/14/2016 08:12 PM, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 04:24:50PM +0100, Luca Abeni wrote:
>> +	if (dl_rq->running_bw > dl_rq->this_bw) {
>> +		WARN_ON(1);
>> +		dl_rq->running_bw = dl_rq->this_bw;
>> +	}
>
> FWIW you can write this as:
>
> 	if (WARN_ON(dl_rq->running_bw > dl_rq->this_bw))
> 		dl_rq->running_bw = dl_rq->this_bw;
Ah, thanks! I did not know that WARN_ON() returns a value...
This looks much nicer, I am locally changing in this way.


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 6/8] GRUB accounting
  2016-01-14 19:50   ` Peter Zijlstra
@ 2016-01-15  8:05     ` Luca Abeni
  0 siblings, 0 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-15  8:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On 01/14/2016 08:50 PM, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 04:24:51PM +0100, Luca Abeni wrote:
>
>
> It would be good to have a really short recap of GRUB in a comment right
> about here...
Ok; I'll add it


			Thanks,
				Luca

>
>> +u64 grub_reclaim(u64 delta, struct rq *rq, u64 u)
>> +{
>> +	return (delta * rq->dl.running_bw) >> 20;
>> +}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 1/8] Track the active utilisation
  2016-01-14 19:13   ` Peter Zijlstra
@ 2016-01-15  8:07     ` Luca Abeni
  0 siblings, 0 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-15  8:07 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On 01/14/2016 08:13 PM, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 04:24:46PM +0100, Luca Abeni wrote:
>> +static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>> +static void clear_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>
> I would prefer {add,sub}, I read clear as =0.
I agree; I am going to change it.


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-14 19:47   ` Peter Zijlstra
@ 2016-01-15  8:10     ` Luca Abeni
  2016-01-15  8:32       ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Luca Abeni @ 2016-01-15  8:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On 01/14/2016 08:47 PM, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 04:24:49PM +0100, Luca Abeni wrote:
>
>> +	if (!hrtimer_is_queued(timer)) {
>> +		hrtimer_start(timer, act, HRTIMER_MODE_ABS);
>> +	}
>> +
>> +	if (hrtimer_active(timer) == 0) {
>> +		printk("Problem activating inactive_timer!\n");
>> +		clear_running_bw(dl_se, dl_rq);
>> +		if (!dl_task(p)) {
>> +			__dl_clear_params(p);
>> +		}
>> +	} else {
>> +		get_task_struct(p);
>
> Ah, I missed that one. I would suggest putting that right _before_
> hrtimer_start(), because hrtimer_start() guarantees the callback will
> run.
Ok. So, if I understand well, the "if (hrtimer_active(timer) == 0)" check
is useless (or should be somehow revised)... Right?



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Make GRUB a task's flag
  2016-01-14 19:56   ` Peter Zijlstra
@ 2016-01-15  8:15     ` Luca Abeni
  2016-01-15  8:41       ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Luca Abeni @ 2016-01-15  8:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On 01/14/2016 08:56 PM, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 04:24:52PM +0100, Luca Abeni wrote:
>> +++ b/include/uapi/linux/sched.h
>> @@ -48,5 +48,6 @@
>>    * For the sched_{set,get}attr() calls
>>    */
>>   #define SCHED_FLAG_RESET_ON_FORK	0x01
>> +#define SCHED_FLAG_RECLAIM		0x02
>
> With an eye towards unpriv usage of SCHED_DEADLINE, this isn't something
> we could allow unpriv tasks, right? Since (IIRC) GRUB will allow eating
> all !deadline time.
The original algorithm, yes, it allowed to use 100% of the CPU time, starving
all !deadline tasks.
But in this patchset I modified the algorithm to allow reclaiming only a
fraction U_max of the CPU time... So, 1-U_max can be left free for !deadline.


> Something with an average runtime/budget that also puts limits on the
> max (say 2*avg) would be far more amenable to be exposed to unpriv
> tasks, except since that would directly result in an average tardiness
> bound this might be non-trivial to combine with tasks not opting for
> this.
I'll try to think about this... The advantage of GRUB is that a theoretically
sound algorithm already existed; here, we would need to design the algorithm
so that it does not break the SCHED_DEADLINE guarantees. Anyway, this is an
interesting challenge, I'll work on it :)



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 8/8] Do not reclaim the whole CPU bandwidth
  2016-01-14 19:59   ` Peter Zijlstra
@ 2016-01-15  8:21     ` Luca Abeni
  2016-01-15  8:50       ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Luca Abeni @ 2016-01-15  8:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On 01/14/2016 08:59 PM, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 04:24:53PM +0100, Luca Abeni wrote:
>> Original GRUB tends to reclaim 100% of the CPU time... And this allows a
>> "CPU hog" (i.e., a busy loop) to starve non-deadline tasks.
>> To address this issue, allow the scheduler to reclaim only a specified
>> fraction of CPU time.
>> NOTE: the fraction of CPU time that cannot be reclaimed is currently
>> hardcoded as (1 << 20) / 10 -> 90%, but it must be made configurable!
>
> So the alternative is an explicit SCHED_OTHER server which is
> configurable.
Yes, I have thought about something similar (actually, this is the strategy
I implemented in my first CBS/GRUB scheduler. With the "old" 2.4 scheduler,
this was easier :).
But I think the solution I implemented in this patch is much simpler (it
just requires a very simple modification to grub_reclaim()) and is more
elegant from the theoretical point of view.


> That would maybe fit in nicely with the DL based FIFO/RR servers from
> this other pending project.
Yes, this reminds me about the half-finished patch for RT throttling using
SCHED_DEADLINE... But that patch needs much more work IMHO.


				Thanks,
					Luca
>
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -524,6 +524,10 @@ struct dl_rq {
>>   	 * and decreased when a task blocks
>>   	 */
>>   	s64 running_bw;
>> +	/* This is the amount of utilization that GRUB can not
>> +         * reclaim (per runqueue)
>> +         */
>> +	s64 unusable_bw;
>
>
> Wrong comment style and whitespace challenged.
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-15  8:10     ` Luca Abeni
@ 2016-01-15  8:32       ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-15  8:32 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Fri, Jan 15, 2016 at 09:10:15AM +0100, Luca Abeni wrote:
> On 01/14/2016 08:47 PM, Peter Zijlstra wrote:
> >On Thu, Jan 14, 2016 at 04:24:49PM +0100, Luca Abeni wrote:
> >
> >>+	if (!hrtimer_is_queued(timer)) {
> >>+		hrtimer_start(timer, act, HRTIMER_MODE_ABS);
> >>+	}
> >>+
> >>+	if (hrtimer_active(timer) == 0) {
> >>+		printk("Problem activating inactive_timer!\n");
> >>+		clear_running_bw(dl_se, dl_rq);
> >>+		if (!dl_task(p)) {
> >>+			__dl_clear_params(p);
> >>+		}
> >>+	} else {
> >>+		get_task_struct(p);
> >
> >Ah, I missed that one. I would suggest putting that right _before_
> >hrtimer_start(), because hrtimer_start() guarantees the callback will
> >run.

> Ok. So, if I understand well, the "if (hrtimer_active(timer) == 0)" check
> is useless (or should be somehow revised)... Right?

Yes, ever since: c6eb3f70d448 ("hrtimer: Get rid of hrtimer softirq")
hrtimer_start() is guaranteed to work and result in a callback, even if
the time is in the past.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 5/8] Track the "total rq utilisation" too
  2016-01-15  6:50     ` Luca Abeni
@ 2016-01-15  8:34       ` Peter Zijlstra
  2016-01-15  9:15         ` Luca Abeni
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-15  8:34 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Fri, Jan 15, 2016 at 07:50:49AM +0100, Luca Abeni wrote:
> On Thu, 14 Jan 2016 20:48:37 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, Jan 14, 2016 at 04:24:50PM +0100, Luca Abeni wrote:
> > > This is the sum of the utilisations of tasks that are assigned to
> > > a runqueue, independently from their state (TASK_RUNNING or blocked)
> > 
> > Is it actually used?
> Not in this patchset...
> It is a possible "cheap" (but less accurate) alternative to the
> tracking introduced in patch 4. Or can be used in more advanced
> implementations of multi-processor GRUB, but not in this patchset.
> 
> So, it can be removed from the patchset; I added it so that people can
> see all the possible alternative utilization tracking strategies.

OK, so that might've been useful text for the changelog. But given that,
maybe leave it out for now.

BTW, have you got a paper on smp grub?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Make GRUB a task's flag
  2016-01-15  8:15     ` Luca Abeni
@ 2016-01-15  8:41       ` Peter Zijlstra
  2016-01-15  9:08         ` Luca Abeni
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-15  8:41 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Fri, Jan 15, 2016 at 09:15:39AM +0100, Luca Abeni wrote:
> On 01/14/2016 08:56 PM, Peter Zijlstra wrote:

> >Something with an average runtime/budget that also puts limits on the
> >max (say 2*avg) would be far more amenable to be exposed to unpriv
> >tasks, except since that would directly result in an average tardiness
> >bound this might be non-trivial to combine with tasks not opting for
> >this.

> I'll try to think about this... The advantage of GRUB is that a theoretically
> sound algorithm already existed; here, we would need to design the algorithm
> so that it does not break the SCHED_DEADLINE guarantees. Anyway, this is an
> interesting challenge, I'll work on it :)

Didn't Baruah and Jim do the whole theory on statistical EDF? Which
shows that if you use a statistical budget the combined distribution
transfers to the tardiness. With stdev=0 for the budgets this trivially
collapses to the regular EDF, since then the combined distribution is
also stdev=0 and you get 0 tardiness (on UP).

But yes, combining the two into one scheduler is 'interesting'. I was
thinking it would be possible with least-laxity-first, since you can
assign the hard (stdev=0) tasks a tighter laxity bound.

But LLF is horrendously painful to implement IIRC.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 8/8] Do not reclaim the whole CPU bandwidth
  2016-01-15  8:21     ` Luca Abeni
@ 2016-01-15  8:50       ` Peter Zijlstra
  2016-01-15  9:49         ` Luca Abeni
  2016-01-26 12:52         ` luca abeni
  0 siblings, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-15  8:50 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Fri, Jan 15, 2016 at 09:21:17AM +0100, Luca Abeni wrote:
> On 01/14/2016 08:59 PM, Peter Zijlstra wrote:
> >On Thu, Jan 14, 2016 at 04:24:53PM +0100, Luca Abeni wrote:
> >>Original GRUB tends to reclaim 100% of the CPU time... And this allows a
> >>"CPU hog" (i.e., a busy loop) to starve non-deadline tasks.
> >>To address this issue, allow the scheduler to reclaim only a specified
> >>fraction of CPU time.
> >>NOTE: the fraction of CPU time that cannot be reclaimed is currently
> >>hardcoded as (1 << 20) / 10 -> 90%, but it must be made configurable!
> >
> >So the alternative is an explicit SCHED_OTHER server which is
> >configurable.
> Yes, I have thought about something similar (actually, this is the strategy
> I implemented in my first CBS/GRUB scheduler. With the "old" 2.4 scheduler,
> this was easier :).
> But I think the solution I implemented in this patch is much simpler (it
> just requires a very simple modification to grub_reclaim()) and is more
> elegant from the theoretical point of view.

It is certainly simpler, agreed.

The trouble is with interfaces. Once we expose them we're stuck with
them. And from that POV I think an explicit SCHED_OTHER server (or a
minimum budget for a slack time scheme) makes more sense.

It provides this same information while also providing more benefit, no?

> >That would maybe fit in nicely with the DL based FIFO/RR servers from
> >this other pending project.
> Yes, this reminds me about the half-finished patch for RT throttling using
> SCHED_DEADLINE... But that patch needs much more work IMHO.

IIRC two years ago at RTLWS there was a presentation that the SMP issues
were 'solved' and they would be posting the patches 'soon'. 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 7/8] Make GRUB a task's flag
  2016-01-15  8:41       ` Peter Zijlstra
@ 2016-01-15  9:08         ` Luca Abeni
  0 siblings, 0 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-15  9:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Fri, 15 Jan 2016 09:41:50 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Jan 15, 2016 at 09:15:39AM +0100, Luca Abeni wrote:
> > On 01/14/2016 08:56 PM, Peter Zijlstra wrote:
> 
> > >Something with an average runtime/budget that also puts limits on
> > >the max (say 2*avg) would be far more amenable to be exposed to
> > >unpriv tasks, except since that would directly result in an
> > >average tardiness bound this might be non-trivial to combine with
> > >tasks not opting for this.
> 
> > I'll try to think about this... The advantage of GRUB is that a
> > theoretically sound algorithm already existed; here, we would need
> > to design the algorithm so that it does not break the
> > SCHED_DEADLINE guarantees. Anyway, this is an interesting
> > challenge, I'll work on it :)
> 
> Didn't Baruah and Jim do the whole theory on statistical EDF? Which
> shows that if you use a statistical budget the combined distribution
> transfers to the tardiness. With stdev=0 for the budgets this
> trivially collapses to the regular EDF, since then the combined
> distribution is also stdev=0 and you get 0 tardiness (on UP).
I remember a paper by Anderson, but it was slightly different from
this... Maybe I am remembering the wrong paper... I'll check it again.

> But yes, combining the two into one scheduler is 'interesting'. I was
> thinking it would be possible with least-laxity-first, since you can
> assign the hard (stdev=0) tasks a tighter laxity bound.
> 
> But LLF is horrendously painful to implement IIRC.
LLF would be interesting (even because if would help in implementing an
optimal SMP scheduler), but yes, it is not simple to implement (at
least, as far as I remember). I had a student working on it, who
implemented some kind of LLF approximation, but I still have to cleanup
the code and see how much useful it can be).



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 5/8] Track the "total rq utilisation" too
  2016-01-15  8:34       ` Peter Zijlstra
@ 2016-01-15  9:15         ` Luca Abeni
  2016-01-29 15:06           ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Luca Abeni @ 2016-01-15  9:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Fri, 15 Jan 2016 09:34:00 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Jan 15, 2016 at 07:50:49AM +0100, Luca Abeni wrote:
> > On Thu, 14 Jan 2016 20:48:37 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Thu, Jan 14, 2016 at 04:24:50PM +0100, Luca Abeni wrote:
> > > > This is the sum of the utilisations of tasks that are assigned
> > > > to a runqueue, independently from their state (TASK_RUNNING or
> > > > blocked)
> > > 
> > > Is it actually used?
> > Not in this patchset...
> > It is a possible "cheap" (but less accurate) alternative to the
> > tracking introduced in patch 4. Or can be used in more advanced
> > implementations of multi-processor GRUB, but not in this patchset.
> > 
> > So, it can be removed from the patchset; I added it so that people
> > can see all the possible alternative utilization tracking
> > strategies.
> 
> OK, so that might've been useful text for the changelog. But given
> that, maybe leave it out for now.
> 
> BTW, have you got a paper on smp grub?
The one mentioned in the cover letter describes the implementation I
posted:
http://disi.unitn.it/~abeni/reclaiming/rtlws14-grub.pdf

There is also a newer paper, that will be published at ACM SAC 2016
(so, it is not available yet), but is based on this technical report:
http://arxiv.org/abs/1512.01984
This second paper describes some more complex algorithms (easily
implementable over this patchset) that are able to guarantee hard
schedulability for SCHED_DEADLINE tasks with reclaiming on SMP.



				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-14 19:43   ` Peter Zijlstra
@ 2016-01-15  9:27     ` Luca Abeni
  2016-01-19 12:20     ` Luca Abeni
  1 sibling, 0 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-15  9:27 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, 14 Jan 2016 20:43:23 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Jan 14, 2016 at 04:24:49PM +0100, Luca Abeni wrote:
> > This patch implements a more theoretically sound algorithm for
> > thracking the active utilisation: instead of decreasing it when a
> > task blocks, use a timer (the "inactive timer", named after the
> > "Inactive" task state of the GRUB algorithm) to decrease the
> > active utilisaation at the so called "0-lag time".
> 
> See also the large-ish comment in __setparam_dl().
> 
> If we go do proper 0-lag, as GRUB requires, then we might as well use
> it for that.
Yes, I initially tried to do this, but I found some issues (I do not
remember, but I think they were related to tasks moving from
SCHED_DEADLINE to SCHED_OTHER, and then migrating to some other
runqueue while SCHED_OTHER but before the 0-lag time)

I'll search my notes for this issue in the next days and check
again (maybe when I wrote this code I was just misunderstanding
something)


			Luca

> 
> But we need to sort the issue of the task exiting with an armed timer.
> The solution suggested there is keeping a task reference with the
> timer.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 8/8] Do not reclaim the whole CPU bandwidth
  2016-01-15  8:50       ` Peter Zijlstra
@ 2016-01-15  9:49         ` Luca Abeni
  2016-01-26 12:52         ` luca abeni
  1 sibling, 0 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-15  9:49 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Fri, 15 Jan 2016 09:50:04 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> The trouble is with interfaces. Once we expose them we're stuck with
> them. And from that POV I think an explicit SCHED_OTHER server (or a
> minimum budget for a slack time scheme) makes more sense.
> 
> It provides this same information while also providing more benefit,
> no?
>From an interface point of view, I agree.

> > >That would maybe fit in nicely with the DL based FIFO/RR servers
> > >from this other pending project.
> > Yes, this reminds me about the half-finished patch for RT
> > throttling using SCHED_DEADLINE... But that patch needs much more
> > work IMHO.
> 
> IIRC two years ago at RTLWS there was a presentation that the SMP
> issues were 'solved' and they would be posting the patches 'soon'. 
Do you mean this paper?
http://retis.sssup.it/~nino/publication/rtlws14bdm.pdf

I started from that patch, and I have something that "basically works",
but I am still discussing some theoretical and implementation issues
with the paper's authors.



				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] CPU reclaiming for SCHED_DEADLINE
  2016-01-14 15:24 [RFC 0/8] CPU reclaiming for SCHED_DEADLINE Luca Abeni
                   ` (7 preceding siblings ...)
  2016-01-14 15:24 ` [RFC 8/8] Do not reclaim the whole CPU bandwidth Luca Abeni
@ 2016-01-19 10:11 ` Juri Lelli
  2016-01-19 11:50   ` Luca Abeni
  8 siblings, 1 reply; 58+ messages in thread
From: Juri Lelli @ 2016-01-19 10:11 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

On 14/01/16 16:24, Luca Abeni wrote:
> Hi all,
> 

Hi Luca,

thanks a lot for posting these patches, it is something that we need to
have at this point, IMHO.

I'll try to do some more testing hopefully soon, but, if you already
addressed Peter's comments and want to post a v2, please don't wait for
me. I'll try to test and review the next version. Let me see if I'm able
to setup that testing in the meantime :).

Best,

- Juri

> this patchset implements CPU reclaiming (using the GRUB algorithm[1])
> for SCHED_DEADLINE: basically, this feature allows SCHED_DEADLINE tasks
> to consume more than their reserved runtime, up to a maximum fraction
> of the CPU time (so that other tasks are left some spare CPU time to
> execute), if this does not break the guarantees of other SCHED_DEADLINE
> tasks.
> 
> I send this RFC because I think the code still needs some work and/or
> cleanups (or maybe the patches should be splitted or merged in a different
> way), but I'd like to check if there is interest in merging this feature
> and if the current implementation strategy is reasonable.
> 
> I added in cc the usual people interested in SCHED_DEADLINE patches; if
> you think that I should have added someone else, let me know (or please
> forward these patches to interested people).
> 
> The implemented CPU reclaiming algorithm is based on tracking the
> utilization U_act of active tasks (first 5 patches), and modifying the
> runtime accounting rule (see patch 0006). The original GRUB algorithm is
> modified as described in [2] to support multiple CPUs (the original
> algorithm only considered one single CPU, this one tracks U_act per
> runqueue) and to leave an "unreclaimable" fraction of CPU time to non
> SCHED_DEADLINE tasks (the original algorithm can consume 100% of the CPU
> time, starving all the other tasks).
> 
> I tried to split the patches so that the whole patchset can be better
> understood; if they should be organized in a different way, let me know.
> The first 5 patches (tracking of per-runqueue active utilization) can
> be useful for frequency scaling too (the tracked "active utilization"
> gives a clear hint about how much the core speed can be reduced without
> compromising the SCHED_DEADLINE guarantees):
> - patches 0001 and 0002 implement a simple tracking of the active
>   utilization that is too optimistic from the theoretical point of
>   view
> - patch 0003 is mainly useful for debugging this patchset and can
>   be removed without problems
> - patch 0004 implements the "active utilization" tracking algorithm
>   described in [1,2]. It uses a timer (named "inactive timer" here) to
>   decrease U_act at the correct time (I called it the "0-lag time").
>   I am working on an alternative implementation that does not use
>   additional timers, but it is not ready yet; I'll post it when ready
>   and tested
> - patch 0005 tracks the utilization of the tasks that can execute on
>   each runqueue. It is a pessimistic approximation of U_act (so, if
>   used instead of U_act it allows to reclaim less CPU time, but does
>   not break SCHED_DEADLINE guarantees)
> - patches 0006-0008 implement the reclaiming algorithm.
> 
> [1] http://retis.sssup.it/~lipari/papers/lipariBaruah2000.pdf
> [2] http://disi.unitn.it/~abeni/reclaiming/rtlws14-grub.pdf
> 
> 
> 
> Juri Lelli (1):
>   sched/deadline: add some tracepoints
> 
> Luca Abeni (7):
>   Track the active utilisation
>   Correctly track the active utilisation for migrating tasks
>   Improve the tracking of active utilisation
>   Track the "total rq utilisation" too
>   GRUB accounting
>   Make GRUB a task's flag
>   Do not reclaim the whole CPU bandwidth
> 
>  include/linux/sched.h        |   1 +
>  include/trace/events/sched.h |  69 ++++++++++++++
>  include/uapi/linux/sched.h   |   1 +
>  kernel/sched/core.c          |   3 +-
>  kernel/sched/deadline.c      | 214 +++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/sched.h         |  12 +++
>  6 files changed, 292 insertions(+), 8 deletions(-)
> 
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 0/8] CPU reclaiming for SCHED_DEADLINE
  2016-01-19 10:11 ` [RFC 0/8] CPU reclaiming for SCHED_DEADLINE Juri Lelli
@ 2016-01-19 11:50   ` Luca Abeni
  0 siblings, 0 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-19 11:50 UTC (permalink / raw)
  To: Juri Lelli; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

On 01/19/2016 11:11 AM, Juri Lelli wrote:
> On 14/01/16 16:24, Luca Abeni wrote:
>> Hi all,
>>
>
> Hi Luca,
>
> thanks a lot for posting these patches, it is something that we need to
> have at this point, IMHO.
>
> I'll try to do some more testing hopefully soon, but, if you already
> addressed Peter's comments and want to post a v2, please don't wait for
> me. I'll try to test and review the next version. Let me see if I'm able
> to setup that testing in the meantime :).
Thanks Juri; I'll work on Peter's comments in the next days, and I'll
post a v2 of the RFC, probably in the first days of February.


			Thanks,
				Luca

>
> Best,
>
> - Juri
>
>> this patchset implements CPU reclaiming (using the GRUB algorithm[1])
>> for SCHED_DEADLINE: basically, this feature allows SCHED_DEADLINE tasks
>> to consume more than their reserved runtime, up to a maximum fraction
>> of the CPU time (so that other tasks are left some spare CPU time to
>> execute), if this does not break the guarantees of other SCHED_DEADLINE
>> tasks.
>>
>> I send this RFC because I think the code still needs some work and/or
>> cleanups (or maybe the patches should be splitted or merged in a different
>> way), but I'd like to check if there is interest in merging this feature
>> and if the current implementation strategy is reasonable.
>>
>> I added in cc the usual people interested in SCHED_DEADLINE patches; if
>> you think that I should have added someone else, let me know (or please
>> forward these patches to interested people).
>>
>> The implemented CPU reclaiming algorithm is based on tracking the
>> utilization U_act of active tasks (first 5 patches), and modifying the
>> runtime accounting rule (see patch 0006). The original GRUB algorithm is
>> modified as described in [2] to support multiple CPUs (the original
>> algorithm only considered one single CPU, this one tracks U_act per
>> runqueue) and to leave an "unreclaimable" fraction of CPU time to non
>> SCHED_DEADLINE tasks (the original algorithm can consume 100% of the CPU
>> time, starving all the other tasks).
>>
>> I tried to split the patches so that the whole patchset can be better
>> understood; if they should be organized in a different way, let me know.
>> The first 5 patches (tracking of per-runqueue active utilization) can
>> be useful for frequency scaling too (the tracked "active utilization"
>> gives a clear hint about how much the core speed can be reduced without
>> compromising the SCHED_DEADLINE guarantees):
>> - patches 0001 and 0002 implement a simple tracking of the active
>>    utilization that is too optimistic from the theoretical point of
>>    view
>> - patch 0003 is mainly useful for debugging this patchset and can
>>    be removed without problems
>> - patch 0004 implements the "active utilization" tracking algorithm
>>    described in [1,2]. It uses a timer (named "inactive timer" here) to
>>    decrease U_act at the correct time (I called it the "0-lag time").
>>    I am working on an alternative implementation that does not use
>>    additional timers, but it is not ready yet; I'll post it when ready
>>    and tested
>> - patch 0005 tracks the utilization of the tasks that can execute on
>>    each runqueue. It is a pessimistic approximation of U_act (so, if
>>    used instead of U_act it allows to reclaim less CPU time, but does
>>    not break SCHED_DEADLINE guarantees)
>> - patches 0006-0008 implement the reclaiming algorithm.
>>
>> [1] http://retis.sssup.it/~lipari/papers/lipariBaruah2000.pdf
>> [2] http://disi.unitn.it/~abeni/reclaiming/rtlws14-grub.pdf
>>
>>
>>
>> Juri Lelli (1):
>>    sched/deadline: add some tracepoints
>>
>> Luca Abeni (7):
>>    Track the active utilisation
>>    Correctly track the active utilisation for migrating tasks
>>    Improve the tracking of active utilisation
>>    Track the "total rq utilisation" too
>>    GRUB accounting
>>    Make GRUB a task's flag
>>    Do not reclaim the whole CPU bandwidth
>>
>>   include/linux/sched.h        |   1 +
>>   include/trace/events/sched.h |  69 ++++++++++++++
>>   include/uapi/linux/sched.h   |   1 +
>>   kernel/sched/core.c          |   3 +-
>>   kernel/sched/deadline.c      | 214 +++++++++++++++++++++++++++++++++++++++++--
>>   kernel/sched/sched.h         |  12 +++
>>   6 files changed, 292 insertions(+), 8 deletions(-)
>>
>> --
>> 1.9.1
>>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-14 19:43   ` Peter Zijlstra
  2016-01-15  9:27     ` Luca Abeni
@ 2016-01-19 12:20     ` Luca Abeni
  2016-01-19 13:47       ` Peter Zijlstra
  1 sibling, 1 reply; 58+ messages in thread
From: Luca Abeni @ 2016-01-19 12:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

Hi Peter,

On 01/14/2016 08:43 PM, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 04:24:49PM +0100, Luca Abeni wrote:
>> This patch implements a more theoretically sound algorithm for
>> thracking the active utilisation: instead of decreasing it when a
>> task blocks, use a timer (the "inactive timer", named after the
>> "Inactive" task state of the GRUB algorithm) to decrease the
>> active utilisaation at the so called "0-lag time".
>
> See also the large-ish comment in __setparam_dl().
>
> If we go do proper 0-lag, as GRUB requires, then we might as well use it
> for that.
Just to check if I understand correctly:
I would need to remove "dl_b->total_bw -= p->dl.dl_bw;" from task_dead_dl(),
and __dl_clear() from "else if (!dl_policy(policy) && task_has_dl_policy(p))"
in dl_overflow(). Then, arm the inactive_timer in these cases, and add the
__dl_clear() in the "if (!dl_task(p))" in inactive_task_timer()... Right?

If this understanding is correct (modulo some details that I'll figure out
during testing), I'll try this.

In theory, the inactive_timer would be the right place to also decrease
the active utilisation when a task switches from SCHED_DEADLINE to something
else... But this is problematic if the task migrates after switching from
SCHED_DEADLINE and before the timer fires.



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-19 12:20     ` Luca Abeni
@ 2016-01-19 13:47       ` Peter Zijlstra
  2016-01-27 13:36         ` Luca Abeni
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-19 13:47 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Tue, Jan 19, 2016 at 01:20:13PM +0100, Luca Abeni wrote:
> Hi Peter,
> 
> On 01/14/2016 08:43 PM, Peter Zijlstra wrote:
> >On Thu, Jan 14, 2016 at 04:24:49PM +0100, Luca Abeni wrote:
> >>This patch implements a more theoretically sound algorithm for
> >>thracking the active utilisation: instead of decreasing it when a
> >>task blocks, use a timer (the "inactive timer", named after the
> >>"Inactive" task state of the GRUB algorithm) to decrease the
> >>active utilisaation at the so called "0-lag time".
> >
> >See also the large-ish comment in __setparam_dl().
> >
> >If we go do proper 0-lag, as GRUB requires, then we might as well use it
> >for that.
> Just to check if I understand correctly:
> I would need to remove "dl_b->total_bw -= p->dl.dl_bw;" from task_dead_dl(),
> and __dl_clear() from "else if (!dl_policy(policy) && task_has_dl_policy(p))"
> in dl_overflow(). Then, arm the inactive_timer in these cases, and add the
> __dl_clear() in the "if (!dl_task(p))" in inactive_task_timer()... Right?

Correct.

> If this understanding is correct (modulo some details that I'll figure out
> during testing), I'll try this.

Yes, there's bound to be 'fun' details..

> In theory, the inactive_timer would be the right place to also decrease
> the active utilisation when a task switches from SCHED_DEADLINE to something
> else... But this is problematic if the task migrates after switching from
> SCHED_DEADLINE and before the timer fires.

urgh, yes.. details :-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 8/8] Do not reclaim the whole CPU bandwidth
  2016-01-15  8:50       ` Peter Zijlstra
  2016-01-15  9:49         ` Luca Abeni
@ 2016-01-26 12:52         ` luca abeni
  2016-01-27 14:44           ` Peter Zijlstra
  1 sibling, 1 reply; 58+ messages in thread
From: luca abeni @ 2016-01-26 12:52 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

Hi Peter,

On Fri, 15 Jan 2016 09:50:04 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> > >>NOTE: the fraction of CPU time that cannot be reclaimed is
> > >>currently hardcoded as (1 << 20) / 10 -> 90%, but it must be made
> > >>configurable!
> > >
> > >So the alternative is an explicit SCHED_OTHER server which is
> > >configurable.
> > Yes, I have thought about something similar (actually, this is the
> > strategy I implemented in my first CBS/GRUB scheduler. With the
> > "old" 2.4 scheduler, this was easier :).
> > But I think the solution I implemented in this patch is much
> > simpler (it just requires a very simple modification to
> > grub_reclaim()) and is more elegant from the theoretical point of
> > view.
> 
> It is certainly simpler, agreed.
> 
> The trouble is with interfaces. Once we expose them we're stuck with
> them. And from that POV I think an explicit SCHED_OTHER server (or a
> minimum budget for a slack time scheme) makes more sense.
I am trying to work on this.
Which kind of interface is better for this? Would adding something like
/proc/sys/kernel/sched_other_period_us
/proc/sys/kernel/sched_other_runtime_us
be ok?

If this is ok, I'll add these two procfs files, and store
(sched_other_runtime / sched_other_period) << 20 in the runqueue field
which represents the unreclaimable utilization (implementing
hierarchical SCHED_DEADLINE/CFS scheduling right now is too complex for
this patchset... But if the exported interface is ok, it can be
implemented later).

Is this approach acceptable? Or am I misunderstanding your comment?



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-19 13:47       ` Peter Zijlstra
@ 2016-01-27 13:36         ` Luca Abeni
  2016-01-27 14:39           ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Luca Abeni @ 2016-01-27 13:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

Hi Peter,

On Tue, 19 Jan 2016 14:47:39 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jan 19, 2016 at 01:20:13PM +0100, Luca Abeni wrote:
> > Hi Peter,
> > 
> > On 01/14/2016 08:43 PM, Peter Zijlstra wrote:
> > >On Thu, Jan 14, 2016 at 04:24:49PM +0100, Luca Abeni wrote:
> > >>This patch implements a more theoretically sound algorithm for
> > >>thracking the active utilisation: instead of decreasing it when a
> > >>task blocks, use a timer (the "inactive timer", named after the
> > >>"Inactive" task state of the GRUB algorithm) to decrease the
> > >>active utilisaation at the so called "0-lag time".
> > >
> > >See also the large-ish comment in __setparam_dl().
> > >
> > >If we go do proper 0-lag, as GRUB requires, then we might as well
> > >use it for that.
> > Just to check if I understand correctly:
> > I would need to remove "dl_b->total_bw -= p->dl.dl_bw;" from
> > task_dead_dl(), and __dl_clear() from "else if (!dl_policy(policy)
> > && task_has_dl_policy(p))" in dl_overflow(). Then, arm the
> > inactive_timer in these cases, and add the __dl_clear() in the "if
> > (!dl_task(p))" in inactive_task_timer()... Right?
> 
> Correct.
> 
> > If this understanding is correct (modulo some details that I'll
> > figure out during testing), I'll try this.
> 
> Yes, there's bound to be 'fun' details..

Ok, so I implemented this idea, and I am currently testing it...
The first experiments seem to show that there are no problems, but I
just tried some simple workload (rt-app, or some other periodic taskset
scheduled by SCHED_DEADLINE). Do you have suggestions for more
"interesting" (and meaningful) tests/experiments?


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-27 13:36         ` Luca Abeni
@ 2016-01-27 14:39           ` Peter Zijlstra
  2016-01-27 14:45             ` Luca Abeni
  2016-01-28 11:14             ` luca abeni
  0 siblings, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-27 14:39 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Wed, Jan 27, 2016 at 02:36:51PM +0100, Luca Abeni wrote:
> Ok, so I implemented this idea, and I am currently testing it...
> The first experiments seem to show that there are no problems, but I
> just tried some simple workload (rt-app, or some other periodic taskset
> scheduled by SCHED_DEADLINE). Do you have suggestions for more
> "interesting" (and meaningful) tests/experiments?

rt-app is the workload generator, right?

I think the most interesting part here is the switched_from path, so
you'd want the workload to include a !rt task that gets PI boosted to
deadline every so often.

Also, does rt-app let tasks die? Or does it spawn N tasks and lets them
run jobs until the end? I think you want to put some effort in
task_dead_dl() as well.

After that, just make sure rt-app generates a _lot_ of tasks such that
the migration thing gets used.

Other than that, no, not really :-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 8/8] Do not reclaim the whole CPU bandwidth
  2016-01-26 12:52         ` luca abeni
@ 2016-01-27 14:44           ` Peter Zijlstra
  2016-02-02 20:53             ` Luca Abeni
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-27 14:44 UTC (permalink / raw)
  To: luca abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Tue, Jan 26, 2016 at 01:52:19PM +0100, luca abeni wrote:

> > The trouble is with interfaces. Once we expose them we're stuck with
> > them. And from that POV I think an explicit SCHED_OTHER server (or a
> > minimum budget for a slack time scheme) makes more sense.

> I am trying to work on this.
> Which kind of interface is better for this? Would adding something like
> /proc/sys/kernel/sched_other_period_us
> /proc/sys/kernel/sched_other_runtime_us
> be ok?
> 
> If this is ok, I'll add these two procfs files, and store
> (sched_other_runtime / sched_other_period) << 20 in the runqueue field
> which represents the unreclaimable utilization (implementing
> hierarchical SCHED_DEADLINE/CFS scheduling right now is too complex for
> this patchset... But if the exported interface is ok, it can be
> implemented later).
> 
> Is this approach acceptable? Or am I misunderstanding your comment?

No, I think that's fine.

Altough now you have me worrying about per root_domain settings and the
like. But I think we can do that with additional interfaces, if needed.

So yes, please go with that.

And agreed, a full CFS server is a bit outside scope for this patch-set.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-27 14:39           ` Peter Zijlstra
@ 2016-01-27 14:45             ` Luca Abeni
  2016-01-28 13:08               ` Vincent Guittot
       [not found]               ` <CAKfTPtAt0gTwk9aAZN238NT1O-zJvxVQDTh2QN_KxAnE61xMww@mail.gmail.com>
  2016-01-28 11:14             ` luca abeni
  1 sibling, 2 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-27 14:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

Hi Peter,

On Wed, 27 Jan 2016 15:39:46 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Jan 27, 2016 at 02:36:51PM +0100, Luca Abeni wrote:
> > Ok, so I implemented this idea, and I am currently testing it...
> > The first experiments seem to show that there are no problems, but I
> > just tried some simple workload (rt-app, or some other periodic
> > taskset scheduled by SCHED_DEADLINE). Do you have suggestions for
> > more "interesting" (and meaningful) tests/experiments?
> 
> rt-app is the workload generator, right?
> 
> I think the most interesting part here is the switched_from path, so
> you'd want the workload to include a !rt task that gets PI boosted to
> deadline every so often.
> 
> Also, does rt-app let tasks die? Or does it spawn N tasks and lets
> them run jobs until the end? I think you want to put some effort in
> task_dead_dl() as well.
> 
> After that, just make sure rt-app generates a _lot_ of tasks such that
> the migration thing gets used.

Thanks; I'll check with Juri how to do all of this with rt-app (or how
to modify rt-app to stress these functionalities).


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-27 14:39           ` Peter Zijlstra
  2016-01-27 14:45             ` Luca Abeni
@ 2016-01-28 11:14             ` luca abeni
  2016-01-28 12:21               ` Peter Zijlstra
  1 sibling, 1 reply; 58+ messages in thread
From: luca abeni @ 2016-01-28 11:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

Hi Peter,

On Wed, 27 Jan 2016 15:39:46 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Jan 27, 2016 at 02:36:51PM +0100, Luca Abeni wrote:
> > Ok, so I implemented this idea, and I am currently testing it...
> > The first experiments seem to show that there are no problems, but I
> > just tried some simple workload (rt-app, or some other periodic
> > taskset scheduled by SCHED_DEADLINE). Do you have suggestions for
> > more "interesting" (and meaningful) tests/experiments?
> 
> rt-app is the workload generator, right?
> 
> I think the most interesting part here is the switched_from path, so
> you'd want the workload to include a !rt task that gets PI boosted to
> deadline every so often.
I am looking at the PI stuff right now... And I am not sure if
SCHED_DEADLINE does the right thing for PI :)

Anyway, I think the total SCHED_DEADLINE utilization (rd->dl_bw) is
currently not changed when a SCHED_OTHER task is boosted to
SCHED_DEADLINE due to PI... Right? Is this the desired behaviour?
If yes, I'll make sure that my patch does not change it.



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-28 11:14             ` luca abeni
@ 2016-01-28 12:21               ` Peter Zijlstra
  2016-01-28 13:41                 ` luca abeni
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-28 12:21 UTC (permalink / raw)
  To: luca abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, Jan 28, 2016 at 12:14:41PM +0100, luca abeni wrote:
> I am looking at the PI stuff right now... And I am not sure if
> SCHED_DEADLINE does the right thing for PI :)

Strictly speaking it does not, dl-pi is a giant hack.

Some day we should fix this :-)

But as you might be aware, SMP capable PI protocols for this are
somewhat tricky.

> Anyway, I think the total SCHED_DEADLINE utilization (rd->dl_bw) is
> currently not changed when a SCHED_OTHER task is boosted to
> SCHED_DEADLINE due to PI... Right? 

>From memory that is accurate, but not right as per the above. Ideally we
would indeed charge the boosted task against the booster's bandwidth.

This has the 'fun' consequence that while you deplete the bandwidth of
the booster the PI order can change and we should pick another booster
etc.

> Is this the desired behaviour?

Nope, but fixing this is likely to be non-trivial.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-27 14:45             ` Luca Abeni
@ 2016-01-28 13:08               ` Vincent Guittot
       [not found]               ` <CAKfTPtAt0gTwk9aAZN238NT1O-zJvxVQDTh2QN_KxAnE61xMww@mail.gmail.com>
  1 sibling, 0 replies; 58+ messages in thread
From: Vincent Guittot @ 2016-01-28 13:08 UTC (permalink / raw)
  To: Luca Abeni; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Juri Lelli

Hi Luca,

On 27 January 2016 at 15:45, Luca Abeni <luca.abeni@unitn.it> wrote:
> Hi Peter,
>
> On Wed, 27 Jan 2016 15:39:46 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Wed, Jan 27, 2016 at 02:36:51PM +0100, Luca Abeni wrote:
>> > Ok, so I implemented this idea, and I am currently testing it...
>> > The first experiments seem to show that there are no problems, but I
>> > just tried some simple workload (rt-app, or some other periodic
>> > taskset scheduled by SCHED_DEADLINE). Do you have suggestions for
>> > more "interesting" (and meaningful) tests/experiments?
>>
>> rt-app is the workload generator, right?
>>
>> I think the most interesting part here is the switched_from path, so
>> you'd want the workload to include a !rt task that gets PI boosted to
>> deadline every so often.
>>
>> Also, does rt-app let tasks die? Or does it spawn N tasks and lets
>> them run jobs until the end? I think you want to put some effort in
>> task_dead_dl() as well.
>>
>> After that, just make sure rt-app generates a _lot_ of tasks such that
>> the migration thing gets used.
>
> Thanks; I'll check with Juri how to do all of this with rt-app (or how
> to modify rt-app to stress these functionalities).

This version of workload generator /rt-app can do all sequences you want:
https://git.linaro.org/power/rt-app.git/shortlog/refs/heads/master
The merge of these changes are ongoing but still not finished.

Let me know if you need help to use it and create some use cases

Regards,
Vincent

>
>
>                         Thanks,
>                                 Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-28 12:21               ` Peter Zijlstra
@ 2016-01-28 13:41                 ` luca abeni
  2016-01-28 14:00                   ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: luca abeni @ 2016-01-28 13:41 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

Hi Peter,

On Thu, 28 Jan 2016 13:21:00 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Jan 28, 2016 at 12:14:41PM +0100, luca abeni wrote:
> > I am looking at the PI stuff right now... And I am not sure if
> > SCHED_DEADLINE does the right thing for PI :)
> 
> Strictly speaking it does not, dl-pi is a giant hack.
> 
> Some day we should fix this :-)
I am trying to have a better look at the code, and I think that
implementing bandwidth inheritance (BWI) could be easy (implementing
M-BWI, that can be analyzed on multi-processor systems, is more complex
because it requires busy waiting or similar).


> But as you might be aware, SMP capable PI protocols for this are
> somewhat tricky.
Right :)


> > Anyway, I think the total SCHED_DEADLINE utilization (rd->dl_bw) is
> > currently not changed when a SCHED_OTHER task is boosted to
> > SCHED_DEADLINE due to PI... Right? 
> 
> From memory that is accurate, but not right as per the above. Ideally
> we would indeed charge the boosted task against the booster's
> bandwidth.
Yes, this would be the BWI approach


> This has the 'fun' consequence that while you deplete the bandwidth of
> the booster the PI order can change and we should pick another booster
> etc.
> 
> > Is this the desired behaviour?
> 
> Nope, but fixing this is likely to be non-trivial.
Ok... So, if this is acceptable for this patchset I'll try to keep the
current PI behaviour, and I'll try to have a look at a better PI
protocol after the runtime reclaiming stuff is done (that is, I make it
acceptable for mainline, or we decide that a different approach is
needed).



			Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
       [not found]               ` <CAKfTPtAt0gTwk9aAZN238NT1O-zJvxVQDTh2QN_KxAnE61xMww@mail.gmail.com>
@ 2016-01-28 13:48                 ` luca abeni
  2016-01-28 13:56                   ` Vincent Guittot
  0 siblings, 1 reply; 58+ messages in thread
From: luca abeni @ 2016-01-28 13:48 UTC (permalink / raw)
  To: Vincent Guittot; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Juri Lelli

On Thu, 28 Jan 2016 14:05:44 +0100
Vincent Guittot <vincent.guittot@linaro.org> wrote:

> Hi Luca,
> 
> 
> On 27 January 2016 at 15:45, Luca Abeni <luca.abeni@unitn.it> wrote:
> 
> > Hi Peter,
> >
> > On Wed, 27 Jan 2016 15:39:46 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > On Wed, Jan 27, 2016 at 02:36:51PM +0100, Luca Abeni wrote:
> > > > Ok, so I implemented this idea, and I am currently testing it...
> > > > The first experiments seem to show that there are no problems,
> > > > but I just tried some simple workload (rt-app, or some other
> > > > periodic taskset scheduled by SCHED_DEADLINE). Do you have
> > > > suggestions for more "interesting" (and meaningful)
> > > > tests/experiments?
> > >
> > > rt-app is the workload generator, right?
> > >
> > > I think the most interesting part here is the switched_from path,
> > > so you'd want the workload to include a !rt task that gets PI
> > > boosted to deadline every so often.
> > >
> > > Also, does rt-app let tasks die? Or does it spawn N tasks and lets
> > > them run jobs until the end? I think you want to put some effort
> > > in task_dead_dl() as well.
> > >
> > > After that, just make sure rt-app generates a _lot_ of tasks such
> > > that the migration thing gets used.
> >
> > Thanks; I'll check with Juri how to do all of this with rt-app (or
> > how to modify rt-app to stress these functionalities).
> >
> 
> This version of workload generator /rt-app can do all sequences you
> want:
> https://git.linaro.org/power/rt-app.git/shortlog/refs/heads/master
Thanks Vincent; I am going to have a look at it.
Are the "lock_order" and "resources" task parameters documented or
described somewhere?


			Thanks,
				Luca

> The merge of these changes are ongoing but still not finished.
> 
> Let me know if you need help to use it and create some use cases
> 
> Regards,
> Vincent
> 
> 
> >
> >
> >                         Thanks,
> >                                 Luca
> >

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-28 13:48                 ` luca abeni
@ 2016-01-28 13:56                   ` Vincent Guittot
  0 siblings, 0 replies; 58+ messages in thread
From: Vincent Guittot @ 2016-01-28 13:56 UTC (permalink / raw)
  To: luca abeni; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Juri Lelli

There is a How-to that explains how to describe a scenario:
https://git.linaro.org/power/rt-app.git/blob/HEAD:/doc/tutorial.txt

as well as some examples in the doc directory

Regards,
Vincent

On 28 January 2016 at 14:48, luca abeni <luca.abeni@unitn.it> wrote:
> On Thu, 28 Jan 2016 14:05:44 +0100
> Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
>> Hi Luca,
>>
>>
>> On 27 January 2016 at 15:45, Luca Abeni <luca.abeni@unitn.it> wrote:
>>
>> > Hi Peter,
>> >
>> > On Wed, 27 Jan 2016 15:39:46 +0100
>> > Peter Zijlstra <peterz@infradead.org> wrote:
>> >
>> > > On Wed, Jan 27, 2016 at 02:36:51PM +0100, Luca Abeni wrote:
>> > > > Ok, so I implemented this idea, and I am currently testing it...
>> > > > The first experiments seem to show that there are no problems,
>> > > > but I just tried some simple workload (rt-app, or some other
>> > > > periodic taskset scheduled by SCHED_DEADLINE). Do you have
>> > > > suggestions for more "interesting" (and meaningful)
>> > > > tests/experiments?
>> > >
>> > > rt-app is the workload generator, right?
>> > >
>> > > I think the most interesting part here is the switched_from path,
>> > > so you'd want the workload to include a !rt task that gets PI
>> > > boosted to deadline every so often.
>> > >
>> > > Also, does rt-app let tasks die? Or does it spawn N tasks and lets
>> > > them run jobs until the end? I think you want to put some effort
>> > > in task_dead_dl() as well.
>> > >
>> > > After that, just make sure rt-app generates a _lot_ of tasks such
>> > > that the migration thing gets used.
>> >
>> > Thanks; I'll check with Juri how to do all of this with rt-app (or
>> > how to modify rt-app to stress these functionalities).
>> >
>>
>> This version of workload generator /rt-app can do all sequences you
>> want:
>> https://git.linaro.org/power/rt-app.git/shortlog/refs/heads/master
> Thanks Vincent; I am going to have a look at it.
> Are the "lock_order" and "resources" task parameters documented or
> described somewhere?
>
>
>                         Thanks,
>                                 Luca
>
>> The merge of these changes are ongoing but still not finished.
>>
>> Let me know if you need help to use it and create some use cases
>>
>> Regards,
>> Vincent
>>
>>
>> >
>> >
>> >                         Thanks,
>> >                                 Luca
>> >
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-28 13:41                 ` luca abeni
@ 2016-01-28 14:00                   ` Peter Zijlstra
  2016-01-28 21:15                     ` Luca Abeni
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-28 14:00 UTC (permalink / raw)
  To: luca abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, Jan 28, 2016 at 02:41:29PM +0100, luca abeni wrote:

> > Some day we should fix this :-)

> I am trying to have a better look at the code, and I think that
> implementing bandwidth inheritance (BWI) could be easy (implementing
> M-BWI, that can be analyzed on multi-processor systems, is more complex
> because it requires busy waiting or similar).

Ah indeed, I remember now. To which I said that if busy-waiting is
'correct' so then must not busy-waiting be, for that consumes less
cputime and would allow more actual work to be done.

Of course, I might have missed some subtle detail, but intuition
suggests the above.

> > Nope, but fixing this is likely to be non-trivial.

> Ok... So, if this is acceptable for this patchset I'll try to keep the
> current PI behaviour,

Yeah that's fine. That's decidedly outside the scope of these patches.

> and I'll try to have a look at a better PI
> protocol after the runtime reclaiming stuff is done (that is, I make it
> acceptable for mainline, or we decide that a different approach is
> needed).

That would be very nice indeed!

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 4/8] Improve the tracking of active utilisation
  2016-01-28 14:00                   ` Peter Zijlstra
@ 2016-01-28 21:15                     ` Luca Abeni
  0 siblings, 0 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-28 21:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Thu, 28 Jan 2016 15:00:53 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Jan 28, 2016 at 02:41:29PM +0100, luca abeni wrote:
> 
> > > Some day we should fix this :-)
> 
> > I am trying to have a better look at the code, and I think that
> > implementing bandwidth inheritance (BWI) could be easy (implementing
> > M-BWI, that can be analyzed on multi-processor systems, is more
> > complex because it requires busy waiting or similar).
> 
> Ah indeed, I remember now. To which I said that if busy-waiting is
> 'correct' so then must not busy-waiting be, for that consumes less
> cputime and would allow more actual work to be done.
The issue is that when the task wakes up after blocking, the scheduler
has to check if the current deadline and runtime can still be used (and
if they cannot, it generates a new deadline).
Of course a blocking solution can work, but the strategy used to check
if deadline and runtime are still valid must be changed. I discussed
this with the M-BWI authors about one year ago, but we did not arrive
to a definitive conclusion.

Anyway, I suspect that implementing BWI (without the "M-"), which has
no busy-waiting, could be an improvement respect to the current
mechanism.


			Thanks,
				Luca

> 
> Of course, I might have missed some subtle detail, but intuition
> suggests the above.
> 
> > > Nope, but fixing this is likely to be non-trivial.
> 
> > Ok... So, if this is acceptable for this patchset I'll try to keep
> > the current PI behaviour,
> 
> Yeah that's fine. That's decidedly outside the scope of these patches.
> 
> > and I'll try to have a look at a better PI
> > protocol after the runtime reclaiming stuff is done (that is, I
> > make it acceptable for mainline, or we decide that a different
> > approach is needed).
> 
> That would be very nice indeed!

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 5/8] Track the "total rq utilisation" too
  2016-01-15  9:15         ` Luca Abeni
@ 2016-01-29 15:06           ` Peter Zijlstra
  2016-01-29 21:21             ` Luca Abeni
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2016-01-29 15:06 UTC (permalink / raw)
  To: Luca Abeni; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Fri, Jan 15, 2016 at 10:15:11AM +0100, Luca Abeni wrote:

> There is also a newer paper, that will be published at ACM SAC 2016
> (so, it is not available yet), but is based on this technical report:
> http://arxiv.org/abs/1512.01984
> This second paper describes some more complex algorithms (easily
> implementable over this patchset) that are able to guarantee hard
> schedulability for SCHED_DEADLINE tasks with reclaiming on SMP.

So I finally got around to reading the relevant sections of that paper
(5.1 and 5.2).

The paper introduces two alternatives;

 - parallel reclaim (5.1)
 - sequential reclaim (5.2)

The parent patch introduces the accounting required for sequential
reclaiming IIUC.

Thinking about things however, I think I would prefer parallel reclaim
over sequential reclaim. The problem I see with sequential reclaim is
that under light load jobs might land on different CPUs and not benefit
from reclaim (as much) since the 'spare' bandwidth is stuck on other
CPUs.

Now I suppose the exact conditions to hit that worst case might be quite
hard to trigger, in which case it might just not matter in practical
terms.

But maybe I'm mistaken, the paper doesn't seem to compare the two
approaches in this way.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 5/8] Track the "total rq utilisation" too
  2016-01-29 15:06           ` Peter Zijlstra
@ 2016-01-29 21:21             ` Luca Abeni
  0 siblings, 0 replies; 58+ messages in thread
From: Luca Abeni @ 2016-01-29 21:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

Hi Peter,

On Fri, 29 Jan 2016 16:06:05 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Jan 15, 2016 at 10:15:11AM +0100, Luca Abeni wrote:
> 
> > There is also a newer paper, that will be published at ACM SAC 2016
> > (so, it is not available yet), but is based on this technical
> > report: http://arxiv.org/abs/1512.01984
> > This second paper describes some more complex algorithms (easily
> > implementable over this patchset) that are able to guarantee hard
> > schedulability for SCHED_DEADLINE tasks with reclaiming on SMP.
> 
> So I finally got around to reading the relevant sections of that paper
> (5.1 and 5.2).
> 
> The paper introduces two alternatives;
> 
>  - parallel reclaim (5.1)
>  - sequential reclaim (5.2)
> 
> The parent patch introduces the accounting required for sequential
> reclaiming IIUC.
The patches I posted implements something similar to sequential
reclaiming (they actually implement the algorithm described in the
RTLWS paper: http://disi.unitn.it/~abeni/reclaiming/rtlws14-grub.pdf).

The parallel and sequential reclaiming algorithms can be implemented on
top of the RFC patches I posted. See
https://github.com/lucabe72/linux-reclaiming/commits/reclaiming-new-v3
commits from "Move to the new M-GRUB definition (BCL incarnation)",
which implements sequential reclaiming. Parallel reclaiming is
implemented in the following 2 commits.
I did not post these last patches because I feel they are too premature
even for an RFC :)

> Thinking about things however, I think I would prefer parallel reclaim
> over sequential reclaim. The problem I see with sequential reclaim is
> that under light load jobs might land on different CPUs and not
> benefit from reclaim (as much) since the 'spare' bandwidth is stuck
> on other CPUs.
> 
> Now I suppose the exact conditions to hit that worst case might be
> quite hard to trigger, in which case it might just not matter in
> practical terms.
> 
> But maybe I'm mistaken, the paper doesn't seem to compare the two
> approaches in this way.
The technical report does not present any comparison, but we have an ACM
SAC paper (still to be published) that presents some experiments
comparing the two algorithms. And, you are right: parallel reclaiming
seems to work better.

However, parallel reclaiming requires a "global" (per scheduling
domain) variable to keep track of the total active (or inactive)
utilization... And this is updated every time a task blocks/unblock.
So, I expected more overhead, or scalability issues... But in the
experiments I have not been able to measure this overhead (I "only"
tested on a dual Xeon, with 4 cores per CPU... Maybe I needed more
runqueues/CPU cores to see scalability issues?).
Also, the implementation of parallel reclaiming is approximated (for
example: when a task wakes up on a runqueue, the "global" active
utilization is updated... But before updating it we should properly
account the runtime of the SCHED_DEADLINE tasks executing on all the
other runqueues of the scheduling domain... However, I only account the
runtime when a tick fires or when a task blocks). But my experiments
were not able to show any performance degradation because of this
approximation...

I got the impression that "per-runqueue active utilization" can be
tracked with less overhead (and can be more useful: for example, to
drive frequency scaling), and this is why I posted this patchset in the
RFC... But if you think that "global active utilization" (leading to
parallel reclaiming) can be more useful, I can reorder the patches and
post an RFC with a parallel reclaiming implementation.



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 8/8] Do not reclaim the whole CPU bandwidth
  2016-01-27 14:44           ` Peter Zijlstra
@ 2016-02-02 20:53             ` Luca Abeni
  2016-02-03 11:30               ` Juri Lelli
  0 siblings, 1 reply; 58+ messages in thread
From: Luca Abeni @ 2016-02-02 20:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Juri Lelli

On Wed, 27 Jan 2016 15:44:22 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Jan 26, 2016 at 01:52:19PM +0100, luca abeni wrote:
> 
> > > The trouble is with interfaces. Once we expose them we're stuck
> > > with them. And from that POV I think an explicit SCHED_OTHER
> > > server (or a minimum budget for a slack time scheme) makes more
> > > sense.
> 
> > I am trying to work on this.
> > Which kind of interface is better for this? Would adding something
> > like /proc/sys/kernel/sched_other_period_us
> > /proc/sys/kernel/sched_other_runtime_us
> > be ok?
> > 
> > If this is ok, I'll add these two procfs files, and store
> > (sched_other_runtime / sched_other_period) << 20 in the runqueue
> > field which represents the unreclaimable utilization (implementing
> > hierarchical SCHED_DEADLINE/CFS scheduling right now is too complex
> > for this patchset... But if the exported interface is ok, it can be
> > implemented later).
> > 
> > Is this approach acceptable? Or am I misunderstanding your comment?
> 
> No, I think that's fine.
So, I implemented this idea (/proc/sys/kernel/sched_other_period_us
and /proc/sys/kernel/sched_other_runtime_us to set the unreclaimable
utilization), and some initial testing seems to show that it works fine.

However, after double-thinking about it I am wondering if using a
runqueue field to store the unreclaimable utilization (unusable_bw in my
original patch) makes sense or not... This value is the same for all
the runqueue, and changing sched_other_runtime/sched_other_period
changes the unreclaimable utilization on all the runqueues... So maybe
it is better to use a global variable instead of a runqueue field?

Any ideas / suggestions? Before sending a v2 of the RFC, I'd like to
be sure that I am doing the right thing.


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 8/8] Do not reclaim the whole CPU bandwidth
  2016-02-02 20:53             ` Luca Abeni
@ 2016-02-03 11:30               ` Juri Lelli
  2016-02-03 13:28                 ` luca abeni
  0 siblings, 1 reply; 58+ messages in thread
From: Juri Lelli @ 2016-02-03 11:30 UTC (permalink / raw)
  To: Luca Abeni; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar

Hi Luca, Peter,

On 02/02/16 21:53, Luca Abeni wrote:
> On Wed, 27 Jan 2016 15:44:22 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, Jan 26, 2016 at 01:52:19PM +0100, luca abeni wrote:
> > 
> > > > The trouble is with interfaces. Once we expose them we're stuck
> > > > with them. And from that POV I think an explicit SCHED_OTHER
> > > > server (or a minimum budget for a slack time scheme) makes more
> > > > sense.
> > 
> > > I am trying to work on this.
> > > Which kind of interface is better for this? Would adding something
> > > like /proc/sys/kernel/sched_other_period_us
> > > /proc/sys/kernel/sched_other_runtime_us
> > > be ok?
> > > 
> > > If this is ok, I'll add these two procfs files, and store
> > > (sched_other_runtime / sched_other_period) << 20 in the runqueue
> > > field which represents the unreclaimable utilization (implementing
> > > hierarchical SCHED_DEADLINE/CFS scheduling right now is too complex
> > > for this patchset... But if the exported interface is ok, it can be
> > > implemented later).
> > > 
> > > Is this approach acceptable? Or am I misunderstanding your comment?
> > 
> > No, I think that's fine.
> So, I implemented this idea (/proc/sys/kernel/sched_other_period_us
> and /proc/sys/kernel/sched_other_runtime_us to set the unreclaimable
> utilization), and some initial testing seems to show that it works fine.
> 

Sorry for not saying this before, but why can't we use the existing
sched_rt_runtime_us/sched_rt_runtime_period cap for this? I mean, other
will have (1 - rt_runtime_ratio) available to run.

Best,

- Juri

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC 8/8] Do not reclaim the whole CPU bandwidth
  2016-02-03 11:30               ` Juri Lelli
@ 2016-02-03 13:28                 ` luca abeni
  0 siblings, 0 replies; 58+ messages in thread
From: luca abeni @ 2016-02-03 13:28 UTC (permalink / raw)
  To: Juri Lelli; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar

Hi Juri,

On Wed, 3 Feb 2016 11:30:19 +0000
Juri Lelli <juri.lelli@arm.com> wrote:
[...]
> > > > Which kind of interface is better for this? Would adding
> > > > something like /proc/sys/kernel/sched_other_period_us
> > > > /proc/sys/kernel/sched_other_runtime_us
> > > > be ok?
> > > > 
> > > > If this is ok, I'll add these two procfs files, and store
> > > > (sched_other_runtime / sched_other_period) << 20 in the runqueue
> > > > field which represents the unreclaimable utilization
> > > > (implementing hierarchical SCHED_DEADLINE/CFS scheduling right
> > > > now is too complex for this patchset... But if the exported
> > > > interface is ok, it can be implemented later).
> > > > 
> > > > Is this approach acceptable? Or am I misunderstanding your
> > > > comment?
> > > 
> > > No, I think that's fine.
> > So, I implemented this idea (/proc/sys/kernel/sched_other_period_us
> > and /proc/sys/kernel/sched_other_runtime_us to set the unreclaimable
> > utilization), and some initial testing seems to show that it works
> > fine.
> > 
> 
> Sorry for not saying this before, but why can't we use the existing
> sched_rt_runtime_us/sched_rt_runtime_period cap for this? I mean,
> other will have (1 - rt_runtime_ratio) available to run.

I was thinking about providing a more flexible interface (allowing to
use rt_runtime/rt_period for admission control and
other_runtime/other_period for reclaiming), but using using
sched_rt_runtime_us/sched_rt_runtime_period makes sense too. If this
solution is preferred, I'll adapt my patch.


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2016-02-03 13:28 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-14 15:24 [RFC 0/8] CPU reclaiming for SCHED_DEADLINE Luca Abeni
2016-01-14 15:24 ` [RFC 1/8] Track the active utilisation Luca Abeni
2016-01-14 16:49   ` Peter Zijlstra
2016-01-15  6:37     ` Luca Abeni
2016-01-14 19:13   ` Peter Zijlstra
2016-01-15  8:07     ` Luca Abeni
2016-01-14 15:24 ` [RFC 2/8] Correctly track the active utilisation for migrating tasks Luca Abeni
2016-01-14 15:24 ` [RFC 3/8] sched/deadline: add some tracepoints Luca Abeni
2016-01-14 15:24 ` [RFC 4/8] Improve the tracking of active utilisation Luca Abeni
2016-01-14 17:16   ` Peter Zijlstra
2016-01-15  6:48     ` Luca Abeni
2016-01-14 19:43   ` Peter Zijlstra
2016-01-15  9:27     ` Luca Abeni
2016-01-19 12:20     ` Luca Abeni
2016-01-19 13:47       ` Peter Zijlstra
2016-01-27 13:36         ` Luca Abeni
2016-01-27 14:39           ` Peter Zijlstra
2016-01-27 14:45             ` Luca Abeni
2016-01-28 13:08               ` Vincent Guittot
     [not found]               ` <CAKfTPtAt0gTwk9aAZN238NT1O-zJvxVQDTh2QN_KxAnE61xMww@mail.gmail.com>
2016-01-28 13:48                 ` luca abeni
2016-01-28 13:56                   ` Vincent Guittot
2016-01-28 11:14             ` luca abeni
2016-01-28 12:21               ` Peter Zijlstra
2016-01-28 13:41                 ` luca abeni
2016-01-28 14:00                   ` Peter Zijlstra
2016-01-28 21:15                     ` Luca Abeni
2016-01-14 19:47   ` Peter Zijlstra
2016-01-15  8:10     ` Luca Abeni
2016-01-15  8:32       ` Peter Zijlstra
2016-01-14 15:24 ` [RFC 5/8] Track the "total rq utilisation" too Luca Abeni
2016-01-14 19:12   ` Peter Zijlstra
2016-01-15  8:04     ` Luca Abeni
2016-01-14 19:48   ` Peter Zijlstra
2016-01-15  6:50     ` Luca Abeni
2016-01-15  8:34       ` Peter Zijlstra
2016-01-15  9:15         ` Luca Abeni
2016-01-29 15:06           ` Peter Zijlstra
2016-01-29 21:21             ` Luca Abeni
2016-01-14 15:24 ` [RFC 6/8] GRUB accounting Luca Abeni
2016-01-14 19:50   ` Peter Zijlstra
2016-01-15  8:05     ` Luca Abeni
2016-01-14 15:24 ` [RFC 7/8] Make GRUB a task's flag Luca Abeni
2016-01-14 19:56   ` Peter Zijlstra
2016-01-15  8:15     ` Luca Abeni
2016-01-15  8:41       ` Peter Zijlstra
2016-01-15  9:08         ` Luca Abeni
2016-01-14 15:24 ` [RFC 8/8] Do not reclaim the whole CPU bandwidth Luca Abeni
2016-01-14 19:59   ` Peter Zijlstra
2016-01-15  8:21     ` Luca Abeni
2016-01-15  8:50       ` Peter Zijlstra
2016-01-15  9:49         ` Luca Abeni
2016-01-26 12:52         ` luca abeni
2016-01-27 14:44           ` Peter Zijlstra
2016-02-02 20:53             ` Luca Abeni
2016-02-03 11:30               ` Juri Lelli
2016-02-03 13:28                 ` luca abeni
2016-01-19 10:11 ` [RFC 0/8] CPU reclaiming for SCHED_DEADLINE Juri Lelli
2016-01-19 11:50   ` Luca Abeni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.