linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE
@ 2016-12-30 11:33 Luca Abeni
  2016-12-30 11:33 ` [RFC v4 1/6] sched/deadline: track the active utilization Luca Abeni
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Luca Abeni @ 2016-12-30 11:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Luca Abeni

From: Luca Abeni <luca.abeni@unitn.it>

Hi all,

here is a new version of the patchset implementing CPU reclaiming
(using the GRUB algorithm[1]) for SCHED_DEADLINE.
Basically, this feature allows SCHED_DEADLINE tasks to consume more
than their reserved runtime, up to a maximum fraction of the CPU time
(so that other tasks are left some spare CPU time to execute), if this
does not break the guarantees of other SCHED_DEADLINE tasks.
The patchset applies on top of tip/master.


The implemented CPU reclaiming algorithm is based on tracking the
utilization U_act of active tasks (first 2 patches), and modifying the
runtime accounting rule (see patch 0004). The original GRUB algorithm is
modified as described in [2] to support multiple CPUs (the original
algorithm only considered one single CPU, this one tracks U_act per
runqueue) and to leave an "unreclaimable" fraction of CPU time to non
SCHED_DEADLINE tasks (see patch 0005: the original algorithm can consume
100% of the CPU time, starving all the other tasks).
Patch 0003 uses the newly introduced "inactive timer" (introduced in
patch 0002) to fix dl_overflow() and __setparam_dl().
Patch 0006 allows to enable CPU reclaiming only on selected tasks.


Changes since v3:
the most important change is the introduction of a new "dl_non_contending"
flag in the "sched_dl_entity" structure, that allows to avoid a race
condition identified by Peter
(http://lkml.iu.edu/hypermail/linux/kernel/1604.0/02822.html) and Juri
(http://lkml.iu.edu/hypermail/linux/kernel/1611.1/02298.html).
For the moment, I added a new field (similar to the other "dl_*" flags)
to the deadline scheduling entity; if needed I can move all the dl_* flags
to a single field in a following patch. 

Other than this, I tried to address all the comments I received, and to
add comments requested in the previous reviews.
In particular, the add_running_bw() and sub_running_bw() functions are now
marked as inline, and have been simplified as suggested by Daniel and
Steven.
The overflow and underflow checks in these functions have been modified
as suggested by Peter; because of a limitation of SCHED_WARN_ON(), the
code in sub_running_bw() is slightly more complex. If SCHED_WARN_ON() is
improved (as suggested in a previous email of mine), I can simplify
sub_running_bw() in a following patch. 
I also updated the patches to apply on top of tip/master.
Finally, I (hopefully) fixed an issue with my usage of get_task_struct() /
put_task_struct() in the previous patches: previously, I did
"get_task_struct(p)" before arming the "inactive task timer", and
"put_task_struct(p)" in the timer handler... But I forgot to call
"put_task_struct(p)" when successfully cancelling the timer; this should
be fixed in the new version of patch 0002.

[1] Lipari, G., & Baruah, S. (2000). Greedy reclamation of unused bandwidth in constant-bandwidth servers. In Real-Time Systems, 2000. Euromicro RTS 2000. 12th Euromicro Conference on (pp. 193-200). IEEE.
[2] Abeni, L., Lelli, J., Scordino, C., & Palopoli, L. (2014, October). Greedy CPU reclaiming for SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS), Dusseldorf, Germany. 

Luca Abeni (6):
  sched/deadline: track the active utilization
  sched/deadline: improve the tracking of active utilization
  sched/deadline: fix the update of the total -deadline utilization
  sched/deadline: implement GRUB accounting
  sched/deadline: do not reclaim the whole CPU bandwidth
  sched/deadline: make GRUB a task's flag

 include/linux/sched.h      |  18 +++-
 include/uapi/linux/sched.h |   1 +
 kernel/sched/core.c        |  45 ++++----
 kernel/sched/deadline.c    | 260 +++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h       |  13 +++
 5 files changed, 291 insertions(+), 46 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC v4 1/6] sched/deadline: track the active utilization
  2016-12-30 11:33 [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
@ 2016-12-30 11:33 ` Luca Abeni
  2016-12-30 11:33 ` [RFC v4 2/6] sched/deadline: improve the tracking of " Luca Abeni
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Luca Abeni @ 2016-12-30 11:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Luca Abeni

From: Luca Abeni <luca.abeni@unitn.it>

Active utilization is defined as the total utilization of active
(TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
when a task wakes up and is decreased when a task blocks.

When a task is migrated from CPUi to CPUj, immediately subtract the
task's utilization from CPUi and add it to CPUj. This mechanism is
implemented by modifying the pull and push functions.
Note: this is not fully correct from the theoretical point of view
(the utilization should be removed from CPUi only at the 0 lag
time), a more theoretically sound solution will follow.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
---
 kernel/sched/deadline.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h    |  6 +++++
 2 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 70ef2b1..23c840e 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -43,6 +43,28 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
 	return !RB_EMPTY_NODE(&dl_se->rb_node);
 }
 
+static inline
+void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	u64 old = dl_rq->running_bw;
+
+	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	dl_rq->running_bw += dl_se->dl_bw;
+	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
+}
+
+static inline
+void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	u64 old = dl_rq->running_bw;
+
+	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	dl_rq->running_bw -= dl_se->dl_bw;
+	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
+	if (dl_rq->running_bw > old)
+		dl_rq->running_bw = 0;
+}
+
 static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
@@ -909,8 +931,12 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
 	 * parameters of the task might need updating. Otherwise,
 	 * we want a replenishment of its runtime.
 	 */
-	if (flags & ENQUEUE_WAKEUP)
+	if (flags & ENQUEUE_WAKEUP) {
+		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+		add_running_bw(dl_se, dl_rq);
 		update_dl_entity(dl_se, pi_se);
+	}
 	else if (flags & ENQUEUE_REPLENISH)
 		replenish_dl_entity(dl_se, pi_se);
 
@@ -947,14 +973,25 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		return;
 	}
 
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		add_running_bw(&p->dl, &rq->dl);
+
 	/*
-	 * If p is throttled, we do nothing. In fact, if it exhausted
+	 * If p is throttled, we do not enqueue it. In fact, if it exhausted
 	 * its budget it needs a replenishment and, since it now is on
 	 * its rq, the bandwidth timer callback (which clearly has not
 	 * run yet) will take care of this.
+	 * However, the active utilization does not depend on the fact
+	 * that the task is on the runqueue or not (but depends on the
+	 * task's state - in GRUB parlance, "inactive" vs "active contending").
+	 * In other words, even if a task is throttled its utilization must
+	 * be counted in the active utilization; hence, we need to call
+	 * add_running_bw().
 	 */
-	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
+	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+		add_running_bw(&p->dl, &rq->dl);
 		return;
+	}
 
 	enqueue_dl_entity(&p->dl, pi_se, flags);
 
@@ -972,6 +1009,21 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
 	__dequeue_task_dl(rq, p, flags);
+
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		sub_running_bw(&p->dl, &rq->dl);
+
+	/*
+	 * This check allows to start the inactive timer (or to immediately
+	 * decrease the active utilization, if needed) in two cases:
+	 * when the task blocks and when it is terminating
+	 * (p->state == TASK_DEAD). We can handle the two cases in the same
+	 * way, because from GRUB's point of view the same thing is happening
+	 * (the task moves from "active contending" to "active non contending"
+	 * or "inactive")
+	 */
+	if (flags & DEQUEUE_SLEEP)
+		sub_running_bw(&p->dl, &rq->dl);
 }
 
 /*
@@ -1501,7 +1553,9 @@ static int push_dl_task(struct rq *rq)
 	}
 
 	deactivate_task(rq, next_task, 0);
+	sub_running_bw(&next_task->dl, &rq->dl);
 	set_task_cpu(next_task, later_rq->cpu);
+	add_running_bw(&next_task->dl, &later_rq->dl);
 	activate_task(later_rq, next_task, 0);
 	ret = 1;
 
@@ -1589,7 +1643,9 @@ static void pull_dl_task(struct rq *this_rq)
 			resched = true;
 
 			deactivate_task(src_rq, p, 0);
+			sub_running_bw(&p->dl, &src_rq->dl);
 			set_task_cpu(p, this_cpu);
+			add_running_bw(&p->dl, &this_rq->dl);
 			activate_task(this_rq, p, 0);
 			dmin = p->dl.deadline;
 
@@ -1695,6 +1751,9 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	if (!start_dl_timer(p))
 		__dl_clear_params(p);
 
+	if (task_on_rq_queued(p))
+		sub_running_bw(&p->dl, &rq->dl);
+
 	/*
 	 * Since this might be the only -deadline task on the rq,
 	 * this is the right place to try to pull some other one
@@ -1712,6 +1771,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
  */
 static void switched_to_dl(struct rq *rq, struct task_struct *p)
 {
+	add_running_bw(&p->dl, &rq->dl);
 
 	/* If p is not queued we will update its parameters at next wakeup. */
 	if (!task_on_rq_queued(p))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7b34c78..0659772 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -536,6 +536,12 @@ struct dl_rq {
 #else
 	struct dl_bw dl_bw;
 #endif
+	/*
+	 * "Active utilization" for this runqueue: increased when a
+	 * task wakes up (becomes TASK_RUNNING) and decreased when a
+	 * task blocks
+	 */
+	u64 running_bw;
 };
 
 #ifdef CONFIG_SMP
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC v4 2/6] sched/deadline: improve the tracking of active utilization
  2016-12-30 11:33 [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
  2016-12-30 11:33 ` [RFC v4 1/6] sched/deadline: track the active utilization Luca Abeni
@ 2016-12-30 11:33 ` Luca Abeni
  2017-01-11 17:05   ` Juri Lelli
  2016-12-30 11:33 ` [RFC v4 3/6] sched/deadline: fix the update of the total -deadline utilization Luca Abeni
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Luca Abeni @ 2016-12-30 11:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Luca Abeni

From: Luca Abeni <luca.abeni@unitn.it>

This patch implements a more theoretically sound algorithm for
tracking active utilization: instead of decreasing it when a
task blocks, use a timer (the "inactive timer", named after the
"Inactive" task state of the GRUB algorithm) to decrease the
active utilization at the so called "0-lag time".

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
---
 include/linux/sched.h   |  18 +++++-
 kernel/sched/core.c     |   2 +
 kernel/sched/deadline.c | 150 ++++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h    |   1 +
 4 files changed, 158 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4d19052..f34633c2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1451,14 +1451,30 @@ struct sched_dl_entity {
 	 *
 	 * @dl_yielded tells if task gave up the cpu before consuming
 	 * all its available runtime during the last job.
+	 *
+	 * @dl_non_contending tells if task is inactive while still
+	 * contributing to the active utilization. In other words, it
+	 * indicates if the inactive timer has been armed and its handler
+	 * has not been executed yet. This flag is useful to avoid race
+	 * conditions between the inactive timer handler and the wakeup
+	 * code.
 	 */
-	int dl_throttled, dl_boosted, dl_yielded;
+	int dl_throttled, dl_boosted, dl_yielded, dl_non_contending;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
 	 * own bandwidth to be enforced, thus we need one timer per task.
 	 */
 	struct hrtimer dl_timer;
+
+	/*
+	 * Inactive timer, responsible for decreasing the active utilization
+	 * at the "0-lag time". When a -deadline task blocks, it contributes
+	 * to GRUB's active utilization until the "0-lag time", hence a
+	 * timer is needed to decrease the active utilization at the correct
+	 * time.
+	 */
+	struct hrtimer inactive_timer;
 };
 
 union rcu_special {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c56fb57..98f9944 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2187,6 +2187,7 @@ void __dl_clear_params(struct task_struct *p)
 
 	dl_se->dl_throttled = 0;
 	dl_se->dl_yielded = 0;
+	dl_se->dl_non_contending = 0;
 }
 
 /*
@@ -2218,6 +2219,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	RB_CLEAR_NODE(&p->dl.rb_node);
 	init_dl_task_timer(&p->dl);
+	init_inactive_task_timer(&p->dl);
 	__dl_clear_params(p);
 
 	INIT_LIST_HEAD(&p->rt.run_list);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 23c840e..cdb7274 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -65,6 +65,46 @@ void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 		dl_rq->running_bw = 0;
 }
 
+static void task_go_inactive(struct task_struct *p)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+	struct hrtimer *timer = &dl_se->inactive_timer;
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+	s64 zerolag_time;
+
+	WARN_ON(dl_se->dl_runtime == 0);
+
+	WARN_ON(hrtimer_active(&dl_se->inactive_timer));
+	WARN_ON(dl_se->dl_non_contending);
+
+	zerolag_time = dl_se->deadline -
+		 div64_long((dl_se->runtime * dl_se->dl_period),
+			dl_se->dl_runtime);
+
+	/*
+	 * Using relative times instead of the absolute "0-lag time"
+	 * allows to simplify the code
+	 */
+	zerolag_time -= rq_clock(rq);
+
+	/*
+	 * If the "0-lag time" already passed, decrease the active
+	 * utilization now, instead of starting a timer
+	 */
+	if (zerolag_time < 0) {
+		sub_running_bw(dl_se, dl_rq);
+		if (!dl_task(p))
+			__dl_clear_params(p);
+
+		return;
+	}
+
+	dl_se->dl_non_contending = 1;
+	get_task_struct(p);
+	hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL);
+}
+
 static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
@@ -610,10 +650,8 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	 * The task might have changed its scheduling policy to something
 	 * different than SCHED_DEADLINE (through switched_from_dl()).
 	 */
-	if (!dl_task(p)) {
-		__dl_clear_params(p);
+	if (!dl_task(p))
 		goto unlock;
-	}
 
 	/*
 	 * The task might have been boosted by someone else and might be in the
@@ -800,6 +838,48 @@ static void update_curr_dl(struct rq *rq)
 	}
 }
 
+static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
+{
+	struct sched_dl_entity *dl_se = container_of(timer,
+						     struct sched_dl_entity,
+						     inactive_timer);
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq_flags rf;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &rf);
+
+	if (!dl_task(p) || p->state == TASK_DEAD) {
+		if (p->state == TASK_DEAD && dl_se->dl_non_contending)
+			sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));
+
+		__dl_clear_params(p);
+
+		goto unlock;
+	}
+	if (dl_se->dl_non_contending == 0)
+		goto unlock;
+
+	sched_clock_tick();
+	update_rq_clock(rq);
+
+	sub_running_bw(dl_se, &rq->dl);
+	dl_se->dl_non_contending = 0;
+unlock:
+	task_rq_unlock(rq, p, &rf);
+	put_task_struct(p);
+
+	return HRTIMER_NORESTART;
+}
+
+void init_inactive_task_timer(struct sched_dl_entity *dl_se)
+{
+	struct hrtimer *timer = &dl_se->inactive_timer;
+
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	timer->function = inactive_task_timer;
+}
+
 #ifdef CONFIG_SMP
 
 static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
@@ -934,7 +1014,28 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
 	if (flags & ENQUEUE_WAKEUP) {
 		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 
-		add_running_bw(dl_se, dl_rq);
+		if (dl_se->dl_non_contending) {
+			/*
+			 * If the timer handler is currently running and the
+			 * timer cannot be cancelled, inactive_task_timer()
+			 * will see that dl_not_contending is not set, and
+			 * will do nothing, so we are still safe.
+			 */
+			if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
+				put_task_struct(dl_task_of(dl_se));
+			WARN_ON(dl_task_of(dl_se)->nr_cpus_allowed > 1);
+			dl_se->dl_non_contending = 0;
+		} else {
+			/*
+			 * Since "dl_non_contending" is not set, the
+			 * task's utilization has already been removed from
+			 * active utilization (either when the task blocked,
+			 * when the "inactive timer" fired, or when it has
+			 * been cancelled in select_task_rq_dl()).
+			 * So, add it back.
+			 */
+			add_running_bw(dl_se, dl_rq);
+		}
 		update_dl_entity(dl_se, pi_se);
 	}
 	else if (flags & ENQUEUE_REPLENISH)
@@ -1023,7 +1124,7 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	 * or "inactive")
 	 */
 	if (flags & DEQUEUE_SLEEP)
-		sub_running_bw(&p->dl, &rq->dl);
+		task_go_inactive(p);
 }
 
 /*
@@ -1097,6 +1198,22 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
 	}
 	rcu_read_unlock();
 
+	rq = task_rq(p);
+	raw_spin_lock(&rq->lock);
+	if (p->dl.dl_non_contending) {
+		sub_running_bw(&p->dl, &rq->dl);
+		p->dl.dl_non_contending = 0;
+		/*
+		 * If the timer handler is currently running and the
+		 * timer cannot be cancelled, inactive_task_timer()
+		 * will see that dl_not_contending is not set, and
+		 * will do nothing, so we are still safe.
+		 */
+		if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+			put_task_struct(p);
+	}
+	raw_spin_unlock(&rq->lock);
+
 out:
 	return cpu;
 }
@@ -1743,16 +1860,25 @@ void __init init_sched_dl_class(void)
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
 {
 	/*
-	 * Start the deadline timer; if we switch back to dl before this we'll
-	 * continue consuming our current CBS slice. If we stay outside of
-	 * SCHED_DEADLINE until the deadline passes, the timer will reset the
-	 * task.
+	 * task_go_inactive() can start the "inactive timer" (if the 0-lag
+	 * time is in the future). If the task switches back to dl before
+	 * the "inactive timer" fires, it can continue to consume its current
+	 * runtime using its current deadline. If it stays outside of
+	 * SCHED_DEADLINE until the 0-lag time passes, inactive_task_timer()
+	 * will reset the task parameters.
 	 */
-	if (!start_dl_timer(p))
-		__dl_clear_params(p);
+	if (task_on_rq_queued(p) && p->dl.dl_runtime)
+		task_go_inactive(p);
 
-	if (task_on_rq_queued(p))
+	/*
+	 * We cannot use inactive_task_timer() to invoke sub_running_bw()
+	 * at the 0-lag time, because the task could have been migrated
+	 * while SCHED_OTHER in the meanwhile.
+	 */
+	if (p->dl.dl_non_contending) {
 		sub_running_bw(&p->dl, &rq->dl);
+		p->dl.dl_non_contending = 0;
+	}
 
 	/*
 	 * Since this might be the only -deadline task on the rq,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0659772..e422803 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1367,6 +1367,7 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
 extern struct dl_bandwidth def_dl_bandwidth;
 extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
 extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
+extern void init_inactive_task_timer(struct sched_dl_entity *dl_se);
 
 unsigned long to_ratio(u64 period, u64 runtime);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC v4 3/6] sched/deadline: fix the update of the total -deadline utilization
  2016-12-30 11:33 [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
  2016-12-30 11:33 ` [RFC v4 1/6] sched/deadline: track the active utilization Luca Abeni
  2016-12-30 11:33 ` [RFC v4 2/6] sched/deadline: improve the tracking of " Luca Abeni
@ 2016-12-30 11:33 ` Luca Abeni
  2016-12-30 11:33 ` [RFC v4 4/6] sched/deadline: implement GRUB accounting Luca Abeni
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Luca Abeni @ 2016-12-30 11:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Luca Abeni

From: Luca Abeni <luca.abeni@unitn.it>

Now that the inactive timer can be armed to fire at the 0-lag time,
it is possible to use inactive_task_timer() to update the total
-deadline utilization (dl_b->total_bw) at the correct time, fixing
dl_overflow() and __setparam_dl().

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
---
 kernel/sched/core.c     | 36 ++++++++++++------------------------
 kernel/sched/deadline.c | 32 +++++++++++++++++++++++---------
 2 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 98f9944..5030b3c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2509,9 +2509,6 @@ static inline int dl_bw_cpus(int i)
  * allocated bandwidth to reflect the new situation.
  *
  * This function is called while holding p's rq->lock.
- *
- * XXX we should delay bw change until the task's 0-lag point, see
- * __setparam_dl().
  */
 static int dl_overflow(struct task_struct *p, int policy,
 		       const struct sched_attr *attr)
@@ -2540,11 +2537,22 @@ static int dl_overflow(struct task_struct *p, int policy,
 		err = 0;
 	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
 		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+		/*
+		 * XXX this is slightly incorrect: when the task
+		 * utilization decreases, we should delay the total
+		 * utilization change until the task's 0-lag point.
+		 * But this would require to set the task's "inactive
+		 * timer" when the task is not inactive.
+		 */
 		__dl_clear(dl_b, p->dl.dl_bw);
 		__dl_add(dl_b, new_bw);
 		err = 0;
 	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
-		__dl_clear(dl_b, p->dl.dl_bw);
+		/*
+		 * Do not decrease the total deadline utilization here,
+		 * switched_from_dl() will take care to do it at the correct
+		 * (0-lag) time.
+		 */
 		err = 0;
 	}
 	raw_spin_unlock(&dl_b->lock);
@@ -3914,26 +3922,6 @@ __setparam_dl(struct task_struct *p, const struct sched_attr *attr)
 	dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline;
 	dl_se->flags = attr->sched_flags;
 	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
-
-	/*
-	 * Changing the parameters of a task is 'tricky' and we're not doing
-	 * the correct thing -- also see task_dead_dl() and switched_from_dl().
-	 *
-	 * What we SHOULD do is delay the bandwidth release until the 0-lag
-	 * point. This would include retaining the task_struct until that time
-	 * and change dl_overflow() to not immediately decrement the current
-	 * amount.
-	 *
-	 * Instead we retain the current runtime/deadline and let the new
-	 * parameters take effect after the current reservation period lapses.
-	 * This is safe (albeit pessimistic) because the 0-lag point is always
-	 * before the current scheduling deadline.
-	 *
-	 * We can still have temporary overloads because we do not delay the
-	 * change in bandwidth until that time; so admission control is
-	 * not on the safe side. It does however guarantee tasks will never
-	 * consume more than promised.
-	 */
 }
 
 /*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index cdb7274..c087c3d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -94,8 +94,14 @@ static void task_go_inactive(struct task_struct *p)
 	 */
 	if (zerolag_time < 0) {
 		sub_running_bw(dl_se, dl_rq);
-		if (!dl_task(p))
+		if (!dl_task(p)) {
+			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+			raw_spin_lock(&dl_b->lock);
+			__dl_clear(dl_b, p->dl.dl_bw);
 			__dl_clear_params(p);
+			raw_spin_unlock(&dl_b->lock);
+		}
 
 		return;
 	}
@@ -850,9 +856,14 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 	rq = task_rq_lock(p, &rf);
 
 	if (!dl_task(p) || p->state == TASK_DEAD) {
+		struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
 		if (p->state == TASK_DEAD && dl_se->dl_non_contending)
 			sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));
 
+		raw_spin_lock(&dl_b->lock);
+		__dl_clear(dl_b, p->dl.dl_bw);
+		raw_spin_unlock(&dl_b->lock);
 		__dl_clear_params(p);
 
 		goto unlock;
@@ -1375,15 +1386,18 @@ static void task_fork_dl(struct task_struct *p)
 
 static void task_dead_dl(struct task_struct *p)
 {
-	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+	if (!hrtimer_active(&p->dl.inactive_timer)) {
+		struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
-	/*
-	 * Since we are TASK_DEAD we won't slip out of the domain!
-	 */
-	raw_spin_lock_irq(&dl_b->lock);
-	/* XXX we should retain the bw until 0-lag */
-	dl_b->total_bw -= p->dl.dl_bw;
-	raw_spin_unlock_irq(&dl_b->lock);
+		/*
+		 * If the "inactive timer is not active, the 0-lag time
+		 * is already passed, so we immediately decrease the
+		 * total deadline utilization
+		 */
+		raw_spin_lock_irq(&dl_b->lock);
+		__dl_clear(dl_b, p->dl.dl_bw);
+		raw_spin_unlock_irq(&dl_b->lock);
+	}
 }
 
 static void set_curr_task_dl(struct rq *rq)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC v4 4/6] sched/deadline: implement GRUB accounting
  2016-12-30 11:33 [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
                   ` (2 preceding siblings ...)
  2016-12-30 11:33 ` [RFC v4 3/6] sched/deadline: fix the update of the total -deadline utilization Luca Abeni
@ 2016-12-30 11:33 ` Luca Abeni
  2016-12-30 11:33 ` [RFC v4 5/6] sched/deadline: do not reclaim the whole CPU bandwidth Luca Abeni
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Luca Abeni @ 2016-12-30 11:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Luca Abeni

From: Luca Abeni <luca.abeni@unitn.it>

According to the GRUB (Greedy Reclaimation of Unused Bandwidth)
reclaiming algorithm, the runtime is not decreased as "dq = -dt",
but as "dq = -Uact dt" (where Uact is the per-runqueue active
utilization).
Hence, this commit modifies the runtime accounting rule in
update_curr_dl() to implement the GRUB rule.

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
---
 kernel/sched/deadline.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c087c3d..361887b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -764,6 +764,19 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
 extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
 /*
+ * This function implements the GRUB accounting rule:
+ * according to the GRUB reclaiming algorithm, the runtime is
+ * not decreased as "dq = -dt", but as "dq = -Uact dt", where
+ * Uact is the (per-runqueue) active utilization.
+ * Since rq->dl.running_bw contains Uact * 2^20, the result
+ * has to be shifted right by 20.
+ */
+u64 grub_reclaim(u64 delta, struct rq *rq)
+{
+	return (delta * rq->dl.running_bw) >> 20;
+}
+
+/*
  * Update the current task's runtime statistics (provided it is still
  * a -deadline task and has not been removed from the dl_rq).
  */
@@ -805,6 +818,7 @@ static void update_curr_dl(struct rq *rq)
 
 	sched_rt_avg_update(rq, delta_exec);
 
+	delta_exec = grub_reclaim(delta_exec, rq);
 	dl_se->runtime -= delta_exec;
 
 throttle:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC v4 5/6] sched/deadline: do not reclaim the whole CPU bandwidth
  2016-12-30 11:33 [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
                   ` (3 preceding siblings ...)
  2016-12-30 11:33 ` [RFC v4 4/6] sched/deadline: implement GRUB accounting Luca Abeni
@ 2016-12-30 11:33 ` Luca Abeni
  2016-12-30 11:33 ` [RFC v4 6/6] sched/deadline: make GRUB a task's flag Luca Abeni
  2017-01-03 18:58 ` [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE Daniel Bristot de Oliveira
  6 siblings, 0 replies; 20+ messages in thread
From: Luca Abeni @ 2016-12-30 11:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Luca Abeni

From: Luca Abeni <luca.abeni@unitn.it>

Original GRUB tends to reclaim 100% of the CPU time... And this
allows a CPU hog to starve non-deadline tasks.
To address this issue, allow the scheduler to reclaim only a
specified fraction of CPU time.

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
---
 kernel/sched/core.c     | 4 ++++
 kernel/sched/deadline.c | 7 ++++++-
 kernel/sched/sched.h    | 6 ++++++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5030b3c..4010af7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8286,6 +8286,10 @@ static void sched_dl_do_global(void)
 		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
 
 		rcu_read_unlock_sched();
+		if (dl_b->bw == -1)
+			cpu_rq(cpu)->dl.non_deadline_bw = 0;
+		else
+			cpu_rq(cpu)->dl.non_deadline_bw = (1 << 20) - new_bw;
 	}
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 361887b..7585dfb 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -151,6 +151,11 @@ void init_dl_rq(struct dl_rq *dl_rq)
 #else
 	init_dl_bw(&dl_rq->dl_bw);
 #endif
+	if (global_rt_runtime() == RUNTIME_INF)
+		dl_rq->non_deadline_bw = 0;
+	else
+		dl_rq->non_deadline_bw = (1 << 20) -
+			to_ratio(global_rt_period(), global_rt_runtime());
 }
 
 #ifdef CONFIG_SMP
@@ -773,7 +778,7 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
  */
 u64 grub_reclaim(u64 delta, struct rq *rq)
 {
-	return (delta * rq->dl.running_bw) >> 20;
+	return (delta * (rq->dl.non_deadline_bw + rq->dl.running_bw)) >> 20;
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e422803..ef4bdaa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -542,6 +542,12 @@ struct dl_rq {
 	 * task blocks
 	 */
 	u64 running_bw;
+
+	/*
+	 * Fraction of the CPU utilization that cannot be reclaimed
+	 * by the GRUB algorithm.
+	 */
+	u64 non_deadline_bw;
 };
 
 #ifdef CONFIG_SMP
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC v4 6/6] sched/deadline: make GRUB a task's flag
  2016-12-30 11:33 [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
                   ` (4 preceding siblings ...)
  2016-12-30 11:33 ` [RFC v4 5/6] sched/deadline: do not reclaim the whole CPU bandwidth Luca Abeni
@ 2016-12-30 11:33 ` Luca Abeni
  2017-01-03 18:58 ` [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE Daniel Bristot de Oliveira
  6 siblings, 0 replies; 20+ messages in thread
From: Luca Abeni @ 2016-12-30 11:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Luca Abeni

From: Luca Abeni <luca.abeni@unitn.it>

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
---
 include/uapi/linux/sched.h | 1 +
 kernel/sched/core.c        | 3 ++-
 kernel/sched/deadline.c    | 3 ++-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5f0fe01..e2a6c7b 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -47,5 +47,6 @@
  * For the sched_{set,get}attr() calls
  */
 #define SCHED_FLAG_RESET_ON_FORK	0x01
+#define SCHED_FLAG_RECLAIM		0x02
 
 #endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4010af7..af9c882 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4089,7 +4089,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			return -EINVAL;
 	}
 
-	if (attr->sched_flags & ~(SCHED_FLAG_RESET_ON_FORK))
+	if (attr->sched_flags &
+		~(SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_RECLAIM))
 		return -EINVAL;
 
 	/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7585dfb..93ff400 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -823,7 +823,8 @@ static void update_curr_dl(struct rq *rq)
 
 	sched_rt_avg_update(rq, delta_exec);
 
-	delta_exec = grub_reclaim(delta_exec, rq);
+	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
+		delta_exec = grub_reclaim(delta_exec, rq);
 	dl_se->runtime -= delta_exec;
 
 throttle:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE
  2016-12-30 11:33 [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
                   ` (5 preceding siblings ...)
  2016-12-30 11:33 ` [RFC v4 6/6] sched/deadline: make GRUB a task's flag Luca Abeni
@ 2017-01-03 18:58 ` Daniel Bristot de Oliveira
  2017-01-03 21:33   ` luca abeni
  2017-01-04 12:17   ` luca abeni
  6 siblings, 2 replies; 20+ messages in thread
From: Daniel Bristot de Oliveira @ 2017-01-03 18:58 UTC (permalink / raw)
  To: Luca Abeni, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta

On 12/30/2016 12:33 PM, Luca Abeni wrote:
> From: Luca Abeni <luca.abeni@unitn.it>
> 
> Hi all,
> 
> here is a new version of the patchset implementing CPU reclaiming
> (using the GRUB algorithm[1]) for SCHED_DEADLINE.
> Basically, this feature allows SCHED_DEADLINE tasks to consume more
> than their reserved runtime, up to a maximum fraction of the CPU time
> (so that other tasks are left some spare CPU time to execute), if this
> does not break the guarantees of other SCHED_DEADLINE tasks.
> The patchset applies on top of tip/master.
> 
> 
> The implemented CPU reclaiming algorithm is based on tracking the
> utilization U_act of active tasks (first 2 patches), and modifying the
> runtime accounting rule (see patch 0004). The original GRUB algorithm is
> modified as described in [2] to support multiple CPUs (the original
> algorithm only considered one single CPU, this one tracks U_act per
> runqueue) and to leave an "unreclaimable" fraction of CPU time to non
> SCHED_DEADLINE tasks (see patch 0005: the original algorithm can consume
> 100% of the CPU time, starving all the other tasks).
> Patch 0003 uses the newly introduced "inactive timer" (introduced in
> patch 0002) to fix dl_overflow() and __setparam_dl().
> Patch 0006 allows to enable CPU reclaiming only on selected tasks.

Hi,

Today I did some tests in this patch set. Unfortunately, it seems that
there is a problem :-(.

In a four core box, if I dispatch 11 tasks [1] with setup:

  period = 30 ms
  runtime = 10 ms
  flags = 0 (GRUB disabled)

I see this:
------------------------------- HTOP ------------------------------------
  1  [|||||||||||||||||||||92.5%]   Tasks: 128, 259 thr; 14 running
  2  [|||||||||||||||||||||91.0%]   Load average: 4.65 4.66 4.81 
  3  [|||||||||||||||||||||92.5%]   Uptime: 05:12:43
  4  [|||||||||||||||||||||92.5%]
  Mem[|||||||||||||||1.13G/3.78G]
  Swp[                  0K/3.90G]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
16247 root      -101   0  4204   632   564 R 32.4  0.0  2:10.35 d
16249 root	-101   0  4204   624   556 R 32.4  0.0  2:09.80 d
16250 root	-101   0  4204   728   660 R 32.4  0.0  2:09.58 d
16252 root	-101   0  4204   676   608 R 32.4  0.0  2:09.08 d
16253 root	-101   0  4204   636   568 R 32.4  0.0  2:08.85 d
16254 root      -101   0  4204   732   664 R 32.4  0.0  2:08.62 d
16255 root	-101   0  4204   620   556 R 32.4  0.0  2:08.40 d
16257 root	-101   0  4204   708   640 R 32.4  0.0  2:07.98 d
16256 root	-101   0  4204   624   560 R 32.4  0.0  2:08.18 d
16248 root	-101   0  4204   680   612 R 33.0  0.0  2:10.15 d
16251 root	-101   0  4204   676   608 R 33.0  0.0  2:09.34 d
16259 root       20   0  124M  4692  3120 R  1.1  0.1  0:02.82 htop
 2191 bristot    20   0  649M 41312 32048 S  0.0  1.0  0:28.77 gnome-ter
------------------------------- HTOP ------------------------------------

All tasks are using +- the same amount of CPU time, a little bit more
than 30%, as expected. However, if I enable GRUB in the same task set
I get this:

------------------------------- HTOP ------------------------------------
  1  [|||||||||||||||||||||93.8%]   Tasks: 128, 260 thr; 15 running
  2  [|||||||||||||||||||||95.2%]   Load average: 5.13 5.01 4.98 
  3  [|||||||||||||||||||||93.3%]   Uptime: 05:01:02
  4  [|||||||||||||||||||||96.4%]
  Mem[|||||||||||||||1.13G/3.78G]
  Swp[                  0K/3.90G]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
14967 root      -101   0  4204   628   564 R 45.8  0.0  1h07:49 g
14962 root	-101   0  4204   728   660 R 45.8  0.0  1h05:06 g
14959 root	-101   0  4204   680   612 R 45.2  0.0  1h07:29 g
14927 root	-101   0  4204   624   556 R 44.6  0.0  1h04:30 g
14928 root	-101   0  4204   656   588 R 31.1  0.0 47:37.21 g
14961 root	-101   0  4204   684   616 R 31.1  0.0 47:19.75 g
14968 root	-101   0  4204   636   568 R 31.1  0.0 46:27.36 g
14960 root	-101   0  4204   684   616 R 23.8  0.0 37:31.06 g
14969 root	-101   0  4204   684   616 R 23.8  0.0 38:11.50 g
14925 root	-101   0  4204   636   568 R 23.8  0.0 37:34.88 g
14926 root	-101   0  4204   684   616 R 23.8  0.0 38:27.37 g
16182 root	 20   0  124M  3972  3212 R  0.6  0.1  0:00.23 htop
  862 root       20   0  264M  5668  4832 S  0.6  0.1  0:03.30 iio-sensor
 2191 bristot    20   0  649M 41312 32048 S  0.0  1.0  0:27.62 gnome-term
  588 root       20   0  257M  121M  120M S  0.0  3.1  0:13.53 systemd-jo
------------------------------- HTOP ------------------------------------

Some tasks start to use more CPU time, while others seems to use less
CPU than it was reserved for them. See the task 14926, it is using
only 23.8 % of the CPU, which is less than its 10/30 reservation.

I traced this task activation and noticed this:

         swapper     0 [003] 14968.332244: sched:sched_switch: swapper/3:0 [120] R ==> g:14926 [-1]
               g 14926 [003] 14968.339294: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 7050 us (14968.339294 - 14968.332244)

period:  29997 us (14968.362241 - 14968.332244)
         swapper     0 [003] 14968.362241: sched:sched_switch: swapper/3:0 [120] R ==> g:14926 [-1]
               g 14926 [003] 14968.369294: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 0.007053 us (14968.369294 = 14968.362241)

period: 29994 us (14968.392235 - 14968.362241)
         swapper     0 [003] 14968.392235: sched:sched_switch: swapper/3:0 [120] R ==> g:14926 [-1]
               g 14926 [003] 14968.399301: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 7066 us (14968.399301 - 14968.392235)

period:  30008 us (14968.422243 - 14968.392235)
         swapper     0 [003] 14968.422243: sched:sched_switch: swapper/3:0 [120] R ==> g:14926 [-1]
               g 14926 [003] 14968.429294: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 7051 us (14968.429294 - 14968.422243)

period:  29995 us (14968.452238 - 14968.422243)
         swapper     0 [003] 14968.452238: sched:sched_switch: swapper/3:0 [120] R ==> g:14926 [-1]
               g 14926 [003] 14968.459293: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 7055 us (14968.459293 - 14968.452238)

period:  30055 us (14968.482293 - 14968.452238)
               g 14925 [003] 14968.482293: sched:sched_switch: g:14925 [-1] R ==> g:14926 [-1]
               g 14926 [003] 14968.490293: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 8000 us (14968.490293 - 14968.482293)

The task is using less CPU than it was reserved/guaranteed.

After some debugging, it seems that in this case GRUB is also _reducing_
the runtime of the task by making the notion of consumed runtime
be greater than the actual consumed runtime.

You can see this with this code snip:

------------------- %<-------------------
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 93ff400..1abb594 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -823,9 +823,21 @@ static void update_curr_dl(struct rq *rq)
 
 	sched_rt_avg_update(rq, delta_exec);
 
-	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
-		delta_exec = grub_reclaim(delta_exec, rq);
-	dl_se->runtime -= delta_exec;
+	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
+		u64 new_delta_exec;
+		new_delta_exec = grub_reclaim(delta_exec, rq);
+		if (new_delta_exec > delta_exec)
+			trace_printk("new delta exec (%llu) is greater than delta exec (%llu) by %llu\n",
+					new_delta_exec,
+					delta_exec,
+					(new_delta_exec - delta_exec));
+		dl_se->runtime -= new_delta_exec;
+	}
+	else {
+		dl_se->runtime -= delta_exec;
+	}
+
+
 
 throttle:
 	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
--------------------------- >% -------------

It seems to be related to the "sched/deadline: do not reclaim the whole
CPU bandwidth", because the trace_printk message I put starts to appear
when we start to touch this limit, and the (new_delta_exec - delta_exec)
seems to be somehow limited to the non_deadline_bw.

Output with sysctl -w kernel.sched_rt_runtime_us=950000
               g-1984  [001] d.h1  1108.783349: update_curr_dl: new delta exec (1050043) is greater than delta exec (1000042) by 50001
               g-1983  [002] d.h1  1108.783349: update_curr_dl: new delta exec (1049974) is greater than delta exec (999976) by 49998
               g-1981  [003] d.h1  1108.783350: update_curr_dl: new delta exec (1050054) is greater than delta exec (1000053) by 50001

Output with sysctl -w kernel.sched_rt_runtime_us=900000
               g-1748  [001] d.h1   418.879815: update_curr_dl: new delta exec (1099995) is greater than delta exec (999996) by 99999
               g-1749  [002] d.h1   418.880815: update_curr_dl: new delta exec (1099986) is greater than delta exec (999988) by 99998
               g-1748  [001] d.h1   418.880815: update_curr_dl: new delta exec (1099962) is greater than delta exec (999966) by 99996

In the case of fewer tasks, this error appears just in the
dispatch of a new task, stabilizing after some ms. But it
does not stabilize when we are closer to the limit of the rt
runtime.

That is all I could find today. Am I missing something?

[1] http://bristot.me/lkml/d.c

-- Daniel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE
  2017-01-03 18:58 ` [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE Daniel Bristot de Oliveira
@ 2017-01-03 21:33   ` luca abeni
  2017-01-04 12:17   ` luca abeni
  1 sibling, 0 replies; 20+ messages in thread
From: luca abeni @ 2017-01-03 21:33 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Claudio Scordino, Steven Rostedt, Tommaso Cucinotta

Hi Daniel,
(sorry for the previous html email; I replied from my phone and I did
not realise how the email client was configured)

On Tue, 3 Jan 2017 19:58:38 +0100
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

[...]
> > The implemented CPU reclaiming algorithm is based on tracking the
> > utilization U_act of active tasks (first 2 patches), and modifying
> > the runtime accounting rule (see patch 0004). The original GRUB
> > algorithm is modified as described in [2] to support multiple CPUs
> > (the original algorithm only considered one single CPU, this one
> > tracks U_act per runqueue) and to leave an "unreclaimable" fraction
> > of CPU time to non SCHED_DEADLINE tasks (see patch 0005: the
> > original algorithm can consume 100% of the CPU time, starving all
> > the other tasks). Patch 0003 uses the newly introduced "inactive
> > timer" (introduced in patch 0002) to fix dl_overflow() and
> > __setparam_dl(). Patch 0006 allows to enable CPU reclaiming only on
> > selected tasks.  
> 
> Hi,
> 
> Today I did some tests in this patch set. Unfortunately, it seems that
> there is a problem :-(.
[...]
I reproduced this issue; thanks for the report. It seems to be due to
the fact that the reclaiming tasks are more than the CPU cores and the
load is very high (near to the utilisation limit).

I am investigating it, and will hopefully post an update in the next
days.



			Thanks,
				Luca


> 
> In a four core box, if I dispatch 11 tasks [1] with setup:
> 
>   period = 30 ms
>   runtime = 10 ms
>   flags = 0 (GRUB disabled)
> 
> I see this:
> ------------------------------- HTOP
> ------------------------------------ 1
> [|||||||||||||||||||||92.5%]   Tasks: 128, 259 thr; 14 running 2
> [|||||||||||||||||||||91.0%]   Load average: 4.65 4.66 4.81 3
> [|||||||||||||||||||||92.5%]   Uptime: 05:12:43 4
> [|||||||||||||||||||||92.5%] Mem[|||||||||||||||1.13G/3.78G]
>   Swp[                  0K/3.90G]
> 
>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
> 16247 root      -101   0  4204   632   564 R 32.4  0.0  2:10.35 d
> 16249 root	-101   0  4204   624   556 R 32.4  0.0  2:09.80 d
> 16250 root	-101   0  4204   728   660 R 32.4  0.0  2:09.58 d
> 16252 root	-101   0  4204   676   608 R 32.4  0.0  2:09.08 d
> 16253 root	-101   0  4204   636   568 R 32.4  0.0  2:08.85 d
> 16254 root      -101   0  4204   732   664 R 32.4  0.0  2:08.62 d
> 16255 root	-101   0  4204   620   556 R 32.4  0.0  2:08.40 d
> 16257 root	-101   0  4204   708   640 R 32.4  0.0  2:07.98 d
> 16256 root	-101   0  4204   624   560 R 32.4  0.0  2:08.18 d
> 16248 root	-101   0  4204   680   612 R 33.0  0.0  2:10.15 d
> 16251 root	-101   0  4204   676   608 R 33.0  0.0  2:09.34 d
> 16259 root       20   0  124M  4692  3120 R  1.1  0.1  0:02.82 htop
>  2191 bristot    20   0  649M 41312 32048 S  0.0  1.0  0:28.77
> gnome-ter ------------------------------- HTOP
> ------------------------------------
> 
> All tasks are using +- the same amount of CPU time, a little bit more
> than 30%, as expected. However, if I enable GRUB in the same task set
> I get this:
> 
> ------------------------------- HTOP
> ------------------------------------ 1
> [|||||||||||||||||||||93.8%]   Tasks: 128, 260 thr; 15 running 2
> [|||||||||||||||||||||95.2%]   Load average: 5.13 5.01 4.98 3
> [|||||||||||||||||||||93.3%]   Uptime: 05:01:02 4
> [|||||||||||||||||||||96.4%] Mem[|||||||||||||||1.13G/3.78G]
>   Swp[                  0K/3.90G]
> 
>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
> 14967 root      -101   0  4204   628   564 R 45.8  0.0  1h07:49 g
> 14962 root	-101   0  4204   728   660 R 45.8  0.0  1h05:06 g
> 14959 root	-101   0  4204   680   612 R 45.2  0.0  1h07:29 g
> 14927 root	-101   0  4204   624   556 R 44.6  0.0  1h04:30 g
> 14928 root	-101   0  4204   656   588 R 31.1  0.0 47:37.21 g
> 14961 root	-101   0  4204   684   616 R 31.1  0.0 47:19.75 g
> 14968 root	-101   0  4204   636   568 R 31.1  0.0 46:27.36 g
> 14960 root	-101   0  4204   684   616 R 23.8  0.0 37:31.06 g
> 14969 root	-101   0  4204   684   616 R 23.8  0.0 38:11.50 g
> 14925 root	-101   0  4204   636   568 R 23.8  0.0 37:34.88 g
> 14926 root	-101   0  4204   684   616 R 23.8  0.0 38:27.37 g
> 16182 root	 20   0  124M  3972  3212 R  0.6  0.1  0:00.23 htop
>   862 root       20   0  264M  5668  4832 S  0.6  0.1  0:03.30
> iio-sensor 2191 bristot    20   0  649M 41312 32048 S  0.0  1.0
> 0:27.62 gnome-term 588 root       20   0  257M  121M  120M S  0.0
> 3.1  0:13.53 systemd-jo ------------------------------- HTOP
> ------------------------------------
> 
> Some tasks start to use more CPU time, while others seems to use less
> CPU than it was reserved for them. See the task 14926, it is using
> only 23.8 % of the CPU, which is less than its 10/30 reservation.
> 
> I traced this task activation and noticed this:
> 
>          swapper     0 [003] 14968.332244: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.339294:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7050 us
> (14968.339294 - 14968.332244)
> 
> period:  29997 us (14968.362241 - 14968.332244)
>          swapper     0 [003] 14968.362241: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.369294:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 0.007053
> us (14968.369294 = 14968.362241)
> 
> period: 29994 us (14968.392235 - 14968.362241)
>          swapper     0 [003] 14968.392235: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.399301:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7066 us
> (14968.399301 - 14968.392235)
> 
> period:  30008 us (14968.422243 - 14968.392235)
>          swapper     0 [003] 14968.422243: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.429294:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7051 us
> (14968.429294 - 14968.422243)
> 
> period:  29995 us (14968.452238 - 14968.422243)
>          swapper     0 [003] 14968.452238: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.459293:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7055 us
> (14968.459293 - 14968.452238)
> 
> period:  30055 us (14968.482293 - 14968.452238)
>                g 14925 [003] 14968.482293: sched:sched_switch:
> g:14925 [-1] R ==> g:14926 [-1] g 14926 [003] 14968.490293:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 8000 us
> (14968.490293 - 14968.482293)
> 
> The task is using less CPU than it was reserved/guaranteed.
> 
> After some debugging, it seems that in this case GRUB is also
> _reducing_ the runtime of the task by making the notion of consumed
> runtime be greater than the actual consumed runtime.
> 
> You can see this with this code snip:
> 
> ------------------- %<-------------------
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 93ff400..1abb594 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -823,9 +823,21 @@ static void update_curr_dl(struct rq *rq)
>  
>  	sched_rt_avg_update(rq, delta_exec);
>  
> -	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
> -		delta_exec = grub_reclaim(delta_exec, rq);
> -	dl_se->runtime -= delta_exec;
> +	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> +		u64 new_delta_exec;
> +		new_delta_exec = grub_reclaim(delta_exec, rq);
> +		if (new_delta_exec > delta_exec)
> +			trace_printk("new delta exec (%llu) is
> greater than delta exec (%llu) by %llu\n",
> +					new_delta_exec,
> +					delta_exec,
> +					(new_delta_exec -
> delta_exec));
> +		dl_se->runtime -= new_delta_exec;
> +	}
> +	else {
> +		dl_se->runtime -= delta_exec;
> +	}
> +
> +
>  
>  throttle:
>  	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> --------------------------- >% -------------  
> 
> It seems to be related to the "sched/deadline: do not reclaim the
> whole CPU bandwidth", because the trace_printk message I put starts
> to appear when we start to touch this limit, and the (new_delta_exec
> - delta_exec) seems to be somehow limited to the non_deadline_bw.
> 
> Output with sysctl -w kernel.sched_rt_runtime_us=950000
>                g-1984  [001] d.h1  1108.783349: update_curr_dl: new
> delta exec (1050043) is greater than delta exec (1000042) by 50001
> g-1983  [002] d.h1  1108.783349: update_curr_dl: new delta exec
> (1049974) is greater than delta exec (999976) by 49998 g-1981  [003]
> d.h1  1108.783350: update_curr_dl: new delta exec (1050054) is
> greater than delta exec (1000053) by 50001
> 
> Output with sysctl -w kernel.sched_rt_runtime_us=900000
>                g-1748  [001] d.h1   418.879815: update_curr_dl: new
> delta exec (1099995) is greater than delta exec (999996) by 99999
> g-1749  [002] d.h1   418.880815: update_curr_dl: new delta exec
> (1099986) is greater than delta exec (999988) by 99998 g-1748  [001]
> d.h1   418.880815: update_curr_dl: new delta exec (1099962) is
> greater than delta exec (999966) by 99996
> 
> In the case of fewer tasks, this error appears just in the
> dispatch of a new task, stabilizing after some ms. But it
> does not stabilize when we are closer to the limit of the rt
> runtime.
> 
> That is all I could find today. Am I missing something?
> 
> [1] http://bristot.me/lkml/d.c
> 
> -- Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE
  2017-01-03 18:58 ` [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE Daniel Bristot de Oliveira
  2017-01-03 21:33   ` luca abeni
@ 2017-01-04 12:17   ` luca abeni
  2017-01-04 15:14     ` Daniel Bristot de Oliveira
  1 sibling, 1 reply; 20+ messages in thread
From: luca abeni @ 2017-01-04 12:17 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Claudio Scordino, Steven Rostedt, Tommaso Cucinotta

Hi Daniel,

On Tue, 3 Jan 2017 19:58:38 +0100
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

[...]
> In a four core box, if I dispatch 11 tasks [1] with setup:
> 
>   period = 30 ms
>   runtime = 10 ms
>   flags = 0 (GRUB disabled)
> 
> I see this:
> ------------------------------- HTOP
> ------------------------------------ 1
> [|||||||||||||||||||||92.5%]   Tasks: 128, 259 thr; 14 running 2
> [|||||||||||||||||||||91.0%]   Load average: 4.65 4.66 4.81 3
> [|||||||||||||||||||||92.5%]   Uptime: 05:12:43 4
> [|||||||||||||||||||||92.5%] Mem[|||||||||||||||1.13G/3.78G]
>   Swp[                  0K/3.90G]
> 
>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
> 16247 root      -101   0  4204   632   564 R 32.4  0.0  2:10.35 d
> 16249 root	-101   0  4204   624   556 R 32.4  0.0  2:09.80 d
> 16250 root	-101   0  4204   728   660 R 32.4  0.0  2:09.58 d
> 16252 root	-101   0  4204   676   608 R 32.4  0.0  2:09.08 d
> 16253 root	-101   0  4204   636   568 R 32.4  0.0  2:08.85 d
> 16254 root      -101   0  4204   732   664 R 32.4  0.0  2:08.62 d
> 16255 root	-101   0  4204   620   556 R 32.4  0.0  2:08.40 d
> 16257 root	-101   0  4204   708   640 R 32.4  0.0  2:07.98 d
> 16256 root	-101   0  4204   624   560 R 32.4  0.0  2:08.18 d
> 16248 root	-101   0  4204   680   612 R 33.0  0.0  2:10.15 d
> 16251 root	-101   0  4204   676   608 R 33.0  0.0  2:09.34 d
> 16259 root       20   0  124M  4692  3120 R  1.1  0.1  0:02.82 htop
>  2191 bristot    20   0  649M 41312 32048 S  0.0  1.0  0:28.77
> gnome-ter ------------------------------- HTOP
> ------------------------------------
> 
> All tasks are using +- the same amount of CPU time, a little bit more
> than 30%, as expected.

Notice that, if I understand well, each task should receive 33.33% (1/3)
of CPU time. Anyway, I think this is ok...

> However, if I enable GRUB in the same task set I get this:
> 
> ------------------------------- HTOP
> ------------------------------------ 1
> [|||||||||||||||||||||93.8%]   Tasks: 128, 260 thr; 15 running 2
> [|||||||||||||||||||||95.2%]   Load average: 5.13 5.01 4.98 3
> [|||||||||||||||||||||93.3%]   Uptime: 05:01:02 4
> [|||||||||||||||||||||96.4%] Mem[|||||||||||||||1.13G/3.78G]
>   Swp[                  0K/3.90G]
> 
>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
> 14967 root      -101   0  4204   628   564 R 45.8  0.0  1h07:49 g
> 14962 root	-101   0  4204   728   660 R 45.8  0.0  1h05:06 g
> 14959 root	-101   0  4204   680   612 R 45.2  0.0  1h07:29 g
> 14927 root	-101   0  4204   624   556 R 44.6  0.0  1h04:30 g
> 14928 root	-101   0  4204   656   588 R 31.1  0.0 47:37.21 g
> 14961 root	-101   0  4204   684   616 R 31.1  0.0 47:19.75 g
> 14968 root	-101   0  4204   636   568 R 31.1  0.0 46:27.36 g
> 14960 root	-101   0  4204   684   616 R 23.8  0.0 37:31.06 g
> 14969 root	-101   0  4204   684   616 R 23.8  0.0 38:11.50 g
> 14925 root	-101   0  4204   636   568 R 23.8  0.0 37:34.88 g
> 14926 root	-101   0  4204   684   616 R 23.8  0.0 38:27.37 g
> 16182 root	 20   0  124M  3972  3212 R  0.6  0.1  0:00.23 htop
>   862 root       20   0  264M  5668  4832 S  0.6  0.1  0:03.30
> iio-sensor 2191 bristot    20   0  649M 41312 32048 S  0.0  1.0
> 0:27.62 gnome-term 588 root       20   0  257M  121M  120M S  0.0
> 3.1  0:13.53 systemd-jo ------------------------------- HTOP
> ------------------------------------
> 
> Some tasks start to use more CPU time, while others seems to use less
> CPU than it was reserved for them. See the task 14926, it is using
> only 23.8 % of the CPU, which is less than its 10/30 reservation.

What happened here is that some runqueues have an active utilisation
larger than 0.95. So, GRUB is decreasing the amount of time received by
the tasks on those runqueues to consume less than 95%... This is the
reason for the effect you noticed below:


> After some debugging, it seems that in this case GRUB is also
> _reducing_ the runtime of the task by making the notion of consumed
> runtime be greater than the actual consumed runtime.
[...]

Now, this is "kind of expected", because you have 11 tasks each one
having utilisation 1/3, distributed on 4 CPUs... So, some CPU will have
3 tasks on it, resulting in an utilisation = 1 > 0.95. But this should
not result in what you have seen in htop...
The real issue seems to be that at some point some runqueues have an
active utilisation = 1.33 (4 dl tasks in the runqueue), with other
runqueues only having 2 tasks... And this results in the huge imbalance
in utilisations you noticed. I am trying to understand why this
happens... It seems to me that a "pull_dl_task()" might end up pulling
more than 1 task... Is this possible?


			Luca

> 
> You can see this with this code snip:
> 
> ------------------- %<-------------------
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 93ff400..1abb594 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -823,9 +823,21 @@ static void update_curr_dl(struct rq *rq)
>  
>  	sched_rt_avg_update(rq, delta_exec);
>  
> -	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
> -		delta_exec = grub_reclaim(delta_exec, rq);
> -	dl_se->runtime -= delta_exec;
> +	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> +		u64 new_delta_exec;
> +		new_delta_exec = grub_reclaim(delta_exec, rq);
> +		if (new_delta_exec > delta_exec)
> +			trace_printk("new delta exec (%llu) is
> greater than delta exec (%llu) by %llu\n",
> +					new_delta_exec,
> +					delta_exec,
> +					(new_delta_exec -
> delta_exec));
> +		dl_se->runtime -= new_delta_exec;
> +	}
> +	else {
> +		dl_se->runtime -= delta_exec;
> +	}
> +
> +
>  
>  throttle:
>  	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> --------------------------- >% -------------  
> 
> It seems to be related to the "sched/deadline: do not reclaim the
> whole CPU bandwidth", because the trace_printk message I put starts
> to appear when we start to touch this limit, and the (new_delta_exec
> - delta_exec) seems to be somehow limited to the non_deadline_bw.
> 
> Output with sysctl -w kernel.sched_rt_runtime_us=950000
>                g-1984  [001] d.h1  1108.783349: update_curr_dl: new
> delta exec (1050043) is greater than delta exec (1000042) by 50001
> g-1983  [002] d.h1  1108.783349: update_curr_dl: new delta exec
> (1049974) is greater than delta exec (999976) by 49998 g-1981  [003]
> d.h1  1108.783350: update_curr_dl: new delta exec (1050054) is
> greater than delta exec (1000053) by 50001
> 
> Output with sysctl -w kernel.sched_rt_runtime_us=900000
>                g-1748  [001] d.h1   418.879815: update_curr_dl: new
> delta exec (1099995) is greater than delta exec (999996) by 99999
> g-1749  [002] d.h1   418.880815: update_curr_dl: new delta exec
> (1099986) is greater than delta exec (999988) by 99998 g-1748  [001]
> d.h1   418.880815: update_curr_dl: new delta exec (1099962) is
> greater than delta exec (999966) by 99996
> 
> In the case of fewer tasks, this error appears just in the
> dispatch of a new task, stabilizing after some ms. But it
> does not stabilize when we are closer to the limit of the rt
> runtime.
> 
> That is all I could find today. Am I missing something?
> 
> [1] http://bristot.me/lkml/d.c
> 
> -- Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE
  2017-01-04 12:17   ` luca abeni
@ 2017-01-04 15:14     ` Daniel Bristot de Oliveira
  2017-01-04 16:42       ` Luca Abeni
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Bristot de Oliveira @ 2017-01-04 15:14 UTC (permalink / raw)
  To: luca abeni
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Claudio Scordino, Steven Rostedt, Tommaso Cucinotta

On 01/04/2017 01:17 PM, luca abeni wrote:
> Hi Daniel,
> 
> On Tue, 3 Jan 2017 19:58:38 +0100
> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> 
> [...]
>> In a four core box, if I dispatch 11 tasks [1] with setup:
>>
>>   period = 30 ms
>>   runtime = 10 ms
>>   flags = 0 (GRUB disabled)
>>
>> I see this:
>> ------------------------------- HTOP
>> ------------------------------------ 1
>> [|||||||||||||||||||||92.5%]   Tasks: 128, 259 thr; 14 running 2
>> [|||||||||||||||||||||91.0%]   Load average: 4.65 4.66 4.81 3
>> [|||||||||||||||||||||92.5%]   Uptime: 05:12:43 4
>> [|||||||||||||||||||||92.5%] Mem[|||||||||||||||1.13G/3.78G]
>>   Swp[                  0K/3.90G]
>>
>>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>> 16247 root      -101   0  4204   632   564 R 32.4  0.0  2:10.35 d
>> 16249 root	-101   0  4204   624   556 R 32.4  0.0  2:09.80 d
>> 16250 root	-101   0  4204   728   660 R 32.4  0.0  2:09.58 d
>> 16252 root	-101   0  4204   676   608 R 32.4  0.0  2:09.08 d
>> 16253 root	-101   0  4204   636   568 R 32.4  0.0  2:08.85 d
>> 16254 root      -101   0  4204   732   664 R 32.4  0.0  2:08.62 d
>> 16255 root	-101   0  4204   620   556 R 32.4  0.0  2:08.40 d
>> 16257 root	-101   0  4204   708   640 R 32.4  0.0  2:07.98 d
>> 16256 root	-101   0  4204   624   560 R 32.4  0.0  2:08.18 d
>> 16248 root	-101   0  4204   680   612 R 33.0  0.0  2:10.15 d
>> 16251 root	-101   0  4204   676   608 R 33.0  0.0  2:09.34 d
>> 16259 root       20   0  124M  4692  3120 R  1.1  0.1  0:02.82 htop
>>  2191 bristot    20   0  649M 41312 32048 S  0.0  1.0  0:28.77
>> gnome-ter ------------------------------- HTOP
>> ------------------------------------
>>
>> All tasks are using +- the same amount of CPU time, a little bit more
>> than 30%, as expected.
> 
> Notice that, if I understand well, each task should receive 33.33% (1/3)
> of CPU time. Anyway, I think this is ok...

If we think on a partitioned system, yes for the CPUs in which 3 'd'
tasks are able to run. But as sched deadline is global by definition,
the load is:

SUM(U_i)  / M processors.

1/3 * 11  / 4            = 0.916666667

So 10/30 (1/3) of this workload is:
91.6 / 3 = 30.533333333

Well, the rest is probably overheads, like scheduling, migration...

>> However, if I enable GRUB in the same task set I get this:
>>
>> ------------------------------- HTOP
>> ------------------------------------ 1
>> [|||||||||||||||||||||93.8%]   Tasks: 128, 260 thr; 15 running 2
>> [|||||||||||||||||||||95.2%]   Load average: 5.13 5.01 4.98 3
>> [|||||||||||||||||||||93.3%]   Uptime: 05:01:02 4
>> [|||||||||||||||||||||96.4%] Mem[|||||||||||||||1.13G/3.78G]
>>   Swp[                  0K/3.90G]
>>
>>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>> 14967 root      -101   0  4204   628   564 R 45.8  0.0  1h07:49 g
>> 14962 root	-101   0  4204   728   660 R 45.8  0.0  1h05:06 g
>> 14959 root	-101   0  4204   680   612 R 45.2  0.0  1h07:29 g
>> 14927 root	-101   0  4204   624   556 R 44.6  0.0  1h04:30 g
>> 14928 root	-101   0  4204   656   588 R 31.1  0.0 47:37.21 g
>> 14961 root	-101   0  4204   684   616 R 31.1  0.0 47:19.75 g
>> 14968 root	-101   0  4204   636   568 R 31.1  0.0 46:27.36 g
>> 14960 root	-101   0  4204   684   616 R 23.8  0.0 37:31.06 g
>> 14969 root	-101   0  4204   684   616 R 23.8  0.0 38:11.50 g
>> 14925 root	-101   0  4204   636   568 R 23.8  0.0 37:34.88 g
>> 14926 root	-101   0  4204   684   616 R 23.8  0.0 38:27.37 g
>> 16182 root	 20   0  124M  3972  3212 R  0.6  0.1  0:00.23 htop
>>   862 root       20   0  264M  5668  4832 S  0.6  0.1  0:03.30
>> iio-sensor 2191 bristot    20   0  649M 41312 32048 S  0.0  1.0
>> 0:27.62 gnome-term 588 root       20   0  257M  121M  120M S  0.0
>> 3.1  0:13.53 systemd-jo ------------------------------- HTOP
>> ------------------------------------
>>
>> Some tasks start to use more CPU time, while others seems to use less
>> CPU than it was reserved for them. See the task 14926, it is using
>> only 23.8 % of the CPU, which is less than its 10/30 reservation.
> 
> What happened here is that some runqueues have an active utilisation
> larger than 0.95. So, GRUB is decreasing the amount of time received by
> the tasks on those runqueues to consume less than 95%... This is the
> reason for the effect you noticed below:

I see. But, AFAIK, the Linux's sched deadline measures the load
globally, not locally. So, it is not a problem having a load > than 95%
in the local queue if the global queue is < 95%.

Am I missing something?

> 
>> After some debugging, it seems that in this case GRUB is also
>> _reducing_ the runtime of the task by making the notion of consumed
>> runtime be greater than the actual consumed runtime.
> [...]
> 
> Now, this is "kind of expected", because you have 11 tasks each one
> having utilisation 1/3, distributed on 4 CPUs... So, some CPU will have
> 3 tasks on it, resulting in an utilisation = 1 > 0.95. But this should
> not result in what you have seen in htop...

Well, the sched deadline aims to schedule the M highest priority tasks,
and migrates tasks to achieve this goal. However, I am not sure if
having the whole runqueue balance is a goal/restriction/feature of the
deadline scheduler.

Maybe this is the difference between the GRUB and sched deadline
assumptions that is causing the problem. Just thinking aloud.

> The real issue seems to be that at some point some runqueues have an
> active utilisation = 1.33 (4 dl tasks in the runqueue), with other
> runqueues only having 2 tasks... And this results in the huge imbalance
> in utilisations you noticed. I am trying to understand why this
> happens... It seems to me that a "pull_dl_task()" might end up pulling
> more than 1 task... Is this possible?

Yeah, this explain the numbers.

Brainstorm time! (sorry if it sounds obviously unfeasible):
Is it possible to think on GRUB tracking the global utilization?

-- Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE
  2017-01-04 15:14     ` Daniel Bristot de Oliveira
@ 2017-01-04 16:42       ` Luca Abeni
  2017-01-04 18:00         ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 20+ messages in thread
From: Luca Abeni @ 2017-01-04 16:42 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Claudio Scordino, Steven Rostedt, Tommaso Cucinotta

Hi Daniel,

2017-01-04 16:14 GMT+01:00, Daniel Bristot de Oliveira <bristot@redhat.com>:
> On 01/04/2017 01:17 PM, luca abeni wrote:
>> Hi Daniel,
>>
>> On Tue, 3 Jan 2017 19:58:38 +0100
>> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
>>
>> [...]
>>> In a four core box, if I dispatch 11 tasks [1] with setup:
>>>
>>>   period = 30 ms
>>>   runtime = 10 ms
>>>   flags = 0 (GRUB disabled)
>>>
>>> I see this:
>>> ------------------------------- HTOP
>>> ------------------------------------ 1
>>> [|||||||||||||||||||||92.5%]   Tasks: 128, 259 thr; 14 running 2
>>> [|||||||||||||||||||||91.0%]   Load average: 4.65 4.66 4.81 3
>>> [|||||||||||||||||||||92.5%]   Uptime: 05:12:43 4
>>> [|||||||||||||||||||||92.5%] Mem[|||||||||||||||1.13G/3.78G]
>>>   Swp[                  0K/3.90G]
>>>
>>>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>>> 16247 root      -101   0  4204   632   564 R 32.4  0.0  2:10.35 d
>>> 16249 root	-101   0  4204   624   556 R 32.4  0.0  2:09.80 d
>>> 16250 root	-101   0  4204   728   660 R 32.4  0.0  2:09.58 d
>>> 16252 root	-101   0  4204   676   608 R 32.4  0.0  2:09.08 d
>>> 16253 root	-101   0  4204   636   568 R 32.4  0.0  2:08.85 d
>>> 16254 root      -101   0  4204   732   664 R 32.4  0.0  2:08.62 d
>>> 16255 root	-101   0  4204   620   556 R 32.4  0.0  2:08.40 d
>>> 16257 root	-101   0  4204   708   640 R 32.4  0.0  2:07.98 d
>>> 16256 root	-101   0  4204   624   560 R 32.4  0.0  2:08.18 d
>>> 16248 root	-101   0  4204   680   612 R 33.0  0.0  2:10.15 d
>>> 16251 root	-101   0  4204   676   608 R 33.0  0.0  2:09.34 d
>>> 16259 root       20   0  124M  4692  3120 R  1.1  0.1  0:02.82 htop
>>>  2191 bristot    20   0  649M 41312 32048 S  0.0  1.0  0:28.77
>>> gnome-ter ------------------------------- HTOP
>>> ------------------------------------
>>>
>>> All tasks are using +- the same amount of CPU time, a little bit more
>>> than 30%, as expected.
>>
>> Notice that, if I understand well, each task should receive 33.33% (1/3)
>> of CPU time. Anyway, I think this is ok...
>
> If we think on a partitioned system, yes for the CPUs in which 3 'd'
> tasks are able to run. But as sched deadline is global by definition,
> the load is:
>
> SUM(U_i)  / M processors.
>
> 1/3 * 11  / 4            = 0.916666667
>
> So 10/30 (1/3) of this workload is:
> 91.6 / 3 = 30.533333333
>
> Well, the rest is probably overheads, like scheduling, migration...

I do not think this math is correct... Yes, the total utilization of
the taskset is 0.91 (or 3.66, depending on how you define the
utilization...), but I still think that the percentage of CPU time
shown by "top" or "htop" should be 33.33 (or 8.33, depending on how
the tool computes it).
runtime=10 and period=30 means "schedule the task for 10ms every
30ms", so the task will consume 33% of the CPU time of a single core.
In other words, 10/30 is a fraction of the CPU time, not a fraction of
the time consumed by SCHED_DEADLINE tasks.


>>> However, if I enable GRUB in the same task set I get this:
>>>
>>> ------------------------------- HTOP
>>> ------------------------------------ 1
>>> [|||||||||||||||||||||93.8%]   Tasks: 128, 260 thr; 15 running 2
>>> [|||||||||||||||||||||95.2%]   Load average: 5.13 5.01 4.98 3
>>> [|||||||||||||||||||||93.3%]   Uptime: 05:01:02 4
>>> [|||||||||||||||||||||96.4%] Mem[|||||||||||||||1.13G/3.78G]
>>>   Swp[                  0K/3.90G]
>>>
>>>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>>> 14967 root      -101   0  4204   628   564 R 45.8  0.0  1h07:49 g
>>> 14962 root	-101   0  4204   728   660 R 45.8  0.0  1h05:06 g
>>> 14959 root	-101   0  4204   680   612 R 45.2  0.0  1h07:29 g
>>> 14927 root	-101   0  4204   624   556 R 44.6  0.0  1h04:30 g
>>> 14928 root	-101   0  4204   656   588 R 31.1  0.0 47:37.21 g
>>> 14961 root	-101   0  4204   684   616 R 31.1  0.0 47:19.75 g
>>> 14968 root	-101   0  4204   636   568 R 31.1  0.0 46:27.36 g
>>> 14960 root	-101   0  4204   684   616 R 23.8  0.0 37:31.06 g
>>> 14969 root	-101   0  4204   684   616 R 23.8  0.0 38:11.50 g
>>> 14925 root	-101   0  4204   636   568 R 23.8  0.0 37:34.88 g
>>> 14926 root	-101   0  4204   684   616 R 23.8  0.0 38:27.37 g
>>> 16182 root	 20   0  124M  3972  3212 R  0.6  0.1  0:00.23 htop
>>>   862 root       20   0  264M  5668  4832 S  0.6  0.1  0:03.30
>>> iio-sensor 2191 bristot    20   0  649M 41312 32048 S  0.0  1.0
>>> 0:27.62 gnome-term 588 root       20   0  257M  121M  120M S  0.0
>>> 3.1  0:13.53 systemd-jo ------------------------------- HTOP
>>> ------------------------------------
>>>
>>> Some tasks start to use more CPU time, while others seems to use less
>>> CPU than it was reserved for them. See the task 14926, it is using
>>> only 23.8 % of the CPU, which is less than its 10/30 reservation.
>>
>> What happened here is that some runqueues have an active utilisation
>> larger than 0.95. So, GRUB is decreasing the amount of time received by
>> the tasks on those runqueues to consume less than 95%... This is the
>> reason for the effect you noticed below:
>
> I see. But, AFAIK, the Linux's sched deadline measures the load
> globally, not locally. So, it is not a problem having a load > than 95%
> in the local queue if the global queue is < 95%.
>
> Am I missing something?

The version of GRUB reclaiming implemented in my patches tracks a
per-runqueue "active utilization", and uses it for reclaiming.

>>> After some debugging, it seems that in this case GRUB is also
>>> _reducing_ the runtime of the task by making the notion of consumed
>>> runtime be greater than the actual consumed runtime.
>> [...]
>>
>> Now, this is "kind of expected", because you have 11 tasks each one
>> having utilisation 1/3, distributed on 4 CPUs... So, some CPU will have
>> 3 tasks on it, resulting in an utilisation = 1 > 0.95. But this should
>> not result in what you have seen in htop...
>
> Well, the sched deadline aims to schedule the M highest priority tasks,
> and migrates tasks to achieve this goal. However, I am not sure if
> having the whole runqueue balance is a goal/restriction/feature of the
> deadline scheduler.
>
> Maybe this is the difference between the GRUB and sched deadline
> assumptions that is causing the problem. Just thinking aloud.

I think I found some strange behaviour in the push/pull mechanisms (at
least it seems strange to me): a "pull" operation might end up pulling
multiple tasks (I see this can simplify the implementation, but I
think pulling multiple tasks is useless and might introduce some
overhead even independently from my patches), and I suspect (but still
I need to verify this) a "push" operation can push a task to a "wrong"
destination runqueue (I mean, a task is pushed to a runqueue where it
is not the earliest deadline task)...

Without reclaiming, this just results in useless migrations (if I did
not misunderstand something), but with my reclaiming patches this is
probably the source of the strange effect you saw. But I am still
investigating this, so I am not too sure...

>> The real issue seems to be that at some point some runqueues have an
>> active utilisation = 1.33 (4 dl tasks in the runqueue), with other
>> runqueues only having 2 tasks... And this results in the huge imbalance
>> in utilisations you noticed. I am trying to understand why this
>> happens... It seems to me that a "pull_dl_task()" might end up pulling
>> more than 1 task... Is this possible?
>
> Yeah, this explain the numbers.
>
> Brainstorm time! (sorry if it sounds obviously unfeasible):
> Is it possible to think on GRUB tracking the global utilization?

Yes, and I even had a version of my patches using a "per root domain"
global active utilization. If needed I can update my patchset to
implement the global active utilization again.
I switched to per-runqueue active utilization because:
- this can be used for controlling the CPU frequency scaling... And
I've been told that frequency scaling is generally per-core / per-CPU
(but I need to verify this)
- the patches based on global active utilization needed to access this
global utilization in mutual exclusion, so I used a spinlock to
protect it... And I am not sure about scalability issues
- I suspect there were issues when the root domain / exclusive cpuset
is modified.


Thanks,
Luca

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE
  2017-01-04 16:42       ` Luca Abeni
@ 2017-01-04 18:00         ` Daniel Bristot de Oliveira
  2017-01-04 18:30           ` Luca Abeni
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Bristot de Oliveira @ 2017-01-04 18:00 UTC (permalink / raw)
  To: Luca Abeni
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Claudio Scordino, Steven Rostedt, Tommaso Cucinotta

On 01/04/2017 05:42 PM, Luca Abeni wrote:
> Hi Daniel,
> 
> 2017-01-04 16:14 GMT+01:00, Daniel Bristot de Oliveira <bristot@redhat.com>:
>> On 01/04/2017 01:17 PM, luca abeni wrote:
>>> Hi Daniel,
>>>
>>> On Tue, 3 Jan 2017 19:58:38 +0100
>>> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
>>>
>>> [...]
>>>> In a four core box, if I dispatch 11 tasks [1] with setup:
>>>>
>>>>   period = 30 ms
>>>>   runtime = 10 ms
>>>>   flags = 0 (GRUB disabled)
>>>>
>>>> I see this:
>>>> ------------------------------- HTOP
>>>> ------------------------------------ 1
>>>> [|||||||||||||||||||||92.5%]   Tasks: 128, 259 thr; 14 running 2
>>>> [|||||||||||||||||||||91.0%]   Load average: 4.65 4.66 4.81 3
>>>> [|||||||||||||||||||||92.5%]   Uptime: 05:12:43 4
>>>> [|||||||||||||||||||||92.5%] Mem[|||||||||||||||1.13G/3.78G]
>>>>   Swp[                  0K/3.90G]
>>>>
>>>>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>>>> 16247 root      -101   0  4204   632   564 R 32.4  0.0  2:10.35 d
>>>> 16249 root	-101   0  4204   624   556 R 32.4  0.0  2:09.80 d
>>>> 16250 root	-101   0  4204   728   660 R 32.4  0.0  2:09.58 d
>>>> 16252 root	-101   0  4204   676   608 R 32.4  0.0  2:09.08 d
>>>> 16253 root	-101   0  4204   636   568 R 32.4  0.0  2:08.85 d
>>>> 16254 root      -101   0  4204   732   664 R 32.4  0.0  2:08.62 d
>>>> 16255 root	-101   0  4204   620   556 R 32.4  0.0  2:08.40 d
>>>> 16257 root	-101   0  4204   708   640 R 32.4  0.0  2:07.98 d
>>>> 16256 root	-101   0  4204   624   560 R 32.4  0.0  2:08.18 d
>>>> 16248 root	-101   0  4204   680   612 R 33.0  0.0  2:10.15 d
>>>> 16251 root	-101   0  4204   676   608 R 33.0  0.0  2:09.34 d
>>>> 16259 root       20   0  124M  4692  3120 R  1.1  0.1  0:02.82 htop
>>>>  2191 bristot    20   0  649M 41312 32048 S  0.0  1.0  0:28.77
>>>> gnome-ter ------------------------------- HTOP
>>>> ------------------------------------
>>>>
>>>> All tasks are using +- the same amount of CPU time, a little bit more
>>>> than 30%, as expected.
>>>
>>> Notice that, if I understand well, each task should receive 33.33% (1/3)
>>> of CPU time. Anyway, I think this is ok...
>>
>> If we think on a partitioned system, yes for the CPUs in which 3 'd'
>> tasks are able to run. But as sched deadline is global by definition,
>> the load is:
>>
>> SUM(U_i)  / M processors.
>>
>> 1/3 * 11  / 4            = 0.916666667
>>
>> So 10/30 (1/3) of this workload is:
>> 91.6 / 3 = 30.533333333
>>
>> Well, the rest is probably overheads, like scheduling, migration...
> 
> I do not think this math is correct... Yes, the total utilization of
> the taskset is 0.91 (or 3.66, depending on how you define the
> utilization...), but I still think that the percentage of CPU time
> shown by "top" or "htop" should be 33.33 (or 8.33, depending on how
> the tool computes it).
> runtime=10 and period=30 means "schedule the task for 10ms every
> 30ms", so the task will consume 33% of the CPU time of a single core.
> In other words, 10/30 is a fraction of the CPU time, not a fraction of
> the time consumed by SCHED_DEADLINE tasks.

Ack! you are correct, I was so focused on global utilization that end up
missing this point. For the top/htop it should 33.3%.

> 
>>>> However, if I enable GRUB in the same task set I get this:
>>>>
>>>> ------------------------------- HTOP
>>>> ------------------------------------ 1
>>>> [|||||||||||||||||||||93.8%]   Tasks: 128, 260 thr; 15 running 2
>>>> [|||||||||||||||||||||95.2%]   Load average: 5.13 5.01 4.98 3
>>>> [|||||||||||||||||||||93.3%]   Uptime: 05:01:02 4
>>>> [|||||||||||||||||||||96.4%] Mem[|||||||||||||||1.13G/3.78G]
>>>>   Swp[                  0K/3.90G]
>>>>
>>>>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>>>> 14967 root      -101   0  4204   628   564 R 45.8  0.0  1h07:49 g
>>>> 14962 root	-101   0  4204   728   660 R 45.8  0.0  1h05:06 g
>>>> 14959 root	-101   0  4204   680   612 R 45.2  0.0  1h07:29 g
>>>> 14927 root	-101   0  4204   624   556 R 44.6  0.0  1h04:30 g
>>>> 14928 root	-101   0  4204   656   588 R 31.1  0.0 47:37.21 g
>>>> 14961 root	-101   0  4204   684   616 R 31.1  0.0 47:19.75 g
>>>> 14968 root	-101   0  4204   636   568 R 31.1  0.0 46:27.36 g
>>>> 14960 root	-101   0  4204   684   616 R 23.8  0.0 37:31.06 g
>>>> 14969 root	-101   0  4204   684   616 R 23.8  0.0 38:11.50 g
>>>> 14925 root	-101   0  4204   636   568 R 23.8  0.0 37:34.88 g
>>>> 14926 root	-101   0  4204   684   616 R 23.8  0.0 38:27.37 g
>>>> 16182 root	 20   0  124M  3972  3212 R  0.6  0.1  0:00.23 htop
>>>>   862 root       20   0  264M  5668  4832 S  0.6  0.1  0:03.30
>>>> iio-sensor 2191 bristot    20   0  649M 41312 32048 S  0.0  1.0
>>>> 0:27.62 gnome-term 588 root       20   0  257M  121M  120M S  0.0
>>>> 3.1  0:13.53 systemd-jo ------------------------------- HTOP
>>>> ------------------------------------
>>>>
>>>> Some tasks start to use more CPU time, while others seems to use less
>>>> CPU than it was reserved for them. See the task 14926, it is using
>>>> only 23.8 % of the CPU, which is less than its 10/30 reservation.
>>>
>>> What happened here is that some runqueues have an active utilisation
>>> larger than 0.95. So, GRUB is decreasing the amount of time received by
>>> the tasks on those runqueues to consume less than 95%... This is the
>>> reason for the effect you noticed below:
>>
>> I see. But, AFAIK, the Linux's sched deadline measures the load
>> globally, not locally. So, it is not a problem having a load > than 95%
>> in the local queue if the global queue is < 95%.
>>
>> Am I missing something?
> 
> The version of GRUB reclaiming implemented in my patches tracks a
> per-runqueue "active utilization", and uses it for reclaiming.

I _think_ that this might be (one of) the source(s) of the problem...

Just exercising...

For example, with my taskset, with a hypothetical perfect balance of the
whole runqueue, one possible scenario is:

   CPU    0    1     2     3
# TASKS   3    3     3     2

In this case, CPUs 0 1 2 are with 100% of local utilization. Thus, the
current task on these CPUs will have their runtime decreased by GRUB.
Meanwhile, the luck tasks in the CPU 3 would use an additional time that
they "globally" do not have - because the system, globally, has a load
higher than the 66.6...% of the local runqueue. Actually, part of the
time decreased from tasks on [0-2] are being used by the tasks on 3,
until the next migration of any task, which will change the luck
tasks... but without any guaranty that all tasks will be the luck one on
every activation, causing the problem.

Does it make sense?

If it does, this let me think that only with the global track of
utilization we will achieve the correct result... but I may be missing
something... :-).

-- Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE
  2017-01-04 18:00         ` Daniel Bristot de Oliveira
@ 2017-01-04 18:30           ` Luca Abeni
  2017-01-11 12:19             ` Juri Lelli
  0 siblings, 1 reply; 20+ messages in thread
From: Luca Abeni @ 2017-01-04 18:30 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Claudio Scordino, Steven Rostedt, Tommaso Cucinotta

2017-01-04 19:00 GMT+01:00, Daniel Bristot de Oliveira <bristot@redhat.com>:
[...]
>>>>> Some tasks start to use more CPU time, while others seems to use less
>>>>> CPU than it was reserved for them. See the task 14926, it is using
>>>>> only 23.8 % of the CPU, which is less than its 10/30 reservation.
>>>>
>>>> What happened here is that some runqueues have an active utilisation
>>>> larger than 0.95. So, GRUB is decreasing the amount of time received by
>>>> the tasks on those runqueues to consume less than 95%... This is the
>>>> reason for the effect you noticed below:
>>>
>>> I see. But, AFAIK, the Linux's sched deadline measures the load
>>> globally, not locally. So, it is not a problem having a load > than 95%
>>> in the local queue if the global queue is < 95%.
>>>
>>> Am I missing something?
>>
>> The version of GRUB reclaiming implemented in my patches tracks a
>> per-runqueue "active utilization", and uses it for reclaiming.
>
> I _think_ that this might be (one of) the source(s) of the problem...
I agree that this can cause some problems, but I am not sure if it
justifies the huge difference in utilisations you observed

> Just exercising...
>
> For example, with my taskset, with a hypothetical perfect balance of the
> whole runqueue, one possible scenario is:
>
>    CPU    0    1     2     3
> # TASKS   3    3     3     2
>
> In this case, CPUs 0 1 2 are with 100% of local utilization. Thus, the
> current task on these CPUs will have their runtime decreased by GRUB.
> Meanwhile, the luck tasks in the CPU 3 would use an additional time that
> they "globally" do not have - because the system, globally, has a load
> higher than the 66.6...% of the local runqueue. Actually, part of the
> time decreased from tasks on [0-2] are being used by the tasks on 3,
> until the next migration of any task, which will change the luck
> tasks... but without any guaranty that all tasks will be the luck one on
> every activation, causing the problem.
>
> Does it make sense?

Yes; but my impression is that gEDF will migrate tasks so that the
distribution of the reclaimed CPU bandwidth is almost uniform...
Instead, you saw huge differences in the utilisations (and I do not
think that "compressing" the utilisations from 100% to 95% can
decrease the utilisation of a task from 33% to 25% / 26%... :)

I suspect there is something more going on here (might be some bug in
one of my patches). I am trying to better understand what happened.

> If it does, this let me think that only with the global track of
> utilization we will achieve the correct result... but I may be missing
> something... :-).

Of course tracking the global active utilisation can be a solution,
but I also want to better understand what is wrong with the current
approach.

Thanks,
Luca

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE
  2017-01-04 18:30           ` Luca Abeni
@ 2017-01-11 12:19             ` Juri Lelli
  2017-01-11 12:39               ` Luca Abeni
  0 siblings, 1 reply; 20+ messages in thread
From: Juri Lelli @ 2017-01-11 12:19 UTC (permalink / raw)
  To: Luca Abeni
  Cc: Daniel Bristot de Oliveira, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Claudio Scordino, Steven Rostedt, Tommaso Cucinotta

Hi,

On 04/01/17 19:30, Luca Abeni wrote:
> 2017-01-04 19:00 GMT+01:00, Daniel Bristot de Oliveira <bristot@redhat.com>:
> [...]
> >>>>> Some tasks start to use more CPU time, while others seems to use less
> >>>>> CPU than it was reserved for them. See the task 14926, it is using
> >>>>> only 23.8 % of the CPU, which is less than its 10/30 reservation.
> >>>>
> >>>> What happened here is that some runqueues have an active utilisation
> >>>> larger than 0.95. So, GRUB is decreasing the amount of time received by
> >>>> the tasks on those runqueues to consume less than 95%... This is the
> >>>> reason for the effect you noticed below:
> >>>
> >>> I see. But, AFAIK, the Linux's sched deadline measures the load
> >>> globally, not locally. So, it is not a problem having a load > than 95%
> >>> in the local queue if the global queue is < 95%.
> >>>
> >>> Am I missing something?
> >>
> >> The version of GRUB reclaiming implemented in my patches tracks a
> >> per-runqueue "active utilization", and uses it for reclaiming.
> >
> > I _think_ that this might be (one of) the source(s) of the problem...
> I agree that this can cause some problems, but I am not sure if it
> justifies the huge difference in utilisations you observed
> 
> > Just exercising...
> >
> > For example, with my taskset, with a hypothetical perfect balance of the
> > whole runqueue, one possible scenario is:
> >
> >    CPU    0    1     2     3
> > # TASKS   3    3     3     2
> >
> > In this case, CPUs 0 1 2 are with 100% of local utilization. Thus, the
> > current task on these CPUs will have their runtime decreased by GRUB.
> > Meanwhile, the luck tasks in the CPU 3 would use an additional time that
> > they "globally" do not have - because the system, globally, has a load
> > higher than the 66.6...% of the local runqueue. Actually, part of the
> > time decreased from tasks on [0-2] are being used by the tasks on 3,
> > until the next migration of any task, which will change the luck
> > tasks... but without any guaranty that all tasks will be the luck one on
> > every activation, causing the problem.
> >
> > Does it make sense?
> 
> Yes; but my impression is that gEDF will migrate tasks so that the
> distribution of the reclaimed CPU bandwidth is almost uniform...
> Instead, you saw huge differences in the utilisations (and I do not
> think that "compressing" the utilisations from 100% to 95% can
> decrease the utilisation of a task from 33% to 25% / 26%... :)
>

I tried to replicate Daniel's experiment, but I don't see such a skewed
allocation. They get a reasonably uniform bandwidth and the trace
looks fairly good as well (all processes get to run on the different
processors at some time).

> I suspect there is something more going on here (might be some bug in
> one of my patches). I am trying to better understand what happened.
> 

However, playing with this a bit further, I found out one thing that
looks counter-intuitive (at least to me :).

Simplifying Daniel's example, let's say that we have one 10/30 task
running on a CPU with a 500/1000 global limit. Applying grub_reclaim()
formula we have:

 delta_exec = delta * (0.5 + 0.333) = delta * 0.833

Which in practice means that 1ms of real delta (at 1000HZ) corresponds
to 0.833ms of virtual delta. Considering this, a 10ms (over 30ms)
reservation gets "extended" to ~12ms (over 30ms), that is to say the
task consumes 0.4 of the CPU's bandwidth. top seems to back what I'm
saying, but am I still talking nonsense? :)

I was expecting that the task could consume 0.5 worth of bandwidth with
the given global limit. Is the current behaviour intended?

If we want to change this behaviour maybe something like the following
might work?

 delta_exec = (delta * to_ratio((1ULL << 20) - rq->dl.non_deadline_bw,
                                rq->dl.running_bw)) >> 20

The idea would be to normalize running_bw over the available dl_bw.

Thoughts?

Best,

- Juri

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE
  2017-01-11 12:19             ` Juri Lelli
@ 2017-01-11 12:39               ` Luca Abeni
  2017-01-11 15:06                 ` Juri Lelli
  0 siblings, 1 reply; 20+ messages in thread
From: Luca Abeni @ 2017-01-11 12:39 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Daniel Bristot de Oliveira, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Claudio Scordino, Steven Rostedt, Tommaso Cucinotta

Hi Juri,
(I reply from my new email address)

On Wed, 11 Jan 2017 12:19:51 +0000
Juri Lelli <juri.lelli@arm.com> wrote:
[...]
> > > For example, with my taskset, with a hypothetical perfect balance
> > > of the whole runqueue, one possible scenario is:
> > >
> > >    CPU    0    1     2     3
> > > # TASKS   3    3     3     2
> > >
> > > In this case, CPUs 0 1 2 are with 100% of local utilization.
> > > Thus, the current task on these CPUs will have their runtime
> > > decreased by GRUB. Meanwhile, the luck tasks in the CPU 3 would
> > > use an additional time that they "globally" do not have - because
> > > the system, globally, has a load higher than the 66.6...% of the
> > > local runqueue. Actually, part of the time decreased from tasks
> > > on [0-2] are being used by the tasks on 3, until the next
> > > migration of any task, which will change the luck tasks... but
> > > without any guaranty that all tasks will be the luck one on every
> > > activation, causing the problem.
> > >
> > > Does it make sense?  
> > 
> > Yes; but my impression is that gEDF will migrate tasks so that the
> > distribution of the reclaimed CPU bandwidth is almost uniform...
> > Instead, you saw huge differences in the utilisations (and I do not
> > think that "compressing" the utilisations from 100% to 95% can
> > decrease the utilisation of a task from 33% to 25% / 26%... :)
> >  
> 
> I tried to replicate Daniel's experiment, but I don't see such a
> skewed allocation. They get a reasonably uniform bandwidth and the
> trace looks fairly good as well (all processes get to run on the
> different processors at some time).

With some effort, I replicated the issue noticed by Daniel... I think
it also depends on the CPU speed (and on good or bad luck :), but the
"unfair" CPU allocation can actually happen.
I am working on a fix (based on the m-grub modifications proposed at
last April's SAC - in my original patchset, I over-simplified the
algorithm).


> > I suspect there is something more going on here (might be some bug
> > in one of my patches). I am trying to better understand what
> > happened. 
> 
> However, playing with this a bit further, I found out one thing that
> looks counter-intuitive (at least to me :).
> 
> Simplifying Daniel's example, let's say that we have one 10/30 task
> running on a CPU with a 500/1000 global limit. Applying grub_reclaim()
> formula we have:
> 
>  delta_exec = delta * (0.5 + 0.333) = delta * 0.833
> 
> Which in practice means that 1ms of real delta (at 1000HZ) corresponds
> to 0.833ms of virtual delta. Considering this, a 10ms (over 30ms)
> reservation gets "extended" to ~12ms (over 30ms), that is to say the
> task consumes 0.4 of the CPU's bandwidth. top seems to back what I'm
> saying, but am I still talking nonsense? :)

You are right; my "Do not reclaim the whole CPU bandwidth" patch is an
approximation... I hoped that this approximation could be more precise
than what it really is.
I used the "Uact + unreclaimable utilization" equation to avoid
divisions in grub_reclaim(), but the equation should really be "Uact /
reclaimable utilization"... So, in your example it is
	delta * 0.3333 / 0.5 = delta * 0.6666
that results in 15ms over 30ms, as expected.

I'll fix that patch for the next submission.

> I was expecting that the task could consume 0.5 worth of bandwidth
> with the given global limit. Is the current behaviour intended?
> 
> If we want to change this behaviour maybe something like the following
> might work?
> 
>  delta_exec = (delta * to_ratio((1ULL << 20) - rq->dl.non_deadline_bw,
>                                 rq->dl.running_bw)) >> 20
My current patch does
	(delta * rq->dl.running_bw * rq->dl.deadline_bw_inv) >> 20 >> 8;
where rq->dl.deadline_bw_inv has been set to
	to_ratio(global_rt_runtime(), global_rt_period()) >> 12;
	
This seems to work fine, and should introduce less overhead than
to_ratio().


		Thanks,
			Luca

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE
  2017-01-11 12:39               ` Luca Abeni
@ 2017-01-11 15:06                 ` Juri Lelli
  2017-01-11 21:16                   ` luca abeni
  0 siblings, 1 reply; 20+ messages in thread
From: Juri Lelli @ 2017-01-11 15:06 UTC (permalink / raw)
  To: Luca Abeni
  Cc: Daniel Bristot de Oliveira, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Claudio Scordino, Steven Rostedt, Tommaso Cucinotta

On 11/01/17 13:39, Luca Abeni wrote:
> Hi Juri,
> (I reply from my new email address)
> 
> On Wed, 11 Jan 2017 12:19:51 +0000
> Juri Lelli <juri.lelli@arm.com> wrote:
> [...]
> > > > For example, with my taskset, with a hypothetical perfect balance
> > > > of the whole runqueue, one possible scenario is:
> > > >
> > > >    CPU    0    1     2     3
> > > > # TASKS   3    3     3     2
> > > >
> > > > In this case, CPUs 0 1 2 are with 100% of local utilization.
> > > > Thus, the current task on these CPUs will have their runtime
> > > > decreased by GRUB. Meanwhile, the luck tasks in the CPU 3 would
> > > > use an additional time that they "globally" do not have - because
> > > > the system, globally, has a load higher than the 66.6...% of the
> > > > local runqueue. Actually, part of the time decreased from tasks
> > > > on [0-2] are being used by the tasks on 3, until the next
> > > > migration of any task, which will change the luck tasks... but
> > > > without any guaranty that all tasks will be the luck one on every
> > > > activation, causing the problem.
> > > >
> > > > Does it make sense?  
> > > 
> > > Yes; but my impression is that gEDF will migrate tasks so that the
> > > distribution of the reclaimed CPU bandwidth is almost uniform...
> > > Instead, you saw huge differences in the utilisations (and I do not
> > > think that "compressing" the utilisations from 100% to 95% can
> > > decrease the utilisation of a task from 33% to 25% / 26%... :)
> > >  
> > 
> > I tried to replicate Daniel's experiment, but I don't see such a
> > skewed allocation. They get a reasonably uniform bandwidth and the
> > trace looks fairly good as well (all processes get to run on the
> > different processors at some time).
> 
> With some effort, I replicated the issue noticed by Daniel... I think
> it also depends on the CPU speed (and on good or bad luck :), but the
> "unfair" CPU allocation can actually happen.

Yeah, actual allocation in general varies. I guess the question is: do
we care? We currently don't load balance considering utilizations, only
dynamic deadlines matter.

> I am working on a fix (based on the m-grub modifications proposed at
> last April's SAC - in my original patchset, I over-simplified the
> algorithm).
> 

OK, will have a look to next version.

> 
> > > I suspect there is something more going on here (might be some bug
> > > in one of my patches). I am trying to better understand what
> > > happened. 
> > 
> > However, playing with this a bit further, I found out one thing that
> > looks counter-intuitive (at least to me :).
> > 
> > Simplifying Daniel's example, let's say that we have one 10/30 task
> > running on a CPU with a 500/1000 global limit. Applying grub_reclaim()
> > formula we have:
> > 
> >  delta_exec = delta * (0.5 + 0.333) = delta * 0.833
> > 
> > Which in practice means that 1ms of real delta (at 1000HZ) corresponds
> > to 0.833ms of virtual delta. Considering this, a 10ms (over 30ms)
> > reservation gets "extended" to ~12ms (over 30ms), that is to say the
> > task consumes 0.4 of the CPU's bandwidth. top seems to back what I'm
> > saying, but am I still talking nonsense? :)
> 
> You are right; my "Do not reclaim the whole CPU bandwidth" patch is an
> approximation... I hoped that this approximation could be more precise
> than what it really is.
> I used the "Uact + unreclaimable utilization" equation to avoid
> divisions in grub_reclaim(), but the equation should really be "Uact /
> reclaimable utilization"... So, in your example it is
> 	delta * 0.3333 / 0.5 = delta * 0.6666
> that results in 15ms over 30ms, as expected.
> 
> I'll fix that patch for the next submission.
> 

Right, OK.

> > I was expecting that the task could consume 0.5 worth of bandwidth
> > with the given global limit. Is the current behaviour intended?
> > 
> > If we want to change this behaviour maybe something like the following
> > might work?
> > 
> >  delta_exec = (delta * to_ratio((1ULL << 20) - rq->dl.non_deadline_bw,
> >                                 rq->dl.running_bw)) >> 20
> My current patch does
> 	(delta * rq->dl.running_bw * rq->dl.deadline_bw_inv) >> 20 >> 8;
> where rq->dl.deadline_bw_inv has been set to
> 	to_ratio(global_rt_runtime(), global_rt_period()) >> 12;
> 	
> This seems to work fine, and should introduce less overhead than
> to_ratio().
> 

Sure, we don't want to do divisions if we can. Why the intermediate
right shifts, though?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 2/6] sched/deadline: improve the tracking of active utilization
  2016-12-30 11:33 ` [RFC v4 2/6] sched/deadline: improve the tracking of " Luca Abeni
@ 2017-01-11 17:05   ` Juri Lelli
  2017-01-11 21:22     ` luca abeni
  0 siblings, 1 reply; 20+ messages in thread
From: Juri Lelli @ 2017-01-11 17:05 UTC (permalink / raw)
  To: Luca Abeni
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira

Hi,

On 30/12/16 12:33, Luca Abeni wrote:
> From: Luca Abeni <luca.abeni@unitn.it>
> 
> This patch implements a more theoretically sound algorithm for
> tracking active utilization: instead of decreasing it when a
> task blocks, use a timer (the "inactive timer", named after the
> "Inactive" task state of the GRUB algorithm) to decrease the
> active utilization at the so called "0-lag time".
> 
> Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> ---

[...]

> +static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
> +{
> +	struct sched_dl_entity *dl_se = container_of(timer,
> +						     struct sched_dl_entity,
> +						     inactive_timer);
> +	struct task_struct *p = dl_task_of(dl_se);
> +	struct rq_flags rf;
> +	struct rq *rq;
> +
> +	rq = task_rq_lock(p, &rf);
> +
> +	if (!dl_task(p) || p->state == TASK_DEAD) {
> +		if (p->state == TASK_DEAD && dl_se->dl_non_contending)
> +			sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));
> +
> +		__dl_clear_params(p);
> +
> +		goto unlock;
> +	}
> +	if (dl_se->dl_non_contending == 0)
> +		goto unlock;
> +
> +	sched_clock_tick();
> +	update_rq_clock(rq);
> +
> +	sub_running_bw(dl_se, &rq->dl);
> +	dl_se->dl_non_contending = 0;
> +unlock:
> +	task_rq_unlock(rq, p, &rf);
> +	put_task_struct(p);
> +
> +	return HRTIMER_NORESTART;
> +}
> +

[...]

>  static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> @@ -934,7 +1014,28 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
>  	if (flags & ENQUEUE_WAKEUP) {
>  		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>  
> -		add_running_bw(dl_se, dl_rq);
> +		if (dl_se->dl_non_contending) {
> +			/*
> +			 * If the timer handler is currently running and the
> +			 * timer cannot be cancelled, inactive_task_timer()
> +			 * will see that dl_not_contending is not set, and
> +			 * will do nothing, so we are still safe.

Here and below: the timer callback will actually put_task_struct() (see
above) if dl_not_contending is not set; that's why we don't need to do
that if try_to_cancel returned -1 (or 0). Saying "will do nothing" is a
bit misleading, IMHO.

> +			 */
> +			if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
> +				put_task_struct(dl_task_of(dl_se));
> +			WARN_ON(dl_task_of(dl_se)->nr_cpus_allowed > 1);
> +			dl_se->dl_non_contending = 0;
> +		} else {

[...]

> @@ -1097,6 +1198,22 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
>  	}
>  	rcu_read_unlock();
>  
> +	rq = task_rq(p);
> +	raw_spin_lock(&rq->lock);
> +	if (p->dl.dl_non_contending) {
> +		sub_running_bw(&p->dl, &rq->dl);
> +		p->dl.dl_non_contending = 0;
> +		/*
> +		 * If the timer handler is currently running and the
> +		 * timer cannot be cancelled, inactive_task_timer()
> +		 * will see that dl_not_contending is not set, and
> +		 * will do nothing, so we are still safe.
> +		 */
> +		if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> +			put_task_struct(p);
> +	}
> +	raw_spin_unlock(&rq->lock);
> +
>  out:
>  	return cpu;
>  }

We already raised the issue about having to lock the rq in
select_task_rq_dl() while reviewing the previous version; did you have
any thinking about possible solutions? Maybe simply bail out (need to
see how frequent this is however) or use an inner lock?

Best,

- Juri

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE
  2017-01-11 15:06                 ` Juri Lelli
@ 2017-01-11 21:16                   ` luca abeni
  0 siblings, 0 replies; 20+ messages in thread
From: luca abeni @ 2017-01-11 21:16 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Daniel Bristot de Oliveira, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Claudio Scordino, Steven Rostedt, Tommaso Cucinotta

On Wed, 11 Jan 2017 15:06:47 +0000
Juri Lelli <juri.lelli@arm.com> wrote:

> On 11/01/17 13:39, Luca Abeni wrote:
> > Hi Juri,
> > (I reply from my new email address)
> > 
> > On Wed, 11 Jan 2017 12:19:51 +0000
> > Juri Lelli <juri.lelli@arm.com> wrote:
> > [...]  
> > > > > For example, with my taskset, with a hypothetical perfect
> > > > > balance of the whole runqueue, one possible scenario is:
> > > > >
> > > > >    CPU    0    1     2     3
> > > > > # TASKS   3    3     3     2
> > > > >
> > > > > In this case, CPUs 0 1 2 are with 100% of local utilization.
> > > > > Thus, the current task on these CPUs will have their runtime
> > > > > decreased by GRUB. Meanwhile, the luck tasks in the CPU 3
> > > > > would use an additional time that they "globally" do not have
> > > > > - because the system, globally, has a load higher than the
> > > > > 66.6...% of the local runqueue. Actually, part of the time
> > > > > decreased from tasks on [0-2] are being used by the tasks on
> > > > > 3, until the next migration of any task, which will change
> > > > > the luck tasks... but without any guaranty that all tasks
> > > > > will be the luck one on every activation, causing the problem.
> > > > >
> > > > > Does it make sense?    
> > > > 
> > > > Yes; but my impression is that gEDF will migrate tasks so that
> > > > the distribution of the reclaimed CPU bandwidth is almost
> > > > uniform... Instead, you saw huge differences in the
> > > > utilisations (and I do not think that "compressing" the
> > > > utilisations from 100% to 95% can decrease the utilisation of a
> > > > task from 33% to 25% / 26%... :) 
> > > 
> > > I tried to replicate Daniel's experiment, but I don't see such a
> > > skewed allocation. They get a reasonably uniform bandwidth and the
> > > trace looks fairly good as well (all processes get to run on the
> > > different processors at some time).  
> > 
> > With some effort, I replicated the issue noticed by Daniel... I
> > think it also depends on the CPU speed (and on good or bad luck :),
> > but the "unfair" CPU allocation can actually happen.  
> 
> Yeah, actual allocation in general varies. I guess the question is: do
> we care? We currently don't load balance considering utilizations,
> only dynamic deadlines matter.

Right... But the problem is that with the version of GRUB I proposed
this unfairness can result in some tasks receiving less CPU time than
the guaranteed amount (because some other tasks receive much more). I
think there are at least two possible ways to fix this (without
changing the migration strategy), and I am working on them...
(hopefully, I'll post something in next week)


> > > I was expecting that the task could consume 0.5 worth of bandwidth
> > > with the given global limit. Is the current behaviour intended?
> > > 
> > > If we want to change this behaviour maybe something like the
> > > following might work?
> > > 
> > >  delta_exec = (delta * to_ratio((1ULL << 20) -
> > > rq->dl.non_deadline_bw, rq->dl.running_bw)) >> 20  
> > My current patch does
> > 	(delta * rq->dl.running_bw * rq->dl.deadline_bw_inv) >> 20
> > >> 8; where rq->dl.deadline_bw_inv has been set to
> > 	to_ratio(global_rt_runtime(), global_rt_period()) >> 12;
> > 	
> > This seems to work fine, and should introduce less overhead than
> > to_ratio().
> >   
> 
> Sure, we don't want to do divisions if we can. Why the intermediate
> right shifts, though?

I wrote it like this to remember that ">> 20" comes from how
"to_ratio()" computes the utilization, and the additional ">> 8"
comes from the fact that deadline_bw_inv is shifted left by 8, to avoid
losing precision (I used 8 insted of 20 so that the computation can be
- hopefully - performed on 32 bits... Of course I can revise this if
needed).

If needed I can change the ">> 20 >> 8" in ">> 28", or remove the
">> 12" from the deadline_bw_inv conmputation (so that we can use
">> 40" or ">> 20 >> 20" in grub_reclaim()).


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC v4 2/6] sched/deadline: improve the tracking of active utilization
  2017-01-11 17:05   ` Juri Lelli
@ 2017-01-11 21:22     ` luca abeni
  0 siblings, 0 replies; 20+ messages in thread
From: luca abeni @ 2017-01-11 21:22 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira

On Wed, 11 Jan 2017 17:05:42 +0000
Juri Lelli <juri.lelli@arm.com> wrote:

> Hi,
> 
> On 30/12/16 12:33, Luca Abeni wrote:
> > From: Luca Abeni <luca.abeni@unitn.it>
> > 
> > This patch implements a more theoretically sound algorithm for
> > tracking active utilization: instead of decreasing it when a
> > task blocks, use a timer (the "inactive timer", named after the
> > "Inactive" task state of the GRUB algorithm) to decrease the
> > active utilization at the so called "0-lag time".
> > 
> > Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> > ---  
> 
> [...]
> 
> > +static enum hrtimer_restart inactive_task_timer(struct hrtimer
> > *timer) +{
> > +	struct sched_dl_entity *dl_se = container_of(timer,
> > +						     struct
> > sched_dl_entity,
> > +
> > inactive_timer);
> > +	struct task_struct *p = dl_task_of(dl_se);
> > +	struct rq_flags rf;
> > +	struct rq *rq;
> > +
> > +	rq = task_rq_lock(p, &rf);
> > +
> > +	if (!dl_task(p) || p->state == TASK_DEAD) {
> > +		if (p->state == TASK_DEAD &&
> > dl_se->dl_non_contending)
> > +			sub_running_bw(&p->dl,
> > dl_rq_of_se(&p->dl)); +
> > +		__dl_clear_params(p);
> > +
> > +		goto unlock;
> > +	}
> > +	if (dl_se->dl_non_contending == 0)
> > +		goto unlock;
> > +
> > +	sched_clock_tick();
> > +	update_rq_clock(rq);
> > +
> > +	sub_running_bw(dl_se, &rq->dl);
> > +	dl_se->dl_non_contending = 0;
> > +unlock:
> > +	task_rq_unlock(rq, p, &rf);
> > +	put_task_struct(p);
> > +
> > +	return HRTIMER_NORESTART;
> > +}
> > +  
> 
> [...]
> 
> >  static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> > @@ -934,7 +1014,28 @@ enqueue_dl_entity(struct sched_dl_entity
> > *dl_se, if (flags & ENQUEUE_WAKEUP) {
> >  		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> >  
> > -		add_running_bw(dl_se, dl_rq);
> > +		if (dl_se->dl_non_contending) {
> > +			/*
> > +			 * If the timer handler is currently
> > running and the
> > +			 * timer cannot be cancelled,
> > inactive_task_timer()
> > +			 * will see that dl_not_contending is not
> > set, and
> > +			 * will do nothing, so we are still safe.  
> 
> Here and below: the timer callback will actually put_task_struct()
> (see above) if dl_not_contending is not set; that's why we don't need
> to do that if try_to_cancel returned -1 (or 0). Saying "will do
> nothing" is a bit misleading, IMHO.

Sorry... I originally had a bug with this put_task_struct() thing. The
bug is now (hopefully :) fixed, but I forgot to update the comment...
I'll fix it for next submission.

> > @@ -1097,6 +1198,22 @@ select_task_rq_dl(struct task_struct *p, int
> > cpu, int sd_flag, int flags) }
> >  	rcu_read_unlock();
> >  
> > +	rq = task_rq(p);
> > +	raw_spin_lock(&rq->lock);
> > +	if (p->dl.dl_non_contending) {
> > +		sub_running_bw(&p->dl, &rq->dl);
> > +		p->dl.dl_non_contending = 0;
> > +		/*
> > +		 * If the timer handler is currently running and
> > the
> > +		 * timer cannot be cancelled, inactive_task_timer()
> > +		 * will see that dl_not_contending is not set, and
> > +		 * will do nothing, so we are still safe.
> > +		 */
> > +		if (hrtimer_try_to_cancel(&p->dl.inactive_timer)
> > == 1)
> > +			put_task_struct(p);
> > +	}
> > +	raw_spin_unlock(&rq->lock);
> > +
> >  out:
> >  	return cpu;
> >  }  
> 
> We already raised the issue about having to lock the rq in
> select_task_rq_dl() while reviewing the previous version; did you have
> any thinking about possible solutions? Maybe simply bail out (need to
> see how frequent this is however) or use an inner lock?

Sorry; I did not come up with any good idea for avoiding to lock the
rq... I'll think about this again... The only alternative idea I have
is just to avoid changing cpu, but I do not know if it is acceptable...




			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-01-11 21:23 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-30 11:33 [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
2016-12-30 11:33 ` [RFC v4 1/6] sched/deadline: track the active utilization Luca Abeni
2016-12-30 11:33 ` [RFC v4 2/6] sched/deadline: improve the tracking of " Luca Abeni
2017-01-11 17:05   ` Juri Lelli
2017-01-11 21:22     ` luca abeni
2016-12-30 11:33 ` [RFC v4 3/6] sched/deadline: fix the update of the total -deadline utilization Luca Abeni
2016-12-30 11:33 ` [RFC v4 4/6] sched/deadline: implement GRUB accounting Luca Abeni
2016-12-30 11:33 ` [RFC v4 5/6] sched/deadline: do not reclaim the whole CPU bandwidth Luca Abeni
2016-12-30 11:33 ` [RFC v4 6/6] sched/deadline: make GRUB a task's flag Luca Abeni
2017-01-03 18:58 ` [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE Daniel Bristot de Oliveira
2017-01-03 21:33   ` luca abeni
2017-01-04 12:17   ` luca abeni
2017-01-04 15:14     ` Daniel Bristot de Oliveira
2017-01-04 16:42       ` Luca Abeni
2017-01-04 18:00         ` Daniel Bristot de Oliveira
2017-01-04 18:30           ` Luca Abeni
2017-01-11 12:19             ` Juri Lelli
2017-01-11 12:39               ` Luca Abeni
2017-01-11 15:06                 ` Juri Lelli
2017-01-11 21:16                   ` luca abeni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).