linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] CPU reclaiming for SCHED_DEADLINE
@ 2017-05-18 20:13 luca abeni
  2017-05-18 20:13 ` [PATCH 01/10] sched/deadline: track the active utilization luca abeni
                   ` (9 more replies)
  0 siblings, 10 replies; 26+ messages in thread
From: luca abeni @ 2017-05-18 20:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Joel Fernandes, Mathieu Poirier, Luca Abeni

From: Luca Abeni <luca.abeni@santannapisa.it>

Hi all,

here is the next iteration of my patchset implementing CPU reclaiming
(using the GRUB algorithm[1]) for SCHED_DEADLINE (since I think the
patchset is now mature enough, I removed the "RFC" keyword from the
emails subjects).
Basically, this feature allows SCHED_DEADLINE tasks to consume more
than their reserved runtime, up to a maximum fraction of the CPU time
(so that other tasks are left some spare CPU time to execute), if this
does not break the guarantees of other SCHED_DEADLINE tasks.
The patchset applies on top of tip/master.


The implemented CPU reclaiming algorithm is based on tracking the
utilization U_act of active tasks (first 2 patches), and modifying the
runtime accounting rule (see patches 0004, 0008 and 0009).
The original GRUB algorithm is modified as described in [2,3] to support
multiple CPUs (the original algorithm only considered one single CPU,
this one tracks U_act per runqueue) and to leave an "unreclaimable"
fraction of CPU time to non SCHED_DEADLINE tasks (see patch 0005: the
original algorithm can consume 100% of the CPU time, starving all the
other tasks).
Patch 0003 uses the newly introduced "inactive timer" (introduced in
patch 0002) to fix dl_overflow() and __setparam_dl().
Patch 0006 allows to enable CPU reclaiming only on selected tasks.
Patches 0007, 0008 and 0009 fix an issue found by Daniel in a previous 
submission, by basing the GRUB reclaiming algorithm on the inactive
utilization U_inact as shown in [3]. Here, U_inact is computed as
the difference between the "rq utilization" (see patch 0007) and
U_act.
Patch 0010 adds some documentation contributed by Claudio.

The changes respect to v5 are mostly cosmetic and about comments or
documentation:
- I rearranged some code to avoid eccessive indentation,
  as requested by Peter
  (http://lkml.iu.edu/hypermail/linux/kernel/1703.3/00291.html)
- I modified dl_non_contending() to avoid doubts about use-after-free
  (http://lkml.iu.edu/hypermail/linux/kernel/1703.3/00291.html)
- I added a comment to migrate_task_rq_dl() explaining the locking
  (http://lkml.iu.edu/hypermail/linux/kernel/1703.3/00291.html)
- I added some braces to be compliant with the coding rules
- I renamed some functions and structure fields according to the
  feedback I received
- I added a big comment to document the GRUB states transition
- I changed grub_reclaim() (and the comments to that function) to
  make it more understandable
- I added some documentation by Claudio

Finally, I updated the patches to apply on top of tip/master.

[1] Lipari, G., & Baruah, S. (2000). Greedy reclamation of unused bandwidth in constant-bandwidth servers. In Real-Time Systems, 2000. Euromicro RTS 2000. 12th Euromicro Conference on (pp. 193-200). IEEE.
[2] Abeni, L., Lelli, J., Scordino, C., & Palopoli, L. (2014, October). Greedy CPU reclaiming for SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS), Dusseldorf, Germany
[3] Abeni, L., Lipari, G., Parri, A., & Sun, Y. (2016, April). Multicore CPU reclaiming: parallel or sequential?. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (pp. 1877-1884). ACM..

 
Claudio Scordino (1):
  sched/deadline: documentation about GRUB reclaiming

Luca Abeni (9):
  sched/deadline: track the active utilization
  sched/deadline: improve the tracking of active utilization
  sched/deadline: fix the update of the total -deadline utilization
  sched/deadline: implement GRUB accounting
  sched/deadline: do not reclaim the whole CPU bandwidth
  sched/deadline: make GRUB a task's flag
  sched/deadline: track the "total rq utilization" too
  sched/deadline: base GRUB reclaiming on the inactive utilization
  sched/deadline: also reclaim bandwidth not used by dl tasks

 Documentation/scheduler/sched-deadline.txt | 168 +++++++++++
 include/linux/sched.h                      |  17 ++
 include/uapi/linux/sched.h                 |   1 +
 kernel/sched/core.c                        |  74 ++---
 kernel/sched/deadline.c                    | 448 +++++++++++++++++++++++++++--
 kernel/sched/sched.h                       |  66 ++++-
 6 files changed, 709 insertions(+), 65 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 01/10] sched/deadline: track the active utilization
  2017-05-18 20:13 [PATCH 00/10] CPU reclaiming for SCHED_DEADLINE luca abeni
@ 2017-05-18 20:13 ` luca abeni
  2017-06-08  8:31   ` Ingo Molnar
  2017-06-08  9:23   ` [tip:sched/core] sched/deadline: Track " tip-bot for Luca Abeni
  2017-05-18 20:13 ` [PATCH 02/10] sched/deadline: improve the tracking of " luca abeni
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 26+ messages in thread
From: luca abeni @ 2017-05-18 20:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Joel Fernandes, Mathieu Poirier, Luca Abeni

From: Luca Abeni <luca.abeni@unitn.it>

Active utilization is defined as the total utilization of active
(TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
when a task wakes up and is decreased when a task blocks.

When a task is migrated from CPUi to CPUj, immediately subtract the
task's utilization from CPUi and add it to CPUj. This mechanism is
implemented by modifying the pull and push functions.
Note: this is not fully correct from the theoretical point of view
(the utilization should be removed from CPUi only at the 0 lag
time), a more theoretically sound solution is presented in the
next patches.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
---
 kernel/sched/deadline.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h    |  6 +++++
 2 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a2ce590..9be399d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -43,6 +43,28 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
 	return !RB_EMPTY_NODE(&dl_se->rb_node);
 }
 
+static inline
+void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
+{
+	u64 old = dl_rq->running_bw;
+
+	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	dl_rq->running_bw += dl_bw;
+	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
+}
+
+static inline
+void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
+{
+	u64 old = dl_rq->running_bw;
+
+	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	dl_rq->running_bw -= dl_bw;
+	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
+	if (dl_rq->running_bw > old)
+		dl_rq->running_bw = 0;
+}
+
 static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
@@ -83,6 +105,8 @@ void init_dl_rq(struct dl_rq *dl_rq)
 #else
 	init_dl_bw(&dl_rq->dl_bw);
 #endif
+
+	dl_rq->running_bw = 0;
 }
 
 #ifdef CONFIG_SMP
@@ -946,10 +970,14 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
 	 * parameters of the task might need updating. Otherwise,
 	 * we want a replenishment of its runtime.
 	 */
-	if (flags & ENQUEUE_WAKEUP)
+	if (flags & ENQUEUE_WAKEUP) {
+		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+		add_running_bw(dl_se->dl_bw, dl_rq);
 		update_dl_entity(dl_se, pi_se);
-	else if (flags & ENQUEUE_REPLENISH)
+	} else if (flags & ENQUEUE_REPLENISH) {
 		replenish_dl_entity(dl_se, pi_se);
+	}
 
 	__enqueue_dl_entity(dl_se);
 }
@@ -998,14 +1026,25 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	if (!p->dl.dl_throttled && dl_is_constrained(&p->dl))
 		dl_check_constrained_dl(&p->dl);
 
+	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE)
+		add_running_bw(p->dl.dl_bw, &rq->dl);
+
 	/*
-	 * If p is throttled, we do nothing. In fact, if it exhausted
+	 * If p is throttled, we do not enqueue it. In fact, if it exhausted
 	 * its budget it needs a replenishment and, since it now is on
 	 * its rq, the bandwidth timer callback (which clearly has not
 	 * run yet) will take care of this.
+	 * However, the active utilization does not depend on the fact
+	 * that the task is on the runqueue or not (but depends on the
+	 * task's state - in GRUB parlance, "inactive" vs "active contending").
+	 * In other words, even if a task is throttled its utilization must
+	 * be counted in the active utilization; hence, we need to call
+	 * add_running_bw().
 	 */
-	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
+	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+		add_running_bw(p->dl.dl_bw, &rq->dl);
 		return;
+	}
 
 	enqueue_dl_entity(&p->dl, pi_se, flags);
 
@@ -1023,6 +1062,20 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
 	__dequeue_task_dl(rq, p, flags);
+
+	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE)
+		sub_running_bw(p->dl.dl_bw, &rq->dl);
+
+	/*
+	 * This check allows to decrease the active utilization in two cases:
+	 * when the task blocks and when it is terminating
+	 * (p->state == TASK_DEAD). We can handle the two cases in the same
+	 * way, because from GRUB's point of view the same thing is happening
+	 * (the task moves from "active contending" to "active non contending"
+	 * or "inactive")
+	 */
+	if (flags & DEQUEUE_SLEEP)
+		sub_running_bw(p->dl.dl_bw, &rq->dl);
 }
 
 /*
@@ -1551,7 +1604,9 @@ static int push_dl_task(struct rq *rq)
 	}
 
 	deactivate_task(rq, next_task, 0);
+	sub_running_bw(next_task->dl.dl_bw, &rq->dl);
 	set_task_cpu(next_task, later_rq->cpu);
+	add_running_bw(next_task->dl.dl_bw, &later_rq->dl);
 	activate_task(later_rq, next_task, 0);
 	ret = 1;
 
@@ -1639,7 +1694,9 @@ static void pull_dl_task(struct rq *this_rq)
 			resched = true;
 
 			deactivate_task(src_rq, p, 0);
+			sub_running_bw(p->dl.dl_bw, &src_rq->dl);
 			set_task_cpu(p, this_cpu);
+			add_running_bw(p->dl.dl_bw, &this_rq->dl);
 			activate_task(this_rq, p, 0);
 			dmin = p->dl.deadline;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bcdf6a1..28b7a74 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -558,6 +558,12 @@ struct dl_rq {
 #else
 	struct dl_bw dl_bw;
 #endif
+	/*
+	 * "Active utilization" for this runqueue: increased when a
+	 * task wakes up (becomes TASK_RUNNING) and decreased when a
+	 * task blocks
+	 */
+	u64 running_bw;
 };
 
 #ifdef CONFIG_SMP
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 02/10] sched/deadline: improve the tracking of active utilization
  2017-05-18 20:13 [PATCH 00/10] CPU reclaiming for SCHED_DEADLINE luca abeni
  2017-05-18 20:13 ` [PATCH 01/10] sched/deadline: track the active utilization luca abeni
@ 2017-05-18 20:13 ` luca abeni
  2017-06-08  9:24   ` [tip:sched/core] sched/deadline: Improve " tip-bot for Luca Abeni
  2017-05-18 20:13 ` [PATCH 03/10] sched/deadline: fix the update of the total -deadline utilization luca abeni
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: luca abeni @ 2017-05-18 20:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Joel Fernandes, Mathieu Poirier, Luca Abeni

From: Luca Abeni <luca.abeni@santannapisa.it>

This patch implements a more theoretically sound algorithm for
tracking active utilization: instead of decreasing it when a
task blocks, use a timer (the "inactive timer", named after the
"Inactive" task state of the GRUB algorithm) to decrease the
active utilization at the so called "0-lag time".

Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Tested-by: Claudio Scordino <claudio@evidence.eu.com>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
---
 include/linux/sched.h   |  17 +++
 kernel/sched/core.c     |   3 +
 kernel/sched/deadline.c | 269 +++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h    |   2 +
 4 files changed, 276 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2b69fc6..8127f8b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -445,16 +445,33 @@ struct sched_dl_entity {
 	 *
 	 * @dl_yielded tells if task gave up the CPU before consuming
 	 * all its available runtime during the last job.
+	 *
+	 * @dl_non_contending tells if the task is inactive while still
+	 * contributing to the active utilization. In other words, it
+	 * indicates if the inactive timer has been armed and its handler
+	 * has not been executed yet. This flag is useful to avoid race
+	 * conditions between the inactive timer handler and the wakeup
+	 * code.
 	 */
 	int				dl_throttled;
 	int				dl_boosted;
 	int				dl_yielded;
+	int				dl_non_contending;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
 	 * own bandwidth to be enforced, thus we need one timer per task.
 	 */
 	struct hrtimer			dl_timer;
+
+	/*
+	 * Inactive timer, responsible for decreasing the active utilization
+	 * at the "0-lag time". When a -deadline task blocks, it contributes
+	 * to GRUB's active utilization until the "0-lag time", hence a
+	 * timer is needed to decrease the active utilization at the correct
+	 * time.
+	 */
+	struct hrtimer inactive_timer;
 };
 
 union rcu_special {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c888bd3..5d49374 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2165,6 +2165,7 @@ void __dl_clear_params(struct task_struct *p)
 
 	dl_se->dl_throttled = 0;
 	dl_se->dl_yielded = 0;
+	dl_se->dl_non_contending = 0;
 }
 
 /*
@@ -2196,6 +2197,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	RB_CLEAR_NODE(&p->dl.rb_node);
 	init_dl_task_timer(&p->dl);
+	init_dl_inactive_task_timer(&p->dl);
 	__dl_clear_params(p);
 
 	INIT_LIST_HEAD(&p->rt.run_list);
@@ -2518,6 +2520,7 @@ static int dl_overflow(struct task_struct *p, int policy,
 		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
 		__dl_clear(dl_b, p->dl.dl_bw);
 		__dl_add(dl_b, new_bw);
+		dl_change_utilization(p, new_bw);
 		err = 0;
 	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
 		__dl_clear(dl_b, p->dl.dl_bw);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9be399d..147cbe6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -65,6 +65,161 @@ void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 		dl_rq->running_bw = 0;
 }
 
+void dl_change_utilization(struct task_struct *p, u64 new_bw)
+{
+	if (task_on_rq_queued(p))
+		return;
+
+	if (!p->dl.dl_non_contending)
+		return;
+
+	sub_running_bw(p->dl.dl_bw, &task_rq(p)->dl);
+	p->dl.dl_non_contending = 0;
+	/*
+	 * If the timer handler is currently running and the
+	 * timer cannot be cancelled, inactive_task_timer()
+	 * will see that dl_not_contending is not set, and
+	 * will not touch the rq's active utilization,
+	 * so we are still safe.
+	 */
+	if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+		put_task_struct(p);
+}
+
+/*
+ * The utilization of a task cannot be immediately removed from
+ * the rq active utilization (running_bw) when the task blocks.
+ * Instead, we have to wait for the so called "0-lag time".
+ *
+ * If a task blocks before the "0-lag time", a timer (the inactive
+ * timer) is armed, and running_bw is decreased when the timer
+ * fires.
+ *
+ * If the task wakes up again before the inactive timer fires,
+ * the timer is cancelled, whereas if the task wakes up after the
+ * inactive timer fired (and running_bw has been decreased) the
+ * task's utilization has to be added to running_bw again.
+ * A flag in the deadline scheduling entity (dl_non_contending)
+ * is used to avoid race conditions between the inactive timer handler
+ * and task wakeups.
+ *
+ * The following diagram shows how running_bw is updated. A task is
+ * "ACTIVE" when its utilization contributes to running_bw; an
+ * "ACTIVE contending" task is in the TASK_RUNNING state, while an
+ * "ACTIVE non contending" task is a blocked task for which the "0-lag time"
+ * has not passed yet. An "INACTIVE" task is a task for which the "0-lag"
+ * time already passed, which does not contribute to running_bw anymore.
+ *                              +------------------+
+ *             wakeup           |    ACTIVE        |
+ *          +------------------>+   contending     |
+ *          | add_running_bw    |                  |
+ *          |                   +----+------+------+
+ *          |                        |      ^
+ *          |                dequeue |      |
+ * +--------+-------+                |      |
+ * |                |   t >= 0-lag   |      | wakeup
+ * |    INACTIVE    |<---------------+      |
+ * |                | sub_running_bw |      |
+ * +--------+-------+                |      |
+ *          ^                        |      |
+ *          |              t < 0-lag |      |
+ *          |                        |      |
+ *          |                        V      |
+ *          |                   +----+------+------+
+ *          | sub_running_bw    |    ACTIVE        |
+ *          +-------------------+                  |
+ *            inactive timer    |  non contending  |
+ *            fired             +------------------+
+ *
+ * The task_non_contending() function is invoked when a task
+ * blocks, and checks if the 0-lag time already passed or
+ * not (in the first case, it directly updates running_bw;
+ * in the second case, it arms the inactive timer).
+ *
+ * The task_contending() function is invoked when a task wakes
+ * up, and checks if the task is still in the "ACTIVE non contending"
+ * state or not (in the second case, it updates running_bw).
+ */
+static void task_non_contending(struct task_struct *p)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+	struct hrtimer *timer = &dl_se->inactive_timer;
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+	s64 zerolag_time;
+
+	/*
+	 * If this is a non-deadline task that has been boosted,
+	 * do nothing
+	 */
+	if (dl_se->dl_runtime == 0)
+		return;
+
+	WARN_ON(hrtimer_active(&dl_se->inactive_timer));
+	WARN_ON(dl_se->dl_non_contending);
+
+	zerolag_time = dl_se->deadline -
+		 div64_long((dl_se->runtime * dl_se->dl_period),
+			dl_se->dl_runtime);
+
+	/*
+	 * Using relative times instead of the absolute "0-lag time"
+	 * allows to simplify the code
+	 */
+	zerolag_time -= rq_clock(rq);
+
+	/*
+	 * If the "0-lag time" already passed, decrease the active
+	 * utilization now, instead of starting a timer
+	 */
+	if (zerolag_time < 0) {
+		if (dl_task(p))
+			sub_running_bw(dl_se->dl_bw, dl_rq);
+		if (!dl_task(p) || p->state == TASK_DEAD)
+			__dl_clear_params(p);
+
+		return;
+	}
+
+	dl_se->dl_non_contending = 1;
+	get_task_struct(p);
+	hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL);
+}
+
+static void task_contending(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+	/*
+	 * If this is a non-deadline task that has been boosted,
+	 * do nothing
+	 */
+	if (dl_se->dl_runtime == 0)
+		return;
+
+	if (dl_se->dl_non_contending) {
+		dl_se->dl_non_contending = 0;
+		/*
+		 * If the timer handler is currently running and the
+		 * timer cannot be cancelled, inactive_task_timer()
+		 * will see that dl_not_contending is not set, and
+		 * will not touch the rq's active utilization,
+		 * so we are still safe.
+		 */
+		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
+			put_task_struct(dl_task_of(dl_se));
+	} else {
+		/*
+		 * Since "dl_non_contending" is not set, the
+		 * task's utilization has already been removed from
+		 * active utilization (either when the task blocked,
+		 * when the "inactive timer" fired).
+		 * So, add it back.
+		 */
+		add_running_bw(dl_se->dl_bw, dl_rq);
+	}
+}
+
 static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
@@ -617,10 +772,8 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	 * The task might have changed its scheduling policy to something
 	 * different than SCHED_DEADLINE (through switched_from_dl()).
 	 */
-	if (!dl_task(p)) {
-		__dl_clear_params(p);
+	if (!dl_task(p))
 		goto unlock;
-	}
 
 	/*
 	 * The task might have been boosted by someone else and might be in the
@@ -839,6 +992,49 @@ static void update_curr_dl(struct rq *rq)
 	}
 }
 
+static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
+{
+	struct sched_dl_entity *dl_se = container_of(timer,
+						     struct sched_dl_entity,
+						     inactive_timer);
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq_flags rf;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &rf);
+
+	if (!dl_task(p) || p->state == TASK_DEAD) {
+		if (p->state == TASK_DEAD && dl_se->dl_non_contending) {
+			sub_running_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
+			dl_se->dl_non_contending = 0;
+		}
+		__dl_clear_params(p);
+
+		goto unlock;
+	}
+	if (dl_se->dl_non_contending == 0)
+		goto unlock;
+
+	sched_clock_tick();
+	update_rq_clock(rq);
+
+	sub_running_bw(dl_se->dl_bw, &rq->dl);
+	dl_se->dl_non_contending = 0;
+unlock:
+	task_rq_unlock(rq, p, &rf);
+	put_task_struct(p);
+
+	return HRTIMER_NORESTART;
+}
+
+void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
+{
+	struct hrtimer *timer = &dl_se->inactive_timer;
+
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	timer->function = inactive_task_timer;
+}
+
 #ifdef CONFIG_SMP
 
 static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
@@ -971,9 +1167,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
 	 * we want a replenishment of its runtime.
 	 */
 	if (flags & ENQUEUE_WAKEUP) {
-		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
-
-		add_running_bw(dl_se->dl_bw, dl_rq);
+		task_contending(dl_se);
 		update_dl_entity(dl_se, pi_se);
 	} else if (flags & ENQUEUE_REPLENISH) {
 		replenish_dl_entity(dl_se, pi_se);
@@ -1042,7 +1236,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	 * add_running_bw().
 	 */
 	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
-		add_running_bw(p->dl.dl_bw, &rq->dl);
+		if (flags & ENQUEUE_WAKEUP)
+			task_contending(&p->dl);
+
 		return;
 	}
 
@@ -1067,7 +1263,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		sub_running_bw(p->dl.dl_bw, &rq->dl);
 
 	/*
-	 * This check allows to decrease the active utilization in two cases:
+	 * This check allows to start the inactive timer (or to immediately
+	 * decrease the active utilization, if needed) in two cases:
 	 * when the task blocks and when it is terminating
 	 * (p->state == TASK_DEAD). We can handle the two cases in the same
 	 * way, because from GRUB's point of view the same thing is happening
@@ -1075,7 +1272,7 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	 * or "inactive")
 	 */
 	if (flags & DEQUEUE_SLEEP)
-		sub_running_bw(p->dl.dl_bw, &rq->dl);
+		task_non_contending(p);
 }
 
 /*
@@ -1153,6 +1350,35 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
 	return cpu;
 }
 
+static void migrate_task_rq_dl(struct task_struct *p)
+{
+	struct rq *rq;
+
+	if (!(p->state == TASK_WAKING) || !(p->dl.dl_non_contending))
+		return;
+
+	rq = task_rq(p);
+	/*
+	 * Since p->state == TASK_WAKING, set_task_cpu() has been called
+	 * from try_to_wake_up(). Hence, p->pi_lock is locked, but
+	 * rq->lock is not... So, lock it
+	 */
+	raw_spin_lock(&rq->lock);
+	sub_running_bw(p->dl.dl_bw, &rq->dl);
+	p->dl.dl_non_contending = 0;
+	/*
+	 * If the timer handler is currently running and the
+	 * timer cannot be cancelled, inactive_task_timer()
+	 * will see that dl_not_contending is not set, and
+	 * will not touch the rq's active utilization,
+	 * so we are still safe.
+	 */
+	if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+		put_task_struct(p);
+
+	raw_spin_unlock(&rq->lock);
+}
+
 static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 {
 	/*
@@ -1794,13 +2020,23 @@ void __init init_sched_dl_class(void)
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
 {
 	/*
-	 * Start the deadline timer; if we switch back to dl before this we'll
-	 * continue consuming our current CBS slice. If we stay outside of
-	 * SCHED_DEADLINE until the deadline passes, the timer will reset the
-	 * task.
+	 * task_non_contending() can start the "inactive timer" (if the 0-lag
+	 * time is in the future). If the task switches back to dl before
+	 * the "inactive timer" fires, it can continue to consume its current
+	 * runtime using its current deadline. If it stays outside of
+	 * SCHED_DEADLINE until the 0-lag time passes, inactive_task_timer()
+	 * will reset the task parameters.
 	 */
-	if (!start_dl_timer(p))
-		__dl_clear_params(p);
+	if (task_on_rq_queued(p) && p->dl.dl_runtime)
+		task_non_contending(p);
+
+	/*
+	 * We cannot use inactive_task_timer() to invoke sub_running_bw()
+	 * at the 0-lag time, because the task could have been migrated
+	 * while SCHED_OTHER in the meanwhile.
+	 */
+	if (p->dl.dl_non_contending)
+		p->dl.dl_non_contending = 0;
 
 	/*
 	 * Since this might be the only -deadline task on the rq,
@@ -1819,6 +2055,8 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
  */
 static void switched_to_dl(struct rq *rq, struct task_struct *p)
 {
+	if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+		put_task_struct(p);
 
 	/* If p is not queued we will update its parameters at next wakeup. */
 	if (!task_on_rq_queued(p))
@@ -1893,6 +2131,7 @@ const struct sched_class dl_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_dl,
+	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
 	.rq_online              = rq_online_dl,
 	.rq_offline             = rq_offline_dl,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 28b7a74..3069e45 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -244,6 +244,7 @@ bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
 	       dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
 }
 
+void dl_change_utilization(struct task_struct *p, u64 new_bw);
 extern void init_dl_bw(struct dl_bw *dl_b);
 
 #ifdef CONFIG_CGROUP_SCHED
@@ -1493,6 +1494,7 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
 extern struct dl_bandwidth def_dl_bandwidth;
 extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
 extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
+extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
 
 unsigned long to_ratio(u64 period, u64 runtime);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 03/10] sched/deadline: fix the update of the total -deadline utilization
  2017-05-18 20:13 [PATCH 00/10] CPU reclaiming for SCHED_DEADLINE luca abeni
  2017-05-18 20:13 ` [PATCH 01/10] sched/deadline: track the active utilization luca abeni
  2017-05-18 20:13 ` [PATCH 02/10] sched/deadline: improve the tracking of " luca abeni
@ 2017-05-18 20:13 ` luca abeni
  2017-06-08  9:24   ` [tip:sched/core] sched/deadline: Fix " tip-bot for Luca Abeni
  2017-05-18 20:13 ` [PATCH 04/10] sched/deadline: implement GRUB accounting luca abeni
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: luca abeni @ 2017-05-18 20:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Joel Fernandes, Mathieu Poirier, Luca Abeni

From: Luca Abeni <luca.abeni@santannapisa.it>

Now that the inactive timer can be armed to fire at the 0-lag time,
it is possible to use inactive_task_timer() to update the total
-deadline utilization (dl_b->total_bw) at the correct time, fixing
dl_overflow() and __setparam_dl().

Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
---
 kernel/sched/core.c     | 38 ++++++++++++++------------------------
 kernel/sched/deadline.c | 28 +++++++++++++---------------
 2 files changed, 27 insertions(+), 39 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5d49374..57856b2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2487,9 +2487,6 @@ static inline int dl_bw_cpus(int i)
  * allocated bandwidth to reflect the new situation.
  *
  * This function is called while holding p's rq->lock.
- *
- * XXX we should delay bw change until the task's 0-lag point, see
- * __setparam_dl().
  */
 static int dl_overflow(struct task_struct *p, int policy,
 		       const struct sched_attr *attr)
@@ -2514,16 +2511,29 @@ static int dl_overflow(struct task_struct *p, int policy,
 	cpus = dl_bw_cpus(task_cpu(p));
 	if (dl_policy(policy) && !task_has_dl_policy(p) &&
 	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
+		if (hrtimer_active(&p->dl.inactive_timer))
+			__dl_clear(dl_b, p->dl.dl_bw);
 		__dl_add(dl_b, new_bw);
 		err = 0;
 	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
 		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+		/*
+		 * XXX this is slightly incorrect: when the task
+		 * utilization decreases, we should delay the total
+		 * utilization change until the task's 0-lag point.
+		 * But this would require to set the task's "inactive
+		 * timer" when the task is not inactive.
+		 */
 		__dl_clear(dl_b, p->dl.dl_bw);
 		__dl_add(dl_b, new_bw);
 		dl_change_utilization(p, new_bw);
 		err = 0;
 	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
-		__dl_clear(dl_b, p->dl.dl_bw);
+		/*
+		 * Do not decrease the total deadline utilization here,
+		 * switched_from_dl() will take care to do it at the correct
+		 * (0-lag) time.
+		 */
 		err = 0;
 	}
 	raw_spin_unlock(&dl_b->lock);
@@ -4032,26 +4042,6 @@ __setparam_dl(struct task_struct *p, const struct sched_attr *attr)
 	dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline;
 	dl_se->flags = attr->sched_flags;
 	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
-
-	/*
-	 * Changing the parameters of a task is 'tricky' and we're not doing
-	 * the correct thing -- also see task_dead_dl() and switched_from_dl().
-	 *
-	 * What we SHOULD do is delay the bandwidth release until the 0-lag
-	 * point. This would include retaining the task_struct until that time
-	 * and change dl_overflow() to not immediately decrement the current
-	 * amount.
-	 *
-	 * Instead we retain the current runtime/deadline and let the new
-	 * parameters take effect after the current reservation period lapses.
-	 * This is safe (albeit pessimistic) because the 0-lag point is always
-	 * before the current scheduling deadline.
-	 *
-	 * We can still have temporary overloads because we do not delay the
-	 * change in bandwidth until that time; so admission control is
-	 * not on the safe side. It does however guarantee tasks will never
-	 * consume more than promised.
-	 */
 }
 
 /*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 147cbe6..bb71d65 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -175,8 +175,14 @@ static void task_non_contending(struct task_struct *p)
 	if (zerolag_time < 0) {
 		if (dl_task(p))
 			sub_running_bw(dl_se->dl_bw, dl_rq);
-		if (!dl_task(p) || p->state == TASK_DEAD)
+		if (!dl_task(p) || p->state == TASK_DEAD) {
+			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+			raw_spin_lock(&dl_b->lock);
+			__dl_clear(dl_b, p->dl.dl_bw);
 			__dl_clear_params(p);
+			raw_spin_unlock(&dl_b->lock);
+		}
 
 		return;
 	}
@@ -1004,10 +1010,16 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 	rq = task_rq_lock(p, &rf);
 
 	if (!dl_task(p) || p->state == TASK_DEAD) {
+		struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
 		if (p->state == TASK_DEAD && dl_se->dl_non_contending) {
 			sub_running_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
 			dl_se->dl_non_contending = 0;
 		}
+
+		raw_spin_lock(&dl_b->lock);
+		__dl_clear(dl_b, p->dl.dl_bw);
+		raw_spin_unlock(&dl_b->lock);
 		__dl_clear_params(p);
 
 		goto unlock;
@@ -1534,19 +1546,6 @@ static void task_fork_dl(struct task_struct *p)
 	 */
 }
 
-static void task_dead_dl(struct task_struct *p)
-{
-	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
-
-	/*
-	 * Since we are TASK_DEAD we won't slip out of the domain!
-	 */
-	raw_spin_lock_irq(&dl_b->lock);
-	/* XXX we should retain the bw until 0-lag */
-	dl_b->total_bw -= p->dl.dl_bw;
-	raw_spin_unlock_irq(&dl_b->lock);
-}
-
 static void set_curr_task_dl(struct rq *rq)
 {
 	struct task_struct *p = rq->curr;
@@ -2141,7 +2140,6 @@ const struct sched_class dl_sched_class = {
 	.set_curr_task		= set_curr_task_dl,
 	.task_tick		= task_tick_dl,
 	.task_fork              = task_fork_dl,
-	.task_dead		= task_dead_dl,
 
 	.prio_changed           = prio_changed_dl,
 	.switched_from		= switched_from_dl,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 04/10] sched/deadline: implement GRUB accounting
  2017-05-18 20:13 [PATCH 00/10] CPU reclaiming for SCHED_DEADLINE luca abeni
                   ` (2 preceding siblings ...)
  2017-05-18 20:13 ` [PATCH 03/10] sched/deadline: fix the update of the total -deadline utilization luca abeni
@ 2017-05-18 20:13 ` luca abeni
  2017-06-08  9:25   ` [tip:sched/core] sched/deadline: Implement " tip-bot for Luca Abeni
  2017-05-18 20:13 ` [PATCH 05/10] sched/deadline: do not reclaim the whole CPU bandwidth luca abeni
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: luca abeni @ 2017-05-18 20:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Joel Fernandes, Mathieu Poirier, Luca Abeni

From: Luca Abeni <luca.abeni@santannapisa.it>

According to the GRUB (Greedy Reclaimation of Unused Bandwidth)
reclaiming algorithm, the runtime is not decreased as "dq = -dt",
but as "dq = -Uact dt" (where Uact is the per-runqueue active
utilization).
Hence, this commit modifies the runtime accounting rule in
update_curr_dl() to implement the GRUB rule.

Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
---
 kernel/sched/core.c     |  4 ++--
 kernel/sched/deadline.c | 17 +++++++++++++++++
 kernel/sched/sched.h    |  2 ++
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 57856b2..14dd97e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2435,7 +2435,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 unsigned long to_ratio(u64 period, u64 runtime)
 {
 	if (runtime == RUNTIME_INF)
-		return 1ULL << 20;
+		return BW_UNIT;
 
 	/*
 	 * Doing this here saves a lot of checks in all
@@ -2445,7 +2445,7 @@ unsigned long to_ratio(u64 period, u64 runtime)
 	if (period == 0)
 		return 0;
 
-	return div64_u64(runtime << 20, period);
+	return div64_u64(runtime << BW_SHIFT, period);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index bb71d65..4e0c02c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -918,6 +918,22 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
 extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
 /*
+ * This function implements the GRUB accounting rule:
+ * according to the GRUB reclaiming algorithm, the runtime is
+ * not decreased as "dq = -dt", but as "dq = -Uact dt", where
+ * Uact is the (per-runqueue) active utilization.
+ * Since rq->dl.running_bw contains Uact * 2^BW_SHIFT, the result
+ * has to be shifted right by BW_SHIFT.
+ */
+u64 grub_reclaim(u64 delta, struct rq *rq)
+{
+	delta *= rq->dl.running_bw;
+	delta >>= BW_SHIFT;
+
+	return delta;
+}
+
+/*
  * Update the current task's runtime statistics (provided it is still
  * a -deadline task and has not been removed from the dl_rq).
  */
@@ -959,6 +975,7 @@ static void update_curr_dl(struct rq *rq)
 
 	sched_rt_avg_update(rq, delta_exec);
 
+	delta_exec = grub_reclaim(delta_exec, rq);
 	dl_se->runtime -= delta_exec;
 
 throttle:
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3069e45..c0001ab 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1496,6 +1496,8 @@ extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime
 extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
 extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
 
+#define BW_SHIFT	20
+#define BW_UNIT		(1 << BW_SHIFT)
 unsigned long to_ratio(u64 period, u64 runtime);
 
 extern void init_entity_runnable_average(struct sched_entity *se);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 05/10] sched/deadline: do not reclaim the whole CPU bandwidth
  2017-05-18 20:13 [PATCH 00/10] CPU reclaiming for SCHED_DEADLINE luca abeni
                   ` (3 preceding siblings ...)
  2017-05-18 20:13 ` [PATCH 04/10] sched/deadline: implement GRUB accounting luca abeni
@ 2017-05-18 20:13 ` luca abeni
  2017-06-08  9:25   ` [tip:sched/core] sched/deadline: Do " tip-bot for Luca Abeni
  2017-05-18 20:13 ` [PATCH 06/10] sched/deadline: make GRUB a task's flag luca abeni
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: luca abeni @ 2017-05-18 20:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Joel Fernandes, Mathieu Poirier, Luca Abeni

From: Luca Abeni <luca.abeni@santannapisa.it>

Original GRUB tends to reclaim 100% of the CPU time... And this
allows a CPU hog to starve non-deadline tasks.
To address this issue, allow the scheduler to reclaim only a
specified fraction of CPU time, stored in the new "bw_ratio"
field of the dl runqueue structure.

Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
---
 kernel/sched/core.c     | 11 +++++++++++
 kernel/sched/deadline.c | 12 +++++++++++-
 kernel/sched/sched.h    |  8 ++++++++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 14dd97e..84cd00f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6754,6 +6754,16 @@ static int sched_dl_global_validate(void)
 	return ret;
 }
 
+void init_dl_rq_bw_ratio(struct dl_rq *dl_rq)
+{
+	if (global_rt_runtime() == RUNTIME_INF) {
+		dl_rq->bw_ratio = 1 << RATIO_SHIFT;
+	} else {
+		dl_rq->bw_ratio = to_ratio(global_rt_runtime(),
+			  global_rt_period()) >> (BW_SHIFT - RATIO_SHIFT);
+	}
+}
+
 static void sched_dl_do_global(void)
 {
 	u64 new_bw = -1;
@@ -6779,6 +6789,7 @@ static void sched_dl_do_global(void)
 		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
 
 		rcu_read_unlock_sched();
+		init_dl_rq_bw_ratio(&cpu_rq(cpu)->dl);
 	}
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4e0c02c..f4d36aa 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -268,6 +268,7 @@ void init_dl_rq(struct dl_rq *dl_rq)
 #endif
 
 	dl_rq->running_bw = 0;
+	init_dl_rq_bw_ratio(dl_rq);
 }
 
 #ifdef CONFIG_SMP
@@ -924,11 +925,20 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
  * Uact is the (per-runqueue) active utilization.
  * Since rq->dl.running_bw contains Uact * 2^BW_SHIFT, the result
  * has to be shifted right by BW_SHIFT.
+ * To reclaim only a fraction Umax of the CPU time, the
+ * runtime accounting rule is modified as
+ * "dq = -Uact / Umax dt"; since rq->dl.bw_ratio contains
+ * 2^RATIO_SHIFT / Umax, delta is multiplied by bw_ratio and shifted
+ * right by RATIO_SHIFT.
+ * Since delta is a 64 bit variable, to have an overflow its value
+ * should be larger than 2^(64 - 20 - 8), which is more than 64 seconds.
+ * So, overflow is not an issue here.
  */
 u64 grub_reclaim(u64 delta, struct rq *rq)
 {
 	delta *= rq->dl.running_bw;
-	delta >>= BW_SHIFT;
+	delta *= rq->dl.bw_ratio;
+	delta >>= BW_SHIFT + RATIO_SHIFT;
 
 	return delta;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c0001ab..cd20da1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -565,6 +565,12 @@ struct dl_rq {
 	 * task blocks
 	 */
 	u64 running_bw;
+
+	/*
+	 * Inverse of the fraction of CPU utilization that can be reclaimed
+	 * by the GRUB algorithm.
+	 */
+	u64 bw_ratio;
 };
 
 #ifdef CONFIG_SMP
@@ -1495,9 +1501,11 @@ extern struct dl_bandwidth def_dl_bandwidth;
 extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
 extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
 extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
+extern void init_dl_rq_bw_ratio(struct dl_rq *dl_rq);
 
 #define BW_SHIFT	20
 #define BW_UNIT		(1 << BW_SHIFT)
+#define RATIO_SHIFT	8
 unsigned long to_ratio(u64 period, u64 runtime);
 
 extern void init_entity_runnable_average(struct sched_entity *se);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 06/10] sched/deadline: make GRUB a task's flag
  2017-05-18 20:13 [PATCH 00/10] CPU reclaiming for SCHED_DEADLINE luca abeni
                   ` (4 preceding siblings ...)
  2017-05-18 20:13 ` [PATCH 05/10] sched/deadline: do not reclaim the whole CPU bandwidth luca abeni
@ 2017-05-18 20:13 ` luca abeni
  2017-06-08  9:26   ` [tip:sched/core] sched/deadline: Make " tip-bot for Luca Abeni
  2017-05-18 20:13 ` [PATCH 07/10] sched/deadline: track the "total rq utilization" too luca abeni
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: luca abeni @ 2017-05-18 20:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Joel Fernandes, Mathieu Poirier, Luca Abeni

From: Luca Abeni <luca.abeni@santannapisa.it>

This patch introduces the SCHED_FLAG_RECLAIM flag to specify
that a DL task is allowed to reclaim unused CPU time (using
the GRUB algorithm).

Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
---
 include/uapi/linux/sched.h | 1 +
 kernel/sched/core.c        | 3 ++-
 kernel/sched/deadline.c    | 3 ++-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5f0fe01..e2a6c7b 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -47,5 +47,6 @@
  * For the sched_{set,get}attr() calls
  */
 #define SCHED_FLAG_RESET_ON_FORK	0x01
+#define SCHED_FLAG_RECLAIM		0x02
 
 #endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84cd00f..7fcc3bd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4207,7 +4207,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			return -EINVAL;
 	}
 
-	if (attr->sched_flags & ~(SCHED_FLAG_RESET_ON_FORK))
+	if (attr->sched_flags &
+		~(SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_RECLAIM))
 		return -EINVAL;
 
 	/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f4d36aa..0f545c1 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -985,7 +985,8 @@ static void update_curr_dl(struct rq *rq)
 
 	sched_rt_avg_update(rq, delta_exec);
 
-	delta_exec = grub_reclaim(delta_exec, rq);
+	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
+		delta_exec = grub_reclaim(delta_exec, rq);
 	dl_se->runtime -= delta_exec;
 
 throttle:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 07/10] sched/deadline: track the "total rq utilization" too
  2017-05-18 20:13 [PATCH 00/10] CPU reclaiming for SCHED_DEADLINE luca abeni
                   ` (5 preceding siblings ...)
  2017-05-18 20:13 ` [PATCH 06/10] sched/deadline: make GRUB a task's flag luca abeni
@ 2017-05-18 20:13 ` luca abeni
  2017-06-08  9:26   ` [tip:sched/core] sched/deadline: Track " tip-bot for Luca Abeni
  2017-05-18 20:13 ` [PATCH 08/10] sched/deadline: base GRUB reclaiming on the inactive utilization luca abeni
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: luca abeni @ 2017-05-18 20:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Joel Fernandes, Mathieu Poirier, Luca Abeni

From: Luca Abeni <luca.abeni@santannapisa.it>

The total rq utilization is defined as the sum of the utilisations of
tasks that are "assigned" to a runqueue, independently from their state
(TASK_RUNNING or blocked)

Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
---
 kernel/sched/deadline.c | 118 ++++++++++++++++++++++++++++++++++--------------
 kernel/sched/sched.h    |  11 +++++
 2 files changed, 95 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0f545c1..c224c0dd 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -51,6 +51,7 @@ void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
 	dl_rq->running_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
+	SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
 }
 
 static inline
@@ -65,25 +66,52 @@ void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 		dl_rq->running_bw = 0;
 }
 
+static inline
+void add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
+{
+	u64 old = dl_rq->this_bw;
+
+	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	dl_rq->this_bw += dl_bw;
+	SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */
+}
+
+static inline
+void sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
+{
+	u64 old = dl_rq->this_bw;
+
+	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	dl_rq->this_bw -= dl_bw;
+	SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */
+	if (dl_rq->this_bw > old)
+		dl_rq->this_bw = 0;
+	SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
+}
+
 void dl_change_utilization(struct task_struct *p, u64 new_bw)
 {
-	if (task_on_rq_queued(p))
-		return;
+	struct rq *rq;
 
-	if (!p->dl.dl_non_contending)
+	if (task_on_rq_queued(p))
 		return;
 
-	sub_running_bw(p->dl.dl_bw, &task_rq(p)->dl);
-	p->dl.dl_non_contending = 0;
-	/*
-	 * If the timer handler is currently running and the
-	 * timer cannot be cancelled, inactive_task_timer()
-	 * will see that dl_not_contending is not set, and
-	 * will not touch the rq's active utilization,
-	 * so we are still safe.
-	 */
-	if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
-		put_task_struct(p);
+	rq = task_rq(p);
+	if (p->dl.dl_non_contending) {
+		sub_running_bw(p->dl.dl_bw, &rq->dl);
+		p->dl.dl_non_contending = 0;
+		/*
+		 * If the timer handler is currently running and the
+		 * timer cannot be cancelled, inactive_task_timer()
+		 * will see that dl_not_contending is not set, and
+		 * will not touch the rq's active utilization,
+		 * so we are still safe.
+		 */
+		if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+			put_task_struct(p);
+	}
+	sub_rq_bw(p->dl.dl_bw, &rq->dl);
+	add_rq_bw(new_bw, &rq->dl);
 }
 
 /*
@@ -178,6 +206,8 @@ static void task_non_contending(struct task_struct *p)
 		if (!dl_task(p) || p->state == TASK_DEAD) {
 			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
+			if (p->state == TASK_DEAD)
+				sub_rq_bw(p->dl.dl_bw, &rq->dl);
 			raw_spin_lock(&dl_b->lock);
 			__dl_clear(dl_b, p->dl.dl_bw);
 			__dl_clear_params(p);
@@ -192,7 +222,7 @@ static void task_non_contending(struct task_struct *p)
 	hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL);
 }
 
-static void task_contending(struct sched_dl_entity *dl_se)
+static void task_contending(struct sched_dl_entity *dl_se, int flags)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 
@@ -203,6 +233,9 @@ static void task_contending(struct sched_dl_entity *dl_se)
 	if (dl_se->dl_runtime == 0)
 		return;
 
+	if (flags & ENQUEUE_MIGRATED)
+		add_rq_bw(dl_se->dl_bw, dl_rq);
+
 	if (dl_se->dl_non_contending) {
 		dl_se->dl_non_contending = 0;
 		/*
@@ -268,6 +301,7 @@ void init_dl_rq(struct dl_rq *dl_rq)
 #endif
 
 	dl_rq->running_bw = 0;
+	dl_rq->this_bw = 0;
 	init_dl_rq_bw_ratio(dl_rq);
 }
 
@@ -1042,6 +1076,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 
 		if (p->state == TASK_DEAD && dl_se->dl_non_contending) {
 			sub_running_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
+			sub_rq_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
 			dl_se->dl_non_contending = 0;
 		}
 
@@ -1207,7 +1242,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
 	 * we want a replenishment of its runtime.
 	 */
 	if (flags & ENQUEUE_WAKEUP) {
-		task_contending(dl_se);
+		task_contending(dl_se, flags);
 		update_dl_entity(dl_se, pi_se);
 	} else if (flags & ENQUEUE_REPLENISH) {
 		replenish_dl_entity(dl_se, pi_se);
@@ -1260,8 +1295,10 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	if (!p->dl.dl_throttled && dl_is_constrained(&p->dl))
 		dl_check_constrained_dl(&p->dl);
 
-	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE)
+	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE) {
+		add_rq_bw(p->dl.dl_bw, &rq->dl);
 		add_running_bw(p->dl.dl_bw, &rq->dl);
+	}
 
 	/*
 	 * If p is throttled, we do not enqueue it. In fact, if it exhausted
@@ -1277,7 +1314,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
 		if (flags & ENQUEUE_WAKEUP)
-			task_contending(&p->dl);
+			task_contending(&p->dl, flags);
 
 		return;
 	}
@@ -1299,8 +1336,10 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	update_curr_dl(rq);
 	__dequeue_task_dl(rq, p, flags);
 
-	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE)
+	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE) {
 		sub_running_bw(p->dl.dl_bw, &rq->dl);
+		sub_rq_bw(p->dl.dl_bw, &rq->dl);
+	}
 
 	/*
 	 * This check allows to start the inactive timer (or to immediately
@@ -1394,7 +1433,7 @@ static void migrate_task_rq_dl(struct task_struct *p)
 {
 	struct rq *rq;
 
-	if (!(p->state == TASK_WAKING) || !(p->dl.dl_non_contending))
+	if (p->state != TASK_WAKING)
 		return;
 
 	rq = task_rq(p);
@@ -1404,18 +1443,20 @@ static void migrate_task_rq_dl(struct task_struct *p)
 	 * rq->lock is not... So, lock it
 	 */
 	raw_spin_lock(&rq->lock);
-	sub_running_bw(p->dl.dl_bw, &rq->dl);
-	p->dl.dl_non_contending = 0;
-	/*
-	 * If the timer handler is currently running and the
-	 * timer cannot be cancelled, inactive_task_timer()
-	 * will see that dl_not_contending is not set, and
-	 * will not touch the rq's active utilization,
-	 * so we are still safe.
-	 */
-	if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
-		put_task_struct(p);
-
+	if (p->dl.dl_non_contending) {
+		sub_running_bw(p->dl.dl_bw, &rq->dl);
+		p->dl.dl_non_contending = 0;
+		/*
+		 * If the timer handler is currently running and the
+		 * timer cannot be cancelled, inactive_task_timer()
+		 * will see that dl_not_contending is not set, and
+		 * will not touch the rq's active utilization,
+		 * so we are still safe.
+		 */
+		if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+			put_task_struct(p);
+	}
+	sub_rq_bw(p->dl.dl_bw, &rq->dl);
 	raw_spin_unlock(&rq->lock);
 }
 
@@ -1858,7 +1899,9 @@ static int push_dl_task(struct rq *rq)
 
 	deactivate_task(rq, next_task, 0);
 	sub_running_bw(next_task->dl.dl_bw, &rq->dl);
+	sub_rq_bw(next_task->dl.dl_bw, &rq->dl);
 	set_task_cpu(next_task, later_rq->cpu);
+	add_rq_bw(next_task->dl.dl_bw, &later_rq->dl);
 	add_running_bw(next_task->dl.dl_bw, &later_rq->dl);
 	activate_task(later_rq, next_task, 0);
 	ret = 1;
@@ -1948,7 +1991,9 @@ static void pull_dl_task(struct rq *this_rq)
 
 			deactivate_task(src_rq, p, 0);
 			sub_running_bw(p->dl.dl_bw, &src_rq->dl);
+			sub_rq_bw(p->dl.dl_bw, &src_rq->dl);
 			set_task_cpu(p, this_cpu);
+			add_rq_bw(p->dl.dl_bw, &this_rq->dl);
 			add_running_bw(p->dl.dl_bw, &this_rq->dl);
 			activate_task(this_rq, p, 0);
 			dmin = p->dl.deadline;
@@ -2057,6 +2102,9 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	if (task_on_rq_queued(p) && p->dl.dl_runtime)
 		task_non_contending(p);
 
+	if (!task_on_rq_queued(p))
+		sub_rq_bw(p->dl.dl_bw, &rq->dl);
+
 	/*
 	 * We cannot use inactive_task_timer() to invoke sub_running_bw()
 	 * at the 0-lag time, because the task could have been migrated
@@ -2086,9 +2134,11 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
 		put_task_struct(p);
 
 	/* If p is not queued we will update its parameters at next wakeup. */
-	if (!task_on_rq_queued(p))
-		return;
+	if (!task_on_rq_queued(p)) {
+		add_rq_bw(p->dl.dl_bw, &rq->dl);
 
+		return;
+	}
 	/*
 	 * If p is boosted we already updated its params in
 	 * rt_mutex_setprio()->enqueue_task(..., ENQUEUE_REPLENISH),
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cd20da1..0939ac4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -567,6 +567,17 @@ struct dl_rq {
 	u64 running_bw;
 
 	/*
+	 * Utilization of the tasks "assigned" to this runqueue (including
+	 * the tasks that are in runqueue and the tasks that executed on this
+	 * CPU and blocked). Increased when a task moves to this runqueue, and
+	 * decreased when the task moves away (migrates, changes scheduling
+	 * policy, or terminates).
+	 * This is needed to compute the "inactive utilization" for the
+	 * runqueue (inactive utilization = this_bw - running_bw).
+	 */
+	u64 this_bw;
+
+	/*
 	 * Inverse of the fraction of CPU utilization that can be reclaimed
 	 * by the GRUB algorithm.
 	 */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 08/10] sched/deadline: base GRUB reclaiming on the inactive utilization
  2017-05-18 20:13 [PATCH 00/10] CPU reclaiming for SCHED_DEADLINE luca abeni
                   ` (6 preceding siblings ...)
  2017-05-18 20:13 ` [PATCH 07/10] sched/deadline: track the "total rq utilization" too luca abeni
@ 2017-05-18 20:13 ` luca abeni
  2017-06-08  9:27   ` [tip:sched/core] sched/deadline: Base " tip-bot for Luca Abeni
  2017-05-18 20:13 ` [PATCH 09/10] sched/deadline: also reclaim bandwidth not used by dl tasks luca abeni
  2017-05-18 20:13 ` [PATCH 10/10] sched/deadline: documentation about GRUB reclaiming luca abeni
  9 siblings, 1 reply; 26+ messages in thread
From: luca abeni @ 2017-05-18 20:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Joel Fernandes, Mathieu Poirier, Luca Abeni

From: Luca Abeni <luca.abeni@santannapisa.it>

Instead of decreasing the runtime as "dq = -Uact dt" (eventually
divided by the maximum utilization available for deadline tasks),
decrease it as "dq = -max{u, (1 - Uinact)} dt", where u is the task
utilization and Uinact is the "inactive utilization".
In this way, the maximum fraction of CPU time that can be reclaimed
is given by the total utilization of deadline tasks.
This approach solves a fairness issue with "traditional" global GRUB
reclaiming: using the traditional GRUB algorithm, if tasks are
allocated to the various cores in a non-uniform way, the
reclaiming mechanism allows some tasks to reclaim more time than
others. This issue is visible starting 11 time-consuming tasks with
runtime 10ms and period 30ms (total utilization 3.666) on a 4-cores
system: some tasks will receive much more than the reserved runtime
(thanks to the reclaiming mechanism), while other tasks will receive
less than the reserved runtime.

Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
---
 kernel/sched/deadline.c | 40 ++++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c224c0dd..4286bb8 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -955,26 +955,30 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 /*
  * This function implements the GRUB accounting rule:
  * according to the GRUB reclaiming algorithm, the runtime is
- * not decreased as "dq = -dt", but as "dq = -Uact dt", where
- * Uact is the (per-runqueue) active utilization.
- * Since rq->dl.running_bw contains Uact * 2^BW_SHIFT, the result
- * has to be shifted right by BW_SHIFT.
- * To reclaim only a fraction Umax of the CPU time, the
- * runtime accounting rule is modified as
- * "dq = -Uact / Umax dt"; since rq->dl.bw_ratio contains
- * 2^RATIO_SHIFT / Umax, delta is multiplied by bw_ratio and shifted
- * right by RATIO_SHIFT.
- * Since delta is a 64 bit variable, to have an overflow its value
- * should be larger than 2^(64 - 20 - 8), which is more than 64 seconds.
- * So, overflow is not an issue here.
+ * not decreased as "dq = -dt", but as "dq = -max{u, (1 - Uinact)} dt",
+ * where u is the utilization of the task and Uinact is the
+ * (per-runqueue) inactive utilization, computed as the difference
+ * between the "total runqueue utilization" and the runqueue
+ * active utilization.
+ * Since rq->dl.running_bw and rq->dl.this_bw contain utilizations
+ * multiplied by 2^BW_SHIFT, the result has to be shifted right by BW_SHIFT.
  */
-u64 grub_reclaim(u64 delta, struct rq *rq)
+u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
 {
-	delta *= rq->dl.running_bw;
-	delta *= rq->dl.bw_ratio;
-	delta >>= BW_SHIFT + RATIO_SHIFT;
+	u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */
+	u64 u_act;
 
-	return delta;
+	/*
+	 * Instead of computing max{u, (1 - u_inact)}, we compare
+	 * u_inact with 1 - u, because u_inact can be larger than 1
+	 * (so, 1 - u_inact would be negative leading to wrong results)
+	 */
+	if (u_inact > BW_UNIT - dl_se->dl_bw)
+		u_act = dl_se->dl_bw;
+	else
+		u_act = BW_UNIT - u_inact;
+
+	return (delta * u_act) >> BW_SHIFT;
 }
 
 /*
@@ -1020,7 +1024,7 @@ static void update_curr_dl(struct rq *rq)
 	sched_rt_avg_update(rq, delta_exec);
 
 	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
-		delta_exec = grub_reclaim(delta_exec, rq);
+		delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
 	dl_se->runtime -= delta_exec;
 
 throttle:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 09/10] sched/deadline: also reclaim bandwidth not used by dl tasks
  2017-05-18 20:13 [PATCH 00/10] CPU reclaiming for SCHED_DEADLINE luca abeni
                   ` (7 preceding siblings ...)
  2017-05-18 20:13 ` [PATCH 08/10] sched/deadline: base GRUB reclaiming on the inactive utilization luca abeni
@ 2017-05-18 20:13 ` luca abeni
  2017-06-08  9:28   ` [tip:sched/core] sched/deadline: Reclaim " tip-bot for Luca Abeni
  2017-05-18 20:13 ` [PATCH 10/10] sched/deadline: documentation about GRUB reclaiming luca abeni
  9 siblings, 1 reply; 26+ messages in thread
From: luca abeni @ 2017-05-18 20:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Joel Fernandes, Mathieu Poirier, Luca Abeni

From: Luca Abeni <luca.abeni@santannapisa.it>

This commit introduces a per-runqueue "extra utilization" that can be
reclaimed by deadline tasks. In this way, the maximum fraction of CPU
time that can reclaimed by deadline tasks is fixed (and configurable)
and does not depend on the total deadline utilization.
The GRUB accounting rule is modified to add this "extra utilization"
to the inactive utilization of the runqueue, and to avoid reclaiming
more than a maximum fraction of the CPU time.

Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
---
 kernel/sched/core.c     | 17 ++++++++++-------
 kernel/sched/deadline.c | 42 +++++++++++++++++++++++++++---------------
 kernel/sched/sched.h    | 37 +++++++++++++++++++++++++++++++++++--
 3 files changed, 72 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7fcc3bd..0628c50 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2456,7 +2456,7 @@ inline struct dl_bw *dl_bw_of(int i)
 	return &cpu_rq(i)->rd->dl_bw;
 }
 
-static inline int dl_bw_cpus(int i)
+inline int dl_bw_cpus(int i)
 {
 	struct root_domain *rd = cpu_rq(i)->rd;
 	int cpus = 0;
@@ -2474,7 +2474,7 @@ inline struct dl_bw *dl_bw_of(int i)
 	return &cpu_rq(i)->dl.dl_bw;
 }
 
-static inline int dl_bw_cpus(int i)
+inline int dl_bw_cpus(int i)
 {
 	return 1;
 }
@@ -2512,8 +2512,8 @@ static int dl_overflow(struct task_struct *p, int policy,
 	if (dl_policy(policy) && !task_has_dl_policy(p) &&
 	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
 		if (hrtimer_active(&p->dl.inactive_timer))
-			__dl_clear(dl_b, p->dl.dl_bw);
-		__dl_add(dl_b, new_bw);
+			__dl_clear(dl_b, p->dl.dl_bw, cpus);
+		__dl_add(dl_b, new_bw, cpus);
 		err = 0;
 	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
 		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
@@ -2524,8 +2524,8 @@ static int dl_overflow(struct task_struct *p, int policy,
 		 * But this would require to set the task's "inactive
 		 * timer" when the task is not inactive.
 		 */
-		__dl_clear(dl_b, p->dl.dl_bw);
-		__dl_add(dl_b, new_bw);
+		__dl_clear(dl_b, p->dl.dl_bw, cpus);
+		__dl_add(dl_b, new_bw, cpus);
 		dl_change_utilization(p, new_bw);
 		err = 0;
 	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
@@ -5527,7 +5527,7 @@ int task_can_attach(struct task_struct *p,
 			 * We will free resources in the source root_domain
 			 * later on (see set_cpus_allowed_dl()).
 			 */
-			__dl_add(dl_b, p->dl.dl_bw);
+			__dl_add(dl_b, p->dl.dl_bw, cpus);
 		}
 		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
 		rcu_read_unlock_sched();
@@ -6759,9 +6759,12 @@ void init_dl_rq_bw_ratio(struct dl_rq *dl_rq)
 {
 	if (global_rt_runtime() == RUNTIME_INF) {
 		dl_rq->bw_ratio = 1 << RATIO_SHIFT;
+		dl_rq->extra_bw = 1 << BW_SHIFT;
 	} else {
 		dl_rq->bw_ratio = to_ratio(global_rt_runtime(),
 			  global_rt_period()) >> (BW_SHIFT - RATIO_SHIFT);
+		dl_rq->extra_bw = to_ratio(global_rt_period(),
+						    global_rt_runtime());
 	}
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4286bb8..1da44b3 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -209,7 +209,7 @@ static void task_non_contending(struct task_struct *p)
 			if (p->state == TASK_DEAD)
 				sub_rq_bw(p->dl.dl_bw, &rq->dl);
 			raw_spin_lock(&dl_b->lock);
-			__dl_clear(dl_b, p->dl.dl_bw);
+			__dl_clear(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 			__dl_clear_params(p);
 			raw_spin_unlock(&dl_b->lock);
 		}
@@ -955,28 +955,40 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 /*
  * This function implements the GRUB accounting rule:
  * according to the GRUB reclaiming algorithm, the runtime is
- * not decreased as "dq = -dt", but as "dq = -max{u, (1 - Uinact)} dt",
- * where u is the utilization of the task and Uinact is the
- * (per-runqueue) inactive utilization, computed as the difference
- * between the "total runqueue utilization" and the runqueue
- * active utilization.
+ * not decreased as "dq = -dt", but as
+ * "dq = -max{u / Umax, (1 - Uinact - Uextra)} dt",
+ * where u is the utilization of the task, Umax is the maximum reclaimable
+ * utilization, Uinact is the (per-runqueue) inactive utilization, computed
+ * as the difference between the "total runqueue utilization" and the
+ * runqueue active utilization, and Uextra is the (per runqueue) extra
+ * reclaimable utilization.
  * Since rq->dl.running_bw and rq->dl.this_bw contain utilizations
- * multiplied by 2^BW_SHIFT, the result has to be shifted right by BW_SHIFT.
+ * multiplied by 2^BW_SHIFT, the result has to be shifted right by
+ * BW_SHIFT.
+ * Since rq->dl.bw_ratio contains 1 / Umax multipled by 2^RATIO_SHIFT,
+ * dl_bw is multiped by rq->dl.bw_ratio and shifted right by RATIO_SHIFT.
+ * Since delta is a 64 bit variable, to have an overflow its value
+ * should be larger than 2^(64 - 20 - 8), which is more than 64 seconds.
+ * So, overflow is not an issue here.
  */
 u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
 {
 	u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */
 	u64 u_act;
+	u64 u_act_min = (dl_se->dl_bw * rq->dl.bw_ratio) >> RATIO_SHIFT;
 
 	/*
-	 * Instead of computing max{u, (1 - u_inact)}, we compare
-	 * u_inact with 1 - u, because u_inact can be larger than 1
-	 * (so, 1 - u_inact would be negative leading to wrong results)
+	 * Instead of computing max{u * bw_ratio, (1 - u_inact - u_extra)},
+	 * we compare u_inact + rq->dl.extra_bw with
+	 * 1 - (u * rq->dl.bw_ratio >> RATIO_SHIFT), because
+	 * u_inact + rq->dl.extra_bw can be larger than
+	 * 1 * (so, 1 - u_inact - rq->dl.extra_bw would be negative
+	 * leading to wrong results)
 	 */
-	if (u_inact > BW_UNIT - dl_se->dl_bw)
-		u_act = dl_se->dl_bw;
+	if (u_inact + rq->dl.extra_bw > BW_UNIT - u_act_min)
+		u_act = u_act_min;
 	else
-		u_act = BW_UNIT - u_inact;
+		u_act = BW_UNIT - u_inact - rq->dl.extra_bw;
 
 	return (delta * u_act) >> BW_SHIFT;
 }
@@ -1085,7 +1097,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 		}
 
 		raw_spin_lock(&dl_b->lock);
-		__dl_clear(dl_b, p->dl.dl_bw);
+		__dl_clear(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 		raw_spin_unlock(&dl_b->lock);
 		__dl_clear_params(p);
 
@@ -2054,7 +2066,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
 		 * until we complete the update.
 		 */
 		raw_spin_lock(&src_dl_b->lock);
-		__dl_clear(src_dl_b, p->dl.dl_bw);
+		__dl_clear(src_dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 		raw_spin_unlock(&src_dl_b->lock);
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0939ac4..fdbc30d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -219,22 +219,27 @@ static inline int dl_bandwidth_enabled(void)
 }
 
 extern struct dl_bw *dl_bw_of(int i);
+extern int dl_bw_cpus(int i);
 
 struct dl_bw {
 	raw_spinlock_t lock;
 	u64 bw, total_bw;
 };
 
+static inline void __dl_update(struct dl_bw *dl_b, s64 bw);
+
 static inline
-void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
+void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
 {
 	dl_b->total_bw -= tsk_bw;
+	__dl_update(dl_b, (s32)tsk_bw / cpus);
 }
 
 static inline
-void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
+void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
 {
 	dl_b->total_bw += tsk_bw;
+	__dl_update(dl_b, -((s32)tsk_bw / cpus));
 }
 
 static inline
@@ -576,6 +581,7 @@ struct dl_rq {
 	 * runqueue (inactive utilization = this_bw - running_bw).
 	 */
 	u64 this_bw;
+	u64 extra_bw;
 
 	/*
 	 * Inverse of the fraction of CPU utilization that can be reclaimed
@@ -1958,6 +1964,33 @@ extern void nohz_balance_exit_idle(unsigned int cpu);
 static inline void nohz_balance_exit_idle(unsigned int cpu) { }
 #endif
 
+
+#ifdef CONFIG_SMP
+static inline
+void __dl_update(struct dl_bw *dl_b, s64 bw)
+{
+	struct root_domain *rd = container_of(dl_b, struct root_domain, dl_bw);
+	int i;
+
+	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+			 "sched RCU must be held");
+	for_each_cpu_and(i, rd->span, cpu_active_mask) {
+		struct rq *rq = cpu_rq(i);
+
+		rq->dl.extra_bw += bw;
+	}
+}
+#else
+static inline
+void __dl_update(struct dl_bw *dl_b, s64 bw)
+{
+	struct dl_rq *dl = container_of(dl_b, struct dl_rq, dl_bw);
+
+	dl->extra_bw += bw;
+}
+#endif
+
+
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 struct irqtime {
 	u64			total;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 10/10] sched/deadline: documentation about GRUB reclaiming
  2017-05-18 20:13 [PATCH 00/10] CPU reclaiming for SCHED_DEADLINE luca abeni
                   ` (8 preceding siblings ...)
  2017-05-18 20:13 ` [PATCH 09/10] sched/deadline: also reclaim bandwidth not used by dl tasks luca abeni
@ 2017-05-18 20:13 ` luca abeni
  2017-06-08  9:28   ` [tip:sched/core] sched/deadline: Add " tip-bot for Claudio Scordino
  9 siblings, 1 reply; 26+ messages in thread
From: luca abeni @ 2017-05-18 20:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Tommaso Cucinotta, Daniel Bristot de Oliveira,
	Joel Fernandes, Mathieu Poirier, Luca Abeni

From: Claudio Scordino <claudio@evidence.eu.com>

This patch adds the documentation about the GRUB reclaiming algorithm,
adding a few details discussed in list.

Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
---
 Documentation/scheduler/sched-deadline.txt | 168 +++++++++++++++++++++++++++++
 1 file changed, 168 insertions(+)

diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
index cbc1b46..e89e36e 100644
--- a/Documentation/scheduler/sched-deadline.txt
+++ b/Documentation/scheduler/sched-deadline.txt
@@ -7,6 +7,8 @@ CONTENTS
  0. WARNING
  1. Overview
  2. Scheduling algorithm
+   2.1 Main algorithm
+   2.2 Bandwidth reclaiming
  3. Scheduling Real-Time Tasks
    3.1 Definitions
    3.2 Schedulability Analysis for Uniprocessor Systems
@@ -44,6 +46,9 @@ CONTENTS
 2. Scheduling algorithm
 ==================
 
+2.1 Main algorithm
+------------------
+
  SCHED_DEADLINE uses three parameters, named "runtime", "period", and
  "deadline", to schedule tasks. A SCHED_DEADLINE task should receive
  "runtime" microseconds of execution time every "period" microseconds, and
@@ -113,6 +118,160 @@ CONTENTS
          remaining runtime = remaining runtime + runtime
 
 
+2.2 Bandwidth reclaiming
+------------------------
+
+ Bandwidth reclaiming for deadline tasks is based on the GRUB (Greedy
+ Reclamation of Unused Bandwidth) algorithm [15, 16, 17] and it is enabled
+ when flag SCHED_FLAG_RECLAIM is set.
+
+ The following diagram illustrates the state names for tasks handled by GRUB:
+
+                             ------------
+                 (d)        |   Active   |
+              ------------->|            |
+              |             | Contending |
+              |              ------------
+              |                A      |
+          ----------           |      |
+         |          |          |      |
+         | Inactive |          |(b)   | (a)
+         |          |          |      |
+          ----------           |      |
+              A                |      V
+              |              ------------
+              |             |   Active   |
+              --------------|     Non    |
+                 (c)        | Contending |
+                             ------------
+
+ A task can be in one of the following states:
+
+  - ActiveContending: if it is ready for execution (or executing);
+
+  - ActiveNonContending: if it just blocked and has not yet surpassed the 0-lag
+    time;
+
+  - Inactive: if it is blocked and has surpassed the 0-lag time.
+
+ State transitions:
+
+  (a) When a task blocks, it does not become immediately inactive since its
+      bandwidth cannot be immediately reclaimed without breaking the
+      real-time guarantees. It therefore enters a transitional state called
+      ActiveNonContending. The scheduler arms the "inactive timer" to fire at
+      the 0-lag time, when the task's bandwidth can be reclaimed without
+      breaking the real-time guarantees.
+
+      The 0-lag time for a task entering the ActiveNonContending state is
+      computed as
+
+                        (runtime * dl_period)
+             deadline - ---------------------
+                             dl_runtime
+
+      where runtime is the remaining runtime, while dl_runtime and dl_period
+      are the reservation parameters.
+
+  (b) If the task wakes up before the inactive timer fires, the task re-enters
+      the ActiveContending state and the "inactive timer" is canceled.
+      In addition, if the task wakes up on a different runqueue, then
+      the task's utilization must be removed from the previous runqueue's active
+      utilization and must be added to the new runqueue's active utilization.
+      In order to avoid races between a task waking up on a runqueue while the
+       "inactive timer" is running on a different CPU, the "dl_non_contending"
+      flag is used to indicate that a task is not on a runqueue but is active
+      (so, the flag is set when the task blocks and is cleared when the
+      "inactive timer" fires or when the task  wakes up).
+
+  (c) When the "inactive timer" fires, the task enters the Inactive state and
+      its utilization is removed from the runqueue's active utilization.
+
+  (d) When an inactive task wakes up, it enters the ActiveContending state and
+      its utilization is added to the active utilization of the runqueue where
+      it has been enqueued.
+
+ For each runqueue, the algorithm GRUB keeps track of two different bandwidths:
+
+  - Active bandwidth (running_bw): this is the sum of the bandwidths of all
+    tasks in active state (i.e., ActiveContending or ActiveNonContending);
+
+  - Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the
+    runqueue, including the tasks in Inactive state.
+
+
+ The algorithm reclaims the bandwidth of the tasks in Inactive state.
+ It does so by decrementing the runtime of the executing task Ti at a pace equal
+ to
+
+           dq = -max{ Ui, (1 - Uinact) } dt
+
+ where Uinact is the inactive utilization, computed as (this_bq - running_bw),
+ and Ui is the bandwidth of task Ti.
+
+
+ Let's now see a trivial example of two deadline tasks with runtime equal
+ to 4 and period equal to 8 (i.e., bandwidth equal to 0.5):
+
+     A            Task T1
+     |
+     |                               |
+     |                               |
+     |--------                       |----
+     |       |                       V
+     |---|---|---|---|---|---|---|---|--------->t
+     0   1   2   3   4   5   6   7   8
+
+
+     A            Task T2
+     |
+     |                               |
+     |                               |
+     |       ------------------------|
+     |       |                       V
+     |---|---|---|---|---|---|---|---|--------->t
+     0   1   2   3   4   5   6   7   8
+
+
+     A            running_bw
+     |
+   1 -----------------               ------
+     |               |               |
+  0.5-               -----------------
+     |                               |
+     |---|---|---|---|---|---|---|---|--------->t
+     0   1   2   3   4   5   6   7   8
+
+
+  - Time t = 0:
+
+    Both tasks are ready for execution and therefore in ActiveContending state.
+    Suppose Task T1 is the first task to start execution.
+    Since there are no inactive tasks, its runtime is decreased as dq = -1 dt.
+
+  - Time t = 2:
+
+    Suppose that task T1 blocks
+    Task T1 therefore enters the ActiveNonContending state. Since its remaining
+    runtime is equal to 2, its 0-lag time is equal to t = 4.
+    Task T2 start execution, with runtime still decreased as dq = -1 dt since
+    there are no inactive tasks.
+
+  - Time t = 4:
+
+    This is the 0-lag time for Task T1. Since it didn't woken up in the
+    meantime, it enters the Inactive state. Its bandwidth is removed from
+    running_bw.
+    Task T2 continues its execution. However, its runtime is now decreased as
+    dq = - 0.5 dt because Uinact = 0.5.
+    Task T2 therefore reclaims the bandwidth unused by Task T1.
+
+  - Time t = 8:
+
+    Task T1 wakes up. It enters the ActiveContending state again, and the
+    running_bw is incremented.
+
+
 3. Scheduling Real-Time Tasks
 =============================
 
@@ -330,6 +489,15 @@ CONTENTS
   14 - J. Erickson, U. Devi and S. Baruah. Improved tardiness bounds for
        Global EDF. Proceedings of the 22nd Euromicro Conference on
        Real-Time Systems, 2010.
+  15 - G. Lipari, S. Baruah, Greedy reclamation of unused bandwidth in
+       constant-bandwidth servers, 12th IEEE Euromicro Conference on Real-Time
+       Systems, 2000.
+  16 - L. Abeni, J. Lelli, C. Scordino, L. Palopoli, Greedy CPU reclaiming for
+       SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS),
+       Dusseldorf, Germany, 2014.
+  17 - L. Abeni, G. Lipari, A. Parri, Y. Sun, Multicore CPU reclaiming: parallel
+       or sequential?. In Proceedings of the 31st Annual ACM Symposium on Applied
+       Computing, 2016.
 
 
 4. Bandwidth management
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 01/10] sched/deadline: track the active utilization
  2017-05-18 20:13 ` [PATCH 01/10] sched/deadline: track the active utilization luca abeni
@ 2017-06-08  8:31   ` Ingo Molnar
  2017-06-08  8:43     ` Luca Abeni
  2017-06-08  9:23   ` [tip:sched/core] sched/deadline: Track " tip-bot for Luca Abeni
  1 sibling, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2017-06-08  8:31 UTC (permalink / raw)
  To: luca abeni
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Claudio Scordino, Steven Rostedt, Tommaso Cucinotta,
	Daniel Bristot de Oliveira, Joel Fernandes, Mathieu Poirier,
	Luca Abeni


* luca abeni <luca.abeni@santannapisa.it> wrote:

> From: Luca Abeni <luca.abeni@unitn.it>
> 
> Active utilization is defined as the total utilization of active
> (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
> when a task wakes up and is decreased when a task blocks.
> 
> When a task is migrated from CPUi to CPUj, immediately subtract the
> task's utilization from CPUi and add it to CPUj. This mechanism is
> implemented by modifying the pull and push functions.
> Note: this is not fully correct from the theoretical point of view
> (the utilization should be removed from CPUi only at the 0 lag
> time), a more theoretically sound solution is presented in the
> next patches.
> 
> Signed-off-by: Juri Lelli <juri.lelli@arm.com>
> Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>

So that SOB chain is not valid - either Juri needs to be the From: author, or it 
should be an Acked-by (or Reviewed-by).

For now I've converted this to:

> Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> Acked-by: Juri Lelli <juri.lelli@arm.com>

Please holler if you'd like something else.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 01/10] sched/deadline: track the active utilization
  2017-06-08  8:31   ` Ingo Molnar
@ 2017-06-08  8:43     ` Luca Abeni
  2017-06-08  9:05       ` Juri Lelli
  0 siblings, 1 reply; 26+ messages in thread
From: Luca Abeni @ 2017-06-08  8:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Claudio Scordino, Steven Rostedt, Tommaso Cucinotta,
	Daniel Bristot de Oliveira, Joel Fernandes, Mathieu Poirier,
	Luca Abeni

On Thu, 8 Jun 2017 10:31:25 +0200
Ingo Molnar <mingo@kernel.org> wrote:

> * luca abeni <luca.abeni@santannapisa.it> wrote:
> 
> > From: Luca Abeni <luca.abeni@unitn.it>
> > 
> > Active utilization is defined as the total utilization of active
> > (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
> > when a task wakes up and is decreased when a task blocks.
> > 
> > When a task is migrated from CPUi to CPUj, immediately subtract the
> > task's utilization from CPUi and add it to CPUj. This mechanism is
> > implemented by modifying the pull and push functions.
> > Note: this is not fully correct from the theoretical point of view
> > (the utilization should be removed from CPUi only at the 0 lag
> > time), a more theoretically sound solution is presented in the
> > next patches.
> > 
> > Signed-off-by: Juri Lelli <juri.lelli@arm.com>
> > Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> > Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>  
> 
> So that SOB chain is not valid - either Juri needs to be the From:
> author, or it should be an Acked-by (or Reviewed-by).
> 
> For now I've converted this to:
> 
> > Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> > Acked-by: Juri Lelli <juri.lelli@arm.com>  

Sorry, my fault: I must have misunderstood how to use the Signed-off-by
stuff.

The story here is that I took a patch originally developed by Juri and
fixed and I heavily modified it. Since the current patch is very
different from the original one, Juri suggested I should by the "From:"
author, and I simply added his Signed-off-by to acknowledge that he was
the author of the original patch.

If Juri is ok with your change, I agree with it.


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 01/10] sched/deadline: track the active utilization
  2017-06-08  8:43     ` Luca Abeni
@ 2017-06-08  9:05       ` Juri Lelli
  2017-06-08 13:36         ` Steven Rostedt
  0 siblings, 1 reply; 26+ messages in thread
From: Juri Lelli @ 2017-06-08  9:05 UTC (permalink / raw)
  To: Luca Abeni
  Cc: Ingo Molnar, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Claudio Scordino, Steven Rostedt, Tommaso Cucinotta,
	Daniel Bristot de Oliveira, Joel Fernandes, Mathieu Poirier,
	Luca Abeni

On 08/06/17 10:43, Luca Abeni wrote:
> On Thu, 8 Jun 2017 10:31:25 +0200
> Ingo Molnar <mingo@kernel.org> wrote:
> 
> > * luca abeni <luca.abeni@santannapisa.it> wrote:
> > 
> > > From: Luca Abeni <luca.abeni@unitn.it>
> > > 
> > > Active utilization is defined as the total utilization of active
> > > (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
> > > when a task wakes up and is decreased when a task blocks.
> > > 
> > > When a task is migrated from CPUi to CPUj, immediately subtract the
> > > task's utilization from CPUi and add it to CPUj. This mechanism is
> > > implemented by modifying the pull and push functions.
> > > Note: this is not fully correct from the theoretical point of view
> > > (the utilization should be removed from CPUi only at the 0 lag
> > > time), a more theoretically sound solution is presented in the
> > > next patches.
> > > 
> > > Signed-off-by: Juri Lelli <juri.lelli@arm.com>
> > > Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> > > Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>  
> > 
> > So that SOB chain is not valid - either Juri needs to be the From:
> > author, or it should be an Acked-by (or Reviewed-by).
> > 
> > For now I've converted this to:
> > 
> > > Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> > > Acked-by: Juri Lelli <juri.lelli@arm.com>  
> 
> Sorry, my fault: I must have misunderstood how to use the Signed-off-by
> stuff.
> 
> The story here is that I took a patch originally developed by Juri and
> fixed and I heavily modified it. Since the current patch is very
> different from the original one, Juri suggested I should by the "From:"
> author, and I simply added his Signed-off-by to acknowledge that he was
> the author of the original patch.
> 
> If Juri is ok with your change, I agree with it.
> 

Yep, I'm OK with Ingo's solution.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [tip:sched/core] sched/deadline: Track the active utilization
  2017-05-18 20:13 ` [PATCH 01/10] sched/deadline: track the active utilization luca abeni
  2017-06-08  8:31   ` Ingo Molnar
@ 2017-06-08  9:23   ` tip-bot for Luca Abeni
  1 sibling, 0 replies; 26+ messages in thread
From: tip-bot for Luca Abeni @ 2017-06-08  9:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: efault, juri.lelli, joelaf, linux-kernel, torvalds, luca.abeni,
	mingo, rostedt, peterz, tglx, hpa, bristot, mathieu.poirier,
	tommaso.cucinotta, claudio

Commit-ID:  e36d8677bfa55054e4194ec3683189b882a538f6
Gitweb:     http://git.kernel.org/tip/e36d8677bfa55054e4194ec3683189b882a538f6
Author:     Luca Abeni <luca.abeni@unitn.it>
AuthorDate: Thu, 18 May 2017 22:13:28 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 8 Jun 2017 10:27:56 +0200

sched/deadline: Track the active utilization

Active utilization is defined as the total utilization of active
(TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
when a task wakes up and is decreased when a task blocks.

When a task is migrated from CPUi to CPUj, immediately subtract the
task's utilization from CPUi and add it to CPUj. This mechanism is
implemented by modifying the pull and push functions.
Note: this is not fully correct from the theoretical point of view
(the utilization should be removed from CPUi only at the 0 lag
time), a more theoretically sound solution is presented in the
next patches.

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-2-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/deadline.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h    |  6 +++++
 2 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index df6c291..b36ecc2 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -43,6 +43,28 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
 	return !RB_EMPTY_NODE(&dl_se->rb_node);
 }
 
+static inline
+void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
+{
+	u64 old = dl_rq->running_bw;
+
+	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	dl_rq->running_bw += dl_bw;
+	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
+}
+
+static inline
+void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
+{
+	u64 old = dl_rq->running_bw;
+
+	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	dl_rq->running_bw -= dl_bw;
+	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
+	if (dl_rq->running_bw > old)
+		dl_rq->running_bw = 0;
+}
+
 static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
@@ -83,6 +105,8 @@ void init_dl_rq(struct dl_rq *dl_rq)
 #else
 	init_dl_bw(&dl_rq->dl_bw);
 #endif
+
+	dl_rq->running_bw = 0;
 }
 
 #ifdef CONFIG_SMP
@@ -946,10 +970,14 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
 	 * parameters of the task might need updating. Otherwise,
 	 * we want a replenishment of its runtime.
 	 */
-	if (flags & ENQUEUE_WAKEUP)
+	if (flags & ENQUEUE_WAKEUP) {
+		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+		add_running_bw(dl_se->dl_bw, dl_rq);
 		update_dl_entity(dl_se, pi_se);
-	else if (flags & ENQUEUE_REPLENISH)
+	} else if (flags & ENQUEUE_REPLENISH) {
 		replenish_dl_entity(dl_se, pi_se);
+	}
 
 	__enqueue_dl_entity(dl_se);
 }
@@ -998,14 +1026,25 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	if (!p->dl.dl_throttled && dl_is_constrained(&p->dl))
 		dl_check_constrained_dl(&p->dl);
 
+	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE)
+		add_running_bw(p->dl.dl_bw, &rq->dl);
+
 	/*
-	 * If p is throttled, we do nothing. In fact, if it exhausted
+	 * If p is throttled, we do not enqueue it. In fact, if it exhausted
 	 * its budget it needs a replenishment and, since it now is on
 	 * its rq, the bandwidth timer callback (which clearly has not
 	 * run yet) will take care of this.
+	 * However, the active utilization does not depend on the fact
+	 * that the task is on the runqueue or not (but depends on the
+	 * task's state - in GRUB parlance, "inactive" vs "active contending").
+	 * In other words, even if a task is throttled its utilization must
+	 * be counted in the active utilization; hence, we need to call
+	 * add_running_bw().
 	 */
-	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
+	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+		add_running_bw(p->dl.dl_bw, &rq->dl);
 		return;
+	}
 
 	enqueue_dl_entity(&p->dl, pi_se, flags);
 
@@ -1023,6 +1062,20 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
 	__dequeue_task_dl(rq, p, flags);
+
+	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE)
+		sub_running_bw(p->dl.dl_bw, &rq->dl);
+
+	/*
+	 * This check allows to decrease the active utilization in two cases:
+	 * when the task blocks and when it is terminating
+	 * (p->state == TASK_DEAD). We can handle the two cases in the same
+	 * way, because from GRUB's point of view the same thing is happening
+	 * (the task moves from "active contending" to "active non contending"
+	 * or "inactive")
+	 */
+	if (flags & DEQUEUE_SLEEP)
+		sub_running_bw(p->dl.dl_bw, &rq->dl);
 }
 
 /*
@@ -1551,7 +1604,9 @@ retry:
 	}
 
 	deactivate_task(rq, next_task, 0);
+	sub_running_bw(next_task->dl.dl_bw, &rq->dl);
 	set_task_cpu(next_task, later_rq->cpu);
+	add_running_bw(next_task->dl.dl_bw, &later_rq->dl);
 	activate_task(later_rq, next_task, 0);
 	ret = 1;
 
@@ -1639,7 +1694,9 @@ static void pull_dl_task(struct rq *this_rq)
 			resched = true;
 
 			deactivate_task(src_rq, p, 0);
+			sub_running_bw(p->dl.dl_bw, &src_rq->dl);
 			set_task_cpu(p, this_cpu);
+			add_running_bw(p->dl.dl_bw, &this_rq->dl);
 			activate_task(this_rq, p, 0);
 			dmin = p->dl.deadline;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f8cf1d8..ee26867 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -558,6 +558,12 @@ struct dl_rq {
 #else
 	struct dl_bw dl_bw;
 #endif
+	/*
+	 * "Active utilization" for this runqueue: increased when a
+	 * task wakes up (becomes TASK_RUNNING) and decreased when a
+	 * task blocks
+	 */
+	u64 running_bw;
 };
 
 #ifdef CONFIG_SMP

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [tip:sched/core] sched/deadline: Improve the tracking of active utilization
  2017-05-18 20:13 ` [PATCH 02/10] sched/deadline: improve the tracking of " luca abeni
@ 2017-06-08  9:24   ` tip-bot for Luca Abeni
  0 siblings, 0 replies; 26+ messages in thread
From: tip-bot for Luca Abeni @ 2017-06-08  9:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, mingo, luca.abeni, joelaf, hpa, bristot, rostedt, torvalds,
	juri.lelli, efault, tommaso.cucinotta, linux-kernel,
	mathieu.poirier, claudio, peterz

Commit-ID:  209a0cbda7a01d2ea32a8b631d35e873bee498e9
Gitweb:     http://git.kernel.org/tip/209a0cbda7a01d2ea32a8b631d35e873bee498e9
Author:     Luca Abeni <luca.abeni@santannapisa.it>
AuthorDate: Thu, 18 May 2017 22:13:29 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 8 Jun 2017 10:31:49 +0200

sched/deadline: Improve the tracking of active utilization

This patch implements a more theoretically sound algorithm for
tracking active utilization: instead of decreasing it when a
task blocks, use a timer (the "inactive timer", named after the
"Inactive" task state of the GRUB algorithm) to decrease the
active utilization at the so called "0-lag time".

Tested-by: Claudio Scordino <claudio@evidence.eu.com>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-3-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h   |  17 +++
 kernel/sched/core.c     |   3 +
 kernel/sched/deadline.c | 269 +++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h    |   2 +
 4 files changed, 276 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1abaa37..f1ead2e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -445,16 +445,33 @@ struct sched_dl_entity {
 	 *
 	 * @dl_yielded tells if task gave up the CPU before consuming
 	 * all its available runtime during the last job.
+	 *
+	 * @dl_non_contending tells if the task is inactive while still
+	 * contributing to the active utilization. In other words, it
+	 * indicates if the inactive timer has been armed and its handler
+	 * has not been executed yet. This flag is useful to avoid race
+	 * conditions between the inactive timer handler and the wakeup
+	 * code.
 	 */
 	int				dl_throttled;
 	int				dl_boosted;
 	int				dl_yielded;
+	int				dl_non_contending;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
 	 * own bandwidth to be enforced, thus we need one timer per task.
 	 */
 	struct hrtimer			dl_timer;
+
+	/*
+	 * Inactive timer, responsible for decreasing the active utilization
+	 * at the "0-lag time". When a -deadline task blocks, it contributes
+	 * to GRUB's active utilization until the "0-lag time", hence a
+	 * timer is needed to decrease the active utilization at the correct
+	 * time.
+	 */
+	struct hrtimer inactive_timer;
 };
 
 union rcu_special {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c3e50ca..968c655 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2153,6 +2153,7 @@ void __dl_clear_params(struct task_struct *p)
 
 	dl_se->dl_throttled = 0;
 	dl_se->dl_yielded = 0;
+	dl_se->dl_non_contending = 0;
 }
 
 /*
@@ -2184,6 +2185,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	RB_CLEAR_NODE(&p->dl.rb_node);
 	init_dl_task_timer(&p->dl);
+	init_dl_inactive_task_timer(&p->dl);
 	__dl_clear_params(p);
 
 	INIT_LIST_HEAD(&p->rt.run_list);
@@ -2506,6 +2508,7 @@ static int dl_overflow(struct task_struct *p, int policy,
 		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
 		__dl_clear(dl_b, p->dl.dl_bw);
 		__dl_add(dl_b, new_bw);
+		dl_change_utilization(p, new_bw);
 		err = 0;
 	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
 		__dl_clear(dl_b, p->dl.dl_bw);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b36ecc2..6480a92 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -65,6 +65,161 @@ void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 		dl_rq->running_bw = 0;
 }
 
+void dl_change_utilization(struct task_struct *p, u64 new_bw)
+{
+	if (task_on_rq_queued(p))
+		return;
+
+	if (!p->dl.dl_non_contending)
+		return;
+
+	sub_running_bw(p->dl.dl_bw, &task_rq(p)->dl);
+	p->dl.dl_non_contending = 0;
+	/*
+	 * If the timer handler is currently running and the
+	 * timer cannot be cancelled, inactive_task_timer()
+	 * will see that dl_not_contending is not set, and
+	 * will not touch the rq's active utilization,
+	 * so we are still safe.
+	 */
+	if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+		put_task_struct(p);
+}
+
+/*
+ * The utilization of a task cannot be immediately removed from
+ * the rq active utilization (running_bw) when the task blocks.
+ * Instead, we have to wait for the so called "0-lag time".
+ *
+ * If a task blocks before the "0-lag time", a timer (the inactive
+ * timer) is armed, and running_bw is decreased when the timer
+ * fires.
+ *
+ * If the task wakes up again before the inactive timer fires,
+ * the timer is cancelled, whereas if the task wakes up after the
+ * inactive timer fired (and running_bw has been decreased) the
+ * task's utilization has to be added to running_bw again.
+ * A flag in the deadline scheduling entity (dl_non_contending)
+ * is used to avoid race conditions between the inactive timer handler
+ * and task wakeups.
+ *
+ * The following diagram shows how running_bw is updated. A task is
+ * "ACTIVE" when its utilization contributes to running_bw; an
+ * "ACTIVE contending" task is in the TASK_RUNNING state, while an
+ * "ACTIVE non contending" task is a blocked task for which the "0-lag time"
+ * has not passed yet. An "INACTIVE" task is a task for which the "0-lag"
+ * time already passed, which does not contribute to running_bw anymore.
+ *                              +------------------+
+ *             wakeup           |    ACTIVE        |
+ *          +------------------>+   contending     |
+ *          | add_running_bw    |                  |
+ *          |                   +----+------+------+
+ *          |                        |      ^
+ *          |                dequeue |      |
+ * +--------+-------+                |      |
+ * |                |   t >= 0-lag   |      | wakeup
+ * |    INACTIVE    |<---------------+      |
+ * |                | sub_running_bw |      |
+ * +--------+-------+                |      |
+ *          ^                        |      |
+ *          |              t < 0-lag |      |
+ *          |                        |      |
+ *          |                        V      |
+ *          |                   +----+------+------+
+ *          | sub_running_bw    |    ACTIVE        |
+ *          +-------------------+                  |
+ *            inactive timer    |  non contending  |
+ *            fired             +------------------+
+ *
+ * The task_non_contending() function is invoked when a task
+ * blocks, and checks if the 0-lag time already passed or
+ * not (in the first case, it directly updates running_bw;
+ * in the second case, it arms the inactive timer).
+ *
+ * The task_contending() function is invoked when a task wakes
+ * up, and checks if the task is still in the "ACTIVE non contending"
+ * state or not (in the second case, it updates running_bw).
+ */
+static void task_non_contending(struct task_struct *p)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+	struct hrtimer *timer = &dl_se->inactive_timer;
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+	s64 zerolag_time;
+
+	/*
+	 * If this is a non-deadline task that has been boosted,
+	 * do nothing
+	 */
+	if (dl_se->dl_runtime == 0)
+		return;
+
+	WARN_ON(hrtimer_active(&dl_se->inactive_timer));
+	WARN_ON(dl_se->dl_non_contending);
+
+	zerolag_time = dl_se->deadline -
+		 div64_long((dl_se->runtime * dl_se->dl_period),
+			dl_se->dl_runtime);
+
+	/*
+	 * Using relative times instead of the absolute "0-lag time"
+	 * allows to simplify the code
+	 */
+	zerolag_time -= rq_clock(rq);
+
+	/*
+	 * If the "0-lag time" already passed, decrease the active
+	 * utilization now, instead of starting a timer
+	 */
+	if (zerolag_time < 0) {
+		if (dl_task(p))
+			sub_running_bw(dl_se->dl_bw, dl_rq);
+		if (!dl_task(p) || p->state == TASK_DEAD)
+			__dl_clear_params(p);
+
+		return;
+	}
+
+	dl_se->dl_non_contending = 1;
+	get_task_struct(p);
+	hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL);
+}
+
+static void task_contending(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+	/*
+	 * If this is a non-deadline task that has been boosted,
+	 * do nothing
+	 */
+	if (dl_se->dl_runtime == 0)
+		return;
+
+	if (dl_se->dl_non_contending) {
+		dl_se->dl_non_contending = 0;
+		/*
+		 * If the timer handler is currently running and the
+		 * timer cannot be cancelled, inactive_task_timer()
+		 * will see that dl_not_contending is not set, and
+		 * will not touch the rq's active utilization,
+		 * so we are still safe.
+		 */
+		if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
+			put_task_struct(dl_task_of(dl_se));
+	} else {
+		/*
+		 * Since "dl_non_contending" is not set, the
+		 * task's utilization has already been removed from
+		 * active utilization (either when the task blocked,
+		 * when the "inactive timer" fired).
+		 * So, add it back.
+		 */
+		add_running_bw(dl_se->dl_bw, dl_rq);
+	}
+}
+
 static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
@@ -617,10 +772,8 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	 * The task might have changed its scheduling policy to something
 	 * different than SCHED_DEADLINE (through switched_from_dl()).
 	 */
-	if (!dl_task(p)) {
-		__dl_clear_params(p);
+	if (!dl_task(p))
 		goto unlock;
-	}
 
 	/*
 	 * The task might have been boosted by someone else and might be in the
@@ -839,6 +992,49 @@ throttle:
 	}
 }
 
+static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
+{
+	struct sched_dl_entity *dl_se = container_of(timer,
+						     struct sched_dl_entity,
+						     inactive_timer);
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq_flags rf;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &rf);
+
+	if (!dl_task(p) || p->state == TASK_DEAD) {
+		if (p->state == TASK_DEAD && dl_se->dl_non_contending) {
+			sub_running_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
+			dl_se->dl_non_contending = 0;
+		}
+		__dl_clear_params(p);
+
+		goto unlock;
+	}
+	if (dl_se->dl_non_contending == 0)
+		goto unlock;
+
+	sched_clock_tick();
+	update_rq_clock(rq);
+
+	sub_running_bw(dl_se->dl_bw, &rq->dl);
+	dl_se->dl_non_contending = 0;
+unlock:
+	task_rq_unlock(rq, p, &rf);
+	put_task_struct(p);
+
+	return HRTIMER_NORESTART;
+}
+
+void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
+{
+	struct hrtimer *timer = &dl_se->inactive_timer;
+
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	timer->function = inactive_task_timer;
+}
+
 #ifdef CONFIG_SMP
 
 static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
@@ -971,9 +1167,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
 	 * we want a replenishment of its runtime.
 	 */
 	if (flags & ENQUEUE_WAKEUP) {
-		struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
-
-		add_running_bw(dl_se->dl_bw, dl_rq);
+		task_contending(dl_se);
 		update_dl_entity(dl_se, pi_se);
 	} else if (flags & ENQUEUE_REPLENISH) {
 		replenish_dl_entity(dl_se, pi_se);
@@ -1042,7 +1236,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	 * add_running_bw().
 	 */
 	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
-		add_running_bw(p->dl.dl_bw, &rq->dl);
+		if (flags & ENQUEUE_WAKEUP)
+			task_contending(&p->dl);
+
 		return;
 	}
 
@@ -1067,7 +1263,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		sub_running_bw(p->dl.dl_bw, &rq->dl);
 
 	/*
-	 * This check allows to decrease the active utilization in two cases:
+	 * This check allows to start the inactive timer (or to immediately
+	 * decrease the active utilization, if needed) in two cases:
 	 * when the task blocks and when it is terminating
 	 * (p->state == TASK_DEAD). We can handle the two cases in the same
 	 * way, because from GRUB's point of view the same thing is happening
@@ -1075,7 +1272,7 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	 * or "inactive")
 	 */
 	if (flags & DEQUEUE_SLEEP)
-		sub_running_bw(p->dl.dl_bw, &rq->dl);
+		task_non_contending(p);
 }
 
 /*
@@ -1153,6 +1350,35 @@ out:
 	return cpu;
 }
 
+static void migrate_task_rq_dl(struct task_struct *p)
+{
+	struct rq *rq;
+
+	if (!(p->state == TASK_WAKING) || !(p->dl.dl_non_contending))
+		return;
+
+	rq = task_rq(p);
+	/*
+	 * Since p->state == TASK_WAKING, set_task_cpu() has been called
+	 * from try_to_wake_up(). Hence, p->pi_lock is locked, but
+	 * rq->lock is not... So, lock it
+	 */
+	raw_spin_lock(&rq->lock);
+	sub_running_bw(p->dl.dl_bw, &rq->dl);
+	p->dl.dl_non_contending = 0;
+	/*
+	 * If the timer handler is currently running and the
+	 * timer cannot be cancelled, inactive_task_timer()
+	 * will see that dl_not_contending is not set, and
+	 * will not touch the rq's active utilization,
+	 * so we are still safe.
+	 */
+	if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+		put_task_struct(p);
+
+	raw_spin_unlock(&rq->lock);
+}
+
 static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 {
 	/*
@@ -1794,13 +2020,23 @@ void __init init_sched_dl_class(void)
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
 {
 	/*
-	 * Start the deadline timer; if we switch back to dl before this we'll
-	 * continue consuming our current CBS slice. If we stay outside of
-	 * SCHED_DEADLINE until the deadline passes, the timer will reset the
-	 * task.
+	 * task_non_contending() can start the "inactive timer" (if the 0-lag
+	 * time is in the future). If the task switches back to dl before
+	 * the "inactive timer" fires, it can continue to consume its current
+	 * runtime using its current deadline. If it stays outside of
+	 * SCHED_DEADLINE until the 0-lag time passes, inactive_task_timer()
+	 * will reset the task parameters.
 	 */
-	if (!start_dl_timer(p))
-		__dl_clear_params(p);
+	if (task_on_rq_queued(p) && p->dl.dl_runtime)
+		task_non_contending(p);
+
+	/*
+	 * We cannot use inactive_task_timer() to invoke sub_running_bw()
+	 * at the 0-lag time, because the task could have been migrated
+	 * while SCHED_OTHER in the meanwhile.
+	 */
+	if (p->dl.dl_non_contending)
+		p->dl.dl_non_contending = 0;
 
 	/*
 	 * Since this might be the only -deadline task on the rq,
@@ -1819,6 +2055,8 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
  */
 static void switched_to_dl(struct rq *rq, struct task_struct *p)
 {
+	if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+		put_task_struct(p);
 
 	/* If p is not queued we will update its parameters at next wakeup. */
 	if (!task_on_rq_queued(p))
@@ -1893,6 +2131,7 @@ const struct sched_class dl_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_dl,
+	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
 	.rq_online              = rq_online_dl,
 	.rq_offline             = rq_offline_dl,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ee26867..c58f389 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -244,6 +244,7 @@ bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
 	       dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
 }
 
+void dl_change_utilization(struct task_struct *p, u64 new_bw);
 extern void init_dl_bw(struct dl_bw *dl_b);
 
 #ifdef CONFIG_CGROUP_SCHED
@@ -1493,6 +1494,7 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
 extern struct dl_bandwidth def_dl_bandwidth;
 extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
 extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
+extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
 
 unsigned long to_ratio(u64 period, u64 runtime);
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [tip:sched/core] sched/deadline: Fix the update of the total -deadline utilization
  2017-05-18 20:13 ` [PATCH 03/10] sched/deadline: fix the update of the total -deadline utilization luca abeni
@ 2017-06-08  9:24   ` tip-bot for Luca Abeni
  0 siblings, 0 replies; 26+ messages in thread
From: tip-bot for Luca Abeni @ 2017-06-08  9:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: luca.abeni, bristot, efault, hpa, peterz, juri.lelli, rostedt,
	mingo, tommaso.cucinotta, joelaf, linux-kernel, mathieu.poirier,
	tglx, claudio, torvalds

Commit-ID:  387e31300b5760169e6d3f7a9e1eeed12cc5a30b
Gitweb:     http://git.kernel.org/tip/387e31300b5760169e6d3f7a9e1eeed12cc5a30b
Author:     Luca Abeni <luca.abeni@santannapisa.it>
AuthorDate: Thu, 18 May 2017 22:13:30 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 8 Jun 2017 10:31:50 +0200

sched/deadline: Fix the update of the total -deadline utilization

Now that the inactive timer can be armed to fire at the 0-lag time,
it is possible to use inactive_task_timer() to update the total
-deadline utilization (dl_b->total_bw) at the correct time, fixing
dl_overflow() and __setparam_dl().

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-4-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c     | 38 ++++++++++++++------------------------
 kernel/sched/deadline.c | 28 +++++++++++++---------------
 2 files changed, 27 insertions(+), 39 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 968c655..126339d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2475,9 +2475,6 @@ static inline int dl_bw_cpus(int i)
  * allocated bandwidth to reflect the new situation.
  *
  * This function is called while holding p's rq->lock.
- *
- * XXX we should delay bw change until the task's 0-lag point, see
- * __setparam_dl().
  */
 static int dl_overflow(struct task_struct *p, int policy,
 		       const struct sched_attr *attr)
@@ -2502,16 +2499,29 @@ static int dl_overflow(struct task_struct *p, int policy,
 	cpus = dl_bw_cpus(task_cpu(p));
 	if (dl_policy(policy) && !task_has_dl_policy(p) &&
 	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
+		if (hrtimer_active(&p->dl.inactive_timer))
+			__dl_clear(dl_b, p->dl.dl_bw);
 		__dl_add(dl_b, new_bw);
 		err = 0;
 	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
 		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+		/*
+		 * XXX this is slightly incorrect: when the task
+		 * utilization decreases, we should delay the total
+		 * utilization change until the task's 0-lag point.
+		 * But this would require to set the task's "inactive
+		 * timer" when the task is not inactive.
+		 */
 		__dl_clear(dl_b, p->dl.dl_bw);
 		__dl_add(dl_b, new_bw);
 		dl_change_utilization(p, new_bw);
 		err = 0;
 	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
-		__dl_clear(dl_b, p->dl.dl_bw);
+		/*
+		 * Do not decrease the total deadline utilization here,
+		 * switched_from_dl() will take care to do it at the correct
+		 * (0-lag) time.
+		 */
 		err = 0;
 	}
 	raw_spin_unlock(&dl_b->lock);
@@ -4020,26 +4030,6 @@ __setparam_dl(struct task_struct *p, const struct sched_attr *attr)
 	dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline;
 	dl_se->flags = attr->sched_flags;
 	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
-
-	/*
-	 * Changing the parameters of a task is 'tricky' and we're not doing
-	 * the correct thing -- also see task_dead_dl() and switched_from_dl().
-	 *
-	 * What we SHOULD do is delay the bandwidth release until the 0-lag
-	 * point. This would include retaining the task_struct until that time
-	 * and change dl_overflow() to not immediately decrement the current
-	 * amount.
-	 *
-	 * Instead we retain the current runtime/deadline and let the new
-	 * parameters take effect after the current reservation period lapses.
-	 * This is safe (albeit pessimistic) because the 0-lag point is always
-	 * before the current scheduling deadline.
-	 *
-	 * We can still have temporary overloads because we do not delay the
-	 * change in bandwidth until that time; so admission control is
-	 * not on the safe side. It does however guarantee tasks will never
-	 * consume more than promised.
-	 */
 }
 
 /*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6480a92..add9cba 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -175,8 +175,14 @@ static void task_non_contending(struct task_struct *p)
 	if (zerolag_time < 0) {
 		if (dl_task(p))
 			sub_running_bw(dl_se->dl_bw, dl_rq);
-		if (!dl_task(p) || p->state == TASK_DEAD)
+		if (!dl_task(p) || p->state == TASK_DEAD) {
+			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+			raw_spin_lock(&dl_b->lock);
+			__dl_clear(dl_b, p->dl.dl_bw);
 			__dl_clear_params(p);
+			raw_spin_unlock(&dl_b->lock);
+		}
 
 		return;
 	}
@@ -1004,10 +1010,16 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 	rq = task_rq_lock(p, &rf);
 
 	if (!dl_task(p) || p->state == TASK_DEAD) {
+		struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
 		if (p->state == TASK_DEAD && dl_se->dl_non_contending) {
 			sub_running_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
 			dl_se->dl_non_contending = 0;
 		}
+
+		raw_spin_lock(&dl_b->lock);
+		__dl_clear(dl_b, p->dl.dl_bw);
+		raw_spin_unlock(&dl_b->lock);
 		__dl_clear_params(p);
 
 		goto unlock;
@@ -1534,19 +1546,6 @@ static void task_fork_dl(struct task_struct *p)
 	 */
 }
 
-static void task_dead_dl(struct task_struct *p)
-{
-	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
-
-	/*
-	 * Since we are TASK_DEAD we won't slip out of the domain!
-	 */
-	raw_spin_lock_irq(&dl_b->lock);
-	/* XXX we should retain the bw until 0-lag */
-	dl_b->total_bw -= p->dl.dl_bw;
-	raw_spin_unlock_irq(&dl_b->lock);
-}
-
 static void set_curr_task_dl(struct rq *rq)
 {
 	struct task_struct *p = rq->curr;
@@ -2141,7 +2140,6 @@ const struct sched_class dl_sched_class = {
 	.set_curr_task		= set_curr_task_dl,
 	.task_tick		= task_tick_dl,
 	.task_fork              = task_fork_dl,
-	.task_dead		= task_dead_dl,
 
 	.prio_changed           = prio_changed_dl,
 	.switched_from		= switched_from_dl,

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [tip:sched/core] sched/deadline: Implement GRUB accounting
  2017-05-18 20:13 ` [PATCH 04/10] sched/deadline: implement GRUB accounting luca abeni
@ 2017-06-08  9:25   ` tip-bot for Luca Abeni
  0 siblings, 0 replies; 26+ messages in thread
From: tip-bot for Luca Abeni @ 2017-06-08  9:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: rostedt, linux-kernel, claudio, tglx, bristot, juri.lelli,
	mathieu.poirier, luca.abeni, efault, tommaso.cucinotta, mingo,
	peterz, joelaf, torvalds, hpa

Commit-ID:  c52f14d384628db0217a7a9080ab800d5ffb2d72
Gitweb:     http://git.kernel.org/tip/c52f14d384628db0217a7a9080ab800d5ffb2d72
Author:     Luca Abeni <luca.abeni@santannapisa.it>
AuthorDate: Thu, 18 May 2017 22:13:31 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 8 Jun 2017 10:31:51 +0200

sched/deadline: Implement GRUB accounting

According to the GRUB (Greedy Reclaimation of Unused Bandwidth)
reclaiming algorithm, the runtime is not decreased as "dq = -dt",
but as "dq = -Uact dt" (where Uact is the per-runqueue active
utilization).
Hence, this commit modifies the runtime accounting rule in
update_curr_dl() to implement the GRUB rule.

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-5-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c     |  4 ++--
 kernel/sched/deadline.c | 17 +++++++++++++++++
 kernel/sched/sched.h    |  2 ++
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 126339d..b68a1fa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2423,7 +2423,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 unsigned long to_ratio(u64 period, u64 runtime)
 {
 	if (runtime == RUNTIME_INF)
-		return 1ULL << 20;
+		return BW_UNIT;
 
 	/*
 	 * Doing this here saves a lot of checks in all
@@ -2433,7 +2433,7 @@ unsigned long to_ratio(u64 period, u64 runtime)
 	if (period == 0)
 		return 0;
 
-	return div64_u64(runtime << 20, period);
+	return div64_u64(runtime << BW_SHIFT, period);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index add9cba..0bee537 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -918,6 +918,22 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
 extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
 /*
+ * This function implements the GRUB accounting rule:
+ * according to the GRUB reclaiming algorithm, the runtime is
+ * not decreased as "dq = -dt", but as "dq = -Uact dt", where
+ * Uact is the (per-runqueue) active utilization.
+ * Since rq->dl.running_bw contains Uact * 2^BW_SHIFT, the result
+ * has to be shifted right by BW_SHIFT.
+ */
+u64 grub_reclaim(u64 delta, struct rq *rq)
+{
+	delta *= rq->dl.running_bw;
+	delta >>= BW_SHIFT;
+
+	return delta;
+}
+
+/*
  * Update the current task's runtime statistics (provided it is still
  * a -deadline task and has not been removed from the dl_rq).
  */
@@ -959,6 +975,7 @@ static void update_curr_dl(struct rq *rq)
 
 	sched_rt_avg_update(rq, delta_exec);
 
+	delta_exec = grub_reclaim(delta_exec, rq);
 	dl_se->runtime -= delta_exec;
 
 throttle:
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c58f389..bb409ef 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1496,6 +1496,8 @@ extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime
 extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
 extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
 
+#define BW_SHIFT	20
+#define BW_UNIT		(1 << BW_SHIFT)
 unsigned long to_ratio(u64 period, u64 runtime);
 
 extern void init_entity_runnable_average(struct sched_entity *se);

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [tip:sched/core] sched/deadline: Do not reclaim the whole CPU bandwidth
  2017-05-18 20:13 ` [PATCH 05/10] sched/deadline: do not reclaim the whole CPU bandwidth luca abeni
@ 2017-06-08  9:25   ` tip-bot for Luca Abeni
  0 siblings, 0 replies; 26+ messages in thread
From: tip-bot for Luca Abeni @ 2017-06-08  9:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, mathieu.poirier, claudio, efault, luca.abeni, tglx,
	joelaf, bristot, linux-kernel, juri.lelli, rostedt, torvalds,
	hpa, tommaso.cucinotta, peterz

Commit-ID:  4da3abcefe178c650033f371e94fa10e80bce167
Gitweb:     http://git.kernel.org/tip/4da3abcefe178c650033f371e94fa10e80bce167
Author:     Luca Abeni <luca.abeni@santannapisa.it>
AuthorDate: Thu, 18 May 2017 22:13:32 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 8 Jun 2017 10:31:51 +0200

sched/deadline: Do not reclaim the whole CPU bandwidth

Original GRUB tends to reclaim 100% of the CPU time... And this
allows a CPU hog to starve non-deadline tasks.
To address this issue, allow the scheduler to reclaim only a
specified fraction of CPU time, stored in the new "bw_ratio"
field of the dl runqueue structure.

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-6-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c     | 11 +++++++++++
 kernel/sched/deadline.c | 12 +++++++++++-
 kernel/sched/sched.h    |  8 ++++++++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b68a1fa..7abd064 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6759,6 +6759,16 @@ static int sched_dl_global_validate(void)
 	return ret;
 }
 
+void init_dl_rq_bw_ratio(struct dl_rq *dl_rq)
+{
+	if (global_rt_runtime() == RUNTIME_INF) {
+		dl_rq->bw_ratio = 1 << RATIO_SHIFT;
+	} else {
+		dl_rq->bw_ratio = to_ratio(global_rt_runtime(),
+			  global_rt_period()) >> (BW_SHIFT - RATIO_SHIFT);
+	}
+}
+
 static void sched_dl_do_global(void)
 {
 	u64 new_bw = -1;
@@ -6784,6 +6794,7 @@ static void sched_dl_do_global(void)
 		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
 
 		rcu_read_unlock_sched();
+		init_dl_rq_bw_ratio(&cpu_rq(cpu)->dl);
 	}
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0bee537..6a0614b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -268,6 +268,7 @@ void init_dl_rq(struct dl_rq *dl_rq)
 #endif
 
 	dl_rq->running_bw = 0;
+	init_dl_rq_bw_ratio(dl_rq);
 }
 
 #ifdef CONFIG_SMP
@@ -924,11 +925,20 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
  * Uact is the (per-runqueue) active utilization.
  * Since rq->dl.running_bw contains Uact * 2^BW_SHIFT, the result
  * has to be shifted right by BW_SHIFT.
+ * To reclaim only a fraction Umax of the CPU time, the
+ * runtime accounting rule is modified as
+ * "dq = -Uact / Umax dt"; since rq->dl.bw_ratio contains
+ * 2^RATIO_SHIFT / Umax, delta is multiplied by bw_ratio and shifted
+ * right by RATIO_SHIFT.
+ * Since delta is a 64 bit variable, to have an overflow its value
+ * should be larger than 2^(64 - 20 - 8), which is more than 64 seconds.
+ * So, overflow is not an issue here.
  */
 u64 grub_reclaim(u64 delta, struct rq *rq)
 {
 	delta *= rq->dl.running_bw;
-	delta >>= BW_SHIFT;
+	delta *= rq->dl.bw_ratio;
+	delta >>= BW_SHIFT + RATIO_SHIFT;
 
 	return delta;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bb409ef..878fe75 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -565,6 +565,12 @@ struct dl_rq {
 	 * task blocks
 	 */
 	u64 running_bw;
+
+	/*
+	 * Inverse of the fraction of CPU utilization that can be reclaimed
+	 * by the GRUB algorithm.
+	 */
+	u64 bw_ratio;
 };
 
 #ifdef CONFIG_SMP
@@ -1495,9 +1501,11 @@ extern struct dl_bandwidth def_dl_bandwidth;
 extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
 extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
 extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
+extern void init_dl_rq_bw_ratio(struct dl_rq *dl_rq);
 
 #define BW_SHIFT	20
 #define BW_UNIT		(1 << BW_SHIFT)
+#define RATIO_SHIFT	8
 unsigned long to_ratio(u64 period, u64 runtime);
 
 extern void init_entity_runnable_average(struct sched_entity *se);

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [tip:sched/core] sched/deadline: Make GRUB a task's flag
  2017-05-18 20:13 ` [PATCH 06/10] sched/deadline: make GRUB a task's flag luca abeni
@ 2017-06-08  9:26   ` tip-bot for Luca Abeni
  0 siblings, 0 replies; 26+ messages in thread
From: tip-bot for Luca Abeni @ 2017-06-08  9:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: joelaf, tommaso.cucinotta, claudio, rostedt, torvalds,
	linux-kernel, tglx, efault, juri.lelli, luca.abeni, hpa, mingo,
	bristot, mathieu.poirier, peterz

Commit-ID:  2d4283e9d583a3ee8cfb1cbb9c1270614df4c29d
Gitweb:     http://git.kernel.org/tip/2d4283e9d583a3ee8cfb1cbb9c1270614df4c29d
Author:     Luca Abeni <luca.abeni@santannapisa.it>
AuthorDate: Thu, 18 May 2017 22:13:33 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 8 Jun 2017 10:31:52 +0200

sched/deadline: Make GRUB a task's flag

This patch introduces the SCHED_FLAG_RECLAIM flag to specify
that a DL task is allowed to reclaim unused CPU time (using
the GRUB algorithm).

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-7-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/uapi/linux/sched.h | 1 +
 kernel/sched/core.c        | 3 ++-
 kernel/sched/deadline.c    | 3 ++-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5f0fe01..e2a6c7b 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -47,5 +47,6 @@
  * For the sched_{set,get}attr() calls
  */
 #define SCHED_FLAG_RESET_ON_FORK	0x01
+#define SCHED_FLAG_RECLAIM		0x02
 
 #endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7abd064..8d1a5a6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4195,7 +4195,8 @@ recheck:
 			return -EINVAL;
 	}
 
-	if (attr->sched_flags & ~(SCHED_FLAG_RESET_ON_FORK))
+	if (attr->sched_flags &
+		~(SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_RECLAIM))
 		return -EINVAL;
 
 	/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6a0614b..61ea303 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -985,7 +985,8 @@ static void update_curr_dl(struct rq *rq)
 
 	sched_rt_avg_update(rq, delta_exec);
 
-	delta_exec = grub_reclaim(delta_exec, rq);
+	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
+		delta_exec = grub_reclaim(delta_exec, rq);
 	dl_se->runtime -= delta_exec;
 
 throttle:

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [tip:sched/core] sched/deadline: Track the "total rq utilization" too
  2017-05-18 20:13 ` [PATCH 07/10] sched/deadline: track the "total rq utilization" too luca abeni
@ 2017-06-08  9:26   ` tip-bot for Luca Abeni
  0 siblings, 0 replies; 26+ messages in thread
From: tip-bot for Luca Abeni @ 2017-06-08  9:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mathieu.poirier, luca.abeni, tommaso.cucinotta, efault, bristot,
	tglx, torvalds, claudio, juri.lelli, peterz, hpa, mingo, joelaf,
	linux-kernel, rostedt

Commit-ID:  8fd27231c3302e0c7e1907df1252db97b65eb241
Gitweb:     http://git.kernel.org/tip/8fd27231c3302e0c7e1907df1252db97b65eb241
Author:     Luca Abeni <luca.abeni@santannapisa.it>
AuthorDate: Thu, 18 May 2017 22:13:34 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 8 Jun 2017 10:31:53 +0200

sched/deadline: Track the "total rq utilization" too

The total rq utilization is defined as the sum of the utilisations of
tasks that are "assigned" to a runqueue, independently from their state
(TASK_RUNNING or blocked)

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-8-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/deadline.c | 118 ++++++++++++++++++++++++++++++++++--------------
 kernel/sched/sched.h    |  11 +++++
 2 files changed, 95 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 61ea303..6c6a1f09 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -51,6 +51,7 @@ void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
 	dl_rq->running_bw += dl_bw;
 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
+	SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
 }
 
 static inline
@@ -65,25 +66,52 @@ void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 		dl_rq->running_bw = 0;
 }
 
+static inline
+void add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
+{
+	u64 old = dl_rq->this_bw;
+
+	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	dl_rq->this_bw += dl_bw;
+	SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */
+}
+
+static inline
+void sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
+{
+	u64 old = dl_rq->this_bw;
+
+	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
+	dl_rq->this_bw -= dl_bw;
+	SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */
+	if (dl_rq->this_bw > old)
+		dl_rq->this_bw = 0;
+	SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
+}
+
 void dl_change_utilization(struct task_struct *p, u64 new_bw)
 {
-	if (task_on_rq_queued(p))
-		return;
+	struct rq *rq;
 
-	if (!p->dl.dl_non_contending)
+	if (task_on_rq_queued(p))
 		return;
 
-	sub_running_bw(p->dl.dl_bw, &task_rq(p)->dl);
-	p->dl.dl_non_contending = 0;
-	/*
-	 * If the timer handler is currently running and the
-	 * timer cannot be cancelled, inactive_task_timer()
-	 * will see that dl_not_contending is not set, and
-	 * will not touch the rq's active utilization,
-	 * so we are still safe.
-	 */
-	if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
-		put_task_struct(p);
+	rq = task_rq(p);
+	if (p->dl.dl_non_contending) {
+		sub_running_bw(p->dl.dl_bw, &rq->dl);
+		p->dl.dl_non_contending = 0;
+		/*
+		 * If the timer handler is currently running and the
+		 * timer cannot be cancelled, inactive_task_timer()
+		 * will see that dl_not_contending is not set, and
+		 * will not touch the rq's active utilization,
+		 * so we are still safe.
+		 */
+		if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+			put_task_struct(p);
+	}
+	sub_rq_bw(p->dl.dl_bw, &rq->dl);
+	add_rq_bw(new_bw, &rq->dl);
 }
 
 /*
@@ -178,6 +206,8 @@ static void task_non_contending(struct task_struct *p)
 		if (!dl_task(p) || p->state == TASK_DEAD) {
 			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
+			if (p->state == TASK_DEAD)
+				sub_rq_bw(p->dl.dl_bw, &rq->dl);
 			raw_spin_lock(&dl_b->lock);
 			__dl_clear(dl_b, p->dl.dl_bw);
 			__dl_clear_params(p);
@@ -192,7 +222,7 @@ static void task_non_contending(struct task_struct *p)
 	hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL);
 }
 
-static void task_contending(struct sched_dl_entity *dl_se)
+static void task_contending(struct sched_dl_entity *dl_se, int flags)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 
@@ -203,6 +233,9 @@ static void task_contending(struct sched_dl_entity *dl_se)
 	if (dl_se->dl_runtime == 0)
 		return;
 
+	if (flags & ENQUEUE_MIGRATED)
+		add_rq_bw(dl_se->dl_bw, dl_rq);
+
 	if (dl_se->dl_non_contending) {
 		dl_se->dl_non_contending = 0;
 		/*
@@ -268,6 +301,7 @@ void init_dl_rq(struct dl_rq *dl_rq)
 #endif
 
 	dl_rq->running_bw = 0;
+	dl_rq->this_bw = 0;
 	init_dl_rq_bw_ratio(dl_rq);
 }
 
@@ -1042,6 +1076,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 
 		if (p->state == TASK_DEAD && dl_se->dl_non_contending) {
 			sub_running_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
+			sub_rq_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl));
 			dl_se->dl_non_contending = 0;
 		}
 
@@ -1207,7 +1242,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
 	 * we want a replenishment of its runtime.
 	 */
 	if (flags & ENQUEUE_WAKEUP) {
-		task_contending(dl_se);
+		task_contending(dl_se, flags);
 		update_dl_entity(dl_se, pi_se);
 	} else if (flags & ENQUEUE_REPLENISH) {
 		replenish_dl_entity(dl_se, pi_se);
@@ -1260,8 +1295,10 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	if (!p->dl.dl_throttled && dl_is_constrained(&p->dl))
 		dl_check_constrained_dl(&p->dl);
 
-	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE)
+	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE) {
+		add_rq_bw(p->dl.dl_bw, &rq->dl);
 		add_running_bw(p->dl.dl_bw, &rq->dl);
+	}
 
 	/*
 	 * If p is throttled, we do not enqueue it. In fact, if it exhausted
@@ -1277,7 +1314,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
 		if (flags & ENQUEUE_WAKEUP)
-			task_contending(&p->dl);
+			task_contending(&p->dl, flags);
 
 		return;
 	}
@@ -1299,8 +1336,10 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	update_curr_dl(rq);
 	__dequeue_task_dl(rq, p, flags);
 
-	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE)
+	if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE) {
 		sub_running_bw(p->dl.dl_bw, &rq->dl);
+		sub_rq_bw(p->dl.dl_bw, &rq->dl);
+	}
 
 	/*
 	 * This check allows to start the inactive timer (or to immediately
@@ -1394,7 +1433,7 @@ static void migrate_task_rq_dl(struct task_struct *p)
 {
 	struct rq *rq;
 
-	if (!(p->state == TASK_WAKING) || !(p->dl.dl_non_contending))
+	if (p->state != TASK_WAKING)
 		return;
 
 	rq = task_rq(p);
@@ -1404,18 +1443,20 @@ static void migrate_task_rq_dl(struct task_struct *p)
 	 * rq->lock is not... So, lock it
 	 */
 	raw_spin_lock(&rq->lock);
-	sub_running_bw(p->dl.dl_bw, &rq->dl);
-	p->dl.dl_non_contending = 0;
-	/*
-	 * If the timer handler is currently running and the
-	 * timer cannot be cancelled, inactive_task_timer()
-	 * will see that dl_not_contending is not set, and
-	 * will not touch the rq's active utilization,
-	 * so we are still safe.
-	 */
-	if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
-		put_task_struct(p);
-
+	if (p->dl.dl_non_contending) {
+		sub_running_bw(p->dl.dl_bw, &rq->dl);
+		p->dl.dl_non_contending = 0;
+		/*
+		 * If the timer handler is currently running and the
+		 * timer cannot be cancelled, inactive_task_timer()
+		 * will see that dl_not_contending is not set, and
+		 * will not touch the rq's active utilization,
+		 * so we are still safe.
+		 */
+		if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
+			put_task_struct(p);
+	}
+	sub_rq_bw(p->dl.dl_bw, &rq->dl);
 	raw_spin_unlock(&rq->lock);
 }
 
@@ -1858,7 +1899,9 @@ retry:
 
 	deactivate_task(rq, next_task, 0);
 	sub_running_bw(next_task->dl.dl_bw, &rq->dl);
+	sub_rq_bw(next_task->dl.dl_bw, &rq->dl);
 	set_task_cpu(next_task, later_rq->cpu);
+	add_rq_bw(next_task->dl.dl_bw, &later_rq->dl);
 	add_running_bw(next_task->dl.dl_bw, &later_rq->dl);
 	activate_task(later_rq, next_task, 0);
 	ret = 1;
@@ -1948,7 +1991,9 @@ static void pull_dl_task(struct rq *this_rq)
 
 			deactivate_task(src_rq, p, 0);
 			sub_running_bw(p->dl.dl_bw, &src_rq->dl);
+			sub_rq_bw(p->dl.dl_bw, &src_rq->dl);
 			set_task_cpu(p, this_cpu);
+			add_rq_bw(p->dl.dl_bw, &this_rq->dl);
 			add_running_bw(p->dl.dl_bw, &this_rq->dl);
 			activate_task(this_rq, p, 0);
 			dmin = p->dl.deadline;
@@ -2057,6 +2102,9 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	if (task_on_rq_queued(p) && p->dl.dl_runtime)
 		task_non_contending(p);
 
+	if (!task_on_rq_queued(p))
+		sub_rq_bw(p->dl.dl_bw, &rq->dl);
+
 	/*
 	 * We cannot use inactive_task_timer() to invoke sub_running_bw()
 	 * at the 0-lag time, because the task could have been migrated
@@ -2086,9 +2134,11 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
 		put_task_struct(p);
 
 	/* If p is not queued we will update its parameters at next wakeup. */
-	if (!task_on_rq_queued(p))
-		return;
+	if (!task_on_rq_queued(p)) {
+		add_rq_bw(p->dl.dl_bw, &rq->dl);
 
+		return;
+	}
 	/*
 	 * If p is boosted we already updated its params in
 	 * rt_mutex_setprio()->enqueue_task(..., ENQUEUE_REPLENISH),
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 878fe75..b7321da 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -567,6 +567,17 @@ struct dl_rq {
 	u64 running_bw;
 
 	/*
+	 * Utilization of the tasks "assigned" to this runqueue (including
+	 * the tasks that are in runqueue and the tasks that executed on this
+	 * CPU and blocked). Increased when a task moves to this runqueue, and
+	 * decreased when the task moves away (migrates, changes scheduling
+	 * policy, or terminates).
+	 * This is needed to compute the "inactive utilization" for the
+	 * runqueue (inactive utilization = this_bw - running_bw).
+	 */
+	u64 this_bw;
+
+	/*
 	 * Inverse of the fraction of CPU utilization that can be reclaimed
 	 * by the GRUB algorithm.
 	 */

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [tip:sched/core] sched/deadline: Base GRUB reclaiming on the inactive utilization
  2017-05-18 20:13 ` [PATCH 08/10] sched/deadline: base GRUB reclaiming on the inactive utilization luca abeni
@ 2017-06-08  9:27   ` tip-bot for Luca Abeni
  0 siblings, 0 replies; 26+ messages in thread
From: tip-bot for Luca Abeni @ 2017-06-08  9:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, luca.abeni, torvalds, mingo, bristot, mathieu.poirier,
	juri.lelli, joelaf, linux-kernel, peterz, rostedt, claudio, hpa,
	efault, tommaso.cucinotta

Commit-ID:  9f0d1a5077399143aad7e1244bb031e29116074e
Gitweb:     http://git.kernel.org/tip/9f0d1a5077399143aad7e1244bb031e29116074e
Author:     Luca Abeni <luca.abeni@santannapisa.it>
AuthorDate: Thu, 18 May 2017 22:13:35 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 8 Jun 2017 10:31:54 +0200

sched/deadline: Base GRUB reclaiming on the inactive utilization

Instead of decreasing the runtime as "dq = -Uact dt" (eventually
divided by the maximum utilization available for deadline tasks),
decrease it as "dq = -max{u, (1 - Uinact)} dt", where u is the task
utilization and Uinact is the "inactive utilization".
In this way, the maximum fraction of CPU time that can be reclaimed
is given by the total utilization of deadline tasks.
This approach solves a fairness issue with "traditional" global GRUB
reclaiming: using the traditional GRUB algorithm, if tasks are
allocated to the various cores in a non-uniform way, the
reclaiming mechanism allows some tasks to reclaim more time than
others. This issue is visible starting 11 time-consuming tasks with
runtime 10ms and period 30ms (total utilization 3.666) on a 4-cores
system: some tasks will receive much more than the reserved runtime
(thanks to the reclaiming mechanism), while other tasks will receive
less than the reserved runtime.

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-9-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/deadline.c | 40 ++++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6c6a1f09..7d2f057 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -955,26 +955,30 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 /*
  * This function implements the GRUB accounting rule:
  * according to the GRUB reclaiming algorithm, the runtime is
- * not decreased as "dq = -dt", but as "dq = -Uact dt", where
- * Uact is the (per-runqueue) active utilization.
- * Since rq->dl.running_bw contains Uact * 2^BW_SHIFT, the result
- * has to be shifted right by BW_SHIFT.
- * To reclaim only a fraction Umax of the CPU time, the
- * runtime accounting rule is modified as
- * "dq = -Uact / Umax dt"; since rq->dl.bw_ratio contains
- * 2^RATIO_SHIFT / Umax, delta is multiplied by bw_ratio and shifted
- * right by RATIO_SHIFT.
- * Since delta is a 64 bit variable, to have an overflow its value
- * should be larger than 2^(64 - 20 - 8), which is more than 64 seconds.
- * So, overflow is not an issue here.
+ * not decreased as "dq = -dt", but as "dq = -max{u, (1 - Uinact)} dt",
+ * where u is the utilization of the task and Uinact is the
+ * (per-runqueue) inactive utilization, computed as the difference
+ * between the "total runqueue utilization" and the runqueue
+ * active utilization.
+ * Since rq->dl.running_bw and rq->dl.this_bw contain utilizations
+ * multiplied by 2^BW_SHIFT, the result has to be shifted right by BW_SHIFT.
  */
-u64 grub_reclaim(u64 delta, struct rq *rq)
+u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
 {
-	delta *= rq->dl.running_bw;
-	delta *= rq->dl.bw_ratio;
-	delta >>= BW_SHIFT + RATIO_SHIFT;
+	u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */
+	u64 u_act;
 
-	return delta;
+	/*
+	 * Instead of computing max{u, (1 - u_inact)}, we compare
+	 * u_inact with 1 - u, because u_inact can be larger than 1
+	 * (so, 1 - u_inact would be negative leading to wrong results)
+	 */
+	if (u_inact > BW_UNIT - dl_se->dl_bw)
+		u_act = dl_se->dl_bw;
+	else
+		u_act = BW_UNIT - u_inact;
+
+	return (delta * u_act) >> BW_SHIFT;
 }
 
 /*
@@ -1020,7 +1024,7 @@ static void update_curr_dl(struct rq *rq)
 	sched_rt_avg_update(rq, delta_exec);
 
 	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
-		delta_exec = grub_reclaim(delta_exec, rq);
+		delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
 	dl_se->runtime -= delta_exec;
 
 throttle:

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [tip:sched/core] sched/deadline: Reclaim bandwidth not used by dl tasks
  2017-05-18 20:13 ` [PATCH 09/10] sched/deadline: also reclaim bandwidth not used by dl tasks luca abeni
@ 2017-06-08  9:28   ` tip-bot for Luca Abeni
  0 siblings, 0 replies; 26+ messages in thread
From: tip-bot for Luca Abeni @ 2017-06-08  9:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, claudio, tglx, peterz, rostedt, efault, juri.lelli,
	tommaso.cucinotta, hpa, joelaf, mathieu.poirier, luca.abeni,
	bristot, torvalds, linux-kernel

Commit-ID:  daec5798367012951cdb54fdb5c006e4379c9ae9
Gitweb:     http://git.kernel.org/tip/daec5798367012951cdb54fdb5c006e4379c9ae9
Author:     Luca Abeni <luca.abeni@santannapisa.it>
AuthorDate: Thu, 18 May 2017 22:13:36 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 8 Jun 2017 10:31:55 +0200

sched/deadline: Reclaim bandwidth not used by dl tasks

This commit introduces a per-runqueue "extra utilization" that can be
reclaimed by deadline tasks. In this way, the maximum fraction of CPU
time that can reclaimed by deadline tasks is fixed (and configurable)
and does not depend on the total deadline utilization.
The GRUB accounting rule is modified to add this "extra utilization"
to the inactive utilization of the runqueue, and to avoid reclaiming
more than a maximum fraction of the CPU time.

Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-10-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c     | 17 ++++++++++-------
 kernel/sched/deadline.c | 42 +++++++++++++++++++++++++++---------------
 kernel/sched/sched.h    | 37 +++++++++++++++++++++++++++++++++++--
 3 files changed, 72 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8d1a5a6..7996479 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2444,7 +2444,7 @@ inline struct dl_bw *dl_bw_of(int i)
 	return &cpu_rq(i)->rd->dl_bw;
 }
 
-static inline int dl_bw_cpus(int i)
+inline int dl_bw_cpus(int i)
 {
 	struct root_domain *rd = cpu_rq(i)->rd;
 	int cpus = 0;
@@ -2462,7 +2462,7 @@ inline struct dl_bw *dl_bw_of(int i)
 	return &cpu_rq(i)->dl.dl_bw;
 }
 
-static inline int dl_bw_cpus(int i)
+inline int dl_bw_cpus(int i)
 {
 	return 1;
 }
@@ -2500,8 +2500,8 @@ static int dl_overflow(struct task_struct *p, int policy,
 	if (dl_policy(policy) && !task_has_dl_policy(p) &&
 	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
 		if (hrtimer_active(&p->dl.inactive_timer))
-			__dl_clear(dl_b, p->dl.dl_bw);
-		__dl_add(dl_b, new_bw);
+			__dl_clear(dl_b, p->dl.dl_bw, cpus);
+		__dl_add(dl_b, new_bw, cpus);
 		err = 0;
 	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
 		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
@@ -2512,8 +2512,8 @@ static int dl_overflow(struct task_struct *p, int policy,
 		 * But this would require to set the task's "inactive
 		 * timer" when the task is not inactive.
 		 */
-		__dl_clear(dl_b, p->dl.dl_bw);
-		__dl_add(dl_b, new_bw);
+		__dl_clear(dl_b, p->dl.dl_bw, cpus);
+		__dl_add(dl_b, new_bw, cpus);
 		dl_change_utilization(p, new_bw);
 		err = 0;
 	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
@@ -5515,7 +5515,7 @@ int task_can_attach(struct task_struct *p,
 			 * We will free resources in the source root_domain
 			 * later on (see set_cpus_allowed_dl()).
 			 */
-			__dl_add(dl_b, p->dl.dl_bw);
+			__dl_add(dl_b, p->dl.dl_bw, cpus);
 		}
 		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
 		rcu_read_unlock_sched();
@@ -6764,9 +6764,12 @@ void init_dl_rq_bw_ratio(struct dl_rq *dl_rq)
 {
 	if (global_rt_runtime() == RUNTIME_INF) {
 		dl_rq->bw_ratio = 1 << RATIO_SHIFT;
+		dl_rq->extra_bw = 1 << BW_SHIFT;
 	} else {
 		dl_rq->bw_ratio = to_ratio(global_rt_runtime(),
 			  global_rt_period()) >> (BW_SHIFT - RATIO_SHIFT);
+		dl_rq->extra_bw = to_ratio(global_rt_period(),
+						    global_rt_runtime());
 	}
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7d2f057..e3b25df 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -209,7 +209,7 @@ static void task_non_contending(struct task_struct *p)
 			if (p->state == TASK_DEAD)
 				sub_rq_bw(p->dl.dl_bw, &rq->dl);
 			raw_spin_lock(&dl_b->lock);
-			__dl_clear(dl_b, p->dl.dl_bw);
+			__dl_clear(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 			__dl_clear_params(p);
 			raw_spin_unlock(&dl_b->lock);
 		}
@@ -955,28 +955,40 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 /*
  * This function implements the GRUB accounting rule:
  * according to the GRUB reclaiming algorithm, the runtime is
- * not decreased as "dq = -dt", but as "dq = -max{u, (1 - Uinact)} dt",
- * where u is the utilization of the task and Uinact is the
- * (per-runqueue) inactive utilization, computed as the difference
- * between the "total runqueue utilization" and the runqueue
- * active utilization.
+ * not decreased as "dq = -dt", but as
+ * "dq = -max{u / Umax, (1 - Uinact - Uextra)} dt",
+ * where u is the utilization of the task, Umax is the maximum reclaimable
+ * utilization, Uinact is the (per-runqueue) inactive utilization, computed
+ * as the difference between the "total runqueue utilization" and the
+ * runqueue active utilization, and Uextra is the (per runqueue) extra
+ * reclaimable utilization.
  * Since rq->dl.running_bw and rq->dl.this_bw contain utilizations
- * multiplied by 2^BW_SHIFT, the result has to be shifted right by BW_SHIFT.
+ * multiplied by 2^BW_SHIFT, the result has to be shifted right by
+ * BW_SHIFT.
+ * Since rq->dl.bw_ratio contains 1 / Umax multipled by 2^RATIO_SHIFT,
+ * dl_bw is multiped by rq->dl.bw_ratio and shifted right by RATIO_SHIFT.
+ * Since delta is a 64 bit variable, to have an overflow its value
+ * should be larger than 2^(64 - 20 - 8), which is more than 64 seconds.
+ * So, overflow is not an issue here.
  */
 u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
 {
 	u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */
 	u64 u_act;
+	u64 u_act_min = (dl_se->dl_bw * rq->dl.bw_ratio) >> RATIO_SHIFT;
 
 	/*
-	 * Instead of computing max{u, (1 - u_inact)}, we compare
-	 * u_inact with 1 - u, because u_inact can be larger than 1
-	 * (so, 1 - u_inact would be negative leading to wrong results)
+	 * Instead of computing max{u * bw_ratio, (1 - u_inact - u_extra)},
+	 * we compare u_inact + rq->dl.extra_bw with
+	 * 1 - (u * rq->dl.bw_ratio >> RATIO_SHIFT), because
+	 * u_inact + rq->dl.extra_bw can be larger than
+	 * 1 * (so, 1 - u_inact - rq->dl.extra_bw would be negative
+	 * leading to wrong results)
 	 */
-	if (u_inact > BW_UNIT - dl_se->dl_bw)
-		u_act = dl_se->dl_bw;
+	if (u_inact + rq->dl.extra_bw > BW_UNIT - u_act_min)
+		u_act = u_act_min;
 	else
-		u_act = BW_UNIT - u_inact;
+		u_act = BW_UNIT - u_inact - rq->dl.extra_bw;
 
 	return (delta * u_act) >> BW_SHIFT;
 }
@@ -1085,7 +1097,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 		}
 
 		raw_spin_lock(&dl_b->lock);
-		__dl_clear(dl_b, p->dl.dl_bw);
+		__dl_clear(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 		raw_spin_unlock(&dl_b->lock);
 		__dl_clear_params(p);
 
@@ -2054,7 +2066,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
 		 * until we complete the update.
 		 */
 		raw_spin_lock(&src_dl_b->lock);
-		__dl_clear(src_dl_b, p->dl.dl_bw);
+		__dl_clear(src_dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
 		raw_spin_unlock(&src_dl_b->lock);
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b7321da..f1e400c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -219,22 +219,27 @@ static inline int dl_bandwidth_enabled(void)
 }
 
 extern struct dl_bw *dl_bw_of(int i);
+extern int dl_bw_cpus(int i);
 
 struct dl_bw {
 	raw_spinlock_t lock;
 	u64 bw, total_bw;
 };
 
+static inline void __dl_update(struct dl_bw *dl_b, s64 bw);
+
 static inline
-void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
+void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
 {
 	dl_b->total_bw -= tsk_bw;
+	__dl_update(dl_b, (s32)tsk_bw / cpus);
 }
 
 static inline
-void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
+void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
 {
 	dl_b->total_bw += tsk_bw;
+	__dl_update(dl_b, -((s32)tsk_bw / cpus));
 }
 
 static inline
@@ -576,6 +581,7 @@ struct dl_rq {
 	 * runqueue (inactive utilization = this_bw - running_bw).
 	 */
 	u64 this_bw;
+	u64 extra_bw;
 
 	/*
 	 * Inverse of the fraction of CPU utilization that can be reclaimed
@@ -1958,6 +1964,33 @@ extern void nohz_balance_exit_idle(unsigned int cpu);
 static inline void nohz_balance_exit_idle(unsigned int cpu) { }
 #endif
 
+
+#ifdef CONFIG_SMP
+static inline
+void __dl_update(struct dl_bw *dl_b, s64 bw)
+{
+	struct root_domain *rd = container_of(dl_b, struct root_domain, dl_bw);
+	int i;
+
+	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+			 "sched RCU must be held");
+	for_each_cpu_and(i, rd->span, cpu_active_mask) {
+		struct rq *rq = cpu_rq(i);
+
+		rq->dl.extra_bw += bw;
+	}
+}
+#else
+static inline
+void __dl_update(struct dl_bw *dl_b, s64 bw)
+{
+	struct dl_rq *dl = container_of(dl_b, struct dl_rq, dl_bw);
+
+	dl->extra_bw += bw;
+}
+#endif
+
+
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 struct irqtime {
 	u64			total;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [tip:sched/core] sched/deadline: Add documentation about GRUB reclaiming
  2017-05-18 20:13 ` [PATCH 10/10] sched/deadline: documentation about GRUB reclaiming luca abeni
@ 2017-06-08  9:28   ` tip-bot for Claudio Scordino
  0 siblings, 0 replies; 26+ messages in thread
From: tip-bot for Claudio Scordino @ 2017-06-08  9:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: joelaf, claudio, luca.abeni, efault, rostedt, tommaso.cucinotta,
	linux-kernel, mathieu.poirier, hpa, tglx, torvalds, peterz,
	mingo, bristot, juri.lelli

Commit-ID:  ccc9d651a7d26fec88d2375c4dd4e097cc0f8de7
Gitweb:     http://git.kernel.org/tip/ccc9d651a7d26fec88d2375c4dd4e097cc0f8de7
Author:     Claudio Scordino <claudio@evidence.eu.com>
AuthorDate: Thu, 18 May 2017 22:13:37 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 8 Jun 2017 10:31:56 +0200

sched/deadline: Add documentation about GRUB reclaiming

This patch adds the documentation about the GRUB reclaiming algorithm,
adding a few details discussed in list.

Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1495138417-6203-11-git-send-email-luca.abeni@santannapisa.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/scheduler/sched-deadline.txt | 168 +++++++++++++++++++++++++++++
 1 file changed, 168 insertions(+)

diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
index cbc1b46..e89e36e 100644
--- a/Documentation/scheduler/sched-deadline.txt
+++ b/Documentation/scheduler/sched-deadline.txt
@@ -7,6 +7,8 @@ CONTENTS
  0. WARNING
  1. Overview
  2. Scheduling algorithm
+   2.1 Main algorithm
+   2.2 Bandwidth reclaiming
  3. Scheduling Real-Time Tasks
    3.1 Definitions
    3.2 Schedulability Analysis for Uniprocessor Systems
@@ -44,6 +46,9 @@ CONTENTS
 2. Scheduling algorithm
 ==================
 
+2.1 Main algorithm
+------------------
+
  SCHED_DEADLINE uses three parameters, named "runtime", "period", and
  "deadline", to schedule tasks. A SCHED_DEADLINE task should receive
  "runtime" microseconds of execution time every "period" microseconds, and
@@ -113,6 +118,160 @@ CONTENTS
          remaining runtime = remaining runtime + runtime
 
 
+2.2 Bandwidth reclaiming
+------------------------
+
+ Bandwidth reclaiming for deadline tasks is based on the GRUB (Greedy
+ Reclamation of Unused Bandwidth) algorithm [15, 16, 17] and it is enabled
+ when flag SCHED_FLAG_RECLAIM is set.
+
+ The following diagram illustrates the state names for tasks handled by GRUB:
+
+                             ------------
+                 (d)        |   Active   |
+              ------------->|            |
+              |             | Contending |
+              |              ------------
+              |                A      |
+          ----------           |      |
+         |          |          |      |
+         | Inactive |          |(b)   | (a)
+         |          |          |      |
+          ----------           |      |
+              A                |      V
+              |              ------------
+              |             |   Active   |
+              --------------|     Non    |
+                 (c)        | Contending |
+                             ------------
+
+ A task can be in one of the following states:
+
+  - ActiveContending: if it is ready for execution (or executing);
+
+  - ActiveNonContending: if it just blocked and has not yet surpassed the 0-lag
+    time;
+
+  - Inactive: if it is blocked and has surpassed the 0-lag time.
+
+ State transitions:
+
+  (a) When a task blocks, it does not become immediately inactive since its
+      bandwidth cannot be immediately reclaimed without breaking the
+      real-time guarantees. It therefore enters a transitional state called
+      ActiveNonContending. The scheduler arms the "inactive timer" to fire at
+      the 0-lag time, when the task's bandwidth can be reclaimed without
+      breaking the real-time guarantees.
+
+      The 0-lag time for a task entering the ActiveNonContending state is
+      computed as
+
+                        (runtime * dl_period)
+             deadline - ---------------------
+                             dl_runtime
+
+      where runtime is the remaining runtime, while dl_runtime and dl_period
+      are the reservation parameters.
+
+  (b) If the task wakes up before the inactive timer fires, the task re-enters
+      the ActiveContending state and the "inactive timer" is canceled.
+      In addition, if the task wakes up on a different runqueue, then
+      the task's utilization must be removed from the previous runqueue's active
+      utilization and must be added to the new runqueue's active utilization.
+      In order to avoid races between a task waking up on a runqueue while the
+       "inactive timer" is running on a different CPU, the "dl_non_contending"
+      flag is used to indicate that a task is not on a runqueue but is active
+      (so, the flag is set when the task blocks and is cleared when the
+      "inactive timer" fires or when the task  wakes up).
+
+  (c) When the "inactive timer" fires, the task enters the Inactive state and
+      its utilization is removed from the runqueue's active utilization.
+
+  (d) When an inactive task wakes up, it enters the ActiveContending state and
+      its utilization is added to the active utilization of the runqueue where
+      it has been enqueued.
+
+ For each runqueue, the algorithm GRUB keeps track of two different bandwidths:
+
+  - Active bandwidth (running_bw): this is the sum of the bandwidths of all
+    tasks in active state (i.e., ActiveContending or ActiveNonContending);
+
+  - Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the
+    runqueue, including the tasks in Inactive state.
+
+
+ The algorithm reclaims the bandwidth of the tasks in Inactive state.
+ It does so by decrementing the runtime of the executing task Ti at a pace equal
+ to
+
+           dq = -max{ Ui, (1 - Uinact) } dt
+
+ where Uinact is the inactive utilization, computed as (this_bq - running_bw),
+ and Ui is the bandwidth of task Ti.
+
+
+ Let's now see a trivial example of two deadline tasks with runtime equal
+ to 4 and period equal to 8 (i.e., bandwidth equal to 0.5):
+
+     A            Task T1
+     |
+     |                               |
+     |                               |
+     |--------                       |----
+     |       |                       V
+     |---|---|---|---|---|---|---|---|--------->t
+     0   1   2   3   4   5   6   7   8
+
+
+     A            Task T2
+     |
+     |                               |
+     |                               |
+     |       ------------------------|
+     |       |                       V
+     |---|---|---|---|---|---|---|---|--------->t
+     0   1   2   3   4   5   6   7   8
+
+
+     A            running_bw
+     |
+   1 -----------------               ------
+     |               |               |
+  0.5-               -----------------
+     |                               |
+     |---|---|---|---|---|---|---|---|--------->t
+     0   1   2   3   4   5   6   7   8
+
+
+  - Time t = 0:
+
+    Both tasks are ready for execution and therefore in ActiveContending state.
+    Suppose Task T1 is the first task to start execution.
+    Since there are no inactive tasks, its runtime is decreased as dq = -1 dt.
+
+  - Time t = 2:
+
+    Suppose that task T1 blocks
+    Task T1 therefore enters the ActiveNonContending state. Since its remaining
+    runtime is equal to 2, its 0-lag time is equal to t = 4.
+    Task T2 start execution, with runtime still decreased as dq = -1 dt since
+    there are no inactive tasks.
+
+  - Time t = 4:
+
+    This is the 0-lag time for Task T1. Since it didn't woken up in the
+    meantime, it enters the Inactive state. Its bandwidth is removed from
+    running_bw.
+    Task T2 continues its execution. However, its runtime is now decreased as
+    dq = - 0.5 dt because Uinact = 0.5.
+    Task T2 therefore reclaims the bandwidth unused by Task T1.
+
+  - Time t = 8:
+
+    Task T1 wakes up. It enters the ActiveContending state again, and the
+    running_bw is incremented.
+
+
 3. Scheduling Real-Time Tasks
 =============================
 
@@ -330,6 +489,15 @@ CONTENTS
   14 - J. Erickson, U. Devi and S. Baruah. Improved tardiness bounds for
        Global EDF. Proceedings of the 22nd Euromicro Conference on
        Real-Time Systems, 2010.
+  15 - G. Lipari, S. Baruah, Greedy reclamation of unused bandwidth in
+       constant-bandwidth servers, 12th IEEE Euromicro Conference on Real-Time
+       Systems, 2000.
+  16 - L. Abeni, J. Lelli, C. Scordino, L. Palopoli, Greedy CPU reclaiming for
+       SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS),
+       Dusseldorf, Germany, 2014.
+  17 - L. Abeni, G. Lipari, A. Parri, Y. Sun, Multicore CPU reclaiming: parallel
+       or sequential?. In Proceedings of the 31st Annual ACM Symposium on Applied
+       Computing, 2016.
 
 
 4. Bandwidth management

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 01/10] sched/deadline: track the active utilization
  2017-06-08  9:05       ` Juri Lelli
@ 2017-06-08 13:36         ` Steven Rostedt
  2017-06-08 13:47           ` Juri Lelli
  0 siblings, 1 reply; 26+ messages in thread
From: Steven Rostedt @ 2017-06-08 13:36 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Luca Abeni, Ingo Molnar, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Claudio Scordino, Tommaso Cucinotta,
	Daniel Bristot de Oliveira, Joel Fernandes, Mathieu Poirier,
	Luca Abeni

On Thu, 8 Jun 2017 10:05:55 +0100
Juri Lelli <juri.lelli@arm.com> wrote:

> On 08/06/17 10:43, Luca Abeni wrote:
> > On Thu, 8 Jun 2017 10:31:25 +0200
> > Ingo Molnar <mingo@kernel.org> wrote:
> >   
> > > * luca abeni <luca.abeni@santannapisa.it> wrote:
> > >   
> > > > From: Luca Abeni <luca.abeni@unitn.it>
> > > > 
> > > > Active utilization is defined as the total utilization of active
> > > > (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
> > > > when a task wakes up and is decreased when a task blocks.
> > > > 
> > > > When a task is migrated from CPUi to CPUj, immediately subtract the
> > > > task's utilization from CPUi and add it to CPUj. This mechanism is
> > > > implemented by modifying the pull and push functions.
> > > > Note: this is not fully correct from the theoretical point of view
> > > > (the utilization should be removed from CPUi only at the 0 lag
> > > > time), a more theoretically sound solution is presented in the
> > > > next patches.
> > > > 
> > > > Signed-off-by: Juri Lelli <juri.lelli@arm.com>
> > > > Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> > > > Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>    
> > > 
> > > So that SOB chain is not valid - either Juri needs to be the From:
> > > author, or it should be an Acked-by (or Reviewed-by).
> > > 
> > > For now I've converted this to:
> > >   
> > > > Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> > > > Acked-by: Juri Lelli <juri.lelli@arm.com>    
> > 
> > Sorry, my fault: I must have misunderstood how to use the Signed-off-by
> > stuff.
> > 
> > The story here is that I took a patch originally developed by Juri and
> > fixed and I heavily modified it. Since the current patch is very
> > different from the original one, Juri suggested I should by the "From:"
> > author, and I simply added his Signed-off-by to acknowledge that he was
> > the author of the original patch.
> > 
> > If Juri is ok with your change, I agree with it.
> >   
> 
> Yep, I'm OK with Ingo's solution.
> 

Although, since the code originally came from you a Signed-off-by is
appropriate. The SOB is a chain of where the patch came from. As Juri
actually has part ownership, Juri should have a signed-off-by on the
patch. The problem with git is that it allows for multiple signed off
bys but only one owner.

-- Steve

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 01/10] sched/deadline: track the active utilization
  2017-06-08 13:36         ` Steven Rostedt
@ 2017-06-08 13:47           ` Juri Lelli
  0 siblings, 0 replies; 26+ messages in thread
From: Juri Lelli @ 2017-06-08 13:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Luca Abeni, Ingo Molnar, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Claudio Scordino, Tommaso Cucinotta,
	Daniel Bristot de Oliveira, Joel Fernandes, Mathieu Poirier,
	Luca Abeni

On 08/06/17 09:36, Steven Rostedt wrote:
> On Thu, 8 Jun 2017 10:05:55 +0100
> Juri Lelli <juri.lelli@arm.com> wrote:
> 
> > On 08/06/17 10:43, Luca Abeni wrote:
> > > On Thu, 8 Jun 2017 10:31:25 +0200
> > > Ingo Molnar <mingo@kernel.org> wrote:
> > >   
> > > > * luca abeni <luca.abeni@santannapisa.it> wrote:
> > > >   
> > > > > From: Luca Abeni <luca.abeni@unitn.it>
> > > > > 
> > > > > Active utilization is defined as the total utilization of active
> > > > > (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
> > > > > when a task wakes up and is decreased when a task blocks.
> > > > > 
> > > > > When a task is migrated from CPUi to CPUj, immediately subtract the
> > > > > task's utilization from CPUi and add it to CPUj. This mechanism is
> > > > > implemented by modifying the pull and push functions.
> > > > > Note: this is not fully correct from the theoretical point of view
> > > > > (the utilization should be removed from CPUi only at the 0 lag
> > > > > time), a more theoretically sound solution is presented in the
> > > > > next patches.
> > > > > 
> > > > > Signed-off-by: Juri Lelli <juri.lelli@arm.com>
> > > > > Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> > > > > Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>    
> > > > 
> > > > So that SOB chain is not valid - either Juri needs to be the From:
> > > > author, or it should be an Acked-by (or Reviewed-by).
> > > > 
> > > > For now I've converted this to:
> > > >   
> > > > > Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> > > > > Acked-by: Juri Lelli <juri.lelli@arm.com>    
> > > 
> > > Sorry, my fault: I must have misunderstood how to use the Signed-off-by
> > > stuff.
> > > 
> > > The story here is that I took a patch originally developed by Juri and
> > > fixed and I heavily modified it. Since the current patch is very
> > > different from the original one, Juri suggested I should by the "From:"
> > > author, and I simply added his Signed-off-by to acknowledge that he was
> > > the author of the original patch.
> > > 
> > > If Juri is ok with your change, I agree with it.
> > >   
> > 
> > Yep, I'm OK with Ingo's solution.
> > 
> 
> Although, since the code originally came from you a Signed-off-by is
> appropriate. The SOB is a chain of where the patch came from. As Juri
> actually has part ownership, Juri should have a signed-off-by on the
> patch. The problem with git is that it allows for multiple signed off
> bys but only one owner.
> 

Right. I've been also using Co-authored-by: in some other set, but I
don't think it's actually documented anywhere. :/

Anyway, not a big deal in this particular case. :)

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-06-08 13:47 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-18 20:13 [PATCH 00/10] CPU reclaiming for SCHED_DEADLINE luca abeni
2017-05-18 20:13 ` [PATCH 01/10] sched/deadline: track the active utilization luca abeni
2017-06-08  8:31   ` Ingo Molnar
2017-06-08  8:43     ` Luca Abeni
2017-06-08  9:05       ` Juri Lelli
2017-06-08 13:36         ` Steven Rostedt
2017-06-08 13:47           ` Juri Lelli
2017-06-08  9:23   ` [tip:sched/core] sched/deadline: Track " tip-bot for Luca Abeni
2017-05-18 20:13 ` [PATCH 02/10] sched/deadline: improve the tracking of " luca abeni
2017-06-08  9:24   ` [tip:sched/core] sched/deadline: Improve " tip-bot for Luca Abeni
2017-05-18 20:13 ` [PATCH 03/10] sched/deadline: fix the update of the total -deadline utilization luca abeni
2017-06-08  9:24   ` [tip:sched/core] sched/deadline: Fix " tip-bot for Luca Abeni
2017-05-18 20:13 ` [PATCH 04/10] sched/deadline: implement GRUB accounting luca abeni
2017-06-08  9:25   ` [tip:sched/core] sched/deadline: Implement " tip-bot for Luca Abeni
2017-05-18 20:13 ` [PATCH 05/10] sched/deadline: do not reclaim the whole CPU bandwidth luca abeni
2017-06-08  9:25   ` [tip:sched/core] sched/deadline: Do " tip-bot for Luca Abeni
2017-05-18 20:13 ` [PATCH 06/10] sched/deadline: make GRUB a task's flag luca abeni
2017-06-08  9:26   ` [tip:sched/core] sched/deadline: Make " tip-bot for Luca Abeni
2017-05-18 20:13 ` [PATCH 07/10] sched/deadline: track the "total rq utilization" too luca abeni
2017-06-08  9:26   ` [tip:sched/core] sched/deadline: Track " tip-bot for Luca Abeni
2017-05-18 20:13 ` [PATCH 08/10] sched/deadline: base GRUB reclaiming on the inactive utilization luca abeni
2017-06-08  9:27   ` [tip:sched/core] sched/deadline: Base " tip-bot for Luca Abeni
2017-05-18 20:13 ` [PATCH 09/10] sched/deadline: also reclaim bandwidth not used by dl tasks luca abeni
2017-06-08  9:28   ` [tip:sched/core] sched/deadline: Reclaim " tip-bot for Luca Abeni
2017-05-18 20:13 ` [PATCH 10/10] sched/deadline: documentation about GRUB reclaiming luca abeni
2017-06-08  9:28   ` [tip:sched/core] sched/deadline: Add " tip-bot for Claudio Scordino

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).