linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v3 0/6] CPU reclaiming for SCHED_DEADLINE
@ 2016-10-24 14:06 Luca Abeni
  2016-10-24 14:06 ` [RFC v3 1/6] Track the active utilisation Luca Abeni
                   ` (5 more replies)
  0 siblings, 6 replies; 45+ messages in thread
From: Luca Abeni @ 2016-10-24 14:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Luca Abeni

Hi all,

this patchset implements CPU reclaiming (using the GRUB algorithm[1])
for SCHED_DEADLINE: basically, this feature allows SCHED_DEADLINE tasks
to consume more than their reserved runtime, up to a maximum fraction
of the CPU time (so that other tasks are left some spare CPU time to
execute), if this does not break the guarantees of other SCHED_DEADLINE
tasks.
The patchset applies on top of tip/master.


The implemented CPU reclaiming algorithm is based on tracking the
utilization U_act of active tasks (first 2 patches), and modifying the
runtime accounting rule (see patch 0004). The original GRUB algorithm is
modified as described in [2] to support multiple CPUs (the original
algorithm only considered one single CPU, this one tracks U_act per
runqueue) and to leave an "unreclaimable" fraction of CPU time to non
SCHED_DEADLINE tasks (see patch 0005: the original algorithm can consume
100% of the CPU time, starving all the other tasks).
Patch 0003 uses the newly introduced "inactive timer" (introduced in
patch 0002) to fix dl_overflow() and __setparam_dl().
Patch 0006 allows to enable CPU reclaiming only on selected tasks.


Changes since v2:
in general, I tried to address all the comments I received, and to add
some more comments in "critical" parts of the code. In particular, I:
- Updated to latest tip/master. This required some changes (for example,
  using "struct rq_flags" instead of "unsigned long" for task_rq_lock())
- Merged patches 0001 and 0002, as suggested by Juri
- Added some comments about GRUB in the changelog of the patch adding
  GRUB accounting
- Exchanged the order of two patches ("Make GRUB a task's flag" and
  "Do not reclaim the whole CPU bandwidth"), as suggested by Juri
- Removed unused code ("if (task_on_rq_queued(p))" in task_dead_dl(),
  noticed by Peter) from patch 0001
- Properly consider the migrations of queued dl tasks when updating
  the active utilization. This should address Peter's concern from
  http://lkml.iu.edu/hypermail/linux/kernel/1604.0/02612.html
- Simplified the code for setting up the inactive timer, as suggested
  by Peter: http://lkml.iu.edu/hypermail/linux/kernel/1604.0/02620.html
- Use hrtimer_is_queued() instead of "(hrtimer_active() &&
  !hrtimer_callback_running())", as pointed out by Peter:
  http://lkml.iu.edu/hypermail/linux/kernel/1604.0/02805.html
- Fix select_task_rq_dl() (using "task_cpu(p) != cpu" instead
  of "rq != cpu_rq(cpu)"), as pointed out by Peter:
  http://lkml.iu.edu/hypermail/linux/kernel/1604.0/02822.html
  I also changed the logic used in select_task_rq_dl()
  (that now does not increase the active utilization in the selected
  runqueue)
- Because of the changes in the code, I am not sure if the race condition
  pointed out by Peter can still happen. I tried to trigger it in many ways,
  but I failed... If it turns out that the race is still possible, I'll fix
  it in the next round of patches, by introducing a new "is_contending" field
  (protected by pi_mutex) in the dl scheduling entity.


[1] Lipari, G., & Baruah, S. (2000). Greedy reclamation of unused bandwidth in constant-bandwidth servers. In Real-Time Systems, 2000. Euromicro RTS 2000. 12th Euromicro Conference on (pp. 193-200). IEEE.
[2] Abeni, L., Lelli, J., Scordino, C., & Palopoli, L. (2014, October). Greedy CPU reclaiming for SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS), Dusseldorf, Germany. 



Luca Abeni (6):
  Track the active utilisation
  Improve the tracking of active utilisation
  Fix the update of the total -deadline utilization
  GRUB accounting
  Do not reclaim the whole CPU bandwidth
  Make GRUB a task's flag

 include/linux/sched.h      |   1 +
 include/uapi/linux/sched.h |   1 +
 kernel/sched/core.c        |  44 ++++-----
 kernel/sched/deadline.c    | 220 ++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h       |  13 +++
 5 files changed, 234 insertions(+), 45 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC v3 1/6] Track the active utilisation
  2016-10-24 14:06 [RFC v3 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
@ 2016-10-24 14:06 ` Luca Abeni
  2016-10-25  9:09   ` Daniel Bristot de Oliveira
                     ` (2 more replies)
  2016-10-24 14:06 ` [RFC v3 2/6] Improve the tracking of " Luca Abeni
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 45+ messages in thread
From: Luca Abeni @ 2016-10-24 14:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Luca Abeni

The active utilisation here is defined as the total utilisation of the
active (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
when a task wakes up and is decreased when a task blocks.

When a task is migrated from CPUi to CPUj, immediately subtract the task's
utilisation from CPUi and add it to CPUj. This mechanism is implemented by
modifying the pull and push functions.
Note: this is not fully correct from the theoretical point of view
(the utilisation should be removed from CPUi only at the 0 lag time),
but doing the right thing would be _MUCH_ more complex (leaving the
timer armed when the task is on a different CPU... Inactive timers should
be moved from per-task timers to per-runqueue lists of timers! Bah...)

The utilisation tracking mechanism implemented in this commit can be
fixed / improved by decreasing the active utilisation at the so-called
"0-lag time" instead of when the task blocks.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
---
 kernel/sched/deadline.c | 39 ++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h    |  6 ++++++
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 37e2449..3d95c1d 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -43,6 +43,22 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
 	return !RB_EMPTY_NODE(&dl_se->rb_node);
 }
 
+static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	u64 se_bw = dl_se->dl_bw;
+
+	dl_rq->running_bw += se_bw;
+}
+
+static void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	u64 se_bw = dl_se->dl_bw;
+
+	dl_rq->running_bw -= se_bw;
+	if (WARN_ON(dl_rq->running_bw < 0))
+		dl_rq->running_bw = 0;
+}
+
 static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
@@ -498,6 +514,8 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
+	add_running_bw(dl_se, dl_rq);
+
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
 	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
 		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
@@ -947,14 +965,19 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		return;
 	}
 
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		add_running_bw(&p->dl, &rq->dl);
+
 	/*
 	 * If p is throttled, we do nothing. In fact, if it exhausted
 	 * its budget it needs a replenishment and, since it now is on
 	 * its rq, the bandwidth timer callback (which clearly has not
 	 * run yet) will take care of this.
 	 */
-	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
+	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
+		add_running_bw(&p->dl, &rq->dl);
 		return;
+	}
 
 	enqueue_dl_entity(&p->dl, pi_se, flags);
 
@@ -972,6 +995,12 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
 	__dequeue_task_dl(rq, p, flags);
+
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		sub_running_bw(&p->dl, &rq->dl);
+
+	if (flags & DEQUEUE_SLEEP)
+		sub_running_bw(&p->dl, &rq->dl);
 }
 
 /*
@@ -1501,7 +1530,9 @@ static int push_dl_task(struct rq *rq)
 	}
 
 	deactivate_task(rq, next_task, 0);
+	sub_running_bw(&next_task->dl, &rq->dl);
 	set_task_cpu(next_task, later_rq->cpu);
+	add_running_bw(&next_task->dl, &later_rq->dl);
 	activate_task(later_rq, next_task, 0);
 	ret = 1;
 
@@ -1589,7 +1620,9 @@ static void pull_dl_task(struct rq *this_rq)
 			resched = true;
 
 			deactivate_task(src_rq, p, 0);
+			sub_running_bw(&p->dl, &src_rq->dl);
 			set_task_cpu(p, this_cpu);
+			add_running_bw(&p->dl, &this_rq->dl);
 			activate_task(this_rq, p, 0);
 			dmin = p->dl.deadline;
 
@@ -1695,6 +1728,9 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	if (!start_dl_timer(p))
 		__dl_clear_params(p);
 
+	if (task_on_rq_queued(p))
+		sub_running_bw(&p->dl, &rq->dl);
+
 	/*
 	 * Since this might be the only -deadline task on the rq,
 	 * this is the right place to try to pull some other one
@@ -1712,6 +1748,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
  */
 static void switched_to_dl(struct rq *rq, struct task_struct *p)
 {
+	add_running_bw(&p->dl, &rq->dl);
 
 	/* If p is not queued we will update its parameters at next wakeup. */
 	if (!task_on_rq_queued(p))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 055f935..3a36c74 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -535,6 +535,12 @@ struct dl_rq {
 #else
 	struct dl_bw dl_bw;
 #endif
+	/*
+	 * "Active utilization" for this runqueue: increased when a
+	 * task wakes up (becomes TASK_RUNNING) and decreased when a
+	 * task blocks
+	 */
+	s64 running_bw;
 };
 
 #ifdef CONFIG_SMP
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC v3 2/6] Improve the tracking of active utilisation
  2016-10-24 14:06 [RFC v3 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
  2016-10-24 14:06 ` [RFC v3 1/6] Track the active utilisation Luca Abeni
@ 2016-10-24 14:06 ` Luca Abeni
  2016-11-01 16:46   ` Juri Lelli
                     ` (3 more replies)
  2016-10-24 14:06 ` [RFC v3 3/6] Fix the update of the total -deadline utilization Luca Abeni
                   ` (3 subsequent siblings)
  5 siblings, 4 replies; 45+ messages in thread
From: Luca Abeni @ 2016-10-24 14:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Luca Abeni

This patch implements a more theoretically sound algorithm for
thracking the active utilisation: instead of decreasing it when a
task blocks, use a timer (the "inactive timer", named after the
"Inactive" task state of the GRUB algorithm) to decrease the
active utilisaation at the so called "0-lag time".

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
---
 include/linux/sched.h   |   1 +
 kernel/sched/core.c     |   1 +
 kernel/sched/deadline.c | 139 ++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h    |   1 +
 4 files changed, 126 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b..22543c6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1433,6 +1433,7 @@ struct sched_dl_entity {
 	 * own bandwidth to be enforced, thus we need one timer per task.
 	 */
 	struct hrtimer dl_timer;
+	struct hrtimer inactive_timer;
 };
 
 union rcu_special {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 94732d1..664c618 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2217,6 +2217,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	RB_CLEAR_NODE(&p->dl.rb_node);
 	init_dl_task_timer(&p->dl);
+	init_inactive_task_timer(&p->dl);
 	__dl_clear_params(p);
 
 	INIT_LIST_HEAD(&p->rt.run_list);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3d95c1d..80d1541 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -47,6 +47,7 @@ static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
 	u64 se_bw = dl_se->dl_bw;
 
+	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
 	dl_rq->running_bw += se_bw;
 }
 
@@ -54,11 +55,52 @@ static void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
 	u64 se_bw = dl_se->dl_bw;
 
+	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
 	dl_rq->running_bw -= se_bw;
 	if (WARN_ON(dl_rq->running_bw < 0))
 		dl_rq->running_bw = 0;
 }
 
+static void task_go_inactive(struct task_struct *p)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+	struct hrtimer *timer = &dl_se->inactive_timer;
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+	s64 zerolag_time;
+
+	WARN_ON(dl_se->dl_runtime == 0);
+
+	/* If the inactive timer is already armed, return immediately */
+	if (hrtimer_active(&dl_se->inactive_timer))
+		return;
+
+	zerolag_time = dl_se->deadline -
+		 div64_long((dl_se->runtime * dl_se->dl_period),
+			dl_se->dl_runtime);
+
+	/*
+	 * Using relative times instead of the absolute "0-lag time"
+	 * allows to simplify the code
+	 */
+	zerolag_time -= rq_clock(rq);
+
+	/*
+	 * If the "0-lag time" already passed, decrease the active
+	 * utilization now, instead of starting a timer
+	 */
+	if (zerolag_time < 0) {
+		sub_running_bw(dl_se, dl_rq);
+		if (!dl_task(p))
+			__dl_clear_params(p);
+
+		return;
+	}
+
+	get_task_struct(p);
+	hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL);
+}
+
 static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
@@ -514,7 +556,20 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
-	add_running_bw(dl_se, dl_rq);
+	if (hrtimer_is_queued(&dl_se->inactive_timer)) {
+		hrtimer_try_to_cancel(&dl_se->inactive_timer);
+		WARN_ON(dl_task_of(dl_se)->nr_cpus_allowed > 1);
+	} else {
+		/*
+		 * The "inactive timer" has been cancelled in
+		 * select_task_rq_dl() (and the acvive utilisation has
+		 * been decreased). So, increase the active utilisation.
+		 * If select_task_rq_dl() could not cancel the timer,
+		 * inactive_task_timer() will * find the task state as
+		 * TASK_RUNNING, and will do nothing, so we are still safe.
+		 */
+		add_running_bw(dl_se, dl_rq);
+	}
 
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
 	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
@@ -602,14 +657,8 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 
 	rq = task_rq_lock(p, &rf);
 
-	/*
-	 * The task might have changed its scheduling policy to something
-	 * different than SCHED_DEADLINE (through switched_fromd_dl()).
-	 */
-	if (!dl_task(p)) {
-		__dl_clear_params(p);
+	if (!dl_task(p))
 		goto unlock;
-	}
 
 	/*
 	 * The task might have been boosted by someone else and might be in the
@@ -796,6 +845,44 @@ static void update_curr_dl(struct rq *rq)
 	}
 }
 
+static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
+{
+	struct sched_dl_entity *dl_se = container_of(timer,
+						     struct sched_dl_entity,
+						     inactive_timer);
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq_flags rf;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &rf);
+
+	if (!dl_task(p)) {
+		__dl_clear_params(p);
+
+		goto unlock;
+	}
+	if (p->state == TASK_RUNNING)
+		goto unlock;
+
+	sched_clock_tick();
+	update_rq_clock(rq);
+
+	sub_running_bw(dl_se, &rq->dl);
+unlock:
+	task_rq_unlock(rq, p, &rf);
+	put_task_struct(p);
+
+	return HRTIMER_NORESTART;
+}
+
+void init_inactive_task_timer(struct sched_dl_entity *dl_se)
+{
+	struct hrtimer *timer = &dl_se->inactive_timer;
+
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	timer->function = inactive_task_timer;
+}
+
 #ifdef CONFIG_SMP
 
 static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
@@ -1000,7 +1087,7 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		sub_running_bw(&p->dl, &rq->dl);
 
 	if (flags & DEQUEUE_SLEEP)
-		sub_running_bw(&p->dl, &rq->dl);
+		task_go_inactive(p);
 }
 
 /*
@@ -1074,6 +1161,14 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
 	}
 	rcu_read_unlock();
 
+	rq = task_rq(p);
+	raw_spin_lock(&rq->lock);
+	if (hrtimer_active(&p->dl.inactive_timer)) {
+		sub_running_bw(&p->dl, &rq->dl);
+		hrtimer_try_to_cancel(&p->dl.inactive_timer);
+	}
+	raw_spin_unlock(&rq->lock);
+
 out:
 	return cpu;
 }
@@ -1244,6 +1339,11 @@ static void task_dead_dl(struct task_struct *p)
 	/* XXX we should retain the bw until 0-lag */
 	dl_b->total_bw -= p->dl.dl_bw;
 	raw_spin_unlock_irq(&dl_b->lock);
+	if (hrtimer_active(&p->dl.inactive_timer)) {
+		raw_spin_lock_irq(&task_rq(p)->lock);
+		sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));
+		raw_spin_unlock_irq(&task_rq(p)->lock);
+	}
 }
 
 static void set_curr_task_dl(struct rq *rq)
@@ -1720,15 +1820,22 @@ void __init init_sched_dl_class(void)
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
 {
 	/*
-	 * Start the deadline timer; if we switch back to dl before this we'll
-	 * continue consuming our current CBS slice. If we stay outside of
-	 * SCHED_DEADLINE until the deadline passes, the timer will reset the
-	 * task.
+	 * task_go_inactive() can start the "inactive timer" (if the 0-lag
+	 * time is in the future). If the task switches back to dl before
+	 * the "inactive timer" fires, it can continue to consume its current
+	 * runtime using its current deadline. If it stays outside of
+	 * SCHED_DEADLINE until the 0-lag time passes, inactive_task_timer()
+	 * will reset the task parameters.
 	 */
-	if (!start_dl_timer(p))
-		__dl_clear_params(p);
+	if (task_on_rq_queued(p) && p->dl.dl_runtime)
+		task_go_inactive(p);
 
-	if (task_on_rq_queued(p))
+	/*
+	 * We cannot use inactive_task_timer() to invoke sub_running_bw()
+	 * at the 0-lag time, because the task could have been migrated
+	 * while SCHED_OTHER in the meanwhile.
+	 */
+	if (hrtimer_is_queued(&p->dl.inactive_timer))
 		sub_running_bw(&p->dl, &rq->dl);
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3a36c74..e82c419 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1358,6 +1358,7 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
 extern struct dl_bandwidth def_dl_bandwidth;
 extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
 extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
+extern void init_inactive_task_timer(struct sched_dl_entity *dl_se);
 
 unsigned long to_ratio(u64 period, u64 runtime);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC v3 3/6] Fix the update of the total -deadline utilization
  2016-10-24 14:06 [RFC v3 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
  2016-10-24 14:06 ` [RFC v3 1/6] Track the active utilisation Luca Abeni
  2016-10-24 14:06 ` [RFC v3 2/6] Improve the tracking of " Luca Abeni
@ 2016-10-24 14:06 ` Luca Abeni
  2016-10-24 14:06 ` [RFC v3 4/6] GRUB accounting Luca Abeni
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Luca Abeni @ 2016-10-24 14:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Luca Abeni

Now that the inactive timer can be armed to fire at the 0-lag time,
it is possible to use inactive_task_timer() to update the total
-deadline utilization (dl_b->total_bw) at the correct time, fixing
dl_overflow() and __setparam_dl().

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
---
 kernel/sched/core.c     | 36 ++++++++++++------------------------
 kernel/sched/deadline.c | 34 +++++++++++++++++++++++++---------
 2 files changed, 37 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 664c618..337a5f0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2507,9 +2507,6 @@ static inline int dl_bw_cpus(int i)
  * allocated bandwidth to reflect the new situation.
  *
  * This function is called while holding p's rq->lock.
- *
- * XXX we should delay bw change until the task's 0-lag point, see
- * __setparam_dl().
  */
 static int dl_overflow(struct task_struct *p, int policy,
 		       const struct sched_attr *attr)
@@ -2538,11 +2535,22 @@ static int dl_overflow(struct task_struct *p, int policy,
 		err = 0;
 	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
 		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+		/*
+		 * XXX this is slightly incorrect: when the task
+		 * utilization decreases, we should delay the total
+		 * utilization change until the task's 0-lag point.
+		 * But this would require to set the task's "inactive
+		 * timer" when the task is not inactive.
+		 */
 		__dl_clear(dl_b, p->dl.dl_bw);
 		__dl_add(dl_b, new_bw);
 		err = 0;
 	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
-		__dl_clear(dl_b, p->dl.dl_bw);
+		/*
+		 * Do not decrease the total deadline utilization here,
+		 * switched_from_dl() will take care to do it at the correct
+		 * (0-lag) time.
+		 */
 		err = 0;
 	}
 	raw_spin_unlock(&dl_b->lock);
@@ -3912,26 +3920,6 @@ __setparam_dl(struct task_struct *p, const struct sched_attr *attr)
 	dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline;
 	dl_se->flags = attr->sched_flags;
 	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
-
-	/*
-	 * Changing the parameters of a task is 'tricky' and we're not doing
-	 * the correct thing -- also see task_dead_dl() and switched_from_dl().
-	 *
-	 * What we SHOULD do is delay the bandwidth release until the 0-lag
-	 * point. This would include retaining the task_struct until that time
-	 * and change dl_overflow() to not immediately decrement the current
-	 * amount.
-	 *
-	 * Instead we retain the current runtime/deadline and let the new
-	 * parameters take effect after the current reservation period lapses.
-	 * This is safe (albeit pessimistic) because the 0-lag point is always
-	 * before the current scheduling deadline.
-	 *
-	 * We can still have temporary overloads because we do not delay the
-	 * change in bandwidth until that time; so admission control is
-	 * not on the safe side. It does however guarantee tasks will never
-	 * consume more than promised.
-	 */
 }
 
 /*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 80d1541..4d3545b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -91,8 +91,14 @@ static void task_go_inactive(struct task_struct *p)
 	 */
 	if (zerolag_time < 0) {
 		sub_running_bw(dl_se, dl_rq);
-		if (!dl_task(p))
+		if (!dl_task(p)) {
+			struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+			raw_spin_lock(&dl_b->lock);
+			__dl_clear(dl_b, p->dl.dl_bw);
 			__dl_clear_params(p);
+			raw_spin_unlock(&dl_b->lock);
+		}
 
 		return;
 	}
@@ -856,8 +862,13 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 
 	rq = task_rq_lock(p, &rf);
 
-	if (!dl_task(p)) {
+	if (!dl_task(p) || p->state == TASK_DEAD) {
+		struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+		raw_spin_lock(&dl_b->lock);
+		__dl_clear(dl_b, p->dl.dl_bw);
 		__dl_clear_params(p);
+		raw_spin_unlock(&dl_b->lock);
 
 		goto unlock;
 	}
@@ -1330,16 +1341,21 @@ static void task_fork_dl(struct task_struct *p)
 
 static void task_dead_dl(struct task_struct *p)
 {
-	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
-
 	/*
 	 * Since we are TASK_DEAD we won't slip out of the domain!
 	 */
-	raw_spin_lock_irq(&dl_b->lock);
-	/* XXX we should retain the bw until 0-lag */
-	dl_b->total_bw -= p->dl.dl_bw;
-	raw_spin_unlock_irq(&dl_b->lock);
-	if (hrtimer_active(&p->dl.inactive_timer)) {
+	if (!hrtimer_active(&p->dl.inactive_timer)) {
+		struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+		/*
+		 * If the "inactive timer is not active, the 0-lag time
+		 * is already passed, so we immediately decrease the
+		 * total deadline utilization
+		 */
+		raw_spin_lock_irq(&dl_b->lock);
+		__dl_clear(dl_b, p->dl.dl_bw);
+		raw_spin_unlock_irq(&dl_b->lock);
+	} else {
 		raw_spin_lock_irq(&task_rq(p)->lock);
 		sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));
 		raw_spin_unlock_irq(&task_rq(p)->lock);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC v3 4/6] GRUB accounting
  2016-10-24 14:06 [RFC v3 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
                   ` (2 preceding siblings ...)
  2016-10-24 14:06 ` [RFC v3 3/6] Fix the update of the total -deadline utilization Luca Abeni
@ 2016-10-24 14:06 ` Luca Abeni
  2016-10-24 14:06 ` [RFC v3 5/6] Do not reclaim the whole CPU bandwidth Luca Abeni
  2016-10-24 14:06 ` [RFC v3 6/6] Make GRUB a task's flag Luca Abeni
  5 siblings, 0 replies; 45+ messages in thread
From: Luca Abeni @ 2016-10-24 14:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Luca Abeni

According to the GRUB (Greedy Reclaimation of Unused Bandwidth) reclaiming
algorithm, the runtime is not decreased as "dq = -dt", but as
"dq = -Uact dt" (where Uact is the per-runqueue active utilization).
Hence, this commit modifies the runtime accounting rule in update_curr_dl()
to implement the GRUB rule.

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
---
 kernel/sched/deadline.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4d3545b..fa10fcd 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -771,6 +771,19 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
 extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
 /*
+ * This function implements the GRUB accounting rule:
+ * according to the GRUB reclaiming algorithm, the runtime is
+ * not decreased as "dq = -dt", but as "dq = -Uact dt", where
+ * Uact is the (per-runqueue) active utilization.
+ * Since rq->dl.running_bw contains Uact * 2^20, the result
+ * has to be shifted right by 20.
+ */
+u64 grub_reclaim(u64 delta, struct rq *rq)
+{
+	return (delta * rq->dl.running_bw) >> 20;
+}
+
+/*
  * Update the current task's runtime statistics (provided it is still
  * a -deadline task and has not been removed from the dl_rq).
  */
@@ -812,6 +825,7 @@ static void update_curr_dl(struct rq *rq)
 
 	sched_rt_avg_update(rq, delta_exec);
 
+	delta_exec = grub_reclaim(delta_exec, rq);
 	dl_se->runtime -= delta_exec;
 
 throttle:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC v3 5/6] Do not reclaim the whole CPU bandwidth
  2016-10-24 14:06 [RFC v3 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
                   ` (3 preceding siblings ...)
  2016-10-24 14:06 ` [RFC v3 4/6] GRUB accounting Luca Abeni
@ 2016-10-24 14:06 ` Luca Abeni
  2016-10-24 14:06 ` [RFC v3 6/6] Make GRUB a task's flag Luca Abeni
  5 siblings, 0 replies; 45+ messages in thread
From: Luca Abeni @ 2016-10-24 14:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Luca Abeni

Original GRUB tends to reclaim 100% of the CPU time... And this allows a
CPU hot to starve non-deadline tasks.
To address this issue, allow the scheduler to reclaim only a specified
fraction of CPU time.

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
---
 kernel/sched/core.c     | 4 ++++
 kernel/sched/deadline.c | 7 ++++++-
 kernel/sched/sched.h    | 6 ++++++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 337a5f0..6706b5f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8256,6 +8256,10 @@ static void sched_dl_do_global(void)
 		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
 
 		rcu_read_unlock_sched();
+		if (dl_b->bw == -1)
+			cpu_rq(cpu)->dl.non_deadline_bw = 0;
+		else
+			cpu_rq(cpu)->dl.non_deadline_bw = (1 << 20) - new_bw;
 	}
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fa10fcd..cab76e8 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -147,6 +147,11 @@ void init_dl_rq(struct dl_rq *dl_rq)
 #else
 	init_dl_bw(&dl_rq->dl_bw);
 #endif
+	if (global_rt_runtime() == RUNTIME_INF)
+		dl_rq->non_deadline_bw = 0;
+	else
+		dl_rq->non_deadline_bw = (1 << 20) -
+			to_ratio(global_rt_period(), global_rt_runtime());
 }
 
 #ifdef CONFIG_SMP
@@ -780,7 +785,7 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
  */
 u64 grub_reclaim(u64 delta, struct rq *rq)
 {
-	return (delta * rq->dl.running_bw) >> 20;
+	return (delta * (rq->dl.non_deadline_bw + rq->dl.running_bw)) >> 20;
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e82c419..2748431 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -541,6 +541,12 @@ struct dl_rq {
 	 * task blocks
 	 */
 	s64 running_bw;
+
+	/*
+	 * Fraction of the CPU utilization that cannot be reclaimed
+	 * by the GRUB algorithm.
+	 */
+	s64 non_deadline_bw;
 };
 
 #ifdef CONFIG_SMP
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC v3 6/6] Make GRUB a task's flag
  2016-10-24 14:06 [RFC v3 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
                   ` (4 preceding siblings ...)
  2016-10-24 14:06 ` [RFC v3 5/6] Do not reclaim the whole CPU bandwidth Luca Abeni
@ 2016-10-24 14:06 ` Luca Abeni
  5 siblings, 0 replies; 45+ messages in thread
From: Luca Abeni @ 2016-10-24 14:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt, Luca Abeni

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
---
 include/uapi/linux/sched.h | 1 +
 kernel/sched/core.c        | 3 ++-
 kernel/sched/deadline.c    | 3 ++-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5f0fe01..e2a6c7b 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -47,5 +47,6 @@
  * For the sched_{set,get}attr() calls
  */
 #define SCHED_FLAG_RESET_ON_FORK	0x01
+#define SCHED_FLAG_RECLAIM		0x02
 
 #endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6706b5f..421a315 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4087,7 +4087,8 @@ static int __sched_setscheduler(struct task_struct *p,
 			return -EINVAL;
 	}
 
-	if (attr->sched_flags & ~(SCHED_FLAG_RESET_ON_FORK))
+	if (attr->sched_flags &
+		~(SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_RECLAIM))
 		return -EINVAL;
 
 	/*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index cab76e8..8c4c0e0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -830,7 +830,8 @@ static void update_curr_dl(struct rq *rq)
 
 	sched_rt_avg_update(rq, delta_exec);
 
-	delta_exec = grub_reclaim(delta_exec, rq);
+	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
+		delta_exec = grub_reclaim(delta_exec, rq);
 	dl_se->runtime -= delta_exec;
 
 throttle:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-10-24 14:06 ` [RFC v3 1/6] Track the active utilisation Luca Abeni
@ 2016-10-25  9:09   ` Daniel Bristot de Oliveira
  2016-10-25  9:29     ` luca abeni
  2016-11-01 16:45   ` Juri Lelli
  2016-11-18 13:55   ` Peter Zijlstra
  2 siblings, 1 reply; 45+ messages in thread
From: Daniel Bristot de Oliveira @ 2016-10-25  9:09 UTC (permalink / raw)
  To: Luca Abeni, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt

Il 24/10/2016 16:06, Luca Abeni ha scritto:
> The active utilisation here is defined as the total utilisation of the
> active (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
> when a task wakes up and is decreased when a task blocks.
> 
> When a task is migrated from CPUi to CPUj, immediately subtract the task's
> utilisation from CPUi and add it to CPUj. This mechanism is implemented by
> modifying the pull and push functions.
> Note: this is not fully correct from the theoretical point of view
> (the utilisation should be removed from CPUi only at the 0 lag time),
> but doing the right thing would be _MUCH_ more complex (leaving the
> timer armed when the task is on a different CPU... Inactive timers should
> be moved from per-task timers to per-runqueue lists of timers! Bah...)
> 
> The utilisation tracking mechanism implemented in this commit can be
> fixed / improved by decreasing the active utilisation at the so-called
> "0-lag time" instead of when the task blocks.
> 
> Signed-off-by: Juri Lelli <juri.lelli@arm.com>
> Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> ---
>  kernel/sched/deadline.c | 39 ++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/sched.h    |  6 ++++++
>  2 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 37e2449..3d95c1d 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -43,6 +43,22 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
>  	return !RB_EMPTY_NODE(&dl_se->rb_node);
>  }
>  
> +static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> +	u64 se_bw = dl_se->dl_bw;
> +
> +	dl_rq->running_bw += se_bw;
> +}

why not...

static *inline*
void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
	dl_rq->running_bw += dl_se->dl_bw;
}

am I missing something?

> +static void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> +	u64 se_bw = dl_se->dl_bw;
> +
> +	dl_rq->running_bw -= se_bw;
> +	if (WARN_ON(dl_rq->running_bw < 0))
> +		dl_rq->running_bw = 0;
> +}

(if I am not missing anything...)

the same in the above function: use inline and remove the se_bw variable.

-- Daniel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-10-25  9:09   ` Daniel Bristot de Oliveira
@ 2016-10-25  9:29     ` luca abeni
  2016-10-25 13:58       ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: luca abeni @ 2016-10-25  9:29 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Claudio Scordino, Steven Rostedt

Hi Daniel,

On Tue, 25 Oct 2016 11:09:52 +0200
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
[...]
> > +static void add_running_bw(struct sched_dl_entity *dl_se, struct
> > dl_rq *dl_rq) +{
> > +	u64 se_bw = dl_se->dl_bw;
> > +
> > +	dl_rq->running_bw += se_bw;
> > +}  
> 
> why not...
> 
> static *inline*
> void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq
> *dl_rq) {
> 	dl_rq->running_bw += dl_se->dl_bw;
> }
> 
> am I missing something?

I do not know... Maybe I am the one missing something :)
I assumed that the compiler is smart enough to inline the function (and
to avoid creating a local variable on the stack), but if there is
agreement I can change the function in this way.



			Thanks,
				Luca


> 
> > +static void sub_running_bw(struct sched_dl_entity *dl_se, struct
> > dl_rq *dl_rq) +{
> > +	u64 se_bw = dl_se->dl_bw;
> > +
> > +	dl_rq->running_bw -= se_bw;
> > +	if (WARN_ON(dl_rq->running_bw < 0))
> > +		dl_rq->running_bw = 0;
> > +}  
> 
> (if I am not missing anything...)
> 
> the same in the above function: use inline and remove the se_bw
> variable.
> 
> -- Daniel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-10-25  9:29     ` luca abeni
@ 2016-10-25 13:58       ` Steven Rostedt
  2016-10-25 18:04         ` Luca Abeni
  2016-11-18 14:23         ` Peter Zijlstra
  0 siblings, 2 replies; 45+ messages in thread
From: Steven Rostedt @ 2016-10-25 13:58 UTC (permalink / raw)
  To: luca abeni
  Cc: Daniel Bristot de Oliveira, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Juri Lelli, Claudio Scordino

On Tue, 25 Oct 2016 11:29:16 +0200
luca abeni <luca.abeni@unitn.it> wrote:

> Hi Daniel,
> 
> On Tue, 25 Oct 2016 11:09:52 +0200
> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> [...]
> > > +static void add_running_bw(struct sched_dl_entity *dl_se, struct
> > > dl_rq *dl_rq) +{
> > > +	u64 se_bw = dl_se->dl_bw;
> > > +
> > > +	dl_rq->running_bw += se_bw;
> > > +}    
> > 
> > why not...
> > 
> > static *inline*
> > void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq
> > *dl_rq) {
> > 	dl_rq->running_bw += dl_se->dl_bw;
> > }
> > 
> > am I missing something?  
> 
> I do not know... Maybe I am the one missing something :)
> I assumed that the compiler is smart enough to inline the function (and
> to avoid creating a local variable on the stack), but if there is
> agreement I can change the function in this way.
> 
> 

I agree with Daniel, especially since I don't usually trust the
compiler. And the added variable is more of a distraction as it doesn't
seem to have any real purpose.

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-10-25 13:58       ` Steven Rostedt
@ 2016-10-25 18:04         ` Luca Abeni
  2016-11-18 14:23         ` Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: Luca Abeni @ 2016-10-25 18:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Daniel Bristot de Oliveira, linux-kernel, Peter Zijlstra,
	Ingo Molnar, Juri Lelli, Claudio Scordino

On Tue, 25 Oct 2016 09:58:11 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 25 Oct 2016 11:29:16 +0200
> luca abeni <luca.abeni@unitn.it> wrote:
> 
> > Hi Daniel,
> > 
> > On Tue, 25 Oct 2016 11:09:52 +0200
> > Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> > [...]
> > > > +static void add_running_bw(struct sched_dl_entity *dl_se,
> > > > struct dl_rq *dl_rq) +{
> > > > +	u64 se_bw = dl_se->dl_bw;
> > > > +
> > > > +	dl_rq->running_bw += se_bw;
> > > > +}    
> > > 
> > > why not...
> > > 
> > > static *inline*
> > > void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq
> > > *dl_rq) {
> > > 	dl_rq->running_bw += dl_se->dl_bw;
> > > }
> > > 
> > > am I missing something?  
> > 
> > I do not know... Maybe I am the one missing something :)
> > I assumed that the compiler is smart enough to inline the function
> > (and to avoid creating a local variable on the stack), but if there
> > is agreement I can change the function in this way.
> > 
> > 
> 
> I agree with Daniel, especially since I don't usually trust the
> compiler. And the added variable is more of a distraction as it
> doesn't seem to have any real purpose.

Ok, then; I'll fix this in the next round of patches.


		Thanks,
			Luca

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-10-24 14:06 ` [RFC v3 1/6] Track the active utilisation Luca Abeni
  2016-10-25  9:09   ` Daniel Bristot de Oliveira
@ 2016-11-01 16:45   ` Juri Lelli
  2016-11-01 21:10     ` luca abeni
  2016-11-18 13:55   ` Peter Zijlstra
  2 siblings, 1 reply; 45+ messages in thread
From: Juri Lelli @ 2016-11-01 16:45 UTC (permalink / raw)
  To: Luca Abeni
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

Hi,

a few nitpicks on subject and changelog and a couple of questions below.

Subject should be changed to something like

 sched/deadline: track the active utilisation

On 24/10/16 16:06, Luca Abeni wrote:
> The active utilisation here is defined as the total utilisation of the

s/The active/Active/
s/here//
s/of the active/of active/

> active (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
> when a task wakes up and is decreased when a task blocks.
> 
> When a task is migrated from CPUi to CPUj, immediately subtract the task's
> utilisation from CPUi and add it to CPUj. This mechanism is implemented by
> modifying the pull and push functions.
> Note: this is not fully correct from the theoretical point of view
> (the utilisation should be removed from CPUi only at the 0 lag time),

a more theoretically sound solution will follow.

> but doing the right thing would be _MUCH_ more complex (leaving the
> timer armed when the task is on a different CPU... Inactive timers should
> be moved from per-task timers to per-runqueue lists of timers! Bah...)

I'd remove this paragraph above.

> 
> The utilisation tracking mechanism implemented in this commit can be
> fixed / improved by decreasing the active utilisation at the so-called
> "0-lag time" instead of when the task blocks.

And maybe this as well, or put it as more information about the "more
theoretically sound" solution?

> 
> Signed-off-by: Juri Lelli <juri.lelli@arm.com>
> Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> ---
>  kernel/sched/deadline.c | 39 ++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/sched.h    |  6 ++++++
>  2 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 37e2449..3d95c1d 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -43,6 +43,22 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
>  	return !RB_EMPTY_NODE(&dl_se->rb_node);
>  }
>  
> +static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> +	u64 se_bw = dl_se->dl_bw;
> +
> +	dl_rq->running_bw += se_bw;
> +}
> +
> +static void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> +	u64 se_bw = dl_se->dl_bw;
> +
> +	dl_rq->running_bw -= se_bw;
> +	if (WARN_ON(dl_rq->running_bw < 0))
> +		dl_rq->running_bw = 0;
> +}
> +
>  static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
>  {
>  	struct sched_dl_entity *dl_se = &p->dl;
> @@ -498,6 +514,8 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>  	struct rq *rq = rq_of_dl_rq(dl_rq);
>  
> +	add_running_bw(dl_se, dl_rq);
> +
>  	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
>  	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
>  		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
> @@ -947,14 +965,19 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>  		return;
>  	}
>  
> +	if (p->on_rq == TASK_ON_RQ_MIGRATING)
> +		add_running_bw(&p->dl, &rq->dl);
> +
>  	/*
>  	 * If p is throttled, we do nothing. In fact, if it exhausted
>  	 * its budget it needs a replenishment and, since it now is on
>  	 * its rq, the bandwidth timer callback (which clearly has not
>  	 * run yet) will take care of this.
>  	 */
> -	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
> +	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
> +		add_running_bw(&p->dl, &rq->dl);

Don't rememeber if we discussed this already, but do we need to add the bw here
even if the task is not actually enqueued until after the replenishment timer
fires?

>  		return;
> +	}
>  
>  	enqueue_dl_entity(&p->dl, pi_se, flags);
>  
> @@ -972,6 +995,12 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	update_curr_dl(rq);
>  	__dequeue_task_dl(rq, p, flags);
> +
> +	if (p->on_rq == TASK_ON_RQ_MIGRATING)
> +		sub_running_bw(&p->dl, &rq->dl);
> +
> +	if (flags & DEQUEUE_SLEEP)
> +		sub_running_bw(&p->dl, &rq->dl);
>  }
>  
>  /*
> @@ -1501,7 +1530,9 @@ static int push_dl_task(struct rq *rq)
>  	}
>  
>  	deactivate_task(rq, next_task, 0);
> +	sub_running_bw(&next_task->dl, &rq->dl);
>  	set_task_cpu(next_task, later_rq->cpu);
> +	add_running_bw(&next_task->dl, &later_rq->dl);
>  	activate_task(later_rq, next_task, 0);
>  	ret = 1;
>  
> @@ -1589,7 +1620,9 @@ static void pull_dl_task(struct rq *this_rq)
>  			resched = true;
>  
>  			deactivate_task(src_rq, p, 0);
> +			sub_running_bw(&p->dl, &src_rq->dl);
>  			set_task_cpu(p, this_cpu);
> +			add_running_bw(&p->dl, &this_rq->dl);
>  			activate_task(this_rq, p, 0);
>  			dmin = p->dl.deadline;
>  
> @@ -1695,6 +1728,9 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
>  	if (!start_dl_timer(p))
>  		__dl_clear_params(p);
>  
> +	if (task_on_rq_queued(p))
> +		sub_running_bw(&p->dl, &rq->dl);
> +
>  	/*
>  	 * Since this might be the only -deadline task on the rq,
>  	 * this is the right place to try to pull some other one
> @@ -1712,6 +1748,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
>   */
>  static void switched_to_dl(struct rq *rq, struct task_struct *p)
>  {
> +	add_running_bw(&p->dl, &rq->dl);
>  
>  	/* If p is not queued we will update its parameters at next wakeup. */
>  	if (!task_on_rq_queued(p))

Don't we also need to remove bw in task_dead_dl()?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 2/6] Improve the tracking of active utilisation
  2016-10-24 14:06 ` [RFC v3 2/6] Improve the tracking of " Luca Abeni
@ 2016-11-01 16:46   ` Juri Lelli
  2016-11-01 21:46     ` luca abeni
  2016-11-02  2:41   ` luca abeni
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 45+ messages in thread
From: Juri Lelli @ 2016-11-01 16:46 UTC (permalink / raw)
  To: Luca Abeni
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

Hi,

On 24/10/16 16:06, Luca Abeni wrote:
> This patch implements a more theoretically sound algorithm for
> thracking the active utilisation: instead of decreasing it when a

s/thracking/tracking/
s/the//

> task blocks, use a timer (the "inactive timer", named after the
> "Inactive" task state of the GRUB algorithm) to decrease the
> active utilisaation at the so called "0-lag time".

s/utilisaation/utilisation/

> 
> Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> ---
>  include/linux/sched.h   |   1 +
>  kernel/sched/core.c     |   1 +
>  kernel/sched/deadline.c | 139 ++++++++++++++++++++++++++++++++++++++++++------
>  kernel/sched/sched.h    |   1 +
>  4 files changed, 126 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 348f51b..22543c6 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1433,6 +1433,7 @@ struct sched_dl_entity {
>  	 * own bandwidth to be enforced, thus we need one timer per task.
>  	 */
>  	struct hrtimer dl_timer;
> +	struct hrtimer inactive_timer;
>  };
>  
>  union rcu_special {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 94732d1..664c618 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2217,6 +2217,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>  
>  	RB_CLEAR_NODE(&p->dl.rb_node);
>  	init_dl_task_timer(&p->dl);
> +	init_inactive_task_timer(&p->dl);
>  	__dl_clear_params(p);
>  
>  	INIT_LIST_HEAD(&p->rt.run_list);
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 3d95c1d..80d1541 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -47,6 +47,7 @@ static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>  {
>  	u64 se_bw = dl_se->dl_bw;
>  
> +	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);

This and the one below go in 1/6.

>  	dl_rq->running_bw += se_bw;
>  }
>  
> @@ -54,11 +55,52 @@ static void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>  {
>  	u64 se_bw = dl_se->dl_bw;
>  
> +	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
>  	dl_rq->running_bw -= se_bw;
>  	if (WARN_ON(dl_rq->running_bw < 0))
>  		dl_rq->running_bw = 0;
>  }
>  
> +static void task_go_inactive(struct task_struct *p)
> +{
> +	struct sched_dl_entity *dl_se = &p->dl;
> +	struct hrtimer *timer = &dl_se->inactive_timer;
> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +	struct rq *rq = rq_of_dl_rq(dl_rq);
> +	s64 zerolag_time;
> +
> +	WARN_ON(dl_se->dl_runtime == 0);
> +
> +	/* If the inactive timer is already armed, return immediately */
> +	if (hrtimer_active(&dl_se->inactive_timer))
> +		return;
> +
> +	zerolag_time = dl_se->deadline -
> +		 div64_long((dl_se->runtime * dl_se->dl_period),
> +			dl_se->dl_runtime);
> +
> +	/*
> +	 * Using relative times instead of the absolute "0-lag time"
> +	 * allows to simplify the code
> +	 */
> +	zerolag_time -= rq_clock(rq);
> +
> +	/*
> +	 * If the "0-lag time" already passed, decrease the active
> +	 * utilization now, instead of starting a timer
> +	 */
> +	if (zerolag_time < 0) {
> +		sub_running_bw(dl_se, dl_rq);
> +		if (!dl_task(p))
> +			__dl_clear_params(p);
> +
> +		return;
> +	}
> +
> +	get_task_struct(p);
> +	hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL);
> +}
> +
>  static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
>  {
>  	struct sched_dl_entity *dl_se = &p->dl;
> @@ -514,7 +556,20 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>  	struct rq *rq = rq_of_dl_rq(dl_rq);
>  
> -	add_running_bw(dl_se, dl_rq);
> +	if (hrtimer_is_queued(&dl_se->inactive_timer)) {
> +		hrtimer_try_to_cancel(&dl_se->inactive_timer);

Why we are OK with just trying to cancel the inactive timer?

> +		WARN_ON(dl_task_of(dl_se)->nr_cpus_allowed > 1);

What's wrong with nr_cpus_allowed > 1 tasks?

> +	} else {
> +		/*
> +		 * The "inactive timer" has been cancelled in
> +		 * select_task_rq_dl() (and the acvive utilisation has
> +		 * been decreased). So, increase the active utilisation.
> +		 * If select_task_rq_dl() could not cancel the timer,
> +		 * inactive_task_timer() will * find the task state as
> +		 * TASK_RUNNING, and will do nothing, so we are still safe.
> +		 */
> +		add_running_bw(dl_se, dl_rq);
> +	}
>  
>  	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
>  	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
> @@ -602,14 +657,8 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
>  
>  	rq = task_rq_lock(p, &rf);
>  
> -	/*
> -	 * The task might have changed its scheduling policy to something
> -	 * different than SCHED_DEADLINE (through switched_fromd_dl()).
> -	 */
> -	if (!dl_task(p)) {
> -		__dl_clear_params(p);
> +	if (!dl_task(p))
>  		goto unlock;
> -	}
>  
>  	/*
>  	 * The task might have been boosted by someone else and might be in the
> @@ -796,6 +845,44 @@ static void update_curr_dl(struct rq *rq)
>  	}
>  }
>  
> +static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
> +{
> +	struct sched_dl_entity *dl_se = container_of(timer,
> +						     struct sched_dl_entity,
> +						     inactive_timer);
> +	struct task_struct *p = dl_task_of(dl_se);
> +	struct rq_flags rf;
> +	struct rq *rq;
> +
> +	rq = task_rq_lock(p, &rf);
> +
> +	if (!dl_task(p)) {
> +		__dl_clear_params(p);
> +
> +		goto unlock;
> +	}
> +	if (p->state == TASK_RUNNING)
> +		goto unlock;
> +
> +	sched_clock_tick();
> +	update_rq_clock(rq);
> +
> +	sub_running_bw(dl_se, &rq->dl);
> +unlock:
> +	task_rq_unlock(rq, p, &rf);
> +	put_task_struct(p);
> +
> +	return HRTIMER_NORESTART;
> +}
> +
> +void init_inactive_task_timer(struct sched_dl_entity *dl_se)
> +{
> +	struct hrtimer *timer = &dl_se->inactive_timer;
> +
> +	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	timer->function = inactive_task_timer;
> +}
> +
>  #ifdef CONFIG_SMP
>  
>  static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> @@ -1000,7 +1087,7 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>  		sub_running_bw(&p->dl, &rq->dl);
>  
>  	if (flags & DEQUEUE_SLEEP)
> -		sub_running_bw(&p->dl, &rq->dl);
> +		task_go_inactive(p);
>  }
>  
>  /*
> @@ -1074,6 +1161,14 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
>  	}
>  	rcu_read_unlock();
>  
> +	rq = task_rq(p);
> +	raw_spin_lock(&rq->lock);
> +	if (hrtimer_active(&p->dl.inactive_timer)) {
> +		sub_running_bw(&p->dl, &rq->dl);
> +		hrtimer_try_to_cancel(&p->dl.inactive_timer);

Can't we subtract twice if it happens that after we grabbed rq_lock the timer
fired, so it's now waiting for that lock and it goes ahead and sub_running_bw
again after we release the lock?

> +	}
> +	raw_spin_unlock(&rq->lock);
> +
>  out:
>  	return cpu;
>  }
> @@ -1244,6 +1339,11 @@ static void task_dead_dl(struct task_struct *p)
>  	/* XXX we should retain the bw until 0-lag */
>  	dl_b->total_bw -= p->dl.dl_bw;
>  	raw_spin_unlock_irq(&dl_b->lock);
> +	if (hrtimer_active(&p->dl.inactive_timer)) {
> +		raw_spin_lock_irq(&task_rq(p)->lock);
> +		sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));

Don't we still need to wait for the 0-lag? Or maybe since the task is dying we
can release it's bw instantaneously? In this case I'd add a comment about it.

> +		raw_spin_unlock_irq(&task_rq(p)->lock);
> +	}
>  }
>  
>  static void set_curr_task_dl(struct rq *rq)
> @@ -1720,15 +1820,22 @@ void __init init_sched_dl_class(void)
>  static void switched_from_dl(struct rq *rq, struct task_struct *p)
>  {
>  	/*
> -	 * Start the deadline timer; if we switch back to dl before this we'll
> -	 * continue consuming our current CBS slice. If we stay outside of
> -	 * SCHED_DEADLINE until the deadline passes, the timer will reset the
> -	 * task.
> +	 * task_go_inactive() can start the "inactive timer" (if the 0-lag
> +	 * time is in the future). If the task switches back to dl before
> +	 * the "inactive timer" fires, it can continue to consume its current
> +	 * runtime using its current deadline. If it stays outside of
> +	 * SCHED_DEADLINE until the 0-lag time passes, inactive_task_timer()
> +	 * will reset the task parameters.
>  	 */
> -	if (!start_dl_timer(p))
> -		__dl_clear_params(p);
> +	if (task_on_rq_queued(p) && p->dl.dl_runtime)
> +		task_go_inactive(p);
>  
> -	if (task_on_rq_queued(p))
> +	/*
> +	 * We cannot use inactive_task_timer() to invoke sub_running_bw()
> +	 * at the 0-lag time, because the task could have been migrated
> +	 * while SCHED_OTHER in the meanwhile.

But, from a theoretical pow, we very much should, right?
Is this taken care of in next patch?

> +	 */
> +	if (hrtimer_is_queued(&p->dl.inactive_timer))
>  		sub_running_bw(&p->dl, &rq->dl);
>  

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-11-01 16:45   ` Juri Lelli
@ 2016-11-01 21:10     ` luca abeni
  2016-11-08 17:56       ` Juri Lelli
  0 siblings, 1 reply; 45+ messages in thread
From: luca abeni @ 2016-11-01 21:10 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

Hi Juri,

On Tue, 1 Nov 2016 16:45:43 +0000
Juri Lelli <juri.lelli@arm.com> wrote:

> Hi,
> 
> a few nitpicks on subject and changelog and a couple of questions below.
> 
> Subject should be changed to something like
> 
>  sched/deadline: track the active utilisation
Ok; that's easy :)
I guess a similar change should be applied to the subjects of all the
other patches, right?


> 
> On 24/10/16 16:06, Luca Abeni wrote:
> > The active utilisation here is defined as the total utilisation of the  
> 
> s/The active/Active/
> s/here//
> s/of the active/of active/
Ok; I'll do this in the next revision of the patchset.


> > active (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
> > when a task wakes up and is decreased when a task blocks.
> > 
> > When a task is migrated from CPUi to CPUj, immediately subtract the task's
> > utilisation from CPUi and add it to CPUj. This mechanism is implemented by
> > modifying the pull and push functions.
> > Note: this is not fully correct from the theoretical point of view
> > (the utilisation should be removed from CPUi only at the 0 lag time),  
> 
> a more theoretically sound solution will follow.
Notice that even the next patch (introducing the "inactive timer") ends up
migrating the utilisation immediately (on tasks' migration), without waiting
for the 0-lag time.
This is because of the reason explained in the following paragraph:

> > but doing the right thing would be _MUCH_ more complex (leaving the
> > timer armed when the task is on a different CPU... Inactive timers should
> > be moved from per-task timers to per-runqueue lists of timers! Bah...)  
> 
> I'd remove this paragraph above.
Ok. Re-reading the changelog, I suspect this is not the correct place for this
comment.


> > The utilisation tracking mechanism implemented in this commit can be
> > fixed / improved by decreasing the active utilisation at the so-called
> > "0-lag time" instead of when the task blocks.  
> 
> And maybe this as well, or put it as more information about the "more
> theoretically sound" solution?
Ok... I can remove the paragraph, or point to the next commit (which
implements the more theoretically sound solution). Is such a "forward
reference" in changelogs ok?

[...]
> > @@ -947,14 +965,19 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> >  		return;
> >  	}
> >  
> > +	if (p->on_rq == TASK_ON_RQ_MIGRATING)
> > +		add_running_bw(&p->dl, &rq->dl);
> > +
> >  	/*
> >  	 * If p is throttled, we do nothing. In fact, if it exhausted
> >  	 * its budget it needs a replenishment and, since it now is on
> >  	 * its rq, the bandwidth timer callback (which clearly has not
> >  	 * run yet) will take care of this.
> >  	 */
> > -	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
> > +	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
> > +		add_running_bw(&p->dl, &rq->dl);  
> 
> Don't rememeber if we discussed this already, but do we need to add the bw here
> even if the task is not actually enqueued until after the replenishment timer
> fires?
I think yes... The active utilization does not depend on the fact that the task
is on the runqueue or not, but depends on the task's state (in GRUB parlance,
"inactive" vs "active contending"). In other words, even when a task is throttled
its utilization must be counted in the active utilization.


[...]
> >  	/*
> >  	 * Since this might be the only -deadline task on the rq,
> >  	 * this is the right place to try to pull some other one
> > @@ -1712,6 +1748,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
> >   */
> >  static void switched_to_dl(struct rq *rq, struct task_struct *p)
> >  {
> > +	add_running_bw(&p->dl, &rq->dl);
> >  
> >  	/* If p is not queued we will update its parameters at next wakeup. */
> >  	if (!task_on_rq_queued(p))  
> 
> Don't we also need to remove bw in task_dead_dl()?
I think task_dead_dl() is invoked after invoking dequeue_task_dl(), which takes care
of this... Or am I wrong? (I think I explicitly tested this, and modifications to
task_dead_dl() turned out to be unneeded)



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 2/6] Improve the tracking of active utilisation
  2016-11-01 16:46   ` Juri Lelli
@ 2016-11-01 21:46     ` luca abeni
  2016-11-02  2:35       ` luca abeni
  0 siblings, 1 reply; 45+ messages in thread
From: luca abeni @ 2016-11-01 21:46 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

Hi Juri,

On Tue, 1 Nov 2016 16:46:04 +0000
Juri Lelli <juri.lelli@arm.com> wrote:
[...]
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 3d95c1d..80d1541 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -47,6 +47,7 @@ static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >  {
> >  	u64 se_bw = dl_se->dl_bw;
> >  
> > +	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);  
> 
> This and the one below go in 1/6.
Ok; I'll move it there


> 
> >  	dl_rq->running_bw += se_bw;
> >  }
> >  
> > @@ -54,11 +55,52 @@ static void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >  {
> >  	u64 se_bw = dl_se->dl_bw;
> >  
> > +	lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
> >  	dl_rq->running_bw -= se_bw;
> >  	if (WARN_ON(dl_rq->running_bw < 0))
> >  		dl_rq->running_bw = 0;
> >  }
> >  
> > +static void task_go_inactive(struct task_struct *p)
> > +{
> > +	struct sched_dl_entity *dl_se = &p->dl;
> > +	struct hrtimer *timer = &dl_se->inactive_timer;
> > +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> > +	struct rq *rq = rq_of_dl_rq(dl_rq);
> > +	s64 zerolag_time;
> > +
> > +	WARN_ON(dl_se->dl_runtime == 0);
> > +
> > +	/* If the inactive timer is already armed, return immediately */
> > +	if (hrtimer_active(&dl_se->inactive_timer))
> > +		return;
> > +
> > +	zerolag_time = dl_se->deadline -
> > +		 div64_long((dl_se->runtime * dl_se->dl_period),
> > +			dl_se->dl_runtime);
> > +
> > +	/*
> > +	 * Using relative times instead of the absolute "0-lag time"
> > +	 * allows to simplify the code
> > +	 */
> > +	zerolag_time -= rq_clock(rq);
> > +
> > +	/*
> > +	 * If the "0-lag time" already passed, decrease the active
> > +	 * utilization now, instead of starting a timer
> > +	 */
> > +	if (zerolag_time < 0) {
> > +		sub_running_bw(dl_se, dl_rq);
> > +		if (!dl_task(p))
> > +			__dl_clear_params(p);
> > +
> > +		return;
> > +	}
> > +
> > +	get_task_struct(p);
> > +	hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL);
> > +}
> > +
> >  static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
> >  {
> >  	struct sched_dl_entity *dl_se = &p->dl;
> > @@ -514,7 +556,20 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
> >  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> >  	struct rq *rq = rq_of_dl_rq(dl_rq);
> >  
> > -	add_running_bw(dl_se, dl_rq);
> > +	if (hrtimer_is_queued(&dl_se->inactive_timer)) {
> > +		hrtimer_try_to_cancel(&dl_se->inactive_timer);  
> 
> Why we are OK with just trying to cancel the inactive timer?
This is my understanding of the things: if the timer cannot be
cancelled, it either:
1) Already decreased the active utilisation (and the timer handler is
   just finishing its execution). In this case, cancelling it or not
   cancelling it does not make any difference
2) Still have to decrease the active utilisation (and is probably
   waiting for the runqueue lock). In this case, the timer handler
   will find the task RUNNING, and will not decrease the active
   utilisation

> 
> > +		WARN_ON(dl_task_of(dl_se)->nr_cpus_allowed > 1);  
> 
> What's wrong with nr_cpus_allowed > 1 tasks?
If nr_cpus_allowed > 1, then select_task_rq_dl() executed before
enqueueing the task, and it already took care of cancelling the
inactive timer. I added this WARN_ON() to check for possible races
between the timer handler and select_task_rq / enqueue_task_dl...
I tried hard to trigger such a race, but the WARN_ON() never
happened (it only happened when I had tasks bound to one single
CPU - I tried this combination for testing purposes).

[...]
> > @@ -1074,6 +1161,14 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> >  	}
> >  	rcu_read_unlock();
> >  
> > +	rq = task_rq(p);
> > +	raw_spin_lock(&rq->lock);
> > +	if (hrtimer_active(&p->dl.inactive_timer)) {
> > +		sub_running_bw(&p->dl, &rq->dl);
> > +		hrtimer_try_to_cancel(&p->dl.inactive_timer);  
> 
> Can't we subtract twice if it happens that after we grabbed rq_lock the timer
> fired, so it's now waiting for that lock and it goes ahead and sub_running_bw
> again after we release the lock?
Uhm... I somehow convinced myself that this could not happen, but I do not
remember the details, sorry :(
I'll check again in the next days (BTW, wouldn't this trigger the WARN_ON you
mentioned above?)


> 
> > +	}
> > +	raw_spin_unlock(&rq->lock);
> > +
> >  out:
> >  	return cpu;
> >  }
> > @@ -1244,6 +1339,11 @@ static void task_dead_dl(struct task_struct *p)
> >  	/* XXX we should retain the bw until 0-lag */
> >  	dl_b->total_bw -= p->dl.dl_bw;
> >  	raw_spin_unlock_irq(&dl_b->lock);
> > +	if (hrtimer_active(&p->dl.inactive_timer)) {
> > +		raw_spin_lock_irq(&task_rq(p)->lock);
> > +		sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));  
> 
> Don't we still need to wait for the 0-lag? Or maybe since the task is dying we
> can release it's bw instantaneously? In this case I'd add a comment about it.
Uhmm... I think you are right; I'll double-check this.


> >  static void set_curr_task_dl(struct rq *rq)
> > @@ -1720,15 +1820,22 @@ void __init init_sched_dl_class(void)
> >  static void switched_from_dl(struct rq *rq, struct task_struct *p)
> >  {
> >  	/*
> > -	 * Start the deadline timer; if we switch back to dl before this we'll
> > -	 * continue consuming our current CBS slice. If we stay outside of
> > -	 * SCHED_DEADLINE until the deadline passes, the timer will reset the
> > -	 * task.
> > +	 * task_go_inactive() can start the "inactive timer" (if the 0-lag
> > +	 * time is in the future). If the task switches back to dl before
> > +	 * the "inactive timer" fires, it can continue to consume its current
> > +	 * runtime using its current deadline. If it stays outside of
> > +	 * SCHED_DEADLINE until the 0-lag time passes, inactive_task_timer()
> > +	 * will reset the task parameters.
> >  	 */
> > -	if (!start_dl_timer(p))
> > -		__dl_clear_params(p);
> > +	if (task_on_rq_queued(p) && p->dl.dl_runtime)
> > +		task_go_inactive(p);
> >  
> > -	if (task_on_rq_queued(p))
> > +	/*
> > +	 * We cannot use inactive_task_timer() to invoke sub_running_bw()
> > +	 * at the 0-lag time, because the task could have been migrated
> > +	 * while SCHED_OTHER in the meanwhile.  
> 
> But, from a theoretical pow, we very much should, right?
Yes, we should.

> Is this taken care of in next patch?
No, I left this small divergence between theory and practice. The problem is
that the inactive timer handler looks at the task's runqueue, and decreases
the active utilization of such a runqueue. If the task is a SCHED_DEADLINE
task, this is not an issue, because the inactive timer is armed when the
task is not on a runqueue... When the task is inserted in a runqueue again,
the timer is cancelled. But when the task becomes SCHED_NORMAL, it can go
on a runqueue (and migrate to different runqueues) without cancelling the
inactive timer... So, when the inactive timer fires we have no guarantee
that the task's runqueue is still the one from which we need to subtract
the utilisation (I've actually seen this bug in action, before changing the
code in the way it is now).

If this is not acceptable, I'll try to find another solution to this issue.




			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 2/6] Improve the tracking of active utilisation
  2016-11-01 21:46     ` luca abeni
@ 2016-11-02  2:35       ` luca abeni
  2016-11-10 10:04         ` Juri Lelli
  0 siblings, 1 reply; 45+ messages in thread
From: luca abeni @ 2016-11-02  2:35 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

On Tue, 1 Nov 2016 22:46:33 +0100
luca abeni <luca.abeni@unitn.it> wrote:
[...]
> > > @@ -1074,6 +1161,14 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> > >  	}
> > >  	rcu_read_unlock();
> > >  
> > > +	rq = task_rq(p);
> > > +	raw_spin_lock(&rq->lock);
> > > +	if (hrtimer_active(&p->dl.inactive_timer)) {
> > > +		sub_running_bw(&p->dl, &rq->dl);
> > > +		hrtimer_try_to_cancel(&p->dl.inactive_timer);    
> > 
> > Can't we subtract twice if it happens that after we grabbed rq_lock the timer
> > fired, so it's now waiting for that lock and it goes ahead and sub_running_bw
> > again after we release the lock?  
> Uhm... I somehow convinced myself that this could not happen, but I do not
> remember the details, sorry :(
I think I remember the answer now: pi_lock is acquired before invoking select_task_rq
and is released after invoking enqueue_task... So, if there is a pending inactive
timer, its handler will be executed after the task is enqueued... It will see the task
as RUNNING, and will not decrease the active utilisation.

Does this make sense?



				Luca

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 2/6] Improve the tracking of active utilisation
  2016-10-24 14:06 ` [RFC v3 2/6] Improve the tracking of " Luca Abeni
  2016-11-01 16:46   ` Juri Lelli
@ 2016-11-02  2:41   ` luca abeni
  2016-11-18 15:36   ` Peter Zijlstra
  2016-11-18 15:47   ` Peter Zijlstra
  3 siblings, 0 replies; 45+ messages in thread
From: luca abeni @ 2016-11-02  2:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Claudio Scordino,
	Steven Rostedt

On Mon, 24 Oct 2016 16:06:34 +0200
Luca Abeni <luca.abeni@unitn.it> wrote:
[...]
> @@ -514,7 +556,20 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>  	struct rq *rq = rq_of_dl_rq(dl_rq);
>  
> -	add_running_bw(dl_se, dl_rq);
> +	if (hrtimer_is_queued(&dl_se->inactive_timer)) {
> +		hrtimer_try_to_cancel(&dl_se->inactive_timer);
Replying to myself here: after re-readling this code again, I now think that
if hrtimer_try_to_cancel() does not fail I need a put_task_struct() to
compensate for the one that should happen in the inactive timer handler
and I just cancelled...
I do not know how I previously missed this (or maybe I just managed to
confuse myself now :)

I will check this in next week.



				Luca

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-11-01 21:10     ` luca abeni
@ 2016-11-08 17:56       ` Juri Lelli
  2016-11-08 18:17         ` Luca Abeni
                           ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Juri Lelli @ 2016-11-08 17:56 UTC (permalink / raw)
  To: luca abeni
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

On 01/11/16 22:10, Luca Abeni wrote:
> Hi Juri,
> 
> On Tue, 1 Nov 2016 16:45:43 +0000
> Juri Lelli <juri.lelli@arm.com> wrote:
> 
> > Hi,
> > 
> > a few nitpicks on subject and changelog and a couple of questions below.
> > 
> > Subject should be changed to something like
> > 
> >  sched/deadline: track the active utilisation
> Ok; that's easy :)
> I guess a similar change should be applied to the subjects of all the
> other patches, right?
> 

Yep. Subject have usually the form:

 <modified_file(s)>: <short title>

> 
> > 
> > On 24/10/16 16:06, Luca Abeni wrote:
> > > The active utilisation here is defined as the total utilisation of the  
> > 
> > s/The active/Active/
> > s/here//
> > s/of the active/of active/
> Ok; I'll do this in the next revision of the patchset.
> 

Thanks.

> 
> > > active (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased
> > > when a task wakes up and is decreased when a task blocks.
> > > 
> > > When a task is migrated from CPUi to CPUj, immediately subtract the task's
> > > utilisation from CPUi and add it to CPUj. This mechanism is implemented by
> > > modifying the pull and push functions.
> > > Note: this is not fully correct from the theoretical point of view
> > > (the utilisation should be removed from CPUi only at the 0 lag time),  
> > 
> > a more theoretically sound solution will follow.
> Notice that even the next patch (introducing the "inactive timer") ends up
> migrating the utilisation immediately (on tasks' migration), without waiting
> for the 0-lag time.
> This is because of the reason explained in the following paragraph:
> 

OK, but is still _more_ theoretically sound. :)

> > > but doing the right thing would be _MUCH_ more complex (leaving the
> > > timer armed when the task is on a different CPU... Inactive timers should
> > > be moved from per-task timers to per-runqueue lists of timers! Bah...)  
> > 
> > I'd remove this paragraph above.
> Ok. Re-reading the changelog, I suspect this is not the correct place for this
> comment.
> 
> 
> > > The utilisation tracking mechanism implemented in this commit can be
> > > fixed / improved by decreasing the active utilisation at the so-called
> > > "0-lag time" instead of when the task blocks.  
> > 
> > And maybe this as well, or put it as more information about the "more
> > theoretically sound" solution?
> Ok... I can remove the paragraph, or point to the next commit (which
> implements the more theoretically sound solution). Is such a "forward
> reference" in changelogs ok?
> 

I'd just say that a better solution will follow. The details about why
it's better might be then put in the changelog and as comments in the
code of the next patch.

> [...]
> > > @@ -947,14 +965,19 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> > >  		return;
> > >  	}
> > >  
> > > +	if (p->on_rq == TASK_ON_RQ_MIGRATING)
> > > +		add_running_bw(&p->dl, &rq->dl);
> > > +
> > >  	/*
> > >  	 * If p is throttled, we do nothing. In fact, if it exhausted
> > >  	 * its budget it needs a replenishment and, since it now is on
> > >  	 * its rq, the bandwidth timer callback (which clearly has not
> > >  	 * run yet) will take care of this.
> > >  	 */
> > > -	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
> > > +	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
> > > +		add_running_bw(&p->dl, &rq->dl);  
> > 
> > Don't rememeber if we discussed this already, but do we need to add the bw here
> > even if the task is not actually enqueued until after the replenishment timer
> > fires?
> I think yes... The active utilization does not depend on the fact that the task
> is on the runqueue or not, but depends on the task's state (in GRUB parlance,
> "inactive" vs "active contending"). In other words, even when a task is throttled
> its utilization must be counted in the active utilization.
> 

OK. Could you add a comment about this point please (so that I don't
forget again :)?

> 
> [...]
> > >  	/*
> > >  	 * Since this might be the only -deadline task on the rq,
> > >  	 * this is the right place to try to pull some other one
> > > @@ -1712,6 +1748,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
> > >   */
> > >  static void switched_to_dl(struct rq *rq, struct task_struct *p)
> > >  {
> > > +	add_running_bw(&p->dl, &rq->dl);
> > >  
> > >  	/* If p is not queued we will update its parameters at next wakeup. */
> > >  	if (!task_on_rq_queued(p))  
> > 
> > Don't we also need to remove bw in task_dead_dl()?
> I think task_dead_dl() is invoked after invoking dequeue_task_dl(), which takes care
> of this... Or am I wrong? (I think I explicitly tested this, and modifications to
> task_dead_dl() turned out to be unneeded)
> 

Mmm. You explicitly check that TASK_ON_RQ_MIGRATING or DEQUEUE_SLEEP
(which btw can be actually put together with an or condition), so I
don't think that any of those turn out to be true when the task dies.
Also, AFAIU, do_exit() works on current and the TASK_DEAD case is
handled in finish_task_switch(), so I don't think we are taking care of
the "task is dying" condition.

Peter, does what I'm saying make any sense? :)

I still have to set up things here to test these patches (sorry, I was
travelling), but could you try to create some tasks and that kill them
from another shell to see if the accounting deviates or not? Or did you
already do this test?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-11-08 17:56       ` Juri Lelli
@ 2016-11-08 18:17         ` Luca Abeni
  2016-11-08 18:53           ` Juri Lelli
  2016-11-09 16:29         ` luca abeni
  2016-11-18 14:55         ` Peter Zijlstra
  2 siblings, 1 reply; 45+ messages in thread
From: Luca Abeni @ 2016-11-08 18:17 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

Hi Juri,

On Tue, 8 Nov 2016 17:56:35 +0000
Juri Lelli <juri.lelli@arm.com> wrote:
[...]
> > > >  static void switched_to_dl(struct rq *rq, struct task_struct
> > > > *p) {
> > > > +	add_running_bw(&p->dl, &rq->dl);
> > > >  
> > > >  	/* If p is not queued we will update its parameters at
> > > > next wakeup. */ if (!task_on_rq_queued(p))  
> > > 
> > > Don't we also need to remove bw in task_dead_dl()?
> > I think task_dead_dl() is invoked after invoking dequeue_task_dl(),
> > which takes care of this... Or am I wrong? (I think I explicitly
> > tested this, and modifications to task_dead_dl() turned out to be
> > unneeded)
> > 
> 
> Mmm. You explicitly check that TASK_ON_RQ_MIGRATING or DEQUEUE_SLEEP
> (which btw can be actually put together with an or condition), so I
> don't think that any of those turn out to be true when the task dies.
I might be very wrong here, but I think do_exit() just does something
like
	tsk->state = TASK_DEAD;
and then invokes schedule(), and __schedule() does
        if (!preempt && prev->state) {
                if (unlikely(signal_pending_state(prev->state, prev))) {
                        prev->state = TASK_RUNNING;
                } else {
                        deactivate_task(rq, prev, DEQUEUE_SLEEP);
			[...]
so dequeue_task_dl() will see DEQUEUE_SLEEP... Or am I misunderstanding
what you are saying?

> Also, AFAIU, do_exit() works on current and the TASK_DEAD case is
> handled in finish_task_switch(), so I don't think we are taking care
> of the "task is dying" condition.
Ok, so I am missing something... The state is set to TASK_DEAD, and
then schedule() is called... So, __schedule() sees the dying task as
"prev" and invokes deactivate_task() with the DEQUEUE_SLEEP flag...
After that, finish_task_switch() calls task_dead_dl(). Is this wrong?
If not, why aren't we taking care of the "task is dying" condition?


> Peter, does what I'm saying make any sense? :)
> 
> I still have to set up things here to test these patches (sorry, I was
> travelling), but could you try to create some tasks and that kill them
> from another shell to see if the accounting deviates or not? Or did
> you already do this test?
I think this is one of the tests I tried... 
I have to check if I changed this code after the test (but I do not
think I did). Anyway, tomorrow I'll write a script for automating this
test, and I'll leave it running for some hours.



				Luca

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-11-08 18:17         ` Luca Abeni
@ 2016-11-08 18:53           ` Juri Lelli
  2016-11-08 19:09             ` Luca Abeni
  0 siblings, 1 reply; 45+ messages in thread
From: Juri Lelli @ 2016-11-08 18:53 UTC (permalink / raw)
  To: Luca Abeni
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

On 08/11/16 19:17, Luca Abeni wrote:
> Hi Juri,
> 
> On Tue, 8 Nov 2016 17:56:35 +0000
> Juri Lelli <juri.lelli@arm.com> wrote:
> [...]
> > > > >  static void switched_to_dl(struct rq *rq, struct task_struct
> > > > > *p) {
> > > > > +	add_running_bw(&p->dl, &rq->dl);
> > > > >  
> > > > >  	/* If p is not queued we will update its parameters at
> > > > > next wakeup. */ if (!task_on_rq_queued(p))  
> > > > 
> > > > Don't we also need to remove bw in task_dead_dl()?
> > > I think task_dead_dl() is invoked after invoking dequeue_task_dl(),
> > > which takes care of this... Or am I wrong? (I think I explicitly
> > > tested this, and modifications to task_dead_dl() turned out to be
> > > unneeded)
> > > 
> > 
> > Mmm. You explicitly check that TASK_ON_RQ_MIGRATING or DEQUEUE_SLEEP
> > (which btw can be actually put together with an or condition), so I
> > don't think that any of those turn out to be true when the task dies.
> I might be very wrong here, but I think do_exit() just does something
> like
> 	tsk->state = TASK_DEAD;
> and then invokes schedule(), and __schedule() does
>         if (!preempt && prev->state) {
>                 if (unlikely(signal_pending_state(prev->state, prev))) {
>                         prev->state = TASK_RUNNING;
>                 } else {
>                         deactivate_task(rq, prev, DEQUEUE_SLEEP);
> 			[...]
> so dequeue_task_dl() will see DEQUEUE_SLEEP... Or am I misunderstanding
> what you are saying?
> 
> > Also, AFAIU, do_exit() works on current and the TASK_DEAD case is
> > handled in finish_task_switch(), so I don't think we are taking care
> > of the "task is dying" condition.
> Ok, so I am missing something... The state is set to TASK_DEAD, and
> then schedule() is called... So, __schedule() sees the dying task as
> "prev" and invokes deactivate_task() with the DEQUEUE_SLEEP flag...
> After that, finish_task_switch() calls task_dead_dl(). Is this wrong?
> If not, why aren't we taking care of the "task is dying" condition?
> 

No, I think you are right. But, semantically this cleanup goes in
task_dead_dl(), IMHO. It's most probably moot if it complicates things,
but it might be helpful to differentiate the case between a task that is
actually going to sleep (and for which we want to activate the timer)
and a task that is dying (and for which we want to release bw
immediately). So, it actually matters for next patch, not here. But,
maybe we want to do things clean from start?

> 
> > Peter, does what I'm saying make any sense? :)
> > 
> > I still have to set up things here to test these patches (sorry, I was
> > travelling), but could you try to create some tasks and that kill them
> > from another shell to see if the accounting deviates or not? Or did
> > you already do this test?
> I think this is one of the tests I tried... 
> I have to check if I changed this code after the test (but I do not
> think I did). Anyway, tomorrow I'll write a script for automating this
> test, and I'll leave it running for some hours.
> 

OK, thanks. As said I think that you actually handle the case already,
but I'll try to setup testing as well soon.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-11-08 18:53           ` Juri Lelli
@ 2016-11-08 19:09             ` Luca Abeni
  2016-11-08 20:02               ` Juri Lelli
  0 siblings, 1 reply; 45+ messages in thread
From: Luca Abeni @ 2016-11-08 19:09 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

Hi again,

On Tue, 8 Nov 2016 18:53:09 +0000
Juri Lelli <juri.lelli@arm.com> wrote:
[...]
> > > Also, AFAIU, do_exit() works on current and the TASK_DEAD case is
> > > handled in finish_task_switch(), so I don't think we are taking
> > > care of the "task is dying" condition.
> > Ok, so I am missing something... The state is set to TASK_DEAD, and
> > then schedule() is called... So, __schedule() sees the dying task as
> > "prev" and invokes deactivate_task() with the DEQUEUE_SLEEP flag...
> > After that, finish_task_switch() calls task_dead_dl(). Is this
> > wrong? If not, why aren't we taking care of the "task is dying"
> > condition?
> > 
> 
> No, I think you are right. But, semantically this cleanup goes in
> task_dead_dl(), IMHO.
Just to be sure I understand correctly: you suggest to add a check for
"state == TASK_DEAD" (skipping the cleanup if the condition is true) in
dequeue_task_dl(), and to add a sub_running_bw() in task_dead_dl()...
Is this understanding correct?

> It's most probably moot if it complicates
> things, but it might be helpful to differentiate the case between a
> task that is actually going to sleep (and for which we want to
> activate the timer) and a task that is dying (and for which we want
> to release bw immediately).
I suspect the two cases should be handled in the same way :)

> So, it actually matters for next patch,
> not here. But, maybe we want to do things clean from start?
You mean, because patch 2/6 adds
+       if (hrtimer_active(&p->dl.inactive_timer)) {
+               raw_spin_lock_irq(&task_rq(p)->lock);
+               sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));
+               raw_spin_unlock_irq(&task_rq(p)->lock);
+       }
in task_dead_dl()? I suspect this hunk is actually unneeded (worse, it
is wrong :). I am trying to remember why it is there, but I cannot find
any reason... In the next days, I'll run some tests to check if that
hunk is actually needed. If yes, then I'll modify patch 1/6 as you
suggest; if it is not needed, I'll remove it from patch 2/6 and I'll
not do this change to patch 1/6... Is this ok?



			Thanks,
				Luca


> 
> > 
> > > Peter, does what I'm saying make any sense? :)
> > > 
> > > I still have to set up things here to test these patches (sorry,
> > > I was travelling), but could you try to create some tasks and
> > > that kill them from another shell to see if the accounting
> > > deviates or not? Or did you already do this test?
> > I think this is one of the tests I tried... 
> > I have to check if I changed this code after the test (but I do not
> > think I did). Anyway, tomorrow I'll write a script for automating
> > this test, and I'll leave it running for some hours.
> > 
> 
> OK, thanks. As said I think that you actually handle the case already,
> but I'll try to setup testing as well soon.
> 
> Thanks,
> 
> - Juri

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-11-08 19:09             ` Luca Abeni
@ 2016-11-08 20:02               ` Juri Lelli
  2016-11-09 15:25                 ` luca abeni
  0 siblings, 1 reply; 45+ messages in thread
From: Juri Lelli @ 2016-11-08 20:02 UTC (permalink / raw)
  To: Luca Abeni
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

On 08/11/16 20:09, Luca Abeni wrote:
> Hi again,
> 
> On Tue, 8 Nov 2016 18:53:09 +0000
> Juri Lelli <juri.lelli@arm.com> wrote:
> [...]
> > > > Also, AFAIU, do_exit() works on current and the TASK_DEAD case is
> > > > handled in finish_task_switch(), so I don't think we are taking
> > > > care of the "task is dying" condition.
> > > Ok, so I am missing something... The state is set to TASK_DEAD, and
> > > then schedule() is called... So, __schedule() sees the dying task as
> > > "prev" and invokes deactivate_task() with the DEQUEUE_SLEEP flag...
> > > After that, finish_task_switch() calls task_dead_dl(). Is this
> > > wrong? If not, why aren't we taking care of the "task is dying"
> > > condition?
> > > 
> > 
> > No, I think you are right. But, semantically this cleanup goes in
> > task_dead_dl(), IMHO.
> Just to be sure I understand correctly: you suggest to add a check for
> "state == TASK_DEAD" (skipping the cleanup if the condition is true) in
> dequeue_task_dl(), and to add a sub_running_bw() in task_dead_dl()...
> Is this understanding correct?

This is more ugly, I know. It makes probably sense though if we then need it in
the next patch. But you are saying the contrary (we don't actually need it), so
in that case we might just want to add a comment here explaining why we handle
the "task is dying" case together with "task is going to sleep" (so that I
don't forget? :).

> 
> > It's most probably moot if it complicates
> > things, but it might be helpful to differentiate the case between a
> > task that is actually going to sleep (and for which we want to
> > activate the timer) and a task that is dying (and for which we want
> > to release bw immediately).
> I suspect the two cases should be handled in the same way :)
> 
> > So, it actually matters for next patch,
> > not here. But, maybe we want to do things clean from start?
> You mean, because patch 2/6 adds
> +       if (hrtimer_active(&p->dl.inactive_timer)) {
> +               raw_spin_lock_irq(&task_rq(p)->lock);
> +               sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));
> +               raw_spin_unlock_irq(&task_rq(p)->lock);
> +       }
> in task_dead_dl()? I suspect this hunk is actually unneeded (worse, it
> is wrong :). I am trying to remember why it is there, but I cannot find
> any reason... In the next days, I'll run some tests to check if that
> hunk is actually needed. If yes, then I'll modify patch 1/6 as you
> suggest; if it is not needed, I'll remove it from patch 2/6 and I'll
> not do this change to patch 1/6... Is this ok?
> 

I guess yes, if we don't need to differentiate. Maybe just add a comment as I
am saying above?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-11-08 20:02               ` Juri Lelli
@ 2016-11-09 15:25                 ` luca abeni
  0 siblings, 0 replies; 45+ messages in thread
From: luca abeni @ 2016-11-09 15:25 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

On Tue, 8 Nov 2016 20:02:29 +0000
Juri Lelli <juri.lelli@arm.com> wrote:
[...]
> > > So, it actually matters for next patch,
> > > not here. But, maybe we want to do things clean from start?  
> > You mean, because patch 2/6 adds
> > +       if (hrtimer_active(&p->dl.inactive_timer)) {
> > +               raw_spin_lock_irq(&task_rq(p)->lock);
> > +               sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));
> > +               raw_spin_unlock_irq(&task_rq(p)->lock);
> > +       }
> > in task_dead_dl()? I suspect this hunk is actually unneeded (worse, it
> > is wrong :). I am trying to remember why it is there, but I cannot find
> > any reason... In the next days, I'll run some tests to check if that
> > hunk is actually needed. If yes, then I'll modify patch 1/6 as you
> > suggest; if it is not needed, I'll remove it from patch 2/6 and I'll
> > not do this change to patch 1/6... Is this ok?
> >   
> 
> I guess yes, if we don't need to differentiate.
Ok; so, I ran some tests (and I found some old notes of mine). The
modifications to task_dead_dl() mentioned above are not actually needed;
I added them as a preparation for a change needed by patch 3... But I
now think this was an error; I am reworking this part of the code
(removing changes from task_dead_dl() and adding a "p->state == TASK_DEAD"
check in the inactive timer handler).

I'll post an update for patches 2 and 3 in few days, after I finish
some more tests.



				Luca

> Maybe just add a comment as I  am saying above?
> 
> Thanks,
> 
> - Juri

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-11-08 17:56       ` Juri Lelli
  2016-11-08 18:17         ` Luca Abeni
@ 2016-11-09 16:29         ` luca abeni
  2016-11-18 14:55         ` Peter Zijlstra
  2 siblings, 0 replies; 45+ messages in thread
From: luca abeni @ 2016-11-09 16:29 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

On Tue, 8 Nov 2016 17:56:35 +0000
Juri Lelli <juri.lelli@arm.com> wrote:
[...]
> > > > @@ -947,14 +965,19 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> > > >  		return;
> > > >  	}
> > > >  
> > > > +	if (p->on_rq == TASK_ON_RQ_MIGRATING)
> > > > +		add_running_bw(&p->dl, &rq->dl);
> > > > +
> > > >  	/*
> > > >  	 * If p is throttled, we do nothing. In fact, if it exhausted
> > > >  	 * its budget it needs a replenishment and, since it now is on
> > > >  	 * its rq, the bandwidth timer callback (which clearly has not
> > > >  	 * run yet) will take care of this.
> > > >  	 */
> > > > -	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
> > > > +	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
> > > > +		add_running_bw(&p->dl, &rq->dl);    
> > > 
> > > Don't rememeber if we discussed this already, but do we need to add the bw here
> > > even if the task is not actually enqueued until after the replenishment timer
> > > fires?  
> > I think yes... The active utilization does not depend on the fact that the task
> > is on the runqueue or not, but depends on the task's state (in GRUB parlance,
> > "inactive" vs "active contending"). In other words, even when a task is throttled
> > its utilization must be counted in the active utilization.
> >   
> 
> OK. Could you add a comment about this point please (so that I don't
> forget again :)?
So, I just changed the comment in

        /*
         * If p is throttled, we do not enqueue it. In fact, if it exhausted
         * its budget it needs a replenishment and, since it now is on
         * its rq, the bandwidth timer callback (which clearly has not
         * run yet) will take care of this.
         * However, the active utilization does not depend on the fact
         * that the task is on the runqueue or not (but depends on the
         * task's state - in GRUB parlance, "inactive" vs "active contending").
         * In other words, even if a task is throttled its utilization must
         * be counted in the active utilization; hence, we need to call
	 * add_running_bw().
         */

Is this ok?


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 2/6] Improve the tracking of active utilisation
  2016-11-02  2:35       ` luca abeni
@ 2016-11-10 10:04         ` Juri Lelli
  2016-11-10 11:56           ` Juri Lelli
  0 siblings, 1 reply; 45+ messages in thread
From: Juri Lelli @ 2016-11-10 10:04 UTC (permalink / raw)
  To: luca abeni
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

On 02/11/16 03:35, Luca Abeni wrote:
> On Tue, 1 Nov 2016 22:46:33 +0100
> luca abeni <luca.abeni@unitn.it> wrote:
> [...]
> > > > @@ -1074,6 +1161,14 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> > > >  	}
> > > >  	rcu_read_unlock();
> > > >  
> > > > +	rq = task_rq(p);
> > > > +	raw_spin_lock(&rq->lock);
> > > > +	if (hrtimer_active(&p->dl.inactive_timer)) {
> > > > +		sub_running_bw(&p->dl, &rq->dl);
> > > > +		hrtimer_try_to_cancel(&p->dl.inactive_timer);    
> > > 
> > > Can't we subtract twice if it happens that after we grabbed rq_lock the timer
> > > fired, so it's now waiting for that lock and it goes ahead and sub_running_bw
> > > again after we release the lock?  
> > Uhm... I somehow convinced myself that this could not happen, but I do not
> > remember the details, sorry :(
> I think I remember the answer now: pi_lock is acquired before invoking select_task_rq
> and is released after invoking enqueue_task... So, if there is a pending inactive
> timer, its handler will be executed after the task is enqueued... It will see the task
> as RUNNING, and will not decrease the active utilisation.
> 

Oh, because we do task_rq_lock() inactive_task_timer(). So, that should
save us from the double subtract. Would you mind adding something along
the line of what you said above as a comment for next version?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 2/6] Improve the tracking of active utilisation
  2016-11-10 10:04         ` Juri Lelli
@ 2016-11-10 11:56           ` Juri Lelli
  2016-11-10 12:15             ` luca abeni
  0 siblings, 1 reply; 45+ messages in thread
From: Juri Lelli @ 2016-11-10 11:56 UTC (permalink / raw)
  To: luca abeni
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

On 10/11/16 10:04, Juri Lelli wrote:
> On 02/11/16 03:35, Luca Abeni wrote:
> > On Tue, 1 Nov 2016 22:46:33 +0100
> > luca abeni <luca.abeni@unitn.it> wrote:
> > [...]
> > > > > @@ -1074,6 +1161,14 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> > > > >  	}
> > > > >  	rcu_read_unlock();
> > > > >  
> > > > > +	rq = task_rq(p);
> > > > > +	raw_spin_lock(&rq->lock);
> > > > > +	if (hrtimer_active(&p->dl.inactive_timer)) {
> > > > > +		sub_running_bw(&p->dl, &rq->dl);
> > > > > +		hrtimer_try_to_cancel(&p->dl.inactive_timer);    
> > > > 
> > > > Can't we subtract twice if it happens that after we grabbed rq_lock the timer
> > > > fired, so it's now waiting for that lock and it goes ahead and sub_running_bw
> > > > again after we release the lock?  
> > > Uhm... I somehow convinced myself that this could not happen, but I do not
> > > remember the details, sorry :(
> > I think I remember the answer now: pi_lock is acquired before invoking select_task_rq
> > and is released after invoking enqueue_task... So, if there is a pending inactive
> > timer, its handler will be executed after the task is enqueued... It will see the task
> > as RUNNING, and will not decrease the active utilisation.
> > 
> 
> Oh, because we do task_rq_lock() inactive_task_timer(). So, that should
> save us from the double subtract. Would you mind adding something along
> the line of what you said above as a comment for next version?
> 

Mmm, wait again.

Cannot the following happen?

 - inactive_timer fires and does sub_running_bw (as the task is not
   RUNNING)
 - another cpu does try_to_wake_up and blocks on pi_lock
 - inactive timer releases both pi and rq locks (but is still executing,
   let's say it is doing put_task_struct())
 - try_to_wake_up goes ahead and calls select_task_rq_dl
   + it finds inactive_timer active
   + sub_running_bw again :(

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 2/6] Improve the tracking of active utilisation
  2016-11-10 11:56           ` Juri Lelli
@ 2016-11-10 12:15             ` luca abeni
  2016-11-10 12:34               ` Juri Lelli
  0 siblings, 1 reply; 45+ messages in thread
From: luca abeni @ 2016-11-10 12:15 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

On Thu, 10 Nov 2016 11:56:10 +0000
Juri Lelli <juri.lelli@arm.com> wrote:

> On 10/11/16 10:04, Juri Lelli wrote:
> > On 02/11/16 03:35, Luca Abeni wrote:  
> > > On Tue, 1 Nov 2016 22:46:33 +0100
> > > luca abeni <luca.abeni@unitn.it> wrote:
> > > [...]  
> > > > > > @@ -1074,6 +1161,14 @@ select_task_rq_dl(struct task_struct
> > > > > > *p, int cpu, int sd_flag, int flags) }
> > > > > >  	rcu_read_unlock();
> > > > > >  
> > > > > > +	rq = task_rq(p);
> > > > > > +	raw_spin_lock(&rq->lock);
> > > > > > +	if (hrtimer_active(&p->dl.inactive_timer)) {
> > > > > > +		sub_running_bw(&p->dl, &rq->dl);
> > > > > > +
> > > > > > hrtimer_try_to_cancel(&p->dl.inactive_timer);      
> > > > > 
> > > > > Can't we subtract twice if it happens that after we grabbed
> > > > > rq_lock the timer fired, so it's now waiting for that lock
> > > > > and it goes ahead and sub_running_bw again after we release
> > > > > the lock?    
> > > > Uhm... I somehow convinced myself that this could not happen,
> > > > but I do not remember the details, sorry :(  
> > > I think I remember the answer now: pi_lock is acquired before
> > > invoking select_task_rq and is released after invoking
> > > enqueue_task... So, if there is a pending inactive timer, its
> > > handler will be executed after the task is enqueued... It will
> > > see the task as RUNNING, and will not decrease the active
> > > utilisation. 
> > 
> > Oh, because we do task_rq_lock() inactive_task_timer(). So, that
> > should save us from the double subtract. Would you mind adding
> > something along the line of what you said above as a comment for
> > next version? 
> 
> Mmm, wait again.
> 
> Cannot the following happen?
> 
>  - inactive_timer fires and does sub_running_bw (as the task is not
>    RUNNING)
>  - another cpu does try_to_wake_up and blocks on pi_lock
>  - inactive timer releases both pi and rq locks (but is still
> executing, let's say it is doing put_task_struct())
>  - try_to_wake_up goes ahead and calls select_task_rq_dl
>    + it finds inactive_timer active
>    + sub_running_bw again :(
Uhm... Right; this can happen :(

Ok; I'll think about some possible solution for this race... If I do
not find any simple way to solve it, I'll add a "contending" flag,
which allows to know if the inactive timer handler already executed or
not.

BTW, talking about sched_dl_entity flags: I see there are three
different int fields "dl_throttled, "dl_boosted" and "dl_yielded"; any
reason for doing this instead of having a "dl_flags" field and setting
its different bits when the entity is throttled, boosted or yielded? In
other words: if I need this "contending" flag, should I add a new
"dl_contending" field?



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 2/6] Improve the tracking of active utilisation
  2016-11-10 12:15             ` luca abeni
@ 2016-11-10 12:34               ` Juri Lelli
  2016-11-10 12:45                 ` luca abeni
  0 siblings, 1 reply; 45+ messages in thread
From: Juri Lelli @ 2016-11-10 12:34 UTC (permalink / raw)
  To: luca abeni
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

On 10/11/16 13:15, Luca Abeni wrote:
> On Thu, 10 Nov 2016 11:56:10 +0000
> Juri Lelli <juri.lelli@arm.com> wrote:
> 
> > On 10/11/16 10:04, Juri Lelli wrote:
> > > On 02/11/16 03:35, Luca Abeni wrote:  
> > > > On Tue, 1 Nov 2016 22:46:33 +0100
> > > > luca abeni <luca.abeni@unitn.it> wrote:
> > > > [...]  
> > > > > > > @@ -1074,6 +1161,14 @@ select_task_rq_dl(struct task_struct
> > > > > > > *p, int cpu, int sd_flag, int flags) }
> > > > > > >  	rcu_read_unlock();
> > > > > > >  
> > > > > > > +	rq = task_rq(p);
> > > > > > > +	raw_spin_lock(&rq->lock);
> > > > > > > +	if (hrtimer_active(&p->dl.inactive_timer)) {
> > > > > > > +		sub_running_bw(&p->dl, &rq->dl);
> > > > > > > +
> > > > > > > hrtimer_try_to_cancel(&p->dl.inactive_timer);      
> > > > > > 
> > > > > > Can't we subtract twice if it happens that after we grabbed
> > > > > > rq_lock the timer fired, so it's now waiting for that lock
> > > > > > and it goes ahead and sub_running_bw again after we release
> > > > > > the lock?    
> > > > > Uhm... I somehow convinced myself that this could not happen,
> > > > > but I do not remember the details, sorry :(  
> > > > I think I remember the answer now: pi_lock is acquired before
> > > > invoking select_task_rq and is released after invoking
> > > > enqueue_task... So, if there is a pending inactive timer, its
> > > > handler will be executed after the task is enqueued... It will
> > > > see the task as RUNNING, and will not decrease the active
> > > > utilisation. 
> > > 
> > > Oh, because we do task_rq_lock() inactive_task_timer(). So, that
> > > should save us from the double subtract. Would you mind adding
> > > something along the line of what you said above as a comment for
> > > next version? 
> > 
> > Mmm, wait again.
> > 
> > Cannot the following happen?
> > 
> >  - inactive_timer fires and does sub_running_bw (as the task is not
> >    RUNNING)
> >  - another cpu does try_to_wake_up and blocks on pi_lock
> >  - inactive timer releases both pi and rq locks (but is still
> > executing, let's say it is doing put_task_struct())
> >  - try_to_wake_up goes ahead and calls select_task_rq_dl
> >    + it finds inactive_timer active
> >    + sub_running_bw again :(
> Uhm... Right; this can happen :(
> 

:(

> Ok; I'll think about some possible solution for this race... If I do
> not find any simple way to solve it, I'll add a "contending" flag,
> which allows to know if the inactive timer handler already executed or
> not.
> 

Right, this might help.

Another thing that I was thinking of is whether we can use the return
value of hrtimer_try_to_cancel() to decide what to do:

 - if it returns 0 it means that the callback exectuted or the timer was
   never set, so nothing to do (as in nothing to sub_running_bw from)
 - if it returns 1 we succedeed, so we need to actively sub_running_bw
 - if -1 we can assume that it will eventually do sub_running_bw() so we
   don't need to care explicitly

Now I guess the problem is that the task can be migrated while his
inactive_timer is set (by select_task_rq_dl or by other classes load
balacing if setscheduled to a different class). Can't we store a back
reference to the rq from which the inactive_timer was queued and use
that to sub_running_bw() from? It seems that we might end up with some
"shadow" bandwidth, say when we do a wakeup migration, but maybe this is
something we can tolerate? Just thinking aloud. :)

> BTW, talking about sched_dl_entity flags: I see there are three
> different int fields "dl_throttled, "dl_boosted" and "dl_yielded"; any
> reason for doing this instead of having a "dl_flags" field and setting
> its different bits when the entity is throttled, boosted or yielded? In
> other words: if I need this "contending" flag, should I add a new
> "dl_contending" field?
> 

I think you might want to add a clean-up patch to your series (or a
separate one) fixing the current situation, and the build on to adding
the new flag if needed.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 2/6] Improve the tracking of active utilisation
  2016-11-10 12:34               ` Juri Lelli
@ 2016-11-10 12:45                 ` luca abeni
  0 siblings, 0 replies; 45+ messages in thread
From: luca abeni @ 2016-11-10 12:45 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Claudio Scordino,
	Steven Rostedt

On Thu, 10 Nov 2016 12:34:15 +0000
Juri Lelli <juri.lelli@arm.com> wrote:
[...]
> > Ok; I'll think about some possible solution for this race... If I do
> > not find any simple way to solve it, I'll add a "contending" flag,
> > which allows to know if the inactive timer handler already executed
> > or not.
> >   
> 
> Right, this might help.
> 
> Another thing that I was thinking of is whether we can use the return
> value of hrtimer_try_to_cancel() to decide what to do:
> 
>  - if it returns 0 it means that the callback exectuted or the timer
> was never set, so nothing to do (as in nothing to sub_running_bw from)
>  - if it returns 1 we succedeed, so we need to actively sub_running_bw
>  - if -1 we can assume that it will eventually do sub_running_bw() so
> we don't need to care explicitly
This is about what I did in the initial version of the patch, but I
found some problems... I'll try to have another look at this approach.


> Now I guess the problem is that the task can be migrated while his
> inactive_timer is set (by select_task_rq_dl or by other classes load
> balacing if setscheduled to a different class). Can't we store a back
> reference to the rq from which the inactive_timer was queued and use
> that to sub_running_bw() from? It seems that we might end up with some
> "shadow" bandwidth, say when we do a wakeup migration, but maybe this
> is something we can tolerate? Just thinking aloud. :)
I'll think about this...


> > BTW, talking about sched_dl_entity flags: I see there are three
> > different int fields "dl_throttled, "dl_boosted" and "dl_yielded";
> > any reason for doing this instead of having a "dl_flags" field and
> > setting its different bits when the entity is throttled, boosted or
> > yielded? In other words: if I need this "contending" flag, should I
> > add a new "dl_contending" field?
> >   
> 
> I think you might want to add a clean-up patch to your series (or a
> separate one) fixing the current situation, and the build on to adding
> the new flag if needed.
Ok, can do this. I just wanted to know if there is a reason for having
different "dl_*" fields instead of a "flags" field (I do not know,
maybe performance?) or if it is just an historical accident and this
can be changed.



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-10-24 14:06 ` [RFC v3 1/6] Track the active utilisation Luca Abeni
  2016-10-25  9:09   ` Daniel Bristot de Oliveira
  2016-11-01 16:45   ` Juri Lelli
@ 2016-11-18 13:55   ` Peter Zijlstra
  2016-11-18 15:06     ` luca abeni
  2 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2016-11-18 13:55 UTC (permalink / raw)
  To: Luca Abeni
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Claudio Scordino, Steven Rostedt

On Mon, Oct 24, 2016 at 04:06:33PM +0200, Luca Abeni wrote:

> @@ -498,6 +514,8 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>  	struct rq *rq = rq_of_dl_rq(dl_rq);
>  
> +	add_running_bw(dl_se, dl_rq);
> +
>  	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
>  	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
>  		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
> @@ -947,14 +965,19 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>  		return;
>  	}
>  
> +	if (p->on_rq == TASK_ON_RQ_MIGRATING)
> +		add_running_bw(&p->dl, &rq->dl);
> +
>  	/*
>  	 * If p is throttled, we do nothing. In fact, if it exhausted
>  	 * its budget it needs a replenishment and, since it now is on
>  	 * its rq, the bandwidth timer callback (which clearly has not
>  	 * run yet) will take care of this.
>  	 */
> -	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
> +	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
> +		add_running_bw(&p->dl, &rq->dl);
>  		return;
> +	}
>  
>  	enqueue_dl_entity(&p->dl, pi_se, flags);
>  

I realize the enqueue path is a bit of a maze, but this hurts my head.

Isn't there anything we can do to streamline this a bit?

Maybe move the add_running_bw() from update_dl_entity() to the
ENQUEUE_WAKEUP branch in enqueue_dl_entity()? Because that's what you
really want, isn't it? Its not actually related to recomputing the
absolute deadline.

> @@ -972,6 +995,12 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	update_curr_dl(rq);
>  	__dequeue_task_dl(rq, p, flags);
> +
> +	if (p->on_rq == TASK_ON_RQ_MIGRATING)
> +		sub_running_bw(&p->dl, &rq->dl);
> +
> +	if (flags & DEQUEUE_SLEEP)
> +		sub_running_bw(&p->dl, &rq->dl);
>  }

We could look at adding more enqueue/dequeue flags to streamline this a
bit, bit lets not do that now.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-10-25 13:58       ` Steven Rostedt
  2016-10-25 18:04         ` Luca Abeni
@ 2016-11-18 14:23         ` Peter Zijlstra
  2016-11-18 15:10           ` luca abeni
                             ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: Peter Zijlstra @ 2016-11-18 14:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: luca abeni, Daniel Bristot de Oliveira, linux-kernel,
	Ingo Molnar, Juri Lelli, Claudio Scordino

On Tue, Oct 25, 2016 at 09:58:11AM -0400, Steven Rostedt wrote:

> I agree with Daniel, especially since I don't usually trust the
> compiler. And the added variable is more of a distraction as it doesn't
> seem to have any real purpose.

I don't think there's anything here to trust the compiler on. Either
it inlines or it doesn't, it should generate 'correct' code either way.

If it doesn't inline, its a dumb compiler and it will make these dumb
decisions throughout the tree and your kernel will be slow, not my
problem really ;-)

That said, I agree that the single line thing is actually easier to
read.

That said; there's something to be said for:

	u64 running_bw;

static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
	u64 old = dl_rq->running_bw;

	dl_rq->running_bw += dl_se->dl_bw;
	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
}

static void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
{
	u64 old = dl_rq->running_bw;

	dl_rq->running_bw -= dl_se->dl_bw;
	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
}

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-11-08 17:56       ` Juri Lelli
  2016-11-08 18:17         ` Luca Abeni
  2016-11-09 16:29         ` luca abeni
@ 2016-11-18 14:55         ` Peter Zijlstra
  2 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2016-11-18 14:55 UTC (permalink / raw)
  To: Juri Lelli
  Cc: luca abeni, linux-kernel, Ingo Molnar, Claudio Scordino, Steven Rostedt

On Tue, Nov 08, 2016 at 05:56:35PM +0000, Juri Lelli wrote:
> Mmm. You explicitly check that TASK_ON_RQ_MIGRATING or DEQUEUE_SLEEP
> (which btw can be actually put together with an or condition), so I
> don't think that any of those turn out to be true when the task dies.
> Also, AFAIU, do_exit() works on current and the TASK_DEAD case is
> handled in finish_task_switch(), so I don't think we are taking care of
> the "task is dying" condition.
> 
> Peter, does what I'm saying make any sense? :)

do_task_dead():
	__set_current_state(TASK_DEAD);
	schedule():
		if (prev->state)
			deactivate_task(DEQUEUE_SLEEP);

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-11-18 13:55   ` Peter Zijlstra
@ 2016-11-18 15:06     ` luca abeni
  0 siblings, 0 replies; 45+ messages in thread
From: luca abeni @ 2016-11-18 15:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Claudio Scordino, Steven Rostedt

Hi Peter,

On Fri, 18 Nov 2016 14:55:54 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Oct 24, 2016 at 04:06:33PM +0200, Luca Abeni wrote:
> 
> > @@ -498,6 +514,8 @@ static void update_dl_entity(struct
> > sched_dl_entity *dl_se, struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> >  	struct rq *rq = rq_of_dl_rq(dl_rq);
> >  
> > +	add_running_bw(dl_se, dl_rq);
> > +
> >  	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
> >  	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
> >  		dl_se->deadline = rq_clock(rq) +
> > pi_se->dl_deadline; @@ -947,14 +965,19 @@ static void
> > enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> > return; }
> >  
> > +	if (p->on_rq == TASK_ON_RQ_MIGRATING)
> > +		add_running_bw(&p->dl, &rq->dl);
> > +
> >  	/*
> >  	 * If p is throttled, we do nothing. In fact, if it
> > exhausted
> >  	 * its budget it needs a replenishment and, since it now
> > is on
> >  	 * its rq, the bandwidth timer callback (which clearly has
> > not
> >  	 * run yet) will take care of this.
> >  	 */
> > -	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
> > +	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) {
> > +		add_running_bw(&p->dl, &rq->dl);
> >  		return;
> > +	}
> >  
> >  	enqueue_dl_entity(&p->dl, pi_se, flags);
> >    
> 
> I realize the enqueue path is a bit of a maze, but this hurts my head.
> 
> Isn't there anything we can do to streamline this a bit?
> 
> Maybe move the add_running_bw() from update_dl_entity() to the
> ENQUEUE_WAKEUP branch in enqueue_dl_entity()? Because that's what you
> really want, isn't it? Its not actually related to recomputing the
> absolute deadline.

Right... I'll change the code in this way. I am currently modifying
this code (because after Juri's review I realized that there is an
issue - see last email exchange with Juri); I'll make this change as
soon as the code stabilizes.


		Thanks,
			Luca

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-11-18 14:23         ` Peter Zijlstra
@ 2016-11-18 15:10           ` luca abeni
  2016-11-18 15:28             ` Peter Zijlstra
  2016-11-18 16:42           ` Steven Rostedt
  2016-12-05 22:30           ` luca abeni
  2 siblings, 1 reply; 45+ messages in thread
From: luca abeni @ 2016-11-18 15:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Daniel Bristot de Oliveira, linux-kernel,
	Ingo Molnar, Juri Lelli, Claudio Scordino

On Fri, 18 Nov 2016 15:23:59 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Oct 25, 2016 at 09:58:11AM -0400, Steven Rostedt wrote:
> 
> > I agree with Daniel, especially since I don't usually trust the
> > compiler. And the added variable is more of a distraction as it
> > doesn't seem to have any real purpose.  
> 
> I don't think there's anything here to trust the compiler on. Either
> it inlines or it doesn't, it should generate 'correct' code either
> way.
> 
> If it doesn't inline, its a dumb compiler and it will make these dumb
> decisions throughout the tree and your kernel will be slow, not my
> problem really ;-)
> 
> That said, I agree that the single line thing is actually easier to
> read.

Ok; I already made that change locally.


> 
> That said; there's something to be said for:
> 
> 	u64 running_bw;

Ok; I originally made it signed because I wanted the "running_bw < 0"
check, but I can change it to "dl_rq->running_bw > old" (I did not
think about it).

I'll make these changes locally ASAP.


			Thanks,
				Luca

> 
> static void add_running_bw(struct sched_dl_entity *dl_se, struct
> dl_rq *dl_rq) {
> 	u64 old = dl_rq->running_bw;
> 
> 	dl_rq->running_bw += dl_se->dl_bw;
> 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
> }
> 
> static void sub_running_bw(struct sched_dl_entity *dl_se, struct
> dl_rq *dl_rq) {
> 	u64 old = dl_rq->running_bw;
> 
> 	dl_rq->running_bw -= dl_se->dl_bw;
> 	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
> }
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-11-18 15:10           ` luca abeni
@ 2016-11-18 15:28             ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2016-11-18 15:28 UTC (permalink / raw)
  To: luca abeni
  Cc: Steven Rostedt, Daniel Bristot de Oliveira, linux-kernel,
	Ingo Molnar, Juri Lelli, Claudio Scordino

On Fri, Nov 18, 2016 at 04:10:17PM +0100, luca abeni wrote:

> Ok; I originally made it signed because I wanted the "running_bw < 0"
> check, but I can change it to "dl_rq->running_bw > old" (I did not
> think about it).

Happy accident of me staring at unsigned over/under-flow for the past
several days :-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 2/6] Improve the tracking of active utilisation
  2016-10-24 14:06 ` [RFC v3 2/6] Improve the tracking of " Luca Abeni
  2016-11-01 16:46   ` Juri Lelli
  2016-11-02  2:41   ` luca abeni
@ 2016-11-18 15:36   ` Peter Zijlstra
  2016-11-18 15:56     ` luca abeni
  2016-11-18 15:47   ` Peter Zijlstra
  3 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2016-11-18 15:36 UTC (permalink / raw)
  To: Luca Abeni
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Claudio Scordino, Steven Rostedt

On Mon, Oct 24, 2016 at 04:06:34PM +0200, Luca Abeni wrote:
> @@ -514,7 +556,20 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>  	struct rq *rq = rq_of_dl_rq(dl_rq);
>  
> +	if (hrtimer_is_queued(&dl_se->inactive_timer)) {
> +		hrtimer_try_to_cancel(&dl_se->inactive_timer);
> +		WARN_ON(dl_task_of(dl_se)->nr_cpus_allowed > 1);

Isn't that always so? That is, DL tasks cannot be but 'global', right?

Also, you could use the return value of hrtimer_try_to_cancel() to
determine hrtimer_is_queued() I suppose.

> +	} else {
> +		/*
> +		 * The "inactive timer" has been cancelled in
> +		 * select_task_rq_dl() (and the acvive utilisation has
> +		 * been decreased). So, increase the active utilisation.
> +		 * If select_task_rq_dl() could not cancel the timer,
> +		 * inactive_task_timer() will * find the task state as
                                             ^^^
superfluous '*'?

> +		 * TASK_RUNNING, and will do nothing, so we are still safe.
> +		 */
> +		add_running_bw(dl_se, dl_rq);
> +	}
>  
>  	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
>  	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 2/6] Improve the tracking of active utilisation
  2016-10-24 14:06 ` [RFC v3 2/6] Improve the tracking of " Luca Abeni
                     ` (2 preceding siblings ...)
  2016-11-18 15:36   ` Peter Zijlstra
@ 2016-11-18 15:47   ` Peter Zijlstra
  2016-11-18 16:06     ` luca abeni
  3 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2016-11-18 15:47 UTC (permalink / raw)
  To: Luca Abeni
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Claudio Scordino, Steven Rostedt

On Mon, Oct 24, 2016 at 04:06:34PM +0200, Luca Abeni wrote:
> @@ -1074,6 +1161,14 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
>  	}
>  	rcu_read_unlock();
>  
> +	rq = task_rq(p);
> +	raw_spin_lock(&rq->lock);
> +	if (hrtimer_active(&p->dl.inactive_timer)) {
> +		sub_running_bw(&p->dl, &rq->dl);
> +		hrtimer_try_to_cancel(&p->dl.inactive_timer);
> +	}
> +	raw_spin_unlock(&rq->lock);

Its a bit sad having to take rq->lock here...

Also, what happens when hrtimer_try_to_cancel() fails?

> +
>  out:
>  	return cpu;
>  }

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 2/6] Improve the tracking of active utilisation
  2016-11-18 15:36   ` Peter Zijlstra
@ 2016-11-18 15:56     ` luca abeni
  0 siblings, 0 replies; 45+ messages in thread
From: luca abeni @ 2016-11-18 15:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Claudio Scordino, Steven Rostedt

On Fri, 18 Nov 2016 16:36:15 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Oct 24, 2016 at 04:06:34PM +0200, Luca Abeni wrote:
> > @@ -514,7 +556,20 @@ static void update_dl_entity(struct
> > sched_dl_entity *dl_se, struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> >  	struct rq *rq = rq_of_dl_rq(dl_rq);
> >  
> > +	if (hrtimer_is_queued(&dl_se->inactive_timer)) {
> > +		hrtimer_try_to_cancel(&dl_se->inactive_timer);
> > +		WARN_ON(dl_task_of(dl_se)->nr_cpus_allowed > 1);  
> 
> Isn't that always so? That is, DL tasks cannot be but 'global', right?
Well, if I understand well in general (that is, if admission control is
enabled) nr_cpus_allowed is equal to the number of CPUs in the
cpuset...
This is generally > 1 (and in this case select_task_rq_dl() is invoked
first, and tries to cancel the timer - so I think the timer cannot be
queued), or can be = 1 if we do partitioned scheduling (cpusets
containing only 1 CPU, or disabled admission control). If
nr_cpus_allowed is 1, then select_task_rq_dl() is not invoked, so the
timer can be queued.
In some of my tests I used partitioned scheduling; in some other tests
I disabled admission control to mix tasks with different affinities, so
I made the warning conditional to the number of CPUs being > 1.


> Also, you could use the return value of hrtimer_try_to_cancel() to
> determine hrtimer_is_queued() I suppose.

Ah, ok... I was under the impression that
"if (hrtimer_is_queued()) hrtimer_try_to_cancel()"
is less overhead than a simple "hrtimer_try_to_cancel()", but this was
just an uneducated guess... I'll change the code to avoid the check on
hrtimer_is_queued().

> 
> > +	} else {
> > +		/*
> > +		 * The "inactive timer" has been cancelled in
> > +		 * select_task_rq_dl() (and the acvive utilisation
> > has
> > +		 * been decreased). So, increase the active
> > utilisation.
> > +		 * If select_task_rq_dl() could not cancel the
> > timer,
> > +		 * inactive_task_timer() will * find the task
> > state as  
>                                              ^^^
> superfluous '*'?

Yes, sorry... Something went wrong when I re-indented the comment :(


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 2/6] Improve the tracking of active utilisation
  2016-11-18 15:47   ` Peter Zijlstra
@ 2016-11-18 16:06     ` luca abeni
  2016-11-18 18:49       ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: luca abeni @ 2016-11-18 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Claudio Scordino, Steven Rostedt

On Fri, 18 Nov 2016 16:47:48 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Oct 24, 2016 at 04:06:34PM +0200, Luca Abeni wrote:
> > @@ -1074,6 +1161,14 @@ select_task_rq_dl(struct task_struct *p, int
> > cpu, int sd_flag, int flags) }
> >  	rcu_read_unlock();
> >  
> > +	rq = task_rq(p);
> > +	raw_spin_lock(&rq->lock);
> > +	if (hrtimer_active(&p->dl.inactive_timer)) {
> > +		sub_running_bw(&p->dl, &rq->dl);
> > +		hrtimer_try_to_cancel(&p->dl.inactive_timer);
> > +	}
> > +	raw_spin_unlock(&rq->lock);  
> 
> Its a bit sad having to take rq->lock here...

I think I can move the locking inside the if() (so that rq->lock is not
taken if the inactive timer is not active); apart from this, the only
solution I can think about is to modify select_task_rq_dl() not to
change the cpu if the timer is active... (I think the task will be
migrated by a following push() if needed). What do you think? Any other
solution I am not seeing?


> Also, what happens when hrtimer_try_to_cancel() fails?

This is something I am working on... My original idea was that nothing
bad happens, because the timer handler will see the task as RUNNING and
will not decrease the running bw... But this is wrong.
My new idea is to add a "dl_contending" flag in the scheduling entity,
that indicates if the running bw has already been subtracted or not.
With this, the issue should be solved (if anyone sees additional
issues, or a better solution that does not require an additional flag,
let me know).

BTW, this code also missed a put_task_struct() for the case in which
hrtimer_try_to_cancel() does not fail :(


			Thanks,
				Luca
> 
> > +
> >  out:
> >  	return cpu;
> >  }  

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-11-18 14:23         ` Peter Zijlstra
  2016-11-18 15:10           ` luca abeni
@ 2016-11-18 16:42           ` Steven Rostedt
  2016-12-05 22:30           ` luca abeni
  2 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2016-11-18 16:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: luca abeni, Daniel Bristot de Oliveira, linux-kernel,
	Ingo Molnar, Juri Lelli, Claudio Scordino

On Fri, 18 Nov 2016 15:23:59 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

 
> That said; there's something to be said for:
> 
> 	u64 running_bw;
> 
> static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> {
> 	u64 old = dl_rq->running_bw;
> 
> 	dl_rq->running_bw += dl_se->dl_bw;
> 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
> }
> 
> static void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> {
> 	u64 old = dl_rq->running_bw;
> 
> 	dl_rq->running_bw -= dl_se->dl_bw;
> 	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
> }
> 

Acked-by: me

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 2/6] Improve the tracking of active utilisation
  2016-11-18 16:06     ` luca abeni
@ 2016-11-18 18:49       ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2016-11-18 18:49 UTC (permalink / raw)
  To: luca abeni
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Claudio Scordino, Steven Rostedt

On Fri, Nov 18, 2016 at 05:06:50PM +0100, luca abeni wrote:
> 
> > Also, what happens when hrtimer_try_to_cancel() fails?
> 
> This is something I am working on... My original idea was that nothing
> bad happens, because the timer handler will see the task as RUNNING and
> will not decrease the running bw... But this is wrong.
> My new idea is to add a "dl_contending" flag in the scheduling entity,
> that indicates if the running bw has already been subtracted or not.
> With this, the issue should be solved (if anyone sees additional
> issues, or a better solution that does not require an additional flag,
> let me know).

Right. My suggestion would be to make it obvious, use that flag if
that's what it takes. We can always try and be clever later.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-11-18 14:23         ` Peter Zijlstra
  2016-11-18 15:10           ` luca abeni
  2016-11-18 16:42           ` Steven Rostedt
@ 2016-12-05 22:30           ` luca abeni
  2016-12-06  8:35             ` Peter Zijlstra
  2 siblings, 1 reply; 45+ messages in thread
From: luca abeni @ 2016-12-05 22:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Daniel Bristot de Oliveira, linux-kernel,
	Ingo Molnar, Juri Lelli, Claudio Scordino

Hi Peter,

On Fri, 18 Nov 2016 15:23:59 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> 	u64 running_bw;
> 
> static void add_running_bw(struct sched_dl_entity *dl_se, struct
> dl_rq *dl_rq) {
> 	u64 old = dl_rq->running_bw;
> 
> 	dl_rq->running_bw += dl_se->dl_bw;
> 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
> }
> 
> static void sub_running_bw(struct sched_dl_entity *dl_se, struct
> dl_rq *dl_rq) {
> 	u64 old = dl_rq->running_bw;
> 
> 	dl_rq->running_bw -= dl_se->dl_bw;
> 	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
> }

I wanted to change "SCHED_WARN_ON(dl_rq->running_bw > old); /*
underflow */" into "if (SCHED_WARN_ON(...)) dl_rq->running_bw = 0" (to
avoid using nonsensical "running_bw" values), but I see that
"SCHED_WARN_ON()" cannot be used inside an if (this seems to be a
difference respect to "SCHED_WARN()").
This is because of the definition used when CONFIG_SCHED_DEBUG is not
defined (I noticed the issue when testing with random kernel
configurations).

Is this expected? If yes, what should I do in this case? Something like
	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
	if (dl_rq->running_bw > old)
		dl_rq->running_bw = 0;
?
Or something else?


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-12-05 22:30           ` luca abeni
@ 2016-12-06  8:35             ` Peter Zijlstra
  2016-12-06  8:57               ` luca abeni
  2016-12-06 13:47               ` luca abeni
  0 siblings, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2016-12-06  8:35 UTC (permalink / raw)
  To: luca abeni
  Cc: Steven Rostedt, Daniel Bristot de Oliveira, linux-kernel,
	Ingo Molnar, Juri Lelli, Claudio Scordino

On Mon, Dec 05, 2016 at 11:30:05PM +0100, luca abeni wrote:
> Hi Peter,
> 
> On Fri, 18 Nov 2016 15:23:59 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> [...]
> > 	u64 running_bw;
> > 
> > static void add_running_bw(struct sched_dl_entity *dl_se, struct
> > dl_rq *dl_rq) {
> > 	u64 old = dl_rq->running_bw;
> > 
> > 	dl_rq->running_bw += dl_se->dl_bw;
> > 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
> > }
> > 
> > static void sub_running_bw(struct sched_dl_entity *dl_se, struct
> > dl_rq *dl_rq) {
> > 	u64 old = dl_rq->running_bw;
> > 
> > 	dl_rq->running_bw -= dl_se->dl_bw;
> > 	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
> > }
> 
> I wanted to change "SCHED_WARN_ON(dl_rq->running_bw > old); /*
> underflow */" into "if (SCHED_WARN_ON(...)) dl_rq->running_bw = 0" (to
> avoid using nonsensical "running_bw" values), but I see that
> "SCHED_WARN_ON()" cannot be used inside an if (this seems to be a
> difference respect to "SCHED_WARN()").

There's a SCHED_WARN? Did you mean to say WARN_ON()?

And yes, mostly by accident I think, I'm not a big user of that pattern
and neglected it when I did SCHED_WARN_ON().

> This is because of the definition used when CONFIG_SCHED_DEBUG is not
> defined (I noticed the issue when testing with random kernel
> configurations).

I'm fine changing the definition, just find something that works. The
current ((void)(x)) thing was to avoid unused complaints -- although I'm
not sure there were any.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-12-06  8:35             ` Peter Zijlstra
@ 2016-12-06  8:57               ` luca abeni
  2016-12-06 13:47               ` luca abeni
  1 sibling, 0 replies; 45+ messages in thread
From: luca abeni @ 2016-12-06  8:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Daniel Bristot de Oliveira, linux-kernel,
	Ingo Molnar, Juri Lelli, Claudio Scordino

On Tue, 6 Dec 2016 09:35:01 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Dec 05, 2016 at 11:30:05PM +0100, luca abeni wrote:
> > Hi Peter,
> > 
> > On Fri, 18 Nov 2016 15:23:59 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > [...]  
> > > 	u64 running_bw;
> > > 
> > > static void add_running_bw(struct sched_dl_entity *dl_se, struct
> > > dl_rq *dl_rq) {
> > > 	u64 old = dl_rq->running_bw;
> > > 
> > > 	dl_rq->running_bw += dl_se->dl_bw;
> > > 	SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
> > > }
> > > 
> > > static void sub_running_bw(struct sched_dl_entity *dl_se, struct
> > > dl_rq *dl_rq) {
> > > 	u64 old = dl_rq->running_bw;
> > > 
> > > 	dl_rq->running_bw -= dl_se->dl_bw;
> > > 	SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
> > > }  
> > 
> > I wanted to change "SCHED_WARN_ON(dl_rq->running_bw > old); /*
> > underflow */" into "if (SCHED_WARN_ON(...)) dl_rq->running_bw =
> > 0" (to avoid using nonsensical "running_bw" values), but I see that
> > "SCHED_WARN_ON()" cannot be used inside an if (this seems to be a
> > difference respect to "SCHED_WARN()").  
> 
> There's a SCHED_WARN? Did you mean to say WARN_ON()?

Sorry, I managed to confuse myself... I was thinking about WARN_ON()
(the one I used in the previous version of my patches).

> And yes, mostly by accident I think, I'm not a big user of that
> pattern and neglected it when I did SCHED_WARN_ON().

You mean the "if(WARN(...))" pattern? I think it was suggested in a
previous round of reviews.


> > This is because of the definition used when CONFIG_SCHED_DEBUG is
> > not defined (I noticed the issue when testing with random kernel
> > configurations).  
> 
> I'm fine changing the definition, just find something that works. The
> current ((void)(x)) thing was to avoid unused complaints -- although
> I'm not sure there were any.

Ok; I'll see if I manage to find a working definition.


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC v3 1/6] Track the active utilisation
  2016-12-06  8:35             ` Peter Zijlstra
  2016-12-06  8:57               ` luca abeni
@ 2016-12-06 13:47               ` luca abeni
  1 sibling, 0 replies; 45+ messages in thread
From: luca abeni @ 2016-12-06 13:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Daniel Bristot de Oliveira, linux-kernel,
	Ingo Molnar, Juri Lelli, Claudio Scordino

Hi Peter,

On Tue, 6 Dec 2016 09:35:01 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> > This is because of the definition used when CONFIG_SCHED_DEBUG is
> > not defined (I noticed the issue when testing with random kernel
> > configurations).  
> 
> I'm fine changing the definition, just find something that works. The
> current ((void)(x)) thing was to avoid unused complaints -- although
> I'm not sure there were any.

Below is what I came up with... It fixes the build, and seems to work
fine generating no warnings (I tested with gcc 5.4.0). To write this
patch, I re-used some code from include/asm-generic/bug.h, that has no
copyright header, so I just added my signed-off-by (but I am not sure
if this is the correct way to go).


				Luca



>From 74e67d61c4b98c2498880932b953c65e9653c121 Mon Sep 17 00:00:00 2001
From: Luca Abeni <luca.abeni@unitn.it>
Date: Tue, 6 Dec 2016 10:02:28 +0100
Subject: [PATCH 7/7] sched.h: Improve SCHED_WARN_ON() when CONFIG_SCHED_DEBUG is not defined

With the current definition of SCHED_WARN_ON(), something like
	if (SCHED_WARN_ON(condition)) ...
fails with
	error: void value not ignored as it ought to be
	 #define SCHED_WARN_ON(x) ((void)(x))
	                          ^

This patch fixes the issue by using the same code used in WARN_ON()

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
---
 kernel/sched/sched.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef4bdaa..2e96aa4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -19,7 +19,10 @@
 #ifdef CONFIG_SCHED_DEBUG
 #define SCHED_WARN_ON(x)	WARN_ONCE(x, #x)
 #else
-#define SCHED_WARN_ON(x)	((void)(x))
+#define SCHED_WARN_ON(x)	({					\
+	int __ret_warn_on = !!(x);					\
+	unlikely(__ret_warn_on);					\
+})
 #endif
 
 struct rq;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2016-12-06 14:01 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-24 14:06 [RFC v3 0/6] CPU reclaiming for SCHED_DEADLINE Luca Abeni
2016-10-24 14:06 ` [RFC v3 1/6] Track the active utilisation Luca Abeni
2016-10-25  9:09   ` Daniel Bristot de Oliveira
2016-10-25  9:29     ` luca abeni
2016-10-25 13:58       ` Steven Rostedt
2016-10-25 18:04         ` Luca Abeni
2016-11-18 14:23         ` Peter Zijlstra
2016-11-18 15:10           ` luca abeni
2016-11-18 15:28             ` Peter Zijlstra
2016-11-18 16:42           ` Steven Rostedt
2016-12-05 22:30           ` luca abeni
2016-12-06  8:35             ` Peter Zijlstra
2016-12-06  8:57               ` luca abeni
2016-12-06 13:47               ` luca abeni
2016-11-01 16:45   ` Juri Lelli
2016-11-01 21:10     ` luca abeni
2016-11-08 17:56       ` Juri Lelli
2016-11-08 18:17         ` Luca Abeni
2016-11-08 18:53           ` Juri Lelli
2016-11-08 19:09             ` Luca Abeni
2016-11-08 20:02               ` Juri Lelli
2016-11-09 15:25                 ` luca abeni
2016-11-09 16:29         ` luca abeni
2016-11-18 14:55         ` Peter Zijlstra
2016-11-18 13:55   ` Peter Zijlstra
2016-11-18 15:06     ` luca abeni
2016-10-24 14:06 ` [RFC v3 2/6] Improve the tracking of " Luca Abeni
2016-11-01 16:46   ` Juri Lelli
2016-11-01 21:46     ` luca abeni
2016-11-02  2:35       ` luca abeni
2016-11-10 10:04         ` Juri Lelli
2016-11-10 11:56           ` Juri Lelli
2016-11-10 12:15             ` luca abeni
2016-11-10 12:34               ` Juri Lelli
2016-11-10 12:45                 ` luca abeni
2016-11-02  2:41   ` luca abeni
2016-11-18 15:36   ` Peter Zijlstra
2016-11-18 15:56     ` luca abeni
2016-11-18 15:47   ` Peter Zijlstra
2016-11-18 16:06     ` luca abeni
2016-11-18 18:49       ` Peter Zijlstra
2016-10-24 14:06 ` [RFC v3 3/6] Fix the update of the total -deadline utilization Luca Abeni
2016-10-24 14:06 ` [RFC v3 4/6] GRUB accounting Luca Abeni
2016-10-24 14:06 ` [RFC v3 5/6] Do not reclaim the whole CPU bandwidth Luca Abeni
2016-10-24 14:06 ` [RFC v3 6/6] Make GRUB a task's flag Luca Abeni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).