[PATCH 0/2] sched/deadline: fix cpusets bandwidth accounting

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] sched/deadline: fix cpusets bandwidth accounting
@ 2016-02-08 12:45 Juri Lelli
  2016-02-08 12:45 ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth Juri Lelli
  2016-02-08 12:45 ` [PATCH 2/2] sched/deadline: rq_{online,offline}_dl for root_domain changes Juri Lelli
  0 siblings, 2 replies; 46+ messages in thread
From: Juri Lelli @ 2016-02-08 12:45 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, peterz, mingo, luca.abeni, vincent.guittot,
	wanpeng.li, juri.lelli

Steven Rostedt provided a detailed bug report [1] highlighting the fact that we
lose bandwidth accounting information across scheduling domains changes.  This
happens when we make changes to cpusets or when hotplug comes into play (this
has been already pointed out in the past by Wanpeng Li). Problem is that these
operations are destructive w.r.t. root_domain data structure(s), that is where
we store information about the amount of bandwidth has been admitted.

This set proposes to fix that by relying on the rq_{online, offline} calls,
that IIUC were introduced exactly to cope with the problem that we have.

Patches description:

 o 01/02 introduces per-rq admitted bandwidth accounting (I stole this from
         Luca with his approval, but I changed it a bit; I kept attribution in
         the patch, but everything that I could have break is on me; also,
         Vincent posted something similar as part of the sched-freq story)
 o 02/02 uses information provided by 01/02 to save restore bandwidth
         information in root_domain(s)

Steven already provided instructions on how to reproduce that problem and test
the proposed fix in [1]. I updated my tests accordingly

 https://github.com/jlelli/tests

This set is based on tip/sched/core as of today. I pushed this set on a branch
together with Steve's sched_debug enhancements to print root_domain bandwidth
and some trace_printks that should help to track what these changes are doing.

 git://linux-arm.org/linux-jl.git upstream/fixes/dl_rootdomain_account-v1

Changelogs can be improved and patches requires more comments, but I wanted to
send this out early to understand if we are really fixing the problem and to
see if people think this approach could fly. Also, I expect that there are
still corner cases that we don't cover with the bandwidth tracking.

Testing and feedback is more than welcome.

Best,

- Juri

[1] https://lkml.org/lkml/2016/2/3/966

Juri Lelli (2):
  sched/deadline: add per rq tracking of admitted bandwidth
  sched/deadline: rq_{online,offline}_dl for root_domain changes

 kernel/sched/core.c     |  2 ++
 kernel/sched/deadline.c | 20 ++++++++++++++++++++
 kernel/sched/sched.h    | 22 ++++++++++++++++++++++
 3 files changed, 44 insertions(+)

-- 
2.7.0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-08 12:45 [PATCH 0/2] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
@ 2016-02-08 12:45 ` Juri Lelli
  2016-02-10 11:32   ` Juri Lelli
  2016-02-08 12:45 ` [PATCH 2/2] sched/deadline: rq_{online,offline}_dl for root_domain changes Juri Lelli
  1 sibling, 1 reply; 46+ messages in thread
From: Juri Lelli @ 2016-02-08 12:45 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, peterz, mingo, luca.abeni, vincent.guittot,
	wanpeng.li, juri.lelli

Currently SCHED_DEADLINE scheduling policy tracks bandwidth of tasks
that passed admission control at root_domain level only. This creates
problems when such data structure(s) are destroyed, when we reconfigure
scheduling domains for example.

This is part one of two changes required to fix the problem. In this
patch we add per-rq tracking of admitted bandwidth. Tasks bring with
them their bandwidth contribution when they enter the system and are
enqueued for the first time. Contributions are then moved around when
migrations happen and removed when tasks die.

Per-rq admitted bandwidth information will be leveraged in the next
commit to save/restore per-rq bandwidth contribution towards the root
domain (using the rq_{online,offline} mechanism).

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reported-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
[ original patch by ]
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
---
 kernel/sched/core.c     |  2 ++
 kernel/sched/deadline.c | 18 ++++++++++++++++++
 kernel/sched/sched.h    | 22 ++++++++++++++++++++++
 3 files changed, 42 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 24fcdbf..706ca23 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2449,7 +2449,9 @@ static int dl_overflow(struct task_struct *p, int policy,
 	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
 		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
 		__dl_clear(dl_b, p->dl.dl_bw);
+		__dl_sub_ac(task_rq(p), p->dl.dl_bw);
 		__dl_add(dl_b, new_bw);
+		__dl_add_ac(task_rq(p), new_bw);
 		err = 0;
 	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
 		__dl_clear(dl_b, p->dl.dl_bw);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index cd64c97..2480cab 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -83,6 +83,7 @@ void init_dl_rq(struct dl_rq *dl_rq)
 #else
 	init_dl_bw(&dl_rq->dl_bw);
 #endif
+	dl_rq->ac_bw = 0;
 }
 
 #ifdef CONFIG_SMP
@@ -278,8 +279,10 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
 	 * By now the task is replenished and enqueued; migrate it.
 	 */
 	deactivate_task(rq, p, 0);
+	__dl_sub_ac(rq, p->dl.dl_bw);
 	set_task_cpu(p, later_rq->cpu);
 	activate_task(later_rq, p, 0);
+	__dl_add_ac(later_rq, p->dl.dl_bw);
 
 	if (!fallback)
 		resched_curr(later_rq);
@@ -506,6 +509,7 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
 	 */
 	if (dl_se->dl_new) {
 		setup_new_dl_entity(dl_se, pi_se);
+		__dl_add_ac(rq, dl_se->dl_bw);
 		return;
 	}
 
@@ -955,6 +959,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		return;
 	}
 
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		__dl_add_ac(rq, p->dl.dl_bw);
+
 	/*
 	 * If p is throttled, we do nothing. In fact, if it exhausted
 	 * its budget it needs a replenishment and, since it now is on
@@ -980,6 +987,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
 	__dequeue_task_dl(rq, p, flags);
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		__dl_sub_ac(rq, p->dl.dl_bw);
 }
 
 /*
@@ -1219,6 +1228,8 @@ static void task_dead_dl(struct task_struct *p)
 {
 	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
+	__dl_sub_ac(task_rq(p), p->dl.dl_bw);
+
 	/*
 	 * Since we are TASK_DEAD we won't slip out of the domain!
 	 */
@@ -1511,8 +1522,10 @@ retry:
 	}
 
 	deactivate_task(rq, next_task, 0);
+	__dl_sub_ac(rq, next_task->dl.dl_bw);
 	set_task_cpu(next_task, later_rq->cpu);
 	activate_task(later_rq, next_task, 0);
+	__dl_add_ac(later_rq, next_task->dl.dl_bw);
 	ret = 1;
 
 	resched_curr(later_rq);
@@ -1599,8 +1612,10 @@ static void pull_dl_task(struct rq *this_rq)
 			resched = true;
 
 			deactivate_task(src_rq, p, 0);
+			__dl_sub_ac(src_rq, p->dl.dl_bw);
 			set_task_cpu(p, this_cpu);
 			activate_task(this_rq, p, 0);
+			__dl_add_ac(this_rq, p->dl.dl_bw);
 			dmin = p->dl.deadline;
 
 			/* Is there any other task even earlier? */
@@ -1705,6 +1720,9 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	if (!start_dl_timer(p))
 		__dl_clear_params(p);
 
+	if (dl_prio(p->normal_prio))
+		__dl_sub_ac(rq, p->dl.dl_bw);
+
 	/*
 	 * Since this might be the only -deadline task on the rq,
 	 * this is the right place to try to pull some other one
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10f1637..e754680 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -519,6 +519,14 @@ struct dl_rq {
 #else
 	struct dl_bw dl_bw;
 #endif
+
+	/*
+	 * ac_bw keeps track of per rq admitted bandwidth. It only changes
+	 * when a new task is admitted, it dies, it changes scheduling policy
+	 * or is migrated to another rq. It is used to correctly save/resore
+	 * total_bw on root_domain changes.
+	 */
+	u64 ac_bw;
 };
 
 #ifdef CONFIG_SMP
@@ -720,6 +728,20 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+static inline
+void __dl_sub_ac(struct rq *rq, u64 tsk_bw)
+{
+	WARN_ON(rq->dl.ac_bw == 0);
+
+	rq->dl.ac_bw -= tsk_bw;
+}
+
+static inline
+void __dl_add_ac(struct rq *rq, u64 tsk_bw)
+{
+	rq->dl.ac_bw += tsk_bw;
+}
+
 static inline u64 __rq_clock_broken(struct rq *rq)
 {
 	return READ_ONCE(rq->clock);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/2] sched/deadline: rq_{online,offline}_dl for root_domain changes
  2016-02-08 12:45 [PATCH 0/2] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
  2016-02-08 12:45 ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth Juri Lelli
@ 2016-02-08 12:45 ` Juri Lelli
  1 sibling, 0 replies; 46+ messages in thread
From: Juri Lelli @ 2016-02-08 12:45 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, peterz, mingo, luca.abeni, vincent.guittot,
	wanpeng.li, juri.lelli

Steven Rostedt reported that we lose information about the amount of
bandwidth that has been admitted to the system across operations that
reconfigure scheduling domains (and are thus destructive w.r.t. root
domains). Wanpeng Li also previously reported a similar problem related
to hotplug, but, since hotplug operations entail scheduling domains
reconfiguration, we can fix both problems at once.

This patch leverages per-rq admitted bandwidth information introduced by
previous commit and couple that with rq_{online,offline}_dl calls to
save and restore root_domain state across scheduling domains
reconfiguration.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reported-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
---
 kernel/sched/deadline.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 2480cab..925814e 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1683,6 +1683,7 @@ static void rq_online_dl(struct rq *rq)
 	if (rq->dl.overloaded)
 		dl_set_overload(rq);
 
+	__dl_add(&rq->rd->dl_bw, rq->dl.ac_bw);
 	cpudl_set_freecpu(&rq->rd->cpudl, rq->cpu);
 	if (rq->dl.dl_nr_running > 0)
 		cpudl_set(&rq->rd->cpudl, rq->cpu, rq->dl.earliest_dl.curr, 1);
@@ -1696,6 +1697,7 @@ static void rq_offline_dl(struct rq *rq)
 
 	cpudl_set(&rq->rd->cpudl, rq->cpu, 0, 0);
 	cpudl_clear_freecpu(&rq->rd->cpudl, rq->cpu);
+	__dl_clear(&rq->rd->dl_bw, rq->dl.ac_bw);
 }
 
 void __init init_sched_dl_class(void)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-08 12:45 ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth Juri Lelli
@ 2016-02-10 11:32   ` Juri Lelli
  2016-02-10 11:43     ` luca abeni
                       ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Juri Lelli @ 2016-02-10 11:32 UTC (permalink / raw)
  To: rostedt
  Cc: linux-kernel, peterz, mingo, luca.abeni, vincent.guittot,
	wanpeng.li, Juri Lelli

Hi,

I've updated this patch since, with a bit more testing and talking with
Luca in private, I realized that the previous version didn't manage
switching back and forth from SCHED_DEADLINE correctly. Thanks a lot
Luca for your feedback (even if not visible on the list).

I updated the testing branch accordingly and added a test to my tests
that stresses switch-in/switch-out.

I don't particularly like the fact that we break the scheduling classes
abstraction in __dl_overflow(), so I think a little bit of refactoring
is still needed. But that can also happen afterwards, if we fix the
problem with root domains.

Best,

- Juri

--->8---

>From 62f70ca3051672dce209e8355cf5eddc9d825c2a Mon Sep 17 00:00:00 2001
From: Juri Lelli <juri.lelli@arm.com>
Date: Sat, 6 Feb 2016 12:41:09 +0000
Subject: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth

Currently SCHED_DEADLINE scheduling policy tracks bandwidth of tasks
that passed admission control at root_domain level only. This creates
problems when such data structure(s) are destroyed, when we reconfigure
scheduling domains for example.

This is part one of two changes required to fix the problem. In this
patch we add per-rq tracking of admitted bandwidth. Tasks bring with
them their bandwidth contribution when they enter the system and are
enqueued for the first time. Contributions are then moved around when
migrations happen and removed when tasks die.

Per-rq admitted bandwidth information will be leveraged in the next
commit to save/restore per-rq bandwidth contribution towards the root
domain (using the rq_{online,offline} mechanism).

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reported-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
[ original patch by ]
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
---
 kernel/sched/core.c     |  6 +++++-
 kernel/sched/deadline.c | 16 +++++++++++++++-
 kernel/sched/sched.h    | 22 ++++++++++++++++++++++
 3 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9503d59..0ee0ec2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2432,7 +2432,7 @@ static int dl_overflow(struct task_struct *p, int policy,
 	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
 	int cpus, err = -1;
 
-	if (new_bw == p->dl.dl_bw)
+	if (task_has_dl_policy(p) && new_bw == p->dl.dl_bw)
 		return 0;
 
 	/*
@@ -2445,14 +2445,18 @@ static int dl_overflow(struct task_struct *p, int policy,
 	if (dl_policy(policy) && !task_has_dl_policy(p) &&
 	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
 		__dl_add(dl_b, new_bw);
+		__dl_add_ac(task_rq(p), new_bw);
 		err = 0;
 	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
 		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
 		__dl_clear(dl_b, p->dl.dl_bw);
+		__dl_sub_ac(task_rq(p), p->dl.dl_bw);
 		__dl_add(dl_b, new_bw);
+		__dl_add_ac(task_rq(p), new_bw);
 		err = 0;
 	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
 		__dl_clear(dl_b, p->dl.dl_bw);
+		__dl_sub_ac(task_rq(p), p->dl.dl_bw);
 		err = 0;
 	}
 	raw_spin_unlock(&dl_b->lock);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index cd64c97..8ac0fee 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -83,6 +83,7 @@ void init_dl_rq(struct dl_rq *dl_rq)
 #else
 	init_dl_bw(&dl_rq->dl_bw);
 #endif
+	dl_rq->ac_bw = 0;
 }
 
 #ifdef CONFIG_SMP
@@ -278,8 +279,10 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
 	 * By now the task is replenished and enqueued; migrate it.
 	 */
 	deactivate_task(rq, p, 0);
+	__dl_sub_ac(rq, p->dl.dl_bw);
 	set_task_cpu(p, later_rq->cpu);
 	activate_task(later_rq, p, 0);
+	__dl_add_ac(later_rq, p->dl.dl_bw);
 
 	if (!fallback)
 		resched_curr(later_rq);
@@ -597,7 +600,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 
 	/*
 	 * The task might have changed its scheduling policy to something
-	 * different than SCHED_DEADLINE (through switched_fromd_dl()).
+	 * different than SCHED_DEADLINE (through switched_from_dl()).
 	 */
 	if (!dl_task(p)) {
 		__dl_clear_params(p);
@@ -955,6 +958,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		return;
 	}
 
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		__dl_add_ac(rq, p->dl.dl_bw);
+
 	/*
 	 * If p is throttled, we do nothing. In fact, if it exhausted
 	 * its budget it needs a replenishment and, since it now is on
@@ -980,6 +986,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
 	__dequeue_task_dl(rq, p, flags);
+	if (p->on_rq == TASK_ON_RQ_MIGRATING)
+		__dl_sub_ac(rq, p->dl.dl_bw);
 }
 
 /*
@@ -1219,6 +1227,8 @@ static void task_dead_dl(struct task_struct *p)
 {
 	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
+	__dl_sub_ac(task_rq(p), p->dl.dl_bw);
+
 	/*
 	 * Since we are TASK_DEAD we won't slip out of the domain!
 	 */
@@ -1511,8 +1521,10 @@ retry:
 	}
 
 	deactivate_task(rq, next_task, 0);
+	__dl_sub_ac(rq, next_task->dl.dl_bw);
 	set_task_cpu(next_task, later_rq->cpu);
 	activate_task(later_rq, next_task, 0);
+	__dl_add_ac(later_rq, next_task->dl.dl_bw);
 	ret = 1;
 
 	resched_curr(later_rq);
@@ -1599,8 +1611,10 @@ static void pull_dl_task(struct rq *this_rq)
 			resched = true;
 
 			deactivate_task(src_rq, p, 0);
+			__dl_sub_ac(src_rq, p->dl.dl_bw);
 			set_task_cpu(p, this_cpu);
 			activate_task(this_rq, p, 0);
+			__dl_add_ac(this_rq, p->dl.dl_bw);
 			dmin = p->dl.deadline;
 
 			/* Is there any other task even earlier? */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10f1637..242907f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -519,6 +519,14 @@ struct dl_rq {
 #else
 	struct dl_bw dl_bw;
 #endif
+
+	/*
+	 * ac_bw keeps track of per rq admitted bandwidth. It only changes
+	 * when a new task is admitted, it dies, it changes scheduling policy
+	 * or is migrated to another rq. It is used to correctly save/resore
+	 * total_bw on root_domain changes.
+	 */
+	u64 ac_bw;
 };
 
 #ifdef CONFIG_SMP
@@ -720,6 +728,20 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+static inline
+void __dl_sub_ac(struct rq *rq, u64 tsk_bw)
+{
+	WARN_ON(rq->dl.ac_bw < tsk_bw);
+
+	rq->dl.ac_bw -= tsk_bw;
+}
+
+static inline
+void __dl_add_ac(struct rq *rq, u64 tsk_bw)
+{
+	rq->dl.ac_bw += tsk_bw;
+}
+
 static inline u64 __rq_clock_broken(struct rq *rq)
 {
 	return READ_ONCE(rq->clock);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-10 11:32   ` Juri Lelli
@ 2016-02-10 11:43     ` luca abeni
  2016-02-10 11:58       ` Juri Lelli
  2016-02-10 12:48     ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth luca abeni
  2016-02-10 14:37     ` Steven Rostedt
  2 siblings, 1 reply; 46+ messages in thread
From: luca abeni @ 2016-02-10 11:43 UTC (permalink / raw)
  To: Juri Lelli
  Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

Hi all,

On Wed, 10 Feb 2016 11:32:58 +0000
Juri Lelli <juri.lelli@arm.com> wrote:
[...]
> @@ -2445,14 +2445,18 @@ static int dl_overflow(struct task_struct *p,
> int policy, if (dl_policy(policy) && !task_has_dl_policy(p) &&
>  	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
>  		__dl_add(dl_b, new_bw);
> +		__dl_add_ac(task_rq(p), new_bw);
>  		err = 0;
>  	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
>  		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
>  		__dl_clear(dl_b, p->dl.dl_bw);
> +		__dl_sub_ac(task_rq(p), p->dl.dl_bw);
>  		__dl_add(dl_b, new_bw);
> +		__dl_add_ac(task_rq(p), new_bw);
>  		err = 0;
>  	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
>  		__dl_clear(dl_b, p->dl.dl_bw);
> +		__dl_sub_ac(task_rq(p), p->dl.dl_bw);

Instead of adding __dl_add_ac() and __dl_sub_ac) calls here, maybe they
can be added in switched_to_dl() and switched_from_dl()?

I'll test this idea locally, and I'll send an updated patch if it works.



			Luca

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-10 11:43     ` luca abeni
@ 2016-02-10 11:58       ` Juri Lelli
  2016-02-19 13:43         ` luca abeni
  2016-02-22 10:57         ` [PATCH 0/3] cleanup " Luca Abeni
  0 siblings, 2 replies; 46+ messages in thread
From: Juri Lelli @ 2016-02-10 11:58 UTC (permalink / raw)
  To: luca abeni
  Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

On 10/02/16 12:43, Luca Abeni wrote:
> Hi all,
> 

Hi Luca,

> On Wed, 10 Feb 2016 11:32:58 +0000
> Juri Lelli <juri.lelli@arm.com> wrote:
> [...]
> > @@ -2445,14 +2445,18 @@ static int dl_overflow(struct task_struct *p,
> > int policy, if (dl_policy(policy) && !task_has_dl_policy(p) &&
> >  	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
> >  		__dl_add(dl_b, new_bw);
> > +		__dl_add_ac(task_rq(p), new_bw);
> >  		err = 0;
> >  	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
> >  		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
> >  		__dl_clear(dl_b, p->dl.dl_bw);
> > +		__dl_sub_ac(task_rq(p), p->dl.dl_bw);
> >  		__dl_add(dl_b, new_bw);
> > +		__dl_add_ac(task_rq(p), new_bw);
> >  		err = 0;
> >  	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
> >  		__dl_clear(dl_b, p->dl.dl_bw);
> > +		__dl_sub_ac(task_rq(p), p->dl.dl_bw);
> 
> Instead of adding __dl_add_ac() and __dl_sub_ac) calls here, maybe they
> can be added in switched_to_dl() and switched_from_dl()?
> 

That might work too yes. I think there is value if we are able to move
all __dl_{add,sub}_ac calls in deadline.c. Actually, we should probably
move __dl_{add,clear} there as well, so that changes to rq and
root_domain happen at the same time.

> I'll test this idea locally, and I'll send an updated patch if it works.
> 

Thanks! Will wait for it :).

Best,

- Juri

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-10 11:32   ` Juri Lelli
  2016-02-10 11:43     ` luca abeni
@ 2016-02-10 12:48     ` luca abeni
  2016-02-10 13:42       ` Juri Lelli
  2016-02-10 14:37     ` Steven Rostedt
  2 siblings, 1 reply; 46+ messages in thread
From: luca abeni @ 2016-02-10 12:48 UTC (permalink / raw)
  To: Juri Lelli
  Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

Hi,

On Wed, 10 Feb 2016 11:32:58 +0000
Juri Lelli <juri.lelli@arm.com> wrote:
[...]
> From 62f70ca3051672dce209e8355cf5eddc9d825c2a Mon Sep 17 00:00:00 2001
> From: Juri Lelli <juri.lelli@arm.com>
> Date: Sat, 6 Feb 2016 12:41:09 +0000
> Subject: [PATCH 1/2] sched/deadline: add per rq tracking of admitted
> bandwidth
> 
> Currently SCHED_DEADLINE scheduling policy tracks bandwidth of tasks
> that passed admission control at root_domain level only. This creates
> problems when such data structure(s) are destroyed, when we
> reconfigure scheduling domains for example.
> 
> This is part one of two changes required to fix the problem. In this
> patch we add per-rq tracking of admitted bandwidth. Tasks bring with
> them their bandwidth contribution when they enter the system and are
> enqueued for the first time. Contributions are then moved around when
> migrations happen and removed when tasks die.

I think this patch actually does two different things (addressing two
separate problems):
1) it introduces the tracking of per-rq utilization (used in your next
   patch to address the root domain issues)
2) it fixes a bug in the current utilization tracking mechanism.
   Currently, a task doing
	while(1) {
		switch to SCHED_DEADLINE
		switch to SCHED_OTHER
	}
   brings dl_b->total_bw below 0. Thanks to Juri for showing me this
   problem (and how to reproduce it) in a private email.
   This happens because when the task switches back from SCHED_DEADLINE
   to SCHED_OTHER, switched_from_dl() does not clear its deadline
   parameters (they will be cleared by the deadline timer when it
   fires). But dl_overflow() removes its utilization from
   dl_b->total_bw. When the task switches back to SCHED_DEADLINE, the
   if (new_bw == p->dl.dl_bw) check in dl_overflow() prevents
   __dl_add() from being called, and so when the task switches back to
   SCHED_OTHER dl_b->total_bw becomes negative.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9503d59..0ee0ec2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2432,7 +2432,7 @@ static int dl_overflow(struct task_struct *p,
> int policy, u64 new_bw = dl_policy(policy) ? to_ratio(period,
> runtime) : 0; int cpus, err = -1;
>  
> -	if (new_bw == p->dl.dl_bw)
> +	if (task_has_dl_policy(p) && new_bw == p->dl.dl_bw)
>  		return 0;

This hunk actually fixes issue 2) mentioned above, so I think it should
be committed in a short time (independently from the rest of the
patch). And maybe is a good candidate for backporting to stable kernels?



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-10 12:48     ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth luca abeni
@ 2016-02-10 13:42       ` Juri Lelli
  2016-02-23 15:48         ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Juri Lelli @ 2016-02-10 13:42 UTC (permalink / raw)
  To: luca abeni
  Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

On 10/02/16 13:48, Luca Abeni wrote:
> Hi,
> 
> On Wed, 10 Feb 2016 11:32:58 +0000
> Juri Lelli <juri.lelli@arm.com> wrote:
> [...]
> > From 62f70ca3051672dce209e8355cf5eddc9d825c2a Mon Sep 17 00:00:00 2001
> > From: Juri Lelli <juri.lelli@arm.com>
> > Date: Sat, 6 Feb 2016 12:41:09 +0000
> > Subject: [PATCH 1/2] sched/deadline: add per rq tracking of admitted
> > bandwidth
> > 
> > Currently SCHED_DEADLINE scheduling policy tracks bandwidth of tasks
> > that passed admission control at root_domain level only. This creates
> > problems when such data structure(s) are destroyed, when we
> > reconfigure scheduling domains for example.
> > 
> > This is part one of two changes required to fix the problem. In this
> > patch we add per-rq tracking of admitted bandwidth. Tasks bring with
> > them their bandwidth contribution when they enter the system and are
> > enqueued for the first time. Contributions are then moved around when
> > migrations happen and removed when tasks die.
> 
> I think this patch actually does two different things (addressing two
> separate problems):
> 1) it introduces the tracking of per-rq utilization (used in your next
>    patch to address the root domain issues)
> 2) it fixes a bug in the current utilization tracking mechanism.
>    Currently, a task doing
> 	while(1) {
> 		switch to SCHED_DEADLINE
> 		switch to SCHED_OTHER
> 	}
>    brings dl_b->total_bw below 0. Thanks to Juri for showing me this
>    problem (and how to reproduce it) in a private email.
>    This happens because when the task switches back from SCHED_DEADLINE
>    to SCHED_OTHER, switched_from_dl() does not clear its deadline
>    parameters (they will be cleared by the deadline timer when it
>    fires). But dl_overflow() removes its utilization from
>    dl_b->total_bw. When the task switches back to SCHED_DEADLINE, the
>    if (new_bw == p->dl.dl_bw) check in dl_overflow() prevents
>    __dl_add() from being called, and so when the task switches back to
>    SCHED_OTHER dl_b->total_bw becomes negative.
> 

Yep. That is also the reason why I think, whatever refactoring we are
goind to do, we should keep per-rq and root_domain accounting done
together.

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 9503d59..0ee0ec2 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2432,7 +2432,7 @@ static int dl_overflow(struct task_struct *p,
> > int policy, u64 new_bw = dl_policy(policy) ? to_ratio(period,
> > runtime) : 0; int cpus, err = -1;
> >  
> > -	if (new_bw == p->dl.dl_bw)
> > +	if (task_has_dl_policy(p) && new_bw == p->dl.dl_bw)
> >  		return 0;
> 
> This hunk actually fixes issue 2) mentioned above, so I think it should
> be committed in a short time (independently from the rest of the
> patch). And maybe is a good candidate for backporting to stable kernels?
> 

Yes, this is a sensible fix per se. I can split it and send it
separately.

Best,

- Juri

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-10 11:32   ` Juri Lelli
  2016-02-10 11:43     ` luca abeni
  2016-02-10 12:48     ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth luca abeni
@ 2016-02-10 14:37     ` Steven Rostedt
  2016-02-10 16:27       ` Juri Lelli
  2 siblings, 1 reply; 46+ messages in thread
From: Steven Rostedt @ 2016-02-10 14:37 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, peterz, mingo, luca.abeni, vincent.guittot, wanpeng.li

On Wed, 10 Feb 2016 11:32:58 +0000
Juri Lelli <juri.lelli@arm.com> wrote:

> Hi,
> 
> I've updated this patch since, with a bit more testing and talking with
> Luca in private, I realized that the previous version didn't manage
> switching back and forth from SCHED_DEADLINE correctly. Thanks a lot
> Luca for your feedback (even if not visible on the list).
> 
> I updated the testing branch accordingly and added a test to my tests
> that stresses switch-in/switch-out.
> 
> I don't particularly like the fact that we break the scheduling classes
> abstraction in __dl_overflow(), so I think a little bit of refactoring
> is still needed. But that can also happen afterwards, if we fix the
> problem with root domains.
> 
> Best,
> 
> - Juri
> 
> --->8---  
> 
> >From 62f70ca3051672dce209e8355cf5eddc9d825c2a Mon Sep 17 00:00:00 2001  
> From: Juri Lelli <juri.lelli@arm.com>
> Date: Sat, 6 Feb 2016 12:41:09 +0000
> Subject: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
> 


I applied this patch and patch 2 and hit this:

[ 2298.134284] ------------[ cut here ]------------
[ 2298.138933] WARNING: CPU: 4 PID: 0 at /home/rostedt/work/git/linux-trace.git/kernel/sched/sched.h:735 task_dead_dl+0xc5/0xd0()
[ 2298.150350] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables xt_CHECKSUM iptable_mangle bridge stp llc bluetooth lockd grace snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core x86_pkg_temp_thermal snd_seq vhost_net snd_seq_device tun vhost macvtap macvlan coretemp iTCO_wdt snd_pcm hp_wmi rfkill kvm_intel sparse_keymap iTCO_vendor_support snd_timer snd acpi_cpufreq kvm i2c_i801 mei_me mei soundcore lpc_ich mfd_core irqbypass wmi serio_raw uinput i915 i2c_algo_bit e1000e drm_kms_helper crc32_pclmul ptp crc32c_intel drm pps_core i2c_core video sunrpc
[ 2298.207495] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.5.0-rc1-test+ #204
[ 2298.214392] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
[ 2298.223371]  ffffffff81abc680 ffff880119433d68 ffffffff81411c3f 0000000000000000
[ 2298.230904]  ffffffff81abc680 ffff880119433da0 ffffffff810acf66 ffff88011eb16f40
[ 2298.238435]  ffff88001ee16200 ffffffff81fd4a00 00000000000aaaaa 0000000000000001
[ 2298.245958] Call Trace:
[ 2298.248431]  [<ffffffff81411c3f>] dump_stack+0x50/0xb1
[ 2298.253597]  [<ffffffff810acf66>] warn_slowpath_common+0x86/0xc0
[ 2298.259627]  [<ffffffff810ad05a>] warn_slowpath_null+0x1a/0x20
[ 2298.265490]  [<ffffffff810f92a5>] task_dead_dl+0xc5/0xd0
[ 2298.270828]  [<ffffffff810d833f>] finish_task_switch+0x16f/0x310
[ 2298.276871]  [<ffffffff810fa7f3>] ? pick_next_task_dl+0xb3/0x250
[ 2298.282906]  [<ffffffff817f07a3>] __schedule+0x3d3/0x9e0
[ 2298.288252]  [<ffffffff817f1001>] schedule+0x41/0xc0
[ 2298.293242]  [<ffffffff817f12c8>] schedule_preempt_disabled+0x18/0x30
[ 2298.299712]  [<ffffffff810fc974>] cpu_startup_entry+0x74/0x4e0
[ 2298.305573]  [<ffffffff8105d16f>] start_secondary+0x2bf/0x330
[ 2298.311347] ---[ end trace 732d16efabe456f1 ]---

It's the warning you added in __dl_sub_ac().

Things appear to still get all screwed up with cpusets and root
domains, but I can still recover with playing with load_balance.


-- Steve

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-10 14:37     ` Steven Rostedt
@ 2016-02-10 16:27       ` Juri Lelli
  2016-02-11 12:12         ` Juri Lelli
  0 siblings, 1 reply; 46+ messages in thread
From: Juri Lelli @ 2016-02-10 16:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, peterz, mingo, luca.abeni, vincent.guittot, wanpeng.li

On 10/02/16 09:37, Steven Rostedt wrote:
> On Wed, 10 Feb 2016 11:32:58 +0000
> Juri Lelli <juri.lelli@arm.com> wrote:
> 
> > Hi,
> > 
> > I've updated this patch since, with a bit more testing and talking with
> > Luca in private, I realized that the previous version didn't manage
> > switching back and forth from SCHED_DEADLINE correctly. Thanks a lot
> > Luca for your feedback (even if not visible on the list).
> > 
> > I updated the testing branch accordingly and added a test to my tests
> > that stresses switch-in/switch-out.
> > 
> > I don't particularly like the fact that we break the scheduling classes
> > abstraction in __dl_overflow(), so I think a little bit of refactoring
> > is still needed. But that can also happen afterwards, if we fix the
> > problem with root domains.
> > 
> > Best,
> > 
> > - Juri
> > 
> > --->8---  
> > 
> > >From 62f70ca3051672dce209e8355cf5eddc9d825c2a Mon Sep 17 00:00:00 2001  
> > From: Juri Lelli <juri.lelli@arm.com>
> > Date: Sat, 6 Feb 2016 12:41:09 +0000
> > Subject: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
> > 
> 
> 
> I applied this patch and patch 2 and hit this:
> 
> [ 2298.134284] ------------[ cut here ]------------
> [ 2298.138933] WARNING: CPU: 4 PID: 0 at /home/rostedt/work/git/linux-trace.git/kernel/sched/sched.h:735 task_dead_dl+0xc5/0xd0()
> [ 2298.150350] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables xt_CHECKSUM iptable_mangle bridge stp llc bluetooth lockd grace snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core x86_pkg_temp_thermal snd_seq vhost_net snd_seq_device tun vhost macvtap macvlan coretemp iTCO_wdt snd_pcm hp_wmi rfkill kvm_intel sparse_keymap iTCO_vendor_support snd_timer snd acpi_cpufreq kvm i2c_i801 mei_me mei soundcore lpc_ich mfd_core irqbypass wmi serio_raw uinput i915 i2c_algo_bit e1000e drm_kms_helper crc32_pclmul ptp crc32c_intel drm pps_core i2c_core video sunrpc
> [ 2298.207495] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.5.0-rc1-test+ #204
> [ 2298.214392] Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
> [ 2298.223371]  ffffffff81abc680 ffff880119433d68 ffffffff81411c3f 0000000000000000
> [ 2298.230904]  ffffffff81abc680 ffff880119433da0 ffffffff810acf66 ffff88011eb16f40
> [ 2298.238435]  ffff88001ee16200 ffffffff81fd4a00 00000000000aaaaa 0000000000000001
> [ 2298.245958] Call Trace:
> [ 2298.248431]  [<ffffffff81411c3f>] dump_stack+0x50/0xb1
> [ 2298.253597]  [<ffffffff810acf66>] warn_slowpath_common+0x86/0xc0
> [ 2298.259627]  [<ffffffff810ad05a>] warn_slowpath_null+0x1a/0x20
> [ 2298.265490]  [<ffffffff810f92a5>] task_dead_dl+0xc5/0xd0
> [ 2298.270828]  [<ffffffff810d833f>] finish_task_switch+0x16f/0x310
> [ 2298.276871]  [<ffffffff810fa7f3>] ? pick_next_task_dl+0xb3/0x250
> [ 2298.282906]  [<ffffffff817f07a3>] __schedule+0x3d3/0x9e0
> [ 2298.288252]  [<ffffffff817f1001>] schedule+0x41/0xc0
> [ 2298.293242]  [<ffffffff817f12c8>] schedule_preempt_disabled+0x18/0x30
> [ 2298.299712]  [<ffffffff810fc974>] cpu_startup_entry+0x74/0x4e0
> [ 2298.305573]  [<ffffffff8105d16f>] start_secondary+0x2bf/0x330
> [ 2298.311347] ---[ end trace 732d16efabe456f1 ]---
> 
> It's the warning you added in __dl_sub_ac().
> 

OK. There are still holes where we fail to properly update per-rq bw. It
seems (by running you test) that we fail to move the per-rq bw when we
move the root_domain bw due css_set_move_task(). So, the final
task_dead_dl() tries to remove bw from where there isn't.

I'm trying to see how we can close this hole.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-10 16:27       ` Juri Lelli
@ 2016-02-11 12:12         ` Juri Lelli
  2016-02-11 12:22           ` luca abeni
  0 siblings, 1 reply; 46+ messages in thread
From: Juri Lelli @ 2016-02-11 12:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, peterz, mingo, luca.abeni, vincent.guittot, wanpeng.li

On 10/02/16 16:27, Juri Lelli wrote:
> On 10/02/16 09:37, Steven Rostedt wrote:
> > On Wed, 10 Feb 2016 11:32:58 +0000
> > Juri Lelli <juri.lelli@arm.com> wrote:
> > 

[...]

> > 
> > I applied this patch and patch 2 and hit this:
> > 

[...]

> > 
> > It's the warning you added in __dl_sub_ac().
> > 
> 
> OK. There are still holes where we fail to properly update per-rq bw. It
> seems (by running you test) that we fail to move the per-rq bw when we
> move the root_domain bw due css_set_move_task(). So, the final
> task_dead_dl() tries to remove bw from where there isn't.
> 
> I'm trying to see how we can close this hole.
> 

So, just to give an update from yesterday (kind of tricky this one :/).

I think we still have (at least) two problems:

 - select_task_rq_dl, if we select a different target
 - select_task_rq might make use of select_fallback_rq, if cpus_allowed
   changed after the task went to sleep

Second case is what creates the problem here, as we don't update
task_rq(p) and fallback_cpu ac_bw. I was thinking we might do so, maybe
adding fallback_cpu in task_struct, from migrate_task_rq_dl() (it has to
be added yes), but I fear that we should hold both rq locks :/.

Luca, did you already face this problem (if I got it right) and thought
of a way to fix it? I'll go back and stare a bit more at those paths.

Best,

- Juri

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-11 12:12         ` Juri Lelli
@ 2016-02-11 12:22           ` luca abeni
  2016-02-11 12:27             ` Juri Lelli
  0 siblings, 1 reply; 46+ messages in thread
From: luca abeni @ 2016-02-11 12:22 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steven Rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

Hi Juri,

On Thu, 11 Feb 2016 12:12:57 +0000
Juri Lelli <juri.lelli@arm.com> wrote:
[...]
> I think we still have (at least) two problems:
> 
>  - select_task_rq_dl, if we select a different target
>  - select_task_rq might make use of select_fallback_rq, if
> cpus_allowed changed after the task went to sleep
> 
> Second case is what creates the problem here, as we don't update
> task_rq(p) and fallback_cpu ac_bw. I was thinking we might do so,
> maybe adding fallback_cpu in task_struct, from migrate_task_rq_dl()
> (it has to be added yes), but I fear that we should hold both rq
> locks :/.
> 
> Luca, did you already face this problem (if I got it right) and
> thought of a way to fix it? I'll go back and stare a bit more at
> those paths.
In my patch I took care of the first case (modifying
select_task_rq_dl() to move the utilization from the "old rq" to the
"new rq"), but I never managed to trigger select_fallback_rq() in my
tests, so I overlooked that case.


				Luca

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-11 12:22           ` luca abeni
@ 2016-02-11 12:27             ` Juri Lelli
  2016-02-11 12:40               ` luca abeni
  0 siblings, 1 reply; 46+ messages in thread
From: Juri Lelli @ 2016-02-11 12:27 UTC (permalink / raw)
  To: luca abeni
  Cc: Steven Rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

On 11/02/16 13:22, Luca Abeni wrote:
> Hi Juri,
> 
> On Thu, 11 Feb 2016 12:12:57 +0000
> Juri Lelli <juri.lelli@arm.com> wrote:
> [...]
> > I think we still have (at least) two problems:
> > 
> >  - select_task_rq_dl, if we select a different target
> >  - select_task_rq might make use of select_fallback_rq, if
> > cpus_allowed changed after the task went to sleep
> > 
> > Second case is what creates the problem here, as we don't update
> > task_rq(p) and fallback_cpu ac_bw. I was thinking we might do so,
> > maybe adding fallback_cpu in task_struct, from migrate_task_rq_dl()
> > (it has to be added yes), but I fear that we should hold both rq
> > locks :/.
> > 
> > Luca, did you already face this problem (if I got it right) and
> > thought of a way to fix it? I'll go back and stare a bit more at
> > those paths.
> In my patch I took care of the first case (modifying
> select_task_rq_dl() to move the utilization from the "old rq" to the
> "new rq"), but I never managed to trigger select_fallback_rq() in my
> tests, so I overlooked that case.
> 

Right, I was thinking to do the same. And you did that after grabbing
both locks, right?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-11 12:27             ` Juri Lelli
@ 2016-02-11 12:40               ` luca abeni
  2016-02-11 12:49                 ` Juri Lelli
  0 siblings, 1 reply; 46+ messages in thread
From: luca abeni @ 2016-02-11 12:40 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steven Rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

On Thu, 11 Feb 2016 12:27:54 +0000
Juri Lelli <juri.lelli@arm.com> wrote:

> On 11/02/16 13:22, Luca Abeni wrote:
> > Hi Juri,
> > 
> > On Thu, 11 Feb 2016 12:12:57 +0000
> > Juri Lelli <juri.lelli@arm.com> wrote:
> > [...]
> > > I think we still have (at least) two problems:
> > > 
> > >  - select_task_rq_dl, if we select a different target
> > >  - select_task_rq might make use of select_fallback_rq, if
> > > cpus_allowed changed after the task went to sleep
> > > 
> > > Second case is what creates the problem here, as we don't update
> > > task_rq(p) and fallback_cpu ac_bw. I was thinking we might do so,
> > > maybe adding fallback_cpu in task_struct, from
> > > migrate_task_rq_dl() (it has to be added yes), but I fear that we
> > > should hold both rq locks :/.
> > > 
> > > Luca, did you already face this problem (if I got it right) and
> > > thought of a way to fix it? I'll go back and stare a bit more at
> > > those paths.
> > In my patch I took care of the first case (modifying
> > select_task_rq_dl() to move the utilization from the "old rq" to the
> > "new rq"), but I never managed to trigger select_fallback_rq() in my
> > tests, so I overlooked that case.
> > 
> 
> Right, I was thinking to do the same. And you did that after grabbing
> both locks, right?

Not sure if I did everything correctly, but my code in
select_task_rq_dl() currently looks like this (you can obviously
ignore the "migrate_active" and "*_running_bw()" parts, and focus on
the "*_rq_bw()" stuff):
[...]
        if (rq != cpu_rq(cpu)) {
                int migrate_active;

                raw_spin_lock(&rq->lock);
                migrate_active = hrtimer_active(&p->dl.inactive_timer);
                if (migrate_active) {
                        hrtimer_try_to_cancel(&p->dl.inactive_timer);
                        sub_running_bw(&p->dl, &rq->dl);
                }
                sub_rq_bw(&p->dl, &rq->dl);
                raw_spin_unlock(&rq->lock);
                rq = cpu_rq(cpu);
                raw_spin_lock(&rq->lock);
                add_rq_bw(&p->dl, &rq->dl);
                if (migrate_active)
                        add_running_bw(&p->dl, &rq->dl);
                raw_spin_unlock(&rq->lock);
        }
[...]

lockdep is not screaming, and I am not able to trigger any race
condition or strange behaviour (I am currently at more than 24h of
continuous stress-testing, but maybe my testcase is not so good in
finding races here :)



				Luca

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-11 12:40               ` luca abeni
@ 2016-02-11 12:49                 ` Juri Lelli
  2016-02-11 13:05                   ` luca abeni
  0 siblings, 1 reply; 46+ messages in thread
From: Juri Lelli @ 2016-02-11 12:49 UTC (permalink / raw)
  To: luca abeni
  Cc: Steven Rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

On 11/02/16 13:40, Luca Abeni wrote:
> On Thu, 11 Feb 2016 12:27:54 +0000
> Juri Lelli <juri.lelli@arm.com> wrote:
> 
> > On 11/02/16 13:22, Luca Abeni wrote:
> > > Hi Juri,
> > > 
> > > On Thu, 11 Feb 2016 12:12:57 +0000
> > > Juri Lelli <juri.lelli@arm.com> wrote:
> > > [...]
> > > > I think we still have (at least) two problems:
> > > > 
> > > >  - select_task_rq_dl, if we select a different target
> > > >  - select_task_rq might make use of select_fallback_rq, if
> > > > cpus_allowed changed after the task went to sleep
> > > > 
> > > > Second case is what creates the problem here, as we don't update
> > > > task_rq(p) and fallback_cpu ac_bw. I was thinking we might do so,
> > > > maybe adding fallback_cpu in task_struct, from
> > > > migrate_task_rq_dl() (it has to be added yes), but I fear that we
> > > > should hold both rq locks :/.
> > > > 
> > > > Luca, did you already face this problem (if I got it right) and
> > > > thought of a way to fix it? I'll go back and stare a bit more at
> > > > those paths.
> > > In my patch I took care of the first case (modifying
> > > select_task_rq_dl() to move the utilization from the "old rq" to the
> > > "new rq"), but I never managed to trigger select_fallback_rq() in my
> > > tests, so I overlooked that case.
> > > 
> > 
> > Right, I was thinking to do the same. And you did that after grabbing
> > both locks, right?
> 
> Not sure if I did everything correctly, but my code in
> select_task_rq_dl() currently looks like this (you can obviously
> ignore the "migrate_active" and "*_running_bw()" parts, and focus on
> the "*_rq_bw()" stuff):
> [...]
>         if (rq != cpu_rq(cpu)) {
>                 int migrate_active;
> 
>                 raw_spin_lock(&rq->lock);
>                 migrate_active = hrtimer_active(&p->dl.inactive_timer);
>                 if (migrate_active) {
>                         hrtimer_try_to_cancel(&p->dl.inactive_timer);
>                         sub_running_bw(&p->dl, &rq->dl);
>                 }
>                 sub_rq_bw(&p->dl, &rq->dl);
>                 raw_spin_unlock(&rq->lock);
>                 rq = cpu_rq(cpu);

Can't something happen here? My problem is that I use per-rq bw tracking
to save/restore root_domain state. So, I fear that a root_domain update
can happen while we are in the middle of moving bw from one cpu to
another.

>                 raw_spin_lock(&rq->lock);
>                 add_rq_bw(&p->dl, &rq->dl);
>                 if (migrate_active)
>                         add_running_bw(&p->dl, &rq->dl);
>                 raw_spin_unlock(&rq->lock);
>         }
> [...]
> 
> lockdep is not screaming, and I am not able to trigger any race
> condition or strange behaviour (I am currently at more than 24h of
> continuous stress-testing, but maybe my testcase is not so good in
> finding races here :)
> 

Thanks for sharing what you have!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-11 12:49                 ` Juri Lelli
@ 2016-02-11 13:05                   ` luca abeni
  2016-02-11 14:25                     ` Steven Rostedt
  0 siblings, 1 reply; 46+ messages in thread
From: luca abeni @ 2016-02-11 13:05 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steven Rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

On Thu, 11 Feb 2016 12:49:59 +0000
Juri Lelli <juri.lelli@arm.com> wrote:
[...]
> > > > > Luca, did you already face this problem (if I got it right)
> > > > > and thought of a way to fix it? I'll go back and stare a bit
> > > > > more at those paths.
> > > > In my patch I took care of the first case (modifying
> > > > select_task_rq_dl() to move the utilization from the "old rq"
> > > > to the "new rq"), but I never managed to trigger
> > > > select_fallback_rq() in my tests, so I overlooked that case.
> > > > 
> > > 
> > > Right, I was thinking to do the same. And you did that after
> > > grabbing both locks, right?
> > 
> > Not sure if I did everything correctly, but my code in
> > select_task_rq_dl() currently looks like this (you can obviously
> > ignore the "migrate_active" and "*_running_bw()" parts, and focus on
> > the "*_rq_bw()" stuff):
> > [...]
> >         if (rq != cpu_rq(cpu)) {
> >                 int migrate_active;
> > 
> >                 raw_spin_lock(&rq->lock);
> >                 migrate_active =
> > hrtimer_active(&p->dl.inactive_timer); if (migrate_active) {
> >                         hrtimer_try_to_cancel(&p->dl.inactive_timer);
> >                         sub_running_bw(&p->dl, &rq->dl);
> >                 }
> >                 sub_rq_bw(&p->dl, &rq->dl);
> >                 raw_spin_unlock(&rq->lock);
> >                 rq = cpu_rq(cpu);
> 
> Can't something happen here? My problem is that I use per-rq bw
> tracking to save/restore root_domain state. So, I fear that a
> root_domain update can happen while we are in the middle of moving bw
> from one cpu to another.
Well, I never used the rq utilization to re-build the root_domain
utilization (and I never played with root domains too much... :)...
So, I do not really know. Maybe the code should do:
	raw_spin_lock(&rq->lock);
	raw_spin_lock(&cpu_rq(cpu)->lock);
	sub_rq_bw(&p->dl, &rq->dl);
	add_rq_bw(&p->dl, &cpu_rq(cpu)->dl);
	[...]
?


			Luca

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-11 13:05                   ` luca abeni
@ 2016-02-11 14:25                     ` Steven Rostedt
  2016-02-11 17:10                       ` Juri Lelli
  2016-02-11 21:48                       ` Luca Abeni
  0 siblings, 2 replies; 46+ messages in thread
From: Steven Rostedt @ 2016-02-11 14:25 UTC (permalink / raw)
  To: luca abeni
  Cc: Juri Lelli, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

On Thu, 11 Feb 2016 14:05:45 +0100
luca abeni <luca.abeni@unitn.it> wrote:

  
> Well, I never used the rq utilization to re-build the root_domain
> utilization (and I never played with root domains too much... :)...
> So, I do not really know. Maybe the code should do:
> 	raw_spin_lock(&rq->lock);
> 	raw_spin_lock(&cpu_rq(cpu)->lock);

Of course you want to use double_rq_lock() here instead.

-- Steve

> 	sub_rq_bw(&p->dl, &rq->dl);
> 	add_rq_bw(&p->dl, &cpu_rq(cpu)->dl);
> 	[...]
> ?
> 
> 
> 			Luca

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-11 14:25                     ` Steven Rostedt
@ 2016-02-11 17:10                       ` Juri Lelli
  2016-02-12 17:05                         ` Peter Zijlstra
  2016-02-11 21:48                       ` Luca Abeni
  1 sibling, 1 reply; 46+ messages in thread
From: Juri Lelli @ 2016-02-11 17:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: luca abeni, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

On 11/02/16 09:25, Steven Rostedt wrote:
> On Thu, 11 Feb 2016 14:05:45 +0100
> luca abeni <luca.abeni@unitn.it> wrote:
> 
>   
> > Well, I never used the rq utilization to re-build the root_domain
> > utilization (and I never played with root domains too much... :)...
> > So, I do not really know. Maybe the code should do:
> > 	raw_spin_lock(&rq->lock);
> > 	raw_spin_lock(&cpu_rq(cpu)->lock);
> 
> Of course you want to use double_rq_lock() here instead.
> 

Right. Is something like this completely out of question/broken?

I slighly tested it with Steve's test and I don't see the warning
anymore (sched_debug looks good as well); but my confidence is still
pretty low. :(

--->8---

>From 9713e12bc682ca364e62f9d69bcd44598c50a8a9 Mon Sep 17 00:00:00 2001
From: Juri Lelli <juri.lelli@arm.com>
Date: Thu, 11 Feb 2016 16:55:49 +0000
Subject: [PATCH] fixup! sched/deadline: add per rq tracking of admitted
 bandwidth

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
---
 include/linux/init_task.h |  1 +
 include/linux/sched.h     |  1 +
 kernel/sched/core.c       |  5 ++++-
 kernel/sched/deadline.c   | 26 +++++++++++++++++++++++++-
 4 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index f2cb8d4..c582f9d 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -199,6 +199,7 @@ extern struct task_group root_task_group;
 	.policy		= SCHED_NORMAL,					\
 	.cpus_allowed	= CPU_MASK_ALL,					\
 	.nr_cpus_allowed= NR_CPUS,					\
+	.fallback_cpu	= -1,						\
 	.mm		= NULL,						\
 	.active_mm	= &init_mm,					\
 	.restart_block = {						\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a..a6fc95c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1401,6 +1401,7 @@ struct task_struct {
 	struct task_struct *last_wakee;
 
 	int wake_cpu;
+	int fallback_cpu;
 #endif
 	int on_rq;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7fb9246..4e4bc41 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1442,7 +1442,8 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 				continue;
 			if (!cpu_active(dest_cpu))
 				continue;
-			if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
+			if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p))) {
+				p->fallback_cpu = dest_cpu;
 				return dest_cpu;
 		}
 	}
@@ -1490,6 +1491,7 @@ out:
 		}
 	}
 
+	p->fallback_cpu = dest_cpu;
 	return dest_cpu;
 }
 
@@ -1954,6 +1956,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		set_task_cpu(p, cpu);
+		p->fallback_cpu = -1;
 	}
 #endif /* CONFIG_SMP */
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6368f43..1eccecf 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1043,6 +1043,21 @@ static void yield_task_dl(struct rq *rq)
 
 #ifdef CONFIG_SMP
 
+static void swap_task_ac_bw(struct task_struct *p,
+			    struct rq *from,
+			    struct rq *to)
+{
+	unsigned long flags;
+
+	lockdep_assert_held(&p->pi_lock);
+	local_irq_save(flags);
+	double_rq_lock(from, to);
+	__dl_sub_ac(from, p->dl.dl_bw);
+	__dl_add_ac(to, p->dl.dl_bw);
+	double_rq_unlock(from, to);
+	local_irq_restore(flags);
+}
+
 static int find_later_rq(struct task_struct *task);
 
 static int
@@ -1077,8 +1092,10 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
 		if (target != -1 &&
 				(dl_time_before(p->dl.deadline,
 					cpu_rq(target)->dl.earliest_dl.curr) ||
-				(cpu_rq(target)->dl.dl_nr_running == 0)))
+				(cpu_rq(target)->dl.dl_nr_running == 0))) {
 			cpu = target;
+			swap_task_ac_bw(p, rq, cpu_rq(target));
+		}
 	}
 	rcu_read_unlock();
 
@@ -1807,6 +1824,12 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
 		switched_to_dl(rq, p);
 }
 
+static void migrate_task_rq_dl(struct task_struct *p)
+{
+	if (p->fallback_cpu != -1)
+		swap_task_ac_bw(p, task_rq(p), cpu_rq(p->fallback_cpu));
+}
+
 const struct sched_class dl_sched_class = {
 	.next			= &rt_sched_class,
 	.enqueue_task		= enqueue_task_dl,
@@ -1820,6 +1843,7 @@ const struct sched_class dl_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_dl,
+	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
 	.rq_online              = rq_online_dl,
 	.rq_offline             = rq_offline_dl,
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-11 14:25                     ` Steven Rostedt
  2016-02-11 17:10                       ` Juri Lelli
@ 2016-02-11 21:48                       ` Luca Abeni
  1 sibling, 0 replies; 46+ messages in thread
From: Luca Abeni @ 2016-02-11 21:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

On Thu, 11 Feb 2016 09:25:46 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 11 Feb 2016 14:05:45 +0100
> luca abeni <luca.abeni@unitn.it> wrote:
> 
>   
> > Well, I never used the rq utilization to re-build the root_domain
> > utilization (and I never played with root domains too much... :)...
> > So, I do not really know. Maybe the code should do:
> > 	raw_spin_lock(&rq->lock);
> > 	raw_spin_lock(&cpu_rq(cpu)->lock);
> 
> Of course you want to use double_rq_lock() here instead.
Right... Of course I always miss something obvious ;-)


			Luca

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-11 17:10                       ` Juri Lelli
@ 2016-02-12 17:05                         ` Peter Zijlstra
  2016-02-12 17:19                           ` Juri Lelli
  2016-02-24 19:17                           ` Peter Zijlstra
  0 siblings, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-02-12 17:05 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steven Rostedt, luca abeni, linux-kernel, mingo, vincent.guittot,
	wanpeng.li

On Thu, Feb 11, 2016 at 05:10:12PM +0000, Juri Lelli wrote:
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 6368f43..1eccecf 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c

> +static void swap_task_ac_bw(struct task_struct *p,
> +			    struct rq *from,
> +			    struct rq *to)
> +{
> +	unsigned long flags;
> +
> +	lockdep_assert_held(&p->pi_lock);
> +	local_irq_save(flags);
> +	double_rq_lock(from, to);
> +	__dl_sub_ac(from, p->dl.dl_bw);
> +	__dl_add_ac(to, p->dl.dl_bw);
> +	double_rq_unlock(from, to);
> +	local_irq_restore(flags);
> +}

> +static void migrate_task_rq_dl(struct task_struct *p)
> +{
> +	if (p->fallback_cpu != -1)
> +		swap_task_ac_bw(p, task_rq(p), cpu_rq(p->fallback_cpu));
> +}

This patch scares me.

Now, my brain is having an awfully hard time trying to re-engage after
flu, but this looks very wrong.

So we call sched_class::migrate_task_rq() from set_task_cpu(), and we
call set_task_cpu() while potentially holding rq::lock's (try
push_dl_task() for kicks).

Sure, you play horrible games with fallback_cpu, but those games are
just that, horrible.

So your initial patch migrates the bandwidth along when a runnable task
gets moved about, this hack seems to be mostly about waking up. The
'normal' accounting is done on enqueue/dequeue, while here you use the
migration hook.

Having two separate means of accounting this also feels more fragile
than one would want.

Let me think a bit about this.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-12 17:05                         ` Peter Zijlstra
@ 2016-02-12 17:19                           ` Juri Lelli
  2016-02-24 19:17                           ` Peter Zijlstra
  1 sibling, 0 replies; 46+ messages in thread
From: Juri Lelli @ 2016-02-12 17:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, luca abeni, linux-kernel, mingo, vincent.guittot,
	wanpeng.li

On 12/02/16 18:05, Peter Zijlstra wrote:
> On Thu, Feb 11, 2016 at 05:10:12PM +0000, Juri Lelli wrote:
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 6368f43..1eccecf 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> 
> > +static void swap_task_ac_bw(struct task_struct *p,
> > +			    struct rq *from,
> > +			    struct rq *to)
> > +{
> > +	unsigned long flags;
> > +
> > +	lockdep_assert_held(&p->pi_lock);
> > +	local_irq_save(flags);
> > +	double_rq_lock(from, to);
> > +	__dl_sub_ac(from, p->dl.dl_bw);
> > +	__dl_add_ac(to, p->dl.dl_bw);
> > +	double_rq_unlock(from, to);
> > +	local_irq_restore(flags);
> > +}
> 
> > +static void migrate_task_rq_dl(struct task_struct *p)
> > +{
> > +	if (p->fallback_cpu != -1)
> > +		swap_task_ac_bw(p, task_rq(p), cpu_rq(p->fallback_cpu));
> > +}
> 
> This patch scares me.
> 

Yeah, same here. However, I didn't find yet something different to fix
this and wanted some help :).

> Now, my brain is having an awfully hard time trying to re-engage after
> flu, but this looks very wrong.
> 
> So we call sched_class::migrate_task_rq() from set_task_cpu(), and we
> call set_task_cpu() while potentially holding rq::lock's (try
> push_dl_task() for kicks).
> 
> Sure, you play horrible games with fallback_cpu, but those games are
> just that, horrible.
> 

Right. I'm counting on fallback_cpu to be able to not call
swap_task_ac_bw() (and the rq locks) during push/pull migrations.
I was actually thinking that we could have a non locked version of swap
and call that in push/pull from migrate_task_rq_dl. But this is most
probably more horrible.

> 
> So your initial patch migrates the bandwidth along when a runnable task
> gets moved about, this hack seems to be mostly about waking up. The
> 'normal' accounting is done on enqueue/dequeue, while here you use the
> migration hook.
> 

The problem is that I don't do anything in enqueue/dequeue (apart from
when cpuset migrates us while still on_rq), and I think we don't want to
do anything there as a task dl_bw should remain in ac_bw when it goes to
sleep, etc. This is the static view of admitted bw. We want to
save/restore the admitted bw in the root_domain also when tasks are
sleeping/blocked.

> Having two separate means of accounting this also feels more fragile
> than one would want.
> 
> Let me think a bit about this.
> 

I was looking at sending out a v2 with this as RFC. I guess is better if
I wait :).

Thanks!

Best,

- Juri

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-10 11:58       ` Juri Lelli
@ 2016-02-19 13:43         ` luca abeni
  2016-02-19 14:20           ` Steven Rostedt
  2016-02-22 10:57         ` [PATCH 0/3] cleanup " Luca Abeni
  1 sibling, 1 reply; 46+ messages in thread
From: luca abeni @ 2016-02-19 13:43 UTC (permalink / raw)
  To: Juri Lelli
  Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

[-- Attachment #1: Type: text/plain, Size: 3661 bytes --]

Hi all,

On Wed, 10 Feb 2016 11:58:12 +0000
Juri Lelli <juri.lelli@arm.com> wrote:
[...]
> > >  	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
> > >  		__dl_clear(dl_b, p->dl.dl_bw);
> > > +		__dl_sub_ac(task_rq(p), p->dl.dl_bw);
> > 
> > Instead of adding __dl_add_ac() and __dl_sub_ac) calls here, maybe
> > they can be added in switched_to_dl() and switched_from_dl()?
> > 
> 
> That might work too yes. I think there is value if we are able to move
> all __dl_{add,sub}_ac calls in deadline.c. Actually, we should
> probably move __dl_{add,clear} there as well, so that changes to rq
> and root_domain happen at the same time.
> 
> > I'll test this idea locally, and I'll send an updated patch if it
> > works.
> > 
> 
> Thanks! Will wait for it :).

I know there are still open issues with select_fallback_rq() and
similar stuff, but as promised I worked on moving the __dl_*_ac() calls
from core.c to deadline.c (which is orthogonal respect to the
select_fallback_rq() thing). I post these patches so that people can
see how the code would look like after moving __dl_*_ac() calls to
deadline.c and can provide some comments; the patches are not submitted
for inclusion (yet).


So, the first attached patch (to be applied over Juri's patch) just
moves two __dl_sub_ac() and __dl_add_ac() invocations from
dl_overflow() to deadline.c, using the switched_from_dl() and
switched_to_dl() methods. This should cover the cases of tasks moving
from SCHED_{OTHER,RT} to SCHED_DEADLINE and vice-versa. The case in
which the deadline parameters of a task are changed still needs some
__dl_* calls in dl_overflow().

These calls are moved to deadline.c by the second attached patch, which
uses prio_changed_dl()... Here, prio_changed_dl() needs to know the
"old" task utilization, while it is given  the "old task
priority" (which, for a deadline task is equal to the current
priority)... So, I changed the meaning of the third parameter of
prio_changed_dl(), from "old priority" to "old utilization".
I know that changing the meaning of a parameter between different
scheduling classes might be a bad idea... But the "prio_changed" method
seems to be designed for priority-based schedulers only, and does not
provide good information to the SCHED_DEADLINE scheduling class... The
alternative is to change the prototype of the function, adding an
"oldbw" (or, better, "oldperiod" and "oldruntime") parameter.
Let me know what you think about this.

Notice that the second patch also contains two hunks that can be
extracted from it (if needed):
1) remove the call to switched_to_dl() from prio_changed_dl(): this
   call seems to be useless
2) change __sched_setscheduler() to invoke prio_changed for deadline
   tasks changing their parameters. Currently, this method is not
   called (because when changing parameters deadline tasks do not
   change their priority, so new_effective_prio == oldprio). I suspect
   calling prio_changed is the correct behaviour; of course, this can
   be done in different ways, let me know your opinion.

I tested these two patches in various ways, and I do not see
regressions respect to Juri's patch (but I expect issues with
select_fallback_rq()... BTW, if anyone can provide some testcases for
it, I can try to fix that case too).

Finally, when playing with switched_to_dl() I realized that it can be
used to remove the dl_new field. So, I wrote patch 0003 (attached),
which seems to be working correctly, but I am probably missing some
important tests. Let me know what you think about it: I think it might
be a nice simplification of the code, but as usual I might be missing
something :)



			Thanks,
				Luca

[-- Attachment #2: 0001-Move-some-calls-to-__dl_-sub-add-_ac-from-core.c-to-.patch --]
[-- Type: text/x-patch, Size: 2219 bytes --]

>From 704d8f7f803ff10d2e84a9fcd4737ce684d7ef78 Mon Sep 17 00:00:00 2001
From: Luca Abeni <luca.abeni@unitn.it>
Date: Thu, 11 Feb 2016 13:12:12 +0100
Subject: [PATCH 1/4] Move some calls to __dl_{sub,add}_ac() from core.c to
 deadline.c

This moves some deadline-specific calls from core.c (dl_overflow())
switched_from_dl() and switched_to_dl().
Some __dl_{sub,add}_ac() calls are left in dl_overflow(), to handle
the case in which the deadline parameters of a task are changed without
changing the scheduling class.
---
 kernel/sched/core.c     | 2 --
 kernel/sched/deadline.c | 3 +++
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0ee0ec2..a4f08d1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2445,7 +2445,6 @@ static int dl_overflow(struct task_struct *p, int policy,
 	if (dl_policy(policy) && !task_has_dl_policy(p) &&
 	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
 		__dl_add(dl_b, new_bw);
-		__dl_add_ac(task_rq(p), new_bw);
 		err = 0;
 	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
 		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
@@ -2456,7 +2455,6 @@ static int dl_overflow(struct task_struct *p, int policy,
 		err = 0;
 	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
 		__dl_clear(dl_b, p->dl.dl_bw);
-		__dl_sub_ac(task_rq(p), p->dl.dl_bw);
 		err = 0;
 	}
 	raw_spin_unlock(&dl_b->lock);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 8ac0fee..4cc713a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1710,6 +1710,8 @@ void __init init_sched_dl_class(void)
 
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
 {
+	__dl_sub_ac(task_rq(p), p->dl.dl_bw);
+
 	/*
 	 * Start the deadline timer; if we switch back to dl before this we'll
 	 * continue consuming our current CBS slice. If we stay outside of
@@ -1736,6 +1738,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
  */
 static void switched_to_dl(struct rq *rq, struct task_struct *p)
 {
+	__dl_add_ac(rq, p->dl.dl_bw);
 	if (task_on_rq_queued(p) && rq->curr != p) {
 #ifdef CONFIG_SMP
 		if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
-- 
2.5.0


[-- Attachment #3: 0002-Move-the-remaining-__dl_-sub-add-_ac-calls-from-core.patch --]
[-- Type: text/x-patch, Size: 3602 bytes --]

>From e9703fcf14722111e16657ba4e9b08055fb4ba30 Mon Sep 17 00:00:00 2001
From: Luca Abeni <luca.abeni@unitn.it>
Date: Thu, 11 Feb 2016 15:33:23 +0100
Subject: [PATCH 2/4] Move the remaining __dl_{sub,add}_ac() calls from core.c
 to deadline.c

To move these calls from dl_overflow() to deadline.c, we must change
the meaning of the third parameter of prio_changed_dl().
Instead of passing the "old priority" (which is always equal to the current
one, for SCHED_DEADLINE) we pass the old utilization. An alternative approach
is to change the prototype of the "prio_changed" method of the scheduling
class.
---
 kernel/sched/core.c     | 10 ++++++----
 kernel/sched/deadline.c | 10 +++++++---
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a4f08d1..5dc12db 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2449,9 +2449,7 @@ static int dl_overflow(struct task_struct *p, int policy,
 	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
 		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
 		__dl_clear(dl_b, p->dl.dl_bw);
-		__dl_sub_ac(task_rq(p), p->dl.dl_bw);
 		__dl_add(dl_b, new_bw);
-		__dl_add_ac(task_rq(p), new_bw);
 		err = 0;
 	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
 		__dl_clear(dl_b, p->dl.dl_bw);
@@ -3522,6 +3520,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 		} else
 			p->dl.dl_boosted = 0;
 		p->sched_class = &dl_sched_class;
+		oldprio = 0;
 	} else if (rt_prio(prio)) {
 		if (dl_prio(oldprio))
 			p->dl.dl_boosted = 0;
@@ -3891,7 +3890,7 @@ static int __sched_setscheduler(struct task_struct *p,
 {
 	int newprio = dl_policy(attr->sched_policy) ? MAX_DL_PRIO - 1 :
 		      MAX_RT_PRIO - 1 - attr->sched_priority;
-	int retval, oldprio, oldpolicy = -1, queued, running;
+	int retval, oldprio, oldbw, oldpolicy = -1, queued, running;
 	int new_effective_prio, policy = attr->sched_policy;
 	unsigned long flags;
 	const struct sched_class *prev_class;
@@ -4069,6 +4068,7 @@ change:
 
 	p->sched_reset_on_fork = reset_on_fork;
 	oldprio = p->prio;
+	oldbw = p->dl.dl_bw;
 
 	if (pi) {
 		/*
@@ -4081,6 +4081,8 @@ change:
 		new_effective_prio = rt_mutex_get_effective_prio(p, newprio);
 		if (new_effective_prio == oldprio) {
 			__setscheduler_params(p, attr);
+			if (p->sched_class == &dl_sched_class)
+				p->sched_class->prio_changed(rq, p, oldbw);
 			task_rq_unlock(rq, p, &flags);
 			return 0;
 		}
@@ -4110,7 +4112,7 @@ change:
 		enqueue_task(rq, p, enqueue_flags);
 	}
 
-	check_class_changed(rq, p, prev_class, oldprio);
+	check_class_changed(rq, p, prev_class, ((prev_class == &dl_sched_class) && (p->sched_class == &dl_sched_class)) ? oldbw : oldprio);
 	preempt_disable(); /* avoid rq from going away on us */
 	task_rq_unlock(rq, p, &flags);
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4cc713a..959e7b7 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1757,8 +1757,13 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
  * a push or pull operation might be needed.
  */
 static void prio_changed_dl(struct rq *rq, struct task_struct *p,
-			    int oldprio)
+			    int oldbw)
 {
+	if (oldbw) {
+		__dl_sub_ac(rq, oldbw);
+		__dl_add_ac(rq, p->dl.dl_bw);
+	}
+
 	if (task_on_rq_queued(p) || rq->curr == p) {
 #ifdef CONFIG_SMP
 		/*
@@ -1785,8 +1790,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
 		 */
 		resched_curr(rq);
 #endif /* CONFIG_SMP */
-	} else
-		switched_to_dl(rq, p);
+	}
 }
 
 const struct sched_class dl_sched_class = {
-- 
2.5.0


[-- Attachment #4: 0003-Remove-dl_new.patch --]
[-- Type: text/x-patch, Size: 3200 bytes --]

>From 12466b28951fc9327af66e450d046cd232a41354 Mon Sep 17 00:00:00 2001
From: Luca Abeni <luca.abeni@unitn.it>
Date: Fri, 19 Feb 2016 12:22:59 +0100
Subject: [PATCH 3/4] Remove dl_new

switched_to_dl() can be used instead
---
 kernel/sched/core.c     |  1 -
 kernel/sched/deadline.c | 28 +++++-----------------------
 2 files changed, 5 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5dc12db..4246b1b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2183,7 +2183,6 @@ void __dl_clear_params(struct task_struct *p)
 	dl_se->dl_bw = 0;
 
 	dl_se->dl_throttled = 0;
-	dl_se->dl_new = 1;
 	dl_se->dl_yielded = 0;
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 959e7b7..12cb934 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -355,8 +355,6 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se,
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
-	WARN_ON(!dl_se->dl_new || dl_se->dl_throttled);
-
 	/*
 	 * We use the regular wall clock time to set deadlines in the
 	 * future; in fact, we must consider execution overheads (time
@@ -364,7 +362,6 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se,
 	 */
 	dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
 	dl_se->runtime = pi_se->dl_runtime;
-	dl_se->dl_new = 0;
 }
 
 /*
@@ -503,15 +500,6 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
-	/*
-	 * The arrival of a new instance needs special treatment, i.e.,
-	 * the actual scheduling parameters have to be "renewed".
-	 */
-	if (dl_se->dl_new) {
-		setup_new_dl_entity(dl_se, pi_se);
-		return;
-	}
-
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
 	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
 		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
@@ -608,16 +596,6 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	}
 
 	/*
-	 * This is possible if switched_from_dl() raced against a running
-	 * callback that took the above !dl_task() path and we've since then
-	 * switched back into SCHED_DEADLINE.
-	 *
-	 * There's nothing to do except drop our task reference.
-	 */
-	if (dl_se->dl_new)
-		goto unlock;
-
-	/*
 	 * The task might have been boosted by someone else and might be in the
 	 * boosting/deboosting path, its not throttled.
 	 */
@@ -920,7 +898,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
 	 * parameters of the task might need updating. Otherwise,
 	 * we want a replenishment of its runtime.
 	 */
-	if (dl_se->dl_new || flags & ENQUEUE_WAKEUP)
+	if (flags & ENQUEUE_WAKEUP)
 		update_dl_entity(dl_se, pi_se);
 	else if (flags & ENQUEUE_REPLENISH)
 		replenish_dl_entity(dl_se, pi_se);
@@ -1738,6 +1716,10 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
  */
 static void switched_to_dl(struct rq *rq, struct task_struct *p)
 {
+	if (p->dl.deadline <= rq_clock(rq)) {
+		setup_new_dl_entity(&p->dl, &p->dl);
+	}
+
 	__dl_add_ac(rq, p->dl.dl_bw);
 	if (task_on_rq_queued(p) && rq->curr != p) {
 #ifdef CONFIG_SMP
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-19 13:43         ` luca abeni
@ 2016-02-19 14:20           ` Steven Rostedt
  2016-02-19 14:53             ` luca abeni
  0 siblings, 1 reply; 46+ messages in thread
From: Steven Rostedt @ 2016-02-19 14:20 UTC (permalink / raw)
  To: luca abeni
  Cc: Juri Lelli, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

On Fri, 19 Feb 2016 14:43:47 +0100
luca abeni <luca.abeni@unitn.it> wrote:


> So, the first attached patch (to be applied over Juri's patch) just
> moves two __dl_sub_ac() and __dl_add_ac() invocations from
> dl_overflow() to deadline.c, using the switched_from_dl() and
> switched_to_dl() methods. This should cover the cases of tasks moving
> from SCHED_{OTHER,RT} to SCHED_DEADLINE and vice-versa. The case in
> which the deadline parameters of a task are changed still needs some
> __dl_* calls in dl_overflow().

Hi Luca,

Please send the patches in a patch series. It is very hard to review
patches that are attachments. And our scripts are made to apply patches
from mailing lists. Having attachments just makes the job harder.

-- Steve

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-19 14:20           ` Steven Rostedt
@ 2016-02-19 14:53             ` luca abeni
  2016-02-19 14:57               ` Steven Rostedt
  2016-02-22 11:03               ` luca abeni
  0 siblings, 2 replies; 46+ messages in thread
From: luca abeni @ 2016-02-19 14:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

On Fri, 19 Feb 2016 09:20:08 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Fri, 19 Feb 2016 14:43:47 +0100
> luca abeni <luca.abeni@unitn.it> wrote:
> 
> 
> > So, the first attached patch (to be applied over Juri's patch) just
> > moves two __dl_sub_ac() and __dl_add_ac() invocations from
> > dl_overflow() to deadline.c, using the switched_from_dl() and
> > switched_to_dl() methods. This should cover the cases of tasks
> > moving from SCHED_{OTHER,RT} to SCHED_DEADLINE and vice-versa. The
> > case in which the deadline parameters of a task are changed still
> > needs some __dl_* calls in dl_overflow().
> 
> Hi Luca,
> 
> Please send the patches in a patch series. It is very hard to review
> patches that are attachments. And our scripts are made to apply
> patches from mailing lists. Having attachments just makes the job
> harder.

Sorry about that; I was in hurry, and I tried to do the quickest
thing... I'll resend the patches in a more appropriate way on Monday.


			Luca

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-19 14:53             ` luca abeni
@ 2016-02-19 14:57               ` Steven Rostedt
  2016-02-22 11:03               ` luca abeni
  1 sibling, 0 replies; 46+ messages in thread
From: Steven Rostedt @ 2016-02-19 14:57 UTC (permalink / raw)
  To: luca abeni
  Cc: Juri Lelli, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

On Fri, 19 Feb 2016 15:53:06 +0100
luca abeni <luca.abeni@unitn.it> wrote:


> Sorry about that; I was in hurry, and I tried to do the quickest
> thing... I'll resend the patches in a more appropriate way on Monday.

Thanks! No rush.

-- Steve

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 0/3] cleanup per rq tracking of admitted bandwidth
  2016-02-10 11:58       ` Juri Lelli
  2016-02-19 13:43         ` luca abeni
@ 2016-02-22 10:57         ` Luca Abeni
  2016-02-22 10:57           ` [PATCH 1/4] Move some calls to __dl_{sub,add}_ac() from core.c to deadline.c Luca Abeni
                             ` (2 more replies)
  1 sibling, 3 replies; 46+ messages in thread
From: Luca Abeni @ 2016-02-22 10:57 UTC (permalink / raw)
  To: Juri Lelli
  Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot,
	wanpeng.li, Luca Abeni

I know there are still open issues with select_fallback_rq() and
similar stuff, but as promised I worked on moving the __dl_*_ac() calls
from core.c to deadline.c (which is orthogonal respect to the
select_fallback_rq() thing). I post these patches so that people can
see how the code would look like after moving __dl_*_ac() calls to
deadline.c and can provide some comments; the patches are not submitted
for inclusion (yet).

All the following patches have to be applied over Juri's patch.

So, patch 0001 just moves two __dl_sub_ac() and __dl_add_ac()
invocations from dl_overflow() to deadline.c, using the
switched_from_dl() and switched_to_dl() methods. This should cover
the cases of tasks moving from SCHED_{OTHER,RT} to SCHED_DEADLINE
and vice-versa. The case in which the deadline parameters of a
task are changed still needs some __dl_* calls in dl_overflow().

These calls are moved to deadline.c patch 0002, which uses
prio_changed_dl()... Here, prio_changed_dl() needs to know the
"old" task utilization, while it is given  the "old task
priority" (which, for a deadline task is equal to the current
priority)... So, I changed the meaning of the third parameter of
prio_changed_dl(), from "old priority" to "old utilization".
I know that changing the meaning of a parameter between different
scheduling classes might be a bad idea... But the "prio_changed" method
seems to be designed for priority-based schedulers only, and does not
provide good information to the SCHED_DEADLINE scheduling class... The
alternative is to change the prototype of the function, adding an
"oldbw" (or, better, "oldperiod" and "oldruntime") parameter.
Let me know what you think about this.

Notice that patch 0002 also contains two hunks that can be extracted
from it (if needed):
1) remove the call to switched_to_dl() from prio_changed_dl(): this
   call seems to be useless
2) change __sched_setscheduler() to invoke prio_changed for deadline
   tasks changing their parameters. Currently, this method is not
   called (because when changing parameters deadline tasks do not
   change their priority, so new_effective_prio == oldprio). I suspect
   calling prio_changed is the correct behaviour; of course, this can
   be done in different ways, let me know your opinion.

I tested patches 0001+0002 in various ways, and I do not see
regressions respect to Juri's patch (but I expect issues with
select_fallback_rq()... BTW, if anyone can provide some testcases for
it, I can try to fix that case too).

Finally, when playing with switched_to_dl() I realized that it can be
used to remove the dl_new field. So, I wrote patch 0003, which seems
to be working correctly, but I am probably missing some important
tests. Let me know what you think about it: I think it might be
a nice simplification of the code, but as usual I might be missing
something :)

Luca Abeni (4):
  Move some calls to __dl_{sub,add}_ac() from core.c to deadline.c
  Move the remaining __dl_{sub,add}_ac() calls from core.c to deadline.c
  Remove dl_new

 kernel/sched/core.c     | 13 ++++++-------
 kernel/sched/deadline.c | 43 +++++++++++++++++--------------------------
 kernel/sched/sched.h    |  5 +++++
 3 files changed, 28 insertions(+), 33 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/4] Move some calls to __dl_{sub,add}_ac() from core.c to deadline.c
  2016-02-22 10:57         ` [PATCH 0/3] cleanup " Luca Abeni
@ 2016-02-22 10:57           ` Luca Abeni
  2016-02-22 10:57           ` [PATCH 2/4] Move the remaining __dl_{sub,add}_ac() calls " Luca Abeni
  2016-02-22 10:57           ` [PATCH 3/4] Remove dl_new Luca Abeni
  2 siblings, 0 replies; 46+ messages in thread
From: Luca Abeni @ 2016-02-22 10:57 UTC (permalink / raw)
  To: Juri Lelli
  Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot,
	wanpeng.li, Luca Abeni

This moves some deadline-specific calls from core.c (dl_overflow())
switched_from_dl() and switched_to_dl().
Some __dl_{sub,add}_ac() calls are left in dl_overflow(), to handle
the case in which the deadline parameters of a task are changed without
changing the scheduling class.
---
 kernel/sched/core.c     | 2 --
 kernel/sched/deadline.c | 3 +++
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0ee0ec2..a4f08d1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2445,7 +2445,6 @@ static int dl_overflow(struct task_struct *p, int policy,
 	if (dl_policy(policy) && !task_has_dl_policy(p) &&
 	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
 		__dl_add(dl_b, new_bw);
-		__dl_add_ac(task_rq(p), new_bw);
 		err = 0;
 	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
 		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
@@ -2456,7 +2455,6 @@ static int dl_overflow(struct task_struct *p, int policy,
 		err = 0;
 	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
 		__dl_clear(dl_b, p->dl.dl_bw);
-		__dl_sub_ac(task_rq(p), p->dl.dl_bw);
 		err = 0;
 	}
 	raw_spin_unlock(&dl_b->lock);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 8ac0fee..4cc713a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1710,6 +1710,8 @@ void __init init_sched_dl_class(void)
 
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
 {
+	__dl_sub_ac(task_rq(p), p->dl.dl_bw);
+
 	/*
 	 * Start the deadline timer; if we switch back to dl before this we'll
 	 * continue consuming our current CBS slice. If we stay outside of
@@ -1736,6 +1738,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
  */
 static void switched_to_dl(struct rq *rq, struct task_struct *p)
 {
+	__dl_add_ac(rq, p->dl.dl_bw);
 	if (task_on_rq_queued(p) && rq->curr != p) {
 #ifdef CONFIG_SMP
 		if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/4] Move the remaining __dl_{sub,add}_ac() calls from core.c to deadline.c
  2016-02-22 10:57         ` [PATCH 0/3] cleanup " Luca Abeni
  2016-02-22 10:57           ` [PATCH 1/4] Move some calls to __dl_{sub,add}_ac() from core.c to deadline.c Luca Abeni
@ 2016-02-22 10:57           ` Luca Abeni
  2016-02-22 10:57           ` [PATCH 3/4] Remove dl_new Luca Abeni
  2 siblings, 0 replies; 46+ messages in thread
From: Luca Abeni @ 2016-02-22 10:57 UTC (permalink / raw)
  To: Juri Lelli
  Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot,
	wanpeng.li, Luca Abeni

To move these calls from dl_overflow() to deadline.c, we must change
the meaning of the third parameter of prio_changed_dl().
Instead of passing the "old priority" (which is always equal to the current
one, for SCHED_DEADLINE) we pass the old utilization. An alternative approach
is to change the prototype of the "prio_changed" method of the scheduling
class.
---
 kernel/sched/core.c     | 10 ++++++----
 kernel/sched/deadline.c | 10 +++++++---
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a4f08d1..5dc12db 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2449,9 +2449,7 @@ static int dl_overflow(struct task_struct *p, int policy,
 	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
 		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
 		__dl_clear(dl_b, p->dl.dl_bw);
-		__dl_sub_ac(task_rq(p), p->dl.dl_bw);
 		__dl_add(dl_b, new_bw);
-		__dl_add_ac(task_rq(p), new_bw);
 		err = 0;
 	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
 		__dl_clear(dl_b, p->dl.dl_bw);
@@ -3522,6 +3520,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 		} else
 			p->dl.dl_boosted = 0;
 		p->sched_class = &dl_sched_class;
+		oldprio = 0;
 	} else if (rt_prio(prio)) {
 		if (dl_prio(oldprio))
 			p->dl.dl_boosted = 0;
@@ -3891,7 +3890,7 @@ static int __sched_setscheduler(struct task_struct *p,
 {
 	int newprio = dl_policy(attr->sched_policy) ? MAX_DL_PRIO - 1 :
 		      MAX_RT_PRIO - 1 - attr->sched_priority;
-	int retval, oldprio, oldpolicy = -1, queued, running;
+	int retval, oldprio, oldbw, oldpolicy = -1, queued, running;
 	int new_effective_prio, policy = attr->sched_policy;
 	unsigned long flags;
 	const struct sched_class *prev_class;
@@ -4069,6 +4068,7 @@ change:
 
 	p->sched_reset_on_fork = reset_on_fork;
 	oldprio = p->prio;
+	oldbw = p->dl.dl_bw;
 
 	if (pi) {
 		/*
@@ -4081,6 +4081,8 @@ change:
 		new_effective_prio = rt_mutex_get_effective_prio(p, newprio);
 		if (new_effective_prio == oldprio) {
 			__setscheduler_params(p, attr);
+			if (p->sched_class == &dl_sched_class)
+				p->sched_class->prio_changed(rq, p, oldbw);
 			task_rq_unlock(rq, p, &flags);
 			return 0;
 		}
@@ -4110,7 +4112,7 @@ change:
 		enqueue_task(rq, p, enqueue_flags);
 	}
 
-	check_class_changed(rq, p, prev_class, oldprio);
+	check_class_changed(rq, p, prev_class, ((prev_class == &dl_sched_class) && (p->sched_class == &dl_sched_class)) ? oldbw : oldprio);
 	preempt_disable(); /* avoid rq from going away on us */
 	task_rq_unlock(rq, p, &flags);
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4cc713a..959e7b7 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1757,8 +1757,13 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
  * a push or pull operation might be needed.
  */
 static void prio_changed_dl(struct rq *rq, struct task_struct *p,
-			    int oldprio)
+			    int oldbw)
 {
+	if (oldbw) {
+		__dl_sub_ac(rq, oldbw);
+		__dl_add_ac(rq, p->dl.dl_bw);
+	}
+
 	if (task_on_rq_queued(p) || rq->curr == p) {
 #ifdef CONFIG_SMP
 		/*
@@ -1785,8 +1790,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
 		 */
 		resched_curr(rq);
 #endif /* CONFIG_SMP */
-	} else
-		switched_to_dl(rq, p);
+	}
 }
 
 const struct sched_class dl_sched_class = {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 3/4] Remove dl_new
  2016-02-22 10:57         ` [PATCH 0/3] cleanup " Luca Abeni
  2016-02-22 10:57           ` [PATCH 1/4] Move some calls to __dl_{sub,add}_ac() from core.c to deadline.c Luca Abeni
  2016-02-22 10:57           ` [PATCH 2/4] Move the remaining __dl_{sub,add}_ac() calls " Luca Abeni
@ 2016-02-22 10:57           ` Luca Abeni
  2016-02-23 15:42             ` Peter Zijlstra
  2 siblings, 1 reply; 46+ messages in thread
From: Luca Abeni @ 2016-02-22 10:57 UTC (permalink / raw)
  To: Juri Lelli
  Cc: rostedt, linux-kernel, peterz, mingo, vincent.guittot,
	wanpeng.li, Luca Abeni

switched_to_dl() can be used instead
---
 kernel/sched/core.c     |  1 -
 kernel/sched/deadline.c | 28 +++++-----------------------
 2 files changed, 5 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5dc12db..4246b1b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2183,7 +2183,6 @@ void __dl_clear_params(struct task_struct *p)
 	dl_se->dl_bw = 0;
 
 	dl_se->dl_throttled = 0;
-	dl_se->dl_new = 1;
 	dl_se->dl_yielded = 0;
 }
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 959e7b7..12cb934 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -355,8 +355,6 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se,
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
-	WARN_ON(!dl_se->dl_new || dl_se->dl_throttled);
-
 	/*
 	 * We use the regular wall clock time to set deadlines in the
 	 * future; in fact, we must consider execution overheads (time
@@ -364,7 +362,6 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se,
 	 */
 	dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
 	dl_se->runtime = pi_se->dl_runtime;
-	dl_se->dl_new = 0;
 }
 
 /*
@@ -503,15 +500,6 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
-	/*
-	 * The arrival of a new instance needs special treatment, i.e.,
-	 * the actual scheduling parameters have to be "renewed".
-	 */
-	if (dl_se->dl_new) {
-		setup_new_dl_entity(dl_se, pi_se);
-		return;
-	}
-
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
 	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
 		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
@@ -608,16 +596,6 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	}
 
 	/*
-	 * This is possible if switched_from_dl() raced against a running
-	 * callback that took the above !dl_task() path and we've since then
-	 * switched back into SCHED_DEADLINE.
-	 *
-	 * There's nothing to do except drop our task reference.
-	 */
-	if (dl_se->dl_new)
-		goto unlock;
-
-	/*
 	 * The task might have been boosted by someone else and might be in the
 	 * boosting/deboosting path, its not throttled.
 	 */
@@ -920,7 +898,7 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se,
 	 * parameters of the task might need updating. Otherwise,
 	 * we want a replenishment of its runtime.
 	 */
-	if (dl_se->dl_new || flags & ENQUEUE_WAKEUP)
+	if (flags & ENQUEUE_WAKEUP)
 		update_dl_entity(dl_se, pi_se);
 	else if (flags & ENQUEUE_REPLENISH)
 		replenish_dl_entity(dl_se, pi_se);
@@ -1738,6 +1716,10 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
  */
 static void switched_to_dl(struct rq *rq, struct task_struct *p)
 {
+	if (p->dl.deadline <= rq_clock(rq)) {
+		setup_new_dl_entity(&p->dl, &p->dl);
+	}
+
 	__dl_add_ac(rq, p->dl.dl_bw);
 	if (task_on_rq_queued(p) && rq->curr != p) {
 #ifdef CONFIG_SMP
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-19 14:53             ` luca abeni
  2016-02-19 14:57               ` Steven Rostedt
@ 2016-02-22 11:03               ` luca abeni
  1 sibling, 0 replies; 46+ messages in thread
From: luca abeni @ 2016-02-22 11:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, linux-kernel, peterz, mingo, vincent.guittot, wanpeng.li

On Fri, 19 Feb 2016 15:53:06 +0100
luca abeni <luca.abeni@unitn.it> wrote:
[...]
> > Please send the patches in a patch series. It is very hard to review
> > patches that are attachments. And our scripts are made to apply
> > patches from mailing lists. Having attachments just makes the job
> > harder.
> 
> Sorry about that; I was in hurry, and I tried to do the quickest
> thing... I'll resend the patches in a more appropriate way on Monday.

Ok; I just re-sent the patches with git sent-email. Hopefully, I
managed to send them as a reply to Juri's original patch, so that the
context of the discussion is not lost.

I just realized that the subjects of the emails say "[*/4]" instead of
"[*/3]" because of a stupid error of mine; I hope this is not a problem
(the patches are sent only for discussion). If this is a problem, I can
resend, of course.


			Thanks (and sorry for the mess),
				Luca

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/4] Remove dl_new
  2016-02-22 10:57           ` [PATCH 3/4] Remove dl_new Luca Abeni
@ 2016-02-23 15:42             ` Peter Zijlstra
  2016-02-24 13:53               ` luca abeni
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-02-23 15:42 UTC (permalink / raw)
  To: Luca Abeni
  Cc: Juri Lelli, rostedt, linux-kernel, mingo, vincent.guittot, wanpeng.li

On Mon, Feb 22, 2016 at 11:57:04AM +0100, Luca Abeni wrote:
> switched_to_dl() can be used instead

This seems unrelated to the other patches, and looks like a nice
cleanup.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-10 13:42       ` Juri Lelli
@ 2016-02-23 15:48         ` Peter Zijlstra
  2016-02-23 15:51           ` Juri Lelli
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-02-23 15:48 UTC (permalink / raw)
  To: Juri Lelli
  Cc: luca abeni, rostedt, linux-kernel, mingo, vincent.guittot, wanpeng.li

On Wed, Feb 10, 2016 at 01:42:40PM +0000, Juri Lelli wrote:
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 9503d59..0ee0ec2 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -2432,7 +2432,7 @@ static int dl_overflow(struct task_struct *p,
> > > int policy, u64 new_bw = dl_policy(policy) ? to_ratio(period,
> > > runtime) : 0; int cpus, err = -1;
> > >  
> > > -	if (new_bw == p->dl.dl_bw)
> > > +	if (task_has_dl_policy(p) && new_bw == p->dl.dl_bw)
> > >  		return 0;
> > 
> > This hunk actually fixes issue 2) mentioned above, so I think it should
> > be committed in a short time (independently from the rest of the
> > patch). And maybe is a good candidate for backporting to stable kernels?
> > 
> 
> Yes, this is a sensible fix per se. I can split it and send it
> separately.

Did you ever send that?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-23 15:48         ` Peter Zijlstra
@ 2016-02-23 15:51           ` Juri Lelli
  0 siblings, 0 replies; 46+ messages in thread
From: Juri Lelli @ 2016-02-23 15:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: luca abeni, rostedt, linux-kernel, mingo, vincent.guittot, wanpeng.li

On 23/02/16 16:48, Peter Zijlstra wrote:
> On Wed, Feb 10, 2016 at 01:42:40PM +0000, Juri Lelli wrote:
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index 9503d59..0ee0ec2 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -2432,7 +2432,7 @@ static int dl_overflow(struct task_struct *p,
> > > > int policy, u64 new_bw = dl_policy(policy) ? to_ratio(period,
> > > > runtime) : 0; int cpus, err = -1;
> > > >  
> > > > -	if (new_bw == p->dl.dl_bw)
> > > > +	if (task_has_dl_policy(p) && new_bw == p->dl.dl_bw)
> > > >  		return 0;
> > > 
> > > This hunk actually fixes issue 2) mentioned above, so I think it should
> > > be committed in a short time (independently from the rest of the
> > > patch). And maybe is a good candidate for backporting to stable kernels?
> > > 
> > 
> > Yes, this is a sensible fix per se. I can split it and send it
> > separately.
> 
> Did you ever send that?
> 

No, I didn't. Will do ASAP.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/4] Remove dl_new
  2016-02-23 15:42             ` Peter Zijlstra
@ 2016-02-24 13:53               ` luca abeni
  2016-02-25  9:46                 ` Juri Lelli
  0 siblings, 1 reply; 46+ messages in thread
From: luca abeni @ 2016-02-24 13:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, rostedt, linux-kernel, mingo, vincent.guittot, wanpeng.li

On Tue, 23 Feb 2016 16:42:49 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Feb 22, 2016 at 11:57:04AM +0100, Luca Abeni wrote:
> > switched_to_dl() can be used instead
> 
> This seems unrelated to the other patches, and looks like a nice
> cleanup.

Ok; I'll rebase the patch on master and I'll run some more serious
tests (Juri, your tests repository is available on github, right? Can I
assume that if the patch passes your tests then it is ok?).
If everything goes well, I'll submit the patch.


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-12 17:05                         ` Peter Zijlstra
  2016-02-12 17:19                           ` Juri Lelli
@ 2016-02-24 19:17                           ` Peter Zijlstra
  2016-02-24 21:46                             ` luca abeni
  2016-02-25 10:07                             ` Juri Lelli
  1 sibling, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-02-24 19:17 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steven Rostedt, luca abeni, linux-kernel, mingo, vincent.guittot,
	wanpeng.li

On Fri, Feb 12, 2016 at 06:05:30PM +0100, Peter Zijlstra wrote:
> Having two separate means of accounting this also feels more fragile
> than one would want.
> 
> Let me think a bit about this.

I think there's a fundamental problem that makes the whole notion of
per-rq accounting 'impossible'.

On hot-unplug we only migrate runnable tasks, all blocked tasks remain
on the dead cpu. This would very much include their bandwidth
requirements.

This means that between a hot-unplug and the moment that _all_ those
blocked tasks have ran at least once, the sum of online bandwidth
doesn't match and we can get into admission trouble (same for GRUB,
which can also use per-rq bw like this).

The main problem is that there is no real way to find blocked tasks;
currently the only way is to iterate _all_ tasks and filter on
task_cpu().

We could of course add a blocked tree/list for deadline tasks, to
explicitly keep track of all these; this would allow migrating blocked
tasks on hotplug and avoid the real ugly I think. But I've not tried
yet.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-24 19:17                           ` Peter Zijlstra
@ 2016-02-24 21:46                             ` luca abeni
  2016-02-25  7:53                               ` Peter Zijlstra
  2016-02-25 10:07                             ` Juri Lelli
  1 sibling, 1 reply; 46+ messages in thread
From: luca abeni @ 2016-02-24 21:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, Steven Rostedt, linux-kernel, mingo, vincent.guittot,
	wanpeng.li

Hi,

On Wed, 24 Feb 2016 20:17:52 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Feb 12, 2016 at 06:05:30PM +0100, Peter Zijlstra wrote:
> > Having two separate means of accounting this also feels more fragile
> > than one would want.
> > 
> > Let me think a bit about this.
> 
> I think there's a fundamental problem that makes the whole notion of
> per-rq accounting 'impossible'.
> 
> On hot-unplug we only migrate runnable tasks, all blocked tasks remain
> on the dead cpu. This would very much include their bandwidth
> requirements.
> 
> This means that between a hot-unplug and the moment that _all_ those
> blocked tasks have ran at least once, the sum of online bandwidth
> doesn't match and we can get into admission trouble (same for GRUB,
> which can also use per-rq bw like this).

After Juri's patch and emails, I tried to think about the CPU
hot-(un)plugging issues, and to check if/how they affect GRUB
reclaiming...

I arrived to the conclusion that for GRUB this is not a problem (but,
as usual, I might be wrong): GRUB just needs to track the per-runqueue
active/inactive utilization, and is not badly affected by the fact that
inactive utilization is migrated "too late" (when a task wakes up
instead of when the CPU goes offline). This is because GRUB does not
care about "global" utilization, but considers the various runqueues in
isolation (there is a flavor of the m-grub algorithm that uses global
inactive utilization, but it is not implemented by the patches I
submitted).
In other words: Juri's patch uses per-runqueue utilizations to re-build
the global utilization, while GRUB does not care if the sum of the
"active utilizations" match with the utilization used for admission
control.

I still have to check some details, and to run some more tests with CPU
hot-(un)plug (and this is why I did not send a v2 of the reclaiming RFC
yet)... In particular, I need to check what happens if the "inactive
timer" fires when the CPU on which the task was running is already
offline.

			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-24 21:46                             ` luca abeni
@ 2016-02-25  7:53                               ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-02-25  7:53 UTC (permalink / raw)
  To: luca abeni
  Cc: Juri Lelli, Steven Rostedt, linux-kernel, mingo, vincent.guittot,
	wanpeng.li

On Wed, Feb 24, 2016 at 10:46:43PM +0100, luca abeni wrote:
> 
> I arrived to the conclusion that for GRUB this is not a problem (but,
> as usual, I might be wrong): GRUB just needs to track the per-runqueue
> active/inactive utilization,

Ah! indeed, my bad.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/4] Remove dl_new
  2016-02-24 13:53               ` luca abeni
@ 2016-02-25  9:46                 ` Juri Lelli
  2016-03-03  9:03                   ` luca abeni
  0 siblings, 1 reply; 46+ messages in thread
From: Juri Lelli @ 2016-02-25  9:46 UTC (permalink / raw)
  To: luca abeni
  Cc: Peter Zijlstra, rostedt, linux-kernel, mingo, vincent.guittot,
	wanpeng.li

Hi,

On 24/02/16 14:53, luca abeni wrote:
> On Tue, 23 Feb 2016 16:42:49 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Mon, Feb 22, 2016 at 11:57:04AM +0100, Luca Abeni wrote:
> > > switched_to_dl() can be used instead
> > 
> > This seems unrelated to the other patches, and looks like a nice
> > cleanup.
> 
> Ok; I'll rebase the patch on master and I'll run some more serious
> tests (Juri, your tests repository is available on github, right? Can I
> assume that if the patch passes your tests then it is ok?).
> If everything goes well, I'll submit the patch.
> 

Yes, tests reside here https://github.com/jlelli/tests. They should give
you some confidence that things are not completely broken, but of course
they might be still broken and you do not notice by running such tests.
:-)

Please run also the PI related onces, they might be important in this
case.

Anyway, I'd say that, once you are fine with it, you can submit the
patch an then we have a second look at it.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-24 19:17                           ` Peter Zijlstra
  2016-02-24 21:46                             ` luca abeni
@ 2016-02-25 10:07                             ` Juri Lelli
  2016-02-25 10:20                               ` Peter Zijlstra
  1 sibling, 1 reply; 46+ messages in thread
From: Juri Lelli @ 2016-02-25 10:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, luca abeni, linux-kernel, mingo, vincent.guittot,
	wanpeng.li

Hi Peter,

On 24/02/16 20:17, Peter Zijlstra wrote:
> On Fri, Feb 12, 2016 at 06:05:30PM +0100, Peter Zijlstra wrote:
> > Having two separate means of accounting this also feels more fragile
> > than one would want.
> > 
> > Let me think a bit about this.
> 
> I think there's a fundamental problem that makes the whole notion of
> per-rq accounting 'impossible'.
>
> On hot-unplug we only migrate runnable tasks, all blocked tasks remain
> on the dead cpu. This would very much include their bandwidth
> requirements.
> 
> This means that between a hot-unplug and the moment that _all_ those
> blocked tasks have ran at least once, the sum of online bandwidth
> doesn't match and we can get into admission trouble (same for GRUB,
> which can also use per-rq bw like this).
> 
> The main problem is that there is no real way to find blocked tasks;
> currently the only way is to iterate _all_ tasks and filter on
> task_cpu().
> 
> We could of course add a blocked tree/list for deadline tasks, to
> explicitly keep track of all these; this would allow migrating blocked
> tasks on hotplug and avoid the real ugly I think. But I've not tried
> yet.
> 

Argh, this makes lot of sense to me. I've actually pondered a tree/list
solution, but then decided to try the cumulative approach because it
looked nicer. But it contains holes, I'm afraid. As Luca already said,
GRUB shouldn't have these problems though.

I'll try and see what introducting a list of blocked/throttled deadline
tasks means, considering also the interaction with cpusets and such.
Maybe it's simpler than it seems.

I'm not sure this will come anytime soon, unfortunately. I'm almost 100%
on the sched-freq/schedutil discussion these days.

Anyway, do you also think that what we want to solve the root domain
issue is something based on rq_online/offline and per-rq information?
Everything else that I tried or thought of was broken/more horrible. :-/

Best,

- Juri

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-25 10:07                             ` Juri Lelli
@ 2016-02-25 10:20                               ` Peter Zijlstra
  2016-03-24  9:20                                 ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2016-02-25 10:20 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steven Rostedt, luca abeni, linux-kernel, mingo, vincent.guittot,
	wanpeng.li

On Thu, Feb 25, 2016 at 10:07:06AM +0000, Juri Lelli wrote:
> Argh, this makes lot of sense to me. I've actually pondered a tree/list
> solution, but then decided to try the cumulative approach because it
> looked nicer. But it contains holes, I'm afraid. As Luca already said,
> GRUB shouldn't have these problems though.
> 
> I'll try and see what introducting a list of blocked/throttled deadline
> tasks means, considering also the interaction with cpusets and such.
> Maybe it's simpler than it seems.
> 
> I'm not sure this will come anytime soon, unfortunately. I'm almost 100%
> on the sched-freq/schedutil discussion these days.

Just skip sleep and write them when its dark outside :-)

> Anyway, do you also think that what we want to solve the root domain
> issue is something based on rq_online/offline and per-rq information?
> Everything else that I tried or thought of was broken/more horrible. :-/

I was still trying to get my head around this, the above was my
suggestion to the per-rq state, but I've not thought hard on alternative
approaches to the root_domain issue.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/4] Remove dl_new
  2016-02-25  9:46                 ` Juri Lelli
@ 2016-03-03  9:03                   ` luca abeni
  2016-03-03  9:28                     ` Juri Lelli
  0 siblings, 1 reply; 46+ messages in thread
From: luca abeni @ 2016-03-03  9:03 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, rostedt, linux-kernel, mingo, vincent.guittot,
	wanpeng.li

On Thu, 25 Feb 2016 09:46:55 +0000
Juri Lelli <juri.lelli@arm.com> wrote:

> Hi,
> 
> On 24/02/16 14:53, luca abeni wrote:
> > On Tue, 23 Feb 2016 16:42:49 +0100
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Mon, Feb 22, 2016 at 11:57:04AM +0100, Luca Abeni wrote:
> > > > switched_to_dl() can be used instead
> > > 
> > > This seems unrelated to the other patches, and looks like a nice
> > > cleanup.
> > 
> > Ok; I'll rebase the patch on master and I'll run some more serious
> > tests (Juri, your tests repository is available on github, right?
> > Can I assume that if the patch passes your tests then it is ok?).
> > If everything goes well, I'll submit the patch.
> > 
> 
> Yes, tests reside here https://github.com/jlelli/tests. They should
> give you some confidence that things are not completely broken, but
> of course they might be still broken and you do not notice by running
> such tests. :-)
I am trying these tests, but... Some scripts use "schedtool"; where can
I find a proper version of it (supporting SCHED_DEADLINE)?
I tried https://github.com/scheduler-tools/schedtool-dl but it does not
seem to work correctly...



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/4] Remove dl_new
  2016-03-03  9:03                   ` luca abeni
@ 2016-03-03  9:28                     ` Juri Lelli
  2016-03-03 14:23                       ` Steven Rostedt
  0 siblings, 1 reply; 46+ messages in thread
From: Juri Lelli @ 2016-03-03  9:28 UTC (permalink / raw)
  To: luca abeni
  Cc: Peter Zijlstra, rostedt, linux-kernel, mingo, vincent.guittot,
	wanpeng.li

Hi Luca,

On 03/03/16 10:03, Luca Abeni wrote:
> On Thu, 25 Feb 2016 09:46:55 +0000
> Juri Lelli <juri.lelli@arm.com> wrote:
> 
> > Hi,
> > 
> > On 24/02/16 14:53, luca abeni wrote:
> > > On Tue, 23 Feb 2016 16:42:49 +0100
> > > Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > On Mon, Feb 22, 2016 at 11:57:04AM +0100, Luca Abeni wrote:
> > > > > switched_to_dl() can be used instead
> > > > 
> > > > This seems unrelated to the other patches, and looks like a nice
> > > > cleanup.
> > > 
> > > Ok; I'll rebase the patch on master and I'll run some more serious
> > > tests (Juri, your tests repository is available on github, right?
> > > Can I assume that if the patch passes your tests then it is ok?).
> > > If everything goes well, I'll submit the patch.
> > > 
> > 
> > Yes, tests reside here https://github.com/jlelli/tests. They should
> > give you some confidence that things are not completely broken, but
> > of course they might be still broken and you do not notice by running
> > such tests. :-)
> I am trying these tests, but... Some scripts use "schedtool"; where can
> I find a proper version of it (supporting SCHED_DEADLINE)?
> I tried https://github.com/scheduler-tools/schedtool-dl but it does not
> seem to work correctly...
> 

That's the one that I use, and I'm not seeing any problems with it. I'll
send you the binary in private.

Best,

- Juri

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/4] Remove dl_new
  2016-03-03  9:28                     ` Juri Lelli
@ 2016-03-03 14:23                       ` Steven Rostedt
  2016-03-03 14:31                         ` luca abeni
  2016-03-03 16:12                         ` Juri Lelli
  0 siblings, 2 replies; 46+ messages in thread
From: Steven Rostedt @ 2016-03-03 14:23 UTC (permalink / raw)
  To: Juri Lelli
  Cc: luca abeni, Peter Zijlstra, linux-kernel, mingo, vincent.guittot,
	wanpeng.li

On Thu, 3 Mar 2016 09:28:01 +0000
Juri Lelli <juri.lelli@arm.com> wrote:

> That's the one that I use, and I'm not seeing any problems with it. I'll
> send you the binary in private.

That's the one I use too. BTW, Juri, do you plan on submitting patches
to schedtool upstream?

-- Steve

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/4] Remove dl_new
  2016-03-03 14:23                       ` Steven Rostedt
@ 2016-03-03 14:31                         ` luca abeni
  2016-03-03 16:12                         ` Juri Lelli
  1 sibling, 0 replies; 46+ messages in thread
From: luca abeni @ 2016-03-03 14:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, Peter Zijlstra, linux-kernel, mingo, vincent.guittot,
	wanpeng.li

Hi Steven,

On Thu, 3 Mar 2016 09:23:44 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 3 Mar 2016 09:28:01 +0000
> Juri Lelli <juri.lelli@arm.com> wrote:
> 
> > That's the one that I use, and I'm not seeing any problems with it.
> > I'll send you the binary in private.
> 
> That's the one I use too.
Juri provided me with a working binary, and I think I found the cause of
the issue: it works fine on 64bit systems, but fails on 32bit systems.
I think the issue is in the sched_setattr() definition present in
syscall_magic.h (which ignores the "flags" parameter).


			Luca

> BTW, Juri, do you plan on submitting patches
> to schedtool upstream?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/4] Remove dl_new
  2016-03-03 14:23                       ` Steven Rostedt
  2016-03-03 14:31                         ` luca abeni
@ 2016-03-03 16:12                         ` Juri Lelli
  1 sibling, 0 replies; 46+ messages in thread
From: Juri Lelli @ 2016-03-03 16:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: luca abeni, Peter Zijlstra, linux-kernel, mingo, vincent.guittot,
	wanpeng.li

Hi Steve,

On 03/03/16 09:23, Steven Rostedt wrote:
> On Thu, 3 Mar 2016 09:28:01 +0000
> Juri Lelli <juri.lelli@arm.com> wrote:
> 
> > That's the one that I use, and I'm not seeing any problems with it. I'll
> > send you the binary in private.
> 
> That's the one I use too. BTW, Juri, do you plan on submitting patches
> to schedtool upstream?
> 

Good point, I should. But I don't have any plans ATM :-/. OTOH, if
anyone else wants to do that I'll be more than happy. :-)

Best,

- Juri

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth
  2016-02-25 10:20                               ` Peter Zijlstra
@ 2016-03-24  9:20                                 ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-03-24  9:20 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steven Rostedt, luca abeni, linux-kernel, mingo, vincent.guittot,
	wanpeng.li

On Thu, Feb 25, 2016 at 11:20:34AM +0100, Peter Zijlstra wrote:
> On Thu, Feb 25, 2016 at 10:07:06AM +0000, Juri Lelli wrote:
> > Argh, this makes lot of sense to me. I've actually pondered a tree/list
> > solution, but then decided to try the cumulative approach because it
> > looked nicer. But it contains holes, I'm afraid. As Luca already said,
> > GRUB shouldn't have these problems though.
> > 
> > I'll try and see what introducting a list of blocked/throttled deadline
> > tasks means, considering also the interaction with cpusets and such.
> > Maybe it's simpler than it seems.
> > 
> > I'm not sure this will come anytime soon, unfortunately. I'm almost 100%
> > on the sched-freq/schedutil discussion these days.
> 
> Just skip sleep and write them when its dark outside :-)
> 
> > Anyway, do you also think that what we want to solve the root domain
> > issue is something based on rq_online/offline and per-rq information?
> > Everything else that I tried or thought of was broken/more horrible. :-/
> 
> I was still trying to get my head around this, the above was my
> suggestion to the per-rq state, but I've not thought hard on alternative
> approaches to the root_domain issue.

So the below is the inactive list; it seems to not insta-explode when I
run a few simple dl proglets.

I don't particularly like it because it makes wakeups (esp. cross-cpu
ones) more expensive for the benefit of hotplug/cpusets which is
something that 'never' happens.

So what I'm going to try and do is forget all about this here patch and
see what I can do with a full task-list iteration on rebuild. But I
figured that since I wrote it and it might work, I might as well post
it.

---
 include/linux/sched.h   |   5 ++
 kernel/sched/core.c     |   6 ++-
 kernel/sched/deadline.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/fair.c     |   2 +-
 kernel/sched/sched.h    |   7 ++-
 5 files changed, 132 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c617ea12c6b7..d9848eac35f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,6 +1351,11 @@ struct sched_dl_entity {
 	 * own bandwidth to be enforced, thus we need one timer per task.
 	 */
 	struct hrtimer dl_timer;
+
+#ifdef CONFIG_SMP
+	struct list_head dl_inactive_entry;
+	int		 dl_inactive_cpu;
+#endif
 };
 
 union rcu_special {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0b21e7a724e1..7f3fab6349a4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1162,7 +1162,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 
 	if (task_cpu(p) != new_cpu) {
 		if (p->sched_class->migrate_task_rq)
-			p->sched_class->migrate_task_rq(p);
+			p->sched_class->migrate_task_rq(p, new_cpu);
 		p->se.nr_migrations++;
 		perf_event_task_migrate(p);
 	}
@@ -2077,6 +2077,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	RB_CLEAR_NODE(&p->dl.rb_node);
 	init_dl_task_timer(&p->dl);
 	__dl_clear_params(p);
+#ifdef CONFIG_SMP
+	INIT_LIST_HEAD(&p->dl.dl_inactive_entry);
+#endif
 
 	INIT_LIST_HEAD(&p->rt.run_list);
 	p->rt.timeout		= 0;
@@ -5397,6 +5400,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		migrate_tasks(rq);
 		BUG_ON(rq->nr_running != 1); /* the migration thread */
 		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		migrate_inactive_dl(rq);
 		break;
 
 	case CPU_DEAD:
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c7a036facbe1..f999b8bb6fea 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -80,6 +80,9 @@ void init_dl_rq(struct dl_rq *dl_rq)
 	dl_rq->dl_nr_migratory = 0;
 	dl_rq->overloaded = 0;
 	dl_rq->pushable_dl_tasks_root = RB_ROOT;
+
+	raw_spin_lock_init(&dl_rq->dl_inactive_lock);
+	INIT_LIST_HEAD(&dl_rq->dl_inactive_list);
 #else
 	init_dl_bw(&dl_rq->dl_bw);
 #endif
@@ -289,6 +292,62 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
 	return later_rq;
 }
 
+static void enqueue_inactive(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	raw_spin_lock(&dl_rq->dl_inactive_lock);
+	WRITE_ONCE(dl_se->dl_inactive_cpu, rq_of_dl_rq(dl_rq)->cpu);
+	list_add(&dl_se->dl_inactive_entry, &dl_rq->dl_inactive_list);
+	raw_spin_unlock(&dl_rq->dl_inactive_lock);
+}
+
+static void dequeue_inactive(struct sched_dl_entity *dl_se)
+{
+	int tmp, cpu = READ_ONCE(dl_se->dl_inactive_cpu);
+	struct rq *rq;
+
+again:
+	if (cpu == -1)
+		return;
+	rq = cpu_rq(cpu);
+
+	raw_spin_lock(&rq->dl.dl_inactive_lock);
+	tmp = READ_ONCE(dl_se->dl_inactive_cpu);
+	if (cpu != tmp) {
+		cpu = tmp;
+		raw_spin_unlock(&rq->dl.dl_inactive_lock);
+		goto again;
+	}
+	list_del_init(&dl_se->dl_inactive_entry);
+	WRITE_ONCE(dl_se->dl_inactive_cpu, -1);
+	raw_spin_unlock(&rq->dl.dl_inactive_lock);
+}
+
+static void migrate_inactive(struct sched_dl_entity *dl_se, int new_cpu)
+{
+	int tmp, cpu = READ_ONCE(dl_se->dl_inactive_cpu);
+	struct rq *src_rq, *dst_rq;
+
+	dst_rq = cpu_rq(new_cpu);
+again:
+	if (cpu == -1)
+		return;
+	src_rq = cpu_rq(cpu);
+
+	double_raw_lock(&src_rq->dl.dl_inactive_lock,
+			&dst_rq->dl.dl_inactive_lock);
+	tmp = READ_ONCE(dl_se->dl_inactive_cpu);
+	if (cpu != tmp) {
+		cpu = tmp;
+		raw_spin_unlock(&dst_rq->dl.dl_inactive_lock);
+		raw_spin_unlock(&src_rq->dl.dl_inactive_lock);
+		goto again;
+	}
+	list_move(&dl_se->dl_inactive_entry, &dst_rq->dl.dl_inactive_list);
+	WRITE_ONCE(dl_se->dl_inactive_cpu, new_cpu);
+	raw_spin_unlock(&dst_rq->dl.dl_inactive_lock);
+	raw_spin_unlock(&src_rq->dl.dl_inactive_lock);
+}
+
 #else
 
 static inline
@@ -327,6 +386,11 @@ static inline void queue_push_tasks(struct rq *rq)
 static inline void queue_pull_task(struct rq *rq)
 {
 }
+
+static inline void enqueue_inactive(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq) { }
+static inline void dequeue_inactive(struct sched_dl_entity *dl_se) { }
+static inline void migrate_inactive(struct sched_dl_entity *dl_se, int new_cpu) { }
+
 #endif /* CONFIG_SMP */
 
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
@@ -960,6 +1024,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH))
 		return;
 
+	if (!(flags & ENQUEUE_RESTORE))
+		dequeue_inactive(&p->dl);
+
 	enqueue_dl_entity(&p->dl, pi_se, flags);
 
 	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
@@ -970,6 +1037,8 @@ static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	dequeue_dl_entity(&p->dl);
 	dequeue_pushable_dl_task(rq, p);
+	if (!(flags & DEQUEUE_SAVE))
+		enqueue_inactive(&p->dl, &rq->dl);
 }
 
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -1074,6 +1143,34 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	resched_curr(rq);
 }
 
+static void migrate_task_rq_dl(struct task_struct *p, int new_cpu)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	if (list_empty(&dl_se->dl_inactive_entry))
+		return;
+
+	migrate_inactive(dl_se, new_cpu);
+}
+
+void migrate_inactive_dl(struct rq *src_rq)
+{
+	int cpu = cpumask_any_and(src_rq->rd->online, cpu_active_mask);
+	struct rq *dst_rq = cpu_rq(cpu);
+	struct sched_dl_entity *dl_se, *tmp;
+
+	double_raw_lock(&src_rq->dl.dl_inactive_lock,
+			&dst_rq->dl.dl_inactive_lock);
+
+	list_for_each_entry_safe(dl_se, tmp, &src_rq->dl.dl_inactive_list, dl_inactive_entry) {
+		WRITE_ONCE(dl_se->dl_inactive_cpu, cpu);
+		list_move(&dl_se->dl_inactive_entry, &dst_rq->dl.dl_inactive_list);
+	}
+
+	raw_spin_unlock(&dst_rq->dl.dl_inactive_lock);
+	raw_spin_unlock(&src_rq->dl.dl_inactive_lock);
+}
+
 #endif /* CONFIG_SMP */
 
 /*
@@ -1211,13 +1308,19 @@ static void task_dead_dl(struct task_struct *p)
 {
 	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
+	local_irq_disable();
+
 	/*
 	 * Since we are TASK_DEAD we won't slip out of the domain!
 	 */
-	raw_spin_lock_irq(&dl_b->lock);
+	raw_spin_lock(&dl_b->lock);
 	/* XXX we should retain the bw until 0-lag */
 	dl_b->total_bw -= p->dl.dl_bw;
-	raw_spin_unlock_irq(&dl_b->lock);
+	raw_spin_unlock(&dl_b->lock);
+
+	dequeue_inactive(&p->dl);
+
+	local_irq_enable();
 }
 
 static void set_curr_task_dl(struct rq *rq)
@@ -1702,7 +1805,12 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	 * this is the right place to try to pull some other one
 	 * from an overloaded cpu, if any.
 	 */
-	if (!task_on_rq_queued(p) || rq->dl.dl_nr_running)
+	if (!task_on_rq_queued(p)) {
+		dequeue_inactive(&p->dl);
+		return;
+	}
+
+	if (rq->dl.dl_nr_running)
 		return;
 
 	queue_pull_task(rq);
@@ -1728,6 +1836,9 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
 			resched_curr(rq);
 #endif
 	}
+
+	if (!task_on_rq_queued(p))
+		enqueue_inactive(&p->dl, &rq->dl);
 }
 
 /*
@@ -1779,6 +1890,7 @@ const struct sched_class dl_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_dl,
+	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
 	.rq_online              = rq_online_dl,
 	.rq_offline             = rq_offline_dl,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 303d6392b389..04e856a85c0f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5231,7 +5231,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
  * cfs_rq_of(p) references at time of call are still valid and identify the
  * previous cpu. The caller guarantees p->pi_lock or task_rq(p)->lock is held.
  */
-static void migrate_task_rq_fair(struct task_struct *p)
+static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 {
 	/*
 	 * We are supposed to update the task to "current" time, then its up to date
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e6d4a3fa3660..0de1e2894d22 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -517,6 +517,9 @@ struct dl_rq {
 	 */
 	struct rb_root pushable_dl_tasks_root;
 	struct rb_node *pushable_dl_tasks_leftmost;
+
+	raw_spinlock_t   dl_inactive_lock;
+	struct list_head dl_inactive_list;
 #else
 	struct dl_bw dl_bw;
 #endif
@@ -776,6 +779,8 @@ extern int migrate_swap(struct task_struct *, struct task_struct *);
 
 #ifdef CONFIG_SMP
 
+extern void migrate_inactive_dl(struct rq *src_rq);
+
 static inline void
 queue_balance_callback(struct rq *rq,
 		       struct callback_head *head,
@@ -1205,7 +1210,7 @@ struct sched_class {
 
 #ifdef CONFIG_SMP
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
-	void (*migrate_task_rq)(struct task_struct *p);
+	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
 	void (*task_waking) (struct task_struct *task);
 	void (*task_woken) (struct rq *this_rq, struct task_struct *task);

^ permalink raw reply related	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2016-03-24  9:21 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-08 12:45 [PATCH 0/2] sched/deadline: fix cpusets bandwidth accounting Juri Lelli
2016-02-08 12:45 ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth Juri Lelli
2016-02-10 11:32   ` Juri Lelli
2016-02-10 11:43     ` luca abeni
2016-02-10 11:58       ` Juri Lelli
2016-02-19 13:43         ` luca abeni
2016-02-19 14:20           ` Steven Rostedt
2016-02-19 14:53             ` luca abeni
2016-02-19 14:57               ` Steven Rostedt
2016-02-22 11:03               ` luca abeni
2016-02-22 10:57         ` [PATCH 0/3] cleanup " Luca Abeni
2016-02-22 10:57           ` [PATCH 1/4] Move some calls to __dl_{sub,add}_ac() from core.c to deadline.c Luca Abeni
2016-02-22 10:57           ` [PATCH 2/4] Move the remaining __dl_{sub,add}_ac() calls " Luca Abeni
2016-02-22 10:57           ` [PATCH 3/4] Remove dl_new Luca Abeni
2016-02-23 15:42             ` Peter Zijlstra
2016-02-24 13:53               ` luca abeni
2016-02-25  9:46                 ` Juri Lelli
2016-03-03  9:03                   ` luca abeni
2016-03-03  9:28                     ` Juri Lelli
2016-03-03 14:23                       ` Steven Rostedt
2016-03-03 14:31                         ` luca abeni
2016-03-03 16:12                         ` Juri Lelli
2016-02-10 12:48     ` [PATCH 1/2] sched/deadline: add per rq tracking of admitted bandwidth luca abeni
2016-02-10 13:42       ` Juri Lelli
2016-02-23 15:48         ` Peter Zijlstra
2016-02-23 15:51           ` Juri Lelli
2016-02-10 14:37     ` Steven Rostedt
2016-02-10 16:27       ` Juri Lelli
2016-02-11 12:12         ` Juri Lelli
2016-02-11 12:22           ` luca abeni
2016-02-11 12:27             ` Juri Lelli
2016-02-11 12:40               ` luca abeni
2016-02-11 12:49                 ` Juri Lelli
2016-02-11 13:05                   ` luca abeni
2016-02-11 14:25                     ` Steven Rostedt
2016-02-11 17:10                       ` Juri Lelli
2016-02-12 17:05                         ` Peter Zijlstra
2016-02-12 17:19                           ` Juri Lelli
2016-02-24 19:17                           ` Peter Zijlstra
2016-02-24 21:46                             ` luca abeni
2016-02-25  7:53                               ` Peter Zijlstra
2016-02-25 10:07                             ` Juri Lelli
2016-02-25 10:20                               ` Peter Zijlstra
2016-03-24  9:20                                 ` Peter Zijlstra
2016-02-11 21:48                       ` Luca Abeni
2016-02-08 12:45 ` [PATCH 2/2] sched/deadline: rq_{online,offline}_dl for root_domain changes Juri Lelli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.