linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2
@ 2010-02-28 19:06 Raistlin
  2010-02-28 19:15 ` [RFC][PATCH 01/11] sched: add sched_class->task_dead Raistlin
                   ` (11 more replies)
  0 siblings, 12 replies; 25+ messages in thread
From: Raistlin @ 2010-02-28 19:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Dario, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Luca Abeni

[-- Attachment #1: Type: text/plain, Size: 4787 bytes --]

Hi everybody,

Here it is the new version of the SCHED_DEADLINE patchset.

For the ones that missed the first posting[*], it is basically a new
scheduling policy (implemented inside its own scheduling class) aiming
at introducing deadline scheduling for Linux tasks, and it is being
developed by Evidence S.r.l. (http://www.evidence.eu.com) in the context
of the EU-Funded project ACTORS (http://www.actors-project.eu/).

From the previous version, mainly:
 - I followed all the suggestions --about bug-fixing and code
   restyling-- that Peter kindly gave us in his review;
 - in particular, I rewrote from scratch the group part, i.e.,
   bandwidth is now accounted for in a per-CPU fashion, such that
   the interface is the same we already have for RT-throttling;
 - I added a first drafted implementation of deadline inheritance, even
   though it surely needs thorough testing and refinements.

The official page of the project is:
  http://www.evidence.eu.com/sched_deadline.html

while the development is taking place at[**]:
  http://gitorious.org/sched_deadline/pages/Home
  http://gitorious.org/sched_deadline

Still Missing parts:
 - bandwidth reclaiming mechanisms, i.e., methods that avoid stopping
   the tasks until their next deadline when overrunning. There are at
   least three of them that are very simple, and patches are on their
   way;
 - porting to PREEMPT-RT is almost done --thanks to Nicola and Luca--
   and separate patches will be available on the project website soon;
 - migration of tasks throughout push and pull (as in -rt), to make it
   possible to deploy global-EDF scheduling. Patches are ready, they're
   just being tested and adapted to this last version;
 - refinements in deadline inheritance, especially regarding the
   possibility of retaining bandwidth isolation among non-interacting
   tasks. This is being studied from both theoretical and practical
   points of view, and hopefully we can have some demonstrative code
   soon.

As Fabio said recently, I also would like to remark that we are
addressing different problems and providing the kernel with different
new features, but we can share some code and we'll definitely do that if
it is the case.

Patchset against Linus' master (at the time of this writing) follows,
comments and any kind of feedback is, as usual, more than welcome.

Many thanks and Regards,
	Dario Faggioli
	Claudio Scordino
	Michael Trimarchi

[*] http://lwn.net/Articles/353797/
[**] the repositories are outdated right now, I'll post the last changes
     and the last version of utilities and testing programs in the very
     next days.

Dario Faggioli, SCHED_DEADLINE (11)
  sched: add sched_class->task_dead.
  sched: SCHED_DEADLINE policy implementation.
  sched: add extended scheduling interface.
  sched: add resource limits for -deadline tasks.
  sched: add a syscall to wait for the next instance.
  sched: add the sched-debug bits for sched_dl.
  sched: add latency tracing for -deadline tasks.
  sched: send SIGXCPU at -deadline task overruns.
  sched: first draft of deadline inheritance.
  sched: add bandwidth management for sched_dl.
  sched: add sched_dl documentation.

 Documentation/scheduler/sched-deadline.txt |  188 ++++
 arch/arm/include/asm/unistd.h              |    4 +
 arch/arm/kernel/calls.S                    |    4 +
 arch/x86/ia32/ia32entry.S                  |    4 +
 arch/x86/include/asm/unistd_32.h           |    6 +-
 arch/x86/include/asm/unistd_64.h           |    8 +
 arch/x86/kernel/syscall_table_32.S         |    4 +
 include/asm-generic/resource.h             |    7 +-
 include/linux/sched.h                      |  173 ++++
 include/linux/syscalls.h                   |    9 +
 init/Kconfig                               |   15 +
 kernel/fork.c                              |   12 +
 kernel/hrtimer.c                           |    2 +-
 kernel/sched.c                             | 1329 +++++++++++++++++++++++++++-
 kernel/sched_debug.c                       |   36 +-
 kernel/sched_dl.c                          |  787 ++++++++++++++++
 kernel/sched_fair.c                        |    2 +-
 kernel/sched_rt.c                          |    2 +-
 kernel/sysctl.c                            |   21 +
 kernel/trace/trace_sched_wakeup.c          |   44 +-
 kernel/trace/trace_selftest.c              |   30 +-
 21 files changed, 2630 insertions(+), 57 deletions(-)

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC][PATCH 01/11] sched: add sched_class->task_dead.
  2010-02-28 19:06 [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Raistlin
@ 2010-02-28 19:15 ` Raistlin
  2010-02-28 19:17 ` [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation Raistlin
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Raistlin @ 2010-02-28 19:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

[-- Attachment #1: Type: text/plain, Size: 1766 bytes --]

Add a new function to the scheduling class interface. It is called
at the end of a cointext switch, if the going out task is in
TASK_DEAD state.

It might be of help in case the scheduling class wants to be
notified when one of its task dies, e.g. to perform some
cleanup actions.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |    2 ++
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78efe7c..d1de995 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1119,6 +1119,7 @@ struct sched_class {
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
 	void (*task_fork) (struct task_struct *p);
+	void (*task_dead) (struct task_struct *p);
 
 	void (*switched_from) (struct rq *this_rq, struct task_struct *task,
 			       int running);
diff --git a/kernel/sched.c b/kernel/sched.c
index 3a8fb30..532dcf1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2801,6 +2801,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		if (prev->sched_class->task_dead)
+			prev->sched_class->task_dead(prev);
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
-- 
1.7.0

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation.
  2010-02-28 19:06 [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Raistlin
  2010-02-28 19:15 ` [RFC][PATCH 01/11] sched: add sched_class->task_dead Raistlin
@ 2010-02-28 19:17 ` Raistlin
  2010-04-13 18:22   ` Peter Zijlstra
                     ` (4 more replies)
  2010-02-28 19:18 ` [RFC][PATCH 03/11] sched: add extended scheduling interface Raistlin
                   ` (9 subsequent siblings)
  11 siblings, 5 replies; 25+ messages in thread
From: Raistlin @ 2010-02-28 19:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

[-- Attachment #1: Type: text/plain, Size: 30232 bytes --]

Add a scheduling class, in sched_dl.c and a new policy called
SCHED_DEADLINE. It basically is an implementation of the Earliest
Deadline First (EDF) scheduling algorithm, augmented with a mechanism
(called Constant Bandwidth Server, CBS) that make it possible to
isolate the behaviour of tasks between each other.

The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.

The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most the its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.

This patch:
 - defines the new scheduling policy and all the new data structures
   needed for its core implementation (e.g., task parameters storage,
   runqueues, etc.), and takes care of their initialization;
 - implements the core logic of the scheduling algorithm in the new
   scheduling class file;
 - provides all the glue code between the new scheduling class and
   the core scheduler and narrows the interactions between sched_dl
   and the other existing scheduling classes.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <trimarchimichael@yahoo.it>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
---
 include/linux/sched.h |   61 +++++
 kernel/fork.c         |   12 +
 kernel/hrtimer.c      |    2 +-
 kernel/sched.c        |   58 ++++-
 kernel/sched_dl.c     |  606 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_fair.c   |    2 +-
 kernel/sched_rt.c     |    2 +-
 7 files changed, 731 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d1de995..cd24a7a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -38,6 +38,7 @@
 #define SCHED_BATCH		3
 /* SCHED_ISO: reserved but not implemented yet */
 #define SCHED_IDLE		5
+#define SCHED_DEADLINE		6
 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
 #define SCHED_RESET_ON_FORK     0x40000000
 
@@ -1228,6 +1229,53 @@ struct sched_rt_entity {
 #endif
 };
 
+/*
+ * Scheduler internal flags.
+ *
+ * These values are only known by the scheduler and are intended to not
+ * be exported to userspace. On the other hand, bits lower than the 16th
+ * are reserved for the userspace to use them to (try to) affect the
+ * scheduler behaviour, and that's why we start from 0x10000.
+ *
+ *  @DL_NEW        tells us that a new instance arrived and that we must
+ *                 start executing it with full runtime and with absolute
+ *                 deadline just dl_deadline away from that time;
+ *  @DL_THROTTLED  tells us that the last instance exhausted the runtime
+ *                 and that the task is waiting for a replenishment to
+ *                 be performed at the next firing of dl_timer.
+ */
+#define DL_NEW			0x00010000
+#define DL_THROTTLED		0x00020000
+
+struct sched_dl_entity {
+	struct rb_node	rb_node;
+	int nr_cpus_allowed;
+
+	/*
+	 * Original parameters. They are copied here  from sched_param_ex
+	 * during sched_setscheduler_ex(), and will remain the same until the
+	 * next sched_setscheduler_ex().
+	 */
+	u64 dl_runtime;		/* maximum runtime for each instance 	*/
+	u64 dl_deadline;	/* relative deadline of each instance	*/
+
+	/*
+	 * Actual scheduling parameters. They are initialized with the
+	 * values above, but they are updated by the scheduler accordingly
+	 * to the task behaviour. Note that the remaining runtime could
+	 * be < 0 in case we are in overrun.
+	 */
+	s64 runtime;		/* remaining runtime for this instance	*/
+	u64 deadline;		/* absolute deadline for this instance	*/
+	unsigned int flags;
+
+	/*
+	 * Bandwidth enforcement timer. In fact, since each task has its
+	 * own bandwidth to be enforced, we need one timer per task.
+	 */
+	struct hrtimer dl_timer;
+};
+
 struct rcu_node;
 
 struct task_struct {
@@ -1250,6 +1298,7 @@ struct task_struct {
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
+	struct sched_dl_entity dl;
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
@@ -1603,6 +1652,18 @@ static inline int rt_task(struct task_struct *p)
 	return rt_prio(p->prio);
 }
 
+static inline int dl_policy(int policy)
+{
+	if (unlikely(policy == SCHED_DEADLINE))
+		return 1;
+	return 0;
+}
+
+static inline int dl_task(struct task_struct *p)
+{
+	return dl_policy(p->policy);
+}
+
 static inline struct pid *task_pid(struct task_struct *task)
 {
 	return task->pids[PIDTYPE_PID].pid;
diff --git a/kernel/fork.c b/kernel/fork.c
index f88bd98..10d3753 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -937,6 +937,16 @@ SYSCALL_DEFINE1(set_tid_address, int __user *, tidptr)
 	return task_pid_vnr(current);
 }
 
+static void init_task_dl_entity(struct task_struct *p)
+{
+	RB_CLEAR_NODE(&p->dl.rb_node);
+	hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+
+	p->dl.dl_runtime = p->dl.runtime = 0;
+	p->dl.dl_deadline = p->dl.deadline = 0;
+	p->dl.flags = 0;
+}
+
 static void rt_mutex_init_task(struct task_struct *p)
 {
 	raw_spin_lock_init(&p->pi_lock);
@@ -1025,6 +1035,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 
 	ftrace_graph_init_task(p);
 
+	init_task_dl_entity(p);
+
 	rt_mutex_init_task(p);
 
 #ifdef CONFIG_PROVE_LOCKING
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 0086628..e59c2de 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1571,7 +1571,7 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp,
 	unsigned long slack;
 
 	slack = current->timer_slack_ns;
-	if (rt_task(current))
+	if (dl_task(current) || rt_task(current))
 		slack = 0;
 
 	hrtimer_init_on_stack(&t.timer, clockid, mode);
diff --git a/kernel/sched.c b/kernel/sched.c
index 532dcf1..c5ee6f9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -131,6 +131,11 @@ static inline int task_has_rt_policy(struct task_struct *p)
 	return rt_policy(p->policy);
 }
 
+static inline int task_has_dl_policy(struct task_struct *p)
+{
+	return dl_policy(p->policy);
+}
+
 /*
  * This is the priority-queue data structure of the RT scheduling class:
  */
@@ -482,6 +487,14 @@ struct rt_rq {
 #endif
 };
 
+struct dl_rq {
+	/* runqueue is an rbtree, ordered by deadline */
+	struct rb_root rb_root;
+	struct rb_node *rb_leftmost;
+
+	unsigned long dl_nr_running;
+};
+
 #ifdef CONFIG_SMP
 
 /*
@@ -544,6 +557,7 @@ struct rq {
 
 	struct cfs_rq cfs;
 	struct rt_rq rt;
+	struct dl_rq dl;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
@@ -1838,11 +1852,12 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 #include "sched_idletask.c"
 #include "sched_fair.c"
 #include "sched_rt.c"
+#include "sched_dl.c"
 #ifdef CONFIG_SCHED_DEBUG
 # include "sched_debug.c"
 #endif
 
-#define sched_class_highest (&rt_sched_class)
+#define sched_class_highest (&dl_sched_class)
 #define for_each_class(class) \
    for (class = sched_class_highest; class; class = class->next)
 
@@ -1858,7 +1873,7 @@ static void dec_nr_running(struct rq *rq)
 
 static void set_load_weight(struct task_struct *p)
 {
-	if (task_has_rt_policy(p)) {
+	if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
 		p->se.load.weight = prio_to_weight[0] * 2;
 		p->se.load.inv_weight = prio_to_wmult[0] >> 1;
 		return;
@@ -1930,7 +1945,12 @@ static inline int normal_prio(struct task_struct *p)
 {
 	int prio;
 
-	if (task_has_rt_policy(p))
+	if (task_has_dl_policy(p)) {
+		/*
+		 * FIXME: horrible hack here... Deadline inheritance needed!!
+		 */
+		prio = -(1<<7);
+	} else if (task_has_rt_policy(p))
 		prio = MAX_RT_PRIO-1 - p->rt_priority;
 	else
 		prio = __normal_prio(p);
@@ -2589,7 +2609,8 @@ void sched_fork(struct task_struct *p, int clone_flags)
 	 * Revert to default priority/policy on fork if requested.
 	 */
 	if (unlikely(p->sched_reset_on_fork)) {
-		if (p->policy == SCHED_FIFO || p->policy == SCHED_RR) {
+		if (p->policy == SCHED_DEADLINE ||
+		    p->policy == SCHED_FIFO || p->policy == SCHED_RR) {
 			p->policy = SCHED_NORMAL;
 			p->normal_prio = p->static_prio;
 		}
@@ -2612,7 +2633,14 @@ void sched_fork(struct task_struct *p, int clone_flags)
 	 */
 	p->prio = current->normal_prio;
 
-	if (!rt_prio(p->prio))
+	/*
+	 * FIXME: deadline inheritance needed here!!
+	 */
+	if (dl_task(p))
+		p->sched_class = &dl_sched_class;
+	else if (rt_prio(p->prio))
+		p->sched_class = &rt_sched_class;
+	else
 		p->sched_class = &fair_sched_class;
 
 	if (p->sched_class->task_fork)
@@ -6071,7 +6099,12 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
-	if (rt_prio(prio))
+	/*
+	 * FIXME: deadline inheritance needed here!!
+	 */
+	if (dl_task(p))
+		p->sched_class = &dl_sched_class;
+	else if (rt_prio(prio))
 		p->sched_class = &rt_sched_class;
 	else
 		p->sched_class = &fair_sched_class;
@@ -6108,9 +6141,9 @@ void set_user_nice(struct task_struct *p, long nice)
 	 * The RT priorities are set via sched_setscheduler(), but we still
 	 * allow the 'normal' nice value to be set - but as expected
 	 * it wont have any effect on scheduling until the task is
-	 * SCHED_FIFO/SCHED_RR:
+	 * SCHED_DEADLINE, SCHED_FIFO or SCHED_RR:
 	 */
-	if (task_has_rt_policy(p)) {
+	if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
 		p->static_prio = NICE_TO_PRIO(nice);
 		goto out_unlock;
 	}
@@ -6298,7 +6331,8 @@ recheck:
 		reset_on_fork = !!(policy & SCHED_RESET_ON_FORK);
 		policy &= ~SCHED_RESET_ON_FORK;
 
-		if (policy != SCHED_FIFO && policy != SCHED_RR &&
+		if (policy != SCHED_DEADLINE &&
+				policy != SCHED_FIFO && policy != SCHED_RR &&
 				policy != SCHED_NORMAL && policy != SCHED_BATCH &&
 				policy != SCHED_IDLE)
 			return -EINVAL;
@@ -9415,6 +9449,11 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
 #endif
 }
 
+static void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
+{
+	dl_rq->rb_root = RB_ROOT;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 				struct sched_entity *se, int cpu, int add,
@@ -9573,6 +9612,7 @@ void __init sched_init(void)
 		rq->calc_load_update = jiffies + LOAD_FREQ;
 		init_cfs_rq(&rq->cfs, rq);
 		init_rt_rq(&rq->rt, rq);
+		init_dl_rq(&rq->dl, rq);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		init_task_group.shares = init_task_group_load;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
new file mode 100644
index 0000000..7585227
--- /dev/null
+++ b/kernel/sched_dl.c
@@ -0,0 +1,606 @@
+/*
+ * Deadline Scheduling Class (SCHED_DEADLINE)
+ *
+ * Earliest Deadline First (EDF) + Constant Bandwidth Server (CBS).
+ *
+ * Tasks that periodically executes their instances for less than their
+ * runtime won't miss any of their deadlines.
+ * Tasks that are not periodic or sporadic or that tries to execute more
+ * than their reserved bandwidth will be slowed down (and may potentially
+ * miss some of their deadlines), and won't affect any other task.
+ *
+ * Copyright (C) 2010 Dario Faggioli <raistlin@linux.it>,
+ *                    Michael Trimarchi <trimarchimichael@yahoo.it>,
+ *                    Fabio Checconi <fabio@gandalf.sssup.it>
+ */
+
+static const struct sched_class dl_sched_class;
+
+static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
+{
+	return container_of(dl_se, struct task_struct, dl);
+}
+
+static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
+{
+	return container_of(dl_rq, struct rq, dl);
+}
+
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->dl;
+}
+
+static inline int on_dl_rq(struct sched_dl_entity *dl_se)
+{
+	return !RB_EMPTY_NODE(&dl_se->rb_node);
+}
+
+static inline int dl_time_before(u64 a, u64 b)
+{
+	return (s64)(a - b) < 0;
+}
+
+static void enqueue_dl_entity(struct sched_dl_entity *dl_se);
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se);
+static void check_dl_preempt_curr(struct task_struct *p, struct rq *rq);
+
+/*
+ * We are being explicitly informed that a new task instance is starting,
+ * and this means that:
+ *  - the absolute deadline of the task has to be placed at
+ *    current time + relative deadline;
+ *  - the runtime of the task has to be set to the maximum value.
+ *
+ * The capability of specifying such event is useful whenever a -deadline
+ * task wants to (try to!) synchronize its behaviour with the scheduler's
+ * one, and to (try to!) reconcile itself with its own scheduling
+ * parameters.
+ */
+static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	WARN_ON(!(dl_se->flags & DL_NEW) || dl_se->flags & DL_THROTTLED);
+
+	dl_se->deadline = rq->clock + dl_se->dl_deadline;
+	dl_se->runtime = dl_se->dl_runtime;
+	dl_se->flags &= ~DL_NEW;
+}
+
+/*
+ * Pure Earliest Deadline First (EDF) scheduling does not deal with the
+ * possibility of a task lasting more than what it declared, and thus
+ * exhausting its runtime.
+ *
+ * Here we are interested in making runtime overrun possible, but we do
+ * not want a task which is misbehaving to affect the scheduling of all
+ * other tasks.
+ * Therefore, a budgeting strategy called Constant Bandwidth Server (CBS)
+ * is used, in order to confine each task within its own bandwidth.
+ *
+ * This function deals exactly with that, and ensures that when the runtime
+ * of a task is replenished, its deadline is also postponed. That results
+ * in "priority unboosting" for the overrunning task, and makes it impossible
+ * for it to cause unexpected interfere to other tasks in the system.
+ */
+static void replenish_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * We Keep moving the deadline away until we get some
+	 * available runtime for the task. This ensures correct
+	 * handling of situations where the runtime overrun is
+	 * arbitrary large.
+	 */
+	while (dl_se->runtime <= 0) {
+		dl_se->deadline += dl_se->dl_deadline;
+		dl_se->runtime += dl_se->dl_runtime;
+	}
+
+	WARN_ON(dl_se->runtime > dl_se->dl_runtime);
+	WARN_ON(dl_time_before(dl_se->deadline, rq->clock));
+}
+
+/*
+ * Here we check if --at time t-- a task (which is probably being
+ * [re]activated or, in general, enqueued) can use its remaining runtime
+ * and its current deadline _without_ exceeding the bandwidth it is
+ * assigned (function returns true if it can).
+ *
+ * For this to hold, we must check if:
+ *   runtime / (deadline - t) < dl_runtime / dl_deadline .
+ */
+static bool dl_check_bandwidth(struct sched_dl_entity *dl_se, u64 t)
+{
+	u64 left, right;
+
+	/*
+	 * left and right are the two sides of the equation above,
+	 * after a bit of shuffling to use multiplications instead
+	 * of divisions.
+	 */
+	left = dl_se->dl_deadline * dl_se->runtime;
+	right = (dl_se->deadline - t) * dl_se->dl_runtime;
+
+	return dl_time_before(left, right);
+}
+
+/*
+ * When a -deadline task is queued back on the runqueue, its runtime and
+ * deadline might need updating.
+ *
+ * The policy here is that we update the deadline of the task only if:
+ *  - the current deadline is in the past,
+ *  - using the remaining remaining with the current deadline would make
+ *    the task exceed its bandwidth.
+ */
+static void update_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * The arrival of a new task (or of a new task instance) needs
+	 * special treatment. The actual scheduling parameters have to be
+	 * "renewed" instead of recalculatetd accordingly to the bandwidth
+	 * enforcement rule.
+	 */
+	if (dl_se->flags & DL_NEW) {
+		setup_new_dl_entity(dl_se);
+		return;
+	}
+
+	if (dl_time_before(dl_se->deadline, rq->clock))
+		goto update;
+
+	if (!dl_check_bandwidth(dl_se, rq->clock)) {
+update:
+		dl_se->deadline = rq->clock + dl_se->dl_deadline;
+		dl_se->runtime = dl_se->dl_runtime;
+	}
+}
+
+/*
+ * If the task depleted all its runtime, and if we want it to sleep
+ * while waiting for some new execution time to become available to it,
+ * we set the bandwidth enforcement timer to the replenishment instant
+ * and try to activate it.
+ *
+ * Notice that it is important for the caller to know if the timer
+ * actually started of if it did not because the replenishment instant
+ * already passed.
+ */
+static int start_dl_timer(struct sched_dl_entity *dl_se, u64 wakeup)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+	ktime_t now, act;
+	ktime_t soft, hard;
+	unsigned long range;
+	s64 delta;
+
+	/*
+	 * Arm the timer to fire at wakeup, tying to compensate for
+	 * the fact that wakeup is actually coming from rq->clock and
+	 * not from hrtimer's time base reading.
+	 */
+	act = ns_to_ktime(wakeup);
+	now = hrtimer_cb_get_time(&dl_se->dl_timer);
+	delta = ktime_to_ns(now) - rq->clock;
+	act = ktime_add_ns(act, delta);
+
+	hrtimer_set_expires(&dl_se->dl_timer, act);
+
+	soft = hrtimer_get_softexpires(&dl_se->dl_timer);
+	hard = hrtimer_get_expires(&dl_se->dl_timer);
+	range = ktime_to_ns(ktime_sub(hard, soft));
+	__hrtimer_start_range_ns(&dl_se->dl_timer, soft,
+				 range, HRTIMER_MODE_ABS, 0);
+
+	return hrtimer_active(&dl_se->dl_timer);
+}
+
+/*
+ * This is the bandwidth enforcement timer callback. If here, we know
+ * the task is not on its dl_rq, since the fact that the timer was running
+ * means the task is throttled and needs a runtime replenishment.
+ *
+ * However, what we actually do depends on the fact the task is active,
+ * --i.e., it is on its rq-- or has been removed from there by a call to
+ * dequeue_task_dl(). In fact, in the former case we must issue the runtime
+ * replenishment and add the task back to the dl_rq; in the latter, we just
+ * do nothing but clearing the DL_THROTTLED flag, so that the runtime and
+ * deadline updating and the queueing back to dl_rq will be done by the
+ * next call to enqueue_task_dl().
+ */
+static enum hrtimer_restart dl_timer(struct hrtimer *timer)
+{
+	unsigned long flags;
+	struct sched_dl_entity *dl_se = container_of(timer,
+						     struct sched_dl_entity,
+						     dl_timer);
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = task_rq_lock(p, &flags);
+
+	/*
+	 * We need to take care of a possible races here. In fact, the
+	 * task might have changed its scheduling policy to something
+	 * different from SCHED_DEADLINE (through sched_setscheduler()).
+	 */
+	if (!dl_task(p))
+		goto unlock;
+
+	dl_se->flags &= ~DL_THROTTLED;
+	if (p->se.on_rq) {
+		replenish_dl_entity(dl_se);
+		enqueue_dl_entity(dl_se);
+		check_dl_preempt_curr(p, rq);
+	}
+unlock:
+	task_rq_unlock(rq, &flags);
+
+	return HRTIMER_NORESTART;
+}
+
+static void init_dl_timer(struct sched_dl_entity *dl_se)
+{
+	struct hrtimer *timer = &dl_se->dl_timer;
+
+	if (hrtimer_active(timer)) {
+		hrtimer_try_to_cancel(timer);
+		return;
+	}
+
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	timer->function = dl_timer;
+}
+
+static
+int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
+{
+	int dmiss = dl_time_before(dl_se->deadline, rq->clock);
+	int rorun = dl_se->runtime <= 0;
+
+	if (!rorun && !dmiss)
+		return 0;
+
+	/*
+	 * If we are beyond our current deadline and we are still
+	 * executing, then we have already used some of the runtime of
+	 * the next instance. Thus, if we do not account that, we are
+	 * stealing bandwidth from the system at each deadline miss!
+	 */
+	if (dmiss)
+		dl_se->runtime -= rq->clock - dl_se->deadline;
+
+	dequeue_dl_entity(dl_se);
+	if (!start_dl_timer(dl_se, dl_se->deadline)) {
+		if (!rorun)
+			dl_se->runtime = 0;
+		replenish_dl_entity(dl_se);
+		enqueue_dl_entity(dl_se);
+	} else
+		dl_se->flags |= DL_THROTTLED;
+
+	return 1;
+}
+
+static void update_curr_dl(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_dl_entity *dl_se = &curr->dl;
+	u64 delta_exec;
+
+	if (!task_has_dl_policy(curr) || !on_dl_rq(dl_se))
+		return;
+
+	delta_exec = rq->clock - curr->se.exec_start;
+	if (unlikely((s64)delta_exec < 0))
+		delta_exec = 0;
+
+	schedstat_set(curr->se.exec_max, max(curr->se.exec_max, delta_exec));
+
+	curr->se.sum_exec_runtime += delta_exec;
+	account_group_exec_runtime(curr, delta_exec);
+
+	curr->se.exec_start = rq->clock;
+	cpuacct_charge(curr, delta_exec);
+
+	dl_se->runtime -= delta_exec;
+	if (dl_runtime_exceeded(rq, dl_se))
+		resched_task(curr);
+}
+
+static void enqueue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rb_node **link = &dl_rq->rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct sched_dl_entity *entry;
+	int leftmost = 1;
+
+	BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct sched_dl_entity, rb_node);
+		if (dl_time_before(dl_se->deadline, entry->deadline))
+			link = &parent->rb_left;
+		else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		dl_rq->rb_leftmost = &dl_se->rb_node;
+
+	rb_link_node(&dl_se->rb_node, parent, link);
+	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
+
+	dl_rq->dl_nr_running++;
+}
+
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+	if (RB_EMPTY_NODE(&dl_se->rb_node))
+		return;
+
+	if (dl_rq->rb_leftmost == &dl_se->rb_node) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&dl_se->rb_node);
+		dl_rq->rb_leftmost = next_node;
+	}
+
+	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
+	RB_CLEAR_NODE(&dl_se->rb_node);
+
+	dl_rq->dl_nr_running--;
+}
+
+static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
+				   int sync)
+{
+	if (dl_task(p) &&
+	    dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
+		resched_task(rq->curr);
+}
+
+/*
+ * There are a few cases where we must check if a -deadline task p should
+ * preempt the current task of a given rq (e.g., inside the bandwidth
+ * enforcement timer callback).
+ */
+static void check_dl_preempt_curr(struct task_struct *p, struct rq *rq)
+{
+	if (!dl_task(rq->curr) ||
+	    dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
+		resched_task(rq->curr);
+}
+
+static void
+enqueue_task_dl(struct rq *rq, struct task_struct *p, int wakeup)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	BUG_ON(on_dl_rq(dl_se));
+
+	/*
+	 * If the task is DL_THROTTLED, we do nothing. In fact,
+	 * if it exhausted its budget it needs a replenishment and,
+	 * since it now is on its rq, the bandwidth timer callback
+	 * (which clearly has not run yet) will take care of this.
+	 */
+	if (dl_se->flags & DL_THROTTLED)
+		return;
+
+	update_dl_entity(dl_se);
+	enqueue_dl_entity(dl_se);
+}
+
+static void
+dequeue_task_dl(struct rq *rq, struct task_struct *p, int sleep)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	if (!on_dl_rq(dl_se))
+		return;
+
+	update_curr_dl(rq);
+	dequeue_dl_entity(dl_se);
+}
+
+static void yield_task_dl(struct rq *rq)
+{
+}
+
+#ifdef CONFIG_SCHED_HRTICK
+static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+	s64 delta;
+
+	delta = dl_se->dl_runtime - dl_se->runtime;
+
+	if (delta > 10000)
+		hrtick_start(rq, delta);
+}
+#else
+static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+{
+}
+#endif
+
+static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
+						   struct dl_rq *dl_rq)
+{
+	struct rb_node *left = dl_rq->rb_leftmost;
+
+	if (!left)
+		return NULL;
+
+	return rb_entry(left, struct sched_dl_entity, rb_node);
+}
+
+struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se;
+	struct task_struct *p;
+	struct dl_rq *dl_rq;
+
+	dl_rq = &rq->dl;
+
+	if (likely(!dl_rq->dl_nr_running))
+		return NULL;
+
+	dl_se = pick_next_dl_entity(rq, dl_rq);
+	BUG_ON(!dl_se);
+
+	p = dl_task_of(dl_se);
+	p->se.exec_start = rq->clock;
+#ifdef CONFIG_SCHED_HRTICK
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, p);
+#endif
+	return p;
+}
+
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
+{
+	update_curr_dl(rq);
+	p->se.exec_start = 0;
+}
+
+static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
+{
+	update_curr_dl(rq);
+
+#ifdef CONFIG_SCHED_HRTICK
+	if (hrtick_enabled(rq) && queued && p->dl.runtime > 0)
+		start_hrtick_dl(rq, p);
+#endif
+}
+
+static void task_fork_dl(struct task_struct *p)
+{
+	/*
+	 * The child of a -deadline task will be SCHED_DEADLINE, but
+	 * with zero runtime and in DL_THROTTLED and !DL_NEW internal state.
+	 * This means the parent (or some other task) must call
+	 * sched_setscheduler_ex() on it, or it won't ever start.
+	 */
+	p->dl.flags |= DL_THROTTLED;
+	p->dl.flags &= ~DL_NEW;
+}
+
+static void task_dead_dl(struct task_struct *p)
+{
+	/*
+	 * We are not holding any lock here, so it is safe to
+	 * wait for the bandwidth timer to be removed.
+	 */
+	hrtimer_cancel(&p->dl.dl_timer);
+}
+
+static void set_curr_task_dl(struct rq *rq)
+{
+	struct task_struct *p = rq->curr;
+
+	p->se.exec_start = rq->clock;
+}
+
+static void switched_from_dl(struct rq *rq, struct task_struct *p,
+			     int running)
+{
+	if (hrtimer_active(&p->dl.dl_timer))
+		hrtimer_try_to_cancel(&p->dl.dl_timer);
+}
+
+static void switched_to_dl(struct rq *rq, struct task_struct *p,
+			   int running)
+{
+	init_dl_timer(&p->dl);
+	check_dl_preempt_curr(p, rq);
+}
+
+static void prio_changed_dl(struct rq *rq, struct task_struct *p,
+			    int oldprio, int running)
+{
+	switched_to_dl(rq, p, running);
+}
+
+#ifdef CONFIG_SMP
+static int select_task_rq_dl(struct task_struct *p,
+				   int sd_flag, int flags)
+{
+	return task_cpu(p);
+}
+
+static unsigned long
+load_balance_dl(struct rq *this_rq, int this_cpu, struct rq *busiest,
+		 unsigned long max_load_move,
+		 struct sched_domain *sd, enum cpu_idle_type idle,
+		 int *all_pinned, int *this_best_prio)
+{
+	/* Don't dare touching SCHED_DEADLINE tasks! */
+	return 0;
+}
+
+static int
+move_one_task_dl(struct rq *this_rq, int this_cpu, struct rq *busiest,
+		  struct sched_domain *sd, enum cpu_idle_type idle)
+{
+	return 0;
+}
+
+static void set_cpus_allowed_dl(struct task_struct *p,
+				 const struct cpumask *new_mask)
+{
+	int weight = cpumask_weight(new_mask);
+
+	BUG_ON(!dl_task(p));
+
+	cpumask_copy(&p->cpus_allowed, new_mask);
+	p->dl.nr_cpus_allowed = weight;
+}
+#endif
+
+static const struct sched_class dl_sched_class = {
+	.next			= &rt_sched_class,
+	.enqueue_task		= enqueue_task_dl,
+	.dequeue_task		= dequeue_task_dl,
+	.yield_task		= yield_task_dl,
+
+	.check_preempt_curr	= check_preempt_curr_dl,
+
+	.pick_next_task		= pick_next_task_dl,
+	.put_prev_task		= put_prev_task_dl,
+
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_dl,
+
+	.load_balance           = load_balance_dl,
+	.move_one_task		= move_one_task_dl,
+	.set_cpus_allowed       = set_cpus_allowed_dl,
+#endif
+
+	.set_curr_task		= set_curr_task_dl,
+	.task_tick		= task_tick_dl,
+	.task_fork              = task_fork_dl,
+	.task_dead		= task_dead_dl,
+
+	.prio_changed           = prio_changed_dl,
+	.switched_from		= switched_from_dl,
+	.switched_to		= switched_to_dl,
+};
+
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 8fe7ee8..7c31185 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1707,7 +1707,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	int sync = wake_flags & WF_SYNC;
 	int scale = cfs_rq->nr_running >= sched_nr_latency;
 
-	if (unlikely(rt_prio(p->prio)))
+	if (unlikely(dl_task(p) || rt_prio(p->prio)))
 		goto preempt;
 
 	if (unlikely(p->sched_class != &fair_sched_class))
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index f48328a..84eee27 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1004,7 +1004,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
  */
 static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (p->prio < rq->curr->prio) {
+	if (dl_task(p) || p->prio < rq->curr->prio) {
 		resched_task(rq->curr);
 		return;
 	}
-- 
1.7.0

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC][PATCH 03/11] sched: add extended scheduling interface.
  2010-02-28 19:06 [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Raistlin
  2010-02-28 19:15 ` [RFC][PATCH 01/11] sched: add sched_class->task_dead Raistlin
  2010-02-28 19:17 ` [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation Raistlin
@ 2010-02-28 19:18 ` Raistlin
  2010-02-28 19:19 ` [RFC][PATCH 04/11] sched: add resource limits for -deadline tasks Raistlin
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Raistlin @ 2010-02-28 19:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

[-- Attachment #1: Type: text/plain, Size: 18157 bytes --]

Add the interface bits needed for supporting scheduling algorithms
with extended parameters (e.g., SCHED_DEADLINE).
In fact, specifying a periodic/sporadic task that executes for a
given amount of runtime at each instance, and that is scheduled
according to the usrgency of their own timing constraints needs,
in general:
 - a (maximum/typical) instance execution time,
 - a minimum interval between consecutive instances,
 - a time constraint by which each instance must be completed.

In order of this model to be useable, both the data structure that
holds the scheduling parameter of tasks and the system calls that
deal with them have to be extended.
Unfortunately, modifying the existing struct sched_param would
break the ABI and result in potentially serious compatibility
issues with legacy binary code.

For these reasons, this patch:
 - defines the new struct sched_param_ex, containing all the fields
   that are necessary for specifying a task in the computational
   model described above;
 - defines and implements the new scheduling related syscalls that
   manipulate it, i.e., sched_setscheduler_ex(), sched_setparam_ex()
   and sched_getparam_ex().

Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. However,
making them available on other archs is straightforward.

The SCHED_DEADLINE policy is, as of now, the only user of this
extended interface.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 arch/arm/include/asm/unistd.h      |    3 +
 arch/arm/kernel/calls.S            |    3 +
 arch/x86/ia32/ia32entry.S          |    3 +
 arch/x86/include/asm/unistd_32.h   |    5 +-
 arch/x86/include/asm/unistd_64.h   |    6 ++
 arch/x86/kernel/syscall_table_32.S |    3 +
 include/linux/sched.h              |   54 +++++++++++
 include/linux/syscalls.h           |    7 ++
 kernel/sched.c                     |  176 +++++++++++++++++++++++++++++++++++-
 9 files changed, 254 insertions(+), 6 deletions(-)

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index cf9cdaa..e741cd6 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -392,6 +392,9 @@
 #define __NR_rt_tgsigqueueinfo		(__NR_SYSCALL_BASE+363)
 #define __NR_perf_event_open		(__NR_SYSCALL_BASE+364)
 #define __NR_recvmmsg			(__NR_SYSCALL_BASE+365)
+#define __NR_sched_setscheduler_ex	(__NR_SYSCALL_BASE+366)
+#define __NR_sched_setparam_ex		(__NR_SYSCALL_BASE+367)
+#define __NR_sched_getparam_ex		(__NR_SYSCALL_BASE+368)
 
 /*
  * The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 9314a2d..8eeb552 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -375,6 +375,9 @@
 		CALL(sys_rt_tgsigqueueinfo)
 		CALL(sys_perf_event_open)
 /* 365 */	CALL(sys_recvmmsg)
+		CALL(sys_sched_setscheduler_ex)
+		CALL(sys_sched_setparam_ex)
+		CALL(sys_sched_getparam_ex)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 53147ad..f24e9fa 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -842,4 +842,7 @@ ia32_sys_call_table:
 	.quad compat_sys_rt_tgsigqueueinfo	/* 335 */
 	.quad sys_perf_event_open
 	.quad compat_sys_recvmmsg
+	.quad sys_sched_setscheduler_ex
+	.quad sys_sched_setparam_ex
+	.quad sys_sched_getparam_ex		/* 340 */
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 3baf379..1db148b 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -343,10 +343,13 @@
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
 #define __NR_recvmmsg		337
+#define __NR_sched_setscheduler_ex	338
+#define __NR_sched_setparam_ex		339
+#define __NR_sched_getparam_ex		340
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 338
+#define NR_syscalls 341
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 4843f7b..d254154 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -663,6 +663,12 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
 #define __NR_recvmmsg				299
 __SYSCALL(__NR_recvmmsg, sys_recvmmsg)
+#define __NR_sched_setscheduler_ex		300
+__SYSCALL(__NR_sched_setscheduler_ex, sys_sched_setscheduler_ex)
+#define __NR_sched_setparam_ex			301
+__SYSCALL(__NR_sched_setparam_ex, sys_sched_setparam_ex)
+#define __NR_sched_getparam_ex			302
+__SYSCALL(__NR_sched_getparam_ex, sys_sched_getparam_ex)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 15228b5..e27e002 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -337,3 +337,6 @@ ENTRY(sys_call_table)
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_event_open
 	.long sys_recvmmsg
+	.long sys_sched_setscheduler_ex
+	.long sys_sched_setparam_ex
+	.long sys_sched_getparam_ex	/* 340 */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cd24a7a..3c466a6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -95,6 +95,57 @@ struct sched_param {
 
 #include <asm/processor.h>
 
+/*
+ * Extended scheduling parameters data structure.
+ *
+ * This is needed because the original struct sched_param can not be
+ * altered without introducing ABI issues with legacy applications
+ * (e.g., in sched_getparam()).
+ *
+ * However, the possibility of specifying more than just a priority for
+ * the tasks may be useful for a wide variety of application fields, e.g.,
+ * multimedia, streaming, automation and control, and many others.
+ *
+ * This variant (sched_param_ex) is meant at describing a so-called
+ * sporadic time-constrained task. In such model a task is specified by:
+ *  - the activation period or minimum instance inter-arrival time;
+ *  - the maximum (or average, depending on the actual scheduling
+ *    discipline) computation time of all instances, a.k.a. runtime;
+ *  - the deadline (relative to the actual activation time) of each
+ *    instance.
+ * Very briefly, a periodic (sporadic) task asks for the execution of
+ * some specific computation --which is typically called an instance--
+ * (at most) every period. Moreover, each instance typically lasts no more
+ * than the runtime and must be completed by time instant t equal to
+ * the instance activation time + the deadline.
+ *
+ * This is reflected by the actual fields of the sched_param_ex structure:
+ *
+ *  @sched_priority:    task's priority (might be still useful)
+ *  @sched_deadline:    representative of the task's deadline
+ *  @sched_runtime:     representative of the task's runtime
+ *  @sched_period:      representative of the task's period
+ *  @sched_flags:       available for specifying some specific
+ *                      task or scheduling behaviour
+ *
+ * Given this task model, there are a multiplicity of scheduling algorithms
+ * and policies, each with its own advantages and drawbacks in having all
+ * the tasks make their timing constraint.
+ *
+ * As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
+ * only user of this new interface. It right now assumes deadlines are
+ * always equal to periods, thus it does not use the sched_period field.
+ * More information about the algorithm are available in the scheduling
+ * class file or in Documentation/.
+ */
+struct sched_param_ex {
+	int sched_priority;
+	struct timespec sched_runtime;
+	struct timespec sched_deadline;
+	struct timespec sched_period;
+	unsigned int sched_flags;
+};
+
 struct exec_domain;
 struct futex_pi_state;
 struct robust_list_head;
@@ -2048,6 +2099,9 @@ extern int idle_cpu(int cpu);
 extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
 extern int sched_setscheduler_nocheck(struct task_struct *, int,
 				      struct sched_param *);
+extern int sched_setscheduler_ex(struct task_struct *, int,
+				 struct sched_param *,
+				 struct sched_param_ex *);
 extern struct task_struct *idle_task(int cpu);
 extern struct task_struct *curr_task(int cpu);
 extern void set_curr_task(int cpu, struct task_struct *p);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 207466a..9e3ad66 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -34,6 +34,7 @@ struct pollfd;
 struct rlimit;
 struct rusage;
 struct sched_param;
+struct sched_param_ex;
 struct semaphore;
 struct sembuf;
 struct shmid_ds;
@@ -340,11 +341,17 @@ asmlinkage long sys_clock_nanosleep(clockid_t which_clock, int flags,
 asmlinkage long sys_nice(int increment);
 asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_setscheduler_ex(pid_t pid, int policy, unsigned len,
+					struct sched_param_ex __user *param);
 asmlinkage long sys_sched_setparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_setparam_ex(pid_t pid, unsigned len,
+					struct sched_param_ex __user *param);
 asmlinkage long sys_sched_getscheduler(pid_t pid);
 asmlinkage long sys_sched_getparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_getparam_ex(pid_t pid, unsigned len,
+					struct sched_param_ex __user *param);
 asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
 asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
diff --git a/kernel/sched.c b/kernel/sched.c
index c5ee6f9..19e90fc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6288,7 +6288,13 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 	p->normal_prio = normal_prio(p);
 	/* we are holding p->pi_lock already */
 	p->prio = rt_mutex_getprio(p);
-	if (rt_prio(p->prio))
+
+	/*
+	 * FIXME: deadline inheritance needed here!!
+	 */
+	if (dl_policy(policy))
+		p->sched_class = &dl_sched_class;
+	else if (rt_prio(p->prio))
 		p->sched_class = &rt_sched_class;
 	else
 		p->sched_class = &fair_sched_class;
@@ -6296,6 +6302,50 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 }
 
 /*
+ * This function initializes the sched_dl_entity of a newly becoming
+ * SCHED_DEADLINE task.
+ *
+ * Only the static values are considered here, the actual runtime and the
+ * absolute deadline will be properly calculated when the task is enqueued
+ * for the first time with its new policy.
+ */
+static void
+__setparam_dl(struct task_struct *p, struct sched_param_ex *param_ex)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	dl_se->dl_runtime = timespec_to_ns(&param_ex->sched_runtime);
+	dl_se->dl_deadline = timespec_to_ns(&param_ex->sched_deadline);
+	dl_se->flags = param_ex->sched_flags;
+	dl_se->flags &= ~DL_THROTTLED;
+	dl_se->flags |= DL_NEW;
+}
+
+static void
+__getparam_dl(struct task_struct *p, struct sched_param_ex *param_ex)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	param_ex->sched_priority = p->rt_priority;
+	param_ex->sched_runtime = ns_to_timespec(dl_se->dl_runtime);
+	param_ex->sched_deadline = ns_to_timespec(dl_se->dl_deadline);
+	param_ex->sched_flags = dl_se->flags;
+}
+
+/*
+ * This function validates the new parameters of a -deadline task.
+ * We ask for the deadline not being zero, and greater or equal
+ * than the runtime.
+ */
+static bool
+__checkparam_dl(struct sched_param_ex *prm)
+{
+	return prm && timespec_to_ns(&prm->sched_deadline) != 0 &&
+	       timespec_to_ns(&prm->sched_deadline) >=
+	       timespec_to_ns(&prm->sched_runtime);
+}
+
+/*
  * check the target process has a UID that matches the current process's
  */
 static bool check_same_owner(struct task_struct *p)
@@ -6312,7 +6362,9 @@ static bool check_same_owner(struct task_struct *p)
 }
 
 static int __sched_setscheduler(struct task_struct *p, int policy,
-				struct sched_param *param, bool user)
+				struct sched_param *param,
+				struct sched_param_ex *param_ex,
+				bool user)
 {
 	int retval, oldprio, oldpolicy = -1, on_rq, running;
 	unsigned long flags;
@@ -6347,7 +6399,8 @@ recheck:
 	    (p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) ||
 	    (!p->mm && param->sched_priority > MAX_RT_PRIO-1))
 		return -EINVAL;
-	if (rt_policy(policy) != (param->sched_priority != 0))
+	if ((dl_policy(policy) && !__checkparam_dl(param_ex)) ||
+	    (rt_policy(policy) != (param->sched_priority != 0)))
 		return -EINVAL;
 
 	/*
@@ -6431,6 +6484,8 @@ recheck:
 	p->sched_reset_on_fork = reset_on_fork;
 
 	oldprio = p->prio;
+	if (dl_policy(policy))
+		__setparam_dl(p, param_ex);
 	__setscheduler(rq, p, policy, param->sched_priority);
 
 	if (running)
@@ -6459,10 +6514,18 @@ recheck:
 int sched_setscheduler(struct task_struct *p, int policy,
 		       struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, true);
+	return __sched_setscheduler(p, policy, param, NULL, true);
 }
 EXPORT_SYMBOL_GPL(sched_setscheduler);
 
+int sched_setscheduler_ex(struct task_struct *p, int policy,
+			  struct sched_param *param,
+			  struct sched_param_ex *param_ex)
+{
+	return __sched_setscheduler(p, policy, param, param_ex, true);
+}
+EXPORT_SYMBOL_GPL(sched_setscheduler_ex);
+
 /**
  * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
  * @p: the task in question.
@@ -6477,7 +6540,7 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
 int sched_setscheduler_nocheck(struct task_struct *p, int policy,
 			       struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, false);
+	return __sched_setscheduler(p, policy, param, NULL, false);
 }
 
 static int
@@ -6502,6 +6565,36 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 	return retval;
 }
 
+static int
+do_sched_setscheduler_ex(pid_t pid, int policy, unsigned len,
+			 struct sched_param_ex __user *param_ex)
+{
+	struct sched_param lparam;
+	struct sched_param_ex lparam_ex;
+	struct task_struct *p;
+	int retval;
+
+	if (!param_ex || pid < 0)
+		return -EINVAL;
+	if (len > sizeof(lparam_ex))
+		return -EINVAL;
+
+	memset(&lparam_ex, 0, sizeof(lparam_ex));
+	if (copy_from_user(&lparam_ex, param_ex, len))
+		return -EFAULT;
+
+	rcu_read_lock();
+	retval = -ESRCH;
+	p = find_process_by_pid(pid);
+	if (p != NULL) {
+		lparam.sched_priority = lparam_ex.sched_priority;
+		retval = sched_setscheduler_ex(p, policy, &lparam, &lparam_ex);
+	}
+	rcu_read_unlock();
+
+	return retval;
+}
+
 /**
  * sys_sched_setscheduler - set/change the scheduler policy and RT priority
  * @pid: the pid in question.
@@ -6519,6 +6612,22 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy,
 }
 
 /**
+ * sys_sched_setscheduler_ex - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @policy: new policy (could use extended sched_param).
+ * @len: size of data pointed by param_ex.
+ * @param: structure containg the extended parameters.
+ */
+SYSCALL_DEFINE4(sched_setscheduler_ex, pid_t, pid, int, policy,
+		unsigned, len, struct sched_param_ex __user *, param_ex)
+{
+	if (policy < 0)
+		return -EINVAL;
+
+	return do_sched_setscheduler_ex(pid, policy, len, param_ex);
+}
+
+/**
  * sys_sched_setparam - set/change the RT priority of a thread
  * @pid: the pid in question.
  * @param: structure containing the new RT priority.
@@ -6529,6 +6638,18 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
 }
 
 /**
+ * sys_sched_setparam_ex - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @len: size of data pointed by param_ex.
+ * @param_ex: structure containing the extended parameters.
+ */
+SYSCALL_DEFINE3(sched_setparam_ex, pid_t, pid, unsigned, len,
+		struct sched_param_ex __user *, param_ex)
+{
+	return do_sched_setscheduler_ex(pid, -1, len, param_ex);
+}
+
+/**
  * sys_sched_getscheduler - get the policy (scheduling class) of a thread
  * @pid: the pid in question.
  */
@@ -6592,6 +6713,51 @@ out_unlock:
 	return retval;
 }
 
+/**
+ * sys_sched_getparam_ex - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @len: size of data pointed by param_ex.
+ * @param_ex: structure containing the extended parameters.
+ */
+SYSCALL_DEFINE3(sched_getparam_ex, pid_t, pid, unsigned, len,
+		struct sched_param_ex __user *, param_ex)
+{
+	struct sched_param_ex lp;
+	struct task_struct *p;
+	int retval;
+
+	if (!param_ex || pid < 0)
+		return -EINVAL;
+	if (len > sizeof(lp))
+		return -EINVAL;
+
+	rcu_read_lock();
+	p = find_process_by_pid(pid);
+	retval = -ESRCH;
+	if (!p)
+		goto out_unlock;
+
+	retval = security_task_getscheduler(p);
+	if (retval)
+		goto out_unlock;
+
+	if (task_has_dl_policy(p))
+		__getparam_dl(p, &lp);
+	rcu_read_unlock();
+
+	/*
+	 * This one might sleep, we cannot do it with a spinlock held ...
+	 */
+	retval = copy_to_user(param_ex, &lp, len) ? -EFAULT : 0;
+
+	return retval;
+
+out_unlock:
+	rcu_read_unlock();
+	return retval;
+
+}
+
 long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 {
 	cpumask_var_t cpus_allowed, new_mask;
-- 
1.7.0

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC][PATCH 04/11] sched: add resource limits for -deadline tasks.
  2010-02-28 19:06 [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Raistlin
                   ` (2 preceding siblings ...)
  2010-02-28 19:18 ` [RFC][PATCH 03/11] sched: add extended scheduling interface Raistlin
@ 2010-02-28 19:19 ` Raistlin
  2010-02-28 19:20 ` [RFC][PATCH 05/11] sched: add a syscall to wait for the next instance Raistlin
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Raistlin @ 2010-02-28 19:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

[-- Attachment #1: Type: text/plain, Size: 3135 bytes --]

Add resource limits for non-root tasks in using the SCHED_DEADLINE
policy, very similarly to what already exists for RT policies.

In fact, this patch:
 - adds the resource limit RLIMIT_DLDLINE, which is the minimum value
   a user task can use as its own deadline;
 - adds the resource limit RLIMIT_DLRTIME, which is the maximum value
   a user task can use as it own runtime.

Notice that to exploit these, a modified version of the ulimit
utility and a modified resource.h header file are needed. They
both will be available on the website of the project.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/asm-generic/resource.h |    7 ++++++-
 kernel/sched.c                 |   25 +++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 1 deletions(-)

diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
index 587566f..4a1d0e2 100644
--- a/include/asm-generic/resource.h
+++ b/include/asm-generic/resource.h
@@ -45,7 +45,10 @@
 					   0-39 for nice level 19 .. -20 */
 #define RLIMIT_RTPRIO		14	/* maximum realtime priority */
 #define RLIMIT_RTTIME		15	/* timeout for RT tasks in us */
-#define RLIM_NLIMITS		16
+
+#define RLIMIT_DLDLINE		16	/* minimum deadline in us */
+#define RLIMIT_DLRTIME		17	/* maximum runtime in us */
+#define RLIM_NLIMITS		18
 
 /*
  * SuS says limits have to be unsigned.
@@ -87,6 +90,8 @@
 	[RLIMIT_NICE]		= { 0, 0 },				\
 	[RLIMIT_RTPRIO]		= { 0, 0 },				\
 	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\
+	[RLIMIT_DLDLINE]	= { ULONG_MAX, ULONG_MAX },		\
+	[RLIMIT_DLRTIME]	= { 0, 0 },				\
 }
 
 #endif	/* __KERNEL__ */
diff --git a/kernel/sched.c b/kernel/sched.c
index 19e90fc..61b1561 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6407,6 +6407,31 @@ recheck:
 	 * Allow unprivileged RT tasks to decrease priority:
 	 */
 	if (user && !capable(CAP_SYS_NICE)) {
+		if (dl_policy(policy)) {
+			u64 rlim_dline, rlim_rtime;
+			u64 dline, rtime;
+
+			if (!lock_task_sighand(p, &flags))
+				return -ESRCH;
+			rlim_dline = p->signal->rlim[RLIMIT_DLDLINE].rlim_cur;
+			rlim_rtime = p->signal->rlim[RLIMIT_DLRTIME].rlim_cur;
+			unlock_task_sighand(p, &flags);
+
+			/* can't set/change -deadline policy */
+			if (policy != p->policy && !rlim_rtime)
+				return -EPERM;
+
+			/* can't decrease the deadline */
+			rlim_dline *= NSEC_PER_USEC;
+			dline = timespec_to_ns(&param_ex->sched_deadline);
+			if (dline < p->dl.dl_deadline && dline < rlim_dline)
+				return -EPERM;
+			/* can't increase the runtime */
+			rlim_rtime *= NSEC_PER_USEC;
+			rtime = timespec_to_ns(&param_ex->sched_runtime);
+			if (rtime > p->dl.dl_runtime && rtime > rlim_rtime)
+				return -EPERM;
+		}
 		if (rt_policy(policy)) {
 			unsigned long rlim_rtprio;
 
-- 
1.7.0

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC][PATCH 05/11] sched: add a syscall to wait for the next instance.
  2010-02-28 19:06 [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Raistlin
                   ` (3 preceding siblings ...)
  2010-02-28 19:19 ` [RFC][PATCH 04/11] sched: add resource limits for -deadline tasks Raistlin
@ 2010-02-28 19:20 ` Raistlin
  2010-02-28 19:22 ` [RFC][PATCH 06/11] sched: add the sched-debug bits for sched_dl Raistlin
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Raistlin @ 2010-02-28 19:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

[-- Attachment #1: Type: text/plain, Size: 10907 bytes --]

Introduce sched_wait_interval() syscall (and scheduling class
interface call). In general, this aims at providing each scheduling
class with a mean of making one of its own task sleep for some time
according to some specific rule of the scheduling class itself.

As of now, the sched_dl scheduling class is the only one that needs
this kind of service, and thus the only one that implements the
class-specific logic. For other classes, calling it will result in
the same effect than calling clock_nanosleep with CLOCK_MONOTONIC
clockid and the TIMER_ABSTIME flag on.

For -deadline task, the idea is to give them the possibility of
notifying the scheduler a periodic/sporadic instance just ended and
ask it to wake up them at the beginning of the next one, with:
 - fully replenished runtime and
 - the absolute deadline set just one relative deadline interval
   away from the wakeup time.
This is an effective mean of synchronizing the task's behaviour with
the scheduler one, which might be useful in some situations.

This patch:
 - adds the new syscall (x83-32, x86-64 and ARM, but extension to all
   archs is strightforward);
 - implements the class-specific logic for -deadline tasks, making it
   impossible for them to exploit this call to use more bandwidth than
   they are given.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 arch/arm/include/asm/unistd.h      |    1 +
 arch/arm/kernel/calls.S            |    1 +
 arch/x86/ia32/ia32entry.S          |    1 +
 arch/x86/include/asm/unistd_32.h   |    3 +-
 arch/x86/include/asm/unistd_64.h   |    2 +
 arch/x86/kernel/syscall_table_32.S |    1 +
 include/linux/sched.h              |    2 +
 include/linux/syscalls.h           |    2 +
 kernel/sched.c                     |   39 +++++++++++++++++++
 kernel/sched_dl.c                  |   74 +++++++++++++++++++++++++++++++++++-
 10 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index e741cd6..8ca0ece 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -395,6 +395,7 @@
 #define __NR_sched_setscheduler_ex	(__NR_SYSCALL_BASE+366)
 #define __NR_sched_setparam_ex		(__NR_SYSCALL_BASE+367)
 #define __NR_sched_getparam_ex		(__NR_SYSCALL_BASE+368)
+#define __NR_sched_wait_interval	(__NR_SYSCALL_BASE+369)
 
 /*
  * The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 8eeb552..6cb3842 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -378,6 +378,7 @@
 		CALL(sys_sched_setscheduler_ex)
 		CALL(sys_sched_setparam_ex)
 		CALL(sys_sched_getparam_ex)
+		CALL(sys_sched_wait_interval)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index f24e9fa..85d383b 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -845,4 +845,5 @@ ia32_sys_call_table:
 	.quad sys_sched_setscheduler_ex
 	.quad sys_sched_setparam_ex
 	.quad sys_sched_getparam_ex		/* 340 */
+	.quad sys_sched_wait_interval
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 1db148b..83c8a3f 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -346,10 +346,11 @@
 #define __NR_sched_setscheduler_ex	338
 #define __NR_sched_setparam_ex		339
 #define __NR_sched_getparam_ex		340
+#define __NR_sched_wait_interval	341
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 341
+#define NR_syscalls 342
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index d254154..b9e3356 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -669,6 +669,8 @@ __SYSCALL(__NR_sched_setscheduler_ex, sys_sched_setscheduler_ex)
 __SYSCALL(__NR_sched_setparam_ex, sys_sched_setparam_ex)
 #define __NR_sched_getparam_ex			302
 __SYSCALL(__NR_sched_getparam_ex, sys_sched_getparam_ex)
+#define __NR_sched_wait_interval		303
+__SYSCALL(__NR_sched_wait_interval, sys_sched_wait_interval)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index e27e002..1609ba9 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -340,3 +340,4 @@ ENTRY(sys_call_table)
 	.long sys_sched_setscheduler_ex
 	.long sys_sched_setparam_ex
 	.long sys_sched_getparam_ex	/* 340 */
+	.long sys_sched_wait_interval
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3c466a6..e6c1cda 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1139,6 +1139,8 @@ struct sched_class {
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
 	void (*yield_task) (struct rq *rq);
+	long (*wait_interval) (struct task_struct *p, struct timespec *rqtp,
+			       struct timespec __user *rmtp);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 9e3ad66..97eb7c4 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -357,6 +357,8 @@ asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
 asmlinkage long sys_sched_yield(void);
+asmlinkage long sys_sched_wait_interval(const struct timespec __user *rqtp,
+					struct timespec *rmtp);
 asmlinkage long sys_sched_get_priority_max(int policy);
 asmlinkage long sys_sched_get_priority_min(int policy);
 asmlinkage long sys_sched_rr_get_interval(pid_t pid,
diff --git a/kernel/sched.c b/kernel/sched.c
index 61b1561..d5e8b6c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6966,6 +6966,45 @@ SYSCALL_DEFINE0(sched_yield)
 	return 0;
 }
 
+/**
+ * sys_sched_wait_interval - sleep according to the scheduling class rules.
+ *
+ * This function is implemented inside each scheduling class, in case it
+ * wants to provide its tasks a mean of waiting a specific instant in
+ * time, while also honouring some specific rule of itself.
+ */
+SYSCALL_DEFINE2(sched_wait_interval,
+	const struct timespec __user *, rqtp,
+	struct timespec __user *, rmtp)
+{
+	struct timespec lrq, lrm;
+	int ret;
+
+	if (rqtp != NULL) {
+		if (copy_from_user(&lrq, rqtp, sizeof(struct timespec)))
+			return -EFAULT;
+		if (!timespec_valid(&lrq))
+			return -EINVAL;
+	}
+
+	if (current->sched_class->wait_interval)
+		ret = current->sched_class->wait_interval(current,
+							  rqtp ? &lrq : NULL,
+							  &lrm);
+	else {
+		if (!rqtp)
+			return -EINVAL;
+
+		ret = hrtimer_nanosleep(&lrq, &lrm, HRTIMER_MODE_ABS,
+					CLOCK_MONOTONIC);
+	}
+
+	if (rmtp && copy_to_user(rmtp, &lrm, sizeof(struct timespec)))
+		return -EFAULT;
+
+	return ret;
+}
+
 static inline int should_resched(void)
 {
 	return need_resched() && !(preempt_count() & PREEMPT_ACTIVE);
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 7585227..a1202ac 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -268,7 +268,7 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 	int dmiss = dl_time_before(dl_se->deadline, rq->clock);
 	int rorun = dl_se->runtime <= 0;
 
-	if (!rorun && !dmiss)
+	if (dl_se->flags & DL_NEW || (!rorun && !dmiss))
 		return 0;
 
 	/*
@@ -424,6 +424,77 @@ static void yield_task_dl(struct rq *rq)
 {
 }
 
+/*
+ * This function makes the task sleep until at least the absolute time
+ * instant specified in @rqtp.
+ * The _at_least_ part comes from the fact that we want to be able
+ * to give the task --as soon as it wakes-up-- its full runtime.
+ * Therefore, if e.g. it is in overrun when this function is invoked,
+ * of if @rqtp is too early, the sleeping time might be longer than asked.
+ * It is intended to be used at the end of a periodic -deadline task
+ * instance.
+ */
+long wait_interval_dl(struct task_struct *p, struct timespec *rqtp,
+		      struct timespec *rmtp)
+{
+	unsigned long flags;
+	struct sched_dl_entity *dl_se = &p->dl;
+	struct rq *rq = task_rq_lock(p, &flags);
+	struct timespec lrqtp;
+	u64 wakeup;
+
+	p->dl.flags |= DL_NEW;
+	update_curr_dl(rq);
+
+	/*
+	 * Task is asking for a new instance with full runtime but, since
+	 * it is in overrun, the only thing we can do is putting it to sleep
+	 * until the time it will have payed back for that (which could
+	 * be its next deadline or farther).
+	 */
+	if (dl_se->runtime < 0) {
+		u64 runtime_exec = dl_se->dl_runtime - dl_se->runtime;
+		u64 rorun_ratio = runtime_exec / dl_se->dl_runtime;
+
+		WARN_ON(rorun_ratio == 0);
+
+		wakeup = dl_se->deadline + dl_se->dl_deadline * rorun_ratio;
+		goto unlock;
+	}
+
+	if (!rqtp) {
+		wakeup = p->dl.deadline;
+		goto unlock;
+	}
+
+	/*
+	 * If the tasks wants to wake up _before_ its absolute deadline
+	 * we must be sure that reusing its (actual) runtime and deadline
+	 * at that time _would_ overcome its bandwidth limitation, so
+	 * that we know it will be given new parameters.
+	 *
+	 * If this is not true, we postpone the wake-up time up to the right
+	 * instant. This involves a division (to calculate the reverse of the
+	 * task's bandwidth), but it is worth to notice that it is quite
+	 * unlikely that we get into here very often.
+	 */
+	wakeup = timespec_to_ns(rqtp);
+	if (dl_time_before(wakeup, dl_se->deadline) &&
+	    dl_check_bandwidth(dl_se, wakeup)) {
+		u64 ibw = (u64)dl_se->runtime * dl_se->dl_deadline;
+
+		ibw = div_u64(ibw, dl_se->dl_runtime);
+		wakeup = dl_se->deadline - ibw;
+	}
+
+unlock:
+	task_rq_unlock(rq, &flags);
+	lrqtp = ns_to_timespec(wakeup);
+
+	return hrtimer_nanosleep(&lrqtp, rmtp, HRTIMER_MODE_ABS,
+				 CLOCK_MONOTONIC);
+}
+
 #ifdef CONFIG_SCHED_HRTICK
 static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
 {
@@ -580,6 +651,7 @@ static const struct sched_class dl_sched_class = {
 	.enqueue_task		= enqueue_task_dl,
 	.dequeue_task		= dequeue_task_dl,
 	.yield_task		= yield_task_dl,
+	.wait_interval		= wait_interval_dl,
 
 	.check_preempt_curr	= check_preempt_curr_dl,
 
-- 
1.7.0

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC][PATCH 06/11] sched: add the sched-debug bits for sched_dl.
  2010-02-28 19:06 [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Raistlin
                   ` (4 preceding siblings ...)
  2010-02-28 19:20 ` [RFC][PATCH 05/11] sched: add a syscall to wait for the next instance Raistlin
@ 2010-02-28 19:22 ` Raistlin
  2010-02-28 19:23 ` [RFC][PATCH 07/11] sched: add latency tracing for -deadline tasks Raistlin
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Raistlin @ 2010-02-28 19:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

[-- Attachment #1: Type: text/plain, Size: 3482 bytes --]

Add the typical debugging output provided by sched_deug (if enabled)
also to the sched_dl class.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 kernel/sched_debug.c |   33 +++++++++++++++++++++++++++++++++
 kernel/sched_dl.c    |   26 ++++++++++++++++++++++++++
 2 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 67f95aa..407a761 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -246,6 +246,38 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
 #undef P
 }
 
+void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
+{
+	s64 min_deadline = -1, max_deadline = -1;
+	struct rq *rq = &per_cpu(runqueues, cpu);
+	struct sched_dl_entity *last;
+	unsigned long flags;
+
+	SEQ_printf(m, "\ndl_rq[%d]:\n", cpu);
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+	if (dl_rq->rb_leftmost)
+		min_deadline = (rb_entry(dl_rq->rb_leftmost,
+					 struct sched_dl_entity,
+					 rb_node))->deadline;
+	last = __pick_dl_last_entity(dl_rq);
+	if (last)
+		max_deadline = last->deadline;
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+#define P(x) \
+	SEQ_printf(m, "  .%-30s: %Ld\n", #x, (long long)(x))
+#define PN(x) \
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(x))
+
+	P(dl_rq->dl_nr_running);
+	PN(min_deadline);
+	PN(max_deadline);
+
+#undef PN
+#undef P
+}
+
 static void print_cpu(struct seq_file *m, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -305,6 +337,7 @@ static void print_cpu(struct seq_file *m, int cpu)
 #endif
 	print_cfs_stats(m, cpu);
 	print_rt_stats(m, cpu);
+	print_dl_stats(m, cpu);
 
 	print_rq(m, rq, cpu);
 }
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index a1202ac..dee2668 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -44,6 +44,9 @@ static inline int dl_time_before(u64 a, u64 b)
 	return (s64)(a - b) < 0;
 }
 
+#define for_each_leaf_dl_rq(dl_rq, rq) \
+	for (dl_rq = &rq->dl; dl_rq; dl_rq = NULL)
+
 static void enqueue_dl_entity(struct sched_dl_entity *dl_se);
 static void dequeue_dl_entity(struct sched_dl_entity *dl_se);
 static void check_dl_preempt_curr(struct task_struct *p, struct rq *rq);
@@ -512,6 +515,16 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
 }
 #endif
 
+static struct sched_dl_entity *__pick_dl_last_entity(struct dl_rq *dl_rq)
+{
+	struct rb_node *last = rb_last(&dl_rq->rb_root);
+
+	if (!last)
+		return NULL;
+
+	return rb_entry(last, struct sched_dl_entity, rb_node);
+}
+
 static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
 						   struct dl_rq *dl_rq)
 {
@@ -676,3 +689,16 @@ static const struct sched_class dl_sched_class = {
 	.switched_to		= switched_to_dl,
 };
 
+#ifdef CONFIG_SCHED_DEBUG
+extern void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq);
+
+static void print_dl_stats(struct seq_file *m, int cpu)
+{
+	struct dl_rq *dl_rq = &cpu_rq(cpu)->dl;
+
+	rcu_read_lock();
+	for_each_leaf_dl_rq(dl_rq, cpu_rq(cpu))
+		print_dl_rq(m, cpu, dl_rq);
+	rcu_read_unlock();
+}
+#endif /* CONFIG_SCHED_DEBUG */
-- 
1.7.0

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC][PATCH 07/11] sched: add latency tracing for -deadline tasks.
  2010-02-28 19:06 [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Raistlin
                   ` (5 preceding siblings ...)
  2010-02-28 19:22 ` [RFC][PATCH 06/11] sched: add the sched-debug bits for sched_dl Raistlin
@ 2010-02-28 19:23 ` Raistlin
  2010-02-28 19:24 ` [RFC][PATCH 08/11] sched: send SIGXCPU at -deadline task overruns Raistlin
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Raistlin @ 2010-02-28 19:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

[-- Attachment #1: Type: text/plain, Size: 6487 bytes --]

It is very likely that systems that wants/needs to use the new
SCHED_DEADLINE policy also want to have the scheduling latency of
the -deadline tasks under control.

For this reason a new version of the scheduling wakeup latency,
called "wakeup_dl", is introduced.

As a consequence of applying this patch there will be three wakeup
latency tracer:
 * "wakeup", that deals with all tasks in the system;
 * "wakeup_rt", that deals with -rt and -deadline tasks only;
 * "wakeup_dl", that deals with -deadline tasks only.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 kernel/trace/trace_sched_wakeup.c |   44 +++++++++++++++++++++++++++++++++---
 kernel/trace/trace_selftest.c     |   30 +++++++++++++++----------
 2 files changed, 58 insertions(+), 16 deletions(-)

diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 0271742..a02ce04 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -27,6 +27,7 @@ static int			wakeup_cpu;
 static int			wakeup_current_cpu;
 static unsigned			wakeup_prio = -1;
 static int			wakeup_rt;
+static int			wakeup_dl;
 
 static arch_spinlock_t wakeup_lock =
 	(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
@@ -214,9 +215,17 @@ probe_wakeup(struct rq *rq, struct task_struct *p, int success)
 	tracing_record_cmdline(p);
 	tracing_record_cmdline(current);
 
-	if ((wakeup_rt && !rt_task(p)) ||
-			p->prio >= wakeup_prio ||
-			p->prio >= current->prio)
+	/*
+	 * Semantic is like this:
+	 *  - wakeup tracer handles all tasks in the system, independently
+	 *    from their scheduling class;
+	 *  - wakeup_rt tracer handles tasks belonging to sched_dl and
+	 *    sched_rt class;
+	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
+	 */
+	if ((wakeup_dl && !dl_task(p)) ||
+	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
+	    (p->prio >= wakeup_prio || p->prio >= current->prio))
 		return;
 
 	pc = preempt_count();
@@ -228,7 +237,7 @@ probe_wakeup(struct rq *rq, struct task_struct *p, int success)
 	arch_spin_lock(&wakeup_lock);
 
 	/* check for races. */
-	if (!tracer_enabled || p->prio >= wakeup_prio)
+	if (!tracer_enabled || (!dl_task(p) && p->prio >= wakeup_prio))
 		goto out_locked;
 
 	/* reset the trace */
@@ -340,16 +349,25 @@ static int __wakeup_tracer_init(struct trace_array *tr)
 
 static int wakeup_tracer_init(struct trace_array *tr)
 {
+	wakeup_dl = 0;
 	wakeup_rt = 0;
 	return __wakeup_tracer_init(tr);
 }
 
 static int wakeup_rt_tracer_init(struct trace_array *tr)
 {
+	wakeup_dl = 0;
 	wakeup_rt = 1;
 	return __wakeup_tracer_init(tr);
 }
 
+static int wakeup_dl_tracer_init(struct trace_array *tr)
+{
+	wakeup_dl = 1;
+	wakeup_rt = 0;
+	return __wakeup_tracer_init(tr);
+}
+
 static void wakeup_tracer_reset(struct trace_array *tr)
 {
 	stop_wakeup_tracer(tr);
@@ -398,6 +416,20 @@ static struct tracer wakeup_rt_tracer __read_mostly =
 #endif
 };
 
+static struct tracer wakeup_dl_tracer __read_mostly =
+{
+	.name		= "wakeup_dl",
+	.init		= wakeup_dl_tracer_init,
+	.reset		= wakeup_tracer_reset,
+	.start		= wakeup_tracer_start,
+	.stop		= wakeup_tracer_stop,
+	.wait_pipe	= poll_wait_pipe,
+	.print_max	= 1,
+#ifdef CONFIG_FTRACE_SELFTEST
+	.selftest    = trace_selftest_startup_wakeup,
+#endif
+};
+
 __init static int init_wakeup_tracer(void)
 {
 	int ret;
@@ -410,6 +442,10 @@ __init static int init_wakeup_tracer(void)
 	if (ret)
 		return ret;
 
+	ret = register_tracer(&wakeup_dl_tracer);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 device_initcall(init_wakeup_tracer);
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 280fea4..4834411 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -558,11 +558,17 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
 #ifdef CONFIG_SCHED_TRACER
 static int trace_wakeup_test_thread(void *data)
 {
-	/* Make this a RT thread, doesn't need to be too high */
-	struct sched_param param = { .sched_priority = 5 };
+	/* Make this a -deadline thread */
+	struct sched_param param = { .sched_priority = 0 };
+	struct sched_param_ex paramx = {
+		.sched_priority = 0,
+		.sched_runtime = { .tv_sec = 0, .tv_nsec = 100000 },
+		.sched_deadline = { .tv_sec = 0, .tv_nsec = 10000000 },
+		.sched_flags = 0
+	};
 	struct completion *x = data;
 
-	sched_setscheduler(current, SCHED_FIFO, &param);
+	sched_setscheduler_ex(current, SCHED_DEADLINE, &param, &paramx);
 
 	/* Make it know we have a new prio */
 	complete(x);
@@ -574,8 +580,8 @@ static int trace_wakeup_test_thread(void *data)
 	/* we are awake, now wait to disappear */
 	while (!kthread_should_stop()) {
 		/*
-		 * This is an RT task, do short sleeps to let
-		 * others run.
+		 * This will likely be the system top priority
+		 * task, do short sleeps to let others run.
 		 */
 		msleep(100);
 	}
@@ -588,21 +594,21 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
 {
 	unsigned long save_max = tracing_max_latency;
 	struct task_struct *p;
-	struct completion isrt;
+	struct completion is_ready;
 	unsigned long count;
 	int ret;
 
-	init_completion(&isrt);
+	init_completion(&is_ready);
 
-	/* create a high prio thread */
-	p = kthread_run(trace_wakeup_test_thread, &isrt, "ftrace-test");
+	/* create a -deadline thread */
+	p = kthread_run(trace_wakeup_test_thread, &is_ready, "ftrace-test");
 	if (IS_ERR(p)) {
 		printk(KERN_CONT "Failed to create ftrace wakeup test thread ");
 		return -1;
 	}
 
-	/* make sure the thread is running at an RT prio */
-	wait_for_completion(&isrt);
+	/* make sure the thread is running at -deadline policy */
+	wait_for_completion(&is_ready);
 
 	/* start the tracing */
 	ret = tracer_init(trace, tr);
@@ -614,7 +620,7 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
 	/* reset the max latency */
 	tracing_max_latency = 0;
 
-	/* sleep to let the RT thread sleep too */
+	/* sleep to let the thread sleep too */
 	msleep(100);
 
 	/*
-- 
1.7.0

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC][PATCH 08/11] sched: send SIGXCPU at -deadline task overruns.
  2010-02-28 19:06 [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Raistlin
                   ` (6 preceding siblings ...)
  2010-02-28 19:23 ` [RFC][PATCH 07/11] sched: add latency tracing for -deadline tasks Raistlin
@ 2010-02-28 19:24 ` Raistlin
  2010-04-13 18:22   ` Peter Zijlstra
  2010-02-28 19:26 ` [RFC][PATCH 09/11] sched: first draft of deadline inheritance Raistlin
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 25+ messages in thread
From: Raistlin @ 2010-02-28 19:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

[-- Attachment #1: Type: text/plain, Size: 5095 bytes --]

Add to the scheduler the capability of notifying when -deadline tasks
overrun their maximum runtime and/or overcome their scheduling
deadline.

Runtime overruns might be quite common, e.g., due to coarse granularity
execution time accounting resolution or wrong assignment of tasks'
parameters (especially runtime). However, since the scheduler enforces
bandwidth isolation among tasks, this is not at all a threat to other
tasks' schedulability. For this reason, it is not common that a task
wants to be notified about that. Moreover, if we are using the
SCHED_DEADLINE policy with sporadic tasks, or to limit the bandwidth
of not periodic nor sporadic ones, runtime overruns are very likely
to occur at each and every instance, and again they should not be
considered a problem.

On the other hand, a deadline miss in any task means that, even if we
are trying at our best to keep each task isolated and to avoid
reciprocal interference among them, something went very, very bad,
and one task did not manage in consuming its runtime by its deadline.
This is something that should happen only on an oversubscribed
system, and thus being notified when it occurs could be very useful.

The user can specify the signals he wants to be sent to his task
during sched_setscheduler_ex(), raising two specific flags in the
sched_flags field of struct sched_param_ex:
 * SCHED_SIG_RORUN (if he wants to be signaled on runtime overrun),
 * SCHED_SIG_DMISS (if he wants to be signaled on deadline misses).

This patch:
 - adds the logic needed to send SIGXCPU signal to a -deadline task
   in case its actual runtime becomes negative;
 - adds the logic needed to send SIGXCPU signal to a -deadline task
   in case it is still being scheduled while its absolute deadline
   passes.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |   18 ++++++++++++++++++
 kernel/sched_dl.c     |   24 ++++++++++++++++++++++++
 2 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e6c1cda..64a7df2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -138,6 +138,7 @@ struct sched_param {
  * More information about the algorithm are available in the scheduling
  * class file or in Documentation/.
  */
+
 struct sched_param_ex {
 	int sched_priority;
 	struct timespec sched_runtime;
@@ -146,6 +147,23 @@ struct sched_param_ex {
 	unsigned int sched_flags;
 };
 
+/*
+ * User -deadline flags.
+ *
+ * These flags are exported to userspace so that tasks can try to affect
+ * the scheduler behaviour and/or specifying that they want to be informed
+ * of the occurrence of some events. There are at most 16 of them available
+ * (lowest bits), since values above 0x10000 are reserved for kernel
+ * internal flags.
+ *
+ *  @SCHED_SIG_RORUN	tells us the task wants to be notified whenever
+ *                      a runtime overrun occurs;
+ *  @SCHED_SIG_DMISS	tells us the task wants to be notified whenever
+ *                      a scheduling deadline is missed.
+ */
+#define SCHED_SIG_RORUN		0x0001
+#define SCHED_SIG_DMISS		0x0002
+
 struct exec_domain;
 struct futex_pi_state;
 struct robust_list_head;
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index dee2668..b5dde44 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -265,6 +265,14 @@ static void init_dl_timer(struct sched_dl_entity *dl_se)
 	timer->function = dl_timer;
 }
 
+#define dl_se_signal(se, s, msg)					\
+ do {									\
+  struct task_struct *t = dl_task_of(se);				\
+  sigaddset(&t->pending.signal, s);					\
+  set_tsk_thread_flag(t, TIF_SIGPENDING);				\
+  printk(KERN_INFO msg "in %d (%s)\n", task_pid_nr(t), t->comm);	\
+ } while (0)
+
 static
 int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 {
@@ -283,6 +291,22 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 	if (dmiss)
 		dl_se->runtime -= rq->clock - dl_se->deadline;
 
+	/*
+	 * if the userspace asked for that, we notify about (scheduling)
+	 * deadline misses and runtime overruns via sending SIGXCPU to
+	 * "faulting" task.
+	 *
+	 * Note that (hopefully small) runtime overruns are very likely
+	 * to occur, mainly due to accounting resolution, while missing a
+	 * scheduling deadline should be very rare, and only happen on
+	 * an oversubscribed systems.
+	 */
+	if (unlikely(dmiss && dl_se->flags & SCHED_SIG_DMISS))
+		dl_se_signal(dl_se, SIGXCPU, "Deadline miss");
+
+	if (unlikely(rorun && dl_se->flags & SCHED_SIG_RORUN))
+		dl_se_signal(dl_se, SIGXCPU, "Runtime overrun");
+
 	dequeue_dl_entity(dl_se);
 	if (!start_dl_timer(dl_se, dl_se->deadline)) {
 		if (!rorun)
-- 
1.7.0

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC][PATCH 09/11] sched: first draft of deadline inheritance.
  2010-02-28 19:06 [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Raistlin
                   ` (7 preceding siblings ...)
  2010-02-28 19:24 ` [RFC][PATCH 08/11] sched: send SIGXCPU at -deadline task overruns Raistlin
@ 2010-02-28 19:26 ` Raistlin
  2010-04-14  8:25   ` Peter Zijlstra
  2010-02-28 19:27 ` [RFC][PATCH 10/11] sched: add bandwidth management for sched_dl Raistlin
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 25+ messages in thread
From: Raistlin @ 2010-02-28 19:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

[-- Attachment #1: Type: text/plain, Size: 14416 bytes --]

Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed. The issues this raises are not at all
trivial, but a simple enough implementation is possible, avoiding
completely restructuring any of the two.
Such solution, that we are proposing here, is based on two key
concepts: (i) relative deadline inheritance, and (ii) inheriting
task's bandwidth boosting.

In some more details.

  (i) relative deadline is representative of how "urgent" each
      instance of a task is and, while executing, the absolute
      deadline --according to which all scheduling decision are
      undertaken-- is calculated right by adding such value to the
      instance activation time. Therefore, a task inheriting a
      relative deadline smaller than its original one will likely
      have a smaller absolute deadline too, which all looks very
      similar to what happens when priority inheritance (e.g.,
      because of rt-mutexes) take place among RT tasks. Moreover,
      relative deadline are meaningful on a per task basis
      independently from the CPU tasks are running on (and from
      any kind of clock synchronization issues among the different
      CPUs).
 (ii) in order of having the inheriting task finishing quickly its
      "operation" we also want it to be temporarily insensitive from
      the bandwidth enforcing mechanism, that would slow it down
      and provides another mean of priority inversion.
      It is true that this opens the door to situations during
      which the system is more loaded than we expected (and perhaps
      even oversubscribed), but if the fundamental assumption of
      these time intervals being small enough, this won't be
      anything unbearable.

This is not 100% theoretically correct solution, although some kind
of analysis can be tried. The main flaw is that it breaks the
bandwidth isolation property the sched_dl policy (also) strives for.
Different solutions exist, that do a better job in preserving such
important feature, but they're way more complicated, and their
implementation strategy is under consideration and studying.

Therefore, as of now, this patch:
 - implements priority inheritance for -deadline tasks, according to
   what described above;
 - to make this possible without rewriting outstanding chunks of
   code of both -deadline scheduling and existing PI, the task's
   relative deadline is stored in the prio field of the task_struct.
   This is done in such a way that:
    * prio is always < 0 for -deadline tasks,
    * p1->prio < p2->prio still means p1 has higher priority than
      p2, i.e., in our case, p1 has smaller relative deadline.
 - the point above means that, since prio is of int type, a relative
   deadline has to be smaller than INT_MAX. This is about 2sec,
   which is a something (we think! :-)) we can afford, at least
   for now.
 - disables bandwidth throttling for tasks while they are deadline
   boosted. It also tries to make them pay back for runtime overrun
   and deadline misses in this phase, but it's only "local", in the
   sense that instances farther than the one right next to the
   overrun are not going to be direcly affected.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |   30 ++++++++++++++++++++--
 kernel/sched.c        |   64 +++++++++++++++++++++++++++++--------------------
 kernel/sched_dl.c     |   57 +++++++++++++++++++++++++++++++++++++++----
 3 files changed, 116 insertions(+), 35 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 64a7df2..0b3a302 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1723,16 +1723,40 @@ static inline int rt_task(struct task_struct *p)
 	return rt_prio(p->prio);
 }
 
-static inline int dl_policy(int policy)
+/*
+ * Relative deadline of a process is a time interval representative
+ * of how "urgent" each instance of such a process is. During actual
+ * execution the absolute deadline of the process --i.e., the value
+ * according to which all scheduling decision are undertaken-- is
+ * calculated by adding to the instance activation time the process'
+ * relative deadline.
+ * Therefore, a process that inherits a smaller relative deadline will
+ * likely have a smaller absolute deadline than its original one, which
+ * means something very similar to the priority boosting that occurs
+ * when priority inheritance (e.g., because of rt-mutexes) take place
+ * among RT tasks.
+ * Thus we can somehow treat relative deadline as priorities and have all
+ * the priority inheritance seamlessy code working even for -deadline
+ * tasks.
+ *
+ * This is how the current implementation of deadline inheritance works
+ * and, in order of this to be possible, we need that the relative deadline
+ * value can be represented on the integer field "prio" of the process'
+ * task_struct.
+ */
+
+#define MAX_SCHED_DEADLINE	INT_MAX
+
+static inline int dl_prio(int prio)
 {
-	if (unlikely(policy == SCHED_DEADLINE))
+	if (unlikely(prio < 0))
 		return 1;
 	return 0;
 }
 
 static inline int dl_task(struct task_struct *p)
 {
-	return dl_policy(p->policy);
+	return dl_prio(p->prio);
 }
 
 static inline struct pid *task_pid(struct task_struct *task)
diff --git a/kernel/sched.c b/kernel/sched.c
index d5e8b6c..87782a3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -131,6 +131,13 @@ static inline int task_has_rt_policy(struct task_struct *p)
 	return rt_policy(p->policy);
 }
 
+static inline int dl_policy(int policy)
+{
+	if (unlikely(policy == SCHED_DEADLINE))
+		return 1;
+	return 0;
+}
+
 static inline int task_has_dl_policy(struct task_struct *p)
 {
 	return dl_policy(p->policy);
@@ -1940,17 +1947,20 @@ static inline int __normal_prio(struct task_struct *p)
  * boosted by interactivity modifiers. Changes upon fork,
  * setprio syscalls, and whenever the interactivity
  * estimator recalculates.
+ *
+ * For -deadline tasks the relative deadline is used as
+ * priority but in such a way that:
+ *  - prio is always < 0,
+ *  - p1->prio < p2->prio means p1's rel. deadline is smaller
+ *    than p2's one.
  */
 static inline int normal_prio(struct task_struct *p)
 {
 	int prio;
 
-	if (task_has_dl_policy(p)) {
-		/*
-		 * FIXME: horrible hack here... Deadline inheritance needed!!
-		 */
-		prio = -(1<<7);
-	} else if (task_has_rt_policy(p))
+	if (task_has_dl_policy(p))
+		prio = -(MAX_SCHED_DEADLINE - p->dl.dl_deadline);
+	else if (task_has_rt_policy(p))
 		prio = MAX_RT_PRIO-1 - p->rt_priority;
 	else
 		prio = __normal_prio(p);
@@ -1960,19 +1970,20 @@ static inline int normal_prio(struct task_struct *p)
 /*
  * Calculate the current priority, i.e. the priority
  * taken into account by the scheduler. This value might
- * be boosted by RT tasks, or might be boosted by
- * interactivity modifiers. Will be RT if the task got
- * RT-boosted. If not then it returns p->normal_prio.
+ * be boosted by deadline/RT tasks, or might be boosted by
+ * interactivity modifiers. Will be deadline if the task got
+ * deadline-boosted (and the same for RT). If not then it
+ * returns p->normal_prio.
  */
 static int effective_prio(struct task_struct *p)
 {
 	p->normal_prio = normal_prio(p);
 	/*
-	 * If we are RT tasks or we were boosted to RT priority,
+	 * If we are deadline/RT tasks or we were boosted to such priority,
 	 * keep the priority unchanged. Otherwise, update priority
 	 * to the normal priority:
 	 */
-	if (!rt_prio(p->prio))
+	if (!dl_prio(p->prio) && !rt_prio(p->prio))
 		return p->normal_prio;
 	return p->prio;
 }
@@ -2633,10 +2644,7 @@ void sched_fork(struct task_struct *p, int clone_flags)
 	 */
 	p->prio = current->normal_prio;
 
-	/*
-	 * FIXME: deadline inheritance needed here!!
-	 */
-	if (dl_task(p))
+	if (dl_prio(p->prio))
 		p->sched_class = &dl_sched_class;
 	else if (rt_prio(p->prio))
 		p->sched_class = &rt_sched_class;
@@ -6086,8 +6094,6 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	struct rq *rq;
 	const struct sched_class *prev_class = p->sched_class;
 
-	BUG_ON(prio < 0 || prio > MAX_PRIO);
-
 	rq = task_rq_lock(p, &flags);
 	update_rq_clock(rq);
 
@@ -6099,12 +6105,20 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
-	/*
-	 * FIXME: deadline inheritance needed here!!
-	 */
-	if (dl_task(p))
+	if (dl_prio(prio)) {
+		/*
+		 * If we are boosting p to -deadline its static runtime and
+		 * relative deadline are both set to the relative deadline
+		 * being inherited, which means the maximum boosting we are
+		 * able to provide as of now!
+		 */
+		if (!task_has_dl_policy(p)) {
+			p->dl.dl_runtime = dl_se_prio_to_deadline(prio);
+			p->dl.dl_deadline = dl_se_prio_to_deadline(prio);
+			p->dl.flags = DL_NEW;
+		}
 		p->sched_class = &dl_sched_class;
-	else if (rt_prio(prio))
+	} else if (rt_prio(prio))
 		p->sched_class = &rt_sched_class;
 	else
 		p->sched_class = &fair_sched_class;
@@ -6289,10 +6303,7 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 	/* we are holding p->pi_lock already */
 	p->prio = rt_mutex_getprio(p);
 
-	/*
-	 * FIXME: deadline inheritance needed here!!
-	 */
-	if (dl_policy(policy))
+	if (dl_prio(p->prio))
 		p->sched_class = &dl_sched_class;
 	else if (rt_prio(p->prio))
 		p->sched_class = &rt_sched_class;
@@ -6341,6 +6352,7 @@ static bool
 __checkparam_dl(struct sched_param_ex *prm)
 {
 	return prm && timespec_to_ns(&prm->sched_deadline) != 0 &&
+	       timespec_to_ns(&prm->sched_deadline) <= MAX_SCHED_DEADLINE &&
 	       timespec_to_ns(&prm->sched_deadline) >=
 	       timespec_to_ns(&prm->sched_runtime);
 }
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index b5dde44..3613cbd 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -39,6 +39,31 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
 	return !RB_EMPTY_NODE(&dl_se->rb_node);
 }
 
+static inline int dl_se_boosted(struct sched_dl_entity *dl_se)
+{
+	struct task_struct *p = dl_task_of(dl_se);
+
+	return p->prio != p->normal_prio;
+}
+
+/*
+ * Decodes the task's priority value and returns the corresponding
+ * relative deadline value.
+ */
+static inline int dl_se_prio_to_deadline(int prio)
+{
+	return MAX_SCHED_DEADLINE + prio;
+}
+
+/*
+ * Retuns the actual relative deadline of the scheduling entity,
+ * i.e., its own one or the earliest it has inherited since now.
+ */
+static inline int dl_se_deadline(struct sched_dl_entity *dl_se)
+{
+	return dl_se_prio_to_deadline(dl_task_of(dl_se)->prio);
+}
+
 static inline int dl_time_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
@@ -70,7 +95,7 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
 
 	WARN_ON(!(dl_se->flags & DL_NEW) || dl_se->flags & DL_THROTTLED);
 
-	dl_se->deadline = rq->clock + dl_se->dl_deadline;
+	dl_se->deadline = rq->clock + dl_se_deadline(dl_se);
 	dl_se->runtime = dl_se->dl_runtime;
 	dl_se->flags &= ~DL_NEW;
 }
@@ -103,7 +128,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 * arbitrary large.
 	 */
 	while (dl_se->runtime <= 0) {
-		dl_se->deadline += dl_se->dl_deadline;
+		dl_se->deadline += dl_se_deadline(dl_se);
 		dl_se->runtime += dl_se->dl_runtime;
 	}
 
@@ -129,7 +154,7 @@ static bool dl_check_bandwidth(struct sched_dl_entity *dl_se, u64 t)
 	 * after a bit of shuffling to use multiplications instead
 	 * of divisions.
 	 */
-	left = dl_se->dl_deadline * dl_se->runtime;
+	left = dl_se_deadline(dl_se) * dl_se->runtime;
 	right = (dl_se->deadline - t) * dl_se->dl_runtime;
 
 	return dl_time_before(left, right);
@@ -165,7 +190,7 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
 
 	if (!dl_check_bandwidth(dl_se, rq->clock)) {
 update:
-		dl_se->deadline = rq->clock + dl_se->dl_deadline;
+		dl_se->deadline = rq->clock + dl_se_deadline(dl_se);
 		dl_se->runtime = dl_se->dl_runtime;
 	}
 }
@@ -190,6 +215,26 @@ static int start_dl_timer(struct sched_dl_entity *dl_se, u64 wakeup)
 	s64 delta;
 
 	/*
+	 * This is the second part of the boosting logic (the first one
+	 * being relative deadline inheritance through the prio field).
+	 * In fact, this function is called when a task is overrunning its
+	 * runtime (or missing its deadline) in order to put it to sleep
+	 * for a while, thus enforcing its bandwidth limitation.
+	 * However, if such a task is boosted, we tell the caller (by
+	 * returning 0) to immediately give the task new scheduling
+	 * parameters and enqueue it back without waiting for the next abs.
+	 * deadline.
+	 *
+	 * This means tasks _are_ able to overcome their bandwidth wile
+	 * boosted. For what concerns paying back for these overruns, this
+	 * is done at this time, in terms of smaller runtime replenishment
+	 * (and perhaps farther deadline postponement), but subsequent
+	 * instances _won't_ be affected.
+	 */
+	if (dl_se_boosted(dl_se))
+		return 0;
+
+	/*
 	 * Arm the timer to fire at wakeup, tying to compensate for
 	 * the fact that wakeup is actually coming from rq->clock and
 	 * not from hrtimer's time base reading.
@@ -485,7 +530,7 @@ long wait_interval_dl(struct task_struct *p, struct timespec *rqtp,
 
 		WARN_ON(rorun_ratio == 0);
 
-		wakeup = dl_se->deadline + dl_se->dl_deadline * rorun_ratio;
+		wakeup = dl_se->deadline + dl_se_deadline(dl_se) * rorun_ratio;
 		goto unlock;
 	}
 
@@ -508,7 +553,7 @@ long wait_interval_dl(struct task_struct *p, struct timespec *rqtp,
 	wakeup = timespec_to_ns(rqtp);
 	if (dl_time_before(wakeup, dl_se->deadline) &&
 	    dl_check_bandwidth(dl_se, wakeup)) {
-		u64 ibw = (u64)dl_se->runtime * dl_se->dl_deadline;
+		u64 ibw = (u64)dl_se->runtime * dl_se_deadline(dl_se);
 
 		ibw = div_u64(ibw, dl_se->dl_runtime);
 		wakeup = dl_se->deadline - ibw;
-- 
1.7.0

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC][PATCH 10/11] sched: add bandwidth management for sched_dl.
  2010-02-28 19:06 [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Raistlin
                   ` (8 preceding siblings ...)
  2010-02-28 19:26 ` [RFC][PATCH 09/11] sched: first draft of deadline inheritance Raistlin
@ 2010-02-28 19:27 ` Raistlin
  2010-04-14 10:09   ` Peter Zijlstra
  2010-02-28 19:28 ` [RFC][PATCH 11/11] sched: add sched_dl documentation Raistlin
  2010-04-14 10:17 ` [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Peter Zijlstra
  11 siblings, 1 reply; 25+ messages in thread
From: Raistlin @ 2010-02-28 19:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

[-- Attachment #1: Type: text/plain, Size: 45369 bytes --]

In order of -deaadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.

Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.

The main differences between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need a throttling
mechanism in the groups, which can be used nothing more than for
admission control of tasks.

This patch:
 - adds system wide deadline bandwidth management by means of:
    * /proc/sys/kernel/sched_dl_runtime_us,
    * /proc/sys/kernel/sched_dl_period_us,
   that determine (i.e., runtime / period) the total bandwidth
   available on each CPU for -deadline tasks and task groups;

 - adds system wide deadline bandwidth management by means of:
    * /proc/sys/kernel/sched_dl_total_bw,
   that --after writing to it the index of an online CPU-- tells
   how much of the total available bandwidth of that CPU is
   currently allocated.

 - adds per-group deadline bandwidth management by means of:
    * /cgroup/<group>/cpu.dl_runtime_us,
    * /cgroup/<group>/cpu.dl_period_us,
   (same as above, but per-group);

 - adds per-group deadline bandwidth management by means of:
    * /cgroup/<group>/cpu.dl_total_bw,
   (same as above, but per-group).

 - couples the RT and deadline bandwidth management (at system
   level only for now), i.e., the sum of how much bandwidth is
   being devoted to -rt entities and to -deadline tasks and task
   groups must stay below 100%.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |   13 +
 init/Kconfig          |   14 +
 kernel/sched.c        | 1003 ++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched_debug.c  |    3 +-
 kernel/sched_dl.c     |   16 +-
 kernel/sysctl.c       |   21 +
 6 files changed, 1054 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0b3a302..66f6872 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1329,6 +1329,7 @@ struct sched_dl_entity {
 	 */
 	u64 dl_runtime;		/* maximum runtime for each instance 	*/
 	u64 dl_deadline;	/* relative deadline of each instance	*/
+	u64 dl_bw;		/* dl_runtime / dl_deadline		*/
 
 	/*
 	 * Actual scheduling parameters. They are initialized with the
@@ -2120,6 +2121,18 @@ int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
+extern unsigned int sysctl_sched_dl_total_bw;
+extern unsigned int sysctl_sched_dl_period;
+extern int sysctl_sched_dl_runtime;
+
+int sched_dl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
+int sched_dl_total_bw_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *lenp,
+			loff_t *ppos);
+
 extern unsigned int sysctl_sched_compat_yield;
 
 #ifdef CONFIG_RT_MUTEXES
diff --git a/init/Kconfig b/init/Kconfig
index 1510e17..de57415 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -473,6 +473,20 @@ config RT_GROUP_SCHED
 	  realtime bandwidth for them.
 	  See Documentation/scheduler/sched-rt-group.txt for more information.
 
+config DEADLINE_GROUP_SCHED
+	bool "Group scheduling for SCHED_DEADLINE"
+	depends on EXPERIMENTAL
+	depends on GROUP_SCHED
+	depends on CGROUPS
+	depends on !USER_SCHED
+	default n
+	help
+	  This feature lets you explicitly specify, in terms of runtime
+	  and period, the bandwidth of each task control group. This means
+	  tasks (and other groups) can be added to it only up to such
+	  "bandwidth cap", which might be useful for avoiding or
+	  controlling oversubscription.
+
 choice
 	depends on GROUP_SCHED
 	prompt "Basis for grouping tasks"
diff --git a/kernel/sched.c b/kernel/sched.c
index 87782a3..ec458ff 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -239,6 +239,97 @@ static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
 }
 #endif
 
+static unsigned long to_ratio(u64 period, u64 runtime)
+{
+	if (runtime == RUNTIME_INF)
+		return 1ULL << 20;
+
+	/*
+	 * Doing this here saves a lot of checks in all
+	 * the calling paths, and returning zero seems
+	 * safe for them anyway.
+	 */
+	if (period == 0)
+		return 0;
+
+	return div64_u64(runtime << 20, period);
+}
+
+/*
+ * To keep the bandwidth of -deadline tasks and groups under control
+ * we need some place where:
+ *  - store the maximum -deadline bandwidth of the system (the group);
+ *  - cache the fraction of that bandwidth that is currently allocated.
+ *
+ * This is all done in the data structure below. It is similar to the
+ * one used for RT-throttling (rt_bandwidth), with the main difference
+ * that, since here we are only interested in admission control, we
+ * do not decrease any runtime while the group "executes", neither we
+ * need a timer to replenish it.
+ *
+ * With respect to SMP, the bandwidth is given on a per-CPU basis,
+ * meaning that:
+ *  - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
+ *  - dl_total_bw array contains, in the i-eth element, the currently
+ *    allocated bandwidth on the i-eth CPU.
+ * Moreover, groups consume bandwidth on each CPU, while tasks only
+ * consume bandwidth on the CPU they're running on.
+ * Finally, dl_total_bw_cpu is used to cache the index of dl_total_bw
+ * that will be shown the next time the proc or cgroup controls will
+ * be red. It on its turn can be changed by writing on its own
+ * control.
+ */
+struct dl_bandwidth {
+	raw_spinlock_t dl_runtime_lock;
+
+	/* dl_bw = dl_runtime / dl_period */
+	u64 dl_runtime;
+	u64 dl_period;
+	u64 dl_bw;
+
+	/* dl_total_bw[cpu] < dl_bw (for each cpu) */
+	int dl_total_bw_cpu;
+	u64 *dl_total_bw;
+};
+
+static struct dl_bandwidth def_dl_bandwidth;
+
+static
+int init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
+{
+	raw_spin_lock_init(&dl_b->dl_runtime_lock);
+	dl_b->dl_period = period;
+	dl_b->dl_runtime = runtime;
+	dl_b->dl_bw = to_ratio(period, runtime);
+
+	dl_b->dl_total_bw_cpu = 0;
+	dl_b->dl_total_bw = kzalloc(sizeof(u64) * nr_cpu_ids, GFP_KERNEL);
+	if (!dl_b->dl_total_bw)
+		return 0;
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	if (dl_b == &def_dl_bandwidth) {
+		int i;
+
+		for_each_possible_cpu(i)
+			dl_b->dl_total_bw[i] = sysctl_sched_dl_total_bw;
+	}
+#endif
+
+	return 1;
+}
+
+static inline int dl_bandwidth_enabled(void)
+{
+	return sysctl_sched_dl_runtime >= 0;
+}
+
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static void destroy_dl_bandwidth(struct dl_bandwidth *dl_b)
+{
+	kfree(dl_b->dl_total_bw);
+}
+#endif
+
 /*
  * sched_domains_mutex serializes calls to arch_init_sched_domains,
  * detach_destroy_domains and partition_sched_domains.
@@ -278,6 +369,12 @@ struct task_group {
 	struct rt_bandwidth rt_bandwidth;
 #endif
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	struct dl_rq **dl_rq;
+
+	struct dl_bandwidth dl_bandwidth;
+#endif
+
 	struct rcu_head rcu;
 	struct list_head list;
 
@@ -312,6 +409,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct cfs_rq, init_tg_cfs_rq);
 static DEFINE_PER_CPU(struct sched_rt_entity, init_sched_rt_entity);
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rt_rq, init_rt_rq_var);
 #endif /* CONFIG_RT_GROUP_SCHED */
+
 #else /* !CONFIG_USER_SCHED */
 #define root_task_group init_task_group
 #endif /* CONFIG_USER_SCHED */
@@ -500,8 +598,30 @@ struct dl_rq {
 	struct rb_node *rb_leftmost;
 
 	unsigned long dl_nr_running;
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	struct rq *rq;
+#endif
 };
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static inline struct dl_bandwidth *task_dl_bandwidth(struct task_struct *p)
+{
+	return &task_group(p)->dl_bandwidth;
+}
+
+static inline struct dl_bandwidth *parent_dl_bandwidth(struct task_group *tg)
+{
+	if (tg->parent)
+		return &tg->parent->dl_bandwidth;
+	return &def_dl_bandwidth;
+}
+#else
+static inline struct dl_bandwidth *task_dl_bandwidth(struct task_struct *p)
+{
+	return &def_dl_bandwidth;
+}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 #ifdef CONFIG_SMP
 
 /*
@@ -879,6 +999,33 @@ static inline u64 global_rt_runtime(void)
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
 
+/*
+ * Maximum bandwidth available for all -deadline tasks and groups
+ * (if group scheduling is configured) on each CPU.
+ *
+ * default: 5%
+ */
+unsigned int sysctl_sched_dl_period = 1000000;
+int sysctl_sched_dl_runtime = 50000;
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+unsigned int sysctl_sched_dl_total_bw = 52428;	/* 50000<<20 / 1000000 */
+#else
+unsigned int sysctl_sched_dl_total_bw = 0;
+#endif
+
+static inline u64 global_dl_period(void)
+{
+	return (u64)sysctl_sched_dl_period * NSEC_PER_USEC;
+}
+
+static inline u64 global_dl_runtime(void)
+{
+	if (sysctl_sched_dl_runtime < 0)
+		return RUNTIME_INF;
+
+	return (u64)sysctl_sched_dl_runtime * NSEC_PER_USEC;
+}
+
 #ifndef prepare_arch_switch
 # define prepare_arch_switch(next)	do { } while (0)
 #endif
@@ -2063,6 +2210,30 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+/*
+ * When dealing with a -deadline task, we have to check if moving it to
+ * a new CPU is possible or not. In fact, this is only true iff there
+ * is enough bandwidth available on such CPU, otherwise we want the
+ * whole migration progedure to fail over.
+ */
+static inline
+bool __set_task_cpu_dl(struct task_struct *p, unsigned int cpu)
+{
+	struct dl_bandwidth *dl_b = task_dl_bandwidth(p);
+
+	raw_spin_lock(&dl_b->dl_runtime_lock);
+	if (dl_b->dl_bw < dl_b->dl_total_bw[cpu] + p->dl.dl_bw) {
+		raw_spin_unlock(&dl_b->dl_runtime_lock);
+
+		return 0;
+	}
+	dl_b->dl_total_bw[task_cpu(p)] -= p->dl.dl_bw;
+	dl_b->dl_total_bw[cpu] += p->dl.dl_bw;
+	raw_spin_unlock(&dl_b->dl_runtime_lock);
+
+	return 1;
+}
+
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {
 #ifdef CONFIG_SCHED_DEBUG
@@ -2077,6 +2248,9 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	trace_sched_migrate_task(p, new_cpu);
 
 	if (task_cpu(p) != new_cpu) {
+		if (task_has_dl_policy(p) && !__set_task_cpu_dl(p, new_cpu))
+			return;
+
 		p->se.nr_migrations++;
 		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 1, NULL, 0);
 	}
@@ -2672,6 +2846,91 @@ void sched_fork(struct task_struct *p, int clone_flags)
 	put_cpu();
 }
 
+static inline
+void __dl_clear_task_bw(struct dl_bandwidth *dl_b, int cpu, u64 tsk_bw)
+{
+	dl_b->dl_total_bw[cpu] -= tsk_bw;
+}
+
+static inline
+void __dl_add_task_bw(struct dl_bandwidth *dl_b, int cpu, u64 tsk_bw)
+{
+	dl_b->dl_total_bw[cpu] += tsk_bw;
+}
+
+static inline
+bool __dl_check_new_task(struct dl_bandwidth *dl_b, int cpu, u64 tsk_bw)
+{
+	return dl_b->dl_runtime == RUNTIME_INF ||
+	       dl_b->dl_bw >= dl_b->dl_total_bw[cpu] + tsk_bw;
+}
+
+static inline
+bool __dl_check_chg_task(struct dl_bandwidth *dl_b,
+			 int cpu, u64 old_tsk_bw, u64 new_tsk_bw)
+{
+	return dl_b->dl_runtime == RUNTIME_INF ||
+	       dl_b->dl_bw >= dl_b->dl_total_bw[cpu] - old_tsk_bw + new_tsk_bw;
+}
+
+/*
+ * We must be sure that accepting a new task (or allowing changing the
+ * parameters of an existing one) is consistent with the bandwidth
+ * contraints. If yes, this function also accordingly updates the currently
+ * allocated bandwidth to reflect the new situation.
+ *
+ * This function is called while holding p's rq->lock.
+ */
+static int dl_check_task_bw(struct task_struct *p, int policy,
+			    struct sched_param_ex *param_ex)
+{
+	struct dl_bandwidth *dl_b = task_dl_bandwidth(p);
+	int cpu = task_cpu(p), err = -EBUSY;
+	u64 new_tsk_bw;
+
+	raw_spin_lock(&dl_b->dl_runtime_lock);
+
+	/*
+	 * It is forbidden to create a task inside a container
+	 * that has no bandwidth.
+	 */
+	if (dl_b->dl_runtime != RUNTIME_INF && !dl_b->dl_bw) {
+		err = -EPERM;
+		goto unlock;
+	}
+
+	new_tsk_bw = to_ratio(timespec_to_ns(&param_ex->sched_deadline),
+			      timespec_to_ns(&param_ex->sched_runtime));
+	if (new_tsk_bw == p->dl.dl_bw) {
+		err = 0;
+		goto unlock;
+	}
+
+	/*
+	 * Either if a task, enters, leave, or stays -deadline but changes
+	 * its parameters, we may need to update accordingly the total
+	 * allocated bandwidth of the container.
+	 */
+	if (dl_policy(policy) && !task_has_dl_policy(p) &&
+	    __dl_check_new_task(dl_b, cpu, new_tsk_bw)) {
+		__dl_add_task_bw(dl_b, cpu, new_tsk_bw);
+		err = 0;
+	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
+		   __dl_check_chg_task(dl_b, cpu, p->dl.dl_bw, new_tsk_bw)) {
+		__dl_clear_task_bw(dl_b, cpu, p->dl.dl_bw);
+		__dl_add_task_bw(dl_b, cpu, new_tsk_bw);
+		err = 0;
+	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
+		__dl_clear_task_bw(dl_b, cpu, p->dl.dl_bw);
+		err = 0;
+	}
+unlock:
+	raw_spin_unlock(&dl_b->dl_runtime_lock);
+
+	return err;
+}
+
+
 /*
  * wake_up_new_task - wake up a newly created task for the first time.
  *
@@ -6111,10 +6370,14 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 		 * relative deadline are both set to the relative deadline
 		 * being inherited, which means the maximum boosting we are
 		 * able to provide as of now!
+		 *
+		 * Notice that it also does not count in admission control,
+		 * since its bandwidth is set to 0.
 		 */
 		if (!task_has_dl_policy(p)) {
 			p->dl.dl_runtime = dl_se_prio_to_deadline(prio);
 			p->dl.dl_deadline = dl_se_prio_to_deadline(prio);
+			p->dl.dl_bw = 0;
 			p->dl.flags = DL_NEW;
 		}
 		p->sched_class = &dl_sched_class;
@@ -6327,6 +6590,7 @@ __setparam_dl(struct task_struct *p, struct sched_param_ex *param_ex)
 
 	dl_se->dl_runtime = timespec_to_ns(&param_ex->sched_runtime);
 	dl_se->dl_deadline = timespec_to_ns(&param_ex->sched_deadline);
+	dl_se->dl_bw = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
 	dl_se->flags = param_ex->sched_flags;
 	dl_se->flags &= ~DL_THROTTLED;
 	dl_se->flags |= DL_NEW;
@@ -6488,6 +6752,15 @@ recheck:
 			return -EPERM;
 #endif
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+		/*
+		 * And the same for -deadline tasks.
+		 */
+		if (dl_bandwidth_enabled() && dl_policy(policy) &&
+				task_group(p)->dl_bandwidth.dl_runtime == 0)
+			return -EPERM;
+#endif
+
 		retval = security_task_setscheduler(p, policy, param);
 		if (retval)
 			return retval;
@@ -6510,6 +6783,22 @@ recheck:
 		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 		goto recheck;
 	}
+	/*
+	 * If setscheduling to SCHED_DEADLINE (or changing the parameters
+	 * of a SCHED_DEADLINE task) we need to check if enough bandwidth
+	 * is available.
+	 */
+	if (dl_policy(policy) || dl_task(p)) {
+		int err;
+
+		err = dl_check_task_bw(p, policy, param_ex);
+		if (err) {
+			__task_rq_unlock(rq);
+			raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+			return err;
+		}
+	}
+
 	update_rq_clock(rq);
 	on_rq = p->se.on_rq;
 	running = task_current(rq, p);
@@ -7476,6 +7765,8 @@ again:
 		put_task_struct(mt);
 		wait_for_completion(&req.done);
 		tlb_migrate_finish(p->mm);
+		if (task_cpu(p) != req.dest_cpu)
+			return -EAGAIN;
 		return 0;
 	}
 out:
@@ -7522,8 +7813,8 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 	if (p->se.on_rq) {
 		deactivate_task(rq_src, p, 0);
 		set_task_cpu(p, dest_cpu);
-		activate_task(rq_dest, p, 0);
-		check_preempt_curr(rq_dest, p, 0);
+		activate_task(task_rq(p), p, 0);
+		check_preempt_curr(task_rq(p), p, 0);
 	}
 done:
 	ret = 1;
@@ -9694,6 +9985,10 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
 static void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
 {
 	dl_rq->rb_root = RB_ROOT;
+
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	dl_rq->rq = rq;
+#endif
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -9755,6 +10050,15 @@ static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
 }
 #endif
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+void init_tg_dl_entry(struct task_group *tg, struct dl_rq *dl_rq,
+		struct sched_dl_entity *dl_se, int cpu, int add,
+		struct sched_dl_entity *parent)
+{
+	tg->dl_rq[cpu] = dl_rq;
+}
+#endif
+
 void __init sched_init(void)
 {
 	int i, j;
@@ -9766,6 +10070,9 @@ void __init sched_init(void)
 #ifdef CONFIG_RT_GROUP_SCHED
 	alloc_size += 2 * nr_cpu_ids * sizeof(void **);
 #endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	alloc_size += nr_cpu_ids * sizeof(void **);
+#endif
 #ifdef CONFIG_USER_SCHED
 	alloc_size *= 2;
 #endif
@@ -9805,6 +10112,10 @@ void __init sched_init(void)
 		ptr += nr_cpu_ids * sizeof(void **);
 #endif /* CONFIG_USER_SCHED */
 #endif /* CONFIG_RT_GROUP_SCHED */
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+		init_task_group.dl_rq = (struct dl_rq **)ptr;
+		ptr += nr_cpu_ids * sizeof(void **);
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
 #ifdef CONFIG_CPUMASK_OFFSTACK
 		for_each_possible_cpu(i) {
 			per_cpu(load_balance_tmpmask, i) = (void *)ptr;
@@ -9819,6 +10130,8 @@ void __init sched_init(void)
 
 	init_rt_bandwidth(&def_rt_bandwidth,
 			global_rt_period(), global_rt_runtime());
+	init_dl_bandwidth(&def_dl_bandwidth,
+			global_dl_period(), global_dl_runtime());
 
 #ifdef CONFIG_RT_GROUP_SCHED
 	init_rt_bandwidth(&init_task_group.rt_bandwidth,
@@ -9829,6 +10142,11 @@ void __init sched_init(void)
 #endif /* CONFIG_USER_SCHED */
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	init_dl_bandwidth(&init_task_group.dl_bandwidth,
+			global_dl_period(), global_dl_runtime());
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 #ifdef CONFIG_GROUP_SCHED
 	list_add(&init_task_group.list, &task_groups);
 	INIT_LIST_HEAD(&init_task_group.children);
@@ -9915,6 +10233,10 @@ void __init sched_init(void)
 #endif
 #endif
 
+#if defined CONFIG_DEADLINE_GROUP_SCHED && defined CONFIG_CGROUP_SCHED
+		init_tg_dl_entry(&init_task_group, &rq->dl, NULL, i, 1, NULL);
+#endif
+
 		for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
 			rq->cpu_load[j] = 0;
 #ifdef CONFIG_SMP
@@ -10307,11 +10629,89 @@ static inline void unregister_rt_sched_group(struct task_group *tg, int cpu)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static void free_dl_sched_group(struct task_group *tg)
+{
+	destroy_dl_bandwidth(&tg->dl_bandwidth);
+
+	kfree(tg->dl_rq);
+}
+
+int alloc_dl_sched_group(struct task_group *tg, struct task_group *parent)
+{
+	struct rq *rq;
+	int i;
+
+	tg->dl_rq = kzalloc(sizeof(struct dl_rq *) * nr_cpu_ids, GFP_KERNEL);
+	if (!tg->dl_rq)
+		return 0;
+
+	if (!init_dl_bandwidth(&tg->dl_bandwidth, global_dl_period(), 0))
+		return 0;
+
+	for_each_possible_cpu(i) {
+		rq = cpu_rq(i);
+		init_tg_dl_entry(tg, &rq->dl, NULL, i, 0, NULL);
+	}
+
+	return 1;
+}
+
+int sched_dl_can_attach(struct cgroup *cgrp, struct task_struct *tsk)
+{
+	struct task_group *tg = container_of(cgroup_subsys_state(cgrp,
+					     cpu_cgroup_subsys_id),
+					     struct task_group, css);
+	unsigned long flags;
+	struct rq *rq = task_rq_lock(tsk, &flags);
+	int ret = 1;
+
+	if (!dl_task(tsk))
+		goto unlock_rq;
+
+	raw_spin_lock(&tg->dl_bandwidth.dl_runtime_lock);
+	if (tg->dl_bandwidth.dl_runtime == RUNTIME_INF)
+		goto unlock;
+	/*
+	 * Check if the group has enough bandwidth available.
+	 */
+	if (tg->dl_bandwidth.dl_bw <
+	    tg->dl_bandwidth.dl_total_bw[task_cpu(tsk)] + tsk->dl.dl_bw)
+		ret = 0;
+unlock:
+	raw_spin_unlock(&tg->dl_bandwidth.dl_runtime_lock);
+unlock_rq:
+	task_rq_unlock(rq, &flags);
+
+	return ret;
+}
+#else /* !CONFIG_DEADLINE_GROUP_SCHED */
+static inline void free_dl_sched_group(struct task_group *tg)
+{
+}
+
+static inline
+int alloc_dl_sched_group(struct task_group *tg, struct task_group *parent)
+{
+	return 1;
+}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+static inline
+void register_dl_sched_group(struct task_group *tg, int cpu)
+{
+}
+
+static inline
+void unregister_dl_sched_group(struct task_group *tg, int cpu)
+{
+}
+
 #ifdef CONFIG_GROUP_SCHED
 static void free_sched_group(struct task_group *tg)
 {
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
+	free_dl_sched_group(tg);
 	kfree(tg);
 }
 
@@ -10332,10 +10732,14 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	if (!alloc_dl_sched_group(tg, parent))
+		goto err;
+
 	spin_lock_irqsave(&task_group_lock, flags);
 	for_each_possible_cpu(i) {
 		register_fair_sched_group(tg, i);
 		register_rt_sched_group(tg, i);
+		register_dl_sched_group(tg, i);
 	}
 	list_add_rcu(&tg->list, &task_groups);
 
@@ -10365,11 +10769,26 @@ void sched_destroy_group(struct task_group *tg)
 {
 	unsigned long flags;
 	int i;
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	struct task_group *parent = tg->parent;
+
+	spin_lock_irqsave(&task_group_lock, flags);
 
+	/*
+	 * If a deadline group goes away, its bandwidth must be
+	 * freed in its parent.
+	 */
+	raw_spin_lock(&parent->dl_bandwidth.dl_runtime_lock);
+	for_each_possible_cpu(i)
+		parent->dl_bandwidth.dl_total_bw[i] -= tg->dl_bandwidth.dl_bw;
+	raw_spin_unlock(&parent->dl_bandwidth.dl_runtime_lock);
+#else
 	spin_lock_irqsave(&task_group_lock, flags);
+#endif
 	for_each_possible_cpu(i) {
 		unregister_fair_sched_group(tg, i);
 		unregister_rt_sched_group(tg, i);
+		unregister_dl_sched_group(tg, i);
 	}
 	list_del_rcu(&tg->list);
 	list_del_rcu(&tg->siblings);
@@ -10516,14 +10935,6 @@ unsigned long sched_group_shares(struct task_group *tg)
  */
 static DEFINE_MUTEX(rt_constraints_mutex);
 
-static unsigned long to_ratio(u64 period, u64 runtime)
-{
-	if (runtime == RUNTIME_INF)
-		return 1ULL << 20;
-
-	return div64_u64(runtime << 20, period);
-}
-
 /* Must be called with tasklist_lock held */
 static inline int tg_has_rt_tasks(struct task_group *tg)
 {
@@ -10617,7 +11028,7 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
 	return walk_tg_tree(tg_schedulable, tg_nop, &data);
 }
 
-static int tg_set_bandwidth(struct task_group *tg,
+static int tg_set_rt_bandwidth(struct task_group *tg,
 		u64 rt_period, u64 rt_runtime)
 {
 	int i, err = 0;
@@ -10656,7 +11067,7 @@ int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
 	if (rt_runtime_us < 0)
 		rt_runtime = RUNTIME_INF;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_runtime(struct task_group *tg)
@@ -10681,7 +11092,7 @@ int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
 	if (rt_period == 0)
 		return -EINVAL;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_period(struct task_group *tg)
@@ -10692,10 +11103,44 @@ long sched_group_rt_period(struct task_group *tg)
 	do_div(rt_period_us, NSEC_PER_USEC);
 	return rt_period_us;
 }
+#endif /* CONFIG_RT_GROUP_SCHED */
+
+/*
+ * Coupling of -rt and -dl bandwidth.
+ *
+ * Here we check, while setting the system wide bandwidth available
+ * for all -rt entities, if the new values are consistent with the
+ * system settings for the bandwidth available to -dl tasks and groups.
+ *
+ * IOW, we want to enforce that
+ *
+ *   rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_rt_dl_global_constraints(u64 global_rt_bw)
+{
+	unsigned long flags;
+	u64 global_dl_bw;
+	bool ret;
+
+	raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock, flags);
+
+	global_dl_bw = to_ratio(def_dl_bandwidth.dl_period,
+				def_dl_bandwidth.dl_runtime);
+
+	ret = global_rt_bw + global_dl_bw <=
+		to_ratio(RUNTIME_INF, RUNTIME_INF);
 
+	raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock, flags);
+
+	return ret;
+}
+
+#ifdef CONFIG_RT_GROUP_SCHED
 static int sched_rt_global_constraints(void)
 {
-	u64 runtime, period;
+	u64 runtime, period, global_bw;
 	int ret = 0;
 
 	if (sysctl_sched_rt_period <= 0)
@@ -10710,6 +11155,10 @@ static int sched_rt_global_constraints(void)
 	if (runtime > period && runtime != RUNTIME_INF)
 		return -EINVAL;
 
+	global_bw = to_ratio(period, runtime);
+	if (!__sched_rt_dl_global_constraints(global_bw))
+		return -EINVAL;
+
 	mutex_lock(&rt_constraints_mutex);
 	read_lock(&tasklist_lock);
 	ret = __rt_schedulable(NULL, 0, 0);
@@ -10732,6 +11181,7 @@ int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
 static int sched_rt_global_constraints(void)
 {
 	unsigned long flags;
+	u64 global_bw;
 	int i;
 
 	if (sysctl_sched_rt_period <= 0)
@@ -10744,6 +11194,10 @@ static int sched_rt_global_constraints(void)
 	if (sysctl_sched_rt_runtime == 0)
 		return -EBUSY;
 
+	global_bw = to_ratio(global_rt_period(), global_rt_runtime());
+	if (!__sched_rt_dl_global_constraints(global_bw))
+		return -EINVAL;
+
 	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
 	for_each_possible_cpu(i) {
 		struct rt_rq *rt_rq = &cpu_rq(i)->rt;
@@ -10758,6 +11212,359 @@ static int sched_rt_global_constraints(void)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+/* Must be called with tasklist_lock held */
+static inline int tg_has_dl_tasks(struct task_group *tg)
+{
+	struct task_struct *g, *p;
+
+	do_each_thread(g, p) {
+		if (task_has_dl_policy(p) && task_group(p) == tg)
+			return 1;
+	} while_each_thread(g, p);
+
+	return 0;
+}
+
+/*
+ * If we were RUNTIME_INF and we want to constraint our own bandwidth
+ * we must to be sure that none of our children is RUNTIME_INF.
+ *
+ * Something different (i.e., letting children to stay RUNTIME_INF even
+ * if we are constrained) could have been done if a different architecture
+ * were chosen for -deadline group scheduling (similar to -rt throttling).
+ */
+static int __check_children_bandwidth(struct task_group *tg, u64 dl_runtime)
+{
+	struct task_group *child;
+
+	/*
+	 * Either we were not RUNTIME_INF or we are going to
+	 * become RUNTIME_INF, so no more checking on our children
+	 * is needed.
+	 */
+	if (tg->dl_bandwidth.dl_runtime != RUNTIME_INF ||
+	    dl_runtime == RUNTIME_INF)
+		return 1;
+
+	list_for_each_entry_rcu(child, &tg->children, siblings) {
+		raw_spin_lock(&child->dl_bandwidth.dl_runtime_lock);
+		if (child->dl_bandwidth.dl_runtime == RUNTIME_INF) {
+			raw_spin_unlock(&child->dl_bandwidth.dl_runtime_lock);
+			return 0;
+		}
+		raw_spin_unlock(&child->dl_bandwidth.dl_runtime_lock);
+	}
+
+	return 1;
+}
+
+/*
+ * If we want to decrease our own bandwidth from old_tg_bw to
+ * new_tg_bw we must be sure that none of our runqueue has more
+ * allocated bandwidth than new_bw.
+ *
+ * This is called holding _both_ tg's and tg's parent's bandwidth
+ * parameters locks (dl_bandwidth.dl_runtime_lock).
+ */
+static
+int __check_tg_bandwidth(struct task_group *tg, u64 new_tg_bw)
+{
+	int i;
+
+	if (new_tg_bw < tg->dl_bandwidth.dl_bw) {
+		for_each_possible_cpu(i) {
+			if (new_tg_bw < tg->dl_bandwidth.dl_total_bw[i])
+				return 0;
+		}
+	}
+
+	return 1;
+}
+
+/*
+ * Here we check if the new bandwidth parameters of the cgroup would
+ * still lead to a schedulable system.
+ *
+ * This is called holding _both_ tg's and tg's parent's bandwidth
+ * parameters locks (dl_bandwidth.dl_runtime_lock).
+ */
+static
+int __deadline_schedulable(struct task_group *tg,
+			   struct dl_bandwidth *parent_dl_b,
+			   u64 dl_runtime, u64 new_tg_bw)
+{
+	int i;
+
+	/*
+	 * RUNTIME_INF is allowed only if our parent is
+	 * RUNTIME_INF as well (see the comment to the
+	 * above function).
+	 */
+	if (parent_dl_b->dl_runtime == RUNTIME_INF)
+		return 0;
+
+	if (dl_runtime == RUNTIME_INF)
+		return -EINVAL;
+
+	if (new_tg_bw > parent_dl_b->dl_bw ||
+	    !__check_tg_bandwidth(tg, new_tg_bw) ||
+	    !__check_children_bandwidth(tg, dl_runtime))
+		return -EBUSY;
+
+	/*
+	 * The root group has no parent, but its assigned bandwidth has
+	 * to stay below the global bandwidth value given by
+	 * sysctl_sched_dl_runtime / sysctl_sched_dl_period.
+	 *
+	 * For other group, what is required is that the sum of the bandwidths
+	 * of all the children of a group does not exceed the bandwidth of
+	 * such group.
+	 */
+	for_each_possible_cpu(i) {
+		if (parent_dl_b->dl_bw < parent_dl_b->dl_total_bw[i] -
+		    tg->dl_bandwidth.dl_bw + new_tg_bw)
+			return -EBUSY;
+	}
+
+	return 0;
+}
+
+/*
+ * This checks if the new parameters of the task group are consistent
+ * and, if yes, updates the allocateed bandwidth in the higher
+ * level entity (could the parent cgroup or the system default)
+ */
+static int tg_set_dl_bandwidth(struct task_group *tg,
+			       u64 dl_period, u64 dl_runtime)
+{
+	struct dl_bandwidth *dl_b = &tg->dl_bandwidth,
+			*parent_dl_b = parent_dl_bandwidth(tg);
+	u64 new_tg_bw;
+	int i, err = 0;
+
+	if (dl_runtime != RUNTIME_INF && dl_runtime > dl_period)
+		return -EINVAL;
+
+	read_lock(&tasklist_lock);
+
+	if (!dl_runtime && tg_has_dl_tasks(tg)) {
+		err = -EBUSY;
+		goto runlock;
+	}
+
+	raw_spin_lock_irq(&parent_dl_b->dl_runtime_lock);
+	raw_spin_lock(&dl_b->dl_runtime_lock);
+
+	/*
+	 * Calculate the old and new bandwidth for the group and,
+	 * if different, check if the new value is consistent.
+	 */
+	new_tg_bw = to_ratio(dl_period, dl_runtime);
+	if (new_tg_bw != dl_b->dl_bw) {
+		err = __deadline_schedulable(tg, parent_dl_b,
+					     dl_runtime, new_tg_bw);
+		if (err)
+			goto unlock;
+	}
+
+	/*
+	 * If here, we can now update tg's bandwidth and tg's
+	 * parent allocated bandwidth value (on each CPU).
+	 */
+	for_each_possible_cpu(i)
+		parent_dl_b->dl_total_bw[i] += new_tg_bw - dl_b->dl_bw;
+	dl_b->dl_bw = new_tg_bw;
+	dl_b->dl_period = dl_period;
+	dl_b->dl_runtime = dl_runtime;
+
+unlock:
+	raw_spin_unlock(&dl_b->dl_runtime_lock);
+	raw_spin_unlock_irq(&parent_dl_b->dl_runtime_lock);
+runlock:
+	read_unlock(&tasklist_lock);
+
+	return err;
+}
+
+int sched_group_set_dl_runtime(struct task_group *tg, long dl_runtime_us)
+{
+	u64 dl_runtime, dl_period;
+
+	dl_period = tg->dl_bandwidth.dl_period;
+	dl_runtime = (u64)dl_runtime_us * NSEC_PER_USEC;
+	if (dl_runtime_us < 0)
+		dl_runtime = RUNTIME_INF;
+
+	return tg_set_dl_bandwidth(tg, dl_period, dl_runtime);
+}
+
+long sched_group_dl_runtime(struct task_group *tg)
+{
+	s64 dl_runtime;
+
+	raw_spin_lock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+	dl_runtime = tg->dl_bandwidth.dl_runtime;
+	raw_spin_unlock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+
+	if (dl_runtime == RUNTIME_INF)
+		return -1;
+
+	do_div(dl_runtime, NSEC_PER_USEC);
+	return dl_runtime;
+}
+
+int sched_group_set_dl_period(struct task_group *tg, long dl_period_us)
+{
+	u64 dl_runtime, dl_period;
+
+	dl_period = (u64)dl_period_us * NSEC_PER_USEC;
+	dl_runtime = tg->dl_bandwidth.dl_runtime;
+
+	if (dl_period == 0)
+		return -EINVAL;
+
+	return tg_set_dl_bandwidth(tg, dl_period, dl_runtime);
+}
+
+long sched_group_dl_period(struct task_group *tg)
+{
+	u64 dl_period_us;
+
+	raw_spin_lock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+	dl_period_us = tg->dl_bandwidth.dl_period;
+	raw_spin_unlock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+	do_div(dl_period_us, NSEC_PER_USEC);
+
+	return dl_period_us;
+}
+
+int sched_group_set_dl_total_bw(struct task_group *tg, int cpu)
+{
+	if (!cpu_online(cpu))
+		return -EINVAL;
+
+	raw_spin_lock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+	tg->dl_bandwidth.dl_total_bw_cpu = cpu;
+	raw_spin_unlock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+
+	return 0;
+}
+
+long sched_group_dl_total_bw(struct task_group *tg)
+{
+	int cpu = tg->dl_bandwidth.dl_total_bw_cpu;
+	u64 dl_total_bw;
+
+	raw_spin_lock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+	dl_total_bw = tg->dl_bandwidth.dl_total_bw[cpu];
+	raw_spin_unlock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+
+	return dl_total_bw;
+}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
+static bool __sched_dl_global_constraints(void)
+{
+	u64 global_runtime = global_dl_runtime();
+	u64 global_period = global_dl_period();
+
+	return (!global_period == 0 || (global_runtime != RUNTIME_INF &&
+		global_runtime > global_period));
+}
+
+/*
+ * Coupling of -dl and -rt bandwidth.
+ *
+ * Here we check, while setting the system wide bandwidth available
+ * for -dl tasks and groups, if the new values are consistent with
+ * the system settings for the bandwidth available to -rt entities.
+ *
+ * IOW, we want to enforce that
+ *
+ *   rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_dl_rt_global_constraints(u64 global_dl_bw)
+{
+	u64 global_rt_bw;
+	bool ret;
+
+	raw_spin_lock(&def_rt_bandwidth.rt_runtime_lock);
+
+	global_rt_bw = to_ratio(ktime_to_ns(def_rt_bandwidth.rt_period),
+				def_rt_bandwidth.rt_runtime);
+
+	ret = global_rt_bw + global_dl_bw <=
+		to_ratio(RUNTIME_INF, RUNTIME_INF);
+
+	raw_spin_unlock(&def_rt_bandwidth.rt_runtime_lock);
+
+	return ret;
+}
+
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static int sched_dl_global_constraints(void)
+{
+	int err = 0;
+	unsigned long flags;
+	struct dl_bandwidth *init_tg_dl_b = &init_task_group.dl_bandwidth;
+	u64 global_bw;
+
+	if (!__sched_dl_global_constraints())
+		return -EINVAL;
+
+	global_bw = to_ratio(global_dl_period(), global_dl_runtime());
+	if (!__sched_dl_rt_global_constraints(global_bw))
+		return -EINVAL;
+
+	/*
+	 * It is not allowed to set the global system bandwidth
+	 * below the current bandwidth of the root task group (nor
+	 * to constrain it if the root task group is RUNTIME_INF)
+	 */
+	raw_spin_lock_irqsave(&init_tg_dl_b->dl_runtime_lock, flags);
+
+	if (global_bw < init_tg_dl_b->dl_bw ||
+	    (global_dl_runtime() != RUNTIME_INF &&
+	     init_tg_dl_b->dl_runtime == RUNTIME_INF))
+		err = -EBUSY;
+
+	raw_spin_unlock_irqrestore(&init_tg_dl_b->dl_runtime_lock, flags);
+
+	return err;
+}
+#else /* !CONFIG_DEADLINE_GROUP_SCHED */
+static int sched_dl_global_constraints(void)
+{
+	int i;
+	u64 global_bw;
+
+	if (!__sched_dl_global_constraints())
+		return -EINVAL;
+
+	global_bw = to_ratio(global_dl_period(), global_dl_runtime());
+	if (!__sched_dl_rt_global_constraints(global_bw))
+		return -EINVAL;
+
+	/*
+	 * In the !DEADLINE_GROUP_SCHED case it is here that we enforce
+	 * the global system bandwidth not being set to a value smaller
+	 * than the currently allocated bandwidth on any runqueue.
+	 *
+	 * This is safe since we are called with the dl_runtime_lock
+	 * of def_sl_bandwidth held.
+	 */
+	for_each_possible_cpu(i) {
+		if (global_bw < def_dl_bandwidth.dl_total_bw[i])
+			return -EBUSY;
+	}
+
+	return 0;
+}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos)
@@ -10788,6 +11595,72 @@ int sched_rt_handler(struct ctl_table *table, int write,
 	return ret;
 }
 
+int sched_dl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int ret;
+	int old_period, old_runtime;
+	static DEFINE_MUTEX(mutex);
+	unsigned long flags;
+
+	mutex_lock(&mutex);
+	old_period = sysctl_sched_dl_period;
+	old_runtime = sysctl_sched_dl_runtime;
+
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock,
+				      flags);
+
+		ret = sched_dl_global_constraints();
+		if (ret) {
+			sysctl_sched_dl_period = old_period;
+			sysctl_sched_dl_runtime = old_runtime;
+		} else {
+			def_dl_bandwidth.dl_period = global_dl_period();
+			def_dl_bandwidth.dl_runtime = global_dl_runtime();
+			def_dl_bandwidth.dl_bw = to_ratio(global_dl_period(),
+							  global_dl_runtime());
+		}
+
+		raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock,
+					   flags);
+	}
+	mutex_unlock(&mutex);
+
+	return ret;
+}
+
+int sched_dl_total_bw_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *lenp,
+			loff_t *ppos)
+{
+	int ret, old_cpu, cpu;
+	static DEFINE_MUTEX(mutex);
+
+	mutex_lock(&mutex);
+	old_cpu = cpu = def_dl_bandwidth.dl_total_bw_cpu;
+
+	if (!write)
+		sysctl_sched_dl_total_bw = def_dl_bandwidth.dl_total_bw[cpu];
+
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		cpu = sysctl_sched_dl_total_bw;
+		if (!cpu_online(cpu)) {
+			cpu = old_cpu;
+			ret = -EINVAL;
+		} else
+			def_dl_bandwidth.dl_total_bw_cpu = cpu;
+	}
+	mutex_unlock(&mutex);
+
+	return ret;
+}
+
 #ifdef CONFIG_CGROUP_SCHED
 
 /* return corresponding task_group object of a cgroup */
@@ -10826,9 +11699,15 @@ cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 static int
 cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
+#if defined CONFIG_DEADLINE_GROUP_SCHED || defined CONFIG_RT_GROUP_SCHED
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	if (!sched_dl_can_attach(cgrp, tsk))
+		return -EINVAL;
+#endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	if (!sched_rt_can_attach(cgroup_tg(cgrp), tsk))
 		return -EINVAL;
+#endif
 #else
 	/* We don't support RT-tasks being in separate groups */
 	if (tsk->sched_class != &fair_sched_class)
@@ -10859,11 +11738,55 @@ cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	return 0;
 }
 
+/*
+ * The bandwidth of tsk is freed from its former task
+ * group, and has to be considered occupied in the
+ * new task group.
+ */
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static
+void __cpu_cgroup_attach_dl(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			    struct task_struct *tsk)
+{
+	unsigned long flags;
+	struct task_group *tg = container_of(cgroup_subsys_state(cgrp,
+					     cpu_cgroup_subsys_id),
+					     struct task_group, css);
+	struct task_group *old_tg = container_of(cgroup_subsys_state(old_cgrp,
+						 cpu_cgroup_subsys_id),
+						 struct task_group, css);
+	struct rq *rq = task_rq_lock(tsk, &flags);
+
+	raw_spin_lock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+	raw_spin_lock(&old_tg->dl_bandwidth.dl_runtime_lock);
+
+	/*
+	 * Actually move the bandwidth the task occupies
+	 * from its old to its new cgroup.
+	 */
+	tg->dl_bandwidth.dl_total_bw[task_cpu(tsk)] += tsk->dl.dl_bw;
+	old_tg->dl_bandwidth.dl_total_bw[task_cpu(tsk)] -= tsk->dl.dl_bw;
+
+	raw_spin_unlock(&old_tg->dl_bandwidth.dl_runtime_lock);
+	raw_spin_unlock_irq(&tg->dl_bandwidth.dl_runtime_lock);
+
+	task_rq_unlock(rq, &flags);
+}
+#else
+static
+void __cpu_cgroup_attach_dl(struct cgroup *cgrp, struct cgroup *old_cgrp,
+			    struct task_struct *tsk)
+{
+}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 static void
 cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		  struct cgroup *old_cont, struct task_struct *tsk,
 		  bool threadgroup)
 {
+	__cpu_cgroup_attach_dl(cgrp, old_cont, tsk);
+
 	sched_move_task(tsk);
 	if (threadgroup) {
 		struct task_struct *c;
@@ -10914,6 +11837,41 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static int cpu_dl_runtime_write(struct cgroup *cgrp, struct cftype *cftype,
+				s64 dl_runtime_us)
+{
+	return sched_group_set_dl_runtime(cgroup_tg(cgrp), dl_runtime_us);
+}
+
+static s64 cpu_dl_runtime_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	return sched_group_dl_runtime(cgroup_tg(cgrp));
+}
+
+static int cpu_dl_period_write(struct cgroup *cgrp, struct cftype *cftype,
+			       u64 dl_period_us)
+{
+	return sched_group_set_dl_period(cgroup_tg(cgrp), dl_period_us);
+}
+
+static u64 cpu_dl_period_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	return sched_group_dl_period(cgroup_tg(cgrp));
+}
+
+static int cpu_dl_total_bw_write(struct cgroup *cgrp, struct cftype *cftype,
+				 u64 cpu)
+{
+	return sched_group_set_dl_total_bw(cgroup_tg(cgrp), cpu);
+}
+
+static u64 cpu_dl_total_bw_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	return sched_group_dl_total_bw(cgroup_tg(cgrp));
+}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -10934,6 +11892,23 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	{
+		.name = "dl_runtime_us",
+		.read_s64 = cpu_dl_runtime_read,
+		.write_s64 = cpu_dl_runtime_write,
+	},
+	{
+		.name = "dl_period_us",
+		.read_u64 = cpu_dl_period_read,
+		.write_u64 = cpu_dl_period_write,
+	},
+	{
+		.name = "dl_total_bw",
+		.read_u64 = cpu_dl_total_bw_read,
+		.write_u64 = cpu_dl_total_bw_write,
+	},
+#endif
 };
 
 static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 407a761..84d8d40 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -146,7 +146,8 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
 }
 
 #if defined(CONFIG_CGROUP_SCHED) && \
-	(defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
+	(defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED) || \
+	 defined(CONFIG_DEADLINE_GROUP_SCHED))
 static void task_group_path(struct task_group *tg, char *buf, int buflen)
 {
 	/* may be NULL if the underlying cgroup isn't fully-created yet */
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 3613cbd..3e466cb 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -9,6 +9,10 @@
  * than their reserved bandwidth will be slowed down (and may potentially
  * miss some of their deadlines), and won't affect any other task.
  *
+ * Group scheduling, if configured, is utilized for admission control
+ * purposes, i.e., the sum of the bandwidth of tasks and groups belonging
+ * to group A must stays below A's own bandwidth.
+ *
  * Copyright (C) 2010 Dario Faggioli <raistlin@linux.it>,
  *                    Michael Trimarchi <trimarchimichael@yahoo.it>,
  *                    Fabio Checconi <fabio@gandalf.sssup.it>
@@ -658,8 +662,18 @@ static void task_fork_dl(struct task_struct *p)
 
 static void task_dead_dl(struct task_struct *p)
 {
+	struct dl_bandwidth *dl_b = task_dl_bandwidth(p);
+
+	/*
+	 * Since the task is TASK_DEAD we hope
+	 * it will not migrate or change group!
+	 */
+	raw_spin_lock_irq(&dl_b->dl_runtime_lock);
+	dl_b->dl_total_bw[task_cpu(p)] -= p->dl.dl_bw;
+	raw_spin_unlock_irq(&dl_b->dl_runtime_lock);
+
 	/*
-	 * We are not holding any lock here, so it is safe to
+	 * We are no longer holding any lock here, so it is safe to
 	 * wait for the bandwidth timer to be removed.
 	 */
 	hrtimer_cancel(&p->dl.dl_timer);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a68b24..2f3cdba 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -358,6 +358,27 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= sched_rt_handler,
 	},
 	{
+		.procname	= "sched_dl_period_us",
+		.data		= &sysctl_sched_dl_period,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &sched_dl_handler,
+	},
+	{
+		.procname	= "sched_dl_runtime_us",
+		.data		= &sysctl_sched_dl_runtime,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &sched_dl_handler,
+	},
+	{
+		.procname	= "sched_dl_total_bw",
+		.data		= &sysctl_sched_dl_total_bw,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &sched_dl_total_bw_handler,
+	},
+	{
 		.procname	= "sched_compat_yield",
 		.data		= &sysctl_sched_compat_yield,
 		.maxlen		= sizeof(unsigned int),
-- 
1.7.0

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC][PATCH 11/11] sched: add sched_dl documentation.
  2010-02-28 19:06 [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Raistlin
                   ` (9 preceding siblings ...)
  2010-02-28 19:27 ` [RFC][PATCH 10/11] sched: add bandwidth management for sched_dl Raistlin
@ 2010-02-28 19:28 ` Raistlin
  2010-04-14 10:17 ` [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Peter Zijlstra
  11 siblings, 0 replies; 25+ messages in thread
From: Raistlin @ 2010-02-28 19:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

[-- Attachment #1: Type: text/plain, Size: 9012 bytes --]

Add in Documentation/scheduler/ some hints about the design
choices, the usage and the future possible developments of the
sched_dl scheduling class and of the SCHED_DEADLINE policy.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 Documentation/scheduler/sched-deadline.txt |  188 ++++++++++++++++++++++++++++
 init/Kconfig                               |    1 +
 2 files changed, 189 insertions(+), 0 deletions(-)

diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
new file mode 100644
index 0000000..1ff0e1e
--- /dev/null
+++ b/Documentation/scheduler/sched-deadline.txt
@@ -0,0 +1,188 @@
+			Deadline Task and Group Scheduling
+			----------------------------------
+
+CONTENTS
+========
+
+0. WARNING
+1. Overview
+  1.1 Task scheduling
+  1.2 Group scheduling
+2. The interface
+  2.1 System wide settings
+  2.2 Task interface
+  2.3 Group interface
+  2.4 Default behavior
+3. Future plans
+
+
+0. WARNING
+==========
+
+ Fiddling with these settings can result in an unpredictable or even unstable
+ system behavior. As for -rt (group) scheduling, it is assumed that root
+ knows what he is doing.
+
+
+1. Overview
+===========
+
+ The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
+ basically an implementation of the Earliest Deadline First (EDF) scheduling
+ algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
+ that make it possible to isolate the behaviour of tasks between each other.
+
+
+1.1 Task scheduling
+-------------------
+
+ The typical -deadline task will be made up of a computation phase (instance)
+ which is activated on a periodic or sporadic fashion. The expected (maximum)
+ duration of such computation is called the task's runtime; the time interval
+ by which each instance need to be completed is called the task's relative
+ deadline. The task's absolute deadline is dynamically calculated as the
+ time instant a task (better, an instance) activates plus the relative
+ deadline.
+
+ The EDF algorithms selects the task with the smallest absolute deadline as
+ the one to be executed first, while the CBS ensures each task to run for
+ at most the its runtime every (relative) deadline length time interval,
+ avoiding any interference between different tasks (bandwidth isolation).
+ Thanks to this feature, also tasks that do not strictly comply with the
+ computational model sketched above can effectively use the new policy.
+ IOW, there are no limitations on what kind of task can exploit this new
+ scheduling discipline, even if it must be said that it is particularly
+ suited for periodic or sporadic tasks that need guarantees on their
+ timing behaviour, e.g., multimedia, streaming, control applications, etc.
+
+
+1.2 Group scheduling
+----------------
+
+ In order of -deaadline scheduling to be effective and useful, it is important
+ that some method of having the allocation of the available CPU bandwidth to
+ tasks and task groups under control.
+ This is usually called "admission control" and if it is not performed at all,
+ no guarantee can be given on the actual scheduling of the -deadline tasks.
+
+ Since when RT-throttling has been introduced each task group have a bandwidth
+ associated to itself, calculated as a certain amount of runtime over a
+ period. Moreover, to make it possible to manipulate such bandwidth,
+ readable/writable controls have been added to both procfs (for system
+ wide settings) and cgroupfs (for per-group settings).
+ Therefore, the same interface is being used for controlling the bandwidth
+ distrubution to -deadline tasks and task groups, i.e., new controls but
+ with similar names, equivalent meaning and with the same usage paradigm are
+ added.
+
+ The main differences between deadline bandwidth management and RT-throttling
+ is that -deadline tasks have bandwidth on their own (while -rt ones doesn't!),
+ and thus we don't need a throttling mechanism in the groups, which can be
+ used nothing more than for admission control of tasks.
+
+ This means that what we check is the sum of the bandwidth of all the tasks
+ belonging to the group stays, on each CPU, below the bandwidth of the group
+ itself.
+
+
+2. The Interface
+================
+
+2.1 System wide settings
+------------------------
+
+The system wide settings are configured under the /proc virtual file system:
+
+ The per-group controls that are added to the cgroupfs virtual file system are:
+  * /proc/sys/kernel/sched_dl_runtime_us,
+  * /proc/sys/kernel/sched_dl_period_us,
+  * /proc/sys/kernel/sched_dl_total_bw.
+
+ The first two accepts (if written) and provides (if read) the new runtime and
+ period, respectively, for each CPU.
+ The last one accepts (if written) the index of one online CPU, and it provides
+ (if read) the total amount of bandwidth currently alloceted on that CPU.
+
+ Settings are checked against the following limit:
+
+  * for the whole system, on each CPU:
+      rt_runtime / rt_period + dl_runtime + dl_period <= 100%
+
+
+2.2 Task interface
+------------------
+
+ Specifying a periodic/sporadic task that executes for a given amount of
+ runtime at each instance, and that is scheduled according to the usrgency of
+ their own timing constraints needs, in general, a way of declaring:
+  - a (maximum/typical) instance execution time,
+  - a minimum interval between consecutive instances,
+  - a time constraint by which each instance must be completed.
+
+ Therefore:
+  * a new struct sched_param_ex, containing all the necessary fields is
+    provided;
+  * the new scheduling related syscalls that manipulate it, i.e.,
+    sched_setscheduler_ex(), sched_setparam_ex() and sched_getparam_ex()
+    are implemented.
+
+
+2.3 Group Interface
+-------------------
+
+ The per-group controls that are added to the cgroupfs virtual file system are:
+  * /cgroup/<cgroup>/cpu.dl_runtime_us,
+  * /cgroup/<cgroup>/cpu.dl_period_us,
+  * /cgroup/<cgroup>/cpu.dl_total_bw.
+
+ The first two accepts (if written) and provides (if read) the new runtime and
+ period, respectively, of the group for each CPU.
+ The last one accepts (if written) the index of one online CPU, and it provides
+ (if read) the total amount of bandwidth currently alloceted inside the group
+ on that CPU.
+
+ Group settings are checked against the following limits:
+
+  * for the root group {r}, on each CPU:
+      dl_runtime_{r} / dl_period_{r} <= dl_runtime / dl_period
+
+  * for each group {i}, subgroup of group {j}, on each CPU:
+      dl_runtime_{i} / dl_period_{i} < 100%
+      \Sum_{i} dl_runtime_{i} / dl_period_{i} <= dl_runtime_{j} / dl_period_{j}
+
+ For more information on working with control groups,
+ Documentation/cgroups/cgroups.txt should be read.
+
+
+2.4 Default behavior
+---------------------
+
+The default values for system wide and root group dl_runtime and dl_period are
+500000 over 1000000.  This means -deadline tasks and task groups can use at
+most 5% bandwidth on each CPU.
+
+When a -deadline task fork a child, its dl_runtime is set to 0, which means
+someone must call sched_setscheduler_ex() on it, or it won't even start.
+
+When a new group is created, its dl_runtime is 0, which means someone must
+(try to) increase it before tasks can be added to the group.
+
+
+3. Future plans
+===============
+
+Still Missing parts:
+
+ - bandwidth reclaiming mechanisms, i.e., methods that avoid stopping the
+   tasks until their next deadline when overrunning. There are at least
+   three of them that are very simple, and patches are on their way;
+
+ - migration of tasks throughout push and pull (as in -rt) to make it
+   possible to deploy global-EDF scheduling. Patches are ready, they're
+   just being tested and adapted to this last version;
+
+ - refinements in deadline inheritance, especially regarding the possibility
+   of retaining bandwidth isolation among non-interacting tasks. This is
+   being studied from both theoretical and practical point of views, and
+   hopefully we can have some demonstrative code soon.
+
diff --git a/init/Kconfig b/init/Kconfig
index de57415..377caed 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -486,6 +486,7 @@ config DEADLINE_GROUP_SCHED
 	  tasks (and other groups) can be added to it only up to such
 	  "bandwidth cap", which might be useful for avoiding or
 	  controlling oversubscription.
+	  See Documentation/scheduler/sched-deadline.txt for more.
 
 choice
 	depends on GROUP_SCHED
-- 
1.7.0

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation.
  2010-02-28 19:17 ` [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation Raistlin
@ 2010-04-13 18:22   ` Peter Zijlstra
  2010-04-13 18:22   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2010-04-13 18:22 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

On Sun, 2010-02-28 at 20:17 +0100, Raistlin wrote:
> +/*
> + * Pure Earliest Deadline First (EDF) scheduling does not deal with the
> + * possibility of a task lasting more than what it declared, and thus
> + * exhausting its runtime.
> + *
> + * Here we are interested in making runtime overrun possible, but we do
> + * not want a task which is misbehaving to affect the scheduling of all
> + * other tasks.
> + * Therefore, a budgeting strategy called Constant Bandwidth Server (CBS)
> + * is used, in order to confine each task within its own bandwidth.
> + *
> + * This function deals exactly with that, and ensures that when the runtime
> + * of a task is replenished, its deadline is also postponed. That results
> + * in "priority unboosting" for the overrunning task, and makes it impossible
> + * for it to cause unexpected interfere to other tasks in the system.
> + */ 

This is not about lock/inheritance related overrun, right? But simply a
task that got its WCET wrong.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation.
  2010-02-28 19:17 ` [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation Raistlin
  2010-04-13 18:22   ` Peter Zijlstra
@ 2010-04-13 18:22   ` Peter Zijlstra
  2010-04-13 18:22   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2010-04-13 18:22 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

On Sun, 2010-02-28 at 20:17 +0100, Raistlin wrote:
> +       /*
> +        * Arm the timer to fire at wakeup, tying to compensate for
> +        * the fact that wakeup is actually coming from rq->clock and
> +        * not from hrtimer's time base reading.
> +        */
> +       act = ns_to_ktime(wakeup);
> +       now = hrtimer_cb_get_time(&dl_se->dl_timer);
> +       delta = ktime_to_ns(now) - rq->clock;
> +       act = ktime_add_ns(act, delta); 

This is sad, but I'm afraid a reality we have to live with..

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation.
  2010-02-28 19:17 ` [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation Raistlin
                     ` (2 preceding siblings ...)
  2010-04-13 18:22   ` Peter Zijlstra
@ 2010-04-13 18:22   ` Peter Zijlstra
  2010-04-13 18:22   ` Peter Zijlstra
  4 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2010-04-13 18:22 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

On Sun, 2010-02-28 at 20:17 +0100, Raistlin wrote:
> +/*
> + * When a -deadline task is queued back on the runqueue, its runtime and
> + * deadline might need updating.
> + *
> + * The policy here is that we update the deadline of the task only if:
> + *  - the current deadline is in the past,
> + *  - using the remaining remaining with the current deadline would make

"remaining runtime", I presume?

> + *    the task exceed its bandwidth.
> + */
> +static void update_dl_entity(struct sched_dl_entity *dl_se)
> +{
> +       struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +       struct rq *rq = rq_of_dl_rq(dl_rq);
> +
> +       /*
> +        * The arrival of a new task (or of a new task instance) needs
> +        * special treatment. The actual scheduling parameters have to be
> +        * "renewed" instead of recalculatetd accordingly to the bandwidth
> +        * enforcement rule.
> +        */
> +       if (dl_se->flags & DL_NEW) {
> +               setup_new_dl_entity(dl_se);
> +               return;
> +       }
> +
> +       if (dl_time_before(dl_se->deadline, rq->clock))
> +               goto update;
> +
> +       if (!dl_check_bandwidth(dl_se, rq->clock)) {
> +update:
> +               dl_se->deadline = rq->clock + dl_se->dl_deadline;
> +               dl_se->runtime = dl_se->dl_runtime;
> +       }
> +} 

We could write that as:

  if (dl_time_before(dl_se, rq->clock) ||
      !dl_check_bandwidth(dl_se, rq->clock)) {
	/* reset parameters */
  }

Also, I was wondering about a more descriptive name for
dl_check_bandwidth(), check _what_ about the bandwidth!?

dl_bandwidth_overflow() perhaps, that would also remove that negation.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation.
  2010-02-28 19:17 ` [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation Raistlin
  2010-04-13 18:22   ` Peter Zijlstra
  2010-04-13 18:22   ` Peter Zijlstra
@ 2010-04-13 18:22   ` Peter Zijlstra
  2010-04-13 18:55     ` Steven Rostedt
  2010-04-13 18:22   ` Peter Zijlstra
  2010-04-13 18:22   ` Peter Zijlstra
  4 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2010-04-13 18:22 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

On Sun, 2010-02-28 at 20:17 +0100, Raistlin wrote:
> +/*
> + * Here we check if --at time t-- a task (which is probably being
> + * [re]activated or, in general, enqueued) can use its remaining runtime
> + * and its current deadline _without_ exceeding the bandwidth it is
> + * assigned (function returns true if it can).
> + *
> + * For this to hold, we must check if:
> + *   runtime / (deadline - t) < dl_runtime / dl_deadline .
> + */
> +static bool dl_check_bandwidth(struct sched_dl_entity *dl_se, u64 t)
> +{
> +       u64 left, right;
> +
> +       /*
> +        * left and right are the two sides of the equation above,
> +        * after a bit of shuffling to use multiplications instead
> +        * of divisions.
> +        */
> +       left = dl_se->dl_deadline * dl_se->runtime;
> +       right = (dl_se->deadline - t) * dl_se->dl_runtime;
> +
> +       return dl_time_before(left, right);
> +} 

So what happens when we overflow u64?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation.
  2010-02-28 19:17 ` [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation Raistlin
                     ` (3 preceding siblings ...)
  2010-04-13 18:22   ` Peter Zijlstra
@ 2010-04-13 18:22   ` Peter Zijlstra
  4 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2010-04-13 18:22 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

On Sun, 2010-02-28 at 20:17 +0100, Raistlin wrote:
> +++ b/kernel/fork.c
> @@ -937,6 +937,16 @@ SYSCALL_DEFINE1(set_tid_address, int __user *, tidptr)
>         return task_pid_vnr(current);
>  }
>  
> +static void init_task_dl_entity(struct task_struct *p)
> +{
> +       RB_CLEAR_NODE(&p->dl.rb_node);
> +       hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +
> +       p->dl.dl_runtime = p->dl.runtime = 0;
> +       p->dl.dl_deadline = p->dl.deadline = 0;
> +       p->dl.flags = 0;
> +}
> +
>  static void rt_mutex_init_task(struct task_struct *p)
>  {
>         raw_spin_lock_init(&p->pi_lock);
> @@ -1025,6 +1035,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  
>         ftrace_graph_init_task(p);
>  
> +       init_task_dl_entity(p);
> +
>         rt_mutex_init_task(p);
>  

Why doesn't that live in/use sched_fork()?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 08/11] sched: send SIGXCPU at -deadline task overruns.
  2010-02-28 19:24 ` [RFC][PATCH 08/11] sched: send SIGXCPU at -deadline task overruns Raistlin
@ 2010-04-13 18:22   ` Peter Zijlstra
  2010-04-13 19:32     ` Oleg Nesterov
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2010-04-13 18:22 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, oleg

On Sun, 2010-02-28 at 20:24 +0100, Raistlin wrote:
> Add to the scheduler the capability of notifying when -deadline tasks
> overrun their maximum runtime and/or overcome their scheduling
> deadline.
> 
> Runtime overruns might be quite common, e.g., due to coarse granularity
> execution time accounting resolution or wrong assignment of tasks'
> parameters (especially runtime). However, since the scheduler enforces
> bandwidth isolation among tasks, this is not at all a threat to other
> tasks' schedulability. For this reason, it is not common that a task
> wants to be notified about that. Moreover, if we are using the
> SCHED_DEADLINE policy with sporadic tasks, or to limit the bandwidth
> of not periodic nor sporadic ones, runtime overruns are very likely
> to occur at each and every instance, and again they should not be
> considered a problem.
> 
> On the other hand, a deadline miss in any task means that, even if we
> are trying at our best to keep each task isolated and to avoid
> reciprocal interference among them, something went very, very bad,
> and one task did not manage in consuming its runtime by its deadline.
> This is something that should happen only on an oversubscribed
> system, and thus being notified when it occurs could be very useful.
> 
> The user can specify the signals he wants to be sent to his task
> during sched_setscheduler_ex(), raising two specific flags in the
> sched_flags field of struct sched_param_ex:
>  * SCHED_SIG_RORUN (if he wants to be signaled on runtime overrun),
>  * SCHED_SIG_DMISS (if he wants to be signaled on deadline misses).
> 
> This patch:
>  - adds the logic needed to send SIGXCPU signal to a -deadline task
>    in case its actual runtime becomes negative;
>  - adds the logic needed to send SIGXCPU signal to a -deadline task
>    in case it is still being scheduled while its absolute deadline
>    passes.
> 
> Signed-off-by: Dario Faggioli <raistlin@linux.it>
> ---
>  include/linux/sched.h |   18 ++++++++++++++++++
>  kernel/sched_dl.c     |   24 ++++++++++++++++++++++++
>  2 files changed, 42 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e6c1cda..64a7df2 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -138,6 +138,7 @@ struct sched_param {
>   * More information about the algorithm are available in the scheduling
>   * class file or in Documentation/.
>   */
> +
>  struct sched_param_ex {
>  	int sched_priority;
>  	struct timespec sched_runtime;
> @@ -146,6 +147,23 @@ struct sched_param_ex {
>  	unsigned int sched_flags;
>  };
>  
> +/*
> + * User -deadline flags.
> + *
> + * These flags are exported to userspace so that tasks can try to affect
> + * the scheduler behaviour and/or specifying that they want to be informed
> + * of the occurrence of some events. There are at most 16 of them available
> + * (lowest bits), since values above 0x10000 are reserved for kernel
> + * internal flags.
> + *
> + *  @SCHED_SIG_RORUN	tells us the task wants to be notified whenever
> + *                      a runtime overrun occurs;
> + *  @SCHED_SIG_DMISS	tells us the task wants to be notified whenever
> + *                      a scheduling deadline is missed.
> + */
> +#define SCHED_SIG_RORUN		0x0001
> +#define SCHED_SIG_DMISS		0x0002
> +
>  struct exec_domain;
>  struct futex_pi_state;
>  struct robust_list_head;
> diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
> index dee2668..b5dde44 100644
> --- a/kernel/sched_dl.c
> +++ b/kernel/sched_dl.c
> @@ -265,6 +265,14 @@ static void init_dl_timer(struct sched_dl_entity *dl_se)
>  	timer->function = dl_timer;
>  }
>  
> +#define dl_se_signal(se, s, msg)					\
> + do {									\
> +  struct task_struct *t = dl_task_of(se);				\
> +  sigaddset(&t->pending.signal, s);					\
> +  set_tsk_thread_flag(t, TIF_SIGPENDING);				\
> +  printk(KERN_INFO msg "in %d (%s)\n", task_pid_nr(t), t->comm);	\
> + } while (0)
> +

Can't this be a regular vararg function?

Also, I wonder, would we want to put some information in the sigaction
struct to allow the user to distinguish between the various SIGXCPU
signal users.

>  static
>  int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
>  {
> @@ -283,6 +291,22 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
>  	if (dmiss)
>  		dl_se->runtime -= rq->clock - dl_se->deadline;
>  
> +	/*
> +	 * if the userspace asked for that, we notify about (scheduling)
> +	 * deadline misses and runtime overruns via sending SIGXCPU to
> +	 * "faulting" task.
> +	 *
> +	 * Note that (hopefully small) runtime overruns are very likely
> +	 * to occur, mainly due to accounting resolution, while missing a
> +	 * scheduling deadline should be very rare, and only happen on
> +	 * an oversubscribed systems.
> +	 */
> +	if (unlikely(dmiss && dl_se->flags & SCHED_SIG_DMISS))
> +		dl_se_signal(dl_se, SIGXCPU, "Deadline miss");
> +
> +	if (unlikely(rorun && dl_se->flags & SCHED_SIG_RORUN))
> +		dl_se_signal(dl_se, SIGXCPU, "Runtime overrun");
> +
>  	dequeue_dl_entity(dl_se);
>  	if (!start_dl_timer(dl_se, dl_se->deadline)) {
>  		if (!rorun)
> -- 
> 1.7.0
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation.
  2010-04-13 18:22   ` Peter Zijlstra
@ 2010-04-13 18:55     ` Steven Rostedt
  2010-04-15  7:34       ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2010-04-13 18:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

On Tue, 2010-04-13 at 20:22 +0200, Peter Zijlstra wrote:
> On Sun, 2010-02-28 at 20:17 +0100, Raistlin wrote:
> > +/*
> > + * Here we check if --at time t-- a task (which is probably being
> > + * [re]activated or, in general, enqueued) can use its remaining runtime
> > + * and its current deadline _without_ exceeding the bandwidth it is
> > + * assigned (function returns true if it can).
> > + *
> > + * For this to hold, we must check if:
> > + *   runtime / (deadline - t) < dl_runtime / dl_deadline .
> > + */
> > +static bool dl_check_bandwidth(struct sched_dl_entity *dl_se, u64 t)
> > +{
> > +       u64 left, right;
> > +
> > +       /*
> > +        * left and right are the two sides of the equation above,
> > +        * after a bit of shuffling to use multiplications instead
> > +        * of divisions.
> > +        */
> > +       left = dl_se->dl_deadline * dl_se->runtime;
> > +       right = (dl_se->deadline - t) * dl_se->dl_runtime;
> > +
> > +       return dl_time_before(left, right);
> > +} 
> 
> So what happens when we overflow u64?

Is the resolution in nanosecs starting from zero? If so, then we don't
need to worry about overflow for 583 years? And that is only if the
difference in time is greater than 292 years since dl_time_before() does
a:

  (s64)(a - b) < 0

The (s64)(a - b) returns the difference even on overflow as long as the
difference is not greater than 2^63


-- Steve



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 08/11] sched: send SIGXCPU at -deadline task overruns.
  2010-04-13 18:22   ` Peter Zijlstra
@ 2010-04-13 19:32     ` Oleg Nesterov
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Nesterov @ 2010-04-13 19:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, Frederic Weisbecker, Darren Hart, Henrik Austad,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Luca Abeni

On 04/13, Peter Zijlstra wrote:
>
> On Sun, 2010-02-28 at 20:24 +0100, Raistlin wrote:
> >
> > +#define dl_se_signal(se, s, msg)					\
> > + do {									\
> > +  struct task_struct *t = dl_task_of(se);				\
> > +  sigaddset(&t->pending.signal, s);					\
> > +  set_tsk_thread_flag(t, TIF_SIGPENDING);				\
> > +  printk(KERN_INFO msg "in %d (%s)\n", task_pid_nr(t), t->comm);	\
> > + } while (0)
> > +

Without ->siglock?

This is racy even if dl_task_of(se) == current, but I guess it can
be !current. For example, we must never set TIF_SIGPENDING without
wake_up_state(). A fatal signal should kill the whole process, etc.
Even sigaddset() itself can race with tkill, it is not atomic.

Oleg.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 09/11] sched: first draft of deadline inheritance.
  2010-02-28 19:26 ` [RFC][PATCH 09/11] sched: first draft of deadline inheritance Raistlin
@ 2010-04-14  8:25   ` Peter Zijlstra
  2010-04-14  9:45     ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2010-04-14  8:25 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

On Sun, 2010-02-28 at 20:26 +0100, Raistlin wrote:
> Therefore, as of now, this patch:
>  - implements priority inheritance for -deadline tasks, according to
>    what described above;
>  - to make this possible without rewriting outstanding chunks of
>    code of both -deadline scheduling and existing PI, the task's
>    relative deadline is stored in the prio field of the task_struct.
>    This is done in such a way that:
>     * prio is always < 0 for -deadline tasks,
>     * p1->prio < p2->prio still means p1 has higher priority than
>       p2, i.e., in our case, p1 has smaller relative deadline.
>  - the point above means that, since prio is of int type, a relative
>    deadline has to be smaller than INT_MAX. This is about 2sec,
>    which is a something (we think! :-)) we can afford, at least
>    for now.

Right, except that this makes the plist stuff O(INT_MAX) [ or rather
O(nr_dl_tasks) ].

But I guess it serves for testing.

I think it would be relatively straight forward to modify the existing
PI chain code to work using an RB-tree instead of the plist stuff.

An RB-tree would also make the whole ->prio mess much easier to solve,
we could simply make a more complex comparison function, like:

int rt_mutex_prio_less(struct task_struct *left, struct task_struct *right)
{
	if (left->prio < right->prio)
		return 1;

	if (left->prio == right->prio && left->prio == -1) {
		if (left->deadline < right->deadline)
			return 1;
	}

	return 0;
}

Which uses the static prio -1 to represent all deadline tasks.

>  - disables bandwidth throttling for tasks while they are deadline
>    boosted. It also tries to make them pay back for runtime overrun
>    and deadline misses in this phase, but it's only "local", in the
>    sense that instances farther than the one right next to the
>    overrun are not going to be direcly affected.
> 
Yeah, good enough to start with ;-)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 09/11] sched: first draft of deadline inheritance.
  2010-04-14  8:25   ` Peter Zijlstra
@ 2010-04-14  9:45     ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2010-04-14  9:45 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

On Wed, 2010-04-14 at 10:25 +0200, Peter Zijlstra wrote:
> I think it would be relatively straight forward to modify the existing
> PI chain code to work using an RB-tree instead of the plist stuff.
> 
Something like the below (totally untested)

---
 include/linux/init_task.h |   10 ++++
 include/linux/rtmutex.h   |   13 +----
 include/linux/sched.h     |    3 +-
 kernel/fork.c             |    3 +-
 kernel/rtmutex-debug.c    |    8 +--
 kernel/rtmutex.c          |  121 +++++++++++++++++++++++++++++++++++---------
 kernel/rtmutex_common.h   |   21 ++++----
 kernel/sched.c            |    4 --
 8 files changed, 125 insertions(+), 58 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index b1ed1cd..2ee9259 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -10,6 +10,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
 #include <linux/securebits.h>
+#include <linux/rbtree.h>
 #include <net/net_namespace.h>
 
 extern struct files_struct init_files;
@@ -103,6 +104,14 @@ extern struct cred init_cred;
 # define INIT_PERF_EVENTS(tsk)
 #endif
 
+#ifdef CONFIG_RT_MUTEXES
+# define INIT_RT_MUTEXES						\
+	.pi_waiters = RB_ROOT,						\
+	.pi_waiters_leftmost = NULL,
+#else
+# define INIT_RT_MUTEXES
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -172,6 +181,7 @@ extern struct cred init_cred;
 	INIT_FTRACE_GRAPH						\
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
+	INIT_RT_MUTEXES							\
 }
 
 
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 8d522ff..8a68b29 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -13,7 +13,7 @@
 #define __LINUX_RT_MUTEX_H
 
 #include <linux/linkage.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
 #include <linux/spinlock_types.h>
 
 extern int max_lock_depth; /* for sysctl */
@@ -27,7 +27,8 @@ extern int max_lock_depth; /* for sysctl */
  */
 struct rt_mutex {
 	raw_spinlock_t		wait_lock;
-	struct plist_head	wait_list;
+	struct rb_root		waiters;
+	struct rb_node		*waiters_leftmost;
 	struct task_struct	*owner;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 	int			save_state;
@@ -98,12 +99,4 @@ extern int rt_mutex_trylock(struct rt_mutex *lock);
 
 extern void rt_mutex_unlock(struct rt_mutex *lock);
 
-#ifdef CONFIG_RT_MUTEXES
-# define INIT_RT_MUTEXES(tsk)						\
-	.pi_waiters	= PLIST_HEAD_INIT(tsk.pi_waiters, tsk.pi_lock),	\
-	INIT_RT_MUTEX_DEBUG(tsk)
-#else
-# define INIT_RT_MUTEXES(tsk)
-#endif
-
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c46b6e5..d4bf0c2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1364,7 +1364,8 @@ struct task_struct {
 
 #ifdef CONFIG_RT_MUTEXES
 	/* PI waiters blocked on a rt_mutex held by this task */
-	struct plist_head pi_waiters;
+	struct rb_root pi_waiters;
+	struct rb_node *pi_waiters_leftmost;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 5d3592d..23e037f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -918,7 +918,8 @@ static void rt_mutex_init_task(struct task_struct *p)
 {
 	raw_spin_lock_init(&p->pi_lock);
 #ifdef CONFIG_RT_MUTEXES
-	plist_head_init_raw(&p->pi_waiters, &p->pi_lock);
+	p->pi_waiters = RB_ROOT;
+	p->pi_waiters_leftmost = NULL;
 	p->pi_blocked_on = NULL;
 #endif
 }
diff --git a/kernel/rtmutex-debug.c b/kernel/rtmutex-debug.c
index ddabb54..7cc8376 100644
--- a/kernel/rtmutex-debug.c
+++ b/kernel/rtmutex-debug.c
@@ -23,7 +23,7 @@
 #include <linux/kallsyms.h>
 #include <linux/syscalls.h>
 #include <linux/interrupt.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
 #include <linux/fs.h>
 #include <linux/debug_locks.h>
 
@@ -111,7 +111,7 @@ static void printk_lock(struct rt_mutex *lock, int print_owner)
 
 void rt_mutex_debug_task_free(struct task_struct *task)
 {
-	WARN_ON(!plist_head_empty(&task->pi_waiters));
+	WARN_ON(!RB_EMPTY_ROOT(&task->pi_waiters));
 	WARN_ON(task->pi_blocked_on);
 }
 
@@ -205,16 +205,12 @@ void debug_rt_mutex_proxy_unlock(struct rt_mutex *lock)
 void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
 {
 	memset(waiter, 0x11, sizeof(*waiter));
-	plist_node_init(&waiter->list_entry, MAX_PRIO);
-	plist_node_init(&waiter->pi_list_entry, MAX_PRIO);
 	waiter->deadlock_task_pid = NULL;
 }
 
 void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)
 {
 	put_pid(waiter->deadlock_task_pid);
-	TRACE_WARN_ON(!plist_node_empty(&waiter->list_entry));
-	TRACE_WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
 	TRACE_WARN_ON(waiter->task);
 	memset(waiter, 0x22, sizeof(*waiter));
 }
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index a960481..765b407 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -97,6 +97,82 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex *lock)
 }
 #endif
 
+static inline int
+rt_mutex_waiter_less(struct rt_mutex_waiter *left, struct rt_mutex_waiter *right)
+{
+	return left->task->prio < right->task->prio;
+}
+
+static void
+rt_mutex_enqueue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
+{
+	struct rb_node **link = &lock->waiters.rb_node;
+	struct rb_node *parent = NULL;
+	struct rt_mutex_waiter *entry;
+	int leftmost = 1;
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct rt_mutex_waiter, tree_entry);
+		if (rt_mutex_waiter_less(waiter, entry)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		lock->waiters_leftmost = &waiter->tree_entry;
+
+	rb_link_node(&waiter->tree_entry, parent, link);
+	rb_insert_color(&waiter->tree_entry, &lock->waiters);
+}
+
+static void
+rt_mutex_dequeue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
+{
+	if (lock->waiter_leftmost == &waiter->tree_entry)
+		lock->waiter_leftmost = rb_next(&waiter->tree_entry);
+
+	rb_erase(&waiter->tree_entry, lock->waiters);
+}
+
+static void 
+rt_mutex_enqueue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
+{
+	struct rb_node **link = &task->pi_waiters.rb_node;
+	struct rb_node *parent = NULL;
+	struct rt_mutex_waiter *entry;
+	int leftmost = 1;
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct rt_mutex_waiter, pi_tree_entry);
+		if (rt_mutex_waiter_less(waiter, entry)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		task->pi_waiters_leftmost = &waiter->pi_tree_entry;
+
+	rb_link_node(&waiter->pi_tree_entry, parent, link);
+	rb_insert_color(&waiter->pi_tree_entry, &task->pi_waiters);
+}
+
+static void
+rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
+{
+	if (lock->waiter_leftmost == &waiter->pi_tree_entry)
+		lock->waiter_leftmost = rb_next(&waiter->pi_tree_entry);
+
+	rb_erase(&waiter->pi_tree_entry, task->pi_waiters);
+}
+
 /*
  * Calculate task priority from the waiter list priority
  *
@@ -248,9 +324,8 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	top_waiter = rt_mutex_top_waiter(lock);
 
 	/* Requeue the waiter */
-	plist_del(&waiter->list_entry, &lock->wait_list);
-	waiter->list_entry.prio = task->prio;
-	plist_add(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_dequeue(lock, waiter);
+	rt_mutex_enqueue(lock, waiter);
 
 	/* Release the task */
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
@@ -263,17 +338,15 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		/* Boost the owner */
-		plist_del(&top_waiter->pi_list_entry, &task->pi_waiters);
-		waiter->pi_list_entry.prio = waiter->list_entry.prio;
-		plist_add(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_dequeue_pi(task, top_waiter);
+		rt_mutex_enqueue_pi(task, waiter);
 		__rt_mutex_adjust_prio(task);
 
 	} else if (top_waiter == waiter) {
 		/* Deboost the owner */
-		plist_del(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_dequeue_pi(task, waiter);
 		waiter = rt_mutex_top_waiter(lock);
-		waiter->pi_list_entry.prio = waiter->list_entry.prio;
-		plist_add(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_enqueue_pi(task, waiter);
 		__rt_mutex_adjust_prio(task);
 	}
 
@@ -331,7 +404,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock,
 
 	/* No chain handling, pending owner is not blocked on anything: */
 	next = rt_mutex_top_waiter(lock);
-	plist_del(&next->pi_list_entry, &pendowner->pi_waiters);
+	rt_mutex_dequeue_pi(pendowner, next);
 	__rt_mutex_adjust_prio(pendowner);
 	raw_spin_unlock_irqrestore(&pendowner->pi_lock, flags);
 
@@ -351,7 +424,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock,
 	 */
 	if (likely(next->task != task)) {
 		raw_spin_lock_irqsave(&task->pi_lock, flags);
-		plist_add(&next->pi_list_entry, &task->pi_waiters);
+		rt_mutex_enqueue_pi(task, next);
 		__rt_mutex_adjust_prio(task);
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 	}
@@ -424,13 +497,11 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	__rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
-	plist_node_init(&waiter->list_entry, task->prio);
-	plist_node_init(&waiter->pi_list_entry, task->prio);
 
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
 		top_waiter = rt_mutex_top_waiter(lock);
-	plist_add(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_enqueue(lock, waiter);
 
 	task->pi_blocked_on = waiter;
 
@@ -438,10 +509,11 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		raw_spin_lock_irqsave(&owner->pi_lock, flags);
-		plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters);
-		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
 
+		rt_mutex_dequeue_pi(owner, top_waiter);
+		rt_mutex_enqueue_pi(owner, waiter);
 		__rt_mutex_adjust_prio(owner);
+
 		if (owner->pi_blocked_on)
 			chain_walk = 1;
 		raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
@@ -486,7 +558,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 	raw_spin_lock_irqsave(&current->pi_lock, flags);
 
 	waiter = rt_mutex_top_waiter(lock);
-	plist_del(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_dequeue(lock, waiter);
 
 	/*
 	 * Remove it from current->pi_waiters. We do not adjust a
@@ -494,7 +566,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 	 * boosted mode and go back to normal after releasing
 	 * lock->wait_lock.
 	 */
-	plist_del(&waiter->pi_list_entry, &current->pi_waiters);
+	rt_mutex_dequeue_pi(current, waiter);
 	pendowner = waiter->task;
 	waiter->task = NULL;
 
@@ -521,7 +593,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 		struct rt_mutex_waiter *next;
 
 		next = rt_mutex_top_waiter(lock);
-		plist_add(&next->pi_list_entry, &pendowner->pi_waiters);
+		rt_mutex_enqueue_pi(pendowner, next);
 	}
 	raw_spin_unlock_irqrestore(&pendowner->pi_lock, flags);
 
@@ -542,7 +614,7 @@ static void remove_waiter(struct rt_mutex *lock,
 	int chain_walk = 0;
 
 	raw_spin_lock_irqsave(&current->pi_lock, flags);
-	plist_del(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_dequeue(lock, waiter);
 	waiter->task = NULL;
 	current->pi_blocked_on = NULL;
 	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
@@ -551,13 +623,13 @@ static void remove_waiter(struct rt_mutex *lock,
 
 		raw_spin_lock_irqsave(&owner->pi_lock, flags);
 
-		plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
+		rt_mutex_dequeue_po(owner, waiter);
 
 		if (rt_mutex_has_waiters(lock)) {
 			struct rt_mutex_waiter *next;
 
 			next = rt_mutex_top_waiter(lock);
-			plist_add(&next->pi_list_entry, &owner->pi_waiters);
+			rt_mutex_enqueue_pi(owner, next);
 		}
 		__rt_mutex_adjust_prio(owner);
 
@@ -567,8 +639,6 @@ static void remove_waiter(struct rt_mutex *lock,
 		raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
 	}
 
-	WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
-
 	if (!chain_walk)
 		return;
 
@@ -971,7 +1041,8 @@ void __rt_mutex_init(struct rt_mutex *lock, const char *name)
 {
 	lock->owner = NULL;
 	raw_spin_lock_init(&lock->wait_lock);
-	plist_head_init_raw(&lock->wait_list, &lock->wait_lock);
+	lock->waiters = RB_ROOT;
+	lock->waiters_leftmost = NULL;
 
 	debug_rt_mutex_init(lock, name);
 }
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index 97a2f81..b522322 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -40,13 +40,13 @@ extern void schedule_rt_mutex_test(struct rt_mutex *lock);
  * This is the control structure for tasks blocked on a rt_mutex,
  * which is allocated on the kernel stack on of the blocked task.
  *
- * @list_entry:		pi node to enqueue into the mutex waiters list
- * @pi_list_entry:	pi node to enqueue into the mutex owner waiters list
+ * @tree_entry:		pi node to enqueue into the mutex waiters tree
+ * @pi_tree_entry:	pi node to enqueue into the mutex owner waiters tree
  * @task:		task reference to the blocked task
  */
 struct rt_mutex_waiter {
-	struct plist_node	list_entry;
-	struct plist_node	pi_list_entry;
+	struct rb_node		tree_entry;
+	struct rb_node		pi_tree_entry;
 	struct task_struct	*task;
 	struct rt_mutex		*lock;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
@@ -57,11 +57,11 @@ struct rt_mutex_waiter {
 };
 
 /*
- * Various helpers to access the waiters-plist:
+ * Various helpers to access the waiters-tree:
  */
 static inline int rt_mutex_has_waiters(struct rt_mutex *lock)
 {
-	return !plist_head_empty(&lock->wait_list);
+	return !RB_EMPTY_ROOT(&lock->waiters);
 }
 
 static inline struct rt_mutex_waiter *
@@ -69,8 +69,7 @@ rt_mutex_top_waiter(struct rt_mutex *lock)
 {
 	struct rt_mutex_waiter *w;
 
-	w = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter,
-			       list_entry);
+	w = rb_entry(lock->waiters_leftmost, struct rt_mutex_waiter, tree_entry);
 	BUG_ON(w->lock != lock);
 
 	return w;
@@ -78,14 +77,14 @@ rt_mutex_top_waiter(struct rt_mutex *lock)
 
 static inline int task_has_pi_waiters(struct task_struct *p)
 {
-	return !plist_head_empty(&p->pi_waiters);
+	return !RB_EMPTY_ROOT(&p->pi_waiters);
 }
 
 static inline struct rt_mutex_waiter *
 task_top_pi_waiter(struct task_struct *p)
 {
-	return plist_first_entry(&p->pi_waiters, struct rt_mutex_waiter,
-				  pi_list_entry);
+	return rb_entry(p->pi_waiters_leftmost, struct rt_mutex_waiter,
+			pi_tree_entry);
 }
 
 /*
diff --git a/kernel/sched.c b/kernel/sched.c
index 3acf694..40b9bc5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7677,10 +7677,6 @@ void __init sched_init(void)
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
 #endif
 
-#ifdef CONFIG_RT_MUTEXES
-	plist_head_init_raw(&init_task.pi_waiters, &init_task.pi_lock);
-#endif
-
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
 	 */



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 10/11] sched: add bandwidth management for sched_dl.
  2010-02-28 19:27 ` [RFC][PATCH 10/11] sched: add bandwidth management for sched_dl Raistlin
@ 2010-04-14 10:09   ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2010-04-14 10:09 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

On Sun, 2010-02-28 at 20:27 +0100, Raistlin wrote:
> @@ -2063,6 +2210,30 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
>         return delta < (s64)sysctl_sched_migration_cost;
>  }
>  
> +/*
> + * When dealing with a -deadline task, we have to check if moving it to
> + * a new CPU is possible or not. In fact, this is only true iff there
> + * is enough bandwidth available on such CPU, otherwise we want the
> + * whole migration progedure to fail over.
> + */
> +static inline
> +bool __set_task_cpu_dl(struct task_struct *p, unsigned int cpu)
> +{
> +       struct dl_bandwidth *dl_b = task_dl_bandwidth(p);
> +
> +       raw_spin_lock(&dl_b->dl_runtime_lock);
> +       if (dl_b->dl_bw < dl_b->dl_total_bw[cpu] + p->dl.dl_bw) {
> +               raw_spin_unlock(&dl_b->dl_runtime_lock);
> +
> +               return 0;
> +       }
> +       dl_b->dl_total_bw[task_cpu(p)] -= p->dl.dl_bw;
> +       dl_b->dl_total_bw[cpu] += p->dl.dl_bw;
> +       raw_spin_unlock(&dl_b->dl_runtime_lock);
> +
> +       return 1;
> +}
> +
>  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>  {
>  #ifdef CONFIG_SCHED_DEBUG
> @@ -2077,6 +2248,9 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>         trace_sched_migrate_task(p, new_cpu);
>  
>         if (task_cpu(p) != new_cpu) {
> +               if (task_has_dl_policy(p) && !__set_task_cpu_dl(p, new_cpu))
> +                       return;
> +
>                 p->se.nr_migrations++;
>                 perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 1, NULL, 0);
>         } 

Yikes!!, I'm not sure we can sanely deal with set_task_cpu() doing that.

I'd much rather see us never attempting set_task_cpu() when we know its
not going to be possible.

That also means that things like set_cpus_allowed_ptr() /
sys_sched_setaffinity() will need to propagate the error back to their
users, which in turn will need to be able to cope.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2
  2010-02-28 19:06 [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Raistlin
                   ` (10 preceding siblings ...)
  2010-02-28 19:28 ` [RFC][PATCH 11/11] sched: add sched_dl documentation Raistlin
@ 2010-04-14 10:17 ` Peter Zijlstra
  11 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2010-04-14 10:17 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

Overall very nice code, awesome!

On Sun, 2010-02-28 at 20:06 +0100, Raistlin wrote:
>  - migration of tasks throughout push and pull (as in -rt), to make it
>    possible to deploy global-EDF scheduling. Patches are ready, they're
>    just being tested and adapted to this last version;

Please post a version that includes this :-)

>  - refinements in deadline inheritance, especially regarding the
>    possibility of retaining bandwidth isolation among non-interacting
>    tasks. This is being studied from both theoretical and practical
>    points of view, and hopefully we can have some demonstrative code
>    soon. 

Right, I hope that with changing the PI code to use RB trees this will
become much less of a kludge than fudging the ->prio field.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation.
  2010-04-13 18:55     ` Steven Rostedt
@ 2010-04-15  7:34       ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2010-04-15  7:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Chris Friesen,
	Frederic Weisbecker, Darren Hart, Henrik Austad, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni

On Tue, 2010-04-13 at 14:55 -0400, Steven Rostedt wrote:
> On Tue, 2010-04-13 at 20:22 +0200, Peter Zijlstra wrote:
> > On Sun, 2010-02-28 at 20:17 +0100, Raistlin wrote:
> > > +/*
> > > + * Here we check if --at time t-- a task (which is probably being
> > > + * [re]activated or, in general, enqueued) can use its remaining runtime
> > > + * and its current deadline _without_ exceeding the bandwidth it is
> > > + * assigned (function returns true if it can).
> > > + *
> > > + * For this to hold, we must check if:
> > > + *   runtime / (deadline - t) < dl_runtime / dl_deadline .
> > > + */
> > > +static bool dl_check_bandwidth(struct sched_dl_entity *dl_se, u64 t)
> > > +{
> > > +       u64 left, right;
> > > +
> > > +       /*
> > > +        * left and right are the two sides of the equation above,
> > > +        * after a bit of shuffling to use multiplications instead
> > > +        * of divisions.
> > > +        */
> > > +       left = dl_se->dl_deadline * dl_se->runtime;
> > > +       right = (dl_se->deadline - t) * dl_se->dl_runtime;
> > > +
> > > +       return dl_time_before(left, right);
> > > +} 
> > 
> > So what happens when we overflow u64?
> 
> Is the resolution in nanosecs starting from zero? If so, then we don't
> need to worry about overflow for 583 years? And that is only if the
> difference in time is greater than 292 years since dl_time_before() does
> a:
> 
>   (s64)(a - b) < 0
> 
> The (s64)(a - b) returns the difference even on overflow as long as the
> difference is not greater than 2^63

Its a multiplication of two u64, that's a lot easier to overflow.


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2010-04-15  7:34 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-28 19:06 [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Raistlin
2010-02-28 19:15 ` [RFC][PATCH 01/11] sched: add sched_class->task_dead Raistlin
2010-02-28 19:17 ` [RFC][PATCH 02/11] sched: SCHED_DEADLINE policy implementation Raistlin
2010-04-13 18:22   ` Peter Zijlstra
2010-04-13 18:22   ` Peter Zijlstra
2010-04-13 18:22   ` Peter Zijlstra
2010-04-13 18:55     ` Steven Rostedt
2010-04-15  7:34       ` Peter Zijlstra
2010-04-13 18:22   ` Peter Zijlstra
2010-04-13 18:22   ` Peter Zijlstra
2010-02-28 19:18 ` [RFC][PATCH 03/11] sched: add extended scheduling interface Raistlin
2010-02-28 19:19 ` [RFC][PATCH 04/11] sched: add resource limits for -deadline tasks Raistlin
2010-02-28 19:20 ` [RFC][PATCH 05/11] sched: add a syscall to wait for the next instance Raistlin
2010-02-28 19:22 ` [RFC][PATCH 06/11] sched: add the sched-debug bits for sched_dl Raistlin
2010-02-28 19:23 ` [RFC][PATCH 07/11] sched: add latency tracing for -deadline tasks Raistlin
2010-02-28 19:24 ` [RFC][PATCH 08/11] sched: send SIGXCPU at -deadline task overruns Raistlin
2010-04-13 18:22   ` Peter Zijlstra
2010-04-13 19:32     ` Oleg Nesterov
2010-02-28 19:26 ` [RFC][PATCH 09/11] sched: first draft of deadline inheritance Raistlin
2010-04-14  8:25   ` Peter Zijlstra
2010-04-14  9:45     ` Peter Zijlstra
2010-02-28 19:27 ` [RFC][PATCH 10/11] sched: add bandwidth management for sched_dl Raistlin
2010-04-14 10:09   ` Peter Zijlstra
2010-02-28 19:28 ` [RFC][PATCH 11/11] sched: add sched_dl documentation Raistlin
2010-04-14 10:17 ` [RFC][PATCH 0/11] sched: SCHED_DEADLINE v2 Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).