linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/12][PATCH] SCHED_DEADLINE (new version of SCHED_EDF)
@ 2009-10-16 15:35 Raistlin
  2009-10-16 15:38 ` [RFC 1/12][PATCH] Extended scheduling parameters structure added Raistlin
                   ` (11 more replies)
  0 siblings, 12 replies; 45+ messages in thread
From: Raistlin @ 2009-10-16 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: raistlin, linux-kernel, michael trimarchi, Fabio Checconi,
	Ingo Molnar, Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 2634 bytes --]

Hi Peter, Hi all,

Given all the comments and feedback we got, here it is the new version
of our EDF patch, this time in a split series. :-)

Special thanks to all the one that gave us any kind of suggestion,
especially during last RTLWS in Dresden.

The rationale/motivation for the new scheduler is the same of the first
e-mail (http://lwn.net/Articles/353797/), thus I only add some new and
(I think) interesting links:
 - Ericsson posting about SCHED_EDF/DEADLINE
    https://labs.ericsson.com/blog/making-linux-more-real-time

 - The slides we presented at RTLWS in Dresden:
    http://retis.sssup.it/people/faggioli/sched_deadline/rtlw_EDF.pdf

 - Luca Abeni's presentation about using deadline based reservation 
   schedulers for IRQ-Threads:
    http://www.disi.unitn.it/~abeni/rtlws-slides.pdf

Moreover, we moved the project on gitorious.org, therefore:
 http://gitorious.org/sched_deadline

I am also setting up right now the Wiki section, where you can find some
more detailed usage instructions, examples and overhead estimation:
 http://gitorious.org/sched_deadline/pages/Home

Git repositories are up and running, and ready at:
(mainline)
 git://gitorious.org/sched_deadline/linux-deadline.git sched-deadline 

(sched-devel)
 git://gitorious.org/sched_deadline/linux-deadline.git sched-devel-deadline

(preempt-rt [*])
 git://gitorious.org/sched_deadline/linux-deadline.git rt-deadline

The new project homepage is
 http://www.evidence.eu.com/sched_deadline.html

Here the main changes we did, following what many of you --and mainly
Peter-- suggested:
- name changed from SCHED_EDF to SCHED_DEADLINE

- SCHED_DEADLINE has higher priority than SCHED_FIFO/SCHED_RR

- flags added in sched_param_ex to signal deadline misses
  (in case of utilization > 100%) and/or budget overruns

- new sched_*_ex prototypes, with a len field to accomodate the size of
  sched_param_ex, trying to avoid further ABI issues if it changes

- new syscall sched_wait_interval added. It behaves like clock_nanosleep
  but, for a SCHED_DEADLINE task, it also represent the end of the
  current instance (sched_yield no longer needed)

- on fork, child starts but with 0 bandwidth (i.e., it does not start! :-D)

- bug fixing :)

Any feedback and contribution is welcome.

Many thanks,

             Dario Faggioli
             Claudio Scordino
             Michael Trimarchi

[*] porting toward preempt-rt is work in progress. Code is there, but we
are testing and fixing it, with the help of Luca and Nicola from
Trento... So don't consider it as the final version!


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC 1/12][PATCH] Extended scheduling parameters structure added.
  2009-10-16 15:35 [RFC 0/12][PATCH] SCHED_DEADLINE (new version of SCHED_EDF) Raistlin
@ 2009-10-16 15:38 ` Raistlin
  2009-12-29 12:15   ` Peter Zijlstra
  2009-10-16 15:40 ` [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class Raistlin
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 45+ messages in thread
From: Raistlin @ 2009-10-16 15:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1680 bytes --]

An extended scheduling parameter structure, sched_param_ex, is defined in
this commit, as the starting point for supporting task models more
sophisticated than fixed-priority.

One that is both popular and (hopefully!) general enough is the so-called
sporadic task model, in which tasks' computation is divided into instances,
each one with:
 * a (maximum/typical) execution time,
 * a minimum interval between the activation of two consecutive instances,
 * a time instant by which the computation of the instance must be completed.

The new sched_param_ex reflects this model, and thus allows for better
specification of time sensitive workloads typical, for example, in real-time,
control and/or continuous media applications.

Signed-off-by: Raistlin <raistlin@linux.it>
---
 include/linux/sched.h |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75e6e60..ac9837c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -94,6 +94,14 @@ struct sched_param {
 
 #include <asm/processor.h>
 
+struct sched_param_ex {
+	int sched_priority;
+	struct timespec sched_runtime;
+	struct timespec sched_deadline;
+	struct timespec sched_period;
+	int sched_flags;
+};
+
 struct exec_domain;
 struct futex_pi_state;
 struct robust_list_head;
-- 
1.6.0.4

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class
  2009-10-16 15:35 [RFC 0/12][PATCH] SCHED_DEADLINE (new version of SCHED_EDF) Raistlin
  2009-10-16 15:38 ` [RFC 1/12][PATCH] Extended scheduling parameters structure added Raistlin
@ 2009-10-16 15:40 ` Raistlin
  2009-12-29 12:25   ` Peter Zijlstra
                     ` (3 more replies)
  2009-10-16 15:41 ` [RFC 0/12][PATCH] SCHED_DEADLINE: fork and terminate task logic Raistlin
                   ` (9 subsequent siblings)
  11 siblings, 4 replies; 45+ messages in thread
From: Raistlin @ 2009-10-16 15:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 23000 bytes --]

This commit introduces a new scheduling policy (SCHED_DEADLINE), implemented
in a new scheduling class (sched_deadline.c).

As of now, it implements the popular Earliest Deadline First (EDF) real-time
scheduling algorithm.
It basically means each (instance of each) task has a deadline, indicating the
time instant by which its computation has to be completed. The scheduler
always picks the task with the earliest deadline as the next to be executed.

Some more logic is added in order to avoid tasks interfering between each
others, i.e., a deadline miss of task A should not affect the capability of
task B to meet its own deadline.

Open issues:
 - this implementation is ``fully partitioned'', which means each task has to
   be bound to one processor at any given time. Turning it into ``global
   scheduling'' (i.e., migrations are allowed) is work in progress;
 - proper dealing with critical sections/rt-mutexes is also missing, and
   is also work in progress.

Signed-off-by: Raistlin <raistlin@linux.it>
---
 include/linux/sched.h   |   36 ++++
 kernel/hrtimer.c        |    2 +-
 kernel/sched.c          |   44 ++++-
 kernel/sched_deadline.c |  513 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_fair.c     |    2 +-
 kernel/sched_rt.c       |    2 +-
 6 files changed, 587 insertions(+), 12 deletions(-)
 create mode 100644 kernel/sched_deadline.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ac9837c..20e1a6a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -38,6 +38,7 @@
 #define SCHED_BATCH		3
 /* SCHED_ISO: reserved but not implemented yet */
 #define SCHED_IDLE		5
+#define SCHED_DEADLINE		6
 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
 #define SCHED_RESET_ON_FORK     0x40000000
 
@@ -159,6 +160,7 @@ extern unsigned long get_parent_ip(unsigned long addr);
 
 struct seq_file;
 struct cfs_rq;
+struct dl_rq;
 struct task_group;
 #ifdef CONFIG_SCHED_DEBUG
 extern void proc_sched_show_task(struct task_struct *p, struct seq_file *m);
@@ -1218,6 +1220,27 @@ struct sched_rt_entity {
 #endif
 };
 
+#define DL_NEW			0x00000001
+#define DL_THROTTLED		0x00000002
+#define DL_BOOSTED		0x00000004
+
+struct sched_dl_entity {
+	struct rb_node	rb_node;
+	/* actual scheduling parameters */
+	s64		runtime;
+	u64		deadline;
+	unsigned int	flags;
+
+	/* original parameters taken from sched_param_ex */
+	u64		sched_runtime;
+	u64		sched_deadline;
+	u64		sched_period;
+	u64		bw;
+
+	int		nr_cpus_allowed;
+	struct hrtimer	dl_timer;
+};
+
 struct rcu_node;
 
 struct task_struct {
@@ -1240,6 +1263,7 @@ struct task_struct {
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
+	struct sched_dl_entity dl;
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
@@ -1583,6 +1607,18 @@ static inline int rt_task(struct task_struct *p)
 	return rt_prio(p->prio);
 }
 
+static inline int deadline_policy(int policy)
+{
+	if (unlikely(policy == SCHED_DEADLINE))
+		return 1;
+	return 0;
+}
+
+static inline int deadline_task(struct task_struct *p)
+{
+	return deadline_policy(p->policy);
+}
+
 static inline struct pid *task_pid(struct task_struct *task)
 {
 	return task->pids[PIDTYPE_PID].pid;
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 3e1c36e..bf6a3b1 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1537,7 +1537,7 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp,
 	unsigned long slack;
 
 	slack = current->timer_slack_ns;
-	if (rt_task(current))
+	if (deadline_task(current) || rt_task(current))
 		slack = 0;
 
 	hrtimer_init_on_stack(&t.timer, clockid, mode);
diff --git a/kernel/sched.c b/kernel/sched.c
index e886895..adf1414 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -131,6 +131,11 @@ static inline int task_has_rt_policy(struct task_struct *p)
 	return rt_policy(p->policy);
 }
 
+static inline int task_has_deadline_policy(struct task_struct *p)
+{
+	return deadline_policy(p->policy);
+}
+
 /*
  * This is the priority-queue data structure of the RT scheduling class:
  */
@@ -481,6 +486,14 @@ struct rt_rq {
 #endif
 };
 
+struct dl_rq {
+	unsigned long dl_nr_running;
+
+	/* runqueue is an rbtree, ordered by deadline */
+	struct rb_root rb_root;
+	struct rb_node *rb_leftmost;
+};
+
 #ifdef CONFIG_SMP
 
 /*
@@ -545,6 +558,7 @@ struct rq {
 
 	struct cfs_rq cfs;
 	struct rt_rq rt;
+	struct dl_rq dl;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
@@ -1818,11 +1832,12 @@ static void calc_load_account_active(struct rq *this_rq);
 #include "sched_idletask.c"
 #include "sched_fair.c"
 #include "sched_rt.c"
+#include "sched_deadline.c"
 #ifdef CONFIG_SCHED_DEBUG
 # include "sched_debug.c"
 #endif
 
-#define sched_class_highest (&rt_sched_class)
+#define sched_class_highest (&deadline_sched_class)
 #define for_each_class(class) \
    for (class = sched_class_highest; class; class = class->next)
 
@@ -1838,7 +1853,7 @@ static void dec_nr_running(struct rq *rq)
 
 static void set_load_weight(struct task_struct *p)
 {
-	if (task_has_rt_policy(p)) {
+	if (task_has_deadline_policy(p) || task_has_rt_policy(p)) {
 		p->se.load.weight = prio_to_weight[0] * 2;
 		p->se.load.inv_weight = prio_to_wmult[0] >> 1;
 		return;
@@ -2523,7 +2538,8 @@ void sched_fork(struct task_struct *p, int clone_flags)
 	 * Revert to default priority/policy on fork if requested.
 	 */
 	if (unlikely(p->sched_reset_on_fork)) {
-		if (p->policy == SCHED_FIFO || p->policy == SCHED_RR) {
+		if (deadline_policy(p->policy) ||
+		    p->policy == SCHED_FIFO || p->policy == SCHED_RR) {
 			p->policy = SCHED_NORMAL;
 			p->normal_prio = p->static_prio;
 		}
@@ -5966,10 +5982,14 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
-	if (rt_prio(prio))
-		p->sched_class = &rt_sched_class;
-	else
-		p->sched_class = &fair_sched_class;
+	if (deadline_task(p))
+		p->sched_class = &deadline_sched_class;
+	else {
+		if (rt_prio(prio))
+			p->sched_class = &rt_sched_class;
+		else
+			p->sched_class = &fair_sched_class;
+	}
 
 	p->prio = prio;
 
@@ -6003,9 +6023,9 @@ void set_user_nice(struct task_struct *p, long nice)
 	 * The RT priorities are set via sched_setscheduler(), but we still
 	 * allow the 'normal' nice value to be set - but as expected
 	 * it wont have any effect on scheduling until the task is
-	 * SCHED_FIFO/SCHED_RR:
+	 * SCHED_DEADLINE, SCHED_FIFO or SCHED_RR:
 	 */
-	if (task_has_rt_policy(p)) {
+	if (unlikely(task_has_deadline_policy(p) || task_has_rt_policy(p))) {
 		p->static_prio = NICE_TO_PRIO(nice);
 		goto out_unlock;
 	}
@@ -9259,6 +9279,11 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
 #endif
 }
 
+static void init_deadline_rq(struct dl_rq *dl_rq, struct rq *rq)
+{
+	dl_rq->rb_root = RB_ROOT;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 				struct sched_entity *se, int cpu, int add,
@@ -9417,6 +9442,7 @@ void __init sched_init(void)
 		rq->calc_load_update = jiffies + LOAD_FREQ;
 		init_cfs_rq(&rq->cfs, rq);
 		init_rt_rq(&rq->rt, rq);
+		init_deadline_rq(&rq->dl, rq);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		init_task_group.shares = init_task_group_load;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
diff --git a/kernel/sched_deadline.c b/kernel/sched_deadline.c
new file mode 100644
index 0000000..5430c48
--- /dev/null
+++ b/kernel/sched_deadline.c
@@ -0,0 +1,513 @@
+/*
+ * Deadline Scheduling Class (SCHED_DEADLINE policy)
+ *
+ * This scheduling class implements the Earliest Deadline First (EDF)
+ * scheduling algorithm, suited for hard and soft real-time tasks.
+ *
+ * The strategy used to confine each task inside its bandwidth reservation
+ * is the Constant Bandwidth Server (CBS) scheduling, a slight variation on
+ * EDF that makes this possible.
+ *
+ * Correct behavior, i.e., no task missing any deadline, is only guaranteed
+ * if the task's parameters are:
+ *  - correctly assigned, so that the system is not overloaded,
+ *  - respected during actual execution.
+ * However, thanks to bandwidth isolation, overruns and deadline misses
+ * remains local, and does not affect any other task in the system.
+ *
+ * Copyright (C) 2009 Dario Faggioli, Michael Trimarchi
+ */
+
+static const struct sched_class deadline_sched_class;
+
+static inline struct task_struct *deadline_task_of(struct sched_dl_entity *dl_se)
+{
+	return container_of(dl_se, struct task_struct, dl);
+}
+
+static inline struct rq *rq_of_deadline_rq(struct dl_rq *dl_rq)
+{
+	return container_of(dl_rq, struct rq, dl);
+}
+
+static inline struct dl_rq *deadline_rq_of_se(struct sched_dl_entity *dl_se)
+{
+	struct task_struct *p = deadline_task_of(dl_se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->dl;
+}
+
+/*
+ * FIXME:
+ *  This is broken for now, correct implementation of a BWI/PEP
+ *  solution is needed here!
+ */
+static inline int deadline_se_boosted(struct sched_dl_entity *dl_se)
+{
+	struct task_struct *p = deadline_task_of(dl_se);
+
+	return p->prio != p->normal_prio;
+}
+
+static inline int on_deadline_rq(struct sched_dl_entity *dl_se)
+{
+	return !RB_EMPTY_NODE(&dl_se->rb_node);
+}
+
+#define for_each_leaf_deadline_rq(dl_rq, rq) \
+	for (dl_rq = &rq->dl; dl_rq; dl_rq = NULL)
+
+static inline int deadline_time_before(u64 a, u64 b)
+{
+	return (s64)(a - b) < 0;
+}
+
+static inline u64 deadline_max_deadline(u64 a, u64 b)
+{
+	s64 delta = (s64)(b - a);
+	if (delta > 0)
+		a = b;
+
+	return a;
+}
+
+static void enqueue_deadline_entity(struct sched_dl_entity *dl_se);
+static void dequeue_deadline_entity(struct sched_dl_entity *dl_se);
+static void check_deadline_preempt_curr(struct task_struct *p, struct rq *rq);
+
+/*
+ * setup a new SCHED_DEADLINE task instance.
+ */
+static inline void setup_new_deadline_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = deadline_rq_of_se(dl_se);
+	struct rq *rq = rq_of_deadline_rq(dl_rq);
+
+	dl_se->flags &= ~DL_NEW;
+	dl_se->deadline = max(dl_se->deadline, rq->clock) +
+			      dl_se->sched_deadline;
+	dl_se->runtime = dl_se->sched_runtime;
+}
+
+/*
+ * gives a SCHED_DEADLINE task that run out of runtime the possibility
+ * of restarting executing, with a refilled runtime and a new
+ * (postponed) deadline.
+ */
+static void replenish_deadline_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = deadline_rq_of_se(dl_se);
+	struct rq *rq = rq_of_deadline_rq(dl_rq);
+
+	/*
+	 * Keep moving the deadline and replenishing runtime by the
+	 * proper amount until the runtime becomes positive.
+	 */
+	while (dl_se->runtime < 0) {
+		dl_se->deadline += dl_se->sched_deadline;
+		dl_se->runtime += dl_se->sched_runtime;
+	}
+
+	WARN_ON(dl_se->runtime > dl_se->sched_runtime);
+	WARN_ON(deadline_time_before(dl_se->deadline, rq->clock));
+}
+
+static void update_deadline_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = deadline_rq_of_se(dl_se);
+	struct rq *rq = rq_of_deadline_rq(dl_rq);
+	u64 left, right;
+
+	if (dl_se->flags & DL_NEW) {
+		setup_new_deadline_entity(dl_se);
+		return;
+	}
+
+	/*
+	 * Update the deadline of the task only if:
+	 * - the budget has been completely exhausted;
+	 * - using the ramaining budget, with the current deadline, would
+	 *   make the task exceed its bandwidth;
+	 * - the deadline itself is in the past.
+	 *
+	 * For the second condition to hold, we check if:
+	 *  runtime / (deadline - rq->clock) >= sched_runtime / sched_deadline
+	 *
+	 * Which basically says if, in the time left before the current
+	 * deadline, the tasks overcome its expected runtime by using the
+	 * residual budget (left and right are the two sides of the equation,
+	 * after a bit of shuffling to use multiplications instead of
+	 * divisions).
+	 */
+	if (deadline_time_before(dl_se->deadline, rq->clock))
+		goto update;
+
+	left = dl_se->sched_deadline * dl_se->runtime;
+	right = (dl_se->deadline - rq->clock) * dl_se->sched_runtime;
+
+	if (deadline_time_before(right, left)) {
+update:
+		dl_se->deadline = rq->clock + dl_se->sched_deadline;
+		dl_se->runtime = dl_se->sched_runtime;
+	}
+}
+
+/*
+ * the task just depleted its runtime, so we try to post the
+ * replenishment timer to fire at the next absolute deadline.
+ *
+ * In fact, the task was allowed to execute for at most sched_runtime
+ * over each period of sched_deadline length.
+ */
+static int start_deadline_timer(struct sched_dl_entity *dl_se, u64 wakeup)
+{
+	struct dl_rq *dl_rq = deadline_rq_of_se(dl_se);
+	struct rq *rq = rq_of_deadline_rq(dl_rq);
+	ktime_t now, act;
+	s64 delta;
+
+	act = ns_to_ktime(wakeup);
+	now = hrtimer_cb_get_time(&dl_se->dl_timer);
+	delta = ktime_to_ns(now) - rq->clock;
+	act = ktime_add_ns(act, delta);
+
+	hrtimer_set_expires(&dl_se->dl_timer, act);
+	hrtimer_start_expires(&dl_se->dl_timer, HRTIMER_MODE_ABS);
+
+	return hrtimer_active(&dl_se->dl_timer);
+}
+
+static enum hrtimer_restart deadline_timer(struct hrtimer *timer)
+{
+	struct sched_dl_entity *dl_se = container_of(timer,
+						     struct sched_dl_entity,
+						     dl_timer);
+	struct task_struct *p = deadline_task_of(dl_se);
+	struct dl_rq *dl_rq = deadline_rq_of_se(dl_se);
+	struct rq *rq = rq_of_deadline_rq(dl_rq);
+
+	spin_lock(&rq->lock);
+
+	/*
+	 * the task might have changed scheduling policy
+	 * through setscheduler_ex, in what case we just do nothing.
+	 */
+	if (!deadline_task(p))
+		goto unlock;
+
+	/*
+	 * the task can't be enqueued any the SCHED_DEADLINE runqueue,
+	 * and needs to be enqueued back there --with its new deadline--
+	 * only if it is active.
+	 */
+	dl_se->flags &= ~DL_THROTTLED;
+	if (p->se.on_rq) {
+		replenish_deadline_entity(dl_se);
+		enqueue_deadline_entity(dl_se);
+		check_deadline_preempt_curr(p, rq);
+	}
+unlock:
+	spin_unlock(&rq->lock);
+
+	return HRTIMER_NORESTART;
+}
+
+static
+int deadline_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
+{
+	if (dl_se->runtime >= 0 || deadline_se_boosted(dl_se))
+		return 0;
+
+	dequeue_deadline_entity(dl_se);
+	if (!start_deadline_timer(dl_se, dl_se->deadline)) {
+		replenish_deadline_entity(dl_se);
+		enqueue_deadline_entity(dl_se);
+	} else
+		dl_se->flags |= DL_THROTTLED;
+
+	return 1;
+}
+
+static void update_curr_deadline(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_dl_entity *dl_se = &curr->dl;
+	u64 delta_exec;
+
+	if (!deadline_task(curr) || !on_deadline_rq(dl_se))
+		return;
+
+	delta_exec = rq->clock - curr->se.exec_start;
+	if (unlikely((s64)delta_exec < 0))
+		delta_exec = 0;
+
+	schedstat_set(curr->se.exec_max, max(curr->se.exec_max, delta_exec));
+
+	curr->se.sum_exec_runtime += delta_exec;
+	account_group_exec_runtime(curr, delta_exec);
+
+	curr->se.exec_start = rq->clock;
+	cpuacct_charge(curr, delta_exec);
+
+	dl_se->runtime -= delta_exec;
+	if (deadline_runtime_exceeded(rq, dl_se))
+		resched_task(curr);
+}
+
+static void enqueue_deadline_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = deadline_rq_of_se(dl_se);
+	struct rb_node **link = &dl_rq->rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct sched_dl_entity *entry;
+	int leftmost = 1;
+
+	BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct sched_dl_entity, rb_node);
+		if (!deadline_time_before(entry->deadline, dl_se->deadline))
+			link = &parent->rb_left;
+		else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		dl_rq->rb_leftmost = &dl_se->rb_node;
+
+	rb_link_node(&dl_se->rb_node, parent, link);
+	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
+
+	dl_rq->dl_nr_running++;
+}
+
+static void dequeue_deadline_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = deadline_rq_of_se(dl_se);
+
+	if (RB_EMPTY_NODE(&dl_se->rb_node))
+		return;
+
+	if (dl_rq->rb_leftmost == &dl_se->rb_node) {
+		struct rb_node *next_node;
+		struct sched_dl_entity *next;
+
+		next_node = rb_next(&dl_se->rb_node);
+		dl_rq->rb_leftmost = next_node;
+
+		if (next_node)
+			next = rb_entry(next_node, struct sched_dl_entity,
+					rb_node);
+	}
+
+	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
+	RB_CLEAR_NODE(&dl_se->rb_node);
+
+	dl_rq->dl_nr_running--;
+}
+
+static void check_preempt_curr_deadline(struct rq *rq, struct task_struct *p,
+				   int sync)
+{
+	if (deadline_task(p) &&
+	    deadline_time_before(p->dl.deadline, rq->curr->dl.deadline))
+		resched_task(rq->curr);
+}
+
+/*
+ * there are a few cases where is important to check if a SCHED_DEADLINE
+ * task p should preempt the current task of a runqueue (e.g., inside the
+ * replenishment timer code).
+ */
+static void check_deadline_preempt_curr(struct task_struct *p, struct rq *rq)
+{
+	if (!deadline_task(rq->curr) ||
+	    deadline_time_before(p->dl.deadline, rq->curr->dl.deadline))
+		resched_task(rq->curr);
+}
+
+static void
+enqueue_task_deadline(struct rq *rq, struct task_struct *p, int wakeup)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	BUG_ON(on_deadline_rq(dl_se));
+
+	/*
+	 * Only enqueue entities with some remaining runtime.
+	 */
+	if (dl_se->flags & DL_THROTTLED)
+		return;
+
+	update_deadline_entity(dl_se);
+	enqueue_deadline_entity(dl_se);
+}
+
+static void
+dequeue_task_deadline(struct rq *rq, struct task_struct *p, int sleep)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	if (!on_deadline_rq(dl_se))
+		return;
+
+	update_curr_deadline(rq);
+	dequeue_deadline_entity(dl_se);
+}
+
+static void yield_task_deadline(struct rq *rq)
+{
+}
+
+#ifdef CONFIG_SCHED_HRTICK
+static void start_hrtick_deadline(struct rq *rq, struct task_struct *p)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+	s64 delta;
+
+	delta = dl_se->sched_runtime - dl_se->runtime;
+
+	if (delta > 10000)
+		hrtick_start(rq, delta);
+}
+#else
+static void start_hrtick_deadline(struct rq *rq, struct task_struct *p)
+{
+}
+#endif
+
+static struct sched_dl_entity *pick_next_deadline_entity(struct rq *rq,
+							 struct dl_rq *dl_rq)
+{
+	struct rb_node *left = dl_rq->rb_leftmost;
+
+	if (!left)
+		return NULL;
+
+	return rb_entry(left, struct sched_dl_entity, rb_node);
+}
+
+struct task_struct *pick_next_task_deadline(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se;
+	struct task_struct *p;
+	struct dl_rq *dl_rq;
+
+	dl_rq = &rq->dl;
+
+	if (likely(!dl_rq->dl_nr_running))
+		return NULL;
+
+	dl_se = pick_next_deadline_entity(rq, dl_rq);
+	BUG_ON(!dl_se);
+
+	p = deadline_task_of(dl_se);
+	p->se.exec_start = rq->clock;
+#ifdef CONFIG_SCHED_HRTICK
+	if (hrtick_enabled(rq))
+		start_hrtick_deadline(rq, p);
+#endif
+	return p;
+}
+
+static void put_prev_task_deadline(struct rq *rq, struct task_struct *p)
+{
+	update_curr_deadline(rq);
+	p->se.exec_start = 0;
+}
+
+static void task_tick_deadline(struct rq *rq, struct task_struct *p, int queued)
+{
+	update_curr_deadline(rq);
+
+#ifdef CONFIG_SCHED_HRTICK
+	if (hrtick_enabled(rq) && queued && p->dl.runtime > 0)
+		start_hrtick_deadline(rq, p);
+#endif
+}
+
+static void set_curr_task_deadline(struct rq *rq)
+{
+	struct task_struct *p = rq->curr;
+
+	p->se.exec_start = rq->clock;
+}
+
+static void prio_changed_deadline(struct rq *rq, struct task_struct *p,
+			     int oldprio, int running)
+{
+	check_deadline_preempt_curr(p, rq);
+}
+
+static void switched_to_deadline(struct rq *rq, struct task_struct *p,
+			    int running)
+{
+	check_deadline_preempt_curr(p, rq);
+}
+
+#ifdef CONFIG_SMP
+static int select_task_rq_deadline(struct task_struct *p,
+				   int sd_flag, int flags)
+{
+	return task_cpu(p);
+}
+
+static unsigned long
+load_balance_deadline(struct rq *this_rq, int this_cpu, struct rq *busiest,
+		 unsigned long max_load_move,
+		 struct sched_domain *sd, enum cpu_idle_type idle,
+		 int *all_pinned, int *this_best_prio)
+{
+	/* for now, don't touch SCHED_DEADLINE tasks */
+	return 0;
+}
+
+static int
+move_one_task_deadline(struct rq *this_rq, int this_cpu, struct rq *busiest,
+		  struct sched_domain *sd, enum cpu_idle_type idle)
+{
+	return 0;
+}
+
+static void set_cpus_allowed_deadline(struct task_struct *p,
+				 const struct cpumask *new_mask)
+{
+	int weight = cpumask_weight(new_mask);
+
+	BUG_ON(!deadline_task(p));
+
+	cpumask_copy(&p->cpus_allowed, new_mask);
+	p->dl.nr_cpus_allowed = weight;
+}
+#endif
+
+static const struct sched_class deadline_sched_class = {
+	.next			= &rt_sched_class,
+	.enqueue_task		= enqueue_task_deadline,
+	.dequeue_task		= dequeue_task_deadline,
+	.yield_task		= yield_task_deadline,
+
+	.check_preempt_curr	= check_preempt_curr_deadline,
+
+	.pick_next_task		= pick_next_task_deadline,
+	.put_prev_task		= put_prev_task_deadline,
+
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_deadline,
+
+	.load_balance           = load_balance_deadline,
+	.move_one_task		= move_one_task_deadline,
+	.set_cpus_allowed       = set_cpus_allowed_deadline,
+#endif
+
+	.set_curr_task		= set_curr_task_deadline,
+	.task_tick		= task_tick_deadline,
+
+	.prio_changed           = prio_changed_deadline,
+	.switched_to		= switched_to_deadline,
+};
+
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 4e777b4..8144cb4 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1571,7 +1571,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 
 	update_curr(cfs_rq);
 
-	if (unlikely(rt_prio(p->prio))) {
+	if (unlikely(deadline_task(p) || rt_prio(p->prio))) {
 		resched_task(curr);
 		return;
 	}
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a4d790c..65cef57 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1004,7 +1004,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
  */
 static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (p->prio < rq->curr->prio) {
+	if (deadline_task(p) || p->prio < rq->curr->prio) {
 		resched_task(rq->curr);
 		return;
 	}
-- 
1.6.0.4

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: fork and terminate task logic
  2009-10-16 15:35 [RFC 0/12][PATCH] SCHED_DEADLINE (new version of SCHED_EDF) Raistlin
  2009-10-16 15:38 ` [RFC 1/12][PATCH] Extended scheduling parameters structure added Raistlin
  2009-10-16 15:40 ` [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class Raistlin
@ 2009-10-16 15:41 ` Raistlin
  2009-12-29 15:20   ` Peter Zijlstra
  2009-10-16 15:41 ` [RFC 0/12][PATCH] SCHED_DEADLINE: added sched_*_ex syscalls Raistlin
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 45+ messages in thread
From: Raistlin @ 2009-10-16 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 3183 bytes --]

This commits adds the code that make it possible for a SCHED_DEADLINE task
to fork a child and to correctly terminate.

The child of a SCHED_DEADLINE task is (if !reset_on_fork) SCHED_DEADLINE as
well, but it has no bandwidth and it thus can't even start running.
To make it run, some other task (e.g., the parent) should provide it with
valid SCHED_DEADLINE parameters.

Actually, this is one of the simplest alternatives we have here, but it might
not be the best one. Therefore, discussion on what the ``most natural''
behaviour is is still open and comments are welcome.

Signed-off-by: Raistlin <raistlin@linux.it>
---
 kernel/sched.c          |   18 +++++++++++++++++-
 kernel/sched_deadline.c |   20 ++++++++++++++++++++
 2 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index adf1414..243066e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2561,8 +2561,20 @@ void sched_fork(struct task_struct *p, int clone_flags)
 	 * Make sure we do not leak PI boosting priority to the child.
 	 */
 	p->prio = current->normal_prio;
+	if (deadline_task(p)) {
+		p->sched_class = &deadline_sched_class;
 
-	if (!rt_prio(p->prio))
+		/*
+		 * the child will be SCHED_DEADLINE, but with zero bandwidth.
+		 * The parent (or some other task) must call setscheduler_ex
+		 * on it, or it won't ever start.
+		 */
+		init_deadline_task(p);
+		p->dl.flags &= ~DL_NEW;
+		p->dl.flags |= DL_THROTTLED;
+	} else if (rt_prio(p->prio))
+		p->sched_class = &rt_sched_class;
+	else
 		p->sched_class = &fair_sched_class;
 
 #ifdef CONFIG_SMP
@@ -2744,6 +2756,10 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		/* a deadline task is dying: stop the bandwidth timer */
+		if (deadline_task(prev))
+			hrtimer_cancel(&prev->dl.dl_timer);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched_deadline.c b/kernel/sched_deadline.c
index 5430c48..b4be178 100644
--- a/kernel/sched_deadline.c
+++ b/kernel/sched_deadline.c
@@ -213,6 +213,26 @@ unlock:
 	return HRTIMER_NORESTART;
 }
 
+static void init_deadline_timer(struct hrtimer *timer)
+{
+	if (hrtimer_active(timer)) {
+		hrtimer_try_to_cancel(timer);
+		return;
+	}
+
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	timer->function = deadline_timer;
+}
+
+static void init_deadline_task(struct task_struct *p)
+{
+	RB_CLEAR_NODE(&p->dl.rb_node);
+	init_deadline_timer(&p->dl.dl_timer);
+	p->dl.sched_runtime = p->dl.runtime = 0;
+	p->dl.sched_deadline = p->dl.deadline = 0;
+	p->dl.flags = p->dl.bw = 0;
+}
+
 static
 int deadline_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 {
-- 
1.6.0.4

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: added sched_*_ex syscalls
  2009-10-16 15:35 [RFC 0/12][PATCH] SCHED_DEADLINE (new version of SCHED_EDF) Raistlin
                   ` (2 preceding siblings ...)
  2009-10-16 15:41 ` [RFC 0/12][PATCH] SCHED_DEADLINE: fork and terminate task logic Raistlin
@ 2009-10-16 15:41 ` Raistlin
  2009-10-16 15:42 ` [RFC 0/12][PATCH] SCHED_DEADLINE: added sched-debug support Raistlin
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Raistlin @ 2009-10-16 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 14094 bytes --]

This commits adds the new syscalls needed to set/get the parameters for
SCHED_DEADLINE scheduling policy. As it can be expected, they all deal with
sched_param_ex.

The new syscalls are:
 * sched_setscheduler_ex,
 * sched_setparam_ex,
 * sched_getparam_ex.

They have been added to x86, x86-64 and ARM only for now, since these are
the only architectures we are able to test... But adding the bits needed for
supporting other archs is more than straightforward...

Signed-off-by: Raistlin <raistlin@linux.it>
---
 arch/arm/include/asm/unistd.h      |    3 +
 arch/arm/kernel/calls.S            |    3 +
 arch/x86/ia32/ia32entry.S          |    3 +
 arch/x86/include/asm/unistd_32.h   |    5 +-
 arch/x86/include/asm/unistd_64.h   |    6 ++
 arch/x86/kernel/syscall_table_32.S |    3 +
 include/linux/syscalls.h           |    7 ++
 kernel/sched.c                     |  168 +++++++++++++++++++++++++++++++++---
 8 files changed, 185 insertions(+), 13 deletions(-)

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 7020217..09b927e 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -391,6 +391,9 @@
 #define __NR_pwritev			(__NR_SYSCALL_BASE+362)
 #define __NR_rt_tgsigqueueinfo		(__NR_SYSCALL_BASE+363)
 #define __NR_perf_event_open		(__NR_SYSCALL_BASE+364)
+#define __NR_sched_setscheduler_ex	(__NR_SYSCALL_BASE+365)
+#define __NR_sched_setparam_ex		(__NR_SYSCALL_BASE+366)
+#define __NR_sched_getparam_ex		(__NR_SYSCALL_BASE+367)
 
 /*
  * The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index fafce1b..42ad362 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -374,6 +374,9 @@
 		CALL(sys_pwritev)
 		CALL(sys_rt_tgsigqueueinfo)
 		CALL(sys_perf_event_open)
+/* 365 */	CALL(sys_sched_setscheduler_ex)
+		CALL(sys_sched_setparam_ex)
+		CALL(sys_sched_getparam_ex)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 1733f9f..3d04691 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -842,4 +842,7 @@ ia32_sys_call_table:
 	.quad compat_sys_pwritev
 	.quad compat_sys_rt_tgsigqueueinfo	/* 335 */
 	.quad sys_perf_event_open
+	.quad sys_sched_setscheduler_ex
+	.quad sys_sched_setparam_ex
+	.quad sys_sched_getparam_ex
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6fb3c20..3928c04 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,10 +342,13 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
+#define __NR_sched_setscheduler_ex	337
+#define __NR_sched_setparam_ex		338
+#define __NR_sched_getparam_ex		339
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 337
+#define NR_syscalls 340
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 8d3ad0a..84b0743 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -661,6 +661,12 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
 __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 #define __NR_perf_event_open			298
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
+#define __NR_sched_setscheduler_ex		299
+__SYSCALL(__NR_sched_setscheduler_ex, sys_sched_setscheduler_ex)
+#define __NR_sched_setparam_ex			300
+__SYSCALL(__NR_sched_setparam_ex, sys_sched_setparam_ex)
+#define __NR_sched_getparam_ex			301
+__SYSCALL(__NR_sched_getparam_ex, sys_sched_getparam_ex)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 0157cd2..38f056c 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,6 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_event_open
+	.long sys_sched_setscheduler_ex
+	.long sys_sched_setparam_ex
+	.long sys_sched_getparam_ex
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a990ace..dad0b33 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -33,6 +33,7 @@ struct pollfd;
 struct rlimit;
 struct rusage;
 struct sched_param;
+struct sched_param_ex;
 struct semaphore;
 struct sembuf;
 struct shmid_ds;
@@ -390,11 +391,17 @@ asmlinkage long sys_clock_nanosleep(clockid_t which_clock, int flags,
 asmlinkage long sys_nice(int increment);
 asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_setscheduler_ex(pid_t pid, int policy,
+					struct sched_param_ex __user *param);
 asmlinkage long sys_sched_setparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_setparam_ex(pid_t pid,
+					struct sched_param_ex __user *param);
 asmlinkage long sys_sched_getscheduler(pid_t pid);
 asmlinkage long sys_sched_getparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_getparam_ex(pid_t pid,
+					struct sched_param_ex __user *param);
 asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
 asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
diff --git a/kernel/sched.c b/kernel/sched.c
index 243066e..2c974fd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2598,6 +2598,14 @@ void sched_fork(struct task_struct *p, int clone_flags)
 	put_cpu();
 }
 
+static unsigned long to_ratio(u64 period, u64 runtime)
+{
+	if (runtime == RUNTIME_INF)
+		return 1ULL << 20;
+
+	return div64_u64(runtime << 20, period);
+}
+
 /*
  * wake_up_new_task - wake up a newly created task for the first time.
  *
@@ -6192,6 +6200,9 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 	case SCHED_RR:
 		p->sched_class = &rt_sched_class;
 		break;
+	case SCHED_DEADLINE:
+		p->sched_class = &deadline_sched_class;
+		break;
 	}
 
 	p->rt_priority = prio;
@@ -6202,6 +6213,28 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 }
 
 /*
+ * initialize all the fields of the deadline scheduling entity.
+ * The absolute deadline and the actual task runtime will be set at the
+ * activation.
+ */
+static void
+__setscheduler_ex(struct rq *rq, struct task_struct *p,
+		  struct sched_param_ex *param_ex)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	init_deadline_task(p);
+	dl_se->flags |= DL_NEW;
+	dl_se->flags &= ~DL_THROTTLED;
+
+	dl_se->flags = param_ex->sched_flags;
+	dl_se->sched_runtime = timespec_to_ns(&param_ex->sched_runtime);
+	dl_se->sched_deadline = timespec_to_ns(&param_ex->sched_deadline);
+	dl_se->sched_period = timespec_to_ns(&param_ex->sched_period);
+	dl_se->bw = to_ratio(dl_se->sched_deadline, dl_se->sched_runtime);
+}
+
+/*
  * check the target process has a UID that matches the current process's
  */
 static bool check_same_owner(struct task_struct *p)
@@ -6218,7 +6251,9 @@ static bool check_same_owner(struct task_struct *p)
 }
 
 static int __sched_setscheduler(struct task_struct *p, int policy,
-				struct sched_param *param, bool user)
+				struct sched_param *param,
+				struct sched_param_ex *param_ex,
+				bool user)
 {
 	int retval, oldprio, oldpolicy = -1, on_rq, running;
 	unsigned long flags;
@@ -6237,7 +6272,8 @@ recheck:
 		reset_on_fork = !!(policy & SCHED_RESET_ON_FORK);
 		policy &= ~SCHED_RESET_ON_FORK;
 
-		if (policy != SCHED_FIFO && policy != SCHED_RR &&
+		if (policy != SCHED_DEADLINE &&
+				policy != SCHED_FIFO && policy != SCHED_RR &&
 				policy != SCHED_NORMAL && policy != SCHED_BATCH &&
 				policy != SCHED_IDLE)
 			return -EINVAL;
@@ -6254,6 +6290,17 @@ recheck:
 		return -EINVAL;
 	if (rt_policy(policy) != (param->sched_priority != 0))
 		return -EINVAL;
+	/*
+	 * Validate the parameters for a SCHED_DEADLINE task.
+	 * We need relative deadline to be different than zero and
+	 * greater or equal than the runtime.
+	 */
+	if (deadline_policy(policy) && (!param_ex ||
+	    param_ex->sched_priority != 0 ||
+	    timespec_to_ns(&param_ex->sched_deadline) == 0 ||
+	    timespec_to_ns(&param_ex->sched_deadline) <
+	    timespec_to_ns(&param_ex->sched_runtime)))
+		return -EINVAL;
 
 	/*
 	 * Allow unprivileged RT tasks to decrease priority:
@@ -6336,6 +6383,8 @@ recheck:
 	p->sched_reset_on_fork = reset_on_fork;
 
 	oldprio = p->prio;
+	if (deadline_policy(policy))
+		__setscheduler_ex(rq, p, param_ex);
 	__setscheduler(rq, p, policy, param->sched_priority);
 
 	if (running)
@@ -6364,10 +6413,17 @@ recheck:
 int sched_setscheduler(struct task_struct *p, int policy,
 		       struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, true);
+	return __sched_setscheduler(p, policy, param, NULL, true);
 }
 EXPORT_SYMBOL_GPL(sched_setscheduler);
 
+int sched_setscheduler_ex(struct task_struct *p, int policy,
+			  struct sched_param *param,
+			  struct sched_param_ex *param_ex)
+{
+	return __sched_setscheduler(p, policy, param, param_ex, true);
+}
+
 /**
  * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
  * @p: the task in question.
@@ -6382,7 +6438,7 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
 int sched_setscheduler_nocheck(struct task_struct *p, int policy,
 			       struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, false);
+	return __sched_setscheduler(p, policy, param, NULL, false);
 }
 
 static int
@@ -6407,6 +6463,33 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 	return retval;
 }
 
+static int
+do_sched_setscheduler_ex(pid_t pid, int policy,
+			 struct sched_param_ex __user *param_ex)
+{
+	struct sched_param lparam;
+	struct sched_param_ex lparam_ex;
+	struct task_struct *p;
+	int retval;
+
+	if (!param_ex || pid < 0)
+		return -EINVAL;
+	if (copy_from_user(&lparam_ex, param_ex,
+	    sizeof(struct sched_param_ex)))
+		return -EFAULT;
+
+	rcu_read_lock();
+	retval = -ESRCH;
+	p = find_process_by_pid(pid);
+	if (p != NULL) {
+		lparam.sched_priority = lparam_ex.sched_priority;
+		retval = sched_setscheduler_ex(p, policy, &lparam, &lparam_ex);
+	}
+	rcu_read_unlock();
+
+	return retval;
+}
+
 /**
  * sys_sched_setscheduler - set/change the scheduler policy and RT priority
  * @pid: the pid in question.
@@ -6424,6 +6507,21 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy,
 }
 
 /**
+ * sys_sched_setscheduler_ex - set/change the scheduler policy to SCHED_DEADLINE
+ * @pid: the pid in question.
+ * @policy: new policy (should be SCHED_DEADLINE).
+ * @param: structure containg the extended deadline parameters.
+ */
+SYSCALL_DEFINE3(sched_setscheduler_ex, pid_t, pid, int, policy,
+		struct sched_param_ex __user *, param_ex)
+{
+	if (policy < 0)
+		return -EINVAL;
+
+	return do_sched_setscheduler_ex(pid, policy, param_ex);
+}
+
+/**
  * sys_sched_setparam - set/change the RT priority of a thread
  * @pid: the pid in question.
  * @param: structure containing the new RT priority.
@@ -6434,6 +6532,17 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
 }
 
 /**
+ * sys_sched_setparam - set/change the DEADLINE parameters of a thread
+ * @pid: the pid in question.
+ * @param_ex: structure containing the new parameters (deadline, runtime, etc.).
+ */
+SYSCALL_DEFINE2(sched_setparam_ex, pid_t, pid,
+		struct sched_param_ex __user *, param_ex)
+{
+	return do_sched_setscheduler_ex(pid, -1, param_ex);
+}
+
+/**
  * sys_sched_getscheduler - get the policy (scheduling class) of a thread
  * @pid: the pid in question.
  */
@@ -6497,6 +6606,49 @@ out_unlock:
 	return retval;
 }
 
+/**
+ * sys_sched_getparam - get the DEADLINE task parameters of a thread
+ * @pid: the pid in question.
+ * @param_ex: structure containing the new parameters (deadline, runtime, etc.).
+ */
+SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
+		struct sched_param_ex __user *, param_ex)
+{
+	struct sched_param_ex lp;
+	struct task_struct *p;
+	int retval;
+
+	if (!param_ex || pid < 0)
+		return -EINVAL;
+
+	read_lock(&tasklist_lock);
+	p = find_process_by_pid(pid);
+	retval = -ESRCH;
+	if (!p)
+		goto out_unlock;
+
+	retval = security_task_getscheduler(p);
+	if (retval)
+		goto out_unlock;
+
+	lp.sched_priority = p->rt_priority;
+	lp.sched_runtime = ns_to_timespec(p->dl.sched_runtime);
+	lp.sched_deadline = ns_to_timespec(p->dl.sched_deadline);
+	read_unlock(&tasklist_lock);
+
+	/*
+	 * This one might sleep, we cannot do it with a spinlock held ...
+	 */
+	retval = copy_to_user(param_ex, &lp, sizeof(*param_ex)) ? -EFAULT : 0;
+
+	return retval;
+
+out_unlock:
+	read_unlock(&tasklist_lock);
+	return retval;
+
+}
+
 long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 {
 	cpumask_var_t cpus_allowed, new_mask;
@@ -10112,14 +10264,6 @@ unsigned long sched_group_shares(struct task_group *tg)
  */
 static DEFINE_MUTEX(rt_constraints_mutex);
 
-static unsigned long to_ratio(u64 period, u64 runtime)
-{
-	if (runtime == RUNTIME_INF)
-		return 1ULL << 20;
-
-	return div64_u64(runtime << 20, period);
-}
-
 /* Must be called with tasklist_lock held */
 static inline int tg_has_rt_tasks(struct task_group *tg)
 {
-- 
1.6.0.4

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: added sched-debug support.
  2009-10-16 15:35 [RFC 0/12][PATCH] SCHED_DEADLINE (new version of SCHED_EDF) Raistlin
                   ` (3 preceding siblings ...)
  2009-10-16 15:41 ` [RFC 0/12][PATCH] SCHED_DEADLINE: added sched_*_ex syscalls Raistlin
@ 2009-10-16 15:42 ` Raistlin
  2009-10-16 15:43 ` [RFC 6/12][PATCH] SCHED_DEADLINE: added scheduling latency tracer Raistlin
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Raistlin @ 2009-10-16 15:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 3945 bytes --]

This commit adds the debugging output support for the SCHED_DEADLINE runqueue.

Signed-off-by: Raistlin <raistlin@linux.it>
---
 include/linux/sched.h   |    6 ++++++
 kernel/sched_deadline.c |   21 +++++++++++++++++++++
 kernel/sched_debug.c    |   33 +++++++++++++++++++++++++++++++++
 3 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 20e1a6a..fac928a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -167,6 +167,8 @@ extern void proc_sched_show_task(struct task_struct *p, struct seq_file *m);
 extern void proc_sched_set_task(struct task_struct *p);
 extern void
 print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
+extern void
+print_deadline_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq);
 #else
 static inline void
 proc_sched_show_task(struct task_struct *p, struct seq_file *m)
@@ -179,6 +181,10 @@ static inline void
 print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 {
 }
+static inline void
+print_deadline_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
+{
+}
 #endif
 
 extern unsigned long long time_sync_thresh;
diff --git a/kernel/sched_deadline.c b/kernel/sched_deadline.c
index b4be178..f2c1b6e 100644
--- a/kernel/sched_deadline.c
+++ b/kernel/sched_deadline.c
@@ -400,6 +400,16 @@ static void start_hrtick_deadline(struct rq *rq, struct task_struct *p)
 }
 #endif
 
+static struct sched_dl_entity *__pick_deadline_last_entity(struct dl_rq *dl_rq)
+{
+	struct rb_node *last = rb_last(&dl_rq->rb_root);
+
+	if (!last)
+		return NULL;
+
+	return rb_entry(last, struct sched_dl_entity, rb_node);
+}
+
 static struct sched_dl_entity *pick_next_deadline_entity(struct rq *rq,
 							 struct dl_rq *dl_rq)
 {
@@ -531,3 +541,14 @@ static const struct sched_class deadline_sched_class = {
 	.switched_to		= switched_to_deadline,
 };
 
+#ifdef CONFIG_SCHED_DEBUG
+static void print_deadline_stats(struct seq_file *m, int cpu)
+{
+	struct dl_rq *dl_rq = &cpu_rq(cpu)->dl;
+
+	rcu_read_lock();
+	for_each_leaf_deadline_rq(dl_rq, cpu_rq(cpu))
+		print_deadline_rq(m, cpu, dl_rq);
+	rcu_read_unlock();
+}
+#endif /* CONFIG_SCHED_DEBUG */
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index efb8440..809ba55 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -246,6 +246,38 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
 #undef P
 }
 
+void print_deadline_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
+{
+	s64 min_deadline = -1, max_deadline = -1;
+	struct rq *rq = &per_cpu(runqueues, cpu);
+	struct sched_dl_entity *last;
+	unsigned long flags;
+
+	SEQ_printf(m, "\ndl_rq[%d]:\n", cpu);
+
+	spin_lock_irqsave(&rq->lock, flags);
+	if (dl_rq->rb_leftmost)
+		min_deadline = (rb_entry(dl_rq->rb_leftmost,
+					 struct sched_dl_entity,
+					 rb_node))->deadline;
+	last = __pick_deadline_last_entity(dl_rq);
+	if (last)
+		max_deadline = last->deadline;
+	spin_unlock_irqrestore(&rq->lock, flags);
+
+#define P(x) \
+	SEQ_printf(m, "  .%-30s: %Ld\n", #x, (long long)(x))
+#define PN(x) \
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(x))
+
+	P(dl_rq->dl_nr_running);
+	PN(min_deadline);
+	PN(max_deadline);
+
+#undef PN
+#undef P
+}
+
 static void print_cpu(struct seq_file *m, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -301,6 +333,7 @@ static void print_cpu(struct seq_file *m, int cpu)
 #endif
 	print_cfs_stats(m, cpu);
 	print_rt_stats(m, cpu);
+	print_deadline_stats(m, cpu);
 
 	print_rq(m, rq, cpu);
 }
-- 
1.6.0.4

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 6/12][PATCH] SCHED_DEADLINE: added scheduling latency tracer
  2009-10-16 15:35 [RFC 0/12][PATCH] SCHED_DEADLINE (new version of SCHED_EDF) Raistlin
                   ` (4 preceding siblings ...)
  2009-10-16 15:42 ` [RFC 0/12][PATCH] SCHED_DEADLINE: added sched-debug support Raistlin
@ 2009-10-16 15:43 ` Raistlin
  2009-10-16 15:44 ` [RFC 7/12][PATCH] SCHED_DEADLINE: signal delivery when overrunning Raistlin
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Raistlin @ 2009-10-16 15:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 2779 bytes --]

This commit adds a new ftrace tracer 'wakeup_deadline', to trace the maximum
wakeup latency of SCHED_DEADLINE tasks.

Signed-off-by: Raistlin <raistlin@linux.it>
---
 kernel/trace/trace_sched_wakeup.c |   31 +++++++++++++++++++++++++++++++
 1 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 26185d7..180948c 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -27,6 +27,7 @@ static int			wakeup_cpu;
 static int			wakeup_current_cpu;
 static unsigned			wakeup_prio = -1;
 static int			wakeup_rt;
+static int			wakeup_deadline;
 
 static raw_spinlock_t wakeup_lock =
 	(raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
@@ -214,6 +215,9 @@ probe_wakeup(struct rq *rq, struct task_struct *p, int success)
 	tracing_record_cmdline(p);
 	tracing_record_cmdline(current);
 
+	if (wakeup_deadline && !deadline_task(p))
+		return;
+
 	if ((wakeup_rt && !rt_task(p)) ||
 			p->prio >= wakeup_prio ||
 			p->prio >= current->prio)
@@ -340,16 +344,25 @@ static int __wakeup_tracer_init(struct trace_array *tr)
 
 static int wakeup_tracer_init(struct trace_array *tr)
 {
+	wakeup_deadline = 0;
 	wakeup_rt = 0;
 	return __wakeup_tracer_init(tr);
 }
 
 static int wakeup_rt_tracer_init(struct trace_array *tr)
 {
+	wakeup_deadline = 0;
 	wakeup_rt = 1;
 	return __wakeup_tracer_init(tr);
 }
 
+static int wakeup_deadline_tracer_init(struct trace_array *tr)
+{
+	wakeup_deadline = 1;
+	wakeup_rt = 0;
+	return __wakeup_tracer_init(tr);
+}
+
 static void wakeup_tracer_reset(struct trace_array *tr)
 {
 	stop_wakeup_tracer(tr);
@@ -398,6 +411,20 @@ static struct tracer wakeup_rt_tracer __read_mostly =
 #endif
 };
 
+static struct tracer wakeup_deadline_tracer __read_mostly =
+{
+	.name		= "wakeup_deadline",
+	.init		= wakeup_deadline_tracer_init,
+	.reset		= wakeup_tracer_reset,
+	.start		= wakeup_tracer_start,
+	.stop		= wakeup_tracer_stop,
+	.wait_pipe	= poll_wait_pipe,
+	.print_max	= 1,
+#ifdef CONFIG_FTRACE_SELFTEST
+	.selftest    = trace_selftest_startup_wakeup,
+#endif
+};
+
 __init static int init_wakeup_tracer(void)
 {
 	int ret;
@@ -410,6 +437,10 @@ __init static int init_wakeup_tracer(void)
 	if (ret)
 		return ret;
 
+	ret = register_tracer(&wakeup_deadline_tracer);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 device_initcall(init_wakeup_tracer);
-- 
1.6.0.4


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 7/12][PATCH] SCHED_DEADLINE: signal delivery when overrunning
  2009-10-16 15:35 [RFC 0/12][PATCH] SCHED_DEADLINE (new version of SCHED_EDF) Raistlin
                   ` (5 preceding siblings ...)
  2009-10-16 15:43 ` [RFC 6/12][PATCH] SCHED_DEADLINE: added scheduling latency tracer Raistlin
@ 2009-10-16 15:44 ` Raistlin
  2009-12-28 14:19   ` Peter Zijlstra
  2009-10-16 15:44 ` [RFC 8/12][PATCH] SCHED_DEADLINE: wait next instance syscall added Raistlin
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 45+ messages in thread
From: Raistlin @ 2009-10-16 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 6352 bytes --]

Starting from this commit, the user can ask to receive a SIGXCPU signal
every time the task runtime is overrun or a scheduling deadline is missed.
This is done by means of the sched_flags field already present in
sched_param_ex.

A runtime overrun will be quite common, e.g. due to coarse execution time
accounting, wrong parameter assignement, etc.
A deadline miss --since the deadlines the scheduler sees are ``scheduling
deadlines'' which have not necessarily to be equal to task's deadlines-- is
much more unlikely, and should only happen in an overloaded system.

Signed-off-by: Raistlin <raistlin@linux.it>
---
 include/linux/sched.h     |    5 ++++
 kernel/posix-cpu-timers.c |   52 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched_deadline.c   |   18 +++++++++++++++
 3 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fac928a..16668f9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -95,6 +95,9 @@ struct sched_param {
 
 #include <asm/processor.h>
 
+#define SCHED_SIG_RORUN		0x80000000
+#define SCHED_SIG_DMISS		0x40000000
+
 struct sched_param_ex {
 	int sched_priority;
 	struct timespec sched_runtime;
@@ -1229,6 +1232,8 @@ struct sched_rt_entity {
 #define DL_NEW			0x00000001
 #define DL_THROTTLED		0x00000002
 #define DL_BOOSTED		0x00000004
+#define DL_RORUN		0x00000008
+#define DL_DMISS		0x00000010
 
 struct sched_dl_entity {
 	struct rb_node	rb_node;
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 5c9dc22..4caa5bf 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -1029,8 +1029,28 @@ static void check_thread_timers(struct task_struct *tsk,
 	}
 
 	/*
-	 * Check for the special case thread timers.
+	 * Check for the special case thread timers:
+	 *  - sched_deadline runtime/deadline overrun notification
+	 *  - sched_rt rlimit overrun notification
 	 */
+	if (deadline_task(tsk) && (tsk->dl.flags & SCHED_SIG_RORUN ||
+	    tsk->dl.flags & SCHED_SIG_DMISS)) {
+		if (tsk->dl.flags & SCHED_SIG_RORUN &&
+		    tsk->dl.flags & DL_RORUN) {
+			tsk->dl.flags &= ~DL_RORUN;
+			printk(KERN_INFO "runtime overrun: %s[%d]\n",
+			       tsk->comm, task_pid_nr(tsk));
+			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
+		}
+		if (tsk->dl.flags & SCHED_SIG_DMISS &&
+		    tsk->dl.flags & DL_DMISS) {
+			tsk->dl.flags &= ~DL_DMISS;
+			printk(KERN_INFO "scheduling deadline miss: %s[%d]\n",
+			       tsk->comm, task_pid_nr(tsk));
+			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
+		}
+	}
+
 	if (sig->rlim[RLIMIT_RTTIME].rlim_cur != RLIM_INFINITY) {
 		unsigned long hard = sig->rlim[RLIMIT_RTTIME].rlim_max;
 		unsigned long *soft = &sig->rlim[RLIMIT_RTTIME].rlim_cur;
@@ -1129,6 +1149,9 @@ static void check_process_timers(struct task_struct *tsk,
 	if (list_empty(&timers[CPUCLOCK_PROF]) &&
 	    cputime_eq(sig->it[CPUCLOCK_PROF].expires, cputime_zero) &&
 	    sig->rlim[RLIMIT_CPU].rlim_cur == RLIM_INFINITY &&
+	    !(deadline_task(tsk) && ((tsk->dl.flags & SCHED_SIG_RORUN &&
+	    tsk->dl.flags & DL_RORUN) || (tsk->dl.flags & SCHED_SIG_DMISS &&
+	    tsk->dl.flags & DL_DMISS))) &&
 	    list_empty(&timers[CPUCLOCK_VIRT]) &&
 	    cputime_eq(sig->it[CPUCLOCK_VIRT].expires, cputime_zero) &&
 	    list_empty(&timers[CPUCLOCK_SCHED])) {
@@ -1188,8 +1211,28 @@ static void check_process_timers(struct task_struct *tsk,
 	}
 
 	/*
-	 * Check for the special case process timers.
+	 * Check for the special case thread timers:
+	 *  - sched_deadline runtime/deadline overrun notification
+	 *  - sched_rt rlimit overrun notification
 	 */
+	if (deadline_task(tsk) && (tsk->dl.flags & SCHED_SIG_RORUN ||
+	    tsk->dl.flags & SCHED_SIG_DMISS)) {
+		if (tsk->dl.flags & SCHED_SIG_RORUN &&
+		    tsk->dl.flags & DL_RORUN) {
+			tsk->dl.flags &= ~DL_RORUN;
+			printk(KERN_INFO "runtime overrun: %s[%d]\n",
+			       tsk->comm, task_pid_nr(tsk));
+			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
+		}
+		if (tsk->dl.flags & SCHED_SIG_DMISS &&
+		    tsk->dl.flags & DL_DMISS) {
+			tsk->dl.flags &= ~DL_DMISS;
+			printk(KERN_INFO "scheduling deadline miss: %s[%d]\n",
+			       tsk->comm, task_pid_nr(tsk));
+			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
+		}
+	}
+
 	check_cpu_itimer(tsk, &sig->it[CPUCLOCK_PROF], &prof_expires, ptime,
 			 SIGPROF);
 	check_cpu_itimer(tsk, &sig->it[CPUCLOCK_VIRT], &virt_expires, utime,
@@ -1383,6 +1426,11 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
 			return 1;
 	}
 
+	if (deadline_task(tsk) &&
+	    ((tsk->dl.flags & SCHED_SIG_RORUN && tsk->dl.flags & DL_RORUN) ||
+	    (tsk->dl.flags & SCHED_SIG_DMISS && tsk->dl.flags & DL_DMISS)))
+		return 1;
+
 	return sig->rlim[RLIMIT_CPU].rlim_cur != RLIM_INFINITY;
 }
 
diff --git a/kernel/sched_deadline.c b/kernel/sched_deadline.c
index f2c1b6e..7b57bb0 100644
--- a/kernel/sched_deadline.c
+++ b/kernel/sched_deadline.c
@@ -236,9 +236,27 @@ static void init_deadline_task(struct task_struct *p)
 static
 int deadline_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 {
+	/*
+	 * if the user asked for that, we have to inform him about
+	 * a (scheduling) deadline miss ...
+	 */
+	if (unlikely(dl_se->flags & SCHED_SIG_DMISS &&
+	    deadline_time_before(dl_se->deadline, rq->clock)))
+		dl_se->flags |= DL_DMISS;
+
 	if (dl_se->runtime >= 0 || deadline_se_boosted(dl_se))
 		return 0;
 
+	/*
+	 * ... and the same appies to runtime overruns.
+	 *
+	 * Note that (hopefully small) runtime overruns are very likely
+	 * to occur, mainly due to accounting resolution, while missing a
+	 * scheduling deadline should happen only on oversubscribed systems.
+	 */
+	if (dl_se->flags & SCHED_SIG_RORUN)
+		dl_se->flags |= DL_RORUN;
+
 	dequeue_deadline_entity(dl_se);
 	if (!start_deadline_timer(dl_se, dl_se->deadline)) {
 		replenish_deadline_entity(dl_se);
-- 
1.6.0.4


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 8/12][PATCH] SCHED_DEADLINE: wait next instance syscall added.
  2009-10-16 15:35 [RFC 0/12][PATCH] SCHED_DEADLINE (new version of SCHED_EDF) Raistlin
                   ` (6 preceding siblings ...)
  2009-10-16 15:44 ` [RFC 7/12][PATCH] SCHED_DEADLINE: signal delivery when overrunning Raistlin
@ 2009-10-16 15:44 ` Raistlin
  2009-12-28 14:30   ` Peter Zijlstra
  2009-10-16 15:45 ` [RFC 9/12][PATCH] SCHED_DEADLINE: system wide bandwidth management Raistlin
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 45+ messages in thread
From: Raistlin @ 2009-10-16 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 8518 bytes --]

This commit introduces another new SCHED_DEADLINE related syscall. It is
called sched_wait_interval() and it has close-to-clock_nanosleep semantic.

However, for SCHED_DEADLINE tasks, it should be the call with which each
job closes its current instance. In fact, in this case, the task is put to
sleep and, when it wakes up, the scheduler is informed that a new job
arrived, saving the overhead that usually comes with a task activation
to enforce maximum task bandwidth.

Signed-off-by: Raistlin <raistlin@linux.it>
---
 arch/arm/include/asm/unistd.h      |    1 +
 arch/arm/kernel/calls.S            |    1 +
 arch/x86/ia32/ia32entry.S          |    1 +
 arch/x86/include/asm/unistd_32.h   |    3 +-
 arch/x86/include/asm/unistd_64.h   |    2 +
 arch/x86/kernel/syscall_table_32.S |    1 +
 include/linux/sched.h              |    1 +
 include/linux/syscalls.h           |    3 ++
 kernel/sched.c                     |   71 ++++++++++++++++++++++++++++++++++++
 kernel/sched_deadline.c            |    9 +++++
 10 files changed, 92 insertions(+), 1 deletions(-)

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 09b927e..769ced1 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -394,6 +394,7 @@
 #define __NR_sched_setscheduler_ex	(__NR_SYSCALL_BASE+365)
 #define __NR_sched_setparam_ex		(__NR_SYSCALL_BASE+366)
 #define __NR_sched_getparam_ex		(__NR_SYSCALL_BASE+367)
+#define __NR_sched_wait_interval	(__NR_SYSCALL_BASE+368)
 
 /*
  * The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 42ad362..8292271 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -377,6 +377,7 @@
 /* 365 */	CALL(sys_sched_setscheduler_ex)
 		CALL(sys_sched_setparam_ex)
 		CALL(sys_sched_getparam_ex)
+		CALL(sys_sched_wait_interval)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 3d04691..9306b80 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -845,4 +845,5 @@ ia32_sys_call_table:
 	.quad sys_sched_setscheduler_ex
 	.quad sys_sched_setparam_ex
 	.quad sys_sched_getparam_ex
+	.quad sys_sched_wait_interval		/* 340 */
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 3928c04..63954cb 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -345,10 +345,11 @@
 #define __NR_sched_setscheduler_ex	337
 #define __NR_sched_setparam_ex		338
 #define __NR_sched_getparam_ex		339
+#define __NR_sched_wait_interval	340
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 340
+#define NR_syscalls 341
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 84b0743..63cccc7 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -667,6 +667,8 @@ __SYSCALL(__NR_sched_setscheduler_ex, sys_sched_setscheduler_ex)
 __SYSCALL(__NR_sched_setparam_ex, sys_sched_setparam_ex)
 #define __NR_sched_getparam_ex			301
 __SYSCALL(__NR_sched_getparam_ex, sys_sched_getparam_ex)
+#define __NR_sched_wait_interval		302
+__SYSCALL(__NR_sched_wait_interval, sys_sched_wait_interval)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 38f056c..bd2cc8e 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -339,3 +339,4 @@ ENTRY(sys_call_table)
 	.long sys_sched_setscheduler_ex
 	.long sys_sched_setparam_ex
 	.long sys_sched_getparam_ex
+	.long sys_sched_wait_interval	/* 340 */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 16668f9..478e07c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1088,6 +1088,7 @@ struct sched_class {
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
 	void (*yield_task) (struct rq *rq);
+	void (*wait_interval) (struct task_struct *p);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index dad0b33..e01f59c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -407,6 +407,9 @@ asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
 asmlinkage long sys_sched_yield(void);
+asmlinkage long sys_sched_wait_interval(int flags,
+					const struct timespec __user *rqtp,
+					struct timespec __user *rmtp);
 asmlinkage long sys_sched_get_priority_max(int policy);
 asmlinkage long sys_sched_get_priority_min(int policy);
 asmlinkage long sys_sched_rr_get_interval(pid_t pid,
diff --git a/kernel/sched.c b/kernel/sched.c
index 2c974fd..3c3e834 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6832,6 +6832,77 @@ SYSCALL_DEFINE0(sched_yield)
 	return 0;
 }
 
+/**
+ * sys_sched_wait_interval - sleep according to the scheduling class rules.
+ *
+ * This function makes the task sleep for an absolute or relative interval
+ * (clock_nanosleep semantic). The only difference is that, before stopping
+ * the task, it asks its scheduling class if some class specific logic needs
+ * to be triggered right after the wakeup.
+ */
+SYSCALL_DEFINE3(sched_wait_interval, int, flags,
+		const struct timespec __user *, rqtp,
+		struct timespec __user *, rmtp)
+{
+	struct timespec lrqtp;
+	struct hrtimer_sleeper t;
+	enum hrtimer_mode mode = flags & TIMER_ABSTIME ?
+				 HRTIMER_MODE_ABS : HRTIMER_MODE_REL;
+	int ret = 0;
+
+	if (copy_from_user(&lrqtp, rqtp, sizeof(lrqtp)))
+		return -EFAULT;
+
+	if (!timespec_valid(&lrqtp))
+		return -EINVAL;
+
+	hrtimer_init_on_stack(&t.timer, CLOCK_MONOTONIC, mode);
+	hrtimer_set_expires(&t.timer, timespec_to_ktime(*rqtp));
+	hrtimer_init_sleeper(&t, current);
+	do {
+		set_current_state(TASK_INTERRUPTIBLE);
+		hrtimer_start_expires(&t.timer, mode);
+		if (!hrtimer_active(&t.timer))
+			t.task = NULL;
+
+		if (likely(t.task)) {
+			if (t.task->sched_class->wait_interval)
+				t.task->sched_class->wait_interval(t.task);
+			schedule();
+		}
+
+		hrtimer_cancel(&t.timer);
+		mode = HRTIMER_MODE_ABS;
+	} while (t.task && !signal_pending(current));
+	__set_current_state(TASK_RUNNING);
+
+	if (t.task == NULL)
+		goto out;
+
+	/* Absolute timers don't need this to be restarted. */
+	if (mode == HRTIMER_MODE_ABS) {
+		ret = -ERESTARTNOHAND;
+		goto out;
+	}
+
+	if (rmtp) {
+		ktime_t rmt;
+		struct timespec rmt_ts;
+
+		rmt = hrtimer_expires_remaining(&t.timer);
+		if (rmt.tv64 > 0)
+			goto out;
+		rmt_ts = ktime_to_timespec(rmt);
+		if (!timespec_valid(&rmt_ts))
+			goto out;
+		if (copy_to_user(rmtp, &rmt, sizeof(*rmtp)))
+			ret = -EFAULT;
+	}
+out:
+	destroy_hrtimer_on_stack(&t.timer);
+	return ret;
+}
+
 static inline int should_resched(void)
 {
 	return need_resched() && !(preempt_count() & PREEMPT_ACTIVE);
diff --git a/kernel/sched_deadline.c b/kernel/sched_deadline.c
index 7b57bb0..82c0192 100644
--- a/kernel/sched_deadline.c
+++ b/kernel/sched_deadline.c
@@ -401,6 +401,14 @@ static void yield_task_deadline(struct rq *rq)
 {
 }
 
+/*
+ * Informs the scheduler that an instance ended.
+ */
+static void wait_interval_deadline(struct task_struct *p)
+{
+	p->dl.flags |= DL_NEW;
+}
+
 #ifdef CONFIG_SCHED_HRTICK
 static void start_hrtick_deadline(struct rq *rq, struct task_struct *p)
 {
@@ -538,6 +546,7 @@ static const struct sched_class deadline_sched_class = {
 	.enqueue_task		= enqueue_task_deadline,
 	.dequeue_task		= dequeue_task_deadline,
 	.yield_task		= yield_task_deadline,
+	.wait_interval		= wait_interval_deadline,
 
 	.check_preempt_curr	= check_preempt_curr_deadline,
 
-- 
1.6.0.4


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 9/12][PATCH] SCHED_DEADLINE: system wide bandwidth management
  2009-10-16 15:35 [RFC 0/12][PATCH] SCHED_DEADLINE (new version of SCHED_EDF) Raistlin
                   ` (7 preceding siblings ...)
  2009-10-16 15:44 ` [RFC 8/12][PATCH] SCHED_DEADLINE: wait next instance syscall added Raistlin
@ 2009-10-16 15:45 ` Raistlin
  2009-11-06 11:34   ` Dhaval Giani
  2009-12-28 14:44   ` Peter Zijlstra
  2009-10-16 15:46 ` [RFC 10/12][PATCH] SCHED_DEADLINE: group bandwidth management code Raistlin
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 45+ messages in thread
From: Raistlin @ 2009-10-16 15:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 8513 bytes --]

This commit adds the capability of controlling the maximum, system wide,
CPU bandwidth that is devoted to SCHED_DEADLINE tasks.

This is done by means of two files:
 - /proc/sys/kernel/sched_deadline_runtime_us,
 - /proc/sys/kernel/sched_deadline_period_us.
The ratio runtime/period is the total bandwidth all the SCHED_DEADLINE tasks
can use in the system as a whole.
Trying to create tasks in such a way that they exceed this limitation will
fail, as soon as the bandwidth cap would be overcome.

Default value is _zero_ bandwidth available, thus write some numbers in those
files before trying to start some SCHED_DEADLINE task. Setting runtime > period
is allowed (i.e., more than 100% bandwidth available for -deadline tasks),
since it makes more than sense in SMP systems.

Signed-off-by: Raistlin <raistlin@linux.it>
---
 include/linux/sched.h |    7 ++
 kernel/sched.c        |  149 ++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c       |   16 +++++
 3 files changed, 171 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 478e07c..4de72eb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1984,6 +1984,13 @@ int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
+extern unsigned int sysctl_sched_deadline_period;
+extern int sysctl_sched_deadline_runtime;
+
+int sched_deadline_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
 extern unsigned int sysctl_sched_compat_yield;
 
 #ifdef CONFIG_RT_MUTEXES
diff --git a/kernel/sched.c b/kernel/sched.c
index 3c3e834..d8b6354 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -870,6 +870,34 @@ static inline u64 global_rt_runtime(void)
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
 
+/*
+ * deadline_runtime/deadline_period is the maximum bandwidth
+ * -deadline tasks can use. It is system wide, i.e., the sum
+ * of the bandwidths of all the tasks, inside every group and
+ * running on any CPU, has to stay below this value!
+ *
+ * default: 0s (= no bandwidth for -deadline tasks)
+ */
+unsigned int sysctl_sched_deadline_period = 0;
+int sysctl_sched_deadline_runtime = 0;
+
+static inline u64 global_deadline_period(void)
+{
+	return (u64)sysctl_sched_deadline_period * NSEC_PER_USEC;
+}
+
+static inline u64 global_deadline_runtime(void)
+{
+	return (u64)sysctl_sched_deadline_runtime * NSEC_PER_USEC;
+}
+
+/*
+ * locking for the system wide deadline bandwidth management.
+ */
+static DEFINE_MUTEX(deadline_constraints_mutex);
+static DEFINE_SPINLOCK(__sysctl_sched_deadline_lock);
+static u64 __sysctl_sched_deadline_total_bw;
+
 #ifndef prepare_arch_switch
 # define prepare_arch_switch(next)	do { } while (0)
 #endif
@@ -2606,6 +2634,66 @@ static unsigned long to_ratio(u64 period, u64 runtime)
 	return div64_u64(runtime << 20, period);
 }
 
+static inline
+void __deadline_clear_task_bw(struct task_struct *p, u64 tsk_bw)
+{
+	__sysctl_sched_deadline_total_bw -= tsk_bw;
+}
+
+static inline
+void __deadline_add_task_bw(struct task_struct *p, u64 tsk_bw)
+{
+	__sysctl_sched_deadline_total_bw += tsk_bw;
+}
+
+/*
+ * update the total allocated bandwidth, if a new -deadline task arrives,
+ * leaves or stays, but modifies its bandwidth.
+ */
+static int __deadline_check_task_bw(struct task_struct *p, int policy,
+				    struct sched_param_ex *param_ex)
+{
+	u64 bw, tsk_bw;
+	int ret = 0;
+
+	spin_lock(&__sysctl_sched_deadline_lock);
+
+	if (sysctl_sched_deadline_period <= 0)
+		goto unlock;
+
+	bw = to_ratio(sysctl_sched_deadline_period,
+		      sysctl_sched_deadline_runtime);
+	if (bw <= 0)
+		return 0;
+
+	if (deadline_policy(policy))
+		tsk_bw = to_ratio(timespec_to_ns(&param_ex->sched_deadline),
+				  timespec_to_ns(&param_ex->sched_runtime));
+
+	/*
+	 * Either if a task, enters, leave, or stays deadline but chanes
+	 * its parameters, we need to update accordingly the global
+	 * deadline allocated bandwidth.
+	 */
+	if (task_has_deadline_policy(p) && !deadline_policy(policy)) {
+		__deadline_clear_task_bw(p, p->dl.bw);
+		ret = 1;
+	} else if (task_has_deadline_policy(p) && deadline_policy(policy) &&
+		  bw >= __sysctl_sched_deadline_total_bw - p->dl.bw + tsk_bw) {
+		__deadline_clear_task_bw(p, p->dl.bw);
+		__deadline_add_task_bw(p, tsk_bw);
+		ret = 1;
+	} else if (deadline_policy(policy) && !task_has_deadline_policy(p) &&
+		   bw >= __sysctl_sched_deadline_total_bw + tsk_bw) {
+		__deadline_add_task_bw(p, tsk_bw);
+		ret = 1;
+	}
+unlock:
+	spin_unlock(&__sysctl_sched_deadline_lock);
+
+	return ret;
+}
+
 /*
  * wake_up_new_task - wake up a newly created task for the first time.
  *
@@ -2765,8 +2853,10 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
 		/* a deadline task is dying: stop the bandwidth timer */
-		if (deadline_task(prev))
+		if (deadline_task(prev)) {
+			__deadline_clear_task_bw(prev, prev->dl.bw);
 			hrtimer_cancel(&prev->dl.dl_timer);
+		}
 
 		/*
 		 * Remove function-return probe instances associated with this
@@ -6372,6 +6462,19 @@ recheck:
 		spin_unlock_irqrestore(&p->pi_lock, flags);
 		goto recheck;
 	}
+	/*
+	 * If changing to SCHED_DEADLINE (or changing the parameters of a
+	 * SCHED_DEADLINE task) we need to check if enough bandwidth is
+	 * available, which might be not true!
+	 */
+	if (deadline_policy(policy) || deadline_task(p)) {
+		if (!__deadline_check_task_bw(p, policy, param_ex)) {
+			__task_rq_unlock(rq);
+			spin_unlock_irqrestore(&p->pi_lock, flags);
+			return -EPERM;
+		}
+	}
+
 	update_rq_clock(rq);
 	on_rq = p->se.on_rq;
 	running = task_current(rq, p);
@@ -10569,6 +10672,25 @@ static int sched_rt_global_constraints(void)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+static int sched_deadline_global_constraints(void)
+{
+	u64 bw;
+	int ret = 1;
+
+	spin_lock_irq(&__sysctl_sched_deadline_lock);
+	if (sysctl_sched_deadline_period <= 0)
+		bw = 0;
+	else
+		bw = to_ratio(global_deadline_period(),
+			      global_deadline_runtime());
+
+	if (bw < __sysctl_sched_deadline_total_bw)
+		ret = 0;
+	spin_unlock_irq(&__sysctl_sched_deadline_lock);
+
+	return ret;
+}
+
 int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos)
@@ -10599,6 +10721,31 @@ int sched_rt_handler(struct ctl_table *table, int write,
 	return ret;
 }
 
+int sched_deadline_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int ret;
+	int old_period, old_runtime;
+
+	mutex_lock(&deadline_constraints_mutex);
+	old_period = sysctl_sched_deadline_period;
+	old_runtime = sysctl_sched_deadline_runtime;
+
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		if (!sched_deadline_global_constraints()) {
+			sysctl_sched_deadline_period = old_period;
+			sysctl_sched_deadline_runtime = old_runtime;
+			ret = -EINVAL;
+		}
+	}
+	mutex_unlock(&deadline_constraints_mutex);
+
+	return ret;
+}
+
 #ifdef CONFIG_CGROUP_SCHED
 
 /* return corresponding task_group object of a cgroup */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 0d949c5..34117f9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -373,6 +373,22 @@ static struct ctl_table kern_table[] = {
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_deadline_period_us",
+		.data		= &sysctl_sched_deadline_period,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &sched_deadline_handler,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_deadline_runtime_us",
+		.data		= &sysctl_sched_deadline_runtime,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &sched_deadline_handler,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "sched_compat_yield",
 		.data		= &sysctl_sched_compat_yield,
 		.maxlen		= sizeof(unsigned int),
-- 
1.6.0.4


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 10/12][PATCH] SCHED_DEADLINE: group bandwidth management code
  2009-10-16 15:35 [RFC 0/12][PATCH] SCHED_DEADLINE (new version of SCHED_EDF) Raistlin
                   ` (8 preceding siblings ...)
  2009-10-16 15:45 ` [RFC 9/12][PATCH] SCHED_DEADLINE: system wide bandwidth management Raistlin
@ 2009-10-16 15:46 ` Raistlin
  2009-12-28 14:51   ` Peter Zijlstra
  2009-10-16 15:47 ` [RFC 11/12][PATCH] SCHED_DEADLINE: documentation Raistlin
  2009-10-16 15:48 ` [RFC 12/12][PATCH] SCHED_DEADLINE: modified sched_*_ex API Raistlin
  11 siblings, 1 reply; 45+ messages in thread
From: Raistlin @ 2009-10-16 15:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 19958 bytes --]

CPU Container Groups support for SCHED_DEADLINE is introduced by this commit.

CGroups, if configured, have a SCHED_DEADLINE bandwidth, and it is enforced
that the sum of the bandwidths of entities (tasks and groups) belonging to
a group stays below its own bandwidth.

Signed-off-by: Raistlin <raistlin@linux.it>
---
 init/Kconfig            |   14 ++
 kernel/sched.c          |  419 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_deadline.c |    4 +
 kernel/sched_debug.c    |    3 +-
 4 files changed, 439 insertions(+), 1 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 09c5c64..17318ca 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -454,6 +454,20 @@ config RT_GROUP_SCHED
 	  realtime bandwidth for them.
 	  See Documentation/scheduler/sched-rt-group.txt for more information.
 
+config DEADLINE_GROUP_SCHED
+	bool "Group scheduling for SCHED_DEADLINE"
+	depends on EXPERIMENTAL
+	depends on GROUP_SCHED
+	depends on CGROUPS
+	depends on !USER_SCHED
+	default n
+	help
+	  This feature lets you explicitly specify, in terms of runtime
+	  and period, the bandwidth of a task control group. This means
+	  tasks (and other groups) can be added to it only up to such
+	  ``bandwidth cap'', which might be useful for avoiding or
+	  controlling oversubscription.
+
 choice
 	depends on GROUP_SCHED
 	prompt "Basis for grouping tasks"
diff --git a/kernel/sched.c b/kernel/sched.c
index d8b6354..a8ebfa2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -232,6 +232,18 @@ static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
 }
 #endif
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+struct dl_bandwidth {
+	spinlock_t		lock;
+	/* runtime and period that determine the bandwidth of the group */
+	u64			runtime_max;
+	u64			period;
+	u64			bw;
+	/* accumulator of the total allocated bandwidth in a group */
+	u64			total_bw;
+};
+#endif
+
 /*
  * sched_domains_mutex serializes calls to arch_init_sched_domains,
  * detach_destroy_domains and partition_sched_domains.
@@ -271,6 +283,12 @@ struct task_group {
 	struct rt_bandwidth rt_bandwidth;
 #endif
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	struct dl_rq **dl_rq;
+
+	struct dl_bandwidth dl_bandwidth;
+#endif
+
 	struct rcu_head rcu;
 	struct list_head list;
 
@@ -305,6 +323,10 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct cfs_rq, init_tg_cfs_rq);
 static DEFINE_PER_CPU(struct sched_rt_entity, init_sched_rt_entity);
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rt_rq, init_rt_rq);
 #endif /* CONFIG_RT_GROUP_SCHED */
+
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct dl_rq, init_dl_rq);
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
 #else /* !CONFIG_USER_SCHED */
 #define root_task_group init_task_group
 #endif /* CONFIG_USER_SCHED */
@@ -492,6 +514,10 @@ struct dl_rq {
 	/* runqueue is an rbtree, ordered by deadline */
 	struct rb_root rb_root;
 	struct rb_node *rb_leftmost;
+
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	struct rq *rq;
+#endif
 };
 
 #ifdef CONFIG_SMP
@@ -895,8 +921,10 @@ static inline u64 global_deadline_runtime(void)
  * locking for the system wide deadline bandwidth management.
  */
 static DEFINE_MUTEX(deadline_constraints_mutex);
+#ifndef CONFIG_DEADLINE_GROUP_SCHED
 static DEFINE_SPINLOCK(__sysctl_sched_deadline_lock);
 static u64 __sysctl_sched_deadline_total_bw;
+#endif
 
 #ifndef prepare_arch_switch
 # define prepare_arch_switch(next)	do { } while (0)
@@ -2634,6 +2662,72 @@ static unsigned long to_ratio(u64 period, u64 runtime)
 	return div64_u64(runtime << 20, period);
 }
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static inline
+void __deadline_clear_task_bw(struct task_struct *p, u64 tsk_bw)
+{
+	struct task_group *tg = task_group(p);
+
+	tg->dl_bandwidth.total_bw -= tsk_bw;
+}
+
+static inline
+void __deadline_add_task_bw(struct task_struct *p, u64 tsk_bw)
+{
+	struct task_group *tg = task_group(p);
+
+	tg->dl_bandwidth.total_bw += tsk_bw;
+}
+
+/*
+ * update the total allocated bandwidth for a group, if a new -deadline
+ * task arrives, leaves, or stays but modifies its bandwidth.
+ */
+static int __deadline_check_task_bw(struct task_struct *p, int policy,
+				    struct sched_param_ex *param_ex)
+{
+	struct task_group *tg = task_group(p);
+	u64 bw, tsk_bw = 0;
+	int ret = 0;
+
+	spin_lock(&tg->dl_bandwidth.lock);
+
+	bw = tg->dl_bandwidth.bw;
+	if (bw <= 0)
+		goto unlock;
+
+	if (deadline_policy(policy))
+		tsk_bw = to_ratio(timespec_to_ns(&param_ex->sched_deadline),
+				  timespec_to_ns(&param_ex->sched_runtime));
+
+	/*
+	 * Either if a task, enters, leave, or stays -deadline but changes
+	 * its parameters, we need to update accordingly the total allocated
+	 * bandwidth of the control group it is inside, provided the new state
+	 * is consistent!
+	 */
+	if (task_has_deadline_policy(p) && !deadline_policy(policy)) {
+		__deadline_clear_task_bw(p, p->dl.bw);
+		ret = 1;
+		goto unlock;
+	} else if (task_has_deadline_policy(p) && deadline_policy(policy) &&
+		   bw >= tg->dl_bandwidth.total_bw - p->dl.bw + tsk_bw) {
+		__deadline_clear_task_bw(p, p->dl.bw);
+		__deadline_add_task_bw(p, tsk_bw);
+		ret = 1;
+		goto unlock;
+	} else if (deadline_policy(policy) && !task_has_deadline_policy(p) &&
+		   bw >= tg->dl_bandwidth.total_bw + tsk_bw) {
+		__deadline_add_task_bw(p, tsk_bw);
+		ret = 1;
+		goto  unlock;
+	}
+unlock:
+	spin_unlock(&tg->dl_bandwidth.lock);
+
+	return ret;
+}
+#else /* !CONFIG_DEADLINE_GROUP_SCHED */
 static inline
 void __deadline_clear_task_bw(struct task_struct *p, u64 tsk_bw)
 {
@@ -2693,6 +2787,7 @@ unlock:
 
 	return ret;
 }
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
 
 /*
  * wake_up_new_task - wake up a newly created task for the first time.
@@ -9624,6 +9719,10 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
 static void init_deadline_rq(struct dl_rq *dl_rq, struct rq *rq)
 {
 	dl_rq->rb_root = RB_ROOT;
+
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	dl_rq->rq = rq;
+#endif
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -9685,6 +9784,22 @@ static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
 }
 #endif
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+void init_tg_deadline_entry(struct task_group *tg, struct dl_rq *dl_rq,
+			    struct sched_dl_entity *dl_se, int cpu, int add,
+			    struct sched_dl_entity *parent)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	tg->dl_rq[cpu] = &rq->dl;
+
+	spin_lock_init(&tg->dl_bandwidth.lock);
+	tg->dl_bandwidth.runtime_max = 0;
+	tg->dl_bandwidth.period = 0;
+	tg->dl_bandwidth.bw = tg->dl_bandwidth.total_bw = 0;
+}
+#endif
+
 void __init sched_init(void)
 {
 	int i, j;
@@ -9696,6 +9811,9 @@ void __init sched_init(void)
 #ifdef CONFIG_RT_GROUP_SCHED
 	alloc_size += 2 * nr_cpu_ids * sizeof(void **);
 #endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	alloc_size += 2 * nr_cpu_ids * sizeof(void **);
+#endif
 #ifdef CONFIG_USER_SCHED
 	alloc_size *= 2;
 #endif
@@ -9739,6 +9857,10 @@ void __init sched_init(void)
 		ptr += nr_cpu_ids * sizeof(void **);
 #endif /* CONFIG_USER_SCHED */
 #endif /* CONFIG_RT_GROUP_SCHED */
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+		init_task_group.dl_rq = (struct dl_rq **)ptr;
+		ptr += nr_cpu_ids * sizeof(void **);
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
 #ifdef CONFIG_CPUMASK_OFFSTACK
 		for_each_possible_cpu(i) {
 			per_cpu(load_balance_tmpmask, i) = (void *)ptr;
@@ -9845,6 +9967,19 @@ void __init sched_init(void)
 #endif
 #endif
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+#ifdef CONFIG_CGROUP_SCHED
+		init_tg_deadline_entry(&init_task_group, &rq->dl,
+				       NULL, i, 1, NULL);
+#elif defined CONFIG_USER_SCHED
+		init_tg_deadline_entry(&root_task_group, &rq->dl,
+				       NULL, i, 0, NULL);
+		init_tg_deadline_entry(&init_task_group,
+				       &per_cpu(init_dl_rq, i),
+				       NULL, i, 1, NULL);
+#endif
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 		for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
 			rq->cpu_load[j] = 0;
 #ifdef CONFIG_SMP
@@ -10229,11 +10364,76 @@ static inline void unregister_rt_sched_group(struct task_group *tg, int cpu)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static void free_deadline_sched_group(struct task_group *tg)
+{
+	kfree(tg->dl_rq);
+}
+
+int alloc_deadline_sched_group(struct task_group *tg, struct task_group *parent)
+{
+	struct rq *rq;
+	int i;
+
+	tg->dl_rq = kzalloc(sizeof(struct dl_rq *) * nr_cpu_ids, GFP_KERNEL);
+	if (!tg->dl_rq)
+		return 0;
+
+	for_each_possible_cpu(i) {
+		rq = cpu_rq(i);
+		init_tg_deadline_entry(tg, &rq->dl, NULL, i, 0, NULL);
+	}
+
+	return 1;
+}
+
+int sched_deadline_can_attach(struct cgroup *cgrp, struct task_struct *tsk)
+{
+	struct task_group *tg = container_of(cgroup_subsys_state(cgrp,
+					     cpu_cgroup_subsys_id),
+					     struct task_group, css);
+	u64 tg_bw = tg->dl_bandwidth.bw;
+	u64 tsk_bw = tsk->dl.bw;
+
+	if (!deadline_task(tsk))
+		return 1;
+
+	/*
+	 * Check for available free bandwidth for the task
+	 * in the group.
+	 */
+	if (tg_bw < tsk_bw + tg->dl_bandwidth.total_bw)
+		return 0;
+
+	return 1;
+}
+#else /* !CONFIG_DEADLINE_GROUP_SCHED */
+static inline void free_deadline_sched_group(struct task_group *tg)
+{
+}
+
+static inline
+int alloc_deadline_sched_group(struct task_group *tg, struct task_group *parent)
+{
+	return 1;
+}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+static inline
+void register_deadline_sched_group(struct task_group *tg, int cpu)
+{
+}
+
+static inline
+void unregister_deadline_sched_group(struct task_group *tg, int cpu)
+{
+}
+
 #ifdef CONFIG_GROUP_SCHED
 static void free_sched_group(struct task_group *tg)
 {
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
+	free_deadline_sched_group(tg);
 	kfree(tg);
 }
 
@@ -10254,10 +10454,14 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	if (!alloc_deadline_sched_group(tg, parent))
+		goto err;
+
 	spin_lock_irqsave(&task_group_lock, flags);
 	for_each_possible_cpu(i) {
 		register_fair_sched_group(tg, i);
 		register_rt_sched_group(tg, i);
+		register_deadline_sched_group(tg, i);
 	}
 	list_add_rcu(&tg->list, &task_groups);
 
@@ -10287,11 +10491,27 @@ void sched_destroy_group(struct task_group *tg)
 {
 	unsigned long flags;
 	int i;
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	struct task_group *parent = tg->parent;
 
 	spin_lock_irqsave(&task_group_lock, flags);
+
+	/*
+	 * If a deadline group goes away, its parent group
+	 * (if any), ends up with some free bandwidth that
+	 * it might use for other groups/tasks.
+	 */
+	spin_lock(&parent->dl_bandwidth.lock);
+	if (tg->dl_bandwidth.bw && parent)
+		parent->dl_bandwidth.total_bw -= tg->dl_bandwidth.bw;
+	spin_unlock(&parent->dl_bandwidth.lock);
+#else
+	spin_lock_irqsave(&task_group_lock, flags);
+#endif
 	for_each_possible_cpu(i) {
 		unregister_fair_sched_group(tg, i);
 		unregister_rt_sched_group(tg, i);
+		unregister_deadline_sched_group(tg, i);
 	}
 	list_del_rcu(&tg->list);
 	list_del_rcu(&tg->siblings);
@@ -10672,6 +10892,113 @@ static int sched_rt_global_constraints(void)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+/* Must be called with tasklist_lock held */
+static inline int tg_has_deadline_tasks(struct task_group *tg)
+{
+	struct task_struct *g, *p;
+
+	do_each_thread(g, p) {
+		if (deadline_task(p) && task_group(p) == tg)
+			return 1;
+	} while_each_thread(g, p);
+
+	return 0;
+}
+
+static inline
+void tg_set_deadline_bandwidth(struct task_group *tg, u64 r, u64 p, u64 bw)
+{
+	assert_spin_locked(&tg->dl_bandwidth.lock);
+
+	tg->dl_bandwidth.runtime_max = r;
+	tg->dl_bandwidth.period = p;
+	tg->dl_bandwidth.bw = bw;
+}
+
+/*
+ * Here we check if the new group parameters are schedulable in the
+ * system. This depends on these new parameters and on the free bandwidth
+ * either in the parent group or in the whole system.
+ */
+static int __deadline_schedulable(struct task_group *tg,
+				  u64 runtime_max, u64 period)
+{
+	struct task_group *parent = tg->parent;
+	u64 bw, old_bw, parent_bw;
+	int ret = 0;
+
+	/*
+	 * Note that we allow runtime > period, since it makes sense to
+	 * assign more than 100% bandwidth to a group on SMP machine.
+	 */
+	mutex_lock(&deadline_constraints_mutex);
+	spin_lock_irq(&tg->dl_bandwidth.lock);
+
+	bw = period <= 0 ? 0 : to_ratio(period, runtime_max);
+	if (bw < tg->dl_bandwidth.total_bw) {
+		ret = -EINVAL;
+		goto unlock_tg;
+	}
+
+	/*
+	 * The root group has no parent, but its assigned bandwidth has
+	 * to stay below the global bandwidth value given by
+	 * sysctl_sched_deadline_runtime / sysctl_sched_deadline_period.
+	 */
+	if (!parent) {
+		/* root group */
+		if (sysctl_sched_deadline_period <= 0)
+			parent_bw = 0;
+		else
+			parent_bw = to_ratio(sysctl_sched_deadline_period,
+					     sysctl_sched_deadline_runtime);
+		if (parent_bw >= bw)
+			tg_set_deadline_bandwidth(tg, runtime_max, period, bw);
+		else
+			ret = -EINVAL;
+	} else {
+		/* non-root groups */
+		spin_lock(&parent->dl_bandwidth.lock);
+		parent_bw = parent->dl_bandwidth.bw;
+		old_bw = tg->dl_bandwidth.bw;
+
+		if (parent_bw >= parent->dl_bandwidth.total_bw -
+				 old_bw + bw) {
+			tg_set_deadline_bandwidth(tg, runtime_max, period, bw);
+			parent->dl_bandwidth.total_bw -= old_bw;
+			parent->dl_bandwidth.total_bw += bw;
+		} else
+			ret = -EINVAL;
+		spin_unlock(&parent->dl_bandwidth.lock);
+	}
+unlock_tg:
+	spin_unlock_irq(&tg->dl_bandwidth.lock);
+	mutex_unlock(&deadline_constraints_mutex);
+
+	return ret;
+}
+
+static int sched_deadline_global_constraints(void)
+{
+	struct task_group *tg = &init_task_group;
+	u64 bw;
+	int ret = 1;
+
+	spin_lock_irq(&tg->dl_bandwidth.lock);
+	if (sysctl_sched_deadline_period <= 0)
+		bw = 0;
+	else
+		bw = to_ratio(global_deadline_period(),
+			      global_deadline_runtime());
+
+	if (bw < tg->dl_bandwidth.bw)
+		ret = 0;
+	spin_unlock_irq(&tg->dl_bandwidth.lock);
+
+	return ret;
+}
+#else /* !CONFIG_DEADLINE_GROUP_SCHED */
 static int sched_deadline_global_constraints(void)
 {
 	u64 bw;
@@ -10690,6 +11017,7 @@ static int sched_deadline_global_constraints(void)
 
 	return ret;
 }
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
 
 int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
@@ -10784,9 +11112,15 @@ cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 static int
 cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
+#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_DEADLINE_GROUP_SCHED)
 #ifdef CONFIG_RT_GROUP_SCHED
 	if (!sched_rt_can_attach(cgroup_tg(cgrp), tsk))
 		return -EINVAL;
+#endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	if (!sched_deadline_can_attach(cgrp, tsk))
+		return -EINVAL;
+#endif
 #else
 	/* We don't support RT-tasks being in separate groups */
 	if (tsk->sched_class != &fair_sched_class)
@@ -10822,6 +11156,29 @@ cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 		  struct cgroup *old_cont, struct task_struct *tsk,
 		  bool threadgroup)
 {
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	struct task_group *tg = container_of(cgroup_subsys_state(cgrp,
+					     cpu_cgroup_subsys_id),
+					     struct task_group, css);
+	struct task_group *old_tg = container_of(cgroup_subsys_state(old_cont,
+						 cpu_cgroup_subsys_id),
+						 struct task_group, css);
+
+	/*
+	 * An amount of bandwidth equal to the bandwidth of tsk
+	 * is freed in the former group of tsk, and declared occupied
+	 * in the new one.
+	 */
+	spin_lock_irq(&tg->dl_bandwidth.lock);
+	tg->dl_bandwidth.total_bw += tsk->dl.bw;
+
+	if (old_tg) {
+		spin_lock(&old_tg->dl_bandwidth.lock);
+		old_tg->dl_bandwidth.total_bw -= tsk->dl.bw;
+		spin_unlock(&old_tg->dl_bandwidth.lock);
+	}
+	spin_unlock_irq(&tg->dl_bandwidth.lock);
+#endif
 	sched_move_task(tsk);
 	if (threadgroup) {
 		struct task_struct *c;
@@ -10872,6 +11229,56 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+static int cpu_deadline_runtime_write_uint(struct cgroup *cgrp,
+					   struct cftype *cftype,
+					   u64 dl_runtime_us)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+
+	return __deadline_schedulable(tg, dl_runtime_us * NSEC_PER_USEC,
+				      tg->dl_bandwidth.period);
+}
+
+static u64 cpu_deadline_runtime_read_uint(struct cgroup *cgrp,
+					  struct cftype *cft)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	u64 runtime;
+
+	spin_lock_irq(&tg->dl_bandwidth.lock);
+	runtime = tg->dl_bandwidth.runtime_max;
+	spin_unlock_irq(&tg->dl_bandwidth.lock);
+	do_div(runtime, NSEC_PER_USEC);
+
+	return runtime;
+}
+
+static int cpu_deadline_period_write_uint(struct cgroup *cgrp,
+					  struct cftype *cftype,
+					  u64 dl_period_us)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+
+	return __deadline_schedulable(tg, tg->dl_bandwidth.runtime_max,
+				      dl_period_us * NSEC_PER_USEC);
+}
+
+static u64 cpu_deadline_period_read_uint(struct cgroup *cgrp,
+					 struct cftype *cft)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	u64 period;
+
+	spin_lock_irq(&tg->dl_bandwidth.lock);
+	period = tg->dl_bandwidth.period;
+	spin_unlock_irq(&tg->dl_bandwidth.lock);
+	do_div(period, NSEC_PER_USEC);
+
+	return period;
+}
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
+
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -10892,6 +11299,18 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	{
+		.name = "deadline_runtime_us",
+		.read_u64 = cpu_deadline_runtime_read_uint,
+		.write_u64 = cpu_deadline_runtime_write_uint,
+	},
+	{
+		.name = "deadline_period_us",
+		.read_u64 = cpu_deadline_period_read_uint,
+		.write_u64 = cpu_deadline_period_write_uint,
+	},
+#endif
 };
 
 static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
diff --git a/kernel/sched_deadline.c b/kernel/sched_deadline.c
index 82c0192..a14b928 100644
--- a/kernel/sched_deadline.c
+++ b/kernel/sched_deadline.c
@@ -15,6 +15,10 @@
  * However, thanks to bandwidth isolation, overruns and deadline misses
  * remains local, and does not affect any other task in the system.
  *
+ * Groups, if configured, have bandwidth as well, and it is enforced that
+ * the sum of the bandwidths of entities (tasks and groups) belonging to
+ * a group stays below its own bandwidth.
+ *
  * Copyright (C) 2009 Dario Faggioli, Michael Trimarchi
  */
 
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 809ba55..27ab926 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -146,7 +146,8 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
 }
 
 #if defined(CONFIG_CGROUP_SCHED) && \
-	(defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
+	(defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED) || \
+	 defined(CONFIG_DEADLINE_GROUP_SCHED))
 static void task_group_path(struct task_group *tg, char *buf, int buflen)
 {
 	/* may be NULL if the underlying cgroup isn't fully-created yet */
-- 
1.6.0.4


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 11/12][PATCH] SCHED_DEADLINE: documentation
  2009-10-16 15:35 [RFC 0/12][PATCH] SCHED_DEADLINE (new version of SCHED_EDF) Raistlin
                   ` (9 preceding siblings ...)
  2009-10-16 15:46 ` [RFC 10/12][PATCH] SCHED_DEADLINE: group bandwidth management code Raistlin
@ 2009-10-16 15:47 ` Raistlin
  2009-10-16 15:48 ` [RFC 12/12][PATCH] SCHED_DEADLINE: modified sched_*_ex API Raistlin
  11 siblings, 0 replies; 45+ messages in thread
From: Raistlin @ 2009-10-16 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 10647 bytes --]

This commit adds some more documentation and comments on how the new
scheduling policy works.

Signed-off-by: Raistlin <raistlin@linux.it>
---
 Documentation/scheduler/sched-deadline.txt |  174 ++++++++++++++++++++++++++++
 include/linux/sched.h                      |   45 +++++++
 init/Kconfig                               |    1 +
 3 files changed, 220 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/scheduler/sched-deadline.txt

diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
new file mode 100644
index 0000000..cadfa9f
--- /dev/null
+++ b/Documentation/scheduler/sched-deadline.txt
@@ -0,0 +1,174 @@
+			Deadline Task and Group Scheduling
+			----------------------------------
+
+CONTENTS
+========
+
+0. WARNING
+1. Overview
+  1.1 Task scheduling
+  1.2 Group scheduling
+2. The interface
+  2.1 System-wide settings
+  2.2 Default behavior
+  2.3 Basis for grouping tasks
+3. Future plans
+
+
+0. WARNING
+==========
+
+ Fiddling with these settings can result in an unpredictable or even unstable
+ system behavior. As for -rt (group) scheduling, it is assumed that root
+ knows what he is doing.
+
+
+1. Overview
+===========
+
+The SCHED_DEADLINE scheduling class implements the Earliest Deadline First
+(EDF) algorithm and uses the Constant Bandwidth Server (CBS) to provide
+bandwidth isolation among tasks.
+The implementation is aligned with the current mainstream kernel, and it
+relies on standard Linux mechanisms (e.g., control groups) to natively support
+multicore platforms and to provide hierarchical scheduling through a standard
+API.
+
+
+1.1 Task scheduling
+-------------------
+
+The SCHED_DEADLINE scheduling class does not make any restrictive assumption
+on the characteristics of the tasks, thus it can handle:
+ * periodic tasks, typical in real-time and control applications;
+ * sporadic tasks, typical in soft real-time and multimedia applications;
+ * aperiodic tasks.
+
+This is mainly because temporal isolation is ensured: the temporal behavior
+of each task (i.e., its ability to meet deadlines) is not affected by what
+happens in any other task in the system.
+In other words, even if a task misbehaves, it is not able to exploit larger
+execution time than the amount that has been devoted to it.
+
+In fact, each task is assigned a ``scheduling budget'' (sched_runtime) and a
+``scheduling deadline'' (sched_deadline, also called period in this branch
+of the real-time literature).
+This means the task is guaranteed to execute for an amount of time equal to
+sched_runtime every sched_deadline, i.e., to utilize at most a CPU bandwidth
+equal to sched_runtime/sched_deadline.
+If it tries to execute more than its sched_runtime it is slowed down, by
+stopping it until the time instant of its next deadline.
+
+However, although this algorithm (i.e., the CBS) is effective for encapsulating
+aperiodic or sporadic --real-time or non real-time-- tasks in a real-time
+EDF scheduled system, it imposes some overhead to ``standard'' periodic tasks.
+Therefore, we make it possible for periodic task to specify that they are going
+to sleep, waiting for the next activation, because a periodic instance just
+ended. This avoid them (provided they behave well!) being disturbed by
+the CBS bandwidth management logic.
+
+
+Group scheduling
+----------------
+
+The scheduling class is integrated with the control groups mechanism in order
+to allow the creation of groups of tasks with a cap on their total utilization.
+
+However, groups plays no role in the on-line scheduling decisions. This is
+different on how group scheduling works for the -rt scheduling class, and
+the difference comes from the fact that -deadline tasks _already_ have their
+own bandwidth, which is not true for standard POSIX SCHED_FIFO or SCHED_RR
+processes and threads.
+
+Therefore, there is no need for fully hierarchical runqueue implementation,
+hierarchical runtime accounting, etc., which result in simpler code and
+smaller overhead.
+All we do are bandwidth ``consistency checks'', which are performed at the
+occurrence of the following events:
+ * a -deadline task is created or moved inside a group,
+ * the parameters of a -deadline task (if inside a group) are modified,
+ * the -deadline related parameters of a group are modified.
+
+The purpose of this is ensuring the cumulative utilization of tasks and
+groups is below the one of the group containing them (see below).
+
+
+2. The Interface
+================
+
+
+2.1 System wide settings
+------------------------
+
+The system wide settings are configured under the /proc virtual file system:
+
+/proc/sys/kernel/sched_deadline_period_us:
+  The scheduling period that is equivalent to 100% CPU bandwidth
+
+/proc/sys/kernel/sched_deadline_runtime_us:
+  A global limit on how much time real-time scheduling may use. Even without
+  CONFIG_DEADLINE_GROUP_SCHED enabled, this will limit time reserved to
+  -deadline processes. With CONFIG_DEADLINE_GROUP_SCHED it signifies the
+  total bandwidth available to all real-time groups.
+
+  * Time is specified in us because the interface is s32. This gives an
+    operating range from 1us to about 35 minutes;
+  * sched_deadline_period_us takes values from 1 to INT_MAX;
+  * sched_deadline_runtime_us takes values from 1 to INT_MAX;
+  * setting runtime = period specifies 100% bandwidth exploitable by
+    -deadline tasks;
+  * setting runtime > period allows for more than 100% bandwidth
+    exploitable by -deadline tasks, which still might make sense,
+    especially in SMP systems.
+
+
+2.2 Default behavior
+---------------------
+
+The default values for sched_deadline_period_us and
+sched_deadline_runtime_us are 0.  This means no -deadline tasks or
+groups can be created!
+
+Consistently, bandwidth assigned to the root group, and to each newly created
+group, is 0 as well.
+
+
+2.3 Basis for grouping tasks
+----------------------------
+
+There are two compile-time settings for allocating CPU bandwidth. These are
+configured using the "Basis for grouping tasks" multiple choice menu under
+General setup > Group CPU Scheduler:
+
+CONFIG_USER_SCHED (aka "Basis for grouping tasks" =  "user id")
+
+This, for now, is not supported for deadline group scheduling.
+
+CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups")
+
+This uses the /cgroup virtual file system, i.e.:
+ * /cgroup/<cgroup>/cpu.deadline_runtime_us and
+ * /cgroup/<cgroup>/cpu.deadline_period_us,
+to control the CPU time reserved or each control group.
+
+For more information on working with control groups, you should read
+Documentation/cgroups/cgroups.txt as well.
+
+Group settings are checked against the following limits:
+
+ * for the root group {r}
+     runtime_{r} / period_{r} <= global_runtime / global_period
+ * for each group {i}, subgroup of group {j}
+     \Sum_{i} runtime_{i} / period_{i} <= runtime_{j} / period_{j}
+
+
+3. Future plans
+===============
+
+Only two, but very important pieces are missing:
+
+ * SMP/multicore global scheduling throughout push and pull logic (as in
+   -rt). This is not finished, but is on it's way, and will come very soon!
+ * Deadline/BandWidth Inheritance and/or Proxy Execution mechanisms for the
+  rt_mutexes. This probably need some more discussion, and also some more time
+  to have it implemented!
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4de72eb..ec0324f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -95,6 +95,51 @@ struct sched_param {
 
 #include <asm/processor.h>
 
+/*
+ * Extended sched_param for SCHED_DEADLINE tasks.
+ *
+ * In fact, struct sched_param can not be modified for binary compatibility
+ * issues.
+ *
+ * A SCHED_DEADLINE task have at least a scheduling deadline (sched_deadline)
+ * and a scheduling runtime (sched_runtime). Space for a scheduling
+ * period (sched_period) is reserved, but the field is not used right now.
+ *
+ * When a SCHED_DEADLINE task activates at time t, its absolute deadline is
+ * computed as:
+ *	deadline = t + sched_deadline.
+ * The SCHED_DEADLINE runqueue is ordered according to ascending tasks'
+ * deadline values, thus the task with the _earliest_ deadline is the one
+ * that will be scheduled.
+ *
+ * In order of avoiding one task to cause intefrerence on the others, each
+ * task activation is allowed to run for at its runtime, which is at most
+ * sched_runtime.
+ * After that, the task is stopped until its deadline, when it is reactivated
+ * with a new 'runtime quota' and a new deadline.
+ *
+ * Period (or minimum interarrival time) is not dealt with in the kernel, and
+ * it is up to the user to make the task suspend at the end of each instance.
+ * The sched_wait_interval() --with clock_nanosleep like semantic-- syscall
+ * can be used for this purpose. In this case, when the task resumes, the
+ * scheduler assumes a new instance is just starting, and provide the task
+ * with new runtime and deadline values.
+ *
+ * Scheduling flags, finally, let the user specify if runtime overruns (which
+ * may occur, e.g., for timing resolution issues) and/or deadline misses
+ * (e.g., because system is oversubscribed) have to be notified by means of
+ * SIGXCPU signals.
+ *
+ * @sched_priority:	not used right now
+ *
+ * @sched_deadline:	scheduling deadline of the task
+ * @sched_runtime:	scheduling runtime of the task
+ * @sched_period:	not used right now
+ *
+ * @sched_flags:	scheduling flags of the task (runtime overrun and/or
+ *			deadline miss only, for now)
+ */
+
 #define SCHED_SIG_RORUN		0x80000000
 #define SCHED_SIG_DMISS		0x40000000
 
diff --git a/init/Kconfig b/init/Kconfig
index 17318ca..d4a52b7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -467,6 +467,7 @@ config DEADLINE_GROUP_SCHED
 	  tasks (and other groups) can be added to it only up to such
 	  ``bandwidth cap'', which might be useful for avoiding or
 	  controlling oversubscription.
+	  See Documentation/scheduler/sched-deadline.txt for more.
 
 choice
 	depends on GROUP_SCHED
-- 
1.6.0.4


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC 12/12][PATCH] SCHED_DEADLINE: modified sched_*_ex API
  2009-10-16 15:35 [RFC 0/12][PATCH] SCHED_DEADLINE (new version of SCHED_EDF) Raistlin
                   ` (10 preceding siblings ...)
  2009-10-16 15:47 ` [RFC 11/12][PATCH] SCHED_DEADLINE: documentation Raistlin
@ 2009-10-16 15:48 ` Raistlin
  2009-12-28 15:09   ` Peter Zijlstra
  2009-12-29 12:15   ` Peter Zijlstra
  11 siblings, 2 replies; 45+ messages in thread
From: Raistlin @ 2009-10-16 15:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 5359 bytes --]

This commit amends the new API introduced to deal with the new sched_param_ex
scheduling parameter data structure.

What we add is one more parameter to all the functions, containing the size of
sched_param_ex. It might turn out useful in possible future extensions of
sched_param_ex itself, to avoid issue with ABI of legacy applications.

Signed-off-by: Raistlin <raistlin@linux.it>
---
 include/linux/syscalls.h |    6 +++---
 kernel/sched.c           |   25 ++++++++++++++++---------
 2 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e01f59c..60a99a7 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -391,16 +391,16 @@ asmlinkage long sys_clock_nanosleep(clockid_t which_clock, int flags,
 asmlinkage long sys_nice(int increment);
 asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
 					struct sched_param __user *param);
-asmlinkage long sys_sched_setscheduler_ex(pid_t pid, int policy,
+asmlinkage long sys_sched_setscheduler_ex(pid_t pid, int policy, unsigned len,
 					struct sched_param_ex __user *param);
 asmlinkage long sys_sched_setparam(pid_t pid,
 					struct sched_param __user *param);
-asmlinkage long sys_sched_setparam_ex(pid_t pid,
+asmlinkage long sys_sched_setparam_ex(pid_t pid, unsigned len,
 					struct sched_param_ex __user *param);
 asmlinkage long sys_sched_getscheduler(pid_t pid);
 asmlinkage long sys_sched_getparam(pid_t pid,
 					struct sched_param __user *param);
-asmlinkage long sys_sched_getparam_ex(pid_t pid,
+asmlinkage long sys_sched_getparam_ex(pid_t pid, unsigned len,
 					struct sched_param_ex __user *param);
 asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
diff --git a/kernel/sched.c b/kernel/sched.c
index a8ebfa2..d3a61f5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6662,7 +6662,7 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 }
 
 static int
-do_sched_setscheduler_ex(pid_t pid, int policy,
+do_sched_setscheduler_ex(pid_t pid, int policy, unsigned int len,
 			 struct sched_param_ex __user *param_ex)
 {
 	struct sched_param lparam;
@@ -6672,8 +6672,9 @@ do_sched_setscheduler_ex(pid_t pid, int policy,
 
 	if (!param_ex || pid < 0)
 		return -EINVAL;
-	if (copy_from_user(&lparam_ex, param_ex,
-	    sizeof(struct sched_param_ex)))
+	if (len > sizeof(struct sched_param_ex))
+		return -EINVAL;
+	if (copy_from_user(&lparam_ex, param_ex,len))
 		return -EFAULT;
 
 	rcu_read_lock();
@@ -6708,15 +6709,17 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy,
  * sys_sched_setscheduler_ex - set/change the scheduler policy to SCHED_DEADLINE
  * @pid: the pid in question.
  * @policy: new policy (should be SCHED_DEADLINE).
+ * @len: size of data pointed by param_ex.
  * @param: structure containg the extended deadline parameters.
  */
-SYSCALL_DEFINE3(sched_setscheduler_ex, pid_t, pid, int, policy,
+SYSCALL_DEFINE4(sched_setscheduler_ex, pid_t, pid,
+		int, policy, unsigned, len,
 		struct sched_param_ex __user *, param_ex)
 {
 	if (policy < 0)
 		return -EINVAL;
 
-	return do_sched_setscheduler_ex(pid, policy, param_ex);
+	return do_sched_setscheduler_ex(pid, policy, len, param_ex);
 }
 
 /**
@@ -6732,12 +6735,13 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
 /**
  * sys_sched_setparam - set/change the DEADLINE parameters of a thread
  * @pid: the pid in question.
+ * @len: size of data pointed by param_ex.
  * @param_ex: structure containing the new parameters (deadline, runtime, etc.).
  */
-SYSCALL_DEFINE2(sched_setparam_ex, pid_t, pid,
+SYSCALL_DEFINE3(sched_setparam_ex, pid_t, pid, unsigned, len,
 		struct sched_param_ex __user *, param_ex)
 {
-	return do_sched_setscheduler_ex(pid, -1, param_ex);
+	return do_sched_setscheduler_ex(pid, -1, len, param_ex);
 }
 
 /**
@@ -6807,9 +6811,10 @@ out_unlock:
 /**
  * sys_sched_getparam - get the DEADLINE task parameters of a thread
  * @pid: the pid in question.
+ * @len: size of data pointed by param_ex.
  * @param_ex: structure containing the new parameters (deadline, runtime, etc.).
  */
-SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
+SYSCALL_DEFINE3(sched_getparam_ex, pid_t, pid, unsigned, len,
 		struct sched_param_ex __user *, param_ex)
 {
 	struct sched_param_ex lp;
@@ -6818,6 +6823,8 @@ SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
 
 	if (!param_ex || pid < 0)
 		return -EINVAL;
+	if (len < sizeof(struct sched_param_ex))
+		return -EINVAL;
 
 	read_lock(&tasklist_lock);
 	p = find_process_by_pid(pid);
@@ -6837,7 +6844,7 @@ SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
 	/*
 	 * This one might sleep, we cannot do it with a spinlock held ...
 	 */
-	retval = copy_to_user(param_ex, &lp, sizeof(*param_ex)) ? -EFAULT : 0;
+	retval = copy_to_user(param_ex, &lp, len) ? -EFAULT : 0;
 
 	return retval;
 
-- 
1.6.0.4

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC 9/12][PATCH] SCHED_DEADLINE: system wide bandwidth  management
  2009-10-16 15:45 ` [RFC 9/12][PATCH] SCHED_DEADLINE: system wide bandwidth management Raistlin
@ 2009-11-06 11:34   ` Dhaval Giani
  2009-12-28 14:44   ` Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: Dhaval Giani @ 2009-11-06 11:34 UTC (permalink / raw)
  To: Raistlin
  Cc: Peter Zijlstra, linux-kernel, michael trimarchi, Fabio Checconi,
	Ingo Molnar, Thomas Gleixner, Johan Eker, p.faure, Chris Friesen,
	Steven Rostedt, Henrik Austad, Frederic Weisbecker, Darren Hart,
	Sven-Thorsten Dietrich, Bjoern Brandenburg, Tommaso Cucinotta,
	giuseppe.lipari, Juri Lelli

On Fri, Oct 16, 2009 at 9:15 PM, Raistlin <raistlin@linux.it> wrote:
> This commit adds the capability of controlling the maximum, system wide,
> CPU bandwidth that is devoted to SCHED_DEADLINE tasks.
>
> This is done by means of two files:
>  - /proc/sys/kernel/sched_deadline_runtime_us,
>  - /proc/sys/kernel/sched_deadline_period_us.
> The ratio runtime/period is the total bandwidth all the SCHED_DEADLINE tasks
> can use in the system as a whole.
> Trying to create tasks in such a way that they exceed this limitation will
> fail, as soon as the bandwidth cap would be overcome.
>
> Default value is _zero_ bandwidth available, thus write some numbers in those
> files before trying to start some SCHED_DEADLINE task. Setting runtime > period
> is allowed (i.e., more than 100% bandwidth available for -deadline tasks),
> since it makes more than sense in SMP systems.
>

I don't like this interface. A couple of issues that come to mind are
1. There is no check to prevent over provisioning of the system (if I
have missed the check, please correct me)
2. It is not CPU hotplug safe (I can understand that it is not that
important an issue now, but we should keep in mind that linux is
hotplug capable, so we would need to hook into the hotplug mechanism)

I would very much prefer the current interface where the runtime <=
period is always enforced. For SMP then across the system we would
allow a runtime equivalent to runtime * NR_CPU. I think it would make
more sense to push it as a percentage system wide. (I don't think it
should be an issue for processes since a process can only run one CPU
so its runtime and period will mean just that and not percentage).
comments?

thanks,
Dhaval
-- 

Ogden Nash  - "The trouble with a kitten is that when it grows up,
it's always a cat." -
http://www.brainyquote.com/quotes/authors/o/ogden_nash.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 7/12][PATCH] SCHED_DEADLINE: signal delivery when overrunning
  2009-10-16 15:44 ` [RFC 7/12][PATCH] SCHED_DEADLINE: signal delivery when overrunning Raistlin
@ 2009-12-28 14:19   ` Peter Zijlstra
  2010-01-13  9:30     ` Raistlin
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2009-12-28 14:19 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

On Fri, 2009-10-16 at 17:44 +0200, Raistlin wrote:
> Starting from this commit, the user can ask to receive a SIGXCPU signal
> every time the task runtime is overrun or a scheduling deadline is missed.
> This is done by means of the sched_flags field already present in
> sched_param_ex.
> 
> A runtime overrun will be quite common, e.g. due to coarse execution time
> accounting, wrong parameter assignement, etc.
> A deadline miss --since the deadlines the scheduler sees are ``scheduling
> deadlines'' which have not necessarily to be equal to task's deadlines-- is
> much more unlikely, and should only happen in an overloaded system.

Right, I think its much better to not do this in posix-cpu-timers.c,
that code is shite.

Its probably possible to set SIGXCPU pending and raise TIF_SIGPENDING
from within the scheduler code, and that will be triggered when we
return to userspace.

That also gets rid of that coarse execution time accounting muck, since
the scheduler has ns accurate accounting.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 8/12][PATCH] SCHED_DEADLINE: wait next instance syscall added.
  2009-10-16 15:44 ` [RFC 8/12][PATCH] SCHED_DEADLINE: wait next instance syscall added Raistlin
@ 2009-12-28 14:30   ` Peter Zijlstra
  2010-01-13  9:33     ` Raistlin
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2009-12-28 14:30 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

On Fri, 2009-10-16 at 17:44 +0200, Raistlin wrote:
> This commit introduces another new SCHED_DEADLINE related syscall. It is
> called sched_wait_interval() and it has close-to-clock_nanosleep semantic.
> 
> However, for SCHED_DEADLINE tasks, it should be the call with which each
> job closes its current instance. In fact, in this case, the task is put to
> sleep and, when it wakes up, the scheduler is informed that a new job
> arrived, saving the overhead that usually comes with a task activation
> to enforce maximum task bandwidth.

The changelog suggests (and a very brief looks seems to confirm) that
this code could be much smaller by using hrtimer_nanosleep().

The implementation as presented seems to only call ->wait_interval()
when the timer arms, which seems like a bug, we should always call it,
regardless of whether we're on a period boundary.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 9/12][PATCH] SCHED_DEADLINE: system wide bandwidth management
  2009-10-16 15:45 ` [RFC 9/12][PATCH] SCHED_DEADLINE: system wide bandwidth management Raistlin
  2009-11-06 11:34   ` Dhaval Giani
@ 2009-12-28 14:44   ` Peter Zijlstra
  2010-01-13  9:41     ` Raistlin
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2009-12-28 14:44 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

On Fri, 2009-10-16 at 17:45 +0200, Raistlin wrote:
> This commit adds the capability of controlling the maximum, system wide,
> CPU bandwidth that is devoted to SCHED_DEADLINE tasks.
> 
> This is done by means of two files:
>  - /proc/sys/kernel/sched_deadline_runtime_us,
>  - /proc/sys/kernel/sched_deadline_period_us.
> The ratio runtime/period is the total bandwidth all the SCHED_DEADLINE tasks
> can use in the system as a whole.
> Trying to create tasks in such a way that they exceed this limitation will
> fail, as soon as the bandwidth cap would be overcome.
> 
> Default value is _zero_ bandwidth available, thus write some numbers in those
> files before trying to start some SCHED_DEADLINE task. Setting runtime > period
> is allowed (i.e., more than 100% bandwidth available for -deadline tasks),
> since it makes more than sense in SMP systems.

Right, so the current rt bandwidth controls go up to 100%, where 100% is
root_domain wide. That is, the bandwidth usage scale is irrespective of
the number of cpus.

If that was the best choice could of course be argued, but since we have
that, it would be strange to add another set of controls which do not
conform.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 10/12][PATCH] SCHED_DEADLINE: group bandwidth management code
  2009-10-16 15:46 ` [RFC 10/12][PATCH] SCHED_DEADLINE: group bandwidth management code Raistlin
@ 2009-12-28 14:51   ` Peter Zijlstra
  2010-01-13  9:46     ` Raistlin
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2009-12-28 14:51 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

On Fri, 2009-10-16 at 17:46 +0200, Raistlin wrote:
> CPU Container Groups support for SCHED_DEADLINE is introduced by this commit.
> 
> CGroups, if configured, have a SCHED_DEADLINE bandwidth, and it is enforced
> that the sum of the bandwidths of entities (tasks and groups) belonging to
> a group stays below its own bandwidth.

No real comment on the code here, but since we have a deadline bandwidth
reservation, should we not also couple to the existing rt (fifo)
bandwidth reservation?

We can't after all guarantee the FIFO time if there's deadline
reservations around for more.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 12/12][PATCH] SCHED_DEADLINE: modified sched_*_ex API
  2009-10-16 15:48 ` [RFC 12/12][PATCH] SCHED_DEADLINE: modified sched_*_ex API Raistlin
@ 2009-12-28 15:09   ` Peter Zijlstra
  2010-01-13 10:27     ` Raistlin
  2009-12-29 12:15   ` Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2009-12-28 15:09 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

On Fri, 2009-10-16 at 17:48 +0200, Raistlin wrote:
> @@ -6807,9 +6811,10 @@ out_unlock:
>  /**
>   * sys_sched_getparam - get the DEADLINE task parameters of a thread
>   * @pid: the pid in question.
> + * @len: size of data pointed by param_ex.
>   * @param_ex: structure containing the new parameters (deadline, runtime, etc.).
>   */
> -SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
> +SYSCALL_DEFINE3(sched_getparam_ex, pid_t, pid, unsigned, len,
>                 struct sched_param_ex __user *, param_ex)
>  {
>         struct sched_param_ex lp;
> @@ -6818,6 +6823,8 @@ SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
>  
>         if (!param_ex || pid < 0)
>                 return -EINVAL;
> +       if (len < sizeof(struct sched_param_ex))
> +               return -EINVAL;
>  
>         read_lock(&tasklist_lock);
>         p = find_process_by_pid(pid);

This allows len > sizeof().

> @@ -6837,7 +6844,7 @@ SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
>         /*
>          * This one might sleep, we cannot do it with a spinlock held ...
>          */
> -       retval = copy_to_user(param_ex, &lp, sizeof(*param_ex)) ? -EFAULT : 0;
> +       retval = copy_to_user(param_ex, &lp, len) ? -EFAULT : 0;
>  
>         return retval; 

Which would copy more than lp, resulting in a stack leak, right?




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 12/12][PATCH] SCHED_DEADLINE: modified sched_*_ex API
  2009-10-16 15:48 ` [RFC 12/12][PATCH] SCHED_DEADLINE: modified sched_*_ex API Raistlin
  2009-12-28 15:09   ` Peter Zijlstra
@ 2009-12-29 12:15   ` Peter Zijlstra
  2010-01-13 10:33     ` Raistlin
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2009-12-29 12:15 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

On Fri, 2009-10-16 at 17:48 +0200, Raistlin wrote:
> This commit amends the new API introduced to deal with the new sched_param_ex
> scheduling parameter data structure.
> 
> What we add is one more parameter to all the functions, containing the size of
> sched_param_ex. It might turn out useful in possible future extensions of
> sched_param_ex itself, to avoid issue with ABI of legacy applications.
> 
> Signed-off-by: Raistlin <raistlin@linux.it>

> @@ -6807,9 +6811,10 @@ out_unlock:
>  /**
>   * sys_sched_getparam - get the DEADLINE task parameters of a thread
>   * @pid: the pid in question.
> + * @len: size of data pointed by param_ex.
>   * @param_ex: structure containing the new parameters (deadline, runtime, etc.).
>   */
> -SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
> +SYSCALL_DEFINE3(sched_getparam_ex, pid_t, pid, unsigned, len,
>  		struct sched_param_ex __user *, param_ex)
>  {
>  	struct sched_param_ex lp;
> @@ -6818,6 +6823,8 @@ SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
>  
>  	if (!param_ex || pid < 0)
>  		return -EINVAL;
> +	if (len < sizeof(struct sched_param_ex))
> +		return -EINVAL;
>  
>  	read_lock(&tasklist_lock);
>  	p = find_process_by_pid(pid);
> @@ -6837,7 +6844,7 @@ SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
>  	/*
>  	 * This one might sleep, we cannot do it with a spinlock held ...
>  	 */
> -	retval = copy_to_user(param_ex, &lp, sizeof(*param_ex)) ? -EFAULT : 0;
> +	retval = copy_to_user(param_ex, &lp, len) ? -EFAULT : 0;
>  
>  	return retval;
>  

I think this doesn't even do what it claims to do, namely provide a
flexible ABI, since you fail the operation when there is not enough room
provided. Hence, when we grow the struct an older program that was
compiled against the smaller one will become an insta-fail.

What this should do is deal with smaller structs by ensuring the tail is
0 and simply copying out the head.

New bits in the flags field are also an interesting challenge.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 1/12][PATCH] Extended scheduling parameters structure added.
  2009-10-16 15:38 ` [RFC 1/12][PATCH] Extended scheduling parameters structure added Raistlin
@ 2009-12-29 12:15   ` Peter Zijlstra
  2010-01-13 10:36     ` Raistlin
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2009-12-29 12:15 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

On Fri, 2009-10-16 at 17:38 +0200, Raistlin wrote:

>  include/linux/sched.h |    8 ++++++++
>  1 files changed, 8 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 75e6e60..ac9837c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -94,6 +94,14 @@ struct sched_param {
>  
>  #include <asm/processor.h>
>  
> +struct sched_param_ex {
> +	int sched_priority;
> +	struct timespec sched_runtime;
> +	struct timespec sched_deadline;
> +	struct timespec sched_period;
> +	int sched_flags;
> +};
> +
>  struct exec_domain;
>  struct futex_pi_state;
>  struct robust_list_head;

Why separate this change from the introduction of the new system calls?



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class
  2009-10-16 15:40 ` [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class Raistlin
@ 2009-12-29 12:25   ` Peter Zijlstra
  2010-01-13 10:40     ` Dario Faggioli
  2009-12-29 12:27   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2009-12-29 12:25 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

On Fri, 2009-10-16 at 17:40 +0200, Raistlin wrote:
> @@ -5966,10 +5982,14 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
>         if (running)
>                 p->sched_class->put_prev_task(rq, p);
>  
> -       if (rt_prio(prio))
> -               p->sched_class = &rt_sched_class;
> -       else
> -               p->sched_class = &fair_sched_class;
> +       if (deadline_task(p))
> +               p->sched_class = &deadline_sched_class;
> +       else {
> +               if (rt_prio(prio))
> +                       p->sched_class = &rt_sched_class;
> +               else
> +                       p->sched_class = &fair_sched_class;
> +       } 

This looks wrong.

This is PI code, so the effective class should be determined based on
the 'priority' not on the 'policy'.

I understand we don't yet have deadline inheritance like things in
place, but this would be where we should make use of the simple ceiling
protocol to boost things for now.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class
  2009-10-16 15:40 ` [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class Raistlin
  2009-12-29 12:25   ` Peter Zijlstra
@ 2009-12-29 12:27   ` Peter Zijlstra
  2010-01-13 10:42     ` Raistlin
  2009-12-29 14:30   ` Peter Zijlstra
  2009-12-29 14:41   ` Peter Zijlstra
  3 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2009-12-29 12:27 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

On Fri, 2009-10-16 at 17:40 +0200, Raistlin wrote:
> +       if (unlikely(task_has_deadline_policy(p) || task_has_rt_policy(p))) {

small nit, since both task_has_{deadline,rt}_policy() already have
unlikely()s in, this extra unlikely is not needed.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class
  2009-10-16 15:40 ` [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class Raistlin
  2009-12-29 12:25   ` Peter Zijlstra
  2009-12-29 12:27   ` Peter Zijlstra
@ 2009-12-29 14:30   ` Peter Zijlstra
  2009-12-29 14:37     ` Peter Zijlstra
  2010-01-13 16:32     ` Dario Faggioli
  2009-12-29 14:41   ` Peter Zijlstra
  3 siblings, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2009-12-29 14:30 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

On Fri, 2009-10-16 at 17:40 +0200, Raistlin wrote:
> +struct task_struct *pick_next_task_deadline(struct rq *rq)
> +{
> +       struct sched_dl_entity *dl_se;
> +       struct task_struct *p;
> +       struct dl_rq *dl_rq;
> +
> +       dl_rq = &rq->dl;
> +
> +       if (likely(!dl_rq->dl_nr_running))
> +               return NULL;
> +
> +       dl_se = pick_next_deadline_entity(rq, dl_rq);
> +       BUG_ON(!dl_se);
> +
> +       p = deadline_task_of(dl_se);
> +       p->se.exec_start = rq->clock;
> +#ifdef CONFIG_SCHED_HRTICK
> +       if (hrtick_enabled(rq))
> +               start_hrtick_deadline(rq, p);
> +#endif
> +       return p;
> +} 

I'm not sure about actually using hrtick like this, I'd expect
SCHED_DEADLINE to always use hrtimers when available.  The only reason
to use some of the hrtick infrastructure is to re-use the hrtick_start()
logic which uses IPIs to ensure we program the timer on the right cpu
(so we can schedule from it).

The whole IPI mess requires USE_GENERIC_SMP_HELPERS, which makes
CONFIG_HRTICK useful (ensures we have hrtimers enabled and have generic
IPI bits)

The problem is that things like hrtick_enabled() also check
sched_feat(HRTICK) which is disabled by default (because programming the
clock hw on each schedule was found too expensive) but that should not
stop SCHED_DEADLINE from using it.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class
  2009-12-29 14:30   ` Peter Zijlstra
@ 2009-12-29 14:37     ` Peter Zijlstra
  2009-12-29 14:40       ` Peter Zijlstra
  2010-01-13 16:32     ` Dario Faggioli
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2009-12-29 14:37 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

On Tue, 2009-12-29 at 15:30 +0100, Peter Zijlstra wrote:
> On Fri, 2009-10-16 at 17:40 +0200, Raistlin wrote:
> > +struct task_struct *pick_next_task_deadline(struct rq *rq)
> > +{
> > +       struct sched_dl_entity *dl_se;
> > +       struct task_struct *p;
> > +       struct dl_rq *dl_rq;
> > +
> > +       dl_rq = &rq->dl;
> > +
> > +       if (likely(!dl_rq->dl_nr_running))
> > +               return NULL;
> > +
> > +       dl_se = pick_next_deadline_entity(rq, dl_rq);
> > +       BUG_ON(!dl_se);
> > +
> > +       p = deadline_task_of(dl_se);
> > +       p->se.exec_start = rq->clock;
> > +#ifdef CONFIG_SCHED_HRTICK
> > +       if (hrtick_enabled(rq))
> > +               start_hrtick_deadline(rq, p);
> > +#endif
> > +       return p;
> > +} 
> 
> I'm not sure about actually using hrtick like this, I'd expect
> SCHED_DEADLINE to always use hrtimers when available.  The only reason
> to use some of the hrtick infrastructure is to re-use the hrtick_start()
> logic which uses IPIs to ensure we program the timer on the right cpu
> (so we can schedule from it).

Hmm I suppose we could ignore all that CONFIG_SCHED_HRTICK stuff and
simply bounce the schedule event using the resched ipi when we find
we're on the wrong cpu. Not ideal though.. frigging mess all this :-)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class
  2009-12-29 14:37     ` Peter Zijlstra
@ 2009-12-29 14:40       ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2009-12-29 14:40 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

On Tue, 2009-12-29 at 15:37 +0100, Peter Zijlstra wrote:
> On Tue, 2009-12-29 at 15:30 +0100, Peter Zijlstra wrote:
> > On Fri, 2009-10-16 at 17:40 +0200, Raistlin wrote:
> > > +struct task_struct *pick_next_task_deadline(struct rq *rq)
> > > +{
> > > +       struct sched_dl_entity *dl_se;
> > > +       struct task_struct *p;
> > > +       struct dl_rq *dl_rq;
> > > +
> > > +       dl_rq = &rq->dl;
> > > +
> > > +       if (likely(!dl_rq->dl_nr_running))
> > > +               return NULL;
> > > +
> > > +       dl_se = pick_next_deadline_entity(rq, dl_rq);
> > > +       BUG_ON(!dl_se);
> > > +
> > > +       p = deadline_task_of(dl_se);
> > > +       p->se.exec_start = rq->clock;
> > > +#ifdef CONFIG_SCHED_HRTICK
> > > +       if (hrtick_enabled(rq))
> > > +               start_hrtick_deadline(rq, p);
> > > +#endif
> > > +       return p;
> > > +} 
> > 
> > I'm not sure about actually using hrtick like this, I'd expect
> > SCHED_DEADLINE to always use hrtimers when available.  The only reason
> > to use some of the hrtick infrastructure is to re-use the hrtick_start()
> > logic which uses IPIs to ensure we program the timer on the right cpu
> > (so we can schedule from it).
> 
> Hmm I suppose we could ignore all that CONFIG_SCHED_HRTICK stuff and
> simply bounce the schedule event using the resched ipi when we find
> we're on the wrong cpu. Not ideal though.. frigging mess all this :-)

Hmm bugger that, that's not going to work since the fallback timer stuff
doesn't run from hardirq context and is generally useless anyway :-)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class
  2009-10-16 15:40 ` [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class Raistlin
                     ` (2 preceding siblings ...)
  2009-12-29 14:30   ` Peter Zijlstra
@ 2009-12-29 14:41   ` Peter Zijlstra
  2010-01-13 10:46     ` Raistlin
  3 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2009-12-29 14:41 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

On Fri, 2009-10-16 at 17:40 +0200, Raistlin wrote:
> +static unsigned long
> +load_balance_deadline(struct rq *this_rq, int this_cpu, struct rq *busiest,
> +                unsigned long max_load_move,
> +                struct sched_domain *sd, enum cpu_idle_type idle,
> +                int *all_pinned, int *this_best_prio)
> +{
> +       /* for now, don't touch SCHED_DEADLINE tasks */
> +       return 0;
> +}
> +
> +static int
> +move_one_task_deadline(struct rq *this_rq, int this_cpu, struct rq *busiest,
> +                 struct sched_domain *sd, enum cpu_idle_type idle)
> +{
> +       return 0;
> +} 

The good news is that I've killed all that :-)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: fork and terminate task logic
  2009-10-16 15:41 ` [RFC 0/12][PATCH] SCHED_DEADLINE: fork and terminate task logic Raistlin
@ 2009-12-29 15:20   ` Peter Zijlstra
  2010-01-13 11:11     ` Raistlin
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2009-12-29 15:20 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

On Fri, 2009-10-16 at 17:41 +0200, Raistlin wrote:
> +++ b/kernel/sched.c
> @@ -2561,8 +2561,20 @@ void sched_fork(struct task_struct *p, int clone_flags)
>          * Make sure we do not leak PI boosting priority to the child.
>          */
>         p->prio = current->normal_prio;
> +       if (deadline_task(p)) {
> +               p->sched_class = &deadline_sched_class;
>  
> -       if (!rt_prio(p->prio))
> +               /*
> +                * the child will be SCHED_DEADLINE, but with zero bandwidth.
> +                * The parent (or some other task) must call setscheduler_ex
> +                * on it, or it won't ever start.
> +                */
> +               init_deadline_task(p);
> +               p->dl.flags &= ~DL_NEW;
> +               p->dl.flags |= DL_THROTTLED;

I recently added ->task_fork(), which is called after the class
assignment.

> +       } else if (rt_prio(p->prio))
> +               p->sched_class = &rt_sched_class;
> +       else
>                 p->sched_class = &fair_sched_class;
>  
>  #ifdef CONFIG_SMP
> @@ -2744,6 +2756,10 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>         if (mm)
>                 mmdrop(mm);
>         if (unlikely(prev_state == TASK_DEAD)) {
> +               /* a deadline task is dying: stop the bandwidth timer */
> +               if (deadline_task(prev))
> +                       hrtimer_cancel(&prev->dl.dl_timer);
> +
>                 /*
>                  * Remove function-return probe instances associated with this
>                  * task and put them back on the free list. 

Shouldn't this be done in the ->dequeue_task() callback?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 7/12][PATCH] SCHED_DEADLINE: signal delivery when overrunning
  2009-12-28 14:19   ` Peter Zijlstra
@ 2010-01-13  9:30     ` Raistlin
  0 siblings, 0 replies; 45+ messages in thread
From: Raistlin @ 2010-01-13  9:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Claudio Scordino, Tommaso Cucinotta, giuseppe.lipari, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1424 bytes --]

On Mon, 2009-12-28 at 15:19 +0100, Peter Zijlstra wrote: 
> > A runtime overrun will be quite common, e.g. due to coarse execution time
> > accounting, wrong parameter assignement, etc.
> > A deadline miss --since the deadlines the scheduler sees are ``scheduling
> > deadlines'' which have not necessarily to be equal to task's deadlines-- is
> > much more unlikely, and should only happen in an overloaded system.
> 
> Right, I think its much better to not do this in posix-cpu-timers.c,
> that code is shite.
> 
> Its probably possible to set SIGXCPU pending and raise TIF_SIGPENDING
> from within the scheduler code, and that will be triggered when we
> return to userspace.
> 
Ok, this sounds a lot better to me too... I'll go for this!

> That also gets rid of that coarse execution time accounting muck, since
> the scheduler has ns accurate accounting.
> 
Yes --at least when the hrtick is enabled-- I agree that this is another
very interesting benefit of this approach.

It'll be done like this in the next version of the patchset I'm
preparing.

Thanks for the answer and regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 8/12][PATCH] SCHED_DEADLINE: wait next instance syscall added.
  2009-12-28 14:30   ` Peter Zijlstra
@ 2010-01-13  9:33     ` Raistlin
  0 siblings, 0 replies; 45+ messages in thread
From: Raistlin @ 2010-01-13  9:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Claudio Scordino, Tommaso Cucinotta, giuseppe.lipari, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1277 bytes --]

On Mon, 2009-12-28 at 15:30 +0100, Peter Zijlstra wrote:
> > However, for SCHED_DEADLINE tasks, it should be the call with which each
> > job closes its current instance. In fact, in this case, the task is put to
> > sleep and, when it wakes up, the scheduler is informed that a new job
> > arrived, saving the overhead that usually comes with a task activation
> > to enforce maximum task bandwidth.
> 
> The changelog suggests (and a very brief looks seems to confirm) that
> this code could be much smaller by using hrtimer_nanosleep().
> 
> The implementation as presented seems to only call ->wait_interval()
> when the timer arms, which seems like a bug, we should always call it,
> regardless of whether we're on a period boundary.
> 
Ok, thanks, I'll look carefully at that! The current code is an attempt
of mine to replicate the behaviour of clock_nanosleep, but you're
definitely right here, it can be done much better.

Thanks and regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 9/12][PATCH] SCHED_DEADLINE: system wide bandwidth management
  2009-12-28 14:44   ` Peter Zijlstra
@ 2010-01-13  9:41     ` Raistlin
  0 siblings, 0 replies; 45+ messages in thread
From: Raistlin @ 2010-01-13  9:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Claudio Scordino, Tommaso Cucinotta, giuseppe.lipari, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1335 bytes --]

On Mon, 2009-12-28 at 15:44 +0100, Peter Zijlstra wrote:
> > Default value is _zero_ bandwidth available, thus write some numbers in those
> > files before trying to start some SCHED_DEADLINE task. Setting runtime > period
> > is allowed (i.e., more than 100% bandwidth available for -deadline tasks),
> > since it makes more than sense in SMP systems.
> 
> Right, so the current rt bandwidth controls go up to 100%, where 100% is
> root_domain wide. That is, the bandwidth usage scale is irrespective of
> the number of cpus.
> 
Yep, I know. Mine controls were working a little bit different because
of the fact I have bandwidth control at the task group --not rq-- level.
Anyway...

> If that was the best choice could of course be argued, but since we have
> that, it would be strange to add another set of controls which do not
> conform.
> 
... I again agree that consistency come first, and I'll move the
controls in the right place, and convert them to the conforming
behavior.

Thanks and regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 10/12][PATCH] SCHED_DEADLINE: group bandwidth management code
  2009-12-28 14:51   ` Peter Zijlstra
@ 2010-01-13  9:46     ` Raistlin
  0 siblings, 0 replies; 45+ messages in thread
From: Raistlin @ 2010-01-13  9:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Claudio Scordino, Tommaso Cucinotta, giuseppe.lipari, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1259 bytes --]

On Mon, 2009-12-28 at 15:51 +0100, Peter Zijlstra wrote:
> On Fri, 2009-10-16 at 17:46 +0200, Raistlin wrote:
> > CPU Container Groups support for SCHED_DEADLINE is introduced by this commit.
> > 
> > CGroups, if configured, have a SCHED_DEADLINE bandwidth, and it is enforced
> > that the sum of the bandwidths of entities (tasks and groups) belonging to
> > a group stays below its own bandwidth.
> 
> No real comment on the code here, but since we have a deadline bandwidth
> reservation, should we not also couple to the existing rt (fifo)
> bandwidth reservation?
> 
> We can't after all guarantee the FIFO time if there's deadline
> reservations around for more.
> 
Mmm... If I'm getting this right, you're suggesting to decrease from the
available system rt-bandwidth the bandwidth devoted to deadline
scheduling.

This make more than sense to me... I'll think about how to make the code
reflect that.

Thanks and regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 12/12][PATCH] SCHED_DEADLINE: modified sched_*_ex API
  2009-12-28 15:09   ` Peter Zijlstra
@ 2010-01-13 10:27     ` Raistlin
  2010-01-13 16:23       ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Raistlin @ 2010-01-13 10:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 2599 bytes --]

On Mon, 2009-12-28 at 16:09 +0100, Peter Zijlstra wrote:
> On Fri, 2009-10-16 at 17:48 +0200, Raistlin wrote:
> > @@ -6807,9 +6811,10 @@ out_unlock:
> >  /**
> >   * sys_sched_getparam - get the DEADLINE task parameters of a thread
> >   * @pid: the pid in question.
> > + * @len: size of data pointed by param_ex.
> >   * @param_ex: structure containing the new parameters (deadline, runtime, etc.).
> >   */
> > -SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
> > +SYSCALL_DEFINE3(sched_getparam_ex, pid_t, pid, unsigned, len,
> >                 struct sched_param_ex __user *, param_ex)
> >  {
> >         struct sched_param_ex lp;
> > @@ -6818,6 +6823,8 @@ SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
> >  
> >         if (!param_ex || pid < 0)
> >                 return -EINVAL;
> > +       if (len < sizeof(struct sched_param_ex))
> > +               return -EINVAL;
> >  
> >         read_lock(&tasklist_lock);
> >         p = find_process_by_pid(pid);
> 
> This allows len > sizeof().
> 
Yes...

> > @@ -6837,7 +6844,7 @@ SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
> >         /*
> >          * This one might sleep, we cannot do it with a spinlock held ...
> >          */
> > -       retval = copy_to_user(param_ex, &lp, sizeof(*param_ex)) ? -EFAULT : 0;
> > +       retval = copy_to_user(param_ex, &lp, len) ? -EFAULT : 0;
> >  
> >         return retval; 
> 
> Which would copy more than lp, resulting in a stack leak, right?
> 
... And yes again! :-)

This has been done bearing in mind that the _kernel_side_ sched_param_ex
--once stabilized-- will never lower its size. I.e., it should always
grow and, if/when it does, it should retain the position of existing
fields, for the sake of backward compatibility.

In that case, I think, the only possible case we have to face is the one
where the "old" userspace program/library uses a version of
sched_param_ex which is smaller than the one in the kernel, and what we
want is the kernel to fill only the fields existing in the userspace
code.

Does all this make sense?

If yes, I guess I just have to flip the inequality in the if() turning
it into "if (len > sizeof())" (, then apologize for the glaring
bug! :-P) and then I'm done, am I?

Thanks and regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 12/12][PATCH] SCHED_DEADLINE: modified sched_*_ex API
  2009-12-29 12:15   ` Peter Zijlstra
@ 2010-01-13 10:33     ` Raistlin
  0 siblings, 0 replies; 45+ messages in thread
From: Raistlin @ 2010-01-13 10:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1665 bytes --]

On Tue, 2009-12-29 at 13:15 +0100, Peter Zijlstra wrote:
> >  	if (!param_ex || pid < 0)
> >  		return -EINVAL;
> > +	if (len < sizeof(struct sched_param_ex))
> > +		return -EINVAL;
> >  
> >  	read_lock(&tasklist_lock);
> >  	p = find_process_by_pid(pid);
> > @@ -6837,7 +6844,7 @@ SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
> >  	/*
> >  	 * This one might sleep, we cannot do it with a spinlock held ...
> >  	 */
> > -	retval = copy_to_user(param_ex, &lp, sizeof(*param_ex)) ? -EFAULT : 0;
> > +	retval = copy_to_user(param_ex, &lp, len) ? -EFAULT : 0;
> >  
> >  	return retval;
> >  
> 
> I think this doesn't even do what it claims to do, namely provide a
> flexible ABI, since you fail the operation when there is not enough room
> provided. Hence, when we grow the struct an older program that was
> compiled against the smaller one will become an insta-fail.
> 
> What this should do is deal with smaller structs by ensuring the tail is
> 0 and simply copying out the head.
> 
Yep... As said in the previous mail I wanted to do so, and I'll do it
now that I see how odd was what I wrote! :-P

> New bits in the flags field are also an interesting challenge.
> 
Right... I think that a (partial?) solution could be to properly choose
default values for newcomer flags, could that be right?

Thanks and regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 1/12][PATCH] Extended scheduling parameters structure added.
  2009-12-29 12:15   ` Peter Zijlstra
@ 2010-01-13 10:36     ` Raistlin
  0 siblings, 0 replies; 45+ messages in thread
From: Raistlin @ 2010-01-13 10:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Claudio Scordino, Tommaso Cucinotta, giuseppe.lipari, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1369 bytes --]

On Tue, 2009-12-29 at 13:15 +0100, Peter Zijlstra wrote:
> On Fri, 2009-10-16 at 17:38 +0200, Raistlin wrote:
> 
> >  include/linux/sched.h |    8 ++++++++
> >  1 files changed, 8 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 75e6e60..ac9837c 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -94,6 +94,14 @@ struct sched_param {
> >  
> >  #include <asm/processor.h>
> >  
> > +struct sched_param_ex {
> > +	int sched_priority;
> > +	struct timespec sched_runtime;
> > +	struct timespec sched_deadline;
> > +	struct timespec sched_period;
> > +	int sched_flags;
> > +};
> > +
> >  struct exec_domain;
> >  struct futex_pi_state;
> >  struct robust_list_head;
> 
> Why separate this change from the introduction of the new system calls?
> 
Nothing particular... I was just thinking that the extended data
structure might have sense even for other --more general-- purposes. But
I can merge the two if it sounds better.

Thanks and regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class
  2009-12-29 12:25   ` Peter Zijlstra
@ 2010-01-13 10:40     ` Dario Faggioli
  0 siblings, 0 replies; 45+ messages in thread
From: Dario Faggioli @ 2010-01-13 10:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Claudio Scordino, Tommaso Cucinotta, giuseppe.lipari, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1695 bytes --]

On Tue, 2009-12-29 at 13:25 +0100, Peter Zijlstra wrote:
> On Fri, 2009-10-16 at 17:40 +0200, Raistlin wrote:
> > @@ -5966,10 +5982,14 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
> >         if (running)
> >                 p->sched_class->put_prev_task(rq, p);
> >  
> > -       if (rt_prio(prio))
> > -               p->sched_class = &rt_sched_class;
> > -       else
> > -               p->sched_class = &fair_sched_class;
> > +       if (deadline_task(p))
> > +               p->sched_class = &deadline_sched_class;
> > +       else {
> > +               if (rt_prio(prio))
> > +                       p->sched_class = &rt_sched_class;
> > +               else
> > +                       p->sched_class = &fair_sched_class;
> > +       } 
> 
> This looks wrong.
> 
Completely agree! :-P

> This is PI code, so the effective class should be determined based on
> the 'priority' not on the 'policy'.
> 
Agree, but...

> I understand we don't yet have deadline inheritance like things in
> place, but this would be where we should make use of the simple ceiling
> protocol to boost things for now.
> 
... You got it! This is me having no idea of what solution would be
better since no deadline/bandwidth/whatever inheritance is in place
right now.

Now that you gave a direction, I'll follow right that path! :-)

Thanks and regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class
  2009-12-29 12:27   ` Peter Zijlstra
@ 2010-01-13 10:42     ` Raistlin
  0 siblings, 0 replies; 45+ messages in thread
From: Raistlin @ 2010-01-13 10:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Claudio Scordino, Tommaso Cucinotta, giuseppe.lipari, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 669 bytes --]

On Tue, 2009-12-29 at 13:27 +0100, Peter Zijlstra wrote:
> On Fri, 2009-10-16 at 17:40 +0200, Raistlin wrote:
> > +       if (unlikely(task_has_deadline_policy(p) || task_has_rt_policy(p))) {
> 
> small nit, since both task_has_{deadline,rt}_policy() already have
> unlikely()s in, this extra unlikely is not needed.
> 
Coping. :-)

Thanks and regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class
  2009-12-29 14:41   ` Peter Zijlstra
@ 2010-01-13 10:46     ` Raistlin
  0 siblings, 0 replies; 45+ messages in thread
From: Raistlin @ 2010-01-13 10:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Claudio Scordino, Tommaso Cucinotta, giuseppe.lipari, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1272 bytes --]

On Tue, 2009-12-29 at 15:41 +0100, Peter Zijlstra wrote:
> On Fri, 2009-10-16 at 17:40 +0200, Raistlin wrote:
> > +static unsigned long
> > +load_balance_deadline(struct rq *this_rq, int this_cpu, struct rq *busiest,
> > +                unsigned long max_load_move,
> > +                struct sched_domain *sd, enum cpu_idle_type idle,
> > +                int *all_pinned, int *this_best_prio)
> > +{
> > +       /* for now, don't touch SCHED_DEADLINE tasks */
> > +       return 0;
> > +}
> > +
> > +static int
> > +move_one_task_deadline(struct rq *this_rq, int this_cpu, struct rq *busiest,
> > +                 struct sched_domain *sd, enum cpu_idle_type idle)
> > +{
> > +       return 0;
> > +} 
> 
> The good news is that I've killed all that :-)
> 
Yeah, good news then! :-D

By the way... I hope you all will soon hear from us something regarding
global-SMP and migrations with respect to this scheduler.

Thanks and regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: fork and terminate task logic
  2009-12-29 15:20   ` Peter Zijlstra
@ 2010-01-13 11:11     ` Raistlin
  2010-01-13 16:15       ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Raistlin @ 2010-01-13 11:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Claudio Scordino, Tommaso Cucinotta, giuseppe.lipari, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 2076 bytes --]

On Tue, 2009-12-29 at 16:20 +0100, Peter Zijlstra wrote:
> > -       if (!rt_prio(p->prio))
> > +               /*
> > +                * the child will be SCHED_DEADLINE, but with zero bandwidth.
> > +                * The parent (or some other task) must call setscheduler_ex
> > +                * on it, or it won't ever start.
> > +                */
> > +               init_deadline_task(p);
> > +               p->dl.flags &= ~DL_NEW;
> > +               p->dl.flags |= DL_THROTTLED;
> 
> I recently added ->task_fork(), which is called after the class
> assignment.
> 
Saw that, and it is being of great help! :-P

> > +       } else if (rt_prio(p->prio))
> > +               p->sched_class = &rt_sched_class;
> > +       else
> >                 p->sched_class = &fair_sched_class;
> >  
> >  #ifdef CONFIG_SMP
> > @@ -2744,6 +2756,10 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> >         if (mm)
> >                 mmdrop(mm);
> >         if (unlikely(prev_state == TASK_DEAD)) {
> > +               /* a deadline task is dying: stop the bandwidth timer */
> > +               if (deadline_task(prev))
> > +                       hrtimer_cancel(&prev->dl.dl_timer);
> > +
> >                 /*
> >                  * Remove function-return probe instances associated with this
> >                  * task and put them back on the free list. 
> 
> Shouldn't this be done in the ->dequeue_task() callback?
>
Not sure of this snippet... Actually, it is one of the most disturbing
piece of code of this whole scheduler. :-(

The reason why it is here is that I think it is needed to call
hrtimer_cancel() _without_ holding the rq->lock, is that correct?

It is 

Thanks and regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: fork and terminate task logic
  2010-01-13 11:11     ` Raistlin
@ 2010-01-13 16:15       ` Peter Zijlstra
  2010-01-13 16:28         ` Dario Faggioli
  2010-01-13 21:30         ` Fabio Checconi
  0 siblings, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2010-01-13 16:15 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Claudio Scordino, Tommaso Cucinotta, giuseppe.lipari, Juri Lelli

On Wed, 2010-01-13 at 12:11 +0100, Raistlin wrote:

> > > +       } else if (rt_prio(p->prio))
> > > +               p->sched_class = &rt_sched_class;
> > > +       else
> > >                 p->sched_class = &fair_sched_class;
> > >  
> > >  #ifdef CONFIG_SMP
> > > @@ -2744,6 +2756,10 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> > >         if (mm)
> > >                 mmdrop(mm);
> > >         if (unlikely(prev_state == TASK_DEAD)) {
> > > +               /* a deadline task is dying: stop the bandwidth timer */
> > > +               if (deadline_task(prev))
> > > +                       hrtimer_cancel(&prev->dl.dl_timer);
> > > +
> > >                 /*
> > >                  * Remove function-return probe instances associated with this
> > >                  * task and put them back on the free list. 
> > 
> > Shouldn't this be done in the ->dequeue_task() callback?
> >
> Not sure of this snippet... Actually, it is one of the most disturbing
> piece of code of this whole scheduler. :-(
> 
> The reason why it is here is that I think it is needed to call
> hrtimer_cancel() _without_ holding the rq->lock, is that correct?

I think we can nest the hrtimer base lock inside the rq->lock these
days, so it should be safe to call while holding it, anyway, lockdep
will quickly tell you if you try ;-)

> It is 

Is that a stmt or an unfinished sentence?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 12/12][PATCH] SCHED_DEADLINE: modified sched_*_ex API
  2010-01-13 10:27     ` Raistlin
@ 2010-01-13 16:23       ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2010-01-13 16:23 UTC (permalink / raw)
  To: Raistlin
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Bjoern Brandenburg, Tommaso Cucinotta, giuseppe.lipari,
	Juri Lelli

On Wed, 2010-01-13 at 11:27 +0100, Raistlin wrote:
> On Mon, 2009-12-28 at 16:09 +0100, Peter Zijlstra wrote:
> > On Fri, 2009-10-16 at 17:48 +0200, Raistlin wrote:
> > > @@ -6807,9 +6811,10 @@ out_unlock:
> > >  /**
> > >   * sys_sched_getparam - get the DEADLINE task parameters of a thread
> > >   * @pid: the pid in question.
> > > + * @len: size of data pointed by param_ex.
> > >   * @param_ex: structure containing the new parameters (deadline, runtime, etc.).
> > >   */
> > > -SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
> > > +SYSCALL_DEFINE3(sched_getparam_ex, pid_t, pid, unsigned, len,
> > >                 struct sched_param_ex __user *, param_ex)
> > >  {
> > >         struct sched_param_ex lp;
> > > @@ -6818,6 +6823,8 @@ SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
> > >  
> > >         if (!param_ex || pid < 0)
> > >                 return -EINVAL;
> > > +       if (len < sizeof(struct sched_param_ex))
> > > +               return -EINVAL;
> > >  
> > >         read_lock(&tasklist_lock);
> > >         p = find_process_by_pid(pid);
> > 
> > This allows len > sizeof().
> > 
> Yes...
> 
> > > @@ -6837,7 +6844,7 @@ SYSCALL_DEFINE2(sched_getparam_ex, pid_t, pid,
> > >         /*
> > >          * This one might sleep, we cannot do it with a spinlock held ....
> > >          */
> > > -       retval = copy_to_user(param_ex, &lp, sizeof(*param_ex)) ? -EFAULT : 0;
> > > +       retval = copy_to_user(param_ex, &lp, len) ? -EFAULT : 0;
> > >  
> > >         return retval; 
> > 
> > Which would copy more than lp, resulting in a stack leak, right?
> > 
> .... And yes again! :-)
> 
> This has been done bearing in mind that the _kernel_side_ sched_param_ex
> --once stabilized-- will never lower its size. I.e., it should always
> grow and, if/when it does, it should retain the position of existing
> fields, for the sake of backward compatibility.
> 
> In that case, I think, the only possible case we have to face is the one
> where the "old" userspace program/library uses a version of
> sched_param_ex which is smaller than the one in the kernel, and what we
> want is the kernel to fill only the fields existing in the userspace
> code.
> 
> Does all this make sense?

> If yes, I guess I just have to flip the inequality in the if() turning
> it into "if (len > sizeof())" (, then apologize for the glaring
> bug! :-P) and then I'm done, am I?

Right, I think so..


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: fork and terminate task logic
  2010-01-13 16:15       ` Peter Zijlstra
@ 2010-01-13 16:28         ` Dario Faggioli
  2010-01-13 21:30         ` Fabio Checconi
  1 sibling, 0 replies; 45+ messages in thread
From: Dario Faggioli @ 2010-01-13 16:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Claudio Scordino, Tommaso Cucinotta, giuseppe.lipari, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1959 bytes --]

On Wed, 2010-01-13 at 17:15 +0100, Peter Zijlstra wrote:
> On Wed, 2010-01-13 at 12:11 +0100, Raistlin wrote:
> 
> > > > +       } else if (rt_prio(p->prio))
> > > > +               p->sched_class = &rt_sched_class;
> > > > +       else
> > > >                 p->sched_class = &fair_sched_class;
> > > >  
> > > >  #ifdef CONFIG_SMP
> > > > @@ -2744,6 +2756,10 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> > > >         if (mm)
> > > >                 mmdrop(mm);
> > > >         if (unlikely(prev_state == TASK_DEAD)) {
> > > > +               /* a deadline task is dying: stop the bandwidth timer */
> > > > +               if (deadline_task(prev))
> > > > +                       hrtimer_cancel(&prev->dl.dl_timer);
> > > > +
> > > >                 /*
> > > >                  * Remove function-return probe instances associated with this
> > > >                  * task and put them back on the free list. 
> > > 
> > > Shouldn't this be done in the ->dequeue_task() callback?
> > >
> > Not sure of this snippet... Actually, it is one of the most disturbing
> > piece of code of this whole scheduler. :-(
> > 
> > The reason why it is here is that I think it is needed to call
> > hrtimer_cancel() _without_ holding the rq->lock, is that correct?
> 
> I think we can nest the hrtimer base lock inside the rq->lock these
> days, so it should be safe to call while holding it, anyway, lockdep
> will quickly tell you if you try ;-)
> 
Nice, I'll try this soon, thanks.

> > It is 
> 
> Is that a stmt or an unfinished sentence?
> 
No, this is nothing, sorry! :-P

Regards,
Dario


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class
  2009-12-29 14:30   ` Peter Zijlstra
  2009-12-29 14:37     ` Peter Zijlstra
@ 2010-01-13 16:32     ` Dario Faggioli
  2010-01-13 16:47       ` Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Dario Faggioli @ 2010-01-13 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Claudio Scordino, Tommaso Cucinotta, giuseppe.lipari, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 2615 bytes --]

On Tue, 2009-12-29 at 15:30 +0100, Peter Zijlstra wrote:
> On Fri, 2009-10-16 at 17:40 +0200, Raistlin wrote:
> > +struct task_struct *pick_next_task_deadline(struct rq *rq)
> > +{
> > +       struct sched_dl_entity *dl_se;
> > +       struct task_struct *p;
> > +       struct dl_rq *dl_rq;
> > +
> > +       dl_rq = &rq->dl;
> > +
> > +       if (likely(!dl_rq->dl_nr_running))
> > +               return NULL;
> > +
> > +       dl_se = pick_next_deadline_entity(rq, dl_rq);
> > +       BUG_ON(!dl_se);
> > +
> > +       p = deadline_task_of(dl_se);
> > +       p->se.exec_start = rq->clock;
> > +#ifdef CONFIG_SCHED_HRTICK
> > +       if (hrtick_enabled(rq))
> > +               start_hrtick_deadline(rq, p);
> > +#endif
> > +       return p;
> > +} 
> 
> I'm not sure about actually using hrtick like this, I'd expect
> SCHED_DEADLINE to always use hrtimers when available.  The only reason
> to use some of the hrtick infrastructure is to re-use the hrtick_start()
> logic which uses IPIs to ensure we program the timer on the right cpu
> (so we can schedule from it).
> 
Yeah, that and the fact that it seemed to me very easy and clean to:
- check for runtime enforcement inside the task_tick_deadline function,
  as other scheduling classes do, and then
- if possible, ask that task_tick_deadline function to be called right 
  at the time instant I expect my runtime to be depleted. If that won't 
  happen --because of no-hrtick or no-hires-hrtimers-- the check will
  still be performed during the next tick.

> The whole IPI mess requires USE_GENERIC_SMP_HELPERS, which makes
> CONFIG_HRTICK useful (ensures we have hrtimers enabled and have generic
> IPI bits)
> 
> The problem is that things like hrtick_enabled() also check
> sched_feat(HRTICK) which is disabled by default (because programming the
> clock hw on each schedule was found too expensive) but that should not
> stop SCHED_DEADLINE from using it.
> 
Mmm... I might have lost you here... :-(

Do you think that keep using hrtick_start and alike, even if
sched_feat(HRTICK) is disabled, could be good enough? Or are you
suggesting something different?
IOTH, should I simply bypass the sched_feat()/hrtick_enabled() check or
you think I need something more?

Thanks and regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class
  2010-01-13 16:32     ` Dario Faggioli
@ 2010-01-13 16:47       ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2010-01-13 16:47 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: linux-kernel, michael trimarchi, Fabio Checconi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Claudio Scordino, Tommaso Cucinotta, giuseppe.lipari, Juri Lelli

On Wed, 2010-01-13 at 17:32 +0100, Dario Faggioli wrote:
> On Tue, 2009-12-29 at 15:30 +0100, Peter Zijlstra wrote:
> > On Fri, 2009-10-16 at 17:40 +0200, Raistlin wrote:
> > > +struct task_struct *pick_next_task_deadline(struct rq *rq)
> > > +{
> > > +       struct sched_dl_entity *dl_se;
> > > +       struct task_struct *p;
> > > +       struct dl_rq *dl_rq;
> > > +
> > > +       dl_rq = &rq->dl;
> > > +
> > > +       if (likely(!dl_rq->dl_nr_running))
> > > +               return NULL;
> > > +
> > > +       dl_se = pick_next_deadline_entity(rq, dl_rq);
> > > +       BUG_ON(!dl_se);
> > > +
> > > +       p = deadline_task_of(dl_se);
> > > +       p->se.exec_start = rq->clock;
> > > +#ifdef CONFIG_SCHED_HRTICK
> > > +       if (hrtick_enabled(rq))
> > > +               start_hrtick_deadline(rq, p);
> > > +#endif
> > > +       return p;
> > > +} 
> > 
> > I'm not sure about actually using hrtick like this, I'd expect
> > SCHED_DEADLINE to always use hrtimers when available.  The only reason
> > to use some of the hrtick infrastructure is to re-use the hrtick_start()
> > logic which uses IPIs to ensure we program the timer on the right cpu
> > (so we can schedule from it).
> > 
> Yeah, that and the fact that it seemed to me very easy and clean to:
> - check for runtime enforcement inside the task_tick_deadline function,
>   as other scheduling classes do, and then
> - if possible, ask that task_tick_deadline function to be called right 
>   at the time instant I expect my runtime to be depleted. If that won't 
>   happen --because of no-hrtick or no-hires-hrtimers-- the check will
>   still be performed during the next tick.
> 
> > The whole IPI mess requires USE_GENERIC_SMP_HELPERS, which makes
> > CONFIG_HRTICK useful (ensures we have hrtimers enabled and have generic
> > IPI bits)
> > 
> > The problem is that things like hrtick_enabled() also check
> > sched_feat(HRTICK) which is disabled by default (because programming the
> > clock hw on each schedule was found too expensive) but that should not
> > stop SCHED_DEADLINE from using it.
> > 
> Mmm... I might have lost you here... :-(

I had a little chat with fabio around new-years and I think we ended up
agreeing that your current usage is ok, we can always fix it up later.

Its an unfortunate complicated dance of hrtimer being configured in,
having capable hardware and dealing with all the fallout cases.

The only thing which is unfortunate for your current usage is the
sched_feat(HRTICK) thing, we generally do not want HRTICK for
SCHED_OTHER, whereas we'd always (when configured and having capable
hardware) want if for SCHED_DEADLINE..


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC 0/12][PATCH] SCHED_DEADLINE: fork and terminate task logic
  2010-01-13 16:15       ` Peter Zijlstra
  2010-01-13 16:28         ` Dario Faggioli
@ 2010-01-13 21:30         ` Fabio Checconi
  1 sibling, 0 replies; 45+ messages in thread
From: Fabio Checconi @ 2010-01-13 21:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, linux-kernel, michael trimarchi, Ingo Molnar,
	Thomas Gleixner, Dhaval Giani, Johan Eker, p.faure,
	Chris Friesen, Steven Rostedt, Henrik Austad,
	Frederic Weisbecker, Darren Hart, Sven-Thorsten Dietrich,
	Claudio Scordino, Tommaso Cucinotta, giuseppe.lipari, Juri Lelli

> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed, Jan 13, 2010 05:15:11PM +0100
>
> On Wed, 2010-01-13 at 12:11 +0100, Raistlin wrote:
> 
> > > > +       } else if (rt_prio(p->prio))
> > > > +               p->sched_class = &rt_sched_class;
> > > > +       else
> > > >                 p->sched_class = &fair_sched_class;
> > > >  
> > > >  #ifdef CONFIG_SMP
> > > > @@ -2744,6 +2756,10 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> > > >         if (mm)
> > > >                 mmdrop(mm);
> > > >         if (unlikely(prev_state == TASK_DEAD)) {
> > > > +               /* a deadline task is dying: stop the bandwidth timer */
> > > > +               if (deadline_task(prev))
> > > > +                       hrtimer_cancel(&prev->dl.dl_timer);
> > > > +
> > > >                 /*
> > > >                  * Remove function-return probe instances associated with this
> > > >                  * task and put them back on the free list. 
> > > 
> > > Shouldn't this be done in the ->dequeue_task() callback?
> > >
> > Not sure of this snippet... Actually, it is one of the most disturbing
> > piece of code of this whole scheduler. :-(
> > 
> > The reason why it is here is that I think it is needed to call
> > hrtimer_cancel() _without_ holding the rq->lock, is that correct?
> 
> I think we can nest the hrtimer base lock inside the rq->lock these
> days, so it should be safe to call while holding it, anyway, lockdep
> will quickly tell you if you try ;-)
> 

I may be wrong, but the race here should be between the hrtimer_cancel()
and the handler itself (which takes rq->lock): if the timer handler is
running on a different cpu and still it has not entered its critical section
we may end up here waiting for it to terminate, but that will never happen.

If we are able to enforce that both hrtimer_cancel() and the timer
handler are always executed on the same cpu then we should be safe, because
this code would never be executed with a running handler.

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2010-01-13 21:20 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-16 15:35 [RFC 0/12][PATCH] SCHED_DEADLINE (new version of SCHED_EDF) Raistlin
2009-10-16 15:38 ` [RFC 1/12][PATCH] Extended scheduling parameters structure added Raistlin
2009-12-29 12:15   ` Peter Zijlstra
2010-01-13 10:36     ` Raistlin
2009-10-16 15:40 ` [RFC 0/12][PATCH] SCHED_DEADLINE: core of the scheduling class Raistlin
2009-12-29 12:25   ` Peter Zijlstra
2010-01-13 10:40     ` Dario Faggioli
2009-12-29 12:27   ` Peter Zijlstra
2010-01-13 10:42     ` Raistlin
2009-12-29 14:30   ` Peter Zijlstra
2009-12-29 14:37     ` Peter Zijlstra
2009-12-29 14:40       ` Peter Zijlstra
2010-01-13 16:32     ` Dario Faggioli
2010-01-13 16:47       ` Peter Zijlstra
2009-12-29 14:41   ` Peter Zijlstra
2010-01-13 10:46     ` Raistlin
2009-10-16 15:41 ` [RFC 0/12][PATCH] SCHED_DEADLINE: fork and terminate task logic Raistlin
2009-12-29 15:20   ` Peter Zijlstra
2010-01-13 11:11     ` Raistlin
2010-01-13 16:15       ` Peter Zijlstra
2010-01-13 16:28         ` Dario Faggioli
2010-01-13 21:30         ` Fabio Checconi
2009-10-16 15:41 ` [RFC 0/12][PATCH] SCHED_DEADLINE: added sched_*_ex syscalls Raistlin
2009-10-16 15:42 ` [RFC 0/12][PATCH] SCHED_DEADLINE: added sched-debug support Raistlin
2009-10-16 15:43 ` [RFC 6/12][PATCH] SCHED_DEADLINE: added scheduling latency tracer Raistlin
2009-10-16 15:44 ` [RFC 7/12][PATCH] SCHED_DEADLINE: signal delivery when overrunning Raistlin
2009-12-28 14:19   ` Peter Zijlstra
2010-01-13  9:30     ` Raistlin
2009-10-16 15:44 ` [RFC 8/12][PATCH] SCHED_DEADLINE: wait next instance syscall added Raistlin
2009-12-28 14:30   ` Peter Zijlstra
2010-01-13  9:33     ` Raistlin
2009-10-16 15:45 ` [RFC 9/12][PATCH] SCHED_DEADLINE: system wide bandwidth management Raistlin
2009-11-06 11:34   ` Dhaval Giani
2009-12-28 14:44   ` Peter Zijlstra
2010-01-13  9:41     ` Raistlin
2009-10-16 15:46 ` [RFC 10/12][PATCH] SCHED_DEADLINE: group bandwidth management code Raistlin
2009-12-28 14:51   ` Peter Zijlstra
2010-01-13  9:46     ` Raistlin
2009-10-16 15:47 ` [RFC 11/12][PATCH] SCHED_DEADLINE: documentation Raistlin
2009-10-16 15:48 ` [RFC 12/12][PATCH] SCHED_DEADLINE: modified sched_*_ex API Raistlin
2009-12-28 15:09   ` Peter Zijlstra
2010-01-13 10:27     ` Raistlin
2010-01-13 16:23       ` Peter Zijlstra
2009-12-29 12:15   ` Peter Zijlstra
2010-01-13 10:33     ` Raistlin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).