linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/deadline: Add sched_dl documentation
@ 2014-01-20 10:40 Juri Lelli
  2014-01-20 11:24 ` Henrik Austad
  2014-01-20 12:25 ` Luca Abeni
  0 siblings, 2 replies; 33+ messages in thread
From: Juri Lelli @ 2014-01-20 10:40 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, rob

From: Dario Faggioli <raistlin@linux.it>

Add in Documentation/scheduler/ some hints about the design
choices, the usage and the future possible developments of the
sched_dl scheduling class and of the SCHED_DEADLINE policy.

Cc: bruce.ashfield@windriver.com
Cc: claudio@evidence.eu.com
Cc: darren@dvhart.com
Cc: dhaval.giani@gmail.com
Cc: fchecconi@gmail.com
Cc: fweisbec@gmail.com
Cc: harald.gustafsson@ericsson.com
Cc: hgu1972@gmail.com
Cc: insop.song@gmail.com
Cc: jkacur@redhat.com
Cc: johan.eker@ericsson.com
Cc: liming.wang@windriver.com
Cc: luca.abeni@unitn.it
Cc: michael@amarulasolutions.com
Cc: mingo@redhat.com
Cc: nicola.manica@disi.unitn.it
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: p.faure@akatech.ch
Cc: rob@landley.net
Cc: rostedt@goodmis.org
Cc: tglx@linutronix.de
Cc: tommaso.cucinotta@sssup.it
Cc: vincent.guittot@linaro.org
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 Documentation/scheduler/00-INDEX           |    2 +
 Documentation/scheduler/sched-deadline.txt |  189 ++++++++++++++++++++++++++++
 kernel/sched/deadline.c                    |    3 +-
 3 files changed, 193 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/scheduler/sched-deadline.txt

diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
index d2651c4..46702e4 100644
--- a/Documentation/scheduler/00-INDEX
+++ b/Documentation/scheduler/00-INDEX
@@ -10,5 +10,7 @@ sched-nice-design.txt
 	- How and why the scheduler's nice levels are implemented.
 sched-rt-group.txt
 	- real-time group scheduling.
+sched-deadline.txt
+	- deadline scheduling.
 sched-stats.txt
 	- information on schedstats (Linux Scheduler Statistics).
diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
new file mode 100644
index 0000000..8980de1
--- /dev/null
+++ b/Documentation/scheduler/sched-deadline.txt
@@ -0,0 +1,189 @@
+			  Deadline Task Scheduling
+			  ------------------------
+
+CONTENTS
+========
+
+ 0. WARNING
+ 1. Overview
+ 2. Task scheduling
+ 2. The Interface
+ 3. Bandwidth management
+   3.1 System-wide settings
+   3.2 Task interface
+   3.4 Default behavior
+ 4. Tasks CPU affinity
+   4.1 SCHED_DEADLINE and cpusets HOWTO
+ 5. Future plans
+
+
+0. WARNING
+==========
+
+ Fiddling with these settings can result in an unpredictable or even unstable
+ system behavior. As for -rt (group) scheduling, it is assumed that root users
+ know what they're doing.
+
+
+1. Overview
+===========
+
+ The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
+ basically an implementation of the Earliest Deadline First (EDF) scheduling
+ algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
+ that makes it possible to isolate the behavior of tasks between each other.
+
+
+2. Task scheduling
+==================
+
+ The typical -deadline task is composed of a computation phase (instance)
+ which is activated on a periodic or sporadic fashion. The expected (maximum)
+ duration of such computation is called the task's runtime; the time interval
+ by which each instance needs to be completed is called the task's relative
+ deadline. The task's absolute deadline is dynamically calculated as the
+ time instant a task (or, more properly) activates plus the relative
+ deadline.
+
+ The EDF[1] algorithm selects the task with the smallest absolute deadline as
+ the one to be executed first, while the CBS[2,3] ensures that each task runs
+ for at most its runtime every period, avoiding any interference between
+ different tasks (bandwidth isolation).
+ Thanks to this feature, also tasks that do not strictly comply with the
+ computational model described above can effectively use the new policy.
+ IOW, there are no limitations on what kind of task can exploit this new
+ scheduling discipline, even if it must be said that it is particularly
+ suited for periodic or sporadic tasks that need guarantees on their
+ timing behavior, e.g., multimedia, streaming, control applications, etc.
+
+ References:
+  1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
+      ming in a hard-real-time environment. Journal of the Association for
+      Computing Machinery, 20(1), 1973.
+  2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
+      Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
+      Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
+  3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
+      Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
+
+3. Bandwidth management
+=======================
+
+ In order for the -deadline scheduling to be effective and useful, it is
+ important to have some method to keep the allocation of the available CPU
+ bandwidth to the tasks under control.
+ This is usually called "admission control" and if it is not performed at all,
+ no guarantee can be given on the actual scheduling of the -deadline tasks.
+
+ Since when RT-throttling has been introduced each task group has a bandwidth
+ associated, calculated as a certain amount of runtime over a period.
+ Moreover, to make it possible to manipulate such bandwidth, readable/writable
+ controls have been added to both procfs (for system wide settings) and cgroupfs
+ (for per-group settings).
+ Therefore, the same interface is being used for controlling the bandwidth
+ distrubution to -deadline tasks.
+
+ However, more discussion is needed in order to figure out how we want to manage
+ SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
+ uses (for now) a less sophisticated, but actually very sensible, mechanism to
+ ensure that a certain utilization cap is not overcome per each root_domain.
+
+ Another main difference between deadline bandwidth management and RT-throttling
+ is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
+ and thus we don't need an higher level throttling mechanism to enforce the
+ desired bandwidth.
+
+3.1 System wide settings
+------------------------
+
+ The system wide settings are configured under the /proc virtual file system.
+
+ For now the -rt knobs are used for dl admission control and the -deadline
+ runtime is accounted against the -rt runtime. We realise that this isn't
+ entirely desirable; however, it is better to have a small interface for now,
+ and be able to change it easily later. The ideal situation (see 5.) is to run
+ -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
+ subset of dl_bw.
+
+ This means that, for a root_domain comprising M CPUs, -deadline tasks
+ can be created while the sum of their bandwidths stays below:
+
+   M * (sched_rt_runtime_us / sched_rt_period_us)
+
+ It is also possible to disable this bandwidth management logic, and
+ be thus free of oversubscribing the system up to any arbitrary level.
+ This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
+
+
+3.2 Task interface
+------------------
+
+ Specifying a periodic/sporadic task that executes for a given amount of
+ runtime at each instance, and that is scheduled according to the urgency of
+ its own timing constraints needs, in general, a way of declaring:
+  - a (maximum/typical) instance execution time,
+  - a minimum interval between consecutive instances,
+  - a time constraint by which each instance must be completed.
+
+ Therefore:
+  * a new struct sched_attr, containing all the necessary fields is
+    provided;
+  * the new scheduling related syscalls that manipulate it, i.e.,
+    sched_setattr() and sched_getattr() are implemented.
+
+
+3.3 Default behavior
+---------------------
+
+ The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
+ 95000. With rt_period equal to 1000000, by default, it means that -deadline
+ tasks can use at most 95%, multiplied by the number of CPUs that compose the
+ root_domain, for each root_domain.
+
+ A -deadline task cannot fork.
+
+4. Tasks CPU affinity
+=====================
+
+ -deadline tasks cannot have an affinity mask smaller that the entire
+ root_domain they are created on. However, affinities can be specified
+ through the cpuset facility (Documentation/cgroups/cpusets.txt).
+
+4.1 SCHED_DEADLINE and cpusets HOWTO
+------------------------------------
+
+ An example of a simple configuration (pin a -deadline task to CPU0)
+ follows (rt-app is used to create a -deadline task).
+
+ mkdir /dev/cpuset
+ mount -t cgroup -o cpuset cpuset /dev/cpuset
+ cd /dev/cpuset
+ mkdir cpu0
+ echo 0 > cpu0/cpuset.cpus
+ echo 0 > cpu0/cpuset.mems
+ echo 1 > cpuset.cpu_exclusive
+ echo 0 > cpuset.sched_load_balance
+ echo 1 > cpu0/cpuset.cpu_exclusive
+ echo 1 > cpu0/cpuset.mem_exclusive
+ echo $$ > cpu0/tasks
+ rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
+ task affinity)
+
+5. Future plans
+===============
+
+ Still missing:
+
+  - refinements to deadline inheritance, especially regarding the possibility
+    of retaining bandwidth isolation among non-interacting tasks. This is
+    being studied from both theoretical and practical points of view, and
+    hopefully we should be able to produce some demonstrative code soon;
+  - (c)group based bandwidth management, and maybe scheduling;
+  - access control for non-root users (and related security concerns to
+    address), which is the best way to allow unprivileged use of the mechanisms
+    and how to prevent non-root users "cheat" the system?
+
+ As already discussed, we are planning also to merge this work with the EDF
+ throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
+ the preliminary phases of the merge and we really seek feedback that would
+ help us decide on the direction it should take.
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0de2482..0dd5e09 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -351,7 +351,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
  * disrupting the schedulability of the system. Otherwise, we should
  * refill the runtime and set the deadline a period in the future,
  * because keeping the current (absolute) deadline of the task would
- * result in breaking guarantees promised to other tasks.
+ * result in breaking guarantees promised to other tasks (refer to
+ * Documentation/scheduler/sched-deadline.txt for more informations).
  *
  * This function returns true if:
  *
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-20 10:40 [PATCH] sched/deadline: Add sched_dl documentation Juri Lelli
@ 2014-01-20 11:24 ` Henrik Austad
  2014-01-20 11:46   ` Peter Zijlstra
                     ` (2 more replies)
  2014-01-20 12:25 ` Luca Abeni
  1 sibling, 3 replies; 33+ messages in thread
From: Henrik Austad @ 2014-01-20 11:24 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, rob

[-- Attachment #1: Type: text/plain, Size: 13205 bytes --]

On Mon, Jan 20, 2014 at 11:40:40AM +0100, Juri Lelli wrote:
> From: Dario Faggioli <raistlin@linux.it>
> 
> Add in Documentation/scheduler/ some hints about the design
> choices, the usage and the future possible developments of the
> sched_dl scheduling class and of the SCHED_DEADLINE policy.
> 
> Cc: bruce.ashfield@windriver.com
> Cc: claudio@evidence.eu.com
> Cc: darren@dvhart.com
> Cc: dhaval.giani@gmail.com
> Cc: fchecconi@gmail.com
> Cc: fweisbec@gmail.com
> Cc: harald.gustafsson@ericsson.com
> Cc: hgu1972@gmail.com
> Cc: insop.song@gmail.com
> Cc: jkacur@redhat.com
> Cc: johan.eker@ericsson.com
> Cc: liming.wang@windriver.com
> Cc: luca.abeni@unitn.it
> Cc: michael@amarulasolutions.com
> Cc: mingo@redhat.com
> Cc: nicola.manica@disi.unitn.it
> Cc: oleg@redhat.com
> Cc: paulmck@linux.vnet.ibm.com
> Cc: p.faure@akatech.ch
> Cc: rob@landley.net
> Cc: rostedt@goodmis.org
> Cc: tglx@linutronix.de
> Cc: tommaso.cucinotta@sssup.it
> Cc: vincent.guittot@linaro.org
> Signed-off-by: Dario Faggioli <raistlin@linux.it>
> Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  Documentation/scheduler/00-INDEX           |    2 +
>  Documentation/scheduler/sched-deadline.txt |  189 ++++++++++++++++++++++++++++
>  kernel/sched/deadline.c                    |    3 +-
>  3 files changed, 193 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/scheduler/sched-deadline.txt
> 
> diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
> index d2651c4..46702e4 100644
> --- a/Documentation/scheduler/00-INDEX
> +++ b/Documentation/scheduler/00-INDEX
> @@ -10,5 +10,7 @@ sched-nice-design.txt
>  	- How and why the scheduler's nice levels are implemented.
>  sched-rt-group.txt
>  	- real-time group scheduling.
> +sched-deadline.txt
> +	- deadline scheduling.
>  sched-stats.txt
>  	- information on schedstats (Linux Scheduler Statistics).
> diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
> new file mode 100644
> index 0000000..8980de1
> --- /dev/null
> +++ b/Documentation/scheduler/sched-deadline.txt
> @@ -0,0 +1,189 @@
> +			  Deadline Task Scheduling
> +			  ------------------------
> +
> +CONTENTS
> +========
> +
> + 0. WARNING
> + 1. Overview
> + 2. Task scheduling
> + 2. The Interface
> + 3. Bandwidth management
> +   3.1 System-wide settings
> +   3.2 Task interface
> +   3.4 Default behavior
> + 4. Tasks CPU affinity
> +   4.1 SCHED_DEADLINE and cpusets HOWTO
> + 5. Future plans
> +
> +
> +0. WARNING
> +==========
> +
> + Fiddling with these settings can result in an unpredictable or even unstable
> + system behavior. As for -rt (group) scheduling, it is assumed that root users
> + know what they're doing.
> +
> +
> +1. Overview
> +===========
> +
> + The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
> + basically an implementation of the Earliest Deadline First (EDF) scheduling
> + algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
> + that makes it possible to isolate the behavior of tasks between each other.


Why not something along the lines of giving a task a guaranteed slice of 
the CPU as well as making sure that a task takes no more than a given 
slice? I.e. making the point of a lower as well as an upper limit of CPU 
usage.

> +2. Task scheduling
> +==================
> +
> + The typical -deadline task is composed of a computation phase (instance)
> + which is activated on a periodic or sporadic fashion. The expected (maximum)
> + duration of such computation is called the task's runtime; the time interval
> + by which each instance needs to be completed is called the task's relative
> + deadline. The task's absolute deadline is dynamically calculated as the
> + time instant a task (or, more properly) activates plus the relative
> + deadline.

activates - released?

Since real-time papers from different rt-campus around the academia insist 
on using *slightly* different terminology, perhaps add a short dictionary 
for some of the more common terms?

D: relative deadline, typically N ms after release
d: absolute deadline, the physical time when a given instance of a job 
   needs to be completed
R: relative release time, for periodic tasks, this is typically 'every N 
   ms'
r: absolute release time
C: Worst-case execution time

   ...you get the idea.

Perhaps too academic?

> + The EDF[1] algorithm selects the task with the smallest absolute deadline as
> + the one to be executed first, while the CBS[2,3] ensures that each task runs
> + for at most its runtime every period, avoiding any interference between
> + different tasks (bandwidth isolation).
> + Thanks to this feature, also tasks that do not strictly comply with the
> + computational model described above can effectively use the new policy.
> + IOW, there are no limitations on what kind of task can exploit this new
> + scheduling discipline, even if it must be said that it is particularly
> + suited for periodic or sporadic tasks that need guarantees on their
> + timing behavior, e.g., multimedia, streaming, control applications, etc.

I assume that ties are broken arbitrarily and that a running task is not 
preempted for another task with equal deadline. Correct?

This would be a nice point to include in this doc methinks.

> + References:
> +  1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
> +      ming in a hard-real-time environment. Journal of the Association for
> +      Computing Machinery, 20(1), 1973.
> +  2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
> +      Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
> +      Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
> +  3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
> +      Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
> +
> +3. Bandwidth management
> +=======================
> +
> + In order for the -deadline scheduling to be effective and useful, it is
> + important to have some method to keep the allocation of the available CPU
> + bandwidth to the tasks under control.
> + This is usually called "admission control" and if it is not performed at all,
> + no guarantee can be given on the actual scheduling of the -deadline tasks.
> +
> + Since when RT-throttling has been introduced each task group has a bandwidth
> + associated, calculated as a certain amount of runtime over a period.
> + Moreover, to make it possible to manipulate such bandwidth, readable/writable
> + controls have been added to both procfs (for system wide settings) and cgroupfs
> + (for per-group settings).
> + Therefore, the same interface is being used for controlling the bandwidth
> + distrubution to -deadline tasks.
> +
> + However, more discussion is needed in order to figure out how we want to manage
> + SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
> + uses (for now) a less sophisticated, but actually very sensible, mechanism to
> + ensure that a certain utilization cap is not overcome per each root_domain.
> +
> + Another main difference between deadline bandwidth management and RT-throttling
> + is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
> + and thus we don't need an higher level throttling mechanism to enforce the
> + desired bandwidth.
> +
> +3.1 System wide settings
> +------------------------
> +
> + The system wide settings are configured under the /proc virtual file system.
> +
> + For now the -rt knobs are used for dl admission control and the -deadline
> + runtime is accounted against the -rt runtime. We realise that this isn't
> + entirely desirable; however, it is better to have a small interface for now,
> + and be able to change it easily later. The ideal situation (see 5.) is to run
> + -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
> + subset of dl_bw.
> +
> + This means that, for a root_domain comprising M CPUs, -deadline tasks
> + can be created while the sum of their bandwidths stays below:
> +
> +   M * (sched_rt_runtime_us / sched_rt_period_us)
> +
> + It is also possible to disable this bandwidth management logic, and
> + be thus free of oversubscribing the system up to any arbitrary level.
> + This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
> +
> +
> +3.2 Task interface
> +------------------
> +
> + Specifying a periodic/sporadic task that executes for a given amount of
> + runtime at each instance, and that is scheduled according to the urgency of
> + its own timing constraints needs, in general, a way of declaring:
> +  - a (maximum/typical) instance execution time,
> +  - a minimum interval between consecutive instances,
> +  - a time constraint by which each instance must be completed.
> +
> + Therefore:
> +  * a new struct sched_attr, containing all the necessary fields is
> +    provided;
> +  * the new scheduling related syscalls that manipulate it, i.e.,
> +    sched_setattr() and sched_getattr() are implemented.
> +
> +
> +3.3 Default behavior
> +---------------------
> +
> + The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
> + 95000. With rt_period equal to 1000000, by default, it means that -deadline
    ^^^^
    This seems to be 9.5% to me ;)

> + tasks can use at most 95%, multiplied by the number of CPUs that compose the
> + root_domain, for each root_domain.
> +
> + A -deadline task cannot fork.
> +
> +4. Tasks CPU affinity
> +=====================
> +
> + -deadline tasks cannot have an affinity mask smaller that the entire
> + root_domain they are created on. However, affinities can be specified
> + through the cpuset facility (Documentation/cgroups/cpusets.txt).

Does this mean that sched_deadline is a somewhat global implementation? Or 
rather, at what point in time will sched_deadline take all cpus in a set 
into consideration and when will it only look at the current CPU? Where is 
the line drawn between global and fully partitioned?

Also, how do you account the budget when a resource holder is boosted in 
order to release a resource? (IIRC, you use BWI, right?)

> +4.1 SCHED_DEADLINE and cpusets HOWTO
> +------------------------------------
> +
> + An example of a simple configuration (pin a -deadline task to CPU0)
> + follows (rt-app is used to create a -deadline task).
> +
> + mkdir /dev/cpuset
> + mount -t cgroup -o cpuset cpuset /dev/cpuset
> + cd /dev/cpuset
> + mkdir cpu0
> + echo 0 > cpu0/cpuset.cpus
> + echo 0 > cpu0/cpuset.mems
> + echo 1 > cpuset.cpu_exclusive
> + echo 0 > cpuset.sched_load_balance
> + echo 1 > cpu0/cpuset.cpu_exclusive
> + echo 1 > cpu0/cpuset.mem_exclusive
> + echo $$ > cpu0/tasks
> + rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
> + task affinity)
> +
> +5. Future plans
> +===============
> +
> + Still missing:
> +
> +  - refinements to deadline inheritance, especially regarding the possibility
> +    of retaining bandwidth isolation among non-interacting tasks. This is
> +    being studied from both theoretical and practical points of view, and
> +    hopefully we should be able to produce some demonstrative code soon;
> +  - (c)group based bandwidth management, and maybe scheduling;
> +  - access control for non-root users (and related security concerns to
> +    address), which is the best way to allow unprivileged use of the mechanisms
> +    and how to prevent non-root users "cheat" the system?
> +
> + As already discussed, we are planning also to merge this work with the EDF
> + throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
> + the preliminary phases of the merge and we really seek feedback that would
> + help us decide on the direction it should take.
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 0de2482..0dd5e09 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -351,7 +351,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
>   * disrupting the schedulability of the system. Otherwise, we should
>   * refill the runtime and set the deadline a period in the future,
>   * because keeping the current (absolute) deadline of the task would
> - * result in breaking guarantees promised to other tasks.
> + * result in breaking guarantees promised to other tasks (refer to
> + * Documentation/scheduler/sched-deadline.txt for more informations).
>   *
>   * This function returns true if:
>   *
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Henrik Austad

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-20 11:24 ` Henrik Austad
@ 2014-01-20 11:46   ` Peter Zijlstra
  2014-01-21 14:55     ` Steven Rostedt
  2014-01-20 12:15   ` Juri Lelli
  2014-01-21 10:21   ` Henrik Austad
  2 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2014-01-20 11:46 UTC (permalink / raw)
  To: Henrik Austad
  Cc: Juri Lelli, tglx, mingo, rostedt, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, rob

On Mon, Jan 20, 2014 at 12:24:42PM +0100, Henrik Austad wrote:
> > +2. Task scheduling
> > +==================
> > +
> > + The typical -deadline task is composed of a computation phase (instance)
> > + which is activated on a periodic or sporadic fashion. The expected (maximum)
> > + duration of such computation is called the task's runtime; the time interval
> > + by which each instance needs to be completed is called the task's relative
> > + deadline. The task's absolute deadline is dynamically calculated as the
> > + time instant a task (or, more properly) activates plus the relative
> > + deadline.
> 
> activates - released?
> 
> Since real-time papers from different rt-campus around the academia insist 
> on using *slightly* different terminology, perhaps add a short dictionary 
> for some of the more common terms?

Oh gawd, they really don't conform with their definitions? I'd not
noticed that.

> D: relative deadline, typically N ms after release

You failed to define release :-) Its the 'wakeup' event, right? Where
the activation would be the moment we actually schedule the
job/instance?

> d: absolute deadline, the physical time when a given instance of a job 
>    needs to be completed
> R: relative release time, for periodic tasks, this is typically 'every N 
>    ms'
> r: absolute release time
> C: Worst-case execution time
> 
>    ...you get the idea.
> 
> Perhaps too academic?

I think not, one can never be too clear about these things.

> > +4. Tasks CPU affinity
> > +=====================
> > +
> > + -deadline tasks cannot have an affinity mask smaller that the entire
> > + root_domain they are created on. However, affinities can be specified
> > + through the cpuset facility (Documentation/cgroups/cpusets.txt).
> 
> Does this mean that sched_deadline is a somewhat global implementation? 

Yes, its a GEDF like thing.

> Or 
> rather, at what point in time will sched_deadline take all cpus in a set 
> into consideration and when will it only look at the current CPU? Where is 
> the line drawn between global and fully partitioned?

Its drawn >< that close to global.

So I think adding a SCHED_FLAG_DL_HARD option which would reduce to
strict per-cpu affinity and deliver 0 tardiness is a future work.

Its slightly complicated in that you cannot share the DL tree between
the GEDF and EDF jobs, because while a GEDF job might have an earlier
deadline an EDF job might have a lesser laxity. Not running the EDF job
in that case would result in a deadline miss (although, assuming we'd
still have function GEDF admission control, still have bounded
tardiness).

I'm not entirely sure we want to do anything in between the fully global
and per-cpu 'hard' mode -- random affinity masks seems like a terribly
hard problem.

NOTE: the 'global' nature is per root_domain, so cpusets can be used to
carve the thing into smaller balance sets.

> Also, how do you account the budget when a resource holder is boosted in 
> order to release a resource? (IIRC, you use BWI, right?)

Boosting is still work in progress, but yes, it does a BWI like thing.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-20 11:24 ` Henrik Austad
  2014-01-20 11:46   ` Peter Zijlstra
@ 2014-01-20 12:15   ` Juri Lelli
  2014-01-20 13:16     ` Henrik Austad
  2014-01-21 10:21   ` Henrik Austad
  2 siblings, 1 reply; 33+ messages in thread
From: Juri Lelli @ 2014-01-20 12:15 UTC (permalink / raw)
  To: Henrik Austad
  Cc: peterz, tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, rob

On 01/20/2014 12:24 PM, Henrik Austad wrote:
> On Mon, Jan 20, 2014 at 11:40:40AM +0100, Juri Lelli wrote:
>> From: Dario Faggioli <raistlin@linux.it>
>>
>> Add in Documentation/scheduler/ some hints about the design
>> choices, the usage and the future possible developments of the
>> sched_dl scheduling class and of the SCHED_DEADLINE policy.
>>
>> Cc: bruce.ashfield@windriver.com
>> Cc: claudio@evidence.eu.com
>> Cc: darren@dvhart.com
>> Cc: dhaval.giani@gmail.com
>> Cc: fchecconi@gmail.com
>> Cc: fweisbec@gmail.com
>> Cc: harald.gustafsson@ericsson.com
>> Cc: hgu1972@gmail.com
>> Cc: insop.song@gmail.com
>> Cc: jkacur@redhat.com
>> Cc: johan.eker@ericsson.com
>> Cc: liming.wang@windriver.com
>> Cc: luca.abeni@unitn.it
>> Cc: michael@amarulasolutions.com
>> Cc: mingo@redhat.com
>> Cc: nicola.manica@disi.unitn.it
>> Cc: oleg@redhat.com
>> Cc: paulmck@linux.vnet.ibm.com
>> Cc: p.faure@akatech.ch
>> Cc: rob@landley.net
>> Cc: rostedt@goodmis.org
>> Cc: tglx@linutronix.de
>> Cc: tommaso.cucinotta@sssup.it
>> Cc: vincent.guittot@linaro.org
>> Signed-off-by: Dario Faggioli <raistlin@linux.it>
>> Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>> ---
>>  Documentation/scheduler/00-INDEX           |    2 +
>>  Documentation/scheduler/sched-deadline.txt |  189 ++++++++++++++++++++++++++++
>>  kernel/sched/deadline.c                    |    3 +-
>>  3 files changed, 193 insertions(+), 1 deletion(-)
>>  create mode 100644 Documentation/scheduler/sched-deadline.txt
>>
>> diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
>> index d2651c4..46702e4 100644
>> --- a/Documentation/scheduler/00-INDEX
>> +++ b/Documentation/scheduler/00-INDEX
>> @@ -10,5 +10,7 @@ sched-nice-design.txt
>>  	- How and why the scheduler's nice levels are implemented.
>>  sched-rt-group.txt
>>  	- real-time group scheduling.
>> +sched-deadline.txt
>> +	- deadline scheduling.
>>  sched-stats.txt
>>  	- information on schedstats (Linux Scheduler Statistics).
>> diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
>> new file mode 100644
>> index 0000000..8980de1
>> --- /dev/null
>> +++ b/Documentation/scheduler/sched-deadline.txt
>> @@ -0,0 +1,189 @@
>> +			  Deadline Task Scheduling
>> +			  ------------------------
>> +
>> +CONTENTS
>> +========
>> +
>> + 0. WARNING
>> + 1. Overview
>> + 2. Task scheduling
>> + 2. The Interface
>> + 3. Bandwidth management
>> +   3.1 System-wide settings
>> +   3.2 Task interface
>> +   3.4 Default behavior
>> + 4. Tasks CPU affinity
>> +   4.1 SCHED_DEADLINE and cpusets HOWTO
>> + 5. Future plans
>> +
>> +
>> +0. WARNING
>> +==========
>> +
>> + Fiddling with these settings can result in an unpredictable or even unstable
>> + system behavior. As for -rt (group) scheduling, it is assumed that root users
>> + know what they're doing.
>> +
>> +
>> +1. Overview
>> +===========
>> +
>> + The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
>> + basically an implementation of the Earliest Deadline First (EDF) scheduling
>> + algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
>> + that makes it possible to isolate the behavior of tasks between each other.
> 
> 
> Why not something along the lines of giving a task a guaranteed slice of 
> the CPU as well as making sure that a task takes no more than a given 
> slice? I.e. making the point of a lower as well as an upper limit of CPU 
> usage.
> 

I'd keep the term "isolate" in, as is one of the strong points on having all
this merged in. But, we could add something along the lines you suggested:

"that makes it possible to isolate the behavior of tasks between each other.
IOW, isolation means that we can reserve a task a guaranteed percentage of the
CPU and, at the same time, we ensure that the task takes no more than the
percentage reserved."

>> +2. Task scheduling
>> +==================
>> +
>> + The typical -deadline task is composed of a computation phase (instance)
>> + which is activated on a periodic or sporadic fashion. The expected (maximum)
>> + duration of such computation is called the task's runtime; the time interval
>> + by which each instance needs to be completed is called the task's relative
>> + deadline. The task's absolute deadline is dynamically calculated as the
>> + time instant a task (or, more properly) activates plus the relative
>> + deadline.
> 
> activates - released?
> 

I'd keep (modifying a bit):

"time instant a task activates plus the relative deadline."

This is probably the nearest thing to what is implemented that we can say
(without entering into the theory too much), a task that "activates" can mean
that it is first released, enqueued, woken-up, etc.

> Since real-time papers from different rt-campus around the academia insist 
> on using *slightly* different terminology, perhaps add a short dictionary 
> for some of the more common terms?
> 
> D: relative deadline, typically N ms after release
> d: absolute deadline, the physical time when a given instance of a job 
>    needs to be completed
> R: relative release time, for periodic tasks, this is typically 'every N 
>    ms'
> r: absolute release time
> C: Worst-case execution time
> 
>    ...you get the idea.
> 
> Perhaps too academic?
> 

Mmm.. we don't go too deep in theory (we just refer to papers below), could
adding a dictionary only complicate things? I mean, if you add a term you have
to explain its meaning related to the task-model you are using. And this means
you have to also define the task model and so on. Who wants more details
already finds them in the papers below.

>> + The EDF[1] algorithm selects the task with the smallest absolute deadline as
>> + the one to be executed first, while the CBS[2,3] ensures that each task runs
>> + for at most its runtime every period, avoiding any interference between
>> + different tasks (bandwidth isolation).
>> + Thanks to this feature, also tasks that do not strictly comply with the
>> + computational model described above can effectively use the new policy.
>> + IOW, there are no limitations on what kind of task can exploit this new
>> + scheduling discipline, even if it must be said that it is particularly
>> + suited for periodic or sporadic tasks that need guarantees on their
>> + timing behavior, e.g., multimedia, streaming, control applications, etc.
> 
> I assume that ties are broken arbitrarily and that a running task is not 
> preempted for another task with equal deadline. Correct?
> 

Yes.

> This would be a nice point to include in this doc methinks.
> 
>> + References:
>> +  1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
>> +      ming in a hard-real-time environment. Journal of the Association for
>> +      Computing Machinery, 20(1), 1973.
>> +  2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
>> +      Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
>> +      Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
>> +  3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
>> +      Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
>> +
>> +3. Bandwidth management
>> +=======================
>> +
>> + In order for the -deadline scheduling to be effective and useful, it is
>> + important to have some method to keep the allocation of the available CPU
>> + bandwidth to the tasks under control.
>> + This is usually called "admission control" and if it is not performed at all,
>> + no guarantee can be given on the actual scheduling of the -deadline tasks.
>> +
>> + Since when RT-throttling has been introduced each task group has a bandwidth
>> + associated, calculated as a certain amount of runtime over a period.
>> + Moreover, to make it possible to manipulate such bandwidth, readable/writable
>> + controls have been added to both procfs (for system wide settings) and cgroupfs
>> + (for per-group settings).
>> + Therefore, the same interface is being used for controlling the bandwidth
>> + distrubution to -deadline tasks.
>> +
>> + However, more discussion is needed in order to figure out how we want to manage
>> + SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
>> + uses (for now) a less sophisticated, but actually very sensible, mechanism to
>> + ensure that a certain utilization cap is not overcome per each root_domain.
>> +
>> + Another main difference between deadline bandwidth management and RT-throttling
>> + is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
>> + and thus we don't need an higher level throttling mechanism to enforce the
>> + desired bandwidth.
>> +
>> +3.1 System wide settings
>> +------------------------
>> +
>> + The system wide settings are configured under the /proc virtual file system.
>> +
>> + For now the -rt knobs are used for dl admission control and the -deadline
>> + runtime is accounted against the -rt runtime. We realise that this isn't
>> + entirely desirable; however, it is better to have a small interface for now,
>> + and be able to change it easily later. The ideal situation (see 5.) is to run
>> + -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
>> + subset of dl_bw.
>> +
>> + This means that, for a root_domain comprising M CPUs, -deadline tasks
>> + can be created while the sum of their bandwidths stays below:
>> +
>> +   M * (sched_rt_runtime_us / sched_rt_period_us)
>> +
>> + It is also possible to disable this bandwidth management logic, and
>> + be thus free of oversubscribing the system up to any arbitrary level.
>> + This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
>> +
>> +
>> +3.2 Task interface
>> +------------------
>> +
>> + Specifying a periodic/sporadic task that executes for a given amount of
>> + runtime at each instance, and that is scheduled according to the urgency of
>> + its own timing constraints needs, in general, a way of declaring:
>> +  - a (maximum/typical) instance execution time,
>> +  - a minimum interval between consecutive instances,
>> +  - a time constraint by which each instance must be completed.
>> +
>> + Therefore:
>> +  * a new struct sched_attr, containing all the necessary fields is
>> +    provided;
>> +  * the new scheduling related syscalls that manipulate it, i.e.,
>> +    sched_setattr() and sched_getattr() are implemented.
>> +
>> +
>> +3.3 Default behavior
>> +---------------------
>> +
>> + The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
>> + 95000. With rt_period equal to 1000000, by default, it means that -deadline
>     ^^^^
>     This seems to be 9.5% to me ;)
> 

Argh! s/95000/950000/

>> + tasks can use at most 95%, multiplied by the number of CPUs that compose the
>> + root_domain, for each root_domain.
>> +
>> + A -deadline task cannot fork.
>> +
>> +4. Tasks CPU affinity
>> +=====================
>> +
>> + -deadline tasks cannot have an affinity mask smaller that the entire
>> + root_domain they are created on. However, affinities can be specified
>> + through the cpuset facility (Documentation/cgroups/cpusets.txt).
> 
> Does this mean that sched_deadline is a somewhat global implementation? Or 
> rather, at what point in time will sched_deadline take all cpus in a set 
> into consideration and when will it only look at the current CPU? Where is 
> the line drawn between global and fully partitioned?
> 
> Also, how do you account the budget when a resource holder is boosted in 
> order to release a resource? (IIRC, you use BWI, right?)
> 

Peter already replied about this.

Thanks,

- Juri

>> +4.1 SCHED_DEADLINE and cpusets HOWTO
>> +------------------------------------
>> +
>> + An example of a simple configuration (pin a -deadline task to CPU0)
>> + follows (rt-app is used to create a -deadline task).
>> +
>> + mkdir /dev/cpuset
>> + mount -t cgroup -o cpuset cpuset /dev/cpuset
>> + cd /dev/cpuset
>> + mkdir cpu0
>> + echo 0 > cpu0/cpuset.cpus
>> + echo 0 > cpu0/cpuset.mems
>> + echo 1 > cpuset.cpu_exclusive
>> + echo 0 > cpuset.sched_load_balance
>> + echo 1 > cpu0/cpuset.cpu_exclusive
>> + echo 1 > cpu0/cpuset.mem_exclusive
>> + echo $$ > cpu0/tasks
>> + rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
>> + task affinity)
>> +
>> +5. Future plans
>> +===============
>> +
>> + Still missing:
>> +
>> +  - refinements to deadline inheritance, especially regarding the possibility
>> +    of retaining bandwidth isolation among non-interacting tasks. This is
>> +    being studied from both theoretical and practical points of view, and
>> +    hopefully we should be able to produce some demonstrative code soon;
>> +  - (c)group based bandwidth management, and maybe scheduling;
>> +  - access control for non-root users (and related security concerns to
>> +    address), which is the best way to allow unprivileged use of the mechanisms
>> +    and how to prevent non-root users "cheat" the system?
>> +
>> + As already discussed, we are planning also to merge this work with the EDF
>> + throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
>> + the preliminary phases of the merge and we really seek feedback that would
>> + help us decide on the direction it should take.
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 0de2482..0dd5e09 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -351,7 +351,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
>>   * disrupting the schedulability of the system. Otherwise, we should
>>   * refill the runtime and set the deadline a period in the future,
>>   * because keeping the current (absolute) deadline of the task would
>> - * result in breaking guarantees promised to other tasks.
>> + * result in breaking guarantees promised to other tasks (refer to
>> + * Documentation/scheduler/sched-deadline.txt for more informations).
>>   *
>>   * This function returns true if:
>>   *
>> -- 
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-20 10:40 [PATCH] sched/deadline: Add sched_dl documentation Juri Lelli
  2014-01-20 11:24 ` Henrik Austad
@ 2014-01-20 12:25 ` Luca Abeni
  1 sibling, 0 replies; 33+ messages in thread
From: Luca Abeni @ 2014-01-20 12:25 UTC (permalink / raw)
  To: Juri Lelli, peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, dhaval.giani, hgu1972, paulmck, raistlin,
	insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield, rob

Hi Juri,

On 01/20/2014 11:40 AM, Juri Lelli wrote:
> From: Dario Faggioli <raistlin@linux.it>
>
> Add in Documentation/scheduler/ some hints about the design
> choices, the usage and the future possible developments of the
> sched_dl scheduling class and of the SCHED_DEADLINE policy.
[...]
> + References:
> +  1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
> +      ming in a hard-real-time environment. Journal of the Association for
> +      Computing Machinery, 20(1), 1973.
> +  2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
> +      Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
> +      Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
> +  3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
> +      Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
I do not know how stable the "xoomer.virtgilio.it" URL is (and I lost the password
for this web site :).
I can host the technical report on unitn.it; I think it's more reliable. If you think
this is a good idea, I'll provide you with an updated URL (also, this is an old
document, written in "not so good" English... If there is interest, I can search for
the latex sources and update the document).



				Luca

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-20 12:15   ` Juri Lelli
@ 2014-01-20 13:16     ` Henrik Austad
  2014-01-20 13:39       ` Luca Abeni
  0 siblings, 1 reply; 33+ messages in thread
From: Henrik Austad @ 2014-01-20 13:16 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, rob

[-- Attachment #1: Type: text/plain, Size: 13704 bytes --]

On Mon, Jan 20, 2014 at 01:15:51PM +0100, Juri Lelli wrote:
> On 01/20/2014 12:24 PM, Henrik Austad wrote:
> > On Mon, Jan 20, 2014 at 11:40:40AM +0100, Juri Lelli wrote:
> >> From: Dario Faggioli <raistlin@linux.it>
> >>
> >> Add in Documentation/scheduler/ some hints about the design
> >> choices, the usage and the future possible developments of the
> >> sched_dl scheduling class and of the SCHED_DEADLINE policy.
> >>
> >> Cc: bruce.ashfield@windriver.com
> >> Cc: claudio@evidence.eu.com
> >> Cc: darren@dvhart.com
> >> Cc: dhaval.giani@gmail.com
> >> Cc: fchecconi@gmail.com
> >> Cc: fweisbec@gmail.com
> >> Cc: harald.gustafsson@ericsson.com
> >> Cc: hgu1972@gmail.com
> >> Cc: insop.song@gmail.com
> >> Cc: jkacur@redhat.com
> >> Cc: johan.eker@ericsson.com
> >> Cc: liming.wang@windriver.com
> >> Cc: luca.abeni@unitn.it
> >> Cc: michael@amarulasolutions.com
> >> Cc: mingo@redhat.com
> >> Cc: nicola.manica@disi.unitn.it
> >> Cc: oleg@redhat.com
> >> Cc: paulmck@linux.vnet.ibm.com
> >> Cc: p.faure@akatech.ch
> >> Cc: rob@landley.net
> >> Cc: rostedt@goodmis.org
> >> Cc: tglx@linutronix.de
> >> Cc: tommaso.cucinotta@sssup.it
> >> Cc: vincent.guittot@linaro.org
> >> Signed-off-by: Dario Faggioli <raistlin@linux.it>
> >> Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
> >> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> >> ---
> >>  Documentation/scheduler/00-INDEX           |    2 +
> >>  Documentation/scheduler/sched-deadline.txt |  189 ++++++++++++++++++++++++++++
> >>  kernel/sched/deadline.c                    |    3 +-
> >>  3 files changed, 193 insertions(+), 1 deletion(-)
> >>  create mode 100644 Documentation/scheduler/sched-deadline.txt
> >>
> >> diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
> >> index d2651c4..46702e4 100644
> >> --- a/Documentation/scheduler/00-INDEX
> >> +++ b/Documentation/scheduler/00-INDEX
> >> @@ -10,5 +10,7 @@ sched-nice-design.txt
> >>  	- How and why the scheduler's nice levels are implemented.
> >>  sched-rt-group.txt
> >>  	- real-time group scheduling.
> >> +sched-deadline.txt
> >> +	- deadline scheduling.
> >>  sched-stats.txt
> >>  	- information on schedstats (Linux Scheduler Statistics).
> >> diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
> >> new file mode 100644
> >> index 0000000..8980de1
> >> --- /dev/null
> >> +++ b/Documentation/scheduler/sched-deadline.txt
> >> @@ -0,0 +1,189 @@
> >> +			  Deadline Task Scheduling
> >> +			  ------------------------
> >> +
> >> +CONTENTS
> >> +========
> >> +
> >> + 0. WARNING
> >> + 1. Overview
> >> + 2. Task scheduling
> >> + 2. The Interface
> >> + 3. Bandwidth management
> >> +   3.1 System-wide settings
> >> +   3.2 Task interface
> >> +   3.4 Default behavior
> >> + 4. Tasks CPU affinity
> >> +   4.1 SCHED_DEADLINE and cpusets HOWTO
> >> + 5. Future plans
> >> +
> >> +
> >> +0. WARNING
> >> +==========
> >> +
> >> + Fiddling with these settings can result in an unpredictable or even unstable
> >> + system behavior. As for -rt (group) scheduling, it is assumed that root users
> >> + know what they're doing.
> >> +
> >> +
> >> +1. Overview
> >> +===========
> >> +
> >> + The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
> >> + basically an implementation of the Earliest Deadline First (EDF) scheduling
> >> + algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
> >> + that makes it possible to isolate the behavior of tasks between each other.
> > 
> > 
> > Why not something along the lines of giving a task a guaranteed slice of 
> > the CPU as well as making sure that a task takes no more than a given 
> > slice? I.e. making the point of a lower as well as an upper limit of CPU 
> > usage.
> > 
> 
> I'd keep the term "isolate" in, as is one of the strong points on having all
> this merged in. But, we could add something along the lines you suggested:
> 
> "that makes it possible to isolate the behavior of tasks between each other.
> IOW, isolation means that we can reserve a task a guaranteed percentage of the
> CPU and, at the same time, we ensure that the task takes no more than the
> percentage reserved."

+1 from me!

> >> +2. Task scheduling
> >> +==================
> >> +
> >> + The typical -deadline task is composed of a computation phase (instance)
> >> + which is activated on a periodic or sporadic fashion. The expected (maximum)
> >> + duration of such computation is called the task's runtime; the time interval
> >> + by which each instance needs to be completed is called the task's relative
> >> + deadline. The task's absolute deadline is dynamically calculated as the
> >> + time instant a task (or, more properly) activates plus the relative
> >> + deadline.
> > 
> > activates - released?
> > 
> 
> I'd keep (modifying a bit):
> 
> "time instant a task activates plus the relative deadline."
> 
> This is probably the nearest thing to what is implemented that we can say
> (without entering into the theory too much), a task that "activates" can mean
> that it is first released, enqueued, woken-up, etc.

So, if we look at release (yes, I'm avoiding activates for a little while) 
as the time at the *beginning* of a new period, then, and only then should 
the *absolute* deadline be computed.

If you den move on to use 'activate' as a term for when a task becomes 
eligble to run, then 'release' becomes a subset of 'activate', and you 
should only compute the absolute deadline at that time. Did that make 
sense?

> > Since real-time papers from different rt-campus around the academia insist 
> > on using *slightly* different terminology, perhaps add a short dictionary 
> > for some of the more common terms?
> > 
> > D: relative deadline, typically N ms after release
> > d: absolute deadline, the physical time when a given instance of a job 
> >    needs to be completed
> > R: relative release time, for periodic tasks, this is typically 'every N 
> >    ms'
> > r: absolute release time
> > C: Worst-case execution time
> > 
> >    ...you get the idea.
> > 
> > Perhaps too academic?
> > 
> 
> Mmm.. we don't go too deep in theory (we just refer to papers below), could
> adding a dictionary only complicate things? I mean, if you add a term you have
> to explain its meaning related to the task-model you are using. And this means
> you have to also define the task model and so on. Who wants more details
> already finds them in the papers below.

Honestly, how many readers of sched_deadline.txt are going to read those 
papers to have the terms defined?

Either way, having a *short* list of terms we use _in_ sched_deadline would 
help avoiding confusion. For instance, what 'activate' actually means.

> >> + The EDF[1] algorithm selects the task with the smallest absolute deadline as
> >> + the one to be executed first, while the CBS[2,3] ensures that each task runs
> >> + for at most its runtime every period, avoiding any interference between
> >> + different tasks (bandwidth isolation).
> >> + Thanks to this feature, also tasks that do not strictly comply with the
> >> + computational model described above can effectively use the new policy.
> >> + IOW, there are no limitations on what kind of task can exploit this new
> >> + scheduling discipline, even if it must be said that it is particularly
> >> + suited for periodic or sporadic tasks that need guarantees on their
> >> + timing behavior, e.g., multimedia, streaming, control applications, etc.
> > 
> > I assume that ties are broken arbitrarily and that a running task is not 
> > preempted for another task with equal deadline. Correct?
> > 
> 
> Yes.
> 
> > This would be a nice point to include in this doc methinks.

how about adding something like that?

"""
Whenever 2 (or more) tasks happen to have the exact same absolute deadine, 
ties are broken arbitrarily. One notable exception occur when one of the 
tasks are already running, in which case no preemption occurs.
"""

> >> + References:
> >> +  1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
> >> +      ming in a hard-real-time environment. Journal of the Association for
> >> +      Computing Machinery, 20(1), 1973.
> >> +  2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
> >> +      Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
> >> +      Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
> >> +  3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
> >> +      Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
> >> +
> >> +3. Bandwidth management
> >> +=======================
> >> +
> >> + In order for the -deadline scheduling to be effective and useful, it is
> >> + important to have some method to keep the allocation of the available CPU
> >> + bandwidth to the tasks under control.
> >> + This is usually called "admission control" and if it is not performed at all,
> >> + no guarantee can be given on the actual scheduling of the -deadline tasks.
> >> +
> >> + Since when RT-throttling has been introduced each task group has a bandwidth
> >> + associated, calculated as a certain amount of runtime over a period.
> >> + Moreover, to make it possible to manipulate such bandwidth, readable/writable
> >> + controls have been added to both procfs (for system wide settings) and cgroupfs
> >> + (for per-group settings).
> >> + Therefore, the same interface is being used for controlling the bandwidth
> >> + distrubution to -deadline tasks.
> >> +
> >> + However, more discussion is needed in order to figure out how we want to manage
> >> + SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
> >> + uses (for now) a less sophisticated, but actually very sensible, mechanism to
> >> + ensure that a certain utilization cap is not overcome per each root_domain.
> >> +
> >> + Another main difference between deadline bandwidth management and RT-throttling
> >> + is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
> >> + and thus we don't need an higher level throttling mechanism to enforce the
> >> + desired bandwidth.
> >> +
> >> +3.1 System wide settings
> >> +------------------------
> >> +
> >> + The system wide settings are configured under the /proc virtual file system.
> >> +
> >> + For now the -rt knobs are used for dl admission control and the -deadline
> >> + runtime is accounted against the -rt runtime. We realise that this isn't
> >> + entirely desirable; however, it is better to have a small interface for now,
> >> + and be able to change it easily later. The ideal situation (see 5.) is to run
> >> + -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
> >> + subset of dl_bw.
> >> +
> >> + This means that, for a root_domain comprising M CPUs, -deadline tasks
> >> + can be created while the sum of their bandwidths stays below:
> >> +
> >> +   M * (sched_rt_runtime_us / sched_rt_period_us)
> >> +
> >> + It is also possible to disable this bandwidth management logic, and
> >> + be thus free of oversubscribing the system up to any arbitrary level.
> >> + This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
> >> +
> >> +
> >> +3.2 Task interface
> >> +------------------
> >> +
> >> + Specifying a periodic/sporadic task that executes for a given amount of
> >> + runtime at each instance, and that is scheduled according to the urgency of
> >> + its own timing constraints needs, in general, a way of declaring:
> >> +  - a (maximum/typical) instance execution time,
> >> +  - a minimum interval between consecutive instances,
> >> +  - a time constraint by which each instance must be completed.
> >> +
> >> + Therefore:
> >> +  * a new struct sched_attr, containing all the necessary fields is
> >> +    provided;
> >> +  * the new scheduling related syscalls that manipulate it, i.e.,
> >> +    sched_setattr() and sched_getattr() are implemented.
> >> +
> >> +
> >> +3.3 Default behavior
> >> +---------------------
> >> +
> >> + The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
> >> + 95000. With rt_period equal to 1000000, by default, it means that -deadline
> >     ^^^^
> >     This seems to be 9.5% to me ;)
> > 
> 
> Argh! s/95000/950000/
> 
> >> + tasks can use at most 95%, multiplied by the number of CPUs that compose the
> >> + root_domain, for each root_domain.
> >> +
> >> + A -deadline task cannot fork.
> >> +
> >> +4. Tasks CPU affinity
> >> +=====================
> >> +
> >> + -deadline tasks cannot have an affinity mask smaller that the entire
> >> + root_domain they are created on. However, affinities can be specified
> >> + through the cpuset facility (Documentation/cgroups/cpusets.txt).
> > 
> > Does this mean that sched_deadline is a somewhat global implementation? Or 
> > rather, at what point in time will sched_deadline take all cpus in a set 
> > into consideration and when will it only look at the current CPU? Where is 
> > the line drawn between global and fully partitioned?
> > 
> > Also, how do you account the budget when a resource holder is boosted in 
> > order to release a resource? (IIRC, you use BWI, right?)
> > 
> 
> Peter already replied about this.
> 
> Thanks,
> 
> - Juri

Feel free to add a reviewed-by from me on this one :)

-- 
Henrik Austad

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-20 13:16     ` Henrik Austad
@ 2014-01-20 13:39       ` Luca Abeni
  2014-01-21 10:20         ` Henrik Austad
  0 siblings, 1 reply; 33+ messages in thread
From: Luca Abeni @ 2014-01-20 13:39 UTC (permalink / raw)
  To: Henrik Austad, Juri Lelli
  Cc: peterz, tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield, rob

Hi all,

On 01/20/2014 02:16 PM, Henrik Austad wrote:
[...]
>>>> + The typical -deadline task is composed of a computation phase (instance)
>>>> + which is activated on a periodic or sporadic fashion. The expected (maximum)
>>>> + duration of such computation is called the task's runtime; the time interval
>>>> + by which each instance needs to be completed is called the task's relative
>>>> + deadline. The task's absolute deadline is dynamically calculated as the
>>>> + time instant a task (or, more properly) activates plus the relative
>>>> + deadline.
>>>
>>> activates - released?
>>>
>>
>> I'd keep (modifying a bit):
>>
>> "time instant a task activates plus the relative deadline."
>>
>> This is probably the nearest thing to what is implemented that we can say
>> (without entering into the theory too much), a task that "activates" can mean
>> that it is first released, enqueued, woken-up, etc.
>
> So, if we look at release (yes, I'm avoiding activates for a little while)
> as the time at the *beginning* of a new period, then, and only then should
> the *absolute* deadline be computed.
>
> If you den move on to use 'activate' as a term for when a task becomes
> eligble to run, then 'release' becomes a subset of 'activate', and you
> should only compute the absolute deadline at that time. Did that make
> sense?
I think things are a little bit complex here, because there are 2 different
"deadlines" we can think about:
- the "jobs deadlines" (the absolute job deadline can be computed at job
   arrival, as the arrival time + the relative deadline). These are generally
   used for performance metrics, to see if a job is respecting its timing
   constraints or not
- the "scheduling deadlines", which are the ones used by the scheduler to
   schedule the tasks. These are computed at tasks' wake-up time - notice that
   for self-suspending jobs there can be wake-up times that are not job arrival
   times. And are assigned according to the rules described in the CBS paper.
I think this can easily become very confusing, so I currently have no concrete
proposals for improving the documentation... But I wanted to point out that
things here are more complex than in the "traditional" real-time task model.

Maybe a solution could be to simply describe scheduling deadlines (which are
what sched_deadline uses) without going into the details of jobs' deadlines.
So, the document could avoid talking about instances (or jobs), and can say
that a task is guaranteed to receive "runtime" time units every "period" time
units (and these "runtime" time units are available within "deadline" time
units from the beginning of the period). Every time the task wakes up, the
scheduler computes a scheduling deadline consistent with this constraint,
and tasks are scheduled using EDF on these scheduling deadlines.
Every time "runtime" time units are consumed in a period, the scheduling
deadline is postponed by a period.

This is of course an approximative description, but I think it can give an
intuitive idea of how the scheduler works, and about the meaning of the three
scheduling parameters.



				Luca

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-20 13:39       ` Luca Abeni
@ 2014-01-21 10:20         ` Henrik Austad
  2014-01-21 11:35           ` Luca Abeni
  0 siblings, 1 reply; 33+ messages in thread
From: Henrik Austad @ 2014-01-21 10:20 UTC (permalink / raw)
  To: Luca Abeni
  Cc: Juri Lelli, peterz, tglx, mingo, rostedt, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield, rob

On Mon, Jan 20, 2014 at 02:39:29PM +0100, Luca Abeni wrote:
> Hi all,
> 
> On 01/20/2014 02:16 PM, Henrik Austad wrote:
> [...]
> >>>>+ The typical -deadline task is composed of a computation phase (instance)
> >>>>+ which is activated on a periodic or sporadic fashion. The expected (maximum)
> >>>>+ duration of such computation is called the task's runtime; the time interval
> >>>>+ by which each instance needs to be completed is called the task's relative
> >>>>+ deadline. The task's absolute deadline is dynamically calculated as the
> >>>>+ time instant a task (or, more properly) activates plus the relative
> >>>>+ deadline.
> >>>
> >>>activates - released?
> >>>
> >>
> >>I'd keep (modifying a bit):
> >>
> >>"time instant a task activates plus the relative deadline."
> >>
> >>This is probably the nearest thing to what is implemented that we can say
> >>(without entering into the theory too much), a task that "activates" can mean
> >>that it is first released, enqueued, woken-up, etc.
> >
> >So, if we look at release (yes, I'm avoiding activates for a little while)
> >as the time at the *beginning* of a new period, then, and only then should
> >the *absolute* deadline be computed.
> >
> >If you den move on to use 'activate' as a term for when a task becomes
> >eligble to run, then 'release' becomes a subset of 'activate', and you
> >should only compute the absolute deadline at that time. Did that make
> >sense?

> I think things are a little bit complex here, because there are 2 different
> "deadlines" we can think about:
> - the "jobs deadlines" (the absolute job deadline can be computed at job
>   arrival, as the arrival time + the relative deadline). These are generally
>   used for performance metrics, to see if a job is respecting its timing
>   constraints or not
>
> - the "scheduling deadlines", which are the ones used by the scheduler to
>   schedule the tasks. These are computed at tasks' wake-up time - notice that
>   for self-suspending jobs there can be wake-up times that are not job arrival
>   times. And are assigned according to the rules described in the CBS paper.
>
> I think this can easily become very confusing, so I currently have no concrete
> proposals for improving the documentation... But I wanted to point out that
> things here are more complex than in the "traditional" real-time task model.

Traditional real-time as in the current real-time model used in Linux, or 
traditional as in EDF terminology used by Liu & Layland?

> Maybe a solution could be to simply describe scheduling deadlines (which are
> what sched_deadline uses) without going into the details of jobs' deadlines.

Huh?

We definately need a short dictionary. In fact, I'd like to have a 
paragraph describing what deadline driven scheduling is.

For instance, I'm getting *Really* confused wrt to arrival time - you seem 
to wrap several types of arrival into the same name,  yet treat it 
differently.

- arrival: when a job gets ready to run for the first time
- arrival: when a job unblocks on some resource

Or did I misunderstand?

So, the terminology I'm used to, an attempt to write up something to 
clear up the terminology and establish common grounds. Please 
edit/elaborate or shoot down as appropriate:

"""
N. Crashcourse in deadline-terminology:

In a system, we typically look at a set of tasks. In Linux-kernel 
terminology, a particular task is normally a thread. When a thread is 
ready to run, we say that a *job* of that task is running. It is 
perhaps easiest to grasp this if one think only of periodic tasks, i.e. a 
thread that need to run for 2ms every 10ms. Normally, we want one job to 
finish before a new (of the same task) start, which implies that the 
deadline for this task is also 10ms. Once this is clear, expanding one's 
mind to aperiodic and/or sporadic tasks is easier.

* Periodic task: a task that needs to run for a while every N us.
* Sporadic task: a tasks that needs tor un for a while at most every N us 
  (jobs start no closer than N us apart)
* Aperiodic task: a task that have no particular period, but once 
  released, needs to complete before a given deadline.

* Set of all deadline-tasks in the system: \tau
* One particluar task: \tau_i
* The j'th job of task i: \tau_{i,j}
* The (relative) deadline of task i: D_i
* The (periodic, relative) release time of task i: R_i
* Required execution time a tasks's job needs to complete. C_i
* Absolute release-time, the time when a new job is ready (when a thread is 
  woken up for a new period).
* The absolute deadline of a job, the actual point in time where a job 
  needs to be finished. This is what the scheduler looks at when it picks 
  the next thread to run.

We can now construct a 3-tuple describing a perioic and sporadic tasks: 

          (C_i, R_i, D_i).

These 3 items is what you can use to describe your task to the scheduler.
"""

> So, the document could avoid talking about instances (or jobs), and can say
> that a task is guaranteed to receive "runtime" time units every "period" time
> units (and these "runtime" time units are available within "deadline" time
> units from the beginning of the period). Every time the task wakes up, the
> scheduler computes a scheduling deadline consistent with this constraint,
> and tasks are scheduled using EDF on these scheduling deadlines.
> Every time "runtime" time units are consumed in a period, the scheduling
> deadline is postponed by a period.

What is wrong with using the CORRECT TERMINOLOGY? People looking at using 
sched_deadline _need_ to understand what a _deadline_ scheduler is.

If the only goal of sched_deadline is to act as a bandwidth-gauge, then 
fine, but *I* want to use sched_deadline for systems that HAVE DEADLINES. I 
do NOT want to mess around with mapping deadlines to priorities in order to 
meet my deadlines. I suspect others would like to use sched_deadline for 
the same.

> This is of course an approximative description, but I think it can give an
> intuitive idea of how the scheduler works, and about the meaning of the three
> scheduling parameters.

Look, I'm not in favour of turning sched_deadline.txt into an academic 
paper, but it is clear that we need to establish a baseline here, and then 
we need to include tasks, instances/jobs, deadline, arrival/release and so 
on.

-- 
Henrik Austad

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-20 11:24 ` Henrik Austad
  2014-01-20 11:46   ` Peter Zijlstra
  2014-01-20 12:15   ` Juri Lelli
@ 2014-01-21 10:21   ` Henrik Austad
  2 siblings, 0 replies; 33+ messages in thread
From: Henrik Austad @ 2014-01-21 10:21 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, rob

On Mon, Jan 20, 2014 at 12:24:42PM +0100, Henrik Austad wrote:
> On Mon, Jan 20, 2014 at 11:40:40AM +0100, Juri Lelli wrote:
> > From: Dario Faggioli <raistlin@linux.it>
> > 
> > Add in Documentation/scheduler/ some hints about the design
> > choices, the usage and the future possible developments of the
> > sched_dl scheduling class and of the SCHED_DEADLINE policy.
> > [...]


> > +++ b/Documentation/scheduler/sched-deadline.txt
> > @@ -0,0 +1,189 @@
> > +			  Deadline Task Scheduling
> > +			  ------------------------
> > +
> > +CONTENTS
> > +========
> > +
> > + 0. WARNING
> > + 1. Overview
> > + 2. Task scheduling
> > + 2. The Interface
      ^^^^^
 I just noticed, where did this one go?

-- 
H

> > + 3. Bandwidth management
> > +   3.1 System-wide settings
> > +   3.2 Task interface
> > +   3.4 Default behavior
> > + 4. Tasks CPU affinity
> > +   4.1 SCHED_DEADLINE and cpusets HOWTO
> > + 5. Future plans
> > +
> > +
> > +0. WARNING
> > +==========
> > +
> > + Fiddling with these settings can result in an unpredictable or even unstable
> > + system behavior. As for -rt (group) scheduling, it is assumed that root users
> > + know what they're doing.
> > +
> > +
> > +1. Overview
> > +===========
> > +
> > + The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
> > + basically an implementation of the Earliest Deadline First (EDF) scheduling
> > + algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
> > + that makes it possible to isolate the behavior of tasks between each other.
> 
> 
> Why not something along the lines of giving a task a guaranteed slice of 
> the CPU as well as making sure that a task takes no more than a given 
> slice? I.e. making the point of a lower as well as an upper limit of CPU 
> usage.
> 
> > +2. Task scheduling
> > +==================
> > +
> > + The typical -deadline task is composed of a computation phase (instance)
> > + which is activated on a periodic or sporadic fashion. The expected (maximum)
> > + duration of such computation is called the task's runtime; the time interval
> > + by which each instance needs to be completed is called the task's relative
> > + deadline. The task's absolute deadline is dynamically calculated as the
> > + time instant a task (or, more properly) activates plus the relative
> > + deadline.
> 
> activates - released?
> 
> Since real-time papers from different rt-campus around the academia insist 
> on using *slightly* different terminology, perhaps add a short dictionary 
> for some of the more common terms?
> 
> D: relative deadline, typically N ms after release
> d: absolute deadline, the physical time when a given instance of a job 
>    needs to be completed
> R: relative release time, for periodic tasks, this is typically 'every N 
>    ms'
> r: absolute release time
> C: Worst-case execution time
> 
>    ...you get the idea.
> 
> Perhaps too academic?
> 
> > + The EDF[1] algorithm selects the task with the smallest absolute deadline as
> > + the one to be executed first, while the CBS[2,3] ensures that each task runs
> > + for at most its runtime every period, avoiding any interference between
> > + different tasks (bandwidth isolation).
> > + Thanks to this feature, also tasks that do not strictly comply with the
> > + computational model described above can effectively use the new policy.
> > + IOW, there are no limitations on what kind of task can exploit this new
> > + scheduling discipline, even if it must be said that it is particularly
> > + suited for periodic or sporadic tasks that need guarantees on their
> > + timing behavior, e.g., multimedia, streaming, control applications, etc.
> 
> I assume that ties are broken arbitrarily and that a running task is not 
> preempted for another task with equal deadline. Correct?
> 
> This would be a nice point to include in this doc methinks.
> 
> > + References:
> > +  1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
> > +      ming in a hard-real-time environment. Journal of the Association for
> > +      Computing Machinery, 20(1), 1973.
> > +  2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
> > +      Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
> > +      Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
> > +  3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
> > +      Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
> > +
> > +3. Bandwidth management
> > +=======================
> > +
> > + In order for the -deadline scheduling to be effective and useful, it is
> > + important to have some method to keep the allocation of the available CPU
> > + bandwidth to the tasks under control.
> > + This is usually called "admission control" and if it is not performed at all,
> > + no guarantee can be given on the actual scheduling of the -deadline tasks.
> > +
> > + Since when RT-throttling has been introduced each task group has a bandwidth
> > + associated, calculated as a certain amount of runtime over a period.
> > + Moreover, to make it possible to manipulate such bandwidth, readable/writable
> > + controls have been added to both procfs (for system wide settings) and cgroupfs
> > + (for per-group settings).
> > + Therefore, the same interface is being used for controlling the bandwidth
> > + distrubution to -deadline tasks.
> > +
> > + However, more discussion is needed in order to figure out how we want to manage
> > + SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
> > + uses (for now) a less sophisticated, but actually very sensible, mechanism to
> > + ensure that a certain utilization cap is not overcome per each root_domain.
> > +
> > + Another main difference between deadline bandwidth management and RT-throttling
> > + is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
> > + and thus we don't need an higher level throttling mechanism to enforce the
> > + desired bandwidth.
> > +
> > +3.1 System wide settings
> > +------------------------
> > +
> > + The system wide settings are configured under the /proc virtual file system.
> > +
> > + For now the -rt knobs are used for dl admission control and the -deadline
> > + runtime is accounted against the -rt runtime. We realise that this isn't
> > + entirely desirable; however, it is better to have a small interface for now,
> > + and be able to change it easily later. The ideal situation (see 5.) is to run
> > + -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
> > + subset of dl_bw.
> > +
> > + This means that, for a root_domain comprising M CPUs, -deadline tasks
> > + can be created while the sum of their bandwidths stays below:
> > +
> > +   M * (sched_rt_runtime_us / sched_rt_period_us)
> > +
> > + It is also possible to disable this bandwidth management logic, and
> > + be thus free of oversubscribing the system up to any arbitrary level.
> > + This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
> > +
> > +
> > +3.2 Task interface
> > +------------------
> > +
> > + Specifying a periodic/sporadic task that executes for a given amount of
> > + runtime at each instance, and that is scheduled according to the urgency of
> > + its own timing constraints needs, in general, a way of declaring:
> > +  - a (maximum/typical) instance execution time,
> > +  - a minimum interval between consecutive instances,
> > +  - a time constraint by which each instance must be completed.
> > +
> > + Therefore:
> > +  * a new struct sched_attr, containing all the necessary fields is
> > +    provided;
> > +  * the new scheduling related syscalls that manipulate it, i.e.,
> > +    sched_setattr() and sched_getattr() are implemented.
> > +
> > +
> > +3.3 Default behavior
> > +---------------------
> > +
> > + The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
> > + 95000. With rt_period equal to 1000000, by default, it means that -deadline
>     ^^^^
>     This seems to be 9.5% to me ;)
> 
> > + tasks can use at most 95%, multiplied by the number of CPUs that compose the
> > + root_domain, for each root_domain.
> > +
> > + A -deadline task cannot fork.
> > +
> > +4. Tasks CPU affinity
> > +=====================
> > +
> > + -deadline tasks cannot have an affinity mask smaller that the entire
> > + root_domain they are created on. However, affinities can be specified
> > + through the cpuset facility (Documentation/cgroups/cpusets.txt).
> 
> Does this mean that sched_deadline is a somewhat global implementation? Or 
> rather, at what point in time will sched_deadline take all cpus in a set 
> into consideration and when will it only look at the current CPU? Where is 
> the line drawn between global and fully partitioned?
> 
> Also, how do you account the budget when a resource holder is boosted in 
> order to release a resource? (IIRC, you use BWI, right?)
> 
> > +4.1 SCHED_DEADLINE and cpusets HOWTO
> > +------------------------------------
> > +
> > + An example of a simple configuration (pin a -deadline task to CPU0)
> > + follows (rt-app is used to create a -deadline task).
> > +
> > + mkdir /dev/cpuset
> > + mount -t cgroup -o cpuset cpuset /dev/cpuset
> > + cd /dev/cpuset
> > + mkdir cpu0
> > + echo 0 > cpu0/cpuset.cpus
> > + echo 0 > cpu0/cpuset.mems
> > + echo 1 > cpuset.cpu_exclusive
> > + echo 0 > cpuset.sched_load_balance
> > + echo 1 > cpu0/cpuset.cpu_exclusive
> > + echo 1 > cpu0/cpuset.mem_exclusive
> > + echo $$ > cpu0/tasks
> > + rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
> > + task affinity)
> > +
> > +5. Future plans
> > +===============
> > +
> > + Still missing:
> > +
> > +  - refinements to deadline inheritance, especially regarding the possibility
> > +    of retaining bandwidth isolation among non-interacting tasks. This is
> > +    being studied from both theoretical and practical points of view, and
> > +    hopefully we should be able to produce some demonstrative code soon;
> > +  - (c)group based bandwidth management, and maybe scheduling;
> > +  - access control for non-root users (and related security concerns to
> > +    address), which is the best way to allow unprivileged use of the mechanisms
> > +    and how to prevent non-root users "cheat" the system?
> > +
> > + As already discussed, we are planning also to merge this work with the EDF
> > + throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
> > + the preliminary phases of the merge and we really seek feedback that would
> > + help us decide on the direction it should take.
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 0de2482..0dd5e09 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -351,7 +351,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
> >   * disrupting the schedulability of the system. Otherwise, we should
> >   * refill the runtime and set the deadline a period in the future,
> >   * because keeping the current (absolute) deadline of the task would
> > - * result in breaking guarantees promised to other tasks.
> > + * result in breaking guarantees promised to other tasks (refer to
> > + * Documentation/scheduler/sched-deadline.txt for more informations).
> >   *
> >   * This function returns true if:
> >   *
> > -- 
> > 1.7.9.5
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
> Henrik Austad



-- 
Henrik Austad

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-21 10:20         ` Henrik Austad
@ 2014-01-21 11:35           ` Luca Abeni
  2014-01-21 12:11             ` Juri Lelli
  2014-01-21 12:33             ` Peter Zijlstra
  0 siblings, 2 replies; 33+ messages in thread
From: Luca Abeni @ 2014-01-21 11:35 UTC (permalink / raw)
  To: Henrik Austad
  Cc: Juri Lelli, peterz, tglx, mingo, rostedt, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield, rob

On 01/21/2014 11:20 AM, Henrik Austad wrote:
> On Mon, Jan 20, 2014 at 02:39:29PM +0100, Luca Abeni wrote:
>> Hi all,
>>
>> On 01/20/2014 02:16 PM, Henrik Austad wrote:
>> [...]
>>>>>> + The typical -deadline task is composed of a computation phase (instance)
>>>>>> + which is activated on a periodic or sporadic fashion. The expected (maximum)
>>>>>> + duration of such computation is called the task's runtime; the time interval
>>>>>> + by which each instance needs to be completed is called the task's relative
>>>>>> + deadline. The task's absolute deadline is dynamically calculated as the
>>>>>> + time instant a task (or, more properly) activates plus the relative
>>>>>> + deadline.
>>>>>
>>>>> activates - released?
>>>>>
>>>>
>>>> I'd keep (modifying a bit):
>>>>
>>>> "time instant a task activates plus the relative deadline."
>>>>
>>>> This is probably the nearest thing to what is implemented that we can say
>>>> (without entering into the theory too much), a task that "activates" can mean
>>>> that it is first released, enqueued, woken-up, etc.
>>>
>>> So, if we look at release (yes, I'm avoiding activates for a little while)
>>> as the time at the *beginning* of a new period, then, and only then should
>>> the *absolute* deadline be computed.
>>>
>>> If you den move on to use 'activate' as a term for when a task becomes
>>> eligble to run, then 'release' becomes a subset of 'activate', and you
>>> should only compute the absolute deadline at that time. Did that make
>>> sense?
>
>> I think things are a little bit complex here, because there are 2 different
>> "deadlines" we can think about:
>> - the "jobs deadlines" (the absolute job deadline can be computed at job
>>    arrival, as the arrival time + the relative deadline). These are generally
>>    used for performance metrics, to see if a job is respecting its timing
>>    constraints or not
>>
>> - the "scheduling deadlines", which are the ones used by the scheduler to
>>    schedule the tasks. These are computed at tasks' wake-up time - notice that
>>    for self-suspending jobs there can be wake-up times that are not job arrival
>>    times. And are assigned according to the rules described in the CBS paper.
>>
>> I think this can easily become very confusing, so I currently have no concrete
>> proposals for improving the documentation... But I wanted to point out that
>> things here are more complex than in the "traditional" real-time task model.
>
> Traditional real-time as in the current real-time model used in Linux, or
> traditional as in EDF terminology used by Liu & Layland?
Sorry, by traditional real-time task model I meant the Liu & Layland one
(which is similar to what is describe above "... a task is composed of
a computation phase (instance) which is activated on a periodic or sporadic
fashion"...). I just wanted to say that things in reality are more complex
than this.


>> Maybe a solution could be to simply describe scheduling deadlines (which are
>> what sched_deadline uses) without going into the details of jobs' deadlines.
>
> Huh?
>
> We definately need a short dictionary. In fact, I'd like to have a
> paragraph describing what deadline driven scheduling is.
>
> For instance, I'm getting *Really* confused wrt to arrival time - you seem
> to wrap several types of arrival into the same name,  yet treat it
> differently.
>
> - arrival: when a job gets ready to run for the first time
> - arrival: when a job unblocks on some resource
>
> Or did I misunderstand?
Well, my point was (but I might be wrong) that describing all of these details
in this document might be confusing, so it might make sense to describe only
the relevant things (leaving all of the theoretical details to other documents,
like the papers and technical reports cited in this document).
You are right: there are different concepts of "arrival" and "arrival time": the
"job arrival" (when some activity - for example decoding a video frame - starts)
and the "task wake up". But I think in this document it is not important to mention
jobs (or task instances) and job arrivals: the only relevant things are task
wake-ups.

> So, the terminology I'm used to, an attempt to write up something to
> clear up the terminology and establish common grounds. Please
> edit/elaborate or shoot down as appropriate:
>
> """
> N. Crashcourse in deadline-terminology:
>
> In a system, we typically look at a set of tasks. In Linux-kernel
> terminology, a particular task is normally a thread. When a thread is
> ready to run, we say that a *job* of that task is running.
This would be true in the original Liu&Layland model (where a task blocks
only when a job finishes), but I do not think it is correct in a real system...
For example: (notice: this discussion might be slightly off-topic, and I do not
think this should go in the document... I am writing just to clarify my point
of view)
- Let's consider a (over simplified) video decoder as an example of task
- The task periodically read a video frame (from disk or network), decodes it,
   and displays it
- So, each job starts when the frame is read, and finishes when the frame is
   displayed. And jobs are (in this case) activated periodically
- During the execution of a job, the task might invoke a blocking system call,
   and block... When it wakes up, it is still in the same job (decoding the same
   video frame), and not in a different one.
This is (IMHO) where all the confusion comes from.

> It is
> perhaps easiest to grasp this if one think only of periodic tasks, i.e. a
> thread that need to run for 2ms every 10ms. Normally, we want one job to
> finish before a new (of the same task) start, which implies that the
> deadline for this task is also 10ms. Once this is clear, expanding one's
> mind to aperiodic and/or sporadic tasks is easier.
>
> * Periodic task: a task that needs to run for a while every N us.
> * Sporadic task: a tasks that needs tor un for a while at most every N us
>    (jobs start no closer than N us apart)
> * Aperiodic task: a task that have no particular period, but once
>    released, needs to complete before a given deadline.
This is all correct, but I think it is not needed to understand how sched_deadline
works


> * Set of all deadline-tasks in the system: \tau
> * One particluar task: \tau_i
> * The j'th job of task i: \tau_{i,j}
> * The (relative) deadline of task i: D_i
> * The (periodic, relative) release time of task i: R_i
> * Required execution time a tasks's job needs to complete. C_i
> * Absolute release-time, the time when a new job is ready (when a thread is
>    woken up for a new period).
> * The absolute deadline of a job, the actual point in time where a job
>    needs to be finished. This is what the scheduler looks at when it picks
>    the next thread to run.
This is not completely correct... The scheduler does not use this deadline, but
uses a "scheduling deadline". The difference is, for example, that the scheduling
deadline is postponed when the task tries to execute for more than the runtime...
And this is why I think (but I might be wrong) that the concepts above do not need
to be introduced in this document.
Notice that if runtime >= C_i and if the sched_deadline period is >= "R_i", then
the scheduling deadlines and the absolute deadlines you mention above coincide, so
the task is guaranteed to respect the jobs' absolute deadlines (this is what is
called "hard schedulability property" in the CBS paper).

> We can now construct a 3-tuple describing a perioic and sporadic tasks:
>
>            (C_i, R_i, D_i).
>
> These 3 items is what you can use to describe your task to the scheduler.
> """
>
>> So, the document could avoid talking about instances (or jobs), and can say
>> that a task is guaranteed to receive "runtime" time units every "period" time
>> units (and these "runtime" time units are available within "deadline" time
>> units from the beginning of the period). Every time the task wakes up, the
>> scheduler computes a scheduling deadline consistent with this constraint,
>> and tasks are scheduled using EDF on these scheduling deadlines.
>> Every time "runtime" time units are consumed in a period, the scheduling
>> deadline is postponed by a period.
>
> What is wrong with using the CORRECT TERMINOLOGY?
Well, what I mentioned above could be "correct terminology" :).
I just avoided talking about the two different kinds of deadlines
(jobs' deadlines and scheduling deadlines), because this could be confusing.
And about jobs, in order to avoid talking about self-suspending jobs, and
having to make a distinction about job arrivals and other task wake ups...

> People looking at using
> sched_deadline _need_ to understand what a _deadline_ scheduler is.
I think the sentence I wrote above gives an idea about what sched_deadline does,
and how to use the three parameters. I can be extended by clarifying that "using
EDF on these scheduling deadlines" means "scheduling the task with the smallest
scheduling deadline".


> If the only goal of sched_deadline is to act as a bandwidth-gauge, then
> fine, but *I* want to use sched_deadline for systems that HAVE DEADLINES.
Ok, so this is a more advanced usage. A section about the "hard schedulability
property" mentioned above can be added (later in the document, I think) to explain
how sched_deadline can be used to provide guarantees to real-time tasks (so,
the real-time task model you mentioned can be described later, and the condition
I mentioned above can be explained).
My point is: first explain what sched_deadline is and how it works (something about
my paragraph above), then, later explain what a real-time task is and how sched_deadline
can be used to properly schedule it.

> I do NOT want to mess around with mapping deadlines to priorities in order to
> meet my deadlines.
I agree, this is not needed.

> I suspect others would like to use sched_deadline for the same.
Again, I might be wrong... But once the "runtime time units every period, served
within a specified deadline" idea is clear, it is easier to understand how to use
sched_deadline.


>> This is of course an approximative description, but I think it can give an
>> intuitive idea of how the scheduler works, and about the meaning of the three
>> scheduling parameters.
>
> Look, I'm not in favour of turning sched_deadline.txt into an academic
> paper, but it is clear that we need to establish a baseline here, and then
> we need to include tasks, instances/jobs, deadline, arrival/release and so
> on.
See above... IMHO, this is not needed to understand how sched_deadline works.
But once the sched_deadline behaviour is defined, these concepts can be introduced
and it can be explained how to use sched_deadline to respect the jobs' deadlines...




				Luca

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-21 11:35           ` Luca Abeni
@ 2014-01-21 12:11             ` Juri Lelli
  2014-01-21 12:33             ` Peter Zijlstra
  1 sibling, 0 replies; 33+ messages in thread
From: Juri Lelli @ 2014-01-21 12:11 UTC (permalink / raw)
  To: Luca Abeni, Henrik Austad
  Cc: peterz, tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield, rob

Ok,

I'd say that we need to re-write parts of the document, as we need more
consensus about terms and better explain to users what SCHED_DEADLINE can be
used for, and the type of guarantees it can provide. I'll try to do this with
Luca, taking into account Henrik's points.

Thanks,

- Juri

On 01/21/2014 12:35 PM, Luca Abeni wrote:
> On 01/21/2014 11:20 AM, Henrik Austad wrote:
>> On Mon, Jan 20, 2014 at 02:39:29PM +0100, Luca Abeni wrote:
>>> Hi all,
>>>
>>> On 01/20/2014 02:16 PM, Henrik Austad wrote:
>>> [...]
>>>>>>> + The typical -deadline task is composed of a computation phase (instance)
>>>>>>> + which is activated on a periodic or sporadic fashion. The expected (maximum)
>>>>>>> + duration of such computation is called the task's runtime; the time interval
>>>>>>> + by which each instance needs to be completed is called the task's relative
>>>>>>> + deadline. The task's absolute deadline is dynamically calculated as the
>>>>>>> + time instant a task (or, more properly) activates plus the relative
>>>>>>> + deadline.
>>>>>>
>>>>>> activates - released?
>>>>>>
>>>>>
>>>>> I'd keep (modifying a bit):
>>>>>
>>>>> "time instant a task activates plus the relative deadline."
>>>>>
>>>>> This is probably the nearest thing to what is implemented that we can say
>>>>> (without entering into the theory too much), a task that "activates" can mean
>>>>> that it is first released, enqueued, woken-up, etc.
>>>>
>>>> So, if we look at release (yes, I'm avoiding activates for a little while)
>>>> as the time at the *beginning* of a new period, then, and only then should
>>>> the *absolute* deadline be computed.
>>>>
>>>> If you den move on to use 'activate' as a term for when a task becomes
>>>> eligble to run, then 'release' becomes a subset of 'activate', and you
>>>> should only compute the absolute deadline at that time. Did that make
>>>> sense?
>>
>>> I think things are a little bit complex here, because there are 2 different
>>> "deadlines" we can think about:
>>> - the "jobs deadlines" (the absolute job deadline can be computed at job
>>>    arrival, as the arrival time + the relative deadline). These are generally
>>>    used for performance metrics, to see if a job is respecting its timing
>>>    constraints or not
>>>
>>> - the "scheduling deadlines", which are the ones used by the scheduler to
>>>    schedule the tasks. These are computed at tasks' wake-up time - notice that
>>>    for self-suspending jobs there can be wake-up times that are not job arrival
>>>    times. And are assigned according to the rules described in the CBS paper.
>>>
>>> I think this can easily become very confusing, so I currently have no concrete
>>> proposals for improving the documentation... But I wanted to point out that
>>> things here are more complex than in the "traditional" real-time task model.
>>
>> Traditional real-time as in the current real-time model used in Linux, or
>> traditional as in EDF terminology used by Liu & Layland?
> Sorry, by traditional real-time task model I meant the Liu & Layland one
> (which is similar to what is describe above "... a task is composed of
> a computation phase (instance) which is activated on a periodic or sporadic
> fashion"...). I just wanted to say that things in reality are more complex
> than this.
> 
> 
>>> Maybe a solution could be to simply describe scheduling deadlines (which are
>>> what sched_deadline uses) without going into the details of jobs' deadlines.
>>
>> Huh?
>>
>> We definately need a short dictionary. In fact, I'd like to have a
>> paragraph describing what deadline driven scheduling is.
>>
>> For instance, I'm getting *Really* confused wrt to arrival time - you seem
>> to wrap several types of arrival into the same name,  yet treat it
>> differently.
>>
>> - arrival: when a job gets ready to run for the first time
>> - arrival: when a job unblocks on some resource
>>
>> Or did I misunderstand?
> Well, my point was (but I might be wrong) that describing all of these details
> in this document might be confusing, so it might make sense to describe only
> the relevant things (leaving all of the theoretical details to other documents,
> like the papers and technical reports cited in this document).
> You are right: there are different concepts of "arrival" and "arrival time": the
> "job arrival" (when some activity - for example decoding a video frame - starts)
> and the "task wake up". But I think in this document it is not important to mention
> jobs (or task instances) and job arrivals: the only relevant things are task
> wake-ups.
> 
>> So, the terminology I'm used to, an attempt to write up something to
>> clear up the terminology and establish common grounds. Please
>> edit/elaborate or shoot down as appropriate:
>>
>> """
>> N. Crashcourse in deadline-terminology:
>>
>> In a system, we typically look at a set of tasks. In Linux-kernel
>> terminology, a particular task is normally a thread. When a thread is
>> ready to run, we say that a *job* of that task is running.
> This would be true in the original Liu&Layland model (where a task blocks
> only when a job finishes), but I do not think it is correct in a real system...
> For example: (notice: this discussion might be slightly off-topic, and I do not
> think this should go in the document... I am writing just to clarify my point
> of view)
> - Let's consider a (over simplified) video decoder as an example of task
> - The task periodically read a video frame (from disk or network), decodes it,
>    and displays it
> - So, each job starts when the frame is read, and finishes when the frame is
>    displayed. And jobs are (in this case) activated periodically
> - During the execution of a job, the task might invoke a blocking system call,
>    and block... When it wakes up, it is still in the same job (decoding the same
>    video frame), and not in a different one.
> This is (IMHO) where all the confusion comes from.
> 
>> It is
>> perhaps easiest to grasp this if one think only of periodic tasks, i.e. a
>> thread that need to run for 2ms every 10ms. Normally, we want one job to
>> finish before a new (of the same task) start, which implies that the
>> deadline for this task is also 10ms. Once this is clear, expanding one's
>> mind to aperiodic and/or sporadic tasks is easier.
>>
>> * Periodic task: a task that needs to run for a while every N us.
>> * Sporadic task: a tasks that needs tor un for a while at most every N us
>>    (jobs start no closer than N us apart)
>> * Aperiodic task: a task that have no particular period, but once
>>    released, needs to complete before a given deadline.
> This is all correct, but I think it is not needed to understand how sched_deadline
> works
> 
> 
>> * Set of all deadline-tasks in the system: \tau
>> * One particluar task: \tau_i
>> * The j'th job of task i: \tau_{i,j}
>> * The (relative) deadline of task i: D_i
>> * The (periodic, relative) release time of task i: R_i
>> * Required execution time a tasks's job needs to complete. C_i
>> * Absolute release-time, the time when a new job is ready (when a thread is
>>    woken up for a new period).
>> * The absolute deadline of a job, the actual point in time where a job
>>    needs to be finished. This is what the scheduler looks at when it picks
>>    the next thread to run.
> This is not completely correct... The scheduler does not use this deadline, but
> uses a "scheduling deadline". The difference is, for example, that the scheduling
> deadline is postponed when the task tries to execute for more than the runtime...
> And this is why I think (but I might be wrong) that the concepts above do not need
> to be introduced in this document.
> Notice that if runtime >= C_i and if the sched_deadline period is >= "R_i", then
> the scheduling deadlines and the absolute deadlines you mention above coincide, so
> the task is guaranteed to respect the jobs' absolute deadlines (this is what is
> called "hard schedulability property" in the CBS paper).
> 
>> We can now construct a 3-tuple describing a perioic and sporadic tasks:
>>
>>            (C_i, R_i, D_i).
>>
>> These 3 items is what you can use to describe your task to the scheduler.
>> """
>>
>>> So, the document could avoid talking about instances (or jobs), and can say
>>> that a task is guaranteed to receive "runtime" time units every "period" time
>>> units (and these "runtime" time units are available within "deadline" time
>>> units from the beginning of the period). Every time the task wakes up, the
>>> scheduler computes a scheduling deadline consistent with this constraint,
>>> and tasks are scheduled using EDF on these scheduling deadlines.
>>> Every time "runtime" time units are consumed in a period, the scheduling
>>> deadline is postponed by a period.
>>
>> What is wrong with using the CORRECT TERMINOLOGY?
> Well, what I mentioned above could be "correct terminology" :).
> I just avoided talking about the two different kinds of deadlines
> (jobs' deadlines and scheduling deadlines), because this could be confusing.
> And about jobs, in order to avoid talking about self-suspending jobs, and
> having to make a distinction about job arrivals and other task wake ups...
> 
>> People looking at using
>> sched_deadline _need_ to understand what a _deadline_ scheduler is.
> I think the sentence I wrote above gives an idea about what sched_deadline does,
> and how to use the three parameters. I can be extended by clarifying that "using
> EDF on these scheduling deadlines" means "scheduling the task with the smallest
> scheduling deadline".
> 
> 
>> If the only goal of sched_deadline is to act as a bandwidth-gauge, then
>> fine, but *I* want to use sched_deadline for systems that HAVE DEADLINES.
> Ok, so this is a more advanced usage. A section about the "hard schedulability
> property" mentioned above can be added (later in the document, I think) to explain
> how sched_deadline can be used to provide guarantees to real-time tasks (so,
> the real-time task model you mentioned can be described later, and the condition
> I mentioned above can be explained).
> My point is: first explain what sched_deadline is and how it works (something about
> my paragraph above), then, later explain what a real-time task is and how sched_deadline
> can be used to properly schedule it.
> 
>> I do NOT want to mess around with mapping deadlines to priorities in order to
>> meet my deadlines.
> I agree, this is not needed.
> 
>> I suspect others would like to use sched_deadline for the same.
> Again, I might be wrong... But once the "runtime time units every period, served
> within a specified deadline" idea is clear, it is easier to understand how to use
> sched_deadline.
> 
> 
>>> This is of course an approximative description, but I think it can give an
>>> intuitive idea of how the scheduler works, and about the meaning of the three
>>> scheduling parameters.
>>
>> Look, I'm not in favour of turning sched_deadline.txt into an academic
>> paper, but it is clear that we need to establish a baseline here, and then
>> we need to include tasks, instances/jobs, deadline, arrival/release and so
>> on.
> See above... IMHO, this is not needed to understand how sched_deadline works.
> But once the sched_deadline behaviour is defined, these concepts can be introduced
> and it can be explained how to use sched_deadline to respect the jobs' deadlines...
> 
> 
> 
> 
> 				Luca
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-21 11:35           ` Luca Abeni
  2014-01-21 12:11             ` Juri Lelli
@ 2014-01-21 12:33             ` Peter Zijlstra
  2014-01-21 12:50               ` Luca Abeni
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2014-01-21 12:33 UTC (permalink / raw)
  To: Luca Abeni
  Cc: Henrik Austad, Juri Lelli, tglx, mingo, rostedt, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, tommaso.cucinotta, nicola.manica, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, rob

On Tue, Jan 21, 2014 at 12:35:27PM +0100, Luca Abeni wrote:
> >In a system, we typically look at a set of tasks. In Linux-kernel
> >terminology, a particular task is normally a thread. When a thread is
> >ready to run, we say that a *job* of that task is running.
> This would be true in the original Liu&Layland model (where a task blocks
> only when a job finishes), but I do not think it is correct in a real system...
> For example: (notice: this discussion might be slightly off-topic, and I do not
> think this should go in the document... I am writing just to clarify my point
> of view)
> - Let's consider a (over simplified) video decoder as an example of task
> - The task periodically read a video frame (from disk or network), decodes it,
>   and displays it
> - So, each job starts when the frame is read, and finishes when the frame is
>   displayed. And jobs are (in this case) activated periodically
> - During the execution of a job, the task might invoke a blocking system call,
>   and block... When it wakes up, it is still in the same job (decoding the same
>   video frame), and not in a different one.
> This is (IMHO) where all the confusion comes from.

I would strongly urge you not to use that as an example, because its
dead wrong design. An RT thread (be it RR,FIFO or DL) should _NEVER_ do
blocking IO.

Have !RT tasks read the stuff from disk into a buffer, then let the RT
task read data from the buffer and flip frames and such.

If you want to mention blocking, then please use the most common one:
blocking on a (hopefully PI) mutex.



On the other subject; I wouldn't actually mind if it grew into a proper
(academic or not) summary of deadline scheduling theory and how it
applies.

Sure, refer to actual papers for all the proofs and such, but it would
be very good to go over all the bits and pieces that make up the system.

So cover the periodic, sporadic and aperiodic model like henr_k
suggested, please do cover the job/instance idiom as it is used all over
the place.

Then also treat schedulability tests and their ramification, explain
what laxity is, what tardiness is, that GEDF doesn't have 0 tardiness
but does have bounded tardiness.

Maybe even mention the actual bounds -- but refer to papers for their
proofs.

Mention CBS and the ramification etc..


Yes this is all a bit much, but I feel it is important, after all how
can you properly use something you don't understand? (and yes I know its
a very popular thing to not want to understand how things work but still
use them :-/).

I mean, I'm the kind of idiot that actually goes out and read a bunch of
papers, but many people simply cannot read those things, or are not
given the time to, even if they wanted and could (arguably they have
bigger problems).

so think of this as the idiot guide to deadline scheduling :-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-21 12:33             ` Peter Zijlstra
@ 2014-01-21 12:50               ` Luca Abeni
  2014-01-21 13:55                 ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Luca Abeni @ 2014-01-21 12:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Henrik Austad, Juri Lelli, tglx, mingo, rostedt, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, tommaso.cucinotta, nicola.manica, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, rob

On 01/21/2014 01:33 PM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 12:35:27PM +0100, Luca Abeni wrote:
>>> In a system, we typically look at a set of tasks. In Linux-kernel
>>> terminology, a particular task is normally a thread. When a thread is
>>> ready to run, we say that a *job* of that task is running.
>> This would be true in the original Liu&Layland model (where a task blocks
>> only when a job finishes), but I do not think it is correct in a real system...
>> For example: (notice: this discussion might be slightly off-topic, and I do not
>> think this should go in the document... I am writing just to clarify my point
>> of view)
>> - Let's consider a (over simplified) video decoder as an example of task
>> - The task periodically read a video frame (from disk or network), decodes it,
>>    and displays it
>> - So, each job starts when the frame is read, and finishes when the frame is
>>    displayed. And jobs are (in this case) activated periodically
>> - During the execution of a job, the task might invoke a blocking system call,
>>    and block... When it wakes up, it is still in the same job (decoding the same
>>    video frame), and not in a different one.
>> This is (IMHO) where all the confusion comes from.
>
> I would strongly urge you not to use that as an example, because its
> dead wrong design. An RT thread (be it RR,FIFO or DL) should _NEVER_ do
> blocking IO.
Well, but it does happen in reality :)
I mean: people might want to use SCHED_DEADLINE to schedule mplayer (or similar).
There are even scientific papers showing the advantage of doing so...
And if you try to use ftrace/kernelshark to check the wake-up times and similar
you will notice that even a single-threaded player like mplayer blocks and wakes-up
many times inside a job.

On the other hand, I agree with you that a hard real-time task should be designed
not to do things like this. But SCHED_DEADLINE is flexible enough to be used on
many different kinds of tasks (hard real-time, soft real-time, etc...).

> Have !RT tasks read the stuff from disk into a buffer, then let the RT
> task read data from the buffer and flip frames and such.
>
> If you want to mention blocking, then please use the most common one:
> blocking on a (hopefully PI) mutex.
Ok.

> On the other subject; I wouldn't actually mind if it grew into a proper
> (academic or not) summary of deadline scheduling theory and how it
> applies.
>
> Sure, refer to actual papers for all the proofs and such, but it would
> be very good to go over all the bits and pieces that make up the system.
>
> So cover the periodic, sporadic and aperiodic model like henr_k
> suggested, please do cover the job/instance idiom as it is used all over
> the place.
Ok... My point was that it would be better (IMHO) to first explain how
sched_deadline works (and no notion of job/instance, etc is needed for this),
and then explain how this applies to the real-time task model (and here, of
course all the formal notation can be introduced).

Do you think this can be reasonable?

> Then also treat schedulability tests and their ramification, explain
> what laxity is, what tardiness is, that GEDF doesn't have 0 tardiness
> but does have bounded tardiness.
>
> Maybe even mention the actual bounds -- but refer to papers for their
> proofs.
>
> Mention CBS and the ramification etc..
Ok.
I guess some of these details can be added incrementally, with additional
patches?


> Yes this is all a bit much, but I feel it is important, after all how
> can you properly use something you don't understand? (and yes I know its
> a very popular thing to not want to understand how things work but still
> use them :-/).
>
> I mean, I'm the kind of idiot that actually goes out and read a bunch of
> papers, but many people simply cannot read those things, or are not
> given the time to, even if they wanted and could (arguably they have
> bigger problems).
Ok.



				Thanks,
					Luca

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-21 12:50               ` Luca Abeni
@ 2014-01-21 13:55                 ` Peter Zijlstra
  2014-01-21 14:38                   ` Juri Lelli
                                     ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Peter Zijlstra @ 2014-01-21 13:55 UTC (permalink / raw)
  To: Luca Abeni
  Cc: Henrik Austad, Juri Lelli, tglx, mingo, rostedt, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, tommaso.cucinotta, nicola.manica, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, rob

On Tue, Jan 21, 2014 at 01:50:41PM +0100, Luca Abeni wrote:
> On 01/21/2014 01:33 PM, Peter Zijlstra wrote:

> >>- During the execution of a job, the task might invoke a blocking system call,
> >>   and block... When it wakes up, it is still in the same job (decoding the same
> >>   video frame), and not in a different one.
> >>This is (IMHO) where all the confusion comes from.
> >
> >I would strongly urge you not to use that as an example, because its
> >dead wrong design. An RT thread (be it RR,FIFO or DL) should _NEVER_ do
> >blocking IO.
> Well, but it does happen in reality :)

Yeah, I know, my point was more about not encouraging people to do this
by explicitly mentioning it.

> On the other hand, I agree with you that a hard real-time task should be designed
> not to do things like this. But SCHED_DEADLINE is flexible enough to be used on
> many different kinds of tasks (hard real-time, soft real-time, etc...).

At which point I feel obliged to mention the work Jim did on statistical
bounded tardiness and a potential future option:
SCHED_FLAG_DL_AVG_RUNTIME, where we would allow tasks to somewhat exceed
their runtime budget provided that they meet their budget on average.

A possible implementation could be to track the unused budget of
previous instances and keep a decaying sum (such that we're guaranteed
this unused budget < 2*runtime). And then allow runtime overruns upto
this limit.

Another possibly extension; one proposed by Ingo; is to demote tasks to
SCHED_OTHER once they exceed their budget instead of the full block they
get now -- we could possibly call this SCHED_FLAG_DL_CBS_SOFT or such.

And of course SCHED_FLAG_DL_CBS_SIGNAL, where the task gets a signal
delivered if it exceeded the runtime -- I think some of the earlier
patches had things like this, no?

> >On the other subject; I wouldn't actually mind if it grew into a proper
> >(academic or not) summary of deadline scheduling theory and how it
> >applies.
> >
> >Sure, refer to actual papers for all the proofs and such, but it would
> >be very good to go over all the bits and pieces that make up the system.
> >
> >So cover the periodic, sporadic and aperiodic model like henr_k
> >suggested, please do cover the job/instance idiom as it is used all over
> >the place.
> Ok... My point was that it would be better (IMHO) to first explain how
> sched_deadline works (and no notion of job/instance, etc is needed for this),
> and then explain how this applies to the real-time task model (and here, of
> course all the formal notation can be introduced).
> 
> Do you think this can be reasonable?

Sure, I think that's reasonable.

> >Then also treat schedulability tests and their ramification, explain
> >what laxity is, what tardiness is, that GEDF doesn't have 0 tardiness
> >but does have bounded tardiness.
> >
> >Maybe even mention the actual bounds -- but refer to papers for their
> >proofs.
> >
> >Mention CBS and the ramification etc..
> Ok.
> I guess some of these details can be added incrementally, with additional
> patches?

Oh sure, all of this will always be a work in progress anyway ;-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-21 13:55                 ` Peter Zijlstra
@ 2014-01-21 14:38                   ` Juri Lelli
  2014-01-21 16:28                   ` Steven Rostedt
  2014-01-22 13:03                   ` Luca Abeni
  2 siblings, 0 replies; 33+ messages in thread
From: Juri Lelli @ 2014-01-21 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Luca Abeni
  Cc: Henrik Austad, tglx, mingo, rostedt, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield, rob

On 01/21/2014 02:55 PM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 01:50:41PM +0100, Luca Abeni wrote:
>> On 01/21/2014 01:33 PM, Peter Zijlstra wrote:
> 
>>>> - During the execution of a job, the task might invoke a blocking system call,
>>>>   and block... When it wakes up, it is still in the same job (decoding the same
>>>>   video frame), and not in a different one.
>>>> This is (IMHO) where all the confusion comes from.
>>>
>>> I would strongly urge you not to use that as an example, because its
>>> dead wrong design. An RT thread (be it RR,FIFO or DL) should _NEVER_ do
>>> blocking IO.
>> Well, but it does happen in reality :)
> 
> Yeah, I know, my point was more about not encouraging people to do this
> by explicitly mentioning it.
> 
>> On the other hand, I agree with you that a hard real-time task should be designed
>> not to do things like this. But SCHED_DEADLINE is flexible enough to be used on
>> many different kinds of tasks (hard real-time, soft real-time, etc...).
> 
> At which point I feel obliged to mention the work Jim did on statistical
> bounded tardiness and a potential future option:
> SCHED_FLAG_DL_AVG_RUNTIME, where we would allow tasks to somewhat exceed
> their runtime budget provided that they meet their budget on average.
> 
> A possible implementation could be to track the unused budget of
> previous instances and keep a decaying sum (such that we're guaranteed
> this unused budget < 2*runtime). And then allow runtime overruns upto
> this limit.
> 
> Another possibly extension; one proposed by Ingo; is to demote tasks to
> SCHED_OTHER once they exceed their budget instead of the full block they
> get now -- we could possibly call this SCHED_FLAG_DL_CBS_SOFT or such.
> 
> And of course SCHED_FLAG_DL_CBS_SIGNAL, where the task gets a signal
> delivered if it exceeded the runtime -- I think some of the earlier
> patches had things like this, no?
>

Yes, they both got removed along the way. But we can pick them back if we
realize that they are needed for some scenario.

>>> On the other subject; I wouldn't actually mind if it grew into a proper
>>> (academic or not) summary of deadline scheduling theory and how it
>>> applies.
>>>
>>> Sure, refer to actual papers for all the proofs and such, but it would
>>> be very good to go over all the bits and pieces that make up the system.
>>>
>>> So cover the periodic, sporadic and aperiodic model like henr_k
>>> suggested, please do cover the job/instance idiom as it is used all over
>>> the place.
>> Ok... My point was that it would be better (IMHO) to first explain how
>> sched_deadline works (and no notion of job/instance, etc is needed for this),
>> and then explain how this applies to the real-time task model (and here, of
>> course all the formal notation can be introduced).
>>
>> Do you think this can be reasonable?
> 
> Sure, I think that's reasonable.
> 
>>> Then also treat schedulability tests and their ramification, explain
>>> what laxity is, what tardiness is, that GEDF doesn't have 0 tardiness
>>> but does have bounded tardiness.
>>>
>>> Maybe even mention the actual bounds -- but refer to papers for their
>>> proofs.
>>>
>>> Mention CBS and the ramification etc..
>> Ok.
>> I guess some of these details can be added incrementally, with additional
>> patches?
> 
> Oh sure, all of this will always be a work in progress anyway ;-)
> 

Ok, we are working on a first update.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-20 11:46   ` Peter Zijlstra
@ 2014-01-21 14:55     ` Steven Rostedt
  0 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2014-01-21 14:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Henrik Austad, Juri Lelli, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, rob

On Mon, 20 Jan 2014 12:46:06 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> > activates - released?
> > 
> > Since real-time papers from different rt-campus around the academia insist 
> > on using *slightly* different terminology, perhaps add a short dictionary 
> > for some of the more common terms?
> 
> Oh gawd, they really don't conform with their definitions? I'd not
> noticed that.
> 
> > D: relative deadline, typically N ms after release
> 
> You failed to define release :-) Its the 'wakeup' event, right? Where
> the activation would be the moment we actually schedule the
> job/instance?

It took me about 20 seconds to figure out what a "release" was too. ;-)

-- Steve

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-21 13:55                 ` Peter Zijlstra
  2014-01-21 14:38                   ` Juri Lelli
@ 2014-01-21 16:28                   ` Steven Rostedt
  2014-01-22 13:03                   ` Luca Abeni
  2 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2014-01-21 16:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Luca Abeni, Henrik Austad, Juri Lelli, tglx, mingo, oleg,
	fweisbec, darren, johan.eker, p.faure, linux-kernel, claudio,
	michael, fchecconi, tommaso.cucinotta, nicola.manica,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur, harald.gustafsson, vincent.guittot,
	bruce.ashfield, rob

On Tue, 21 Jan 2014 14:55:59 +0100
Peter Zijlstra <peterz@infradead.org> wrote:


> > >I would strongly urge you not to use that as an example, because its
> > >dead wrong design. An RT thread (be it RR,FIFO or DL) should _NEVER_ do
> > >blocking IO.
> > Well, but it does happen in reality :)
> 
> Yeah, I know, my point was more about not encouraging people to do this
> by explicitly mentioning it.

Unless it is mentioned that this is fine for soft real-time uses, such
as what Luca described (mplayer and possibly Jack). Tools where missing
a deadline does not crash the system, but just reduces quality.

-- Steve

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-21 13:55                 ` Peter Zijlstra
  2014-01-21 14:38                   ` Juri Lelli
  2014-01-21 16:28                   ` Steven Rostedt
@ 2014-01-22 13:03                   ` Luca Abeni
  2014-01-22 13:51                     ` Peter Zijlstra
  2 siblings, 1 reply; 33+ messages in thread
From: Luca Abeni @ 2014-01-22 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Henrik Austad, Juri Lelli, tglx, mingo, rostedt, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, tommaso.cucinotta, nicola.manica, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, rob

Hi,

On 01/21/2014 02:55 PM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 01:50:41PM +0100, Luca Abeni wrote:
>> On 01/21/2014 01:33 PM, Peter Zijlstra wrote:
>
>>>> - During the execution of a job, the task might invoke a blocking system call,
>>>>    and block... When it wakes up, it is still in the same job (decoding the same
>>>>    video frame), and not in a different one.
>>>> This is (IMHO) where all the confusion comes from.
>>>
>>> I would strongly urge you not to use that as an example, because its
>>> dead wrong design. An RT thread (be it RR,FIFO or DL) should _NEVER_ do
>>> blocking IO.
>> Well, but it does happen in reality :)
>
> Yeah, I know, my point was more about not encouraging people to do this
> by explicitly mentioning it.
Ok, got the point. I was not planning to add this example to the documentation
anyway, so I think there is no problem :)


>> On the other hand, I agree with you that a hard real-time task should be designed
>> not to do things like this. But SCHED_DEADLINE is flexible enough to be used on
>> many different kinds of tasks (hard real-time, soft real-time, etc...).
>
> At which point I feel obliged to mention the work Jim did on statistical
> bounded tardiness and a potential future option:
> SCHED_FLAG_DL_AVG_RUNTIME, where we would allow tasks to somewhat exceed
> their runtime budget provided that they meet their budget on average.
I think I read the paper your refer to, and if I remember well it was about
an analysis technique (using some math from queuing theory to get estimations
of the average response time, and then using Tchebysheff to transform this result
in a probabilistic real-time guarantee)... If I understand well, it does not
require modifications to the scheduling algorithm (the paper presents a multiprocessor
reservation-based scheduling algorithm, but the presented analysis applies to every
reservation-based algorithm, including SCHED_DEADLINE without modifications).
Am I misunderstanding something?

Anyway, I'll propose a documentation patch adding this paper to the references
(and if you agree I can also add some other references to probabilistic guarantees).

The SCHED_FLAG_DL_AVG_RUNTIME idea also looks interesting.

> Another possibly extension; one proposed by Ingo; is to demote tasks to
> SCHED_OTHER once they exceed their budget instead of the full block they
> get now -- we could possibly call this SCHED_FLAG_DL_CBS_SOFT or such.
I think something similar to this was mentioned in the original "resource kernels"
paper by Rajkumar and others... It is in general very useful.

Another extension I implemented "locally" (but I never submitted patches because
it is "dangerous" and potentially controversial) is the original CBS behaviour:
when a task is depleted, do not make it unschedulable, but just postpone its
scheduling deadline (decreasing its priority) and immediately recharge the
runtime. This still preserves temporal isolation between SCHED_DEADLINE tasks,
but can cause starvation of non-SCHED_DEADLINE tasks (and this is why I say this
is dangerous and can be controversial), but can be useful in some situations.


> And of course SCHED_FLAG_DL_CBS_SIGNAL, where the task gets a signal
> delivered if it exceeded the runtime -- I think some of the earlier
> patches had things like this, no?
I've seen this in some patchset, but I do not remember when. I think some of
the "advanced features" have been removed from the first submission.


>>> On the other subject; I wouldn't actually mind if it grew into a proper
>>> (academic or not) summary of deadline scheduling theory and how it
>>> applies.
>>>
>>> Sure, refer to actual papers for all the proofs and such, but it would
>>> be very good to go over all the bits and pieces that make up the system.
>>>
>>> So cover the periodic, sporadic and aperiodic model like henr_k
>>> suggested, please do cover the job/instance idiom as it is used all over
>>> the place.
>> Ok... My point was that it would be better (IMHO) to first explain how
>> sched_deadline works (and no notion of job/instance, etc is needed for this),
>> and then explain how this applies to the real-time task model (and here, of
>> course all the formal notation can be introduced).
>>
>> Do you think this can be reasonable?
>
> Sure, I think that's reasonable.
Ok, good. I am already working on this together with Juri.



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-22 13:03                   ` Luca Abeni
@ 2014-01-22 13:51                     ` Peter Zijlstra
  2014-01-24 10:08                       ` Tommaso Cucinotta
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2014-01-22 13:51 UTC (permalink / raw)
  To: Luca Abeni
  Cc: Henrik Austad, Juri Lelli, tglx, mingo, rostedt, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, tommaso.cucinotta, nicola.manica, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, rob

On Wed, Jan 22, 2014 at 02:03:18PM +0100, Luca Abeni wrote:
> >At which point I feel obliged to mention the work Jim did on statistical
> >bounded tardiness and a potential future option:
> >SCHED_FLAG_DL_AVG_RUNTIME, where we would allow tasks to somewhat exceed
> >their runtime budget provided that they meet their budget on average.

> I think I read the paper your refer to, and if I remember well it was about
> an analysis technique (using some math from queuing theory to get estimations
> of the average response time, and then using Tchebysheff to transform this result
> in a probabilistic real-time guarantee)... If I understand well, it does not
> require modifications to the scheduling algorithm (the paper presents a multiprocessor
> reservation-based scheduling algorithm, but the presented analysis applies to every
> reservation-based algorithm, including SCHED_DEADLINE without modifications).
> Am I misunderstanding something?

I must admit to not actually having read that paper; Jim talked me
through the result at some time, but yes that sounds about right.

The only thing we need to do is enforce some 'avg' on the input
(scheduling parameters) such that we are guaranteed an avg on the output
(tardiness).

And we must enforce the input because we cannot trust userspace to
actually do what it says; if it were to go overboard we'd loose all
bounds.

> Anyway, I'll propose a documentation patch adding this paper to the references
> (and if you agree I can also add some other references to probabilistic guarantees).

Right, something for the future though.

> The SCHED_FLAG_DL_AVG_RUNTIME idea also looks interesting.

Right, like said above, because we cannot trust userspace we must
enforce that the input is indeed a bounded avg, otherwise the guarantees
do not hold.

> >Another possibly extension; one proposed by Ingo; is to demote tasks to
> >SCHED_OTHER once they exceed their budget instead of the full block they
> >get now -- we could possibly call this SCHED_FLAG_DL_CBS_SOFT or such.

> I think something similar to this was mentioned in the original "resource kernels"
> paper by Rajkumar and others... It is in general very useful.

Right, esp. so when we allow unpriv. access to SCHED_DEADLINE, more on
that below.

> Another extension I implemented "locally" (but I never submitted patches because
> it is "dangerous" and potentially controversial) is the original CBS behaviour:
> when a task is depleted, do not make it unschedulable, but just postpone its
> scheduling deadline (decreasing its priority) and immediately recharge the
> runtime. This still preserves temporal isolation between SCHED_DEADLINE tasks,
> but can cause starvation of non-SCHED_DEADLINE tasks (and this is why I say this
> is dangerous and can be controversial), but can be useful in some situations.

Right, we could actually implement that but always require CAP_ADMIN if
set.

We should also again talk about what it would take to allow
unprivileged access to SCHED_DEADLINE. The things I can remember are the
obvious cap on utilization and a minimum period -- the latter so that we
don't DoS the system with a metric ton of tiny tasks.

But I seem to remember there were a few other issues.

> 
> >And of course SCHED_FLAG_DL_CBS_SIGNAL, where the task gets a signal
> >delivered if it exceeded the runtime -- I think some of the earlier
> >patches had things like this, no?
> I've seen this in some patchset, but I do not remember when. I think some of
> the "advanced features" have been removed from the first submission.

Exactly, so. Now that we have the simple stuff settled, we can again
look at the more advanced features if there's potential users.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-22 13:51                     ` Peter Zijlstra
@ 2014-01-24 10:08                       ` Tommaso Cucinotta
  2014-01-28  9:31                         ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Tommaso Cucinotta @ 2014-01-24 10:08 UTC (permalink / raw)
  To: Peter Zijlstra, Luca Abeni
  Cc: Henrik Austad, Juri Lelli, tglx, mingo, rostedt, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, nicola.manica, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield, rob

On 22/01/14 13:51, Peter Zijlstra wrote:
>>> Another possibly extension; one proposed by Ingo; is to demote tasks to
>>> SCHED_OTHER once they exceed their budget instead of the full block they
>>> get now -- we could possibly call this SCHED_FLAG_DL_CBS_SOFT or such.

Soft reservations are also very useful, as in multimedia and adaptive RT  apps
we expect these budgets to be roughly estimated, rather than being carefully
computed WCETs.

For example, in the experimentation with adaptive budget tuning made with Jack,

  http://retis.sssup.it/~tommaso/papers/lac11.php

the best results were achieved just setting the QOS_F_SOFT flag of the old
AQuoSA (apologies for mentioning that old crappy code), that was implementing
exactly downgrade of the task to the regular CFS scheduler for the time in
which the budget was exhausted.
That Jack-related scenario is quite challenging as due to the change in the
required budget due to new Jack clients joining, or varying workloads of
existing clients (e.g., timidity). At least, as far as the intention of being
seamless to applications is kept (apps required no change at all -- only Jack
server and client library were changed). FYI, I wish we could replay those
modifications on top of SCHED_DEADLINE one day :-)...

  http://repo.or.cz/w/jack2.git/shortlog/refs/heads/aquosa

> We should also again talk about what it would take to allow
> unprivileged access to SCHED_DEADLINE. The things I can remember are the
> obvious cap on utilization and a minimum period -- the latter so that we
> don't DoS the system with a metric ton of tiny tasks.
> 
> But I seem to remember there were a few other issues.

Exactly. I can remember also a huge period might be harmful, because with it
even a tiny utilization may lead to a big runtime that starves the whole non
RT tasks for a significant time. Just in case, I had this paper

  http://retis.sssup.it/~tommaso/papers/rtas08.php

discussing the issues I had found, and how they were tackled in the AQuoSA
supervisor. See Section 3.1. Note that the AQuoSA model was probably more
complex due to the possibility of hosting multiple tasks in the same
deadline-driven reservation, and the possibility for parts of the tasks
in a user reservation being capable of being re-wrapped within another
app-specific reservation etc.... I'd expect the situation to be quite
simpler with SCHED_DEADLINE.

My2c,

	Tommaso


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-24 10:08                       ` Tommaso Cucinotta
@ 2014-01-28  9:31                         ` Peter Zijlstra
  2014-01-28 18:22                           ` Tommaso Cucinotta
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2014-01-28  9:31 UTC (permalink / raw)
  To: Tommaso Cucinotta
  Cc: Luca Abeni, Henrik Austad, Juri Lelli, tglx, mingo, rostedt,
	oleg, fweisbec, darren, johan.eker, p.faure, linux-kernel,
	claudio, michael, fchecconi, nicola.manica, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, rob

On Fri, Jan 24, 2014 at 10:08:35AM +0000, Tommaso Cucinotta wrote:
> > We should also again talk about what it would take to allow
> > unprivileged access to SCHED_DEADLINE. The things I can remember are the
> > obvious cap on utilization and a minimum period -- the latter so that we
> > don't DoS the system with a metric ton of tiny tasks.
> > 
> > But I seem to remember there were a few other issues.
> 
> Exactly. I can remember also a huge period might be harmful, because with it
> even a tiny utilization may lead to a big runtime that starves the whole non
> RT tasks for a significant time. 

Another way to solve that is with explicit slack time scheduling, where
slack here means the time/utilization not allocated to dl tasks (a quick
google shows various different usages of slack, including laxity).

So suppose we have 50% utilization but with a huge period, we could
schedule the other 50% at a fixed but smaller period and have that run
the rt/fair/etc tasks.

So effectively we'll always fill up the utilization to 100% and use the
'slack' time as a server for the other classes.

> Just in case, I had this paper
> 
>   http://retis.sssup.it/~tommaso/papers/rtas08.php
> 
> discussing the issues I had found, and how they were tackled in the AQuoSA
> supervisor. See Section 3.1.

/me *click*

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-28  9:31                         ` Peter Zijlstra
@ 2014-01-28 18:22                           ` Tommaso Cucinotta
  0 siblings, 0 replies; 33+ messages in thread
From: Tommaso Cucinotta @ 2014-01-28 18:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Luca Abeni, Henrik Austad, Juri Lelli, tglx, mingo, rostedt,
	oleg, fweisbec, darren, johan.eker, p.faure, linux-kernel,
	claudio, michael, fchecconi, nicola.manica, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, rob

On 28/01/14 09:31, Peter Zijlstra wrote:
>> Exactly. I can remember also a huge period might be harmful, because with it
>> even a tiny utilization may lead to a big runtime that starves the whole non
>> RT tasks for a significant time. 
> 
> Another way to solve that is with explicit slack time scheduling, where
> slack here means the time/utilization not allocated to dl tasks (a quick
> google shows various different usages of slack, including laxity).
> 
> So suppose we have 50% utilization but with a huge period, we could
> schedule the other 50% at a fixed but smaller period and have that run
> the rt/fair/etc tasks.
> 
> So effectively we'll always fill up the utilization to 100% and use the
> 'slack' time as a server for the other classes.

Yesss :-)! Indeed, AFAICR in the AQuoSA API there was the QOS_F_DEFAULT flag

  http://aquosa.sourceforge.net/aquosa-docs/aquosa-qosres/html/qos__types_8h_source.html

(that could only be set by a  privileged process) just there to allow
for creating a server serving "default" tasks (namely, every non-aquosa task).
This way, you could create for example at system boot time a default
reservation with a precise (runtime, period) for Linux, so that non-aquosa
tasks could have a chance to run in that reservation, even in presence of
a RT reservation with a significantly long runtime.

	T.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-27 22:29       ` Luca Abeni
@ 2014-01-28 10:03         ` Juri Lelli
  0 siblings, 0 replies; 33+ messages in thread
From: Juri Lelli @ 2014-01-28 10:03 UTC (permalink / raw)
  To: Luca Abeni, Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, dhaval.giani, hgu1972, paulmck, raistlin,
	insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield, linux-doc, rob

On 01/27/2014 11:29 PM, Luca Abeni wrote:
> Hi Steven,
> 
> On Mon, 27 Jan 2014 12:09:38 -0500
> Steven Rostedt <rostedt@goodmis.org> wrote:
> [...]
>>>> Lets take a case where deadline == period. It seems that the above
>>>> would be true any time there was any delay to starting the task
>>>> or the task was interrupted by another SCHED_DEADLINE task.
>>> Not sure about this... Notice that above only applies when a task
>>> wakes up (moving from a "sleeping" state to a "ready to run"
>>> state). Not when an already ready task is scheduled.
>>> Or did I misunderstand your comment?
>>
>> No no, you understood, I missed that this only happens on wakeup. But
>> then I have to ask, what happens if the task blocks on a mutex? Would
>> that cause this check to happen then?
> Well, in theory, this check has to be performed every time the task
> wakes up from any kind of sleeping (mutex, or anything else).
> I think this is what happens in practice with the current
> implementation, but maybe I am missing something in the implementation
> details.
> 

Yes. The call graph looks like:

 enqueue_task_dl()
  -> enqueue_dl_entity()
   -> update_dl_entity() <-- this does the check through dl_entity_overflow()

and enqueue_task_dl() gets called every time the task wakes up and is enqueued
back in the dl_rq (e.g., it was waiting on a mutex).

Thanks,

- Juri

>> It may be nice to add some comments about exactly when this check is
>> performed.
> Well, maybe "wake-up" is not the right term according to the
> terminology used in the Linux kernel... The check should be performed
> every time the task changes its state from sleeping (TASK_INTERRUPTIBLE
> or TASK_UNINTERRUPTIBLE, I think) to "ready to execute" (TASK_RUNNING,
> I think) and enters the ready queue (well, RB tree :). If someone can
> suggest a more appropriate wording, I'll fix the document accordingly.
> 
> 
>>>> For example, lets say we have two SD tasks. One that has 50ms
>>>> runtime and a 100ms period. The other has a 1ms runtime and a
>>>> 10ms period.
>>>>
>>>> The above two should work perfectly fine together. The 10ms period
>>>> task will constantly schedule in on the 100ms task.
>>>>
>>>> When the 100ms task runs, it could easily be delayed by 1ms due
>>>> to the 10ms task. Then lets look at the above equation
>>> See above: the check for the 100ms task is only performed when the
>>> task unblocks (at the beginning of its period), not when it's
>>> scheduled (after the 10ms taks).
>>>
>>> This check is designed to take care of the following situation:
>>> - consider a task with runtime 10ms and period equal to deadline
>>> equal to 100ms
>>> - assume the task wakes up at time t, and is assigned "remaining
>>>   runtime" 10ms and "scheduling deadline" t+100ms
>>> - assume the task blocks after executing for 2ms. The "remaining
>>>   runtime" is now 8ms
>>> - can the task use "remaining runtime" 8ms and "scheduling deadline"
>>>   t+100ms when it wakes up again?
>>>   Answer: if it wakes up before t+20ms, it can. Otherwise, it
>>> cannot, because it would consume a fraction of CPU time larger than
>>> 10%, causing missed deadlines in other tasks.
>>
>> Ah, you answered my question here. The check happens every time the
>> task blocks as well. I still need to read the papers, and even more
>> importantly, start running more tests with tracing on to review what
>> exactly happens in the implementation.
> If you read the original CBS paper, you'll see that it does not talk
> about "wake-up" and "sleep", because it always considers a wake-up as a
> job arrival, and a task blocking as a job end. This is because the
> original paper only considers a simplified task model, without
> "self-suspending jobs".
> In SCHED_DEADLINE, things are not so simple, because it has to consider
> real situations with real tasks not using the simplified original
> real-time task model...
> 
>>>>> + SCHED_DEADLINE can be used to schedule real-time tasks
>>>>> guaranteeing that
>>>>> + the jobs' deadlines of a task are respected. In order to do
>>>>> this, a task
>>>>> + must be scheduled by setting:
>>>>> +
>>>>> +  - runtime >= WCET
>>>>> +  - deadline = D
>>>>> +  - period <= P
>>>>> +
>>>>> + IOW, if runtime >= WCET and if period is >= P, then the
>>>>> scheduling deadlines
>>>>> + and the absolute deadlines (d_j) coincide, so a proper
>>>>> admission control
>>>>> + allows to respect the jobs' absolute deadlines for this task
>>>>> (this is what is
>>>>> + called "hard schedulability property" and is an extension of
>>>>> Lemma 1 of [2]).
>>>>
>>>> I wonder if we should state the obvious (which is never obvious).
>>>> That is the user must also have the following.
>>>>
>>>>   runtime < deadline <= period
>>>>
>>>> Although it is fine for deadline = period, runtime should be less
>>>> than deadline, otherwise the task will take over the system.
>>> I think if "runtime < deadline <= period" is not respected, then the
>>> admission control will fail... But yes, repeating it here can be
>>> useful. If needed I'll add it to the document.
>>
>> Yeah, it's one of those things that you should know, but I can see
>> users screwing it up ;-)
> Ok...
> What about presenting this as an example of the admission test failing?
> Something like: 'notice that if "runtime" > "deadine" the admission
> test will surely fail.'
> (BTW, I am not sure if the "deadline <= period" condition is really
> needed... Maybe not respecting it does not make too much sense, but I
> think it is not strictly needed for schedulability purposes).
> 
> 
> 
> 			Thanks,
> 				Luca
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-27 17:09     ` Steven Rostedt
@ 2014-01-27 22:29       ` Luca Abeni
  2014-01-28 10:03         ` Juri Lelli
  0 siblings, 1 reply; 33+ messages in thread
From: Luca Abeni @ 2014-01-27 22:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, peterz, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield, linux-doc, rob

Hi Steven,

On Mon, 27 Jan 2014 12:09:38 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
[...]
> > > Lets take a case where deadline == period. It seems that the above
> > > would be true any time there was any delay to starting the task
> > > or the task was interrupted by another SCHED_DEADLINE task.
> > Not sure about this... Notice that above only applies when a task
> > wakes up (moving from a "sleeping" state to a "ready to run"
> > state). Not when an already ready task is scheduled.
> > Or did I misunderstand your comment?
> 
> No no, you understood, I missed that this only happens on wakeup. But
> then I have to ask, what happens if the task blocks on a mutex? Would
> that cause this check to happen then?
Well, in theory, this check has to be performed every time the task
wakes up from any kind of sleeping (mutex, or anything else).
I think this is what happens in practice with the current
implementation, but maybe I am missing something in the implementation
details.

> It may be nice to add some comments about exactly when this check is
> performed.
Well, maybe "wake-up" is not the right term according to the
terminology used in the Linux kernel... The check should be performed
every time the task changes its state from sleeping (TASK_INTERRUPTIBLE
or TASK_UNINTERRUPTIBLE, I think) to "ready to execute" (TASK_RUNNING,
I think) and enters the ready queue (well, RB tree :). If someone can
suggest a more appropriate wording, I'll fix the document accordingly.


> > > For example, lets say we have two SD tasks. One that has 50ms
> > > runtime and a 100ms period. The other has a 1ms runtime and a
> > > 10ms period.
> > > 
> > > The above two should work perfectly fine together. The 10ms period
> > > task will constantly schedule in on the 100ms task.
> > > 
> > > When the 100ms task runs, it could easily be delayed by 1ms due
> > > to the 10ms task. Then lets look at the above equation
> > See above: the check for the 100ms task is only performed when the
> > task unblocks (at the beginning of its period), not when it's
> > scheduled (after the 10ms taks).
> > 
> > This check is designed to take care of the following situation:
> > - consider a task with runtime 10ms and period equal to deadline
> > equal to 100ms
> > - assume the task wakes up at time t, and is assigned "remaining
> >   runtime" 10ms and "scheduling deadline" t+100ms
> > - assume the task blocks after executing for 2ms. The "remaining
> >   runtime" is now 8ms
> > - can the task use "remaining runtime" 8ms and "scheduling deadline"
> >   t+100ms when it wakes up again?
> >   Answer: if it wakes up before t+20ms, it can. Otherwise, it
> > cannot, because it would consume a fraction of CPU time larger than
> > 10%, causing missed deadlines in other tasks.
> 
> Ah, you answered my question here. The check happens every time the
> task blocks as well. I still need to read the papers, and even more
> importantly, start running more tests with tracing on to review what
> exactly happens in the implementation.
If you read the original CBS paper, you'll see that it does not talk
about "wake-up" and "sleep", because it always considers a wake-up as a
job arrival, and a task blocking as a job end. This is because the
original paper only considers a simplified task model, without
"self-suspending jobs".
In SCHED_DEADLINE, things are not so simple, because it has to consider
real situations with real tasks not using the simplified original
real-time task model...

> > > > + SCHED_DEADLINE can be used to schedule real-time tasks
> > > > guaranteeing that
> > > > + the jobs' deadlines of a task are respected. In order to do
> > > > this, a task
> > > > + must be scheduled by setting:
> > > > +
> > > > +  - runtime >= WCET
> > > > +  - deadline = D
> > > > +  - period <= P
> > > > +
> > > > + IOW, if runtime >= WCET and if period is >= P, then the
> > > > scheduling deadlines
> > > > + and the absolute deadlines (d_j) coincide, so a proper
> > > > admission control
> > > > + allows to respect the jobs' absolute deadlines for this task
> > > > (this is what is
> > > > + called "hard schedulability property" and is an extension of
> > > > Lemma 1 of [2]).
> > > 
> > > I wonder if we should state the obvious (which is never obvious).
> > > That is the user must also have the following.
> > > 
> > >   runtime < deadline <= period
> > > 
> > > Although it is fine for deadline = period, runtime should be less
> > > than deadline, otherwise the task will take over the system.
> > I think if "runtime < deadline <= period" is not respected, then the
> > admission control will fail... But yes, repeating it here can be
> > useful. If needed I'll add it to the document.
> 
> Yeah, it's one of those things that you should know, but I can see
> users screwing it up ;-)
Ok...
What about presenting this as an example of the admission test failing?
Something like: 'notice that if "runtime" > "deadine" the admission
test will surely fail.'
(BTW, I am not sure if the "deadline <= period" condition is really
needed... Maybe not respecting it does not make too much sense, but I
think it is not strictly needed for schedulability purposes).



			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-27 16:56   ` Luca Abeni
@ 2014-01-27 17:09     ` Steven Rostedt
  2014-01-27 22:29       ` Luca Abeni
  0 siblings, 1 reply; 33+ messages in thread
From: Steven Rostedt @ 2014-01-27 17:09 UTC (permalink / raw)
  To: Luca Abeni
  Cc: Juri Lelli, peterz, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield, linux-doc, rob

On Mon, 27 Jan 2014 17:56:07 +0100
Luca Abeni <luca.abeni@unitn.it> wrote:
 
> > > +
> > > +    then, if the scheduling deadline is smaller than the current
> > > time, or
> > > +    this condition is verified, the scheduling deadline and the
> > > +    current budget are re-initialised as
> > > +
> > > +         scheduling deadline = current time + deadline
> > > +         current runtime = runtime
> > > +
> > > +    otherwise, the scheduling deadline and the current runtime are
> > > +    left unchanged;
> > 
> > I've been trying to wrap my head around this a bit. I started writing
> > an email about this when I first examined the patches and never sent
> > it out :-p
> > 
> > Lets take a case where deadline == period. It seems that the above
> > would be true any time there was any delay to starting the task or the
> > task was interrupted by another SCHED_DEADLINE task.
> Not sure about this... Notice that above only applies when a task wakes
> up (moving from a "sleeping" state to a "ready to run" state). Not when
> an already ready task is scheduled.
> Or did I misunderstand your comment?

No no, you understood, I missed that this only happens on wakeup. But
then I have to ask, what happens if the task blocks on a mutex? Would
that cause this check to happen then? It may be nice to add some
comments about exactly when this check is performed.

> 
> 
> > For example, lets say we have two SD tasks. One that has 50ms runtime
> > and a 100ms period. The other has a 1ms runtime and a 10ms period.
> > 
> > The above two should work perfectly fine together. The 10ms period
> > task will constantly schedule in on the 100ms task.
> > 
> > When the 100ms task runs, it could easily be delayed by 1ms due to the
> > 10ms task. Then lets look at the above equation
> See above: the check for the 100ms task is only performed when the task
> unblocks (at the beginning of its period), not when it's scheduled
> (after the 10ms taks).
> 
> This check is designed to take care of the following situation:
> - consider a task with runtime 10ms and period equal to deadline equal
>   to 100ms
> - assume the task wakes up at time t, and is assigned "remaining
>   runtime" 10ms and "scheduling deadline" t+100ms
> - assume the task blocks after executing for 2ms. The "remaining
>   runtime" is now 8ms
> - can the task use "remaining runtime" 8ms and "scheduling deadline"
>   t+100ms when it wakes up again?
>   Answer: if it wakes up before t+20ms, it can. Otherwise, it cannot,
>   because it would consume a fraction of CPU time larger than 10%,
>   causing missed deadlines in other tasks.

Ah, you answered my question here. The check happens every time the
task blocks as well. I still need to read the papers, and even more
importantly, start running more tests with tracing on to review what
exactly happens in the implementation.

> 
> [...]
> > > + SCHED_DEADLINE can be used to schedule real-time tasks
> > > guaranteeing that
> > > + the jobs' deadlines of a task are respected. In order to do this,
> > > a task
> > > + must be scheduled by setting:
> > > +
> > > +  - runtime >= WCET
> > > +  - deadline = D
> > > +  - period <= P
> > > +
> > > + IOW, if runtime >= WCET and if period is >= P, then the
> > > scheduling deadlines
> > > + and the absolute deadlines (d_j) coincide, so a proper admission
> > > control
> > > + allows to respect the jobs' absolute deadlines for this task
> > > (this is what is
> > > + called "hard schedulability property" and is an extension of
> > > Lemma 1 of [2]).
> > 
> > I wonder if we should state the obvious (which is never obvious). That
> > is the user must also have the following.
> > 
> >   runtime < deadline <= period
> > 
> > Although it is fine for deadline = period, runtime should be less than
> > deadline, otherwise the task will take over the system.
> I think if "runtime < deadline <= period" is not respected, then the
> admission control will fail... But yes, repeating it here can be
> useful. If needed I'll add it to the document.

Yeah, it's one of those things that you should know, but I can see
users screwing it up ;-)

-- Steve


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-27 15:35 ` Steven Rostedt
@ 2014-01-27 16:56   ` Luca Abeni
  2014-01-27 17:09     ` Steven Rostedt
  0 siblings, 1 reply; 33+ messages in thread
From: Luca Abeni @ 2014-01-27 16:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, peterz, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield, linux-doc, rob

Hi Steven,

On Mon, 27 Jan 2014 10:35:56 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
[...]
> > + to be executed first.  Thanks to this feature, also tasks that do
> > not
> > + strictly comply with the "traditional" real-time task model (see
> > Section 3)
> > + can effectively use the new policy.
> > +
> > + In more details, the CBS algorithm assigns scheduling deadlines to
> > + tasks in the following way:
> > +
> > +  - Each SCHED_DEADLINE task is characterised by the "runtime",
> > +    "deadline", and "period" parameters;
> > +
> > +  - The state of the task is described by a "scheduling deadline",
> > and
> > +    a "current runtime". These two parameters are initially set to
> > 0; +
> > +  - When a SCHED_DEADLINE task wakes up (becomes ready for
> > execution),
> > +    the scheduler checks if
> > +
> > +                    current runtime                runtime
> > +         ---------------------------------- > ----------------
> > +         scheduling deadline - current time         period
> 
> Just a nit on formatting above. I was confused at first because the
> above looks more like a movement in time (the > being an arrow), then
> after reading the below, I realized that the above is actually an
> equation. This can be fixed by adding more spaces between the
> fractions and the greater than sign:
> 
>                  current runtime                      runtime
>       ----------------------------------    >    ----------------
>       scheduling deadline - current time               period
Ok; I was not sure about how to add an equation in the document, and I
did the first thing I could think about. I'll fix it.

> > +
> > +    then, if the scheduling deadline is smaller than the current
> > time, or
> > +    this condition is verified, the scheduling deadline and the
> > +    current budget are re-initialised as
> > +
> > +         scheduling deadline = current time + deadline
> > +         current runtime = runtime
> > +
> > +    otherwise, the scheduling deadline and the current runtime are
> > +    left unchanged;
> 
> I've been trying to wrap my head around this a bit. I started writing
> an email about this when I first examined the patches and never sent
> it out :-p
> 
> Lets take a case where deadline == period. It seems that the above
> would be true any time there was any delay to starting the task or the
> task was interrupted by another SCHED_DEADLINE task.
Not sure about this... Notice that above only applies when a task wakes
up (moving from a "sleeping" state to a "ready to run" state). Not when
an already ready task is scheduled.
Or did I misunderstand your comment?


> For example, lets say we have two SD tasks. One that has 50ms runtime
> and a 100ms period. The other has a 1ms runtime and a 10ms period.
> 
> The above two should work perfectly fine together. The 10ms period
> task will constantly schedule in on the 100ms task.
> 
> When the 100ms task runs, it could easily be delayed by 1ms due to the
> 10ms task. Then lets look at the above equation
See above: the check for the 100ms task is only performed when the task
unblocks (at the beginning of its period), not when it's scheduled
(after the 10ms taks).

This check is designed to take care of the following situation:
- consider a task with runtime 10ms and period equal to deadline equal
  to 100ms
- assume the task wakes up at time t, and is assigned "remaining
  runtime" 10ms and "scheduling deadline" t+100ms
- assume the task blocks after executing for 2ms. The "remaining
  runtime" is now 8ms
- can the task use "remaining runtime" 8ms and "scheduling deadline"
  t+100ms when it wakes up again?
  Answer: if it wakes up before t+20ms, it can. Otherwise, it cannot,
  because it would consume a fraction of CPU time larger than 10%,
  causing missed deadlines in other tasks.

[...]
> > + SCHED_DEADLINE can be used to schedule real-time tasks
> > guaranteeing that
> > + the jobs' deadlines of a task are respected. In order to do this,
> > a task
> > + must be scheduled by setting:
> > +
> > +  - runtime >= WCET
> > +  - deadline = D
> > +  - period <= P
> > +
> > + IOW, if runtime >= WCET and if period is >= P, then the
> > scheduling deadlines
> > + and the absolute deadlines (d_j) coincide, so a proper admission
> > control
> > + allows to respect the jobs' absolute deadlines for this task
> > (this is what is
> > + called "hard schedulability property" and is an extension of
> > Lemma 1 of [2]).
> 
> I wonder if we should state the obvious (which is never obvious). That
> is the user must also have the following.
> 
>   runtime < deadline <= period
> 
> Although it is fine for deadline = period, runtime should be less than
> deadline, otherwise the task will take over the system.
I think if "runtime < deadline <= period" is not respected, then the
admission control will fail... But yes, repeating it here can be
useful. If needed I'll add it to the document.


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-27 11:20 Juri Lelli
  2014-01-27 11:23 ` Juri Lelli
  2014-01-27 11:53 ` Henrik Austad
@ 2014-01-27 15:35 ` Steven Rostedt
  2014-01-27 16:56   ` Luca Abeni
  2 siblings, 1 reply; 33+ messages in thread
From: Steven Rostedt @ 2014-01-27 15:35 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield, linux-doc, rob

On Mon, 27 Jan 2014 12:20:15 +0100
Juri Lelli <juri.lelli@gmail.com> wrote:


> +
> +2. Scheduling algorithm
> +==================
> +
> + SCHED_DEADLINE uses three parameters, named "runtime", "period", and
> + "deadline" to schedule tasks. A SCHED_DEADLINE task is guaranteed to receive
> + "runtime" microseconds of execution time every "period" microseconds, and
> + these "runtime" microseconds are available within "deadline" microseconds
> + from the beginning of the period.  In order to implement this behaviour,
> + every time the task wakes up, the scheduler computes a "scheduling deadline"
> + consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then
> + scheduled using EDF[1] on these scheduling deadlines (the task with the
> + smallest scheduling deadline is selected for execution). Notice that this

s/smallest/closest/

You can have a large deadline, but it could be the next one to trigger.

> + guaranteed is respected if a proper "admission control" strategy (see Section
> + "4. Bandwidth management") is used.
> +
> + Summing up, the CBS[2,3] algorithms assigns scheduling deadlines to tasks so
> + that each task runs for at most its runtime every period, avoiding any
> + interference between different tasks (bandwidth isolation), while the EDF[1]
> + algorithm selects the task with the smallest scheduling deadline as the one

Probably should say closest here too. Smallest doesn't mean that it is
the next deadline.

> + to be executed first.  Thanks to this feature, also tasks that do not
> + strictly comply with the "traditional" real-time task model (see Section 3)
> + can effectively use the new policy.
> +
> + In more details, the CBS algorithm assigns scheduling deadlines to
> + tasks in the following way:
> +
> +  - Each SCHED_DEADLINE task is characterised by the "runtime",
> +    "deadline", and "period" parameters;
> +
> +  - The state of the task is described by a "scheduling deadline", and
> +    a "current runtime". These two parameters are initially set to 0;
> +
> +  - When a SCHED_DEADLINE task wakes up (becomes ready for execution),
> +    the scheduler checks if
> +
> +                    current runtime                runtime
> +         ---------------------------------- > ----------------
> +         scheduling deadline - current time         period

Just a nit on formatting above. I was confused at first because the
above looks more like a movement in time (the > being an arrow), then
after reading the below, I realized that the above is actually an
equation. This can be fixed by adding more spaces between the fractions
and the greater than sign:

                 current runtime                      runtime
      ----------------------------------    >    ----------------
      scheduling deadline - current time               period


See, it no longer looks like a timeline:

     ---------------------------------- > ----------------


> +
> +    then, if the scheduling deadline is smaller than the current time, or
> +    this condition is verified, the scheduling deadline and the
> +    current budget are re-initialised as
> +
> +         scheduling deadline = current time + deadline
> +         current runtime = runtime
> +
> +    otherwise, the scheduling deadline and the current runtime are
> +    left unchanged;

I've been trying to wrap my head around this a bit. I started writing
an email about this when I first examined the patches and never sent it
out :-p

Lets take a case where deadline == period. It seems that the above
would be true any time there was any delay to starting the task or the
task was interrupted by another SCHED_DEADLINE task.

For example, lets say we have two SD tasks. One that has 50ms runtime
and a 100ms period. The other has a 1ms runtime and a 10ms period.

The above two should work perfectly fine together. The 10ms period task
will constantly schedule in on the 100ms task.

When the 100ms task runs, it could easily be delayed by 1ms due to the
10ms task. Then lets look at the above equation


50/99  >  50/100

That statement is true. Then I guess you make scheduling deadline =
current time + deadline, and leave the period alone? Then we run until
the next 10ms period. Get scheduled out by the 10ms task, and then
rescheduled back in. Looking at that we should have something like this:


(50 - 10) / (100 - 10 - 1)  = 40/89 which is less than 50/100

I haven't analyzed this down to see how this actually works, I have to
read the CBS papers. But is this what is expected?


> +
> +  - When a SCHED_DEADLINE task executes for an amount of time t, its
> +    current runtime is decreased as
> +
> +         current runtime = current runtime - t
> +
> +    (technically, the runtime is decreased at every tick, or when the
> +    task is descheduled / preempted);
> +
> +  - When the current runtime becomes less or equal than 0, the task is
> +    said to be "throttled" (also known as "depleted" in real-time literature)
> +    and cannot be scheduled until its scheduling deadline. The "replenishment
> +    time" for this task (see next item) is set to be equal to the current
> +    value of the scheduling deadline;
> +
> +  - When the current time is equal to the replenishment time of a
> +    throttled task, the scheduling deadline and the current runtime are
> +    updated as
> +
> +         scheduling deadline = scheduling deadline + period
> +         current runtime = current runtime + runtime
> +
> +
> +3. Scheduling Real-Time Tasks
> +=============================
> +
> + * BIG FAT WARNING ******************************************************
> + *
> + * This section contains a (not-thorough) summary on classical deadline
> + * scheduling theory, and how it applies to SCHED_DEADLINE.
> + * The reader can "safely" skip to Section 4 if only interested in seeing
> + * how the scheduling policy can be used. Anyway, we strongly recommend
> + * to come back here and continue reading (once the urge for testing is
> + * satisfied :P) to be sure of fully understanding all technical details.
> + ************************************************************************
> +
> + There are no limitations on what kind of task can exploit this new
> + scheduling discipline, even if it must be said that it is particularly
> + suited for periodic or sporadic real-time tasks that need guarantees on their
> + timing behavior, e.g., multimedia, streaming, control applications, etc.
> +
> + A typical real-time task is composed of a repetition of computation phases
> + (task instances, or jobs) which are activated on a periodic or sporadic
> + fashion.
> + Each job J_j (where J_j is the j^th job of the task) is characterised by an
> + arrival time r_j (the time when the job starts), an amount of computation
> + time c_j needed to finish the job, and a job absolute deadline d_j, which
> + is the time within which the job should be finished. The maximum execution
> + time max_j{c_j} is called "Worst Case Execution Time" (WCET) for the task.
> + A real-time task can be periodic with period P if r_{j+1} = r_j + P, or
> + sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally,
> + d_j = r_j + D, where D is the task's relative deadline.
> +
> + SCHED_DEADLINE can be used to schedule real-time tasks guaranteeing that
> + the jobs' deadlines of a task are respected. In order to do this, a task
> + must be scheduled by setting:
> +
> +  - runtime >= WCET
> +  - deadline = D
> +  - period <= P
> +
> + IOW, if runtime >= WCET and if period is >= P, then the scheduling deadlines
> + and the absolute deadlines (d_j) coincide, so a proper admission control
> + allows to respect the jobs' absolute deadlines for this task (this is what is
> + called "hard schedulability property" and is an extension of Lemma 1 of [2]).

I wonder if we should state the obvious (which is never obvious). That
is the user must also have the following.

  runtime < deadline <= period

Although it is fine for deadline = period, runtime should be less than
deadline, otherwise the task will take over the system.

> +
> + References:
> +  1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
> +      ming in a hard-real-time environment. Journal of the Association for
> +      Computing Machinery, 20(1), 1973.
> +  2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
> +      Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
> +      Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
> +  3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
> +      Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
> +
> +4. Bandwidth management
> +=======================
> +
> + In order for the -deadline scheduling to be effective and useful, it is
> + important to have some method to keep the allocation of the available CPU
> + bandwidth to the tasks under control.
> + This is usually called "admission control" and if it is not performed at all,
> + no guarantee can be given on the actual scheduling of the -deadline tasks.
> +
> + Since when RT-throttling has been introduced each task group has a bandwidth
> + associated, calculated as a certain amount of runtime over a period.

The above sentence doesn't make any sense. Most notably the "Since
when". As that is something my teenage daughter usually shots at me
when I tell her she needs to do one of her chores.


> + Moreover, to make it possible to manipulate such bandwidth, readable/writable
> + controls have been added to both procfs (for system wide settings) and cgroupfs
> + (for per-group settings).
> + Therefore, the same interface is being used for controlling the bandwidth
> + distrubution to -deadline tasks.

The last sentence above is also confusing. We added controls to proc
and cgroupfs, but then state the same interface is being used. Do you
mean that the proc and cgroupfs have the same type of interface for
each? Perhaps say something like "the same interface files are being
used in both locations".

> +
> + However, more discussion is needed in order to figure out how we want to manage
> + SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
> + uses (for now) a less sophisticated, but actually very sensible, mechanism to
> + ensure that a certain utilization cap is not overcome per each root_domain.

What mechanism is this?

> +
> + Another main difference between deadline bandwidth management and RT-throttling
> + is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
> + and thus we don't need an higher level throttling mechanism to enforce the
> + desired bandwidth.
> +
> +4.1 System wide settings
> +------------------------


The rest looks good.

-- Steve

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-27 12:40     ` Henrik Austad
@ 2014-01-27 12:52       ` Luca Abeni
  0 siblings, 0 replies; 33+ messages in thread
From: Luca Abeni @ 2014-01-27 12:52 UTC (permalink / raw)
  To: Henrik Austad
  Cc: Juri Lelli, peterz, tglx, mingo, rostedt, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield, linux-doc, rob

On 01/27/2014 01:40 PM, Henrik Austad wrote:
[...]
>>> Current runtime: time spent running _this_ period? or is _remaining_
>>> runtime this period? I get the feeling it's the latter.
>>>
>>> So, roughly, it is the ration
>>>
>>>        remaining_runtime / relative_time_to_deadline
>>>
>>> which needs to be greater than the assigned CPU bandwidth, and if so, the
>>> budget should be replensihed?
>>>
>>> Shouldn't there be something about not refilling the budget before a new
>>> period has started?
>> Uh... Maybe the description above can be improved :)
>> Do you think that using "remaining runtime" instead of "current runtime"
>> would help? If yes, I'll make this change.
>
> Yes, in my vocabularly, "remaining" != "current" :), so changing to
> 'remaining runtime' would be nice.
Ok; I'll make this change.


>> Also, I see that some of your questions are answered by some items below...
>
> Yes, but I left the comments to indicate that the order was a bit
> confusing.
>
>> Do you think that changing the order of the items in the presentation would
>> improve the readability? If you suggest a better ordering, I'll try to
>> rewrite the algorithm's description according to it.
>
> Could be, at least the ratio-caclulation was a bit confusing until I'd read
> the entire section.
>
> My preferred order (I've just cut'n'pased from the original email here)
>
> +  - The state of the task is described by a "scheduling deadline", and
> +    a "current runtime". These two parameters are initially set to 0;
BTW, maybe some of the confusion can be avoided by explaining what the
"remaining runtime" is here... Something like:
- The state of the task is described by a "scheduling deadline", and
   a "current runtime" (representing the amount of execution time that
   the task can use before the scheduling deadline). These two parameters
   are initially set to 0;

Can something like this help? If yes, I'll update the document.


[...]
> +  - When a SCHED_DEADLINE task wakes up (becomes ready for execution),
> +    the scheduler checks if
> +
> +                    current runtime                runtime
> +         ---------------------------------- > ----------------
> +         scheduling deadline - current time         period
> +
> +    then, if the scheduling deadline is smaller than the current time, or
> +    this condition is verified, the scheduling deadline and the
> +    current budget are re-initialised as
> +
> +         scheduling deadline = current time + deadline
> +         current runtime = runtime
> +
> +    otherwise, the scheduling deadline and the current runtime are
> +    left unchanged;
> +
>
> Emphasis on -my- preferred order. :)
Ok; let's see what other people think, and then we decide the items'
order.


> Either way, I'm quite happy with this documentation-update, this is just
> nitpick :)
Ok, good.


			Thanks,
				Luca


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-27 12:30   ` Luca Abeni
@ 2014-01-27 12:40     ` Henrik Austad
  2014-01-27 12:52       ` Luca Abeni
  0 siblings, 1 reply; 33+ messages in thread
From: Henrik Austad @ 2014-01-27 12:40 UTC (permalink / raw)
  To: Luca Abeni
  Cc: Juri Lelli, peterz, tglx, mingo, rostedt, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield, linux-doc, rob

[-- Attachment #1: Type: text/plain, Size: 4369 bytes --]

On Mon, Jan 27, 2014 at 01:30:42PM +0100, Luca Abeni wrote:
> Hi Henrik,
> 
> On 01/27/2014 12:53 PM, Henrik Austad wrote:
> [...]
> >>+ In more details, the CBS algorithm assigns scheduling deadlines to
> >>+ tasks in the following way:
> >>+
> >>+  - Each SCHED_DEADLINE task is characterised by the "runtime",
> >>+    "deadline", and "period" parameters;

> >>+
> >>+  - The state of the task is described by a "scheduling deadline", and
> >>+    a "current runtime". These two parameters are initially set to 0;
> >>+
> >>+  - When a SCHED_DEADLINE task wakes up (becomes ready for execution),
> >>+    the scheduler checks if
> >>+
> >>+                    current runtime                runtime
> >>+         ---------------------------------- > ----------------
> >>+         scheduling deadline - current time         period
> >>+
> >>+    then, if the scheduling deadline is smaller than the current time, or
> >>+    this condition is verified, the scheduling deadline and the
> >>+    current budget are re-initialised as
> >
> >Current runtime: time spent running _this_ period? or is _remaining_
> >runtime this period? I get the feeling it's the latter.
> >
> >So, roughly, it is the ration
> >
> >       remaining_runtime / relative_time_to_deadline
> >
> >which needs to be greater than the assigned CPU bandwidth, and if so, the
> >budget should be replensihed?
> >
> >Shouldn't there be something about not refilling the budget before a new
> >period has started?
> Uh... Maybe the description above can be improved :)
> Do you think that using "remaining runtime" instead of "current runtime"
> would help? If yes, I'll make this change.

Yes, in my vocabularly, "remaining" != "current" :), so changing to 
'remaining runtime' would be nice.

> Also, I see that some of your questions are answered by some items below...

Yes, but I left the comments to indicate that the order was a bit 
confusing.

> Do you think that changing the order of the items in the presentation would
> improve the readability? If you suggest a better ordering, I'll try to
> rewrite the algorithm's description according to it.

Could be, at least the ratio-caclulation was a bit confusing until I'd read 
the entire section.

My preferred order (I've just cut'n'pased from the original email here)

+  - The state of the task is described by a "scheduling deadline", and
+    a "current runtime". These two parameters are initially set to 0;
+
+  - Each SCHED_DEADLINE task is characterised by the "runtime",
+    "deadline", and "period" parameters;
+



+  - When a SCHED_DEADLINE task executes for an amount of time t, its
+    current runtime is decreased as
+
+         current runtime = current runtime - t
+
+    (technically, the runtime is decreased at every tick, or when the
+    task is descheduled / preempted);
+
+  - When the current runtime becomes less or equal than 0, the task is
+    said to be "throttled" (also known as "depleted" in real-time literature)
+    and cannot be scheduled until its scheduling deadline. The "replenishment
+    time" for this task (see next item) is set to be equal to the current
+    value of the scheduling deadline;
+
+  - When the current time is equal to the replenishment time of a
+    throttled task, the scheduling deadline and the current runtime are
+    updated as
+
+         scheduling deadline = scheduling deadline + period
+         current runtime = current runtime + runtime




+  - When a SCHED_DEADLINE task wakes up (becomes ready for execution),
+    the scheduler checks if
+
+                    current runtime                runtime
+         ---------------------------------- > ----------------
+         scheduling deadline - current time         period
+
+    then, if the scheduling deadline is smaller than the current time, or
+    this condition is verified, the scheduling deadline and the
+    current budget are re-initialised as
+
+         scheduling deadline = current time + deadline
+         current runtime = runtime
+
+    otherwise, the scheduling deadline and the current runtime are
+    left unchanged;
+

Emphasis on -my- preferred order. :)

Either way, I'm quite happy with this documentation-update, this is just 
nitpick :)

-- 
Henrik Austad

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-27 11:53 ` Henrik Austad
@ 2014-01-27 12:30   ` Luca Abeni
  2014-01-27 12:40     ` Henrik Austad
  0 siblings, 1 reply; 33+ messages in thread
From: Luca Abeni @ 2014-01-27 12:30 UTC (permalink / raw)
  To: Henrik Austad, Juri Lelli
  Cc: peterz, tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield, linux-doc, rob

Hi Henrik,

On 01/27/2014 12:53 PM, Henrik Austad wrote:
[...]
>> + In more details, the CBS algorithm assigns scheduling deadlines to
>> + tasks in the following way:
>> +
>> +  - Each SCHED_DEADLINE task is characterised by the "runtime",
>> +    "deadline", and "period" parameters;
>> +
>> +  - The state of the task is described by a "scheduling deadline", and
>> +    a "current runtime". These two parameters are initially set to 0;
>> +
>> +  - When a SCHED_DEADLINE task wakes up (becomes ready for execution),
>> +    the scheduler checks if
>> +
>> +                    current runtime                runtime
>> +         ---------------------------------- > ----------------
>> +         scheduling deadline - current time         period
>> +
>> +    then, if the scheduling deadline is smaller than the current time, or
>> +    this condition is verified, the scheduling deadline and the
>> +    current budget are re-initialised as
>
> Current runtime: time spent running _this_ period? or is _remaining_
> runtime this period? I get the feeling it's the latter.
>
> So, roughly, it is the ration
>
>        remaining_runtime / relative_time_to_deadline
>
> which needs to be greater than the assigned CPU bandwidth, and if so, the
> budget should be replensihed?
>
> Shouldn't there be something about not refilling the budget before a new
> period has started?
Uh... Maybe the description above can be improved :)
Do you think that using "remaining runtime" instead of "current runtime"
would help? If yes, I'll make this change.

Also, I see that some of your questions are answered by some items below...
Do you think that changing the order of the items in the presentation would
improve the readability? If you suggest a better ordering, I'll try to
rewrite the algorithm's description according to it.


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-27 11:20 Juri Lelli
  2014-01-27 11:23 ` Juri Lelli
@ 2014-01-27 11:53 ` Henrik Austad
  2014-01-27 12:30   ` Luca Abeni
  2014-01-27 15:35 ` Steven Rostedt
  2 siblings, 1 reply; 33+ messages in thread
From: Henrik Austad @ 2014-01-27 11:53 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, linux-doc,
	rob

[-- Attachment #1: Type: text/plain, Size: 17207 bytes --]

On Mon, Jan 27, 2014 at 12:20:15PM +0100, Juri Lelli wrote:
> From: Dario Faggioli <raistlin@linux.it>
> 
> Add in Documentation/scheduler/ some hints about the design
> choices, the usage and the future possible developments of the
> sched_dl scheduling class and of the SCHED_DEADLINE policy.
> 
> Cc: bruce.ashfield@windriver.com
> Cc: claudio@evidence.eu.com
> Cc: darren@dvhart.com
> Cc: dhaval.giani@gmail.com
> Cc: fchecconi@gmail.com
> Cc: fweisbec@gmail.com
> Cc: harald.gustafsson@ericsson.com
> Cc: hgu1972@gmail.com
> Cc: insop.song@gmail.com
> Cc: jkacur@redhat.com
> Cc: johan.eker@ericsson.com
> Cc: liming.wang@windriver.com
> Cc: michael@amarulasolutions.com
> Cc: mingo@redhat.com
> Cc: nicola.manica@disi.unitn.it
> Cc: oleg@redhat.com
> Cc: paulmck@linux.vnet.ibm.com
> Cc: p.faure@akatech.ch
> Cc: rob@landley.net
> Cc: rostedt@goodmis.org
> Cc: tglx@linutronix.de
> Cc: tommaso.cucinotta@sssup.it
> Cc: vincent.guittot@linaro.org
> Signed-off-by: Dario Faggioli <raistlin@linux.it>
> Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> [ Re-wrote sections 2 and 3. ]
> Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> ---
>  Documentation/scheduler/00-INDEX           |    2 +
>  Documentation/scheduler/sched-deadline.txt |  281 ++++++++++++++++++++++++++++
>  kernel/sched/deadline.c                    |    3 +-
>  3 files changed, 285 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/scheduler/sched-deadline.txt
> 
> diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
> index d2651c4..46702e4 100644
> --- a/Documentation/scheduler/00-INDEX
> +++ b/Documentation/scheduler/00-INDEX
> @@ -10,5 +10,7 @@ sched-nice-design.txt
>  	- How and why the scheduler's nice levels are implemented.
>  sched-rt-group.txt
>  	- real-time group scheduling.
> +sched-deadline.txt
> +	- deadline scheduling.
>  sched-stats.txt
>  	- information on schedstats (Linux Scheduler Statistics).
> diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
> new file mode 100644
> index 0000000..18adc92
> --- /dev/null
> +++ b/Documentation/scheduler/sched-deadline.txt
> @@ -0,0 +1,281 @@
> +			  Deadline Task Scheduling
> +			  ------------------------
> +
> +CONTENTS
> +========
> +
> + 0. WARNING
> + 1. Overview
> + 2. Scheduling algorithm
> + 3. Scheduling Real-Time Tasks
> + 4. Bandwidth management
> +   4.1 System-wide settings
> +   4.2 Task interface
> +   4.3 Default behavior
> + 5. Tasks CPU affinity
> +   5.1 SCHED_DEADLINE and cpusets HOWTO
> + 6. Future plans
> +
> +
> +0. WARNING
> +==========
> +
> + Fiddling with these settings can result in an unpredictable or even unstable
> + system behavior. As for -rt (group) scheduling, it is assumed that root users
> + know what they're doing.
> +
> +
> +1. Overview
> +===========
> +
> + The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
> + basically an implementation of the Earliest Deadline First (EDF) scheduling
> + algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
> + that makes it possible to isolate the behavior of tasks between each other.
> +
> +
> +2. Scheduling algorithm
> +==================
> +
> + SCHED_DEADLINE uses three parameters, named "runtime", "period", and
> + "deadline" to schedule tasks. A SCHED_DEADLINE task is guaranteed to receive
> + "runtime" microseconds of execution time every "period" microseconds, and
> + these "runtime" microseconds are available within "deadline" microseconds
> + from the beginning of the period.  In order to implement this behaviour,
> + every time the task wakes up, the scheduler computes a "scheduling deadline"
> + consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then
> + scheduled using EDF[1] on these scheduling deadlines (the task with the
> + smallest scheduling deadline is selected for execution). Notice that this
> + guaranteed is respected if a proper "admission control" strategy (see Section
> + "4. Bandwidth management") is used.
> +
> + Summing up, the CBS[2,3] algorithms assigns scheduling deadlines to tasks so
> + that each task runs for at most its runtime every period, avoiding any
> + interference between different tasks (bandwidth isolation), while the EDF[1]
> + algorithm selects the task with the smallest scheduling deadline as the one
> + to be executed first.  Thanks to this feature, also tasks that do not
> + strictly comply with the "traditional" real-time task model (see Section 3)
> + can effectively use the new policy.
> +
> + In more details, the CBS algorithm assigns scheduling deadlines to
> + tasks in the following way:
> +
> +  - Each SCHED_DEADLINE task is characterised by the "runtime",
> +    "deadline", and "period" parameters;
> +
> +  - The state of the task is described by a "scheduling deadline", and
> +    a "current runtime". These two parameters are initially set to 0;
> +
> +  - When a SCHED_DEADLINE task wakes up (becomes ready for execution),
> +    the scheduler checks if
> +
> +                    current runtime                runtime
> +         ---------------------------------- > ----------------
> +         scheduling deadline - current time         period
> +
> +    then, if the scheduling deadline is smaller than the current time, or
> +    this condition is verified, the scheduling deadline and the
> +    current budget are re-initialised as

Current runtime: time spent running _this_ period? or is _remaining_ 
runtime this period? I get the feeling it's the latter.

So, roughly, it is the ration

      remaining_runtime / relative_time_to_deadline 

which needs to be greater than the assigned CPU bandwidth, and if so, the 
budget should be replensihed?

Shouldn't there be something about not refilling the budget before a new 
period has started?

> +         scheduling deadline = current time + deadline
> +         current runtime = runtime
> +
> +    otherwise, the scheduling deadline and the current runtime are
> +    left unchanged;
> +
> +  - When a SCHED_DEADLINE task executes for an amount of time t, its
> +    current runtime is decreased as
> +
> +         current runtime = current runtime - t
> +
> +    (technically, the runtime is decreased at every tick, or when the
> +    task is descheduled / preempted);

Aha, there it is. Having it here makes sense, but it does wrapping ones 
head around this a bit harder than strictly necessary perhaps.

> +  - When the current runtime becomes less or equal than 0, the task is
> +    said to be "throttled" (also known as "depleted" in real-time literature)
> +    and cannot be scheduled until its scheduling deadline. The "replenishment
> +    time" for this task (see next item) is set to be equal to the current
> +    value of the scheduling deadline;
> +
> +  - When the current time is equal to the replenishment time of a
> +    throttled task, the scheduling deadline and the current runtime are
> +    updated as
> +
> +         scheduling deadline = scheduling deadline + period
> +         current runtime = current runtime + runtime

ok, this section makes sense now

> +3. Scheduling Real-Time Tasks
> +=============================
> +
> + * BIG FAT WARNING ******************************************************
> + *
> + * This section contains a (not-thorough) summary on classical deadline
> + * scheduling theory, and how it applies to SCHED_DEADLINE.

"This section should not be considered a complete summary of classical 
deadline scheduling theroy in any way AT ALL."

(not-thorough sounds a bit strange)

> + * The reader can "safely" skip to Section 4 if only interested in seeing
> + * how the scheduling policy can be used. Anyway, we strongly recommend
> + * to come back here and continue reading (once the urge for testing is
> + * satisfied :P) to be sure of fully understanding all technical details.
> + ************************************************************************
> +
> + There are no limitations on what kind of task can exploit this new
> + scheduling discipline, even if it must be said that it is particularly
> + suited for periodic or sporadic real-time tasks that need guarantees on their
> + timing behavior, e.g., multimedia, streaming, control applications, etc.
> +
> + A typical real-time task is composed of a repetition of computation phases
> + (task instances, or jobs) which are activated on a periodic or sporadic
> + fashion.
> + Each job J_j (where J_j is the j^th job of the task) is characterised by an
> + arrival time r_j (the time when the job starts), an amount of computation
> + time c_j needed to finish the job, and a job absolute deadline d_j, which
> + is the time within which the job should be finished. The maximum execution
> + time max_j{c_j} is called "Worst Case Execution Time" (WCET) for the task.
> + A real-time task can be periodic with period P if r_{j+1} = r_j + P, or
> + sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally,
> + d_j = r_j + D, where D is the task's relative deadline.

\o/

Great, thanks!

> + SCHED_DEADLINE can be used to schedule real-time tasks guaranteeing that
> + the jobs' deadlines of a task are respected. In order to do this, a task
> + must be scheduled by setting:
> +
> +  - runtime >= WCET
> +  - deadline = D
> +  - period <= P
> +
> + IOW, if runtime >= WCET and if period is >= P, then the scheduling deadlines
> + and the absolute deadlines (d_j) coincide, so a proper admission control
> + allows to respect the jobs' absolute deadlines for this task (this is what is
> + called "hard schedulability property" and is an extension of Lemma 1 of [2]).
> +
> + References:
> +  1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
> +      ming in a hard-real-time environment. Journal of the Association for
> +      Computing Machinery, 20(1), 1973.
> +  2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
> +      Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
> +      Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
> +  3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
> +      Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
> +
> +4. Bandwidth management
> +=======================
> +
> + In order for the -deadline scheduling to be effective and useful, it is
> + important to have some method to keep the allocation of the available CPU
> + bandwidth to the tasks under control.
> + This is usually called "admission control" and if it is not performed at all,
> + no guarantee can be given on the actual scheduling of the -deadline tasks.
> +
> + Since when RT-throttling has been introduced each task group has a bandwidth
> + associated, calculated as a certain amount of runtime over a period.
> + Moreover, to make it possible to manipulate such bandwidth, readable/writable
> + controls have been added to both procfs (for system wide settings) and cgroupfs
> + (for per-group settings).
> + Therefore, the same interface is being used for controlling the bandwidth
> + distrubution to -deadline tasks.
> +
> + However, more discussion is needed in order to figure out how we want to manage
> + SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
> + uses (for now) a less sophisticated, but actually very sensible, mechanism to
> + ensure that a certain utilization cap is not overcome per each root_domain.
> +
> + Another main difference between deadline bandwidth management and RT-throttling
> + is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
> + and thus we don't need an higher level throttling mechanism to enforce the
> + desired bandwidth.
> +
> +4.1 System wide settings
> +------------------------
> +
> + The system wide settings are configured under the /proc virtual file system.
> +
> + For now the -rt knobs are used for dl admission control and the -deadline
> + runtime is accounted against the -rt runtime. We realise that this isn't
> + entirely desirable; however, it is better to have a small interface for now,
> + and be able to change it easily later. The ideal situation (see 5.) is to run
> + -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
> + subset of dl_bw.
> +
> + This means that, for a root_domain comprising M CPUs, -deadline tasks
> + can be created while the sum of their bandwidths stays below:
> +
> +   M * (sched_rt_runtime_us / sched_rt_period_us)
> +
> + It is also possible to disable this bandwidth management logic, and
> + be thus free of oversubscribing the system up to any arbitrary level.
> + This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
> +
> +
> +4.2 Task interface
> +------------------
> +
> + Specifying a periodic/sporadic task that executes for a given amount of
> + runtime at each instance, and that is scheduled according to the urgency of
> + its own timing constraints needs, in general, a way of declaring:
> +  - a (maximum/typical) instance execution time,
> +  - a minimum interval between consecutive instances,
> +  - a time constraint by which each instance must be completed.
> +
> + Therefore:
> +  * a new struct sched_attr, containing all the necessary fields is
> +    provided;
> +  * the new scheduling related syscalls that manipulate it, i.e.,
> +    sched_setattr() and sched_getattr() are implemented.
> +
> +
> +4.3 Default behavior
> +---------------------
> +
> + The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
> + 950000. With rt_period equal to 1000000, by default, it means that -deadline
> + tasks can use at most 95%, multiplied by the number of CPUs that compose the
> + root_domain, for each root_domain.
> +
> + A -deadline task cannot fork.
> +
> +5. Tasks CPU affinity
> +=====================
> +
> + -deadline tasks cannot have an affinity mask smaller that the entire
> + root_domain they are created on. However, affinities can be specified
> + through the cpuset facility (Documentation/cgroups/cpusets.txt).
> +
> +5.1 SCHED_DEADLINE and cpusets HOWTO
> +------------------------------------
> +
> + An example of a simple configuration (pin a -deadline task to CPU0)
> + follows (rt-app is used to create a -deadline task).
> +
> + mkdir /dev/cpuset
> + mount -t cgroup -o cpuset cpuset /dev/cpuset
> + cd /dev/cpuset
> + mkdir cpu0
> + echo 0 > cpu0/cpuset.cpus
> + echo 0 > cpu0/cpuset.mems
> + echo 1 > cpuset.cpu_exclusive
> + echo 0 > cpuset.sched_load_balance
> + echo 1 > cpu0/cpuset.cpu_exclusive
> + echo 1 > cpu0/cpuset.mem_exclusive
> + echo $$ > cpu0/tasks
> + rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
> + task affinity)
> +
> +6. Future plans
> +===============
> +
> + Still missing:
> +
> +  - refinements to deadline inheritance, especially regarding the possibility
> +    of retaining bandwidth isolation among non-interacting tasks. This is
> +    being studied from both theoretical and practical points of view, and
> +    hopefully we should be able to produce some demonstrative code soon;
> +  - (c)group based bandwidth management, and maybe scheduling;
> +  - access control for non-root users (and related security concerns to
> +    address), which is the best way to allow unprivileged use of the mechanisms
> +    and how to prevent non-root users "cheat" the system?
> +
> + As already discussed, we are planning also to merge this work with the EDF
> + throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
> + the preliminary phases of the merge and we really seek feedback that would
> + help us decide on the direction it should take.
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 0de2482..0dd5e09 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -351,7 +351,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
>   * disrupting the schedulability of the system. Otherwise, we should
>   * refill the runtime and set the deadline a period in the future,
>   * because keeping the current (absolute) deadline of the task would
> - * result in breaking guarantees promised to other tasks.
> + * result in breaking guarantees promised to other tasks (refer to
> + * Documentation/scheduler/sched-deadline.txt for more informations).
>   *
>   * This function returns true if:
>   *
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-doc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nice! /me is very happy

Reviewed-by: Henrik Austad <henrik@austad.us>

-- 
Henrik Austad

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] sched/deadline: Add sched_dl documentation
  2014-01-27 11:20 Juri Lelli
@ 2014-01-27 11:23 ` Juri Lelli
  2014-01-27 11:53 ` Henrik Austad
  2014-01-27 15:35 ` Steven Rostedt
  2 siblings, 0 replies; 33+ messages in thread
From: Juri Lelli @ 2014-01-27 11:23 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, linux-doc,
	rob

On 01/27/2014 12:20 PM, Juri Lelli wrote:
> From: Dario Faggioli <raistlin@linux.it>
> 
> Add in Documentation/scheduler/ some hints about the design
> choices, the usage and the future possible developments of the
> sched_dl scheduling class and of the SCHED_DEADLINE policy.
> 
> Cc: bruce.ashfield@windriver.com
> Cc: claudio@evidence.eu.com
> Cc: darren@dvhart.com
> Cc: dhaval.giani@gmail.com
> Cc: fchecconi@gmail.com
> Cc: fweisbec@gmail.com
> Cc: harald.gustafsson@ericsson.com
> Cc: hgu1972@gmail.com
> Cc: insop.song@gmail.com
> Cc: jkacur@redhat.com
> Cc: johan.eker@ericsson.com
> Cc: liming.wang@windriver.com
> Cc: michael@amarulasolutions.com
> Cc: mingo@redhat.com
> Cc: nicola.manica@disi.unitn.it
> Cc: oleg@redhat.com
> Cc: paulmck@linux.vnet.ibm.com
> Cc: p.faure@akatech.ch
> Cc: rob@landley.net
> Cc: rostedt@goodmis.org
> Cc: tglx@linutronix.de
> Cc: tommaso.cucinotta@sssup.it
> Cc: vincent.guittot@linaro.org
> Signed-off-by: Dario Faggioli <raistlin@linux.it>
> Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> [ Re-wrote sections 2 and 3. ]
> Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
> ---
>  Documentation/scheduler/00-INDEX           |    2 +
>  Documentation/scheduler/sched-deadline.txt |  281 ++++++++++++++++++++++++++++
>  kernel/sched/deadline.c                    |    3 +-
>  3 files changed, 285 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/scheduler/sched-deadline.txt
> 
> diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
> index d2651c4..46702e4 100644
> --- a/Documentation/scheduler/00-INDEX
> +++ b/Documentation/scheduler/00-INDEX
> @@ -10,5 +10,7 @@ sched-nice-design.txt
>  	- How and why the scheduler's nice levels are implemented.
>  sched-rt-group.txt
>  	- real-time group scheduling.
> +sched-deadline.txt
> +	- deadline scheduling.
>  sched-stats.txt
>  	- information on schedstats (Linux Scheduler Statistics).
> diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
> new file mode 100644
> index 0000000..18adc92
> --- /dev/null
> +++ b/Documentation/scheduler/sched-deadline.txt
> @@ -0,0 +1,281 @@
> +			  Deadline Task Scheduling
> +			  ------------------------
> +
> +CONTENTS
> +========
> +
> + 0. WARNING
> + 1. Overview
> + 2. Scheduling algorithm
> + 3. Scheduling Real-Time Tasks

We also plan to add here something more about admission control in the next future.


Best,

- Juri

> + 4. Bandwidth management
> +   4.1 System-wide settings
> +   4.2 Task interface
> +   4.3 Default behavior
> + 5. Tasks CPU affinity
> +   5.1 SCHED_DEADLINE and cpusets HOWTO
> + 6. Future plans
> +
> +
> +0. WARNING
> +==========
> +
> + Fiddling with these settings can result in an unpredictable or even unstable
> + system behavior. As for -rt (group) scheduling, it is assumed that root users
> + know what they're doing.
> +
> +
> +1. Overview
> +===========
> +
> + The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
> + basically an implementation of the Earliest Deadline First (EDF) scheduling
> + algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
> + that makes it possible to isolate the behavior of tasks between each other.
> +
> +
> +2. Scheduling algorithm
> +==================
> +
> + SCHED_DEADLINE uses three parameters, named "runtime", "period", and
> + "deadline" to schedule tasks. A SCHED_DEADLINE task is guaranteed to receive
> + "runtime" microseconds of execution time every "period" microseconds, and
> + these "runtime" microseconds are available within "deadline" microseconds
> + from the beginning of the period.  In order to implement this behaviour,
> + every time the task wakes up, the scheduler computes a "scheduling deadline"
> + consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then
> + scheduled using EDF[1] on these scheduling deadlines (the task with the
> + smallest scheduling deadline is selected for execution). Notice that this
> + guaranteed is respected if a proper "admission control" strategy (see Section
> + "4. Bandwidth management") is used.
> +
> + Summing up, the CBS[2,3] algorithms assigns scheduling deadlines to tasks so
> + that each task runs for at most its runtime every period, avoiding any
> + interference between different tasks (bandwidth isolation), while the EDF[1]
> + algorithm selects the task with the smallest scheduling deadline as the one
> + to be executed first.  Thanks to this feature, also tasks that do not
> + strictly comply with the "traditional" real-time task model (see Section 3)
> + can effectively use the new policy.
> +
> + In more details, the CBS algorithm assigns scheduling deadlines to
> + tasks in the following way:
> +
> +  - Each SCHED_DEADLINE task is characterised by the "runtime",
> +    "deadline", and "period" parameters;
> +
> +  - The state of the task is described by a "scheduling deadline", and
> +    a "current runtime". These two parameters are initially set to 0;
> +
> +  - When a SCHED_DEADLINE task wakes up (becomes ready for execution),
> +    the scheduler checks if
> +
> +                    current runtime                runtime
> +         ---------------------------------- > ----------------
> +         scheduling deadline - current time         period
> +
> +    then, if the scheduling deadline is smaller than the current time, or
> +    this condition is verified, the scheduling deadline and the
> +    current budget are re-initialised as
> +
> +         scheduling deadline = current time + deadline
> +         current runtime = runtime
> +
> +    otherwise, the scheduling deadline and the current runtime are
> +    left unchanged;
> +
> +  - When a SCHED_DEADLINE task executes for an amount of time t, its
> +    current runtime is decreased as
> +
> +         current runtime = current runtime - t
> +
> +    (technically, the runtime is decreased at every tick, or when the
> +    task is descheduled / preempted);
> +
> +  - When the current runtime becomes less or equal than 0, the task is
> +    said to be "throttled" (also known as "depleted" in real-time literature)
> +    and cannot be scheduled until its scheduling deadline. The "replenishment
> +    time" for this task (see next item) is set to be equal to the current
> +    value of the scheduling deadline;
> +
> +  - When the current time is equal to the replenishment time of a
> +    throttled task, the scheduling deadline and the current runtime are
> +    updated as
> +
> +         scheduling deadline = scheduling deadline + period
> +         current runtime = current runtime + runtime
> +
> +
> +3. Scheduling Real-Time Tasks
> +=============================
> +
> + * BIG FAT WARNING ******************************************************
> + *
> + * This section contains a (not-thorough) summary on classical deadline
> + * scheduling theory, and how it applies to SCHED_DEADLINE.
> + * The reader can "safely" skip to Section 4 if only interested in seeing
> + * how the scheduling policy can be used. Anyway, we strongly recommend
> + * to come back here and continue reading (once the urge for testing is
> + * satisfied :P) to be sure of fully understanding all technical details.
> + ************************************************************************
> +
> + There are no limitations on what kind of task can exploit this new
> + scheduling discipline, even if it must be said that it is particularly
> + suited for periodic or sporadic real-time tasks that need guarantees on their
> + timing behavior, e.g., multimedia, streaming, control applications, etc.
> +
> + A typical real-time task is composed of a repetition of computation phases
> + (task instances, or jobs) which are activated on a periodic or sporadic
> + fashion.
> + Each job J_j (where J_j is the j^th job of the task) is characterised by an
> + arrival time r_j (the time when the job starts), an amount of computation
> + time c_j needed to finish the job, and a job absolute deadline d_j, which
> + is the time within which the job should be finished. The maximum execution
> + time max_j{c_j} is called "Worst Case Execution Time" (WCET) for the task.
> + A real-time task can be periodic with period P if r_{j+1} = r_j + P, or
> + sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally,
> + d_j = r_j + D, where D is the task's relative deadline.
> +
> + SCHED_DEADLINE can be used to schedule real-time tasks guaranteeing that
> + the jobs' deadlines of a task are respected. In order to do this, a task
> + must be scheduled by setting:
> +
> +  - runtime >= WCET
> +  - deadline = D
> +  - period <= P
> +
> + IOW, if runtime >= WCET and if period is >= P, then the scheduling deadlines
> + and the absolute deadlines (d_j) coincide, so a proper admission control
> + allows to respect the jobs' absolute deadlines for this task (this is what is
> + called "hard schedulability property" and is an extension of Lemma 1 of [2]).
> +
> + References:
> +  1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
> +      ming in a hard-real-time environment. Journal of the Association for
> +      Computing Machinery, 20(1), 1973.
> +  2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
> +      Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
> +      Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
> +  3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
> +      Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
> +
> +4. Bandwidth management
> +=======================
> +
> + In order for the -deadline scheduling to be effective and useful, it is
> + important to have some method to keep the allocation of the available CPU
> + bandwidth to the tasks under control.
> + This is usually called "admission control" and if it is not performed at all,
> + no guarantee can be given on the actual scheduling of the -deadline tasks.
> +
> + Since when RT-throttling has been introduced each task group has a bandwidth
> + associated, calculated as a certain amount of runtime over a period.
> + Moreover, to make it possible to manipulate such bandwidth, readable/writable
> + controls have been added to both procfs (for system wide settings) and cgroupfs
> + (for per-group settings).
> + Therefore, the same interface is being used for controlling the bandwidth
> + distrubution to -deadline tasks.
> +
> + However, more discussion is needed in order to figure out how we want to manage
> + SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
> + uses (for now) a less sophisticated, but actually very sensible, mechanism to
> + ensure that a certain utilization cap is not overcome per each root_domain.
> +
> + Another main difference between deadline bandwidth management and RT-throttling
> + is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
> + and thus we don't need an higher level throttling mechanism to enforce the
> + desired bandwidth.
> +
> +4.1 System wide settings
> +------------------------
> +
> + The system wide settings are configured under the /proc virtual file system.
> +
> + For now the -rt knobs are used for dl admission control and the -deadline
> + runtime is accounted against the -rt runtime. We realise that this isn't
> + entirely desirable; however, it is better to have a small interface for now,
> + and be able to change it easily later. The ideal situation (see 5.) is to run
> + -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
> + subset of dl_bw.
> +
> + This means that, for a root_domain comprising M CPUs, -deadline tasks
> + can be created while the sum of their bandwidths stays below:
> +
> +   M * (sched_rt_runtime_us / sched_rt_period_us)
> +
> + It is also possible to disable this bandwidth management logic, and
> + be thus free of oversubscribing the system up to any arbitrary level.
> + This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
> +
> +
> +4.2 Task interface
> +------------------
> +
> + Specifying a periodic/sporadic task that executes for a given amount of
> + runtime at each instance, and that is scheduled according to the urgency of
> + its own timing constraints needs, in general, a way of declaring:
> +  - a (maximum/typical) instance execution time,
> +  - a minimum interval between consecutive instances,
> +  - a time constraint by which each instance must be completed.
> +
> + Therefore:
> +  * a new struct sched_attr, containing all the necessary fields is
> +    provided;
> +  * the new scheduling related syscalls that manipulate it, i.e.,
> +    sched_setattr() and sched_getattr() are implemented.
> +
> +
> +4.3 Default behavior
> +---------------------
> +
> + The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
> + 950000. With rt_period equal to 1000000, by default, it means that -deadline
> + tasks can use at most 95%, multiplied by the number of CPUs that compose the
> + root_domain, for each root_domain.
> +
> + A -deadline task cannot fork.
> +
> +5. Tasks CPU affinity
> +=====================
> +
> + -deadline tasks cannot have an affinity mask smaller that the entire
> + root_domain they are created on. However, affinities can be specified
> + through the cpuset facility (Documentation/cgroups/cpusets.txt).
> +
> +5.1 SCHED_DEADLINE and cpusets HOWTO
> +------------------------------------
> +
> + An example of a simple configuration (pin a -deadline task to CPU0)
> + follows (rt-app is used to create a -deadline task).
> +
> + mkdir /dev/cpuset
> + mount -t cgroup -o cpuset cpuset /dev/cpuset
> + cd /dev/cpuset
> + mkdir cpu0
> + echo 0 > cpu0/cpuset.cpus
> + echo 0 > cpu0/cpuset.mems
> + echo 1 > cpuset.cpu_exclusive
> + echo 0 > cpuset.sched_load_balance
> + echo 1 > cpu0/cpuset.cpu_exclusive
> + echo 1 > cpu0/cpuset.mem_exclusive
> + echo $$ > cpu0/tasks
> + rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
> + task affinity)
> +
> +6. Future plans
> +===============
> +
> + Still missing:
> +
> +  - refinements to deadline inheritance, especially regarding the possibility
> +    of retaining bandwidth isolation among non-interacting tasks. This is
> +    being studied from both theoretical and practical points of view, and
> +    hopefully we should be able to produce some demonstrative code soon;
> +  - (c)group based bandwidth management, and maybe scheduling;
> +  - access control for non-root users (and related security concerns to
> +    address), which is the best way to allow unprivileged use of the mechanisms
> +    and how to prevent non-root users "cheat" the system?
> +
> + As already discussed, we are planning also to merge this work with the EDF
> + throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
> + the preliminary phases of the merge and we really seek feedback that would
> + help us decide on the direction it should take.
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 0de2482..0dd5e09 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -351,7 +351,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
>   * disrupting the schedulability of the system. Otherwise, we should
>   * refill the runtime and set the deadline a period in the future,
>   * because keeping the current (absolute) deadline of the task would
> - * result in breaking guarantees promised to other tasks.
> + * result in breaking guarantees promised to other tasks (refer to
> + * Documentation/scheduler/sched-deadline.txt for more informations).
>   *
>   * This function returns true if:
>   *
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] sched/deadline: Add sched_dl documentation
@ 2014-01-27 11:20 Juri Lelli
  2014-01-27 11:23 ` Juri Lelli
                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Juri Lelli @ 2014-01-27 11:20 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield, linux-doc,
	rob

From: Dario Faggioli <raistlin@linux.it>

Add in Documentation/scheduler/ some hints about the design
choices, the usage and the future possible developments of the
sched_dl scheduling class and of the SCHED_DEADLINE policy.

Cc: bruce.ashfield@windriver.com
Cc: claudio@evidence.eu.com
Cc: darren@dvhart.com
Cc: dhaval.giani@gmail.com
Cc: fchecconi@gmail.com
Cc: fweisbec@gmail.com
Cc: harald.gustafsson@ericsson.com
Cc: hgu1972@gmail.com
Cc: insop.song@gmail.com
Cc: jkacur@redhat.com
Cc: johan.eker@ericsson.com
Cc: liming.wang@windriver.com
Cc: michael@amarulasolutions.com
Cc: mingo@redhat.com
Cc: nicola.manica@disi.unitn.it
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: p.faure@akatech.ch
Cc: rob@landley.net
Cc: rostedt@goodmis.org
Cc: tglx@linutronix.de
Cc: tommaso.cucinotta@sssup.it
Cc: vincent.guittot@linaro.org
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
[ Re-wrote sections 2 and 3. ]
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
---
 Documentation/scheduler/00-INDEX           |    2 +
 Documentation/scheduler/sched-deadline.txt |  281 ++++++++++++++++++++++++++++
 kernel/sched/deadline.c                    |    3 +-
 3 files changed, 285 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/scheduler/sched-deadline.txt

diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
index d2651c4..46702e4 100644
--- a/Documentation/scheduler/00-INDEX
+++ b/Documentation/scheduler/00-INDEX
@@ -10,5 +10,7 @@ sched-nice-design.txt
 	- How and why the scheduler's nice levels are implemented.
 sched-rt-group.txt
 	- real-time group scheduling.
+sched-deadline.txt
+	- deadline scheduling.
 sched-stats.txt
 	- information on schedstats (Linux Scheduler Statistics).
diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
new file mode 100644
index 0000000..18adc92
--- /dev/null
+++ b/Documentation/scheduler/sched-deadline.txt
@@ -0,0 +1,281 @@
+			  Deadline Task Scheduling
+			  ------------------------
+
+CONTENTS
+========
+
+ 0. WARNING
+ 1. Overview
+ 2. Scheduling algorithm
+ 3. Scheduling Real-Time Tasks
+ 4. Bandwidth management
+   4.1 System-wide settings
+   4.2 Task interface
+   4.3 Default behavior
+ 5. Tasks CPU affinity
+   5.1 SCHED_DEADLINE and cpusets HOWTO
+ 6. Future plans
+
+
+0. WARNING
+==========
+
+ Fiddling with these settings can result in an unpredictable or even unstable
+ system behavior. As for -rt (group) scheduling, it is assumed that root users
+ know what they're doing.
+
+
+1. Overview
+===========
+
+ The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
+ basically an implementation of the Earliest Deadline First (EDF) scheduling
+ algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
+ that makes it possible to isolate the behavior of tasks between each other.
+
+
+2. Scheduling algorithm
+==================
+
+ SCHED_DEADLINE uses three parameters, named "runtime", "period", and
+ "deadline" to schedule tasks. A SCHED_DEADLINE task is guaranteed to receive
+ "runtime" microseconds of execution time every "period" microseconds, and
+ these "runtime" microseconds are available within "deadline" microseconds
+ from the beginning of the period.  In order to implement this behaviour,
+ every time the task wakes up, the scheduler computes a "scheduling deadline"
+ consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then
+ scheduled using EDF[1] on these scheduling deadlines (the task with the
+ smallest scheduling deadline is selected for execution). Notice that this
+ guaranteed is respected if a proper "admission control" strategy (see Section
+ "4. Bandwidth management") is used.
+
+ Summing up, the CBS[2,3] algorithms assigns scheduling deadlines to tasks so
+ that each task runs for at most its runtime every period, avoiding any
+ interference between different tasks (bandwidth isolation), while the EDF[1]
+ algorithm selects the task with the smallest scheduling deadline as the one
+ to be executed first.  Thanks to this feature, also tasks that do not
+ strictly comply with the "traditional" real-time task model (see Section 3)
+ can effectively use the new policy.
+
+ In more details, the CBS algorithm assigns scheduling deadlines to
+ tasks in the following way:
+
+  - Each SCHED_DEADLINE task is characterised by the "runtime",
+    "deadline", and "period" parameters;
+
+  - The state of the task is described by a "scheduling deadline", and
+    a "current runtime". These two parameters are initially set to 0;
+
+  - When a SCHED_DEADLINE task wakes up (becomes ready for execution),
+    the scheduler checks if
+
+                    current runtime                runtime
+         ---------------------------------- > ----------------
+         scheduling deadline - current time         period
+
+    then, if the scheduling deadline is smaller than the current time, or
+    this condition is verified, the scheduling deadline and the
+    current budget are re-initialised as
+
+         scheduling deadline = current time + deadline
+         current runtime = runtime
+
+    otherwise, the scheduling deadline and the current runtime are
+    left unchanged;
+
+  - When a SCHED_DEADLINE task executes for an amount of time t, its
+    current runtime is decreased as
+
+         current runtime = current runtime - t
+
+    (technically, the runtime is decreased at every tick, or when the
+    task is descheduled / preempted);
+
+  - When the current runtime becomes less or equal than 0, the task is
+    said to be "throttled" (also known as "depleted" in real-time literature)
+    and cannot be scheduled until its scheduling deadline. The "replenishment
+    time" for this task (see next item) is set to be equal to the current
+    value of the scheduling deadline;
+
+  - When the current time is equal to the replenishment time of a
+    throttled task, the scheduling deadline and the current runtime are
+    updated as
+
+         scheduling deadline = scheduling deadline + period
+         current runtime = current runtime + runtime
+
+
+3. Scheduling Real-Time Tasks
+=============================
+
+ * BIG FAT WARNING ******************************************************
+ *
+ * This section contains a (not-thorough) summary on classical deadline
+ * scheduling theory, and how it applies to SCHED_DEADLINE.
+ * The reader can "safely" skip to Section 4 if only interested in seeing
+ * how the scheduling policy can be used. Anyway, we strongly recommend
+ * to come back here and continue reading (once the urge for testing is
+ * satisfied :P) to be sure of fully understanding all technical details.
+ ************************************************************************
+
+ There are no limitations on what kind of task can exploit this new
+ scheduling discipline, even if it must be said that it is particularly
+ suited for periodic or sporadic real-time tasks that need guarantees on their
+ timing behavior, e.g., multimedia, streaming, control applications, etc.
+
+ A typical real-time task is composed of a repetition of computation phases
+ (task instances, or jobs) which are activated on a periodic or sporadic
+ fashion.
+ Each job J_j (where J_j is the j^th job of the task) is characterised by an
+ arrival time r_j (the time when the job starts), an amount of computation
+ time c_j needed to finish the job, and a job absolute deadline d_j, which
+ is the time within which the job should be finished. The maximum execution
+ time max_j{c_j} is called "Worst Case Execution Time" (WCET) for the task.
+ A real-time task can be periodic with period P if r_{j+1} = r_j + P, or
+ sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally,
+ d_j = r_j + D, where D is the task's relative deadline.
+
+ SCHED_DEADLINE can be used to schedule real-time tasks guaranteeing that
+ the jobs' deadlines of a task are respected. In order to do this, a task
+ must be scheduled by setting:
+
+  - runtime >= WCET
+  - deadline = D
+  - period <= P
+
+ IOW, if runtime >= WCET and if period is >= P, then the scheduling deadlines
+ and the absolute deadlines (d_j) coincide, so a proper admission control
+ allows to respect the jobs' absolute deadlines for this task (this is what is
+ called "hard schedulability property" and is an extension of Lemma 1 of [2]).
+
+ References:
+  1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
+      ming in a hard-real-time environment. Journal of the Association for
+      Computing Machinery, 20(1), 1973.
+  2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
+      Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
+      Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
+  3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
+      Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
+
+4. Bandwidth management
+=======================
+
+ In order for the -deadline scheduling to be effective and useful, it is
+ important to have some method to keep the allocation of the available CPU
+ bandwidth to the tasks under control.
+ This is usually called "admission control" and if it is not performed at all,
+ no guarantee can be given on the actual scheduling of the -deadline tasks.
+
+ Since when RT-throttling has been introduced each task group has a bandwidth
+ associated, calculated as a certain amount of runtime over a period.
+ Moreover, to make it possible to manipulate such bandwidth, readable/writable
+ controls have been added to both procfs (for system wide settings) and cgroupfs
+ (for per-group settings).
+ Therefore, the same interface is being used for controlling the bandwidth
+ distrubution to -deadline tasks.
+
+ However, more discussion is needed in order to figure out how we want to manage
+ SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
+ uses (for now) a less sophisticated, but actually very sensible, mechanism to
+ ensure that a certain utilization cap is not overcome per each root_domain.
+
+ Another main difference between deadline bandwidth management and RT-throttling
+ is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
+ and thus we don't need an higher level throttling mechanism to enforce the
+ desired bandwidth.
+
+4.1 System wide settings
+------------------------
+
+ The system wide settings are configured under the /proc virtual file system.
+
+ For now the -rt knobs are used for dl admission control and the -deadline
+ runtime is accounted against the -rt runtime. We realise that this isn't
+ entirely desirable; however, it is better to have a small interface for now,
+ and be able to change it easily later. The ideal situation (see 5.) is to run
+ -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
+ subset of dl_bw.
+
+ This means that, for a root_domain comprising M CPUs, -deadline tasks
+ can be created while the sum of their bandwidths stays below:
+
+   M * (sched_rt_runtime_us / sched_rt_period_us)
+
+ It is also possible to disable this bandwidth management logic, and
+ be thus free of oversubscribing the system up to any arbitrary level.
+ This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
+
+
+4.2 Task interface
+------------------
+
+ Specifying a periodic/sporadic task that executes for a given amount of
+ runtime at each instance, and that is scheduled according to the urgency of
+ its own timing constraints needs, in general, a way of declaring:
+  - a (maximum/typical) instance execution time,
+  - a minimum interval between consecutive instances,
+  - a time constraint by which each instance must be completed.
+
+ Therefore:
+  * a new struct sched_attr, containing all the necessary fields is
+    provided;
+  * the new scheduling related syscalls that manipulate it, i.e.,
+    sched_setattr() and sched_getattr() are implemented.
+
+
+4.3 Default behavior
+---------------------
+
+ The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
+ 950000. With rt_period equal to 1000000, by default, it means that -deadline
+ tasks can use at most 95%, multiplied by the number of CPUs that compose the
+ root_domain, for each root_domain.
+
+ A -deadline task cannot fork.
+
+5. Tasks CPU affinity
+=====================
+
+ -deadline tasks cannot have an affinity mask smaller that the entire
+ root_domain they are created on. However, affinities can be specified
+ through the cpuset facility (Documentation/cgroups/cpusets.txt).
+
+5.1 SCHED_DEADLINE and cpusets HOWTO
+------------------------------------
+
+ An example of a simple configuration (pin a -deadline task to CPU0)
+ follows (rt-app is used to create a -deadline task).
+
+ mkdir /dev/cpuset
+ mount -t cgroup -o cpuset cpuset /dev/cpuset
+ cd /dev/cpuset
+ mkdir cpu0
+ echo 0 > cpu0/cpuset.cpus
+ echo 0 > cpu0/cpuset.mems
+ echo 1 > cpuset.cpu_exclusive
+ echo 0 > cpuset.sched_load_balance
+ echo 1 > cpu0/cpuset.cpu_exclusive
+ echo 1 > cpu0/cpuset.mem_exclusive
+ echo $$ > cpu0/tasks
+ rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
+ task affinity)
+
+6. Future plans
+===============
+
+ Still missing:
+
+  - refinements to deadline inheritance, especially regarding the possibility
+    of retaining bandwidth isolation among non-interacting tasks. This is
+    being studied from both theoretical and practical points of view, and
+    hopefully we should be able to produce some demonstrative code soon;
+  - (c)group based bandwidth management, and maybe scheduling;
+  - access control for non-root users (and related security concerns to
+    address), which is the best way to allow unprivileged use of the mechanisms
+    and how to prevent non-root users "cheat" the system?
+
+ As already discussed, we are planning also to merge this work with the EDF
+ throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
+ the preliminary phases of the merge and we really seek feedback that would
+ help us decide on the direction it should take.
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0de2482..0dd5e09 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -351,7 +351,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
  * disrupting the schedulability of the system. Otherwise, we should
  * refill the runtime and set the deadline a period in the future,
  * because keeping the current (absolute) deadline of the task would
- * result in breaking guarantees promised to other tasks.
+ * result in breaking guarantees promised to other tasks (refer to
+ * Documentation/scheduler/sched-deadline.txt for more informations).
  *
  * This function returns true if:
  *
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2014-01-28 18:22 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-20 10:40 [PATCH] sched/deadline: Add sched_dl documentation Juri Lelli
2014-01-20 11:24 ` Henrik Austad
2014-01-20 11:46   ` Peter Zijlstra
2014-01-21 14:55     ` Steven Rostedt
2014-01-20 12:15   ` Juri Lelli
2014-01-20 13:16     ` Henrik Austad
2014-01-20 13:39       ` Luca Abeni
2014-01-21 10:20         ` Henrik Austad
2014-01-21 11:35           ` Luca Abeni
2014-01-21 12:11             ` Juri Lelli
2014-01-21 12:33             ` Peter Zijlstra
2014-01-21 12:50               ` Luca Abeni
2014-01-21 13:55                 ` Peter Zijlstra
2014-01-21 14:38                   ` Juri Lelli
2014-01-21 16:28                   ` Steven Rostedt
2014-01-22 13:03                   ` Luca Abeni
2014-01-22 13:51                     ` Peter Zijlstra
2014-01-24 10:08                       ` Tommaso Cucinotta
2014-01-28  9:31                         ` Peter Zijlstra
2014-01-28 18:22                           ` Tommaso Cucinotta
2014-01-21 10:21   ` Henrik Austad
2014-01-20 12:25 ` Luca Abeni
2014-01-27 11:20 Juri Lelli
2014-01-27 11:23 ` Juri Lelli
2014-01-27 11:53 ` Henrik Austad
2014-01-27 12:30   ` Luca Abeni
2014-01-27 12:40     ` Henrik Austad
2014-01-27 12:52       ` Luca Abeni
2014-01-27 15:35 ` Steven Rostedt
2014-01-27 16:56   ` Luca Abeni
2014-01-27 17:09     ` Steven Rostedt
2014-01-27 22:29       ` Luca Abeni
2014-01-28 10:03         ` Juri Lelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).