linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3
@ 2010-10-29  6:18 Raistlin
  2010-10-29  6:25 ` [RFC][PATCH 01/22] sched: add sched_class->task_dead Raistlin
                   ` (21 more replies)
  0 siblings, 22 replies; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck, Dario

[-- Attachment #1: Type: text/plain, Size: 6701 bytes --]

Hello everyone,

This is the take 3 for the SCHED_DEADLINE patchset. I've done my best to
have something that can be publicly shown by the Kernel Summit. To be
sincere, I didn't make in getting the code to the point I wanted to but,
hey, I hope this can be at least something from where to start the
discussion. :-)

BTW, the patchset introduces a new deadline based real-time task
scheduling policy --called SCHED_DEADLINE-- with bandwidth isolation
(aka "resource reservation") capabilities. It now supports
global/clustered multiprocessor scheduling through dynamic task
migrations.

The code is being jointly developed by ReTiS Lab (http://retis.sssup.it)
and Evidence S.r.l (http://www.evidence.eu.com) in the context of the
ACTORS EU-funded project (http://www.actors-project.eu).
It is also starting to get some users, both in academic and applied
research, considered I'm getting some feedback from Ericsson, MIT and
from Iowa State, Porto (ISEP), Carnegie Mellon, Barcelona and Trento
universities.

From the previous release[*]:
 - all the comments and the fixes coming from the reviews we got have 
   been considered and applied;
 - global and clustered (e.g., through cpusets) scheduling is now
   available. This means that tasks can migrate among (a subset of) CPUs
   when this is needed, by means of pushes & pulls, like in
   sched_rt.c;
 - (c)group based task admission logic and bandwidth management has 
   been removed, in favour of a per root_domain tasks bandwidth
   accounting mechanism;
 - finally, all the code underwent major restructuring (many parts 
   have been almost completely rewritten), to make it easier to read and
   understand, as well as more consistent with kernel mechanisms and
   conventions;
 
Still missing/incomplete:
 - (c)group based bandwidth management, and maybe scheduling. It seems 
   some more discussion on what precisely we want is *really* needed 
   for this point;
 - better handling of rq selection for dynamic task migration, by means
   of a cpupri equivalent for -deadline tasks. Not that hard to do, we
   already have some ideas and hope to have the code soon;
 - bandwidth inheritance (to replace deadline/priority inheritance).
   What's in the patchset is just very few more than a simple
   placeholder. I tried doing something that may fit in the current
   architecture of rt_mutexes (i.e., the pi-chain of waiters) but did
   not get to anything meaningful. We are now working on migrating to
   something similar to what it's probably known here as proxy
   execution... It's not easy at all, but we are on it. :-)

The official page of the project is:
  http://www.evidence.eu.com/sched_deadline.html

while the development is taking place at:
  http://gitorious.org/sched_deadline/pages/Home
  http://gitorious.org/sched_deadline

Check the repositories frequently if you're interested, and feel free to
e-mail me for any issue you run into.

Patchset is on top of tip/master (as of today). The git-tree and patches
for PREEMPT_RT will be available on the project website in the next
days.

Relationship of this patchset with the EDF-throttling one is tight,
although the two implementations add different (new) features and solve
different problems. In case they have to coexist, a lot of code could be
shared and duplications would be easily avoided.

As usual, any kind of feedback is welcome and appreciated.

Thanks in advice and regards,
Dario

[*] http://lwn.net/Articles/376502, http://lwn.net/Articles/353797

Dario Faggioi, SCHED_DEADLINE (22)

 sched: add sched_class->task_dead.
 sched: add extended scheduling interface.
 sched: SCHED_DEADLINE data structures.
 sched: SCHED_DEADLINE SMP-related data structures.
 sched: SCHED_DEADLINE policy implementation.
 sched: SCHED_DEADLINE handles spacial kthreads.
 sched: SCHED_DEADLINE push and pull logic.
 sched: SCHED_DEADLINE avg_update accounting.
 sched: add period support for -deadline tasks.
 sched: add a syscall to wait for the next instance.
 sched: add schedstats for -deadline tasks.
 sched: add runtime reporting for -deadline tasks.
 sched: add resource limits for -deadline tasks.
 sched: add latency tracing for -deadline tasks.
 sched: add traceporints for -deadline tasks.
 sched: add SMP traceporints for -deadline tasks.
 sched: add signaling overrunning -deadline tasks.
 sched: add reclaiming logic to -deadline tasks.
 rtmutex: turn the plist into an rb-tree.
 sched: drafted deadline inheritance logic.
 sched: add bandwidth management for sched_dl.
 sched: add sched_dl documentation.

 Documentation/scheduler/sched-deadline.txt |  147 +++
 arch/arm/include/asm/unistd.h              |    4 +
 arch/arm/kernel/calls.S                    |    4 +
 arch/x86/ia32/ia32entry.S                  |    4 +
 arch/x86/include/asm/unistd_32.h           |    6 +-
 arch/x86/include/asm/unistd_64.h           |    8 +
 arch/x86/kernel/syscall_table_32.S         |    4 +
 include/asm-generic/resource.h             |    7 +-
 include/linux/init_task.h                  |   10 +
 include/linux/rtmutex.h                    |   13 +-
 include/linux/sched.h                      |  208 ++++-
 include/linux/syscalls.h                   |    9 +
 include/trace/events/sched.h               |  312 +++++-
 kernel/fork.c                              |    4 +-
 kernel/hrtimer.c                           |    2 +-
 kernel/posix-cpu-timers.c                  |   55 +
 kernel/rtmutex-debug.c                     |    8 +-
 kernel/rtmutex.c                           |  146 ++-
 kernel/rtmutex_common.h                    |   22 +-
 kernel/sched.c                             | 1046 ++++++++++++++++--
 kernel/sched_debug.c                       |   46 +
 kernel/sched_dl.c                          | 1713 ++++++++++++++++++++++++++++
 kernel/sched_fair.c                        |    6 +-
 kernel/sched_rt.c                          |    7 +-
 kernel/sched_stoptask.c                    |    2 +-
 kernel/softirq.c                           |    6 +-
 kernel/sysctl.c                            |   14 +
 kernel/trace/trace_sched_wakeup.c          |   44 +-
 kernel/trace/trace_selftest.c              |   31 +-
 kernel/watchdog.c                          |    3 +-
 30 files changed, 3721 insertions(+), 170 deletions(-)

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* [RFC][PATCH 01/22] sched: add sched_class->task_dead.
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
@ 2010-10-29  6:25 ` Raistlin
  2010-10-29  6:27 ` [RFC][PATCH 02/22] sched: add extended scheduling interface Raistlin
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1771 bytes --]


Add a new function to the scheduling class interface. It is called
at the end of a context switch, if the prev task is in TASK_DEAD state.

It might be useful for the scheduling classes that want to be notified
when one of their task dies, e.g. to perform some cleanup actions.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |    3 +++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b3d07df..6053b4b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1068,6 +1068,7 @@ struct sched_class {
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
 	void (*task_fork) (struct task_struct *p);
+	void (*task_dead) (struct task_struct *p);
 
 	void (*switched_from) (struct rq *this_rq, struct task_struct *task,
 			       int running);
diff --git a/kernel/sched.c b/kernel/sched.c
index 41f1869..07f5a0c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2891,6 +2891,9 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		if (prev->sched_class->task_dead)
+			prev->sched_class->task_dead(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
  2010-10-29  6:25 ` [RFC][PATCH 01/22] sched: add sched_class->task_dead Raistlin
@ 2010-10-29  6:27 ` Raistlin
  2010-11-10 16:00   ` Dhaval Giani
                     ` (3 more replies)
  2010-10-29  6:28 ` [RFC][PATCH 03/22] sched: SCHED_DEADLINE data structures Raistlin
                   ` (19 subsequent siblings)
  21 siblings, 4 replies; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 16575 bytes --]


Add the interface bits needed for supporting scheduling algorithms
with extended parameters (e.g., SCHED_DEADLINE).

In general, it makes it possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints, i.e.:
 - a (maximum/typical) instance execution time,
 - a minimum interval between consecutive instances,
 - a time constraint by which each instance must be completed.

Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.

For these reasons, this patch:
 - defines the new struct sched_param_ex, containing all the fields
   that are necessary for specifying a task in the computational
   model described above;
 - defines and implements the new scheduling related syscalls that
   manipulate it, i.e., sched_setscheduler_ex(), sched_setparam_ex()
   and sched_getparam_ex().

Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.

Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the *_ex() calls accordingly with their own purposes.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 arch/arm/include/asm/unistd.h      |    3 +
 arch/arm/kernel/calls.S            |    3 +
 arch/x86/ia32/ia32entry.S          |    3 +
 arch/x86/include/asm/unistd_32.h   |    5 +-
 arch/x86/include/asm/unistd_64.h   |    6 ++
 arch/x86/kernel/syscall_table_32.S |    3 +
 include/linux/sched.h              |   58 +++++++++++++++
 include/linux/syscalls.h           |    7 ++
 kernel/sched.c                     |  135 +++++++++++++++++++++++++++++++++++-
 9 files changed, 219 insertions(+), 4 deletions(-)

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index c891eb7..6f18f72 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -396,6 +396,9 @@
 #define __NR_fanotify_init		(__NR_SYSCALL_BASE+367)
 #define __NR_fanotify_mark		(__NR_SYSCALL_BASE+368)
 #define __NR_prlimit64			(__NR_SYSCALL_BASE+369)
+#define __NR_sched_setscheduler_ex	(__NR_SYSCALL_BASE+370)
+#define __NR_sched_setparam_ex		(__NR_SYSCALL_BASE+371)
+#define __NR_sched_getparam_ex		(__NR_SYSCALL_BASE+372)
 
 /*
  * The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 5c26ecc..c131615 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -379,6 +379,9 @@
 		CALL(sys_fanotify_init)
 		CALL(sys_fanotify_mark)
 		CALL(sys_prlimit64)
+/* 370 */	CALL(sys_sched_setscheduler_ex)
+		CALL(sys_sched_setparam_ex)
+		CALL(sys_sched_getparam_ex)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 518bb99..0c6f451 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -851,4 +851,7 @@ ia32_sys_call_table:
 	.quad sys_fanotify_init
 	.quad sys32_fanotify_mark
 	.quad sys_prlimit64		/* 340 */
+	.quad sys_sched_setscheduler_ex
+	.quad sys_sched_setparam_ex
+	.quad sys_sched_getparam_ex
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index b766a5e..437383b 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -346,10 +346,13 @@
 #define __NR_fanotify_init	338
 #define __NR_fanotify_mark	339
 #define __NR_prlimit64		340
+#define __NR_sched_setscheduler_ex	341
+#define __NR_sched_setparam_ex		342
+#define __NR_sched_getparam_ex		343
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 341
+#define NR_syscalls 344
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 363e9b8..fc4618b 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -669,6 +669,12 @@ __SYSCALL(__NR_fanotify_init, sys_fanotify_init)
 __SYSCALL(__NR_fanotify_mark, sys_fanotify_mark)
 #define __NR_prlimit64				302
 __SYSCALL(__NR_prlimit64, sys_prlimit64)
+#define __NR_sched_setscheduler_ex		303
+__SYSCALL(__NR_sched_setscheduler_ex, sys_sched_setscheduler_ex)
+#define __NR_sched_setparam_ex			304
+__SYSCALL(__NR_sched_setparam_ex, sys_sched_setparam_ex)
+#define __NR_sched_getparam_ex			305
+__SYSCALL(__NR_sched_getparam_ex, sys_sched_getparam_ex)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index b35786d..7d4ed62 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -340,3 +340,6 @@ ENTRY(sys_call_table)
 	.long sys_fanotify_init
 	.long sys_fanotify_mark
 	.long sys_prlimit64		/* 340 */
+	.long sys_sched_setscheduler_ex
+	.long sys_sched_setparam_ex
+	.long sys_sched_getparam_ex
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6053b4b..cf20084 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -94,6 +94,61 @@ struct sched_param {
 
 #include <asm/processor.h>
 
+/*
+ * Extended scheduling parameters data structure.
+ *
+ * This is needed because the original struct sched_param can not be
+ * altered without introducing ABI issues with legacy applications
+ * (e.g., in sched_getparam()).
+ *
+ * However, the possibility of specifying more than just a priority for
+ * the tasks may be useful for a wide variety of application fields, e.g.,
+ * multimedia, streaming, automation and control, and many others.
+ *
+ * This variant (sched_param_ex) is meant at describing a so-called
+ * sporadic time-constrained task. In such model a task is specified by:
+ *  - the activation period or minimum instance inter-arrival time;
+ *  - the maximum (or average, depending on the actual scheduling
+ *    discipline) computation time of all instances, a.k.a. runtime;
+ *  - the deadline (relative to the actual activation time) of each
+ *    instance.
+ * Very briefly, a periodic (sporadic) task asks for the execution of
+ * some specific computation --which is typically called an instance--
+ * (at most) every period. Moreover, each instance typically lasts no more
+ * than the runtime and must be completed by time instant t equal to
+ * the instance activation time + the deadline.
+ *
+ * This is reflected by the actual fields of the sched_param_ex structure:
+ *
+ *  @sched_priority     task's priority (might still be useful)
+ *  @sched_deadline     representative of the task's deadline
+ *  @sched_runtime      representative of the task's runtime
+ *  @sched_period       representative of the task's period
+ *  @sched_flags        for customizing the scheduler behaviour
+ *
+ * There are other fields, which may be useful for implementing (in
+ * user-space) advanced scheduling behaviours, e.g., feedback scheduling:
+ *
+ *  @curr_runtime       task's currently available runtime
+ *  @used_runtime       task's totally used runtime
+ *  @curr_deadline      task's current absolute deadline
+ *
+ * Given this task model, there are a multiplicity of scheduling algorithms
+ * and policies, that can be used to ensure all the tasks will make their
+ * timing constraints.
+ */
+struct sched_param_ex {
+	int sched_priority;
+	struct timespec sched_runtime;
+	struct timespec sched_deadline;
+	struct timespec sched_period;
+	unsigned int sched_flags;
+
+	struct timespec curr_runtime;
+	struct timespec used_runtime;
+	struct timespec curr_deadline;
+};
+
 struct exec_domain;
 struct futex_pi_state;
 struct robust_list_head;
@@ -1950,6 +2005,9 @@ extern int sched_setscheduler(struct task_struct *, int,
 			      const struct sched_param *);
 extern int sched_setscheduler_nocheck(struct task_struct *, int,
 				      const struct sched_param *);
+extern int sched_setscheduler_ex(struct task_struct *, int,
+				 const struct sched_param *,
+				 const struct sched_param_ex *);
 extern struct task_struct *idle_task(int cpu);
 extern struct task_struct *curr_task(int cpu);
 extern void set_curr_task(int cpu, struct task_struct *p);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index cacc27a..46b461e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -38,6 +38,7 @@ struct rlimit;
 struct rlimit64;
 struct rusage;
 struct sched_param;
+struct sched_param_ex;
 struct sel_arg_struct;
 struct semaphore;
 struct sembuf;
@@ -322,11 +323,17 @@ asmlinkage long sys_clock_nanosleep(clockid_t which_clock, int flags,
 asmlinkage long sys_nice(int increment);
 asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_setscheduler_ex(pid_t pid, int policy, unsigned len,
+					struct sched_param_ex __user *param);
 asmlinkage long sys_sched_setparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_setparam_ex(pid_t pid, unsigned len,
+					struct sched_param_ex __user *param);
 asmlinkage long sys_sched_getscheduler(pid_t pid);
 asmlinkage long sys_sched_getparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_getparam_ex(pid_t pid, unsigned len,
+					struct sched_param_ex __user *param);
 asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
 asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
diff --git a/kernel/sched.c b/kernel/sched.c
index 07f5a0c..76f1bc6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4704,7 +4704,9 @@ static bool check_same_owner(struct task_struct *p)
 }
 
 static int __sched_setscheduler(struct task_struct *p, int policy,
-				const struct sched_param *param, bool user)
+				const struct sched_param *param,
+				const struct sched_param_ex *param_ex,
+				bool user)
 {
 	int retval, oldprio, oldpolicy = -1, on_rq, running;
 	unsigned long flags;
@@ -4861,10 +4863,18 @@ recheck:
 int sched_setscheduler(struct task_struct *p, int policy,
 		       const struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, true);
+	return __sched_setscheduler(p, policy, param, NULL, true);
 }
 EXPORT_SYMBOL_GPL(sched_setscheduler);
 
+int sched_setscheduler_ex(struct task_struct *p, int policy,
+			  const struct sched_param *param,
+			  const struct sched_param_ex *param_ex)
+{
+	return __sched_setscheduler(p, policy, param, param_ex, true);
+}
+EXPORT_SYMBOL_GPL(sched_setscheduler_ex);
+
 /**
  * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
  * @p: the task in question.
@@ -4879,7 +4889,7 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
 int sched_setscheduler_nocheck(struct task_struct *p, int policy,
 			       const struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, false);
+	return __sched_setscheduler(p, policy, param, NULL, false);
 }
 
 static int
@@ -4904,6 +4914,56 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 	return retval;
 }
 
+/*
+ * Notice that, to extend sched_param_ex in the future without causing ABI
+ * issues, the user-space is asked to pass to this (and the other *_ex())
+ * function(s) the actual size of the data structure it has been compiled
+ * against.
+ *
+ * What we do is the following:
+ *  - if the user data structure is bigger than our one we fail, since this
+ *    means we wouldn't be able of providing some of the features that
+ *    are expected;
+ *  - if the user data structure is smaller than our one we can continue,
+ *    we just initialize to default all the fielld of the kernel-side
+ *    sched_param_ex and copy from the user the available values. This
+ *    obviously assume that such data structure can only grow and that
+ *    positions and meaning of the existing fields will not be altered.
+ *
+ * The issue can be addressed also adding a "version" field to the data
+ * structure itself (which would also remove the fixed position & meaning
+ * requirement)... Comments about the best way to go are welcome!
+ */
+static int
+do_sched_setscheduler_ex(pid_t pid, int policy, unsigned len,
+			 struct sched_param_ex __user *param_ex)
+{
+	struct sched_param lparam;
+	struct sched_param_ex lparam_ex;
+	struct task_struct *p;
+	int retval;
+
+	if (!param_ex || pid < 0)
+		return -EINVAL;
+	if (len > sizeof(lparam_ex))
+		return -EINVAL;
+
+	memset(&lparam_ex, 0, sizeof(lparam_ex));
+	if (copy_from_user(&lparam_ex, param_ex, len))
+		return -EFAULT;
+
+	rcu_read_lock();
+	retval = -ESRCH;
+	p = find_process_by_pid(pid);
+	if (p != NULL) {
+		lparam.sched_priority = lparam_ex.sched_priority;
+		retval = sched_setscheduler_ex(p, policy, &lparam, &lparam_ex);
+	}
+	rcu_read_unlock();
+
+	return retval;
+}
+
 /**
  * sys_sched_setscheduler - set/change the scheduler policy and RT priority
  * @pid: the pid in question.
@@ -4921,6 +4981,22 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy,
 }
 
 /**
+ * sys_sched_setscheduler_ex - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @policy: new policy (could use extended sched_param).
+ * @len: size of data pointed by param_ex.
+ * @param: structure containg the extended parameters.
+ */
+SYSCALL_DEFINE4(sched_setscheduler_ex, pid_t, pid, int, policy,
+		unsigned, len, struct sched_param_ex __user *, param_ex)
+{
+	if (policy < 0)
+		return -EINVAL;
+
+	return do_sched_setscheduler_ex(pid, policy, len, param_ex);
+}
+
+/**
  * sys_sched_setparam - set/change the RT priority of a thread
  * @pid: the pid in question.
  * @param: structure containing the new RT priority.
@@ -4931,6 +5007,18 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
 }
 
 /**
+ * sys_sched_setparam_ex - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @len: size of data pointed by param_ex.
+ * @param_ex: structure containing the extended parameters.
+ */
+SYSCALL_DEFINE3(sched_setparam_ex, pid_t, pid, unsigned, len,
+		struct sched_param_ex __user *, param_ex)
+{
+	return do_sched_setscheduler_ex(pid, -1, len, param_ex);
+}
+
+/**
  * sys_sched_getscheduler - get the policy (scheduling class) of a thread
  * @pid: the pid in question.
  */
@@ -4994,6 +5082,47 @@ out_unlock:
 	return retval;
 }
 
+/**
+ * sys_sched_getparam_ex - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @len: size of data pointed by param_ex.
+ * @param_ex: structure containing the extended parameters.
+ */
+SYSCALL_DEFINE3(sched_getparam_ex, pid_t, pid, unsigned, len,
+		struct sched_param_ex __user *, param_ex)
+{
+	struct sched_param_ex lp;
+	struct task_struct *p;
+	int retval;
+
+	if (!param_ex || pid < 0)
+		return -EINVAL;
+	if (len > sizeof(lp))
+		return -EINVAL;
+
+	rcu_read_lock();
+	p = find_process_by_pid(pid);
+	retval = -ESRCH;
+	if (!p)
+		goto out_unlock;
+
+	retval = security_task_getscheduler(p);
+	if (retval)
+		goto out_unlock;
+
+	lp.sched_priority = p->rt_priority;
+	rcu_read_unlock();
+
+	retval = copy_to_user(param_ex, &lp, len) ? -EFAULT : 0;
+
+	return retval;
+
+out_unlock:
+	rcu_read_unlock();
+	return retval;
+
+}
+
 long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 {
 	cpumask_var_t cpus_allowed, new_mask;
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 03/22] sched: SCHED_DEADLINE data structures.
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
  2010-10-29  6:25 ` [RFC][PATCH 01/22] sched: add sched_class->task_dead Raistlin
  2010-10-29  6:27 ` [RFC][PATCH 02/22] sched: add extended scheduling interface Raistlin
@ 2010-10-29  6:28 ` Raistlin
  2010-11-10 18:59   ` Peter Zijlstra
  2010-11-10 19:10   ` Peter Zijlstra
  2010-10-29  6:29 ` [RFC][PATCH 04/22] sched: SCHED_DEADLINE SMP-related " Raistlin
                   ` (18 subsequent siblings)
  21 siblings, 2 replies; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 11517 bytes --]


Introduce the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.

Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |   68 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/hrtimer.c      |    2 +-
 kernel/sched.c        |   78 ++++++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 135 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cf20084..c72a132 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -38,6 +38,7 @@
 #define SCHED_BATCH		3
 /* SCHED_ISO: reserved but not implemented yet */
 #define SCHED_IDLE		5
+#define SCHED_DEADLINE		6
 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
 #define SCHED_RESET_ON_FORK     0x40000000
 
@@ -136,6 +137,10 @@ struct sched_param {
  * Given this task model, there are a multiplicity of scheduling algorithms
  * and policies, that can be used to ensure all the tasks will make their
  * timing constraints.
+ *
+ * As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
+ * only user of this new interface. More information about the algorithm
+ * available in the scheduling class file or in Documentation/.
  */
 struct sched_param_ex {
 	int sched_priority;
@@ -1089,6 +1094,7 @@ struct sched_domain;
 #define ENQUEUE_WAKEUP		1
 #define ENQUEUE_WAKING		2
 #define ENQUEUE_HEAD		4
+#define ENQUEUE_REPLENISH	8
 
 #define DEQUEUE_SLEEP		1
 
@@ -1222,6 +1228,47 @@ struct sched_rt_entity {
 #endif
 };
 
+struct sched_dl_entity {
+	struct rb_node	rb_node;
+	int nr_cpus_allowed;
+
+	/*
+	 * Original scheduling parameters. Copied here from sched_param_ex
+	 * during sched_setscheduler_ex(), they will remain the same until
+	 * the next sched_setscheduler_ex().
+	 */
+	u64 dl_runtime;		/* maximum runtime for each instance 	*/
+	u64 dl_deadline;	/* relative deadline of each instance	*/
+
+	/*
+	 * Actual scheduling parameters. Initialized with the values above,
+	 * they are continously updated during task execution. Note that
+	 * the remaining runtime could be < 0 in case we are in overrun.
+	 */
+	s64 runtime;		/* remaining runtime for this instance	*/
+	u64 deadline;		/* absolute deadline for this instance	*/
+	unsigned int flags;	/* specifying the scheduler behaviour   */
+
+	/*
+	 * Some bool flags:
+	 *
+	 * @dl_throttled tells if we exhausted the runtime. If so, the
+	 * task has to wait for a replenishment to be performed at the
+	 * next firing of dl_timer.
+	 *
+	 * @dl_new tells if a new instance arrived. If so we must
+	 * start executing it with full runtime and reset its absolute
+	 * deadline;
+	 */
+	int dl_throttled, dl_new;
+
+	/*
+	 * Bandwidth enforcement timer. Each -deadline task has its
+	 * own bandwidth to be enforced, thus we need one timer per task.
+	 */
+	struct hrtimer dl_timer;
+};
+
 struct rcu_node;
 
 enum perf_event_task_context {
@@ -1251,6 +1298,7 @@ struct task_struct {
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
+	struct sched_dl_entity dl;
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
@@ -1580,6 +1628,10 @@ struct task_struct {
  * user-space.  This allows kernel threads to set their
  * priority to a value higher than any user task. Note:
  * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
+ *
+ * SCHED_DEADLINE tasks has negative priorities, reflecting
+ * the fact that any of them has higher prio than RT and
+ * NORMAL/BATCH tasks.
  */
 
 #define MAX_USER_RT_PRIO	100
@@ -1588,9 +1640,23 @@ struct task_struct {
 #define MAX_PRIO		(MAX_RT_PRIO + 40)
 #define DEFAULT_PRIO		(MAX_RT_PRIO + 20)
 
+#define MAX_DL_PRIO		0
+
+static inline int dl_prio(int prio)
+{
+	if (unlikely(prio < MAX_DL_PRIO))
+		return 1;
+	return 0;
+}
+
+static inline int dl_task(struct task_struct *p)
+{
+	return dl_prio(p->prio);
+}
+
 static inline int rt_prio(int prio)
 {
-	if (unlikely(prio < MAX_RT_PRIO))
+	if (unlikely(prio >= MAX_DL_PRIO && prio < MAX_RT_PRIO))
 		return 1;
 	return 0;
 }
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 72206cf..9cd8564 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1574,7 +1574,7 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp,
 	unsigned long slack;
 
 	slack = current->timer_slack_ns;
-	if (rt_task(current))
+	if (dl_task(current) || rt_task(current))
 		slack = 0;
 
 	hrtimer_init_on_stack(&t.timer, clockid, mode);
diff --git a/kernel/sched.c b/kernel/sched.c
index 76f1bc6..d157358 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -128,11 +128,23 @@ static inline int rt_policy(int policy)
 	return 0;
 }
 
+static inline int dl_policy(int policy)
+{
+	if (unlikely(policy == SCHED_DEADLINE))
+		return 1;
+	return 0;
+}
+
 static inline int task_has_rt_policy(struct task_struct *p)
 {
 	return rt_policy(p->policy);
 }
 
+static inline int task_has_dl_policy(struct task_struct *p)
+{
+	return dl_policy(p->policy);
+}
+
 /*
  * This is the priority-queue data structure of the RT scheduling class:
  */
@@ -405,6 +417,15 @@ struct rt_rq {
 #endif
 };
 
+/* Deadline class' related fields in a runqueue */
+struct dl_rq {
+	/* runqueue is an rbtree, ordered by deadline */
+	struct rb_root rb_root;
+	struct rb_node *rb_leftmost;
+
+	unsigned long dl_nr_running;
+};
+
 #ifdef CONFIG_SMP
 
 /*
@@ -469,6 +490,7 @@ struct rq {
 
 	struct cfs_rq cfs;
 	struct rt_rq rt;
+	struct dl_rq dl;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
@@ -1852,8 +1874,6 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 #endif
 }
 
-static const struct sched_class rt_sched_class;
-
 #define sched_class_highest (&stop_sched_class)
 #define for_each_class(class) \
    for (class = sched_class_highest; class; class = class->next)
@@ -2070,7 +2090,9 @@ static inline int normal_prio(struct task_struct *p)
 {
 	int prio;
 
-	if (task_has_rt_policy(p))
+	if (task_has_dl_policy(p))
+		prio = MAX_DL_PRIO-1;
+	else if (task_has_rt_policy(p))
 		prio = MAX_RT_PRIO-1 - p->rt_priority;
 	else
 		prio = __normal_prio(p);
@@ -2634,6 +2656,12 @@ static void __sched_fork(struct task_struct *p)
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
 
+	RB_CLEAR_NODE(&p->dl.rb_node);
+	hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	p->dl.dl_runtime = p->dl.runtime = 0;
+	p->dl.dl_deadline = p->dl.deadline = 0;
+	p->dl.flags = 0;
+
 	INIT_LIST_HEAD(&p->rt.run_list);
 	p->se.on_rq = 0;
 	INIT_LIST_HEAD(&p->se.group_node);
@@ -2662,7 +2690,8 @@ void sched_fork(struct task_struct *p, int clone_flags)
 	 * Revert to default priority/policy on fork if requested.
 	 */
 	if (unlikely(p->sched_reset_on_fork)) {
-		if (p->policy == SCHED_FIFO || p->policy == SCHED_RR) {
+		if (p->policy == SCHED_DEADLINE ||
+		    p->policy == SCHED_FIFO || p->policy == SCHED_RR) {
 			p->policy = SCHED_NORMAL;
 			p->normal_prio = p->static_prio;
 		}
@@ -4464,6 +4493,8 @@ long __sched sleep_on_timeout(wait_queue_head_t *q, long timeout)
 }
 EXPORT_SYMBOL(sleep_on_timeout);
 
+static const struct sched_class dl_sched_class;
+
 #ifdef CONFIG_RT_MUTEXES
 
 /*
@@ -4497,7 +4528,9 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
-	if (rt_prio(prio))
+	if (dl_prio(prio))
+		p->sched_class = &dl_sched_class;
+	else if (rt_prio(prio))
 		p->sched_class = &rt_sched_class;
 	else
 		p->sched_class = &fair_sched_class;
@@ -4533,9 +4566,9 @@ void set_user_nice(struct task_struct *p, long nice)
 	 * The RT priorities are set via sched_setscheduler(), but we still
 	 * allow the 'normal' nice value to be set - but as expected
 	 * it wont have any effect on scheduling until the task is
-	 * SCHED_FIFO/SCHED_RR:
+	 * SCHED_DEADLINE, SCHED_FIFO or SCHED_RR:
 	 */
-	if (task_has_rt_policy(p)) {
+	if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
 		p->static_prio = NICE_TO_PRIO(nice);
 		goto out_unlock;
 	}
@@ -4680,7 +4713,9 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 	p->normal_prio = normal_prio(p);
 	/* we are holding p->pi_lock already */
 	p->prio = rt_mutex_getprio(p);
-	if (rt_prio(p->prio))
+	if (dl_prio(p->prio))
+		p->sched_class = &dl_sched_class;
+	else if (rt_prio(p->prio))
 		p->sched_class = &rt_sched_class;
 	else
 		p->sched_class = &fair_sched_class;
@@ -4688,6 +4723,19 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 }
 
 /*
+ * This function validates the new parameters of a -deadline task.
+ * We ask for the deadline not being zero, and greater or equal
+ * than the runtime.
+ */
+static bool
+__checkparam_dl(const struct sched_param_ex *prm)
+{
+	return prm && timespec_to_ns(&prm->sched_deadline) != 0 &&
+	       timespec_compare(&prm->sched_deadline,
+				&prm->sched_runtime) >= 0;
+}
+
+/*
  * check the target process has a UID that matches the current process's
  */
 static bool check_same_owner(struct task_struct *p)
@@ -4725,7 +4773,8 @@ recheck:
 		reset_on_fork = !!(policy & SCHED_RESET_ON_FORK);
 		policy &= ~SCHED_RESET_ON_FORK;
 
-		if (policy != SCHED_FIFO && policy != SCHED_RR &&
+		if (policy != SCHED_DEADLINE &&
+				policy != SCHED_FIFO && policy != SCHED_RR &&
 				policy != SCHED_NORMAL && policy != SCHED_BATCH &&
 				policy != SCHED_IDLE)
 			return -EINVAL;
@@ -4740,7 +4789,8 @@ recheck:
 	    (p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) ||
 	    (!p->mm && param->sched_priority > MAX_RT_PRIO-1))
 		return -EINVAL;
-	if (rt_policy(policy) != (param->sched_priority != 0))
+	if ((dl_policy(policy) && !__checkparam_dl(param_ex)) ||
+	    (rt_policy(policy) != (param->sched_priority != 0)))
 		return -EINVAL;
 
 	/*
@@ -7980,6 +8030,11 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
 #endif
 }
 
+static void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
+{
+	dl_rq->rb_root = RB_ROOT;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 				struct sched_entity *se, int cpu, int add,
@@ -8111,6 +8166,7 @@ void __init sched_init(void)
 		rq->calc_load_update = jiffies + LOAD_FREQ;
 		init_cfs_rq(&rq->cfs, rq);
 		init_rt_rq(&rq->rt, rq);
+		init_dl_rq(&rq->dl, rq);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		init_task_group.shares = init_task_group_load;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
@@ -8301,7 +8357,7 @@ void normalize_rt_tasks(void)
 		p->se.statistics.block_start	= 0;
 #endif
 
-		if (!rt_task(p)) {
+		if (!dl_task(p) && !rt_task(p)) {
 			/*
 			 * Renice negative nice level userspace
 			 * tasks back to 0:
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 04/22] sched: SCHED_DEADLINE SMP-related data structures
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (2 preceding siblings ...)
  2010-10-29  6:28 ` [RFC][PATCH 03/22] sched: SCHED_DEADLINE data structures Raistlin
@ 2010-10-29  6:29 ` Raistlin
  2010-11-10 19:17   ` Peter Zijlstra
  2010-10-29  6:30 ` [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation Raistlin
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 4600 bytes --]


Introduce data structures relevant for implementing dynamic
migration of -deadline tasks.

Mainly, this is the logic for checking if runqueues are
overloaded with -deadline tasks and for choosing where
a task should migrate, when it is the case.

Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |   58 ++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 58 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c72a132..f94da51 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1336,6 +1336,7 @@ struct task_struct {
 
 	struct list_head tasks;
 	struct plist_node pushable_tasks;
+	struct rb_node pushable_dl_tasks;
 
 	struct mm_struct *mm, *active_mm;
 #if defined(SPLIT_RSS_COUNTING)
diff --git a/kernel/sched.c b/kernel/sched.c
index d157358..b11e888 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -424,6 +424,35 @@ struct dl_rq {
 	struct rb_node *rb_leftmost;
 
 	unsigned long dl_nr_running;
+
+#ifdef CONFIG_SMP
+	/*
+	 * Deadline values of the currently executing and the
+	 * earliest ready task on this rq. Caching these facilitates
+	 * the decision wether or not a ready but not running task
+	 * should migrate somewhere else.
+	 */
+	struct {
+		u64 curr;
+		u64 next;
+	} earliest_dl;
+
+	unsigned long dl_nr_migratory;
+	unsigned long dl_nr_total;
+	int overloaded;
+
+	/*
+	 * Tasks on this rq that can be pushed away. They are kept in
+	 * an rb-tree, ordered by tasks' deadlines, with caching
+	 * of the leftmost (earliest deadline) element.
+	 */
+	struct rb_root pushable_dl_tasks_root;
+	struct rb_node *pushable_dl_tasks_leftmost;
+#endif
+
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	struct rq *rq;
+#endif
 };
 
 #ifdef CONFIG_SMP
@@ -442,6 +471,13 @@ struct root_domain {
 	cpumask_var_t online;
 
 	/*
+	 * The bit corresponding to a CPU gets set here if such CPU has more
+	 * than one runnable -deadline task (as it is below for RT tasks).
+	 */
+	cpumask_var_t dlo_mask;
+	atomic_t dlo_count;
+
+	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
 	 * one runnable RT task.
 	 */
@@ -2742,6 +2778,7 @@ void sched_fork(struct task_struct *p, int clone_flags)
 	/* Want to start with kernel preemption disabled. */
 	task_thread_info(p)->preempt_count = 1;
 #endif
+	RB_CLEAR_NODE(&p->pushable_dl_tasks);
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
 
 	put_cpu();
@@ -5804,6 +5841,7 @@ again:
 		p->sched_class->set_cpus_allowed(p, new_mask);
 	else {
 		cpumask_copy(&p->cpus_allowed, new_mask);
+		p->dl.nr_cpus_allowed = cpumask_weight(new_mask);
 		p->rt.nr_cpus_allowed = cpumask_weight(new_mask);
 	}
 
@@ -6551,6 +6589,7 @@ static void free_rootdomain(struct root_domain *rd)
 
 	cpupri_cleanup(&rd->cpupri);
 
+	free_cpumask_var(rd->dlo_mask);
 	free_cpumask_var(rd->rto_mask);
 	free_cpumask_var(rd->online);
 	free_cpumask_var(rd->span);
@@ -6602,8 +6641,10 @@ static int init_rootdomain(struct root_domain *rd)
 		goto out;
 	if (!alloc_cpumask_var(&rd->online, GFP_KERNEL))
 		goto free_span;
-	if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
+	if (!alloc_cpumask_var(&rd->dlo_mask, GFP_KERNEL))
 		goto free_online;
+	if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
+		goto free_dlo_mask;
 
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_rto_mask;
@@ -6611,6 +6652,8 @@ static int init_rootdomain(struct root_domain *rd)
 
 free_rto_mask:
 	free_cpumask_var(rd->rto_mask);
+free_dlo_mask:
+	free_cpumask_var(rd->dlo_mask);
 free_online:
 	free_cpumask_var(rd->online);
 free_span:
@@ -8033,6 +8076,19 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
 static void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
 {
 	dl_rq->rb_root = RB_ROOT;
+
+#ifdef CONFIG_SMP
+	/* zero means no -deadline tasks */
+	dl_rq->earliest_dl.curr = dl_rq->earliest_dl.next = 0;
+
+	dl_rq->dl_nr_migratory = 0;
+	dl_rq->overloaded = 0;
+	dl_rq->pushable_dl_tasks_root = RB_ROOT;
+#endif
+
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+	dl_rq->rq = rq;
+#endif
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (3 preceding siblings ...)
  2010-10-29  6:29 ` [RFC][PATCH 04/22] sched: SCHED_DEADLINE SMP-related " Raistlin
@ 2010-10-29  6:30 ` Raistlin
  2010-11-10 19:21   ` Peter Zijlstra
                     ` (7 more replies)
  2010-10-29  6:31 ` [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads Raistlin
                   ` (16 subsequent siblings)
  21 siblings, 8 replies; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 27335 bytes --]


Add a scheduling class, in sched_dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.

The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.

The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most the its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.

This patch:
 - implements the core logic of the scheduling algorithm in the new
   scheduling class file;
 - provides all the glue code between the new scheduling class and
   the core scheduler and refines the interactions between sched_dl
   and the other existing scheduling classes.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <trimarchimichael@yahoo.it>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
---
 kernel/sched.c          |   67 +++++-
 kernel/sched_dl.c       |  643 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_fair.c     |    2 +-
 kernel/sched_rt.c       |    5 +
 kernel/sched_stoptask.c |    2 +-
 5 files changed, 711 insertions(+), 8 deletions(-)
 create mode 100644 kernel/sched_dl.c

diff --git a/kernel/sched.c b/kernel/sched.c
index b11e888..208fa08 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1928,6 +1928,12 @@ static void dec_nr_running(struct rq *rq)
 
 static void set_load_weight(struct task_struct *p)
 {
+	if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
+		p->se.load.weight = 0;
+		p->se.load.inv_weight = WMULT_CONST;
+		return;
+	}
+
 	/*
 	 * SCHED_IDLE tasks get minimal weight:
 	 */
@@ -2072,6 +2078,7 @@ static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { }
 #include "sched_idletask.c"
 #include "sched_fair.c"
 #include "sched_rt.c"
+#include "sched_dl.c"
 #include "sched_stoptask.c"
 #ifdef CONFIG_SCHED_DEBUG
 # include "sched_debug.c"
@@ -2750,7 +2757,11 @@ void sched_fork(struct task_struct *p, int clone_flags)
 	 */
 	p->prio = current->normal_prio;
 
-	if (!rt_prio(p->prio))
+	if (dl_prio(p->prio))
+		p->sched_class = &dl_sched_class;
+	else if (rt_prio(p->prio))
+		p->sched_class = &rt_sched_class;
+	else
 		p->sched_class = &fair_sched_class;
 
 	if (p->sched_class->task_fork)
@@ -4530,8 +4541,6 @@ long __sched sleep_on_timeout(wait_queue_head_t *q, long timeout)
 }
 EXPORT_SYMBOL(sleep_on_timeout);
 
-static const struct sched_class dl_sched_class;
-
 #ifdef CONFIG_RT_MUTEXES
 
 /*
@@ -4760,6 +4769,40 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 }
 
 /*
+ * This function initializes the sched_dl_entity of a newly becoming
+ * SCHED_DEADLINE task.
+ *
+ * Only the static values are considered here, the actual runtime and the
+ * absolute deadline will be properly calculated when the task is enqueued
+ * for the first time with its new policy.
+ */
+static void
+__setparam_dl(struct task_struct *p, const struct sched_param_ex *param_ex)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	init_dl_task_timer(dl_se);
+	dl_se->dl_runtime = timespec_to_ns(&param_ex->sched_runtime);
+	dl_se->dl_deadline = timespec_to_ns(&param_ex->sched_deadline);
+	dl_se->flags = param_ex->sched_flags;
+	dl_se->dl_throttled = 0;
+	dl_se->dl_new = 1;
+}
+
+static void
+__getparam_dl(struct task_struct *p, struct sched_param_ex *param_ex)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	param_ex->sched_priority = p->rt_priority;
+	param_ex->sched_runtime = ns_to_timespec(dl_se->dl_runtime);
+	param_ex->sched_deadline = ns_to_timespec(dl_se->dl_deadline);
+	param_ex->sched_flags = dl_se->flags;
+	param_ex->curr_runtime = ns_to_timespec(dl_se->runtime);
+	param_ex->curr_deadline = ns_to_timespec(dl_se->deadline);
+}
+
+/*
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
  * than the runtime.
@@ -4922,7 +4965,11 @@ recheck:
 
 	oldprio = p->prio;
 	prev_class = p->sched_class;
-	__setscheduler(rq, p, policy, param->sched_priority);
+	if (dl_policy(policy)) {
+		__setparam_dl(p, param_ex);
+		__setscheduler(rq, p, policy, param_ex->sched_priority);
+	} else
+		__setscheduler(rq, p, policy, param->sched_priority);
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
@@ -5043,7 +5090,10 @@ do_sched_setscheduler_ex(pid_t pid, int policy, unsigned len,
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);
 	if (p != NULL) {
-		lparam.sched_priority = lparam_ex.sched_priority;
+		if (dl_policy(policy))
+			lparam.sched_priority = 0;
+		else
+			lparam.sched_priority = lparam_ex.sched_priority;
 		retval = sched_setscheduler_ex(p, policy, &lparam, &lparam_ex);
 	}
 	rcu_read_unlock();
@@ -5197,7 +5247,10 @@ SYSCALL_DEFINE3(sched_getparam_ex, pid_t, pid, unsigned, len,
 	if (retval)
 		goto out_unlock;
 
-	lp.sched_priority = p->rt_priority;
+	if (task_has_dl_policy(p))
+		__getparam_dl(p, &lp);
+	else
+		lp.sched_priority = p->rt_priority;
 	rcu_read_unlock();
 
 	retval = copy_to_user(param_ex, &lp, len) ? -EFAULT : 0;
@@ -5523,6 +5576,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
 	case SCHED_RR:
 		ret = MAX_USER_RT_PRIO-1;
 		break;
+	case SCHED_DEADLINE:
 	case SCHED_NORMAL:
 	case SCHED_BATCH:
 	case SCHED_IDLE:
@@ -5548,6 +5602,7 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
 	case SCHED_RR:
 		ret = 1;
 		break;
+	case SCHED_DEADLINE:
 	case SCHED_NORMAL:
 	case SCHED_BATCH:
 	case SCHED_IDLE:
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
new file mode 100644
index 0000000..9d0443e
--- /dev/null
+++ b/kernel/sched_dl.c
@@ -0,0 +1,643 @@
+/*
+ * Deadline Scheduling Class (SCHED_DEADLINE)
+ *
+ * Earliest Deadline First (EDF) + Constant Bandwidth Server (CBS).
+ *
+ * Tasks that periodically executes their instances for less than their
+ * runtime won't miss any of their deadlines.
+ * Tasks that are not periodic or sporadic or that tries to execute more
+ * than their reserved bandwidth will be slowed down (and may potentially
+ * miss some of their deadlines), and won't affect any other task.
+ *
+ * Copyright (C) 2010 Dario Faggioli <raistlin@linux.it>,
+ *                    Michael Trimarchi <trimarchimichael@yahoo.it>,
+ *                    Fabio Checconi <fabio@gandalf.sssup.it>
+ */
+
+static inline int dl_time_before(u64 a, u64 b)
+{
+	return (s64)(a - b) < 0;
+}
+
+static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
+{
+	return container_of(dl_se, struct task_struct, dl);
+}
+
+static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
+{
+	return container_of(dl_rq, struct rq, dl);
+}
+
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->dl;
+}
+
+static inline int on_dl_rq(struct sched_dl_entity *dl_se)
+{
+	return !RB_EMPTY_NODE(&dl_se->rb_node);
+}
+
+static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
+				  int flags);
+
+/*
+ * We are being explicitly informed that a new instance is starting,
+ * and this means that:
+ *  - the absolute deadline of the entity has to be placed at
+ *    current time + relative deadline;
+ *  - the runtime of the entity has to be set to the maximum value.
+ *
+ * The capability of specifying such event is useful whenever a -deadline
+ * entity wants to (try to!) synchronize its behaviour with the scheduler's
+ * one, and to (try to!) reconcile itself with its own scheduling
+ * parameters.
+ */
+static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	WARN_ON(!dl_se->dl_new || dl_se->dl_throttled);
+
+	dl_se->deadline = rq->clock + dl_se->dl_deadline;
+	dl_se->runtime = dl_se->dl_runtime;
+	dl_se->dl_new = 0;
+}
+
+/*
+ * Pure Earliest Deadline First (EDF) scheduling does not deal with the
+ * possibility of a entity lasting more than what it declared, and thus
+ * exhausting its runtime.
+ *
+ * Here we are interested in making runtime overrun possible, but we do
+ * not want a entity which is misbehaving to affect the scheduling of all
+ * other entities.
+ * Therefore, a budgeting strategy called Constant Bandwidth Server (CBS)
+ * is used, in order to confine each entity within its own bandwidth.
+ *
+ * This function deals exactly with that, and ensures that when the runtime
+ * of a entity is replenished, its deadline is also postponed. That ensures
+ * the overrunning entity can't interfere with other entity in the system and
+ * can't make them miss their deadlines. Reasons why this kind of overruns
+ * could happen are, typically, a entity voluntarily trying to overcume its
+ * runtime, or it just underestimated it during sched_setscheduler_ex().
+ */
+static void replenish_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * We Keep moving the deadline away until we get some
+	 * available runtime for the entity. This ensures correct
+	 * handling of situations where the runtime overrun is
+	 * arbitrary large.
+	 */
+	while (dl_se->runtime <= 0) {
+		dl_se->deadline += dl_se->dl_deadline;
+		dl_se->runtime += dl_se->dl_runtime;
+	}
+
+	/*
+	 * At this point, the deadline really should be "in
+	 * the future" with respect to rq->clock. If it's
+	 * not, we are, for some reason, lagging too much!
+	 * Anyway, after having warn userspace abut that,
+	 * we still try to keep the things running by
+	 * resetting the deadline and the budget of the
+	 * entity.
+	 */
+	if (dl_time_before(dl_se->deadline, rq->clock)) {
+		WARN_ON_ONCE(1);
+		dl_se->deadline = rq->clock + dl_se->dl_deadline;
+		dl_se->runtime = dl_se->dl_runtime;
+	}
+}
+
+/*
+ * Here we check if --at time t-- an entity (which is probably being
+ * [re]activated or, in general, enqueued) can use its remaining runtime
+ * and its current deadline _without_ exceeding the bandwidth it is
+ * assigned (function returns true if it can).
+ *
+ * For this to hold, we must check if:
+ *   runtime / (deadline - t) < dl_runtime / dl_deadline .
+ */
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
+{
+	u64 left, right;
+
+	/*
+	 * left and right are the two sides of the equation above,
+	 * after a bit of shuffling to use multiplications instead
+	 * of divisions.
+	 *
+	 * Note that none of the time values involved in the two
+	 * multiplications are absolute: dl_deadline and dl_runtime
+	 * are the relative deadline and the maximum runtime of each
+	 * instance, runtime is the runtime left for the last instance
+	 * and (deadline - t), since t is rq->clock, is the time left
+	 * to the (absolute) deadline. Therefore, overflowing the u64
+	 * type is very unlikely to occur in both cases.
+	 */
+	left = dl_se->dl_deadline * dl_se->runtime;
+	right = (dl_se->deadline - t) * dl_se->dl_runtime;
+
+	return dl_time_before(right, left);
+}
+
+/*
+ * When a -deadline entity is queued back on the runqueue, its runtime and
+ * deadline might need updating.
+ *
+ * The policy here is that we update the deadline of the entity only if:
+ *  - the current deadline is in the past,
+ *  - using the remaining runtime with the current deadline would make
+ *    the entity exceed its bandwidth.
+ */
+static void update_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * The arrival of a new instance needs special treatment, i.e.,
+	 * the actual scheduling parameters have to be "renewed".
+	 */
+	if (dl_se->dl_new) {
+		setup_new_dl_entity(dl_se);
+		return;
+	}
+
+	if (dl_time_before(dl_se->deadline, rq->clock) ||
+	    dl_entity_overflow(dl_se, rq->clock)) {
+		dl_se->deadline = rq->clock + dl_se->dl_deadline;
+		dl_se->runtime = dl_se->dl_runtime;
+	}
+}
+
+/*
+ * If the entity depleted all its runtime, and if we want it to sleep
+ * while waiting for some new execution time to become available, we
+ * set the bandwidth enforcement timer to the replenishment instant
+ * and try to activate it.
+ *
+ * Notice that it is important for the caller to know if the timer
+ * actually started or not (i.e., the replenishment instant is in
+ * the future or in the past).
+ */
+static int start_dl_timer(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+	ktime_t now, act;
+	ktime_t soft, hard;
+	unsigned long range;
+	s64 delta;
+
+	/*
+	 * We want the timer to fire at the deadline, but considering
+	 * that it is actually coming from rq->clock and not from
+	 * hrtimer's time base reading.
+	 */
+	act = ns_to_ktime(dl_se->deadline);
+	now = hrtimer_cb_get_time(&dl_se->dl_timer);
+	delta = ktime_to_ns(now) - rq->clock;
+	act = ktime_add_ns(act, delta);
+
+	/*
+	 * If the expiry time already passed, e.g., because the value
+	 * chosen as the deadline is too small, don't even try to
+	 * start the timer in the past!
+	 */
+	if (ktime_us_delta(act, now) < 0)
+		return 0;
+
+	hrtimer_set_expires(&dl_se->dl_timer, act);
+
+	soft = hrtimer_get_softexpires(&dl_se->dl_timer);
+	hard = hrtimer_get_expires(&dl_se->dl_timer);
+	range = ktime_to_ns(ktime_sub(hard, soft));
+	__hrtimer_start_range_ns(&dl_se->dl_timer, soft,
+				 range, HRTIMER_MODE_ABS, 0);
+
+	return hrtimer_active(&dl_se->dl_timer);
+}
+
+/*
+ * This is the bandwidth enforcement timer callback. If here, we know
+ * a task is not on its dl_rq, since the fact that the timer was running
+ * means the task is throttled and needs a runtime replenishment.
+ *
+ * However, what we actually do depends on the fact the task is active,
+ * (it is on its rq) or has been removed from there by a call to
+ * dequeue_task_dl(). In the former case we must issue the runtime
+ * replenishment and add the task back to the dl_rq; in the latter, we just
+ * do nothing but clearing dl_throttled, so that runtime and deadline
+ * updating (and the queueing back to dl_rq) will be done by the
+ * next call to enqueue_task_dl().
+ */
+static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
+{
+	unsigned long flags;
+	struct sched_dl_entity *dl_se = container_of(timer,
+						     struct sched_dl_entity,
+						     dl_timer);
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = task_rq_lock(p, &flags);
+
+	/*
+	 * We need to take care of a possible races here. In fact, the
+	 * task might have changed its scheduling policy to something
+	 * different from SCHED_DEADLINE (through sched_setscheduler()).
+	 */
+	if (!dl_task(p))
+		goto unlock;
+
+	dl_se->dl_throttled = 0;
+	if (p->se.on_rq) {
+		enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
+		check_preempt_curr_dl(rq, p, 0);
+	}
+unlock:
+	task_rq_unlock(rq, &flags);
+
+	return HRTIMER_NORESTART;
+}
+
+static void init_dl_task_timer(struct sched_dl_entity *dl_se)
+{
+	struct hrtimer *timer = &dl_se->dl_timer;
+
+	if (hrtimer_active(timer)) {
+		hrtimer_try_to_cancel(timer);
+		return;
+	}
+
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	timer->function = dl_task_timer;
+}
+
+static
+int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
+{
+	int dmiss = dl_time_before(dl_se->deadline, rq->clock);
+	int rorun = dl_se->runtime <= 0;
+
+	if (!rorun && !dmiss)
+		return 0;
+
+	/*
+	 * If we are beyond our current deadline and we are still
+	 * executing, then we have already used some of the runtime of
+	 * the next instance. Thus, if we do not account that, we are
+	 * stealing bandwidth from the system at each deadline miss!
+	 */
+	if (dmiss) {
+		dl_se->runtime = rorun ? dl_se->runtime : 0;
+		dl_se->runtime -= rq->clock - dl_se->deadline;
+	}
+
+	return 1;
+}
+
+/*
+ * Update the current task's runtime statistics (provided it is still
+ * a -deadline task and has not been removed from the dl_rq).
+ */
+static void update_curr_dl(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_dl_entity *dl_se = &curr->dl;
+	u64 delta_exec;
+
+	if (!dl_task(curr) || !on_dl_rq(dl_se))
+		return;
+
+	delta_exec = rq->clock - curr->se.exec_start;
+	if (unlikely((s64)delta_exec < 0))
+		delta_exec = 0;
+
+	schedstat_set(curr->se.statistics.exec_max,
+		      max(curr->se.statistics.exec_max, delta_exec));
+
+	curr->se.sum_exec_runtime += delta_exec;
+	account_group_exec_runtime(curr, delta_exec);
+
+	curr->se.exec_start = rq->clock;
+	cpuacct_charge(curr, delta_exec);
+
+	dl_se->runtime -= delta_exec;
+	if (dl_runtime_exceeded(rq, dl_se)) {
+		__dequeue_task_dl(rq, curr, 0);
+		if (likely(start_dl_timer(dl_se)))
+			dl_se->dl_throttled = 1;
+		else
+			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
+
+		resched_task(curr);
+	}
+}
+
+static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rb_node **link = &dl_rq->rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct sched_dl_entity *entry;
+	int leftmost = 1;
+
+	BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct sched_dl_entity, rb_node);
+		if (dl_time_before(dl_se->deadline, entry->deadline))
+			link = &parent->rb_left;
+		else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		dl_rq->rb_leftmost = &dl_se->rb_node;
+
+	rb_link_node(&dl_se->rb_node, parent, link);
+	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
+
+	dl_rq->dl_nr_running++;
+}
+
+static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+	if (RB_EMPTY_NODE(&dl_se->rb_node))
+		return;
+
+	if (dl_rq->rb_leftmost == &dl_se->rb_node) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&dl_se->rb_node);
+		dl_rq->rb_leftmost = next_node;
+	}
+
+	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
+	RB_CLEAR_NODE(&dl_se->rb_node);
+
+	dl_rq->dl_nr_running--;
+}
+
+static void
+enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
+{
+	BUG_ON(on_dl_rq(dl_se));
+
+	/*
+	 * If this is a wakeup or a new instance, the scheduling
+	 * parameters of the task might need updating. Otherwise,
+	 * we want a replenishment of its runtime.
+	 */
+	if (!dl_se->dl_new && flags & ENQUEUE_REPLENISH)
+		replenish_dl_entity(dl_se);
+	else
+		update_dl_entity(dl_se);
+
+	__enqueue_dl_entity(dl_se);
+}
+
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	__dequeue_dl_entity(dl_se);
+}
+
+static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+	/*
+	 * If p is throttled, we do nothing. In fact, if it exhausted
+	 * its budget it needs a replenishment and, since it now is on
+	 * its rq, the bandwidth timer callback (which clearly has not
+	 * run yet) will take care of this.
+	 */
+	if (p->dl.dl_throttled)
+		return;
+
+	enqueue_dl_entity(&p->dl, flags);
+}
+
+static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+	dequeue_dl_entity(&p->dl);
+}
+
+static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+	update_curr_dl(rq);
+	__dequeue_task_dl(rq, p, flags);
+}
+
+/*
+ * Yield task semantic for -deadline tasks is:
+ *
+ *   get off from the CPU until our next instance, with
+ *   a new runtime.
+ */
+static void yield_task_dl(struct rq *rq)
+{
+	struct task_struct *p = rq->curr;
+
+	/*
+	 * We make the task go to sleep until its current deadline by
+	 * forcing its runtime to zero. This way, update_curr_dl() stops
+	 * it and the bandwidth timer will wake it up and will give it
+	 * new scheduling parameters (thanks to dl_new=1).
+	 */
+	if (p->dl.runtime > 0) {
+		rq->curr->dl.dl_new = 1;
+		p->dl.runtime = 0;
+	}
+	update_curr_dl(rq);
+}
+
+static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
+				  int flags)
+{
+	if (!dl_task(rq->curr) || (dl_task(p) &&
+	    dl_time_before(p->dl.deadline, rq->curr->dl.deadline)))
+		resched_task(rq->curr);
+}
+
+#ifdef CONFIG_SCHED_HRTICK
+static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+{
+	s64 delta = p->dl.dl_runtime - p->dl.runtime;
+
+	if (delta > 10000)
+		hrtick_start(rq, delta);
+}
+#else
+static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+{
+}
+#endif
+
+static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
+						   struct dl_rq *dl_rq)
+{
+	struct rb_node *left = dl_rq->rb_leftmost;
+
+	if (!left)
+		return NULL;
+
+	return rb_entry(left, struct sched_dl_entity, rb_node);
+}
+
+struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se;
+	struct task_struct *p;
+	struct dl_rq *dl_rq;
+
+	dl_rq = &rq->dl;
+
+	if (unlikely(!dl_rq->dl_nr_running))
+		return NULL;
+
+	dl_se = pick_next_dl_entity(rq, dl_rq);
+	BUG_ON(!dl_se);
+
+	p = dl_task_of(dl_se);
+	p->se.exec_start = rq->clock;
+#ifdef CONFIG_SCHED_HRTICK
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, p);
+#endif
+	return p;
+}
+
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
+{
+	update_curr_dl(rq);
+	p->se.exec_start = 0;
+}
+
+static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
+{
+	update_curr_dl(rq);
+
+#ifdef CONFIG_SCHED_HRTICK
+	if (hrtick_enabled(rq) && queued && p->dl.runtime > 0)
+		start_hrtick_dl(rq, p);
+#endif
+}
+
+static void task_fork_dl(struct task_struct *p)
+{
+	/*
+	 * The child of a -deadline task will be SCHED_DEADLINE, but
+	 * as a throttled task. This means the parent (or someone else)
+	 * must call sched_setscheduler_ex() on it, or it won't even
+	 * start.
+	 */
+	p->dl.dl_throttled = 1;
+	p->dl.dl_new = 0;
+}
+
+static void task_dead_dl(struct task_struct *p)
+{
+	/*
+	 * We are not holding any lock here, so it is safe to
+	 * wait for the bandwidth timer to be removed.
+	 */
+	hrtimer_cancel(&p->dl.dl_timer);
+}
+
+static void set_curr_task_dl(struct rq *rq)
+{
+	struct task_struct *p = rq->curr;
+
+	p->se.exec_start = rq->clock;
+}
+
+static void switched_from_dl(struct rq *rq, struct task_struct *p,
+			     int running)
+{
+	if (hrtimer_active(&p->dl.dl_timer))
+		hrtimer_try_to_cancel(&p->dl.dl_timer);
+}
+
+static void switched_to_dl(struct rq *rq, struct task_struct *p,
+			   int running)
+{
+	/*
+	 * If p is throttled, don't consider the possibility
+	 * of preempting rq->curr, the check will be done right
+	 * after its runtime will get replenished.
+	 */
+	if (unlikely(p->dl.dl_throttled))
+		return;
+
+	if (!running)
+		check_preempt_curr_dl(rq, p, 0);
+}
+
+static void prio_changed_dl(struct rq *rq, struct task_struct *p,
+			    int oldprio, int running)
+{
+	switched_to_dl(rq, p, running);
+}
+
+#ifdef CONFIG_SMP
+static int
+select_task_rq_dl(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+{
+	return task_cpu(p);
+}
+
+static void set_cpus_allowed_dl(struct task_struct *p,
+				const struct cpumask *new_mask)
+{
+	int weight = cpumask_weight(new_mask);
+
+	BUG_ON(!dl_task(p));
+
+	cpumask_copy(&p->cpus_allowed, new_mask);
+	p->dl.nr_cpus_allowed = weight;
+}
+#endif
+
+static const struct sched_class dl_sched_class = {
+	.next			= &rt_sched_class,
+	.enqueue_task		= enqueue_task_dl,
+	.dequeue_task		= dequeue_task_dl,
+	.yield_task		= yield_task_dl,
+
+	.check_preempt_curr	= check_preempt_curr_dl,
+
+	.pick_next_task		= pick_next_task_dl,
+	.put_prev_task		= put_prev_task_dl,
+
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_dl,
+
+	.set_cpus_allowed       = set_cpus_allowed_dl,
+#endif
+
+	.set_curr_task		= set_curr_task_dl,
+	.task_tick		= task_tick_dl,
+	.task_fork              = task_fork_dl,
+	.task_dead		= task_dead_dl,
+
+	.prio_changed           = prio_changed_dl,
+	.switched_from		= switched_from_dl,
+	.switched_to		= switched_to_dl,
+};
+
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f4f6a83..54c869c 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1654,7 +1654,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	int scale = cfs_rq->nr_running >= sched_nr_latency;
 
-	if (unlikely(rt_prio(p->prio)))
+	if (unlikely(dl_prio(p->prio) || rt_prio(p->prio)))
 		goto preempt;
 
 	if (unlikely(p->sched_class != &fair_sched_class))
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index bea7d79..56c00fa 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1014,6 +1014,11 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
  */
 static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flags)
 {
+	/*
+	 * Since MAX_DL_PRIO is less than any possible RT
+	 * prio, this also guarantees that all -deadline tasks
+	 * preempt -rt ones.
+	 */
 	if (p->prio < rq->curr->prio) {
 		resched_task(rq->curr);
 		return;
diff --git a/kernel/sched_stoptask.c b/kernel/sched_stoptask.c
index 45bddc0..25624df 100644
--- a/kernel/sched_stoptask.c
+++ b/kernel/sched_stoptask.c
@@ -81,7 +81,7 @@ get_rr_interval_stop(struct rq *rq, struct task_struct *task)
  * Simple, special scheduling class for the per-CPU stop tasks:
  */
 static const struct sched_class stop_sched_class = {
-	.next			= &rt_sched_class,
+	.next			= &dl_sched_class,
 
 	.enqueue_task		= enqueue_task_stop,
 	.dequeue_task		= dequeue_task_stop,
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (4 preceding siblings ...)
  2010-10-29  6:30 ` [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation Raistlin
@ 2010-10-29  6:31 ` Raistlin
  2010-11-11 14:31   ` Peter Zijlstra
                     ` (2 more replies)
  2010-10-29  6:32 ` [RFC][PATCH 07/22] sched: SCHED_DEADLINE push and pull logic Raistlin
                   ` (15 subsequent siblings)
  21 siblings, 3 replies; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 9656 bytes --]


There sometimes is the need of executing a task as if it would
have the maximum possible priority in the entire system, i.e.,
whenever it gets ready it must run! This is for example the case
for some maintainance kernel thread like migration and (sometimes)
watchdog or ksoftirq.

Since SCHED_DEADLINE is now the highest priority scheduling class
these tasks have to be handled therein, but it is not obvious how
to choose a runtime and a deadline that guarantee what explained
above. Therefore, we need a mean of recognizing system tasks inside
the -deadline class and always run them as soon as possible, without
any kind of runtime and bandwidth limitation.

This patch:
 - adds the SF_HEAD flag, which identify a special task that need
   absolute prioritization among any other task;
 - ensures that special tasks always preempt everyone else (and,
   obviously, are not preempted by non special tasks);
 - disables runtime and bandwidth checking for such tasks, hoping
   that the interference they cause is small enough.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |   13 ++++++++++
 kernel/sched.c        |   59 ++++++++++++++++++++++++++++++++++++++----------
 kernel/sched_dl.c     |   27 ++++++++++++++++++++--
 kernel/softirq.c      |    6 +----
 kernel/watchdog.c     |    3 +-
 5 files changed, 85 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f94da51..f25d3a6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -154,6 +154,18 @@ struct sched_param_ex {
 	struct timespec curr_deadline;
 };
 
+/*
+ * Scheduler flags.
+ *
+ *  @SF_HEAD    tells us that the task has to be considered one of the
+ *              maximum priority tasks in the system. This means it
+ *              always enqueued with maximum priority in the runqueue
+ *              of the highest priority scheduling class. In case it
+ *              it sched_deadline, the task also ignore runtime and
+ *              bandwidth limitations.
+ */
+#define SF_HEAD		1
+
 struct exec_domain;
 struct futex_pi_state;
 struct robust_list_head;
@@ -2072,6 +2084,7 @@ extern int sched_setscheduler(struct task_struct *, int,
 			      const struct sched_param *);
 extern int sched_setscheduler_nocheck(struct task_struct *, int,
 				      const struct sched_param *);
+extern void setscheduler_dl_special(struct task_struct *);
 extern int sched_setscheduler_ex(struct task_struct *, int,
 				 const struct sched_param *,
 				 const struct sched_param_ex *);
diff --git a/kernel/sched.c b/kernel/sched.c
index 208fa08..79e7c1c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2086,19 +2086,13 @@ static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { }
 
 void sched_set_stop_task(int cpu, struct task_struct *stop)
 {
-	struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
 	struct task_struct *old_stop = cpu_rq(cpu)->stop;
 
 	if (stop) {
 		/*
-		 * Make it appear like a SCHED_FIFO task, its something
-		 * userspace knows about and won't get confused about.
-		 *
-		 * Also, it will make PI more or less work without too
-		 * much confusion -- but then, stop work should not
-		 * rely on PI working anyway.
+		 * Make it appear like a SCHED_DEADLINE task.
 		 */
-		sched_setscheduler_nocheck(stop, SCHED_FIFO, &param);
+		setscheduler_dl_special(stop);
 
 		stop->sched_class = &stop_sched_class;
 	}
@@ -2110,7 +2104,7 @@ void sched_set_stop_task(int cpu, struct task_struct *stop)
 		 * Reset it back to a normal scheduling class so that
 		 * it can die in pieces.
 		 */
-		old_stop->sched_class = &rt_sched_class;
+		old_stop->sched_class = &dl_sched_class;
 	}
 }
 
@@ -4808,9 +4802,15 @@ __getparam_dl(struct task_struct *p, struct sched_param_ex *param_ex)
  * than the runtime.
  */
 static bool
-__checkparam_dl(const struct sched_param_ex *prm)
+__checkparam_dl(const struct sched_param_ex *prm, bool kthread)
 {
-	return prm && timespec_to_ns(&prm->sched_deadline) != 0 &&
+	if (!prm)
+		return false;
+
+	if (prm->sched_flags & SF_HEAD)
+		return kthread;
+
+	return timespec_to_ns(&prm->sched_deadline) != 0 &&
 	       timespec_compare(&prm->sched_deadline,
 				&prm->sched_runtime) >= 0;
 }
@@ -4869,7 +4869,7 @@ recheck:
 	    (p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) ||
 	    (!p->mm && param->sched_priority > MAX_RT_PRIO-1))
 		return -EINVAL;
-	if ((dl_policy(policy) && !__checkparam_dl(param_ex)) ||
+	if ((dl_policy(policy) && !__checkparam_dl(param_ex, !p->mm)) ||
 	    (rt_policy(policy) != (param->sched_priority != 0)))
 		return -EINVAL;
 
@@ -5133,6 +5133,39 @@ SYSCALL_DEFINE4(sched_setscheduler_ex, pid_t, pid, int, policy,
 	return do_sched_setscheduler_ex(pid, policy, len, param_ex);
 }
 
+/*
+ * These functions make the task one of the highest priority task in
+ * the system. This means it will always run as soon as it gets ready,
+ * and it won't be preempted by any other task, independently from their
+ * scheduling policy, deadline, priority, etc. (provided they're not
+ * 'special tasks' as well).
+ */
+static void __setscheduler_dl_special(struct rq *rq, struct task_struct *p)
+{
+	p->dl.dl_runtime = 0;
+	p->dl.dl_deadline = 0;
+	p->dl.flags = SF_HEAD;
+	p->dl.dl_new = 1;
+
+	__setscheduler(rq, p, SCHED_DEADLINE, MAX_RT_PRIO-1);
+}
+
+void setscheduler_dl_special(struct task_struct *p)
+{
+	struct sched_param param;
+	struct sched_param_ex param_ex;
+
+	param.sched_priority = 0;
+
+	param_ex.sched_priority = MAX_RT_PRIO-1;
+	param_ex.sched_runtime = ns_to_timespec(0);
+	param_ex.sched_deadline = ns_to_timespec(0);
+	param_ex.sched_flags = SF_HEAD;
+
+	__sched_setscheduler(p, SCHED_DEADLINE, &param, &param_ex, false);
+}
+EXPORT_SYMBOL(setscheduler_dl_special);
+
 /**
  * sys_sched_setparam - set/change the RT priority of a thread
  * @pid: the pid in question.
@@ -6071,7 +6104,7 @@ void sched_idle_next(void)
 	 */
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
-	__setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
+	__setscheduler_dl_special(rq, p);
 
 	activate_task(rq, p, 0);
 
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 9d0443e..17973aa 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -19,6 +19,21 @@ static inline int dl_time_before(u64 a, u64 b)
 	return (s64)(a - b) < 0;
 }
 
+/*
+ * Tells if entity @a should preempt entity @b.
+ */
+static inline
+int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
+{
+	/*
+	 * A system task marked with SF_HEAD flag will always
+	 * preempt a non 'special' one.
+	 */
+	return a->flags & SF_HEAD ||
+	       (!(b->flags & SF_HEAD) &&
+		dl_time_before(a->deadline, b->deadline));
+}
+
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
 	return container_of(dl_se, struct task_struct, dl);
@@ -291,7 +306,13 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 	int dmiss = dl_time_before(dl_se->deadline, rq->clock);
 	int rorun = dl_se->runtime <= 0;
 
-	if (!rorun && !dmiss)
+	/*
+	 * No need for checking if it's time to enforce the
+	 * bandwidth for the tasks that are:
+	 *  - maximum priority (SF_HEAD),
+	 *  - not overrunning nor missing a deadline.
+	 */
+	if (dl_se->flags & SF_HEAD || (!rorun && !dmiss))
 		return 0;
 
 	/*
@@ -359,7 +380,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 	while (*link) {
 		parent = *link;
 		entry = rb_entry(parent, struct sched_dl_entity, rb_node);
-		if (dl_time_before(dl_se->deadline, entry->deadline))
+		if (dl_entity_preempt(dl_se, entry))
 			link = &parent->rb_left;
 		else {
 			link = &parent->rb_right;
@@ -471,7 +492,7 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 				  int flags)
 {
 	if (!dl_task(rq->curr) || (dl_task(p) &&
-	    dl_time_before(p->dl.deadline, rq->curr->dl.deadline)))
+	    dl_entity_preempt(&p->dl, &rq->curr->dl)))
 		resched_task(rq->curr);
 }
 
diff --git a/kernel/softirq.c b/kernel/softirq.c
index d4d918a..9c4c967 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -853,13 +853,9 @@ static int __cpuinit cpu_callback(struct notifier_block *nfb,
 			     cpumask_any(cpu_online_mask));
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN: {
-		static struct sched_param param = {
-			.sched_priority = MAX_RT_PRIO-1
-		};
-
 		p = per_cpu(ksoftirqd, hotcpu);
 		per_cpu(ksoftirqd, hotcpu) = NULL;
-		sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
+		setscheduler_dl_special(p);
 		kthread_stop(p);
 		takeover_tasklets(hotcpu);
 		break;
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 94ca779..2b7f259 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -307,10 +307,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
  */
 static int watchdog(void *unused)
 {
-	static struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
 	struct hrtimer *hrtimer = &__raw_get_cpu_var(watchdog_hrtimer);
 
-	sched_setscheduler(current, SCHED_FIFO, &param);
+	setscheduler_dl_special(current);
 
 	/* initialize timestamp */
 	__touch_watchdog();
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 07/22] sched: SCHED_DEADLINE push and pull logic
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (5 preceding siblings ...)
  2010-10-29  6:31 ` [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads Raistlin
@ 2010-10-29  6:32 ` Raistlin
  2010-11-12 16:17   ` Peter Zijlstra
  2010-10-29  6:33 ` [RFC][PATCH 08/22] sched: SCHED_DEADLINE avg_update accounting Raistlin
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 30268 bytes --]


Add dynamic migrations to SCHED_DEADLINE, so that tasks can
be moved among CPUs when necessary. It is also possible to bind a
task to a (set of) CPU(s), thus restricting its capability of
migrating, or forbidding migrations at all.

The very same approach used in sched_rt is utilised:
 - -deadline tasks are kept into CPU-specific runqueues,
 - -deadline tasks are migrated among runqueues to achieve the
   following:
    * on an M-CPU system the M earliest deadline ready tasks
      are always running;
    * affinity/cpusets settings of all the -deadline tasks is
      always respected.

Therefore, this very special form of "load balancing" is done with
an active method, i.e., the scheduler pushes or pulls tasks between
runqueues when they are woken up and/or (de)scheduled.
IOW, every time a preemption occurs, the descheduled task might be sent
to some other CPU (depending on its deadline) to continue executing
(push). On the other hand, every time a CPU becomes idle, it might pull
the second earliest deadline ready task from some other CPU.

To enforce this, a pull operation is always attempted before taking any
scheduling decision (pre_schedule()), as well as a push one after each
scheduling decision (post_schedule()). In addition, when a task arrives
or wakes up, the best CPU where to resume it is selected taking into
account its affinity mask, the system topology, but also its deadline.
E.g., from the scheduling point of view, the best CPU where to wake
up (and also where to push) a task is the one which is running the task
with the latest deadline among the M executing ones.

In order to facilitate these decisions, per-runqueue "caching" of the
deadlines of the currently running and of the first ready task is used.
Queued but not running tasks are also parked in another rb-tree to
speed-up pushes.

Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 kernel/sched_dl.c |  888 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched_rt.c |    2 +-
 2 files changed, 866 insertions(+), 24 deletions(-)

diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 17973aa..26126a6 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -10,6 +10,7 @@
  * miss some of their deadlines), and won't affect any other task.
  *
  * Copyright (C) 2010 Dario Faggioli <raistlin@linux.it>,
+ *                    Juri Lelli <juri.lelli@gmail.com>,
  *                    Michael Trimarchi <trimarchimichael@yahoo.it>,
  *                    Fabio Checconi <fabio@gandalf.sssup.it>
  */
@@ -52,6 +53,151 @@ static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
 	return &rq->dl;
 }
 
+#ifdef CONFIG_SMP
+
+static inline int dl_overloaded(struct rq *rq)
+{
+	return atomic_read(&rq->rd->dlo_count);
+}
+
+static inline void dl_set_overload(struct rq *rq)
+{
+	if (!rq->online)
+		return;
+
+	cpumask_set_cpu(rq->cpu, rq->rd->dlo_mask);
+	/*
+	 * Must be visible before the overload count is
+	 * set (as in sched_rt.c).
+	 */
+	wmb();
+	atomic_inc(&rq->rd->dlo_count);
+}
+
+static inline void dl_clear_overload(struct rq *rq)
+{
+	if (!rq->online)
+		return;
+
+	atomic_dec(&rq->rd->dlo_count);
+	cpumask_clear_cpu(rq->cpu, rq->rd->dlo_mask);
+}
+
+static void update_dl_migration(struct dl_rq *dl_rq)
+{
+	if (dl_rq->dl_nr_migratory && dl_rq->dl_nr_total > 1) {
+		if (!dl_rq->overloaded) {
+			dl_set_overload(rq_of_dl_rq(dl_rq));
+			dl_rq->overloaded = 1;
+		}
+	} else if (dl_rq->overloaded) {
+		dl_clear_overload(rq_of_dl_rq(dl_rq));
+		dl_rq->overloaded = 0;
+	}
+}
+
+static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
+
+	dl_rq->dl_nr_total++;
+	if (dl_se->nr_cpus_allowed > 1)
+		dl_rq->dl_nr_migratory++;
+
+	update_dl_migration(dl_rq);
+}
+
+static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
+
+	dl_rq->dl_nr_total--;
+	if (dl_se->nr_cpus_allowed > 1)
+		dl_rq->dl_nr_migratory--;
+
+	update_dl_migration(dl_rq);
+}
+
+/*
+ * The list of pushable -deadline task is not a plist, like in
+ * sched_rt.c, it is an rb-tree with tasks ordered by deadline.
+ */
+static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+	struct dl_rq *dl_rq = &rq->dl;
+	struct rb_node **link = &dl_rq->pushable_dl_tasks_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct task_struct *entry;
+	int leftmost = 1;
+
+	BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct task_struct,
+				 pushable_dl_tasks);
+		if (!dl_entity_preempt(&entry->dl, &p->dl))
+			link = &parent->rb_left;
+		else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		dl_rq->pushable_dl_tasks_leftmost = &p->pushable_dl_tasks;
+
+	rb_link_node(&p->pushable_dl_tasks, parent, link);
+	rb_insert_color(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
+}
+
+static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+	struct dl_rq *dl_rq = &rq->dl;
+
+	if (RB_EMPTY_NODE(&p->pushable_dl_tasks))
+		return;
+
+	if (dl_rq->pushable_dl_tasks_leftmost == &p->pushable_dl_tasks) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&p->pushable_dl_tasks);
+		dl_rq->pushable_dl_tasks_leftmost = next_node;
+	}
+
+	rb_erase(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
+	RB_CLEAR_NODE(&p->pushable_dl_tasks);
+}
+
+static inline int has_pushable_dl_tasks(struct rq *rq)
+{
+	return !RB_EMPTY_ROOT(&rq->dl.pushable_dl_tasks_root);
+}
+
+#else
+
+static inline
+void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline
+void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline
+void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+}
+
+static inline
+void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+}
+
+#endif /* CONFIG_SMP */
+
 static inline int on_dl_rq(struct sched_dl_entity *dl_se)
 {
 	return !RB_EMPTY_NODE(&dl_se->rb_node);
@@ -61,6 +207,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
 static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 				  int flags);
+static int push_dl_task(struct rq *rq);
 
 /*
  * We are being explicitly informed that a new instance is starting,
@@ -280,6 +427,13 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	if (p->se.on_rq) {
 		enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
 		check_preempt_curr_dl(rq, p, 0);
+
+		/*
+		 * Queueing this task back might have overloaded rq,
+		 * check if we need to kick someone away.
+		 */
+		if (rq->dl.overloaded)
+			push_dl_task(rq);
 	}
 unlock:
 	task_rq_unlock(rq, &flags);
@@ -367,6 +521,100 @@ static void update_curr_dl(struct rq *rq)
 	}
 }
 
+#ifdef CONFIG_SMP
+
+static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu);
+
+static inline int next_deadline(struct rq *rq)
+{
+	struct task_struct *next = pick_next_earliest_dl_task(rq, rq->cpu);
+
+	if (next && dl_prio(next->prio))
+		return next->dl.deadline;
+	else
+		return 0;
+}
+
+static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
+{
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	if (dl_rq->earliest_dl.curr == 0 ||
+	    dl_time_before(deadline, dl_rq->earliest_dl.curr)) {
+		/*
+		 * If the dl_rq had no -deadline tasks, or if the new task
+		 * has shorter deadline than the current one on dl_rq, we
+		 * know that the previous earliest becomes our next earliest,
+		 * as the new task becomes the earliest itself.
+		 */
+		dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
+		dl_rq->earliest_dl.curr = deadline;
+	} else if (dl_rq->earliest_dl.next == 0 ||
+		   dl_time_before(deadline, dl_rq->earliest_dl.next)) {
+		/*
+		 * On the other hand, if the new -deadline task has a
+		 * a later deadline than the earliest one on dl_rq, but
+		 * it is earlier than the next (if any), we must
+		 * recompute the next-earliest.
+		 */
+		dl_rq->earliest_dl.next = next_deadline(rq);
+	}
+}
+
+static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
+{
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * Since we may have removed our earliest (and/or next earliest)
+	 * task we must recompute them.
+	 */
+	if (!dl_rq->dl_nr_running) {
+		dl_rq->earliest_dl.curr = 0;
+		dl_rq->earliest_dl.next = 0;
+	} else {
+		struct rb_node *leftmost = dl_rq->rb_leftmost;
+		struct sched_dl_entity *entry;
+
+		entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
+		dl_rq->earliest_dl.curr = entry->deadline;
+		dl_rq->earliest_dl.next = next_deadline(rq);
+	}
+}
+
+#else
+
+static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
+static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
+
+#endif /* CONFIG_SMP */
+
+static inline
+void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	int prio = dl_task_of(dl_se)->prio;
+	u64 deadline = dl_se->deadline;
+
+	WARN_ON(!dl_prio(prio));
+	dl_rq->dl_nr_running++;
+
+	inc_dl_deadline(dl_rq, deadline);
+	inc_dl_migration(dl_se, dl_rq);
+}
+
+static inline
+void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	int prio = dl_task_of(dl_se)->prio;
+
+	WARN_ON(!dl_prio(prio));
+	WARN_ON(!dl_rq->dl_nr_running);
+	dl_rq->dl_nr_running--;
+
+	dec_dl_deadline(dl_rq, dl_se->deadline);
+	dec_dl_migration(dl_se, dl_rq);
+}
+
 static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
@@ -394,7 +642,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 	rb_link_node(&dl_se->rb_node, parent, link);
 	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
 
-	dl_rq->dl_nr_running++;
+	inc_dl_tasks(dl_se, dl_rq);
 }
 
 static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
@@ -414,7 +662,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
 	RB_CLEAR_NODE(&dl_se->rb_node);
 
-	dl_rq->dl_nr_running--;
+	dec_dl_tasks(dl_se, dl_rq);
 }
 
 static void
@@ -452,11 +700,15 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		return;
 
 	enqueue_dl_entity(&p->dl, flags);
+
+	if (!task_current(rq, p) && p->dl.nr_cpus_allowed > 1)
+		enqueue_pushable_dl_task(rq, p);
 }
 
 static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	dequeue_dl_entity(&p->dl);
+	dequeue_pushable_dl_task(rq, p);
 }
 
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -488,12 +740,80 @@ static void yield_task_dl(struct rq *rq)
 	update_curr_dl(rq);
 }
 
+#ifdef CONFIG_SMP
+static int find_later_rq(struct task_struct *task);
+static int latest_cpu_find(struct cpumask *span,
+			   struct task_struct *task,
+			   struct cpumask *later_mask);
+
+static int
+select_task_rq_dl(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+{
+	if (sd_flag != SD_BALANCE_WAKE)
+		return smp_processor_id();
+
+	/*
+	 * If we are dealing with a -deadline task, we must
+	 * decide where to wake it up.
+	 * If it has a later deadline and the current task
+	 * on this rq can't move (provided the waking task
+	 * can!) we prefer to send it somewhere else. On the
+	 * other hand, if it has a shorter deadline, we
+	 * try to make it stay here, it might be important.
+	 */
+	if (unlikely(dl_task(rq->curr)) &&
+	    (rq->curr->dl.nr_cpus_allowed < 2 ||
+	     dl_entity_preempt(&rq->curr->dl, &p->dl)) &&
+	    (p->dl.nr_cpus_allowed > 1)) {
+		int cpu = find_later_rq(p);
+
+		return (cpu == -1) ? task_cpu(p) : cpu;
+	}
+
+	return task_cpu(p);
+}
+
+static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
+{
+	/*
+	 * Current can't be migrated, useles to reschedule,
+	 * let's hope p can move out.
+	 */
+	if (rq->curr->dl.nr_cpus_allowed == 1 ||
+	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
+		return;
+
+	/*
+	 * p is migratable, so let's not schedule it and
+	 * see if it is pushed or pulled somewhere else.
+	 */
+	if (p->dl.nr_cpus_allowed != 1 &&
+	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
+		return;
+
+	resched_task(rq->curr);
+}
+
+#endif /* CONFIG_SMP */
+
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 				  int flags)
 {
 	if (!dl_task(rq->curr) || (dl_task(p) &&
-	    dl_entity_preempt(&p->dl, &rq->curr->dl)))
+	    dl_entity_preempt(&p->dl, &rq->curr->dl))) {
 		resched_task(rq->curr);
+		return;
+	}
+
+#ifdef CONFIG_SMP
+	/*
+	 * In the unlikely case current and p have the same deadline
+	 * let us try to decide what's the best thing to do...
+	 */
+	if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
+	    !need_resched())
+		check_preempt_equal_dl(rq, p);
+#endif /* CONFIG_SMP */
 }
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -537,10 +857,20 @@ struct task_struct *pick_next_task_dl(struct rq *rq)
 
 	p = dl_task_of(dl_se);
 	p->se.exec_start = rq->clock;
+
+	/* Running task will never be pushed. */
+	if (p)
+		dequeue_pushable_dl_task(rq, p);
+
 #ifdef CONFIG_SCHED_HRTICK
 	if (hrtick_enabled(rq))
 		start_hrtick_dl(rq, p);
 #endif
+
+#ifdef CONFIG_SMP
+	rq->post_schedule = has_pushable_dl_tasks(rq);
+#endif /* CONFIG_SMP */
+
 	return p;
 }
 
@@ -548,6 +878,9 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
 {
 	update_curr_dl(rq);
 	p->se.exec_start = 0;
+
+	if (on_dl_rq(&p->dl) && p->dl.nr_cpus_allowed > 1)
+		enqueue_pushable_dl_task(rq, p);
 }
 
 static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
@@ -586,41 +919,409 @@ static void set_curr_task_dl(struct rq *rq)
 	struct task_struct *p = rq->curr;
 
 	p->se.exec_start = rq->clock;
+
+	/* You can't push away the running task */
+	dequeue_pushable_dl_task(rq, p);
 }
 
-static void switched_from_dl(struct rq *rq, struct task_struct *p,
-			     int running)
+#ifdef CONFIG_SMP
+
+/* Only try algorithms three times */
+#define DL_MAX_TRIES 3
+
+static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
 {
-	if (hrtimer_active(&p->dl.dl_timer))
-		hrtimer_try_to_cancel(&p->dl.dl_timer);
+	if (!task_running(rq, p) &&
+	    (cpu < 0 || cpumask_test_cpu(cpu, &p->cpus_allowed)) &&
+	    (p->dl.nr_cpus_allowed > 1))
+		return 1;
+
+	return 0;
 }
 
-static void switched_to_dl(struct rq *rq, struct task_struct *p,
-			   int running)
+/* Returns the second earliest -deadline task, NULL otherwise */
+static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu)
+{
+	struct rb_node *next_node = rq->dl.rb_leftmost;
+	struct sched_dl_entity *dl_se;
+	struct task_struct *p = NULL;
+
+next_node:
+	next_node = rb_next(next_node);
+	if (next_node) {
+		dl_se = rb_entry(next_node, struct sched_dl_entity, rb_node);
+		p = dl_task_of(dl_se);
+
+		if (pick_dl_task(rq, p, cpu))
+			return p;
+
+		goto next_node;
+	}
+
+	return NULL;
+}
+
+static int latest_cpu_find(struct cpumask *span,
+			   struct task_struct *task,
+			   struct cpumask *later_mask)
+{
+	const struct sched_dl_entity *dl_se = &task->dl;
+	int cpu, found = -1, best = 0;
+	u64 max_dl = 0;
+
+	for_each_cpu(cpu, span) {
+		struct rq *rq = cpu_rq(cpu);
+		struct dl_rq *dl_rq = &rq->dl;
+
+		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
+		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
+		     dl_rq->earliest_dl.curr))) {
+			if (later_mask)
+				cpumask_set_cpu(cpu, later_mask);
+			if (!best && !dl_rq->dl_nr_running) {
+				best = 1;
+				found = cpu;
+			} else if (!best &&
+				   dl_time_before(max_dl,
+						  dl_rq->earliest_dl.curr)) {
+				max_dl = dl_rq->earliest_dl.curr;
+				found = cpu;
+			}
+		} else if (later_mask)
+			cpumask_clear_cpu(cpu, later_mask);
+	}
+
+	return found;
+}
+
+static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
+
+static int find_later_rq(struct task_struct *task)
 {
+	struct sched_domain *sd;
+	struct cpumask *later_mask = __get_cpu_var(local_cpu_mask_dl);
+	int this_cpu = smp_processor_id();
+	int best_cpu, cpu = task_cpu(task);
+
+	if (task->dl.nr_cpus_allowed == 1)
+		return -1;
+
+	best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
+	if (best_cpu == -1)
+		return -1;
+
 	/*
-	 * If p is throttled, don't consider the possibility
-	 * of preempting rq->curr, the check will be done right
-	 * after its runtime will get replenished.
+	 * If we are here, some target has been found,
+	 * the most suitable of which is cached in best_cpu.
+	 * This is, among the runqueues where the current tasks
+	 * have later deadlines than the task's one, the rq
+	 * with the latest possible one.
+	 *
+	 * Now we check how well this matches with task's
+	 * affinity and system topology.
+	 *
+	 * The last cpu where the task run is our first
+	 * guess, since it is most likely cache-hot there.
 	 */
-	if (unlikely(p->dl.dl_throttled))
-		return;
+	if (cpumask_test_cpu(cpu, later_mask))
+		return cpu;
 
-	if (!running)
-		check_preempt_curr_dl(rq, p, 0);
+	/*
+	 * Check if this_cpu is to be skipped (i.e., it is
+	 * not in the mask) or not.
+	 */
+	if (!cpumask_test_cpu(this_cpu, later_mask))
+		this_cpu = -1;
+
+	for_each_domain(cpu, sd) {
+		if (sd->flags & SD_WAKE_AFFINE) {
+
+			/*
+			 * If possible, preempting this_cpu is
+			 * cheaper than migrating.
+			 */
+			if (this_cpu != -1 &&
+			    cpumask_test_cpu(this_cpu, sched_domain_span(sd)))
+				return this_cpu;
+
+			/*
+			 * Last chance: if best_cpu is valid and is
+			 * in the mask, that becomes our choice.
+			 */
+			if (best_cpu < nr_cpu_ids &&
+			    cpumask_test_cpu(best_cpu, sched_domain_span(sd)))
+				return best_cpu;
+		}
+	}
+
+	/*
+	 * At this point, all our guesses failed, we just return
+	 * 'something', and let the caller sort the things out.
+	 */
+	if (this_cpu != -1)
+		return this_cpu;
+
+	cpu = cpumask_any(later_mask);
+	if (cpu < nr_cpu_ids)
+		return cpu;
+
+	return -1;
 }
 
-static void prio_changed_dl(struct rq *rq, struct task_struct *p,
-			    int oldprio, int running)
+/* Locks the rq it finds */
+static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
 {
-	switched_to_dl(rq, p, running);
+	struct rq *later_rq = NULL;
+	int tries;
+	int cpu;
+
+	for (tries = 0; tries < DL_MAX_TRIES; tries++) {
+		cpu = find_later_rq(task);
+
+		if ((cpu == -1) || (cpu == rq->cpu))
+			break;
+
+		later_rq = cpu_rq(cpu);
+
+		/* Retry if something changed. */
+		if (double_lock_balance(rq, later_rq)) {
+			if (unlikely(task_rq(task) != rq ||
+				     !cpumask_test_cpu(later_rq->cpu,
+						       &task->cpus_allowed) ||
+				     task_running(rq, task) ||
+				     !task->se.on_rq)) {
+				raw_spin_unlock(&later_rq->lock);
+				later_rq = NULL;
+				break;
+			}
+		}
+
+		/*
+		 * If the rq we found has no -deadline task, or
+		 * its earliest one has a later deadline than our
+		 * task, the rq is a good one.
+		 */
+		if (!later_rq->dl.dl_nr_running ||
+		    dl_time_before(task->dl.deadline,
+				   later_rq->dl.earliest_dl.curr))
+			break;
+
+		/* Otherwise we try again. */
+		double_unlock_balance(rq, later_rq);
+		later_rq = NULL;
+	}
+
+	return later_rq;
 }
 
-#ifdef CONFIG_SMP
-static int
-select_task_rq_dl(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
 {
-	return task_cpu(p);
+	struct task_struct *p;
+
+	if (!has_pushable_dl_tasks(rq))
+		return NULL;
+
+	p = rb_entry(rq->dl.pushable_dl_tasks_leftmost,
+		     struct task_struct, pushable_dl_tasks);
+
+	BUG_ON(rq->cpu != task_cpu(p));
+	BUG_ON(task_current(rq, p));
+	BUG_ON(p->dl.nr_cpus_allowed <= 1);
+
+	BUG_ON(!p->se.on_rq);
+	BUG_ON(!dl_task(p));
+
+	return p;
+}
+
+/*
+ * See if the non running -deadline tasks on this rq
+ * can be sent to some other CPU where they can preempt
+ * and start executing.
+ */
+static int push_dl_task(struct rq *rq)
+{
+	struct task_struct *next_task;
+	struct rq *later_rq;
+
+	if (!rq->dl.overloaded)
+		return 0;
+
+	next_task = pick_next_pushable_dl_task(rq);
+	if (!next_task)
+		return 0;
+
+retry:
+	if (unlikely(next_task == rq->curr)) {
+		WARN_ON(1);
+		return 0;
+	}
+
+	/*
+	 * If next_task preempts rq->curr, and rq->curr
+	 * can move away, it makes sense to just reschedule
+	 * without going further in pushing next_task.
+	 */
+	if (dl_task(rq->curr) &&
+	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+	    rq->curr->dl.nr_cpus_allowed > 1) {
+		resched_task(rq->curr);
+		return 0;
+	}
+
+	/* We might release rq lock */
+	get_task_struct(next_task);
+
+	/* Will lock the rq it'll find */
+	later_rq = find_lock_later_rq(next_task, rq);
+	if (!later_rq) {
+		struct task_struct *task;
+
+		/*
+		 * We must check all this again, since
+		 * find_lock_later_rq releases rq->lock and it is
+		 * then possible that next_task has migrated.
+		 */
+		task = pick_next_pushable_dl_task(rq);
+		if (task_cpu(next_task) == rq->cpu && task == next_task) {
+			/*
+			 * The task is still there. We don't try
+			 * again, some other cpu will pull it when ready.
+			 */
+			dequeue_pushable_dl_task(rq, next_task);
+			goto out;
+		}
+
+		if (!task)
+			/* No more tasks */
+			goto out;
+
+		put_task_struct(next_task);
+		next_task = task;
+		goto retry;
+	}
+
+	deactivate_task(rq, next_task, 0);
+	set_task_cpu(next_task, later_rq->cpu);
+	activate_task(later_rq, next_task, 0);
+
+	resched_task(later_rq->curr);
+
+	double_unlock_balance(rq, later_rq);
+
+out:
+	put_task_struct(next_task);
+
+	return 1;
+}
+
+static void push_dl_tasks(struct rq *rq)
+{
+	/* Terminates as it moves a -deadline task */
+	while (push_dl_task(rq))
+		;
+}
+
+static int pull_dl_task(struct rq *this_rq)
+{
+	int this_cpu = this_rq->cpu, ret = 0, cpu;
+	struct task_struct *p;
+	struct rq *src_rq;
+	u64 dmin = LONG_MAX;
+
+	if (likely(!dl_overloaded(this_rq)))
+		return 0;
+
+	for_each_cpu(cpu, this_rq->rd->dlo_mask) {
+		if (this_cpu == cpu)
+			continue;
+
+		src_rq = cpu_rq(cpu);
+
+		/*
+		 * It looks racy, abd it is! However, as in sched_rt.c,
+		 * we are fine with this.
+		 */
+		if (this_rq->dl.dl_nr_running &&
+		    dl_time_before(this_rq->dl.earliest_dl.curr,
+				   src_rq->dl.earliest_dl.next))
+			continue;
+
+		/* Might drop this_rq->lock */
+		double_lock_balance(this_rq, src_rq);
+
+		/*
+		 * If there are no more pullable tasks on the
+		 * rq, we're done with it.
+		 */
+		if (src_rq->dl.dl_nr_running <= 1)
+			goto skip;
+
+		p = pick_next_earliest_dl_task(src_rq, this_cpu);
+
+		/*
+		 * We found a task to be pulled if:
+		 *  - it preempts our current (if there's one),
+		 *  - it will preempt the last one we pulled (if any).
+		 */
+		if (p && dl_time_before(p->dl.deadline, dmin) &&
+		    (!this_rq->dl.dl_nr_running ||
+		     dl_time_before(p->dl.deadline,
+				    this_rq->dl.earliest_dl.curr))) {
+			WARN_ON(p == src_rq->curr);
+			WARN_ON(!p->se.on_rq);
+
+			/*
+			 * Then we pull iff p has actually an earlier
+			 * deadline than the current task of its runqueue.
+			 */
+			if (dl_time_before(p->dl.deadline,
+					   src_rq->curr->dl.deadline))
+				goto skip;
+
+			ret = 1;
+
+			deactivate_task(src_rq, p, 0);
+			set_task_cpu(p, this_cpu);
+			activate_task(this_rq, p, 0);
+			dmin = p->dl.deadline;
+
+			/* Is there any other task even earlier? */
+		}
+skip:
+		double_unlock_balance(this_rq, src_rq);
+	}
+
+	return ret;
+}
+
+static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
+{
+	/* Try to pull other tasks here */
+	if (dl_task(prev))
+		pull_dl_task(rq);
+}
+
+static void post_schedule_dl(struct rq *rq)
+{
+	push_dl_tasks(rq);
+}
+
+/*
+ * Since the task is not running and a reschedule is not going to happen
+ * anytime soon on its runqueue, we try pushing it away now.
+ */
+static void task_woken_dl(struct rq *rq, struct task_struct *p)
+{
+	if (!task_running(rq, p) &&
+	    !test_tsk_need_resched(rq->curr) &&
+	    has_pushable_dl_tasks(rq) &&
+	    p->dl.nr_cpus_allowed > 1 &&
+	    dl_task(rq->curr) &&
+	    (rq->curr->dl.nr_cpus_allowed < 2 ||
+	     dl_entity_preempt(&rq->curr->dl, &p->dl))) {
+		push_dl_tasks(rq);
+	}
 }
 
 static void set_cpus_allowed_dl(struct task_struct *p,
@@ -630,10 +1331,146 @@ static void set_cpus_allowed_dl(struct task_struct *p,
 
 	BUG_ON(!dl_task(p));
 
+	/*
+	 * Update only if the task is actually running (i.e.,
+	 * it is on the rq AND it is not throttled).
+	 */
+	if (on_dl_rq(&p->dl) && (weight != p->dl.nr_cpus_allowed)) {
+		struct rq *rq = task_rq(p);
+
+		if (!task_current(rq, p)) {
+			/*
+			 * If the task was on the pushable list,
+			 * make sure it stays there only if the new
+			 * mask allows that.
+			 */
+			if (p->dl.nr_cpus_allowed > 1)
+				dequeue_pushable_dl_task(rq, p);
+
+			if (weight > 1)
+				enqueue_pushable_dl_task(rq, p);
+		}
+
+		if ((p->dl.nr_cpus_allowed <= 1) && (weight > 1)) {
+			rq->dl.dl_nr_migratory++;
+		} else if ((p->dl.nr_cpus_allowed > 1) && (weight <= 1)) {
+			BUG_ON(!rq->dl.dl_nr_migratory);
+			rq->dl.dl_nr_migratory--;
+		}
+
+		update_dl_migration(&rq->dl);
+	}
+
 	cpumask_copy(&p->cpus_allowed, new_mask);
 	p->dl.nr_cpus_allowed = weight;
 }
+
+/* Assumes rq->lock is held */
+static void rq_online_dl(struct rq *rq)
+{
+	if (rq->dl.overloaded)
+		dl_set_overload(rq);
+}
+
+/* Assumes rq->lock is held */
+static void rq_offline_dl(struct rq *rq)
+{
+	if (rq->dl.overloaded)
+		dl_clear_overload(rq);
+}
+
+static inline void init_sched_dl_class(void)
+{
+	unsigned int i;
+
+	for_each_possible_cpu(i)
+		zalloc_cpumask_var_node(&per_cpu(local_cpu_mask_dl, i),
+					GFP_KERNEL, cpu_to_node(i));
+}
+#endif /* CONFIG_SMP */
+
+static void switched_from_dl(struct rq *rq, struct task_struct *p,
+			     int running)
+{
+	if (hrtimer_active(&p->dl.dl_timer) && !dl_policy(p->policy))
+		hrtimer_try_to_cancel(&p->dl.dl_timer);
+
+#ifdef CONFIG_SMP
+	/*
+	 * Since this might be the only -deadline task on the rq,
+	 * this is the right place to try to pull some other one
+	 * from an overloaded cpu, if any.
+	 */
+	if (!rq->dl.dl_nr_running)
+		pull_dl_task(rq);
 #endif
+}
+
+/*
+ * When switching to -deadline, we may overload the rq, then
+ * we try to push someone off, if possible.
+ */
+static void switched_to_dl(struct rq *rq, struct task_struct *p,
+			   int running)
+{
+	int check_resched = 1;
+
+	/*
+	 * If p is throttled, don't consider the possibility
+	 * of preempting rq->curr, the check will be done right
+	 * after its runtime will get replenished.
+	 */
+	if (unlikely(p->dl.dl_throttled))
+		return;
+
+	if (!running) {
+#ifdef CONFIG_SMP
+		if (rq->dl.overloaded && push_dl_task(rq) && rq != task_rq(p))
+			/* Only reschedule if pushing failed */
+			check_resched = 0;
+#endif /* CONFIG_SMP */
+		if (check_resched)
+			check_preempt_curr_dl(rq, p, 0);
+	}
+}
+
+/*
+ * If the scheduling parameters of a -deadline task changed,
+ * a push or pull operation might be needed.
+ */
+static void prio_changed_dl(struct rq *rq, struct task_struct *p,
+			    int oldprio, int running)
+{
+	if (running) {
+#ifdef CONFIG_SMP
+		/*
+		 * This might be too much, but unfortunately
+		 * we don't have the old deadline value, and
+		 * we can't argue if the task is increasing
+		 * or lowering its prio, so...
+		 */
+		if (!rq->dl.overloaded)
+			pull_dl_task(rq);
+
+		/*
+		 * If we now have a earlier deadline task than p,
+		 * then reschedule, provided p is still on this
+		 * runqueue.
+		 */
+		if (dl_time_before(rq->dl.earliest_dl.curr, p->dl.deadline) &&
+		    rq->curr == p)
+			resched_task(p);
+#else
+		/*
+		 * Again, we don't know if p has a earlier
+		 * or later deadline, so let's blindly set a
+		 * (maybe not needed) rescheduling point.
+		 */
+		resched_task(p);
+#endif /* CONFIG_SMP */
+	} else
+		switched_to_dl(rq, p, running);
+}
 
 static const struct sched_class dl_sched_class = {
 	.next			= &rt_sched_class,
@@ -650,6 +1487,11 @@ static const struct sched_class dl_sched_class = {
 	.select_task_rq		= select_task_rq_dl,
 
 	.set_cpus_allowed       = set_cpus_allowed_dl,
+	.rq_online              = rq_online_dl,
+	.rq_offline             = rq_offline_dl,
+	.pre_schedule		= pre_schedule_dl,
+	.post_schedule		= post_schedule_dl,
+	.task_woken		= task_woken_dl,
 #endif
 
 	.set_curr_task		= set_curr_task_dl,
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 56c00fa..9a8422d 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1498,7 +1498,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
 	    !test_tsk_need_resched(rq->curr) &&
 	    has_pushable_tasks(rq) &&
 	    p->rt.nr_cpus_allowed > 1 &&
-	    rt_task(rq->curr) &&
+	    (dl_task(rq->curr) || rt_task(rq->curr)) &&
 	    (rq->curr->rt.nr_cpus_allowed < 2 ||
 	     rq->curr->prio < p->prio))
 		push_rt_tasks(rq);
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 08/22] sched: SCHED_DEADLINE avg_update accounting
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (6 preceding siblings ...)
  2010-10-29  6:32 ` [RFC][PATCH 07/22] sched: SCHED_DEADLINE push and pull logic Raistlin
@ 2010-10-29  6:33 ` Raistlin
  2010-11-11 19:16   ` Peter Zijlstra
  2010-10-29  6:34 ` [RFC][PATCH 09/22] sched: add period support for -deadline tasks Raistlin
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 2653 bytes --]


Make the core scheduler and load balancer aware of the load
produced by -deadline tasks, by updating the moving average
like for sched_rt.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 kernel/sched.c      |   13 ++++++++++++-
 kernel/sched_dl.c   |    2 ++
 kernel/sched_fair.c |    4 ++--
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 79e7c1c..7f0780c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -571,7 +571,7 @@ struct rq {
 
 	unsigned long avg_load_per_task;
 
-	u64 rt_avg;
+	u64 dl_avg, rt_avg;
 	u64 age_stamp;
 	u64 idle_stamp;
 	u64 avg_idle;
@@ -1346,10 +1346,17 @@ static void sched_avg_update(struct rq *rq)
 		 */
 		asm("" : "+rm" (rq->age_stamp));
 		rq->age_stamp += period;
+		rq->dl_avg /= 2;
 		rq->rt_avg /= 2;
 	}
 }
 
+static void sched_dl_avg_update(struct rq *rq, u64 dl_delta)
+{
+	rq->dl_avg += dl_delta;
+	sched_avg_update(rq);
+}
+
 static void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
 	rq->rt_avg += rt_delta;
@@ -1363,6 +1370,10 @@ static void resched_task(struct task_struct *p)
 	set_tsk_need_resched(p);
 }
 
+static void sched_dl_avg_update(struct rq *rq, u64 dl_delta)
+{
+}
+
 static void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
 }
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 26126a6..1bb4308 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -509,6 +509,8 @@ static void update_curr_dl(struct rq *rq)
 	curr->se.exec_start = rq->clock;
 	cpuacct_charge(curr, delta_exec);
 
+	sched_dl_avg_update(rq, delta_exec);
+
 	dl_se->runtime -= delta_exec;
 	if (dl_runtime_exceeded(rq, dl_se)) {
 		__dequeue_task_dl(rq, curr, 0);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 54c869c..2afe280 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2276,11 +2276,11 @@ unsigned long scale_rt_power(int cpu)
 
 	total = sched_avg_period() + (rq->clock - rq->age_stamp);
 
-	if (unlikely(total < rq->rt_avg)) {
+	if (unlikely(total < rq->dl_avg + rq->rt_avg)) {
 		/* Ensures that power won't end up being negative */
 		available = 0;
 	} else {
-		available = total - rq->rt_avg;
+		available = total - rq->dl_avg - rq->rt_avg;
 	}
 
 	if (unlikely((s64)total < SCHED_LOAD_SCALE))
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 09/22] sched: add period support for -deadline tasks
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (7 preceding siblings ...)
  2010-10-29  6:33 ` [RFC][PATCH 08/22] sched: SCHED_DEADLINE avg_update accounting Raistlin
@ 2010-10-29  6:34 ` Raistlin
  2010-11-11 19:17   ` Peter Zijlstra
  2010-10-29  6:35 ` [RFC][PATCH 10/22] sched: add a syscall to wait for the next instance Raistlin
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 4434 bytes --]


Make it possible to specify a period (different or equal than
deadline) for -deadline tasks.

Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |   12 +++++++++++-
 kernel/sched_dl.c     |    8 ++++++--
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f25d3a6..83fa2b5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1251,6 +1251,7 @@ struct sched_dl_entity {
 	 */
 	u64 dl_runtime;		/* maximum runtime for each instance 	*/
 	u64 dl_deadline;	/* relative deadline of each instance	*/
+	u64 dl_period;		/* separation of two instances (period) */
 
 	/*
 	 * Actual scheduling parameters. Initialized with the values above,
diff --git a/kernel/sched.c b/kernel/sched.c
index 7f0780c..4491f7d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2708,6 +2708,7 @@ static void __sched_fork(struct task_struct *p)
 	hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	p->dl.dl_runtime = p->dl.runtime = 0;
 	p->dl.dl_deadline = p->dl.deadline = 0;
+	p->dl.dl_period = 0;
 	p->dl.flags = 0;
 
 	INIT_LIST_HEAD(&p->rt.run_list);
@@ -4789,6 +4790,10 @@ __setparam_dl(struct task_struct *p, const struct sched_param_ex *param_ex)
 	init_dl_task_timer(dl_se);
 	dl_se->dl_runtime = timespec_to_ns(&param_ex->sched_runtime);
 	dl_se->dl_deadline = timespec_to_ns(&param_ex->sched_deadline);
+	if (timespec_to_ns(&param_ex->sched_period) != 0)
+		dl_se->dl_period = timespec_to_ns(&param_ex->sched_period);
+	else
+		dl_se->dl_period = dl_se->dl_deadline;
 	dl_se->flags = param_ex->sched_flags;
 	dl_se->dl_throttled = 0;
 	dl_se->dl_new = 1;
@@ -4802,6 +4807,7 @@ __getparam_dl(struct task_struct *p, struct sched_param_ex *param_ex)
 	param_ex->sched_priority = p->rt_priority;
 	param_ex->sched_runtime = ns_to_timespec(dl_se->dl_runtime);
 	param_ex->sched_deadline = ns_to_timespec(dl_se->dl_deadline);
+	param_ex->sched_period = ns_to_timespec(dl_se->dl_period);
 	param_ex->sched_flags = dl_se->flags;
 	param_ex->curr_runtime = ns_to_timespec(dl_se->runtime);
 	param_ex->curr_deadline = ns_to_timespec(dl_se->deadline);
@@ -4810,7 +4816,8 @@ __getparam_dl(struct task_struct *p, struct sched_param_ex *param_ex)
 /*
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
- * than the runtime.
+ * than the runtime, as well as the period of being zero or
+ * not greater than deadline.
  */
 static bool
 __checkparam_dl(const struct sched_param_ex *prm, bool kthread)
@@ -4822,6 +4829,9 @@ __checkparam_dl(const struct sched_param_ex *prm, bool kthread)
 		return kthread;
 
 	return timespec_to_ns(&prm->sched_deadline) != 0 &&
+	       (timespec_to_ns(&prm->sched_period) == 0 ||
+		timespec_compare(&prm->sched_period,
+				 &prm->sched_deadline) >= 0) &&
 	       timespec_compare(&prm->sched_deadline,
 				&prm->sched_runtime) >= 0;
 }
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 1bb4308..31fb771 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -263,7 +263,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 * arbitrary large.
 	 */
 	while (dl_se->runtime <= 0) {
-		dl_se->deadline += dl_se->dl_deadline;
+		dl_se->deadline += dl_se->dl_period;
 		dl_se->runtime += dl_se->dl_runtime;
 	}
 
@@ -290,7 +290,11 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
  * assigned (function returns true if it can).
  *
  * For this to hold, we must check if:
- *   runtime / (deadline - t) < dl_runtime / dl_deadline .
+ *   runtime / (deadline - t) < dl_runtime / dl_period .
+ *
+ * Notice that the bandwidth check is done against the period. For
+ * task with deadline equal to period this is the same of using
+ * dl_deadline instead of dl_period in the equation above.
  */
 static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
 {
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 10/22] sched: add a syscall to wait for the next instance
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (8 preceding siblings ...)
  2010-10-29  6:34 ` [RFC][PATCH 09/22] sched: add period support for -deadline tasks Raistlin
@ 2010-10-29  6:35 ` Raistlin
  2010-11-11 19:21   ` Peter Zijlstra
  2010-10-29  6:35 ` [RFC][PATCH 11/22] sched: add schedstats for -deadline tasks Raistlin
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 10338 bytes --]


Introduce sched_wait_interval() syscall (and scheduling class
interface call). In general, this aims at providing each scheduling
class with a mean of making one of its own task sleep for some time
according to some specific rule of the scheduling class itself.

As of now, the sched_dl scheduling class is the only one that needs
this kind of service, and thus the only one that implements the
class-specific logic. For other classes, calling it will result in
the same effect than calling clock_nanosleep with CLOCK_MONOTONIC
clockid and the TIMER_ABSTIME flag on.

For -deadline task, the idea is to give them the possibility of
notifying the scheduler a periodic/sporadic instance just ended and
ask it to wake up them at the beginning of the next one, with:
 - fully replenished runtime and
 - the absolute deadline set just one relative deadline interval
   away from the wakeup time.
This is an effective mean of synchronizing the task's behaviour with
the scheduler one, which might be useful in some situations.

This patch:
 - adds the new syscall (x83-32, x86-64 and ARM, but extension to all
   archs is strightforward);
 - implements the class-specific logic for -deadline tasks, making it
   impossible for them to exploit this call to use more bandwidth than
   they are given.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 arch/arm/include/asm/unistd.h      |    1 +
 arch/arm/kernel/calls.S            |    1 +
 arch/x86/ia32/ia32entry.S          |    1 +
 arch/x86/include/asm/unistd_32.h   |    3 +-
 arch/x86/include/asm/unistd_64.h   |    2 +
 arch/x86/kernel/syscall_table_32.S |    1 +
 include/linux/sched.h              |    2 +
 include/linux/syscalls.h           |    2 +
 kernel/sched.c                     |   39 ++++++++++++++++++++++
 kernel/sched_dl.c                  |   63 ++++++++++++++++++++++++++++++++++++
 10 files changed, 114 insertions(+), 1 deletions(-)

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 6f18f72..56513bb 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -399,6 +399,7 @@
 #define __NR_sched_setscheduler_ex	(__NR_SYSCALL_BASE+370)
 #define __NR_sched_setparam_ex		(__NR_SYSCALL_BASE+371)
 #define __NR_sched_getparam_ex		(__NR_SYSCALL_BASE+372)
+#define __NR_sched_wait_interval	(__NR_SYSCALL_BASE+373)
 
 /*
  * The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index c131615..c18e1e4 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -382,6 +382,7 @@
 /* 370 */	CALL(sys_sched_setscheduler_ex)
 		CALL(sys_sched_setparam_ex)
 		CALL(sys_sched_getparam_ex)
+		CALL(sys_sched_wait_interval)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 0c6f451..32821df 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -854,4 +854,5 @@ ia32_sys_call_table:
 	.quad sys_sched_setscheduler_ex
 	.quad sys_sched_setparam_ex
 	.quad sys_sched_getparam_ex
+	.quad sys_sched_wait_interval
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 437383b..684bf79 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -349,10 +349,11 @@
 #define __NR_sched_setscheduler_ex	341
 #define __NR_sched_setparam_ex		342
 #define __NR_sched_getparam_ex		343
+#define __NR_sched_wait_interval	344
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 344
+#define NR_syscalls 345
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index fc4618b..932b094 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -675,6 +675,8 @@ __SYSCALL(__NR_sched_setscheduler_ex, sys_sched_setscheduler_ex)
 __SYSCALL(__NR_sched_setparam_ex, sys_sched_setparam_ex)
 #define __NR_sched_getparam_ex			305
 __SYSCALL(__NR_sched_getparam_ex, sys_sched_getparam_ex)
+#define __NR_sched_wait_interval		306
+__SYSCALL(__NR_sched_wait_interval, sys_sched_wait_interval)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 7d4ed62..c77e82c 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -343,3 +343,4 @@ ENTRY(sys_call_table)
 	.long sys_sched_setscheduler_ex
 	.long sys_sched_setparam_ex
 	.long sys_sched_getparam_ex
+	.long sys_sched_wait_interval
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 83fa2b5..e301eea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1116,6 +1116,8 @@ struct sched_class {
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*yield_task) (struct rq *rq);
+	long (*wait_interval) (struct task_struct *p, struct timespec *rqtp,
+			       struct timespec __user *rmtp);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 46b461e..b6e04db 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -339,6 +339,8 @@ asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
 asmlinkage long sys_sched_yield(void);
+asmlinkage long sys_sched_wait_interval(const struct timespec __user *rqtp,
+					struct timespec *rmtp);
 asmlinkage long sys_sched_get_priority_max(int policy);
 asmlinkage long sys_sched_get_priority_min(int policy);
 asmlinkage long sys_sched_rr_get_interval(pid_t pid,
diff --git a/kernel/sched.c b/kernel/sched.c
index 4491f7d..619d091 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5504,6 +5504,45 @@ SYSCALL_DEFINE0(sched_yield)
 	return 0;
 }
 
+/**
+ * sys_sched_wait_interval - sleep according to the scheduling class rules.
+ *
+ * This function is implemented inside each scheduling class, in case it
+ * wants to provide its tasks a mean of waiting a specific instant in
+ * time, while also honouring some specific rule of itself.
+ */
+SYSCALL_DEFINE2(sched_wait_interval,
+	const struct timespec __user *, rqtp,
+	struct timespec __user *, rmtp)
+{
+	struct timespec lrq, lrm;
+	int ret;
+
+	if (rqtp != NULL) {
+		if (copy_from_user(&lrq, rqtp, sizeof(struct timespec)))
+			return -EFAULT;
+		if (!timespec_valid(&lrq))
+			return -EINVAL;
+	}
+
+	if (current->sched_class->wait_interval)
+		ret = current->sched_class->wait_interval(current,
+							  rqtp ? &lrq : NULL,
+							  &lrm);
+	else {
+		if (!rqtp)
+			return -EINVAL;
+
+		ret = hrtimer_nanosleep(&lrq, &lrm, HRTIMER_MODE_ABS,
+					CLOCK_MONOTONIC);
+	}
+
+	if (rmtp && copy_to_user(rmtp, &lrm, sizeof(struct timespec)))
+		return -EFAULT;
+
+	return ret;
+}
+
 static inline int should_resched(void)
 {
 	return need_resched() && !(preempt_count() & PREEMPT_ACTIVE);
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 31fb771..c8eb304 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -724,6 +724,68 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 }
 
 /*
+ * This function makes the task sleep until at least the absolute time
+ * instant specified in @rqtp.
+ * In fact, since we want to wake up the task with its full runtime,
+ * @rqtp might be too early (or the task might already have overrun
+ * its runtime when calling this), the sleeping time may  be longer
+ * than asked.
+ *
+ * This is intended to be used at the end of a periodic -deadline task
+ * instance, or any time a task want to be sure it'll wake up with
+ * its full runtime.
+ */
+static long wait_interval_dl(struct task_struct *p, struct timespec *rqtp,
+			     struct timespec *rmtp)
+{
+	unsigned long flags;
+	struct sched_dl_entity *dl_se = &p->dl;
+	struct rq *rq = task_rq_lock(p, &flags);
+	struct timespec lrqtp;
+	u64 wakeup;
+
+	/*
+	 * If no wakeup time is provided, sleep at least up to the
+	 * next activation period. This guarantee the budget will
+	 * be renewed.
+	 */
+	if (!rqtp) {
+		wakeup = dl_se->deadline +
+			 dl_se->dl_period - dl_se->dl_deadline;
+		goto unlock;
+	}
+
+	/*
+	 * If the tasks wants to wake up _before_ its absolute deadline
+	 * we must be sure that reusing its (actual) runtime and deadline
+	 * at that time _would_ overcome its bandwidth limitation, so
+	 * that we know it will be given new parameters.
+	 *
+	 * If this is not true, we postpone the wake-up time up to the right
+	 * instant. This involves a division (to calculate the reverse of the
+	 * task's bandwidth), but it is worth to notice that it is quite
+	 * unlikely that we get into here very often.
+	 */
+	wakeup = timespec_to_ns(rqtp);
+	if (dl_time_before(wakeup, dl_se->deadline) &&
+	    !dl_entity_overflow(dl_se, wakeup)) {
+		u64 ibw = (u64)dl_se->runtime * dl_se->dl_period;
+
+		ibw = div_u64(ibw, dl_se->dl_runtime);
+		wakeup = dl_se->deadline - ibw;
+	}
+
+unlock:
+	task_rq_unlock(rq, &flags);
+
+	lrqtp = ns_to_timespec(wakeup);
+	dl_se->dl_new = 1;
+
+	return hrtimer_nanosleep(&lrqtp, rmtp, HRTIMER_MODE_ABS,
+				 CLOCK_MONOTONIC);
+}
+
+/*
  * Yield task semantic for -deadline tasks is:
  *
  *   get off from the CPU until our next instance, with
@@ -1483,6 +1545,7 @@ static const struct sched_class dl_sched_class = {
 	.enqueue_task		= enqueue_task_dl,
 	.dequeue_task		= dequeue_task_dl,
 	.yield_task		= yield_task_dl,
+	.wait_interval		= wait_interval_dl,
 
 	.check_preempt_curr	= check_preempt_curr_dl,
 
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 11/22] sched: add schedstats for -deadline tasks
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (9 preceding siblings ...)
  2010-10-29  6:35 ` [RFC][PATCH 10/22] sched: add a syscall to wait for the next instance Raistlin
@ 2010-10-29  6:35 ` Raistlin
  2010-10-29  6:36 ` [RFC][PATCH 12/22] sched: add runtime reporting " Raistlin
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 5753 bytes --]


Add some typical sched-debug output to dl_rq(s) and some
schedstats to -deadline tasks.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |   13 +++++++++++++
 kernel/sched.c        |    2 ++
 kernel/sched_debug.c  |   43 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_dl.c     |   42 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 100 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e301eea..8ae947b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1242,6 +1242,15 @@ struct sched_rt_entity {
 #endif
 };
 
+#ifdef CONFIG_SCHEDSTATS
+struct sched_stats_dl {
+	u64			last_dmiss;
+	u64			last_rorun;
+	u64			dmiss_max;
+	u64			rorun_max;
+};
+#endif
+
 struct sched_dl_entity {
 	struct rb_node	rb_node;
 	int nr_cpus_allowed;
@@ -1282,6 +1291,10 @@ struct sched_dl_entity {
 	 * own bandwidth to be enforced, thus we need one timer per task.
 	 */
 	struct hrtimer dl_timer;
+
+#ifdef CONFIG_SCHEDSTATS
+	struct sched_stats_dl stats;
+#endif
 };
 
 struct rcu_node;
diff --git a/kernel/sched.c b/kernel/sched.c
index 619d091..63a33f6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -425,6 +425,8 @@ struct dl_rq {
 
 	unsigned long dl_nr_running;
 
+	u64 exec_clock;
+
 #ifdef CONFIG_SMP
 	/*
 	 * Deadline values of the currently executing and the
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 2e1b0d1..f685f18 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -243,6 +243,42 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
 #undef P
 }
 
+void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
+{
+	s64 min_deadline = -1, max_deadline = -1;
+	struct rq *rq = cpu_rq(cpu);
+	struct sched_dl_entity *last;
+	unsigned long flags;
+
+	SEQ_printf(m, "\ndl_rq[%d]:\n", cpu);
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+	if (dl_rq->rb_leftmost)
+		min_deadline = (rb_entry(dl_rq->rb_leftmost,
+					 struct sched_dl_entity,
+					 rb_node))->deadline;
+	last = __pick_dl_last_entity(dl_rq);
+	if (last)
+		max_deadline = last->deadline;
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+#define P(x) \
+	SEQ_printf(m, "  .%-30s: %Ld\n", #x, (long long)(dl_rq->x))
+#define __PN(x) \
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(x))
+#define PN(x) \
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(dl_rq->x))
+
+	P(dl_nr_running);
+	PN(exec_clock);
+	__PN(min_deadline);
+	__PN(max_deadline);
+
+#undef PN
+#undef __PN
+#undef P
+}
+
 static void print_cpu(struct seq_file *m, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -302,6 +338,7 @@ static void print_cpu(struct seq_file *m, int cpu)
 #endif
 	print_cfs_stats(m, cpu);
 	print_rt_stats(m, cpu);
+	print_dl_stats(m, cpu);
 
 	print_rq(m, rq, cpu);
 }
@@ -430,6 +467,12 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.statistics.nr_wakeups_affine_attempts);
 	P(se.statistics.nr_wakeups_passive);
 	P(se.statistics.nr_wakeups_idle);
+	if (dl_task(p)) {
+		PN(dl.stats.last_dmiss);
+		PN(dl.stats.dmiss_max);
+		PN(dl.stats.last_rorun);
+		PN(dl.stats.rorun_max);
+	}
 
 	{
 		u64 avg_atom, avg_per_cpu;
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index c8eb304..b01aa2a 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -465,6 +465,25 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 	int rorun = dl_se->runtime <= 0;
 
 	/*
+	 * Record statistics about last and maximum deadline
+	 * misses and runtime overruns.
+	 */
+	if (dmiss) {
+		u64 damount = rq->clock - dl_se->deadline;
+
+		schedstat_set(dl_se->stats.last_dmiss, damount);
+		schedstat_set(dl_se->stats.dmiss_max,
+			      max(dl_se->stats.dmiss_max, damount));
+	}
+	if (rorun) {
+		u64 ramount = -dl_se->runtime;
+
+		schedstat_set(dl_se->stats.last_rorun, ramount);
+		schedstat_set(dl_se->stats.rorun_max,
+			      max(dl_se->stats.rorun_max, ramount));
+	}
+
+	/*
 	 * No need for checking if it's time to enforce the
 	 * bandwidth for the tasks that are:
 	 *  - maximum priority (SF_HEAD),
@@ -508,6 +527,7 @@ static void update_curr_dl(struct rq *rq)
 		      max(curr->se.statistics.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
+	schedstat_add(&rq->dl, exec_clock, delta_exec);
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq->clock;
@@ -898,6 +918,16 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
 }
 #endif
 
+static struct sched_dl_entity *__pick_dl_last_entity(struct dl_rq *dl_rq)
+{
+	struct rb_node *last = rb_last(&dl_rq->rb_root);
+
+	if (!last)
+		return NULL;
+
+	return rb_entry(last, struct sched_dl_entity, rb_node);
+}
+
 static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
 						   struct dl_rq *dl_rq)
 {
@@ -1573,3 +1603,15 @@ static const struct sched_class dl_sched_class = {
 	.switched_to		= switched_to_dl,
 };
 
+#ifdef CONFIG_SCHED_DEBUG
+extern void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq);
+
+static void print_dl_stats(struct seq_file *m, int cpu)
+{
+	struct dl_rq *dl_rq = &cpu_rq(cpu)->dl;
+
+	rcu_read_lock();
+	print_dl_rq(m, cpu, dl_rq);
+	rcu_read_unlock();
+}
+#endif /* CONFIG_SCHED_DEBUG */
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 12/22] sched: add runtime reporting for -deadline tasks
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (10 preceding siblings ...)
  2010-10-29  6:35 ` [RFC][PATCH 11/22] sched: add schedstats for -deadline tasks Raistlin
@ 2010-10-29  6:36 ` Raistlin
  2010-11-11 19:37   ` Peter Zijlstra
  2010-10-29  6:37 ` [RFC][PATCH 13/22] sched: add resource limits " Raistlin
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 3252 bytes --]


Make it available for the user-space the total amount of runtime
time it has used from since it became a -deadline task.

This is something that is typically useful for monitoring from
user-space the task CPU usage, and maybe implementing at that level
some more sophisticated scheduling behaviour.

One example is feedback scheduling, where you try to adapt the
scheduling parameters of a task by looking at its behaviour in
a certain interval of time, applying concepts coming from control
engineering.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |    7 +++----
 kernel/sched.c        |    3 +++
 kernel/sched_debug.c  |    1 +
 kernel/sched_dl.c     |    1 +
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8ae947b..b6f0635 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1242,14 +1242,15 @@ struct sched_rt_entity {
 #endif
 };
 
-#ifdef CONFIG_SCHEDSTATS
 struct sched_stats_dl {
+#ifdef CONFIG_SCHEDSTATS
 	u64			last_dmiss;
 	u64			last_rorun;
 	u64			dmiss_max;
 	u64			rorun_max;
-};
 #endif
+	u64			tot_rtime;
+};
 
 struct sched_dl_entity {
 	struct rb_node	rb_node;
@@ -1292,9 +1293,7 @@ struct sched_dl_entity {
 	 */
 	struct hrtimer dl_timer;
 
-#ifdef CONFIG_SCHEDSTATS
 	struct sched_stats_dl stats;
-#endif
 };
 
 struct rcu_node;
diff --git a/kernel/sched.c b/kernel/sched.c
index 63a33f6..19c8c25 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4799,6 +4799,8 @@ __setparam_dl(struct task_struct *p, const struct sched_param_ex *param_ex)
 	dl_se->flags = param_ex->sched_flags;
 	dl_se->dl_throttled = 0;
 	dl_se->dl_new = 1;
+
+	dl_se->stats.tot_rtime = 0;
 }
 
 static void
@@ -4812,6 +4814,7 @@ __getparam_dl(struct task_struct *p, struct sched_param_ex *param_ex)
 	param_ex->sched_period = ns_to_timespec(dl_se->dl_period);
 	param_ex->sched_flags = dl_se->flags;
 	param_ex->curr_runtime = ns_to_timespec(dl_se->runtime);
+	param_ex->used_runtime = ns_to_timespec(dl_se->stats.tot_rtime);
 	param_ex->curr_deadline = ns_to_timespec(dl_se->deadline);
 }
 
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index f685f18..9bec524 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -472,6 +472,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 		PN(dl.stats.dmiss_max);
 		PN(dl.stats.last_rorun);
 		PN(dl.stats.rorun_max);
+		PN(dl.stats.tot_rtime);
 	}
 
 	{
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index b01aa2a..c4091c9 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -535,6 +535,7 @@ static void update_curr_dl(struct rq *rq)
 
 	sched_dl_avg_update(rq, delta_exec);
 
+	dl_se->stats.tot_rtime += delta_exec;
 	dl_se->runtime -= delta_exec;
 	if (dl_runtime_exceeded(rq, dl_se)) {
 		__dequeue_task_dl(rq, curr, 0);
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 13/22] sched: add resource limits for -deadline tasks
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (11 preceding siblings ...)
  2010-10-29  6:36 ` [RFC][PATCH 12/22] sched: add runtime reporting " Raistlin
@ 2010-10-29  6:37 ` Raistlin
  2010-11-11 19:57   ` Peter Zijlstra
  2010-10-29  6:38 ` [RFC][PATCH 14/22] sched: add latency tracing " Raistlin
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 3177 bytes --]


Add resource limits for non-root tasks in using the SCHED_DEADLINE
policy, very similarly to what already exists for RT policies.

In fact, this patch:
 - adds the resource limit RLIMIT_DLDLINE, which is the minimum value
   a user task can use as its own deadline;
 - adds the resource limit RLIMIT_DLRTIME, which is the maximum value
   a user task can use as it own runtime.

Notice that to exploit these, a modified version of the ulimit
utility and a modified resource.h header file are needed. They
both will be available on the website of the project.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/asm-generic/resource.h |    7 ++++++-
 kernel/sched.c                 |   25 +++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 1 deletions(-)

diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
index 587566f..4a1d0e2 100644
--- a/include/asm-generic/resource.h
+++ b/include/asm-generic/resource.h
@@ -45,7 +45,10 @@
 					   0-39 for nice level 19 .. -20 */
 #define RLIMIT_RTPRIO		14	/* maximum realtime priority */
 #define RLIMIT_RTTIME		15	/* timeout for RT tasks in us */
-#define RLIM_NLIMITS		16
+
+#define RLIMIT_DLDLINE		16	/* minimum deadline in us */
+#define RLIMIT_DLRTIME		17	/* maximum runtime in us */
+#define RLIM_NLIMITS		18
 
 /*
  * SuS says limits have to be unsigned.
@@ -87,6 +90,8 @@
 	[RLIMIT_NICE]		= { 0, 0 },				\
 	[RLIMIT_RTPRIO]		= { 0, 0 },				\
 	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\
+	[RLIMIT_DLDLINE]	= { ULONG_MAX, ULONG_MAX },		\
+	[RLIMIT_DLRTIME]	= { 0, 0 },				\
 }
 
 #endif	/* __KERNEL__ */
diff --git a/kernel/sched.c b/kernel/sched.c
index 19c8c25..9165c5e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4903,6 +4903,31 @@ recheck:
 	 * Allow unprivileged RT tasks to decrease priority:
 	 */
 	if (user && !capable(CAP_SYS_NICE)) {
+		if (dl_policy(policy)) {
+			u64 rlim_dline, rlim_rtime;
+			u64 dline, rtime;
+
+			if (!lock_task_sighand(p, &flags))
+				return -ESRCH;
+			rlim_dline = p->signal->rlim[RLIMIT_DLDLINE].rlim_cur;
+			rlim_rtime = p->signal->rlim[RLIMIT_DLRTIME].rlim_cur;
+			unlock_task_sighand(p, &flags);
+
+			/* can't set/change -deadline policy */
+			if (policy != p->policy && !rlim_rtime)
+				return -EPERM;
+
+			/* can't decrease the deadline */
+			rlim_dline *= NSEC_PER_USEC;
+			dline = timespec_to_ns(&param_ex->sched_deadline);
+			if (dline < p->dl.dl_deadline && dline < rlim_dline)
+				return -EPERM;
+			/* can't increase the runtime */
+			rlim_rtime *= NSEC_PER_USEC;
+			rtime = timespec_to_ns(&param_ex->sched_runtime);
+			if (rtime > p->dl.dl_runtime && rtime > rlim_rtime)
+				return -EPERM;
+		}
 		if (rt_policy(policy)) {
 			unsigned long rlim_rtprio =
 					task_rlimit(p, RLIMIT_RTPRIO);
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 14/22] sched: add latency tracing for -deadline tasks.
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (12 preceding siblings ...)
  2010-10-29  6:37 ` [RFC][PATCH 13/22] sched: add resource limits " Raistlin
@ 2010-10-29  6:38 ` Raistlin
  2010-10-29  6:38 ` [RFC][PATCH 15/22] sched: add traceporints " Raistlin
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 6566 bytes --]


It is very likely that systems that wants/needs to use the new
SCHED_DEADLINE policy also want to have the scheduling latency of
the -deadline tasks under control.

For this reason a new version of the scheduling wakeup latency,
called "wakeup_dl", is introduced.

As a consequence of applying this patch there will be three wakeup
latency tracer:
 * "wakeup", that deals with all tasks in the system;
 * "wakeup_rt", that deals with -rt and -deadline tasks only;
 * "wakeup_dl", that deals with -deadline tasks only.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 kernel/trace/trace_sched_wakeup.c |   44 +++++++++++++++++++++++++++++++++---
 kernel/trace/trace_selftest.c     |   31 ++++++++++++++++----------
 2 files changed, 59 insertions(+), 16 deletions(-)

diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 7319559..8cee7b0 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -27,6 +27,7 @@ static int			wakeup_cpu;
 static int			wakeup_current_cpu;
 static unsigned			wakeup_prio = -1;
 static int			wakeup_rt;
+static int			wakeup_dl;
 
 static arch_spinlock_t wakeup_lock =
 	(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
@@ -414,9 +415,17 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
 	tracing_record_cmdline(p);
 	tracing_record_cmdline(current);
 
-	if ((wakeup_rt && !rt_task(p)) ||
-			p->prio >= wakeup_prio ||
-			p->prio >= current->prio)
+	/*
+	 * Semantic is like this:
+	 *  - wakeup tracer handles all tasks in the system, independently
+	 *    from their scheduling class;
+	 *  - wakeup_rt tracer handles tasks belonging to sched_dl and
+	 *    sched_rt class;
+	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
+	 */
+	if ((wakeup_dl && !dl_task(p)) ||
+	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
+	    (p->prio >= wakeup_prio || p->prio >= current->prio))
 		return;
 
 	pc = preempt_count();
@@ -428,7 +437,7 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
 	arch_spin_lock(&wakeup_lock);
 
 	/* check for races. */
-	if (!tracer_enabled || p->prio >= wakeup_prio)
+	if (!tracer_enabled || (!dl_task(p) && p->prio >= wakeup_prio))
 		goto out_locked;
 
 	/* reset the trace */
@@ -536,16 +545,25 @@ static int __wakeup_tracer_init(struct trace_array *tr)
 
 static int wakeup_tracer_init(struct trace_array *tr)
 {
+	wakeup_dl = 0;
 	wakeup_rt = 0;
 	return __wakeup_tracer_init(tr);
 }
 
 static int wakeup_rt_tracer_init(struct trace_array *tr)
 {
+	wakeup_dl = 0;
 	wakeup_rt = 1;
 	return __wakeup_tracer_init(tr);
 }
 
+static int wakeup_dl_tracer_init(struct trace_array *tr)
+{
+	wakeup_dl = 1;
+	wakeup_rt = 0;
+	return __wakeup_tracer_init(tr);
+}
+
 static void wakeup_tracer_reset(struct trace_array *tr)
 {
 	stop_wakeup_tracer(tr);
@@ -608,6 +626,20 @@ static struct tracer wakeup_rt_tracer __read_mostly =
 	.use_max_tr	= 1,
 };
 
+static struct tracer wakeup_dl_tracer __read_mostly =
+{
+	.name		= "wakeup_dl",
+	.init		= wakeup_dl_tracer_init,
+	.reset		= wakeup_tracer_reset,
+	.start		= wakeup_tracer_start,
+	.stop		= wakeup_tracer_stop,
+	.wait_pipe	= poll_wait_pipe,
+	.print_max	= 1,
+#ifdef CONFIG_FTRACE_SELFTEST
+	.selftest    = trace_selftest_startup_wakeup,
+#endif
+};
+
 __init static int init_wakeup_tracer(void)
 {
 	int ret;
@@ -620,6 +652,10 @@ __init static int init_wakeup_tracer(void)
 	if (ret)
 		return ret;
 
+	ret = register_tracer(&wakeup_dl_tracer);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 device_initcall(init_wakeup_tracer);
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 562c56e..c4f3580 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -557,11 +557,18 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
 #ifdef CONFIG_SCHED_TRACER
 static int trace_wakeup_test_thread(void *data)
 {
-	/* Make this a RT thread, doesn't need to be too high */
-	static struct sched_param param = { .sched_priority = 5 };
+	/* Make this a -deadline thread */
+	struct sched_param param = { .sched_priority = 0 };
+	struct sched_param_ex paramx = {
+		.sched_priority = 0,
+		.sched_runtime = { .tv_sec = 0, .tv_nsec = 100000 },
+		.sched_deadline = { .tv_sec = 0, .tv_nsec = 10000000 },
+		.sched_period = { .tv_sec = 0, tv_nsec = 10000000 }
+		.sched_flags = 0
+	};
 	struct completion *x = data;
 
-	sched_setscheduler(current, SCHED_FIFO, &param);
+	sched_setscheduler_ex(current, SCHED_DEADLINE, &param, &paramx);
 
 	/* Make it know we have a new prio */
 	complete(x);
@@ -573,8 +580,8 @@ static int trace_wakeup_test_thread(void *data)
 	/* we are awake, now wait to disappear */
 	while (!kthread_should_stop()) {
 		/*
-		 * This is an RT task, do short sleeps to let
-		 * others run.
+		 * This will likely be the system top priority
+		 * task, do short sleeps to let others run.
 		 */
 		msleep(100);
 	}
@@ -587,21 +594,21 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
 {
 	unsigned long save_max = tracing_max_latency;
 	struct task_struct *p;
-	struct completion isrt;
+	struct completion is_ready;
 	unsigned long count;
 	int ret;
 
-	init_completion(&isrt);
+	init_completion(&is_ready);
 
-	/* create a high prio thread */
-	p = kthread_run(trace_wakeup_test_thread, &isrt, "ftrace-test");
+	/* create a -deadline thread */
+	p = kthread_run(trace_wakeup_test_thread, &is_ready, "ftrace-test");
 	if (IS_ERR(p)) {
 		printk(KERN_CONT "Failed to create ftrace wakeup test thread ");
 		return -1;
 	}
 
-	/* make sure the thread is running at an RT prio */
-	wait_for_completion(&isrt);
+	/* make sure the thread is running at -deadline policy */
+	wait_for_completion(&is_ready);
 
 	/* start the tracing */
 	ret = tracer_init(trace, tr);
@@ -613,7 +620,7 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
 	/* reset the max latency */
 	tracing_max_latency = 0;
 
-	/* sleep to let the RT thread sleep too */
+	/* sleep to let the thread sleep too */
 	msleep(100);
 
 	/*
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 15/22] sched: add traceporints for -deadline tasks
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (13 preceding siblings ...)
  2010-10-29  6:38 ` [RFC][PATCH 14/22] sched: add latency tracing " Raistlin
@ 2010-10-29  6:38 ` Raistlin
  2010-11-11 19:54   ` Peter Zijlstra
  2010-10-29  6:39 ` [RFC][PATCH 16/22] sched: add SMP " Raistlin
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 10786 bytes --]


Add tracepoints for the most notable events related to -deadline
tasks scheduling (new task arrival, context switch, runtime accounting,
bandwidth enforcement timer, etc.).

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
---
 include/trace/events/sched.h |  203 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched.c               |    2 +
 kernel/sched_dl.c            |   21 +++++
 3 files changed, 225 insertions(+), 1 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index f633478..03baa17 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -304,7 +304,6 @@ DECLARE_EVENT_CLASS(sched_stat_template,
 			(unsigned long long)__entry->delay)
 );
 
-
 /*
  * Tracepoint for accounting wait time (time the task is runnable
  * but not actually running due to scheduler contention).
@@ -363,6 +362,208 @@ TRACE_EVENT(sched_stat_runtime,
 );
 
 /*
+ * Tracepoint for task switches involving -deadline tasks:
+ */
+TRACE_EVENT(sched_switch_dl,
+
+	TP_PROTO(u64 clock,
+		 struct task_struct *prev,
+		 struct task_struct *next),
+
+	TP_ARGS(clock, prev, next),
+
+	TP_STRUCT__entry(
+		__array(	char,	prev_comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	prev_pid			)
+		__field(	u64,	clock				)
+		__field(	s64,	prev_rt				)
+		__field(	u64,	prev_dl				)
+		__field(	long,	prev_state			)
+		__array(	char,	next_comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	next_pid			)
+		__field(	s64,	next_rt				)
+		__field(	u64,	next_dl				)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+		__entry->prev_pid	= prev->pid;
+		__entry->clock		= clock;
+		__entry->prev_rt	= prev->dl.runtime;
+		__entry->prev_dl	= prev->dl.deadline;
+		__entry->prev_state	= __trace_sched_switch_state(prev);
+		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
+		__entry->next_pid	= next->pid;
+		__entry->next_rt	= next->dl.runtime;
+		__entry->next_dl	= next->dl.deadline;
+	),
+
+	TP_printk("prev_comm=%s prev_pid=%d prev_rt=%Ld [ns] prev_dl=%Lu [ns] prev_state=%s ==> "
+		  "next_comm=%s next_pid=%d next_rt=%Ld [ns] next_dl=%Lu [ns] clock=%Lu [ns]",
+		  __entry->prev_comm, __entry->prev_pid, (long long)__entry->prev_rt,
+		  (unsigned long long)__entry->prev_dl, __entry->prev_state ?
+		    __print_flags(__entry->prev_state, "|",
+				{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
+				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
+				{ 128, "W" }) : "R",
+		  __entry->next_comm, __entry->next_pid, (long long)__entry->next_rt,
+		  (unsigned long long)__entry->next_dl, (unsigned long long)__entry->clock)
+);
+
+/*
+ * Tracepoint for starting of the throttling timer of a -deadline task:
+ */
+TRACE_EVENT(sched_start_timer_dl,
+
+	TP_PROTO(struct task_struct *p, u64 clock,
+		 s64 now, s64 act, unsigned long range),
+
+	TP_ARGS(p, clock, now, act, range),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	pid			)
+		__field(	u64,	clock			)
+		__field(	s64,	now			)
+		__field(	s64,	act			)
+		__field(	unsigned long,	range		)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->clock		= clock;
+		__entry->now		= now;
+		__entry->act		= act;
+		__entry->range		= range;
+	),
+
+	TP_printk("comm=%s pid=%d clock=%Lu [ns] now=%Ld [ns] soft=%Ld [ns] range=%lu",
+		  __entry->comm, __entry->pid, (unsigned long long)__entry->clock,
+		  (long long)__entry->now, (long long)__entry->act,
+		  (unsigned long)__entry->range)
+);
+
+/*
+ * Tracepoint for the throttling timer of a -deadline task:
+ */
+TRACE_EVENT(sched_timer_dl,
+
+	TP_PROTO(struct task_struct *p, u64 clock, int on_rq, int running),
+
+	TP_ARGS(p, clock, on_rq, running),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	pid			)
+		__field(	u64,	clock			)
+		__field(	int,	on_rq			)
+		__field(	int,	running			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->clock		= clock;
+		__entry->on_rq		= on_rq;
+		__entry->running	= running;
+	),
+
+	TP_printk("comm=%s pid=%d clock=%Lu_rq=%d running=%d",
+		  __entry->comm, __entry->pid, (unsigned long long)__entry->clock,
+		  __entry->on_rq, __entry->running)
+);
+
+/*
+ * sched_stat tracepoints for -deadline tasks:
+ */
+DECLARE_EVENT_CLASS(sched_stat_template_dl,
+
+	TP_PROTO(struct task_struct *p, u64 clock, int flags),
+
+	TP_ARGS(p, clock, flags),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	pid			)
+		__field(	u64,	clock			)
+		__field(	s64,	rt			)
+		__field(	u64,	dl			)
+		__field(	int,	flags			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->clock		= clock;
+		__entry->rt		= p->dl.runtime;
+		__entry->dl		= p->dl.deadline;
+		__entry->flags		= flags;
+	),
+
+	TP_printk("comm=%s pid=%d clock=%Lu [ns] rt=%Ld dl=%Lu [ns] flags=0x%x",
+		  __entry->comm, __entry->pid, (unsigned long long)__entry->clock,
+		  (long long)__entry->rt, (unsigned long long)__entry->dl,
+		  __entry->flags)
+);
+
+/*
+ * Tracepoint for a new instance of a -deadline task:
+ */
+DEFINE_EVENT(sched_stat_template_dl, sched_stat_new_dl,
+	     TP_PROTO(struct task_struct *tsk, u64 clock, int flags),
+	     TP_ARGS(tsk, clock, flags));
+
+/*
+ * Tracepoint for a replenishment of a -deadline task:
+ */
+DEFINE_EVENT(sched_stat_template_dl, sched_stat_repl_dl,
+	     TP_PROTO(struct task_struct *tsk, u64 clock, int flags),
+	     TP_ARGS(tsk, clock, flags));
+
+/*
+ * Tracepoint for parameters recalculation of -deadline tasks:.
+ */
+DEFINE_EVENT(sched_stat_template_dl, sched_stat_updt_dl,
+	     TP_PROTO(struct task_struct *tsk, u64 clock, int flags),
+	     TP_ARGS(tsk, clock, flags));
+
+/*
+ * Tracepoint for accounting stats of -deadline tasks:.
+ */
+TRACE_EVENT(sched_stat_runtime_dl,
+
+	TP_PROTO(struct task_struct *p, u64 clock, u64 last),
+
+	TP_ARGS(p, clock, last),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	pid			)
+		__field(	u64,	clock			)
+		__field(	u64,	last			)
+		__field(	s64,	rt			)
+		__field(	u64,	dl			)
+		__field(	u64,	start			)
+        ),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->clock		= clock;
+		__entry->last		= last;
+		__entry->rt		= p->dl.runtime - last;
+		__entry->dl		= p->dl.deadline;
+		__entry->start		= p->se.exec_start;
+	),
+
+	TP_printk("comm=%s pid=%d clock=%Lu [ns] delta_exec=%Lu [ns] rt=%Ld [ns] dl=%Lu [ns] exec_start=%Lu [ns]",
+		  __entry->comm, __entry->pid, (unsigned long long)__entry->clock,
+		  (unsigned long long)__entry->last, (long long)__entry->rt,
+		  (unsigned long long)__entry->dl, (unsigned long long)__entry->start)
+);
+
+/*
  * Tracepoint for showing priority inheritance modifying a tasks
  * priority.
  */
diff --git a/kernel/sched.c b/kernel/sched.c
index 9165c5e..060d0c9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3061,6 +3061,8 @@ context_switch(struct rq *rq, struct task_struct *prev,
 
 	prepare_task_switch(rq, prev, next);
 	trace_sched_switch(prev, next);
+	if (unlikely(dl_task(prev) || dl_task(next)))
+		trace_sched_switch_dl(rq->clock, prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index c4091c9..229814a 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -231,6 +231,9 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
 	dl_se->deadline = rq->clock + dl_se->dl_deadline;
 	dl_se->runtime = dl_se->dl_runtime;
 	dl_se->dl_new = 0;
+#ifdef CONFIG_SCHEDSTATS
+	trace_sched_stat_new_dl(dl_task_of(dl_se), rq->clock, dl_se->flags);
+#endif
 }
 
 /*
@@ -255,6 +258,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
+	int reset = 0;
 
 	/*
 	 * We Keep moving the deadline away until we get some
@@ -280,7 +284,11 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 		WARN_ON_ONCE(1);
 		dl_se->deadline = rq->clock + dl_se->dl_deadline;
 		dl_se->runtime = dl_se->dl_runtime;
+		reset = 1;
 	}
+#ifdef CONFIG_SCHEDSTATS
+	trace_sched_stat_repl_dl(dl_task_of(dl_se), rq->clock, reset);
+#endif
 }
 
 /*
@@ -332,6 +340,7 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
+	int overflow = 0;
 
 	/*
 	 * The arrival of a new instance needs special treatment, i.e.,
@@ -346,7 +355,11 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
 	    dl_entity_overflow(dl_se, rq->clock)) {
 		dl_se->deadline = rq->clock + dl_se->dl_deadline;
 		dl_se->runtime = dl_se->dl_runtime;
+		overflow = 1;
 	}
+#ifdef CONFIG_SCHEDSTATS
+	trace_sched_stat_updt_dl(dl_task_of(dl_se), rq->clock, overflow);
+#endif
 }
 
 /*
@@ -394,6 +407,10 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
 	__hrtimer_start_range_ns(&dl_se->dl_timer, soft,
 				 range, HRTIMER_MODE_ABS, 0);
 
+	trace_sched_start_timer_dl(dl_task_of(dl_se), rq->clock,
+				   ktime_to_ns(now), ktime_to_ns(soft),
+				   range);
+
 	return hrtimer_active(&dl_se->dl_timer);
 }
 
@@ -427,6 +444,8 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	if (!dl_task(p))
 		goto unlock;
 
+	trace_sched_timer_dl(p, rq->clock, p->se.on_rq, task_current(rq, p));
+
 	dl_se->dl_throttled = 0;
 	if (p->se.on_rq) {
 		enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
@@ -439,6 +458,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 		if (rq->dl.overloaded)
 			push_dl_task(rq);
 	}
+
 unlock:
 	task_rq_unlock(rq, &flags);
 
@@ -529,6 +549,7 @@ static void update_curr_dl(struct rq *rq)
 	curr->se.sum_exec_runtime += delta_exec;
 	schedstat_add(&rq->dl, exec_clock, delta_exec);
 	account_group_exec_runtime(curr, delta_exec);
+	trace_sched_stat_runtime_dl(curr, rq->clock, delta_exec);
 
 	curr->se.exec_start = rq->clock;
 	cpuacct_charge(curr, delta_exec);
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 16/22] sched: add SMP traceporints for -deadline tasks
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (14 preceding siblings ...)
  2010-10-29  6:38 ` [RFC][PATCH 15/22] sched: add traceporints " Raistlin
@ 2010-10-29  6:39 ` Raistlin
  2010-10-29  6:40 ` [RFC][PATCH 17/22] sched: add signaling overrunning " Raistlin
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 5183 bytes --]


Add tracepoints for the events involved in -deadline task migration
(mainly push, pull and migrate-task).

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/trace/events/sched.h |  109 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched.c               |    3 +
 kernel/sched_dl.c            |    7 +++
 3 files changed, 119 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 03baa17..f1d805f 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -475,6 +475,115 @@ TRACE_EVENT(sched_timer_dl,
 );
 
 /*
+ *
+ */
+TRACE_EVENT(sched_push_task_dl,
+
+	TP_PROTO(struct task_struct *n, u64 clock, int later_cpu),
+
+	TP_ARGS(n, clock, later_cpu),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	pid			)
+		__field(	u64,	clock			)
+		__field(	s64,	rt			)
+		__field(	u64,	dl			)
+		__field(	int,	cpu			)
+		__field(	int,	later_cpu		)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, n->comm, TASK_COMM_LEN);
+		__entry->pid		= n->pid;
+		__entry->clock		= clock;
+		__entry->rt		= n->dl.runtime;
+		__entry->dl		= n->dl.deadline;
+		__entry->cpu		= task_cpu(n);
+		__entry->later_cpu	= later_cpu;
+	),
+
+	TP_printk("comm=%s pid=%d rt=%Ld [ns] dl=%Lu [ns] clock=%Lu [ns] cpu=%d later_cpu=%d",
+		  __entry->comm, __entry->pid, (long long)__entry->rt,
+		  (unsigned long long)__entry->dl, (unsigned long long)__entry->clock,
+		  __entry->cpu, __entry->later_cpu)
+);
+
+/*
+ *
+ */
+TRACE_EVENT(sched_pull_task_dl,
+
+	TP_PROTO(struct task_struct *p, u64 clock, int src_cpu),
+
+	TP_ARGS(p, clock, src_cpu),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	pid			)
+		__field(	u64,	clock			)
+		__field(	s64,	rt			)
+		__field(	u64,	dl			)
+		__field(	int,	cpu			)
+		__field(	int,	src_cpu			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->clock		= clock;
+		__entry->rt		= p->dl.runtime;
+		__entry->dl		= p->dl.deadline;
+		__entry->cpu		= task_cpu(p);
+		__entry->src_cpu	= src_cpu;
+	),
+
+	TP_printk("comm=%s pid=%d rt=%Ld [ns] dl=%Lu [ns] clock=%Lu [ns] cpu=%d later_cpu=%d",
+		  __entry->comm, __entry->pid, (long long)__entry->rt,
+		  (unsigned long long)__entry->dl, (unsigned long long)__entry->clock,
+		  __entry->cpu, __entry->src_cpu)
+);
+
+/*
+ * Tracepoint for migrations involving -deadline tasks:
+ */
+TRACE_EVENT(sched_migrate_task_dl,
+
+	TP_PROTO(struct task_struct *p, u64 clock, int dest_cpu, u64 dclock),
+
+	TP_ARGS(p, clock, dest_cpu, dclock),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	pid			)
+		__field(	u64,	clock			)
+		__field(	s64,	rt			)
+		__field(	u64,	dl			)
+		__field(	int,	orig_cpu		)
+		__field(	int,	dest_cpu		)
+		__field(	u64,	dclock			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->clock		= clock;
+		__entry->rt		= p->dl.runtime;
+		__entry->dl		= p->dl.deadline;
+		__entry->orig_cpu	= task_cpu(p);
+		__entry->dest_cpu	= dest_cpu;
+		__entry->dclock		= dclock;
+	),
+
+	TP_printk("comm=%s pid=%d rt=%Ld [ns] dl=%Lu [ns] orig_cpu=%d orig_clock=%Lu [ns] "
+		  "dest_cpu=%d dest_clock=%Lu [ns]",
+		  __entry->comm, __entry->pid, (long long)__entry->rt,
+		  (unsigned long long)__entry->dl, __entry->orig_cpu,
+		  (unsigned long long)__entry->clock, __entry->dest_cpu,
+		  (unsigned long long)__entry->dclock)
+);
+
+/*
  * sched_stat tracepoints for -deadline tasks:
  */
 DECLARE_EVENT_CLASS(sched_stat_template_dl,
diff --git a/kernel/sched.c b/kernel/sched.c
index 060d0c9..79cac6e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2235,6 +2235,9 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 #endif
 
 	trace_sched_migrate_task(p, new_cpu);
+	if (unlikely(dl_task(p)))
+		trace_sched_migrate_task_dl(p, task_rq(p)->clock,
+					    new_cpu, cpu_rq(new_cpu)->clock);
 
 	if (task_cpu(p) != new_cpu) {
 		p->se.nr_migrations++;
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 229814a..cc87949 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -1294,6 +1294,10 @@ retry:
 
 	/* Will lock the rq it'll find */
 	later_rq = find_lock_later_rq(next_task, rq);
+
+	trace_sched_push_task_dl(next_task, rq->clock,
+				 later_rq ? later_rq->cpu : -1);
+
 	if (!later_rq) {
 		struct task_struct *task;
 
@@ -1378,6 +1382,9 @@ static int pull_dl_task(struct rq *this_rq)
 			goto skip;
 
 		p = pick_next_earliest_dl_task(src_rq, this_cpu);
+		if (p)
+			trace_sched_pull_task_dl(p, this_rq->clock,
+						 src_rq->cpu);
 
 		/*
 		 * We found a task to be pulled if:
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 17/22] sched: add signaling overrunning -deadline tasks.
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (15 preceding siblings ...)
  2010-10-29  6:39 ` [RFC][PATCH 16/22] sched: add SMP " Raistlin
@ 2010-10-29  6:40 ` Raistlin
  2010-11-11 21:58   ` Peter Zijlstra
  2010-10-29  6:42 ` [RFC][PATCH 19/22] rtmutex: turn the plist into an rb-tree Raistlin
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 8101 bytes --]


Add to the scheduler the capability of notifying when -deadline tasks
overrun their maximum runtime and/or overcome their scheduling
deadline.

Runtime overruns might be quite common, e.g., due to coarse granularity
execution time accounting resolution or wrong assignment of tasks'
parameters (especially runtime). However, since the scheduler enforces
bandwidth isolation among tasks, this is not at all a threat to other
tasks' schedulability. For this reason, it is not common that a task
wants to be notified about that. Moreover, if we are using the
SCHED_DEADLINE policy with sporadic tasks, or to limit the bandwidth
of not periodic nor sporadic ones, runtime overruns are very likely
to occur at each and every instance, and again they should not be
considered a problem.

On the other hand, a deadline miss in any task means that, even if we
are trying at our best to keep each task isolated and to avoid
reciprocal interference among them, something went very, very bad,
and one task did not manage in consuming its runtime by its deadline.
This is something that should happen only on an oversubscribed
system, and thus being notified when it occurs could be very useful.

The user can specify the signal(s) he wants to be sent to his task
during sched_setscheduler_ex(), raising two specific flags in the
sched_flags field of struct sched_param_ex:
 * SF_SIG_RORUN (if he wants to be signaled on runtime overrun),
 * SF_SIG_DMISS (if he wants to be signaled on deadline misses).

This patch:
 - adds the logic needed to send SIGXCPU signal to a -deadline task
   in case its actual runtime becomes negative;
 - adds the logic needed to send SIGXCPU signal to a -deadline task
   in case it is still being scheduled while its absolute deadline
   passes.

This all happens in the POSIX cpu-timers code, we need to take
t->sighand->siglock, and it can't be done within the scheduler,
under task_rq(t)->lock.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h     |   14 ++++++++++-
 kernel/posix-cpu-timers.c |   55 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_debug.c      |    2 +
 kernel/sched_dl.c         |    8 +++++-
 4 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b6f0635..b729f83 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -163,8 +163,19 @@ struct sched_param_ex {
  *              of the highest priority scheduling class. In case it
  *              it sched_deadline, the task also ignore runtime and
  *              bandwidth limitations.
+ *
+ * These flags here below are meant to be used by userspace tasks to affect
+ * the scheduler behaviour and/or specifying that they want to be informed
+ * of the occurrence of some events.
+ *
+ *  @SF_SIG_RORUN       tells us the task wants to be notified whenever
+ *                      a runtime overrun occurs;
+ *  @SF_SIG_DMISS       tells us the task wants to be notified whenever
+ *                      a scheduling deadline is missed.
  */
 #define SF_HEAD		1
+#define SF_SIG_RORUN	2
+#define SF_SIG_DMISS	4
 
 struct exec_domain;
 struct futex_pi_state;
@@ -1243,9 +1254,10 @@ struct sched_rt_entity {
 };
 
 struct sched_stats_dl {
-#ifdef CONFIG_SCHEDSTATS
+	int			dmiss, rorun;
 	u64			last_dmiss;
 	u64			last_rorun;
+#ifdef CONFIG_SCHEDSTATS
 	u64			dmiss_max;
 	u64			rorun_max;
 #endif
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 6842eeb..610b8b1 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -901,6 +901,37 @@ void posix_cpu_timer_get(struct k_itimer *timer, struct itimerspec *itp)
 }
 
 /*
+ * Inform a -deadline task that it is overrunning its runtime or
+ * (much worse) missing a deadline. This is done by sending the task
+ * SIGXCPU, with some additional information to let it discover
+ * what actually happened.
+ *
+ * The nature of the violation is coded in si_errno, while an attempt
+ * to let the task know *how big* the violation is is done through
+ * si_value. Unfortunately, only an int field is available there,
+ * thus what reported might be inaccurate.
+ */
+static inline void __dl_signal(struct task_struct *tsk, int which)
+{
+	struct siginfo info;
+	long long amount = which == SF_SIG_DMISS ? tsk->dl.stats.last_dmiss :
+			   tsk->dl.stats.last_rorun;
+
+	info.si_signo = SIGXCPU;
+	info.si_errno = which;
+	info.si_code = SI_KERNEL;
+	info.si_pid = 0;
+	info.si_uid = 0;
+	info.si_value.sival_int = (int)amount;
+
+	/* Correctly take the locks on task's sighand */
+	__group_send_sig_info(SIGXCPU, &info, tsk);
+	/* Log what happened to dmesg */
+	printk(KERN_INFO "SCHED_DEADLINE: 0x%4x by %Ld [ns] in %d (%s)\n",
+	       which, amount, task_pid_nr(tsk), tsk->comm);
+}
+
+/*
  * Check for any per-thread CPU timers that have fired and move them off
  * the tsk->cpu_timers[N] list onto the firing list.  Here we update the
  * tsk->it_*_expires values to reflect the remaining thread CPU timers.
@@ -958,6 +989,25 @@ static void check_thread_timers(struct task_struct *tsk,
 	}
 
 	/*
+	 * if the userspace asked for that, we notify about (scheduling)
+	 * deadline misses and runtime overruns via sending SIGXCPU to
+	 * "faulting" task.
+	 *
+	 * Note that (hopefully small) runtime overruns are very likely
+	 * to occur, mainly due to accounting resolution, while missing a
+	 * scheduling deadline should be very rare, and only happen on
+	 * an oversubscribed systems.
+	 *
+	 */
+	if (unlikely(dl_task(tsk))) {
+		if ((tsk->dl.flags & SF_SIG_DMISS) && tsk->dl.stats.dmiss)
+			__dl_signal(tsk, SF_SIG_DMISS);
+		if ((tsk->dl.flags & SF_SIG_RORUN) && tsk->dl.stats.rorun)
+			__dl_signal(tsk, SF_SIG_RORUN);
+		tsk->dl.stats.dmiss = tsk->dl.stats.rorun = 0;
+	}
+
+	/*
 	 * Check for the special case thread timers.
 	 */
 	soft = ACCESS_ONCE(sig->rlim[RLIMIT_RTTIME].rlim_cur);
@@ -1272,6 +1322,11 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
 {
 	struct signal_struct *sig;
 
+	if (unlikely(dl_task(tsk) &&
+	    (((tsk->dl.flags & SF_SIG_DMISS) && tsk->dl.stats.dmiss) ||
+	     ((tsk->dl.flags & SF_SIG_RORUN) && tsk->dl.stats.rorun))))
+		return 1;
+
 	if (!task_cputime_zero(&tsk->cputime_expires)) {
 		struct task_cputime task_sample = {
 			.utime = tsk->utime,
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 9bec524..4949a21 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -468,8 +468,10 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.statistics.nr_wakeups_passive);
 	P(se.statistics.nr_wakeups_idle);
 	if (dl_task(p)) {
+		P(dl.stats.dmiss);
 		PN(dl.stats.last_dmiss);
 		PN(dl.stats.dmiss_max);
+		P(dl.stats.rorun);
 		PN(dl.stats.last_rorun);
 		PN(dl.stats.rorun_max);
 		PN(dl.stats.tot_rtime);
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index cc87949..eff183a 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -491,14 +491,18 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 	if (dmiss) {
 		u64 damount = rq->clock - dl_se->deadline;
 
-		schedstat_set(dl_se->stats.last_dmiss, damount);
+		dl_se->stats.dmiss = 1;
+		dl_se->stats.last_dmiss = damount;
+
 		schedstat_set(dl_se->stats.dmiss_max,
 			      max(dl_se->stats.dmiss_max, damount));
 	}
 	if (rorun) {
 		u64 ramount = -dl_se->runtime;
 
-		schedstat_set(dl_se->stats.last_rorun, ramount);
+		dl_se->stats.rorun = 1;
+		dl_se->stats.last_rorun = ramount;
+
 		schedstat_set(dl_se->stats.rorun_max,
 			      max(dl_se->stats.rorun_max, ramount));
 	}
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 19/22] rtmutex: turn the plist into an rb-tree
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (16 preceding siblings ...)
  2010-10-29  6:40 ` [RFC][PATCH 17/22] sched: add signaling overrunning " Raistlin
@ 2010-10-29  6:42 ` Raistlin
  2010-10-29  6:42 ` [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks Raistlin
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 17901 bytes --]


Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.

This is done mainly because:
 - classical prio field of the plist is just an int, which might
   not be enough for representing a deadline;
 - manipulating such a list would become O(nr_deadline_tasks),
   which might be to much, as the number of -deadline task increses.

Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
 - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
   one with the higher (lower, actually!) prio wins;
 - among a -priority and a -deadline task, the latter always wins;
 - among two -deadline tasks, the one with the earliest deadline
   wins.

Queueing and dequeueing functions are chenged accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/init_task.h |   10 +++
 include/linux/rtmutex.h   |   13 +---
 include/linux/sched.h     |    4 +-
 kernel/fork.c             |    3 +-
 kernel/rtmutex-debug.c    |    8 +--
 kernel/rtmutex.c          |  135 +++++++++++++++++++++++++++++++++++----------
 kernel/rtmutex_common.h   |   22 ++++----
 kernel/sched.c            |    4 -
 8 files changed, 137 insertions(+), 62 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 1f8c06c..f4f7567 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -10,6 +10,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
 #include <linux/securebits.h>
+#include <linux/rbtree.h>
 #include <net/net_namespace.h>
 
 extern struct files_struct init_files;
@@ -110,6 +111,14 @@ extern struct cred init_cred;
 # define INIT_PERF_EVENTS(tsk)
 #endif
 
+#ifdef CONFIG_RT_MUTEXES
+# define INIT_RT_MUTEXES						\
+	.pi_waiters = RB_ROOT,						\
+	.pi_waiters_leftmost = NULL,
+#else
+# define INIT_RT_MUTEXES
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -178,6 +187,7 @@ extern struct cred init_cred;
 	INIT_FTRACE_GRAPH						\
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
+	INIT_RT_MUTEXES							\
 }
 
 
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 8d522ff..bd7cd02 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -13,7 +13,7 @@
 #define __LINUX_RT_MUTEX_H
 
 #include <linux/linkage.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
 #include <linux/spinlock_types.h>
 
 extern int max_lock_depth; /* for sysctl */
@@ -27,7 +27,8 @@ extern int max_lock_depth; /* for sysctl */
  */
 struct rt_mutex {
 	raw_spinlock_t		wait_lock;
-	struct plist_head	wait_list;
+	struct rb_root          waiters;
+	struct rb_node          *waiters_leftmost;
 	struct task_struct	*owner;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 	int			save_state;
@@ -98,12 +99,4 @@ extern int rt_mutex_trylock(struct rt_mutex *lock);
 
 extern void rt_mutex_unlock(struct rt_mutex *lock);
 
-#ifdef CONFIG_RT_MUTEXES
-# define INIT_RT_MUTEXES(tsk)						\
-	.pi_waiters	= PLIST_HEAD_INIT(tsk.pi_waiters, tsk.pi_lock),	\
-	INIT_RT_MUTEX_DEBUG(tsk)
-#else
-# define INIT_RT_MUTEXES(tsk)
-#endif
-
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8806c1f..c3d1f17b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -56,6 +56,7 @@ struct sched_param {
 #include <linux/types.h>
 #include <linux/timex.h>
 #include <linux/jiffies.h>
+#include <linux/plist.h>
 #include <linux/rbtree.h>
 #include <linux/thread_info.h>
 #include <linux/cpumask.h>
@@ -1530,7 +1531,8 @@ struct task_struct {
 
 #ifdef CONFIG_RT_MUTEXES
 	/* PI waiters blocked on a rt_mutex held by this task */
-	struct plist_head pi_waiters;
+	struct rb_root pi_waiters;
+	struct rb_node *pi_waiters_leftmost;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..aceb248 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -935,7 +935,8 @@ static void rt_mutex_init_task(struct task_struct *p)
 {
 	raw_spin_lock_init(&p->pi_lock);
 #ifdef CONFIG_RT_MUTEXES
-	plist_head_init_raw(&p->pi_waiters, &p->pi_lock);
+	p->pi_waiters = RB_ROOT;
+	p->pi_waiters_leftmost = NULL;
 	p->pi_blocked_on = NULL;
 #endif
 }
diff --git a/kernel/rtmutex-debug.c b/kernel/rtmutex-debug.c
index ddabb54..7cc8376 100644
--- a/kernel/rtmutex-debug.c
+++ b/kernel/rtmutex-debug.c
@@ -23,7 +23,7 @@
 #include <linux/kallsyms.h>
 #include <linux/syscalls.h>
 #include <linux/interrupt.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
 #include <linux/fs.h>
 #include <linux/debug_locks.h>
 
@@ -111,7 +111,7 @@ static void printk_lock(struct rt_mutex *lock, int print_owner)
 
 void rt_mutex_debug_task_free(struct task_struct *task)
 {
-	WARN_ON(!plist_head_empty(&task->pi_waiters));
+	WARN_ON(!RB_EMPTY_ROOT(&task->pi_waiters));
 	WARN_ON(task->pi_blocked_on);
 }
 
@@ -205,16 +205,12 @@ void debug_rt_mutex_proxy_unlock(struct rt_mutex *lock)
 void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
 {
 	memset(waiter, 0x11, sizeof(*waiter));
-	plist_node_init(&waiter->list_entry, MAX_PRIO);
-	plist_node_init(&waiter->pi_list_entry, MAX_PRIO);
 	waiter->deadlock_task_pid = NULL;
 }
 
 void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)
 {
 	put_pid(waiter->deadlock_task_pid);
-	TRACE_WARN_ON(!plist_node_empty(&waiter->list_entry));
-	TRACE_WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
 	TRACE_WARN_ON(waiter->task);
 	memset(waiter, 0x22, sizeof(*waiter));
 }
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index a960481..2e9c0dc 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -97,6 +97,90 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex *lock)
 }
 #endif
 
+static inline int
+rt_mutex_waiter_less(struct rt_mutex_waiter *left,
+		     struct rt_mutex_waiter *right)
+{
+	if (left->task->prio < right->task->prio)
+		return 1;
+
+	/*
+	 * If both tasks are dl_task(), we check their deadlines.
+	 */
+	if (dl_prio(left->task->prio) && dl_prio(right->task->prio))
+		return left->task->dl.deadline < right->task->dl.deadline;
+}
+
+static void
+rt_mutex_enqueue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
+{
+	struct rb_node **link = &lock->waiters.rb_node;
+	struct rb_node *parent = NULL;
+	struct rt_mutex_waiter *entry;
+	int leftmost = 1;
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct rt_mutex_waiter, tree_entry);
+		if (rt_mutex_waiter_less(waiter, entry)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		lock->waiters_leftmost = &waiter->tree_entry;
+
+	rb_link_node(&waiter->tree_entry, parent, link);
+	rb_insert_color(&waiter->tree_entry, &lock->waiters);
+}
+
+static void
+rt_mutex_dequeue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
+{
+	if (lock->waiters_leftmost == &waiter->tree_entry)
+		lock->waiters_leftmost = rb_next(&waiter->tree_entry);
+
+	rb_erase(&waiter->tree_entry, &lock->waiters);
+}
+
+static void
+rt_mutex_enqueue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
+{
+	struct rb_node **link = &task->pi_waiters.rb_node;
+	struct rb_node *parent = NULL;
+	struct rt_mutex_waiter *entry;
+	int leftmost = 1;
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct rt_mutex_waiter, pi_tree_entry);
+		if (rt_mutex_waiter_less(waiter, entry)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		task->pi_waiters_leftmost = &waiter->pi_tree_entry;
+
+	rb_link_node(&waiter->pi_tree_entry, parent, link);
+	rb_insert_color(&waiter->pi_tree_entry, &task->pi_waiters);
+}
+
+static void
+rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
+{
+	if (task->pi_waiters_leftmost == &waiter->pi_tree_entry)
+		task->pi_waiters_leftmost = rb_next(&waiter->pi_tree_entry);
+
+	rb_erase(&waiter->pi_tree_entry, &task->pi_waiters);
+}
+
 /*
  * Calculate task priority from the waiter list priority
  *
@@ -108,7 +192,7 @@ int rt_mutex_getprio(struct task_struct *task)
 	if (likely(!task_has_pi_waiters(task)))
 		return task->normal_prio;
 
-	return min(task_top_pi_waiter(task)->pi_list_entry.prio,
+	return min(task_top_pi_waiter(task)->task->prio,
 		   task->normal_prio);
 }
 
@@ -227,7 +311,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * When deadlock detection is off then we check, if further
 	 * priority adjustment is necessary.
 	 */
-	if (!detect_deadlock && waiter->list_entry.prio == task->prio)
+	if (!detect_deadlock && waiter->task->prio == task->prio)
 		goto out_unlock_pi;
 
 	lock = waiter->lock;
@@ -248,9 +332,8 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	top_waiter = rt_mutex_top_waiter(lock);
 
 	/* Requeue the waiter */
-	plist_del(&waiter->list_entry, &lock->wait_list);
-	waiter->list_entry.prio = task->prio;
-	plist_add(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_dequeue(lock, waiter);
+	rt_mutex_enqueue(lock, waiter);
 
 	/* Release the task */
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
@@ -263,17 +346,15 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		/* Boost the owner */
-		plist_del(&top_waiter->pi_list_entry, &task->pi_waiters);
-		waiter->pi_list_entry.prio = waiter->list_entry.prio;
-		plist_add(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_dequeue_pi(task, top_waiter);
+		rt_mutex_enqueue_pi(task, waiter);
 		__rt_mutex_adjust_prio(task);
 
 	} else if (top_waiter == waiter) {
 		/* Deboost the owner */
-		plist_del(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_dequeue_pi(task, waiter);
 		waiter = rt_mutex_top_waiter(lock);
-		waiter->pi_list_entry.prio = waiter->list_entry.prio;
-		plist_add(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_enqueue_pi(task, waiter);
 		__rt_mutex_adjust_prio(task);
 	}
 
@@ -331,7 +412,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock,
 
 	/* No chain handling, pending owner is not blocked on anything: */
 	next = rt_mutex_top_waiter(lock);
-	plist_del(&next->pi_list_entry, &pendowner->pi_waiters);
+	rt_mutex_dequeue_pi(pendowner, next);
 	__rt_mutex_adjust_prio(pendowner);
 	raw_spin_unlock_irqrestore(&pendowner->pi_lock, flags);
 
@@ -351,7 +432,7 @@ static inline int try_to_steal_lock(struct rt_mutex *lock,
 	 */
 	if (likely(next->task != task)) {
 		raw_spin_lock_irqsave(&task->pi_lock, flags);
-		plist_add(&next->pi_list_entry, &task->pi_waiters);
+		rt_mutex_enqueue_pi(task, next);
 		__rt_mutex_adjust_prio(task);
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 	}
@@ -424,13 +505,11 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	__rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
-	plist_node_init(&waiter->list_entry, task->prio);
-	plist_node_init(&waiter->pi_list_entry, task->prio);
 
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
 		top_waiter = rt_mutex_top_waiter(lock);
-	plist_add(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_enqueue(lock, waiter);
 
 	task->pi_blocked_on = waiter;
 
@@ -438,9 +517,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		raw_spin_lock_irqsave(&owner->pi_lock, flags);
-		plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters);
-		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
-
+		rt_mutex_dequeue_pi(owner, top_waiter);
+		rt_mutex_enqueue_pi(owner, waiter);
 		__rt_mutex_adjust_prio(owner);
 		if (owner->pi_blocked_on)
 			chain_walk = 1;
@@ -486,7 +564,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 	raw_spin_lock_irqsave(&current->pi_lock, flags);
 
 	waiter = rt_mutex_top_waiter(lock);
-	plist_del(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_dequeue(lock, waiter);
 
 	/*
 	 * Remove it from current->pi_waiters. We do not adjust a
@@ -494,7 +572,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 	 * boosted mode and go back to normal after releasing
 	 * lock->wait_lock.
 	 */
-	plist_del(&waiter->pi_list_entry, &current->pi_waiters);
+	rt_mutex_dequeue_pi(current, waiter);
 	pendowner = waiter->task;
 	waiter->task = NULL;
 
@@ -521,7 +599,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 		struct rt_mutex_waiter *next;
 
 		next = rt_mutex_top_waiter(lock);
-		plist_add(&next->pi_list_entry, &pendowner->pi_waiters);
+		rt_mutex_enqueue_pi(pendowner, next);
 	}
 	raw_spin_unlock_irqrestore(&pendowner->pi_lock, flags);
 
@@ -542,7 +620,7 @@ static void remove_waiter(struct rt_mutex *lock,
 	int chain_walk = 0;
 
 	raw_spin_lock_irqsave(&current->pi_lock, flags);
-	plist_del(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_dequeue(lock, waiter);
 	waiter->task = NULL;
 	current->pi_blocked_on = NULL;
 	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
@@ -551,13 +629,13 @@ static void remove_waiter(struct rt_mutex *lock,
 
 		raw_spin_lock_irqsave(&owner->pi_lock, flags);
 
-		plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
+		rt_mutex_dequeue_pi(owner, waiter);
 
 		if (rt_mutex_has_waiters(lock)) {
 			struct rt_mutex_waiter *next;
 
 			next = rt_mutex_top_waiter(lock);
-			plist_add(&next->pi_list_entry, &owner->pi_waiters);
+			rt_mutex_enqueue_pi(owner, next);
 		}
 		__rt_mutex_adjust_prio(owner);
 
@@ -567,8 +645,6 @@ static void remove_waiter(struct rt_mutex *lock,
 		raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
 	}
 
-	WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
-
 	if (!chain_walk)
 		return;
 
@@ -595,7 +671,7 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || waiter->list_entry.prio == task->prio) {
+	if (!waiter || waiter->task->prio == task->prio) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
@@ -971,7 +1047,8 @@ void __rt_mutex_init(struct rt_mutex *lock, const char *name)
 {
 	lock->owner = NULL;
 	raw_spin_lock_init(&lock->wait_lock);
-	plist_head_init_raw(&lock->wait_list, &lock->wait_lock);
+	lock->waiters = RB_ROOT;
+	lock->waiters_leftmost = NULL;
 
 	debug_rt_mutex_init(lock, name);
 }
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index 97a2f81..84b9eea 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -40,13 +40,13 @@ extern void schedule_rt_mutex_test(struct rt_mutex *lock);
  * This is the control structure for tasks blocked on a rt_mutex,
  * which is allocated on the kernel stack on of the blocked task.
  *
- * @list_entry:		pi node to enqueue into the mutex waiters list
- * @pi_list_entry:	pi node to enqueue into the mutex owner waiters list
+ * @tree_entry:		pi node to enqueue into the mutex waiters tree
+ * @pi_tree_entry:	pi node to enqueue into the mutex owner waiters tree
  * @task:		task reference to the blocked task
  */
 struct rt_mutex_waiter {
-	struct plist_node	list_entry;
-	struct plist_node	pi_list_entry;
+	struct rb_node          tree_entry;
+	struct rb_node          pi_tree_entry;
 	struct task_struct	*task;
 	struct rt_mutex		*lock;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
@@ -57,11 +57,11 @@ struct rt_mutex_waiter {
 };
 
 /*
- * Various helpers to access the waiters-plist:
+ * Various helpers to access the waiters-tree:
  */
 static inline int rt_mutex_has_waiters(struct rt_mutex *lock)
 {
-	return !plist_head_empty(&lock->wait_list);
+	return !RB_EMPTY_ROOT(&lock->waiters);
 }
 
 static inline struct rt_mutex_waiter *
@@ -69,8 +69,8 @@ rt_mutex_top_waiter(struct rt_mutex *lock)
 {
 	struct rt_mutex_waiter *w;
 
-	w = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter,
-			       list_entry);
+	w = rb_entry(lock->waiters_leftmost, struct rt_mutex_waiter,
+		     tree_entry);
 	BUG_ON(w->lock != lock);
 
 	return w;
@@ -78,14 +78,14 @@ rt_mutex_top_waiter(struct rt_mutex *lock)
 
 static inline int task_has_pi_waiters(struct task_struct *p)
 {
-	return !plist_head_empty(&p->pi_waiters);
+	return !RB_EMPTY_ROOT(&p->pi_waiters);
 }
 
 static inline struct rt_mutex_waiter *
 task_top_pi_waiter(struct task_struct *p)
 {
-	return plist_first_entry(&p->pi_waiters, struct rt_mutex_waiter,
-				  pi_list_entry);
+	return rb_entry(p->pi_waiters_leftmost, struct rt_mutex_waiter,
+			pi_tree_entry);
 }
 
 /*
diff --git a/kernel/sched.c b/kernel/sched.c
index 4d291e3..853473a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8504,10 +8504,6 @@ void __init sched_init(void)
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
 #endif
 
-#ifdef CONFIG_RT_MUTEXES
-	plist_head_init_raw(&init_task.pi_waiters, &init_task.pi_lock);
-#endif
-
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
 	 */
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (17 preceding siblings ...)
  2010-10-29  6:42 ` [RFC][PATCH 19/22] rtmutex: turn the plist into an rb-tree Raistlin
@ 2010-10-29  6:42 ` Raistlin
  2010-11-11 22:12   ` Peter Zijlstra
  2010-10-29  6:43 ` [RFC][PATCH 20/22] sched: drafted deadline inheritance logic Raistlin
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 13505 bytes --]

The bandwidth enforcing mechanism implemented inside the
SCHED_DEADLINE policy ensures that overrunning tasks are slowed
down without interfering with well behaving ones.
This, however, comes at the price of limiting the capability of
a task to exploit more bandwidth than it is asigned.

The current implementation always stops a task that is trying
to use more than its runtime (every deadline). Something else that
could be done is to let it continue running, but with a "decreased
priority". This way, we can exploit full CPU bandwidth and still
avoid interferences.

In order of "decreasing the priority" of a deadline task, we can:
 - let it stay SCHED_DEADLINE and postpone its deadline. This way it
   will always be scheduled before -rt and -other tasks but it
   won't affect other -deadline tasks;
 - put it in SCHED_FIFO with some priority. This way it will always
   be scheduled before -other tasks but it won't affect -deadline
   tasks, nor other -rt tasks with higher priority;
 - put it in SCHED_OTHER.

Notice also that this can be done on a per-task basis, e.g., each
task can specify what kind of reclaiming mechanism it wants to use
by means of the sched_flags field of sched_param_ex.

Therefore, this patch:
 - adds the flags for specyfing DEADLINE, RT or OTHER reclaiming
   behaviour;
 - adds the logic that changes the scheduling class of a task when
   it overruns, according to the requested policy.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |   25 ++++++++++++++
 kernel/hrtimer.c      |    2 +-
 kernel/sched.c        |   86 ++++++++++++++++++++++++++++++++-----------------
 kernel/sched_debug.c  |    2 +-
 kernel/sched_dl.c     |   44 +++++++++++++++++++++++--
 5 files changed, 123 insertions(+), 36 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b729f83..8806c1f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -172,10 +172,26 @@ struct sched_param_ex {
  *                      a runtime overrun occurs;
  *  @SF_SIG_DMISS       tells us the task wants to be notified whenever
  *                      a scheduling deadline is missed.
+ *  @SF_BWRECL_DL       tells us that the task doesn't stop when exhausting
+ *                      its runtime, and it remains a -deadline task, even
+ *                      though its deadline is postponed. This means it
+ *                      won't affect the scheduling of the other -deadline
+ *                      tasks, but if it is a CPU-hog, lower scheduling
+ *                      classes will starve!
+ *  @SF_BWRECL_RT       tells us that the task doesn't stop when exhausting
+ *                      its runtime, and it becomes a -rt task, with the
+ *                      priority specified in the sched_priority field of
+ *                      struct shced_param_ex.
+ *  @SF_BWRECL_OTH      tells us that the task doesn't stop when exhausting
+ *                      its runtime, and it becomes a normal task, with
+ *                      default priority.
  */
 #define SF_HEAD		1
 #define SF_SIG_RORUN	2
 #define SF_SIG_DMISS	4
+#define SF_BWRECL_DL	8
+#define SF_BWRECL_RT	16
+#define SF_BWRECL_NR	32
 
 struct exec_domain;
 struct futex_pi_state;
@@ -1694,6 +1710,15 @@ static inline int dl_task(struct task_struct *p)
 	return dl_prio(p->prio);
 }
 
+/*
+ * We might have temporarily dropped -deadline policy,
+ * but still be a -deadline task!
+ */
+static inline int __dl_task(struct task_struct *p)
+{
+	return dl_task(p) || p->policy == SCHED_DEADLINE;
+}
+
 static inline int rt_prio(int prio)
 {
 	if (unlikely(prio >= MAX_DL_PRIO && prio < MAX_RT_PRIO))
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 9cd8564..54277be 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1574,7 +1574,7 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp,
 	unsigned long slack;
 
 	slack = current->timer_slack_ns;
-	if (dl_task(current) || rt_task(current))
+	if (__dl_task(current) || rt_task(current))
 		slack = 0;
 
 	hrtimer_init_on_stack(&t.timer, clockid, mode);
diff --git a/kernel/sched.c b/kernel/sched.c
index 79cac6e..4d291e3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2235,7 +2235,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 #endif
 
 	trace_sched_migrate_task(p, new_cpu);
-	if (unlikely(dl_task(p)))
+	if (unlikely(__dl_task(p)))
 		trace_sched_migrate_task_dl(p, task_rq(p)->clock,
 					    new_cpu, cpu_rq(new_cpu)->clock);
 
@@ -2983,6 +2983,16 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 			prev->sched_class->task_dead(prev);
 
 		/*
+		 * If we are a -deadline task, dieing while
+		 * hanging out in a different scheduling class
+		 * we need to manually call our own cleanup function,
+		 * at least to stop the bandwidth timer.
+		 */
+		if (unlikely(task_has_dl_policy(prev) &&
+		    prev->sched_class != &dl_sched_class))
+			dl_sched_class.task_dead(prev);
+
+		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
 		 */
@@ -3064,7 +3074,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 
 	prepare_task_switch(rq, prev, next);
 	trace_sched_switch(prev, next);
-	if (unlikely(dl_task(prev) || dl_task(next)))
+	if (unlikely(__dl_task(prev) || __dl_task(next)))
 		trace_sched_switch_dl(rq->clock, prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
@@ -4554,34 +4564,13 @@ long __sched sleep_on_timeout(wait_queue_head_t *q, long timeout)
 }
 EXPORT_SYMBOL(sleep_on_timeout);
 
-#ifdef CONFIG_RT_MUTEXES
-
-/*
- * rt_mutex_setprio - set the current priority of a task
- * @p: task
- * @prio: prio value (kernel-internal form)
- *
- * This function changes the 'effective' priority of a task. It does
- * not touch ->normal_prio like __setscheduler().
- *
- * Used by the rt_mutex code to implement priority inheritance logic.
- */
-void rt_mutex_setprio(struct task_struct *p, int prio)
+static void __setprio(struct rq *rq, struct task_struct *p, int prio)
 {
-	unsigned long flags;
-	int oldprio, on_rq, running;
-	struct rq *rq;
-	const struct sched_class *prev_class;
-
-	BUG_ON(prio < 0 || prio > MAX_PRIO);
+	int oldprio = p->prio;
+	const struct sched_class *prev_class = p->sched_class;
+	int running = task_current(rq, p);
+	int on_rq = p->se.on_rq;
 
-	rq = task_rq_lock(p, &flags);
-
-	trace_sched_pi_setprio(p, prio);
-	oldprio = p->prio;
-	prev_class = p->sched_class;
-	on_rq = p->se.on_rq;
-	running = task_current(rq, p);
 	if (on_rq)
 		dequeue_task(rq, p, 0);
 	if (running)
@@ -4603,6 +4592,30 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 
 		check_class_changed(rq, p, prev_class, oldprio, running);
 	}
+}
+
+#ifdef CONFIG_RT_MUTEXES
+
+/*
+ * rt_mutex_setprio - set the current priority of a task
+ * @p: task
+ * @prio: prio value (kernel-internal form)
+ *
+ * This function changes the 'effective' priority of a task. It does
+ * not touch ->normal_prio like __setscheduler().
+ *
+ * Used by the rt_mutex code to implement priority inheritance logic.
+ */
+void rt_mutex_setprio(struct task_struct *p, int prio)
+{
+	unsigned long flags;
+	struct rq *rq;
+
+	BUG_ON(prio < 0 || prio > MAX_PRIO);
+
+	rq = task_rq_lock(p, &flags);
+	trace_sched_pi_setprio(p, prio);
+	__setprio(rq, p, prio);
 	task_rq_unlock(rq, &flags);
 }
 
@@ -4909,19 +4922,32 @@ recheck:
 	 */
 	if (user && !capable(CAP_SYS_NICE)) {
 		if (dl_policy(policy)) {
-			u64 rlim_dline, rlim_rtime;
+			u64 rlim_dline, rlim_rtime, rlim_rtprio;
 			u64 dline, rtime;
 
 			if (!lock_task_sighand(p, &flags))
 				return -ESRCH;
 			rlim_dline = p->signal->rlim[RLIMIT_DLDLINE].rlim_cur;
 			rlim_rtime = p->signal->rlim[RLIMIT_DLRTIME].rlim_cur;
+			rlim_rtprio = p->signal->rlim[RLIMIT_RTPRIO].rlim_cur;
 			unlock_task_sighand(p, &flags);
 
 			/* can't set/change -deadline policy */
 			if (policy != p->policy && !rlim_rtime)
 				return -EPERM;
 
+			/* can't set/change reclaiming policy to -deadline */
+			if ((param_ex->sched_flags & SF_BWRECL_DL) !=
+			    (p->dl.flags & SF_BWRECL_DL))
+				return -EPERM;
+
+			/* can't set/increase -rt reclaiming priority */
+			if (param_ex->sched_flags & SF_BWRECL_RT &&
+			    (param_ex->sched_priority <= 0 ||
+			     (param_ex->sched_priority > p->rt_priority &&
+			      param_ex->sched_priority > rlim_rtprio)))
+				return -EPERM;
+
 			/* can't decrease the deadline */
 			rlim_dline *= NSEC_PER_USEC;
 			dline = timespec_to_ns(&param_ex->sched_deadline);
@@ -8596,7 +8622,7 @@ void normalize_rt_tasks(void)
 		p->se.statistics.block_start	= 0;
 #endif
 
-		if (!dl_task(p) && !rt_task(p)) {
+		if (!__dl_task(p) && !rt_task(p)) {
 			/*
 			 * Renice negative nice level userspace
 			 * tasks back to 0:
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 4949a21..2bf4e72 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -467,7 +467,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.statistics.nr_wakeups_affine_attempts);
 	P(se.statistics.nr_wakeups_passive);
 	P(se.statistics.nr_wakeups_idle);
-	if (dl_task(p)) {
+	if (__dl_task(p)) {
 		P(dl.stats.dmiss);
 		PN(dl.stats.last_dmiss);
 		PN(dl.stats.dmiss_max);
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index eff183a..4d24109 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -15,6 +15,8 @@
  *                    Fabio Checconi <fabio@gandalf.sssup.it>
  */
 
+static const struct sched_class dl_sched_class;
+
 static inline int dl_time_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
@@ -382,6 +384,17 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
 	s64 delta;
 
 	/*
+	 * If the task wants to stay -deadline even if it exhausted
+	 * its runtime we allow that by not starting the timer.
+	 * update_curr_dl() will thus queue it back after replenishment
+	 * and deadline postponing.
+	 * This won't affect the other -deadline tasks, but if we are
+	 * a CPU-hog, lower scheduling classes will starve!
+	 */
+	if (dl_se->flags & SF_BWRECL_DL)
+		return 0;
+
+	/*
 	 * We want the timer to fire at the deadline, but considering
 	 * that it is actually coming from rq->clock and not from
 	 * hrtimer's time base reading.
@@ -414,6 +427,8 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
 	return hrtimer_active(&dl_se->dl_timer);
 }
 
+static void __setprio(struct rq *rq, struct task_struct *p, int prio);
+
 /*
  * This is the bandwidth enforcement timer callback. If here, we know
  * a task is not on its dl_rq, since the fact that the timer was running
@@ -440,12 +455,18 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	 * We need to take care of a possible races here. In fact, the
 	 * task might have changed its scheduling policy to something
 	 * different from SCHED_DEADLINE (through sched_setscheduler()).
+	 * However, if we changed scheduling class for reclaiming, it
+	 * is correct to handle this replenishment, since this is what
+	 * will put us back into the -deadline scheduling class.
 	 */
-	if (!dl_task(p))
+	if (!__dl_task(p))
 		goto unlock;
 
 	trace_sched_timer_dl(p, rq->clock, p->se.on_rq, task_current(rq, p));
 
+	if (unlikely(p->sched_class != &dl_sched_class))
+		__setprio(rq, p, MAX_DL_PRIO-1);
+
 	dl_se->dl_throttled = 0;
 	if (p->se.on_rq) {
 		enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
@@ -530,6 +551,16 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 	return 1;
 }
 
+static inline void throttle_curr_dl(struct rq *rq, struct task_struct *curr)
+{
+	curr->dl.dl_throttled = 1;
+
+	if (curr->dl.flags & SF_BWRECL_RT)
+		__setprio(rq, curr, MAX_RT_PRIO-1 - curr->rt_priority);
+	else if (curr->dl.flags & SF_BWRECL_NR)
+		__setprio(rq, curr, DEFAULT_PRIO);
+}
+
 /*
  * Update the current task's runtime statistics (provided it is still
  * a -deadline task and has not been removed from the dl_rq).
@@ -565,7 +596,7 @@ static void update_curr_dl(struct rq *rq)
 	if (dl_runtime_exceeded(rq, dl_se)) {
 		__dequeue_task_dl(rq, curr, 0);
 		if (likely(start_dl_timer(dl_se)))
-			dl_se->dl_throttled = 1;
+			throttle_curr_dl(rq, curr);
 		else
 			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
 
@@ -765,8 +796,10 @@ static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
-	update_curr_dl(rq);
-	__dequeue_task_dl(rq, p, flags);
+	if (likely(!p->dl.dl_throttled)) {
+		update_curr_dl(rq);
+		__dequeue_task_dl(rq, p, flags);
+	}
 }
 
 /*
@@ -1000,6 +1033,9 @@ struct task_struct *pick_next_task_dl(struct rq *rq)
 
 static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
 {
+	if (unlikely(p->dl.dl_throttled))
+		return;
+
 	update_curr_dl(rq);
 	p->se.exec_start = 0;
 
-- 
1.7.2.3



-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 20/22] sched: drafted deadline inheritance logic
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (18 preceding siblings ...)
  2010-10-29  6:42 ` [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks Raistlin
@ 2010-10-29  6:43 ` Raistlin
  2010-11-11 22:15   ` Peter Zijlstra
  2010-10-29  6:44 ` [RFC][PATCH 21/22] sched: add bandwidth management for sched_dl Raistlin
  2010-10-29  6:45 ` [RFC][PATCH 22/22] sched: add sched_dl documentation Raistlin
  21 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 11654 bytes --]


Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).

This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
 - ensure a pi-lock owner with waiters is never throttled down. Instead,
   when it runs out of runtime, it immediately gets replenished and it's
   deadline is postponed (as in SF_BWRECL_DL reclaiming policy);
 - the scheduling parameters (relative deadline and default runtime)
   used for that replenishments --during the whole period it holds the
   pi-lock-- are the ones of the waiting task with earliest deadline.

Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |    3 ++
 kernel/fork.c         |    1 +
 kernel/rtmutex.c      |   13 ++++++++-
 kernel/sched.c        |    3 +-
 kernel/sched_dl.c     |   65 +++++++++++++++++++++++++++++++------------------
 5 files changed, 58 insertions(+), 27 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c3d1f17b..7cf78e2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1535,6 +1535,8 @@ struct task_struct {
 	struct rb_node *pi_waiters_leftmost;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
+	/* */
+	struct task_struct *pi_top_task;
 #endif
 
 #ifdef CONFIG_DEBUG_MUTEXES
@@ -2118,6 +2120,7 @@ extern unsigned int sysctl_sched_compat_yield;
 
 #ifdef CONFIG_RT_MUTEXES
 extern int rt_mutex_getprio(struct task_struct *p);
+extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 #else
diff --git a/kernel/fork.c b/kernel/fork.c
index aceb248..c8f2555 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -938,6 +938,7 @@ static void rt_mutex_init_task(struct task_struct *p)
 	p->pi_waiters = RB_ROOT;
 	p->pi_waiters_leftmost = NULL;
 	p->pi_blocked_on = NULL;
+	p->pi_top_task = NULL;
 #endif
 }
 
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 2e9c0dc..84ea165 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -196,6 +196,14 @@ int rt_mutex_getprio(struct task_struct *task)
 		   task->normal_prio);
 }
 
+struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
+{
+	if (likely(!task_has_pi_waiters(task)))
+		return NULL;
+
+	return task_top_pi_waiter(task)->task;
+}
+
 /*
  * Adjust the priority of a task, after its pi_waiters got modified.
  *
@@ -205,7 +213,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
 {
 	int prio = rt_mutex_getprio(task);
 
-	if (task->prio != prio)
+	if (task->prio != prio || dl_prio(prio))
 		rt_mutex_setprio(task, prio);
 }
 
@@ -671,7 +679,8 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || waiter->task->prio == task->prio) {
+	if (!waiter || (waiter->task->prio == task->prio &&
+	    !dl_prio(task->prio))) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
diff --git a/kernel/sched.c b/kernel/sched.c
index 853473a..97db370 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4611,10 +4611,11 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	unsigned long flags;
 	struct rq *rq;
 
-	BUG_ON(prio < 0 || prio > MAX_PRIO);
+	BUG_ON(prio > MAX_PRIO);
 
 	rq = task_rq_lock(p, &flags);
 	trace_sched_pi_setprio(p, prio);
+	p->pi_top_task = rt_mutex_get_top_task(p);
 	__setprio(rq, p, prio);
 	task_rq_unlock(rq, &flags);
 }
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 4d24109..991a4f2 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -223,15 +223,16 @@ static int push_dl_task(struct rq *rq);
  * one, and to (try to!) reconcile itself with its own scheduling
  * parameters.
  */
-static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
+static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se,
+				       struct sched_dl_entity *pi_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
 	WARN_ON(!dl_se->dl_new || dl_se->dl_throttled);
 
-	dl_se->deadline = rq->clock + dl_se->dl_deadline;
-	dl_se->runtime = dl_se->dl_runtime;
+	dl_se->deadline = rq->clock + pi_se->dl_deadline;
+	dl_se->runtime = pi_se->dl_runtime;
 	dl_se->dl_new = 0;
 #ifdef CONFIG_SCHEDSTATS
 	trace_sched_stat_new_dl(dl_task_of(dl_se), rq->clock, dl_se->flags);
@@ -256,7 +257,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
  * could happen are, typically, a entity voluntarily trying to overcume its
  * runtime, or it just underestimated it during sched_setscheduler_ex().
  */
-static void replenish_dl_entity(struct sched_dl_entity *dl_se)
+static void replenish_dl_entity(struct sched_dl_entity *dl_se,
+				struct sched_dl_entity *pi_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -269,8 +271,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 * arbitrary large.
 	 */
 	while (dl_se->runtime <= 0) {
-		dl_se->deadline += dl_se->dl_period;
-		dl_se->runtime += dl_se->dl_runtime;
+		dl_se->deadline += pi_se->dl_period;
+		dl_se->runtime += pi_se->dl_runtime;
 	}
 
 	/*
@@ -284,8 +286,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 */
 	if (dl_time_before(dl_se->deadline, rq->clock)) {
 		WARN_ON_ONCE(1);
-		dl_se->deadline = rq->clock + dl_se->dl_deadline;
-		dl_se->runtime = dl_se->dl_runtime;
+		dl_se->deadline = rq->clock + pi_se->dl_deadline;
+		dl_se->runtime = pi_se->dl_runtime;
 		reset = 1;
 	}
 #ifdef CONFIG_SCHEDSTATS
@@ -306,7 +308,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
  * task with deadline equal to period this is the same of using
  * dl_deadline instead of dl_period in the equation above.
  */
-static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se,
+			       struct sched_dl_entity *pi_se, u64 t)
 {
 	u64 left, right;
 
@@ -323,8 +326,8 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
 	 * to the (absolute) deadline. Therefore, overflowing the u64
 	 * type is very unlikely to occur in both cases.
 	 */
-	left = dl_se->dl_deadline * dl_se->runtime;
-	right = (dl_se->deadline - t) * dl_se->dl_runtime;
+	left = pi_se->dl_deadline * dl_se->runtime;
+	right = (dl_se->deadline - t) * pi_se->dl_runtime;
 
 	return dl_time_before(right, left);
 }
@@ -338,7 +341,8 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
  *  - using the remaining runtime with the current deadline would make
  *    the entity exceed its bandwidth.
  */
-static void update_dl_entity(struct sched_dl_entity *dl_se)
+static void update_dl_entity(struct sched_dl_entity *dl_se,
+			     struct sched_dl_entity *pi_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -349,14 +353,14 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
 	 * the actual scheduling parameters have to be "renewed".
 	 */
 	if (dl_se->dl_new) {
-		setup_new_dl_entity(dl_se);
+		setup_new_dl_entity(dl_se, pi_se);
 		return;
 	}
 
 	if (dl_time_before(dl_se->deadline, rq->clock) ||
-	    dl_entity_overflow(dl_se, rq->clock)) {
-		dl_se->deadline = rq->clock + dl_se->dl_deadline;
-		dl_se->runtime = dl_se->dl_runtime;
+	    dl_entity_overflow(dl_se, pi_se, rq->clock)) {
+		dl_se->deadline = rq->clock + pi_se->dl_deadline;
+		dl_se->runtime = pi_se->dl_runtime;
 		overflow = 1;
 	}
 #ifdef CONFIG_SCHEDSTATS
@@ -374,7 +378,7 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
  * actually started or not (i.e., the replenishment instant is in
  * the future or in the past).
  */
-static int start_dl_timer(struct sched_dl_entity *dl_se)
+static int start_dl_timer(struct sched_dl_entity *dl_se, bool boosted)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -391,7 +395,7 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
 	 * This won't affect the other -deadline tasks, but if we are
 	 * a CPU-hog, lower scheduling classes will starve!
 	 */
-	if (dl_se->flags & SF_BWRECL_DL)
+	if (boosted || dl_se->flags & SF_BWRECL_DL)
 		return 0;
 
 	/*
@@ -595,7 +599,7 @@ static void update_curr_dl(struct rq *rq)
 	dl_se->runtime -= delta_exec;
 	if (dl_runtime_exceeded(rq, dl_se)) {
 		__dequeue_task_dl(rq, curr, 0);
-		if (likely(start_dl_timer(dl_se)))
+		if (likely(start_dl_timer(dl_se, !!curr->pi_top_task)))
 			throttle_curr_dl(rq, curr);
 		else
 			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
@@ -749,7 +753,8 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 }
 
 static void
-enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
+enqueue_dl_entity(struct sched_dl_entity *dl_se,
+		  struct sched_dl_entity *pi_se, int flags)
 {
 	BUG_ON(on_dl_rq(dl_se));
 
@@ -759,9 +764,9 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	 * we want a replenishment of its runtime.
 	 */
 	if (!dl_se->dl_new && flags & ENQUEUE_REPLENISH)
-		replenish_dl_entity(dl_se);
+		replenish_dl_entity(dl_se, pi_se);
 	else
-		update_dl_entity(dl_se);
+		update_dl_entity(dl_se, pi_se);
 
 	__enqueue_dl_entity(dl_se);
 }
@@ -773,6 +778,18 @@ static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
 
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
+	struct task_struct *pi_task = p->pi_top_task;
+	struct sched_dl_entity *pi_se = &p->dl;
+
+	/*
+	 * Use the scheduling parameters of the top pi-waiter
+	 * task if we have one and its (relative) deadline is
+	 * smaller than our one... OTW we keep our runtime and
+	 * deadline.
+	 */
+	if (pi_task && dl_entity_preempt(&pi_task->dl, &p->dl))
+		pi_se = &pi_task->dl;
+
 	/*
 	 * If p is throttled, we do nothing. In fact, if it exhausted
 	 * its budget it needs a replenishment and, since it now is on
@@ -782,7 +799,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	if (p->dl.dl_throttled)
 		return;
 
-	enqueue_dl_entity(&p->dl, flags);
+	enqueue_dl_entity(&p->dl, pi_se, flags);
 
 	if (!task_current(rq, p) && p->dl.nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
@@ -847,7 +864,7 @@ static long wait_interval_dl(struct task_struct *p, struct timespec *rqtp,
 	 */
 	wakeup = timespec_to_ns(rqtp);
 	if (dl_time_before(wakeup, dl_se->deadline) &&
-	    !dl_entity_overflow(dl_se, wakeup)) {
+	    !dl_entity_overflow(dl_se, dl_se, wakeup)) {
 		u64 ibw = (u64)dl_se->runtime * dl_se->dl_period;
 
 		ibw = div_u64(ibw, dl_se->dl_runtime);
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 21/22] sched: add bandwidth management for sched_dl
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (19 preceding siblings ...)
  2010-10-29  6:43 ` [RFC][PATCH 20/22] sched: drafted deadline inheritance logic Raistlin
@ 2010-10-29  6:44 ` Raistlin
  2010-10-29  6:45 ` [RFC][PATCH 22/22] sched: add sched_dl documentation Raistlin
  21 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 23119 bytes --]


In order of -deaadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.

Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.

However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, machanism to ensure that a certain utilization
cap is not overcome per each root_domain.

Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.

This patch, therefore:
 - adds system wide deadline bandwidth management by means of:
    * /proc/sys/kernel/sched_dl_runtime_us,
    * /proc/sys/kernel/sched_dl_period_us,
   that determine (i.e., runtime / period) the total bandwidth
   available on each CPU of each root_domain for -deadline tasks;
 - couples the RT and deadline bandwidth management, i.e., enforces
   that the sum of how much bandwidth is being devoted to -rt
   -deadline tasks to stay below 100%.

This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:

    M * (sched_dl_runtime_us / sched_dl_period_us)

It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |    8 +
 kernel/sched.c        |  480 ++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched_dl.c     |   10 +
 kernel/sysctl.c       |   14 ++
 4 files changed, 492 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7cf78e2..03f8a8a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1293,6 +1293,7 @@ struct sched_dl_entity {
 	u64 dl_runtime;		/* maximum runtime for each instance 	*/
 	u64 dl_deadline;	/* relative deadline of each instance	*/
 	u64 dl_period;		/* separation of two instances (period) */
+	u64 dl_bw;		/* dl_runtime / dl_deadline		*/
 
 	/*
 	 * Actual scheduling parameters. Initialized with the values above,
@@ -2116,6 +2117,13 @@ int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
+extern unsigned int sysctl_sched_dl_period;
+extern int sysctl_sched_dl_runtime;
+
+int sched_dl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
 extern unsigned int sysctl_sched_compat_yield;
 
 #ifdef CONFIG_RT_MUTEXES
diff --git a/kernel/sched.c b/kernel/sched.c
index 97db370..5cc0e48 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -153,6 +153,22 @@ struct rt_prio_array {
 	struct list_head queue[MAX_RT_PRIO];
 };
 
+static unsigned long to_ratio(u64 period, u64 runtime)
+{
+	if (runtime == RUNTIME_INF)
+		return 1ULL << 20;
+
+	/*
+	 * Doing this here saves a lot of checks in all
+	 * the calling paths, and returning zero seems
+	 * safe for them anyway.
+	 */
+	if (period == 0)
+		return 0;
+
+	return div64_u64(runtime << 20, period);
+}
+
 struct rt_bandwidth {
 	/* nests inside the rq lock: */
 	raw_spinlock_t		rt_runtime_lock;
@@ -242,6 +258,74 @@ static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
 #endif
 
 /*
+ * To keep the bandwidth of -deadline tasks and groups under control
+ * we need some place where:
+ *  - store the maximum -deadline bandwidth of the system (the group);
+ *  - cache the fraction of that bandwidth that is currently allocated.
+ *
+ * This is all done in the data structure below. It is similar to the
+ * one used for RT-throttling (rt_bandwidth), with the main difference
+ * that, since here we are only interested in admission control, we
+ * do not decrease any runtime while the group "executes", neither we
+ * need a timer to replenish it.
+ *
+ * With respect to SMP, the bandwidth is given on a per-CPU basis,
+ * meaning that:
+ *  - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
+ *  - dl_total_bw array contains, in the i-eth element, the currently
+ *    allocated bandwidth on the i-eth CPU.
+ * Moreover, groups consume bandwidth on each CPU, while tasks only
+ * consume bandwidth on the CPU they're running on.
+ * Finally, dl_total_bw_cpu is used to cache the index of dl_total_bw
+ * that will be shown the next time the proc or cgroup controls will
+ * be red. It on its turn can be changed by writing on its own
+ * control.
+ */
+struct dl_bandwidth {
+	raw_spinlock_t dl_runtime_lock;
+	u64 dl_runtime;
+	u64 dl_period;
+};
+
+static struct dl_bandwidth def_dl_bandwidth;
+
+static
+void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
+{
+	raw_spin_lock_init(&dl_b->dl_runtime_lock);
+	dl_b->dl_period = period;
+	dl_b->dl_runtime = runtime;
+}
+
+static inline int dl_bandwidth_enabled(void)
+{
+	return sysctl_sched_dl_runtime >= 0;
+}
+
+/*
+ *
+ */
+struct dl_bw {
+	raw_spinlock_t lock;
+	u64 bw, total_bw;
+};
+
+static inline u64 global_dl_period(void);
+static inline u64 global_dl_runtime(void);
+
+static void init_dl_bw(struct dl_bw *dl_b)
+{
+	raw_spin_lock_init(&dl_b->lock);
+	raw_spin_lock(&def_dl_bandwidth.dl_runtime_lock);
+	if (global_dl_runtime() == RUNTIME_INF)
+		dl_b->bw = -1;
+	else
+		dl_b->bw = to_ratio(global_dl_period(), global_dl_runtime());
+	raw_spin_unlock(&def_dl_bandwidth.dl_runtime_lock);
+	dl_b->total_bw = 0;
+}
+
+/*
  * sched_domains_mutex serializes calls to arch_init_sched_domains,
  * detach_destroy_domains and partition_sched_domains.
  */
@@ -478,6 +562,7 @@ struct root_domain {
 	 */
 	cpumask_var_t dlo_mask;
 	atomic_t dlo_count;
+	struct dl_bw dl_bw;
 
 	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
@@ -915,6 +1000,28 @@ static inline u64 global_rt_runtime(void)
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
 
+/*
+ * Maximum bandwidth available for all -deadline tasks and groups
+ * (if group scheduling is configured) on each CPU.
+ *
+ * default: 5%
+ */
+unsigned int sysctl_sched_dl_period = 1000000;
+int sysctl_sched_dl_runtime = 50000;
+
+static inline u64 global_dl_period(void)
+{
+	return (u64)sysctl_sched_dl_period * NSEC_PER_USEC;
+}
+
+static inline u64 global_dl_runtime(void)
+{
+	if (sysctl_sched_dl_runtime < 0)
+		return RUNTIME_INF;
+
+	return (u64)sysctl_sched_dl_runtime * NSEC_PER_USEC;
+}
+
 #ifndef prepare_arch_switch
 # define prepare_arch_switch(next)	do { } while (0)
 #endif
@@ -2806,6 +2913,70 @@ void sched_fork(struct task_struct *p, int clone_flags)
 	put_cpu();
 }
 
+static inline
+void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
+{
+	dl_b->total_bw -= tsk_bw;
+}
+
+static inline
+void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
+{
+	dl_b->total_bw += tsk_bw;
+}
+
+static inline
+bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
+{
+	return dl_b->bw != -1 &&
+	       dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
+}
+
+/*
+ * We must be sure that accepting a new task (or allowing changing the
+ * parameters of an existing one) is consistent with the bandwidth
+ * contraints. If yes, this function also accordingly updates the currently
+ * allocated bandwidth to reflect the new situation.
+ *
+ * This function is called while holding p's rq->lock.
+ */
+static int dl_overflow(struct task_struct *p, int policy,
+		       const struct sched_param_ex *param_ex)
+{
+	struct dl_bw *dl_b = &task_rq(p)->rd->dl_bw;
+	u64 period = timespec_to_ns(&param_ex->sched_period);
+	u64 runtime = timespec_to_ns(&param_ex->sched_runtime);
+	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
+	int cpus = cpumask_weight(task_rq(p)->rd->span);
+	int err = -1;
+
+	if (new_bw == p->dl.dl_bw)
+		return 0;
+
+	/*
+	 * Either if a task, enters, leave, or stays -deadline but changes
+	 * its parameters, we may need to update accordingly the total
+	 * allocated bandwidth of the container.
+	 */
+	raw_spin_lock(&dl_b->lock);
+	if (dl_policy(policy) && !task_has_dl_policy(p) &&
+	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
+		__dl_add(dl_b, new_bw);
+		err = 0;
+	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
+		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+		__dl_clear(dl_b, p->dl.dl_bw);
+		__dl_add(dl_b, new_bw);
+		err = 0;
+	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
+		__dl_clear(dl_b, p->dl.dl_bw);
+		err = 0;
+	}
+	raw_spin_unlock(&dl_b->lock);
+
+	return err;
+}
+
 /*
  * wake_up_new_task - wake up a newly created task for the first time.
  *
@@ -4815,6 +4986,7 @@ __setparam_dl(struct task_struct *p, const struct sched_param_ex *param_ex)
 		dl_se->dl_period = timespec_to_ns(&param_ex->sched_period);
 	else
 		dl_se->dl_period = dl_se->dl_deadline;
+	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
 	dl_se->flags = param_ex->sched_flags;
 	dl_se->dl_throttled = 0;
 	dl_se->dl_new = 1;
@@ -5015,8 +5187,8 @@ recheck:
 		return -EINVAL;
 	}
 
-#ifdef CONFIG_RT_GROUP_SCHED
 	if (user) {
+#ifdef CONFIG_RT_GROUP_SCHED
 		/*
 		 * Do not allow realtime tasks into groups that have no runtime
 		 * assigned.
@@ -5027,9 +5199,25 @@ recheck:
 			raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 			return -EPERM;
 		}
-	}
 #endif
 
+		if (dl_bandwidth_enabled() && dl_policy(policy)) {
+			const struct cpumask *span = rq->rd->span;
+
+			/*
+			 * Don't allow tasks with an affinity mask smaller than
+			 * the entire root_domain to become SCHED_DEADLINE. We
+			 * will also fail if there's no bandwidth available.
+			 */
+			if (!cpumask_equal(&p->cpus_allowed, span) ||
+			    rq->rd->dl_bw.bw == 0) {
+				__task_rq_unlock(rq);
+				raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+				return -EPERM;
+			}
+		}
+	}
+
 	/* recheck policy now with rq lock held */
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;
@@ -5037,6 +5225,19 @@ recheck:
 		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 		goto recheck;
 	}
+
+	/*
+	 * If setscheduling to SCHED_DEADLINE (or changing the parameters
+	 * of a SCHED_DEADLINE task) we need to check if enough bandwidth
+	 * is available.
+	 */
+	if ((dl_policy(policy) || dl_task(p)) &&
+	    dl_overflow(p, policy, param_ex)) {
+		__task_rq_unlock(rq);
+		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		return -EBUSY;
+	}
+
 	on_rq = p->se.on_rq;
 	running = task_current(rq, p);
 	if (on_rq)
@@ -5415,6 +5616,22 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	if (retval)
 		goto out_unlock;
 
+	/*
+	 * Since bandwidth control happens on root_domain basis,
+	 * if admission test is enabled, we only admit -deadline
+	 * tasks allowed to run on all the CPUs in the task's
+	 * root_domain.
+	 */
+	if (task_has_dl_policy(p)) {
+		const struct cpumask *span = task_rq(p)->rd->span;
+
+		if (dl_bandwidth_enabled() &&
+		    !cpumask_equal(in_mask, span)) {
+			retval = -EBUSY;
+			goto out_unlock;
+		}
+	}
+
 	cpuset_cpus_allowed(p, cpus_allowed);
 	cpumask_and(new_mask, in_mask, cpus_allowed);
 again:
@@ -6076,6 +6293,42 @@ out:
 EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
 /*
+ * When dealing with a -deadline task, we have to check if moving it to
+ * a new CPU is possible or not. In fact, this is only true iff there
+ * is enough bandwidth available on such CPU, otherwise we want the
+ * whole migration progedure to fail over.
+ */
+static inline
+bool set_task_cpu_dl(struct task_struct *p, unsigned int cpu)
+{
+	struct dl_bw *dl_b = &task_rq(p)->rd->dl_bw;
+	struct dl_bw *cpu_b = &cpu_rq(cpu)->rd->dl_bw;
+	int ret = 1;
+	u64 bw;
+
+	if (dl_b == cpu_b)
+		return 1;
+
+	raw_spin_lock(&dl_b->lock);
+	raw_spin_lock(&cpu_b->lock);
+
+	bw = cpu_b->bw * cpumask_weight(cpu_rq(cpu)->rd->span);
+	if (dl_bandwidth_enabled() &&
+	    bw < cpu_b->total_bw + p->dl.dl_bw) {
+		ret = 0;
+		goto unlock;
+	}
+	dl_b->total_bw -= p->dl.dl_bw;
+	cpu_b->total_bw += p->dl.dl_bw;
+
+unlock:
+	raw_spin_unlock(&cpu_b->lock);
+	raw_spin_unlock(&dl_b->lock);
+
+	return ret;
+}
+
+/*
  * Move (not current) task off this cpu, onto dest cpu. We're doing
  * this because either it can't run here any more (set_cpus_allowed()
  * away from this CPU, or CPU going down), or because we're
@@ -6106,6 +6359,13 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 		goto fail;
 
 	/*
+	 * If p is -deadline, proceed only if there is enough
+	 * bandwidth available on dest_cpu
+	 */
+	if (unlikely(dl_task(p)) && !set_task_cpu_dl(p, dest_cpu))
+		goto fail;
+
+	/*
 	 * If we're not on a rq, the next wake-up will ensure we're
 	 * placed properly.
 	 */
@@ -6856,6 +7116,8 @@ static int init_rootdomain(struct root_domain *rd)
 	if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
 		goto free_dlo_mask;
 
+	init_dl_bw(&rd->dl_bw);
+
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_rto_mask;
 	return 0;
@@ -8406,6 +8668,8 @@ void __init sched_init(void)
 
 	init_rt_bandwidth(&def_rt_bandwidth,
 			global_rt_period(), global_rt_runtime());
+	init_dl_bandwidth(&def_dl_bandwidth,
+			global_dl_period(), global_dl_runtime());
 
 #ifdef CONFIG_RT_GROUP_SCHED
 	init_rt_bandwidth(&init_task_group.rt_bandwidth,
@@ -9074,14 +9338,6 @@ unsigned long sched_group_shares(struct task_group *tg)
  */
 static DEFINE_MUTEX(rt_constraints_mutex);
 
-static unsigned long to_ratio(u64 period, u64 runtime)
-{
-	if (runtime == RUNTIME_INF)
-		return 1ULL << 20;
-
-	return div64_u64(runtime << 20, period);
-}
-
 /* Must be called with tasklist_lock held */
 static inline int tg_has_rt_tasks(struct task_group *tg)
 {
@@ -9243,10 +9499,48 @@ long sched_group_rt_period(struct task_group *tg)
 	do_div(rt_period_us, NSEC_PER_USEC);
 	return rt_period_us;
 }
+#endif /* CONFIG_RT_GROUP_SCHED */
+
+/*
+ * Coupling of -rt and -deadline bandwidth.
+ *
+ * Here we check if the new -rt bandwidth value is consistent
+ * with the system settings for the bandwidth available
+ * to -deadline tasks.
+ *
+ * IOW, we want to enforce that
+ *
+ *   rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_rt_dl_global_constraints(u64 rt_bw)
+{
+	unsigned long flags;
+	u64 dl_bw;
+	bool ret;
+
+	raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock, flags);
+	if (global_rt_runtime() == RUNTIME_INF ||
+	    global_dl_runtime() == RUNTIME_INF) {
+		ret = true;
+		goto unlock;
+	}
+
+	dl_bw = to_ratio(def_dl_bandwidth.dl_period,
+			 def_dl_bandwidth.dl_runtime);
+
+	ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
+unlock:
+	raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock, flags);
+
+	return ret;
+}
 
+#ifdef CONFIG_RT_GROUP_SCHED
 static int sched_rt_global_constraints(void)
 {
-	u64 runtime, period;
+	u64 runtime, period, bw;
 	int ret = 0;
 
 	if (sysctl_sched_rt_period <= 0)
@@ -9261,6 +9555,10 @@ static int sched_rt_global_constraints(void)
 	if (runtime > period && runtime != RUNTIME_INF)
 		return -EINVAL;
 
+	bw = to_ratio(period, runtime)
+	if (!__sched_rt_dl_global_constraints(bw))
+		return -EINVAL;
+
 	mutex_lock(&rt_constraints_mutex);
 	read_lock(&tasklist_lock);
 	ret = __rt_schedulable(NULL, 0, 0);
@@ -9283,19 +9581,19 @@ int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
 static int sched_rt_global_constraints(void)
 {
 	unsigned long flags;
-	int i;
+	int i, ret = 0;
+	u64 bw;
 
 	if (sysctl_sched_rt_period <= 0)
 		return -EINVAL;
 
-	/*
-	 * There's always some RT tasks in the root group
-	 * -- migration, kstopmachine etc..
-	 */
-	if (sysctl_sched_rt_runtime == 0)
-		return -EBUSY;
-
 	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
+	bw = to_ratio(global_rt_period(), global_rt_runtime());
+	if (!__sched_rt_dl_global_constraints(bw)) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
 	for_each_possible_cpu(i) {
 		struct rt_rq *rt_rq = &cpu_rq(i)->rt;
 
@@ -9303,12 +9601,100 @@ static int sched_rt_global_constraints(void)
 		rt_rq->rt_runtime = global_rt_runtime();
 		raw_spin_unlock(&rt_rq->rt_runtime_lock);
 	}
+unlock:
 	raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
 
-	return 0;
+	return ret;
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+/*
+ * Coupling of -dl and -rt bandwidth.
+ *
+ * Here we check, while setting the system wide bandwidth available
+ * for -dl tasks and groups, if the new values are consistent with
+ * the system settings for the bandwidth available to -rt entities.
+ *
+ * IOW, we want to enforce that
+ *
+ *   rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_dl_rt_global_constraints(u64 dl_bw)
+{
+	u64 rt_bw;
+	bool ret;
+
+	raw_spin_lock(&def_rt_bandwidth.rt_runtime_lock);
+	if (global_dl_runtime() == RUNTIME_INF ||
+	    global_rt_runtime() == RUNTIME_INF) {
+		ret = true;
+		goto unlock;
+	}
+
+	rt_bw = to_ratio(ktime_to_ns(def_rt_bandwidth.rt_period),
+			 def_rt_bandwidth.rt_runtime);
+
+	ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
+unlock:
+	raw_spin_unlock(&def_rt_bandwidth.rt_runtime_lock);
+
+	return ret;
+}
+
+static bool __sched_dl_global_constraints(u64 runtime, u64 period)
+{
+	/*
+	 * There's always some -deadline tasks in the root group
+	 * -- migration, kstopmachine etc..
+	 */
+	if (runtime == 0)
+		return -EBUSY;
+
+	if (!period || (runtime != RUNTIME_INF && runtime > period))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int sched_dl_global_constraints(void)
+{
+	u64 runtime = global_dl_runtime();
+	u64 period = global_dl_period();
+	u64 new_bw = to_ratio(period, runtime);
+	int ret, i;
+
+	ret = __sched_dl_global_constraints(runtime, period);
+	if (ret)
+		return ret;
+
+	if (!__sched_dl_rt_global_constraints(new_bw))
+		return -EINVAL;
+
+	/*
+	 * Here we want to check the bandwidth not being set to some
+	 * value smaller than the currently allocated bandwidth in
+	 * any of the root_domains.
+	 *
+	 * FIXME: Cycling on all the CPUs is overdoing, but simpler than
+	 * cycling on root_domains... Discussion on different/better
+	 * solutions is welcome!
+	 */
+	for_each_possible_cpu(i) {
+		struct dl_bw *dl_b = &cpu_rq(i)->rd->dl_bw;
+
+		raw_spin_lock(&dl_b->lock);
+		if (new_bw < dl_b->total_bw) {
+			raw_spin_unlock(&dl_b->lock);
+			return -EBUSY;
+		}
+		raw_spin_unlock(&dl_b->lock);
+	}
+
+	return 0;
+}
+
 int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos)
@@ -9339,6 +9725,60 @@ int sched_rt_handler(struct ctl_table *table, int write,
 	return ret;
 }
 
+int sched_dl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int ret;
+	int old_period, old_runtime;
+	static DEFINE_MUTEX(mutex);
+	unsigned long flags;
+
+	mutex_lock(&mutex);
+	old_period = sysctl_sched_dl_period;
+	old_runtime = sysctl_sched_dl_runtime;
+
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock,
+				      flags);
+
+		ret = sched_dl_global_constraints();
+		if (ret) {
+			sysctl_sched_dl_period = old_period;
+			sysctl_sched_dl_runtime = old_runtime;
+		} else {
+			u64 new_bw;
+			int i;
+
+			def_dl_bandwidth.dl_period = global_dl_period();
+			def_dl_bandwidth.dl_runtime = global_dl_runtime();
+			if (global_dl_runtime() == RUNTIME_INF)
+				new_bw = -1;
+			else
+				new_bw = to_ratio(global_dl_period(),
+						  global_dl_runtime());
+			/*
+			 * FIXME: As above...
+			 */
+			for_each_possible_cpu(i) {
+				struct dl_bw *dl_b = &cpu_rq(i)->rd->dl_bw;
+
+				raw_spin_lock(&dl_b->lock);
+				dl_b->bw = new_bw;
+				raw_spin_unlock(&dl_b->lock);
+			}
+		}
+
+		raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock,
+					   flags);
+	}
+	mutex_unlock(&mutex);
+
+	return ret;
+}
+
 #ifdef CONFIG_CGROUP_SCHED
 
 /* return corresponding task_group object of a cgroup */
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 991a4f2..69324ae 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -1084,7 +1084,17 @@ static void task_fork_dl(struct task_struct *p)
 
 static void task_dead_dl(struct task_struct *p)
 {
+	struct dl_bw *dl_b = &task_rq(p)->rd->dl_bw;
+
+	/*
+	 * Since we are TASK_DEAD we won't slip out of the domain!
+	 */
+	raw_spin_lock_irq(&dl_b->lock);
+	dl_b->total_bw -= p->dl.dl_bw;
+	raw_spin_unlock_irq(&dl_b->lock);
+
 	/*
+	 * We are no longer holding any lock here, so it is safe to
 	 * We are not holding any lock here, so it is safe to
 	 * wait for the bandwidth timer to be removed.
 	 */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c33a1ed..eb975be 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -376,6 +376,20 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= sched_rt_handler,
 	},
 	{
+		.procname	= "sched_dl_period_us",
+		.data		= &sysctl_sched_dl_period,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_handler,
+	},
+	{
+		.procname	= "sched_dl_runtime_us",
+		.data		= &sysctl_sched_dl_runtime,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_handler,
+	},
+	{
 		.procname	= "sched_compat_yield",
 		.data		= &sysctl_sched_compat_yield,
 		.maxlen		= sizeof(unsigned int),
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* [RFC][PATCH 22/22] sched: add sched_dl documentation
  2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
                   ` (20 preceding siblings ...)
  2010-10-29  6:44 ` [RFC][PATCH 21/22] sched: add bandwidth management for sched_dl Raistlin
@ 2010-10-29  6:45 ` Raistlin
  21 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-10-29  6:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 7162 bytes --]


Add in Documentation/scheduler/ some hints about the design
choices, the usage and the future possible developments of the
sched_dl scheduling class and of the SCHED_DEADLINE policy.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 Documentation/scheduler/sched-deadline.txt |  147 ++++++++++++++++++++++++++++
 1 files changed, 147 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/scheduler/sched-deadline.txt

diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
new file mode 100644
index 0000000..e795968
--- /dev/null
+++ b/Documentation/scheduler/sched-deadline.txt
@@ -0,0 +1,147 @@
+			Deadline Task and Group Scheduling
+			----------------------------------
+
+CONTENTS
+========
+
+0. WARNING
+1. Overview
+2. Task scheduling
+2. The interface
+3. Bandwidth management
+  3.1 System wide settings
+  2.2 Task interface
+  2.4 Default behavior
+3. Future plans
+
+
+0. WARNING
+==========
+
+ Fiddling with these settings can result in an unpredictable or even unstable
+ system behavior. As for -rt (group) scheduling, it is assumed that root
+ knows what he is doing.
+
+
+1. Overview
+===========
+
+ The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
+ basically an implementation of the Earliest Deadline First (EDF) scheduling
+ algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
+ that make it possible to isolate the behaviour of tasks between each other.
+
+
+2. Task scheduling
+==================
+
+ The typical -deadline task will be made up of a computation phase (instance)
+ which is activated on a periodic or sporadic fashion. The expected (maximum)
+ duration of such computation is called the task's runtime; the time interval
+ by which each instance need to be completed is called the task's relative
+ deadline. The task's absolute deadline is dynamically calculated as the
+ time instant a task (better, an instance) activates plus the relative
+ deadline.
+
+ The EDF algorithms selects the task with the smallest absolute deadline as
+ the one to be executed first, while the CBS ensures each task to run for
+ at most the its runtime every (relative) deadline length time interval,
+ avoiding any interference between different tasks (bandwidth isolation).
+ Thanks to this feature, also tasks that do not strictly comply with the
+ computational model sketched above can effectively use the new policy.
+ IOW, there are no limitations on what kind of task can exploit this new
+ scheduling discipline, even if it must be said that it is particularly
+ suited for periodic or sporadic tasks that need guarantees on their
+ timing behaviour, e.g., multimedia, streaming, control applications, etc.
+
+
+3. Bandwidth management
+=======================
+
+ In order of -deaadline scheduling to be effective and useful, it is important
+ that some method of having the allocation of the available CPU bandwidth to
+ the tasks under control.
+ This is usually called "admission control" and if it is not performed at all,
+ no guarantee can be given on the actual scheduling of the -deadline tasks.
+
+ Since when RT-throttling has been introduced each task group have a bandwidth
+ associated to itself, calculated as a certain amount of runtime over a period.
+ Moreover, to make it possible to manipulate such bandwidth, readable/writable
+ controls have been added to both procfs (for system wide settings) and cgroupfs
+ (for per-group settings).
+ Therefore, the same interface is being used for controlling the bandwidth
+ distrubution to -deadline tasks and task groups, i.e., new controls but with
+ similar names, equivalent meaning and with the same usage paradigm are added.
+
+ However, more discussion is needed in order to figure out how we want to manage
+ SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE uses
+ (for now) a less sophisticated, but actually very sensible, machanism to ensure
+ that a certain utilization cap is not overcome per each root_domain.
+
+ Another main difference between deadline bandwidth management and RT-throttling
+ is that -deadline tasks have bandwidth on their own (while -rt ones doesn't!),
+ and thus we don't need an higher level throttling mechanism to enforce the
+ desired bandwidth.
+
+3.1 System wide settings
+------------------------
+
+The system wide settings are configured under the /proc virtual file system:
+
+ The per-group controls that are added to the cgroupfs virtual file system are:
+  * /proc/sys/kernel/sched_dl_runtime_us,
+  * /proc/sys/kernel/sched_dl_period_us,
+
+ They accepts (if written) and provides (if read) the new runtime and period,
+ respectively, for each CPU in each root_domain.
+
+ This means that, for a root_domain comprising M CPUs, -deadline tasks
+ can be created until the sum of their bandwidths stay below:
+
+   M * (sched_dl_runtime_us / sched_dl_period_us)
+
+ It is also possible to disable this bandwidth management logic, and
+ be thus free of oversubscribing the system up to any arbitrary level.
+ This is done by writing -1 in /proc/sys/kernel/sched_dl_runtime_us.
+
+
+2.2 Task interface
+------------------
+
+ Specifying a periodic/sporadic task that executes for a given amount of
+ runtime at each instance, and that is scheduled according to the usrgency of
+ their own timing constraints needs, in general, a way of declaring:
+  - a (maximum/typical) instance execution time,
+  - a minimum interval between consecutive instances,
+  - a time constraint by which each instance must be completed.
+
+ Therefore:
+  * a new struct sched_param_ex, containing all the necessary fields is
+    provided;
+  * the new scheduling related syscalls that manipulate it, i.e.,
+    sched_setscheduler_ex(), sched_setparam_ex() and sched_getparam_ex()
+    are implemented.
+
+
+2.4 Default behavior
+---------------------
+
+The default values for SCHED_DEADLINE bandwidth is to have dl_runtime and
+dl_period equal to 500000 and 1000000, respectively. This means -deadline
+tasks can use at most 5%, multiplied by the number of CPUs that compose the
+root_domain, for each root_domain.
+
+When a -deadline task fork a child, its dl_runtime is set to 0, which means
+someone must call sched_setscheduler_ex() on it, or it won't even start.
+
+
+3. Future plans
+===============
+
+Still Missing parts:
+
+ - refinements in deadline inheritance, especially regarding the possibility
+   of retaining bandwidth isolation among non-interacting tasks. This is
+   being studied from both theoretical and practical point of views, and
+   hopefully we can have some demonstrative code soon.
+
-- 
1.7.2.3


-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-10-29  6:27 ` [RFC][PATCH 02/22] sched: add extended scheduling interface Raistlin
@ 2010-11-10 16:00   ` Dhaval Giani
  2010-11-10 16:12     ` Dhaval Giani
  2010-11-10 16:17     ` Claudio Scordino
  2010-11-10 17:28   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 135+ messages in thread
From: Dhaval Giani @ 2010-11-10 16:00 UTC (permalink / raw)
  To: Raistlin
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Luca Abeni, Harald Gustafsson, paulmck

> +/*
> + * Extended scheduling parameters data structure.
> + *
> + * This is needed because the original struct sched_param can not be
> + * altered without introducing ABI issues with legacy applications
> + * (e.g., in sched_getparam()).
> + *
> + * However, the possibility of specifying more than just a priority for
> + * the tasks may be useful for a wide variety of application fields, e.g.,
> + * multimedia, streaming, automation and control, and many others.
> + *
> + * This variant (sched_param_ex) is meant at describing a so-called
> + * sporadic time-constrained task. In such model a task is specified by:
> + *  - the activation period or minimum instance inter-arrival time;
> + *  - the maximum (or average, depending on the actual scheduling
> + *    discipline) computation time of all instances, a.k.a. runtime;
> + *  - the deadline (relative to the actual activation time) of each
> + *    instance.
> + * Very briefly, a periodic (sporadic) task asks for the execution of
> + * some specific computation --which is typically called an instance--
> + * (at most) every period. Moreover, each instance typically lasts no more
> + * than the runtime and must be completed by time instant t equal to
> + * the instance activation time + the deadline.
> + *
> + * This is reflected by the actual fields of the sched_param_ex structure:
> + *
> + *  @sched_priority     task's priority (might still be useful)
> + *  @sched_deadline     representative of the task's deadline
> + *  @sched_runtime      representative of the task's runtime
> + *  @sched_period       representative of the task's period
> + *  @sched_flags        for customizing the scheduler behaviour
> + *
> + * There are other fields, which may be useful for implementing (in
> + * user-space) advanced scheduling behaviours, e.g., feedback scheduling:
> + *
> + *  @curr_runtime       task's currently available runtime
> + *  @used_runtime       task's totally used runtime
> + *  @curr_deadline      task's current absolute deadline
> + *
> + * Given this task model, there are a multiplicity of scheduling algorithms
> + * and policies, that can be used to ensure all the tasks will make their
> + * timing constraints.
> + */
> +struct sched_param_ex {
> +	int sched_priority;
> +	struct timespec sched_runtime;
> +	struct timespec sched_deadline;
> +	struct timespec sched_period;
> +	unsigned int sched_flags;
> +
> +	struct timespec curr_runtime;
> +	struct timespec used_runtime;
> +	struct timespec curr_deadline;
> +};
> +

So, how extensible is this. What about when real time theory develops a
new algorithm which actually works practically, but needs additional
parameters? :-). (I am guessing that this interface should handle most
of the algorithms used these days). Any ideas?

Thanks,
Dhaval

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-10 16:00   ` Dhaval Giani
@ 2010-11-10 16:12     ` Dhaval Giani
  2010-11-10 22:45       ` Raistlin
  2010-11-10 16:17     ` Claudio Scordino
  1 sibling, 1 reply; 135+ messages in thread
From: Dhaval Giani @ 2010-11-10 16:12 UTC (permalink / raw)
  To: Raistlin
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Luca Abeni, Harald Gustafsson, paulmck

On Wed, Nov 10, 2010 at 5:00 PM, Dhaval Giani <dhaval@retis.sssup.it> wrote:
>> +/*
>> + * Extended scheduling parameters data structure.
>> + *
>> + * This is needed because the original struct sched_param can not be
>> + * altered without introducing ABI issues with legacy applications
>> + * (e.g., in sched_getparam()).
>> + *
>> + * However, the possibility of specifying more than just a priority for
>> + * the tasks may be useful for a wide variety of application fields, e.g.,
>> + * multimedia, streaming, automation and control, and many others.
>> + *
>> + * This variant (sched_param_ex) is meant at describing a so-called
>> + * sporadic time-constrained task. In such model a task is specified by:
>> + *  - the activation period or minimum instance inter-arrival time;
>> + *  - the maximum (or average, depending on the actual scheduling
>> + *    discipline) computation time of all instances, a.k.a. runtime;
>> + *  - the deadline (relative to the actual activation time) of each
>> + *    instance.
>> + * Very briefly, a periodic (sporadic) task asks for the execution of
>> + * some specific computation --which is typically called an instance--
>> + * (at most) every period. Moreover, each instance typically lasts no more
>> + * than the runtime and must be completed by time instant t equal to
>> + * the instance activation time + the deadline.
>> + *
>> + * This is reflected by the actual fields of the sched_param_ex structure:
>> + *
>> + *  @sched_priority     task's priority (might still be useful)
>> + *  @sched_deadline     representative of the task's deadline
>> + *  @sched_runtime      representative of the task's runtime
>> + *  @sched_period       representative of the task's period
>> + *  @sched_flags        for customizing the scheduler behaviour
>> + *
>> + * There are other fields, which may be useful for implementing (in
>> + * user-space) advanced scheduling behaviours, e.g., feedback scheduling:
>> + *
>> + *  @curr_runtime       task's currently available runtime
>> + *  @used_runtime       task's totally used runtime
>> + *  @curr_deadline      task's current absolute deadline
>> + *
>> + * Given this task model, there are a multiplicity of scheduling algorithms
>> + * and policies, that can be used to ensure all the tasks will make their
>> + * timing constraints.
>> + */
>> +struct sched_param_ex {
>> +     int sched_priority;
>> +     struct timespec sched_runtime;
>> +     struct timespec sched_deadline;
>> +     struct timespec sched_period;
>> +     unsigned int sched_flags;
>> +
>> +     struct timespec curr_runtime;
>> +     struct timespec used_runtime;
>> +     struct timespec curr_deadline;

Can we expose soem of these details via schedstats as opposed to a syscall?

Dhaval

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-10 16:00   ` Dhaval Giani
  2010-11-10 16:12     ` Dhaval Giani
@ 2010-11-10 16:17     ` Claudio Scordino
  1 sibling, 0 replies; 135+ messages in thread
From: Claudio Scordino @ 2010-11-10 16:17 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Raistlin, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Steven Rostedt, Chris Friesen, oleg, Frederic Weisbecker,
	Darren Hart, Johan Eker, p.faure, linux-kernel,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Luca Abeni, Harald Gustafsson, paulmck

Dhaval Giani ha scritto:
>> +/*
>> + * Extended scheduling parameters data structure.
>> + *
>> + * This is needed because the original struct sched_param can not be
>> + * altered without introducing ABI issues with legacy applications
>> + * (e.g., in sched_getparam()).
>> + *
>> + * However, the possibility of specifying more than just a priority for
>> + * the tasks may be useful for a wide variety of application fields, e.g.,
>> + * multimedia, streaming, automation and control, and many others.
>> + *
>> + * This variant (sched_param_ex) is meant at describing a so-called
>> + * sporadic time-constrained task. In such model a task is specified by:
>> + *  - the activation period or minimum instance inter-arrival time;
>> + *  - the maximum (or average, depending on the actual scheduling
>> + *    discipline) computation time of all instances, a.k.a. runtime;
>> + *  - the deadline (relative to the actual activation time) of each
>> + *    instance.
>> + * Very briefly, a periodic (sporadic) task asks for the execution of
>> + * some specific computation --which is typically called an instance--
>> + * (at most) every period. Moreover, each instance typically lasts no more
>> + * than the runtime and must be completed by time instant t equal to
>> + * the instance activation time + the deadline.
>> + *
>> + * This is reflected by the actual fields of the sched_param_ex structure:
>> + *
>> + *  @sched_priority     task's priority (might still be useful)
>> + *  @sched_deadline     representative of the task's deadline
>> + *  @sched_runtime      representative of the task's runtime
>> + *  @sched_period       representative of the task's period
>> + *  @sched_flags        for customizing the scheduler behaviour
>> + *
>> + * There are other fields, which may be useful for implementing (in
>> + * user-space) advanced scheduling behaviours, e.g., feedback scheduling:
>> + *
>> + *  @curr_runtime       task's currently available runtime
>> + *  @used_runtime       task's totally used runtime
>> + *  @curr_deadline      task's current absolute deadline
>> + *
>> + * Given this task model, there are a multiplicity of scheduling algorithms
>> + * and policies, that can be used to ensure all the tasks will make their
>> + * timing constraints.
>> + */
>> +struct sched_param_ex {
>> +	int sched_priority;
>> +	struct timespec sched_runtime;
>> +	struct timespec sched_deadline;
>> +	struct timespec sched_period;
>> +	unsigned int sched_flags;
>> +
>> +	struct timespec curr_runtime;
>> +	struct timespec used_runtime;
>> +	struct timespec curr_deadline;
>> +};
>> +
> 
> So, how extensible is this. What about when real time theory develops a
> new algorithm which actually works practically, but needs additional
> parameters? :-). (I am guessing that this interface should handle most
> of the algorithms used these days). Any ideas?

AFAIK, these parameters already address most real-time algorithms.

It could happen that in the future a new algorithm will need some
further parameter (like jitter). However, IMHO, this is too unlikely
to justify the addition of some padding in this data structure.

Best regards,

	Claudio

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-10-29  6:27 ` [RFC][PATCH 02/22] sched: add extended scheduling interface Raistlin
  2010-11-10 16:00   ` Dhaval Giani
@ 2010-11-10 17:28   ` Peter Zijlstra
  2010-11-10 19:26     ` Peter Zijlstra
                       ` (2 more replies)
  2010-11-10 18:50   ` Peter Zijlstra
  2010-11-12 16:38   ` Steven Rostedt
  3 siblings, 3 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-10 17:28 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:27 +0200, Raistlin wrote:
> +struct sched_param_ex {
> +       int sched_priority;
> +       struct timespec sched_runtime;
> +       struct timespec sched_deadline;
> +       struct timespec sched_period;
> +       unsigned int sched_flags;
> +
> +       struct timespec curr_runtime;
> +       struct timespec used_runtime;
> +       struct timespec curr_deadline;
> +}; 

It would be better for alignment reasons to move the sched_flags field
next to the sched_priority field.

I would suggest we add at least one more field so we can implement the
stochastic model from UNC, sched_runtime_dev or sched_runtime_var or
somesuch.



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-10-29  6:27 ` [RFC][PATCH 02/22] sched: add extended scheduling interface Raistlin
  2010-11-10 16:00   ` Dhaval Giani
  2010-11-10 17:28   ` Peter Zijlstra
@ 2010-11-10 18:50   ` Peter Zijlstra
  2010-11-10 22:05     ` Raistlin
  2010-11-12 16:38   ` Steven Rostedt
  3 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-10 18:50 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:27 +0200, Raistlin wrote:
>  static int __sched_setscheduler(struct task_struct *p, int policy,
> -                               const struct sched_param *param, bool user)
> +                               const struct sched_param *param,
> +                               const struct sched_param_ex *param_ex,
> +                               bool user)
>  {
>         int retval, oldprio, oldpolicy = -1, on_rq, running;
>         unsigned long flags;
> @@ -4861,10 +4863,18 @@ recheck:
>  int sched_setscheduler(struct task_struct *p, int policy,
>                        const struct sched_param *param)
>  {
> -       return __sched_setscheduler(p, policy, param, true);
> +       return __sched_setscheduler(p, policy, param, NULL, true);
>  }
>  EXPORT_SYMBOL_GPL(sched_setscheduler);
>  
> +int sched_setscheduler_ex(struct task_struct *p, int policy,
> +                         const struct sched_param *param,
> +                         const struct sched_param_ex *param_ex)
> +{
> +       return __sched_setscheduler(p, policy, param, param_ex, true);
> +}
> +EXPORT_SYMBOL_GPL(sched_setscheduler_ex); 

Do we really need to pass both params? Can't we simply create a struct
sched_param_ex new_param = { .sched_priority = param->sched_priority };
on stack and pass that?



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 03/22] sched: SCHED_DEADLINE data structures.
  2010-10-29  6:28 ` [RFC][PATCH 03/22] sched: SCHED_DEADLINE data structures Raistlin
@ 2010-11-10 18:59   ` Peter Zijlstra
  2010-11-10 22:06     ` Raistlin
  2010-11-10 19:10   ` Peter Zijlstra
  1 sibling, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-10 18:59 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:28 +0200, Raistlin wrote:
> +       /*
> +        * Bandwidth enforcement timer. Each -deadline task has its
> +        * own bandwidth to be enforced, thus we need one timer per task.
> +        */
> +       struct hrtimer dl_timer; 

This is for the bandwidth replenishment right? Not the runtime throttle?

The throttle thing should only need a single timer per rq as only the
current task will be consuming runtime.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 03/22] sched: SCHED_DEADLINE data structures.
  2010-10-29  6:28 ` [RFC][PATCH 03/22] sched: SCHED_DEADLINE data structures Raistlin
  2010-11-10 18:59   ` Peter Zijlstra
@ 2010-11-10 19:10   ` Peter Zijlstra
  2010-11-12 17:11     ` Steven Rostedt
  1 sibling, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-10 19:10 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:28 +0200, Raistlin wrote:
> +       if (unlikely(prio >= MAX_DL_PRIO && prio < MAX_RT_PRIO))

Since MAX_DL_PRIO is 0, you can write that as: 
  ((unsigned)prio) < MAX_RT_PRIO



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 04/22] sched: SCHED_DEADLINE SMP-related data structures
  2010-10-29  6:29 ` [RFC][PATCH 04/22] sched: SCHED_DEADLINE SMP-related " Raistlin
@ 2010-11-10 19:17   ` Peter Zijlstra
  0 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-10 19:17 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:29 +0200, Raistlin wrote:
> @@ -1336,6 +1336,7 @@ struct task_struct {
>  
>         struct list_head tasks;
>         struct plist_node pushable_tasks;
> +       struct rb_node pushable_dl_tasks; 

Shouldn't these be CONFIG_SMP too? Not much use in tracking pushable
tasks when there's nothing to push them to, eh?



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-10-29  6:30 ` [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation Raistlin
@ 2010-11-10 19:21   ` Peter Zijlstra
  2010-11-10 19:43   ` Peter Zijlstra
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-10 19:21 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:30 +0200, Raistlin wrote:
> +       if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
> +               p->se.load.weight = 0;
> +               p->se.load.inv_weight = WMULT_CONST;
> +               return;
> +       } 

---
commit 17bdcf949d03306b308c5fb694849cd35f119807
Author: Linus Walleij <linus.walleij@stericsson.com>
Date:   Mon Oct 11 16:36:51 2010 +0200

    sched: Drop all load weight manipulation for RT tasks
    
    Load weights are for the CFS, they do not belong in the RT task. This makes all
    RT scheduling classes leave the CFS weights alone.
    
    This fixes a real bug as well: I noticed the following phonomena: a process
    elevated to SCHED_RR forks with SCHED_RESET_ON_FORK set, and the child is
    indeed SCHED_OTHER, and the niceval is indeed reset to 0. However the weight
    inserted by set_load_weight() remains at 0, giving the task insignificat
    priority.
    
    With this fix, the weight is reset to what the task had before being elevated
    to SCHED_RR/SCHED_FIFO.
    
    Cc: Lennart Poettering <lennart@poettering.net>
    Cc: stable@kernel.org
    Signed-off-by: Linus Walleij <linus.walleij@stericsson.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <1286807811-10568-1-git-send-email-linus.walleij@stericsson.com>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/kernel/sched.c b/kernel/sched.c
index 5f64fed..728081a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1855,12 +1855,6 @@ static void dec_nr_running(struct rq *rq)
 
 static void set_load_weight(struct task_struct *p)
 {
-	if (task_has_rt_policy(p)) {
-		p->se.load.weight = 0;
-		p->se.load.inv_weight = WMULT_CONST;
-		return;
-	}
-
 	/*
 	 * SCHED_IDLE tasks get minimal weight:
 	 */


^ permalink raw reply related	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-10 17:28   ` Peter Zijlstra
@ 2010-11-10 19:26     ` Peter Zijlstra
  2010-11-10 23:33       ` Tommaso Cucinotta
  2010-11-10 22:17     ` Raistlin
  2010-11-10 22:24     ` Raistlin
  2 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-10 19:26 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Wed, 2010-11-10 at 18:28 +0100, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 08:27 +0200, Raistlin wrote:
> > +struct sched_param_ex {
> > +       int sched_priority;
> > +       struct timespec sched_runtime;
> > +       struct timespec sched_deadline;
> > +       struct timespec sched_period;
> > +       unsigned int sched_flags;
> > +
> > +       struct timespec curr_runtime;
> > +       struct timespec used_runtime;
> > +       struct timespec curr_deadline;
> > +}; 
> 
> It would be better for alignment reasons to move the sched_flags field
> next to the sched_priority field.
> 
> I would suggest we add at least one more field so we can implement the
> stochastic model from UNC, sched_runtime_dev or sched_runtime_var or
> somesuch.

Oh, and their model has something akin to: sched_runtime_max, these
Gaussian bell curves go to inf. which is kinda bad for trying to compute
bounds.

 

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-10-29  6:30 ` [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation Raistlin
  2010-11-10 19:21   ` Peter Zijlstra
@ 2010-11-10 19:43   ` Peter Zijlstra
  2010-11-11  1:02     ` Raistlin
  2010-11-10 19:45   ` Peter Zijlstra
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-10 19:43 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:30 +0200, Raistlin wrote:
> + * We are being explicitly informed that a new instance is starting,
> + * and this means that:
> + *  - the absolute deadline of the entity has to be placed at
> + *    current time + relative deadline;
> + *  - the runtime of the entity has to be set to the maximum value.

When exactly are we a new instance? From a quick look dl_new gets set
after a sched_setscheduler() call, is that the only way?

Could a task calling sched_setscheduler() on itself cheat the system?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-10-29  6:30 ` [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation Raistlin
  2010-11-10 19:21   ` Peter Zijlstra
  2010-11-10 19:43   ` Peter Zijlstra
@ 2010-11-10 19:45   ` Peter Zijlstra
  2010-11-10 22:26     ` Raistlin
  2010-11-10 20:21   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-10 19:45 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:30 +0200, Raistlin wrote:
> @@ -4922,7 +4965,11 @@ recheck:

if you add:

QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$"

to your /etc/quilt.quiltrc, it won't see labels as functions.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-10-29  6:30 ` [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation Raistlin
                     ` (2 preceding siblings ...)
  2010-11-10 19:45   ` Peter Zijlstra
@ 2010-11-10 20:21   ` Peter Zijlstra
  2010-11-11  1:18     ` Raistlin
  2010-11-11 14:13   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-10 20:21 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:30 +0200, Raistlin wrote:
> +static void update_dl_entity(struct sched_dl_entity *dl_se)
> +{
> +       struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +       struct rq *rq = rq_of_dl_rq(dl_rq);
> +
> +       /*
> +        * The arrival of a new instance needs special treatment, i.e.,
> +        * the actual scheduling parameters have to be "renewed".
> +        */
> +       if (dl_se->dl_new) {
> +               setup_new_dl_entity(dl_se);
> +               return;
> +       }
> +
> +       if (dl_time_before(dl_se->deadline, rq->clock) ||
> +           dl_entity_overflow(dl_se, rq->clock)) {
> +               dl_se->deadline = rq->clock + dl_se->dl_deadline;
> +               dl_se->runtime = dl_se->dl_runtime;
> +       }
> +} 

Can't we loose runtime deficit this way?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-10 18:50   ` Peter Zijlstra
@ 2010-11-10 22:05     ` Raistlin
  0 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-10 22:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 930 bytes --]

On Wed, 2010-11-10 at 19:50 +0100, Peter Zijlstra wrote:  
> > +int sched_setscheduler_ex(struct task_struct *p, int policy,
> > +                         const struct sched_param *param,
> > +                         const struct sched_param_ex *param_ex)
> > +{
> > +       return __sched_setscheduler(p, policy, param, param_ex, true);
> > +}
> > +EXPORT_SYMBOL_GPL(sched_setscheduler_ex); 
> 
> Do we really need to pass both params? Can't we simply create a struct
> sched_param_ex new_param = { .sched_priority = param->sched_priority };
> on stack and pass that?
> 
We can. I'll do that way.

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 03/22] sched: SCHED_DEADLINE data structures.
  2010-11-10 18:59   ` Peter Zijlstra
@ 2010-11-10 22:06     ` Raistlin
  0 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-10 22:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 999 bytes --]

On Wed, 2010-11-10 at 19:59 +0100, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 08:28 +0200, Raistlin wrote:
> > +       /*
> > +        * Bandwidth enforcement timer. Each -deadline task has its
> > +        * own bandwidth to be enforced, thus we need one timer per task.
> > +        */
> > +       struct hrtimer dl_timer; 
> 
> This is for the bandwidth replenishment right? 
Yep.

> Not the runtime throttle?
> 
Not, that is done on tick basis, or by means of hrtick, if enabled.

> The throttle thing should only need a single timer per rq as only the
> current task will be consuming runtime.
>
Sure, and that's what the hrtick does (again, if enabled).

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-10 17:28   ` Peter Zijlstra
  2010-11-10 19:26     ` Peter Zijlstra
@ 2010-11-10 22:17     ` Raistlin
  2010-11-10 22:57       ` Tommaso Cucinotta
  2010-11-11 13:32       ` Peter Zijlstra
  2010-11-10 22:24     ` Raistlin
  2 siblings, 2 replies; 135+ messages in thread
From: Raistlin @ 2010-11-10 22:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1510 bytes --]

On Wed, 2010-11-10 at 18:28 +0100, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 08:27 +0200, Raistlin wrote:
> > +struct sched_param_ex {
> > +       int sched_priority;
> > +       struct timespec sched_runtime;
> > +       struct timespec sched_deadline;
> > +       struct timespec sched_period;
> > +       unsigned int sched_flags;
> > +
> > +       struct timespec curr_runtime;
> > +       struct timespec used_runtime;
> > +       struct timespec curr_deadline;
> > +}; 
> 
> It would be better for alignment reasons to move the sched_flags field
> next to the sched_priority field.
> 
Makes sense, thanks. :-)

> I would suggest we add at least one more field so we can implement the
> stochastic model from UNC, sched_runtime_dev or sched_runtime_var or
> somesuch.
> 
Ok, no problem with that too.

BTW, as Dhaval was suggesting, are (after those changes) fine with this
new sched_param? Do we need some further mechanism to grant its
extendability?
Padding?
Versioning?
void *data field?
Whatever?

:-O

I'd like very much to have some discussion here, if you think it is
needed, in hope of avoiding future ABI issues as much as possible! :-P

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-10 17:28   ` Peter Zijlstra
  2010-11-10 19:26     ` Peter Zijlstra
  2010-11-10 22:17     ` Raistlin
@ 2010-11-10 22:24     ` Raistlin
  2 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-10 22:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1455 bytes --]

On Wed, 2010-11-10 at 18:28 +0100, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 08:27 +0200, Raistlin wrote:
> > +struct sched_param_ex {
> > +       int sched_priority;
> > +       struct timespec sched_runtime;
> > +       struct timespec sched_deadline;
> > +       struct timespec sched_period;
> > +       unsigned int sched_flags;
> > +
> > +       struct timespec curr_runtime;
> > +       struct timespec used_runtime;
> > +       struct timespec curr_deadline;
> > +}; 
>
> I would suggest we add at least one more field so we can implement the
> stochastic model from UNC, sched_runtime_dev or sched_runtime_var or
> somesuch.
> 
Moreover, I really think that the capability of reporting back current
and used runtime (and deadline) would be very useful for implementing
more complex (and effective) scheduling behaviour in userspace... And in
fact I added them here.

Something I was not so sure, and thus about what I wanted your opinion,
was if I should put these things here --so that they are retrievable by
a sched_getparam[_ex, 2], or either add yet another syscall specific for
that? Thoughts?

Thanks,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-11-10 19:45   ` Peter Zijlstra
@ 2010-11-10 22:26     ` Raistlin
  0 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-10 22:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 742 bytes --]

On Wed, 2010-11-10 at 20:45 +0100, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 08:30 +0200, Raistlin wrote:
> > @@ -4922,7 +4965,11 @@ recheck:
> 
> if you add:
> 
> QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$"
> 
> to your /etc/quilt.quiltrc, it won't see labels as functions.
>
Mmm... Even if I don't use quilt at all and all this is the output of
git-format-patch? :-O

BTW, thanks, I'll figure it out! ;-P

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-10 16:12     ` Dhaval Giani
@ 2010-11-10 22:45       ` Raistlin
  0 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-10 22:45 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Luca Abeni, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1323 bytes --]

On Wed, 2010-11-10 at 17:12 +0100, Dhaval Giani wrote:
> >> + *  @curr_runtime       task's currently available runtime
> >> + *  @used_runtime       task's totally used runtime
> >> + *  @curr_deadline      task's current absolute deadline
> >> + *
> >> + * Given this task model, there are a multiplicity of scheduling algorithms
> >> + * and policies, that can be used to ensure all the tasks will make their
> >> + * timing constraints.
> >> + */
> >> +struct sched_param_ex {
> >> +     int sched_priority;
> >> +     struct timespec sched_runtime;
> >> +     struct timespec sched_deadline;
> >> +     struct timespec sched_period;
> >> +     unsigned int sched_flags;
> >> +
> >> +     struct timespec curr_runtime;
> >> +     struct timespec used_runtime;
> >> +     struct timespec curr_deadline;
> 
> Can we expose soem of these details via schedstats as opposed to a syscall?
> 
Actually, good point... schedstats seems very reasonable to me... What
do the others think?

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-10 22:17     ` Raistlin
@ 2010-11-10 22:57       ` Tommaso Cucinotta
  2010-11-11 13:32       ` Peter Zijlstra
  1 sibling, 0 replies; 135+ messages in thread
From: Tommaso Cucinotta @ 2010-11-10 22:57 UTC (permalink / raw)
  To: Raistlin
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Luca Abeni, Dhaval Giani, Harald Gustafsson,
	paulmck

Il 10/11/2010 23:17, Raistlin ha scritto:
>> I would suggest we add at least one more field so we can implement the
>> stochastic model from UNC, sched_runtime_dev or sched_runtime_var or
>> somesuch.
> Do we need some further mechanism to grant its
> extendability?
> Padding?
> Versioning?
> void *data field?
> Whatever?
This is a key point. Let me copy text from a slide of my LPC main-conf talk:

Warning: features & parameters may easily grow
- Addition of parameters, such as
     - deadline
     - desired vs guaranteed runtime (for adaptive reservations & 
controlled overcommitment)
- Set of flags for controlling variations on behavior
     - work conserving vs non-conserving reservations
     - what happens at fork() time
     - what happens on tasks death (automatic reclamation)
     - notifications from kernel (e.g., runtime exhaustion)
- Controlled access to RT scheduling by unprivileged
    applications (e.g., per-user “quotas”)
- Monitoring (e.g., residual runtime, available bandwidth)
- Integration/interaction with power management
    (e.g., spec of per-cpu-frequency budget)

How can we guarantee extensibility (or replacement) of parameters in the 
future ?

What about something like _attr_*() in POSIX-like interfaces ?

     T.

-- 
Tommaso Cucinotta, Computer Engineering PhD, Researcher
ReTiS Lab, Scuola Superiore Sant'Anna, Pisa, Italy
Tel +39 050 882 024, Fax +39 050 882 003
http://retis.sssup.it/people/tommaso


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-10 19:26     ` Peter Zijlstra
@ 2010-11-10 23:33       ` Tommaso Cucinotta
  2010-11-11 12:19         ` Peter Zijlstra
  0 siblings, 1 reply; 135+ messages in thread
From: Tommaso Cucinotta @ 2010-11-10 23:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Luca Abeni, Dhaval Giani, Harald Gustafsson,
	paulmck

Il 10/11/2010 20:26, Peter Zijlstra ha scritto:
>> I would suggest we add at least one more field so we can implement the
>> stochastic model from UNC, sched_runtime_dev or sched_runtime_var or
>> somesuch.
> Oh, and their model has something akin to: sched_runtime_max, these
> Gaussian bell curves go to inf. which is kinda bad for trying to compute
> bounds.
If I understand well the paper you're referring to, the actual admission 
test would require also to specify a maximum acceptable expected 
tardiness, and/or proper quantiles of the tardiness distribution, and 
also it would require to solve a linear programming optimization problem 
in order to check those bounds. You don't want this stuff to go into the 
kernel, do you ?

There are plenty of complex schedulability tests for complex (and also 
distributed) RT applications modeled in a more or less complex way, 
scheduled under a variety of scheduling policies, with models including 
maximum and stochastic blocking times, task dependencies, offsets, and I 
don't know whatever else. These tests may be part of a user-space 
component. I would recommend to keep at the kernel level only a bare 
minimal set of functionality.

Deadlines different from periods are already a first complexity that I'm 
not sure we want to have in the interface. The easiest thing you can do 
there is to consider simply the minimum among the relative deadline and 
the period, but that would be equivalent to having only one parameter. 
More importantly, realizing complex admission control tests raises the 
issue of how to represent the "availability of RT CPU power" (as needed 
by "higher-level" logic/middleware). As far as we keep simple 
utilization-based admission tests (which might optionally be disabled), 
we still have some chance of representing such quantity.

Apologies for my 2 poor cents. I hope to see here a discussion (and I'll 
try to shut-up as much as possible :-) ).

     T.

-- 
Tommaso Cucinotta, Computer Engineering PhD, Researcher
ReTiS Lab, Scuola Superiore Sant'Anna, Pisa, Italy
Tel +39 050 882 024, Fax +39 050 882 003
http://retis.sssup.it/people/tommaso


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-11-10 19:43   ` Peter Zijlstra
@ 2010-11-11  1:02     ` Raistlin
  0 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-11  1:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 3441 bytes --]

On Wed, 2010-11-10 at 20:43 +0100, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 08:30 +0200, Raistlin wrote:
> > + * We are being explicitly informed that a new instance is starting,
> > + * and this means that:
> > + *  - the absolute deadline of the entity has to be placed at
> > + *    current time + relative deadline;
> > + *  - the runtime of the entity has to be set to the maximum value.
> 
> When exactly are we a new instance? From a quick look dl_new gets set
> after a sched_setscheduler() call, is that the only way?
> 
One of the only two ways. Later in the queue, that flag is set by a new
system call, i.e., sched_wait_interval, that can be used to inform the
scheduler (for example at the end of a periodic/sporadic job) that an
instance just ended. Moreover, it can be exploited by a task which want
the scheduler to wake it up when it can be given its full runtime.
It as been added as a consequence of the discussion happened in Dresden,
at last year RTLWS, aside of my presentation...

Whether or not this could be useful, I don't know, and I accept comments
as usual. My opinion is that it might be something worthwhile to have,
especially from the point of view of hard real-time-ish scenarios, but
we can remove it appears unnecessary.

> Could a task calling sched_setscheduler() on itself cheat the system?
>
I obviously might be wrong (especially at this time), but I would say no
for the following reasons.

If you are an overrunning -deadline task calling sched_setscheduler()
the deactivate_task->dequeue_task->dequeue_task_dl() below will trigger
the bandwidth enforcement, i.e., will set dl_throttled=1 and start
dl_timer:
	...
        on_rq = p->se.on_rq;                                                                                          
        running = task_current(rq, p);                                                                                
        if (on_rq)
                deactivate_task(rq, p, 0);
        if (running)
                p->sched_class->put_prev_task(rq, p);
	...

Later, this enqueue:
	...
	if (running)                                                                                                  
                p->sched_class->set_curr_task(rq);                                                                    
        if (on_rq) {
                activate_task(rq, p, 0);                                                                              
                                                                                                                      
                check_class_changed(rq, p, prev_class, oldprio, running);
        }
	...

even if it will find dl_new=1, will not enqueue the task back in its
dl_rq (since dl_throttled=1). The actual enqueueing happens at the
firing of dl_timer, where an update instead than a replenishment will be
performed, right because of the fact that dl_new=1. This means the
runtime will be fully replenished and the deadline moved toward
rq->clock+dl_se->dl_deadline.

Did this answer your question?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-11-10 20:21   ` Peter Zijlstra
@ 2010-11-11  1:18     ` Raistlin
  2010-11-11 13:13       ` Peter Zijlstra
  0 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-11-11  1:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1583 bytes --]

On Wed, 2010-11-10 at 21:21 +0100, Peter Zijlstra wrote:
> > +       if (dl_time_before(dl_se->deadline, rq->clock) ||
> > +           dl_entity_overflow(dl_se, rq->clock)) {
> > +               dl_se->deadline = rq->clock + dl_se->dl_deadline;
> > +               dl_se->runtime = dl_se->dl_runtime;
> > +       }
> > +} 
> 
> Can't we loose runtime deficit this way?
>
No, this should not be the case (I hope!). The rationale is basically
the same of the other e-mail about new instances.

In fact, a task that goes to sleep with some available runtime will be
given new parameters or not, depending on the return value of
dl_entity_overflow, and that's fine, right?

On the other hand, a task blocking while in overrun will (at dequeue_*
and/or put_* time) trigger the bandwidth enforcement logic (which arms
dl_timer) so that:
 - if unblocking happens _before_ it becomes eligible again, the 
   enqueue will be later handled by the dl_timer itself, when it'll
   fire, and the task will be given a replenishment starting from its
   negative runtime;
 - if unblocking happens _later_ than the firing of dl_timer, resetting
   the scheduling parameters should be just fine, from the bandwidth
   point of view.

Does it make sense?

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-10 23:33       ` Tommaso Cucinotta
@ 2010-11-11 12:19         ` Peter Zijlstra
  0 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 12:19 UTC (permalink / raw)
  To: Tommaso Cucinotta
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Luca Abeni, Dhaval Giani, Harald Gustafsson,
	paulmck

On Thu, 2010-11-11 at 00:33 +0100, Tommaso Cucinotta wrote:
> Il 10/11/2010 20:26, Peter Zijlstra ha scritto:
> >> I would suggest we add at least one more field so we can implement the
> >> stochastic model from UNC, sched_runtime_dev or sched_runtime_var or
> >> somesuch.
> > Oh, and their model has something akin to: sched_runtime_max, these
> > Gaussian bell curves go to inf. which is kinda bad for trying to compute
> > bounds.
> If I understand well the paper you're referring to, the actual admission 
> test would require also to specify a maximum acceptable expected 
> tardiness, and/or proper quantiles of the tardiness distribution, and 
> also it would require to solve a linear programming optimization problem 
> in order to check those bounds. You don't want this stuff to go into the 
> kernel, do you ?

I'm not sure it does, the admission test only uses the average runtime
and uses the fact that it averages out to this (or less) to ensure
tardiness is bounded.

So the stochastic model allows for temporal overload situations but
because of the average nature next jobs must make up for the overrun of
a previous job, negating the overload.

So on average the system isn't overloaded.

The only reason we need the max runtime limit is because avg+stdev dont
actually place a bound on anything, like said the Gaussian bell curve
goes all the way out to infinity.



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-11-11  1:18     ` Raistlin
@ 2010-11-11 13:13       ` Peter Zijlstra
  0 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 13:13 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Thu, 2010-11-11 at 02:18 +0100, Raistlin wrote:
> On Wed, 2010-11-10 at 21:21 +0100, Peter Zijlstra wrote:
> > > +       if (dl_time_before(dl_se->deadline, rq->clock) ||
> > > +           dl_entity_overflow(dl_se, rq->clock)) {
> > > +               dl_se->deadline = rq->clock + dl_se->dl_deadline;
> > > +               dl_se->runtime = dl_se->dl_runtime;
> > > +       }
> > > +} 
> > 
> > Can't we loose runtime deficit this way?

>  - if unblocking happens _before_ it becomes eligible again, the 
>    enqueue will be later handled by the dl_timer itself, when it'll
>    fire, and the task will be given a replenishment starting from its
>    negative runtime;
>  - if unblocking happens _later_ than the firing of dl_timer, resetting
>    the scheduling parameters should be just fine, from the bandwidth
>    point of view.
> 
> Does it make sense?

Yes, I think so. Thanks!

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-10 22:17     ` Raistlin
  2010-11-10 22:57       ` Tommaso Cucinotta
@ 2010-11-11 13:32       ` Peter Zijlstra
  2010-11-11 13:54         ` Raistlin
  2010-11-11 14:05         ` Dhaval Giani
  1 sibling, 2 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 13:32 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Wed, 2010-11-10 at 23:17 +0100, Raistlin wrote:
> On Wed, 2010-11-10 at 18:28 +0100, Peter Zijlstra wrote:
> > On Fri, 2010-10-29 at 08:27 +0200, Raistlin wrote:
> > > +struct sched_param_ex {
> > > +       int sched_priority;
> > > +       struct timespec sched_runtime;
> > > +       struct timespec sched_deadline;
> > > +       struct timespec sched_period;
> > > +       unsigned int sched_flags;
> > > +
> > > +       struct timespec curr_runtime;
> > > +       struct timespec used_runtime;
> > > +       struct timespec curr_deadline;
> > > +}; 
> > 
> > It would be better for alignment reasons to move the sched_flags field
> > next to the sched_priority field.
> > 
> Makes sense, thanks. :-)
> 
> > I would suggest we add at least one more field so we can implement the
> > stochastic model from UNC, sched_runtime_dev or sched_runtime_var or
> > somesuch.
> > 
> Ok, no problem with that too.
> 
> BTW, as Dhaval was suggesting, are (after those changes) fine with this
> new sched_param? Do we need some further mechanism to grant its
> extendability?
> Padding?
> Versioning?
> void *data field?
> Whatever?
> 
> :-O
> 
> I'd like very much to have some discussion here, if you think it is
> needed, in hope of avoiding future ABI issues as much as possible! :-P

Right, so you mentioned doing s/_ex/2/ on all this stuff, which brings
it more in line with that other syscalls have done.

The last three parameters look to be output only as I've not yet found
code that reads it, and __getparam_dl() doesn't even appear to set
used_runtime.

One thing you can do is add some padding, versioning and void*
extentions are doable for the setparam() path, but getparam() is going
to be mighty interesting.



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-11 13:32       ` Peter Zijlstra
@ 2010-11-11 13:54         ` Raistlin
  2010-11-11 14:08           ` Peter Zijlstra
  2010-11-11 14:05         ` Dhaval Giani
  1 sibling, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-11-11 13:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1831 bytes --]

On Thu, 2010-11-11 at 14:32 +0100, Peter Zijlstra wrote:
> > BTW, as Dhaval was suggesting, are (after those changes) fine with this
> > new sched_param? Do we need some further mechanism to grant its
> > extendability?
> > Padding?
> > Versioning?
> > void *data field?
> > Whatever?
> > 
> > :-O
> > 
> > I'd like very much to have some discussion here, if you think it is
> > needed, in hope of avoiding future ABI issues as much as possible! :-P
> 
> Right, so you mentioned doing s/_ex/2/ on all this stuff, which brings
> it more in line with that other syscalls have done.
> 
Sure, this is necessary and easy to achieve. :-)

> The last three parameters look to be output only as I've not yet found
> code that reads it, and __getparam_dl() doesn't even appear to set
> used_runtime.
> 
Yeah, just kind of statistical reporting of the task's behaviour. That's
why I was in agreement with Dhaval about using schedstats for those
(bumping the version, obviously). What do you think?

> One thing you can do is add some padding, versioning and void*
> extentions are doable for the setparam() path, but getparam() is going
> to be mighty interesting.
> 
Mmm... So, tell me if I got it well: I remove the last three parameters
(e.g., moving them toward schedstats) and add (besides _var and _max)
some padding? It that correct?

what about the len <== sizeof(struct sched_param2) in
sched_{set,get}{param,scheduler}2()... Does this still make sense, or
are we removing it?

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-11 13:32       ` Peter Zijlstra
  2010-11-11 13:54         ` Raistlin
@ 2010-11-11 14:05         ` Dhaval Giani
  1 sibling, 0 replies; 135+ messages in thread
From: Dhaval Giani @ 2010-11-11 14:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Luca Abeni, Harald Gustafsson, paulmck

On Thu, Nov 11, 2010 at 02:32:13PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-10 at 23:17 +0100, Raistlin wrote:
> > On Wed, 2010-11-10 at 18:28 +0100, Peter Zijlstra wrote:
> > > On Fri, 2010-10-29 at 08:27 +0200, Raistlin wrote:
> > > > +struct sched_param_ex {
> > > > +       int sched_priority;
> > > > +       struct timespec sched_runtime;
> > > > +       struct timespec sched_deadline;
> > > > +       struct timespec sched_period;
> > > > +       unsigned int sched_flags;
> > > > +
> > > > +       struct timespec curr_runtime;
> > > > +       struct timespec used_runtime;
> > > > +       struct timespec curr_deadline;
> > > > +}; 
> > > 
> > > It would be better for alignment reasons to move the sched_flags field
> > > next to the sched_priority field.
> > > 
> > Makes sense, thanks. :-)
> > 
> > > I would suggest we add at least one more field so we can implement the
> > > stochastic model from UNC, sched_runtime_dev or sched_runtime_var or
> > > somesuch.
> > > 
> > Ok, no problem with that too.
> > 
> > BTW, as Dhaval was suggesting, are (after those changes) fine with this
> > new sched_param? Do we need some further mechanism to grant its
> > extendability?
> > Padding?
> > Versioning?
> > void *data field?
> > Whatever?
> > 
> > :-O
> > 
> > I'd like very much to have some discussion here, if you think it is
> > needed, in hope of avoiding future ABI issues as much as possible! :-P
> 
> Right, so you mentioned doing s/_ex/2/ on all this stuff, which brings
> it more in line with that other syscalls have done.
> 
> The last three parameters look to be output only as I've not yet found
> code that reads it, and __getparam_dl() doesn't even appear to set
> used_runtime.
> 

So, do you think its a good idea moving this information to schedstats?
It seems more in line for monitoring, which schedstat seems a more
appropriate destination.

thanks,
Dhaval

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-11 13:54         ` Raistlin
@ 2010-11-11 14:08           ` Peter Zijlstra
  2010-11-11 17:27             ` Raistlin
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 14:08 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Thu, 2010-11-11 at 14:54 +0100, Raistlin wrote:
> > The last three parameters look to be output only as I've not yet found
> > code that reads it, and __getparam_dl() doesn't even appear to set
> > used_runtime.
> > 
> Yeah, just kind of statistical reporting of the task's behaviour. That's
> why I was in agreement with Dhaval about using schedstats for those
> (bumping the version, obviously). What do you think?

So its pure output?  In that case its not really a nice fit for
sched_param, however..

I'm not really a fan of schedstat, esp if you have to use it very
frequently, the overhead of open()+read()+close() + parsing text is
quite high.

Then again, if people are really going to use this (big if I guess) we
could add yet another syscall for this or whatever.

> > One thing you can do is add some padding, versioning and void*
> > extentions are doable for the setparam() path, but getparam() is going
> > to be mighty interesting.
> > 
> Mmm... So, tell me if I got it well: I remove the last three parameters
> (e.g., moving them toward schedstats) and add (besides _var and _max)
> some padding? It that correct?

grmbl, so I was going to say, just pad it to a nice 2^n size, but then I
saw that struct timespec is defined as two long's, which means we're
going to have to do compat crap.

Thomas is there a sane time format in existence? I thought the whole
purpose of timeval/timespec was to avoid having to use a u64, but then
using longs as opposed to int totally defeats the purpose. 

> what about the len <== sizeof(struct sched_param2) in
> sched_{set,get}{param,scheduler}2()... Does this still make sense, or
> are we removing it? 

Since we're going for a constant sized structure we might as well take
it out.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-10-29  6:30 ` [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation Raistlin
                     ` (3 preceding siblings ...)
  2010-11-10 20:21   ` Peter Zijlstra
@ 2010-11-11 14:13   ` Peter Zijlstra
  2010-11-11 14:28     ` Raistlin
  2010-11-11 14:17   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 14:13 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:30 +0200, Raistlin wrote:
> +       delta_exec = rq->clock - curr->se.exec_start;

This changed to rq->clock_task, the difference between rq->clock and
rq->clock_task is that the latter only counts time actually spend on the
task context (ie. it doesn't count softirq and hardirq context).



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-10-29  6:30 ` [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation Raistlin
                     ` (4 preceding siblings ...)
  2010-11-11 14:13   ` Peter Zijlstra
@ 2010-11-11 14:17   ` Peter Zijlstra
  2010-11-11 18:33     ` Raistlin
  2010-11-11 14:25   ` Peter Zijlstra
  2010-11-14  8:54   ` Raistlin
  7 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 14:17 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:30 +0200, Raistlin wrote:
> +static void update_curr_dl(struct rq *rq)
> +{
> +       struct task_struct *curr = rq->curr;
> +       struct sched_dl_entity *dl_se = &curr->dl;
> +       u64 delta_exec;
> +
> +       if (!dl_task(curr) || !on_dl_rq(dl_se))
> +               return;
> +
> +       delta_exec = rq->clock - curr->se.exec_start;
> +       if (unlikely((s64)delta_exec < 0))
> +               delta_exec = 0;
> +
> +       schedstat_set(curr->se.statistics.exec_max,
> +                     max(curr->se.statistics.exec_max, delta_exec));
> +
> +       curr->se.sum_exec_runtime += delta_exec;
> +       account_group_exec_runtime(curr, delta_exec);
> +
> +       curr->se.exec_start = rq->clock;
> +       cpuacct_charge(curr, delta_exec);
> +
> +       dl_se->runtime -= delta_exec;
> +       if (dl_runtime_exceeded(rq, dl_se)) {
> +               __dequeue_task_dl(rq, curr, 0);
> +               if (likely(start_dl_timer(dl_se)))
> +                       dl_se->dl_throttled = 1;
> +               else
> +                       enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
> +
> +               resched_task(curr);
> +       }
> +} 

So you keep the current task in the rb-tree? If you remove the current
task from the tree you don't have to do the whole dequeue/enqueue thing.
Then again, I guess it only really matters once you push the deadline,
which shouldn't be that often.

Also, you might want to put a conditional around that resched, no point
rescheduling if you're still the leftmost task.



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-10-29  6:30 ` [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation Raistlin
                     ` (5 preceding siblings ...)
  2010-11-11 14:17   ` Peter Zijlstra
@ 2010-11-11 14:25   ` Peter Zijlstra
  2010-11-11 14:33     ` Raistlin
  2010-11-14  8:54   ` Raistlin
  7 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 14:25 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:30 +0200, Raistlin wrote:
> +static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
> +                                 int flags)
> +{
> +       if (!dl_task(rq->curr) || (dl_task(p) &&
> +           dl_time_before(p->dl.deadline, rq->curr->dl.deadline)))
> +               resched_task(rq->curr);
> +} 

every moment now a patch will hit -tip that ensures
->check_preempt_curr() is only called when both the current and waking
task belong to the same sched_class.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-11-11 14:13   ` Peter Zijlstra
@ 2010-11-11 14:28     ` Raistlin
  0 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-11 14:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 822 bytes --]

On Thu, 2010-11-11 at 15:13 +0100, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 08:30 +0200, Raistlin wrote:
> > +       delta_exec = rq->clock - curr->se.exec_start;
> 
> This changed to rq->clock_task, the difference between rq->clock and
> rq->clock_task is that the latter only counts time actually spend on the
> task context (ie. it doesn't count softirq and hardirq context).
> 
Yeah, I noticed later that Venki's changes in accounting could be
super-useful here. I'm already on it! :-P

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads
  2010-10-29  6:31 ` [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads Raistlin
@ 2010-11-11 14:31   ` Peter Zijlstra
  2010-11-11 14:50     ` Dario Faggioli
  2010-11-11 14:34   ` Peter Zijlstra
  2010-11-11 14:46   ` [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads Peter Zijlstra
  2 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 14:31 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:31 +0200, Raistlin wrote:
> 
> There sometimes is the need of executing a task as if it would
> have the maximum possible priority in the entire system, i.e.,
> whenever it gets ready it must run! This is for example the case
> for some maintainance kernel thread like migration and (sometimes)
> watchdog or ksoftirq.
> 
> Since SCHED_DEADLINE is now the highest priority scheduling class
> these tasks have to be handled therein, but it is not obvious how
> to choose a runtime and a deadline that guarantee what explained
> above. Therefore, we need a mean of recognizing system tasks inside
> the -deadline class and always run them as soon as possible, without
> any kind of runtime and bandwidth limitation.
> 
> This patch:
>  - adds the SF_HEAD flag, which identify a special task that need
>    absolute prioritization among any other task;
>  - ensures that special tasks always preempt everyone else (and,
>    obviously, are not preempted by non special tasks);
>  - disables runtime and bandwidth checking for such tasks, hoping
>    that the interference they cause is small enough.
> 

Yet in the previous patch you had this hunk:

> +++ b/kernel/sched_stoptask.c
> @@ -81,7 +81,7 @@ get_rr_interval_stop(struct rq *rq, struct
> task_struct *task)
>   * Simple, special scheduling class for the per-CPU stop tasks:
>   */
>  static const struct sched_class stop_sched_class = {
> -       .next                   = &rt_sched_class,
> +       .next                   = &dl_sched_class,
>  
>         .enqueue_task           = enqueue_task_stop,
>         .dequeue_task           = dequeue_task_stop,


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-11-11 14:25   ` Peter Zijlstra
@ 2010-11-11 14:33     ` Raistlin
  0 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-11 14:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1057 bytes --]

On Thu, 2010-11-11 at 15:25 +0100, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 08:30 +0200, Raistlin wrote:
> > +static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
> > +                                 int flags)
> > +{
> > +       if (!dl_task(rq->curr) || (dl_task(p) &&
> > +           dl_time_before(p->dl.deadline, rq->curr->dl.deadline)))
> > +               resched_task(rq->curr);
> > +} 
> 
> every moment now a patch will hit -tip that ensures
> ->check_preempt_curr() is only called when both the current and waking
> task belong to the same sched_class.
>
Again, I saw your fix to this, which is later than this patchset... :-)

I'll take care of this while rebasing right in these days.

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads
  2010-10-29  6:31 ` [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads Raistlin
  2010-11-11 14:31   ` Peter Zijlstra
@ 2010-11-11 14:34   ` Peter Zijlstra
  2010-11-11 15:27     ` Oleg Nesterov
  2010-11-11 14:46   ` [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads Peter Zijlstra
  2 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 14:34 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:31 +0200, Raistlin wrote:
> @@ -6071,7 +6104,7 @@ void sched_idle_next(void)
>          */
>         raw_spin_lock_irqsave(&rq->lock, flags);
>  
> -       __setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
> +       __setscheduler_dl_special(rq, p);
>  
>         activate_task(rq, p, 0);
>   

Ingo, happen to know if this is really needed these days? hotplug should
have migrated all other tasks away, leaving only the idle task to run.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads
  2010-10-29  6:31 ` [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads Raistlin
  2010-11-11 14:31   ` Peter Zijlstra
  2010-11-11 14:34   ` Peter Zijlstra
@ 2010-11-11 14:46   ` Peter Zijlstra
  2 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 14:46 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:31 +0200, Raistlin wrote:
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index d4d918a..9c4c967 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -853,13 +853,9 @@ static int __cpuinit cpu_callback(struct notifier_block *nfb,
>                              cpumask_any(cpu_online_mask));
>         case CPU_DEAD:
>         case CPU_DEAD_FROZEN: {
> -               static struct sched_param param = {
> -                       .sched_priority = MAX_RT_PRIO-1
> -               };
> -
>                 p = per_cpu(ksoftirqd, hotcpu);
>                 per_cpu(ksoftirqd, hotcpu) = NULL;
> -               sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
> +               setscheduler_dl_special(p);
>                 kthread_stop(p);
>                 takeover_tasklets(hotcpu);
>                 break;

So this comes from 1c6b4aa94576, which is something I wouldn't have
bothered merging in the first place, if you pin a cpu with RT tasks like
that you get to keep the pieces, hotplug isn't the only thing that will
go wonky.

Anyway, if you leave the code as is you'll be fine, it'll me above any
FIFO task, but still below deadline tasks and the stop task, neither of
which should be hogging the system like that anyway.

> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 94ca779..2b7f259 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -307,10 +307,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>   */
>  static int watchdog(void *unused)
>  {
> -       static struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
>         struct hrtimer *hrtimer = &__raw_get_cpu_var(watchdog_hrtimer);
>  
> -       sched_setscheduler(current, SCHED_FIFO, &param);
> +       setscheduler_dl_special(current);
>  
>         /* initialize timestamp */
>         __touch_watchdog(); 

I'd be inclined to drop this too, if people get watchdog timeouts it
means the system is really over-commited on deadline tasks and the
watchdog FIFO thread didn't get around to running, something which I
think we both agree shouldn't be happening.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads
  2010-11-11 14:31   ` Peter Zijlstra
@ 2010-11-11 14:50     ` Dario Faggioli
  0 siblings, 0 replies; 135+ messages in thread
From: Dario Faggioli @ 2010-11-11 14:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1476 bytes --]

On Thu, 2010-11-11 at 15:31 +0100, Peter Zijlstra wrote:
> > Since SCHED_DEADLINE is now the highest priority scheduling class
> > these tasks have to be handled therein, but it is not obvious how
> > to choose a runtime and a deadline that guarantee what explained
> > above. Therefore, we need a mean of recognizing system tasks inside
> > the -deadline class and always run them as soon as possible, without
> > any kind of runtime and bandwidth limitation.
> 
> Yet in the previous patch you had this hunk:
> 
> > +++ b/kernel/sched_stoptask.c
> > @@ -81,7 +81,7 @@ get_rr_interval_stop(struct rq *rq, struct
> > task_struct *task)
> >   * Simple, special scheduling class for the per-CPU stop tasks:
> >   */
> >  static const struct sched_class stop_sched_class = {
> > -       .next                   = &rt_sched_class,
> > +       .next                   = &dl_sched_class,
> >  
> >         .enqueue_task           = enqueue_task_stop,
> >         .dequeue_task           = dequeue_task_stop,
> 
Yep. And (as said on IRC) this needs serious cleanup, in favour of
stop_task!

I'll completely drop patch 06 for next releases.

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads
  2010-11-11 14:34   ` Peter Zijlstra
@ 2010-11-11 15:27     ` Oleg Nesterov
  2010-11-11 15:43       ` Peter Zijlstra
  0 siblings, 1 reply; 135+ messages in thread
From: Oleg Nesterov @ 2010-11-11 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, Frederic Weisbecker, Darren Hart, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On 11/11, Peter Zijlstra wrote:
>
> On Fri, 2010-10-29 at 08:31 +0200, Raistlin wrote:
> > @@ -6071,7 +6104,7 @@ void sched_idle_next(void)
> >          */
> >         raw_spin_lock_irqsave(&rq->lock, flags);
> >
> > -       __setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
> > +       __setscheduler_dl_special(rq, p);
> >
> >         activate_task(rq, p, 0);
> >
>
> Ingo, happen to know if this is really needed these days? hotplug should
> have migrated all other tasks away, leaving only the idle task to run.

This is called before CPU_DEAD stage which migrates all tasks away.


Sorry, can't resist, off-topic quiestion. Do we really need
migration_call()->migrate_live_tasks() ?

With the recent changes, try_to_wake_up() can never choose
the dead (!cpu_online) cpu if the task was deactivated.

Looks like we should only worry about the running tasks, and
migrate_dead_tasks()->pick_next_task() loop seems to all work
we need.

(Of course, we can't just remove migrate_live_tasks(), at leat
 migrate_dead() needs simple changes).

What do you think?

Oleg.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads
  2010-11-11 15:27     ` Oleg Nesterov
@ 2010-11-11 15:43       ` Peter Zijlstra
  2010-11-11 16:32         ` Oleg Nesterov
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 15:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, Frederic Weisbecker, Darren Hart, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Thu, 2010-11-11 at 16:27 +0100, Oleg Nesterov wrote:
> On 11/11, Peter Zijlstra wrote:
> >
> > On Fri, 2010-10-29 at 08:31 +0200, Raistlin wrote:
> > > @@ -6071,7 +6104,7 @@ void sched_idle_next(void)
> > >          */
> > >         raw_spin_lock_irqsave(&rq->lock, flags);
> > >
> > > -       __setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
> > > +       __setscheduler_dl_special(rq, p);
> > >
> > >         activate_task(rq, p, 0);
> > >
> >
> > Ingo, happen to know if this is really needed these days? hotplug should
> > have migrated all other tasks away, leaving only the idle task to run.
> 
> This is called before CPU_DEAD stage which migrates all tasks away.
> 
> 
> Sorry, can't resist, off-topic quiestion. Do we really need
> migration_call()->migrate_live_tasks() ?
> 
> With the recent changes, try_to_wake_up() can never choose
> the dead (!cpu_online) cpu if the task was deactivated.
> 
> Looks like we should only worry about the running tasks, and
> migrate_dead_tasks()->pick_next_task() loop seems to all work
> we need.
> 
> (Of course, we can't just remove migrate_live_tasks(), at leat
>  migrate_dead() needs simple changes).
> 
> What do you think?

Yes, I think we can make that work, we could even move that
migrate_live_tasks() into CPU_DYING, which is before this point.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads
  2010-11-11 15:43       ` Peter Zijlstra
@ 2010-11-11 16:32         ` Oleg Nesterov
  2010-11-13 18:35           ` Peter Zijlstra
  0 siblings, 1 reply; 135+ messages in thread
From: Oleg Nesterov @ 2010-11-11 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, Frederic Weisbecker, Darren Hart, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On 11/11, Peter Zijlstra wrote:
>
> On Thu, 2010-11-11 at 16:27 +0100, Oleg Nesterov wrote:
> >
> > Sorry, can't resist, off-topic quiestion. Do we really need
> > migration_call()->migrate_live_tasks() ?
> >
> > With the recent changes, try_to_wake_up() can never choose
> > the dead (!cpu_online) cpu if the task was deactivated.
> >
> > Looks like we should only worry about the running tasks, and
> > migrate_dead_tasks()->pick_next_task() loop seems to all work
> > we need.
> >
> > (Of course, we can't just remove migrate_live_tasks(), at leat
> >  migrate_dead() needs simple changes).
> >
> > What do you think?
>
> Yes, I think we can make that work, we could even move that
> migrate_live_tasks() into CPU_DYING, which is before this point.

Hmm, I think you are right. In this case we can also simplify
migrate-from-dead-cpu paths, we know that nobody can touch rq->lock,
every CPU runs stop_machine_cpu_stop() with irqs disabled. No need
to get/put task_struct, etc.

Oleg.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-11 14:08           ` Peter Zijlstra
@ 2010-11-11 17:27             ` Raistlin
  0 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-11 17:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1414 bytes --]

On Thu, 2010-11-11 at 15:08 +0100, Peter Zijlstra wrote:
> I'm not really a fan of schedstat, esp if you have to use it very
> frequently, the overhead of open()+read()+close() + parsing text is
> quite high.
> 
> Then again, if people are really going to use this (big if I guess) we
> could add yet another syscall for this or whatever.
> 
Ok, we'll see at that time.

> grmbl, so I was going to say, just pad it to a nice 2^n size, but then I
> saw that struct timespec is defined as two long's, which means we're
> going to have to do compat crap.
> 
> Thomas is there a sane time format in existence? I thought the whole
> purpose of timeval/timespec was to avoid having to use a u64, but then
> using longs as opposed to int totally defeats the purpose. 
> 
Fine, and u64 will be. Going for that...

> > what about the len <== sizeof(struct sched_param2) in
> > sched_{set,get}{param,scheduler}2()... Does this still make sense, or
> > are we removing it? 
> 
> Since we're going for a constant sized structure we might as well take
> it out.
>
... and for that!

thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-11-11 14:17   ` Peter Zijlstra
@ 2010-11-11 18:33     ` Raistlin
  0 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-11 18:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 2574 bytes --]

On Thu, 2010-11-11 at 15:17 +0100, Peter Zijlstra wrote:
> > +       dl_se->runtime -= delta_exec;
> > +       if (dl_runtime_exceeded(rq, dl_se)) {
> > +               __dequeue_task_dl(rq, curr, 0);
> > +               if (likely(start_dl_timer(dl_se)))
> > +                       dl_se->dl_throttled = 1;
> > +               else
> > +                       enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
> > +
> > +               resched_task(curr);
> > +       }
> > +} 
> 
> So you keep the current task in the rb-tree? 
>
Yes, I do.

> If you remove the current
> task from the tree you don't have to do the whole dequeue/enqueue thing.
> Then again, I guess it only really matters once you push the deadline,
> which shouldn't be that often.
> 
I'm not sure. The likelihood of runtime overrun/deadline pushing depends
on many things, and might happen even if nothing is going bad... Might
be wanted actually!

Suppose you have some sporadic task of some sort with computation ~10ms
and (minimum) interarrival time of jobs of 100ms. Moreover, I want it to
be able to react with a latency in the order of 100us. If I give it
10ms/100ms (or maybe 12ms/100ms, or whatever overprovisioning is
considered enough to be safe), and an instance arrives as soon as the it
has been throttled I may have to wait for 90ms (88, ?).
Thus, I give it 10us/100us, which means it won't stay throttled for more
tha 90us, but also that it's runtime will be exhausted (and it's
deadline pushed away) 1000 times for a typical instance!

Forgive me for the stupid example... What I was trying to point out is
that, especially considering we don't (want to!) have rock solid WCET
analysis, too much bias toward the case were the scheduling parameters
perfectly matches the applications' ones would not be as much
preferable.

For this reasons, I structured the code this way, and it seems to me
that keeping current out the the tree would complicate the code quite a
bit, but I'm also sure it's doable if you really think it is needed.
Just let me know... :-)

> Also, you might want to put a conditional around that resched, no point
> rescheduling if you're still the leftmost task.
> 
Right. This should be done, indeed.

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 08/22] sched: SCHED_DEADLINE avg_update accounting
  2010-10-29  6:33 ` [RFC][PATCH 08/22] sched: SCHED_DEADLINE avg_update accounting Raistlin
@ 2010-11-11 19:16   ` Peter Zijlstra
  0 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 19:16 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:33 +0200, Raistlin wrote:
> Make the core scheduler and load balancer aware of the load
> produced by -deadline tasks, by updating the moving average
> like for sched_rt.

I think you can simply add your dl time to sched_rt_avg_update(), no
need to track a second avg, its about !fair anyway.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 09/22] sched: add period support for -deadline tasks
  2010-10-29  6:34 ` [RFC][PATCH 09/22] sched: add period support for -deadline tasks Raistlin
@ 2010-11-11 19:17   ` Peter Zijlstra
  2010-11-11 19:31     ` Raistlin
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 19:17 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:34 +0200, Raistlin wrote:
> Make it possible to specify a period (different or equal than
> deadline) for -deadline tasks.
> 
I would expect it to be:

runtime <= deadline <= period



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 10/22] sched: add a syscall to wait for the next instance
  2010-10-29  6:35 ` [RFC][PATCH 10/22] sched: add a syscall to wait for the next instance Raistlin
@ 2010-11-11 19:21   ` Peter Zijlstra
  2010-11-11 19:33     ` Raistlin
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 19:21 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:35 +0200, Raistlin wrote:
> 
> Introduce sched_wait_interval() syscall (and scheduling class
> interface call). In general, this aims at providing each scheduling
> class with a mean of making one of its own task sleep for some time
> according to some specific rule of the scheduling class itself.
> 
Did we have an actual use case for this? I seem to remember that the
last time we seemed to thing job wakeups are due to external events, in
which case we don't need this.

I think we should try without this patch first and only consider this
once we merged the base functionality and have a solid use-case for
this.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 09/22] sched: add period support for -deadline tasks
  2010-11-11 19:17   ` Peter Zijlstra
@ 2010-11-11 19:31     ` Raistlin
  2010-11-11 19:43       ` Peter Zijlstra
  2010-11-12 13:46       ` Luca Abeni
  0 siblings, 2 replies; 135+ messages in thread
From: Raistlin @ 2010-11-11 19:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1308 bytes --]

On Thu, 2010-11-11 at 20:17 +0100, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 08:34 +0200, Raistlin wrote:
> > Make it possible to specify a period (different or equal than
> > deadline) for -deadline tasks.
> > 
> I would expect it to be:
> 
> runtime <= deadline <= period
> 
Well, apart from that really unhappy comment/changelog, it should be
like that in the code, and if it's not, it is what I meant and I'll
change to that as soon as I can! :-)

Since you spotted it... The biggest issue here is admission control
test. Right now this is done against task's bandwidth, i.e.,
sum_i(runtime_i/period_i)<=threshold, but it is unfortunately wrong...
Or at least very, very loose, to the point of being almost useless! :-(

The more correct --in the sense that it at least yield a sufficient (not
necessary!) condition-- thing to do would be
sum_i(runtime_i/min{deadline_i,period_i})<=threshold.

So, what you think we should do? Can I go for this latter option?

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 10/22] sched: add a syscall to wait for the next instance
  2010-11-11 19:21   ` Peter Zijlstra
@ 2010-11-11 19:33     ` Raistlin
  0 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-11 19:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1174 bytes --]

On Thu, 2010-11-11 at 20:21 +0100, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 08:35 +0200, Raistlin wrote:
> > 
> > Introduce sched_wait_interval() syscall (and scheduling class
> > interface call). In general, this aims at providing each scheduling
> > class with a mean of making one of its own task sleep for some time
> > according to some specific rule of the scheduling class itself.
> > 
> Did we have an actual use case for this? I seem to remember that the
> last time we seemed to thing job wakeups are due to external events, in
> which case we don't need this.
> 
Might find out something, but I mostly agree with what you're saying.

> I think we should try without this patch first and only consider this
> once we merged the base functionality and have a solid use-case for
> this.
>
Consider it as already dropped! :-)

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 12/22] sched: add runtime reporting for -deadline tasks
  2010-10-29  6:36 ` [RFC][PATCH 12/22] sched: add runtime reporting " Raistlin
@ 2010-11-11 19:37   ` Peter Zijlstra
  2010-11-12 16:15     ` Raistlin
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 19:37 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:36 +0200, Raistlin wrote:
> Make it available for the user-space the total amount of runtime
> time it has used from since it became a -deadline task.
> 
> This is something that is typically useful for monitoring from
> user-space the task CPU usage, and maybe implementing at that level
> some more sophisticated scheduling behaviour.
> 
> One example is feedback scheduling, where you try to adapt the
> scheduling parameters of a task by looking at its behaviour in
> a certain interval of time, applying concepts coming from control
> engineering.
> 
> Signed-off-by: Dario Faggioli <raistlin@linux.it>
> ---
>  include/linux/sched.h |    7 +++----
>  kernel/sched.c        |    3 +++
>  kernel/sched_debug.c  |    1 +
>  kernel/sched_dl.c     |    1 +
>  4 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8ae947b..b6f0635 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1242,14 +1242,15 @@ struct sched_rt_entity {
>  #endif
>  };
>  
> -#ifdef CONFIG_SCHEDSTATS
>  struct sched_stats_dl {
> +#ifdef CONFIG_SCHEDSTATS
>  	u64			last_dmiss;
>  	u64			last_rorun;
>  	u64			dmiss_max;
>  	u64			rorun_max;
> -};
>  #endif
> +	u64			tot_rtime;
> +};
>  

I know we agreed to pull this from the sched_param2 structure and delay
exposing this information for a while until the base patches got merged
and came up with a solid use-case, but reading this patch makes me
wonder why tsk->se.sum_exec_runtime isn't good enough?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 09/22] sched: add period support for -deadline tasks
  2010-11-11 19:31     ` Raistlin
@ 2010-11-11 19:43       ` Peter Zijlstra
  2010-11-11 23:33         ` Tommaso Cucinotta
  2010-11-12 13:33         ` Raistlin
  2010-11-12 13:46       ` Luca Abeni
  1 sibling, 2 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 19:43 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Thu, 2010-11-11 at 20:31 +0100, Raistlin wrote:
> On Thu, 2010-11-11 at 20:17 +0100, Peter Zijlstra wrote:
> > On Fri, 2010-10-29 at 08:34 +0200, Raistlin wrote:
> > > Make it possible to specify a period (different or equal than
> > > deadline) for -deadline tasks.
> > > 
> > I would expect it to be:
> > 
> > runtime <= deadline <= period
> > 
> Well, apart from that really unhappy comment/changelog, it should be
> like that in the code, and if it's not, it is what I meant and I'll
> change to that as soon as I can! :-)
> 
> Since you spotted it... The biggest issue here is admission control
> test. Right now this is done against task's bandwidth, i.e.,
> sum_i(runtime_i/period_i)<=threshold, but it is unfortunately wrong...
> Or at least very, very loose, to the point of being almost useless! :-(

Right, I have some recollection on that.

> The more correct --in the sense that it at least yield a sufficient (not
> necessary!) condition-- thing to do would be
> sum_i(runtime_i/min{deadline_i,period_i})<=threshold.
> 
> So, what you think we should do? Can I go for this latter option?

I remember we visited this subject last time, but I seem to have
forgotten most details.

So sufficient (but not necessary) means its still a pessimistic approach
but better than the one currently employed, or does it mean its
optimistic and allows for unschedulable sets to be allowed in?


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 15/22] sched: add traceporints for -deadline tasks
  2010-10-29  6:38 ` [RFC][PATCH 15/22] sched: add traceporints " Raistlin
@ 2010-11-11 19:54   ` Peter Zijlstra
  2010-11-12 16:13     ` Raistlin
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 19:54 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:38 +0200, Raistlin wrote:
> Add tracepoints for the most notable events related to -deadline
> tasks scheduling (new task arrival, context switch, runtime accounting,
> bandwidth enforcement timer, etc.).
> 
> Signed-off-by: Dario Faggioli <raistlin@linux.it>
> Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
> ---
>  include/trace/events/sched.h |  203 +++++++++++++++++++++++++++++++++++++++++-
>  kernel/sched.c               |    2 +
>  kernel/sched_dl.c            |   21 +++++
>  3 files changed, 225 insertions(+), 1 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index f633478..03baa17 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -304,7 +304,6 @@ DECLARE_EVENT_CLASS(sched_stat_template,
>  			(unsigned long long)__entry->delay)
>  );
>  
> -
>  /*
>   * Tracepoint for accounting wait time (time the task is runnable
>   * but not actually running due to scheduler contention).
> @@ -363,6 +362,208 @@ TRACE_EVENT(sched_stat_runtime,
>  );
>  
>  /*
> + * Tracepoint for task switches involving -deadline tasks:
> + */
> +TRACE_EVENT(sched_switch_dl,


We've already got sched_switch(), better extend that. Same for the next
patch, we already have a migration tracepoint, extend that.

And I recently rejected a fifo push/pull tracepoint patch from Steve
because the migration tracepoint was able to provide the same
information.




^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 13/22] sched: add resource limits for -deadline tasks
  2010-10-29  6:37 ` [RFC][PATCH 13/22] sched: add resource limits " Raistlin
@ 2010-11-11 19:57   ` Peter Zijlstra
  2010-11-12 21:30     ` Raistlin
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 19:57 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:37 +0200, Raistlin wrote:
> Add resource limits for non-root tasks in using the SCHED_DEADLINE
> policy, very similarly to what already exists for RT policies.
> 
> In fact, this patch:
>  - adds the resource limit RLIMIT_DLDLINE, which is the minimum value
>    a user task can use as its own deadline;
>  - adds the resource limit RLIMIT_DLRTIME, which is the maximum value
>    a user task can use as it own runtime.
> 
> Notice that to exploit these, a modified version of the ulimit
> utility and a modified resource.h header file are needed. They
> both will be available on the website of the project.

We might also want to add an additional !SYS_CAP_ADMIN global bandwidth
cap much like the existing sysctl bandwidth cap.



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 17/22] sched: add signaling overrunning -deadline tasks.
  2010-10-29  6:40 ` [RFC][PATCH 17/22] sched: add signaling overrunning " Raistlin
@ 2010-11-11 21:58   ` Peter Zijlstra
  2010-11-12 15:39     ` Raistlin
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 21:58 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:40 +0200, Raistlin wrote:
> +static inline void __dl_signal(struct task_struct *tsk, int which)
> +{
> +       struct siginfo info;
> +       long long amount = which == SF_SIG_DMISS ? tsk->dl.stats.last_dmiss :
> +                          tsk->dl.stats.last_rorun;
> +
> +       info.si_signo = SIGXCPU;
> +       info.si_errno = which;
> +       info.si_code = SI_KERNEL;
> +       info.si_pid = 0;
> +       info.si_uid = 0;
> +       info.si_value.sival_int = (int)amount;
> +
> +       /* Correctly take the locks on task's sighand */
> +       __group_send_sig_info(SIGXCPU, &info, tsk);
> +       /* Log what happened to dmesg */
> +       printk(KERN_INFO "SCHED_DEADLINE: 0x%4x by %Ld [ns] in %d (%s)\n",
> +              which, amount, task_pid_nr(tsk), tsk->comm);
> +} 

This being a G-EDF like scheduler with a u<=1 schedulability test, we're
firmly in soft-rt territory which means the above will be very easy to
trigger.. Maybe not spam dmesg?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-10-29  6:42 ` [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks Raistlin
@ 2010-11-11 22:12   ` Peter Zijlstra
  2010-11-12 15:36     ` Raistlin
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 22:12 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:42 +0200, Raistlin wrote:
> The bandwidth enforcing mechanism implemented inside the
> SCHED_DEADLINE policy ensures that overrunning tasks are slowed
> down without interfering with well behaving ones.
> This, however, comes at the price of limiting the capability of
> a task to exploit more bandwidth than it is asigned.
> 
> The current implementation always stops a task that is trying
> to use more than its runtime (every deadline). Something else that
> could be done is to let it continue running, but with a "decreased
> priority". This way, we can exploit full CPU bandwidth and still
> avoid interferences.
> 
> In order of "decreasing the priority" of a deadline task, we can:
>  - let it stay SCHED_DEADLINE and postpone its deadline. This way it
>    will always be scheduled before -rt and -other tasks but it
>    won't affect other -deadline tasks;
>  - put it in SCHED_FIFO with some priority. This way it will always
>    be scheduled before -other tasks but it won't affect -deadline
>    tasks, nor other -rt tasks with higher priority;
>  - put it in SCHED_OTHER.
> 
> Notice also that this can be done on a per-task basis, e.g., each
> task can specify what kind of reclaiming mechanism it wants to use
> by means of the sched_flags field of sched_param_ex.
> 
> Therefore, this patch:
>  - adds the flags for specyfing DEADLINE, RT or OTHER reclaiming
>    behaviour;
>  - adds the logic that changes the scheduling class of a task when
>    it overruns, according to the requested policy.

The first two definitely should require SYS_CAP_ADMIN because it allows
silly while(1) loops again.. but can we postpone this fancy feature as
well? 

I'd much rather have the stochastic thing implemented that allows
limited temporal overrun.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 20/22] sched: drafted deadline inheritance logic
  2010-10-29  6:43 ` [RFC][PATCH 20/22] sched: drafted deadline inheritance logic Raistlin
@ 2010-11-11 22:15   ` Peter Zijlstra
  2010-11-14 12:00     ` Raistlin
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-11 22:15 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:43 +0200, Raistlin wrote:
> Some method to deal with rt-mutexes and make sched_dl interact with
> the current PI-coded is needed, raising all but trivial issues, that
> needs (according to us) to be solved with some restructuring of
> the pi-code (i.e., going toward a proxy execution-ish implementation).
> 
> This is under development, in the meanwhile, as a temporary solution,
> what this commits does is:
>  - ensure a pi-lock owner with waiters is never throttled down. Instead,
>    when it runs out of runtime, it immediately gets replenished and it's
>    deadline is postponed (as in SF_BWRECL_DL reclaiming policy);
>  - the scheduling parameters (relative deadline and default runtime)
>    used for that replenishments --during the whole period it holds the
>    pi-lock-- are the ones of the waiting task with earliest deadline.
> 
> Acting this way, we provide some kind of boosting to the lock-owner,
> still by using the existing (actually, slightly modified by the previous
> commit) pi-architecture.

Right, so this is the trivial priority ceiling protocol extended to
bandwidth inheritance and we basically let the owner overrun its runtime
to release the shared resource.

Didn't look at it too closely, but yeah, that is a sensible first
approximation band-aid to keep stuff working.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 09/22] sched: add period support for -deadline tasks
  2010-11-11 19:43       ` Peter Zijlstra
@ 2010-11-11 23:33         ` Tommaso Cucinotta
  2010-11-12 13:33         ` Raistlin
  1 sibling, 0 replies; 135+ messages in thread
From: Tommaso Cucinotta @ 2010-11-11 23:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Luca Abeni, Dhaval Giani, Harald Gustafsson,
	paulmck

Il 11/11/2010 20:43, Peter Zijlstra ha scritto:
>> The more correct --in the sense that it at least yield a sufficient (not
>> necessary!) condition-- thing to do would be
>> sum_i(runtime_i/min{deadline_i,period_i})<=threshold.
>>
>> So, what you think we should do? Can I go for this latter option?
> So sufficient (but not necessary) means its still a pessimistic approach
> but better than the one currently employed, or does it mean its
> optimistic and allows for unschedulable sets to be allowed in?
It means that, if the new task passes the test, then it has its 
guaranteed runtime_i over each time horizon as long as min{deadline_i, 
period_i} (and all of the other tasks already admitted have their 
guarantees as well of course). From the perspective of analyzing 
capability of the attached task to meet its own deadlines, if the task 
has a WCET of runtime_i, a minimum inter-arrival period of period_i, and 
a relative deadline of deadline_i, then it is guaranteed to meet all of 
its deadlines.

Therefore, this kind of test is sufficient for ensuring schedulability 
of all of the tasks, but it is not actually necessary, because it is too 
pessimistic. In fact, consider a task with a period of 10ms, a runtime 
of 3ms and a relative deadline of 5ms. After the test passed, you have 
actually allocated a "share" of the CPU capable of handling 3ms of 
workload every 5ms. Instead, we actually know that (or, we may actually 
force it to), after the deadline at 5ms, this task will actually be idle 
for further 5ms, till its new period. There are more complex tests which 
account for this, in the analysis.

Generally speaking, with deadlines different from periods, a tighter 
test (for partitioned EDF) is one making use of the demand-bound 
function, which unfortunately is far more heavyweight than a mere 
utilization check (for example, you should perform a number of checks 
along a time horizon that can go as far as the hyper-period [LCM of the 
periods] of the considered task-set -- something that may require 
arbitrary precision arithmetics in the worst-case). However, you can 
check the *RT* conferences in the last 10 years in order to see all the 
possible trade-offs between accuracy of the test and the imposed 
computation requirement/overhead.

Summarizing, the test suggested by Dario is sufficient to ensure the 
correct behavior of the accepted tasks, under the assumption that they 
stick to the "sporadic RT task model", it is very simple to implement in 
the kernel, but it is somewhat pessimistic. Also, it actually uses only 
2 parameters, the runtime and the min{deadline_i, period_i}.
This clarifies also why I was raising the issue of whether to have at 
all the specification of a deadline \neq period, in my other e-mail. If 
the first implementation will just use the minimum of 2 of the supplied 
parameters, then let them be specified as 1 parameter only: it will be 
easier for developers to understand and use. If we identify later a 
proper test we want to use, then we can exploit the "extensibility" of 
the sched_params.

My 2 cents.

     T.

-- 
Tommaso Cucinotta, Computer Engineering PhD, Researcher
ReTiS Lab, Scuola Superiore Sant'Anna, Pisa, Italy
Tel +39 050 882 024, Fax +39 050 882 003
http://retis.sssup.it/people/tommaso


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 09/22] sched: add period support for -deadline tasks
  2010-11-11 19:43       ` Peter Zijlstra
  2010-11-11 23:33         ` Tommaso Cucinotta
@ 2010-11-12 13:33         ` Raistlin
  2010-11-12 13:45           ` Peter Zijlstra
  1 sibling, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-11-12 13:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 2063 bytes --]

On Thu, 2010-11-11 at 20:43 +0100, Peter Zijlstra wrote:
> > Since you spotted it... The biggest issue here is admission control
> > test. Right now this is done against task's bandwidth, i.e.,
> > sum_i(runtime_i/period_i)<=threshold, but it is unfortunately wrong...
> > Or at least very, very loose, to the point of being almost useless! :-(
> 
> Right, I have some recollection on that.
> 
:-)

> So sufficient (but not necessary) means its still a pessimistic approach
> but better than the one currently employed, or does it mean its
> optimistic and allows for unschedulable sets to be allowed in?
> 
Tommaso already gave the best possible explanation of this! :-P

So, trying to recap:
 - using runtime/min(deadline,period) _does_ guarantee schedulability,
   but also rejects schedulable situations in UP/partitioning. Quite
   sure it _does_not_ guarantee schedulability in SMP/global, but
   *should* enable bounded tardiness;
 - using runtime/period _does_not_ guarantee schedulability nor in
   UP/partitioning neither in SMP/global, but *should* enable bounded
   tardiness for _both_.

The *should*-s come from the fact that I feel like I read it somewhere,
but right now I can't find the paper(s), not even following the
references indicated by Bjorn and Jim in previous e-mails and threads
(i.e., I can't find anything _explicitly_ considering deadline!=period,
but it might be my fault)... :-(

Thus, all this being said, what do you want me to do? :-D

Since we care about bounded tardiness more than 100%-guaranteed
schedulability (which, BTW, neither min{} could give us, at least for
SMPs), should we stay with runtime/period? Tommaso, Luca, do you think
it would be so bad?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 09/22] sched: add period support for -deadline tasks
  2010-11-12 13:33         ` Raistlin
@ 2010-11-12 13:45           ` Peter Zijlstra
  0 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-12 13:45 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-11-12 at 14:33 +0100, Raistlin wrote:
> On Thu, 2010-11-11 at 20:43 +0100, Peter Zijlstra wrote:
> > > Since you spotted it... The biggest issue here is admission control
> > > test. Right now this is done against task's bandwidth, i.e.,
> > > sum_i(runtime_i/period_i)<=threshold, but it is unfortunately wrong...
> > > Or at least very, very loose, to the point of being almost useless! :-(
> > 
> > Right, I have some recollection on that.
> > 
> :-)
> 
> > So sufficient (but not necessary) means its still a pessimistic approach
> > but better than the one currently employed, or does it mean its
> > optimistic and allows for unschedulable sets to be allowed in?
> > 
> Tommaso already gave the best possible explanation of this! :-P
> 
> So, trying to recap:
>  - using runtime/min(deadline,period) _does_ guarantee schedulability,
>    but also rejects schedulable situations in UP/partitioning. Quite
>    sure it _does_not_ guarantee schedulability in SMP/global, but
>    *should* enable bounded tardiness;
>  - using runtime/period _does_not_ guarantee schedulability nor in
>    UP/partitioning neither in SMP/global, but *should* enable bounded
>    tardiness for _both_.

> Thus, all this being said, what do you want me to do? :-D

runtime/min(deadline,period) sounds fine, as its more useful than
runtime/period.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 09/22] sched: add period support for -deadline tasks
  2010-11-11 19:31     ` Raistlin
  2010-11-11 19:43       ` Peter Zijlstra
@ 2010-11-12 13:46       ` Luca Abeni
  2010-11-12 14:01         ` Raistlin
  1 sibling, 1 reply; 135+ messages in thread
From: Luca Abeni @ 2010-11-12 13:46 UTC (permalink / raw)
  To: Raistlin
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Dhaval Giani, Harald Gustafsson, paulmck

On 11/11/2010 08:31 PM, Raistlin wrote:
> On Thu, 2010-11-11 at 20:17 +0100, Peter Zijlstra wrote:
>> On Fri, 2010-10-29 at 08:34 +0200, Raistlin wrote:
>>> Make it possible to specify a period (different or equal than
>>> deadline) for -deadline tasks.
>>>
>> I would expect it to be:
>>
>> runtime<= deadline<= period
>>
> Well, apart from that really unhappy comment/changelog, it should be
> like that in the code, and if it's not, it is what I meant and I'll
> change to that as soon as I can! :-)
>
> Since you spotted it... The biggest issue here is admission control
> test. Right now this is done against task's bandwidth, i.e.,
> sum_i(runtime_i/period_i)<=threshold, but it is unfortunately wrong...
> Or at least very, very loose, to the point of being almost useless! :-(
The point is that when the relative deadline is different from the period,
the concept of "task utilisation", or "bandwidth" becomes fuzzy at least
(I would say it becomes almost meaningless, but...).

The test with min{D,P} is technically more correct (meaning that it will
never accept unschedulable tasks), but it rejects some schedulable tasks.
As Tommaso pointed out, a more complex admission test would be needed.

> The more correct --in the sense that it at least yield a sufficient (not
> necessary!) condition-- thing to do would be
> sum_i(runtime_i/min{deadline_i,period_i})<=threshold.
>
> So, what you think we should do? Can I go for this latter option?
The one with min{} is at lest correct :)


				Luca

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 09/22] sched: add period support for -deadline tasks
  2010-11-12 13:46       ` Luca Abeni
@ 2010-11-12 14:01         ` Raistlin
  0 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-12 14:01 UTC (permalink / raw)
  To: Luca Abeni
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 567 bytes --]

On Fri, 2010-11-12 at 14:46 +0100, Luca Abeni wrote:
> > So, what you think we should do? Can I go for this latter option?
> The one with min{} is at lest correct :)
> 
Seems that Peter agrees (and I'm Tommaso would agree too) so, copy
that! ;-P

Thanks,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-11-11 22:12   ` Peter Zijlstra
@ 2010-11-12 15:36     ` Raistlin
  2010-11-12 16:04       ` Peter Zijlstra
  0 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-11-12 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 2086 bytes --]

On Thu, 2010-11-11 at 23:12 +0100, Peter Zijlstra wrote:
> > Therefore, this patch:
> >  - adds the flags for specyfing DEADLINE, RT or OTHER reclaiming
> >    behaviour;
> >  - adds the logic that changes the scheduling class of a task when
> >    it overruns, according to the requested policy.
> 
> The first two definitely should require SYS_CAP_ADMIN because it allows
> silly while(1) loops again.. 
>
Sure, good point!

> but can we postpone this fancy feature as
> well? 
> 
As you wish, I'll keep backing it in my private tree...

> I'd much rather have the stochastic thing implemented that allows
> limited temporal overrun.
>
... But at this point I can't avoid asking. That model aims at _pure_
hard real-time scheduling *without* resource reservation capabilities,
provided it deals with temporal overruns by means of a probabilistic
analysis, right?
In this scheduler, we do have resource reservations to deal with
overruns, provided it can guarantee bandwidth isolation among tasks even
in case of overrun (which is very typical soft real-time solution, but
can provide hard guarantees as well, if the analysis is careful enough).

Are we sure the two approaches matches, and/or can live together?

That's because I'm not sure what we would do, at that point, while
facing a runtime overrun... Enforce the bandwidth by stopping the task
until its next deadline (as we do now)? Or allows it to overrun basing
of the statistical information we have? Or do we want somehow try to do
both?

I know we can discuss and decide the details later, after merging all
this... But I still think it's worth trying to have at least a basic
idea of how to do that, just to avoid doing something now that we will
regret then. :-)

Thanks,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 17/22] sched: add signaling overrunning -deadline tasks.
  2010-11-11 21:58   ` Peter Zijlstra
@ 2010-11-12 15:39     ` Raistlin
  2010-11-12 16:04       ` Peter Zijlstra
  0 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-11-12 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1622 bytes --]

On Thu, 2010-11-11 at 22:58 +0100, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 08:40 +0200, Raistlin wrote:
> > +static inline void __dl_signal(struct task_struct *tsk, int which)
> > +{
> > +       struct siginfo info;
> > +       long long amount = which == SF_SIG_DMISS ? tsk->dl.stats.last_dmiss :
> > +                          tsk->dl.stats.last_rorun;
> > +
> > +       info.si_signo = SIGXCPU;
> > +       info.si_errno = which;
> > +       info.si_code = SI_KERNEL;
> > +       info.si_pid = 0;
> > +       info.si_uid = 0;
> > +       info.si_value.sival_int = (int)amount;
> > +
> > +       /* Correctly take the locks on task's sighand */
> > +       __group_send_sig_info(SIGXCPU, &info, tsk);
> > +       /* Log what happened to dmesg */
> > +       printk(KERN_INFO "SCHED_DEADLINE: 0x%4x by %Ld [ns] in %d (%s)\n",
> > +              which, amount, task_pid_nr(tsk), tsk->comm);
> > +} 
> 
> This being a G-EDF like scheduler with a u<=1 schedulability test, we're
> firmly in soft-rt territory which means the above will be very easy to
> trigger.. Maybe not spam dmesg?
>
Ok, right. Maybe, if I add the SF_HARD_RT flag (and force the hard tasks
to run on a single CPU they must specify) I can keep the notification
for those tasks only. What do you think?

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-11-12 15:36     ` Raistlin
@ 2010-11-12 16:04       ` Peter Zijlstra
  2010-11-12 17:41         ` Luca Abeni
  2010-11-12 18:56         ` Raistlin
  0 siblings, 2 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-12 16:04 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-11-12 at 16:36 +0100, Raistlin wrote:
> But at this point I can't avoid asking. That model aims at _pure_
> hard real-time scheduling *without* resource reservation capabilities,
> provided it deals with temporal overruns by means of a probabilistic
> analysis, right? 

>From what I understood from it, its a soft real-time scheduling
algorithm with resource reservation. It explicitly allows for deadline
misses, but requires the tardiness of those misses to be bounded, ie.
the UNC soft real-time definition.

The problem the stochastic execution time model tries to address is the
WCET computation mess, WCET computation is hard and often overly
pessimistic, resulting in under-utilized systems.

By using the average CET (much more easily obtained) we get a much
higher system utilization, but since its an average we need to deal with
deadline overruns due to temporal overload scenarios.

Their reasoning goes that since its an average, an overrun must be
compensated by a short run in the near future. The variance parameter
provides a measure of 'near'. Once we've 'consumed' this short run and
are back to the average case our tardiness is back to 0 as well
(considering an otherwise tight scheduler, say P-EDF), since then we've
met the bandwidth requirements placed by this scheduler.

And since the pure statistics allow for an arbitrary large deviation
from the average it also requires a max runtime in order to be able to
place a bound on tardiness.

So for G-EDF with stochastic ET we still get a bounded tardiness, its a
simple sum of bounds, one due to the G in G-EDF and one due to the
stochastic ET.



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 17/22] sched: add signaling overrunning -deadline tasks.
  2010-11-12 15:39     ` Raistlin
@ 2010-11-12 16:04       ` Peter Zijlstra
  0 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-12 16:04 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-11-12 at 16:39 +0100, Raistlin wrote:
> On Thu, 2010-11-11 at 22:58 +0100, Peter Zijlstra wrote:
> > On Fri, 2010-10-29 at 08:40 +0200, Raistlin wrote:
> > > +static inline void __dl_signal(struct task_struct *tsk, int which)
> > > +{
> > > +       struct siginfo info;
> > > +       long long amount = which == SF_SIG_DMISS ? tsk->dl.stats.last_dmiss :
> > > +                          tsk->dl.stats.last_rorun;
> > > +
> > > +       info.si_signo = SIGXCPU;
> > > +       info.si_errno = which;
> > > +       info.si_code = SI_KERNEL;
> > > +       info.si_pid = 0;
> > > +       info.si_uid = 0;
> > > +       info.si_value.sival_int = (int)amount;
> > > +
> > > +       /* Correctly take the locks on task's sighand */
> > > +       __group_send_sig_info(SIGXCPU, &info, tsk);
> > > +       /* Log what happened to dmesg */
> > > +       printk(KERN_INFO "SCHED_DEADLINE: 0x%4x by %Ld [ns] in %d (%s)\n",
> > > +              which, amount, task_pid_nr(tsk), tsk->comm);
> > > +} 
> > 
> > This being a G-EDF like scheduler with a u<=1 schedulability test, we're
> > firmly in soft-rt territory which means the above will be very easy to
> > trigger.. Maybe not spam dmesg?
> >
> Ok, right. Maybe, if I add the SF_HARD_RT flag (and force the hard tasks
> to run on a single CPU they must specify) I can keep the notification
> for those tasks only. What do you think?

Sure.. that makes sense.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 15/22] sched: add traceporints for -deadline tasks
  2010-11-11 19:54   ` Peter Zijlstra
@ 2010-11-12 16:13     ` Raistlin
  0 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-12 16:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1051 bytes --]

On Thu, 2010-11-11 at 20:54 +0100, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 08:38 +0200, Raistlin wrote:
> > Add tracepoints for the most notable events related to -deadline
> > tasks scheduling (new task arrival, context switch, runtime accounting,
> > bandwidth enforcement timer, etc.).
> >
> We've already got sched_switch(), better extend that. Same for the next
> patch, we already have a migration tracepoint, extend that.
> 
> And I recently rejected a fifo push/pull tracepoint patch from Steve
> because the migration tracepoint was able to provide the same
> information.
> 
Perfectly fine. I'll see how do that without spamming too much the trace
with data that are meaningless for non-deadline tasks.

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 12/22] sched: add runtime reporting for -deadline tasks
  2010-11-11 19:37   ` Peter Zijlstra
@ 2010-11-12 16:15     ` Raistlin
  2010-11-12 16:27       ` Peter Zijlstra
  0 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-11-12 16:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1141 bytes --]

On Thu, 2010-11-11 at 20:37 +0100, Peter Zijlstra wrote:
> > -#ifdef CONFIG_SCHEDSTATS
> >  struct sched_stats_dl {
> > +#ifdef CONFIG_SCHEDSTATS
> >  	u64			last_dmiss;
> >  	u64			last_rorun;
> >  	u64			dmiss_max;
> >  	u64			rorun_max;
> > -};
> >  #endif
> > +	u64			tot_rtime;
> > +};
> >  
> 
> I know we agreed to pull this from the sched_param2 structure and delay
> exposing this information for a while until the base patches got merged
> and came up with a solid use-case, 
>
Sure! :-)

> but reading this patch makes me
> wonder why tsk->se.sum_exec_runtime isn't good enough?
>
Might be, but what you usually want is something reporting you total
runtime from when you became a -deadline task, which may not be the case
of sum_exec_runtime, is it?

BTW, again, let's postpone this.

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 07/22] sched: SCHED_DEADLINE push and pull logic
  2010-10-29  6:32 ` [RFC][PATCH 07/22] sched: SCHED_DEADLINE push and pull logic Raistlin
@ 2010-11-12 16:17   ` Peter Zijlstra
  2010-11-12 21:11     ` Raistlin
  2010-11-14  9:14     ` Raistlin
  0 siblings, 2 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-12 16:17 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:32 +0200, Raistlin wrote:
> Add dynamic migrations to SCHED_DEADLINE, so that tasks can
> be moved among CPUs when necessary. It is also possible to bind a
> task to a (set of) CPU(s), thus restricting its capability of
> migrating, or forbidding migrations at all.
> 
> The very same approach used in sched_rt is utilised:
>  - -deadline tasks are kept into CPU-specific runqueues,
>  - -deadline tasks are migrated among runqueues to achieve the
>    following:
>     * on an M-CPU system the M earliest deadline ready tasks
>       are always running;
>     * affinity/cpusets settings of all the -deadline tasks is
>       always respected. 

I haven't fully digested the patch, I keep getting side-tracked and its
a large patch.. however, I thought we would only allow 2 affinities,
strict per-cpu and full root-domain?

Since there are no existing applications using this, this won't break
anything except maybe some expectations :-)

The advantage of restricting the sched_setaffinity() calls like this is
that we can make the schedulability tests saner.

Keep 2 per-cpu utilization counts, a hard-rt and a soft-rt, and ensure
the sum stays <= 1. Use the hard-rt one for the planned SF_HARD_RT flag,
use the soft-rt one for !SF_HARD_RT with nr_cpus_allowed == 1, and use
\Sum (1-h-s) over the root domain for nr_cpus_allowed != 1.

Once you start allowing masks in between its nearly impossible to
guarantee anything.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 12/22] sched: add runtime reporting for -deadline tasks
  2010-11-12 16:15     ` Raistlin
@ 2010-11-12 16:27       ` Peter Zijlstra
  2010-11-12 21:12         ` Raistlin
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-12 16:27 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-11-12 at 17:15 +0100, Raistlin wrote:
> > but reading this patch makes me
> > wonder why tsk->se.sum_exec_runtime isn't good enough?
> >
> Might be, but what you usually want is something reporting you total
> runtime from when you became a -deadline task, which may not be the case
> of sum_exec_runtime, is it? 

Correct, it wouldn't be. However since I expect a user to be mostly
interested in deltas between two readings, in which case it really
doesn't matter, does it?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-10-29  6:27 ` [RFC][PATCH 02/22] sched: add extended scheduling interface Raistlin
                     ` (2 preceding siblings ...)
  2010-11-10 18:50   ` Peter Zijlstra
@ 2010-11-12 16:38   ` Steven Rostedt
  2010-11-12 16:43     ` Peter Zijlstra
                       ` (2 more replies)
  3 siblings, 3 replies; 135+ messages in thread
From: Steven Rostedt @ 2010-11-12 16:38 UTC (permalink / raw)
  To: Raistlin
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-10-29 at 08:27 +0200, Raistlin wrote:

> +/*
> + * Extended scheduling parameters data structure.
> + *
> + * This is needed because the original struct sched_param can not be
> + * altered without introducing ABI issues with legacy applications
> + * (e.g., in sched_getparam()).
> + *
> + * However, the possibility of specifying more than just a priority for
> + * the tasks may be useful for a wide variety of application fields, e.g.,
> + * multimedia, streaming, automation and control, and many others.
> + *
> + * This variant (sched_param_ex) is meant at describing a so-called
> + * sporadic time-constrained task. In such model a task is specified by:
> + *  - the activation period or minimum instance inter-arrival time;
> + *  - the maximum (or average, depending on the actual scheduling
> + *    discipline) computation time of all instances, a.k.a. runtime;
> + *  - the deadline (relative to the actual activation time) of each
> + *    instance.
> + * Very briefly, a periodic (sporadic) task asks for the execution of
> + * some specific computation --which is typically called an instance--
> + * (at most) every period. Moreover, each instance typically lasts no more
> + * than the runtime and must be completed by time instant t equal to
> + * the instance activation time + the deadline.
> + *
> + * This is reflected by the actual fields of the sched_param_ex structure:
> + *
> + *  @sched_priority     task's priority (might still be useful)
> + *  @sched_deadline     representative of the task's deadline
> + *  @sched_runtime      representative of the task's runtime
> + *  @sched_period       representative of the task's period
> + *  @sched_flags        for customizing the scheduler behaviour
> + *
> + * There are other fields, which may be useful for implementing (in
> + * user-space) advanced scheduling behaviours, e.g., feedback scheduling:
> + *
> + *  @curr_runtime       task's currently available runtime
> + *  @used_runtime       task's totally used runtime
> + *  @curr_deadline      task's current absolute deadline
> + *
> + * Given this task model, there are a multiplicity of scheduling algorithms
> + * and policies, that can be used to ensure all the tasks will make their
> + * timing constraints.
> + */

A while ago I implemented an EDF scheduler for a client (before working
with Red Hat), and one thing they asked about was having a "soft group",
which was basically: This group is guaranteed X runtime in Y period, but
if the system is idle, let the group run, even if it has exhausted its X
runtime.

Is this supported?

-- Steve




^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-12 16:38   ` Steven Rostedt
@ 2010-11-12 16:43     ` Peter Zijlstra
  2010-11-12 16:52       ` Steven Rostedt
  2010-11-12 17:42     ` Tommaso Cucinotta
  2010-11-12 19:24     ` Raistlin
  2 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-12 16:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Chris Friesen, oleg,
	Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-11-12 at 11:38 -0500, Steven Rostedt wrote:
> A while ago I implemented an EDF scheduler for a client (before working
> with Red Hat), and one thing they asked about was having a "soft group",
> which was basically: This group is guaranteed X runtime in Y period, but
> if the system is idle, let the group run, even if it has exhausted its X
> runtime.
> 
> Is this supported? 

No, some of the bits in 18/22 come near, but I'd prefer not to add such
things until we've got a convincing use-case.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-12 16:43     ` Peter Zijlstra
@ 2010-11-12 16:52       ` Steven Rostedt
  2010-11-12 19:19         ` Raistlin
  0 siblings, 1 reply; 135+ messages in thread
From: Steven Rostedt @ 2010-11-12 16:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Chris Friesen, oleg,
	Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-11-12 at 17:43 +0100, Peter Zijlstra wrote:
> On Fri, 2010-11-12 at 11:38 -0500, Steven Rostedt wrote:
> > A while ago I implemented an EDF scheduler for a client (before working
> > with Red Hat), and one thing they asked about was having a "soft group",
> > which was basically: This group is guaranteed X runtime in Y period, but
> > if the system is idle, let the group run, even if it has exhausted its X
> > runtime.
> > 
> > Is this supported? 
> 
> No, some of the bits in 18/22 come near, but I'd prefer not to add such
> things until we've got a convincing use-case.

I'm fine with that, but I would like to know if the ABI would be able to
add such an extension in the future if we find a convincing use-case.

-- Steve



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 03/22] sched: SCHED_DEADLINE data structures.
  2010-11-10 19:10   ` Peter Zijlstra
@ 2010-11-12 17:11     ` Steven Rostedt
  0 siblings, 0 replies; 135+ messages in thread
From: Steven Rostedt @ 2010-11-12 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Chris Friesen, oleg,
	Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Wed, 2010-11-10 at 20:10 +0100, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 08:28 +0200, Raistlin wrote:
> > +       if (unlikely(prio >= MAX_DL_PRIO && prio < MAX_RT_PRIO))
> 
> Since MAX_DL_PRIO is 0, you can write that as: 
>   ((unsigned)prio) < MAX_RT_PRIO

Does this make a difference? If not, I rather leave it out since it just
makes it less readable, and a bit confusing for reviewers.

-- Steve




^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-11-12 16:04       ` Peter Zijlstra
@ 2010-11-12 17:41         ` Luca Abeni
  2010-11-12 17:51           ` Peter Zijlstra
  2010-11-12 18:07           ` Tommaso Cucinotta
  2010-11-12 18:56         ` Raistlin
  1 sibling, 2 replies; 135+ messages in thread
From: Luca Abeni @ 2010-11-12 17:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Dhaval Giani, Harald Gustafsson, paulmck

On 12/11/10 17:04, Peter Zijlstra wrote:
> On Fri, 2010-11-12 at 16:36 +0100, Raistlin wrote:
>> But at this point I can't avoid asking. That model aims at _pure_
>> hard real-time scheduling *without* resource reservation capabilities,
>> provided it deals with temporal overruns by means of a probabilistic
>> analysis, right?
>
>> From what I understood from it, its a soft real-time scheduling
> algorithm with resource reservation. It explicitly allows for deadline
> misses, but requires the tardiness of those misses to be bounded, ie.
> the UNC soft real-time definition.
>
> The problem the stochastic execution time model tries to address is the
> WCET computation mess, WCET computation is hard and often overly
> pessimistic, resulting in under-utilized systems.
[...]
BTW, sorry for the shameless plug, but even with the current 
SCHED_DEADLINE you are not forced to dimension the runtime using the 
WCET. You can use some stochastic analysis, providing probabilistic 
deadline guarantees. See (for example) "QoS Guarantee Using 
Probabilistic Deadlines"
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.7683&rep=rep1&type=pdf
and "Stochastic analysis of a reservation based system"
http://www.computer.org/portal/web/csdl/doi?doc=doi/10.1109/IPDPS.2001.925049
(sorry, this is not easy to download... But I can provide a copy if you 
are interested).


			Luca

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-12 16:38   ` Steven Rostedt
  2010-11-12 16:43     ` Peter Zijlstra
@ 2010-11-12 17:42     ` Tommaso Cucinotta
  2010-11-12 19:21       ` Steven Rostedt
  2010-11-12 19:24     ` Raistlin
  2 siblings, 1 reply; 135+ messages in thread
From: Tommaso Cucinotta @ 2010-11-12 17:42 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Chris Friesen, oleg,
	Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

Il 12/11/2010 17:38, Steven Rostedt ha scritto:
> A while ago I implemented an EDF scheduler for a client (before working
> with Red Hat), and one thing they asked about was having a "soft group",
> which was basically: This group is guaranteed X runtime in Y period, but
> if the system is idle, let the group run, even if it has exhausted its X
> runtime.
>
> Is this supported?
Actually, I know that Dario has an implementation of exactly this 
feature (and it used to be implemented in our previous scheduler as 
well, i.e., the old AQuoSA stuff):
a per-task flag, when enabled, allows to put temporarily the task into 
SCHED_OTHER when its budget is exhausted. Therefore, it will be allowed 
to run for more than its budget but without breaking the temporal 
isolation with the other SCHED_DEADLINE tasks. Furthermore, it won't 
starve other SCHED_OTHER tasks, as it will compete with them for the CPU 
during the budget-exhausted time windows.

     T.

-- 
Tommaso Cucinotta, Computer Engineering PhD, Researcher
ReTiS Lab, Scuola Superiore Sant'Anna, Pisa, Italy
Tel +39 050 882 024, Fax +39 050 882 003
http://retis.sssup.it/people/tommaso


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-11-12 17:41         ` Luca Abeni
@ 2010-11-12 17:51           ` Peter Zijlstra
  2010-11-12 17:54             ` Luca Abeni
  2010-11-13 21:08             ` Raistlin
  2010-11-12 18:07           ` Tommaso Cucinotta
  1 sibling, 2 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-12 17:51 UTC (permalink / raw)
  To: Luca Abeni
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-11-12 at 18:41 +0100, Luca Abeni wrote:
> > The problem the stochastic execution time model tries to address is the
> > WCET computation mess, WCET computation is hard and often overly
> > pessimistic, resulting in under-utilized systems.
> [...]
> BTW, sorry for the shameless plug, but even with the current 
> SCHED_DEADLINE you are not forced to dimension the runtime using the 
> WCET. 

Yes you are, it pushes the deadline back on overrun. The idea it to
maintain the deadline despite overrunning your budget (up to a point).

The paper we're all talking about is:

A. Mills and J. Anderson, " A Stochastic Framework for Multiprocessor
Soft Real-Time Scheduling", Proceedings of the 16th IEEE Real-Time and
Embedded Technology and Applications Symposium, pp. 311-320, April 2010.
http://www.cs.unc.edu/~anderson/papers/rtas10brevised.pdf

And I see they've got a new stochastic paper out:

A. Mills and J. Anderson, " Scheduling Stochastically-Executing Soft
Real-Time Tasks: A Multiprocessor Approach Without Worst-Case Execution
Times", in submission. 
http://www.cs.unc.edu/~anderson/papers/rtas11b.pdf



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-11-12 17:51           ` Peter Zijlstra
@ 2010-11-12 17:54             ` Luca Abeni
  2010-11-13 21:08             ` Raistlin
  1 sibling, 0 replies; 135+ messages in thread
From: Luca Abeni @ 2010-11-12 17:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Dhaval Giani, Harald Gustafsson, paulmck

On 12/11/10 18:51, Peter Zijlstra wrote:
> On Fri, 2010-11-12 at 18:41 +0100, Luca Abeni wrote:
>>> The problem the stochastic execution time model tries to address is the
>>> WCET computation mess, WCET computation is hard and often overly
>>> pessimistic, resulting in under-utilized systems.
>> [...]
>> BTW, sorry for the shameless plug, but even with the current
>> SCHED_DEADLINE you are not forced to dimension the runtime using the
>> WCET.
>
> Yes you are, it pushes the deadline back on overrun.
I think in case of overrun it postpones the deadline (by a period P), 
avoiding to execute the task until the end of the current period, right?

> The idea it to
> maintain the deadline despite overrunning your budget (up to a point).
>
> The paper we're all talking about is:
>
> A. Mills and J. Anderson, " A Stochastic Framework for Multiprocessor
> Soft Real-Time Scheduling", Proceedings of the 16th IEEE Real-Time and
> Embedded Technology and Applications Symposium, pp. 311-320, April 2010.
> http://www.cs.unc.edu/~anderson/papers/rtas10brevised.pdf
I see... This is a different approach to stochastic analysis, which 
requires modifications to the scheduler.

In the analysis I mentioned, you still enforce a maximum runtime C every 
period P, but C can be smaller than the WCET of the task. If C is larger 
than the average execution time, you can use queuing theory to find the 
probability to miss a deadline (or, the probability to finish a job in a 
time x * P).


				Luca

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-11-12 17:41         ` Luca Abeni
  2010-11-12 17:51           ` Peter Zijlstra
@ 2010-11-12 18:07           ` Tommaso Cucinotta
  2010-11-12 19:07             ` Raistlin
  2010-11-13  0:43             ` Peter Zijlstra
  1 sibling, 2 replies; 135+ messages in thread
From: Tommaso Cucinotta @ 2010-11-12 18:07 UTC (permalink / raw)
  To: Luca Abeni, Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Dhaval Giani, Harald Gustafsson, paulmck

Il 12/11/2010 18:41, Luca Abeni ha scritto:
>
>> algorithm with resource reservation. It explicitly allows for deadline
>> misses, but requires the tardiness of those misses to be bounded, ie.
>> the UNC soft real-time definition.
>>
>> The problem the stochastic execution time model tries to address is the
>> WCET computation mess, WCET computation is hard and often overly
>> pessimistic, resulting in under-utilized systems.
>
> [...]
> BTW, sorry for the shameless plug, but even with the current 
> SCHED_DEADLINE you are not forced to dimension the runtime using the 
> WCET. You can use some stochastic analysis, providing probabilistic 
> deadline guarantees. See (for example) "QoS Guarantee Using 
> Probabilistic Deadlines"
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.7683&rep=rep1&type=pdf 
>
> and "Stochastic analysis of a reservation based system"
> http://www.computer.org/portal/web/csdl/doi?doc=doi/10.1109/IPDPS.2001.925049 
>
> (sorry, this is not easy to download... But I can provide a copy if 
> you are interested).
Thanks, Luca, for supporting the viewpoint. I also repeated this 
multiple times, during the LPC as well.

Let me underline a few key points, also about the technique suggested by 
Zijlstra:

-) the specification of a budget every period may be exploited for 
providing deterministic guarantees to applications, if the budget = 
WCET, as well as probabilistic guarantees, if the budget < WCET. For 
example, what we do in many of our papers is to set budget = to some 
percentile/quantile of the observed computation time distribution, 
especially in those cases in which there are isolated peaks of 
computation times which would cause an excessive under-utilization of 
the system (these are ruled out by the percentile-based allocation); I 
think this is a way of reasoning that can be easily understood and used 
by developers;

-) setting a budget equal to (or too close to) the average computation 
time is *bad*, because the is almost in a meta-stable condition in which 
its response-time may easily grow uncontrolled;

-) same thing applies to admitting tasks in the system: if you only 
ensure that the sum of average/expected bandwidths < system capacity, 
then the whole system is at risk of having uncontrolled and arbitrarily 
high peak delays, but from a theoretical viewpoint it is still a 
"stable" system; this is not a condition you want to have in a sane 
real-time scenario;

-) if you want to apply the Mills & Anderson's rule for controlling the 
bound on the tardiness percentiles, as in that paper (A Stochastic 
Framework for Multiprocessor
Soft Real-Time Scheduling), then I can see 2 major drawbacks:
   a) you need to compute the "\psi" in order to use the "Corollary 10" 
of that paper, but that quantity needs to solve a LP optimization 
problem (see also the example in Section 6); the \psi can be used in Eq. 
(36) in order to compute the *expected tardiness*;
   b) unfortunately, the expected tardiness is hardly the quantity of 
interest for any concrete real-time application, where some percentile 
of the tardiness distribution may be much more important; therefore, you 
are actually interested in Eq. (37) for computing the q-th percentile of 
the tardiness. Unfortunately, such last bound is provided through the 
Chebychev (or Markov, or whatever it is called) inequality simply from 
the expected average, and this bound is well-known to be so conservative 
(a 99% percentile is basically hundred times the average). The 
consequence is that you would end-up to actually admit very few tasks as 
compared to what actually fits into the system.

Please, understand me: I don't want to say that that particular 
technique is not useful, but I'd like simply to stress that such 
policies might just belong to the user-space. If you really want, you 
can disable *any* type of admission control at the kernel-level, and you 
can disable *any* kind of budget enforcement, and just trust the 
user-space to have deployed the proper/correct number & type of tasks 
into your embedded RT platform.

     T.

-- 
Tommaso Cucinotta, Computer Engineering PhD, Researcher
ReTiS Lab, Scuola Superiore Sant'Anna, Pisa, Italy
Tel +39 050 882 024, Fax +39 050 882 003
http://retis.sssup.it/people/tommaso


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-11-12 16:04       ` Peter Zijlstra
  2010-11-12 17:41         ` Luca Abeni
@ 2010-11-12 18:56         ` Raistlin
       [not found]           ` <80992760-24F2-42AE-AF2D-15727F6A1C81@email.unc.edu>
  1 sibling, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-11-12 18:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck,
	Bjoern Brandenburg, James H. Anderson

[-- Attachment #1: Type: text/plain, Size: 2289 bytes --]

On Fri, 2010-11-12 at 17:04 +0100, Peter Zijlstra wrote:
> On Fri, 2010-11-12 at 16:36 +0100, Raistlin wrote:
> > But at this point I can't avoid asking. That model aims at _pure_
> > hard real-time scheduling *without* resource reservation capabilities,
> > provided it deals with temporal overruns by means of a probabilistic
> > analysis, right? 
> 
> From what I understood from it, its a soft real-time scheduling
> algorithm with resource reservation. 
>
Mmm... I've gone through it (again!) quickly, and you're right, it
mentions soft real-time, and I agree that for those systems average CET
is better than worst CET. However, I'm not sure resource reservation is
there... Not in the paper I have at least, but I may be wrong.

> The problem the stochastic execution time model tries to address is the
> WCET computation mess, WCET computation is hard and often overly
> pessimistic, resulting in under-utilized systems.
> 
I know, and it's very reasonable. The point I'm trying to make is that
resource reservation tries to address the very same issue.
I am all but against this model, just want to be sure it's not too much
in conflict to the other features we have, especially with resource
reservation. Especially considering that --if I got the whole thing
about this scheduler right-- resource reservation is something we really
want, and I think UNC people would agree here, since I heard Bjorn
stating this very clear both in Dresden and in Dublin. :-)

BTW, I'm adding them to the Cc, seems fair, and more useful than all
this speculation! :-P

Bjorn, Jim, sorry for bothering. If you're interested, this is the very
beginning of the whole thread:
 http://lkml.org/lkml/2010/10/29/67

And these should be from where this specific discussion starts (I hope,
the mirror is not updated yet I guess :-( ):
 http://lkml.org/lkml/2010/10/29/49
 http://groups.google.com/group/linux.kernel/msg/1dadeca435631b60

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-11-12 18:07           ` Tommaso Cucinotta
@ 2010-11-12 19:07             ` Raistlin
  2010-11-13  0:43             ` Peter Zijlstra
  1 sibling, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-12 19:07 UTC (permalink / raw)
  To: Tommaso Cucinotta
  Cc: Luca Abeni, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Steven Rostedt, Chris Friesen, oleg, Frederic Weisbecker,
	Darren Hart, Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 891 bytes --]

On Fri, 2010-11-12 at 19:07 +0100, Tommaso Cucinotta wrote:
> Let me underline a few key points, also about the technique suggested by 
> Zijlstra:
>
> [...]
>
> I can see 2 major drawbacks:
>    a) you need to compute the "\psi" in order to use the "Corollary 10" 
> of that paper, but that quantity needs to solve a LP optimization 
> problem (see also the example in Section 6); the \psi can be used in Eq. 
> (36) in order to compute the *expected tardiness*;
>
> [...]
>
Wow man! You really know what "give us details" means, don't you? :-PP

Thanks and regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-12 16:52       ` Steven Rostedt
@ 2010-11-12 19:19         ` Raistlin
  2010-11-12 19:23           ` Steven Rostedt
  0 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-11-12 19:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 960 bytes --]

On Fri, 2010-11-12 at 11:52 -0500, Steven Rostedt wrote:
> > No, some of the bits in 18/22 come near, but I'd prefer not to add such
> > things until we've got a convincing use-case.
> 
> I'm fine with that, but I would like to know if the ABI would be able to
> add such an extension in the future if we find a convincing use-case.
> 
Scared about ABIs' stability eh? :-PP

BTW, in this case everything should be fine, since the very exact
behaviour you described was triggered by a (couple of) flag(s), which
can be added and dealt with during sched_setscheduler2() at any time.

As Peter was saying, see patch 18 for details.

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-12 17:42     ` Tommaso Cucinotta
@ 2010-11-12 19:21       ` Steven Rostedt
  0 siblings, 0 replies; 135+ messages in thread
From: Steven Rostedt @ 2010-11-12 19:21 UTC (permalink / raw)
  To: Tommaso Cucinotta
  Cc: Peter Zijlstra, Raistlin, Ingo Molnar, Thomas Gleixner,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Luca Abeni, Dhaval Giani, Harald Gustafsson,
	paulmck

On Fri, 2010-11-12 at 18:42 +0100, Tommaso Cucinotta wrote:
> Il 12/11/2010 17:38, Steven Rostedt ha scritto:
> > A while ago I implemented an EDF scheduler for a client (before working
> > with Red Hat), and one thing they asked about was having a "soft group",
> > which was basically: This group is guaranteed X runtime in Y period, but
> > if the system is idle, let the group run, even if it has exhausted its X
> > runtime.
> >
> > Is this supported?
> Actually, I know that Dario has an implementation of exactly this 
> feature (and it used to be implemented in our previous scheduler as 
> well, i.e., the old AQuoSA stuff):
> a per-task flag, when enabled, allows to put temporarily the task into 
> SCHED_OTHER when its budget is exhausted. Therefore, it will be allowed 
> to run for more than its budget but without breaking the temporal 
> isolation with the other SCHED_DEADLINE tasks. Furthermore, it won't 
> starve other SCHED_OTHER tasks, as it will compete with them for the CPU 
> during the budget-exhausted time windows.

Good to know, thanks!

-- Steve



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-12 19:19         ` Raistlin
@ 2010-11-12 19:23           ` Steven Rostedt
  0 siblings, 0 replies; 135+ messages in thread
From: Steven Rostedt @ 2010-11-12 19:23 UTC (permalink / raw)
  To: Raistlin
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-11-12 at 20:19 +0100, Raistlin wrote:
> On Fri, 2010-11-12 at 11:52 -0500, Steven Rostedt wrote:
> > > No, some of the bits in 18/22 come near, but I'd prefer not to add such
> > > things until we've got a convincing use-case.
> > 
> > I'm fine with that, but I would like to know if the ABI would be able to
> > add such an extension in the future if we find a convincing use-case.
> > 
> Scared about ABIs' stability eh? :-PP

Nah, why should I be? ;-)

> 
> BTW, in this case everything should be fine, since the very exact
> behaviour you described was triggered by a (couple of) flag(s), which
> can be added and dealt with during sched_setscheduler2() at any time.

Yep.

> 
> As Peter was saying, see patch 18 for details.

I'm slowly getting there ;-)

-- Steve



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 02/22] sched: add extended scheduling interface
  2010-11-12 16:38   ` Steven Rostedt
  2010-11-12 16:43     ` Peter Zijlstra
  2010-11-12 17:42     ` Tommaso Cucinotta
@ 2010-11-12 19:24     ` Raistlin
  2 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-12 19:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1223 bytes --]

On Fri, 2010-11-12 at 11:38 -0500, Steven Rostedt wrote:
> A while ago I implemented an EDF scheduler for a client (before working
> with Red Hat), and one thing they asked about was having a "soft group",
> which was basically: This group is guaranteed X runtime in Y period, but
> if the system is idle, let the group run, even if it has exhausted its X
> runtime.
> 
Sounds reasonable...

> Is this supported?
> 
Yep, as both Tommaso and Peter were saying, patch 18 gives you exactly
that. In some more details, after X is over, you can chose (just by
setting a flag in sched_param2 and calling setscheduler2()) if you want
to keep running as a -deadline task (but without hurting other -deadline
tasks), or if you want to still be able to compete for CPU, but as a -rt
or -fair task.

However, we're delaying this for now. It won't be hard to add it back if
we change our mind. :-)

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 07/22] sched: SCHED_DEADLINE push and pull logic
  2010-11-12 16:17   ` Peter Zijlstra
@ 2010-11-12 21:11     ` Raistlin
  2010-11-14  9:14     ` Raistlin
  1 sibling, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-12 21:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1815 bytes --]

On Fri, 2010-11-12 at 17:17 +0100, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 08:32 +0200, Raistlin wrote:
> > Add dynamic migrations to SCHED_DEADLINE, so that tasks can
> > be moved among CPUs when necessary. It is also possible to bind a
> > task to a (set of) CPU(s), thus restricting its capability of
> > migrating, or forbidding migrations at all.
> > 
> > The very same approach used in sched_rt is utilised:
> >  - -deadline tasks are kept into CPU-specific runqueues,
> >  - -deadline tasks are migrated among runqueues to achieve the
> >    following:
> >     * on an M-CPU system the M earliest deadline ready tasks
> >       are always running;
> >     * affinity/cpusets settings of all the -deadline tasks is
> >       always respected. 
> 
> I haven't fully digested the patch, I keep getting side-tracked and its
> a large patch.. 
>
Yeah, I know, take your time. :-)

> however, I thought we would only allow 2 affinities,
> strict per-cpu and full root-domain?
> 
Yes, we do! Writing a better changelog for this already noted for next
version.

> Keep 2 per-cpu utilization counts, a hard-rt and a soft-rt, and ensure
> the sum stays <= 1. Use the hard-rt one for the planned SF_HARD_RT flag,
> use the soft-rt one for !SF_HARD_RT with nr_cpus_allowed == 1, and use
> \Sum (1-h-s) over the root domain for nr_cpus_allowed != 1.
> 
As agreed during LPC, that's exactly what I'll do. Let's hope to don't
screw up while trying to do the math! :-P

Thanks,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 12/22] sched: add runtime reporting for -deadline tasks
  2010-11-12 16:27       ` Peter Zijlstra
@ 2010-11-12 21:12         ` Raistlin
  0 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-12 21:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 789 bytes --]

On Fri, 2010-11-12 at 17:27 +0100, Peter Zijlstra wrote:
> > Might be, but what you usually want is something reporting you total
> > runtime from when you became a -deadline task, which may not be the case
> > of sum_exec_runtime, is it? 
> 
> Correct, it wouldn't be. However since I expect a user to be mostly
> interested in deltas between two readings, in which case it really
> doesn't matter, does it?
>
Good point. So we won't need this new stat anyway.

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 13/22] sched: add resource limits for -deadline tasks
  2010-11-11 19:57   ` Peter Zijlstra
@ 2010-11-12 21:30     ` Raistlin
  2010-11-12 23:32       ` Peter Zijlstra
  0 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-11-12 21:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1444 bytes --]

On Thu, 2010-11-11 at 20:57 +0100, Peter Zijlstra wrote:
> > In fact, this patch:
> >  - adds the resource limit RLIMIT_DLDLINE, which is the minimum value
> >    a user task can use as its own deadline;
> >  - adds the resource limit RLIMIT_DLRTIME, which is the maximum value
> >    a user task can use as it own runtime.
> > 
>
> We might also want to add an additional !SYS_CAP_ADMIN global bandwidth
> cap much like the existing sysctl bandwidth cap.
> 
Mmm... I think we've never discussed much about that before, so here I
am. I'm currently asking one to be root to set SCHED_DEADLINE as his
policy. Normal users are allowed to do so, but just under the rlimits
restrictions provided by this patch.

So, first of all, are we cool with this? Or do we want normal users to
be able to give their tasks SCHED_DEADLINE policy by default? Maybe we
want that but up to a certain bandwidth? Is this that you mean here,
having two bandwidth limits, one of which !SYS_ADMINs could not cross?

Sorry for being annoying, but I've never got any feedback on this, while
I think it's something really important.

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 13/22] sched: add resource limits for -deadline tasks
  2010-11-12 21:30     ` Raistlin
@ 2010-11-12 23:32       ` Peter Zijlstra
  0 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-12 23:32 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-11-12 at 22:30 +0100, Raistlin wrote:
> On Thu, 2010-11-11 at 20:57 +0100, Peter Zijlstra wrote:
> > > In fact, this patch:
> > >  - adds the resource limit RLIMIT_DLDLINE, which is the minimum value
> > >    a user task can use as its own deadline;
> > >  - adds the resource limit RLIMIT_DLRTIME, which is the maximum value
> > >    a user task can use as it own runtime.
> > > 
> >
> > We might also want to add an additional !SYS_CAP_ADMIN global bandwidth
> > cap much like the existing sysctl bandwidth cap.
> > 
> Mmm... I think we've never discussed much about that before, so here I
> am. I'm currently asking one to be root to set SCHED_DEADLINE as his
> policy. Normal users are allowed to do so, but just under the rlimits
> restrictions provided by this patch.
> 
> So, first of all, are we cool with this? Or do we want normal users to
> be able to give their tasks SCHED_DEADLINE policy by default? 

I think so, it would make it much more useful to people.

> Maybe we want that but up to a certain bandwidth?

Exactly.

>  Is this that you mean here,
> having two bandwidth limits, one of which !SYS_ADMINs could not cross?

Yep. A bandwidth cap for !SYS_CAP_ADMIN and a these two constraints
already introduced by this patch, a min period to avoid very fast timer
programming and a max runtime to avoid incurring large latencies on the
rest of the system.




^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-11-12 18:07           ` Tommaso Cucinotta
  2010-11-12 19:07             ` Raistlin
@ 2010-11-13  0:43             ` Peter Zijlstra
  2010-11-13  1:49               ` Tommaso Cucinotta
  1 sibling, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-13  0:43 UTC (permalink / raw)
  To: Tommaso Cucinotta
  Cc: Luca Abeni, Raistlin, Ingo Molnar, Thomas Gleixner,
	Steven Rostedt, Chris Friesen, oleg, Frederic Weisbecker,
	Darren Hart, Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Dhaval Giani, Harald Gustafsson, paulmck

On Fri, 2010-11-12 at 19:07 +0100, Tommaso Cucinotta wrote:
> Il 12/11/2010 18:41, Luca Abeni ha scritto:
> >
> >> algorithm with resource reservation. It explicitly allows for deadline
> >> misses, but requires the tardiness of those misses to be bounded, ie.
> >> the UNC soft real-time definition.
> >>
> >> The problem the stochastic execution time model tries to address is the
> >> WCET computation mess, WCET computation is hard and often overly
> >> pessimistic, resulting in under-utilized systems.
> >
> > [...]
> > BTW, sorry for the shameless plug, but even with the current 
> > SCHED_DEADLINE you are not forced to dimension the runtime using the 
> > WCET. You can use some stochastic analysis, providing probabilistic 
> > deadline guarantees. See (for example) "QoS Guarantee Using 
> > Probabilistic Deadlines"
> > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.7683&rep=rep1&type=pdf 
> >
> > and "Stochastic analysis of a reservation based system"
> > http://www.computer.org/portal/web/csdl/doi?doc=doi/10.1109/IPDPS.2001.925049 
> >
> > (sorry, this is not easy to download... But I can provide a copy if 
> > you are interested).
> Thanks, Luca, for supporting the viewpoint. I also repeated this 
> multiple times, during the LPC as well.
> 
> Let me underline a few key points, also about the technique suggested by 
> Zijlstra:
> 
> -) the specification of a budget every period may be exploited for 
> providing deterministic guarantees to applications, if the budget = 
> WCET, as well as probabilistic guarantees, if the budget < WCET. For 
> example, what we do in many of our papers is to set budget = to some 
> percentile/quantile of the observed computation time distribution, 
> especially in those cases in which there are isolated peaks of 
> computation times which would cause an excessive under-utilization of 
> the system (these are ruled out by the percentile-based allocation); I 
> think this is a way of reasoning that can be easily understood and used 
> by developers;

Maybe, but I'm clearly not one of them because I'm not getting it.

> -) setting a budget equal to (or too close to) the average computation 
> time is *bad*, because the is almost in a meta-stable condition in which 
> its response-time may easily grow uncontrolled;

How so? Didn't the paper referenced just prove that the response time
stays bounded? 

Setting it lower will of course wreak havoc, but that's what we have
bandwidth control for (implementing stochastic bandwidth control is a
whole separate fun topic though -- although I've been thinking we could
do something by lowering the max runtime every time a job overruns the
average, and limit it at 2*avg - max, if you take a simple parametrized
reduction function and compute the variability of th resulting series
you can invert that and find the reduction parameter to a given
variability).

> -) same thing applies to admitting tasks in the system: if you only 
> ensure that the sum of average/expected bandwidths < system capacity, 
> then the whole system is at risk of having uncontrolled and arbitrarily 
> high peak delays, but from a theoretical viewpoint it is still a 
> "stable" system; this is not a condition you want to have in a sane 
> real-time scenario;

I'm not seeing where the unbounded comes from. 

I am seeing that if you place your budget request slightly higher than
the actual average (say 1 stdev) your variability in the response time
will decrease, but at a cost of lower utilization.

> -) if you want to apply the Mills & Anderson's rule for controlling the 
> bound on the tardiness percentiles, as in that paper (A Stochastic 
> Framework for Multiprocessor
> Soft Real-Time Scheduling), then I can see 2 major drawbacks:
>    a) you need to compute the "\psi" in order to use the "Corollary 10" 
> of that paper, but that quantity needs to solve a LP optimization 
> problem (see also the example in Section 6); the \psi can be used in Eq. 
> (36) in order to compute the *expected tardiness*;

Right, but do we ever actually want to compute the bound? G-EDF also
incurs tardiness but we don't calculate it either. Both depend on the
full task set parameters which is not stable (nor accessible to
user-space in a meaningful manner).

> Please, understand me: I don't want to say that that particular 
> technique is not useful, but I'd like simply to stress that such 
> policies might just belong to the user-space. If you really want, you 
> can disable *any* type of admission control at the kernel-level, and you 
> can disable *any* kind of budget enforcement, and just trust the 
> user-space to have deployed the proper/correct number & type of tasks 
> into your embedded RT platform.

I'm very much against disabling everything and letting the user sort it,
that's basically what SCHED_FIFO does too and its a frigging nightmare.

The whole admission control and schedulability test is what makes this
thing usable. People who build closed systems can already hack their
kernel and do as they please, but for anything other than a closed
system this approach is useless.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-11-13  0:43             ` Peter Zijlstra
@ 2010-11-13  1:49               ` Tommaso Cucinotta
  0 siblings, 0 replies; 135+ messages in thread
From: Tommaso Cucinotta @ 2010-11-13  1:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Luca Abeni, Raistlin, Ingo Molnar, Thomas Gleixner,
	Steven Rostedt, Chris Friesen, oleg, Frederic Weisbecker,
	Darren Hart, Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Juri Lelli, Nicola Manica,
	Dhaval Giani, Harald Gustafsson, paulmck

Il 13/11/2010 01:43, Peter Zijlstra ha scritto:
> On Fri, 2010-11-12 at 19:07 +0100, Tommaso Cucinotta wrote:
>> -) the specification of a budget every period may be exploited for
>> providing deterministic guarantees to applications, if the budget =
>> WCET, as well as probabilistic guarantees, if the budget<  WCET. For
>> example, what we do in many of our papers is to set budget = to some
>> percentile/quantile of the observed computation time distribution,
>> especially in those cases in which there are isolated peaks of
>> computation times which would cause an excessive under-utilization of
>> the system (these are ruled out by the percentile-based allocation); I
>> think this is a way of reasoning that can be easily understood and used
>> by developers;
> Maybe, but I'm clearly not one of them because I'm not getting it.
My fault for not having explained. Let me see if I can clarify. Let's 
just consider the simple case in which application instances do not 
enqueue (i.e., as soon as the application detects to have missed a 
deadline, it discards the current job, as opposed to keep computing the 
current job), and consider a reservation period == application period.

In such a case, if 'C' represents the (probabilistically modeled) 
computation time of a job, then:

   Prob{deadline hit} = Prob{enough runtime for a job instance} = Prob{C 
<= runtime}.

So, if runtime is set as the q-th quantile of the `C' probability 
distribution, then:

   Prob{deadline hit} = Prob{C <= runtime} = q

This is true independently of what else is admitted into the system, as 
far as I get my runtime guaranteed from the scheduler.

Does this now make sense ?

If, on the other hand, task instances enqueue (e.g., I keep decoding the 
current frame even if I know a new frame arrived), then the probability 
of deadline-hit will be lower than q, and generally speaking one can use 
stochastic analysis & queueing theory techniques in order to figure out 
what it actually is.
>> -) setting a budget equal to (or too close to) the average computation
>> time is *bad*, because the is almost in a meta-stable condition in which
>> its response-time may easily grow uncontrolled;
> How so? Didn't the paper referenced just prove that the response time
> stays bounded?
Here I was not referring to GEDF, but simply to the case in which we are 
reserved from the kernel a budget every period (whatever the scheduling 
algorithm): as the reserved budget moves from the WCET down towards the 
average computation time, the response time distribution moves from a 
shape entirely contained below the deadline, to a more and more flat 
shape, where the probability of missing the deadline for the task 
increases over and over. Roughly speaking, if the application instances 
do not enqueue, then with a budget = average computation time, I would 
expect a ~50% deadline miss, something which hardly is acceptable even 
for soft RT applications.
If instances instead enqueue, then the situation may go much worse, 
because the response-time distribution flattens with a long tail beyond 
the deadline. The maximum value of it approaches +\infty with the 
reserved budget approaching the average computation time.
> Setting it lower will of course wreak havoc, but that's what we have
> bandwidth control for (implementing stochastic bandwidth control is a
> whole separate fun topic though -- although I've been thinking we could
> do something by lowering the max runtime every time a job overruns the
> average, and limit it at 2*avg - max, if you take a simple parametrized
> reduction function and compute the variability of th resulting series
> you can invert that and find the reduction parameter to a given
> variability).
I'd need some more explanation, sorry, I couldn't understand what you're 
proposing.

>> -) if you want to apply the Mills&  Anderson's rule for controlling the
>> bound on the tardiness percentiles, as in that paper (A Stochastic
>> Framework for Multiprocessor
>> Soft Real-Time Scheduling), then I can see 2 major drawbacks:
>>     a) you need to compute the "\psi" in order to use the "Corollary 10"
>> of that paper, but that quantity needs to solve a LP optimization
>> problem (see also the example in Section 6); the \psi can be used in Eq.
>> (36) in order to compute the *expected tardiness*;
> Right, but do we ever actually want to compute the bound? G-EDF also
> incurs tardiness but we don't calculate it either.
I was assuming you were proposing to keep an admission test based on 
providing the parameters needed for checking whether or not a given 
tardiness bound were respected. I must have misunderstood. Would you 
please detail what is the test (and result in the paper) you are 
thinking of using ?
>> If you really want, you
>> can disable *any* type of admission control at the kernel-level, and you
>> can disable *any* kind of budget enforcement, and just trust the
>> user-space to have deployed the proper/correct number&  type of tasks
>> into your embedded RT platform.
> I'm very much against disabling everything and letting the user sort it,
> that's basically what SCHED_FIFO does too and its a frigging nightmare.
Sure, I agree. I was simply suggesting it as a last-resort option 
(possibly usable by exploiting a compile-time option of the scheduler 
compiling out the admission test), useful in those cases in which you do 
have a user-space complex admission test that you made (or even an 
off-line static analysis of your system) but the simple admission test 
into the kernel would actually reject the task set, being the test 
merely sufficient.

Bye,

     T.

-- 
Tommaso Cucinotta, Computer Engineering PhD, Researcher
ReTiS Lab, Scuola Superiore Sant'Anna, Pisa, Italy
Tel +39 050 882 024, Fax +39 050 882 003
http://retis.sssup.it/people/tommaso


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads
  2010-11-11 16:32         ` Oleg Nesterov
@ 2010-11-13 18:35           ` Peter Zijlstra
  2010-11-13 19:58             ` Oleg Nesterov
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-13 18:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, Frederic Weisbecker, Darren Hart, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Thu, 2010-11-11 at 17:32 +0100, Oleg Nesterov wrote:
> >
> > Yes, I think we can make that work, we could even move that
> > migrate_live_tasks() into CPU_DYING, which is before this point.
> 
> Hmm, I think you are right. In this case we can also simplify
> migrate-from-dead-cpu paths, we know that nobody can touch rq->lock,
> every CPU runs stop_machine_cpu_stop() with irqs disabled. No need
> to get/put task_struct, etc. 

Something like so?.. hasn't even seen a compiler yet but one's got to do
something to keep the worst bore of saturday night telly in check ;-)


---
Subject: sched: Simplify cpu-hot-unplug task migration
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Sat Nov 13 19:32:29 CET 2010


Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 include/linux/sched.h |    3 
 kernel/cpu.c          |    7 -
 kernel/sched.c        |  181 ++++++++++++--------------------------------------
 3 files changed, 46 insertions(+), 145 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1872,14 +1872,11 @@ extern void sched_clock_idle_sleep_event
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 
 #ifdef CONFIG_HOTPLUG_CPU
-extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
 extern void idle_task_exit(void);
 #else
 static inline void idle_task_exit(void) {}
 #endif
 
-extern void sched_idle_next(void);
-
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
 extern void wake_up_idle_cpu(int cpu);
 #else
Index: linux-2.6/kernel/cpu.c
===================================================================
--- linux-2.6.orig/kernel/cpu.c
+++ linux-2.6/kernel/cpu.c
@@ -189,7 +189,6 @@ static inline void check_for_tasks(int c
 }
 
 struct take_cpu_down_param {
-	struct task_struct *caller;
 	unsigned long mod;
 	void *hcpu;
 };
@@ -208,11 +207,6 @@ static int __ref take_cpu_down(void *_pa
 
 	cpu_notify(CPU_DYING | param->mod, param->hcpu);
 
-	if (task_cpu(param->caller) == cpu)
-		move_task_off_dead_cpu(cpu, param->caller);
-	/* Force idle task to run as soon as we yield: it should
-	   immediately notice cpu is offline and die quickly. */
-	sched_idle_next();
 	return 0;
 }
 
@@ -223,7 +217,6 @@ static int __ref _cpu_down(unsigned int
 	void *hcpu = (void *)(long)cpu;
 	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
 	struct take_cpu_down_param tcd_param = {
-		.caller = current,
 		.mod = mod,
 		.hcpu = hcpu,
 	};
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2381,18 +2381,15 @@ static int select_fallback_rq(int cpu, s
 		return dest_cpu;
 
 	/* No more Mr. Nice Guy. */
-	if (unlikely(dest_cpu >= nr_cpu_ids)) {
-		dest_cpu = cpuset_cpus_allowed_fallback(p);
-		/*
-		 * Don't tell them about moving exiting tasks or
-		 * kernel threads (both mm NULL), since they never
-		 * leave kernel.
-		 */
-		if (p->mm && printk_ratelimit()) {
-			printk(KERN_INFO "process %d (%s) no "
-			       "longer affine to cpu%d\n",
-			       task_pid_nr(p), p->comm, cpu);
-		}
+	dest_cpu = cpuset_cpus_allowed_fallback(p);
+	/*
+	 * Don't tell them about moving exiting tasks or
+	 * kernel threads (both mm NULL), since they never
+	 * leave kernel.
+	 */
+	if (p->mm && printk_ratelimit()) {
+		printk(KERN_INFO "process %d (%s) no longer affine to cpu%d\n",
+				task_pid_nr(p), p->comm, cpu);
 	}
 
 	return dest_cpu;
@@ -5727,29 +5724,20 @@ static int migration_cpu_stop(void *data
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
+
 /*
- * Figure out where task on dead CPU should go, use force if necessary.
+ * Ensures that the idle task is using init_mm right before its cpu goes
+ * offline.
  */
-void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
+void idle_task_exit(void)
 {
-	struct rq *rq = cpu_rq(dead_cpu);
-	int needs_cpu, uninitialized_var(dest_cpu);
-	unsigned long flags;
+	struct mm_struct *mm = current->active_mm;
 
-	local_irq_save(flags);
+	BUG_ON(cpu_online(smp_processor_id()));
 
-	raw_spin_lock(&rq->lock);
-	needs_cpu = (task_cpu(p) == dead_cpu) && (p->state != TASK_WAKING);
-	if (needs_cpu)
-		dest_cpu = select_fallback_rq(dead_cpu, p);
-	raw_spin_unlock(&rq->lock);
-	/*
-	 * It can only fail if we race with set_cpus_allowed(),
-	 * in the racer should migrate the task anyway.
-	 */
-	if (needs_cpu)
-		__migrate_task(p, dead_cpu, dest_cpu);
-	local_irq_restore(flags);
+	if (mm != &init_mm)
+		switch_mm(mm, &init_mm, current);
+	mmdrop(mm);
 }
 
 /*
@@ -5762,104 +5750,48 @@ void move_task_off_dead_cpu(int dead_cpu
 static void migrate_nr_uninterruptible(struct rq *rq_src)
 {
 	struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
-	unsigned long flags;
 
-	local_irq_save(flags);
-	double_rq_lock(rq_src, rq_dest);
 	rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
 	rq_src->nr_uninterruptible = 0;
-	double_rq_unlock(rq_src, rq_dest);
-	local_irq_restore(flags);
-}
-
-/* Run through task list and migrate tasks from the dead cpu. */
-static void migrate_live_tasks(int src_cpu)
-{
-	struct task_struct *p, *t;
-
-	read_lock(&tasklist_lock);
-
-	do_each_thread(t, p) {
-		if (p == current)
-			continue;
-
-		if (task_cpu(p) == src_cpu)
-			move_task_off_dead_cpu(src_cpu, p);
-	} while_each_thread(t, p);
-
-	read_unlock(&tasklist_lock);
 }
 
 /*
- * Schedules idle task to be the next runnable task on current CPU.
- * It does so by boosting its priority to highest possible.
- * Used by CPU offline code.
+ * remove the tasks which were accounted by rq from calc_load_tasks.
  */
-void sched_idle_next(void)
+static void calc_global_load_remove(struct rq *rq)
 {
-	int this_cpu = smp_processor_id();
-	struct rq *rq = cpu_rq(this_cpu);
-	struct task_struct *p = rq->idle;
-	unsigned long flags;
-
-	/* cpu has to be offline */
-	BUG_ON(cpu_online(this_cpu));
-
-	/*
-	 * Strictly not necessary since rest of the CPUs are stopped by now
-	 * and interrupts disabled on the current cpu.
-	 */
-	raw_spin_lock_irqsave(&rq->lock, flags);
-
-	__setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
-
-	activate_task(rq, p, 0);
-
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
+	rq->calc_load_active = 0;
 }
 
 /*
- * Ensures that the idle task is using init_mm right before its cpu goes
- * offline.
+ * Figure out where task on dead CPU should go, use force if necessary.
  */
-void idle_task_exit(void)
-{
-	struct mm_struct *mm = current->active_mm;
-
-	BUG_ON(cpu_online(smp_processor_id()));
-
-	if (mm != &init_mm)
-		switch_mm(mm, &init_mm, current);
-	mmdrop(mm);
-}
-
-/* called under rq->lock with disabled interrupts */
-static void migrate_dead(unsigned int dead_cpu, struct task_struct *p)
+static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
 {
 	struct rq *rq = cpu_rq(dead_cpu);
+	int needs_cpu, uninitialized_var(dest_cpu);
 
-	/* Must be exiting, otherwise would be on tasklist. */
-	BUG_ON(!p->exit_state);
-
-	/* Cannot have done final schedule yet: would have vanished. */
-	BUG_ON(p->state == TASK_DEAD);
-
-	get_task_struct(p);
+	needs_cpu = (task_cpu(p) == dead_cpu) && (p->state != TASK_WAKING);
+	if (needs_cpu)
+		dest_cpu = select_fallback_rq(dead_cpu, p);
+	raw_spin_unlock(&rq->lock);
 
 	/*
-	 * Drop lock around migration; if someone else moves it,
-	 * that's OK. No task can be added to this CPU, so iteration is
-	 * fine.
+	 * It can only fail if we race with set_cpus_allowed(),
+	 * in the racer should migrate the task anyway.
 	 */
-	raw_spin_unlock_irq(&rq->lock);
-	move_task_off_dead_cpu(dead_cpu, p);
-	raw_spin_lock_irq(&rq->lock);
+	if (needs_cpu)
+		__migrate_task(p, dead_cpu, dest_cpu);
 
-	put_task_struct(p);
+	raw_spin_lock(&rq->lock);
 }
 
-/* release_task() removes task from tasklist, so we won't find dead tasks. */
-static void migrate_dead_tasks(unsigned int dead_cpu)
+/*
+ * Migrate all tasks from the rq, sleeping tasks will be migrated by
+ * try_to_wake_up()->select_task_rq().
+ */
+static void migrate_tasks(unsigned int dead_cpu)
 {
 	struct rq *rq = cpu_rq(dead_cpu);
 	struct task_struct *next;
@@ -5871,19 +5803,11 @@ static void migrate_dead_tasks(unsigned
 		if (!next)
 			break;
 		next->sched_class->put_prev_task(rq, next);
-		migrate_dead(dead_cpu, next);
 
+		move_task_off_dead_cpu(dead_cpu, next);
 	}
 }
 
-/*
- * remove the tasks which were accounted by rq from calc_load_tasks.
- */
-static void calc_global_load_remove(struct rq *rq)
-{
-	atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
-	rq->calc_load_active = 0;
-}
 #endif /* CONFIG_HOTPLUG_CPU */
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
@@ -6093,15 +6017,13 @@ migration_call(struct notifier_block *nf
 	unsigned long flags;
 	struct rq *rq = cpu_rq(cpu);
 
-	switch (action) {
+	switch (action & ~CPU_TASKS_FROZEN) {
 
 	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
 		rq->calc_load_update = calc_load_update;
 		break;
 
 	case CPU_ONLINE:
-	case CPU_ONLINE_FROZEN:
 		/* Update our root-domain */
 		raw_spin_lock_irqsave(&rq->lock, flags);
 		if (rq->rd) {
@@ -6113,30 +6035,19 @@ migration_call(struct notifier_block *nf
 		break;
 
 #ifdef CONFIG_HOTPLUG_CPU
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		migrate_live_tasks(cpu);
-		/* Idle task back to normal (off runqueue, low prio) */
-		raw_spin_lock_irq(&rq->lock);
-		deactivate_task(rq, rq->idle, 0);
-		__setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
-		rq->idle->sched_class = &idle_sched_class;
-		migrate_dead_tasks(cpu);
-		raw_spin_unlock_irq(&rq->lock);
-		migrate_nr_uninterruptible(rq);
-		BUG_ON(rq->nr_running != 0);
-		calc_global_load_remove(rq);
-		break;
-
 	case CPU_DYING:
-	case CPU_DYING_FROZEN:
 		/* Update our root-domain */
 		raw_spin_lock_irqsave(&rq->lock, flags);
 		if (rq->rd) {
 			BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
 			set_rq_offline(rq);
 		}
+		migrate_tasks(cpu);
+		BUG_ON(rq->nr_running != 0);
 		raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+		migrate_nr_uninterruptible(rq);
+		calc_global_load_remove(rq);
 		break;
 #endif
 	}


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads
  2010-11-13 18:35           ` Peter Zijlstra
@ 2010-11-13 19:58             ` Oleg Nesterov
  2010-11-13 20:31               ` Peter Zijlstra
  0 siblings, 1 reply; 135+ messages in thread
From: Oleg Nesterov @ 2010-11-13 19:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, Frederic Weisbecker, Darren Hart, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On 11/13, Peter Zijlstra wrote:
>
> Something like so?.. hasn't even seen a compiler yet but one's got to do
> something to keep the worst bore of saturday night telly in check ;-)

Yes, I _think_ this all can work (and imho makes a lot of sense
if it works).

quick and dirty review below ;)

>  struct take_cpu_down_param {
> -	struct task_struct *caller;
>  	unsigned long mod;
>  	void *hcpu;
>  };
> @@ -208,11 +207,6 @@ static int __ref take_cpu_down(void *_pa
>
>  	cpu_notify(CPU_DYING | param->mod, param->hcpu);
>
> -	if (task_cpu(param->caller) == cpu)
> -		move_task_off_dead_cpu(cpu, param->caller);
> -	/* Force idle task to run as soon as we yield: it should
> -	   immediately notice cpu is offline and die quickly. */
> -	sched_idle_next();

Yes. but we should remove "while (!idle_cpu(cpu))" from _cpu_down().

> @@ -2381,18 +2381,15 @@ static int select_fallback_rq(int cpu, s
>  		return dest_cpu;
>
>  	/* No more Mr. Nice Guy. */
> -	if (unlikely(dest_cpu >= nr_cpu_ids)) {
> -		dest_cpu = cpuset_cpus_allowed_fallback(p);
> -		/*
> -		 * Don't tell them about moving exiting tasks or
> -		 * kernel threads (both mm NULL), since they never
> -		 * leave kernel.
> -		 */
> -		if (p->mm && printk_ratelimit()) {
> -			printk(KERN_INFO "process %d (%s) no "
> -			       "longer affine to cpu%d\n",
> -			       task_pid_nr(p), p->comm, cpu);
> -		}
> +	dest_cpu = cpuset_cpus_allowed_fallback(p);
> +	/*
> +	 * Don't tell them about moving exiting tasks or
> +	 * kernel threads (both mm NULL), since they never
> +	 * leave kernel.
> +	 */
> +	if (p->mm && printk_ratelimit()) {
> +		printk(KERN_INFO "process %d (%s) no longer affine to cpu%d\n",
> +				task_pid_nr(p), p->comm, cpu);
>  	}

Hmm. I was really puzzled until I realized this is just cleanup,
we can't reach this point if dest_cpu < nr_cpu_ids.

> +static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
>  {
>  	struct rq *rq = cpu_rq(dead_cpu);
> +	int needs_cpu, uninitialized_var(dest_cpu);
>
> -	/* Must be exiting, otherwise would be on tasklist. */
> -	BUG_ON(!p->exit_state);
> -
> -	/* Cannot have done final schedule yet: would have vanished. */
> -	BUG_ON(p->state == TASK_DEAD);
> -
> -	get_task_struct(p);
> +	needs_cpu = (task_cpu(p) == dead_cpu) && (p->state != TASK_WAKING);
> +	if (needs_cpu)
> +		dest_cpu = select_fallback_rq(dead_cpu, p);
> +	raw_spin_unlock(&rq->lock);

Probably we do not need any checks. This task was picked by
->pick_next_task(), it should have task_cpu(p) == dead_cpu ?

But. I think there is a problem. We should not migrate current task,
stop thread, which does the migrating. At least, sched_stoptask.c
doesn't implement ->enqueue_task() and we can never wake it up later
for kthread_stop().

Perhaps migrate_tasks() should do for_each_class() by hand to
ignore stop_sched_class. But then _cpu_down() should somewhow
ensure the stop thread on the dead CPU is already parked in
schedule().

> -	case CPU_DYING_FROZEN:
>  		/* Update our root-domain */
>  		raw_spin_lock_irqsave(&rq->lock, flags);
>  		if (rq->rd) {
>  			BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
>  			set_rq_offline(rq);
>  		}
> +		migrate_tasks(cpu);
> +		BUG_ON(rq->nr_running != 0);
>  		raw_spin_unlock_irqrestore(&rq->lock, flags);

Probably we don't really need rq->lock. All cpus run stop threads.

I am not sure about rq->idle, perhaps it should be deactivated.
I don't think we should migrate it.


What I never understood is the meaning of play_dead/etc. If we
remove sched_idle_next(), who will do that logic? And how the
idle thread can call idle_task_exit() ?

Oleg.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads
  2010-11-13 19:58             ` Oleg Nesterov
@ 2010-11-13 20:31               ` Peter Zijlstra
  2010-11-13 20:51                 ` Peter Zijlstra
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-13 20:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, Frederic Weisbecker, Darren Hart, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Sat, 2010-11-13 at 20:58 +0100, Oleg Nesterov wrote:
> On 11/13, Peter Zijlstra wrote:
> >
> > Something like so?.. hasn't even seen a compiler yet but one's got to do
> > something to keep the worst bore of saturday night telly in check ;-)
> 
> Yes, I _think_ this all can work (and imho makes a lot of sense
> if it works).
> 
> quick and dirty review below ;)
> 
> >  struct take_cpu_down_param {
> > -	struct task_struct *caller;
> >  	unsigned long mod;
> >  	void *hcpu;
> >  };
> > @@ -208,11 +207,6 @@ static int __ref take_cpu_down(void *_pa
> >
> >  	cpu_notify(CPU_DYING | param->mod, param->hcpu);
> >
> > -	if (task_cpu(param->caller) == cpu)
> > -		move_task_off_dead_cpu(cpu, param->caller);
> > -	/* Force idle task to run as soon as we yield: it should
> > -	   immediately notice cpu is offline and die quickly. */
> > -	sched_idle_next();
> 
> Yes. but we should remove "while (!idle_cpu(cpu))" from _cpu_down().

Right, I think we should replace that with something like BUG_ON(!
idle_cpu(cpu)); Since we migrated everything away during the stop
machine, the cpu should be idle after it.

> > @@ -2381,18 +2381,15 @@ static int select_fallback_rq(int cpu, s
> >  		return dest_cpu;
> >
> >  	/* No more Mr. Nice Guy. */
> > -	if (unlikely(dest_cpu >= nr_cpu_ids)) {
> > -		dest_cpu = cpuset_cpus_allowed_fallback(p);
> > -		/*
> > -		 * Don't tell them about moving exiting tasks or
> > -		 * kernel threads (both mm NULL), since they never
> > -		 * leave kernel.
> > -		 */
> > -		if (p->mm && printk_ratelimit()) {
> > -			printk(KERN_INFO "process %d (%s) no "
> > -			       "longer affine to cpu%d\n",
> > -			       task_pid_nr(p), p->comm, cpu);
> > -		}
> > +	dest_cpu = cpuset_cpus_allowed_fallback(p);
> > +	/*
> > +	 * Don't tell them about moving exiting tasks or
> > +	 * kernel threads (both mm NULL), since they never
> > +	 * leave kernel.
> > +	 */
> > +	if (p->mm && printk_ratelimit()) {
> > +		printk(KERN_INFO "process %d (%s) no longer affine to cpu%d\n",
> > +				task_pid_nr(p), p->comm, cpu);
> >  	}
> 
> Hmm. I was really puzzled until I realized this is just cleanup,
> we can't reach this point if dest_cpu < nr_cpu_ids.

Right.. Noticed that when I read that code, though I might as well fix
it up.

> > +static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
> >  {
> >  	struct rq *rq = cpu_rq(dead_cpu);
> > +	int needs_cpu, uninitialized_var(dest_cpu);
> >
> > -	/* Must be exiting, otherwise would be on tasklist. */
> > -	BUG_ON(!p->exit_state);
> > -
> > -	/* Cannot have done final schedule yet: would have vanished. */
> > -	BUG_ON(p->state == TASK_DEAD);
> > -
> > -	get_task_struct(p);
> > +	needs_cpu = (task_cpu(p) == dead_cpu) && (p->state != TASK_WAKING);
> > +	if (needs_cpu)
> > +		dest_cpu = select_fallback_rq(dead_cpu, p);
> > +	raw_spin_unlock(&rq->lock);
> 
> Probably we do not need any checks. This task was picked by
> ->pick_next_task(), it should have task_cpu(p) == dead_cpu ?

Right, we can drop those checks, its unconditionally true.

> But. I think there is a problem. We should not migrate current task,
> stop thread, which does the migrating. At least, sched_stoptask.c
> doesn't implement ->enqueue_task() and we can never wake it up later
> for kthread_stop().

Hrm, right, so while the migration thread isn't actually on any rq
structure as such, pick_next_task() will return it.. need to come up
with a way to skip it.

As to current, take_cpu_down() is actually migrating current away before
this patch, so I simply included current in the
CPU_DYING->migrate_tasks() loop and removed the special case from
take_cpu_down().

> Perhaps migrate_tasks() should do for_each_class() by hand to
> ignore stop_sched_class. But then _cpu_down() should somewhow
> ensure the stop thread on the dead CPU is already parked in
> schedule().

Well, since we're in stop_machine all cpus but the cpu that is executing
is stuck in the stop_machine_cpu_stop() loop, in both cases we could
simply fudge the pick_next_task_stop() condition (eg. set rq->stop =
NULL) while doing that loop, and restore it afterwards, nothing will hit
schedule() while we're there.

> > -	case CPU_DYING_FROZEN:
> >  		/* Update our root-domain */
> >  		raw_spin_lock_irqsave(&rq->lock, flags);
> >  		if (rq->rd) {
> >  			BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
> >  			set_rq_offline(rq);
> >  		}
> > +		migrate_tasks(cpu);
> > +		BUG_ON(rq->nr_running != 0);
> >  		raw_spin_unlock_irqrestore(&rq->lock, flags);
> 
> Probably we don't really need rq->lock. All cpus run stop threads.

Right, but I was worried about stuff that relied on lockdep state like
the rcu lockdep stuff.. and taking the lock doesn't hurt.

> I am not sure about rq->idle, perhaps it should be deactivated.
> I don't think we should migrate it.

Ah, I think the !nr_running check will bail before we end up selecting
the idle thread.

> What I never understood is the meaning of play_dead/etc. If we
> remove sched_idle_next(), who will do that logic? And how the
> idle thread can call idle_task_exit() ?

Well, since we'll have migrated every task on that runqueue (except the
migration thread), the only runnable task left (once the migration
thread stops running) is the idle thread, so it should be implicit.

As to play_dead():

  cpu_idle()
    if (cpu_is_offline(smp_processor_id()))
      play_dead()
        native_play_dead() /* did I already say I detest paravirt? */
          play_dead_common()
            idle_task_exit();
            local_irq_disable();
          tboot/mwait/hlt

It basically puts the cpu to sleep with IRQs disabled, needs special
magic to wake it back up.
          


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads
  2010-11-13 20:31               ` Peter Zijlstra
@ 2010-11-13 20:51                 ` Peter Zijlstra
  2010-11-13 23:31                   ` Peter Zijlstra
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-13 20:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, Frederic Weisbecker, Darren Hart, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck


Latest version, I actually compiled it too ! 

---
Subject: sched: Simplify cpu-hot-unplug task migration
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Sat Nov 13 19:32:29 CET 2010

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    3 
 kernel/cpu.c          |   17 +---
 kernel/sched.c        |  201 ++++++++++++++------------------------------------
 3 files changed, 63 insertions(+), 158 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1872,14 +1872,11 @@ extern void sched_clock_idle_sleep_event
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 
 #ifdef CONFIG_HOTPLUG_CPU
-extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
 extern void idle_task_exit(void);
 #else
 static inline void idle_task_exit(void) {}
 #endif
 
-extern void sched_idle_next(void);
-
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
 extern void wake_up_idle_cpu(int cpu);
 #else
Index: linux-2.6/kernel/cpu.c
===================================================================
--- linux-2.6.orig/kernel/cpu.c
+++ linux-2.6/kernel/cpu.c
@@ -189,7 +189,6 @@ static inline void check_for_tasks(int c
 }
 
 struct take_cpu_down_param {
-	struct task_struct *caller;
 	unsigned long mod;
 	void *hcpu;
 };
@@ -198,7 +197,6 @@ struct take_cpu_down_param {
 static int __ref take_cpu_down(void *_param)
 {
 	struct take_cpu_down_param *param = _param;
-	unsigned int cpu = (unsigned long)param->hcpu;
 	int err;
 
 	/* Ensure this CPU doesn't handle any more interrupts. */
@@ -208,11 +206,6 @@ static int __ref take_cpu_down(void *_pa
 
 	cpu_notify(CPU_DYING | param->mod, param->hcpu);
 
-	if (task_cpu(param->caller) == cpu)
-		move_task_off_dead_cpu(cpu, param->caller);
-	/* Force idle task to run as soon as we yield: it should
-	   immediately notice cpu is offline and die quickly. */
-	sched_idle_next();
 	return 0;
 }
 
@@ -223,7 +216,6 @@ static int __ref _cpu_down(unsigned int
 	void *hcpu = (void *)(long)cpu;
 	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
 	struct take_cpu_down_param tcd_param = {
-		.caller = current,
 		.mod = mod,
 		.hcpu = hcpu,
 	};
@@ -253,9 +245,12 @@ static int __ref _cpu_down(unsigned int
 	}
 	BUG_ON(cpu_online(cpu));
 
-	/* Wait for it to sleep (leaving idle task). */
-	while (!idle_cpu(cpu))
-		yield();
+	/*
+	 * The migration_call() CPU_DYING callback will have removed all
+	 * runnable tasks from the cpu, there's only the idle task left now
+	 * that the migration thread is done doing the stop_machine thing.
+	 */
+	BUG_ON(!idle_cpu(cpu));
 
 	/* This actually kills the CPU. */
 	__cpu_die(cpu);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2381,18 +2381,15 @@ static int select_fallback_rq(int cpu, s
 		return dest_cpu;
 
 	/* No more Mr. Nice Guy. */
-	if (unlikely(dest_cpu >= nr_cpu_ids)) {
-		dest_cpu = cpuset_cpus_allowed_fallback(p);
-		/*
-		 * Don't tell them about moving exiting tasks or
-		 * kernel threads (both mm NULL), since they never
-		 * leave kernel.
-		 */
-		if (p->mm && printk_ratelimit()) {
-			printk(KERN_INFO "process %d (%s) no "
-			       "longer affine to cpu%d\n",
-			       task_pid_nr(p), p->comm, cpu);
-		}
+	dest_cpu = cpuset_cpus_allowed_fallback(p);
+	/*
+	 * Don't tell them about moving exiting tasks or
+	 * kernel threads (both mm NULL), since they never
+	 * leave kernel.
+	 */
+	if (p->mm && printk_ratelimit()) {
+		printk(KERN_INFO "process %d (%s) no longer affine to cpu%d\n",
+				task_pid_nr(p), p->comm, cpu);
 	}
 
 	return dest_cpu;
@@ -5727,29 +5724,20 @@ static int migration_cpu_stop(void *data
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
+
 /*
- * Figure out where task on dead CPU should go, use force if necessary.
+ * Ensures that the idle task is using init_mm right before its cpu goes
+ * offline.
  */
-void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
+void idle_task_exit(void)
 {
-	struct rq *rq = cpu_rq(dead_cpu);
-	int needs_cpu, uninitialized_var(dest_cpu);
-	unsigned long flags;
+	struct mm_struct *mm = current->active_mm;
 
-	local_irq_save(flags);
+	BUG_ON(cpu_online(smp_processor_id()));
 
-	raw_spin_lock(&rq->lock);
-	needs_cpu = (task_cpu(p) == dead_cpu) && (p->state != TASK_WAKING);
-	if (needs_cpu)
-		dest_cpu = select_fallback_rq(dead_cpu, p);
-	raw_spin_unlock(&rq->lock);
-	/*
-	 * It can only fail if we race with set_cpus_allowed(),
-	 * in the racer should migrate the task anyway.
-	 */
-	if (needs_cpu)
-		__migrate_task(p, dead_cpu, dest_cpu);
-	local_irq_restore(flags);
+	if (mm != &init_mm)
+		switch_mm(mm, &init_mm, current);
+	mmdrop(mm);
 }
 
 /*
@@ -5762,128 +5750,66 @@ void move_task_off_dead_cpu(int dead_cpu
 static void migrate_nr_uninterruptible(struct rq *rq_src)
 {
 	struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
-	unsigned long flags;
 
-	local_irq_save(flags);
-	double_rq_lock(rq_src, rq_dest);
 	rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
 	rq_src->nr_uninterruptible = 0;
-	double_rq_unlock(rq_src, rq_dest);
-	local_irq_restore(flags);
-}
-
-/* Run through task list and migrate tasks from the dead cpu. */
-static void migrate_live_tasks(int src_cpu)
-{
-	struct task_struct *p, *t;
-
-	read_lock(&tasklist_lock);
-
-	do_each_thread(t, p) {
-		if (p == current)
-			continue;
-
-		if (task_cpu(p) == src_cpu)
-			move_task_off_dead_cpu(src_cpu, p);
-	} while_each_thread(t, p);
-
-	read_unlock(&tasklist_lock);
 }
 
 /*
- * Schedules idle task to be the next runnable task on current CPU.
- * It does so by boosting its priority to highest possible.
- * Used by CPU offline code.
+ * remove the tasks which were accounted by rq from calc_load_tasks.
  */
-void sched_idle_next(void)
+static void calc_global_load_remove(struct rq *rq)
 {
-	int this_cpu = smp_processor_id();
-	struct rq *rq = cpu_rq(this_cpu);
-	struct task_struct *p = rq->idle;
-	unsigned long flags;
-
-	/* cpu has to be offline */
-	BUG_ON(cpu_online(this_cpu));
-
-	/*
-	 * Strictly not necessary since rest of the CPUs are stopped by now
-	 * and interrupts disabled on the current cpu.
-	 */
-	raw_spin_lock_irqsave(&rq->lock, flags);
-
-	__setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
-
-	activate_task(rq, p, 0);
-
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
+	rq->calc_load_active = 0;
 }
 
 /*
- * Ensures that the idle task is using init_mm right before its cpu goes
- * offline.
+ * Migrate all tasks from the rq, sleeping tasks will be migrated by
+ * try_to_wake_up()->select_task_rq().
+ *
+ * Called with rq->lock held.
  */
-void idle_task_exit(void)
-{
-	struct mm_struct *mm = current->active_mm;
-
-	BUG_ON(cpu_online(smp_processor_id()));
-
-	if (mm != &init_mm)
-		switch_mm(mm, &init_mm, current);
-	mmdrop(mm);
-}
-
-/* called under rq->lock with disabled interrupts */
-static void migrate_dead(unsigned int dead_cpu, struct task_struct *p)
+static void migrate_tasks(unsigned int dead_cpu)
 {
 	struct rq *rq = cpu_rq(dead_cpu);
-
-	/* Must be exiting, otherwise would be on tasklist. */
-	BUG_ON(!p->exit_state);
-
-	/* Cannot have done final schedule yet: would have vanished. */
-	BUG_ON(p->state == TASK_DEAD);
-
-	get_task_struct(p);
+	struct task_struct *next, *stop = rq->stop;
+	int dest_cpu;
 
 	/*
-	 * Drop lock around migration; if someone else moves it,
-	 * that's OK. No task can be added to this CPU, so iteration is
-	 * fine.
+	 * Fudge the rq selection such that the below task selection loop
+	 * doesn't get stuck on the currently eligible stop task.
+	 *
+	 * We're currently inside stop_machine() and the rq is either stuck
+	 * in the stop_machine_cpu_stop() loop, or we're executing this code,
+	 * either way we should never end up calling schedule() until we're
+	 * done here.
 	 */
-	raw_spin_unlock_irq(&rq->lock);
-	move_task_off_dead_cpu(dead_cpu, p);
-	raw_spin_lock_irq(&rq->lock);
-
-	put_task_struct(p);
-}
-
-/* release_task() removes task from tasklist, so we won't find dead tasks. */
-static void migrate_dead_tasks(unsigned int dead_cpu)
-{
-	struct rq *rq = cpu_rq(dead_cpu);
-	struct task_struct *next;
+	rq->stop = NULL;
 
 	for ( ; ; ) {
+		/*
+		 * Will terminate the loop before we should get around
+		 * selecting the idle task.
+		 */
 		if (!rq->nr_running)
 			break;
 		next = pick_next_task(rq);
-		if (!next)
-			break;
+		BUG_ON(!next); /* there's always the idle task */
 		next->sched_class->put_prev_task(rq, next);
-		migrate_dead(dead_cpu, next);
 
+		/* Find suitable destination for @next, with force if needed. */
+		dest_cpu = select_fallback_rq(dead_cpu, next);
+		raw_spin_unlock(&rq->lock);
+
+		__migrate_task(next, dead_cpu, dest_cpu);
+
+		raw_spin_lock(&rq->lock);
 	}
-}
 
-/*
- * remove the tasks which were accounted by rq from calc_load_tasks.
- */
-static void calc_global_load_remove(struct rq *rq)
-{
-	atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
-	rq->calc_load_active = 0;
+	rq->stop = stop;
 }
+
 #endif /* CONFIG_HOTPLUG_CPU */
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
@@ -6093,15 +6019,13 @@ migration_call(struct notifier_block *nf
 	unsigned long flags;
 	struct rq *rq = cpu_rq(cpu);
 
-	switch (action) {
+	switch (action & ~CPU_TASKS_FROZEN) {
 
 	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
 		rq->calc_load_update = calc_load_update;
 		break;
 
 	case CPU_ONLINE:
-	case CPU_ONLINE_FROZEN:
 		/* Update our root-domain */
 		raw_spin_lock_irqsave(&rq->lock, flags);
 		if (rq->rd) {
@@ -6113,30 +6037,19 @@ migration_call(struct notifier_block *nf
 		break;
 
 #ifdef CONFIG_HOTPLUG_CPU
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		migrate_live_tasks(cpu);
-		/* Idle task back to normal (off runqueue, low prio) */
-		raw_spin_lock_irq(&rq->lock);
-		deactivate_task(rq, rq->idle, 0);
-		__setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
-		rq->idle->sched_class = &idle_sched_class;
-		migrate_dead_tasks(cpu);
-		raw_spin_unlock_irq(&rq->lock);
-		migrate_nr_uninterruptible(rq);
-		BUG_ON(rq->nr_running != 0);
-		calc_global_load_remove(rq);
-		break;
-
 	case CPU_DYING:
-	case CPU_DYING_FROZEN:
 		/* Update our root-domain */
 		raw_spin_lock_irqsave(&rq->lock, flags);
 		if (rq->rd) {
 			BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
 			set_rq_offline(rq);
 		}
+		migrate_tasks(cpu);
+		BUG_ON(rq->nr_running != 0);
 		raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+		migrate_nr_uninterruptible(rq);
+		calc_global_load_remove(rq);
 		break;
 #endif
 	}


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-11-12 17:51           ` Peter Zijlstra
  2010-11-12 17:54             ` Luca Abeni
@ 2010-11-13 21:08             ` Raistlin
  1 sibling, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-13 21:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Luca Abeni, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1131 bytes --]

On Fri, 2010-11-12 at 18:51 +0100, Peter Zijlstra wrote:
> > BTW, sorry for the shameless plug, but even with the current 
> > SCHED_DEADLINE you are not forced to dimension the runtime using the 
> > WCET. 
> 
> Yes you are, it pushes the deadline back on overrun. The idea it to
> maintain the deadline despite overrunning your budget (up to a point).
> 
BTW, although I share most of Luca's and Tommaso's viewpoints,
triggering bandwidth enforcement at runtime+something instead than just
at runtime doesn't look all that bad, and should fit nicely in what we
have now, without much twisting.

It's just another way to interpret the runtime, and it seems feasible
(or at least plan-able as future feature! :-P) to me... Actually this
was exactly what I was asking and what I wanted to know. :-)

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads
  2010-11-13 20:51                 ` Peter Zijlstra
@ 2010-11-13 23:31                   ` Peter Zijlstra
  2010-11-15 20:06                     ` [PATCH] sched: Simplify cpu-hot-unplug task migration Peter Zijlstra
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-13 23:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, Frederic Weisbecker, Darren Hart, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Sat, 2010-11-13 at 21:51 +0100, Peter Zijlstra wrote:
> Latest version, I actually compiled it too ! 

Locks up solid though,.. even the NMI watchdog doesn't seem to trigger
anymore (or its broken too)...

/me goes prod at it a bit more

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-10-29  6:30 ` [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation Raistlin
                     ` (6 preceding siblings ...)
  2010-11-11 14:25   ` Peter Zijlstra
@ 2010-11-14  8:54   ` Raistlin
  2010-11-23 14:24     ` Peter Zijlstra
  7 siblings, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-11-14  8:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 2040 bytes --]

On Fri, 2010-10-29 at 08:30 +0200, Raistlin wrote:
> +static void task_fork_dl(struct task_struct *p)
> +{
> +	/*
> +	 * The child of a -deadline task will be SCHED_DEADLINE, but
> +	 * as a throttled task. This means the parent (or someone else)
> +	 * must call sched_setscheduler_ex() on it, or it won't even
> +	 * start.
> +	 */
> +	p->dl.dl_throttled = 1;
> +	p->dl.dl_new = 0;
> +}
> +
So, this is also something we only discussed once without reaching a
conclusive statement... Are we ok with this behaviour?

I'm not, actually, but I'm not sure how to do better, considering:
 - resetting task to SCHED_OTHER on fork is useless and confusing, since
   RESET_ON_FORK already exists and allows for this.
 - cloning the parent's bandwidth and try to see if it fits might be
   nice, but what if the check fails? fork fails as well? If yes, what
   should the parent do, lower its own bandwidth (since it's being
   cloned) and try forking again? If yes, lowering it by how much? 
   Mmm... Not sure it would fly... :-(
 - splitting the bandwidth of the parent in two would make sense
   to me, but then it has to be returned at some point, or a poor
   -deadline shell will reach zero bandwidth after a few `ls'! So, when
   are we giving the bandwidth back to the parent? When the child dies?
   What if the child is sched_setscheduled_ex()-ed (either to -deadline 
   or to something else), we return the bw back (if the call succeeds)
   as well?

I was thinking wither to force RESET_ON_FORK for SCHED_DEADLINE or to
try to pursue the 3rd solution (provided I figure out what to do on
detaching, parent dying, and such things... :-P).

Comments very very welcome!

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 07/22] sched: SCHED_DEADLINE push and pull logic
  2010-11-12 16:17   ` Peter Zijlstra
  2010-11-12 21:11     ` Raistlin
@ 2010-11-14  9:14     ` Raistlin
  2010-11-23 14:27       ` Peter Zijlstra
  1 sibling, 1 reply; 135+ messages in thread
From: Raistlin @ 2010-11-14  9:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 2539 bytes --]

On Fri, 2010-11-12 at 17:17 +0100, Peter Zijlstra wrote:
> On Fri, 2010-10-29 at 08:32 +0200, Raistlin wrote:
> > Add dynamic migrations to SCHED_DEADLINE, so that tasks can
> > be moved among CPUs when necessary. It is also possible to bind a
> > task to a (set of) CPU(s), thus restricting its capability of
> > migrating, or forbidding migrations at all.
> > 
> > The very same approach used in sched_rt is utilised:
> >  - -deadline tasks are kept into CPU-specific runqueues,
> >  - -deadline tasks are migrated among runqueues to achieve the
> >    following:
> >     * on an M-CPU system the M earliest deadline ready tasks
> >       are always running;
> >     * affinity/cpusets settings of all the -deadline tasks is
> >       always respected. 
> 
> I haven't fully digested the patch, I keep getting side-tracked and its
> a large patch.. 
>
BTW, I was thinking about your suggestion of adding a *debugging* knob
for achieving a "lock everything while I'm migrating" behaviour... :-)

Something like locking the root_domain during pushes and pulls won't
probably work, since both of them do a double_lock_balance, taking two
rq, which might race with this new "global" lock.
Something like we (CPU#1) hold rq1->lock, we take rd->lock, and then we
try to take rq2->lock. CPU#2 holds rq2->lock and try to take rd->lock.
Stuck! :-(
This should be possible if both CPU#1 and CPU#2 are into a push or a
pull which, on each one, involves some task on the other. Do you agree,
or I'm missing/mistaking something? :-)

Something we can probably do is locking the root_domain for
_each_and_every_ scheduling decision, having all the rq->locks nesting
inside our new root_domain->lock. This would emulate some sort of unique
global rq implementation, since also local decisions on a CPU will
affect all the others, as if they were sharing a single rq... But it's
going to be very slow on large machines (but I guess we can afford
that... It's debugging!), and will probably affect other scheduling
class.
I'm not sure we want the latter... But maybe it could be useful for
debugging others too (at least for FIFO/RR, it should be!).

Let me know what you think...

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 20/22] sched: drafted deadline inheritance logic
  2010-11-11 22:15   ` Peter Zijlstra
@ 2010-11-14 12:00     ` Raistlin
  0 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-14 12:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

[-- Attachment #1: Type: text/plain, Size: 1530 bytes --]

On Thu, 2010-11-11 at 23:15 +0100, Peter Zijlstra wrote:
> > Acting this way, we provide some kind of boosting to the lock-owner,
> > still by using the existing (actually, slightly modified by the previous
> > commit) pi-architecture.
> 
> Right, so this is the trivial priority ceiling protocol extended to
> bandwidth inheritance and we basically let the owner overrun its runtime
> to release the shared resource.
> 
We can call it that way. Basically, what we do is scheduling the
lock-owner with the parameters, i.e., runtime and deadline, of the
earliest deadline task blocked on it. As soon as such runtime depletes,
we (1) _do_ postpone that deadline (again, according to earliest
deadline lock-owner's relative deadline), but we also (2) immediately
replenish the runtime _immediately_.

Acting like this we ensure the lock-owner won't hurt the guarantees
provided to tasks with deadline earlier than all the tasks in its
blocking chains (by means of (1)), and we also enable a quicker release
of the lock (by means of (2)).

> Didn't look at it too closely, but yeah, that is a sensible first
> approximation band-aid to keep stuff working.
> 
I'll keep going this way then. :-)

Thanks,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
       [not found]           ` <80992760-24F2-42AE-AF2D-15727F6A1C81@email.unc.edu>
@ 2010-11-15 18:37             ` James H. Anderson
  2010-11-15 19:23               ` Luca Abeni
                                 ` (2 more replies)
  0 siblings, 3 replies; 135+ messages in thread
From: James H. Anderson @ 2010-11-15 18:37 UTC (permalink / raw)
  To: Raistlin
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Luca Abeni, Dhaval Giani, Harald Gustafsson,
	paulmck, Bjoern Brandenburg, James H. Anderson

Sorry for the delayed response... I think I must have inadvertently 
deleted this
email last week and Bjoern just mentioned it to me...
> On Fri, 2010-11-12 at 17:04 +0100, Peter Zijlstra wrote:
>   
>> On Fri, 2010-11-12 at 16:36 +0100, Raistlin wrote:
>>     
>>> But at this point I can't avoid asking. That model aims at _pure_
>>> hard real-time scheduling *without* resource reservation capabilities,
>>> provided it deals with temporal overruns by means of a probabilistic
>>> analysis, right? 
>>>       
>> From what I understood from it, its a soft real-time scheduling
>> algorithm with resource reservation. 
>>
>>     
> Mmm... I've gone through it (again!) quickly, and you're right, it
> mentions soft real-time, and I agree that for those systems average CET
> is better than worst CET. However, I'm not sure resource reservation is
> there... Not in the paper I have at least, but I may be wrong.
>
>   
>> The problem the stochastic execution time model tries to address is the
>> WCET computation mess, WCET computation is hard and often overly
>> pessimistic, resulting in under-utilized systems.
>>
>>     
> I know, and it's very reasonable. The point I'm trying to make is that
> resource reservation tries to address the very same issue.
> I am all but against this model, just want to be sure it's not too much
> in conflict to the other features we have, especially with resource
> reservation. Especially considering that --if I got the whole thing
> about this scheduler right-- resource reservation is something we really
> want, and I think UNC people would agree here, since I heard Bjorn
> stating this very clear both in Dresden and in Dublin. :-)
>
> BTW, I'm adding them to the Cc, seems fair, and more useful than all
> this speculation! :-P
>
> Bjorn, Jim, sorry for bothering. If you're interested, this is the very
> beginning of the whole thread:
> http://lkml.org/lkml/2010/10/29/67
>
> And these should be from where this specific discussion starts (I hope,
> the mirror is not updated yet I guess :-( ):
> http://lkml.org/lkml/2010/10/29/49
> http://groups.google.com/group/linux.kernel/msg/1dadeca435631b60
>
> Thanks and Regards,
> Dario
>   
If you're talking about our most recent "stochastic" paper, it is about 
supporting
soft real-time task systems on a multiprocessor where resource 
reservations are
used.  The main result of the paper is that if you provision the 
reservation for a
task slightly higher than it's average-case execution time, and if you use a
scheduling algorithm (like global EDF) that ensures bounded tardiness 
(w.r.t.
these reservations), then the task's expected tardiness will be bounded 
and the
expected bound does not depend on worst-case execution times.  I'm not 
sure if
slack-reallocation methods have come up in this discussion (sorry, I'm 
really
pressed for time and didn't look), but we didn't get into that in our 
paper.  However,
such methods would be easy to incorporate.  I think one of the most 
beautiful
aspects of this paper (which we didn't say enough about) is that the 
analysis
completely separates all the stochastic stuff from the reasoning needed to
derive tardiness bounds under a given scheduler.  In other words, you 
can simply
ignore all stochastic issues when reasoning about tardiness.  I gathered 
there was
some confusion about whether we were using resource reservations.  Such
reservations are actually crucial for our analysis as they allow 
independence to
be assumed across tasks.  And oh yeah: this work has nothing to do with hard
real-time and worst-case execution times are not used at all.

Just to make sure I don't end up creating more confusion, the paper I'm 
talking
about is this one:
http://www.cs.unc.edu/~anderson/papers/rtas11b.pdf

I hope this helps.

-Jim Anderson


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-11-15 18:37             ` James H. Anderson
@ 2010-11-15 19:23               ` Luca Abeni
  2010-11-15 19:49                 ` James H. Anderson
  2010-11-15 19:39               ` Luca Abeni
  2010-11-15 21:34               ` Raistlin
  2 siblings, 1 reply; 135+ messages in thread
From: Luca Abeni @ 2010-11-15 19:23 UTC (permalink / raw)
  To: James H. Anderson
  Cc: Raistlin, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Steven Rostedt, Chris Friesen, oleg, Frederic Weisbecker,
	Darren Hart, Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Dhaval Giani, Harald Gustafsson, paulmck,
	Bjoern Brandenburg

Hi James,

On 15/11/10 19:37, James H. Anderson wrote:
[...]
>>> The problem the stochastic execution time model tries to address is the
>>> WCET computation mess, WCET computation is hard and often overly
>>> pessimistic, resulting in under-utilized systems.
>>>
>> I know, and it's very reasonable. The point I'm trying to make is that
>> resource reservation tries to address the very same issue.
>> I am all but against this model, just want to be sure it's not too much
>> in conflict to the other features we have, especially with resource
>> reservation. Especially considering that --if I got the whole thing
>> about this scheduler right-- resource reservation is something we really
>> want, and I think UNC people would agree here, since I heard Bjorn
>> stating this very clear both in Dresden and in Dublin. :-)
>>
>> BTW, I'm adding them to the Cc, seems fair, and more useful than all
>> this speculation! :-P
>>
>> Bjorn, Jim, sorry for bothering. If you're interested, this is the very
>> beginning of the whole thread:
>> http://lkml.org/lkml/2010/10/29/67
[...]
> If you're talking about our most recent "stochastic" paper, it is about
> supporting
> soft real-time task systems on a multiprocessor where resource
> reservations are
> used. The main result of the paper is that if you provision the
> reservation for a
> task slightly higher than it's average-case execution time, and if you
> use a
> scheduling algorithm (like global EDF) that ensures bounded tardiness
> (w.r.t.
> these reservations), then the task's expected tardiness will be bounded
> and the
> expected bound does not depend on worst-case execution times. I'm not
> sure if
> slack-reallocation methods have come up in this discussion (sorry, I'm
> really
> pressed for time and didn't look), but we didn't get into that in our
> paper.
So, if I understand well (sorry, I am just trying to make a short 
summary to check if we are aligned) your analysis is similar to the one 
presented in the papers I mentioned earlier in this thread (different 
stochastic modelling, but similar approach): you analyse a reservation 
in isolation and you provide some stochastic tardiness guarantees based 
on an (e_i, p_i) service model.... Right?

If my understanding is correct (please, correct me if I am wrong), your 
analysis can be applied even with the current version of Dario's patch 
(I mean: no modifications to the patch are needed for removing 
assumptions about WCET knowledge... Your paper uses a sporadic server 
for the reservation mechanism, but I think a CBS can work too...).


			Thanks,
				Luca

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-11-15 18:37             ` James H. Anderson
  2010-11-15 19:23               ` Luca Abeni
@ 2010-11-15 19:39               ` Luca Abeni
  2010-11-15 21:34               ` Raistlin
  2 siblings, 0 replies; 135+ messages in thread
From: Luca Abeni @ 2010-11-15 19:39 UTC (permalink / raw)
  To: James H. Anderson
  Cc: Raistlin, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Steven Rostedt, Chris Friesen, oleg, Frederic Weisbecker,
	Darren Hart, Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Dhaval Giani, Harald Gustafsson, paulmck,
	Bjoern Brandenburg

On 15/11/10 19:37, James H. Anderson wrote:
[...]
> If you're talking about our most recent "stochastic" paper, it is about
> supporting
> soft real-time task systems on a multiprocessor where resource
> reservations are
> used. The main result of the paper is that if you provision the
> reservation for a
> task slightly higher than it's average-case execution time
[...]
BTW, I think we are aligned on this.

I was a little bit surprised when Peter mentioned allocating a runtime 
equal to the average execution time (because of the meta-stability 
considerations that Tommaso also mentioned), but I fully agree that if 
the allocated runtime is higher than the average execution time then the 
queue is stable and it's possible to find a bound for the expected 
tardiness (or even its probability distribution... This is similar to my 
"probabilistic deadlines").


			Luca

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-11-15 19:23               ` Luca Abeni
@ 2010-11-15 19:49                 ` James H. Anderson
  0 siblings, 0 replies; 135+ messages in thread
From: James H. Anderson @ 2010-11-15 19:49 UTC (permalink / raw)
  To: Luca Abeni
  Cc: Raistlin, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Steven Rostedt, Chris Friesen, oleg, Frederic Weisbecker,
	Darren Hart, Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Dhaval Giani, Harald Gustafsson, paulmck,
	Bjoern Brandenburg



On 11/15/2010 2:23 PM, Luca Abeni wrote:
> Hi James,
>
> On 15/11/10 19:37, James H. Anderson wrote:
> [...]
>>>> The problem the stochastic execution time model tries to address is 
>>>> the
>>>> WCET computation mess, WCET computation is hard and often overly
>>>> pessimistic, resulting in under-utilized systems.
>>>>
>>> I know, and it's very reasonable. The point I'm trying to make is that
>>> resource reservation tries to address the very same issue.
>>> I am all but against this model, just want to be sure it's not too much
>>> in conflict to the other features we have, especially with resource
>>> reservation. Especially considering that --if I got the whole thing
>>> about this scheduler right-- resource reservation is something we 
>>> really
>>> want, and I think UNC people would agree here, since I heard Bjorn
>>> stating this very clear both in Dresden and in Dublin. :-)
>>>
>>> BTW, I'm adding them to the Cc, seems fair, and more useful than all
>>> this speculation! :-P
>>>
>>> Bjorn, Jim, sorry for bothering. If you're interested, this is the very
>>> beginning of the whole thread:
>>> http://lkml.org/lkml/2010/10/29/67
> [...]
>> If you're talking about our most recent "stochastic" paper, it is about
>> supporting
>> soft real-time task systems on a multiprocessor where resource
>> reservations are
>> used. The main result of the paper is that if you provision the
>> reservation for a
>> task slightly higher than it's average-case execution time, and if you
>> use a
>> scheduling algorithm (like global EDF) that ensures bounded tardiness
>> (w.r.t.
>> these reservations), then the task's expected tardiness will be bounded
>> and the
>> expected bound does not depend on worst-case execution times. I'm not
>> sure if
>> slack-reallocation methods have come up in this discussion (sorry, I'm
>> really
>> pressed for time and didn't look), but we didn't get into that in our
>> paper.
> So, if I understand well (sorry, I am just trying to make a short 
> summary to check if we are aligned) your analysis is similar to the 
> one presented in the papers I mentioned earlier in this thread 
> (different stochastic modelling, but similar approach): you analyse a 
> reservation in isolation and you provide some stochastic tardiness 
> guarantees based on an (e_i, p_i) service model.... Right?
Sorry, I don't have time right now to check these papers, but what you 
are saying sounds correct.

>
> If my understanding is correct (please, correct me if I am wrong), 
> your analysis can be applied even with the current version of Dario's 
> patch (I mean: no modifications to the patch are needed for removing 
> assumptions about WCET knowledge... Your paper uses a sporadic server 
> for the reservation mechanism, but I think a CBS can work too...).

This sounds correct a well.  We assume that if a job of a task overruns 
its current budget allocation (which will likely happen, provisioning 
reservations on the average case), then the remainder of that job will 
be executed using future allocations for the same task.  The analysis 
doesn't (I think) depend too much on the exact way reservations are 
supported.

-Jim



^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH] sched: Simplify cpu-hot-unplug task migration
  2010-11-13 23:31                   ` Peter Zijlstra
@ 2010-11-15 20:06                     ` Peter Zijlstra
  2010-11-17 19:27                       ` Oleg Nesterov
  2010-11-18 14:09                       ` [tip:sched/core] " tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-15 20:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, Frederic Weisbecker, Darren Hart, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

Subject: sched: Simplify cpu-hot-unplug task migration
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Sat Nov 13 19:32:29 CET 2010

While discussing the need for sched_idle_next(), Oleg remarked that
since try_to_wake_up() ensures sleeping tasks will end up running on a
sane cpu, we can do away with migrate_live_tasks().

If we then extend the existing hack of migrating current from
CPU_DYING to migrating the full rq worth of tasks from CPU_DYING, the
need for the sched_idle_next() abomination disappears as well, since
idle will be the only possible thread left after the migration thread
stops.

This greatly simplifies the hot-unplug task migration path, as can be
seen from the resulting code reduction (and about half the new lines
are comments).

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    3 
 kernel/cpu.c          |   16 +--
 kernel/sched.c        |  210 +++++++++++++++-----------------------------------
 3 files changed, 69 insertions(+), 160 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1872,14 +1872,11 @@ extern void sched_clock_idle_sleep_event
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 
 #ifdef CONFIG_HOTPLUG_CPU
-extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
 extern void idle_task_exit(void);
 #else
 static inline void idle_task_exit(void) {}
 #endif
 
-extern void sched_idle_next(void);
-
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
 extern void wake_up_idle_cpu(int cpu);
 #else
Index: linux-2.6/kernel/cpu.c
===================================================================
--- linux-2.6.orig/kernel/cpu.c
+++ linux-2.6/kernel/cpu.c
@@ -189,7 +189,6 @@ static inline void check_for_tasks(int c
 }
 
 struct take_cpu_down_param {
-	struct task_struct *caller;
 	unsigned long mod;
 	void *hcpu;
 };
@@ -208,11 +207,6 @@ static int __ref take_cpu_down(void *_pa
 
 	cpu_notify(CPU_DYING | param->mod, param->hcpu);
 
-	if (task_cpu(param->caller) == cpu)
-		move_task_off_dead_cpu(cpu, param->caller);
-	/* Force idle task to run as soon as we yield: it should
-	   immediately notice cpu is offline and die quickly. */
-	sched_idle_next();
 	return 0;
 }
 
@@ -223,7 +217,6 @@ static int __ref _cpu_down(unsigned int 
 	void *hcpu = (void *)(long)cpu;
 	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
 	struct take_cpu_down_param tcd_param = {
-		.caller = current,
 		.mod = mod,
 		.hcpu = hcpu,
 	};
@@ -253,9 +246,12 @@ static int __ref _cpu_down(unsigned int 
 	}
 	BUG_ON(cpu_online(cpu));
 
-	/* Wait for it to sleep (leaving idle task). */
-	while (!idle_cpu(cpu))
-		yield();
+	/*
+	 * The migration_call() CPU_DYING callback will have removed all
+	 * runnable tasks from the cpu, there's only the idle task left now
+	 * that the migration thread is done doing the stop_machine thing.
+	 */
+	BUG_ON(!idle_cpu(cpu));
 
 	/* This actually kills the CPU. */
 	__cpu_die(cpu);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2381,18 +2381,15 @@ static int select_fallback_rq(int cpu, s
 		return dest_cpu;
 
 	/* No more Mr. Nice Guy. */
-	if (unlikely(dest_cpu >= nr_cpu_ids)) {
-		dest_cpu = cpuset_cpus_allowed_fallback(p);
-		/*
-		 * Don't tell them about moving exiting tasks or
-		 * kernel threads (both mm NULL), since they never
-		 * leave kernel.
-		 */
-		if (p->mm && printk_ratelimit()) {
-			printk(KERN_INFO "process %d (%s) no "
-			       "longer affine to cpu%d\n",
-			       task_pid_nr(p), p->comm, cpu);
-		}
+	dest_cpu = cpuset_cpus_allowed_fallback(p);
+	/*
+	 * Don't tell them about moving exiting tasks or
+	 * kernel threads (both mm NULL), since they never
+	 * leave kernel.
+	 */
+	if (p->mm && printk_ratelimit()) {
+		printk(KERN_INFO "process %d (%s) no longer affine to cpu%d\n",
+				task_pid_nr(p), p->comm, cpu);
 	}
 
 	return dest_cpu;
@@ -5727,29 +5724,20 @@ static int migration_cpu_stop(void *data
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
+
 /*
- * Figure out where task on dead CPU should go, use force if necessary.
+ * Ensures that the idle task is using init_mm right before its cpu goes
+ * offline.
  */
-void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
+void idle_task_exit(void)
 {
-	struct rq *rq = cpu_rq(dead_cpu);
-	int needs_cpu, uninitialized_var(dest_cpu);
-	unsigned long flags;
+	struct mm_struct *mm = current->active_mm;
 
-	local_irq_save(flags);
+	BUG_ON(cpu_online(smp_processor_id()));
 
-	raw_spin_lock(&rq->lock);
-	needs_cpu = (task_cpu(p) == dead_cpu) && (p->state != TASK_WAKING);
-	if (needs_cpu)
-		dest_cpu = select_fallback_rq(dead_cpu, p);
-	raw_spin_unlock(&rq->lock);
-	/*
-	 * It can only fail if we race with set_cpus_allowed(),
-	 * in the racer should migrate the task anyway.
-	 */
-	if (needs_cpu)
-		__migrate_task(p, dead_cpu, dest_cpu);
-	local_irq_restore(flags);
+	if (mm != &init_mm)
+		switch_mm(mm, &init_mm, current);
+	mmdrop(mm);
 }
 
 /*
@@ -5762,128 +5750,69 @@ void move_task_off_dead_cpu(int dead_cpu
 static void migrate_nr_uninterruptible(struct rq *rq_src)
 {
 	struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
-	unsigned long flags;
 
-	local_irq_save(flags);
-	double_rq_lock(rq_src, rq_dest);
 	rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
 	rq_src->nr_uninterruptible = 0;
-	double_rq_unlock(rq_src, rq_dest);
-	local_irq_restore(flags);
-}
-
-/* Run through task list and migrate tasks from the dead cpu. */
-static void migrate_live_tasks(int src_cpu)
-{
-	struct task_struct *p, *t;
-
-	read_lock(&tasklist_lock);
-
-	do_each_thread(t, p) {
-		if (p == current)
-			continue;
-
-		if (task_cpu(p) == src_cpu)
-			move_task_off_dead_cpu(src_cpu, p);
-	} while_each_thread(t, p);
-
-	read_unlock(&tasklist_lock);
 }
 
 /*
- * Schedules idle task to be the next runnable task on current CPU.
- * It does so by boosting its priority to highest possible.
- * Used by CPU offline code.
- */
-void sched_idle_next(void)
-{
-	int this_cpu = smp_processor_id();
-	struct rq *rq = cpu_rq(this_cpu);
-	struct task_struct *p = rq->idle;
-	unsigned long flags;
-
-	/* cpu has to be offline */
-	BUG_ON(cpu_online(this_cpu));
-
-	/*
-	 * Strictly not necessary since rest of the CPUs are stopped by now
-	 * and interrupts disabled on the current cpu.
-	 */
-	raw_spin_lock_irqsave(&rq->lock, flags);
-
-	__setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
-
-	activate_task(rq, p, 0);
-
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
-}
-
-/*
- * Ensures that the idle task is using init_mm right before its cpu goes
- * offline.
+ * remove the tasks which were accounted by rq from calc_load_tasks.
  */
-void idle_task_exit(void)
+static void calc_global_load_remove(struct rq *rq)
 {
-	struct mm_struct *mm = current->active_mm;
-
-	BUG_ON(cpu_online(smp_processor_id()));
-
-	if (mm != &init_mm)
-		switch_mm(mm, &init_mm, current);
-	mmdrop(mm);
+	atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
+	rq->calc_load_active = 0;
 }
 
-/* called under rq->lock with disabled interrupts */
-static void migrate_dead(unsigned int dead_cpu, struct task_struct *p)
+/*
+ * Migrate all tasks from the rq, sleeping tasks will be migrated by
+ * try_to_wake_up()->select_task_rq().
+ *
+ * Called with rq->lock held even though we'er in stop_machine() and
+ * there's no concurrency possible, we hold the required locks anyway
+ * because of lock validation efforts.
+ */
+static void migrate_tasks(unsigned int dead_cpu)
 {
 	struct rq *rq = cpu_rq(dead_cpu);
-
-	/* Must be exiting, otherwise would be on tasklist. */
-	BUG_ON(!p->exit_state);
-
-	/* Cannot have done final schedule yet: would have vanished. */
-	BUG_ON(p->state == TASK_DEAD);
-
-	get_task_struct(p);
+	struct task_struct *next, *stop = rq->stop;
+	int dest_cpu;
 
 	/*
-	 * Drop lock around migration; if someone else moves it,
-	 * that's OK. No task can be added to this CPU, so iteration is
-	 * fine.
+	 * Fudge the rq selection such that the below task selection loop
+	 * doesn't get stuck on the currently eligible stop task.
+	 *
+	 * We're currently inside stop_machine() and the rq is either stuck
+	 * in the stop_machine_cpu_stop() loop, or we're executing this code,
+	 * either way we should never end up calling schedule() until we're
+	 * done here.
 	 */
-	raw_spin_unlock_irq(&rq->lock);
-	move_task_off_dead_cpu(dead_cpu, p);
-	raw_spin_lock_irq(&rq->lock);
-
-	put_task_struct(p);
-}
-
-/* release_task() removes task from tasklist, so we won't find dead tasks. */
-static void migrate_dead_tasks(unsigned int dead_cpu)
-{
-	struct rq *rq = cpu_rq(dead_cpu);
-	struct task_struct *next;
+	rq->stop = NULL;
 
 	for ( ; ; ) {
-		if (!rq->nr_running)
+		/*
+		 * There's this thread running, bail when that's the only
+		 * remaining thread.
+		 */
+		if (rq->nr_running == 1)
 			break;
+
 		next = pick_next_task(rq);
-		if (!next)
-			break;
+		BUG_ON(!next);
 		next->sched_class->put_prev_task(rq, next);
-		migrate_dead(dead_cpu, next);
 
+		/* Find suitable destination for @next, with force if needed. */
+		dest_cpu = select_fallback_rq(dead_cpu, next);
+		raw_spin_unlock(&rq->lock);
+
+		__migrate_task(next, dead_cpu, dest_cpu);
+
+		raw_spin_lock(&rq->lock);
 	}
-}
 
-/*
- * remove the tasks which were accounted by rq from calc_load_tasks.
- */
-static void calc_global_load_remove(struct rq *rq)
-{
-	atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
-	rq->calc_load_active = 0;
+	rq->stop = stop;
 }
+
 #endif /* CONFIG_HOTPLUG_CPU */
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
@@ -6093,15 +6022,13 @@ migration_call(struct notifier_block *nf
 	unsigned long flags;
 	struct rq *rq = cpu_rq(cpu);
 
-	switch (action) {
+	switch (action & ~CPU_TASKS_FROZEN) {
 
 	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
 		rq->calc_load_update = calc_load_update;
 		break;
 
 	case CPU_ONLINE:
-	case CPU_ONLINE_FROZEN:
 		/* Update our root-domain */
 		raw_spin_lock_irqsave(&rq->lock, flags);
 		if (rq->rd) {
@@ -6113,30 +6040,19 @@ migration_call(struct notifier_block *nf
 		break;
 
 #ifdef CONFIG_HOTPLUG_CPU
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		migrate_live_tasks(cpu);
-		/* Idle task back to normal (off runqueue, low prio) */
-		raw_spin_lock_irq(&rq->lock);
-		deactivate_task(rq, rq->idle, 0);
-		__setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
-		rq->idle->sched_class = &idle_sched_class;
-		migrate_dead_tasks(cpu);
-		raw_spin_unlock_irq(&rq->lock);
-		migrate_nr_uninterruptible(rq);
-		BUG_ON(rq->nr_running != 0);
-		calc_global_load_remove(rq);
-		break;
-
 	case CPU_DYING:
-	case CPU_DYING_FROZEN:
 		/* Update our root-domain */
 		raw_spin_lock_irqsave(&rq->lock, flags);
 		if (rq->rd) {
 			BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
 			set_rq_offline(rq);
 		}
+		migrate_tasks(cpu);
+		BUG_ON(rq->nr_running != 1); /* the migration thread */
 		raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+		migrate_nr_uninterruptible(rq);
+		calc_global_load_remove(rq);
 		break;
 #endif
 	}


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks
  2010-11-15 18:37             ` James H. Anderson
  2010-11-15 19:23               ` Luca Abeni
  2010-11-15 19:39               ` Luca Abeni
@ 2010-11-15 21:34               ` Raistlin
  2 siblings, 0 replies; 135+ messages in thread
From: Raistlin @ 2010-11-15 21:34 UTC (permalink / raw)
  To: James H. Anderson
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, oleg, Frederic Weisbecker, Darren Hart,
	Johan Eker, p.faure, linux-kernel, Claudio Scordino,
	michael trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Nicola Manica, Luca Abeni, Dhaval Giani, Harald Gustafsson,
	paulmck, Bjoern Brandenburg

[-- Attachment #1: Type: text/plain, Size: 1777 bytes --]

On Mon, 2010-11-15 at 13:37 -0500, James H. Anderson wrote:
> Sorry for the delayed response... I think I must have inadvertently 
> deleted this
>
NP. Hope it's not my fault, due to GPG and stuff... :-)

> If you're talking about our most recent "stochastic" paper, it is about 
> supporting
> soft real-time task systems on a multiprocessor where resource 
> reservations are
> used.  
>
Actually, we were talking about the previous one, which (if I got it
well) didn't include reservations, and that's why we were wondering how
to integrate it in the scheduler (which _is_ reservation based).

Now that this one is out, going for it simply solves all our issues,
also considering that it says something very similar to what we (Pisa
guys :-D) are familiar with, since we investigated this stuff too
sometime ago (actually, Luca did).

> However, such methods would be easy to incorporate.  
>
I think it really should be... Actually, I'm under the impression that
we don't even need variance/std-dev of the execution time to be part of
the interface, do we?

> I gathered there was
> some confusion about whether we were using resource reservations.  Such
> reservations are actually crucial for our analysis as they allow 
> independence to be assumed across tasks.
>
Yes, as I said, this was because we were looking at first paper. Now we
got it, thanks a lot for clarifying! :-)

> I hope this helps.
> 
A lot, thanks again.

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://blog.linux.it/raistlin / raistlin@ekiga.net /
dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] sched: Simplify cpu-hot-unplug task migration
  2010-11-15 20:06                     ` [PATCH] sched: Simplify cpu-hot-unplug task migration Peter Zijlstra
@ 2010-11-17 19:27                       ` Oleg Nesterov
  2010-11-17 19:42                         ` Peter Zijlstra
  2010-11-18 14:09                       ` [tip:sched/core] " tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 135+ messages in thread
From: Oleg Nesterov @ 2010-11-17 19:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, Frederic Weisbecker, Darren Hart, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

Peter, sorry for delay.

I was going to read this patch carefully today, but due to the holiday
in the Czech Republic I have to drink (too much) beer instead ;)

This means you should probably ignore my question, but can't resist...

> -static void migrate_dead_tasks(unsigned int dead_cpu)
> -{
> -	struct rq *rq = cpu_rq(dead_cpu);
> -	struct task_struct *next;
> +	rq->stop = NULL;

(or we could do current->state = TASK_INTERRUPTIPLE, afaics)

>  	for ( ; ; ) {
> -		if (!rq->nr_running)
> +		/*
> +		 * There's this thread running, bail when that's the only
> +		 * remaining thread.
> +		 */
> +		if (rq->nr_running == 1)
>  			break;

I was very much confused, and I was going to say this is wrong.
However, now I think this is correct, just the comment is not
right.

There is another running thread we should not migrate, rq->idle.
If nothing else, dequeue_task_idle() should be never called.

But, if I understand correctly, ->nr_running does not account
the idle thread, and this is what makes this correct.

Correct?

Oleg.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] sched: Simplify cpu-hot-unplug task migration
  2010-11-17 19:27                       ` Oleg Nesterov
@ 2010-11-17 19:42                         ` Peter Zijlstra
  2010-11-18 14:05                           ` Oleg Nesterov
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-17 19:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, Frederic Weisbecker, Darren Hart, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Wed, 2010-11-17 at 20:27 +0100, Oleg Nesterov wrote:
> Peter, sorry for delay.
> 
> I was going to read this patch carefully today, but due to the holiday
> in the Czech Republic I have to drink (too much) beer instead ;)
> 
> This means you should probably ignore my question, but can't resist...
> 
> > -static void migrate_dead_tasks(unsigned int dead_cpu)
> > -{
> > -	struct rq *rq = cpu_rq(dead_cpu);
> > -	struct task_struct *next;
> > +	rq->stop = NULL;
> 
> (or we could do current->state = TASK_INTERRUPTIPLE, afaics)

Ah, you missed a patch that made pick_next_task_stop() look like:

static struct task_struct *pick_next_task_stop(struct rq *rq)
{
        struct task_struct *stop = rq->stop;

        if (stop && stop->se.on_rq)
                return stop;

        return NULL;
}

> >  	for ( ; ; ) {
> > -		if (!rq->nr_running)
> > +		/*
> > +		 * There's this thread running, bail when that's the only
> > +		 * remaining thread.
> > +		 */
> > +		if (rq->nr_running == 1)
> >  			break;
> 
> I was very much confused, and I was going to say this is wrong.
> However, now I think this is correct, just the comment is not
> right.
> 
> There is another running thread we should not migrate, rq->idle.
> If nothing else, dequeue_task_idle() should be never called.

In fact, dequeue_task_idle() will yell if you try that ;-)

> But, if I understand correctly, ->nr_running does not account
> the idle thread, and this is what makes this correct.
> 
> Correct?

Right, I can add: (the idle thread is not counted in nr_running), if
that makes things clearer for you; however its a quite fundamental
property, we don't consider the idle task a proper runnable entity, its
simply the thing we do when there's nothing else to do.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] sched: Simplify cpu-hot-unplug task migration
  2010-11-17 19:42                         ` Peter Zijlstra
@ 2010-11-18 14:05                           ` Oleg Nesterov
  2010-11-18 14:24                             ` Peter Zijlstra
  0 siblings, 1 reply; 135+ messages in thread
From: Oleg Nesterov @ 2010-11-18 14:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, Frederic Weisbecker, Darren Hart, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On 11/17, Peter Zijlstra wrote:
>
> On Wed, 2010-11-17 at 20:27 +0100, Oleg Nesterov wrote:
>
> > > -static void migrate_dead_tasks(unsigned int dead_cpu)
> > > -{
> > > -	struct rq *rq = cpu_rq(dead_cpu);
> > > -	struct task_struct *next;
> > > +	rq->stop = NULL;
> >
> > (or we could do current->state = TASK_INTERRUPTIPLE, afaics)
>
> Ah, you missed a patch that made pick_next_task_stop() look like:
>
> static struct task_struct *pick_next_task_stop(struct rq *rq)
> {
>         struct task_struct *stop = rq->stop;
>
>         if (stop && stop->se.on_rq)

Yes, thanks.

> > >  	for ( ; ; ) {
> > > -		if (!rq->nr_running)
> > > +		/*
> > > +		 * There's this thread running, bail when that's the only
> > > +		 * remaining thread.
> > > +		 */
> > > +		if (rq->nr_running == 1)
> > >  			break;
> >
> > I was very much confused, and I was going to say this is wrong.
> > However, now I think this is correct, just the comment is not
> > right.
> >
> > There is another running thread we should not migrate, rq->idle.
> > If nothing else, dequeue_task_idle() should be never called.
>
> In fact, dequeue_task_idle() will yell if you try that ;-)
>
> > But, if I understand correctly, ->nr_running does not account
> > the idle thread, and this is what makes this correct.
> >
> > Correct?
>
> Right, I can add: (the idle thread is not counted in nr_running), if
> that makes things clearer for you; however its a quite fundamental
> property,

Yes, I see now.

OK, this also explains my previous questions. I greatly misunderstood
this "small detail", starting from your initial patch. Every time I
thought you are trying to migrate rq->idle as well.

Thanks Peter. Only one question,

> @@ -253,9 +246,12 @@ static int __ref _cpu_down(unsigned int
>  	}
>  	BUG_ON(cpu_online(cpu));
>
> -	/* Wait for it to sleep (leaving idle task). */
> -	while (!idle_cpu(cpu))
> -		yield();
> +	/*
> +	 * The migration_call() CPU_DYING callback will have removed all
> +	 * runnable tasks from the cpu, there's only the idle task left now
> +	 * that the migration thread is done doing the stop_machine thing.
> +	 */
> +	BUG_ON(!idle_cpu(cpu));

I am not sure.

Yes, we know for sure rhat the only runnable task is rq->idle.
But only after migration thread calls schedule() and switches to the
idle thread.

However, I see nothing which can guarantee this. Migration thread
running on the dead cpu wakes up the caller of stop_cpus() before
it calls schedule(), _cpu_down() can check rq->curr before it was
changed.

No?



Hmm. In fact, I think it is possible that cpu_stopper_thread() can
have more cpu_stop_work's queued when __stop_machine() returns.
This has nothing to do with this patch, but I think it makes sense
to clear stopper->enabled at CPU_DYING stage as well (of course,
this needs a separate patch).

Oleg.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [tip:sched/core] sched: Simplify cpu-hot-unplug task migration
  2010-11-15 20:06                     ` [PATCH] sched: Simplify cpu-hot-unplug task migration Peter Zijlstra
  2010-11-17 19:27                       ` Oleg Nesterov
@ 2010-11-18 14:09                       ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 135+ messages in thread
From: tip-bot for Peter Zijlstra @ 2010-11-18 14:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, oleg, tglx, mingo

Commit-ID:  48c5ccae88dcd989d9de507e8510313c6cbd352b
Gitweb:     http://git.kernel.org/tip/48c5ccae88dcd989d9de507e8510313c6cbd352b
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Sat, 13 Nov 2010 19:32:29 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 18 Nov 2010 13:27:46 +0100

sched: Simplify cpu-hot-unplug task migration

While discussing the need for sched_idle_next(), Oleg remarked that
since try_to_wake_up() ensures sleeping tasks will end up running on a
sane cpu, we can do away with migrate_live_tasks().

If we then extend the existing hack of migrating current from
CPU_DYING to migrating the full rq worth of tasks from CPU_DYING, the
need for the sched_idle_next() abomination disappears as well, since
idle will be the only possible thread left after the migration thread
stops.

This greatly simplifies the hot-unplug task migration path, as can be
seen from the resulting code reduction (and about half the new lines
are comments).

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1289851597.2109.547.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |    3 -
 kernel/cpu.c          |   16 ++---
 kernel/sched.c        |  206 +++++++++++++++----------------------------------
 3 files changed, 67 insertions(+), 158 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3cd70cf..29d953a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1871,14 +1871,11 @@ extern void sched_clock_idle_sleep_event(void);
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 
 #ifdef CONFIG_HOTPLUG_CPU
-extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
 extern void idle_task_exit(void);
 #else
 static inline void idle_task_exit(void) {}
 #endif
 
-extern void sched_idle_next(void);
-
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
 extern void wake_up_idle_cpu(int cpu);
 #else
diff --git a/kernel/cpu.c b/kernel/cpu.c
index f6e726f..8615aa6 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -189,7 +189,6 @@ static inline void check_for_tasks(int cpu)
 }
 
 struct take_cpu_down_param {
-	struct task_struct *caller;
 	unsigned long mod;
 	void *hcpu;
 };
@@ -208,11 +207,6 @@ static int __ref take_cpu_down(void *_param)
 
 	cpu_notify(CPU_DYING | param->mod, param->hcpu);
 
-	if (task_cpu(param->caller) == cpu)
-		move_task_off_dead_cpu(cpu, param->caller);
-	/* Force idle task to run as soon as we yield: it should
-	   immediately notice cpu is offline and die quickly. */
-	sched_idle_next();
 	return 0;
 }
 
@@ -223,7 +217,6 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
 	void *hcpu = (void *)(long)cpu;
 	unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
 	struct take_cpu_down_param tcd_param = {
-		.caller = current,
 		.mod = mod,
 		.hcpu = hcpu,
 	};
@@ -253,9 +246,12 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
 	}
 	BUG_ON(cpu_online(cpu));
 
-	/* Wait for it to sleep (leaving idle task). */
-	while (!idle_cpu(cpu))
-		yield();
+	/*
+	 * The migration_call() CPU_DYING callback will have removed all
+	 * runnable tasks from the cpu, there's only the idle task left now
+	 * that the migration thread is done doing the stop_machine thing.
+	 */
+	BUG_ON(!idle_cpu(cpu));
 
 	/* This actually kills the CPU. */
 	__cpu_die(cpu);
diff --git a/kernel/sched.c b/kernel/sched.c
index 41f1869..b0d5f1b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2366,18 +2366,15 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 		return dest_cpu;
 
 	/* No more Mr. Nice Guy. */
-	if (unlikely(dest_cpu >= nr_cpu_ids)) {
-		dest_cpu = cpuset_cpus_allowed_fallback(p);
-		/*
-		 * Don't tell them about moving exiting tasks or
-		 * kernel threads (both mm NULL), since they never
-		 * leave kernel.
-		 */
-		if (p->mm && printk_ratelimit()) {
-			printk(KERN_INFO "process %d (%s) no "
-			       "longer affine to cpu%d\n",
-			       task_pid_nr(p), p->comm, cpu);
-		}
+	dest_cpu = cpuset_cpus_allowed_fallback(p);
+	/*
+	 * Don't tell them about moving exiting tasks or
+	 * kernel threads (both mm NULL), since they never
+	 * leave kernel.
+	 */
+	if (p->mm && printk_ratelimit()) {
+		printk(KERN_INFO "process %d (%s) no longer affine to cpu%d\n",
+				task_pid_nr(p), p->comm, cpu);
 	}
 
 	return dest_cpu;
@@ -5712,29 +5709,20 @@ static int migration_cpu_stop(void *data)
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
+
 /*
- * Figure out where task on dead CPU should go, use force if necessary.
+ * Ensures that the idle task is using init_mm right before its cpu goes
+ * offline.
  */
-void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
+void idle_task_exit(void)
 {
-	struct rq *rq = cpu_rq(dead_cpu);
-	int needs_cpu, uninitialized_var(dest_cpu);
-	unsigned long flags;
+	struct mm_struct *mm = current->active_mm;
 
-	local_irq_save(flags);
+	BUG_ON(cpu_online(smp_processor_id()));
 
-	raw_spin_lock(&rq->lock);
-	needs_cpu = (task_cpu(p) == dead_cpu) && (p->state != TASK_WAKING);
-	if (needs_cpu)
-		dest_cpu = select_fallback_rq(dead_cpu, p);
-	raw_spin_unlock(&rq->lock);
-	/*
-	 * It can only fail if we race with set_cpus_allowed(),
-	 * in the racer should migrate the task anyway.
-	 */
-	if (needs_cpu)
-		__migrate_task(p, dead_cpu, dest_cpu);
-	local_irq_restore(flags);
+	if (mm != &init_mm)
+		switch_mm(mm, &init_mm, current);
+	mmdrop(mm);
 }
 
 /*
@@ -5747,128 +5735,69 @@ void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
 static void migrate_nr_uninterruptible(struct rq *rq_src)
 {
 	struct rq *rq_dest = cpu_rq(cpumask_any(cpu_active_mask));
-	unsigned long flags;
 
-	local_irq_save(flags);
-	double_rq_lock(rq_src, rq_dest);
 	rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
 	rq_src->nr_uninterruptible = 0;
-	double_rq_unlock(rq_src, rq_dest);
-	local_irq_restore(flags);
-}
-
-/* Run through task list and migrate tasks from the dead cpu. */
-static void migrate_live_tasks(int src_cpu)
-{
-	struct task_struct *p, *t;
-
-	read_lock(&tasklist_lock);
-
-	do_each_thread(t, p) {
-		if (p == current)
-			continue;
-
-		if (task_cpu(p) == src_cpu)
-			move_task_off_dead_cpu(src_cpu, p);
-	} while_each_thread(t, p);
-
-	read_unlock(&tasklist_lock);
 }
 
 /*
- * Schedules idle task to be the next runnable task on current CPU.
- * It does so by boosting its priority to highest possible.
- * Used by CPU offline code.
+ * remove the tasks which were accounted by rq from calc_load_tasks.
  */
-void sched_idle_next(void)
+static void calc_global_load_remove(struct rq *rq)
 {
-	int this_cpu = smp_processor_id();
-	struct rq *rq = cpu_rq(this_cpu);
-	struct task_struct *p = rq->idle;
-	unsigned long flags;
-
-	/* cpu has to be offline */
-	BUG_ON(cpu_online(this_cpu));
-
-	/*
-	 * Strictly not necessary since rest of the CPUs are stopped by now
-	 * and interrupts disabled on the current cpu.
-	 */
-	raw_spin_lock_irqsave(&rq->lock, flags);
-
-	__setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
-
-	activate_task(rq, p, 0);
-
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
+	atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
+	rq->calc_load_active = 0;
 }
 
 /*
- * Ensures that the idle task is using init_mm right before its cpu goes
- * offline.
+ * Migrate all tasks from the rq, sleeping tasks will be migrated by
+ * try_to_wake_up()->select_task_rq().
+ *
+ * Called with rq->lock held even though we'er in stop_machine() and
+ * there's no concurrency possible, we hold the required locks anyway
+ * because of lock validation efforts.
  */
-void idle_task_exit(void)
-{
-	struct mm_struct *mm = current->active_mm;
-
-	BUG_ON(cpu_online(smp_processor_id()));
-
-	if (mm != &init_mm)
-		switch_mm(mm, &init_mm, current);
-	mmdrop(mm);
-}
-
-/* called under rq->lock with disabled interrupts */
-static void migrate_dead(unsigned int dead_cpu, struct task_struct *p)
+static void migrate_tasks(unsigned int dead_cpu)
 {
 	struct rq *rq = cpu_rq(dead_cpu);
-
-	/* Must be exiting, otherwise would be on tasklist. */
-	BUG_ON(!p->exit_state);
-
-	/* Cannot have done final schedule yet: would have vanished. */
-	BUG_ON(p->state == TASK_DEAD);
-
-	get_task_struct(p);
+	struct task_struct *next, *stop = rq->stop;
+	int dest_cpu;
 
 	/*
-	 * Drop lock around migration; if someone else moves it,
-	 * that's OK. No task can be added to this CPU, so iteration is
-	 * fine.
+	 * Fudge the rq selection such that the below task selection loop
+	 * doesn't get stuck on the currently eligible stop task.
+	 *
+	 * We're currently inside stop_machine() and the rq is either stuck
+	 * in the stop_machine_cpu_stop() loop, or we're executing this code,
+	 * either way we should never end up calling schedule() until we're
+	 * done here.
 	 */
-	raw_spin_unlock_irq(&rq->lock);
-	move_task_off_dead_cpu(dead_cpu, p);
-	raw_spin_lock_irq(&rq->lock);
-
-	put_task_struct(p);
-}
-
-/* release_task() removes task from tasklist, so we won't find dead tasks. */
-static void migrate_dead_tasks(unsigned int dead_cpu)
-{
-	struct rq *rq = cpu_rq(dead_cpu);
-	struct task_struct *next;
+	rq->stop = NULL;
 
 	for ( ; ; ) {
-		if (!rq->nr_running)
+		/*
+		 * There's this thread running, bail when that's the only
+		 * remaining thread.
+		 */
+		if (rq->nr_running == 1)
 			break;
+
 		next = pick_next_task(rq);
-		if (!next)
-			break;
+		BUG_ON(!next);
 		next->sched_class->put_prev_task(rq, next);
-		migrate_dead(dead_cpu, next);
 
+		/* Find suitable destination for @next, with force if needed. */
+		dest_cpu = select_fallback_rq(dead_cpu, next);
+		raw_spin_unlock(&rq->lock);
+
+		__migrate_task(next, dead_cpu, dest_cpu);
+
+		raw_spin_lock(&rq->lock);
 	}
-}
 
-/*
- * remove the tasks which were accounted by rq from calc_load_tasks.
- */
-static void calc_global_load_remove(struct rq *rq)
-{
-	atomic_long_sub(rq->calc_load_active, &calc_load_tasks);
-	rq->calc_load_active = 0;
+	rq->stop = stop;
 }
+
 #endif /* CONFIG_HOTPLUG_CPU */
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_SYSCTL)
@@ -6078,15 +6007,13 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 	unsigned long flags;
 	struct rq *rq = cpu_rq(cpu);
 
-	switch (action) {
+	switch (action & ~CPU_TASKS_FROZEN) {
 
 	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
 		rq->calc_load_update = calc_load_update;
 		break;
 
 	case CPU_ONLINE:
-	case CPU_ONLINE_FROZEN:
 		/* Update our root-domain */
 		raw_spin_lock_irqsave(&rq->lock, flags);
 		if (rq->rd) {
@@ -6098,30 +6025,19 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		break;
 
 #ifdef CONFIG_HOTPLUG_CPU
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		migrate_live_tasks(cpu);
-		/* Idle task back to normal (off runqueue, low prio) */
-		raw_spin_lock_irq(&rq->lock);
-		deactivate_task(rq, rq->idle, 0);
-		__setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
-		rq->idle->sched_class = &idle_sched_class;
-		migrate_dead_tasks(cpu);
-		raw_spin_unlock_irq(&rq->lock);
-		migrate_nr_uninterruptible(rq);
-		BUG_ON(rq->nr_running != 0);
-		calc_global_load_remove(rq);
-		break;
-
 	case CPU_DYING:
-	case CPU_DYING_FROZEN:
 		/* Update our root-domain */
 		raw_spin_lock_irqsave(&rq->lock, flags);
 		if (rq->rd) {
 			BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
 			set_rq_offline(rq);
 		}
+		migrate_tasks(cpu);
+		BUG_ON(rq->nr_running != 1); /* the migration thread */
 		raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+		migrate_nr_uninterruptible(rq);
+		calc_global_load_remove(rq);
 		break;
 #endif
 	}

^ permalink raw reply related	[flat|nested] 135+ messages in thread

* Re: [PATCH] sched: Simplify cpu-hot-unplug task migration
  2010-11-18 14:05                           ` Oleg Nesterov
@ 2010-11-18 14:24                             ` Peter Zijlstra
  2010-11-18 15:32                               ` Oleg Nesterov
  0 siblings, 1 reply; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-18 14:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, Frederic Weisbecker, Darren Hart, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Thu, 2010-11-18 at 15:05 +0100, Oleg Nesterov wrote:
> > -     /* Wait for it to sleep (leaving idle task). */
> > -     while (!idle_cpu(cpu))
> > -             yield();
> > +     /*
> > +      * The migration_call() CPU_DYING callback will have removed all
> > +      * runnable tasks from the cpu, there's only the idle task left now
> > +      * that the migration thread is done doing the stop_machine thing.
> > +      */
> > +     BUG_ON(!idle_cpu(cpu));
> 
> I am not sure.
> 
> Yes, we know for sure rhat the only runnable task is rq->idle.
> But only after migration thread calls schedule() and switches to the
> idle thread.
> 
> However, I see nothing which can guarantee this. Migration thread
> running on the dead cpu wakes up the caller of stop_cpus() before
> it calls schedule(), _cpu_down() can check rq->curr before it was
> changed.
> 
> No?
> 
> 
> 
> Hmm. In fact, I think it is possible that cpu_stopper_thread() can
> have more cpu_stop_work's queued when __stop_machine() returns.
> This has nothing to do with this patch, but I think it makes sense
> to clear stopper->enabled at CPU_DYING stage as well (of course,
> this needs a separate patch). 

Hmm, I think you're right, although I haven't hit that case during
testing.

There is no firm guarantee the dying cpu actually got to running the
idle thread (there's a guarantee it will at some point), so we ought to
maintain that wait-loop, possibly using cpu_relax(), I don't see the
point in calling yield() here.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH] sched: Simplify cpu-hot-unplug task migration
  2010-11-18 14:24                             ` Peter Zijlstra
@ 2010-11-18 15:32                               ` Oleg Nesterov
  0 siblings, 0 replies; 135+ messages in thread
From: Oleg Nesterov @ 2010-11-18 15:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Raistlin, Ingo Molnar, Thomas Gleixner, Steven Rostedt,
	Chris Friesen, Frederic Weisbecker, Darren Hart, Johan Eker,
	p.faure, linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On 11/18, Peter Zijlstra wrote:
>
> There is no firm guarantee the dying cpu actually got to running the
> idle thread (there's a guarantee it will at some point), so we ought to
> maintain that wait-loop, possibly using cpu_relax(), I don't see the
> point in calling yield() here.

Agreed. But do we need to wait at all?

With or without this change, even if we know that rq->idle is running
we can't know if it (say) already started play_dead_common() or not.

We are going to call __cpu_die(), afaics it should do the necessary
synchronization in any case.

For example, native_cpu_die() waits for cpu_state == CPU_DEAD in a
loop. Of course it should work in practice (it also does msleep),
but in theory there is no guarantee.

So. Can't we just remove this wait-loop? We know that rq->idle
will be scheduled "soon", I don't understand why it is necessary
to ensure that context_switch() has already happened.

Oleg.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation
  2010-11-14  8:54   ` Raistlin
@ 2010-11-23 14:24     ` Peter Zijlstra
  0 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-23 14:24 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Sun, 2010-11-14 at 09:54 +0100, Raistlin wrote:
> On Fri, 2010-10-29 at 08:30 +0200, Raistlin wrote:
> > +static void task_fork_dl(struct task_struct *p)
> > +{
> > +	/*
> > +	 * The child of a -deadline task will be SCHED_DEADLINE, but
> > +	 * as a throttled task. This means the parent (or someone else)
> > +	 * must call sched_setscheduler_ex() on it, or it won't even
> > +	 * start.
> > +	 */
> > +	p->dl.dl_throttled = 1;
> > +	p->dl.dl_new = 0;
> > +}
> > +
> So, this is also something we only discussed once without reaching a
> conclusive statement... Are we ok with this behaviour?
> 
> I'm not, actually, but I'm not sure how to do better, considering:
>  - resetting task to SCHED_OTHER on fork is useless and confusing, since
>    RESET_ON_FORK already exists and allows for this.
>  - cloning the parent's bandwidth and try to see if it fits might be
>    nice, but what if the check fails? fork fails as well? If yes, what
>    should the parent do, lower its own bandwidth (since it's being
>    cloned) and try forking again? If yes, lowering it by how much? 
>    Mmm... Not sure it would fly... :-(
>  - splitting the bandwidth of the parent in two would make sense
>    to me, but then it has to be returned at some point, or a poor
>    -deadline shell will reach zero bandwidth after a few `ls'! So, when
>    are we giving the bandwidth back to the parent? When the child dies?
>    What if the child is sched_setscheduled_ex()-ed (either to -deadline 
>    or to something else), we return the bw back (if the call succeeds)
>    as well?
> 
> I was thinking wither to force RESET_ON_FORK for SCHED_DEADLINE or to
> try to pursue the 3rd solution (provided I figure out what to do on
> detaching, parent dying, and such things... :-P).
> 
> Comments very very welcome!

Right, so either this, or we could make sched_deadline tasks fail
fork() ;-)

from a RT perspective such tasks shouldn't fork() anyway, fork()
includes a lot of implicit memory allocations which definitely are not
deterministic (page reclaim etc.).

So yeah, I'm fine with this, it causes some pain, but then, you've
earned this pain by doing this in the first place.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC][PATCH 07/22] sched: SCHED_DEADLINE push and pull logic
  2010-11-14  9:14     ` Raistlin
@ 2010-11-23 14:27       ` Peter Zijlstra
  0 siblings, 0 replies; 135+ messages in thread
From: Peter Zijlstra @ 2010-11-23 14:27 UTC (permalink / raw)
  To: Raistlin
  Cc: Ingo Molnar, Thomas Gleixner, Steven Rostedt, Chris Friesen,
	oleg, Frederic Weisbecker, Darren Hart, Johan Eker, p.faure,
	linux-kernel, Claudio Scordino, michael trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Nicola Manica,
	Luca Abeni, Dhaval Giani, Harald Gustafsson, paulmck

On Sun, 2010-11-14 at 10:14 +0100, Raistlin wrote:
> On Fri, 2010-11-12 at 17:17 +0100, Peter Zijlstra wrote:
> > On Fri, 2010-10-29 at 08:32 +0200, Raistlin wrote:
> > > Add dynamic migrations to SCHED_DEADLINE, so that tasks can
> > > be moved among CPUs when necessary. It is also possible to bind a
> > > task to a (set of) CPU(s), thus restricting its capability of
> > > migrating, or forbidding migrations at all.
> > > 
> > > The very same approach used in sched_rt is utilised:
> > >  - -deadline tasks are kept into CPU-specific runqueues,
> > >  - -deadline tasks are migrated among runqueues to achieve the
> > >    following:
> > >     * on an M-CPU system the M earliest deadline ready tasks
> > >       are always running;
> > >     * affinity/cpusets settings of all the -deadline tasks is
> > >       always respected. 
> > 
> > I haven't fully digested the patch, I keep getting side-tracked and its
> > a large patch.. 
> >
> BTW, I was thinking about your suggestion of adding a *debugging* knob
> for achieving a "lock everything while I'm migrating" behaviour... :-)
> 
> Something like locking the root_domain during pushes and pulls won't
> probably work, since both of them do a double_lock_balance, taking two
> rq, which might race with this new "global" lock.
> Something like we (CPU#1) hold rq1->lock, we take rd->lock, and then we
> try to take rq2->lock. CPU#2 holds rq2->lock and try to take rd->lock.
> Stuck! :-(
> This should be possible if both CPU#1 and CPU#2 are into a push or a
> pull which, on each one, involves some task on the other. Do you agree,
> or I'm missing/mistaking something? :-)
> 
> Something we can probably do is locking the root_domain for
> _each_and_every_ scheduling decision, having all the rq->locks nesting
> inside our new root_domain->lock. This would emulate some sort of unique
> global rq implementation, since also local decisions on a CPU will
> affect all the others, as if they were sharing a single rq... But it's
> going to be very slow on large machines (but I guess we can afford
> that... It's debugging!), and will probably affect other scheduling
> class.
> I'm not sure we want the latter... But maybe it could be useful for
> debugging others too (at least for FIFO/RR, it should be!).
> 
> Let me know what you think...

Ugh!.. lock ordering sucks :-)

I think we can cheat since double_rq_lock() and double_lock_balance()
can already unlock both locks, so you can simply: unlock both, lock rd,
then lock both.



^ permalink raw reply	[flat|nested] 135+ messages in thread

end of thread, other threads:[~2010-11-23 14:28 UTC | newest]

Thread overview: 135+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-29  6:18 [RFC][PATCH 00/22] sched: SCHED_DEADLINE v3 Raistlin
2010-10-29  6:25 ` [RFC][PATCH 01/22] sched: add sched_class->task_dead Raistlin
2010-10-29  6:27 ` [RFC][PATCH 02/22] sched: add extended scheduling interface Raistlin
2010-11-10 16:00   ` Dhaval Giani
2010-11-10 16:12     ` Dhaval Giani
2010-11-10 22:45       ` Raistlin
2010-11-10 16:17     ` Claudio Scordino
2010-11-10 17:28   ` Peter Zijlstra
2010-11-10 19:26     ` Peter Zijlstra
2010-11-10 23:33       ` Tommaso Cucinotta
2010-11-11 12:19         ` Peter Zijlstra
2010-11-10 22:17     ` Raistlin
2010-11-10 22:57       ` Tommaso Cucinotta
2010-11-11 13:32       ` Peter Zijlstra
2010-11-11 13:54         ` Raistlin
2010-11-11 14:08           ` Peter Zijlstra
2010-11-11 17:27             ` Raistlin
2010-11-11 14:05         ` Dhaval Giani
2010-11-10 22:24     ` Raistlin
2010-11-10 18:50   ` Peter Zijlstra
2010-11-10 22:05     ` Raistlin
2010-11-12 16:38   ` Steven Rostedt
2010-11-12 16:43     ` Peter Zijlstra
2010-11-12 16:52       ` Steven Rostedt
2010-11-12 19:19         ` Raistlin
2010-11-12 19:23           ` Steven Rostedt
2010-11-12 17:42     ` Tommaso Cucinotta
2010-11-12 19:21       ` Steven Rostedt
2010-11-12 19:24     ` Raistlin
2010-10-29  6:28 ` [RFC][PATCH 03/22] sched: SCHED_DEADLINE data structures Raistlin
2010-11-10 18:59   ` Peter Zijlstra
2010-11-10 22:06     ` Raistlin
2010-11-10 19:10   ` Peter Zijlstra
2010-11-12 17:11     ` Steven Rostedt
2010-10-29  6:29 ` [RFC][PATCH 04/22] sched: SCHED_DEADLINE SMP-related " Raistlin
2010-11-10 19:17   ` Peter Zijlstra
2010-10-29  6:30 ` [RFC][PATCH 05/22] sched: SCHED_DEADLINE policy implementation Raistlin
2010-11-10 19:21   ` Peter Zijlstra
2010-11-10 19:43   ` Peter Zijlstra
2010-11-11  1:02     ` Raistlin
2010-11-10 19:45   ` Peter Zijlstra
2010-11-10 22:26     ` Raistlin
2010-11-10 20:21   ` Peter Zijlstra
2010-11-11  1:18     ` Raistlin
2010-11-11 13:13       ` Peter Zijlstra
2010-11-11 14:13   ` Peter Zijlstra
2010-11-11 14:28     ` Raistlin
2010-11-11 14:17   ` Peter Zijlstra
2010-11-11 18:33     ` Raistlin
2010-11-11 14:25   ` Peter Zijlstra
2010-11-11 14:33     ` Raistlin
2010-11-14  8:54   ` Raistlin
2010-11-23 14:24     ` Peter Zijlstra
2010-10-29  6:31 ` [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads Raistlin
2010-11-11 14:31   ` Peter Zijlstra
2010-11-11 14:50     ` Dario Faggioli
2010-11-11 14:34   ` Peter Zijlstra
2010-11-11 15:27     ` Oleg Nesterov
2010-11-11 15:43       ` Peter Zijlstra
2010-11-11 16:32         ` Oleg Nesterov
2010-11-13 18:35           ` Peter Zijlstra
2010-11-13 19:58             ` Oleg Nesterov
2010-11-13 20:31               ` Peter Zijlstra
2010-11-13 20:51                 ` Peter Zijlstra
2010-11-13 23:31                   ` Peter Zijlstra
2010-11-15 20:06                     ` [PATCH] sched: Simplify cpu-hot-unplug task migration Peter Zijlstra
2010-11-17 19:27                       ` Oleg Nesterov
2010-11-17 19:42                         ` Peter Zijlstra
2010-11-18 14:05                           ` Oleg Nesterov
2010-11-18 14:24                             ` Peter Zijlstra
2010-11-18 15:32                               ` Oleg Nesterov
2010-11-18 14:09                       ` [tip:sched/core] " tip-bot for Peter Zijlstra
2010-11-11 14:46   ` [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacial kthreads Peter Zijlstra
2010-10-29  6:32 ` [RFC][PATCH 07/22] sched: SCHED_DEADLINE push and pull logic Raistlin
2010-11-12 16:17   ` Peter Zijlstra
2010-11-12 21:11     ` Raistlin
2010-11-14  9:14     ` Raistlin
2010-11-23 14:27       ` Peter Zijlstra
2010-10-29  6:33 ` [RFC][PATCH 08/22] sched: SCHED_DEADLINE avg_update accounting Raistlin
2010-11-11 19:16   ` Peter Zijlstra
2010-10-29  6:34 ` [RFC][PATCH 09/22] sched: add period support for -deadline tasks Raistlin
2010-11-11 19:17   ` Peter Zijlstra
2010-11-11 19:31     ` Raistlin
2010-11-11 19:43       ` Peter Zijlstra
2010-11-11 23:33         ` Tommaso Cucinotta
2010-11-12 13:33         ` Raistlin
2010-11-12 13:45           ` Peter Zijlstra
2010-11-12 13:46       ` Luca Abeni
2010-11-12 14:01         ` Raistlin
2010-10-29  6:35 ` [RFC][PATCH 10/22] sched: add a syscall to wait for the next instance Raistlin
2010-11-11 19:21   ` Peter Zijlstra
2010-11-11 19:33     ` Raistlin
2010-10-29  6:35 ` [RFC][PATCH 11/22] sched: add schedstats for -deadline tasks Raistlin
2010-10-29  6:36 ` [RFC][PATCH 12/22] sched: add runtime reporting " Raistlin
2010-11-11 19:37   ` Peter Zijlstra
2010-11-12 16:15     ` Raistlin
2010-11-12 16:27       ` Peter Zijlstra
2010-11-12 21:12         ` Raistlin
2010-10-29  6:37 ` [RFC][PATCH 13/22] sched: add resource limits " Raistlin
2010-11-11 19:57   ` Peter Zijlstra
2010-11-12 21:30     ` Raistlin
2010-11-12 23:32       ` Peter Zijlstra
2010-10-29  6:38 ` [RFC][PATCH 14/22] sched: add latency tracing " Raistlin
2010-10-29  6:38 ` [RFC][PATCH 15/22] sched: add traceporints " Raistlin
2010-11-11 19:54   ` Peter Zijlstra
2010-11-12 16:13     ` Raistlin
2010-10-29  6:39 ` [RFC][PATCH 16/22] sched: add SMP " Raistlin
2010-10-29  6:40 ` [RFC][PATCH 17/22] sched: add signaling overrunning " Raistlin
2010-11-11 21:58   ` Peter Zijlstra
2010-11-12 15:39     ` Raistlin
2010-11-12 16:04       ` Peter Zijlstra
2010-10-29  6:42 ` [RFC][PATCH 19/22] rtmutex: turn the plist into an rb-tree Raistlin
2010-10-29  6:42 ` [RFC][PATCH 18/22] sched: add reclaiming logic to -deadline tasks Raistlin
2010-11-11 22:12   ` Peter Zijlstra
2010-11-12 15:36     ` Raistlin
2010-11-12 16:04       ` Peter Zijlstra
2010-11-12 17:41         ` Luca Abeni
2010-11-12 17:51           ` Peter Zijlstra
2010-11-12 17:54             ` Luca Abeni
2010-11-13 21:08             ` Raistlin
2010-11-12 18:07           ` Tommaso Cucinotta
2010-11-12 19:07             ` Raistlin
2010-11-13  0:43             ` Peter Zijlstra
2010-11-13  1:49               ` Tommaso Cucinotta
2010-11-12 18:56         ` Raistlin
     [not found]           ` <80992760-24F2-42AE-AF2D-15727F6A1C81@email.unc.edu>
2010-11-15 18:37             ` James H. Anderson
2010-11-15 19:23               ` Luca Abeni
2010-11-15 19:49                 ` James H. Anderson
2010-11-15 19:39               ` Luca Abeni
2010-11-15 21:34               ` Raistlin
2010-10-29  6:43 ` [RFC][PATCH 20/22] sched: drafted deadline inheritance logic Raistlin
2010-11-11 22:15   ` Peter Zijlstra
2010-11-14 12:00     ` Raistlin
2010-10-29  6:44 ` [RFC][PATCH 21/22] sched: add bandwidth management for sched_dl Raistlin
2010-10-29  6:45 ` [RFC][PATCH 22/22] sched: add sched_dl documentation Raistlin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).