All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4
@ 2012-04-06  7:14 Juri Lelli
  2012-04-06  7:14 ` [PATCH 01/16] sched: add sched_class->task_dead Juri Lelli
                   ` (19 more replies)
  0 siblings, 20 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

Hello everyone,

This is the take 4 for the SCHED_DEADLINE patchset.

Just to recap, the patchset introduces a new deadline based real-time
task scheduling policy --called SCHED_DEADLINE-- with bandwidth
isolation (aka "resource reservation") capabilities. It now supports
global/clustered multiprocessor scheduling through dynamic task
migrations.


 From the previous releases[1]:
  - all the comments and the fixes coming from the reviews we got have
    been considered and applied;
  - better handling of rq selection for dynamic task migration, by means
    of a cpupri equivalent for -deadline tasks (cpudl). The mechanism
    is simple and straightforward, but showed nice performance figures[2].
  - this time we sit on top of PREEMPT_RT (3.2.13-rt23); we continue to aim
    at mainline inclusion, but we also see -rt folks as immediate and
    interested users.

Still missing/incomplete:
  - (c)group based bandwidth management, and maybe scheduling. It seems
    some more discussion on what precisely we want is *really* needed
    for this point;
  - bandwidth inheritance (to replace deadline/priority inheritance).
    What's in the patchset is just very few more than a simple
    placeholder. More discussion on the right way to go is needed here.
    Some work has already been done, but it is still not ready for
    submission.

The development is taking place at:
   https://github.com/jlelli/sched-deadline

Check the repositories frequently if you're interested, and feel free to
e-mail me for any issue you run into.

Furthermore, we developed an application that you can use to test this
patchset:
  https://github.com/gbagnoli/rt-app 

We also set up a development mailing list: linux-dl[3].
You can subscribe from here:
http://feanor.sssup.it/mailman/listinfo/linux-dl
or via e-mail (send a message to linux-dl-request@retis.sssup.it with
just the word `help' as subject or in the body to receive info).

As already discussed we are planning to merge this work with the EDF
throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
the preliminary phases of the merge and we intend to use the feedback to this
post to help us decide on the direction it should take.

As said, patchset is on top of PREEMPT_RT (as of today). However, Insop
Song (from Ericsson) is maintaining a parallel branch for the current
tip/master (https://github.com/insop/sched-deadline2). Ericsson is in fact
evaluating the use of SCHED_DEADLINE for CPE (Customer Premise Equipment)
devices in order to reserve CPU bandwidth to processes.

The code was being jointly developed by ReTiS Lab (http://retis.sssup.it)
and Evidence S.r.l (http://www.evidence.eu.com) in the context of the ACTORS
EU-funded project (http://www.actors-project.eu). It is now also supported by
the S(o)OS EU-funded project (http://www.soos-project.eu/).
It has also some users, both in academic and applied research. Even if our
last release dates back almost a year we continued to get feedback
from Ericsson (see above), Wind River, Porto (ISEP), Trento and Malardalen
universities :-).

As usual, any kind of feedback is welcome and appreciated.

Thanks in advice and regards,

 - Juri

[1] http://lwn.net/Articles/376502, http://lwn.net/Articles/353797,
    http://lwn.net/Articles/412410
[2] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf
[3] from the first linux-dl message:
    -linux-dl should become the place where discussions about real-time
     deadline scheduling on Linux take place (not only SCHED_DEADLINE).
     We felt the lack of a place where we can keep in touch with each
     other; we are all working on the same things, but probably from
     different viewpoints, and this is surely a point of strength.
     Anyway, our efforts need to be organized in some way, or at least it
     is important to know on what everyone is currently working as to not
     end up with "duplicate-efforts".-

Dario Faggioli (10):
      sched: add extended scheduling interface.
      sched: SCHED_DEADLINE data structures.
      sched: SCHED_DEADLINE policy implementation.
      sched: SCHED_DEADLINE avg_update accounting.
      sched: add schedstats for -deadline tasks.
      sched: add resource limits for -deadline tasks.
      sched: add latency tracing for -deadline tasks.
      sched: drafted deadline inheritance logic.
      sched: add bandwidth management for sched_dl.
      sched: add sched_dl documentation.

Juri Lelli (3):
      sched: SCHED_DEADLINE SMP-related data structures.
      sched: SCHED_DEADLINE push and pull logic
      sched: speed up -dl pushes with a push-heap.

Harald Gustafsson (1):
      sched: add period support for -deadline tasks.

Peter Zijlstra (1):
      rtmutex: turn the plist into an rb-tree.

 Documentation/scheduler/sched-deadline.txt |  147 +++
 arch/arm/include/asm/unistd.h              |    3 +
 arch/arm/kernel/calls.S                    |    3 +
 arch/x86/ia32/ia32entry.S                  |    3 +
 arch/x86/include/asm/unistd_32.h           |    5 +-
 arch/x86/include/asm/unistd_64.h           |    6 +
 arch/x86/kernel/syscall_table_32.S         |    3 +
 include/asm-generic/resource.h             |    7 +-
 include/linux/init_task.h                  |   10 +
 include/linux/rtmutex.h                    |   18 +-
 include/linux/sched.h                      |  154 +++-
 include/linux/syscalls.h                   |    7 +
 kernel/Makefile                            |    1 +
 kernel/fork.c                              |    8 +-
 kernel/hrtimer.c                           |    2 +-
 kernel/rtmutex-debug.c                     |   10 +-
 kernel/rtmutex.c                           |  162 +++-
 kernel/rtmutex_common.h                    |   24 +-
 kernel/sched.c                             |  866 ++++++++++++++-
 kernel/sched_cpudl.c                       |  208 ++++
 kernel/sched_cpudl.h                       |   34 +
 kernel/sched_debug.c                       |   43 +
 kernel/sched_dl.c                          | 1585 ++++++++++++++++++++++++++++
 kernel/sched_rt.c                          |    3 +-
 kernel/sched_stoptask.c                    |    2 +-
 kernel/sysctl.c                            |   14 +
 kernel/trace/trace_sched_wakeup.c          |   41 +-
 kernel/trace/trace_selftest.c              |   30 +-
 28 files changed, 3266 insertions(+), 133 deletions(-)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH 01/16] sched: add sched_class->task_dead.
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-08 17:49   ` Oleg Nesterov
  2012-04-06  7:14 ` [PATCH 02/16] sched: add extended scheduling interface Juri Lelli
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

From: Dario Faggioli <raistlin@linux.it>

Add a new function to the scheduling class interface. It is called
at the end of a context switch, if the prev task is in TASK_DEAD state.

It might be useful for the scheduling classes that want to be notified
when one of their task dies, e.g. to perform some cleanup actions.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |    3 +++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1f6b11a..ddfc4dc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1116,6 +1116,7 @@ struct sched_class {
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
 	void (*task_fork) (struct task_struct *p);
+	void (*task_dead) (struct task_struct *p);
 
 	void (*switched_from) (struct rq *this_rq, struct task_struct *task);
 	void (*switched_to) (struct rq *this_rq, struct task_struct *task);
diff --git a/kernel/sched.c b/kernel/sched.c
index 1cc706d..4c8d1c3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3219,6 +3219,9 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop_delayed(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		if (prev->sched_class->task_dead)
+			prev->sched_class->task_dead(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 02/16] sched: add extended scheduling interface.
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
  2012-04-06  7:14 ` [PATCH 01/16] sched: add sched_class->task_dead Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-06  7:14 ` [PATCH 03/16] sched: SCHED_DEADLINE data structures Juri Lelli
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

From: Dario Faggioli <raistlin@linux.it>

Add the interface bits needed for supporting scheduling algorithms
with extended parameters (e.g., SCHED_DEADLINE).

In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
 - a (maximum/typical) instance execution time,
 - a minimum interval between consecutive instances,
 - a time constraint by which each instance must be completed.

Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.

For these reasons, this patch:
 - defines the new struct sched_param2, containing all the fields
   that are necessary for specifying a task in the computational
   model described above;
 - defines and implements the new scheduling related syscalls that
   manipulate it, i.e., sched_setscheduler2(), sched_setparam2()
   and sched_getparam2().

Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.

Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the *2() calls accordingly with their own purposes.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 arch/arm/include/asm/unistd.h      |    3 +
 arch/arm/kernel/calls.S            |    3 +
 arch/x86/ia32/ia32entry.S          |    3 +
 arch/x86/include/asm/unistd_32.h   |    5 +-
 arch/x86/include/asm/unistd_64.h   |    6 ++
 arch/x86/kernel/syscall_table_32.S |    3 +
 include/linux/sched.h              |   50 ++++++++++++++++
 include/linux/syscalls.h           |    7 ++
 kernel/sched.c                     |  110 +++++++++++++++++++++++++++++++++++-
 9 files changed, 186 insertions(+), 4 deletions(-)

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 4a11237..66fb25b 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -404,6 +404,9 @@
 #define __NR_setns			(__NR_SYSCALL_BASE+375)
 #define __NR_process_vm_readv		(__NR_SYSCALL_BASE+376)
 #define __NR_process_vm_writev		(__NR_SYSCALL_BASE+377)
+#define __NR_sched_setscheduler2	(__NR_SYSCALL_BASE+378)
+#define __NR_sched_setparam2		(__NR_SYSCALL_BASE+379)
+#define __NR_sched_getparam2		(__NR_SYSCALL_BASE+380)
 
 /*
  * The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 463ff4a..be174eb 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -387,6 +387,9 @@
 /* 375 */	CALL(sys_setns)
 		CALL(sys_process_vm_readv)
 		CALL(sys_process_vm_writev)
+		CALL(sys_sched_setscheduler2)
+		CALL(sys_sched_setparam2)
+/* 380 */	CALL(sys_sched_getparam2)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index a6253ec..7a502fd 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -852,4 +852,7 @@ ia32_sys_call_table:
 	.quad sys_setns
 	.quad compat_sys_process_vm_readv
 	.quad compat_sys_process_vm_writev
+	.quad sys_sched_setscheduler2
+	.quad sys_sched_setparam2	/* 350 */
+	.quad sys_sched_getparam2
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 599c77d..5658de0 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -354,10 +354,13 @@
 #define __NR_setns		346
 #define __NR_process_vm_readv	347
 #define __NR_process_vm_writev	348
+#define __NR_sched_setscheduler2	349
+#define __NR_sched_setparam2		350
+#define __NR_sched_getparam2		351
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 349
+#define NR_syscalls 352
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 0431f19..b7a4347 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -686,6 +686,12 @@ __SYSCALL(__NR_getcpu, sys_getcpu)
 __SYSCALL(__NR_process_vm_readv, sys_process_vm_readv)
 #define __NR_process_vm_writev			311
 __SYSCALL(__NR_process_vm_writev, sys_process_vm_writev)
+#define __NR_sched_setscheduler2		312
+__SYSCALL(__NR_sched_setscheduler2, sys_sched_setscheduler2)
+#define __NR_sched_setparam2			313
+__SYSCALL(__NR_sched_setparam2, sys_sched_setparam2)
+#define __NR_sched_getparam2			314
+__SYSCALL(__NR_sched_getparam2, sys_sched_getparam2)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 9a0e312..b9b37b5 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -348,3 +348,6 @@ ENTRY(sys_call_table)
 	.long sys_setns
 	.long sys_process_vm_readv
 	.long sys_process_vm_writev
+	.long sys_sched_setscheduler2
+	.long sys_sched_setparam2	/* 350 */
+	.long sys_sched_getparam2
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ddfc4dc..3e67d30 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -96,6 +96,54 @@ struct sched_param {
 
 #include <asm/processor.h>
 
+/*
+ * Extended scheduling parameters data structure.
+ *
+ * This is needed because the original struct sched_param can not be
+ * altered without introducing ABI issues with legacy applications
+ * (e.g., in sched_getparam()).
+ *
+ * However, the possibility of specifying more than just a priority for
+ * the tasks may be useful for a wide variety of application fields, e.g.,
+ * multimedia, streaming, automation and control, and many others.
+ *
+ * This variant (sched_param2) is meant at describing a so-called
+ * sporadic time-constrained task. In such model a task is specified by:
+ *  - the activation period or minimum instance inter-arrival time;
+ *  - the maximum (or average, depending on the actual scheduling
+ *    discipline) computation time of all instances, a.k.a. runtime;
+ *  - the deadline (relative to the actual activation time) of each
+ *    instance.
+ * Very briefly, a periodic (sporadic) task asks for the execution of
+ * some specific computation --which is typically called an instance--
+ * (at most) every period. Moreover, each instance typically lasts no more
+ * than the runtime and must be completed by time instant t equal to
+ * the instance activation time + the deadline.
+ *
+ * This is reflected by the actual fields of the sched_param2 structure:
+ *
+ *  @sched_priority     task's priority (might still be useful)
+ *  @sched_deadline     representative of the task's deadline
+ *  @sched_runtime      representative of the task's runtime
+ *  @sched_period       representative of the task's period
+ *  @sched_flags        for customizing the scheduler behaviour
+ *
+ * Given this task model, there are a multiplicity of scheduling algorithms
+ * and policies, that can be used to ensure all the tasks will make their
+ * timing constraints.
+ *
+ * @__unused		padding to allow future expansion without ABI issues
+ */
+struct sched_param2 {
+	int sched_priority;
+	unsigned int sched_flags;
+	u64 sched_runtime;
+	u64 sched_deadline;
+	u64 sched_period;
+
+	u64 __unused[12];
+};
+
 struct exec_domain;
 struct futex_pi_state;
 struct robust_list_head;
@@ -2133,6 +2181,8 @@ extern int sched_setscheduler(struct task_struct *, int,
 			      const struct sched_param *);
 extern int sched_setscheduler_nocheck(struct task_struct *, int,
 				      const struct sched_param *);
+extern int sched_setscheduler2(struct task_struct *, int,
+				 const struct sched_param2 *);
 extern struct task_struct *idle_task(int cpu);
 extern struct task_struct *curr_task(int cpu);
 extern void set_curr_task(int cpu, struct task_struct *p);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 86a24b1..11449ae 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -38,6 +38,7 @@ struct rlimit;
 struct rlimit64;
 struct rusage;
 struct sched_param;
+struct sched_param2;
 struct sel_arg_struct;
 struct semaphore;
 struct sembuf;
@@ -327,11 +328,17 @@ asmlinkage long sys_clock_nanosleep(clockid_t which_clock, int flags,
 asmlinkage long sys_nice(int increment);
 asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_setscheduler2(pid_t pid, int policy,
+					struct sched_param2 __user *param);
 asmlinkage long sys_sched_setparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_setparam2(pid_t pid,
+					struct sched_param2 __user *param);
 asmlinkage long sys_sched_getscheduler(pid_t pid);
 asmlinkage long sys_sched_getparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_getparam2(pid_t pid,
+					struct sched_param2 __user *param);
 asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
 asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
diff --git a/kernel/sched.c b/kernel/sched.c
index 4c8d1c3..eed5133 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5421,7 +5421,8 @@ static bool check_same_owner(struct task_struct *p)
 }
 
 static int __sched_setscheduler(struct task_struct *p, int policy,
-				const struct sched_param *param, bool user)
+				const struct sched_param2 *param,
+				bool user)
 {
 	int retval, oldprio, oldpolicy = -1, on_rq, running;
 	unsigned long flags;
@@ -5586,10 +5587,20 @@ recheck:
 int sched_setscheduler(struct task_struct *p, int policy,
 		       const struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, true);
+	struct sched_param2 param2 = {
+		.sched_priority = param->sched_priority
+	};
+	return __sched_setscheduler(p, policy, &param2, true);
 }
 EXPORT_SYMBOL_GPL(sched_setscheduler);
 
+int sched_setscheduler2(struct task_struct *p, int policy,
+			  const struct sched_param2 *param2)
+{
+	return __sched_setscheduler(p, policy, param2, true);
+}
+EXPORT_SYMBOL_GPL(sched_setscheduler2);
+
 /**
  * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
  * @p: the task in question.
@@ -5604,7 +5615,10 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
 int sched_setscheduler_nocheck(struct task_struct *p, int policy,
 			       const struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, false);
+	struct sched_param2 param2 = {
+		.sched_priority = param->sched_priority
+	};
+	return __sched_setscheduler(p, policy, &param2, false);
 }
 
 static int
@@ -5629,6 +5643,31 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 	return retval;
 }
 
+static int
+do_sched_setscheduler2(pid_t pid, int policy,
+			 struct sched_param2 __user *param2)
+{
+	struct sched_param2 lparam2;
+	struct task_struct *p;
+	int retval;
+
+	if (!param2 || pid < 0)
+		return -EINVAL;
+
+	memset(&lparam2, 0, sizeof(struct sched_param2));
+	if (copy_from_user(&lparam2, param2, sizeof(struct sched_param2)))
+		return -EFAULT;
+
+	rcu_read_lock();
+	retval = -ESRCH;
+	p = find_process_by_pid(pid);
+	if (p != NULL)
+		retval = sched_setscheduler2(p, policy, &lparam2);
+	rcu_read_unlock();
+
+	return retval;
+}
+
 /**
  * sys_sched_setscheduler - set/change the scheduler policy and RT priority
  * @pid: the pid in question.
@@ -5646,6 +5685,21 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy,
 }
 
 /**
+ * sys_sched_setscheduler2 - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @policy: new policy (could use extended sched_param).
+ * @param: structure containg the extended parameters.
+ */
+SYSCALL_DEFINE3(sched_setscheduler2, pid_t, pid, int, policy,
+		struct sched_param2 __user *, param2)
+{
+	if (policy < 0)
+		return -EINVAL;
+
+	return do_sched_setscheduler2(pid, policy, param2);
+}
+
+/**
  * sys_sched_setparam - set/change the RT priority of a thread
  * @pid: the pid in question.
  * @param: structure containing the new RT priority.
@@ -5656,6 +5710,17 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
 }
 
 /**
+ * sys_sched_setparam2 - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @param2: structure containing the extended parameters.
+ */
+SYSCALL_DEFINE2(sched_setparam2, pid_t, pid,
+		struct sched_param2 __user *, param2)
+{
+	return do_sched_setscheduler2(pid, -1, param2);
+}
+
+/**
  * sys_sched_getscheduler - get the policy (scheduling class) of a thread
  * @pid: the pid in question.
  */
@@ -5719,6 +5784,45 @@ out_unlock:
 	return retval;
 }
 
+/**
+ * sys_sched_getparam2 - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @param2: structure containing the extended parameters.
+ */
+SYSCALL_DEFINE2(sched_getparam2, pid_t, pid,
+		struct sched_param2 __user *, param2)
+{
+	struct sched_param2 lp;
+	struct task_struct *p;
+	int retval;
+
+	if (!param2 || pid < 0)
+		return -EINVAL;
+
+	rcu_read_lock();
+	p = find_process_by_pid(pid);
+	retval = -ESRCH;
+	if (!p)
+		goto out_unlock;
+
+	retval = security_task_getscheduler(p);
+	if (retval)
+		goto out_unlock;
+
+	lp.sched_priority = p->rt_priority;
+	rcu_read_unlock();
+
+	retval = copy_to_user(param2, &lp,
+			sizeof(struct sched_param2)) ? -EFAULT : 0;
+
+	return retval;
+
+out_unlock:
+	rcu_read_unlock();
+	return retval;
+
+}
+
 long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 {
 	cpumask_var_t cpus_allowed, new_mask;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 03/16] sched: SCHED_DEADLINE data structures.
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
  2012-04-06  7:14 ` [PATCH 01/16] sched: add sched_class->task_dead Juri Lelli
  2012-04-06  7:14 ` [PATCH 02/16] sched: add extended scheduling interface Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-23  9:08   ` Peter Zijlstra
                     ` (3 more replies)
  2012-04-06  7:14 ` [PATCH 04/16] sched: SCHED_DEADLINE SMP-related " Juri Lelli
                   ` (16 subsequent siblings)
  19 siblings, 4 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

From: Dario Faggioli <raistlin@linux.it>

Introduce the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.

Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 include/linux/sched.h |   68 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/hrtimer.c      |    2 +-
 kernel/sched.c        |   73 +++++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 132 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3e67d30..a7a4276 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -39,6 +39,7 @@
 #define SCHED_BATCH		3
 /* SCHED_ISO: reserved but not implemented yet */
 #define SCHED_IDLE		5
+#define SCHED_DEADLINE		6
 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
 #define SCHED_RESET_ON_FORK     0x40000000
 
@@ -133,6 +134,10 @@ struct sched_param {
  * timing constraints.
  *
  * @__unused		padding to allow future expansion without ABI issues
+ *
+ * As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
+ * only user of this new interface. More information about the algorithm
+ * available in the scheduling class file or in Documentation/.
  */
 struct sched_param2 {
 	int sched_priority;
@@ -1130,6 +1135,7 @@ struct sched_domain;
 #else
 #define ENQUEUE_WAKING		0
 #endif
+#define ENQUEUE_REPLENISH	8
 
 #define DEQUEUE_SLEEP		1
 
@@ -1261,6 +1267,47 @@ struct sched_rt_entity {
 #endif
 };
 
+struct sched_dl_entity {
+	struct rb_node	rb_node;
+	int nr_cpus_allowed;
+
+	/*
+	 * Original scheduling parameters. Copied here from sched_param2
+	 * during sched_setscheduler2(), they will remain the same until
+	 * the next sched_setscheduler2().
+	 */
+	u64 dl_runtime;		/* maximum runtime for each instance	*/
+	u64 dl_deadline;	/* relative deadline of each instance	*/
+
+	/*
+	 * Actual scheduling parameters. Initialized with the values above,
+	 * they are continously updated during task execution. Note that
+	 * the remaining runtime could be < 0 in case we are in overrun.
+	 */
+	s64 runtime;		/* remaining runtime for this instance	*/
+	u64 deadline;		/* absolute deadline for this instance	*/
+	unsigned int flags;	/* specifying the scheduler behaviour	*/
+
+	/*
+	 * Some bool flags:
+	 *
+	 * @dl_throttled tells if we exhausted the runtime. If so, the
+	 * task has to wait for a replenishment to be performed at the
+	 * next firing of dl_timer.
+	 *
+	 * @dl_new tells if a new instance arrived. If so we must
+	 * start executing it with full runtime and reset its absolute
+	 * deadline;
+	 */
+	int dl_throttled, dl_new;
+
+	/*
+	 * Bandwidth enforcement timer. Each -deadline task has its
+	 * own bandwidth to be enforced, thus we need one timer per task.
+	 */
+	struct hrtimer dl_timer;
+};
+
 struct rcu_node;
 
 enum perf_event_task_context {
@@ -1289,6 +1336,7 @@ struct task_struct {
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
+	struct sched_dl_entity dl;
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
@@ -1681,6 +1729,10 @@ static inline bool pagefault_disabled(void)
  * user-space.  This allows kernel threads to set their
  * priority to a value higher than any user task. Note:
  * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
+ *
+ * SCHED_DEADLINE tasks has negative priorities, reflecting
+ * the fact that any of them has higher prio than RT and
+ * NORMAL/BATCH tasks.
  */
 
 #define MAX_USER_RT_PRIO	100
@@ -1689,9 +1741,23 @@ static inline bool pagefault_disabled(void)
 #define MAX_PRIO		(MAX_RT_PRIO + 40)
 #define DEFAULT_PRIO		(MAX_RT_PRIO + 20)
 
+#define MAX_DL_PRIO		0
+
+static inline int dl_prio(int prio)
+{
+	if (unlikely(prio < MAX_DL_PRIO))
+		return 1;
+	return 0;
+}
+
+static inline int dl_task(struct task_struct *p)
+{
+	return dl_prio(p->prio);
+}
+
 static inline int rt_prio(int prio)
 {
-	if (unlikely(prio < MAX_RT_PRIO))
+	if (unlikely(prio >= MAX_DL_PRIO && prio < MAX_RT_PRIO))
 		return 1;
 	return 0;
 }
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 3991464..246842e 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1764,7 +1764,7 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp,
 	unsigned long slack;
 
 	slack = current->timer_slack_ns;
-	if (rt_task(current))
+	if (dl_task(current) || rt_task(current))
 		slack = 0;
 
 	hrtimer_init_on_stack(&t.timer, clockid, mode);
diff --git a/kernel/sched.c b/kernel/sched.c
index eed5133..ea67240 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -133,11 +133,23 @@ static inline int rt_policy(int policy)
 	return 0;
 }
 
+static inline int dl_policy(int policy)
+{
+	if (unlikely(policy == SCHED_DEADLINE))
+		return 1;
+	return 0;
+}
+
 static inline int task_has_rt_policy(struct task_struct *p)
 {
 	return rt_policy(p->policy);
 }
 
+static inline int task_has_dl_policy(struct task_struct *p)
+{
+	return dl_policy(p->policy);
+}
+
 /*
  * This is the priority-queue data structure of the RT scheduling class:
  */
@@ -549,6 +561,15 @@ struct rt_rq {
 #endif
 };
 
+/* Deadline class' related fields in a runqueue */
+struct dl_rq {
+	/* runqueue is an rbtree, ordered by deadline */
+	struct rb_root rb_root;
+	struct rb_node *rb_leftmost;
+
+	unsigned long dl_nr_running;
+};
+
 #ifdef CONFIG_SMP
 
 /*
@@ -614,6 +635,7 @@ struct rq {
 
 	struct cfs_rq cfs;
 	struct rt_rq rt;
+	struct dl_rq dl;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
@@ -1911,6 +1933,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 }
 
 static const struct sched_class rt_sched_class;
+static const struct sched_class dl_sched_class;
 
 #define sched_class_highest (&stop_sched_class)
 #define for_each_class(class) \
@@ -2257,7 +2280,9 @@ static inline int normal_prio(struct task_struct *p)
 {
 	int prio;
 
-	if (task_has_rt_policy(p))
+	if (task_has_dl_policy(p))
+		prio = MAX_DL_PRIO-1;
+	else if (task_has_rt_policy(p))
 		prio = MAX_RT_PRIO-1 - p->rt_priority;
 	else
 		prio = __normal_prio(p);
@@ -2965,6 +2990,12 @@ static void __sched_fork(struct task_struct *p)
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
 
+	RB_CLEAR_NODE(&p->dl.rb_node);
+	hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	p->dl.dl_runtime = p->dl.runtime = 0;
+	p->dl.dl_deadline = p->dl.deadline = 0;
+	p->dl.flags = 0;
+
 	INIT_LIST_HEAD(&p->rt.run_list);
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -2997,7 +3028,7 @@ void sched_fork(struct task_struct *p)
 	 * Revert to default priority/policy on fork if requested.
 	 */
 	if (unlikely(p->sched_reset_on_fork)) {
-		if (task_has_rt_policy(p)) {
+		if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
 			p->policy = SCHED_NORMAL;
 			p->static_prio = NICE_TO_PRIO(0);
 			p->rt_priority = 0;
@@ -5202,7 +5233,9 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
-	if (rt_prio(prio))
+	if (dl_prio(prio))
+		p->sched_class = &dl_sched_class;
+	else if (rt_prio(prio))
 		p->sched_class = &rt_sched_class;
 	else
 		p->sched_class = &fair_sched_class;
@@ -5236,9 +5269,9 @@ void set_user_nice(struct task_struct *p, long nice)
 	 * The RT priorities are set via sched_setscheduler(), but we still
 	 * allow the 'normal' nice value to be set - but as expected
 	 * it wont have any effect on scheduling until the task is
-	 * SCHED_FIFO/SCHED_RR:
+	 * SCHED_DEADLINE, SCHED_FIFO or SCHED_RR:
 	 */
-	if (task_has_rt_policy(p)) {
+	if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
 		p->static_prio = NICE_TO_PRIO(nice);
 		goto out_unlock;
 	}
@@ -5394,7 +5427,9 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 	p->normal_prio = normal_prio(p);
 	/* we are holding p->pi_lock already */
 	p->prio = rt_mutex_getprio(p);
-	if (rt_prio(p->prio))
+	if (dl_prio(p->prio))
+		p->sched_class = &dl_sched_class;
+	else if (rt_prio(p->prio))
 		p->sched_class = &rt_sched_class;
 	else
 		p->sched_class = &fair_sched_class;
@@ -5402,6 +5437,18 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 }
 
 /*
+ * This function validates the new parameters of a -deadline task.
+ * We ask for the deadline not being zero, and greater or equal
+ * than the runtime.
+ */
+static bool
+__checkparam_dl(const struct sched_param2 *prm)
+{
+	return prm && (&prm->sched_deadline) != 0 &&
+	       (s64)(&prm->sched_deadline - &prm->sched_runtime) >= 0;
+}
+
+/*
  * check the target process has a UID that matches the current process's
  */
 static bool check_same_owner(struct task_struct *p)
@@ -5441,7 +5488,8 @@ recheck:
 		reset_on_fork = !!(policy & SCHED_RESET_ON_FORK);
 		policy &= ~SCHED_RESET_ON_FORK;
 
-		if (policy != SCHED_FIFO && policy != SCHED_RR &&
+		if (policy != SCHED_DEADLINE &&
+				policy != SCHED_FIFO && policy != SCHED_RR &&
 				policy != SCHED_NORMAL && policy != SCHED_BATCH &&
 				policy != SCHED_IDLE)
 			return -EINVAL;
@@ -5456,7 +5504,8 @@ recheck:
 	    (p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) ||
 	    (!p->mm && param->sched_priority > MAX_RT_PRIO-1))
 		return -EINVAL;
-	if (rt_policy(policy) != (param->sched_priority != 0))
+	if ((dl_policy(policy) && !__checkparam_dl(param)) ||
+	    (rt_policy(policy) != (param->sched_priority != 0)))
 		return -EINVAL;
 
 	/*
@@ -8437,6 +8486,11 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
 	raw_spin_lock_init(&rt_rq->rt_runtime_lock);
 }
 
+static void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
+{
+	dl_rq->rb_root = RB_ROOT;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 				struct sched_entity *se, int cpu,
@@ -8568,6 +8622,7 @@ void __init sched_init(void)
 		rq->calc_load_update = jiffies + LOAD_FREQ;
 		init_cfs_rq(&rq->cfs);
 		init_rt_rq(&rq->rt, rq);
+		init_dl_rq(&rq->dl, rq);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		root_task_group.shares = root_task_group_load;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
@@ -8755,7 +8810,7 @@ void normalize_rt_tasks(void)
 		p->se.statistics.block_start	= 0;
 #endif
 
-		if (!rt_task(p)) {
+		if (!dl_task(p) && !rt_task(p)) {
 			/*
 			 * Renice negative nice level userspace
 			 * tasks back to 0:
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 04/16] sched: SCHED_DEADLINE SMP-related data structures.
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (2 preceding siblings ...)
  2012-04-06  7:14 ` [PATCH 03/16] sched: SCHED_DEADLINE data structures Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-06  7:14 ` [PATCH 05/16] sched: SCHED_DEADLINE policy implementation Juri Lelli
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

Introduce data structures relevant for implementing dynamic
migration of -deadline tasks.

Mainly, this is the logic for checking if runqueues are
overloaded with -deadline tasks and for choosing where
a task should migrate, when it is the case.

Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |   50 ++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 50 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a7a4276..6eb72b6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1384,6 +1384,7 @@ struct task_struct {
 	struct list_head tasks;
 #ifdef CONFIG_SMP
 	struct plist_node pushable_tasks;
+	struct rb_node pushable_dl_tasks;
 #endif
 
 	struct mm_struct *mm, *active_mm;
diff --git a/kernel/sched.c b/kernel/sched.c
index ea67240..fd23c67 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -568,6 +568,31 @@ struct dl_rq {
 	struct rb_node *rb_leftmost;
 
 	unsigned long dl_nr_running;
+
+#ifdef CONFIG_SMP
+	/*
+	 * Deadline values of the currently executing and the
+	 * earliest ready task on this rq. Caching these facilitates
+	 * the decision wether or not a ready but not running task
+	 * should migrate somewhere else.
+	 */
+	struct {
+		u64 curr;
+		u64 next;
+	} earliest_dl;
+
+	unsigned long dl_nr_migratory;
+	unsigned long dl_nr_total;
+	int overloaded;
+
+	/*
+	 * Tasks on this rq that can be pushed away. They are kept in
+	 * an rb-tree, ordered by tasks' deadlines, with caching
+	 * of the leftmost (earliest deadline) element.
+	 */
+	struct rb_root pushable_dl_tasks_root;
+	struct rb_node *pushable_dl_tasks_leftmost;
+#endif
 };
 
 #ifdef CONFIG_SMP
@@ -588,6 +613,13 @@ struct root_domain {
 	cpumask_var_t online;
 
 	/*
+	 * The bit corresponding to a CPU gets set here if such CPU has more
+	 * than one runnable -deadline task (as it is below for RT tasks).
+	 */
+	cpumask_var_t dlo_mask;
+	atomic_t dlo_count;
+
+	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
 	 * one runnable RT task.
 	 */
@@ -3075,6 +3107,7 @@ void sched_fork(struct task_struct *p)
 #endif
 #ifdef CONFIG_SMP
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
+	RB_CLEAR_NODE(&p->pushable_dl_tasks);
 #endif
 
 	put_cpu();
@@ -6513,6 +6546,7 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 		if (p->sched_class && p->sched_class->set_cpus_allowed)
 			p->sched_class->set_cpus_allowed(p, new_mask);
 		p->rt.nr_cpus_allowed = cpumask_weight(new_mask);
+		p->dl.nr_cpus_allowed = cpumask_weight(new_mask);
 	}
 	cpumask_copy(&p->cpus_allowed, new_mask);
 }
@@ -7267,6 +7301,7 @@ static void free_rootdomain(struct rcu_head *rcu)
 	struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
 
 	cpupri_cleanup(&rd->cpupri);
+	free_cpumask_var(rd->dlo_mask);
 	free_cpumask_var(rd->rto_mask);
 	free_cpumask_var(rd->online);
 	free_cpumask_var(rd->span);
@@ -7318,8 +7353,10 @@ static int init_rootdomain(struct root_domain *rd)
 		goto out;
 	if (!alloc_cpumask_var(&rd->online, GFP_KERNEL))
 		goto free_span;
-	if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
+	if (!alloc_cpumask_var(&rd->dlo_mask, GFP_KERNEL))
 		goto free_online;
+	if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
+		goto free_dlo_mask;
 
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_rto_mask;
@@ -7327,6 +7364,8 @@ static int init_rootdomain(struct root_domain *rd)
 
 free_rto_mask:
 	free_cpumask_var(rd->rto_mask);
+free_dlo_mask:
+	free_cpumask_var(rd->dlo_mask);
 free_online:
 	free_cpumask_var(rd->online);
 free_span:
@@ -8489,6 +8528,15 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
 static void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
 {
 	dl_rq->rb_root = RB_ROOT;
+
+#ifdef CONFIG_SMP
+	/* zero means no -deadline tasks */
+	dl_rq->earliest_dl.curr = dl_rq->earliest_dl.next = 0;
+
+	dl_rq->dl_nr_migratory = 0;
+	dl_rq->overloaded = 0;
+	dl_rq->pushable_dl_tasks_root = RB_ROOT;
+#endif
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (3 preceding siblings ...)
  2012-04-06  7:14 ` [PATCH 04/16] sched: SCHED_DEADLINE SMP-related " Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-11  3:06   ` Steven Rostedt
                     ` (10 more replies)
  2012-04-06  7:14 ` [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic Juri Lelli
                   ` (14 subsequent siblings)
  19 siblings, 11 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

From: Dario Faggioli <raistlin@linux.it>

Add a scheduling class, in sched_dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.

The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.

The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.

This patch:
 - implements the core logic of the scheduling algorithm in the new
   scheduling class file;
 - provides all the glue code between the new scheduling class and
   the core scheduler and refines the interactions between sched_dl
   and the other existing scheduling classes.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 include/linux/sched.h   |    2 +-
 kernel/fork.c           |    4 +-
 kernel/sched.c          |   67 +++++-
 kernel/sched_dl.c       |  655 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_rt.c       |    1 +
 kernel/sched_stoptask.c |    2 +-
 6 files changed, 719 insertions(+), 12 deletions(-)
 create mode 100644 kernel/sched_dl.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6eb72b6..416ce99 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2323,7 +2323,7 @@ extern void wake_up_new_task(struct task_struct *tsk);
 #else
  static inline void kick_process(struct task_struct *tsk) { }
 #endif
-extern void sched_fork(struct task_struct *p);
+extern int sched_fork(struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 extern void proc_caches_init(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index e3db0cb..b263c69 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1241,7 +1241,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
-	sched_fork(p);
+	retval = sched_fork(p);
+	if (retval)
+		goto bad_fork_cleanup_policy;
 
 	retval = perf_event_init_task(p);
 	if (retval)
diff --git a/kernel/sched.c b/kernel/sched.c
index fd23c67..1a38ad1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1964,9 +1964,6 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 #endif
 }
 
-static const struct sched_class rt_sched_class;
-static const struct sched_class dl_sched_class;
-
 #define sched_class_highest (&stop_sched_class)
 #define for_each_class(class) \
    for (class = sched_class_highest; class; class = class->next)
@@ -2257,6 +2254,7 @@ static int irqtime_account_si_update(void)
 #include "sched_idletask.c"
 #include "sched_fair.c"
 #include "sched_rt.c"
+#include "sched_dl.c"
 #include "sched_autogroup.c"
 #include "sched_stoptask.c"
 #ifdef CONFIG_SCHED_DEBUG
@@ -3038,7 +3036,7 @@ static void __sched_fork(struct task_struct *p)
 /*
  * fork()/clone()-time setup:
  */
-void sched_fork(struct task_struct *p)
+int sched_fork(struct task_struct *p)
 {
 	unsigned long flags;
 	int cpu = get_cpu();
@@ -3077,8 +3075,14 @@ void sched_fork(struct task_struct *p)
 		p->sched_reset_on_fork = 0;
 	}
 
-	if (!rt_prio(p->prio))
+	if (dl_prio(p->prio)) {
+		put_cpu();
+		return -EAGAIN;
+	} else if (rt_prio(p->prio)) {
+		p->sched_class = &rt_sched_class;
+	} else {
 		p->sched_class = &fair_sched_class;
+	}
 
 	if (p->sched_class->task_fork)
 		p->sched_class->task_fork(p);
@@ -3111,6 +3115,7 @@ void sched_fork(struct task_struct *p)
 #endif
 
 	put_cpu();
+	return 0;
 }
 
 /*
@@ -5234,7 +5239,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	struct rq *rq;
 	const struct sched_class *prev_class;
 
-	BUG_ON(prio < 0 || prio > MAX_PRIO);
+	BUG_ON(prio > MAX_PRIO);
 
 	rq = __task_rq_lock(p);
 
@@ -5470,6 +5475,38 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 }
 
 /*
+ * This function initializes the sched_dl_entity of a newly becoming
+ * SCHED_DEADLINE task.
+ *
+ * Only the static values are considered here, the actual runtime and the
+ * absolute deadline will be properly calculated when the task is enqueued
+ * for the first time with its new policy.
+ */
+static void
+__setparam_dl(struct task_struct *p, const struct sched_param2 *param2)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	init_dl_task_timer(dl_se);
+	dl_se->dl_runtime = param2->sched_runtime;
+	dl_se->dl_deadline = param2->sched_deadline;
+	dl_se->flags = param2->sched_flags;
+	dl_se->dl_throttled = 0;
+	dl_se->dl_new = 1;
+}
+
+static void
+__getparam_dl(struct task_struct *p, struct sched_param2 *param2)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	param2->sched_priority = p->rt_priority;
+	param2->sched_runtime = dl_se->dl_runtime;
+	param2->sched_deadline = dl_se->dl_deadline;
+	param2->sched_flags = dl_se->flags;
+}
+
+/*
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
  * than the runtime.
@@ -5643,7 +5680,11 @@ recheck:
 
 	oldprio = p->prio;
 	prev_class = p->sched_class;
-	__setscheduler(rq, p, policy, param->sched_priority);
+	if (dl_policy(policy)) {
+		__setparam_dl(p, param);
+		__setscheduler(rq, p, policy, param->sched_priority);
+	} else
+		__setscheduler(rq, p, policy, param->sched_priority);
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
@@ -5743,8 +5784,11 @@ do_sched_setscheduler2(pid_t pid, int policy,
 	rcu_read_lock();
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);
-	if (p != NULL)
+	if (p != NULL) {
+		if (dl_policy(policy))
+			lparam2.sched_priority = 0;
 		retval = sched_setscheduler2(p, policy, &lparam2);
+	}
 	rcu_read_unlock();
 
 	return retval;
@@ -5891,7 +5935,10 @@ SYSCALL_DEFINE2(sched_getparam2, pid_t, pid,
 	if (retval)
 		goto out_unlock;
 
-	lp.sched_priority = p->rt_priority;
+	if (task_has_dl_policy(p))
+		__getparam_dl(p, &lp);
+	else
+		lp.sched_priority = p->rt_priority;
 	rcu_read_unlock();
 
 	retval = copy_to_user(param2, &lp,
@@ -6290,6 +6337,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
 	case SCHED_RR:
 		ret = MAX_USER_RT_PRIO-1;
 		break;
+	case SCHED_DEADLINE:
 	case SCHED_NORMAL:
 	case SCHED_BATCH:
 	case SCHED_IDLE:
@@ -6315,6 +6363,7 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
 	case SCHED_RR:
 		ret = 1;
 		break;
+	case SCHED_DEADLINE:
 	case SCHED_NORMAL:
 	case SCHED_BATCH:
 	case SCHED_IDLE:
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
new file mode 100644
index 0000000..604e2bc
--- /dev/null
+++ b/kernel/sched_dl.c
@@ -0,0 +1,655 @@
+/*
+ * Deadline Scheduling Class (SCHED_DEADLINE)
+ *
+ * Earliest Deadline First (EDF) + Constant Bandwidth Server (CBS).
+ *
+ * Tasks that periodically executes their instances for less than their
+ * runtime won't miss any of their deadlines.
+ * Tasks that are not periodic or sporadic or that tries to execute more
+ * than their reserved bandwidth will be slowed down (and may potentially
+ * miss some of their deadlines), and won't affect any other task.
+ *
+ * Copyright (C) 2010 Dario Faggioli <raistlin@linux.it>,
+ *                    Michael Trimarchi <michael@amarulasolutions.com>,
+ *                    Fabio Checconi <fabio@gandalf.sssup.it>
+ */
+static const struct sched_class dl_sched_class;
+
+static inline int dl_time_before(u64 a, u64 b)
+{
+	return (s64)(a - b) < 0;
+}
+
+static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
+{
+	return container_of(dl_se, struct task_struct, dl);
+}
+
+static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
+{
+	return container_of(dl_rq, struct rq, dl);
+}
+
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->dl;
+}
+
+static inline int on_dl_rq(struct sched_dl_entity *dl_se)
+{
+	return !RB_EMPTY_NODE(&dl_se->rb_node);
+}
+
+static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	return dl_rq->rb_leftmost == &dl_se->rb_node;
+}
+
+static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
+				  int flags);
+
+/*
+ * We are being explicitly informed that a new instance is starting,
+ * and this means that:
+ *  - the absolute deadline of the entity has to be placed at
+ *    current time + relative deadline;
+ *  - the runtime of the entity has to be set to the maximum value.
+ *
+ * The capability of specifying such event is useful whenever a -deadline
+ * entity wants to (try to!) synchronize its behaviour with the scheduler's
+ * one, and to (try to!) reconcile itself with its own scheduling
+ * parameters.
+ */
+static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	WARN_ON(!dl_se->dl_new || dl_se->dl_throttled);
+
+	dl_se->deadline = rq->clock + dl_se->dl_deadline;
+	dl_se->runtime = dl_se->dl_runtime;
+	dl_se->dl_new = 0;
+}
+
+/*
+ * Pure Earliest Deadline First (EDF) scheduling does not deal with the
+ * possibility of a entity lasting more than what it declared, and thus
+ * exhausting its runtime.
+ *
+ * Here we are interested in making runtime overrun possible, but we do
+ * not want a entity which is misbehaving to affect the scheduling of all
+ * other entities.
+ * Therefore, a budgeting strategy called Constant Bandwidth Server (CBS)
+ * is used, in order to confine each entity within its own bandwidth.
+ *
+ * This function deals exactly with that, and ensures that when the runtime
+ * of a entity is replenished, its deadline is also postponed. That ensures
+ * the overrunning entity can't interfere with other entity in the system and
+ * can't make them miss their deadlines. Reasons why this kind of overruns
+ * could happen are, typically, a entity voluntarily trying to overcume its
+ * runtime, or it just underestimated it during sched_setscheduler_ex().
+ */
+static void replenish_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * We Keep moving the deadline away until we get some
+	 * available runtime for the entity. This ensures correct
+	 * handling of situations where the runtime overrun is
+	 * arbitrary large.
+	 */
+	while (dl_se->runtime <= 0) {
+		dl_se->deadline += dl_se->dl_deadline;
+		dl_se->runtime += dl_se->dl_runtime;
+	}
+
+	/*
+	 * At this point, the deadline really should be "in
+	 * the future" with respect to rq->clock. If it's
+	 * not, we are, for some reason, lagging too much!
+	 * Anyway, after having warn userspace abut that,
+	 * we still try to keep the things running by
+	 * resetting the deadline and the budget of the
+	 * entity.
+	 */
+	if (dl_time_before(dl_se->deadline, rq->clock)) {
+		WARN_ON_ONCE(1);
+		dl_se->deadline = rq->clock + dl_se->dl_deadline;
+		dl_se->runtime = dl_se->dl_runtime;
+	}
+}
+
+/*
+ * Here we check if --at time t-- an entity (which is probably being
+ * [re]activated or, in general, enqueued) can use its remaining runtime
+ * and its current deadline _without_ exceeding the bandwidth it is
+ * assigned (function returns true if it can).
+ *
+ * For this to hold, we must check if:
+ *   runtime / (deadline - t) < dl_runtime / dl_deadline .
+ */
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
+{
+	u64 left, right;
+
+	/*
+	 * left and right are the two sides of the equation above,
+	 * after a bit of shuffling to use multiplications instead
+	 * of divisions.
+	 *
+	 * Note that none of the time values involved in the two
+	 * multiplications are absolute: dl_deadline and dl_runtime
+	 * are the relative deadline and the maximum runtime of each
+	 * instance, runtime is the runtime left for the last instance
+	 * and (deadline - t), since t is rq->clock, is the time left
+	 * to the (absolute) deadline. Therefore, overflowing the u64
+	 * type is very unlikely to occur in both cases.
+	 */
+	left = dl_se->dl_deadline * dl_se->runtime;
+	right = (dl_se->deadline - t) * dl_se->dl_runtime;
+
+	return dl_time_before(right, left);
+}
+
+/*
+ * When a -deadline entity is queued back on the runqueue, its runtime and
+ * deadline might need updating.
+ *
+ * The policy here is that we update the deadline of the entity only if:
+ *  - the current deadline is in the past,
+ *  - using the remaining runtime with the current deadline would make
+ *    the entity exceed its bandwidth.
+ */
+static void update_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * The arrival of a new instance needs special treatment, i.e.,
+	 * the actual scheduling parameters have to be "renewed".
+	 */
+	if (dl_se->dl_new) {
+		setup_new_dl_entity(dl_se);
+		return;
+	}
+
+	if (dl_time_before(dl_se->deadline, rq->clock) ||
+	    dl_entity_overflow(dl_se, rq->clock)) {
+		dl_se->deadline = rq->clock + dl_se->dl_deadline;
+		dl_se->runtime = dl_se->dl_runtime;
+	}
+}
+
+/*
+ * If the entity depleted all its runtime, and if we want it to sleep
+ * while waiting for some new execution time to become available, we
+ * set the bandwidth enforcement timer to the replenishment instant
+ * and try to activate it.
+ *
+ * Notice that it is important for the caller to know if the timer
+ * actually started or not (i.e., the replenishment instant is in
+ * the future or in the past).
+ */
+static int start_dl_timer(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+	ktime_t now, act;
+	ktime_t soft, hard;
+	unsigned long range;
+	s64 delta;
+
+	/*
+	 * We want the timer to fire at the deadline, but considering
+	 * that it is actually coming from rq->clock and not from
+	 * hrtimer's time base reading.
+	 */
+	act = ns_to_ktime(dl_se->deadline);
+	now = hrtimer_cb_get_time(&dl_se->dl_timer);
+	delta = ktime_to_ns(now) - rq->clock;
+	act = ktime_add_ns(act, delta);
+
+	/*
+	 * If the expiry time already passed, e.g., because the value
+	 * chosen as the deadline is too small, don't even try to
+	 * start the timer in the past!
+	 */
+	if (ktime_us_delta(act, now) < 0)
+		return 0;
+
+	hrtimer_set_expires(&dl_se->dl_timer, act);
+
+	soft = hrtimer_get_softexpires(&dl_se->dl_timer);
+	hard = hrtimer_get_expires(&dl_se->dl_timer);
+	range = ktime_to_ns(ktime_sub(hard, soft));
+	__hrtimer_start_range_ns(&dl_se->dl_timer, soft,
+				 range, HRTIMER_MODE_ABS, 0);
+
+	return hrtimer_active(&dl_se->dl_timer);
+}
+
+/*
+ * This is the bandwidth enforcement timer callback. If here, we know
+ * a task is not on its dl_rq, since the fact that the timer was running
+ * means the task is throttled and needs a runtime replenishment.
+ *
+ * However, what we actually do depends on the fact the task is active,
+ * (it is on its rq) or has been removed from there by a call to
+ * dequeue_task_dl(). In the former case we must issue the runtime
+ * replenishment and add the task back to the dl_rq; in the latter, we just
+ * do nothing but clearing dl_throttled, so that runtime and deadline
+ * updating (and the queueing back to dl_rq) will be done by the
+ * next call to enqueue_task_dl().
+ */
+static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
+{
+	unsigned long flags;
+	struct sched_dl_entity *dl_se = container_of(timer,
+						     struct sched_dl_entity,
+						     dl_timer);
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = task_rq_lock(p, &flags);
+
+	/*
+	 * We need to take care of a possible races here. In fact, the
+	 * task might have changed its scheduling policy to something
+	 * different from SCHED_DEADLINE (through sched_setscheduler()).
+	 */
+	if (!dl_task(p))
+		goto unlock;
+
+	dl_se->dl_throttled = 0;
+	if (p->on_rq) {
+		enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
+		if (task_has_dl_policy(rq->curr))
+			check_preempt_curr_dl(rq, p, 0);
+		else
+			resched_task(rq->curr);
+	}
+unlock:
+	task_rq_unlock(rq, p, &flags);
+
+	return HRTIMER_NORESTART;
+}
+
+static void init_dl_task_timer(struct sched_dl_entity *dl_se)
+{
+	struct hrtimer *timer = &dl_se->dl_timer;
+
+	if (hrtimer_active(timer)) {
+		hrtimer_try_to_cancel(timer);
+		return;
+	}
+
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	timer->function = dl_task_timer;
+	timer->irqsafe = 1;
+}
+
+static
+int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
+{
+	int dmiss = dl_time_before(dl_se->deadline, rq->clock);
+	int rorun = dl_se->runtime <= 0;
+
+	if (!rorun && !dmiss)
+		return 0;
+
+	/*
+	 * If we are beyond our current deadline and we are still
+	 * executing, then we have already used some of the runtime of
+	 * the next instance. Thus, if we do not account that, we are
+	 * stealing bandwidth from the system at each deadline miss!
+	 */
+	if (dmiss) {
+		dl_se->runtime = rorun ? dl_se->runtime : 0;
+		dl_se->runtime -= rq->clock - dl_se->deadline;
+	}
+
+	return 1;
+}
+
+/*
+ * Update the current task's runtime statistics (provided it is still
+ * a -deadline task and has not been removed from the dl_rq).
+ */
+static void update_curr_dl(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_dl_entity *dl_se = &curr->dl;
+	u64 delta_exec;
+
+	if (!dl_task(curr) || !on_dl_rq(dl_se))
+		return;
+
+	delta_exec = rq->clock_task - curr->se.exec_start;
+	if (unlikely((s64)delta_exec < 0))
+		delta_exec = 0;
+
+	schedstat_set(curr->se.statistics.exec_max,
+		      max(curr->se.statistics.exec_max, delta_exec));
+
+	curr->se.sum_exec_runtime += delta_exec;
+	account_group_exec_runtime(curr, delta_exec);
+
+	curr->se.exec_start = rq->clock;
+	cpuacct_charge(curr, delta_exec);
+
+	dl_se->runtime -= delta_exec;
+	if (dl_runtime_exceeded(rq, dl_se)) {
+		__dequeue_task_dl(rq, curr, 0);
+		if (likely(start_dl_timer(dl_se)))
+			dl_se->dl_throttled = 1;
+		else
+			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
+
+		if (!is_leftmost(curr, &rq->dl))
+			resched_task(curr);
+	}
+}
+
+static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rb_node **link = &dl_rq->rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct sched_dl_entity *entry;
+	int leftmost = 1;
+
+	BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct sched_dl_entity, rb_node);
+		if (dl_time_before(dl_se->deadline, entry->deadline))
+			link = &parent->rb_left;
+		else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		dl_rq->rb_leftmost = &dl_se->rb_node;
+
+	rb_link_node(&dl_se->rb_node, parent, link);
+	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
+
+	dl_rq->dl_nr_running++;
+}
+
+static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+	if (RB_EMPTY_NODE(&dl_se->rb_node))
+		return;
+
+	if (dl_rq->rb_leftmost == &dl_se->rb_node) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&dl_se->rb_node);
+		dl_rq->rb_leftmost = next_node;
+	}
+
+	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
+	RB_CLEAR_NODE(&dl_se->rb_node);
+
+	dl_rq->dl_nr_running--;
+}
+
+static void
+enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
+{
+	BUG_ON(on_dl_rq(dl_se));
+
+	/*
+	 * If this is a wakeup or a new instance, the scheduling
+	 * parameters of the task might need updating. Otherwise,
+	 * we want a replenishment of its runtime.
+	 */
+	if (!dl_se->dl_new && flags & ENQUEUE_REPLENISH)
+		replenish_dl_entity(dl_se);
+	else
+		update_dl_entity(dl_se);
+
+	__enqueue_dl_entity(dl_se);
+}
+
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	__dequeue_dl_entity(dl_se);
+}
+
+static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+	/*
+	 * If p is throttled, we do nothing. In fact, if it exhausted
+	 * its budget it needs a replenishment and, since it now is on
+	 * its rq, the bandwidth timer callback (which clearly has not
+	 * run yet) will take care of this.
+	 */
+	if (p->dl.dl_throttled)
+		return;
+
+	enqueue_dl_entity(&p->dl, flags);
+}
+
+static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+	dequeue_dl_entity(&p->dl);
+}
+
+static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+	update_curr_dl(rq);
+	__dequeue_task_dl(rq, p, flags);
+}
+
+/*
+ * Yield task semantic for -deadline tasks is:
+ *
+ *   get off from the CPU until our next instance, with
+ *   a new runtime.
+ */
+static void yield_task_dl(struct rq *rq)
+{
+	struct task_struct *p = rq->curr;
+
+	/*
+	 * We make the task go to sleep until its current deadline by
+	 * forcing its runtime to zero. This way, update_curr_dl() stops
+	 * it and the bandwidth timer will wake it up and will give it
+	 * new scheduling parameters (thanks to dl_new=1).
+	 */
+	if (p->dl.runtime > 0) {
+		rq->curr->dl.dl_new = 1;
+		p->dl.runtime = 0;
+	}
+	update_curr_dl(rq);
+}
+
+/*
+ * Only called when both the current and waking task are -deadline
+ * tasks.
+ */
+static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
+				  int flags)
+{
+	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
+		resched_task(rq->curr);
+}
+
+#ifdef CONFIG_SCHED_HRTICK
+static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+{
+	s64 delta = p->dl.dl_runtime - p->dl.runtime;
+
+	if (delta > 10000)
+		hrtick_start(rq, delta);
+}
+#else
+static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+{
+}
+#endif
+
+static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
+						   struct dl_rq *dl_rq)
+{
+	struct rb_node *left = dl_rq->rb_leftmost;
+
+	if (!left)
+		return NULL;
+
+	return rb_entry(left, struct sched_dl_entity, rb_node);
+}
+
+struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se;
+	struct task_struct *p;
+	struct dl_rq *dl_rq;
+
+	dl_rq = &rq->dl;
+
+	if (unlikely(!dl_rq->dl_nr_running))
+		return NULL;
+
+	dl_se = pick_next_dl_entity(rq, dl_rq);
+	BUG_ON(!dl_se);
+
+	p = dl_task_of(dl_se);
+	p->se.exec_start = rq->clock;
+#ifdef CONFIG_SCHED_HRTICK
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, p);
+#endif
+	return p;
+}
+
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
+{
+	update_curr_dl(rq);
+	p->se.exec_start = 0;
+}
+
+static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
+{
+	update_curr_dl(rq);
+
+#ifdef CONFIG_SCHED_HRTICK
+	if (hrtick_enabled(rq) && queued && p->dl.runtime > 0)
+		start_hrtick_dl(rq, p);
+#endif
+}
+
+static void task_fork_dl(struct task_struct *p)
+{
+	/*
+	 * SCHED_DEADLINE tasks cannot fork and this is achieved through
+	 * sched_fork()
+	 */
+}
+
+static void task_dead_dl(struct task_struct *p)
+{
+	struct hrtimer *timer = &p->dl.dl_timer;
+
+	if (hrtimer_active(timer))
+		hrtimer_try_to_cancel(timer);
+}
+
+static void set_curr_task_dl(struct rq *rq)
+{
+	struct task_struct *p = rq->curr;
+
+	p->se.exec_start = rq->clock;
+}
+
+static void switched_from_dl(struct rq *rq, struct task_struct *p)
+{
+	if (hrtimer_active(&p->dl.dl_timer))
+		hrtimer_try_to_cancel(&p->dl.dl_timer);
+}
+
+static void switched_to_dl(struct rq *rq, struct task_struct *p)
+{
+	/*
+	 * If p is throttled, don't consider the possibility
+	 * of preempting rq->curr, the check will be done right
+	 * after its runtime will get replenished.
+	 */
+	if (unlikely(p->dl.dl_throttled))
+		return;
+
+	if (!p->on_rq || rq->curr != p) {
+		if (task_has_dl_policy(rq->curr))
+			check_preempt_curr_dl(rq, p, 0);
+		else
+			resched_task(rq->curr);
+	}
+}
+
+static void prio_changed_dl(struct rq *rq, struct task_struct *p,
+			    int oldprio)
+{
+	switched_to_dl(rq, p);
+}
+
+#ifdef CONFIG_SMP
+static int
+select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
+{
+	return task_cpu(p);
+}
+
+static void set_cpus_allowed_dl(struct task_struct *p,
+				const struct cpumask *new_mask)
+{
+	int weight = cpumask_weight(new_mask);
+
+	BUG_ON(!dl_task(p));
+
+	cpumask_copy(&p->cpus_allowed, new_mask);
+	p->dl.nr_cpus_allowed = weight;
+}
+#endif
+
+static const struct sched_class dl_sched_class = {
+	.next			= &rt_sched_class,
+	.enqueue_task		= enqueue_task_dl,
+	.dequeue_task		= dequeue_task_dl,
+	.yield_task		= yield_task_dl,
+
+	.check_preempt_curr	= check_preempt_curr_dl,
+
+	.pick_next_task		= pick_next_task_dl,
+	.put_prev_task		= put_prev_task_dl,
+
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_dl,
+
+	.set_cpus_allowed       = set_cpus_allowed_dl,
+#endif
+
+	.set_curr_task		= set_curr_task_dl,
+	.task_tick		= task_tick_dl,
+	.task_fork              = task_fork_dl,
+	.task_dead		= task_dead_dl,
+
+	.prio_changed           = prio_changed_dl,
+	.switched_from		= switched_from_dl,
+	.switched_to		= switched_to_dl,
+};
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index c108b9c..4b09704 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -2,6 +2,7 @@
  * Real-Time Scheduling Class (mapped to the SCHED_FIFO and SCHED_RR
  * policies)
  */
+static const struct sched_class rt_sched_class;
 
 #ifdef CONFIG_RT_GROUP_SCHED
 
diff --git a/kernel/sched_stoptask.c b/kernel/sched_stoptask.c
index 8b44e7f..4270a36 100644
--- a/kernel/sched_stoptask.c
+++ b/kernel/sched_stoptask.c
@@ -81,7 +81,7 @@ get_rr_interval_stop(struct rq *rq, struct task_struct *task)
  * Simple, special scheduling class for the per-CPU stop tasks:
  */
 static const struct sched_class stop_sched_class = {
-	.next			= &rt_sched_class,
+	.next			= &dl_sched_class,
 
 	.enqueue_task		= enqueue_task_stop,
 	.dequeue_task		= dequeue_task_stop,
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (4 preceding siblings ...)
  2012-04-06  7:14 ` [PATCH 05/16] sched: SCHED_DEADLINE policy implementation Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-06 13:39   ` Hillf Danton
                     ` (5 more replies)
  2012-04-06  7:14 ` [PATCH 07/16] sched: SCHED_DEADLINE avg_update accounting Juri Lelli
                   ` (13 subsequent siblings)
  19 siblings, 6 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

Add dynamic migrations to SCHED_DEADLINE, so that tasks can
be moved among CPUs when necessary. It is also possible to bind a
task to a (set of) CPU(s), thus restricting its capability of
migrating, or forbidding migrations at all.

The very same approach used in sched_rt is utilised:
 - -deadline tasks are kept into CPU-specific runqueues,
 - -deadline tasks are migrated among runqueues to achieve the
   following:
    * on an M-CPU system the M earliest deadline ready tasks
      are always running;
    * affinity/cpusets settings of all the -deadline tasks is
      always respected.

Therefore, this very special form of "load balancing" is done with
an active method, i.e., the scheduler pushes or pulls tasks between
runqueues when they are woken up and/or (de)scheduled.
IOW, every time a preemption occurs, the descheduled task might be sent
to some other CPU (depending on its deadline) to continue executing
(push). On the other hand, every time a CPU becomes idle, it might pull
the second earliest deadline ready task from some other CPU.

To enforce this, a pull operation is always attempted before taking any
scheduling decision (pre_schedule()), as well as a push one after each
scheduling decision (post_schedule()). In addition, when a task arrives
or wakes up, the best CPU where to resume it is selected taking into
account its affinity mask, the system topology, but also its deadline.
E.g., from the scheduling point of view, the best CPU where to wake
up (and also where to push) a task is the one which is running the task
with the latest deadline among the M executing ones.

In order to facilitate these decisions, per-runqueue "caching" of the
deadlines of the currently running and of the first ready task is used.
Queued but not running tasks are also parked in another rb-tree to
speed-up pushes.

Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 kernel/sched_dl.c |  912 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched_rt.c |    2 +-
 2 files changed, 889 insertions(+), 25 deletions(-)

diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 604e2bc..38edefa 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -10,6 +10,7 @@
  * miss some of their deadlines), and won't affect any other task.
  *
  * Copyright (C) 2010 Dario Faggioli <raistlin@linux.it>,
+ *                    Juri Lelli <juri.lelli@gmail.com>,
  *                    Michael Trimarchi <michael@amarulasolutions.com>,
  *                    Fabio Checconi <fabio@gandalf.sssup.it>
  */
@@ -20,6 +21,15 @@ static inline int dl_time_before(u64 a, u64 b)
 	return (s64)(a - b) < 0;
 }
 
+/*
+ * Tells if entity @a should preempt entity @b.
+ */
+static inline
+int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
+{
+	return dl_time_before(a->deadline, b->deadline);
+}
+
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
 	return container_of(dl_se, struct task_struct, dl);
@@ -50,6 +60,153 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 	return dl_rq->rb_leftmost == &dl_se->rb_node;
 }
 
+#ifdef CONFIG_SMP
+
+static inline int dl_overloaded(struct rq *rq)
+{
+	return atomic_read(&rq->rd->dlo_count);
+}
+
+static inline void dl_set_overload(struct rq *rq)
+{
+	if (!rq->online)
+		return;
+
+	cpumask_set_cpu(rq->cpu, rq->rd->dlo_mask);
+	/*
+	 * Must be visible before the overload count is
+	 * set (as in sched_rt.c).
+	 */
+	wmb();
+	atomic_inc(&rq->rd->dlo_count);
+}
+
+static inline void dl_clear_overload(struct rq *rq)
+{
+	if (!rq->online)
+		return;
+
+	atomic_dec(&rq->rd->dlo_count);
+	cpumask_clear_cpu(rq->cpu, rq->rd->dlo_mask);
+}
+
+static void update_dl_migration(struct dl_rq *dl_rq)
+{
+	if (dl_rq->dl_nr_migratory && dl_rq->dl_nr_total > 1) {
+		if (!dl_rq->overloaded) {
+			dl_set_overload(rq_of_dl_rq(dl_rq));
+			dl_rq->overloaded = 1;
+		}
+	} else if (dl_rq->overloaded) {
+		dl_clear_overload(rq_of_dl_rq(dl_rq));
+		dl_rq->overloaded = 0;
+	}
+}
+
+static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
+
+	dl_rq->dl_nr_total++;
+	if (dl_se->nr_cpus_allowed > 1)
+		dl_rq->dl_nr_migratory++;
+
+	update_dl_migration(dl_rq);
+}
+
+static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
+
+	dl_rq->dl_nr_total--;
+	if (dl_se->nr_cpus_allowed > 1)
+		dl_rq->dl_nr_migratory--;
+
+	update_dl_migration(dl_rq);
+}
+
+/*
+ * The list of pushable -deadline task is not a plist, like in
+ * sched_rt.c, it is an rb-tree with tasks ordered by deadline.
+ */
+static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+	struct dl_rq *dl_rq = &rq->dl;
+	struct rb_node **link = &dl_rq->pushable_dl_tasks_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct task_struct *entry;
+	int leftmost = 1;
+
+	BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct task_struct,
+				 pushable_dl_tasks);
+		if (!dl_entity_preempt(&entry->dl, &p->dl))
+			link = &parent->rb_left;
+		else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		dl_rq->pushable_dl_tasks_leftmost = &p->pushable_dl_tasks;
+
+	rb_link_node(&p->pushable_dl_tasks, parent, link);
+	rb_insert_color(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
+}
+
+static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+	struct dl_rq *dl_rq = &rq->dl;
+
+	if (RB_EMPTY_NODE(&p->pushable_dl_tasks))
+		return;
+
+	if (dl_rq->pushable_dl_tasks_leftmost == &p->pushable_dl_tasks) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&p->pushable_dl_tasks);
+		dl_rq->pushable_dl_tasks_leftmost = next_node;
+	}
+
+	rb_erase(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
+	RB_CLEAR_NODE(&p->pushable_dl_tasks);
+}
+
+static inline int has_pushable_dl_tasks(struct rq *rq)
+{
+	return !RB_EMPTY_ROOT(&rq->dl.pushable_dl_tasks_root);
+}
+
+static int push_dl_task(struct rq *rq);
+
+#else
+
+static inline
+void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline
+void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline
+void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+}
+
+static inline
+void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+}
+
+#endif /* CONFIG_SMP */
+
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
 static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
@@ -276,6 +433,14 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 			check_preempt_curr_dl(rq, p, 0);
 		else
 			resched_task(rq->curr);
+#ifdef CONFIG_SMP
+		/*
+		 * Queueing this task back might have overloaded rq,
+		 * check if we need to kick someone away.
+		 */
+		if (rq->dl.overloaded)
+			push_dl_task(rq);
+#endif
 	}
 unlock:
 	task_rq_unlock(rq, p, &flags);
@@ -359,6 +524,100 @@ static void update_curr_dl(struct rq *rq)
 	}
 }
 
+#ifdef CONFIG_SMP
+
+static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu);
+
+static inline int next_deadline(struct rq *rq)
+{
+	struct task_struct *next = pick_next_earliest_dl_task(rq, rq->cpu);
+
+	if (next && dl_prio(next->prio))
+		return next->dl.deadline;
+	else
+		return 0;
+}
+
+static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
+{
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	if (dl_rq->earliest_dl.curr == 0 ||
+	    dl_time_before(deadline, dl_rq->earliest_dl.curr)) {
+		/*
+		 * If the dl_rq had no -deadline tasks, or if the new task
+		 * has shorter deadline than the current one on dl_rq, we
+		 * know that the previous earliest becomes our next earliest,
+		 * as the new task becomes the earliest itself.
+		 */
+		dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
+		dl_rq->earliest_dl.curr = deadline;
+	} else if (dl_rq->earliest_dl.next == 0 ||
+		   dl_time_before(deadline, dl_rq->earliest_dl.next)) {
+		/*
+		 * On the other hand, if the new -deadline task has a
+		 * a later deadline than the earliest one on dl_rq, but
+		 * it is earlier than the next (if any), we must
+		 * recompute the next-earliest.
+		 */
+		dl_rq->earliest_dl.next = next_deadline(rq);
+	}
+}
+
+static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
+{
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * Since we may have removed our earliest (and/or next earliest)
+	 * task we must recompute them.
+	 */
+	if (!dl_rq->dl_nr_running) {
+		dl_rq->earliest_dl.curr = 0;
+		dl_rq->earliest_dl.next = 0;
+	} else {
+		struct rb_node *leftmost = dl_rq->rb_leftmost;
+		struct sched_dl_entity *entry;
+
+		entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
+		dl_rq->earliest_dl.curr = entry->deadline;
+		dl_rq->earliest_dl.next = next_deadline(rq);
+	}
+}
+
+#else
+
+static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
+static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
+
+#endif /* CONFIG_SMP */
+
+static inline
+void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	int prio = dl_task_of(dl_se)->prio;
+	u64 deadline = dl_se->deadline;
+
+	WARN_ON(!dl_prio(prio));
+	dl_rq->dl_nr_running++;
+
+	inc_dl_deadline(dl_rq, deadline);
+	inc_dl_migration(dl_se, dl_rq);
+}
+
+static inline
+void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	int prio = dl_task_of(dl_se)->prio;
+
+	WARN_ON(!dl_prio(prio));
+	WARN_ON(!dl_rq->dl_nr_running);
+	dl_rq->dl_nr_running--;
+
+	dec_dl_deadline(dl_rq, dl_se->deadline);
+	dec_dl_migration(dl_se, dl_rq);
+}
+
 static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
@@ -386,7 +645,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 	rb_link_node(&dl_se->rb_node, parent, link);
 	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
 
-	dl_rq->dl_nr_running++;
+	inc_dl_tasks(dl_se, dl_rq);
 }
 
 static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
@@ -406,7 +665,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
 	RB_CLEAR_NODE(&dl_se->rb_node);
 
-	dl_rq->dl_nr_running--;
+	dec_dl_tasks(dl_se, dl_rq);
 }
 
 static void
@@ -444,11 +703,15 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		return;
 
 	enqueue_dl_entity(&p->dl, flags);
+
+	if (!task_current(rq, p) && p->dl.nr_cpus_allowed > 1)
+		enqueue_pushable_dl_task(rq, p);
 }
 
 static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	dequeue_dl_entity(&p->dl);
+	dequeue_pushable_dl_task(rq, p);
 }
 
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -480,6 +743,75 @@ static void yield_task_dl(struct rq *rq)
 	update_curr_dl(rq);
 }
 
+#ifdef CONFIG_SMP
+
+static int find_later_rq(struct task_struct *task);
+static int latest_cpu_find(struct cpumask *span,
+			   struct task_struct *task,
+			   struct cpumask *later_mask);
+
+static int
+select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
+{
+	struct task_struct *curr;
+	struct rq *rq;
+	int cpu;
+
+	if (sd_flag != SD_BALANCE_WAKE)
+		return smp_processor_id();
+
+	cpu = task_cpu(p);
+	rq = cpu_rq(cpu);
+
+	rcu_read_lock();
+	curr = ACCESS_ONCE(rq->curr); /* unlocked access */
+
+	/*
+	 * If we are dealing with a -deadline task, we must
+	 * decide where to wake it up.
+	 * If it has a later deadline and the current task
+	 * on this rq can't move (provided the waking task
+	 * can!) we prefer to send it somewhere else. On the
+	 * other hand, if it has a shorter deadline, we
+	 * try to make it stay here, it might be important.
+	 */
+	if (unlikely(dl_task(rq->curr)) &&
+	    (rq->curr->dl.nr_cpus_allowed < 2 ||
+	     dl_entity_preempt(&rq->curr->dl, &p->dl)) &&
+	    (p->dl.nr_cpus_allowed > 1)) {
+		int target = find_later_rq(p);
+
+		if (target != -1)
+			cpu = target;
+	}
+	rcu_read_unlock();
+
+	return cpu;
+}
+
+static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
+{
+	/*
+	 * Current can't be migrated, useles to reschedule,
+	 * let's hope p can move out.
+	 */
+	if (rq->curr->dl.nr_cpus_allowed == 1 ||
+	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
+		return;
+
+	/*
+	 * p is migratable, so let's not schedule it and
+	 * see if it is pushed or pulled somewhere else.
+	 */
+	if (p->dl.nr_cpus_allowed != 1 &&
+	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
+		return;
+
+	resched_task(rq->curr);
+}
+
+#endif /* CONFIG_SMP */
+
 /*
  * Only called when both the current and waking task are -deadline
  * tasks.
@@ -487,8 +819,20 @@ static void yield_task_dl(struct rq *rq)
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 				  int flags)
 {
-	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
+	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline)) {
 		resched_task(rq->curr);
+		return;
+	}
+
+#ifdef CONFIG_SMP
+	/*
+	 * In the unlikely case current and p have the same deadline
+	 * let us try to decide what's the best thing to do...
+	 */
+	if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
+	    !need_resched())
+		check_preempt_equal_dl(rq, p);
+#endif /* CONFIG_SMP */
 }
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -532,10 +876,20 @@ struct task_struct *pick_next_task_dl(struct rq *rq)
 
 	p = dl_task_of(dl_se);
 	p->se.exec_start = rq->clock;
+
+	/* Running task will never be pushed. */
+	if (p)
+		dequeue_pushable_dl_task(rq, p);
+
 #ifdef CONFIG_SCHED_HRTICK
 	if (hrtick_enabled(rq))
 		start_hrtick_dl(rq, p);
 #endif
+
+#ifdef CONFIG_SMP
+	rq->post_schedule = has_pushable_dl_tasks(rq);
+#endif /* CONFIG_SMP */
+
 	return p;
 }
 
@@ -543,6 +897,9 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
 {
 	update_curr_dl(rq);
 	p->se.exec_start = 0;
+
+	if (on_dl_rq(&p->dl) && p->dl.nr_cpus_allowed > 1)
+		enqueue_pushable_dl_task(rq, p);
 }
 
 static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
@@ -576,43 +933,410 @@ static void set_curr_task_dl(struct rq *rq)
 	struct task_struct *p = rq->curr;
 
 	p->se.exec_start = rq->clock;
+
+	/* You can't push away the running task */
+	dequeue_pushable_dl_task(rq, p);
 }
 
-static void switched_from_dl(struct rq *rq, struct task_struct *p)
+#ifdef CONFIG_SMP
+
+/* Only try algorithms three times */
+#define DL_MAX_TRIES 3
+
+static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
 {
-	if (hrtimer_active(&p->dl.dl_timer))
-		hrtimer_try_to_cancel(&p->dl.dl_timer);
+	if (!task_running(rq, p) &&
+	    (cpu < 0 || cpumask_test_cpu(cpu, &p->cpus_allowed)) &&
+	    (p->dl.nr_cpus_allowed > 1))
+		return 1;
+
+	return 0;
 }
 
-static void switched_to_dl(struct rq *rq, struct task_struct *p)
+/* Returns the second earliest -deadline task, NULL otherwise */
+static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu)
+{
+	struct rb_node *next_node = rq->dl.rb_leftmost;
+	struct sched_dl_entity *dl_se;
+	struct task_struct *p = NULL;
+
+next_node:
+	next_node = rb_next(next_node);
+	if (next_node) {
+		dl_se = rb_entry(next_node, struct sched_dl_entity, rb_node);
+		p = dl_task_of(dl_se);
+
+		if (pick_dl_task(rq, p, cpu))
+			return p;
+
+		goto next_node;
+	}
+
+	return NULL;
+}
+
+static int latest_cpu_find(struct cpumask *span,
+			   struct task_struct *task,
+			   struct cpumask *later_mask)
 {
+	const struct sched_dl_entity *dl_se = &task->dl;
+	int cpu, found = -1, best = 0;
+	u64 max_dl = 0;
+
+	for_each_cpu(cpu, span) {
+		struct rq *rq = cpu_rq(cpu);
+		struct dl_rq *dl_rq = &rq->dl;
+
+		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
+		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
+		     dl_rq->earliest_dl.curr))) {
+			if (later_mask)
+				cpumask_set_cpu(cpu, later_mask);
+			if (!best && !dl_rq->dl_nr_running) {
+				best = 1;
+				found = cpu;
+			} else if (!best &&
+				   dl_time_before(max_dl,
+						  dl_rq->earliest_dl.curr)) {
+				max_dl = dl_rq->earliest_dl.curr;
+				found = cpu;
+			}
+		} else if (later_mask)
+			cpumask_clear_cpu(cpu, later_mask);
+	}
+
+	return found;
+}
+
+static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
+
+static int find_later_rq(struct task_struct *task)
+{
+	struct sched_domain *sd;
+	struct cpumask *later_mask = __get_cpu_var(local_cpu_mask_dl);
+	int this_cpu = smp_processor_id();
+	int best_cpu, cpu = task_cpu(task);
+
+	if (task->dl.nr_cpus_allowed == 1)
+		return -1;
+
+	best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
+	if (best_cpu == -1)
+		return -1;
+
 	/*
-	 * If p is throttled, don't consider the possibility
-	 * of preempting rq->curr, the check will be done right
-	 * after its runtime will get replenished.
+	 * If we are here, some target has been found,
+	 * the most suitable of which is cached in best_cpu.
+	 * This is, among the runqueues where the current tasks
+	 * have later deadlines than the task's one, the rq
+	 * with the latest possible one.
+	 *
+	 * Now we check how well this matches with task's
+	 * affinity and system topology.
+	 *
+	 * The last cpu where the task run is our first
+	 * guess, since it is most likely cache-hot there.
 	 */
-	if (unlikely(p->dl.dl_throttled))
-		return;
+	if (cpumask_test_cpu(cpu, later_mask))
+		return cpu;
+	/*
+	 * Check if this_cpu is to be skipped (i.e., it is
+	 * not in the mask) or not.
+	 */
+	if (!cpumask_test_cpu(this_cpu, later_mask))
+		this_cpu = -1;
+
+	rcu_read_lock();
+	for_each_domain(cpu, sd) {
+		if (sd->flags & SD_WAKE_AFFINE) {
+
+			/*
+			 * If possible, preempting this_cpu is
+			 * cheaper than migrating.
+			 */
+			if (this_cpu != -1 &&
+			    cpumask_test_cpu(this_cpu, sched_domain_span(sd)))
+				return this_cpu;
+
+			/*
+			 * Last chance: if best_cpu is valid and is
+			 * in the mask, that becomes our choice.
+			 */
+			if (best_cpu < nr_cpu_ids &&
+			    cpumask_test_cpu(best_cpu, sched_domain_span(sd)))
+				return best_cpu;
+		}
+	}
+	rcu_read_unlock();
 
-	if (!p->on_rq || rq->curr != p) {
-		if (task_has_dl_policy(rq->curr))
-			check_preempt_curr_dl(rq, p, 0);
-		else
-			resched_task(rq->curr);
+	/*
+	 * At this point, all our guesses failed, we just return
+	 * 'something', and let the caller sort the things out.
+	 */
+	if (this_cpu != -1)
+		return this_cpu;
+
+	cpu = cpumask_any(later_mask);
+	if (cpu < nr_cpu_ids)
+		return cpu;
+
+	return -1;
+}
+
+/* Locks the rq it finds */
+static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
+{
+	struct rq *later_rq = NULL;
+	int tries;
+	int cpu;
+
+	for (tries = 0; tries < DL_MAX_TRIES; tries++) {
+		cpu = find_later_rq(task);
+
+		if ((cpu == -1) || (cpu == rq->cpu))
+			break;
+
+		later_rq = cpu_rq(cpu);
+
+		/* Retry if something changed. */
+		if (double_lock_balance(rq, later_rq)) {
+			if (unlikely(task_rq(task) != rq ||
+				     !cpumask_test_cpu(later_rq->cpu,
+						       &task->cpus_allowed) ||
+				     task_running(rq, task) ||
+				     !task->se.on_rq)) {
+				raw_spin_unlock(&later_rq->lock);
+				later_rq = NULL;
+				break;
+			}
+		}
+
+		/*
+		 * If the rq we found has no -deadline task, or
+		 * its earliest one has a later deadline than our
+		 * task, the rq is a good one.
+		 */
+		if (!later_rq->dl.dl_nr_running ||
+		    dl_time_before(task->dl.deadline,
+				   later_rq->dl.earliest_dl.curr))
+			break;
+
+		/* Otherwise we try again. */
+		double_unlock_balance(rq, later_rq);
+		later_rq = NULL;
 	}
+
+	return later_rq;
 }
 
-static void prio_changed_dl(struct rq *rq, struct task_struct *p,
-			    int oldprio)
+static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
 {
-	switched_to_dl(rq, p);
+	struct task_struct *p;
+
+	if (!has_pushable_dl_tasks(rq))
+		return NULL;
+
+	p = rb_entry(rq->dl.pushable_dl_tasks_leftmost,
+		     struct task_struct, pushable_dl_tasks);
+
+	BUG_ON(rq->cpu != task_cpu(p));
+	BUG_ON(task_current(rq, p));
+	BUG_ON(p->dl.nr_cpus_allowed <= 1);
+
+	BUG_ON(!p->se.on_rq);
+	BUG_ON(!dl_task(p));
+
+	return p;
 }
 
-#ifdef CONFIG_SMP
-static int
-select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
+/*
+ * See if the non running -deadline tasks on this rq
+ * can be sent to some other CPU where they can preempt
+ * and start executing.
+ */
+static int push_dl_task(struct rq *rq)
 {
-	return task_cpu(p);
+	struct task_struct *next_task;
+	struct rq *later_rq;
+
+	if (!rq->dl.overloaded)
+		return 0;
+
+	next_task = pick_next_pushable_dl_task(rq);
+	if (!next_task)
+		return 0;
+
+retry:
+	if (unlikely(next_task == rq->curr)) {
+		WARN_ON(1);
+		return 0;
+	}
+
+	/*
+	 * If next_task preempts rq->curr, and rq->curr
+	 * can move away, it makes sense to just reschedule
+	 * without going further in pushing next_task.
+	 */
+	if (dl_task(rq->curr) &&
+	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+	    rq->curr->dl.nr_cpus_allowed > 1) {
+		resched_task(rq->curr);
+		return 0;
+	}
+
+	/* We might release rq lock */
+	get_task_struct(next_task);
+
+	/* Will lock the rq it'll find */
+	later_rq = find_lock_later_rq(next_task, rq);
+	if (!later_rq) {
+		struct task_struct *task;
+
+		/*
+		 * We must check all this again, since
+		 * find_lock_later_rq releases rq->lock and it is
+		 * then possible that next_task has migrated.
+		 */
+		task = pick_next_pushable_dl_task(rq);
+		if (task_cpu(next_task) == rq->cpu && task == next_task) {
+			/*
+			 * The task is still there. We don't try
+			 * again, some other cpu will pull it when ready.
+			 */
+			dequeue_pushable_dl_task(rq, next_task);
+			goto out;
+		}
+
+		if (!task)
+			/* No more tasks */
+			goto out;
+
+		put_task_struct(next_task);
+		next_task = task;
+		goto retry;
+	}
+
+	deactivate_task(rq, next_task, 0);
+	set_task_cpu(next_task, later_rq->cpu);
+	activate_task(later_rq, next_task, 0);
+
+	resched_task(later_rq->curr);
+
+	double_unlock_balance(rq, later_rq);
+
+out:
+	put_task_struct(next_task);
+
+	return 1;
+}
+
+static void push_dl_tasks(struct rq *rq)
+{
+	/* Terminates as it moves a -deadline task */
+	while (push_dl_task(rq))
+		;
+}
+
+static int pull_dl_task(struct rq *this_rq)
+{
+	int this_cpu = this_rq->cpu, ret = 0, cpu;
+	struct task_struct *p;
+	struct rq *src_rq;
+	u64 dmin = LONG_MAX;
+
+	if (likely(!dl_overloaded(this_rq)))
+		return 0;
+
+	for_each_cpu(cpu, this_rq->rd->dlo_mask) {
+		if (this_cpu == cpu)
+			continue;
+
+		src_rq = cpu_rq(cpu);
+
+		/*
+		 * It looks racy, abd it is! However, as in sched_rt.c,
+		 * we are fine with this.
+		 */
+		if (this_rq->dl.dl_nr_running &&
+		    dl_time_before(this_rq->dl.earliest_dl.curr,
+				   src_rq->dl.earliest_dl.next))
+			continue;
+
+		/* Might drop this_rq->lock */
+		double_lock_balance(this_rq, src_rq);
+
+		/*
+		 * If there are no more pullable tasks on the
+		 * rq, we're done with it.
+		 */
+		if (src_rq->dl.dl_nr_running <= 1)
+			goto skip;
+
+		p = pick_next_earliest_dl_task(src_rq, this_cpu);
+
+		/*
+		 * We found a task to be pulled if:
+		 *  - it preempts our current (if there's one),
+		 *  - it will preempt the last one we pulled (if any).
+		 */
+		if (p && dl_time_before(p->dl.deadline, dmin) &&
+		    (!this_rq->dl.dl_nr_running ||
+		     dl_time_before(p->dl.deadline,
+				    this_rq->dl.earliest_dl.curr))) {
+			WARN_ON(p == src_rq->curr);
+			WARN_ON(!p->se.on_rq);
+
+			/*
+			 * Then we pull iff p has actually an earlier
+			 * deadline than the current task of its runqueue.
+			 */
+			if (dl_time_before(p->dl.deadline,
+					   src_rq->curr->dl.deadline))
+				goto skip;
+
+			ret = 1;
+
+			deactivate_task(src_rq, p, 0);
+			set_task_cpu(p, this_cpu);
+			activate_task(this_rq, p, 0);
+			dmin = p->dl.deadline;
+
+			/* Is there any other task even earlier? */
+		}
+skip:
+		double_unlock_balance(this_rq, src_rq);
+	}
+
+	return ret;
+}
+
+static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
+{
+	/* Try to pull other tasks here */
+	if (dl_task(prev))
+		pull_dl_task(rq);
+}
+
+static void post_schedule_dl(struct rq *rq)
+{
+	push_dl_tasks(rq);
+}
+
+/*
+ * Since the task is not running and a reschedule is not going to happen
+ * anytime soon on its runqueue, we try pushing it away now.
+ */
+static void task_woken_dl(struct rq *rq, struct task_struct *p)
+{
+	if (!task_running(rq, p) &&
+	    !test_tsk_need_resched(rq->curr) &&
+	    has_pushable_dl_tasks(rq) &&
+	    p->dl.nr_cpus_allowed > 1 &&
+	    dl_task(rq->curr) &&
+	    (rq->curr->dl.nr_cpus_allowed < 2 ||
+	     dl_entity_preempt(&rq->curr->dl, &p->dl))) {
+		push_dl_tasks(rq);
+	}
 }
 
 static void set_cpus_allowed_dl(struct task_struct *p,
@@ -622,10 +1346,145 @@ static void set_cpus_allowed_dl(struct task_struct *p,
 
 	BUG_ON(!dl_task(p));
 
+	/*
+	 * Update only if the task is actually running (i.e.,
+	 * it is on the rq AND it is not throttled).
+	 */
+	if (on_dl_rq(&p->dl) && (weight != p->dl.nr_cpus_allowed)) {
+		struct rq *rq = task_rq(p);
+
+		if (!task_current(rq, p)) {
+			/*
+			 * If the task was on the pushable list,
+			 * make sure it stays there only if the new
+			 * mask allows that.
+			 */
+			if (p->dl.nr_cpus_allowed > 1)
+				dequeue_pushable_dl_task(rq, p);
+
+			if (weight > 1)
+				enqueue_pushable_dl_task(rq, p);
+		}
+
+		if ((p->dl.nr_cpus_allowed <= 1) && (weight > 1)) {
+			rq->dl.dl_nr_migratory++;
+		} else if ((p->dl.nr_cpus_allowed > 1) && (weight <= 1)) {
+			BUG_ON(!rq->dl.dl_nr_migratory);
+			rq->dl.dl_nr_migratory--;
+		}
+
+		update_dl_migration(&rq->dl);
+	}
+
 	cpumask_copy(&p->cpus_allowed, new_mask);
 	p->dl.nr_cpus_allowed = weight;
 }
+
+/* Assumes rq->lock is held */
+static void rq_online_dl(struct rq *rq)
+{
+	if (rq->dl.overloaded)
+		dl_set_overload(rq);
+}
+
+/* Assumes rq->lock is held */
+static void rq_offline_dl(struct rq *rq)
+{
+	if (rq->dl.overloaded)
+		dl_clear_overload(rq);
+}
+
+static inline void init_sched_dl_class(void)
+{
+	unsigned int i;
+
+	for_each_possible_cpu(i)
+		zalloc_cpumask_var_node(&per_cpu(local_cpu_mask_dl, i),
+					GFP_KERNEL, cpu_to_node(i));
+}
+
+#endif /* CONFIG_SMP */
+
+static void switched_from_dl(struct rq *rq, struct task_struct *p)
+{
+	if (hrtimer_active(&p->dl.dl_timer) && !dl_policy(p->policy))
+		hrtimer_try_to_cancel(&p->dl.dl_timer);
+
+#ifdef CONFIG_SMP
+	/*
+	 * Since this might be the only -deadline task on the rq,
+	 * this is the right place to try to pull some other one
+	 * from an overloaded cpu, if any.
+	 */
+	if (!rq->dl.dl_nr_running)
+		pull_dl_task(rq);
 #endif
+}
+
+/*
+ * When switching to -deadline, we may overload the rq, then
+ * we try to push someone off, if possible.
+ */
+static void switched_to_dl(struct rq *rq, struct task_struct *p)
+{
+	int check_resched = 1;
+
+	/*
+	 * If p is throttled, don't consider the possibility
+	 * of preempting rq->curr, the check will be done right
+	 * after its runtime will get replenished.
+	 */
+	if (unlikely(p->dl.dl_throttled))
+		return;
+
+	if (!p->on_rq || rq->curr != p) {
+#ifdef CONFIG_SMP
+		if (rq->dl.overloaded && push_dl_task(rq) && rq != task_rq(p))
+			/* Only reschedule if pushing failed */
+			check_resched = 0;
+#endif /* CONFIG_SMP */
+		if (check_resched && task_has_dl_policy(rq->curr))
+			check_preempt_curr_dl(rq, p, 0);
+	}
+}
+
+/*
+ * If the scheduling parameters of a -deadline task changed,
+ * a push or pull operation might be needed.
+ */
+static void prio_changed_dl(struct rq *rq, struct task_struct *p,
+			    int oldprio)
+{
+	if (p->on_rq || rq->curr == p) {
+#ifdef CONFIG_SMP
+		/*
+		 * This might be too much, but unfortunately
+		 * we don't have the old deadline value, and
+		 * we can't argue if the task is increasing
+		 * or lowering its prio, so...
+		 */
+		if (!rq->dl.overloaded)
+			pull_dl_task(rq);
+
+		/*
+		 * If we now have a earlier deadline task than p,
+		 * then reschedule, provided p is still on this
+		 * runqueue.
+		 */
+		if (dl_time_before(rq->dl.earliest_dl.curr, p->dl.deadline) &&
+		    rq->curr == p)
+			resched_task(p);
+#else
+		/*
+		 * Again, we don't know if p has a earlier
+		 * or later deadline, so let's blindly set a
+		 * (maybe not needed) rescheduling point.
+		 */
+		resched_task(p);
+#endif /* CONFIG_SMP */
+	} else
+		switched_to_dl(rq, p);
+}
 
 static const struct sched_class dl_sched_class = {
 	.next			= &rt_sched_class,
@@ -642,6 +1501,11 @@ static const struct sched_class dl_sched_class = {
 	.select_task_rq		= select_task_rq_dl,
 
 	.set_cpus_allowed       = set_cpus_allowed_dl,
+	.rq_online              = rq_online_dl,
+	.rq_offline             = rq_offline_dl,
+	.pre_schedule		= pre_schedule_dl,
+	.post_schedule		= post_schedule_dl,
+	.task_woken		= task_woken_dl,
 #endif
 
 	.set_curr_task		= set_curr_task_dl,
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 4b09704..7b609bc 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1590,7 +1590,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
 	    !test_tsk_need_resched(rq->curr) &&
 	    has_pushable_tasks(rq) &&
 	    p->rt.nr_cpus_allowed > 1 &&
-	    rt_task(rq->curr) &&
+	    (dl_task(rq->curr) || rt_task(rq->curr)) &&
 	    (rq->curr->rt.nr_cpus_allowed < 2 ||
 	     rq->curr->prio <= p->prio))
 		push_rt_tasks(rq);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 07/16] sched: SCHED_DEADLINE avg_update accounting.
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (5 preceding siblings ...)
  2012-04-06  7:14 ` [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-06  7:14 ` [PATCH 08/16] sched: add period support for -deadline tasks Juri Lelli
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

From: Dario Faggioli <raistlin@linux.it>

Make the core scheduler and load balancer aware of the load
produced by -deadline tasks, by updating the moving average
like for sched_rt.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 kernel/sched_dl.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 38edefa..11c9b8d 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -511,6 +511,8 @@ static void update_curr_dl(struct rq *rq)
 	curr->se.exec_start = rq->clock;
 	cpuacct_charge(curr, delta_exec);
 
+	sched_rt_avg_update(rq, delta_exec);
+
 	dl_se->runtime -= delta_exec;
 	if (dl_runtime_exceeded(rq, dl_se)) {
 		__dequeue_task_dl(rq, curr, 0);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 08/16] sched: add period support for -deadline tasks.
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (6 preceding siblings ...)
  2012-04-06  7:14 ` [PATCH 07/16] sched: SCHED_DEADLINE avg_update accounting Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-11 20:32   ` Steven Rostedt
  2012-04-06  7:14 ` [PATCH 09/16] sched: add schedstats " Juri Lelli
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

From: Harald Gustafsson <harald.gustafsson@ericsson.com>

Make it possible to specify a period (different or equal than
deadline) for -deadline tasks.

Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 include/linux/sched.h |    1 +
 kernel/sched.c        |   15 ++++++++++++---
 kernel/sched_dl.c     |   10 +++++++---
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 416ce99..5961592 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1278,6 +1278,7 @@ struct sched_dl_entity {
 	 */
 	u64 dl_runtime;		/* maximum runtime for each instance	*/
 	u64 dl_deadline;	/* relative deadline of each instance	*/
+	u64 dl_period;		/* separation of two instances (period) */
 
 	/*
 	 * Actual scheduling parameters. Initialized with the values above,
diff --git a/kernel/sched.c b/kernel/sched.c
index 1a38ad1..9461958 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3024,6 +3024,7 @@ static void __sched_fork(struct task_struct *p)
 	hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	p->dl.dl_runtime = p->dl.runtime = 0;
 	p->dl.dl_deadline = p->dl.deadline = 0;
+	p->dl.dl_period = 0;
 	p->dl.flags = 0;
 
 	INIT_LIST_HEAD(&p->rt.run_list);
@@ -5490,6 +5491,10 @@ __setparam_dl(struct task_struct *p, const struct sched_param2 *param2)
 	init_dl_task_timer(dl_se);
 	dl_se->dl_runtime = param2->sched_runtime;
 	dl_se->dl_deadline = param2->sched_deadline;
+	if (param2->sched_period != 0)
+		dl_se->dl_period = param2->sched_period;
+	else
+		dl_se->dl_period = dl_se->dl_deadline;
 	dl_se->flags = param2->sched_flags;
 	dl_se->dl_throttled = 0;
 	dl_se->dl_new = 1;
@@ -5503,19 +5508,23 @@ __getparam_dl(struct task_struct *p, struct sched_param2 *param2)
 	param2->sched_priority = p->rt_priority;
 	param2->sched_runtime = dl_se->dl_runtime;
 	param2->sched_deadline = dl_se->dl_deadline;
+	param2->sched_period = dl_se->dl_period;
 	param2->sched_flags = dl_se->flags;
 }
 
 /*
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
- * than the runtime.
+ * than the runtime, as well as the period of being zero or
+ * greater than deadline.
  */
 static bool
 __checkparam_dl(const struct sched_param2 *prm)
 {
-	return prm && (&prm->sched_deadline) != 0 &&
-	       (s64)(&prm->sched_deadline - &prm->sched_runtime) >= 0;
+	return prm && prm->sched_deadline != 0 &&
+	       (prm->sched_period == 0 ||
+		(s64)(prm->sched_period - prm->sched_deadline) >= 0) &&
+	       (s64)(prm->sched_deadline - prm->sched_runtime) >= 0;
 }
 
 /*
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 11c9b8d..8682ee2 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -266,7 +266,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 * arbitrary large.
 	 */
 	while (dl_se->runtime <= 0) {
-		dl_se->deadline += dl_se->dl_deadline;
+		dl_se->deadline += dl_se->dl_period;
 		dl_se->runtime += dl_se->dl_runtime;
 	}
 
@@ -293,7 +293,11 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
  * assigned (function returns true if it can).
  *
  * For this to hold, we must check if:
- *   runtime / (deadline - t) < dl_runtime / dl_deadline .
+ *   runtime / (deadline - t) < dl_runtime / dl_period .
+ *
+ * Notice that the bandwidth check is done against the period. For
+ * task with deadline equal to period this is the same of using
+ * dl_deadline instead of dl_period in the equation above.
  */
 static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
 {
@@ -312,7 +316,7 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
 	 * to the (absolute) deadline. Therefore, overflowing the u64
 	 * type is very unlikely to occur in both cases.
 	 */
-	left = dl_se->dl_deadline * dl_se->runtime;
+	left = dl_se->dl_period * dl_se->runtime;
 	right = (dl_se->deadline - t) * dl_se->dl_runtime;
 
 	return dl_time_before(right, left);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 09/16] sched: add schedstats for -deadline tasks.
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (7 preceding siblings ...)
  2012-04-06  7:14 ` [PATCH 08/16] sched: add period support for -deadline tasks Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-06  7:14 ` [PATCH 10/16] sched: add resource limits " Juri Lelli
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

From: Dario Faggioli <raistlin@linux.it>

Add some typical sched-debug output to dl_rq(s) and some
schedstats to -deadline tasks.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 include/linux/sched.h |   13 +++++++++++++
 kernel/sched.c        |    2 ++
 kernel/sched_debug.c  |   43 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_dl.c     |   45 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 103 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5961592..23cca57 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1267,6 +1267,15 @@ struct sched_rt_entity {
 #endif
 };
 
+#ifdef CONFIG_SCHEDSTATS
+struct sched_stats_dl {
+	u64			last_dmiss;
+	u64			last_rorun;
+	u64			dmiss_max;
+	u64			rorun_max;
+};
+#endif
+
 struct sched_dl_entity {
 	struct rb_node	rb_node;
 	int nr_cpus_allowed;
@@ -1307,6 +1316,10 @@ struct sched_dl_entity {
 	 * own bandwidth to be enforced, thus we need one timer per task.
 	 */
 	struct hrtimer dl_timer;
+
+#ifdef CONFIG_SCHEDSTATS
+	struct sched_stats_dl stats;
+#endif
 };
 
 struct rcu_node;
diff --git a/kernel/sched.c b/kernel/sched.c
index 9461958..ea4787a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -569,6 +569,8 @@ struct dl_rq {
 
 	unsigned long dl_nr_running;
 
+	u64 exec_clock;
+
 #ifdef CONFIG_SMP
 	/*
 	 * Deadline values of the currently executing and the
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 528032b..8979fa3 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -243,6 +243,42 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
 #undef P
 }
 
+void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
+{
+	s64 min_deadline = -1, max_deadline = -1;
+	struct rq *rq = cpu_rq(cpu);
+	struct sched_dl_entity *last;
+	unsigned long flags;
+
+	SEQ_printf(m, "\ndl_rq[%d]:\n", cpu);
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+	if (dl_rq->rb_leftmost)
+		min_deadline = (rb_entry(dl_rq->rb_leftmost,
+					 struct sched_dl_entity,
+					 rb_node))->deadline;
+	last = __pick_dl_last_entity(dl_rq);
+	if (last)
+		max_deadline = last->deadline;
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+#define P(x) \
+	SEQ_printf(m, "  .%-30s: %Ld\n", #x, (long long)(dl_rq->x))
+#define __PN(x) \
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(x))
+#define PN(x) \
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(dl_rq->x))
+
+	P(dl_nr_running);
+	PN(exec_clock);
+	__PN(min_deadline);
+	__PN(max_deadline);
+
+#undef PN
+#undef __PN
+#undef P
+}
+
 extern __read_mostly int sched_clock_running;
 
 static void print_cpu(struct seq_file *m, int cpu)
@@ -305,6 +341,7 @@ static void print_cpu(struct seq_file *m, int cpu)
 	spin_lock_irqsave(&sched_debug_lock, flags);
 	print_cfs_stats(m, cpu);
 	print_rt_stats(m, cpu);
+	print_dl_stats(m, cpu);
 
 	rcu_read_lock();
 	print_rq(m, rq, cpu);
@@ -456,6 +493,12 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.statistics.nr_wakeups_affine_attempts);
 	P(se.statistics.nr_wakeups_passive);
 	P(se.statistics.nr_wakeups_idle);
+	if (dl_task(p)) {
+		PN(dl.stats.last_dmiss);
+		PN(dl.stats.dmiss_max);
+		PN(dl.stats.last_rorun);
+		PN(dl.stats.rorun_max);
+	}
 
 	{
 		u64 avg_atom, avg_per_cpu;
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 8682ee2..05140bb 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -476,6 +476,25 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 		return 0;
 
 	/*
+	 * Record statistics about last and maximum deadline
+	 * misses and runtime overruns.
+	 */
+	if (dmiss) {
+		u64 damount = rq->clock - dl_se->deadline;
+
+		schedstat_set(dl_se->stats.last_dmiss, damount);
+		schedstat_set(dl_se->stats.dmiss_max,
+			      max(dl_se->stats.dmiss_max, damount));
+	}
+	if (rorun) {
+		u64 ramount = -dl_se->runtime;
+
+		schedstat_set(dl_se->stats.last_rorun, ramount);
+		schedstat_set(dl_se->stats.rorun_max,
+			      max(dl_se->stats.rorun_max, ramount));
+	}
+
+	/*
 	 * If we are beyond our current deadline and we are still
 	 * executing, then we have already used some of the runtime of
 	 * the next instance. Thus, if we do not account that, we are
@@ -510,6 +529,7 @@ static void update_curr_dl(struct rq *rq)
 		      max(curr->se.statistics.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
+	schedstat_add(&rq->dl, exec_clock, delta_exec);
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq->clock;
@@ -855,6 +875,18 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
 }
 #endif
 
+#ifdef CONFIG_SCHED_DEBUG
+static struct sched_dl_entity *__pick_dl_last_entity(struct dl_rq *dl_rq)
+{
+	struct rb_node *last = rb_last(&dl_rq->rb_root);
+
+	if (!last)
+		return NULL;
+
+	return rb_entry(last, struct sched_dl_entity, rb_node);
+}
+#endif /* CONFIG_SCHED_DEBUG */
+
 static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
 						   struct dl_rq *dl_rq)
 {
@@ -1523,3 +1555,16 @@ static const struct sched_class dl_sched_class = {
 	.switched_from		= switched_from_dl,
 	.switched_to		= switched_to_dl,
 };
+
+#ifdef CONFIG_SCHED_DEBUG
+extern void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq);
+
+static void print_dl_stats(struct seq_file *m, int cpu)
+{
+	struct dl_rq *dl_rq = &cpu_rq(cpu)->dl;
+
+	rcu_read_lock();
+	print_dl_rq(m, cpu, dl_rq);
+	rcu_read_unlock();
+}
+#endif /* CONFIG_SCHED_DEBUG */
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 10/16] sched: add resource limits for -deadline tasks.
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (8 preceding siblings ...)
  2012-04-06  7:14 ` [PATCH 09/16] sched: add schedstats " Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-24 15:07   ` Peter Zijlstra
  2012-04-06  7:14 ` [PATCH 11/16] sched: add latency tracing " Juri Lelli
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

From: Dario Faggioli <raistlin@linux.it>

Add resource limits for non-root tasks in using the SCHED_DEADLINE
policy, very similarly to what already exists for RT policies.

In fact, this patch:
 - adds the resource limit RLIMIT_DLDLINE, which is the minimum value
   a user task can use as its own deadline;
 - adds the resource limit RLIMIT_DLRTIME, which is the maximum value
   a user task can use as it own runtime.

Notice that to exploit these, a modified version of the ulimit
utility and a modified resource.h header file are needed. They
both will be available on the website of the project.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 include/asm-generic/resource.h |    7 ++++++-
 kernel/sched.c                 |   25 +++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 1 deletions(-)

diff --git a/include/asm-generic/resource.h b/include/asm-generic/resource.h
index 61fa862..a682848 100644
--- a/include/asm-generic/resource.h
+++ b/include/asm-generic/resource.h
@@ -45,7 +45,10 @@
 					   0-39 for nice level 19 .. -20 */
 #define RLIMIT_RTPRIO		14	/* maximum realtime priority */
 #define RLIMIT_RTTIME		15	/* timeout for RT tasks in us */
-#define RLIM_NLIMITS		16
+
+#define RLIMIT_DLDLINE		16	/* minimum deadline in us */
+#define RLIMIT_DLRTIME		17	/* maximum runtime in us */
+#define RLIM_NLIMITS		18
 
 /*
  * SuS says limits have to be unsigned.
@@ -87,6 +90,8 @@
 	[RLIMIT_NICE]		= { 0, 0 },				\
 	[RLIMIT_RTPRIO]		= { 0, 0 },				\
 	[RLIMIT_RTTIME]		= {  RLIM_INFINITY,  RLIM_INFINITY },	\
+	[RLIMIT_DLDLINE]	= { ULONG_MAX, ULONG_MAX },		\
+	[RLIMIT_DLRTIME]	= { 0, 0 },				\
 }
 
 #endif	/* __KERNEL__ */
diff --git a/kernel/sched.c b/kernel/sched.c
index ea4787a..92d5e26 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5593,6 +5593,31 @@ recheck:
 	 * Allow unprivileged RT tasks to decrease priority:
 	 */
 	if (user && !capable(CAP_SYS_NICE)) {
+		if (dl_policy(policy)) {
+			u64 rlim_dline, rlim_rtime;
+			u64 dline, rtime;
+
+			if (!lock_task_sighand(p, &flags))
+				return -ESRCH;
+			rlim_dline = p->signal->rlim[RLIMIT_DLDLINE].rlim_cur;
+			rlim_rtime = p->signal->rlim[RLIMIT_DLRTIME].rlim_cur;
+			unlock_task_sighand(p, &flags);
+
+			/* can't set/change -deadline policy */
+			if (policy != p->policy && !rlim_rtime)
+				return -EPERM;
+
+			/* can't decrease the deadline */
+			rlim_dline *= NSEC_PER_USEC;
+			dline = param->sched_deadline;
+			if (dline < p->dl.dl_deadline && dline < rlim_dline)
+				return -EPERM;
+			/* can't increase the runtime */
+			rlim_rtime *= NSEC_PER_USEC;
+			rtime = param->sched_runtime;
+			if (rtime > p->dl.dl_runtime && rtime > rlim_rtime)
+				return -EPERM;
+		}
 		if (rt_policy(policy)) {
 			unsigned long rlim_rtprio =
 					task_rlimit(p, RLIMIT_RTPRIO);
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 11/16] sched: add latency tracing for -deadline tasks.
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (9 preceding siblings ...)
  2012-04-06  7:14 ` [PATCH 10/16] sched: add resource limits " Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-11 21:03   ` Steven Rostedt
  2012-04-06  7:14 ` [PATCH 12/16] rtmutex: turn the plist into an rb-tree Juri Lelli
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

From: Dario Faggioli <raistlin@linux.it>

It is very likely that systems that wants/needs to use the new
SCHED_DEADLINE policy also want to have the scheduling latency of
the -deadline tasks under control.

For this reason a new version of the scheduling wakeup latency,
called "wakeup_dl", is introduced.

As a consequence of applying this patch there will be three wakeup
latency tracer:
 * "wakeup", that deals with all tasks in the system;
 * "wakeup_rt", that deals with -rt and -deadline tasks only;
 * "wakeup_dl", that deals with -deadline tasks only.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 kernel/trace/trace_sched_wakeup.c |   41 ++++++++++++++++++++++++++++++++++++-
 kernel/trace/trace_selftest.c     |   30 ++++++++++++++++----------
 2 files changed, 58 insertions(+), 13 deletions(-)

diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index e4a70c0..9c9b1be 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -27,6 +27,7 @@ static int			wakeup_cpu;
 static int			wakeup_current_cpu;
 static unsigned			wakeup_prio = -1;
 static int			wakeup_rt;
+static int			wakeup_dl;
 
 static arch_spinlock_t wakeup_lock =
 	(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
@@ -420,6 +421,17 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
 	if ((wakeup_rt && !rt_task(p)) ||
 			p->prio >= wakeup_prio ||
 			p->prio >= current->prio)
+	/*
+	 * Semantic is like this:
+	 *  - wakeup tracer handles all tasks in the system, independently
+	 *    from their scheduling class;
+	 *  - wakeup_rt tracer handles tasks belonging to sched_dl and
+	 *    sched_rt class;
+	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
+	 */
+	if ((wakeup_dl && !dl_task(p)) ||
+	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
+	    (p->prio >= wakeup_prio || p->prio >= current->prio))
 		return;
 
 	pc = preempt_count();
@@ -431,7 +443,7 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
 	arch_spin_lock(&wakeup_lock);
 
 	/* check for races. */
-	if (!tracer_enabled || p->prio >= wakeup_prio)
+	if (!tracer_enabled || (!dl_task(p) && p->prio >= wakeup_prio))
 		goto out_locked;
 
 	/* reset the trace */
@@ -539,16 +551,25 @@ static int __wakeup_tracer_init(struct trace_array *tr)
 
 static int wakeup_tracer_init(struct trace_array *tr)
 {
+	wakeup_dl = 0;
 	wakeup_rt = 0;
 	return __wakeup_tracer_init(tr);
 }
 
 static int wakeup_rt_tracer_init(struct trace_array *tr)
 {
+	wakeup_dl = 0;
 	wakeup_rt = 1;
 	return __wakeup_tracer_init(tr);
 }
 
+static int wakeup_dl_tracer_init(struct trace_array *tr)
+{
+	wakeup_dl = 1;
+	wakeup_rt = 0;
+	return __wakeup_tracer_init(tr);
+}
+
 static void wakeup_tracer_reset(struct trace_array *tr)
 {
 	stop_wakeup_tracer(tr);
@@ -611,6 +632,20 @@ static struct tracer wakeup_rt_tracer __read_mostly =
 	.use_max_tr	= 1,
 };
 
+static struct tracer wakeup_dl_tracer __read_mostly =
+{
+	.name		= "wakeup_dl",
+	.init		= wakeup_dl_tracer_init,
+	.reset		= wakeup_tracer_reset,
+	.start		= wakeup_tracer_start,
+	.stop		= wakeup_tracer_stop,
+	.wait_pipe	= poll_wait_pipe,
+	.print_max	= 1,
+#ifdef CONFIG_FTRACE_SELFTEST
+	.selftest    = trace_selftest_startup_wakeup,
+#endif
+};
+
 __init static int init_wakeup_tracer(void)
 {
 	int ret;
@@ -623,6 +658,10 @@ __init static int init_wakeup_tracer(void)
 	if (ret)
 		return ret;
 
+	ret = register_tracer(&wakeup_dl_tracer);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 device_initcall(init_wakeup_tracer);
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 288541f..849063e 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -765,11 +765,17 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
 #ifdef CONFIG_SCHED_TRACER
 static int trace_wakeup_test_thread(void *data)
 {
-	/* Make this a RT thread, doesn't need to be too high */
-	static const struct sched_param param = { .sched_priority = 5 };
+	/* Make this a -deadline thread */
+	struct sched_param2 paramx = {
+		.sched_priority = 0,
+		.sched_runtime = 100000ULL,
+		.sched_deadline = 10000000ULL,
+		.sched_period = 10000000ULL
+		.sched_flags = 0
+	};
 	struct completion *x = data;
 
-	sched_setscheduler(current, SCHED_FIFO, &param);
+	sched_setscheduler2(current, SCHED_DEADLINE, &paramx);
 
 	/* Make it know we have a new prio */
 	complete(x);
@@ -781,8 +787,8 @@ static int trace_wakeup_test_thread(void *data)
 	/* we are awake, now wait to disappear */
 	while (!kthread_should_stop()) {
 		/*
-		 * This is an RT task, do short sleeps to let
-		 * others run.
+		 * This will likely be the system top priority
+		 * task, do short sleeps to let others run.
 		 */
 		msleep(100);
 	}
@@ -795,21 +801,21 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
 {
 	unsigned long save_max = tracing_max_latency;
 	struct task_struct *p;
-	struct completion isrt;
+	struct completion is_ready;
 	unsigned long count;
 	int ret;
 
-	init_completion(&isrt);
+	init_completion(&is_ready);
 
-	/* create a high prio thread */
-	p = kthread_run(trace_wakeup_test_thread, &isrt, "ftrace-test");
+	/* create a -deadline thread */
+	p = kthread_run(trace_wakeup_test_thread, &is_ready, "ftrace-test");
 	if (IS_ERR(p)) {
 		printk(KERN_CONT "Failed to create ftrace wakeup test thread ");
 		return -1;
 	}
 
-	/* make sure the thread is running at an RT prio */
-	wait_for_completion(&isrt);
+	/* make sure the thread is running at -deadline policy */
+	wait_for_completion(&is_ready);
 
 	/* start the tracing */
 	ret = tracer_init(trace, tr);
@@ -821,7 +827,7 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
 	/* reset the max latency */
 	tracing_max_latency = 0;
 
-	/* sleep to let the RT thread sleep too */
+	/* sleep to let the thread sleep too */
 	msleep(100);
 
 	/*
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 12/16] rtmutex: turn the plist into an rb-tree.
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (10 preceding siblings ...)
  2012-04-06  7:14 ` [PATCH 11/16] sched: add latency tracing " Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-11 21:11   ` Steven Rostedt
  2012-04-06  7:14 ` [PATCH 13/16] sched: drafted deadline inheritance logic Juri Lelli
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

From: Peter Zijlstra <peterz@infradead.org>

Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.

This is done mainly because:
 - classical prio field of the plist is just an int, which might
   not be enough for representing a deadline;
 - manipulating such a list would become O(nr_deadline_tasks),
   which might be to much, as the number of -deadline task increases.

Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
 - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
   one with the higher (lower, actually!) prio wins;
 - among a -priority and a -deadline task, the latter always wins;
 - among two -deadline tasks, the one with the earliest deadline
   wins.

Queueing and dequeueing functions are changed accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 include/linux/init_task.h |   10 +++
 include/linux/rtmutex.h   |   18 ++----
 include/linux/sched.h     |    4 +-
 kernel/fork.c             |    3 +-
 kernel/rtmutex-debug.c    |   10 +--
 kernel/rtmutex.c          |  151 +++++++++++++++++++++++++++++++++++----------
 kernel/rtmutex_common.h   |   24 ++++---
 kernel/sched.c            |    4 -
 8 files changed, 157 insertions(+), 67 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index cfd9f8d..7a2da88 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -10,6 +10,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
 #include <linux/securebits.h>
+#include <linux/rbtree.h>
 #include <net/net_namespace.h>
 
 #ifdef CONFIG_SMP
@@ -134,6 +135,14 @@ extern struct cred init_cred;
 
 #define INIT_TASK_COMM "swapper"
 
+#ifdef CONFIG_RT_MUTEXES
+# define INIT_RT_MUTEXES						\
+	.pi_waiters = RB_ROOT,						\
+	.pi_waiters_leftmost = NULL,
+#else
+# define INIT_RT_MUTEXES
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -200,6 +209,7 @@ extern struct cred init_cred;
 	INIT_FTRACE_GRAPH						\
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
+	INIT_RT_MUTEXES							\
 }
 
 
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 5ebd0bb..2344d71 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -13,7 +13,7 @@
 #define __LINUX_RT_MUTEX_H
 
 #include <linux/linkage.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
 #include <linux/spinlock_types_raw.h>
 
 extern int max_lock_depth; /* for sysctl */
@@ -22,12 +22,14 @@ extern int max_lock_depth; /* for sysctl */
  * The rt_mutex structure
  *
  * @wait_lock:	spinlock to protect the structure
- * @wait_list:	pilist head to enqueue waiters in priority order
+ * @waiters:	rbtree root to enqueue waiters in priority order
+ * @waiters_leftmost: top waiter
  * @owner:	the mutex owner
  */
 struct rt_mutex {
 	raw_spinlock_t		wait_lock;
-	struct plist_head	wait_list;
+	struct rb_root          waiters;
+	struct rb_node          *waiters_leftmost;
 	struct task_struct	*owner;
 	int			save_state;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
@@ -79,7 +81,7 @@ struct hrtimer_sleeper;
 
 #define __RT_MUTEX_INITIALIZER_PLAIN(mutexname) \
 	.wait_lock = __RAW_SPIN_LOCK_UNLOCKED(mutexname.wait_lock) \
-	, .wait_list = PLIST_HEAD_INIT(mutexname.wait_list) \
+	, .waiters = RB_ROOT \
 	, .owner = NULL \
 	__DEBUG_RT_MUTEX_INITIALIZER(mutexname)
 
@@ -120,12 +122,4 @@ extern int rt_mutex_trylock(struct rt_mutex *lock);
 
 extern void rt_mutex_unlock(struct rt_mutex *lock);
 
-#ifdef CONFIG_RT_MUTEXES
-# define INIT_RT_MUTEXES(tsk)						\
-	.pi_waiters	= PLIST_HEAD_INIT(tsk.pi_waiters),	\
-	INIT_RT_MUTEX_DEBUG(tsk)
-#else
-# define INIT_RT_MUTEXES(tsk)
-#endif
-
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 23cca57..5ef7bb6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -57,6 +57,7 @@ struct sched_param {
 #include <linux/types.h>
 #include <linux/timex.h>
 #include <linux/jiffies.h>
+#include <linux/plist.h>
 #include <linux/rbtree.h>
 #include <linux/thread_info.h>
 #include <linux/cpumask.h>
@@ -1551,7 +1552,8 @@ struct task_struct {
 
 #ifdef CONFIG_RT_MUTEXES
 	/* PI waiters blocked on a rt_mutex held by this task */
-	struct plist_head pi_waiters;
+	struct rb_root pi_waiters;
+	struct rb_node *pi_waiters_leftmost;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index b263c69..344e0b4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1033,7 +1033,8 @@ static void rt_mutex_init_task(struct task_struct *p)
 {
 	raw_spin_lock_init(&p->pi_lock);
 #ifdef CONFIG_RT_MUTEXES
-	plist_head_init(&p->pi_waiters);
+	p->pi_waiters = RB_ROOT;
+	p->pi_waiters_leftmost = NULL;
 	p->pi_blocked_on = NULL;
 #endif
 }
diff --git a/kernel/rtmutex-debug.c b/kernel/rtmutex-debug.c
index 8eafd1b..e80670d 100644
--- a/kernel/rtmutex-debug.c
+++ b/kernel/rtmutex-debug.c
@@ -23,7 +23,7 @@
 #include <linux/kallsyms.h>
 #include <linux/syscalls.h>
 #include <linux/interrupt.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
 #include <linux/fs.h>
 #include <linux/debug_locks.h>
 
@@ -56,7 +56,7 @@ static void printk_lock(struct rt_mutex *lock, int print_owner)
 
 void rt_mutex_debug_task_free(struct task_struct *task)
 {
-	DEBUG_LOCKS_WARN_ON(!plist_head_empty(&task->pi_waiters));
+	DEBUG_LOCKS_WARN_ON(!RB_EMPTY_ROOT(&task->pi_waiters));
 	DEBUG_LOCKS_WARN_ON(task->pi_blocked_on);
 }
 
@@ -152,16 +152,14 @@ void debug_rt_mutex_proxy_unlock(struct rt_mutex *lock)
 void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
 {
 	memset(waiter, 0x11, sizeof(*waiter));
-	plist_node_init(&waiter->list_entry, MAX_PRIO);
-	plist_node_init(&waiter->pi_list_entry, MAX_PRIO);
+	RB_CLEAR_NODE(&waiter->pi_tree_entry);
+	RB_CLEAR_NODE(&waiter->tree_entry);
 	waiter->deadlock_task_pid = NULL;
 }
 
 void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)
 {
 	put_pid(waiter->deadlock_task_pid);
-	DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->list_entry));
-	DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
 	memset(waiter, 0x22, sizeof(*waiter));
 }
 
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index b525158..6234c0e 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -102,16 +102,110 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex *lock)
 }
 #endif
 
+static inline int
+rt_mutex_waiter_less(struct rt_mutex_waiter *left,
+		     struct rt_mutex_waiter *right)
+{
+	if (left->task->prio < right->task->prio)
+		return 1;
+
+	/*
+	 * If both tasks are dl_task(), we check their deadlines.
+	 */
+	if (dl_prio(left->task->prio) && dl_prio(right->task->prio))
+		return (left->task->dl.deadline < right->task->dl.deadline);
+
+	return 0;
+}
+
+static void
+rt_mutex_enqueue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
+{
+	struct rb_node **link = &lock->waiters.rb_node;
+	struct rb_node *parent = NULL;
+	struct rt_mutex_waiter *entry;
+	int leftmost = 1;
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct rt_mutex_waiter, tree_entry);
+		if (rt_mutex_waiter_less(waiter, entry)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		lock->waiters_leftmost = &waiter->tree_entry;
+
+	rb_link_node(&waiter->tree_entry, parent, link);
+	rb_insert_color(&waiter->tree_entry, &lock->waiters);
+}
+
+static void
+rt_mutex_dequeue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
+{
+	if (RB_EMPTY_NODE(&waiter->tree_entry))
+		return;
+
+	if (lock->waiters_leftmost == &waiter->tree_entry)
+		lock->waiters_leftmost = rb_next(&waiter->tree_entry);
+
+	rb_erase(&waiter->tree_entry, &lock->waiters);
+	RB_CLEAR_NODE(&waiter->tree_entry);
+}
+
+static void
+rt_mutex_enqueue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
+{
+	struct rb_node **link = &task->pi_waiters.rb_node;
+	struct rb_node *parent = NULL;
+	struct rt_mutex_waiter *entry;
+	int leftmost = 1;
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct rt_mutex_waiter, pi_tree_entry);
+		if (rt_mutex_waiter_less(waiter, entry)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		task->pi_waiters_leftmost = &waiter->pi_tree_entry;
+
+	rb_link_node(&waiter->pi_tree_entry, parent, link);
+	rb_insert_color(&waiter->pi_tree_entry, &task->pi_waiters);
+}
+
+static void
+rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
+{
+	if (RB_EMPTY_NODE(&waiter->pi_tree_entry))
+		return;
+
+	if (task->pi_waiters_leftmost == &waiter->pi_tree_entry)
+		task->pi_waiters_leftmost = rb_next(&waiter->pi_tree_entry);
+
+	rb_erase(&waiter->pi_tree_entry, &task->pi_waiters);
+	RB_CLEAR_NODE(&waiter->pi_tree_entry);
+}
+
 static inline void init_lists(struct rt_mutex *lock)
 {
-	if (unlikely(!lock->wait_list.node_list.prev))
-		plist_head_init(&lock->wait_list);
+/*	if (unlikely(!lock->wait_list.node_list.prev))
+		plist_head_init(&lock->wait_list);*/
 }
 
 /*
- * Calculate task priority from the waiter list priority
+ * Calculate task priority from the waiter tree priority
  *
- * Return task->normal_prio when the waiter list is empty or when
+ * Return task->normal_prio when the waiter tree is empty or when
  * the waiter is not allowed to do priority boosting
  */
 int rt_mutex_getprio(struct task_struct *task)
@@ -119,7 +213,7 @@ int rt_mutex_getprio(struct task_struct *task)
 	if (likely(!task_has_pi_waiters(task)))
 		return task->normal_prio;
 
-	return min(task_top_pi_waiter(task)->pi_list_entry.prio,
+	return min(task_top_pi_waiter(task)->task->prio,
 		   task->normal_prio);
 }
 
@@ -245,7 +339,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * When deadlock detection is off then we check, if further
 	 * priority adjustment is necessary.
 	 */
-	if (!detect_deadlock && waiter->list_entry.prio == task->prio)
+	if (!detect_deadlock && waiter->task->prio == task->prio)
 		goto out_unlock_pi;
 
 	lock = waiter->lock;
@@ -266,9 +360,9 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	top_waiter = rt_mutex_top_waiter(lock);
 
 	/* Requeue the waiter */
-	plist_del(&waiter->list_entry, &lock->wait_list);
-	waiter->list_entry.prio = task->prio;
-	plist_add(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_dequeue(lock, waiter);
+	waiter->task->prio = task->prio;
+	rt_mutex_enqueue(lock, waiter);
 
 	/* Release the task */
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
@@ -294,17 +388,15 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		/* Boost the owner */
-		plist_del(&top_waiter->pi_list_entry, &task->pi_waiters);
-		waiter->pi_list_entry.prio = waiter->list_entry.prio;
-		plist_add(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_dequeue_pi(task, top_waiter);
+		rt_mutex_enqueue_pi(task, waiter);
 		__rt_mutex_adjust_prio(task);
 
 	} else if (top_waiter == waiter) {
 		/* Deboost the owner */
-		plist_del(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_dequeue_pi(task, waiter);
 		waiter = rt_mutex_top_waiter(lock);
-		waiter->pi_list_entry.prio = waiter->list_entry.prio;
-		plist_add(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_enqueue_pi(task, waiter);
 		__rt_mutex_adjust_prio(task);
 	}
 
@@ -405,7 +497,7 @@ __try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 
 		/* remove the queued waiter. */
 		if (waiter) {
-			plist_del(&waiter->list_entry, &lock->wait_list);
+			rt_mutex_dequeue(lock, waiter);
 			task->pi_blocked_on = NULL;
 		}
 
@@ -415,8 +507,7 @@ __try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 		 */
 		if (rt_mutex_has_waiters(lock)) {
 			top = rt_mutex_top_waiter(lock);
-			top->pi_list_entry.prio = top->list_entry.prio;
-			plist_add(&top->pi_list_entry, &task->pi_waiters);
+			rt_mutex_enqueue_pi(task, top);
 		}
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 	}
@@ -475,13 +566,11 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	__rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
-	plist_node_init(&waiter->list_entry, task->prio);
-	plist_node_init(&waiter->pi_list_entry, task->prio);
 
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
 		top_waiter = rt_mutex_top_waiter(lock);
-	plist_add(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_enqueue(lock, waiter);
 
 	task->pi_blocked_on = waiter;
 
@@ -492,8 +581,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		raw_spin_lock_irqsave(&owner->pi_lock, flags);
-		plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters);
-		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
+		rt_mutex_dequeue_pi(owner, top_waiter);
+		rt_mutex_enqueue_pi(owner, waiter);
 
 		__rt_mutex_adjust_prio(owner);
 		if (rt_mutex_real_waiter(owner->pi_blocked_on))
@@ -545,7 +634,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 	 * boosted mode and go back to normal after releasing
 	 * lock->wait_lock.
 	 */
-	plist_del(&waiter->pi_list_entry, &current->pi_waiters);
+	rt_mutex_dequeue_pi(current, waiter);
 
 	rt_mutex_set_owner(lock, NULL);
 
@@ -569,7 +658,7 @@ static void remove_waiter(struct rt_mutex *lock,
 	int chain_walk = 0;
 
 	raw_spin_lock_irqsave(&current->pi_lock, flags);
-	plist_del(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_dequeue(lock, waiter);
 	current->pi_blocked_on = NULL;
 	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
 
@@ -580,13 +669,13 @@ static void remove_waiter(struct rt_mutex *lock,
 
 		raw_spin_lock_irqsave(&owner->pi_lock, flags);
 
-		plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
+		rt_mutex_dequeue_pi(owner, waiter);
 
 		if (rt_mutex_has_waiters(lock)) {
 			struct rt_mutex_waiter *next;
 
 			next = rt_mutex_top_waiter(lock);
-			plist_add(&next->pi_list_entry, &owner->pi_waiters);
+			rt_mutex_enqueue_pi(owner, next);
 		}
 		__rt_mutex_adjust_prio(owner);
 
@@ -596,8 +685,6 @@ static void remove_waiter(struct rt_mutex *lock,
 		raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
 	}
 
-	WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
-
 	if (!chain_walk)
 		return;
 
@@ -625,7 +712,7 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 
 	waiter = task->pi_blocked_on;
 	if (!rt_mutex_real_waiter(waiter) ||
-	    waiter->list_entry.prio == task->prio) {
+	    waiter->task->prio == task->prio) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
@@ -779,7 +866,6 @@ static void  noinline __sched rt_spin_lock_slowlock(struct rt_mutex *lock)
 	fixup_rt_mutex_waiters(lock);
 
 	BUG_ON(rt_mutex_has_waiters(lock) && &waiter == rt_mutex_top_waiter(lock));
-	BUG_ON(!plist_node_empty(&waiter.list_entry));
 
 	raw_spin_unlock(&lock->wait_lock);
 
@@ -1286,7 +1372,8 @@ EXPORT_SYMBOL_GPL(rt_mutex_destroy);
 void __rt_mutex_init(struct rt_mutex *lock, const char *name)
 {
 	lock->owner = NULL;
-	plist_head_init(&lock->wait_list);
+	lock->waiters = RB_ROOT;
+	lock->waiters_leftmost = NULL;
 
 	debug_rt_mutex_init(lock, name);
 }
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index 6ec3dc1..aa2be42 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -40,13 +40,13 @@ extern void schedule_rt_mutex_test(struct rt_mutex *lock);
  * This is the control structure for tasks blocked on a rt_mutex,
  * which is allocated on the kernel stack on of the blocked task.
  *
- * @list_entry:		pi node to enqueue into the mutex waiters list
- * @pi_list_entry:	pi node to enqueue into the mutex owner waiters list
+ * @tree_entry:		pi node to enqueue into the mutex waiters tree
+ * @pi_tree_entry:	pi node to enqueue into the mutex owner waiters tree
  * @task:		task reference to the blocked task
  */
 struct rt_mutex_waiter {
-	struct plist_node	list_entry;
-	struct plist_node	pi_list_entry;
+	struct rb_node          tree_entry;
+	struct rb_node          pi_tree_entry;
 	struct task_struct	*task;
 	struct rt_mutex		*lock;
 	bool			savestate;
@@ -58,11 +58,11 @@ struct rt_mutex_waiter {
 };
 
 /*
- * Various helpers to access the waiters-plist:
+ * Various helpers to access the waiters-tree:
  */
 static inline int rt_mutex_has_waiters(struct rt_mutex *lock)
 {
-	return !plist_head_empty(&lock->wait_list);
+	return !RB_EMPTY_ROOT(&lock->waiters);
 }
 
 static inline struct rt_mutex_waiter *
@@ -70,8 +70,8 @@ rt_mutex_top_waiter(struct rt_mutex *lock)
 {
 	struct rt_mutex_waiter *w;
 
-	w = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter,
-			       list_entry);
+	w = rb_entry(lock->waiters_leftmost, struct rt_mutex_waiter,
+		     tree_entry);
 	BUG_ON(w->lock != lock);
 
 	return w;
@@ -79,14 +79,14 @@ rt_mutex_top_waiter(struct rt_mutex *lock)
 
 static inline int task_has_pi_waiters(struct task_struct *p)
 {
-	return !plist_head_empty(&p->pi_waiters);
+	return !RB_EMPTY_ROOT(&p->pi_waiters);
 }
 
 static inline struct rt_mutex_waiter *
 task_top_pi_waiter(struct task_struct *p)
 {
-	return plist_first_entry(&p->pi_waiters, struct rt_mutex_waiter,
-				  pi_list_entry);
+	return rb_entry(p->pi_waiters_leftmost, struct rt_mutex_waiter,
+			pi_tree_entry);
 }
 
 /*
@@ -133,6 +133,8 @@ rt_mutex_init_waiter(struct rt_mutex_waiter *waiter, bool savestate)
 	debug_rt_mutex_init_waiter(waiter);
 	waiter->task = NULL;
 	waiter->savestate = savestate;
+	rb_init_node(&waiter->tree_entry);
+	rb_init_node(&waiter->pi_tree_entry);
 }
 
 #endif
diff --git a/kernel/sched.c b/kernel/sched.c
index 92d5e26..00e69a0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8824,10 +8824,6 @@ void __init sched_init(void)
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
 #endif
 
-#ifdef CONFIG_RT_MUTEXES
-	plist_head_init(&init_task.pi_waiters);
-#endif
-
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
 	 */
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 13/16] sched: drafted deadline inheritance logic.
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (11 preceding siblings ...)
  2012-04-06  7:14 ` [PATCH 12/16] rtmutex: turn the plist into an rb-tree Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-12  2:42   ` Steven Rostedt
  2012-04-06  7:14 ` [PATCH 14/16] sched: add bandwidth management for sched_dl Juri Lelli
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

From: Dario Faggioli <raistlin@linux.it>

Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).

This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
 - ensure a pi-lock owner with waiters is never throttled down. Instead,
   when it runs out of runtime, it immediately gets replenished and it's
   deadline is postponed;
 - the scheduling parameters (relative deadline and default runtime)
   used for that replenishments --during the whole period it holds the
   pi-lock-- are the ones of the waiting task with earliest deadline.

Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.

We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 include/linux/sched.h |    9 +++++-
 kernel/fork.c         |    1 +
 kernel/rtmutex.c      |   13 +++++++-
 kernel/sched.c        |   34 ++++++++++++++++++---
 kernel/sched_dl.c     |   77 +++++++++++++++++++++++++++++++++---------------
 5 files changed, 102 insertions(+), 32 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5ef7bb6..ca45db4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1309,8 +1309,12 @@ struct sched_dl_entity {
 	 * @dl_new tells if a new instance arrived. If so we must
 	 * start executing it with full runtime and reset its absolute
 	 * deadline;
+	 *
+	 * @dl_boosted tells if we are boosted due to DI. If so we are
+	 * outside bandwidth enforcement mechanism (but only until we
+	 * exit the critical section).
 	 */
-	int dl_throttled, dl_new;
+	int dl_throttled, dl_new, dl_boosted;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
@@ -1556,6 +1560,8 @@ struct task_struct {
 	struct rb_node *pi_waiters_leftmost;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
+	/* Top pi_waiters task */
+	struct task_struct *pi_top_task;
 #endif
 
 #ifdef CONFIG_DEBUG_MUTEXES
@@ -2236,6 +2242,7 @@ extern unsigned int sysctl_sched_cfs_bandwidth_slice;
 #ifdef CONFIG_RT_MUTEXES
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
+extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 344e0b4..b8b0dff 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1036,6 +1036,7 @@ static void rt_mutex_init_task(struct task_struct *p)
 	p->pi_waiters = RB_ROOT;
 	p->pi_waiters_leftmost = NULL;
 	p->pi_blocked_on = NULL;
+	p->pi_top_task = NULL;
 #endif
 }
 
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 6234c0e..cf155ba 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -217,6 +217,14 @@ int rt_mutex_getprio(struct task_struct *task)
 		   task->normal_prio);
 }
 
+struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
+{
+	if (likely(!task_has_pi_waiters(task)))
+		return NULL;
+
+	return task_top_pi_waiter(task)->task;
+}
+
 /*
  * Adjust the priority of a task, after its pi_waiters got modified.
  *
@@ -226,7 +234,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
 {
 	int prio = rt_mutex_getprio(task);
 
-	if (task->prio != prio)
+	if (task->prio != prio || dl_prio(prio))
 		rt_mutex_setprio(task, prio);
 }
 
@@ -712,7 +720,8 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 
 	waiter = task->pi_blocked_on;
 	if (!rt_mutex_real_waiter(waiter) ||
-	    waiter->task->prio == task->prio) {
+	    (waiter->task->prio == task->prio &&
+	    !dl_prio(task->prio))) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
diff --git a/kernel/sched.c b/kernel/sched.c
index 00e69a0..3963e4e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5238,7 +5238,7 @@ EXPORT_SYMBOL(sleep_on_timeout);
  */
 void rt_mutex_setprio(struct task_struct *p, int prio)
 {
-	int oldprio, on_rq, running;
+	int oldprio, on_rq, running, enqueue_flag = 0;
 	struct rq *rq;
 	const struct sched_class *prev_class;
 
@@ -5265,6 +5265,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	}
 
 	trace_sched_pi_setprio(p, prio);
+	p->pi_top_task = rt_mutex_get_top_task(p);
 	oldprio = p->prio;
 	prev_class = p->sched_class;
 	on_rq = p->on_rq;
@@ -5274,19 +5275,42 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
-	if (dl_prio(prio))
+	/*
+	 * Boosting condition are:
+	 * 1. -rt task is running and holds mutex A
+	 *      --> -dl task blocks on mutex A
+	 *
+	 * 2. -dl task is running and holds mutex A
+	 *      --> -dl task blocks on mutex A and could preempt the
+	 *          running task
+	 */
+	if (dl_prio(prio)) {
+		if (!dl_prio(p->normal_prio) || (p->pi_top_task &&
+			dl_entity_preempt(&p->pi_top_task->dl, &p->dl))) {
+			p->dl.dl_boosted = 1;
+			p->dl.dl_throttled = 0;
+			enqueue_flag = ENQUEUE_REPLENISH;
+		} else
+			p->dl.dl_boosted = 0;
 		p->sched_class = &dl_sched_class;
-	else if (rt_prio(prio))
+	} else if (rt_prio(prio)) {
+		if (dl_prio(oldprio))
+			p->dl.dl_boosted = 0;
+		if (oldprio < prio)
+			enqueue_flag = ENQUEUE_HEAD;
 		p->sched_class = &rt_sched_class;
-	else
+	} else {
+		if (dl_prio(oldprio))
+			p->dl.dl_boosted = 0;
 		p->sched_class = &fair_sched_class;
+	}
 
 	p->prio = prio;
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
 	if (on_rq)
-		enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
+		enqueue_task(rq, p, enqueue_flag);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 05140bb..e393292 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -224,15 +224,16 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
  * one, and to (try to!) reconcile itself with its own scheduling
  * parameters.
  */
-static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
+static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se,
+				       struct sched_dl_entity *pi_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
 	WARN_ON(!dl_se->dl_new || dl_se->dl_throttled);
 
-	dl_se->deadline = rq->clock + dl_se->dl_deadline;
-	dl_se->runtime = dl_se->dl_runtime;
+	dl_se->deadline = rq->clock + pi_se->dl_deadline;
+	dl_se->runtime = pi_se->dl_runtime;
 	dl_se->dl_new = 0;
 }
 
@@ -254,11 +255,23 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
  * could happen are, typically, a entity voluntarily trying to overcume its
  * runtime, or it just underestimated it during sched_setscheduler_ex().
  */
-static void replenish_dl_entity(struct sched_dl_entity *dl_se)
+static void replenish_dl_entity(struct sched_dl_entity *dl_se,
+				struct sched_dl_entity *pi_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
+	BUG_ON(pi_se->dl_runtime <= 0);
+
+	/*
+	 * This could be the case for a !-dl task that is boosted.
+	 * Just go with full inherited parameters.
+	 */
+	if (dl_se->dl_deadline == 0) {
+		dl_se->deadline = rq->clock + pi_se->dl_deadline;
+		dl_se->runtime = pi_se->dl_runtime;
+	}
+
 	/*
 	 * We Keep moving the deadline away until we get some
 	 * available runtime for the entity. This ensures correct
@@ -266,8 +279,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 * arbitrary large.
 	 */
 	while (dl_se->runtime <= 0) {
-		dl_se->deadline += dl_se->dl_period;
-		dl_se->runtime += dl_se->dl_runtime;
+		dl_se->deadline += pi_se->dl_period;
+		dl_se->runtime += pi_se->dl_runtime;
 	}
 
 	/*
@@ -281,8 +294,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 */
 	if (dl_time_before(dl_se->deadline, rq->clock)) {
 		WARN_ON_ONCE(1);
-		dl_se->deadline = rq->clock + dl_se->dl_deadline;
-		dl_se->runtime = dl_se->dl_runtime;
+		dl_se->deadline = rq->clock + pi_se->dl_deadline;
+		dl_se->runtime = pi_se->dl_runtime;
 	}
 }
 
@@ -299,7 +312,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
  * task with deadline equal to period this is the same of using
  * dl_deadline instead of dl_period in the equation above.
  */
-static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se,
+			       struct sched_dl_entity *pi_se, u64 t)
 {
 	u64 left, right;
 
@@ -316,8 +330,8 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
 	 * to the (absolute) deadline. Therefore, overflowing the u64
 	 * type is very unlikely to occur in both cases.
 	 */
-	left = dl_se->dl_period * dl_se->runtime;
-	right = (dl_se->deadline - t) * dl_se->dl_runtime;
+	left = pi_se->dl_deadline * dl_se->runtime;
+	right = (dl_se->deadline - t) * pi_se->dl_runtime;
 
 	return dl_time_before(right, left);
 }
@@ -331,7 +345,8 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
  *  - using the remaining runtime with the current deadline would make
  *    the entity exceed its bandwidth.
  */
-static void update_dl_entity(struct sched_dl_entity *dl_se)
+static void update_dl_entity(struct sched_dl_entity *dl_se,
+			     struct sched_dl_entity *pi_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -341,14 +356,14 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
 	 * the actual scheduling parameters have to be "renewed".
 	 */
 	if (dl_se->dl_new) {
-		setup_new_dl_entity(dl_se);
+		setup_new_dl_entity(dl_se, pi_se);
 		return;
 	}
 
 	if (dl_time_before(dl_se->deadline, rq->clock) ||
-	    dl_entity_overflow(dl_se, rq->clock)) {
-		dl_se->deadline = rq->clock + dl_se->dl_deadline;
-		dl_se->runtime = dl_se->dl_runtime;
+	    dl_entity_overflow(dl_se, pi_se, rq->clock)) {
+		dl_se->deadline = rq->clock + pi_se->dl_deadline;
+		dl_se->runtime = pi_se->dl_runtime;
 	}
 }
 
@@ -362,7 +377,7 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
  * actually started or not (i.e., the replenishment instant is in
  * the future or in the past).
  */
-static int start_dl_timer(struct sched_dl_entity *dl_se)
+static int start_dl_timer(struct sched_dl_entity *dl_se, bool boosted)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -371,6 +386,8 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
 	unsigned long range;
 	s64 delta;
 
+	if (boosted)
+		return 0;
 	/*
 	 * We want the timer to fire at the deadline, but considering
 	 * that it is actually coming from rq->clock and not from
@@ -540,7 +557,7 @@ static void update_curr_dl(struct rq *rq)
 	dl_se->runtime -= delta_exec;
 	if (dl_runtime_exceeded(rq, dl_se)) {
 		__dequeue_task_dl(rq, curr, 0);
-		if (likely(start_dl_timer(dl_se)))
+		if (likely(start_dl_timer(dl_se, curr->dl.dl_boosted)))
 			dl_se->dl_throttled = 1;
 		else
 			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
@@ -695,7 +712,8 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 }
 
 static void
-enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
+enqueue_dl_entity(struct sched_dl_entity *dl_se,
+		  struct sched_dl_entity *pi_se, int flags)
 {
 	BUG_ON(on_dl_rq(dl_se));
 
@@ -705,9 +723,9 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	 * we want a replenishment of its runtime.
 	 */
 	if (!dl_se->dl_new && flags & ENQUEUE_REPLENISH)
-		replenish_dl_entity(dl_se);
+		replenish_dl_entity(dl_se, pi_se);
 	else
-		update_dl_entity(dl_se);
+		update_dl_entity(dl_se, pi_se);
 
 	__enqueue_dl_entity(dl_se);
 }
@@ -719,6 +737,18 @@ static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
 
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
+	struct task_struct *pi_task = p->pi_top_task;
+	struct sched_dl_entity *pi_se = &p->dl;
+
+	/*
+	 * Use the scheduling parameters of the top pi-waiter
+	 * task if we have one and its (relative) deadline is
+	 * smaller than our one... OTW we keep our runtime and
+	 * deadline.
+	 */
+	if (pi_task && p->dl.dl_boosted && dl_prio(pi_task->normal_prio))
+		pi_se = &pi_task->dl;
+
 	/*
 	 * If p is throttled, we do nothing. In fact, if it exhausted
 	 * its budget it needs a replenishment and, since it now is on
@@ -728,7 +758,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	if (p->dl.dl_throttled)
 		return;
 
-	enqueue_dl_entity(&p->dl, flags);
+	enqueue_dl_entity(&p->dl, pi_se, flags);
 
 	if (!task_current(rq, p) && p->dl.nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
@@ -962,8 +992,7 @@ static void task_dead_dl(struct task_struct *p)
 {
 	struct hrtimer *timer = &p->dl.dl_timer;
 
-	if (hrtimer_active(timer))
-		hrtimer_try_to_cancel(timer);
+	hrtimer_try_to_cancel(timer);
 }
 
 static void set_curr_task_dl(struct rq *rq)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 14/16] sched: add bandwidth management for sched_dl.
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (12 preceding siblings ...)
  2012-04-06  7:14 ` [PATCH 13/16] sched: drafted deadline inheritance logic Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-06  7:14 ` [PATCH 15/16] sched: speed up -dl pushes with a push-heap Juri Lelli
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

From: Dario Faggioli <raistlin@linux.it>

In order of -deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.

Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.

However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).

Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.

This patch, therefore:
 - adds system wide deadline bandwidth management by means of:
    * /proc/sys/kernel/sched_dl_runtime_us,
    * /proc/sys/kernel/sched_dl_period_us,
   that determine (i.e., runtime / period) the total bandwidth
   available on each CPU of each root_domain for -deadline tasks;
 - couples the RT and deadline bandwidth management, i.e., enforces
   that the sum of how much bandwidth is being devoted to -rt
   -deadline tasks to stay below 100%.

This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:

    M * (sched_dl_runtime_us / sched_dl_period_us)

It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 include/linux/sched.h |    8 +
 kernel/sched.c        |  497 ++++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched_dl.c     |   18 ++-
 kernel/sysctl.c       |   14 ++
 4 files changed, 512 insertions(+), 25 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ca45db4..534f099 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1289,6 +1289,7 @@ struct sched_dl_entity {
 	u64 dl_runtime;		/* maximum runtime for each instance	*/
 	u64 dl_deadline;	/* relative deadline of each instance	*/
 	u64 dl_period;		/* separation of two instances (period) */
+	u64 dl_bw;		/* dl_runtime / dl_deadline		*/
 
 	/*
 	 * Actual scheduling parameters. Initialized with the values above,
@@ -2217,6 +2218,13 @@ int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
+extern unsigned int sysctl_sched_dl_period;
+extern int sysctl_sched_dl_runtime;
+
+int sched_dl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
 #ifdef CONFIG_SCHED_AUTOGROUP
 extern unsigned int sysctl_sched_autogroup_enabled;
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 3963e4e..d5403c8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -158,6 +158,22 @@ struct rt_prio_array {
 	struct list_head queue[MAX_RT_PRIO];
 };
 
+static unsigned long to_ratio(u64 period, u64 runtime)
+{
+	if (runtime == RUNTIME_INF)
+		return 1ULL << 20;
+
+	/*
+	 * Doing this here saves a lot of checks in all
+	 * the calling paths, and returning zero seems
+	 * safe for them anyway.
+	 */
+	if (period == 0)
+		return 0;
+
+	return div64_u64(runtime << 20, period);
+}
+
 struct rt_bandwidth {
 	/* nests inside the rq lock: */
 	raw_spinlock_t		rt_runtime_lock;
@@ -251,6 +267,74 @@ static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
 #endif
 
 /*
+ * To keep the bandwidth of -deadline tasks and groups under control
+ * we need some place where:
+ *  - store the maximum -deadline bandwidth of the system (the group);
+ *  - cache the fraction of that bandwidth that is currently allocated.
+ *
+ * This is all done in the data structure below. It is similar to the
+ * one used for RT-throttling (rt_bandwidth), with the main difference
+ * that, since here we are only interested in admission control, we
+ * do not decrease any runtime while the group "executes", neither we
+ * need a timer to replenish it.
+ *
+ * With respect to SMP, the bandwidth is given on a per-CPU basis,
+ * meaning that:
+ *  - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
+ *  - dl_total_bw array contains, in the i-eth element, the currently
+ *    allocated bandwidth on the i-eth CPU.
+ * Moreover, groups consume bandwidth on each CPU, while tasks only
+ * consume bandwidth on the CPU they're running on.
+ * Finally, dl_total_bw_cpu is used to cache the index of dl_total_bw
+ * that will be shown the next time the proc or cgroup controls will
+ * be red. It on its turn can be changed by writing on its own
+ * control.
+ */
+struct dl_bandwidth {
+	raw_spinlock_t dl_runtime_lock;
+	u64 dl_runtime;
+	u64 dl_period;
+};
+
+static struct dl_bandwidth def_dl_bandwidth;
+
+static
+void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
+{
+	raw_spin_lock_init(&dl_b->dl_runtime_lock);
+	dl_b->dl_period = period;
+	dl_b->dl_runtime = runtime;
+}
+
+static inline int dl_bandwidth_enabled(void)
+{
+	return sysctl_sched_dl_runtime >= 0;
+}
+
+/*
+ *
+ */
+struct dl_bw {
+	raw_spinlock_t lock;
+	u64 bw, total_bw;
+};
+
+static inline u64 global_dl_period(void);
+static inline u64 global_dl_runtime(void);
+
+static void init_dl_bw(struct dl_bw *dl_b)
+{
+	raw_spin_lock_init(&dl_b->lock);
+	raw_spin_lock(&def_dl_bandwidth.dl_runtime_lock);
+	if (global_dl_runtime() == RUNTIME_INF)
+		dl_b->bw = -1;
+	else
+		dl_b->bw = to_ratio(global_dl_period(), global_dl_runtime());
+	raw_spin_unlock(&def_dl_bandwidth.dl_runtime_lock);
+	dl_b->total_bw = 0;
+}
+
+/*
  * sched_domains_mutex serializes calls to init_sched_domains,
  * detach_destroy_domains and partition_sched_domains.
  */
@@ -594,6 +678,8 @@ struct dl_rq {
 	 */
 	struct rb_root pushable_dl_tasks_root;
 	struct rb_node *pushable_dl_tasks_leftmost;
+#else
+	struct dl_bw dl_bw;
 #endif
 };
 
@@ -620,6 +706,7 @@ struct root_domain {
 	 */
 	cpumask_var_t dlo_mask;
 	atomic_t dlo_count;
+	struct dl_bw dl_bw;
 
 	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
@@ -1039,6 +1126,28 @@ static inline u64 global_rt_runtime(void)
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
 
+/*
+ * Maximum bandwidth available for all -deadline tasks and groups
+ * (if group scheduling is configured) on each CPU.
+ *
+ * default: 5%
+ */
+unsigned int sysctl_sched_dl_period = 1000000;
+int sysctl_sched_dl_runtime = 50000;
+
+static inline u64 global_dl_period(void)
+{
+	return (u64)sysctl_sched_dl_period * NSEC_PER_USEC;
+}
+
+static inline u64 global_dl_runtime(void)
+{
+	if (sysctl_sched_dl_runtime < 0)
+		return RUNTIME_INF;
+
+	return (u64)sysctl_sched_dl_runtime * NSEC_PER_USEC;
+}
+
 #ifndef prepare_arch_switch
 # define prepare_arch_switch(next)	do { } while (0)
 #endif
@@ -3121,6 +3230,78 @@ int sched_fork(struct task_struct *p)
 	return 0;
 }
 
+static inline
+void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
+{
+	dl_b->total_bw -= tsk_bw;
+}
+
+static inline
+void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
+{
+	dl_b->total_bw += tsk_bw;
+}
+
+static inline
+bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
+{
+	return dl_b->bw != -1 &&
+	       dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
+}
+
+/*
+ * We must be sure that accepting a new task (or allowing changing the
+ * parameters of an existing one) is consistent with the bandwidth
+ * constraints. If yes, this function also accordingly updates the currently
+ * allocated bandwidth to reflect the new situation.
+ *
+ * This function is called while holding p's rq->lock.
+ */
+static int dl_overflow(struct task_struct *p, int policy,
+		       const struct sched_param2 *param2)
+{
+#ifdef CONFIG_SMP
+	struct dl_bw *dl_b = &task_rq(p)->rd->dl_bw;
+#else
+	struct dl_bw *dl_b = &task_rq(p)->dl.dl_bw;
+#endif
+	u64 period = param2->sched_period;
+	u64 runtime = param2->sched_runtime;
+	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
+#ifdef CONFIG_SMP
+	int cpus = cpumask_weight(task_rq(p)->rd->span);
+#else
+	int cpus = 1;
+#endif
+	int err = -1;
+
+	if (new_bw == p->dl.dl_bw)
+		return 0;
+
+	/*
+	 * Either if a task, enters, leave, or stays -deadline but changes
+	 * its parameters, we may need to update accordingly the total
+	 * allocated bandwidth of the container.
+	 */
+	raw_spin_lock(&dl_b->lock);
+	if (dl_policy(policy) && !task_has_dl_policy(p) &&
+	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
+		__dl_add(dl_b, new_bw);
+		err = 0;
+	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
+		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+		__dl_clear(dl_b, p->dl.dl_bw);
+		__dl_add(dl_b, new_bw);
+		err = 0;
+	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
+		__dl_clear(dl_b, p->dl.dl_bw);
+		err = 0;
+	}
+	raw_spin_unlock(&dl_b->lock);
+
+	return err;
+}
+
 /*
  * wake_up_new_task - wake up a newly created task for the first time.
  *
@@ -5521,6 +5702,7 @@ __setparam_dl(struct task_struct *p, const struct sched_param2 *param2)
 		dl_se->dl_period = param2->sched_period;
 	else
 		dl_se->dl_period = dl_se->dl_deadline;
+	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
 	dl_se->flags = param2->sched_flags;
 	dl_se->dl_throttled = 0;
 	dl_se->dl_new = 1;
@@ -5708,8 +5890,8 @@ recheck:
 		return 0;
 	}
 
-#ifdef CONFIG_RT_GROUP_SCHED
 	if (user) {
+#ifdef CONFIG_RT_GROUP_SCHED
 		/*
 		 * Do not allow realtime tasks into groups that have no runtime
 		 * assigned.
@@ -5720,8 +5902,25 @@ recheck:
 			task_rq_unlock(rq, p, &flags);
 			return -EPERM;
 		}
-	}
 #endif
+#ifdef CONFIG_SMP
+		if (dl_bandwidth_enabled() && dl_policy(policy)) {
+			const struct cpumask *span = rq->rd->span;
+
+			/*
+			 * Don't allow tasks with an affinity mask smaller than
+			 * the entire root_domain to become SCHED_DEADLINE. We
+			 * will also fail if there's no bandwidth available.
+			 */
+			if (!cpumask_equal(&p->cpus_allowed, span) ||
+			    rq->rd->dl_bw.bw == 0) {
+				__task_rq_unlock(rq);
+				raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+				return -EPERM;
+			}
+		}
+#endif
+	}
 
 	/* recheck policy now with rq lock held */
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
@@ -5729,6 +5928,19 @@ recheck:
 		task_rq_unlock(rq, p, &flags);
 		goto recheck;
 	}
+
+	/*
+	 * If setscheduling to SCHED_DEADLINE (or changing the parameters
+	 * of a SCHED_DEADLINE task) we need to check if enough bandwidth
+	 * is available.
+	 */
+	if ((dl_policy(policy) || dl_task(p)) &&
+	    dl_overflow(p, policy, param)) {
+		__task_rq_unlock(rq);
+		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		return -EBUSY;
+	}
+
 	on_rq = p->on_rq;
 	running = task_current(rq, p);
 	if (on_rq)
@@ -6048,6 +6260,24 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	if (retval)
 		goto out_unlock;
 
+	/*
+	 * Since bandwidth control happens on root_domain basis,
+	 * if admission test is enabled, we only admit -deadline
+	 * tasks allowed to run on all the CPUs in the task's
+	 * root_domain.
+	 */
+#ifdef CONFIG_SMP
+	if (task_has_dl_policy(p)) {
+		const struct cpumask *span = task_rq(p)->rd->span;
+
+		if (dl_bandwidth_enabled() &&
+		    !cpumask_equal(in_mask, span)) {
+			retval = -EBUSY;
+			goto out_unlock;
+		}
+	}
+#endif
+
 	cpuset_cpus_allowed(p, cpus_allowed);
 	cpumask_and(new_mask, in_mask, cpus_allowed);
 again:
@@ -6728,6 +6958,42 @@ out:
 EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
 /*
+ * When dealing with a -deadline task, we have to check if moving it to
+ * a new CPU is possible or not. In fact, this is only true iff there
+ * is enough bandwidth available on such CPU, otherwise we want the
+ * whole migration progedure to fail over.
+ */
+static inline
+bool set_task_cpu_dl(struct task_struct *p, unsigned int cpu)
+{
+	struct dl_bw *dl_b = &task_rq(p)->rd->dl_bw;
+	struct dl_bw *cpu_b = &cpu_rq(cpu)->rd->dl_bw;
+	int ret = 1;
+	u64 bw;
+
+	if (dl_b == cpu_b)
+		return 1;
+
+	raw_spin_lock(&dl_b->lock);
+	raw_spin_lock(&cpu_b->lock);
+
+	bw = cpu_b->bw * cpumask_weight(cpu_rq(cpu)->rd->span);
+	if (dl_bandwidth_enabled() &&
+	    bw < cpu_b->total_bw + p->dl.dl_bw) {
+		ret = 0;
+		goto unlock;
+	}
+	dl_b->total_bw -= p->dl.dl_bw;
+	cpu_b->total_bw += p->dl.dl_bw;
+
+unlock:
+	raw_spin_unlock(&cpu_b->lock);
+	raw_spin_unlock(&dl_b->lock);
+
+	return ret;
+}
+
+/*
  * Move (not current) task off this cpu, onto dest cpu. We're doing
  * this because either it can't run here any more (set_cpus_allowed()
  * away from this CPU, or CPU going down), or because we're
@@ -6759,6 +7025,13 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 		goto fail;
 
 	/*
+	 * If p is -deadline, proceed only if there is enough
+	 * bandwidth available on dest_cpu
+	 */
+	if (unlikely(dl_task(p)) && !set_task_cpu_dl(p, dest_cpu))
+		goto fail;
+
+	/*
 	 * If we're not on a rq, the next wake-up will ensure we're
 	 * placed properly.
 	 */
@@ -7467,6 +7740,8 @@ static int init_rootdomain(struct root_domain *rd)
 	if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
 		goto free_dlo_mask;
 
+	init_dl_bw(&rd->dl_bw);
+
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_rto_mask;
 	return 0;
@@ -8645,6 +8920,8 @@ static void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
 	dl_rq->dl_nr_migratory = 0;
 	dl_rq->overloaded = 0;
 	dl_rq->pushable_dl_tasks_root = RB_ROOT;
+#else
+	init_dl_bw(&dl_rq->dl_bw);
 #endif
 }
 
@@ -8757,6 +9034,8 @@ void __init sched_init(void)
 
 	init_rt_bandwidth(&def_rt_bandwidth,
 			global_rt_period(), global_rt_runtime());
+	init_dl_bandwidth(&def_dl_bandwidth,
+			global_dl_period(), global_dl_runtime());
 
 #ifdef CONFIG_RT_GROUP_SCHED
 	init_rt_bandwidth(&root_task_group.rt_bandwidth,
@@ -9345,16 +9624,6 @@ unsigned long sched_group_shares(struct task_group *tg)
 }
 #endif
 
-#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
-static unsigned long to_ratio(u64 period, u64 runtime)
-{
-	if (runtime == RUNTIME_INF)
-		return 1ULL << 20;
-
-	return div64_u64(runtime << 20, period);
-}
-#endif
-
 #ifdef CONFIG_RT_GROUP_SCHED
 /*
  * Ensure that the real time constraints are schedulable.
@@ -9528,10 +9797,48 @@ long sched_group_rt_period(struct task_group *tg)
 	do_div(rt_period_us, NSEC_PER_USEC);
 	return rt_period_us;
 }
+#endif /* CONFIG_RT_GROUP_SCHED */
 
+/*
+ * Coupling of -rt and -deadline bandwidth.
+ *
+ * Here we check if the new -rt bandwidth value is consistent
+ * with the system settings for the bandwidth available
+ * to -deadline tasks.
+ *
+ * IOW, we want to enforce that
+ *
+ *   rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_rt_dl_global_constraints(u64 rt_bw)
+{
+	unsigned long flags;
+	u64 dl_bw;
+	bool ret;
+
+	raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock, flags);
+	if (global_rt_runtime() == RUNTIME_INF ||
+	    global_dl_runtime() == RUNTIME_INF) {
+		ret = true;
+		goto unlock;
+	}
+
+	dl_bw = to_ratio(def_dl_bandwidth.dl_period,
+			 def_dl_bandwidth.dl_runtime);
+
+	ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
+unlock:
+	raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock, flags);
+
+	return ret;
+}
+
+#ifdef CONFIG_RT_GROUP_SCHED
 static int sched_rt_global_constraints(void)
 {
-	u64 runtime, period;
+	u64 runtime, period, bw;
 	int ret = 0;
 
 	if (sysctl_sched_rt_period <= 0)
@@ -9546,6 +9853,10 @@ static int sched_rt_global_constraints(void)
 	if (runtime > period && runtime != RUNTIME_INF)
 		return -EINVAL;
 
+	bw = to_ratio(period, runtime);
+	if (!__sched_rt_dl_global_constraints(bw))
+		return -EINVAL;
+
 	mutex_lock(&rt_constraints_mutex);
 	read_lock(&tasklist_lock);
 	ret = __rt_schedulable(NULL, 0, 0);
@@ -9568,19 +9879,19 @@ int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
 static int sched_rt_global_constraints(void)
 {
 	unsigned long flags;
-	int i;
+	int i, ret = 0;
+	u64 bw;
 
 	if (sysctl_sched_rt_period <= 0)
 		return -EINVAL;
 
-	/*
-	 * There's always some RT tasks in the root group
-	 * -- migration, kstopmachine etc..
-	 */
-	if (sysctl_sched_rt_runtime == 0)
-		return -EBUSY;
-
 	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
+	bw = to_ratio(global_rt_period(), global_rt_runtime());
+	if (!__sched_rt_dl_global_constraints(bw)) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
 	for_each_possible_cpu(i) {
 		struct rt_rq *rt_rq = &cpu_rq(i)->rt;
 
@@ -9588,12 +9899,96 @@ static int sched_rt_global_constraints(void)
 		rt_rq->rt_runtime = global_rt_runtime();
 		raw_spin_unlock(&rt_rq->rt_runtime_lock);
 	}
+unlock:
 	raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
 
-	return 0;
+	return ret;
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+/*
+ * Coupling of -dl and -rt bandwidth.
+ *
+ * Here we check, while setting the system wide bandwidth available
+ * for -dl tasks and groups, if the new values are consistent with
+ * the system settings for the bandwidth available to -rt entities.
+ *
+ * IOW, we want to enforce that
+ *
+ *   rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_dl_rt_global_constraints(u64 dl_bw)
+{
+	u64 rt_bw;
+	bool ret;
+
+	raw_spin_lock(&def_rt_bandwidth.rt_runtime_lock);
+	if (global_dl_runtime() == RUNTIME_INF ||
+	    global_rt_runtime() == RUNTIME_INF) {
+		ret = true;
+		goto unlock;
+	}
+
+	rt_bw = to_ratio(ktime_to_ns(def_rt_bandwidth.rt_period),
+			 def_rt_bandwidth.rt_runtime);
+
+	ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
+unlock:
+	raw_spin_unlock(&def_rt_bandwidth.rt_runtime_lock);
+
+	return ret;
+}
+
+static bool __sched_dl_global_constraints(u64 runtime, u64 period)
+{
+	if (!period || (runtime != RUNTIME_INF && runtime > period))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int sched_dl_global_constraints(void)
+{
+	u64 runtime = global_dl_runtime();
+	u64 period = global_dl_period();
+	u64 new_bw = to_ratio(period, runtime);
+	int ret, i;
+
+	ret = __sched_dl_global_constraints(runtime, period);
+	if (ret)
+		return ret;
+
+	if (!__sched_dl_rt_global_constraints(new_bw))
+		return -EINVAL;
+
+	/*
+	 * Here we want to check the bandwidth not being set to some
+	 * value smaller than the currently allocated bandwidth in
+	 * any of the root_domains.
+	 *
+	 * FIXME: Cycling on all the CPUs is overdoing, but simpler than
+	 * cycling on root_domains... Discussion on different/better
+	 * solutions is welcome!
+	 */
+	for_each_possible_cpu(i) {
+#ifdef CONFIG_SMP
+		struct dl_bw *dl_b = &cpu_rq(i)->rd->dl_bw;
+#else
+		struct dl_bw *dl_b = &cpu_rq(i)->dl.dl_bw;
+#endif
+		raw_spin_lock(&dl_b->lock);
+		if (new_bw < dl_b->total_bw) {
+			raw_spin_unlock(&dl_b->lock);
+			return -EBUSY;
+		}
+		raw_spin_unlock(&dl_b->lock);
+	}
+
+	return 0;
+}
+
 int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos)
@@ -9624,6 +10019,64 @@ int sched_rt_handler(struct ctl_table *table, int write,
 	return ret;
 }
 
+int sched_dl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int ret;
+	int old_period, old_runtime;
+	static DEFINE_MUTEX(mutex);
+	unsigned long flags;
+
+	mutex_lock(&mutex);
+	old_period = sysctl_sched_dl_period;
+	old_runtime = sysctl_sched_dl_runtime;
+
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock,
+				      flags);
+
+		ret = sched_dl_global_constraints();
+		if (ret) {
+			sysctl_sched_dl_period = old_period;
+			sysctl_sched_dl_runtime = old_runtime;
+		} else {
+			u64 new_bw;
+			int i;
+
+			def_dl_bandwidth.dl_period = global_dl_period();
+			def_dl_bandwidth.dl_runtime = global_dl_runtime();
+			if (global_dl_runtime() == RUNTIME_INF)
+				new_bw = -1;
+			else
+				new_bw = to_ratio(global_dl_period(),
+						  global_dl_runtime());
+			/*
+			 * FIXME: As above...
+			 */
+			for_each_possible_cpu(i) {
+#ifdef CONFIG_SMP
+				struct dl_bw *dl_b = &cpu_rq(i)->rd->dl_bw;
+#else
+				struct dl_bw *dl_b = &cpu_rq(i)->dl.dl_bw;
+#endif
+
+				raw_spin_lock(&dl_b->lock);
+				dl_b->bw = new_bw;
+				raw_spin_unlock(&dl_b->lock);
+			}
+		}
+
+		raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock,
+					   flags);
+	}
+	mutex_unlock(&mutex);
+
+	return ret;
+}
+
 #ifdef CONFIG_CGROUP_SCHED
 
 /* return corresponding task_group object of a cgroup */
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index e393292..fee9434 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -991,6 +991,18 @@ static void task_fork_dl(struct task_struct *p)
 static void task_dead_dl(struct task_struct *p)
 {
 	struct hrtimer *timer = &p->dl.dl_timer;
+#ifdef CONFIG_SMP
+	struct dl_bw *dl_b = &task_rq(p)->rd->dl_bw;
+#else
+	struct dl_bw *dl_b = &task_rq(p)->dl.dl_bw;
+#endif
+
+	/*
+	 * Since we are TASK_DEAD we won't slip out of the domain!
+	 */
+	raw_spin_lock_irq(&dl_b->lock);
+	dl_b->total_bw -= p->dl.dl_bw;
+	raw_spin_unlock_irq(&dl_b->lock);
 
 	hrtimer_try_to_cancel(timer);
 }
@@ -1171,7 +1183,7 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
 				     !cpumask_test_cpu(later_rq->cpu,
 						       &task->cpus_allowed) ||
 				     task_running(rq, task) ||
-				     !task->se.on_rq)) {
+				     !task->on_rq)) {
 				raw_spin_unlock(&later_rq->lock);
 				later_rq = NULL;
 				break;
@@ -1210,7 +1222,7 @@ static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
 	BUG_ON(task_current(rq, p));
 	BUG_ON(p->dl.nr_cpus_allowed <= 1);
 
-	BUG_ON(!p->se.on_rq);
+	BUG_ON(!p->on_rq);
 	BUG_ON(!dl_task(p));
 
 	return p;
@@ -1351,7 +1363,7 @@ static int pull_dl_task(struct rq *this_rq)
 		     dl_time_before(p->dl.deadline,
 				    this_rq->dl.earliest_dl.curr))) {
 			WARN_ON(p == src_rq->curr);
-			WARN_ON(!p->se.on_rq);
+			WARN_ON(!p->on_rq);
 
 			/*
 			 * Then we pull iff p has actually an earlier
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ae27196..9ab6c9d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -362,6 +362,20 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sched_rt_handler,
 	},
+	{
+		.procname	= "sched_dl_period_us",
+		.data		= &sysctl_sched_dl_period,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_handler,
+	},
+	{
+		.procname	= "sched_dl_runtime_us",
+		.data		= &sysctl_sched_dl_runtime,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_handler,
+	},
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
 		.procname	= "sched_autogroup_enabled",
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 15/16] sched: speed up -dl pushes with a push-heap.
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (13 preceding siblings ...)
  2012-04-06  7:14 ` [PATCH 14/16] sched: add bandwidth management for sched_dl Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-06  7:14 ` [PATCH 16/16] sched: add sched_dl documentation Juri Lelli
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

Data from tests confirmed that the original active load balancing
logic didn't scale neither in the number of CPU nor in the number of
tasks (as sched_rt does).

Here we provide a global data structure to keep track of deadlines
of the running tasks in the system. The structure is composed by
a bitmask showing the free CPUs and a max-heap, needed when the system
is heavily loaded.

The implementation and concurrent access scheme are kept simple by
design. However, our measurements show that we can compete with sched_rt
on large multi-CPUs machines [1].

Only the push path is addressed, the extension to use this structure
also for pull decisions is straightforward. However, we are currently
evaluating different (in order to decrease/avoid contention) data
structures to solve possibly both problems. We are also going to re-run
tests considering recent changes inside cpupri [2].

[1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf
[2] http://www.spinics.net/lists/linux-rt-users/msg06778.html

Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 kernel/Makefile      |    1 +
 kernel/sched.c       |    5 +
 kernel/sched_cpudl.c |  208 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_cpudl.h |   34 ++++++++
 kernel/sched_dl.c    |   52 +++---------
 5 files changed, 261 insertions(+), 39 deletions(-)
 create mode 100644 kernel/sched_cpudl.c
 create mode 100644 kernel/sched_cpudl.h

diff --git a/kernel/Makefile b/kernel/Makefile
index c961d3a..0aca455 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -105,6 +105,7 @@ obj-$(CONFIG_X86_DS) += trace/
 obj-$(CONFIG_RING_BUFFER) += trace/
 obj-$(CONFIG_TRACEPOINTS) += trace/
 obj-$(CONFIG_SMP) += sched_cpupri.o
+obj-$(CONFIG_SMP) += sched_cpudl.o
 obj-$(CONFIG_IRQ_WORK) += irq_work.o
 obj-$(CONFIG_CPU_PM) += cpu_pm.o
 
diff --git a/kernel/sched.c b/kernel/sched.c
index d5403c8..f8a857a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -81,6 +81,7 @@
 #endif
 
 #include "sched_cpupri.h"
+#include "sched_cpudl.h"
 #include "workqueue_sched.h"
 #include "sched_autogroup.h"
 
@@ -707,6 +708,7 @@ struct root_domain {
 	cpumask_var_t dlo_mask;
 	atomic_t dlo_count;
 	struct dl_bw dl_bw;
+	struct cpudl cpudl;
 
 	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
@@ -7683,6 +7685,7 @@ static void free_rootdomain(struct rcu_head *rcu)
 	struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
 
 	cpupri_cleanup(&rd->cpupri);
+	cpudl_cleanup(&rd->cpudl);
 	free_cpumask_var(rd->dlo_mask);
 	free_cpumask_var(rd->rto_mask);
 	free_cpumask_var(rd->online);
@@ -7741,6 +7744,8 @@ static int init_rootdomain(struct root_domain *rd)
 		goto free_dlo_mask;
 
 	init_dl_bw(&rd->dl_bw);
+	if (cpudl_init(&rd->cpudl) != 0)
+		goto free_dlo_mask;
 
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_rto_mask;
diff --git a/kernel/sched_cpudl.c b/kernel/sched_cpudl.c
new file mode 100644
index 0000000..b0908ac
--- /dev/null
+++ b/kernel/sched_cpudl.c
@@ -0,0 +1,208 @@
+/*
+ *  kernel/sched_cpudl.c
+ *
+ *  Global CPU deadline management
+ *
+ *  Author: Juri Lelli <j.lelli@sssup.it>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+#include "sched_cpudl.h"
+
+static inline int parent(int i)
+{
+	return (i - 1) >> 1;
+}
+
+static inline int left_child(int i)
+{
+	return (i << 1) + 1;
+}
+
+static inline int right_child(int i)
+{
+	return (i << 1) + 2;
+}
+
+static inline int dl_time_before(u64 a, u64 b)
+{
+	return (s64)(a - b) < 0;
+}
+
+void cpudl_exchange(struct cpudl *cp, int a, int b)
+{
+	int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;
+
+	swap(cp->elements[a], cp->elements[b]);
+	swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]);
+}
+
+void cpudl_heapify(struct cpudl *cp, int idx)
+{
+	int l, r, largest;
+
+	/* adapted from lib/prio_heap.c */
+	while(1) {
+		l = left_child(idx);
+		r = right_child(idx);
+		largest = idx;
+
+		if ((l < cp->size) && dl_time_before(cp->elements[idx].dl,
+							cp->elements[l].dl))
+			largest = l;
+		if ((r < cp->size) && dl_time_before(cp->elements[largest].dl,
+							cp->elements[r].dl))
+			largest = r;
+		if (largest == idx)
+			break;
+
+		/* Push idx down the heap one level and bump one up */
+		cpudl_exchange(cp, largest, idx);
+		idx = largest;
+	}
+}
+
+void cpudl_change_key(struct cpudl *cp, int idx, u64 new_dl)
+{
+	WARN_ON(idx > num_present_cpus() && idx != -1);
+
+	if (dl_time_before(new_dl, cp->elements[idx].dl)) {
+		cp->elements[idx].dl = new_dl;
+		cpudl_heapify(cp, idx);
+	} else {
+		cp->elements[idx].dl = new_dl;
+		while (idx > 0 && dl_time_before(cp->elements[parent(idx)].dl,
+					cp->elements[idx].dl)) {
+			cpudl_exchange(cp, idx, parent(idx));
+			idx = parent(idx);
+		}
+	}
+}
+
+static inline int cpudl_maximum(struct cpudl *cp)
+{
+	return cp->elements[0].cpu;
+}
+
+/*
+ * cpudl_find - find the best (later-dl) CPU in the system
+ * @cp: the cpudl max-heap context
+ * @p: the task
+ * @later_mask: a mask to fill in with the selected CPUs (or NULL)
+ *
+ * Returns: int - best CPU (heap maximum if suitable)
+ */
+int cpudl_find(struct cpudl *cp, struct cpumask *dlo_mask,
+		struct task_struct *p, struct cpumask *later_mask)
+{
+	int best_cpu = -1;
+	const struct sched_dl_entity *dl_se = &p->dl;
+
+	if (later_mask && cpumask_and(later_mask, cp->free_cpus,
+			&p->cpus_allowed) && cpumask_and(later_mask,
+			later_mask, cpu_active_mask)) {
+		best_cpu = cpumask_any(later_mask);
+		goto out;
+	} else if (cpumask_test_cpu(cpudl_maximum(cp), &p->cpus_allowed) &&
+			dl_time_before(dl_se->deadline, cp->elements[0].dl)) {
+		best_cpu = cpudl_maximum(cp);
+		if (later_mask)
+			cpumask_set_cpu(best_cpu, later_mask);
+	}
+
+out:
+	WARN_ON(best_cpu > num_present_cpus() && best_cpu != -1);
+
+	return best_cpu;
+}
+
+/*
+ * cpudl_set - update the cpudl max-heap
+ * @cp: the cpudl max-heap context
+ * @cpu: the target cpu
+ * @dl: the new earliest deadline for this cpu
+ *
+ * Notes: assumes cpu_rq(cpu)->lock is locked
+ *
+ * Returns: (void)
+ */
+void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid)
+{
+	int old_idx, new_cpu;
+	unsigned long flags;
+
+	WARN_ON(cpu > num_present_cpus());
+
+	raw_spin_lock_irqsave(&cp->lock, flags);
+	old_idx = cp->cpu_to_idx[cpu];
+	if (!is_valid) {
+		/* remove item */
+		new_cpu = cp->elements[cp->size - 1].cpu;
+		cp->elements[old_idx].dl = cp->elements[cp->size - 1].dl;
+		cp->elements[old_idx].cpu = new_cpu;
+		cp->size--;
+		cp->cpu_to_idx[new_cpu] = old_idx;
+		cp->cpu_to_idx[cpu] = IDX_INVALID;
+		while (old_idx > 0 && dl_time_before(
+				cp->elements[parent(old_idx)].dl,
+				cp->elements[old_idx].dl)) {
+			cpudl_exchange(cp, old_idx, parent(old_idx));
+			old_idx = parent(old_idx);
+		}
+		cpumask_set_cpu(cpu, cp->free_cpus);
+                cpudl_heapify(cp, old_idx);
+
+		goto out;
+	}
+
+	if (old_idx == IDX_INVALID) {
+		cp->size++;
+		cp->elements[cp->size - 1].dl = 0;
+		cp->elements[cp->size - 1].cpu = cpu;
+		cp->cpu_to_idx[cpu] = cp->size - 1;
+		cpudl_change_key(cp, cp->size - 1, dl);
+		cpumask_clear_cpu(cpu, cp->free_cpus);
+	} else {
+		cpudl_change_key(cp, old_idx, dl);
+	}
+
+out:
+	raw_spin_unlock_irqrestore(&cp->lock, flags);
+}
+
+/*
+ * cpudl_init - initialize the cpudl structure
+ * @cp: the cpudl max-heap context
+ */
+int cpudl_init(struct cpudl *cp)
+{
+	int i;
+
+	memset(cp, 0, sizeof(*cp));
+	raw_spin_lock_init(&cp->lock);
+	cp->size = 0;
+	for (i = 0; i < NR_CPUS; i++)
+		cp->cpu_to_idx[i] = IDX_INVALID;
+	if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL))
+		return -ENOMEM;
+	cpumask_setall(cp->free_cpus);
+
+	return 0;
+}
+
+/*
+ * cpudl_cleanup - clean up the cpudl structure
+ * @cp: the cpudl max-heap context
+ */
+void cpudl_cleanup(struct cpudl *cp)
+{
+	/*
+	 * nothing to do for the moment
+	 */
+}
diff --git a/kernel/sched_cpudl.h b/kernel/sched_cpudl.h
new file mode 100644
index 0000000..9285dc4
--- /dev/null
+++ b/kernel/sched_cpudl.h
@@ -0,0 +1,34 @@
+#ifndef _LINUX_CPUDL_H
+#define _LINUX_CPUDL_H
+
+#include <linux/sched.h>
+
+#define IDX_INVALID     -1
+
+struct array_item {
+	u64 dl;
+	int cpu;
+};
+
+struct cpudl {
+	raw_spinlock_t lock;
+	int size;
+	int cpu_to_idx[NR_CPUS];
+	struct array_item elements[NR_CPUS];
+	cpumask_var_t free_cpus;
+};
+
+
+#ifdef CONFIG_SMP
+int cpudl_find(struct cpudl *cp, struct cpumask *dlo_mask,
+		struct task_struct *p, struct cpumask *later_mask);
+void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid);
+int cpudl_init(struct cpudl *cp);
+void cpudl_cleanup(struct cpudl *cp);
+#else
+#define cpudl_set(cp, cpu, dl) do { } while (0)
+#define cpudl_init() do { } while (0)
+#endif /* CONFIG_SMP */
+
+#endif /* _LINUX_CPUDL_H */
+
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index fee9434..2f280b3 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -595,6 +595,7 @@ static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 		 */
 		dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
 		dl_rq->earliest_dl.curr = deadline;
+		cpudl_set(&rq->rd->cpudl, rq->cpu, deadline, 1);
 	} else if (dl_rq->earliest_dl.next == 0 ||
 		   dl_time_before(deadline, dl_rq->earliest_dl.next)) {
 		/*
@@ -618,6 +619,7 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 	if (!dl_rq->dl_nr_running) {
 		dl_rq->earliest_dl.curr = 0;
 		dl_rq->earliest_dl.next = 0;
+		cpudl_set(&rq->rd->cpudl, rq->cpu, 0, 0);
 	} else {
 		struct rb_node *leftmost = dl_rq->rb_leftmost;
 		struct sched_dl_entity *entry;
@@ -625,6 +627,7 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 		entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
 		dl_rq->earliest_dl.curr = entry->deadline;
 		dl_rq->earliest_dl.next = next_deadline(rq);
+		cpudl_set(&rq->rd->cpudl, rq->cpu, entry->deadline, 1);
 	}
 }
 
@@ -802,9 +805,6 @@ static void yield_task_dl(struct rq *rq)
 #ifdef CONFIG_SMP
 
 static int find_later_rq(struct task_struct *task);
-static int latest_cpu_find(struct cpumask *span,
-			   struct task_struct *task,
-			   struct cpumask *later_mask);
 
 static int
 select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
@@ -852,7 +852,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	 * let's hope p can move out.
 	 */
 	if (rq->curr->dl.nr_cpus_allowed == 1 ||
-	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
+	    cpudl_find(&rq->rd->cpudl, rq->rd->dlo_mask, rq->curr, NULL) == -1)
 		return;
 
 	/*
@@ -860,7 +860,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	 * see if it is pushed or pulled somewhere else.
 	 */
 	if (p->dl.nr_cpus_allowed != 1 &&
-	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
+	    cpudl_find(&rq->rd->cpudl, rq->rd->dlo_mask, p, NULL) != -1)
 		return;
 
 	resched_task(rq->curr);
@@ -1054,39 +1054,6 @@ next_node:
 	return NULL;
 }
 
-static int latest_cpu_find(struct cpumask *span,
-			   struct task_struct *task,
-			   struct cpumask *later_mask)
-{
-	const struct sched_dl_entity *dl_se = &task->dl;
-	int cpu, found = -1, best = 0;
-	u64 max_dl = 0;
-
-	for_each_cpu(cpu, span) {
-		struct rq *rq = cpu_rq(cpu);
-		struct dl_rq *dl_rq = &rq->dl;
-
-		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
-		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
-		     dl_rq->earliest_dl.curr))) {
-			if (later_mask)
-				cpumask_set_cpu(cpu, later_mask);
-			if (!best && !dl_rq->dl_nr_running) {
-				best = 1;
-				found = cpu;
-			} else if (!best &&
-				   dl_time_before(max_dl,
-						  dl_rq->earliest_dl.curr)) {
-				max_dl = dl_rq->earliest_dl.curr;
-				found = cpu;
-			}
-		} else if (later_mask)
-			cpumask_clear_cpu(cpu, later_mask);
-	}
-
-	return found;
-}
-
 static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
 
 static int find_later_rq(struct task_struct *task)
@@ -1099,7 +1066,9 @@ static int find_later_rq(struct task_struct *task)
 	if (task->dl.nr_cpus_allowed == 1)
 		return -1;
 
-	best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
+	best_cpu = cpudl_find(&task_rq(task)->rd->cpudl,
+			task_rq(task)->rd->dlo_mask,
+			task, later_mask);
 	if (best_cpu == -1)
 		return -1;
 
@@ -1464,6 +1433,9 @@ static void rq_online_dl(struct rq *rq)
 {
 	if (rq->dl.overloaded)
 		dl_set_overload(rq);
+
+	if (rq->dl.dl_nr_running > 0)
+		cpudl_set(&rq->rd->cpudl, rq->cpu, rq->dl.earliest_dl.curr, 1);
 }
 
 /* Assumes rq->lock is held */
@@ -1471,6 +1443,8 @@ static void rq_offline_dl(struct rq *rq)
 {
 	if (rq->dl.overloaded)
 		dl_clear_overload(rq);
+
+	cpudl_set(&rq->rd->cpudl, rq->cpu, 0, 0);
 }
 
 static inline void init_sched_dl_class(void)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 16/16] sched: add sched_dl documentation.
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (14 preceding siblings ...)
  2012-04-06  7:14 ` [PATCH 15/16] sched: speed up -dl pushes with a push-heap Juri Lelli
@ 2012-04-06  7:14 ` Juri Lelli
  2012-04-06  8:25 ` [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Luca Abeni
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-06  7:14 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, cfriesen, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, juri.lelli, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

From: Dario Faggioli <raistlin@linux.it>

Add in Documentation/scheduler/ some hints about the design
choices, the usage and the future possible developments of the
sched_dl scheduling class and of the SCHED_DEADLINE policy.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 Documentation/scheduler/sched-deadline.txt |  147 ++++++++++++++++++++++++++++
 1 files changed, 147 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/scheduler/sched-deadline.txt

diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
new file mode 100644
index 0000000..6c94194
--- /dev/null
+++ b/Documentation/scheduler/sched-deadline.txt
@@ -0,0 +1,147 @@
+			Deadline Task and Group Scheduling
+			----------------------------------
+
+CONTENTS
+========
+
+0. WARNING
+1. Overview
+2. Task scheduling
+2. The Interface
+3. Bandwidth management
+  3.1 System-wide settings
+  2.2 Task interface
+  2.4 Default behavior
+3. Future plans
+
+
+0. WARNING
+==========
+
+ Fiddling with these settings can result in an unpredictable or even unstable
+ system behavior. As for -rt (group) scheduling, it is assumed that root users
+ know what they're doing.
+
+
+1. Overview
+===========
+
+ The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
+ basically an implementation of the Earliest Deadline First (EDF) scheduling
+ algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
+ that makes it possible to isolate the behavior of tasks between each other.
+
+
+2. Task scheduling
+==================
+
+ The typical -deadline task will be made up of a computation phase (instance)
+ which is activated on a periodic or sporadic fashion. The expected (maximum)
+ duration of such computation is called the task's runtime; the time interval
+ by which each instance needs to be completed is called the task's relative
+ deadline. The task's absolute deadline is dynamically calculated as the
+ time instant a task (better, an instance) activates plus the relative
+ deadline.
+
+ The EDF algorithm selects the task with the smallest absolute deadline as
+ the one to be executed first, while the CBS ensures that each task runs for
+ at most its runtime every period, avoiding any interference between different
+ tasks (bandwidth isolation).
+ Thanks to this feature, also tasks that do not strictly comply with the
+ computational model described above can effectively use the new policy.
+ IOW, there are no limitations on what kind of task can exploit this new
+ scheduling discipline, even if it must be said that it is particularly
+ suited for periodic or sporadic tasks that need guarantees on their
+ timing behavior, e.g., multimedia, streaming, control applications, etc.
+
+
+3. Bandwidth management
+=======================
+
+ In order for the -deadline scheduling to be effective and useful, it is
+ important to have some method to keep the allocation of the available CPU
+ bandwidth to the tasks under control.
+ This is usually called "admission control" and if it is not performed at all,
+ no guarantee can be given on the actual scheduling of the -deadline tasks.
+
+ Since when RT-throttling has been introduced each task group has a bandwidth
+ associated, calculated as a certain amount of runtime over a period.
+ Moreover, to make it possible to manipulate such bandwidth, readable/writable
+ controls have been added to both procfs (for system wide settings) and cgroupfs
+ (for per-group settings).
+ Therefore, the same interface is being used for controlling the bandwidth
+ distrubution to -deadline tasks and task groups, i.e., new controls but with
+ similar names, equivalent meaning and with the same usage paradigm are added.
+
+ However, more discussion is needed in order to figure out how we want to manage
+ SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
+ uses (for now) a less sophisticated, but actually very sensible, mechanism to
+ ensure that a certain utilization cap is not overcome per each root_domain.
+
+ Another main difference between deadline bandwidth management and RT-throttling
+ is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
+ and thus we don't need an higher level throttling mechanism to enforce the
+ desired bandwidth.
+
+3.1 System wide settings
+------------------------
+
+The system wide settings are configured under the /proc virtual file system:
+
+ The per-group controls that are added to the cgroupfs virtual file system are:
+  * /proc/sys/kernel/sched_dl_runtime_us,
+  * /proc/sys/kernel/sched_dl_period_us,
+
+ They accept (if written) and provides (if read) the new runtime and period,
+ respectively, for each CPU in each root_domain.
+
+ This means that, for a root_domain comprising M CPUs, -deadline tasks
+ can be created until the sum of their bandwidths stay below:
+
+   M * (sched_dl_runtime_us / sched_dl_period_us)
+
+ It is also possible to disable this bandwidth management logic, and
+ be thus free of oversubscribing the system up to any arbitrary level.
+ This is done by writing -1 in /proc/sys/kernel/sched_dl_runtime_us.
+
+
+2.2 Task interface
+------------------
+
+ Specifying a periodic/sporadic task that executes for a given amount of
+ runtime at each instance, and that is scheduled according to the urgency of
+ its own timing constraints needs, in general, a way of declaring:
+  - a (maximum/typical) instance execution time,
+  - a minimum interval between consecutive instances,
+  - a time constraint by which each instance must be completed.
+
+ Therefore:
+  * a new struct sched_param2, containing all the necessary fields is
+    provided;
+  * the new scheduling related syscalls that manipulate it, i.e.,
+    sched_setscheduler2(), sched_setparam2() and sched_getparam2()
+    are implemented.
+
+
+2.4 Default behavior
+---------------------
+
+The default values for SCHED_DEADLINE bandwidth is to have dl_runtime and
+dl_period equal to 500000 and 1000000, respectively. This means -deadline
+tasks can use at most 5%, multiplied by the number of CPUs that compose the
+root_domain, for each root_domain.
+
+When a -deadline task fork a child, its dl_runtime is set to 0, which means
+someone must call sched_setscheduler2() on it, or it won't even start.
+
+
+3. Future plans
+===============
+
+Still Missing:
+
+ - refinements to deadline inheritance, especially regarding the possibility
+   of retaining bandwidth isolation among non-interacting tasks. This is
+   being studied from both theoretical and practical point of views, and
+   hopefully we can have some demonstrative code soon.
+
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (15 preceding siblings ...)
  2012-04-06  7:14 ` [PATCH 16/16] sched: add sched_dl documentation Juri Lelli
@ 2012-04-06  8:25 ` Luca Abeni
  2012-04-07  9:25   ` Tadeus Prastowo
  2012-04-06 11:07 ` Dario Faggioli
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 129+ messages in thread
From: Luca Abeni @ 2012-04-06  8:25 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, Tadeus Prastowo

Hi all,

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> Hello everyone,
> 
> This is the take 4 for the SCHED_DEADLINE patchset.
[...]
> Still missing/incomplete:
>   - (c)group based bandwidth management, and maybe scheduling. It seems
>     some more discussion on what precisely we want is *really* needed
>     for this point;
>   - bandwidth inheritance (to replace deadline/priority inheritance).
>     What's in the patchset is just very few more than a simple
>     placeholder. More discussion on the right way to go is needed here.
>     Some work has already been done, but it is still not ready for
>     submission.
About BWI... A student of mine (added in cc) implemented a prototypal
bandwidth inheritance (based on an old version of SCHED_DEADLINE). It is
here:
https://github.com/eus/cbs_inheritance
(Tadeus, please correct me if I pointed to the wrong repository).

It is not for inclusion yet (it is based on an old version, it is UP
only, and it probably needs some cleanups), but it worked fine in our
tests. Note that in this patch the BWI mechanism is not bound to
rtmutexes, but inheritance is controlled through 2 syscalls (because we
used BWI for client/server interactions).
Anyway, I hope that the BWI code developed by Tadeus can be useful (or
can be directly re-used) for implementing BWI in SCHED_DEADLINE.

BTW, happy Easter to everyone!

				Luca


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (16 preceding siblings ...)
  2012-04-06  8:25 ` [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Luca Abeni
@ 2012-04-06 11:07 ` Dario Faggioli
  2012-04-07  7:52 ` Juri Lelli
  2012-04-11 14:17 ` [RFC][PATCH 00/16] sched: " Steven Rostedt
  19 siblings, 0 replies; 129+ messages in thread
From: Dario Faggioli @ 2012-04-06 11:07 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, insop.song, liming.wang

[-- Attachment #1: Type: text/plain, Size: 893 bytes --]

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote: 
> Hello everyone,
> 
Hi from here too,

> This is the take 4 for the SCHED_DEADLINE patchset.
> 
Yeah, we did it, at last! :-P

Actually, Juri did --entirely, as I'm now busy with a completely
different kind of stuff! I therefore really what to thank him for having
over the project, applying the review comments, updating it to latest
branches and finally, posting it to see if you guys are still interested
in it.

> As usual, any kind of feedback is welcome and appreciated.
> 
Yep, let us know what you think. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-06  7:14 ` [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic Juri Lelli
@ 2012-04-06 13:39   ` Hillf Danton
  2012-04-06 17:31     ` Juri Lelli
  2012-04-11 14:10     ` Steven Rostedt
  2012-04-11 16:07   ` Steven Rostedt
                     ` (4 subsequent siblings)
  5 siblings, 2 replies; 129+ messages in thread
From: Hillf Danton @ 2012-04-06 13:39 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fcheccon

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 43794 bytes --]

Hello

On Fri, Apr 6, 2012 at 3:14 PM, Juri Lelli <juri.lelli@gmail.com> wrote:
> Add dynamic migrations to SCHED_DEADLINE, so that tasks can
> be moved among CPUs when necessary. It is also possible to bind a
> task to a (set of) CPU(s), thus restricting its capability of
> migrating, or forbidding migrations at all.
>
> The very same approach used in sched_rt is utilised:
>  - -deadline tasks are kept into CPU-specific runqueues,
>  - -deadline tasks are migrated among runqueues to achieve the
>   following:
>    * on an M-CPU system the M earliest deadline ready tasks
>      are always running;
>    * affinity/cpusets settings of all the -deadline tasks is
>      always respected.
>
> Therefore, this very special form of "load balancing" is done with
> an active method, i.e., the scheduler pushes or pulls tasks between
> runqueues when they are woken up and/or (de)scheduled.
> IOW, every time a preemption occurs, the descheduled task might be sent
> to some other CPU (depending on its deadline) to continue executing
> (push). On the other hand, every time a CPU becomes idle, it might pull
> the second earliest deadline ready task from some other CPU.
>
> To enforce this, a pull operation is always attempted before taking any
> scheduling decision (pre_schedule()), as well as a push one after each
> scheduling decision (post_schedule()). In addition, when a task arrives
> or wakes up, the best CPU where to resume it is selected taking into
> account its affinity mask, the system topology, but also its deadline.
> E.g., from the scheduling point of view, the best CPU where to wake
> up (and also where to push) a task is the one which is running the task
> with the latest deadline among the M executing ones.
>
> In order to facilitate these decisions, per-runqueue "caching" of the
> deadlines of the currently running and of the first ready task is used.
> Queued but not running tasks are also parked in another rb-tree to
> speed-up pushes.
>
> Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
> Signed-off-by: Dario Faggioli <raistlin@linux.it>
> ---
>  kernel/sched_dl.c |  912 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched_rt.c |    2 +-
>  2 files changed, 889 insertions(+), 25 deletions(-)
>
> diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
> index 604e2bc..38edefa 100644
> --- a/kernel/sched_dl.c
> +++ b/kernel/sched_dl.c
> @@ -10,6 +10,7 @@
>  * miss some of their deadlines), and won't affect any other task.
>  *
>  * Copyright (C) 2010 Dario Faggioli <raistlin@linux.it>,
> + *                    Juri Lelli <juri.lelli@gmail.com>,
>  *                    Michael Trimarchi <michael@amarulasolutions.com>,
>  *                    Fabio Checconi <fabio@gandalf.sssup.it>
>  */
> @@ -20,6 +21,15 @@ static inline int dl_time_before(u64 a, u64 b)
>        return (s64)(a - b) < 0;
>  }
>
> +/*
> + * Tells if entity @a should preempt entity @b.
> + */
> +static inline
> +int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
> +{
> +       return dl_time_before(a->deadline, b->deadline);
> +}
> +
>  static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
>  {
>        return container_of(dl_se, struct task_struct, dl);
> @@ -50,6 +60,153 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
>        return dl_rq->rb_leftmost == &dl_se->rb_node;
>  }
>
> +#ifdef CONFIG_SMP
> +
> +static inline int dl_overloaded(struct rq *rq)
> +{
> +       return atomic_read(&rq->rd->dlo_count);
> +}
> +
> +static inline void dl_set_overload(struct rq *rq)
> +{
> +       if (!rq->online)
> +               return;
> +
> +       cpumask_set_cpu(rq->cpu, rq->rd->dlo_mask);
> +       /*
> +        * Must be visible before the overload count is
> +        * set (as in sched_rt.c).
> +        */
> +       wmb();
> +       atomic_inc(&rq->rd->dlo_count);
> +}
> +
> +static inline void dl_clear_overload(struct rq *rq)
> +{
> +       if (!rq->online)
> +               return;
> +
> +       atomic_dec(&rq->rd->dlo_count);
> +       cpumask_clear_cpu(rq->cpu, rq->rd->dlo_mask);
> +}
> +
> +static void update_dl_migration(struct dl_rq *dl_rq)
> +{
> +       if (dl_rq->dl_nr_migratory && dl_rq->dl_nr_total > 1) {
> +               if (!dl_rq->overloaded) {
> +                       dl_set_overload(rq_of_dl_rq(dl_rq));
> +                       dl_rq->overloaded = 1;
> +               }
> +       } else if (dl_rq->overloaded) {
> +               dl_clear_overload(rq_of_dl_rq(dl_rq));
> +               dl_rq->overloaded = 0;
> +       }
> +}
> +
> +static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> +       dl_rq = &rq_of_dl_rq(dl_rq)->dl;
> +
> +       dl_rq->dl_nr_total++;
> +       if (dl_se->nr_cpus_allowed > 1)
> +               dl_rq->dl_nr_migratory++;
> +
> +       update_dl_migration(dl_rq);

	if (dl_se->nr_cpus_allowed > 1) {
		dl_rq->dl_nr_migratory++;
		/* No change in migratory, no update of migration */
		update_dl_migration(dl_rq);
	}

> +}
> +
> +static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> +       dl_rq = &rq_of_dl_rq(dl_rq)->dl;
> +
> +       dl_rq->dl_nr_total--;
> +       if (dl_se->nr_cpus_allowed > 1)
> +               dl_rq->dl_nr_migratory--;
> +
> +       update_dl_migration(dl_rq);

ditto

> +}
> +
> +/*
> + * The list of pushable -deadline task is not a plist, like in
> + * sched_rt.c, it is an rb-tree with tasks ordered by deadline.
> + */
> +static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> +{
> +       struct dl_rq *dl_rq = &rq->dl;
> +       struct rb_node **link = &dl_rq->pushable_dl_tasks_root.rb_node;
> +       struct rb_node *parent = NULL;
> +       struct task_struct *entry;
> +       int leftmost = 1;
> +
> +       BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
> +
> +       while (*link) {
> +               parent = *link;
> +               entry = rb_entry(parent, struct task_struct,
> +                                pushable_dl_tasks);
> +               if (!dl_entity_preempt(&entry->dl, &p->dl))

		if (dl_entity_preempt(&p->dl, &entry->dl))

> +                       link = &parent->rb_left;
> +               else {
> +                       link = &parent->rb_right;
> +                       leftmost = 0;
> +               }
> +       }
> +
> +       if (leftmost)
> +               dl_rq->pushable_dl_tasks_leftmost = &p->pushable_dl_tasks;
> +
> +       rb_link_node(&p->pushable_dl_tasks, parent, link);
> +       rb_insert_color(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
> +}
> +
> +static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> +{
> +       struct dl_rq *dl_rq = &rq->dl;
> +
> +       if (RB_EMPTY_NODE(&p->pushable_dl_tasks))
> +               return;
> +
> +       if (dl_rq->pushable_dl_tasks_leftmost == &p->pushable_dl_tasks) {
> +               struct rb_node *next_node;
> +
> +               next_node = rb_next(&p->pushable_dl_tasks);
> +               dl_rq->pushable_dl_tasks_leftmost = next_node;

		dl_rq->pushable_dl_tasks_leftmost =
					rb_next(&p->pushable_dl_tasks);

> +       }
> +
> +       rb_erase(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
> +       RB_CLEAR_NODE(&p->pushable_dl_tasks);
> +}
> +
> +static inline int has_pushable_dl_tasks(struct rq *rq)
> +{
> +       return !RB_EMPTY_ROOT(&rq->dl.pushable_dl_tasks_root);

	return rq->dl.pushable_dl_tasks_leftmost != NULL;
> +}
> +
> +static int push_dl_task(struct rq *rq);
> +
> +#else
> +
> +static inline
> +void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> +{
> +}
> +
> +static inline
> +void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> +{
> +}
> +
> +static inline
> +void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> +}
> +
> +static inline
> +void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> +}
> +
> +#endif /* CONFIG_SMP */
> +
>  static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
>  static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
>  static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
> @@ -276,6 +433,14 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
>                        check_preempt_curr_dl(rq, p, 0);
>                else
>                        resched_task(rq->curr);
> +#ifdef CONFIG_SMP
> +               /*
> +                * Queueing this task back might have overloaded rq,
> +                * check if we need to kick someone away.
> +                */
> +               if (rq->dl.overloaded)
		if (has_pushable_dl_tasks(rq))

> +                       push_dl_task(rq);
> +#endif
>        }
>  unlock:
>        task_rq_unlock(rq, p, &flags);
> @@ -359,6 +524,100 @@ static void update_curr_dl(struct rq *rq)
>        }
>  }
>
> +#ifdef CONFIG_SMP
> +
> +static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu);
> +
> +static inline int next_deadline(struct rq *rq)
static inline typeof(next->dl.deadline) next_deadline(struct rq *rq) ?

> +{
> +       struct task_struct *next = pick_next_earliest_dl_task(rq, rq->cpu);
> +
> +       if (next && dl_prio(next->prio))
> +               return next->dl.deadline;
> +       else
> +               return 0;
> +}
> +
> +static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> +{
> +       struct rq *rq = rq_of_dl_rq(dl_rq);
> +
> +       if (dl_rq->earliest_dl.curr == 0 ||
> +           dl_time_before(deadline, dl_rq->earliest_dl.curr)) {
> +               /*
> +                * If the dl_rq had no -deadline tasks, or if the new task
> +                * has shorter deadline than the current one on dl_rq, we
> +                * know that the previous earliest becomes our next earliest,
> +                * as the new task becomes the earliest itself.
> +                */
> +               dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
> +               dl_rq->earliest_dl.curr = deadline;
> +       } else if (dl_rq->earliest_dl.next == 0 ||
> +                  dl_time_before(deadline, dl_rq->earliest_dl.next)) {
> +               /*
> +                * On the other hand, if the new -deadline task has a
> +                * a later deadline than the earliest one on dl_rq, but
> +                * it is earlier than the next (if any), we must
> +                * recompute the next-earliest.
> +                */
> +               dl_rq->earliest_dl.next = next_deadline(rq);
> +       }
> +}
> +
> +static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> +{
> +       struct rq *rq = rq_of_dl_rq(dl_rq);
> +
> +       /*
> +        * Since we may have removed our earliest (and/or next earliest)
> +        * task we must recompute them.
> +        */
> +       if (!dl_rq->dl_nr_running) {
> +               dl_rq->earliest_dl.curr = 0;
> +               dl_rq->earliest_dl.next = 0;
> +       } else {
> +               struct rb_node *leftmost = dl_rq->rb_leftmost;
> +               struct sched_dl_entity *entry;
> +
> +               entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
> +               dl_rq->earliest_dl.curr = entry->deadline;
> +               dl_rq->earliest_dl.next = next_deadline(rq);
> +       }
> +}
> +
> +#else
> +
> +static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
> +static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
> +
> +#endif /* CONFIG_SMP */
> +
> +static inline
> +void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
void inc_dl_task(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)

> +{
> +       int prio = dl_task_of(dl_se)->prio;
> +       u64 deadline = dl_se->deadline;
> +
> +       WARN_ON(!dl_prio(prio));
> +       dl_rq->dl_nr_running++;
> +
> +       inc_dl_deadline(dl_rq, deadline);
> +       inc_dl_migration(dl_se, dl_rq);
> +}
> +
> +static inline
> +void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
void dec_dl_task(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)

> +{
> +       int prio = dl_task_of(dl_se)->prio;
> +
> +       WARN_ON(!dl_prio(prio));
> +       WARN_ON(!dl_rq->dl_nr_running);
> +       dl_rq->dl_nr_running--;
> +
> +       dec_dl_deadline(dl_rq, dl_se->deadline);
> +       dec_dl_migration(dl_se, dl_rq);
> +}
> +
>  static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>  {
>        struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> @@ -386,7 +645,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>        rb_link_node(&dl_se->rb_node, parent, link);
>        rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
>
> -       dl_rq->dl_nr_running++;
> +       inc_dl_tasks(dl_se, dl_rq);
>  }
>
>  static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
> @@ -406,7 +665,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>        rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
>        RB_CLEAR_NODE(&dl_se->rb_node);
>
> -       dl_rq->dl_nr_running--;
> +       dec_dl_tasks(dl_se, dl_rq);
>  }
>
>  static void
> @@ -444,11 +703,15 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>                return;
>
>        enqueue_dl_entity(&p->dl, flags);
> +
> +       if (!task_current(rq, p) && p->dl.nr_cpus_allowed > 1)
> +               enqueue_pushable_dl_task(rq, p);
>  }
>
>  static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>  {
>        dequeue_dl_entity(&p->dl);
> +       dequeue_pushable_dl_task(rq, p);
>  }
>
>  static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> @@ -480,6 +743,75 @@ static void yield_task_dl(struct rq *rq)
>        update_curr_dl(rq);
>  }
>
> +#ifdef CONFIG_SMP
> +
> +static int find_later_rq(struct task_struct *task);
> +static int latest_cpu_find(struct cpumask *span,
> +                          struct task_struct *task,
> +                          struct cpumask *later_mask);
> +
> +static int
> +select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
> +{
> +       struct task_struct *curr;
> +       struct rq *rq;
> +       int cpu;
> +
> +       if (sd_flag != SD_BALANCE_WAKE)
		why is task_cpu(p) not eligible?

> +               return smp_processor_id();
> +
> +       cpu = task_cpu(p);
> +       rq = cpu_rq(cpu);
> +
> +       rcu_read_lock();
> +       curr = ACCESS_ONCE(rq->curr); /* unlocked access */
> +
> +       /*
> +        * If we are dealing with a -deadline task, we must
> +        * decide where to wake it up.
> +        * If it has a later deadline and the current task
> +        * on this rq can't move (provided the waking task
> +        * can!) we prefer to send it somewhere else. On the
> +        * other hand, if it has a shorter deadline, we
> +        * try to make it stay here, it might be important.
> +        */
> +       if (unlikely(dl_task(rq->curr)) &&
		the above ACCESS_ONCE for what?
> +           (rq->curr->dl.nr_cpus_allowed < 2 ||
> +            dl_entity_preempt(&rq->curr->dl, &p->dl)) &&
		!dl_entity_preempt(&p->dl, &rq->curr->dl)) &&
> +           (p->dl.nr_cpus_allowed > 1)) {
> +               int target = find_later_rq(p);
> +
> +               if (target != -1)
> +                       cpu = target;
> +       }
> +       rcu_read_unlock();
> +
> +       return cpu;
> +}
> +
> +static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
> +{
> +       /*
> +        * Current can't be migrated, useles to reschedule,
> +        * let's hope p can move out.
> +        */
> +       if (rq->curr->dl.nr_cpus_allowed == 1 ||
> +           latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
> +               return;
> +
> +       /*
> +        * p is migratable, so let's not schedule it and
> +        * see if it is pushed or pulled somewhere else.
> +        */
> +       if (p->dl.nr_cpus_allowed != 1 &&
> +           latest_cpu_find(rq->rd->span, p, NULL) != -1)
> +               return;
> +
> +       resched_task(rq->curr);
> +}
> +
> +#endif /* CONFIG_SMP */
> +
>  /*
>  * Only called when both the current and waking task are -deadline
>  * tasks.
> @@ -487,8 +819,20 @@ static void yield_task_dl(struct rq *rq)
>  static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
>                                  int flags)
>  {
> -       if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
> +       if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline)) {
	if (dl_entity_preempt(&p->dl, &rq->curr->dl))
>                resched_task(rq->curr);
> +               return;
> +       }
> +
> +#ifdef CONFIG_SMP
> +       /*
> +        * In the unlikely case current and p have the same deadline
> +        * let us try to decide what's the best thing to do...
> +        */
> +       if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
> +           !need_resched())
please recheck !need_resched(), say rq->curr need reschedule?
> +               check_preempt_equal_dl(rq, p);
> +#endif /* CONFIG_SMP */
>  }
>
>  #ifdef CONFIG_SCHED_HRTICK
> @@ -532,10 +876,20 @@ struct task_struct *pick_next_task_dl(struct rq *rq)
>
>        p = dl_task_of(dl_se);
>        p->se.exec_start = rq->clock;
> +
> +       /* Running task will never be pushed. */
> +       if (p)
> +               dequeue_pushable_dl_task(rq, p);
> +
>  #ifdef CONFIG_SCHED_HRTICK
>        if (hrtick_enabled(rq))
>                start_hrtick_dl(rq, p);
		need to check p is valid?
>  #endif
> +
> +#ifdef CONFIG_SMP
> +       rq->post_schedule = has_pushable_dl_tasks(rq);
> +#endif /* CONFIG_SMP */
> +
>        return p;
>  }
>
> @@ -543,6 +897,9 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
>  {
>        update_curr_dl(rq);
>        p->se.exec_start = 0;
	why seset exec_start?
> +
> +       if (on_dl_rq(&p->dl) && p->dl.nr_cpus_allowed > 1)
> +               enqueue_pushable_dl_task(rq, p);
>  }
>
>  static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
> @@ -576,43 +933,410 @@ static void set_curr_task_dl(struct rq *rq)
>        struct task_struct *p = rq->curr;
>
>        p->se.exec_start = rq->clock;
> +
> +       /* You can't push away the running task */
> +       dequeue_pushable_dl_task(rq, p);
>  }
>
> -static void switched_from_dl(struct rq *rq, struct task_struct *p)
> +#ifdef CONFIG_SMP
> +
> +/* Only try algorithms three times */
> +#define DL_MAX_TRIES 3
> +
> +static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
>  {
> -       if (hrtimer_active(&p->dl.dl_timer))
> -               hrtimer_try_to_cancel(&p->dl.dl_timer);
> +       if (!task_running(rq, p) &&
> +           (cpu < 0 || cpumask_test_cpu(cpu, &p->cpus_allowed)) &&
> +           (p->dl.nr_cpus_allowed > 1))
> +               return 1;
> +
> +       return 0;

	if (task_running(rq, p))
		return 0;
	return cpumask_test_cpu(cpu, &p->cpus_allowed);
	that is all:)
>  }
>
> -static void switched_to_dl(struct rq *rq, struct task_struct *p)
> +/* Returns the second earliest -deadline task, NULL otherwise */
> +static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu)
> +{
> +       struct rb_node *next_node = rq->dl.rb_leftmost;
> +       struct sched_dl_entity *dl_se;
> +       struct task_struct *p = NULL;
> +
> +next_node:
> +       next_node = rb_next(next_node);
> +       if (next_node) {
> +               dl_se = rb_entry(next_node, struct sched_dl_entity, rb_node);
> +               p = dl_task_of(dl_se);
> +
> +               if (pick_dl_task(rq, p, cpu))
> +                       return p;
> +
> +               goto next_node;
> +       }
> +
> +       return NULL;
> +}
> +
> +static int latest_cpu_find(struct cpumask *span,
> +                          struct task_struct *task,
> +                          struct cpumask *later_mask)
>  {
> +       const struct sched_dl_entity *dl_se = &task->dl;
> +       int cpu, found = -1, best = 0;
> +       u64 max_dl = 0;
> +
> +       for_each_cpu(cpu, span) {
	for_each_cpu_and(cpu, span, &task->cpus_allowed) {
> +               struct rq *rq = cpu_rq(cpu);
> +               struct dl_rq *dl_rq = &rq->dl;
> +
> +               if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
> +                   (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
> +                    dl_rq->earliest_dl.curr))) {
		please use dl_entity_preempt()
> +                       if (later_mask)
> +                               cpumask_set_cpu(cpu, later_mask);
> +                       if (!best && !dl_rq->dl_nr_running) {
> +                               best = 1;
> +                               found = cpu;
> +                       } else if (!best &&
> +                                  dl_time_before(max_dl,
> +                                                 dl_rq->earliest_dl.curr)) {
> +                               max_dl = dl_rq->earliest_dl.curr;
> +                               found = cpu;
> +                       }
> +               } else if (later_mask)
> +                       cpumask_clear_cpu(cpu, later_mask);
> +       }
> +
> +       return found;
> +}
> +
> +static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
> +
> +static int find_later_rq(struct task_struct *task)
> +{
> +       struct sched_domain *sd;
> +       struct cpumask *later_mask = __get_cpu_var(local_cpu_mask_dl);
	please check is local_cpu_mask_dl valid

> +       int this_cpu = smp_processor_id();
> +       int best_cpu, cpu = task_cpu(task);
> +
> +       if (task->dl.nr_cpus_allowed == 1)
> +               return -1;
> +
> +       best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
> +       if (best_cpu == -1)
> +               return -1;
> +
>        /*
> -        * If p is throttled, don't consider the possibility
> -        * of preempting rq->curr, the check will be done right
> -        * after its runtime will get replenished.
> +        * If we are here, some target has been found,
> +        * the most suitable of which is cached in best_cpu.
> +        * This is, among the runqueues where the current tasks
> +        * have later deadlines than the task's one, the rq
> +        * with the latest possible one.
> +        *
> +        * Now we check how well this matches with task's
> +        * affinity and system topology.
> +        *
> +        * The last cpu where the task run is our first
> +        * guess, since it is most likely cache-hot there.
>         */
> -       if (unlikely(p->dl.dl_throttled))
> -               return;
> +       if (cpumask_test_cpu(cpu, later_mask))
> +               return cpu;
> +       /*
> +        * Check if this_cpu is to be skipped (i.e., it is
> +        * not in the mask) or not.
> +        */
> +       if (!cpumask_test_cpu(this_cpu, later_mask))
> +               this_cpu = -1;
> +
> +       rcu_read_lock();
> +       for_each_domain(cpu, sd) {
> +               if (sd->flags & SD_WAKE_AFFINE) {
> +
> +                       /*
> +                        * If possible, preempting this_cpu is
> +                        * cheaper than migrating.
> +                        */
> +                       if (this_cpu != -1 &&
> +                           cpumask_test_cpu(this_cpu, sched_domain_span(sd)))
> +                               return this_cpu;
> +
> +                       /*
> +                        * Last chance: if best_cpu is valid and is
> +                        * in the mask, that becomes our choice.
> +                        */
> +                       if (best_cpu < nr_cpu_ids &&
> +                           cpumask_test_cpu(best_cpu, sched_domain_span(sd)))
> +                               return best_cpu;
> +               }
> +       }
> +       rcu_read_unlock();
>
> -       if (!p->on_rq || rq->curr != p) {
> -               if (task_has_dl_policy(rq->curr))
> -                       check_preempt_curr_dl(rq, p, 0);
> -               else
> -                       resched_task(rq->curr);
> +       /*
> +        * At this point, all our guesses failed, we just return
> +        * 'something', and let the caller sort the things out.
> +        */
> +       if (this_cpu != -1)
> +               return this_cpu;
> +
> +       cpu = cpumask_any(later_mask);
> +       if (cpu < nr_cpu_ids)
> +               return cpu;
> +
> +       return -1;
> +}
> +
> +/* Locks the rq it finds */
> +static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
> +{
> +       struct rq *later_rq = NULL;
> +       int tries;
> +       int cpu;
> +
> +       for (tries = 0; tries < DL_MAX_TRIES; tries++) {
> +               cpu = find_later_rq(task);
> +
> +               if ((cpu == -1) || (cpu == rq->cpu))
> +                       break;
> +
> +               later_rq = cpu_rq(cpu);
> +
> +               /* Retry if something changed. */
> +               if (double_lock_balance(rq, later_rq)) {
> +                       if (unlikely(task_rq(task) != rq ||
> +                                    !cpumask_test_cpu(later_rq->cpu,
> +                                                      &task->cpus_allowed) ||
> +                                    task_running(rq, task) ||
> +                                    !task->se.on_rq)) {
> +                               raw_spin_unlock(&later_rq->lock);
> +                               later_rq = NULL;
> +                               break;
> +                       }
> +               }
> +
> +               /*
> +                * If the rq we found has no -deadline task, or
> +                * its earliest one has a later deadline than our
> +                * task, the rq is a good one.
> +                */
> +               if (!later_rq->dl.dl_nr_running ||
> +                   dl_time_before(task->dl.deadline,
> +                                  later_rq->dl.earliest_dl.curr))
> +                       break;
> +
> +               /* Otherwise we try again. */
> +               double_unlock_balance(rq, later_rq);
> +               later_rq = NULL;
>        }
> +
> +       return later_rq;
>  }
>
> -static void prio_changed_dl(struct rq *rq, struct task_struct *p,
> -                           int oldprio)
> +static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
>  {
> -       switched_to_dl(rq, p);
> +       struct task_struct *p;
> +
> +       if (!has_pushable_dl_tasks(rq))
> +               return NULL;
> +
> +       p = rb_entry(rq->dl.pushable_dl_tasks_leftmost,
> +                    struct task_struct, pushable_dl_tasks);
> +
> +       BUG_ON(rq->cpu != task_cpu(p));
> +       BUG_ON(task_current(rq, p));
> +       BUG_ON(p->dl.nr_cpus_allowed <= 1);
> +
> +       BUG_ON(!p->se.on_rq);
> +       BUG_ON(!dl_task(p));
> +
> +       return p;
>  }
>
> -#ifdef CONFIG_SMP
> -static int
> -select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
> +/*
> + * See if the non running -deadline tasks on this rq
> + * can be sent to some other CPU where they can preempt
> + * and start executing.
> + */
> +static int push_dl_task(struct rq *rq)
>  {
> -       return task_cpu(p);
> +       struct task_struct *next_task;
> +       struct rq *later_rq;
> +
> +       if (!rq->dl.overloaded)
> +               return 0;
> +
> +       next_task = pick_next_pushable_dl_task(rq);
> +       if (!next_task)
> +               return 0;
> +
> +retry:
> +       if (unlikely(next_task == rq->curr)) {
> +               WARN_ON(1);
> +               return 0;
> +       }
> +
> +       /*
> +        * If next_task preempts rq->curr, and rq->curr
> +        * can move away, it makes sense to just reschedule
> +        * without going further in pushing next_task.
> +        */
> +       if (dl_task(rq->curr) &&
> +           dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
> +           rq->curr->dl.nr_cpus_allowed > 1) {
> +               resched_task(rq->curr);
> +               return 0;
> +       }
> +
> +       /* We might release rq lock */
> +       get_task_struct(next_task);
> +
> +       /* Will lock the rq it'll find */
> +       later_rq = find_lock_later_rq(next_task, rq);
> +       if (!later_rq) {
> +               struct task_struct *task;
> +
> +               /*
> +                * We must check all this again, since
> +                * find_lock_later_rq releases rq->lock and it is
> +                * then possible that next_task has migrated.
> +                */
> +               task = pick_next_pushable_dl_task(rq);
> +               if (task_cpu(next_task) == rq->cpu && task == next_task) {
> +                       /*
> +                        * The task is still there. We don't try
> +                        * again, some other cpu will pull it when ready.
> +                        */
> +                       dequeue_pushable_dl_task(rq, next_task);
> +                       goto out;
> +               }
> +
> +               if (!task)
> +                       /* No more tasks */
> +                       goto out;
> +
> +               put_task_struct(next_task);
> +               next_task = task;
> +               goto retry;
> +       }
> +
> +       deactivate_task(rq, next_task, 0);
> +       set_task_cpu(next_task, later_rq->cpu);
> +       activate_task(later_rq, next_task, 0);
> +
> +       resched_task(later_rq->curr);
> +
> +       double_unlock_balance(rq, later_rq);
> +
> +out:
> +       put_task_struct(next_task);
> +
> +       return 1;
> +}
> +
> +static void push_dl_tasks(struct rq *rq)
> +{
> +       /* Terminates as it moves a -deadline task */
> +       while (push_dl_task(rq))
> +               ;
> +}
> +
> +static int pull_dl_task(struct rq *this_rq)
> +{
> +       int this_cpu = this_rq->cpu, ret = 0, cpu;
> +       struct task_struct *p;
> +       struct rq *src_rq;
> +       u64 dmin = LONG_MAX;
> +
> +       if (likely(!dl_overloaded(this_rq)))
> +               return 0;
> +
> +       for_each_cpu(cpu, this_rq->rd->dlo_mask) {
> +               if (this_cpu == cpu)
> +                       continue;
> +
> +               src_rq = cpu_rq(cpu);
> +
> +               /*
> +                * It looks racy, abd it is! However, as in sched_rt.c,
> +                * we are fine with this.
> +                */
> +               if (this_rq->dl.dl_nr_running &&
> +                   dl_time_before(this_rq->dl.earliest_dl.curr,
> +                                  src_rq->dl.earliest_dl.next))
> +                       continue;
> +
> +               /* Might drop this_rq->lock */
> +               double_lock_balance(this_rq, src_rq);
> +
> +               /*
> +                * If there are no more pullable tasks on the
> +                * rq, we're done with it.
> +                */
> +               if (src_rq->dl.dl_nr_running <= 1)
> +                       goto skip;
> +
> +               p = pick_next_earliest_dl_task(src_rq, this_cpu);
> +
> +               /*
> +                * We found a task to be pulled if:
> +                *  - it preempts our current (if there's one),
> +                *  - it will preempt the last one we pulled (if any).
> +                */
> +               if (p && dl_time_before(p->dl.deadline, dmin) &&
> +                   (!this_rq->dl.dl_nr_running ||
> +                    dl_time_before(p->dl.deadline,
> +                                   this_rq->dl.earliest_dl.curr))) {
> +                       WARN_ON(p == src_rq->curr);
> +                       WARN_ON(!p->se.on_rq);
> +
> +                       /*
> +                        * Then we pull iff p has actually an earlier
> +                        * deadline than the current task of its runqueue.
> +                        */
> +                       if (dl_time_before(p->dl.deadline,
> +                                          src_rq->curr->dl.deadline))
> +                               goto skip;
> +
> +                       ret = 1;
> +
> +                       deactivate_task(src_rq, p, 0);
> +                       set_task_cpu(p, this_cpu);
> +                       activate_task(this_rq, p, 0);
> +                       dmin = p->dl.deadline;
> +
> +                       /* Is there any other task even earlier? */
> +               }
> +skip:
> +               double_unlock_balance(this_rq, src_rq);
> +       }
> +
> +       return ret;
> +}
> +
> +static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
> +{
> +       /* Try to pull other tasks here */
> +       if (dl_task(prev))
> +               pull_dl_task(rq);
> +}
> +
> +static void post_schedule_dl(struct rq *rq)
> +{
> +       push_dl_tasks(rq);
> +}
> +
> +/*
> + * Since the task is not running and a reschedule is not going to happen
> + * anytime soon on its runqueue, we try pushing it away now.
> + */
> +static void task_woken_dl(struct rq *rq, struct task_struct *p)
> +{
> +       if (!task_running(rq, p) &&
> +           !test_tsk_need_resched(rq->curr) &&
> +           has_pushable_dl_tasks(rq) &&
> +           p->dl.nr_cpus_allowed > 1 &&
> +           dl_task(rq->curr) &&
> +           (rq->curr->dl.nr_cpus_allowed < 2 ||
> +            dl_entity_preempt(&rq->curr->dl, &p->dl))) {
> +               push_dl_tasks(rq);
> +       }
>  }
>
>  static void set_cpus_allowed_dl(struct task_struct *p,
> @@ -622,10 +1346,145 @@ static void set_cpus_allowed_dl(struct task_struct *p,
>
>        BUG_ON(!dl_task(p));
>
> +       /*
> +        * Update only if the task is actually running (i.e.,
> +        * it is on the rq AND it is not throttled).
> +        */
> +       if (on_dl_rq(&p->dl) && (weight != p->dl.nr_cpus_allowed)) {
> +               struct rq *rq = task_rq(p);
> +
> +               if (!task_current(rq, p)) {
> +                       /*
> +                        * If the task was on the pushable list,
> +                        * make sure it stays there only if the new
> +                        * mask allows that.
> +                        */
> +                       if (p->dl.nr_cpus_allowed > 1)
> +                               dequeue_pushable_dl_task(rq, p);
> +
> +                       if (weight > 1)
> +                               enqueue_pushable_dl_task(rq, p);
> +               }
> +
> +               if ((p->dl.nr_cpus_allowed <= 1) && (weight > 1)) {
> +                       rq->dl.dl_nr_migratory++;
> +               } else if ((p->dl.nr_cpus_allowed > 1) && (weight <= 1)) {
> +                       BUG_ON(!rq->dl.dl_nr_migratory);
> +                       rq->dl.dl_nr_migratory--;
> +               }
> +
> +               update_dl_migration(&rq->dl);
> +       }
> +
>        cpumask_copy(&p->cpus_allowed, new_mask);
>        p->dl.nr_cpus_allowed = weight;
>  }
> +
> +/* Assumes rq->lock is held */
> +static void rq_online_dl(struct rq *rq)
> +{
> +       if (rq->dl.overloaded)
> +               dl_set_overload(rq);
> +}
> +
> +/* Assumes rq->lock is held */
> +static void rq_offline_dl(struct rq *rq)
> +{
> +       if (rq->dl.overloaded)
> +               dl_clear_overload(rq);
> +}
> +
> +static inline void init_sched_dl_class(void)
> +{
> +       unsigned int i;
> +
> +       for_each_possible_cpu(i)
> +               zalloc_cpumask_var_node(&per_cpu(local_cpu_mask_dl, i),
> +                                       GFP_KERNEL, cpu_to_node(i));
> +}
> +
> +#endif /* CONFIG_SMP */
> +
> +static void switched_from_dl(struct rq *rq, struct task_struct *p)
> +{
> +       if (hrtimer_active(&p->dl.dl_timer) && !dl_policy(p->policy))
> +               hrtimer_try_to_cancel(&p->dl.dl_timer);
> +
> +#ifdef CONFIG_SMP
> +       /*
> +        * Since this might be the only -deadline task on the rq,
> +        * this is the right place to try to pull some other one
> +        * from an overloaded cpu, if any.
> +        */
> +       if (!rq->dl.dl_nr_running)
> +               pull_dl_task(rq);
>  #endif
> +}
> +
> +/*
> + * When switching to -deadline, we may overload the rq, then
> + * we try to push someone off, if possible.
> + */
> +static void switched_to_dl(struct rq *rq, struct task_struct *p)
> +{
> +       int check_resched = 1;
> +
> +       /*
> +        * If p is throttled, don't consider the possibility
> +        * of preempting rq->curr, the check will be done right
> +        * after its runtime will get replenished.
> +        */
> +       if (unlikely(p->dl.dl_throttled))
> +               return;
> +
> +       if (!p->on_rq || rq->curr != p) {
> +#ifdef CONFIG_SMP
> +               if (rq->dl.overloaded && push_dl_task(rq) && rq != task_rq(p))
> +                       /* Only reschedule if pushing failed */
> +                       check_resched = 0;
> +#endif /* CONFIG_SMP */
> +               if (check_resched && task_has_dl_policy(rq->curr))
> +                       check_preempt_curr_dl(rq, p, 0);
> +       }
> +}
> +
> +/*
> + * If the scheduling parameters of a -deadline task changed,
> + * a push or pull operation might be needed.
> + */
> +static void prio_changed_dl(struct rq *rq, struct task_struct *p,
> +                           int oldprio)
> +{
> +       if (p->on_rq || rq->curr == p) {
> +#ifdef CONFIG_SMP
> +               /*
> +                * This might be too much, but unfortunately
> +                * we don't have the old deadline value, and
> +                * we can't argue if the task is increasing
> +                * or lowering its prio, so...
> +                */
> +               if (!rq->dl.overloaded)
> +                       pull_dl_task(rq);
> +
> +               /*
> +                * If we now have a earlier deadline task than p,
> +                * then reschedule, provided p is still on this
> +                * runqueue.
> +                */
> +               if (dl_time_before(rq->dl.earliest_dl.curr, p->dl.deadline) &&
> +                   rq->curr == p)
> +                       resched_task(p);
> +#else
> +               /*
> +                * Again, we don't know if p has a earlier
> +                * or later deadline, so let's blindly set a
> +                * (maybe not needed) rescheduling point.
> +                */
> +               resched_task(p);
> +#endif /* CONFIG_SMP */
> +       } else
> +               switched_to_dl(rq, p);
> +}
>
>  static const struct sched_class dl_sched_class = {
>        .next                   = &rt_sched_class,
> @@ -642,6 +1501,11 @@ static const struct sched_class dl_sched_class = {
>        .select_task_rq         = select_task_rq_dl,
>
>        .set_cpus_allowed       = set_cpus_allowed_dl,
> +       .rq_online              = rq_online_dl,
> +       .rq_offline             = rq_offline_dl,
> +       .pre_schedule           = pre_schedule_dl,
> +       .post_schedule          = post_schedule_dl,
> +       .task_woken             = task_woken_dl,
>  #endif
>
>        .set_curr_task          = set_curr_task_dl,
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index 4b09704..7b609bc 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -1590,7 +1590,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
>            !test_tsk_need_resched(rq->curr) &&
>            has_pushable_tasks(rq) &&
>            p->rt.nr_cpus_allowed > 1 &&
> -           rt_task(rq->curr) &&
> +           (dl_task(rq->curr) || rt_task(rq->curr)) &&
>            (rq->curr->rt.nr_cpus_allowed < 2 ||
>             rq->curr->prio <= p->prio))
>                push_rt_tasks(rq);
> --
> 1.7.5.4
>

Would you please mail me, in attachment, a monolithic patch of this work?

Thanks
-hd
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-06 13:39   ` Hillf Danton
@ 2012-04-06 17:31     ` Juri Lelli
  2012-04-07  2:32       ` Hillf Danton
  2012-04-11 14:10     ` Steven Rostedt
  1 sibling, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-06 17:31 UTC (permalink / raw)
  To: Hillf Danton
  Cc: peterz, tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fcheccon

Hi,

On 04/06/2012 03:39 PM, Hillf Danton wrote:
> Hello
>
> On Fri, Apr 6, 2012 at 3:14 PM, Juri Lelli<juri.lelli@gmail.com>  wrote:
>> Add dynamic migrations to SCHED_DEADLINE, so that tasks can
>> be moved among CPUs when necessary. It is also possible to bind a
>> task to a (set of) CPU(s), thus restricting its capability of
>> migrating, or forbidding migrations at all.
>>
>> The very same approach used in sched_rt is utilised:
>>   - -deadline tasks are kept into CPU-specific runqueues,
>>   - -deadline tasks are migrated among runqueues to achieve the
>>    following:
>>     * on an M-CPU system the M earliest deadline ready tasks
>>       are always running;
>>     * affinity/cpusets settings of all the -deadline tasks is
>>       always respected.
>>
>> Therefore, this very special form of "load balancing" is done with
>> an active method, i.e., the scheduler pushes or pulls tasks between
>> runqueues when they are woken up and/or (de)scheduled.
>> IOW, every time a preemption occurs, the descheduled task might be sent
>> to some other CPU (depending on its deadline) to continue executing
>> (push). On the other hand, every time a CPU becomes idle, it might pull
>> the second earliest deadline ready task from some other CPU.
>>
>> To enforce this, a pull operation is always attempted before taking any
>> scheduling decision (pre_schedule()), as well as a push one after each
>> scheduling decision (post_schedule()). In addition, when a task arrives
>> or wakes up, the best CPU where to resume it is selected taking into
>> account its affinity mask, the system topology, but also its deadline.
>> E.g., from the scheduling point of view, the best CPU where to wake
>> up (and also where to push) a task is the one which is running the task
>> with the latest deadline among the M executing ones.
>>
>> In order to facilitate these decisions, per-runqueue "caching" of the
>> deadlines of the currently running and of the first ready task is used.
>> Queued but not running tasks are also parked in another rb-tree to
>> speed-up pushes.
>>
>> Signed-off-by: Juri Lelli<juri.lelli@gmail.com>
>> Signed-off-by: Dario Faggioli<raistlin@linux.it>
>> ---
>>   kernel/sched_dl.c |  912 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>>   kernel/sched_rt.c |    2 +-
>>   2 files changed, 889 insertions(+), 25 deletions(-)
>>
>> diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
>> index 604e2bc..38edefa 100644
>> --- a/kernel/sched_dl.c
>> +++ b/kernel/sched_dl.c
>> @@ -10,6 +10,7 @@
>>   * miss some of their deadlines), and won't affect any other task.
>>   *
>>   * Copyright (C) 2010 Dario Faggioli<raistlin@linux.it>,
>> + *                    Juri Lelli<juri.lelli@gmail.com>,
>>   *                    Michael Trimarchi<michael@amarulasolutions.com>,
>>   *                    Fabio Checconi<fabio@gandalf.sssup.it>
>>   */
>> @@ -20,6 +21,15 @@ static inline int dl_time_before(u64 a, u64 b)
>>         return (s64)(a - b)<  0;
>>   }
>>
>> +/*
>> + * Tells if entity @a should preempt entity @b.
>> + */
>> +static inline
>> +int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
>> +{
>> +       return dl_time_before(a->deadline, b->deadline);
>> +}
>> +
>>   static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
>>   {
>>         return container_of(dl_se, struct task_struct, dl);
>> @@ -50,6 +60,153 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
>>         return dl_rq->rb_leftmost ==&dl_se->rb_node;
>>   }
>>
>> +#ifdef CONFIG_SMP
>> +
>> +static inline int dl_overloaded(struct rq *rq)
>> +{
>> +       return atomic_read(&rq->rd->dlo_count);
>> +}
>> +
>> +static inline void dl_set_overload(struct rq *rq)
>> +{
>> +       if (!rq->online)
>> +               return;
>> +
>> +       cpumask_set_cpu(rq->cpu, rq->rd->dlo_mask);
>> +       /*
>> +        * Must be visible before the overload count is
>> +        * set (as in sched_rt.c).
>> +        */
>> +       wmb();
>> +       atomic_inc(&rq->rd->dlo_count);
>> +}
>> +
>> +static inline void dl_clear_overload(struct rq *rq)
>> +{
>> +       if (!rq->online)
>> +               return;
>> +
>> +       atomic_dec(&rq->rd->dlo_count);
>> +       cpumask_clear_cpu(rq->cpu, rq->rd->dlo_mask);
>> +}
>> +
>> +static void update_dl_migration(struct dl_rq *dl_rq)
>> +{
>> +       if (dl_rq->dl_nr_migratory&&  dl_rq->dl_nr_total>  1) {
>> +               if (!dl_rq->overloaded) {
>> +                       dl_set_overload(rq_of_dl_rq(dl_rq));
>> +                       dl_rq->overloaded = 1;
>> +               }
>> +       } else if (dl_rq->overloaded) {
>> +               dl_clear_overload(rq_of_dl_rq(dl_rq));
>> +               dl_rq->overloaded = 0;
>> +       }
>> +}
>> +
>> +static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>> +{
>> +       dl_rq =&rq_of_dl_rq(dl_rq)->dl;
>> +
>> +       dl_rq->dl_nr_total++;
>> +       if (dl_se->nr_cpus_allowed>  1)
>> +               dl_rq->dl_nr_migratory++;
>> +
>> +       update_dl_migration(dl_rq);
>
> 	if (dl_se->nr_cpus_allowed>  1) {
> 		dl_rq->dl_nr_migratory++;
> 		/* No change in migratory, no update of migration */
> 		update_dl_migration(dl_rq);
> 	}
>

I think the code is probably cleaner as it is written now and aligned with
sched_rt.
   
>> +}
>> +
>> +static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>> +{
>> +       dl_rq =&rq_of_dl_rq(dl_rq)->dl;
>> +
>> +       dl_rq->dl_nr_total--;
>> +       if (dl_se->nr_cpus_allowed>  1)
>> +               dl_rq->dl_nr_migratory--;
>> +
>> +       update_dl_migration(dl_rq);
>
> ditto
>

As above.

>> +}
>> +
>> +/*
>> + * The list of pushable -deadline task is not a plist, like in
>> + * sched_rt.c, it is an rb-tree with tasks ordered by deadline.
>> + */
>> +static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
>> +{
>> +       struct dl_rq *dl_rq =&rq->dl;
>> +       struct rb_node **link =&dl_rq->pushable_dl_tasks_root.rb_node;
>> +       struct rb_node *parent = NULL;
>> +       struct task_struct *entry;
>> +       int leftmost = 1;
>> +
>> +       BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
>> +
>> +       while (*link) {
>> +               parent = *link;
>> +               entry = rb_entry(parent, struct task_struct,
>> +                                pushable_dl_tasks);
>> +               if (!dl_entity_preempt(&entry->dl,&p->dl))
>
> 		if (dl_entity_preempt(&p->dl,&entry->dl))
>

Any specific reason to reverse the condition?

>> +                       link =&parent->rb_left;
>> +               else {
>> +                       link =&parent->rb_right;
>> +                       leftmost = 0;
>> +               }
>> +       }
>> +
>> +       if (leftmost)
>> +               dl_rq->pushable_dl_tasks_leftmost =&p->pushable_dl_tasks;
>> +
>> +       rb_link_node(&p->pushable_dl_tasks, parent, link);
>> +       rb_insert_color(&p->pushable_dl_tasks,&dl_rq->pushable_dl_tasks_root);
>> +}
>> +
>> +static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
>> +{
>> +       struct dl_rq *dl_rq =&rq->dl;
>> +
>> +       if (RB_EMPTY_NODE(&p->pushable_dl_tasks))
>> +               return;
>> +
>> +       if (dl_rq->pushable_dl_tasks_leftmost ==&p->pushable_dl_tasks) {
>> +               struct rb_node *next_node;
>> +
>> +               next_node = rb_next(&p->pushable_dl_tasks);
>> +               dl_rq->pushable_dl_tasks_leftmost = next_node;
>
> 		dl_rq->pushable_dl_tasks_leftmost =
> 					rb_next(&p->pushable_dl_tasks);
>

Yes, but mine is probably cleaner :-).

>> +       }
>> +
>> +       rb_erase(&p->pushable_dl_tasks,&dl_rq->pushable_dl_tasks_root);
>> +       RB_CLEAR_NODE(&p->pushable_dl_tasks);
>> +}
>> +
>> +static inline int has_pushable_dl_tasks(struct rq *rq)
>> +{
>> +       return !RB_EMPTY_ROOT(&rq->dl.pushable_dl_tasks_root);
>
> 	return rq->dl.pushable_dl_tasks_leftmost != NULL;

Equivalent?

>> +}
>> +
>> +static int push_dl_task(struct rq *rq);
>> +
>> +#else
>> +
>> +static inline
>> +void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
>> +{
>> +}
>> +
>> +static inline
>> +void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
>> +{
>> +}
>> +
>> +static inline
>> +void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>> +{
>> +}
>> +
>> +static inline
>> +void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>> +{
>> +}
>> +
>> +#endif /* CONFIG_SMP */
>> +
>>   static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
>>   static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
>>   static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
>> @@ -276,6 +433,14 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
>>                         check_preempt_curr_dl(rq, p, 0);
>>                 else
>>                         resched_task(rq->curr);
>> +#ifdef CONFIG_SMP
>> +               /*
>> +                * Queueing this task back might have overloaded rq,
>> +                * check if we need to kick someone away.
>> +                */
>> +               if (rq->dl.overloaded)
> 		if (has_pushable_dl_tasks(rq))
>

Ok, better.

>> +                       push_dl_task(rq);
>> +#endif
>>         }
>>   unlock:
>>         task_rq_unlock(rq, p,&flags);
>> @@ -359,6 +524,100 @@ static void update_curr_dl(struct rq *rq)
>>         }
>>   }
>>
>> +#ifdef CONFIG_SMP
>> +
>> +static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu);
>> +
>> +static inline int next_deadline(struct rq *rq)
> static inline typeof(next->dl.deadline) next_deadline(struct rq *rq) ?
>

Right!
static inline u64 next_deadline(struct rq *rq)

>> +{
>> +       struct task_struct *next = pick_next_earliest_dl_task(rq, rq->cpu);
>> +
>> +       if (next&&  dl_prio(next->prio))
>> +               return next->dl.deadline;
>> +       else
>> +               return 0;
>> +}
>> +
>> +static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
>> +{
>> +       struct rq *rq = rq_of_dl_rq(dl_rq);
>> +
>> +       if (dl_rq->earliest_dl.curr == 0 ||
>> +           dl_time_before(deadline, dl_rq->earliest_dl.curr)) {
>> +               /*
>> +                * If the dl_rq had no -deadline tasks, or if the new task
>> +                * has shorter deadline than the current one on dl_rq, we
>> +                * know that the previous earliest becomes our next earliest,
>> +                * as the new task becomes the earliest itself.
>> +                */
>> +               dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
>> +               dl_rq->earliest_dl.curr = deadline;
>> +       } else if (dl_rq->earliest_dl.next == 0 ||
>> +                  dl_time_before(deadline, dl_rq->earliest_dl.next)) {
>> +               /*
>> +                * On the other hand, if the new -deadline task has a
>> +                * a later deadline than the earliest one on dl_rq, but
>> +                * it is earlier than the next (if any), we must
>> +                * recompute the next-earliest.
>> +                */
>> +               dl_rq->earliest_dl.next = next_deadline(rq);
>> +       }
>> +}
>> +
>> +static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
>> +{
>> +       struct rq *rq = rq_of_dl_rq(dl_rq);
>> +
>> +       /*
>> +        * Since we may have removed our earliest (and/or next earliest)
>> +        * task we must recompute them.
>> +        */
>> +       if (!dl_rq->dl_nr_running) {
>> +               dl_rq->earliest_dl.curr = 0;
>> +               dl_rq->earliest_dl.next = 0;
>> +       } else {
>> +               struct rb_node *leftmost = dl_rq->rb_leftmost;
>> +               struct sched_dl_entity *entry;
>> +
>> +               entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
>> +               dl_rq->earliest_dl.curr = entry->deadline;
>> +               dl_rq->earliest_dl.next = next_deadline(rq);
>> +       }
>> +}
>> +
>> +#else
>> +
>> +static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
>> +static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
>> +
>> +#endif /* CONFIG_SMP */
>> +
>> +static inline
>> +void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> void inc_dl_task(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>

Well, I see this as "increment the number of -dl tasks"..

>> +{
>> +       int prio = dl_task_of(dl_se)->prio;
>> +       u64 deadline = dl_se->deadline;
>> +
>> +       WARN_ON(!dl_prio(prio));
>> +       dl_rq->dl_nr_running++;
>> +
>> +       inc_dl_deadline(dl_rq, deadline);
>> +       inc_dl_migration(dl_se, dl_rq);
>> +}
>> +
>> +static inline
>> +void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> void dec_dl_task(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>

Like above.

>> +{
>> +       int prio = dl_task_of(dl_se)->prio;
>> +
>> +       WARN_ON(!dl_prio(prio));
>> +       WARN_ON(!dl_rq->dl_nr_running);
>> +       dl_rq->dl_nr_running--;
>> +
>> +       dec_dl_deadline(dl_rq, dl_se->deadline);
>> +       dec_dl_migration(dl_se, dl_rq);
>> +}
>> +
>>   static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>>   {
>>         struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>> @@ -386,7 +645,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>>         rb_link_node(&dl_se->rb_node, parent, link);
>>         rb_insert_color(&dl_se->rb_node,&dl_rq->rb_root);
>>
>> -       dl_rq->dl_nr_running++;
>> +       inc_dl_tasks(dl_se, dl_rq);
>>   }
>>
>>   static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>> @@ -406,7 +665,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>>         rb_erase(&dl_se->rb_node,&dl_rq->rb_root);
>>         RB_CLEAR_NODE(&dl_se->rb_node);
>>
>> -       dl_rq->dl_nr_running--;
>> +       dec_dl_tasks(dl_se, dl_rq);
>>   }
>>
>>   static void
>> @@ -444,11 +703,15 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>>                 return;
>>
>>         enqueue_dl_entity(&p->dl, flags);
>> +
>> +       if (!task_current(rq, p)&&  p->dl.nr_cpus_allowed>  1)
>> +               enqueue_pushable_dl_task(rq, p);
>>   }
>>
>>   static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>>   {
>>         dequeue_dl_entity(&p->dl);
>> +       dequeue_pushable_dl_task(rq, p);
>>   }
>>
>>   static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>> @@ -480,6 +743,75 @@ static void yield_task_dl(struct rq *rq)
>>         update_curr_dl(rq);
>>   }
>>
>> +#ifdef CONFIG_SMP
>> +
>> +static int find_later_rq(struct task_struct *task);
>> +static int latest_cpu_find(struct cpumask *span,
>> +                          struct task_struct *task,
>> +                          struct cpumask *later_mask);
>> +
>> +static int
>> +select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
>> +{
>> +       struct task_struct *curr;
>> +       struct rq *rq;
>> +       int cpu;
>> +
>> +       if (sd_flag != SD_BALANCE_WAKE)
> 		why is task_cpu(p) not eligible?
>

Right, I'll change this.

>> +               return smp_processor_id();
>> +
>> +       cpu = task_cpu(p);
>> +       rq = cpu_rq(cpu);
>> +
>> +       rcu_read_lock();
>> +       curr = ACCESS_ONCE(rq->curr); /* unlocked access */
>> +
>> +       /*
>> +        * If we are dealing with a -deadline task, we must
>> +        * decide where to wake it up.
>> +        * If it has a later deadline and the current task
>> +        * on this rq can't move (provided the waking task
>> +        * can!) we prefer to send it somewhere else. On the
>> +        * other hand, if it has a shorter deadline, we
>> +        * try to make it stay here, it might be important.
>> +        */
>> +       if (unlikely(dl_task(rq->curr))&&
> 		the above ACCESS_ONCE for what?

Yes, I'll change this.

>> +           (rq->curr->dl.nr_cpus_allowed<  2 ||
>> +            dl_entity_preempt(&rq->curr->dl,&p->dl))&&
> 		!dl_entity_preempt(&p->dl,&rq->curr->dl))&&

As above?

>> +           (p->dl.nr_cpus_allowed>  1)) {
>> +               int target = find_later_rq(p);
>> +
>> +               if (target != -1)
>> +                       cpu = target;
>> +       }
>> +       rcu_read_unlock();
>> +
>> +       return cpu;
>> +}
>> +
>> +static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
>> +{
>> +       /*
>> +        * Current can't be migrated, useles to reschedule,
>> +        * let's hope p can move out.
>> +        */
>> +       if (rq->curr->dl.nr_cpus_allowed == 1 ||
>> +           latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
>> +               return;
>> +
>> +       /*
>> +        * p is migratable, so let's not schedule it and
>> +        * see if it is pushed or pulled somewhere else.
>> +        */
>> +       if (p->dl.nr_cpus_allowed != 1&&
>> +           latest_cpu_find(rq->rd->span, p, NULL) != -1)
>> +               return;
>> +
>> +       resched_task(rq->curr);
>> +}
>> +
>> +#endif /* CONFIG_SMP */
>> +
>>   /*
>>   * Only called when both the current and waking task are -deadline
>>   * tasks.
>> @@ -487,8 +819,20 @@ static void yield_task_dl(struct rq *rq)
>>   static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
>>                                   int flags)
>>   {
>> -       if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
>> +       if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline)) {
> 	if (dl_entity_preempt(&p->dl,&rq->curr->dl))

Ok.

>>                 resched_task(rq->curr);
>> +               return;
>> +       }
>> +
>> +#ifdef CONFIG_SMP
>> +       /*
>> +        * In the unlikely case current and p have the same deadline
>> +        * let us try to decide what's the best thing to do...
>> +        */
>> +       if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0&&
>> +           !need_resched())
> please recheck !need_resched(), say rq->curr need reschedule?

Sorry, I don't get this..

>> +               check_preempt_equal_dl(rq, p);
>> +#endif /* CONFIG_SMP */
>>   }
>>
>>   #ifdef CONFIG_SCHED_HRTICK
>> @@ -532,10 +876,20 @@ struct task_struct *pick_next_task_dl(struct rq *rq)
>>
>>         p = dl_task_of(dl_se);
>>         p->se.exec_start = rq->clock;
>> +
>> +       /* Running task will never be pushed. */
>> +       if (p)
>> +               dequeue_pushable_dl_task(rq, p);
>> +
>>   #ifdef CONFIG_SCHED_HRTICK
>>         if (hrtick_enabled(rq))
>>                 start_hrtick_dl(rq, p);
> 		need to check p is valid?

I'll restructure pick_next_task_ as in sched_rt.

>>   #endif
>> +
>> +#ifdef CONFIG_SMP
>> +       rq->post_schedule = has_pushable_dl_tasks(rq);
>> +#endif /* CONFIG_SMP */
>> +
>>         return p;
>>   }
>>
>> @@ -543,6 +897,9 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
>>   {
>>         update_curr_dl(rq);
>>         p->se.exec_start = 0;
> 	why seset exec_start?

Good point, I'll probably remove this.

>> +
>> +       if (on_dl_rq(&p->dl)&&  p->dl.nr_cpus_allowed>  1)
>> +               enqueue_pushable_dl_task(rq, p);
>>   }
>>
>>   static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
>> @@ -576,43 +933,410 @@ static void set_curr_task_dl(struct rq *rq)
>>         struct task_struct *p = rq->curr;
>>
>>         p->se.exec_start = rq->clock;
>> +
>> +       /* You can't push away the running task */
>> +       dequeue_pushable_dl_task(rq, p);
>>   }
>>
>> -static void switched_from_dl(struct rq *rq, struct task_struct *p)
>> +#ifdef CONFIG_SMP
>> +
>> +/* Only try algorithms three times */
>> +#define DL_MAX_TRIES 3
>> +
>> +static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
>>   {
>> -       if (hrtimer_active(&p->dl.dl_timer))
>> -               hrtimer_try_to_cancel(&p->dl.dl_timer);
>> +       if (!task_running(rq, p)&&
>> +           (cpu<  0 || cpumask_test_cpu(cpu,&p->cpus_allowed))&&
>> +           (p->dl.nr_cpus_allowed>  1))
>> +               return 1;
>> +
>> +       return 0;
>
> 	if (task_running(rq, p))
> 		return 0;
> 	return cpumask_test_cpu(cpu,&p->cpus_allowed);
> 	that is all:)

We use this inside pull_dl_task. Since we are searching for a task to
pull, you must be sure that the found task can actually migrate checking
nr_cpus_allowed > 1.

>>   }
>>
>> -static void switched_to_dl(struct rq *rq, struct task_struct *p)
>> +/* Returns the second earliest -deadline task, NULL otherwise */
>> +static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu)
>> +{
>> +       struct rb_node *next_node = rq->dl.rb_leftmost;
>> +       struct sched_dl_entity *dl_se;
>> +       struct task_struct *p = NULL;
>> +
>> +next_node:
>> +       next_node = rb_next(next_node);
>> +       if (next_node) {
>> +               dl_se = rb_entry(next_node, struct sched_dl_entity, rb_node);
>> +               p = dl_task_of(dl_se);
>> +
>> +               if (pick_dl_task(rq, p, cpu))
>> +                       return p;
>> +
>> +               goto next_node;
>> +       }
>> +
>> +       return NULL;
>> +}
>> +
>> +static int latest_cpu_find(struct cpumask *span,
>> +                          struct task_struct *task,
>> +                          struct cpumask *later_mask)
>>   {
>> +       const struct sched_dl_entity *dl_se =&task->dl;
>> +       int cpu, found = -1, best = 0;
>> +       u64 max_dl = 0;
>> +
>> +       for_each_cpu(cpu, span) {
> 	for_each_cpu_and(cpu, span,&task->cpus_allowed) {
>> +               struct rq *rq = cpu_rq(cpu);
>> +               struct dl_rq *dl_rq =&rq->dl;
>> +
>> +               if (cpumask_test_cpu(cpu,&task->cpus_allowed)&&
>> +                   (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
>> +                    dl_rq->earliest_dl.curr))) {
> 		please use dl_entity_preempt()

Well, ok with this and above. Anyway this code is completely removed in 15/16.

>> +                       if (later_mask)
>> +                               cpumask_set_cpu(cpu, later_mask);
>> +                       if (!best&&  !dl_rq->dl_nr_running) {
>> +                               best = 1;
>> +                               found = cpu;
>> +                       } else if (!best&&
>> +                                  dl_time_before(max_dl,
>> +                                                 dl_rq->earliest_dl.curr)) {
>> +                               max_dl = dl_rq->earliest_dl.curr;
>> +                               found = cpu;
>> +                       }
>> +               } else if (later_mask)
>> +                       cpumask_clear_cpu(cpu, later_mask);
>> +       }
>> +
>> +       return found;
>> +}
>> +
>> +static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
>> +
>> +static int find_later_rq(struct task_struct *task)
>> +{
>> +       struct sched_domain *sd;
>> +       struct cpumask *later_mask = __get_cpu_var(local_cpu_mask_dl);
> 	please check is local_cpu_mask_dl valid
>

Could you explain more why should I check for validity?

>> +       int this_cpu = smp_processor_id();
>> +       int best_cpu, cpu = task_cpu(task);
>> +
>> +       if (task->dl.nr_cpus_allowed == 1)
>> +               return -1;
>> +
>> +       best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
>> +       if (best_cpu == -1)
>> +               return -1;
>> +
>>         /*
>> -        * If p is throttled, don't consider the possibility
>> -        * of preempting rq->curr, the check will be done right
>> -        * after its runtime will get replenished.
>> +        * If we are here, some target has been found,
>> +        * the most suitable of which is cached in best_cpu.
>> +        * This is, among the runqueues where the current tasks
>> +        * have later deadlines than the task's one, the rq
>> +        * with the latest possible one.
>> +        *
>> +        * Now we check how well this matches with task's
>> +        * affinity and system topology.
>> +        *
>> +        * The last cpu where the task run is our first
>> +        * guess, since it is most likely cache-hot there.
>>          */
>> -       if (unlikely(p->dl.dl_throttled))
>> -               return;
>> +       if (cpumask_test_cpu(cpu, later_mask))
>> +               return cpu;
>> +       /*
>> +        * Check if this_cpu is to be skipped (i.e., it is
>> +        * not in the mask) or not.
>> +        */
>> +       if (!cpumask_test_cpu(this_cpu, later_mask))
>> +               this_cpu = -1;
>> +
>> +       rcu_read_lock();
>> +       for_each_domain(cpu, sd) {
>> +               if (sd->flags&  SD_WAKE_AFFINE) {
>> +
>> +                       /*
>> +                        * If possible, preempting this_cpu is
>> +                        * cheaper than migrating.
>> +                        */
>> +                       if (this_cpu != -1&&
>> +                           cpumask_test_cpu(this_cpu, sched_domain_span(sd)))
>> +                               return this_cpu;
>> +
>> +                       /*
>> +                        * Last chance: if best_cpu is valid and is
>> +                        * in the mask, that becomes our choice.
>> +                        */
>> +                       if (best_cpu<  nr_cpu_ids&&
>> +                           cpumask_test_cpu(best_cpu, sched_domain_span(sd)))
>> +                               return best_cpu;
>> +               }
>> +       }
>> +       rcu_read_unlock();
>>
>> -       if (!p->on_rq || rq->curr != p) {
>> -               if (task_has_dl_policy(rq->curr))
>> -                       check_preempt_curr_dl(rq, p, 0);
>> -               else
>> -                       resched_task(rq->curr);
>> +       /*
>> +        * At this point, all our guesses failed, we just return
>> +        * 'something', and let the caller sort the things out.
>> +        */
>> +       if (this_cpu != -1)
>> +               return this_cpu;
>> +
>> +       cpu = cpumask_any(later_mask);
>> +       if (cpu<  nr_cpu_ids)
>> +               return cpu;
>> +
>> +       return -1;
>> +}
>> +
>> +/* Locks the rq it finds */
>> +static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
>> +{
>> +       struct rq *later_rq = NULL;
>> +       int tries;
>> +       int cpu;
>> +
>> +       for (tries = 0; tries<  DL_MAX_TRIES; tries++) {
>> +               cpu = find_later_rq(task);
>> +
>> +               if ((cpu == -1) || (cpu == rq->cpu))
>> +                       break;
>> +
>> +               later_rq = cpu_rq(cpu);
>> +
>> +               /* Retry if something changed. */
>> +               if (double_lock_balance(rq, later_rq)) {
>> +                       if (unlikely(task_rq(task) != rq ||
>> +                                    !cpumask_test_cpu(later_rq->cpu,
>> +&task->cpus_allowed) ||
>> +                                    task_running(rq, task) ||
>> +                                    !task->se.on_rq)) {
>> +                               raw_spin_unlock(&later_rq->lock);
>> +                               later_rq = NULL;
>> +                               break;
>> +                       }
>> +               }
>> +
>> +               /*
>> +                * If the rq we found has no -deadline task, or
>> +                * its earliest one has a later deadline than our
>> +                * task, the rq is a good one.
>> +                */
>> +               if (!later_rq->dl.dl_nr_running ||
>> +                   dl_time_before(task->dl.deadline,
>> +                                  later_rq->dl.earliest_dl.curr))
>> +                       break;
>> +
>> +               /* Otherwise we try again. */
>> +               double_unlock_balance(rq, later_rq);
>> +               later_rq = NULL;
>>         }
>> +
>> +       return later_rq;
>>   }
>>
>> -static void prio_changed_dl(struct rq *rq, struct task_struct *p,
>> -                           int oldprio)
>> +static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
>>   {
>> -       switched_to_dl(rq, p);
>> +       struct task_struct *p;
>> +
>> +       if (!has_pushable_dl_tasks(rq))
>> +               return NULL;
>> +
>> +       p = rb_entry(rq->dl.pushable_dl_tasks_leftmost,
>> +                    struct task_struct, pushable_dl_tasks);
>> +
>> +       BUG_ON(rq->cpu != task_cpu(p));
>> +       BUG_ON(task_current(rq, p));
>> +       BUG_ON(p->dl.nr_cpus_allowed<= 1);
>> +
>> +       BUG_ON(!p->se.on_rq);
>> +       BUG_ON(!dl_task(p));
>> +
>> +       return p;
>>   }
>>
>> -#ifdef CONFIG_SMP
>> -static int
>> -select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
>> +/*
>> + * See if the non running -deadline tasks on this rq
>> + * can be sent to some other CPU where they can preempt
>> + * and start executing.
>> + */
>> +static int push_dl_task(struct rq *rq)
>>   {
>> -       return task_cpu(p);
>> +       struct task_struct *next_task;
>> +       struct rq *later_rq;
>> +
>> +       if (!rq->dl.overloaded)
>> +               return 0;
>> +
>> +       next_task = pick_next_pushable_dl_task(rq);
>> +       if (!next_task)
>> +               return 0;
>> +
>> +retry:
>> +       if (unlikely(next_task == rq->curr)) {
>> +               WARN_ON(1);
>> +               return 0;
>> +       }
>> +
>> +       /*
>> +        * If next_task preempts rq->curr, and rq->curr
>> +        * can move away, it makes sense to just reschedule
>> +        * without going further in pushing next_task.
>> +        */
>> +       if (dl_task(rq->curr)&&
>> +           dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline)&&
>> +           rq->curr->dl.nr_cpus_allowed>  1) {
>> +               resched_task(rq->curr);
>> +               return 0;
>> +       }
>> +
>> +       /* We might release rq lock */
>> +       get_task_struct(next_task);
>> +
>> +       /* Will lock the rq it'll find */
>> +       later_rq = find_lock_later_rq(next_task, rq);
>> +       if (!later_rq) {
>> +               struct task_struct *task;
>> +
>> +               /*
>> +                * We must check all this again, since
>> +                * find_lock_later_rq releases rq->lock and it is
>> +                * then possible that next_task has migrated.
>> +                */
>> +               task = pick_next_pushable_dl_task(rq);
>> +               if (task_cpu(next_task) == rq->cpu&&  task == next_task) {
>> +                       /*
>> +                        * The task is still there. We don't try
>> +                        * again, some other cpu will pull it when ready.
>> +                        */
>> +                       dequeue_pushable_dl_task(rq, next_task);
>> +                       goto out;
>> +               }
>> +
>> +               if (!task)
>> +                       /* No more tasks */
>> +                       goto out;
>> +
>> +               put_task_struct(next_task);
>> +               next_task = task;
>> +               goto retry;
>> +       }
>> +
>> +       deactivate_task(rq, next_task, 0);
>> +       set_task_cpu(next_task, later_rq->cpu);
>> +       activate_task(later_rq, next_task, 0);
>> +
>> +       resched_task(later_rq->curr);
>> +
>> +       double_unlock_balance(rq, later_rq);
>> +
>> +out:
>> +       put_task_struct(next_task);
>> +
>> +       return 1;
>> +}
>> +
>> +static void push_dl_tasks(struct rq *rq)
>> +{
>> +       /* Terminates as it moves a -deadline task */
>> +       while (push_dl_task(rq))
>> +               ;
>> +}
>> +
>> +static int pull_dl_task(struct rq *this_rq)
>> +{
>> +       int this_cpu = this_rq->cpu, ret = 0, cpu;
>> +       struct task_struct *p;
>> +       struct rq *src_rq;
>> +       u64 dmin = LONG_MAX;
>> +
>> +       if (likely(!dl_overloaded(this_rq)))
>> +               return 0;
>> +
>> +       for_each_cpu(cpu, this_rq->rd->dlo_mask) {
>> +               if (this_cpu == cpu)
>> +                       continue;
>> +
>> +               src_rq = cpu_rq(cpu);
>> +
>> +               /*
>> +                * It looks racy, abd it is! However, as in sched_rt.c,
>> +                * we are fine with this.
>> +                */
>> +               if (this_rq->dl.dl_nr_running&&
>> +                   dl_time_before(this_rq->dl.earliest_dl.curr,
>> +                                  src_rq->dl.earliest_dl.next))
>> +                       continue;
>> +
>> +               /* Might drop this_rq->lock */
>> +               double_lock_balance(this_rq, src_rq);
>> +
>> +               /*
>> +                * If there are no more pullable tasks on the
>> +                * rq, we're done with it.
>> +                */
>> +               if (src_rq->dl.dl_nr_running<= 1)
>> +                       goto skip;
>> +
>> +               p = pick_next_earliest_dl_task(src_rq, this_cpu);
>> +
>> +               /*
>> +                * We found a task to be pulled if:
>> +                *  - it preempts our current (if there's one),
>> +                *  - it will preempt the last one we pulled (if any).
>> +                */
>> +               if (p&&  dl_time_before(p->dl.deadline, dmin)&&
>> +                   (!this_rq->dl.dl_nr_running ||
>> +                    dl_time_before(p->dl.deadline,
>> +                                   this_rq->dl.earliest_dl.curr))) {
>> +                       WARN_ON(p == src_rq->curr);
>> +                       WARN_ON(!p->se.on_rq);
>> +
>> +                       /*
>> +                        * Then we pull iff p has actually an earlier
>> +                        * deadline than the current task of its runqueue.
>> +                        */
>> +                       if (dl_time_before(p->dl.deadline,
>> +                                          src_rq->curr->dl.deadline))
>> +                               goto skip;
>> +
>> +                       ret = 1;
>> +
>> +                       deactivate_task(src_rq, p, 0);
>> +                       set_task_cpu(p, this_cpu);
>> +                       activate_task(this_rq, p, 0);
>> +                       dmin = p->dl.deadline;
>> +
>> +                       /* Is there any other task even earlier? */
>> +               }
>> +skip:
>> +               double_unlock_balance(this_rq, src_rq);
>> +       }
>> +
>> +       return ret;
>> +}
>> +
>> +static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
>> +{
>> +       /* Try to pull other tasks here */
>> +       if (dl_task(prev))
>> +               pull_dl_task(rq);
>> +}
>> +
>> +static void post_schedule_dl(struct rq *rq)
>> +{
>> +       push_dl_tasks(rq);
>> +}
>> +
>> +/*
>> + * Since the task is not running and a reschedule is not going to happen
>> + * anytime soon on its runqueue, we try pushing it away now.
>> + */
>> +static void task_woken_dl(struct rq *rq, struct task_struct *p)
>> +{
>> +       if (!task_running(rq, p)&&
>> +           !test_tsk_need_resched(rq->curr)&&
>> +           has_pushable_dl_tasks(rq)&&
>> +           p->dl.nr_cpus_allowed>  1&&
>> +           dl_task(rq->curr)&&
>> +           (rq->curr->dl.nr_cpus_allowed<  2 ||
>> +            dl_entity_preempt(&rq->curr->dl,&p->dl))) {
>> +               push_dl_tasks(rq);
>> +       }
>>   }
>>
>>   static void set_cpus_allowed_dl(struct task_struct *p,
>> @@ -622,10 +1346,145 @@ static void set_cpus_allowed_dl(struct task_struct *p,
>>
>>         BUG_ON(!dl_task(p));
>>
>> +       /*
>> +        * Update only if the task is actually running (i.e.,
>> +        * it is on the rq AND it is not throttled).
>> +        */
>> +       if (on_dl_rq(&p->dl)&&  (weight != p->dl.nr_cpus_allowed)) {
>> +               struct rq *rq = task_rq(p);
>> +
>> +               if (!task_current(rq, p)) {
>> +                       /*
>> +                        * If the task was on the pushable list,
>> +                        * make sure it stays there only if the new
>> +                        * mask allows that.
>> +                        */
>> +                       if (p->dl.nr_cpus_allowed>  1)
>> +                               dequeue_pushable_dl_task(rq, p);
>> +
>> +                       if (weight>  1)
>> +                               enqueue_pushable_dl_task(rq, p);
>> +               }
>> +
>> +               if ((p->dl.nr_cpus_allowed<= 1)&&  (weight>  1)) {
>> +                       rq->dl.dl_nr_migratory++;
>> +               } else if ((p->dl.nr_cpus_allowed>  1)&&  (weight<= 1)) {
>> +                       BUG_ON(!rq->dl.dl_nr_migratory);
>> +                       rq->dl.dl_nr_migratory--;
>> +               }
>> +
>> +               update_dl_migration(&rq->dl);
>> +       }
>> +
>>         cpumask_copy(&p->cpus_allowed, new_mask);
>>         p->dl.nr_cpus_allowed = weight;
>>   }
>> +
>> +/* Assumes rq->lock is held */
>> +static void rq_online_dl(struct rq *rq)
>> +{
>> +       if (rq->dl.overloaded)
>> +               dl_set_overload(rq);
>> +}
>> +
>> +/* Assumes rq->lock is held */
>> +static void rq_offline_dl(struct rq *rq)
>> +{
>> +       if (rq->dl.overloaded)
>> +               dl_clear_overload(rq);
>> +}
>> +
>> +static inline void init_sched_dl_class(void)
>> +{
>> +       unsigned int i;
>> +
>> +       for_each_possible_cpu(i)
>> +               zalloc_cpumask_var_node(&per_cpu(local_cpu_mask_dl, i),
>> +                                       GFP_KERNEL, cpu_to_node(i));
>> +}
>> +
>> +#endif /* CONFIG_SMP */
>> +
>> +static void switched_from_dl(struct rq *rq, struct task_struct *p)
>> +{
>> +       if (hrtimer_active(&p->dl.dl_timer)&&  !dl_policy(p->policy))
>> +               hrtimer_try_to_cancel(&p->dl.dl_timer);
>> +
>> +#ifdef CONFIG_SMP
>> +       /*
>> +        * Since this might be the only -deadline task on the rq,
>> +        * this is the right place to try to pull some other one
>> +        * from an overloaded cpu, if any.
>> +        */
>> +       if (!rq->dl.dl_nr_running)
>> +               pull_dl_task(rq);
>>   #endif
>> +}
>> +
>> +/*
>> + * When switching to -deadline, we may overload the rq, then
>> + * we try to push someone off, if possible.
>> + */
>> +static void switched_to_dl(struct rq *rq, struct task_struct *p)
>> +{
>> +       int check_resched = 1;
>> +
>> +       /*
>> +        * If p is throttled, don't consider the possibility
>> +        * of preempting rq->curr, the check will be done right
>> +        * after its runtime will get replenished.
>> +        */
>> +       if (unlikely(p->dl.dl_throttled))
>> +               return;
>> +
>> +       if (!p->on_rq || rq->curr != p) {
>> +#ifdef CONFIG_SMP
>> +               if (rq->dl.overloaded&&  push_dl_task(rq)&&  rq != task_rq(p))
>> +                       /* Only reschedule if pushing failed */
>> +                       check_resched = 0;
>> +#endif /* CONFIG_SMP */
>> +               if (check_resched&&  task_has_dl_policy(rq->curr))
>> +                       check_preempt_curr_dl(rq, p, 0);
>> +       }
>> +}
>> +
>> +/*
>> + * If the scheduling parameters of a -deadline task changed,
>> + * a push or pull operation might be needed.
>> + */
>> +static void prio_changed_dl(struct rq *rq, struct task_struct *p,
>> +                           int oldprio)
>> +{
>> +       if (p->on_rq || rq->curr == p) {
>> +#ifdef CONFIG_SMP
>> +               /*
>> +                * This might be too much, but unfortunately
>> +                * we don't have the old deadline value, and
>> +                * we can't argue if the task is increasing
>> +                * or lowering its prio, so...
>> +                */
>> +               if (!rq->dl.overloaded)
>> +                       pull_dl_task(rq);
>> +
>> +               /*
>> +                * If we now have a earlier deadline task than p,
>> +                * then reschedule, provided p is still on this
>> +                * runqueue.
>> +                */
>> +               if (dl_time_before(rq->dl.earliest_dl.curr, p->dl.deadline)&&
>> +                   rq->curr == p)
>> +                       resched_task(p);
>> +#else
>> +               /*
>> +                * Again, we don't know if p has a earlier
>> +                * or later deadline, so let's blindly set a
>> +                * (maybe not needed) rescheduling point.
>> +                */
>> +               resched_task(p);
>> +#endif /* CONFIG_SMP */
>> +       } else
>> +               switched_to_dl(rq, p);
>> +}
>>
>>   static const struct sched_class dl_sched_class = {
>>         .next                   =&rt_sched_class,
>> @@ -642,6 +1501,11 @@ static const struct sched_class dl_sched_class = {
>>         .select_task_rq         = select_task_rq_dl,
>>
>>         .set_cpus_allowed       = set_cpus_allowed_dl,
>> +       .rq_online              = rq_online_dl,
>> +       .rq_offline             = rq_offline_dl,
>> +       .pre_schedule           = pre_schedule_dl,
>> +       .post_schedule          = post_schedule_dl,
>> +       .task_woken             = task_woken_dl,
>>   #endif
>>
>>         .set_curr_task          = set_curr_task_dl,
>> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
>> index 4b09704..7b609bc 100644
>> --- a/kernel/sched_rt.c
>> +++ b/kernel/sched_rt.c
>> @@ -1590,7 +1590,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
>>             !test_tsk_need_resched(rq->curr)&&
>>             has_pushable_tasks(rq)&&
>>             p->rt.nr_cpus_allowed>  1&&
>> -           rt_task(rq->curr)&&
>> +           (dl_task(rq->curr) || rt_task(rq->curr))&&
>>             (rq->curr->rt.nr_cpus_allowed<  2 ||
>>              rq->curr->prio<= p->prio))
>>                 push_rt_tasks(rq);
>> --
>> 1.7.5.4
>>
>
> Would you please mail me, in attachment, a monolithic patch of this work?
>

Ok, I'll prepare the monolithic patch and probably store it somewhere so that it
can be downloaded also by others.

Thanks a lot for the immediate review :-)!

Best Regards,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-06 17:31     ` Juri Lelli
@ 2012-04-07  2:32       ` Hillf Danton
  2012-04-07  7:46         ` Dario Faggioli
  2012-04-08 20:20         ` Juri Lelli
  0 siblings, 2 replies; 129+ messages in thread
From: Hillf Danton @ 2012-04-07  2:32 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fcheccon

On Sat, Apr 7, 2012 at 1:31 AM, Juri Lelli <juri.lelli@gmail.com> wrote:
>>>
>>>  kernel/sched_dl.c |  912
>>>  kernel/sched_rt.c |    2 +-

You are working on 2.6.3x, x <= 8 ?
If so, what is the reason(just curious)?
Already planned to add in 3.3 and above?

>>> +               if (!dl_entity_preempt(&entry->dl,&p->dl))
>>
>>                if (dl_entity_preempt(&p->dl,&entry->dl))
>>
>
> Any specific reason to reverse the condition?
>
Just for easing readers.

>>> +select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
>>> +{
>>> +       struct task_struct *curr;
>>> +       struct rq *rq;
>>> +       int cpu;
>>> +
>>> +       if (sd_flag != SD_BALANCE_WAKE)
>>
>>                why is task_cpu(p) not eligible?
>>
>
> Right, I'll change this.
>
No, you will first IMO sort out clear answer to the question.

>>> +           (rq->curr->dl.nr_cpus_allowed<  2 ||
>>> +            dl_entity_preempt(&rq->curr->dl,&p->dl))&&
>>
>>                !dl_entity_preempt(&p->dl,&rq->curr->dl))&&
>
> As above?
>
Just for easing reader.

>>> +#ifdef CONFIG_SMP
>>> +       /*
>>> +        * In the unlikely case current and p have the same deadline
>>> +        * let us try to decide what's the best thing to do...
>>> +        */
>>> +       if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0&&
>>> +           !need_resched())
>>
>> please recheck !need_resched(), say rq->curr need reschedule?
>
> Sorry, I don't get this..
>
Perhaps smp_processor_id() != rq->cpu

>>
>>        if (task_running(rq, p))
>>                return 0;
>>        return cpumask_test_cpu(cpu, &p->cpus_allowed);
>
> We use this inside pull_dl_task. Since we are searching for a task to
> pull, you must be sure that the found task can actually migrate checking
> nr_cpus_allowed > 1.
>
If cpu is certainly allowed for task to run, but nr_cpus_allowed is no more
than one, which is corrupted?

>
> Well, ok with this and above. Anyway this code is completely removed in
> 15/16.
>
Yup, another reason for monolith.

>>> +
>>> +static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
>>> +
>>> +static int find_later_rq(struct task_struct *task)
>>> +{
>>> +       struct sched_domain *sd;
>>> +       struct cpumask *later_mask = __get_cpu_var(local_cpu_mask_dl);
>>
>>        please check is local_cpu_mask_dl valid
>>
>
> Could you explain more why should I check for validity?
>
Only for the case that something comes in before it is initialized,
IIRC encountered by Steven.

>
> Ok, I'll prepare the monolithic patch and probably store it somewhere so
> that it can be downloaded also by others.
>
Info Hillf once it is ready, thanks.

Good Weekend
-hd

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-07  2:32       ` Hillf Danton
@ 2012-04-07  7:46         ` Dario Faggioli
  2012-04-08 20:20         ` Juri Lelli
  1 sibling, 0 replies; 129+ messages in thread
From: Dario Faggioli @ 2012-04-07  7:46 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Juri Lelli, peterz, tglx, mingo, rostedt, cfriesen, oleg,
	fweisbec, darren, johan.eker, p.faure, linux-kernel, claudio,
	michael, fcheccon

[-- Attachment #1: Type: text/plain, Size: 1169 bytes --]

On Sat, 2012-04-07 at 10:32 +0800, Hillf Danton wrote: 
> On Sat, Apr 7, 2012 at 1:31 AM, Juri Lelli <juri.lelli@gmail.com> wrote:
> >>>
> >>>  kernel/sched_dl.c |  912
> >>>  kernel/sched_rt.c |    2 +-
> 
> You are working on 2.6.3x, x <= 8 ?
>
Not really. Why saying so? :-O

> If so, what is the reason(just curious)?
> Already planned to add in 3.3 and above?
>
The n.0 mail of this series "[RFC][PATCH 00/16] sched: SCHED_DEADLINE
v4" says:
<<  - this time we sit on top of PREEMPT_RT (3.2.13-rt23); we continue
      to aim at mainline inclusion, but we also see -rt folks as
      immediate and interested users.

    ...

    As said, patchset is on top of PREEMPT_RT (as of today). However,
    Insop Song (from Ericsson) is maintaining a parallel branch for the
    current tip/master (https://github.com/insop/sched-deadline2).>>

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* SCHED_DEADLINE v4
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (17 preceding siblings ...)
  2012-04-06 11:07 ` Dario Faggioli
@ 2012-04-07  7:52 ` Juri Lelli
  2012-04-11 14:17 ` [RFC][PATCH 00/16] sched: " Steven Rostedt
  19 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-07  7:52 UTC (permalink / raw)
  To: linux-dl, RT

Hello everyone,

This is to inform you that a new version of the SCHED_DEADLINE patchset
is out (v4). What follows is an excerpt of the overview (00/16) messsage
sent on LKML, you can find the thread here:
https://lkml.org/lkml/2012/4/6/39

[...]
Just to recap, the patchset introduces a new deadline based real-time
task scheduling policy --called SCHED_DEADLINE-- with bandwidth
isolation (aka "resource reservation") capabilities. It now supports
global/clustered multiprocessor scheduling through dynamic task
migrations.


  From the previous releases[1]:
   - all the comments and the fixes coming from the reviews we got have
     been considered and applied;
   - better handling of rq selection for dynamic task migration, by means
     of a cpupri equivalent for -deadline tasks (cpudl). The mechanism
     is simple and straightforward, but showed nice performance figures[2].
   - this time we sit on top of PREEMPT_RT (3.2.13-rt23); we continue to aim
     at mainline inclusion, but we also see -rt folks as immediate and
     interested users.

Still missing/incomplete:
   - (c)group based bandwidth management, and maybe scheduling. It seems
     some more discussion on what precisely we want is *really* needed
     for this point;
   - bandwidth inheritance (to replace deadline/priority inheritance).
     What's in the patchset is just very few more than a simple
     placeholder. More discussion on the right way to go is needed here.
     Some work has already been done, but it is still not ready for
     submission.

The development is taking place at:
    https://github.com/jlelli/sched-deadline

Check the repositories frequently if you're interested, and feel free to
e-mail me for any issue you run into.

Furthermore, we developed an application that you can use to test this
patchset:
   https://github.com/gbagnoli/rt-app

We also set up a development mailing list: linux-dl[3].
You can subscribe from here:
http://feanor.sssup.it/mailman/listinfo/linux-dl
or via e-mail (send a message to linux-dl-request@retis.sssup.it with
just the word `help' as subject or in the body to receive info).

As already discussed we are planning to merge this work with the EDF
throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
the preliminary phases of the merge and we intend to use the feedback to this
post to help us decide on the direction it should take.

As said, patchset is on top of PREEMPT_RT (as of today). However, Insop
Song (from Ericsson) is maintaining a parallel branch for the current
tip/master (https://github.com/insop/sched-deadline2). Ericsson is in fact
evaluating the use of SCHED_DEADLINE for CPE (Customer Premise Equipment)
devices in order to reserve CPU bandwidth to processes.

The code was being jointly developed by ReTiS Lab (http://retis.sssup.it)
and Evidence S.r.l (http://www.evidence.eu.com) in the context of the ACTORS
EU-funded project (http://www.actors-project.eu). It is now also supported by
the S(o)OS EU-funded project (http://www.soos-project.eu/).
It has also some users, both in academic and applied research. Even if our
last release dates back almost a year we continued to get feedback
from Ericsson (see above), Wind River, Porto (ISEP), Trento and Malardalen
universities :-).

As usual, any kind of feedback is welcome and appreciated.

Thanks in advice and regards,

  - Juri

[1] http://lwn.net/Articles/376502, http://lwn.net/Articles/353797,
     http://lwn.net/Articles/412410
[2] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf
[3] from the first linux-dl message:
     -linux-dl should become the place where discussions about real-time
      deadline scheduling on Linux take place (not only SCHED_DEADLINE).
      We felt the lack of a place where we can keep in touch with each
      other; we are all working on the same things, but probably from
      different viewpoints, and this is surely a point of strength.
      Anyway, our efforts need to be organized in some way, or at least it
      is important to know on what everyone is currently working as to not
      end up with "duplicate-efforts".-

Dario Faggioli (10):
       sched: add extended scheduling interface.
       sched: SCHED_DEADLINE data structures.
       sched: SCHED_DEADLINE policy implementation.
       sched: SCHED_DEADLINE avg_update accounting.
       sched: add schedstats for -deadline tasks.
       sched: add resource limits for -deadline tasks.
       sched: add latency tracing for -deadline tasks.
       sched: drafted deadline inheritance logic.
       sched: add bandwidth management for sched_dl.
       sched: add sched_dl documentation.

Juri Lelli (3):
       sched: SCHED_DEADLINE SMP-related data structures.
       sched: SCHED_DEADLINE push and pull logic
       sched: speed up -dl pushes with a push-heap.

Harald Gustafsson (1):
       sched: add period support for -deadline tasks.

Peter Zijlstra (1):
       rtmutex: turn the plist into an rb-tree.
[...]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4
  2012-04-06  8:25 ` [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Luca Abeni
@ 2012-04-07  9:25   ` Tadeus Prastowo
  0 siblings, 0 replies; 129+ messages in thread
From: Tadeus Prastowo @ 2012-04-07  9:25 UTC (permalink / raw)
  To: Luca Abeni
  Cc: Juri Lelli, peterz, tglx, mingo, rostedt, cfriesen, oleg,
	fweisbec, darren, johan.eker, p.faure, linux-kernel, claudio,
	michael, fchecconi, tommaso.cucinotta, nicola.manica,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang

Hi everyone!

On Fri, 2012-04-06 at 10:25 +0200, Luca Abeni wrote:

[...]

> About BWI... A student of mine (added in cc) implemented a prototypal
> bandwidth inheritance (based on an old version of SCHED_DEADLINE). It is
> here:
> https://github.com/eus/cbs_inheritance
> (Tadeus, please correct me if I pointed to the wrong repository).

The repository is correct.

> It is not for inclusion yet (it is based on an old version, it is UP
> only, and it probably needs some cleanups), but it worked fine in our
> tests. Note that in this patch the BWI mechanism is not bound to
> rtmutexes, but inheritance is controlled through 2 syscalls (because we
> used BWI for client/server interactions).
> Anyway, I hope that the BWI code developed by Tadeus can be useful (or
> can be directly re-used) for implementing BWI in SCHED_DEADLINE.

For inclusion, I think the interaction between the two system calls and
RT-mutex subsystem needs some thoughts especially when multi-cores are
involved. I haven't given further thought on this matter upon the
completion of my thesis work, especially in taking into account Dario's
M-BWI approach.

For UP, the BWI syscalls implementation in the aforementioned branch
does not have Dario's BWI implementation for mutexes. And interaction
between the BWI syscalls and RT-Mutex has not been taken into account.

Happy Easter to all of you who celebrate it!

> 				Luca

-- 
Sincerely yours,
Tadeus Prastowo


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 01/16] sched: add sched_class->task_dead.
  2012-04-06  7:14 ` [PATCH 01/16] sched: add sched_class->task_dead Juri Lelli
@ 2012-04-08 17:49   ` Oleg Nesterov
  2012-04-08 18:09     ` Juri Lelli
  0 siblings, 1 reply; 129+ messages in thread
From: Oleg Nesterov @ 2012-04-08 17:49 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, rostedt, cfriesen, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/06, Juri Lelli wrote:
>
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3219,6 +3219,9 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>  	if (mm)
>  		mmdrop_delayed(mm);
>  	if (unlikely(prev_state == TASK_DEAD)) {
> +		if (prev->sched_class->task_dead)
> +			prev->sched_class->task_dead(prev);
> +

And 5/16 adds

	+static void task_dead_dl(struct task_struct *p)
	+{
	+       struct hrtimer *timer = &p->dl.dl_timer;
	+
	+	if (hrtimer_active(timer))
	+               hrtimer_try_to_cancel(timer);
	+}

This looks suspicious. finish_task_switch() does put_task_struct()
after that, it is quite possible this actually frees the memory.

What if hrtimer_try_to_cancel() fails because the timer is running?
In this case __run_hrtimer() can play with the freed timer. Say, to
clear HRTIMER_STATE_CALLBACK. Not to mention dl_task_timer() itself.

Oleg.


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 01/16] sched: add sched_class->task_dead.
  2012-04-08 17:49   ` Oleg Nesterov
@ 2012-04-08 18:09     ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-08 18:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: peterz, tglx, mingo, rostedt, cfriesen, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/08/2012 07:49 PM, Oleg Nesterov wrote:
> On 04/06, Juri Lelli wrote:
>>
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -3219,6 +3219,9 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>>   	if (mm)
>>   		mmdrop_delayed(mm);
>>   	if (unlikely(prev_state == TASK_DEAD)) {
>> +		if (prev->sched_class->task_dead)
>> +			prev->sched_class->task_dead(prev);
>> +
>
> And 5/16 adds
>
> 	+static void task_dead_dl(struct task_struct *p)
> 	+{
> 	+       struct hrtimer *timer =&p->dl.dl_timer;
> 	+
> 	+	if (hrtimer_active(timer))
> 	+               hrtimer_try_to_cancel(timer);
> 	+}
>
> This looks suspicious. finish_task_switch() does put_task_struct()
> after that, it is quite possible this actually frees the memory.
>
> What if hrtimer_try_to_cancel() fails because the timer is running?
> In this case __run_hrtimer() can play with the freed timer. Say, to
> clear HRTIMER_STATE_CALLBACK. Not to mention dl_task_timer() itself.
>
> Oleg.
>

Right, hrtimer_cancel(timer) looks way better.

Thanks!

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-07  2:32       ` Hillf Danton
  2012-04-07  7:46         ` Dario Faggioli
@ 2012-04-08 20:20         ` Juri Lelli
  2012-04-09 12:28           ` Hillf Danton
  2012-04-11 16:00           ` Steven Rostedt
  1 sibling, 2 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-08 20:20 UTC (permalink / raw)
  To: Hillf Danton
  Cc: peterz, tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, Fabio Checconi,
	Tommaso Cucinotta, Juri Lelli, Nicola Manica, Luca Abeni,
	Dhaval Giani, hgu1972, paulmck, Dario Faggioli, Insop Song,
	liming.wang

On 04/07/2012 04:32 AM, Hillf Danton wrote:
> On Sat, Apr 7, 2012 at 1:31 AM, Juri Lelli<juri.lelli@gmail.com>  wrote:
>>>>
>>>>   kernel/sched_dl.c |  912
>>>>   kernel/sched_rt.c |    2 +-
>
> You are working on 2.6.3x, x<= 8 ?
> If so, what is the reason(just curious)?
> Already planned to add in 3.3 and above?
>

Dario answered on this :-).
   
>>>> +               if (!dl_entity_preempt(&entry->dl,&p->dl))
>>>
>>>                 if (dl_entity_preempt(&p->dl,&entry->dl))
>>>
>>
>> Any specific reason to reverse the condition?
>>
> Just for easing readers.
>

Ok, reasonable. Here and below.

>>>> +select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
>>>> +{
>>>> +       struct task_struct *curr;
>>>> +       struct rq *rq;
>>>> +       int cpu;
>>>> +
>>>> +       if (sd_flag != SD_BALANCE_WAKE)
>>>
>>>                 why is task_cpu(p) not eligible?
>>>
>>
>> Right, I'll change this.
>>
> No, you will first IMO sort out clear answer to the question.
>

task_cpu(p) is eligible and will be returned if sd_flag != SD_BALANCE_WAKE
&& sd_flag != SD_BALANCE_FORK as in sched_rt. I changed the code accordingly.

>>>> +           (rq->curr->dl.nr_cpus_allowed<    2 ||
>>>> +            dl_entity_preempt(&rq->curr->dl,&p->dl))&&
>>>
>>>                 !dl_entity_preempt(&p->dl,&rq->curr->dl))&&
>>
>> As above?
>>
> Just for easing reader.
>
>>>> +#ifdef CONFIG_SMP
>>>> +       /*
>>>> +        * In the unlikely case current and p have the same deadline
>>>> +        * let us try to decide what's the best thing to do...
>>>> +        */
>>>> +       if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0&&
>>>> +           !need_resched())
>>>
>>> please recheck !need_resched(), say rq->curr need reschedule?
>>
>> Sorry, I don't get this..
>>
> Perhaps smp_processor_id() != rq->cpu
>

need_resched is actually checked...

>>>
>>>         if (task_running(rq, p))
>>>                 return 0;
>>>         return cpumask_test_cpu(cpu,&p->cpus_allowed);
>>
>> We use this inside pull_dl_task. Since we are searching for a task to
>> pull, you must be sure that the found task can actually migrate checking
>> nr_cpus_allowed>  1.
>>
> If cpu is certainly allowed for task to run, but nr_cpus_allowed is no more
> than one, which is corrupted?
>
>>
>> Well, ok with this and above. Anyway this code is completely removed in
>> 15/16.
>>
> Yup, another reason for monolith.
>

Monolithic is below. Anyway, please check the github repo for bug
fixes/new features. ;-)

>>>> +
>>>> +static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
>>>> +
>>>> +static int find_later_rq(struct task_struct *task)
>>>> +{
>>>> +       struct sched_domain *sd;
>>>> +       struct cpumask *later_mask = __get_cpu_var(local_cpu_mask_dl);
>>>
>>>         please check is local_cpu_mask_dl valid
>>>
>>
>> Could you explain more why should I check for validity?
>>
> Only for the case that something comes in before it is initialized,
> IIRC encountered by Steven.
>

Do you mean at kernel_init time?
Could you be more precise about the problem Steven encountered?

>>
>> Ok, I'll prepare the monolithic patch and probably store it somewhere so
>> that it can be downloaded also by others.
>>
> Info Hillf once it is ready, thanks.
>

Here we go:
https://github.com/downloads/jlelli/sched-deadline/sched-dl-V4.patch

I noticed that the Cc list is changed... something went wrong?
Anyway, I restored it to the original one. :-)

Thanks and Regards,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-08 20:20         ` Juri Lelli
@ 2012-04-09 12:28           ` Hillf Danton
  2012-04-10  8:11             ` Juri Lelli
  2012-04-11 16:00           ` Steven Rostedt
  1 sibling, 1 reply; 129+ messages in thread
From: Hillf Danton @ 2012-04-09 12:28 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, Fabio Checconi,
	Tommaso Cucinotta, Nicola Manica, Luca Abeni, Dhaval Giani,
	hgu1972, paulmck, Dario Faggioli, Insop Song, liming.wang

On Mon, Apr 9, 2012 at 4:20 AM, Juri Lelli <juri.lelli@gmail.com> wrote:
>
> Do you mean at kernel_init time?
> Could you be more precise about the problem Steven encountered?
>
After change from kernel/sched_rt.c to kernel/sched/rt.c, I could not
find the git history of that fix, related to RCU IIRC, by Steven in mainline.
Steven, please give a link.

>
> Here we go:
> https://github.com/downloads/jlelli/sched-deadline/sched-dl-V4.patch
>
Thanks:)

> I noticed that the Cc list is changed... something went wrong?
>
I forget what was changed, simply because the mail agent beeps that
one mail address is not reachable.

-hd

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-09 12:28           ` Hillf Danton
@ 2012-04-10  8:11             ` Juri Lelli
  2012-04-11 15:57               ` Steven Rostedt
  0 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-10  8:11 UTC (permalink / raw)
  To: Hillf Danton
  Cc: peterz, tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, Fabio Checconi,
	Tommaso Cucinotta, Nicola Manica, Luca Abeni, Dhaval Giani,
	hgu1972, paulmck, Dario Faggioli, Insop Song, liming.wang

On 04/09/2012 02:28 PM, Hillf Danton wrote:
> On Mon, Apr 9, 2012 at 4:20 AM, Juri Lelli<juri.lelli@gmail.com>  wrote:
>>
>> Do you mean at kernel_init time?
>> Could you be more precise about the problem Steven encountered?
>>
> After change from kernel/sched_rt.c to kernel/sched/rt.c, I could not
> find the git history of that fix, related to RCU IIRC, by Steven in mainline.
> Steven, please give a link.
>

Ok, found!
https://lkml.org/lkml/2011/6/14/366
  
>>
>> Here we go:
>> https://github.com/downloads/jlelli/sched-deadline/sched-dl-V4.patch
>>
> Thanks:)
>
>> I noticed that the Cc list is changed... something went wrong?
>>
> I forget what was changed, simply because the mail agent beeps that
> one mail address is not reachable.
>
> -hd

Thanks and Regards,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-06  7:14 ` [PATCH 05/16] sched: SCHED_DEADLINE policy implementation Juri Lelli
@ 2012-04-11  3:06   ` Steven Rostedt
  2012-04-11  6:54     ` Juri Lelli
  2012-04-11 13:41   ` Steven Rostedt
                     ` (9 subsequent siblings)
  10 siblings, 1 reply; 129+ messages in thread
From: Steven Rostedt @ 2012-04-11  3:06 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:

> +/*
> + * Pure Earliest Deadline First (EDF) scheduling does not deal with the
> + * possibility of a entity lasting more than what it declared, and thus
> + * exhausting its runtime.
> + *
> + * Here we are interested in making runtime overrun possible, but we do
> + * not want a entity which is misbehaving to affect the scheduling of all
> + * other entities.
> + * Therefore, a budgeting strategy called Constant Bandwidth Server (CBS)
> + * is used, in order to confine each entity within its own bandwidth.
> + *
> + * This function deals exactly with that, and ensures that when the runtime
> + * of a entity is replenished, its deadline is also postponed. That ensures
> + * the overrunning entity can't interfere with other entity in the system and
> + * can't make them miss their deadlines. Reasons why this kind of overruns
> + * could happen are, typically, a entity voluntarily trying to overcume its

s/overcume/overcome/

-- Steve

> + * runtime, or it just underestimated it during sched_setscheduler_ex().
> + */
> +static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> +{
> +


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-11  3:06   ` Steven Rostedt
@ 2012-04-11  6:54     ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-11  6:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/11/2012 05:06 AM, Steven Rostedt wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>
>> +/*
>> + * Pure Earliest Deadline First (EDF) scheduling does not deal with the
>> + * possibility of a entity lasting more than what it declared, and thus
>> + * exhausting its runtime.
>> + *
>> + * Here we are interested in making runtime overrun possible, but we do
>> + * not want a entity which is misbehaving to affect the scheduling of all
>> + * other entities.
>> + * Therefore, a budgeting strategy called Constant Bandwidth Server (CBS)
>> + * is used, in order to confine each entity within its own bandwidth.
>> + *
>> + * This function deals exactly with that, and ensures that when the runtime
>> + * of a entity is replenished, its deadline is also postponed. That ensures
>> + * the overrunning entity can't interfere with other entity in the system and
>> + * can't make them miss their deadlines. Reasons why this kind of overruns
>> + * could happen are, typically, a entity voluntarily trying to overcume its
>
> s/overcume/overcome/
>
> -- Steve
>
>> + * runtime, or it just underestimated it during sched_setscheduler_ex().
>> + */
>> +static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>> +{
>> +
>

Thanks!

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-06  7:14 ` [PATCH 05/16] sched: SCHED_DEADLINE policy implementation Juri Lelli
  2012-04-11  3:06   ` Steven Rostedt
@ 2012-04-11 13:41   ` Steven Rostedt
  2012-04-11 13:55     ` Juri Lelli
  2012-04-23 10:15   ` Peter Zijlstra
                     ` (8 subsequent siblings)
  10 siblings, 1 reply; 129+ messages in thread
From: Steven Rostedt @ 2012-04-11 13:41 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:

> +static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> +{
> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +	struct rq *rq = rq_of_dl_rq(dl_rq);
> +
> +	/*
> +	 * We Keep moving the deadline away until we get some

s/Keep/keep/

> +	 * available runtime for the entity. This ensures correct
> +	 * handling of situations where the runtime overrun is
> +	 * arbitrary large.
> +	 */
> +	while (dl_se->runtime <= 0) {
> +		dl_se->deadline += dl_se->dl_deadline;
> +		dl_se->runtime += dl_se->dl_runtime;
> +	}
> +
> +	/*
> +	 * At this point, the deadline really should be "in
> +	 * the future" with respect to rq->clock. If it's
> +	 * not, we are, for some reason, lagging too much!
> +	 * Anyway, after having warn userspace abut that,
> +	 * we still try to keep the things running by
> +	 * resetting the deadline and the budget of the
> +	 * entity.
> +	 */
> +	if (dl_time_before(dl_se->deadline, rq->clock)) {
> +		WARN_ON_ONCE(1);
> +		dl_se->deadline = rq->clock + dl_se->dl_deadline;
> +		dl_se->runtime = dl_se->dl_runtime;
> +	}
> +}
> +

I just finished reviewing patches 1-5, and have yet to find anything
wrong with them (except for these typos). I'll continue my review, and
then I'll start testing them.

Good work (so far ;-)

-- Steve



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-11 13:41   ` Steven Rostedt
@ 2012-04-11 13:55     ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-11 13:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/11/2012 03:41 PM, Steven Rostedt wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>
>> +static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>> +{
>> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>> +	struct rq *rq = rq_of_dl_rq(dl_rq);
>> +
>> +	/*
>> +	 * We Keep moving the deadline away until we get some
>
> s/Keep/keep/
>
>> +	 * available runtime for the entity. This ensures correct
>> +	 * handling of situations where the runtime overrun is
>> +	 * arbitrary large.
>> +	 */
>> +	while (dl_se->runtime<= 0) {
>> +		dl_se->deadline += dl_se->dl_deadline;
>> +		dl_se->runtime += dl_se->dl_runtime;
>> +	}
>> +
>> +	/*
>> +	 * At this point, the deadline really should be "in
>> +	 * the future" with respect to rq->clock. If it's
>> +	 * not, we are, for some reason, lagging too much!
>> +	 * Anyway, after having warn userspace abut that,
>> +	 * we still try to keep the things running by
>> +	 * resetting the deadline and the budget of the
>> +	 * entity.
>> +	 */
>> +	if (dl_time_before(dl_se->deadline, rq->clock)) {
>> +		WARN_ON_ONCE(1);
>> +		dl_se->deadline = rq->clock + dl_se->dl_deadline;
>> +		dl_se->runtime = dl_se->dl_runtime;
>> +	}
>> +}
>> +
>
> I just finished reviewing patches 1-5, and have yet to find anything
> wrong with them (except for these typos). I'll continue my review, and
> then I'll start testing them.
>
> Good work (so far ;-)
>
> -- Steve
>
>

Well, I tried my best to not spoil to much the work made by
Dario & Co. :-).

Anyway, thanks!

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-06 13:39   ` Hillf Danton
  2012-04-06 17:31     ` Juri Lelli
@ 2012-04-11 14:10     ` Steven Rostedt
  2012-04-12 12:28       ` Hillf Danton
  1 sibling, 1 reply; 129+ messages in thread
From: Steven Rostedt @ 2012-04-11 14:10 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Juri Lelli, peterz, tglx, mingo, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fcheccon

On Fri, 2012-04-06 at 21:39 +0800, Hillf Danton wrote:

> > +static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> > +{
> > +       dl_rq = &rq_of_dl_rq(dl_rq)->dl;
> > +
> > +       dl_rq->dl_nr_total++;
> > +       if (dl_se->nr_cpus_allowed > 1)
> > +               dl_rq->dl_nr_migratory++;
> > +
> > +       update_dl_migration(dl_rq);
> 
> 	if (dl_se->nr_cpus_allowed > 1) {
> 		dl_rq->dl_nr_migratory++;
> 		/* No change in migratory, no update of migration */

This is not true. As dl_nr_total changed. If there was only one dl task
queued that can migrate, and then another dl task is queued but this
task can not migrate, the update_dl_migration still needs to be called.
As dl_nr_migratory would be 1, but now dl_nr_total > 1. This means we
are now overloaded.

> 		update_dl_migration(dl_rq);
> 	}
> 
> > +}
> > +
> > +static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> > +{
> > +       dl_rq = &rq_of_dl_rq(dl_rq)->dl;
> > +
> > +       dl_rq->dl_nr_total--;
> > +       if (dl_se->nr_cpus_allowed > 1)
> > +               dl_rq->dl_nr_migratory--;
> > +
> > +       update_dl_migration(dl_rq);
> 
> ditto

ditto.

> 
> > +}
> > +
> > +/*
> > + * The list of pushable -deadline task is not a plist, like in
> > + * sched_rt.c, it is an rb-tree with tasks ordered by deadline.

-- Steve



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4
  2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
                   ` (18 preceding siblings ...)
  2012-04-07  7:52 ` Juri Lelli
@ 2012-04-11 14:17 ` Steven Rostedt
  2012-04-11 14:28   ` Juri Lelli
  19 siblings, 1 reply; 129+ messages in thread
From: Steven Rostedt @ 2012-04-11 14:17 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:

> The development is taking place at:
>    https://github.com/jlelli/sched-deadline
> 
> Check the repositories frequently if you're interested, and feel free to
> e-mail me for any issue you run into.
> 

BTW, which branch should I be looking at?

-- Steve



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4
  2012-04-11 14:17 ` [RFC][PATCH 00/16] sched: " Steven Rostedt
@ 2012-04-11 14:28   ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-11 14:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang

On 04/11/2012 04:17 PM, Steven Rostedt wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>
>> The development is taking place at:
>>     https://github.com/jlelli/sched-deadline
>>
>> Check the repositories frequently if you're interested, and feel free to
>> e-mail me for any issue you run into.
>>
>
> BTW, which branch should I be looking at?
>

  * sched-dl-V4 contains the patchset as posted on LKML
  * 3.2.13-rt23-dl is the "development" one with commits not merged in bigger
    patches

Currently, they are equivalent for testing purposes.

I most probably have to simplify and clean the whole repo..

Thanks and Regards,

- Juri
  

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-10  8:11             ` Juri Lelli
@ 2012-04-11 15:57               ` Steven Rostedt
  0 siblings, 0 replies; 129+ messages in thread
From: Steven Rostedt @ 2012-04-11 15:57 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Hillf Danton, peterz, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael,
	Fabio Checconi, Tommaso Cucinotta, Nicola Manica, Luca Abeni,
	Dhaval Giani, hgu1972, paulmck, Dario Faggioli, Insop Song,
	liming.wang

On Tue, 2012-04-10 at 10:11 +0200, Juri Lelli wrote:
> On 04/09/2012 02:28 PM, Hillf Danton wrote:
> > On Mon, Apr 9, 2012 at 4:20 AM, Juri Lelli<juri.lelli@gmail.com>  wrote:
> >>
> >> Do you mean at kernel_init time?
> >> Could you be more precise about the problem Steven encountered?
> >>
> > After change from kernel/sched_rt.c to kernel/sched/rt.c, I could not
> > find the git history of that fix, related to RCU IIRC, by Steven in mainline.
> > Steven, please give a link.
> >
> 
> Ok, found!
> https://lkml.org/lkml/2011/6/14/366
>   

I doubt you'll hit this same bug, but it doesn't hurt to add it. If a
boot time kernel thread starts using edf, then it would be required. And
you never know who may do that ;-)

-- Steve



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-08 20:20         ` Juri Lelli
  2012-04-09 12:28           ` Hillf Danton
@ 2012-04-11 16:00           ` Steven Rostedt
  2012-04-11 16:09             ` Juri Lelli
  1 sibling, 1 reply; 129+ messages in thread
From: Steven Rostedt @ 2012-04-11 16:00 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Hillf Danton, peterz, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael,
	Fabio Checconi, Tommaso Cucinotta, Nicola Manica, Luca Abeni,
	Dhaval Giani, hgu1972, paulmck, Dario Faggioli, Insop Song,
	liming.wang

On Sun, 2012-04-08 at 22:20 +0200, Juri Lelli wrote:
> >
> >>>> +#ifdef CONFIG_SMP
> >>>> +       /*
> >>>> +        * In the unlikely case current and p have the same deadline
> >>>> +        * let us try to decide what's the best thing to do...
> >>>> +        */
> >>>> +       if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0&&
> >>>> +           !need_resched())
> >>>
> >>> please recheck !need_resched(), say rq->curr need reschedule?
> >>
> >> Sorry, I don't get this..
> >>
> > Perhaps smp_processor_id() != rq->cpu
> >
> 
> need_resched is actually checked...
> 

I guess what Hillf is trying to say is,

s/!need_resched()/!test_tsk_need_resched(rq->curr)/

-- Steve



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-06  7:14 ` [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic Juri Lelli
  2012-04-06 13:39   ` Hillf Danton
@ 2012-04-11 16:07   ` Steven Rostedt
  2012-04-11 16:11     ` Juri Lelli
  2012-04-11 16:14   ` Steven Rostedt
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 129+ messages in thread
From: Steven Rostedt @ 2012-04-11 16:07 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:

> +static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
> +{
> +	/*
> +	 * Current can't be migrated, useles to reschedule,

 s/useles/useless/

I feel so useles by only adding typo fixes ;-)

-- Steve

> +	 * let's hope p can move out.
> +	 */
> +	if (rq->curr->dl.nr_cpus_allowed == 1 ||
> +	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
> +		return;
> +
> +	/*
> +	 * p is migratable, so let's not schedule it and
> +	 * see if it is pushed or pulled somewhere else.
> +	 */
> +	if (p->dl.nr_cpus_allowed != 1 &&
> +	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
> +		return;
> +
> +	resched_task(rq->curr);
> +}
> +
> +#endif /* CONFIG_SMP */
> +


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-11 16:00           ` Steven Rostedt
@ 2012-04-11 16:09             ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-11 16:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Hillf Danton, peterz, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael,
	Fabio Checconi, Tommaso Cucinotta, Nicola Manica, Luca Abeni,
	Dhaval Giani, hgu1972, paulmck, Dario Faggioli, Insop Song,
	liming.wang

On 04/11/2012 06:00 PM, Steven Rostedt wrote:
> On Sun, 2012-04-08 at 22:20 +0200, Juri Lelli wrote:
>>>
>>>>>> +#ifdef CONFIG_SMP
>>>>>> +       /*
>>>>>> +        * In the unlikely case current and p have the same deadline
>>>>>> +        * let us try to decide what's the best thing to do...
>>>>>> +        */
>>>>>> +       if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0&&
>>>>>> +           !need_resched())
>>>>>
>>>>> please recheck !need_resched(), say rq->curr need reschedule?
>>>>
>>>> Sorry, I don't get this..
>>>>
>>> Perhaps smp_processor_id() != rq->cpu
>>>
>>
>> need_resched is actually checked...
>>
>
> I guess what Hillf is trying to say is,
>
> s/!need_resched()/!test_tsk_need_resched(rq->curr)/
>

Yep, I finally got (and changed) it ;-).

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-11 16:07   ` Steven Rostedt
@ 2012-04-11 16:11     ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-11 16:11 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang

On 04/11/2012 06:07 PM, Steven Rostedt wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>
>> +static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
>> +{
>> +	/*
>> +	 * Current can't be migrated, useles to reschedule,
>
>   s/useles/useless/
>
> I feel so useles by only adding typo fixes ;-)
>

Don't know why.. but this instead sounds good to me :-P.

>> +	 * let's hope p can move out.
>> +	 */
>> +	if (rq->curr->dl.nr_cpus_allowed == 1 ||
>> +	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
>> +		return;
>> +
>> +	/*
>> +	 * p is migratable, so let's not schedule it and
>> +	 * see if it is pushed or pulled somewhere else.
>> +	 */
>> +	if (p->dl.nr_cpus_allowed != 1&&
>> +	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
>> +		return;
>> +
>> +	resched_task(rq->curr);
>> +}
>> +
>> +#endif /* CONFIG_SMP */
>> +
>

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-06  7:14 ` [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic Juri Lelli
  2012-04-06 13:39   ` Hillf Danton
  2012-04-11 16:07   ` Steven Rostedt
@ 2012-04-11 16:14   ` Steven Rostedt
  2012-04-19 13:44     ` Juri Lelli
  2012-04-11 16:21   ` Steven Rostedt
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 129+ messages in thread
From: Steven Rostedt @ 2012-04-11 16:14 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:

> +static int latest_cpu_find(struct cpumask *span,
> +			   struct task_struct *task,
> +			   struct cpumask *later_mask)
>  {
> +	const struct sched_dl_entity *dl_se = &task->dl;
> +	int cpu, found = -1, best = 0;
> +	u64 max_dl = 0;
> +
> +	for_each_cpu(cpu, span) {
> +		struct rq *rq = cpu_rq(cpu);
> +		struct dl_rq *dl_rq = &rq->dl;
> +
> +		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
> +		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
> +		     dl_rq->earliest_dl.curr))) {
> +			if (later_mask)
> +				cpumask_set_cpu(cpu, later_mask);
> +			if (!best && !dl_rq->dl_nr_running) {

I hate to say this (and I also have yet to look at the patches after
this) but we should really take into account the RT tasks. It would suck
to preempt a normal RT task when a non RT task is running on another
CPU.

-- Steve

> +				best = 1;
> +				found = cpu;
> +			} else if (!best &&
> +				   dl_time_before(max_dl,
> +						  dl_rq->earliest_dl.curr)) {
> +				max_dl = dl_rq->earliest_dl.curr;
> +				found = cpu;
> +			}
> +		} else if (later_mask)
> +			cpumask_clear_cpu(cpu, later_mask);
> +	}
> +
> +	return found;
> +}



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-06  7:14 ` [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic Juri Lelli
                     ` (2 preceding siblings ...)
  2012-04-11 16:14   ` Steven Rostedt
@ 2012-04-11 16:21   ` Steven Rostedt
  2012-04-11 16:24     ` Juri Lelli
  2012-04-11 16:33   ` Steven Rostedt
  2012-04-11 17:25   ` Steven Rostedt
  5 siblings, 1 reply; 129+ messages in thread
From: Steven Rostedt @ 2012-04-11 16:21 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:

>  /*
>   * Only called when both the current and waking task are -deadline
>   * tasks.
> @@ -487,8 +819,20 @@ static void yield_task_dl(struct rq *rq)
>  static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
>  				  int flags)
>  {
> -	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
> +	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline)) {
>  		resched_task(rq->curr);
> +		return;
> +	}
> +
> +#ifdef CONFIG_SMP
> +	/*
> +	 * In the unlikely case current and p have the same deadline
> +	 * let us try to decide what's the best thing to do...
> +	 */
> +	if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
> +	    !need_resched())

OK, maybe I'm thick. But how is:

  (s64)(p->dl.deadline - rq->curr->dl.deadline) == 0

Better than:

  p->dl.deadline == rq->curr->dl.deadline

?

-- Steve

> +		check_preempt_equal_dl(rq, p);
> +#endif /* CONFIG_SMP */
>  }
>  



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-11 16:21   ` Steven Rostedt
@ 2012-04-11 16:24     ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-11 16:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang

On 04/11/2012 06:21 PM, Steven Rostedt wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>
>>   /*
>>    * Only called when both the current and waking task are -deadline
>>    * tasks.
>> @@ -487,8 +819,20 @@ static void yield_task_dl(struct rq *rq)
>>   static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
>>   				  int flags)
>>   {
>> -	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
>> +	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline)) {
>>   		resched_task(rq->curr);
>> +		return;
>> +	}
>> +
>> +#ifdef CONFIG_SMP
>> +	/*
>> +	 * In the unlikely case current and p have the same deadline
>> +	 * let us try to decide what's the best thing to do...
>> +	 */
>> +	if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0&&
>> +	    !need_resched())
>
> OK, maybe I'm thick. But how is:
>
>    (s64)(p->dl.deadline - rq->curr->dl.deadline) == 0
>
> Better than:
>
>    p->dl.deadline == rq->curr->dl.deadline
>
> ?
>

I agree, will change.

>
>> +		check_preempt_equal_dl(rq, p);
>> +#endif /* CONFIG_SMP */
>>   }
>>
>
>


Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-06  7:14 ` [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic Juri Lelli
                     ` (3 preceding siblings ...)
  2012-04-11 16:21   ` Steven Rostedt
@ 2012-04-11 16:33   ` Steven Rostedt
  2012-04-24 13:15     ` Peter Zijlstra
  2012-04-11 17:25   ` Steven Rostedt
  5 siblings, 1 reply; 129+ messages in thread
From: Steven Rostedt @ 2012-04-11 16:33 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>  
> @@ -543,6 +897,9 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
>  {
>  	update_curr_dl(rq);
>  	p->se.exec_start = 0;
> +
> +	if (on_dl_rq(&p->dl) && p->dl.nr_cpus_allowed > 1)
> +		enqueue_pushable_dl_task(rq, p);
>  }

Ouch! We need to fix this. This has nothing to do with your patch
series, but if you look at schedule():

	put_prev_task(rq, prev);
	next = pick_next_task(rq);


We put the prev task and then pick the next task. If we call schedule
for some reason when we don't need to really schedule, then we just
added and removed from the pushable rb tree the same task. That is, we
did the rb manipulation twice, for no good reason.

Not sure how to fix this. But it will require a generic change.

-- Steve



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-06  7:14 ` [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic Juri Lelli
                     ` (4 preceding siblings ...)
  2012-04-11 16:33   ` Steven Rostedt
@ 2012-04-11 17:25   ` Steven Rostedt
  2012-04-11 17:48     ` Juri Lelli
  5 siblings, 1 reply; 129+ messages in thread
From: Steven Rostedt @ 2012-04-11 17:25 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, Kirill Tkhai

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>  
>  static void set_cpus_allowed_dl(struct task_struct *p,
> @@ -622,10 +1346,145 @@ static void set_cpus_allowed_dl(struct task_struct *p,
>  
>  	BUG_ON(!dl_task(p));
>  
> +	/*
> +	 * Update only if the task is actually running (i.e.,
> +	 * it is on the rq AND it is not throttled).
> +	 */
> +	if (on_dl_rq(&p->dl) && (weight != p->dl.nr_cpus_allowed)) {
> +		struct rq *rq = task_rq(p);
> +
> +		if (!task_current(rq, p)) {
> +			/*
> +			 * If the task was on the pushable list,
> +			 * make sure it stays there only if the new
> +			 * mask allows that.
> +			 */
> +			if (p->dl.nr_cpus_allowed > 1)
> +				dequeue_pushable_dl_task(rq, p);
> +
> +			if (weight > 1)
> +				enqueue_pushable_dl_task(rq, p);
> +		}
> +
> +		if ((p->dl.nr_cpus_allowed <= 1) && (weight > 1)) {
> +			rq->dl.dl_nr_migratory++;
> +		} else if ((p->dl.nr_cpus_allowed > 1) && (weight <= 1)) {
> +			BUG_ON(!rq->dl.dl_nr_migratory);
> +			rq->dl.dl_nr_migratory--;
> +		}
> +
> +		update_dl_migration(&rq->dl);

Note, I'm in the process of testing this patch:

 https://lkml.org/lkml/2012/4/11/7

Just giving you a heads up, as this looks like you can benefit from this
change as well.

-- Steve


> +	}
> +
>  	cpumask_copy(&p->cpus_allowed, new_mask);
>  	p->dl.nr_cpus_allowed = weight;
>  }
> +



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-11 17:25   ` Steven Rostedt
@ 2012-04-11 17:48     ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-11 17:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, Kirill Tkhai

On 04/11/2012 07:25 PM, Steven Rostedt wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>>
>>   static void set_cpus_allowed_dl(struct task_struct *p,
>> @@ -622,10 +1346,145 @@ static void set_cpus_allowed_dl(struct task_struct *p,
>>
>>   	BUG_ON(!dl_task(p));
>>
>> +	/*
>> +	 * Update only if the task is actually running (i.e.,
>> +	 * it is on the rq AND it is not throttled).
>> +	 */
>> +	if (on_dl_rq(&p->dl)&&  (weight != p->dl.nr_cpus_allowed)) {
>> +		struct rq *rq = task_rq(p);
>> +
>> +		if (!task_current(rq, p)) {
>> +			/*
>> +			 * If the task was on the pushable list,
>> +			 * make sure it stays there only if the new
>> +			 * mask allows that.
>> +			 */
>> +			if (p->dl.nr_cpus_allowed>  1)
>> +				dequeue_pushable_dl_task(rq, p);
>> +
>> +			if (weight>  1)
>> +				enqueue_pushable_dl_task(rq, p);
>> +		}
>> +
>> +		if ((p->dl.nr_cpus_allowed<= 1)&&  (weight>  1)) {
>> +			rq->dl.dl_nr_migratory++;
>> +		} else if ((p->dl.nr_cpus_allowed>  1)&&  (weight<= 1)) {
>> +			BUG_ON(!rq->dl.dl_nr_migratory);
>> +			rq->dl.dl_nr_migratory--;
>> +		}
>> +
>> +		update_dl_migration(&rq->dl);
>
> Note, I'm in the process of testing this patch:
>
>   https://lkml.org/lkml/2012/4/11/7
>
> Just giving you a heads up, as this looks like you can benefit from this
> change as well.
>

Sure! I noticed it just today and started wondering how it would apply in
my case.
  
>
>> +	}
>> +
>>   	cpumask_copy(&p->cpus_allowed, new_mask);
>>   	p->dl.nr_cpus_allowed = weight;
>>   }
>> +
>
>

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 08/16] sched: add period support for -deadline tasks.
  2012-04-06  7:14 ` [PATCH 08/16] sched: add period support for -deadline tasks Juri Lelli
@ 2012-04-11 20:32   ` Steven Rostedt
  2012-04-11 21:56     ` Juri Lelli
                       ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: Steven Rostedt @ 2012-04-11 20:32 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:

> @@ -293,7 +293,11 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>   * assigned (function returns true if it can).
>   *
>   * For this to hold, we must check if:
> - *   runtime / (deadline - t) < dl_runtime / dl_deadline .
> + *   runtime / (deadline - t) < dl_runtime / dl_period .
> + *
> + * Notice that the bandwidth check is done against the period. For
> + * task with deadline equal to period this is the same of using
> + * dl_deadline instead of dl_period in the equation above.

First, it seems that the function returns true if:

	dl_runtime / dl_period < runtime / (deadline - t)


I'm a little confused by this. We are comparing the ratio of runtime
left and deadline left, to the ratio of total runtime to period.

I'm actually confused by this premise anyway. What's the purpose of
comparing the ratio? If runtime < (deadline - t) wouldn't it not be able
to complete anyway? Or are we thinking that the runtime will be
interrupted proportionally by other tasks?

-- Steve




>   */
>  static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
>  {
> @@ -312,7 +316,7 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
>  	 * to the (absolute) deadline. Therefore, overflowing the u64
>  	 * type is very unlikely to occur in both cases.
>  	 */
> -	left = dl_se->dl_deadline * dl_se->runtime;
> +	left = dl_se->dl_period * dl_se->runtime;
>  	right = (dl_se->deadline - t) * dl_se->dl_runtime;
>  
>  	return dl_time_before(right, left);



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 11/16] sched: add latency tracing for -deadline tasks.
  2012-04-06  7:14 ` [PATCH 11/16] sched: add latency tracing " Juri Lelli
@ 2012-04-11 21:03   ` Steven Rostedt
  2012-04-12  7:16     ` Juri Lelli
  2012-04-16 15:51     ` Daniel Vacek
  0 siblings, 2 replies; 129+ messages in thread
From: Steven Rostedt @ 2012-04-11 21:03 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> From: Dario Faggioli <raistlin@linux.it>
> 
> It is very likely that systems that wants/needs to use the new
> SCHED_DEADLINE policy also want to have the scheduling latency of
> the -deadline tasks under control.
> 
> For this reason a new version of the scheduling wakeup latency,
> called "wakeup_dl", is introduced.
> 
> As a consequence of applying this patch there will be three wakeup
> latency tracer:
>  * "wakeup", that deals with all tasks in the system;
>  * "wakeup_rt", that deals with -rt and -deadline tasks only;
>  * "wakeup_dl", that deals with -deadline tasks only.
> 
> Signed-off-by: Dario Faggioli <raistlin@linux.it>
> Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
> ---
>  kernel/trace/trace_sched_wakeup.c |   41 ++++++++++++++++++++++++++++++++++++-
>  kernel/trace/trace_selftest.c     |   30 ++++++++++++++++----------
>  2 files changed, 58 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> index e4a70c0..9c9b1be 100644
> --- a/kernel/trace/trace_sched_wakeup.c
> +++ b/kernel/trace/trace_sched_wakeup.c
> @@ -27,6 +27,7 @@ static int			wakeup_cpu;
>  static int			wakeup_current_cpu;
>  static unsigned			wakeup_prio = -1;
>  static int			wakeup_rt;
> +static int			wakeup_dl;
>  
>  static arch_spinlock_t wakeup_lock =
>  	(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
> @@ -420,6 +421,17 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
>  	if ((wakeup_rt && !rt_task(p)) ||
>  			p->prio >= wakeup_prio ||
>  			p->prio >= current->prio)

I don't think you meant to keep both if statements. Look above and
below ;-)

> +	/*
> +	 * Semantic is like this:
> +	 *  - wakeup tracer handles all tasks in the system, independently
> +	 *    from their scheduling class;
> +	 *  - wakeup_rt tracer handles tasks belonging to sched_dl and
> +	 *    sched_rt class;
> +	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
> +	 */
> +	if ((wakeup_dl && !dl_task(p)) ||
> +	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
> +	    (p->prio >= wakeup_prio || p->prio >= current->prio))
>  		return;

Anyway, perhaps this should be broken up, as we don't want the double
test, that is, wakeup_rt and wakeup_dl are both checked. Perhaps do:

	if (wakeup_dl && !dl_task(p))
		return;
	else if (wakeup_rt && !dl_task(p) && !rt_task(p))
		return;

	if (p->prio >= wakeup_prio || p->prio >= current->prio)
		return;


-- Steve

>  
>  	pc = preempt_count();
> @@ -431,7 +443,7 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
>  	arch_spin_lock(&wakeup_lock);
>  
>  	/* check for races. */
> -	if (!tracer_enabled || p->prio >= wakeup_prio)
> +	if (!tracer_enabled || (!dl_task(p) && p->prio >= wakeup_prio))
>  		goto out_locked;
>  
>  	/* reset the trace */
> @@ -539,16 +551,25 @@ static int __wakeup_tracer_init(struct trace_array *tr)
>  



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 12/16] rtmutex: turn the plist into an rb-tree.
  2012-04-06  7:14 ` [PATCH 12/16] rtmutex: turn the plist into an rb-tree Juri Lelli
@ 2012-04-11 21:11   ` Steven Rostedt
  2012-04-22 14:28     ` Juri Lelli
  2012-04-23  8:33     ` Peter Zijlstra
  0 siblings, 2 replies; 129+ messages in thread
From: Steven Rostedt @ 2012-04-11 21:11 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
> and provide a proper comparison function for -deadline and
> -priority tasks.

I have to ask. Why not just add a rbtree with a plist? That is, add all
deadline tasks to the rbtree and all others to the plist. As plist has a
O(1) operation, and rbtree does not. We are making all RT tasks suffer
the overhead of the rbtree.

As deadline tasks always win, the two may stay agnostic from each other.
Check first the rbtree, if it is empty, then check the plist.

This will become more predominant with the -rt tree as it converts most
the locks in the kernel to pi mutexes.

-- Steve

> 
> This is done mainly because:
>  - classical prio field of the plist is just an int, which might
>    not be enough for representing a deadline;
>  - manipulating such a list would become O(nr_deadline_tasks),
>    which might be to much, as the number of -deadline task increases.
> 
> Therefore, an rb-tree is used, and tasks are queued in it according
> to the following logic:
>  - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
>    one with the higher (lower, actually!) prio wins;
>  - among a -priority and a -deadline task, the latter always wins;
>  - among two -deadline tasks, the one with the earliest deadline
>    wins.
> 
> Queueing and dequeueing functions are changed accordingly, for both
> the list of a task's pi-waiters and the list of tasks blocked on
> a pi-lock.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Dario Faggioli <raistlin@linux.it>
> Signed-off-by: Juri Lelli <juri.lelli@gmail.com>



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 08/16] sched: add period support for -deadline tasks.
  2012-04-11 20:32   ` Steven Rostedt
@ 2012-04-11 21:56     ` Juri Lelli
  2012-04-11 22:13     ` Tommaso Cucinotta
  2012-04-12  6:39     ` Luca Abeni
  2 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-11 21:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/11/2012 10:32 PM, Steven Rostedt wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>
>> @@ -293,7 +293,11 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>>    * assigned (function returns true if it can).
>>    *
>>    * For this to hold, we must check if:
>> - *   runtime / (deadline - t)<  dl_runtime / dl_deadline .
>> + *   runtime / (deadline - t)<  dl_runtime / dl_period .
>> + *
>> + * Notice that the bandwidth check is done against the period. For
>> + * task with deadline equal to period this is the same of using
>> + * dl_deadline instead of dl_period in the equation above.
>
> First, it seems that the function returns true if:
>
> 	dl_runtime / dl_period<  runtime / (deadline - t)
>

Right, the comment is wrong! Just reverse the inequality as you did.
  
>
> I'm a little confused by this. We are comparing the ratio of runtime
> left and deadline left, to the ratio of total runtime to period.
>
> I'm actually confused by this premise anyway. What's the purpose of
> comparing the ratio? If runtime<  (deadline - t) wouldn't it not be able
> to complete anyway? Or are we thinking that the runtime will be
> interrupted proportionally by other tasks?
>

We are actually applying one of the CBS rules. We want to be able to
"slow down" a deadline task if it is going to exceed its reserved
bandwidth. We are in fact checking here that the bandwidth this task
will consume from t to its deadline is no more than its reserved one.

>
>
>>    */
>>   static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
>>   {
>> @@ -312,7 +316,7 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
>>   	 * to the (absolute) deadline. Therefore, overflowing the u64
>>   	 * type is very unlikely to occur in both cases.
>>   	 */
>> -	left = dl_se->dl_deadline * dl_se->runtime;
>> +	left = dl_se->dl_period * dl_se->runtime;
>>   	right = (dl_se->deadline - t) * dl_se->dl_runtime;
>>
>>   	return dl_time_before(right, left);
>
>

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 08/16] sched: add period support for -deadline tasks.
  2012-04-11 20:32   ` Steven Rostedt
  2012-04-11 21:56     ` Juri Lelli
@ 2012-04-11 22:13     ` Tommaso Cucinotta
  2012-04-12  0:19       ` Steven Rostedt
  2012-04-12  6:39     ` Luca Abeni
  2 siblings, 1 reply; 129+ messages in thread
From: Tommaso Cucinotta @ 2012-04-11 22:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, peterz, tglx, mingo, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang

Il 11/04/2012 21:32, Steven Rostedt ha scritto:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>
>> @@ -293,7 +293,11 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>>    * assigned (function returns true if it can).
>>    *
>>    * For this to hold, we must check if:
>> - *   runtime / (deadline - t)<  dl_runtime / dl_deadline .
>> + *   runtime / (deadline - t)<  dl_runtime / dl_period .
>> + *
>> + * Notice that the bandwidth check is done against the period. For
>> + * task with deadline equal to period this is the same of using
>> + * dl_deadline instead of dl_period in the equation above.
> First, it seems that the function returns true if:
>
> 	dl_runtime / dl_period<  runtime / (deadline - t)
>
>
> I'm a little confused by this. We are comparing the ratio of runtime
> left and deadline left, to the ratio of total runtime to period.
>
> I'm actually confused by this premise anyway. What's the purpose of
> comparing the ratio? If runtime<  (deadline - t) wouldn't it not be able
> to complete anyway? Or are we thinking that the runtime will be
> interrupted proportionally by other tasks?

That's a well-known property of the CBS scheduling algorithm:
the "unblock rule" says that, when a task wakes up, if the residual
budget over residual deadline fits within the allocated "bandwidth",
then we can keep the current (abs) deadline and residual budget
without disrupting the schedulability of the system (i.e., ability of
other admitted tasks to meet their deadlines).
Otherwise, we should reset the status, i.e., refill budget and set
the deadline a period in the future, because keeping the current
abs deadline of the task would result in breaking guarantees
promised to other tasks.
Imagine a task going to sleep and waking
too close to its deadline: its deadline would be the closest in the
system, and it wouldn't allow anyone else to run. However, if its
residual budget is also big, then this is going to not allow anyone
to run for too much. Now, when doing the classical easy admission
test for single-CPU EDF (sum of budgets over periods <= 1), we
hadn't accounted for such a scenario with tasks blocking (which can
actually be accounted for by using far more complex tests knowing
for how long each task will block etc.). However, using the CBS, we
can keep the easy test and add the simple cut-off rule at task
wake-up which adds "temporal isolation", i.e., the capability
to not disrupt others' guarantees if I sleep for too much.
Said this, I guess/hope the rule is implemented right :-)

For a formal proof, I think you can refer to the Abeni's paper(s):

-) Integrating Multimedia Applications in Hard Real-Time Systems, RTSS '98
     www.cis.upenn.edu/~lee/01cis642/papers/AB98.pdf

which redirects on the Technical Report:
-) Server Mechanisms for Multimedia Applications

Hope this helps,

     T.

-- 
Tommaso Cucinotta, Computer Engineering PhD, Researcher
ReTiS Lab, Scuola Superiore Sant'Anna, Pisa, Italy
Tel +39 050 882 024, Fax +39 050 882 003
http://retis.sssup.it/people/tommaso


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 08/16] sched: add period support for -deadline tasks.
  2012-04-11 22:13     ` Tommaso Cucinotta
@ 2012-04-12  0:19       ` Steven Rostedt
  0 siblings, 0 replies; 129+ messages in thread
From: Steven Rostedt @ 2012-04-12  0:19 UTC (permalink / raw)
  To: Tommaso Cucinotta
  Cc: Juri Lelli, peterz, tglx, mingo, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang

On Wed, 2012-04-11 at 23:13 +0100, Tommaso Cucinotta wrote:

> For a formal proof, I think you can refer to the Abeni's paper(s):
> 
> -) Integrating Multimedia Applications in Hard Real-Time Systems, RTSS '98
>      www.cis.upenn.edu/~lee/01cis642/papers/AB98.pdf
> 
> which redirects on the Technical Report:
> -) Server Mechanisms for Multimedia Applications
> 
> Hope this helps,

Thanks, I'll read up on it. Honestly, it's been 5 years since I worked
with rate monotonic algorithms. I'm happy to get back into it ;-)

-- Steve



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 13/16] sched: drafted deadline inheritance logic.
  2012-04-06  7:14 ` [PATCH 13/16] sched: drafted deadline inheritance logic Juri Lelli
@ 2012-04-12  2:42   ` Steven Rostedt
  2012-04-22 14:04     ` Juri Lelli
  2012-04-23  8:39     ` Peter Zijlstra
  0 siblings, 2 replies; 129+ messages in thread
From: Steven Rostedt @ 2012-04-12  2:42 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> From: Dario Faggioli <raistlin@linux.it>
> 
> Some method to deal with rt-mutexes and make sched_dl interact with
> the current PI-coded is needed, raising all but trivial issues, that
> needs (according to us) to be solved with some restructuring of
> the pi-code (i.e., going toward a proxy execution-ish implementation).

I also would like to point out that this is probably only for -rt? Or is
this for PI futexes as well?

Note, non-proxy version of this code may be an issue. Could a process
force more bandwidth than was allowed by rlimit if the process used PI
futexes forced a behavior that PI waiters were waiting on another
process while it held the PI futex? The process that held the futex may
get more bandwidth if it never lets go of the futex, right?

> 
> This is under development, in the meanwhile, as a temporary solution,
> what this commits does is:
>  - ensure a pi-lock owner with waiters is never throttled down. Instead,
>    when it runs out of runtime, it immediately gets replenished and it's
>    deadline is postponed;
>  - the scheduling parameters (relative deadline and default runtime)
>    used for that replenishments --during the whole period it holds the
>    pi-lock-- are the ones of the waiting task with earliest deadline.

This sounds similar to what I implemented for a company back in 2005.

> 
> Acting this way, we provide some kind of boosting to the lock-owner,
> still by using the existing (actually, slightly modified by the previous
> commit) pi-architecture.
> 
> We would stress the fact that this is only a surely needed, all but
> clean solution to the problem. In the end it's only a way to re-start
> discussion within the community. So, as always, comments, ideas, rants,
> etc.. are welcome! :-)
> 
> Signed-off-by: Dario Faggioli <raistlin@linux.it>
> Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
> ---
>  include/linux/sched.h |    9 +++++-
>  kernel/fork.c         |    1 +
>  kernel/rtmutex.c      |   13 +++++++-
>  kernel/sched.c        |   34 ++++++++++++++++++---
>  kernel/sched_dl.c     |   77 +++++++++++++++++++++++++++++++++---------------
>  5 files changed, 102 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5ef7bb6..ca45db4 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1309,8 +1309,12 @@ struct sched_dl_entity {
>  	 * @dl_new tells if a new instance arrived. If so we must
>  	 * start executing it with full runtime and reset its absolute
>  	 * deadline;
> +	 *
> +	 * @dl_boosted tells if we are boosted due to DI. If so we are

 DI?

-- Steve


> +	 * outside bandwidth enforcement mechanism (but only until we
> +	 * exit the critical section).
>  	 */
> -	int dl_throttled, dl_new;
> +	int dl_throttled, dl_new, dl_boosted;
>  
>  	/*
>  	 * Bandwidth enforcement timer. Each -deadline task has its
> @@ -1556,6 +1560,8 @@ struct task_struct {
>  	struct rb_node *pi_waiters_leftmost;
>  	/* Deadlock detection and priority inheritance handling */
>  	struct rt_mutex_waiter *pi_blocked_on;
> +	/* Top pi_waiters task */
> +	struct task_struct *pi_top_task;
>  #endif
>  



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 08/16] sched: add period support for -deadline tasks.
  2012-04-11 20:32   ` Steven Rostedt
  2012-04-11 21:56     ` Juri Lelli
  2012-04-11 22:13     ` Tommaso Cucinotta
@ 2012-04-12  6:39     ` Luca Abeni
  2 siblings, 0 replies; 129+ messages in thread
From: Luca Abeni @ 2012-04-12  6:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, peterz, tglx, mingo, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, tommaso.cucinotta, nicola.manica, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

Hi Steven,

On 04/11/2012 10:32 PM, Steven Rostedt wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>
>> @@ -293,7 +293,11 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>>    * assigned (function returns true if it can).
>>    *
>>    * For this to hold, we must check if:
>> - *   runtime / (deadline - t)<  dl_runtime / dl_deadline .
>> + *   runtime / (deadline - t)<  dl_runtime / dl_period .
>> + *
>> + * Notice that the bandwidth check is done against the period. For
>> + * task with deadline equal to period this is the same of using
>> + * dl_deadline instead of dl_period in the equation above.
>
> First, it seems that the function returns true if:
>
> 	dl_runtime / dl_period<  runtime / (deadline - t)
>
>
> I'm a little confused by this. We are comparing the ratio of runtime
> left and deadline left, to the ratio of total runtime to period.
>
> I'm actually confused by this premise anyway. What's the purpose of
> comparing the ratio? If runtime<  (deadline - t) wouldn't it not be able
> to complete anyway? Or are we thinking that the runtime will be
> interrupted proportionally by other tasks?
I see that other people have been faster than me to reply (and Tommaso's
reply is pretty good).

I just add that the original CBS algorithm (which introduced this kind
of check) assumed "relative deadline = reservation period". So, if you
are wondering why
	dl_runtime / dl_deadline
is changed into
	dl_runtime / dl_period
I agree that this is not too clear... In my opinion, a more correct (but
probably too pessimistic) check should use
	dl_runtime / min(dl_deadline, dl_period)


Also, note that this check is presented as
	c_s >= (d_{s,k} - r_{i,j})U_s
(where U_s = Q_s/T_s = dl_runtime / dl_period - with dl_period = dl_deadline)
in the original paper (see the "When a job J_{i,j} arrives and the server is
idle..." item at the end of the first column of the third page of the paper).

Also, regarding the "Server mechanisms for multimedia applications" technical
report mentioned by Tommaso, you can download it from here:
http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps



				Luca

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 11/16] sched: add latency tracing for -deadline tasks.
  2012-04-11 21:03   ` Steven Rostedt
@ 2012-04-12  7:16     ` Juri Lelli
  2012-04-16 15:51     ` Daniel Vacek
  1 sibling, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-12  7:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/11/2012 11:03 PM, Steven Rostedt wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> From: Dario Faggioli<raistlin@linux.it>
>>
>> It is very likely that systems that wants/needs to use the new
>> SCHED_DEADLINE policy also want to have the scheduling latency of
>> the -deadline tasks under control.
>>
>> For this reason a new version of the scheduling wakeup latency,
>> called "wakeup_dl", is introduced.
>>
>> As a consequence of applying this patch there will be three wakeup
>> latency tracer:
>>   * "wakeup", that deals with all tasks in the system;
>>   * "wakeup_rt", that deals with -rt and -deadline tasks only;
>>   * "wakeup_dl", that deals with -deadline tasks only.
>>
>> Signed-off-by: Dario Faggioli<raistlin@linux.it>
>> Signed-off-by: Juri Lelli<juri.lelli@gmail.com>
>> ---
>>   kernel/trace/trace_sched_wakeup.c |   41 ++++++++++++++++++++++++++++++++++++-
>>   kernel/trace/trace_selftest.c     |   30 ++++++++++++++++----------
>>   2 files changed, 58 insertions(+), 13 deletions(-)
>>
>> diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
>> index e4a70c0..9c9b1be 100644
>> --- a/kernel/trace/trace_sched_wakeup.c
>> +++ b/kernel/trace/trace_sched_wakeup.c
>> @@ -27,6 +27,7 @@ static int			wakeup_cpu;
>>   static int			wakeup_current_cpu;
>>   static unsigned			wakeup_prio = -1;
>>   static int			wakeup_rt;
>> +static int			wakeup_dl;
>>
>>   static arch_spinlock_t wakeup_lock =
>>   	(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
>> @@ -420,6 +421,17 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
>>   	if ((wakeup_rt&&  !rt_task(p)) ||
>>   			p->prio>= wakeup_prio ||
>>   			p->prio>= current->prio)
>
> I don't think you meant to keep both if statements. Look above and
> below ;-)
>

Ouch! Forgot to cut something! :-(
  
>> +	/*
>> +	 * Semantic is like this:
>> +	 *  - wakeup tracer handles all tasks in the system, independently
>> +	 *    from their scheduling class;
>> +	 *  - wakeup_rt tracer handles tasks belonging to sched_dl and
>> +	 *    sched_rt class;
>> +	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
>> +	 */
>> +	if ((wakeup_dl&&  !dl_task(p)) ||
>> +	    (wakeup_rt&&  !dl_task(p)&&  !rt_task(p)) ||
>> +	    (p->prio>= wakeup_prio || p->prio>= current->prio))
>>   		return;
>
> Anyway, perhaps this should be broken up, as we don't want the double
> test, that is, wakeup_rt and wakeup_dl are both checked. Perhaps do:
>
> 	if (wakeup_dl&&  !dl_task(p))
> 		return;
> 	else if (wakeup_rt&&  !dl_task(p)&&  !rt_task(p))
> 		return;
>
> 	if (p->prio>= wakeup_prio || p->prio>= current->prio)
> 		return;

Yes, way better.

Thanks!

- Juri

>>
>>   	pc = preempt_count();
>> @@ -431,7 +443,7 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
>>   	arch_spin_lock(&wakeup_lock);
>>
>>   	/* check for races. */
>> -	if (!tracer_enabled || p->prio>= wakeup_prio)
>> +	if (!tracer_enabled || (!dl_task(p)&&  p->prio>= wakeup_prio))
>>   		goto out_locked;
>>
>>   	/* reset the trace */
>> @@ -539,16 +551,25 @@ static int __wakeup_tracer_init(struct trace_array *tr)
>>
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-11 14:10     ` Steven Rostedt
@ 2012-04-12 12:28       ` Hillf Danton
  2012-04-12 12:51         ` Steven Rostedt
  0 siblings, 1 reply; 129+ messages in thread
From: Hillf Danton @ 2012-04-12 12:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, peterz, tglx, mingo, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fcheccon

On Wed, Apr 11, 2012 at 10:10 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Fri, 2012-04-06 at 21:39 +0800, Hillf Danton wrote:
>
>> > +static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>> > +{
>> > +       dl_rq = &rq_of_dl_rq(dl_rq)->dl;
>> > +
>> > +       dl_rq->dl_nr_total++;
>> > +       if (dl_se->nr_cpus_allowed > 1)
>> > +               dl_rq->dl_nr_migratory++;
>> > +
>> > +       update_dl_migration(dl_rq);
>>
>>       if (dl_se->nr_cpus_allowed > 1) {
>>               dl_rq->dl_nr_migratory++;
>>               /* No change in migratory, no update of migration */
>
> This is not true. As dl_nr_total changed. If there was only one dl task
> queued that can migrate, and then another dl task is queued but this
> task can not migrate, the update_dl_migration still needs to be called.
> As dl_nr_migratory would be 1, but now dl_nr_total > 1. This means we
> are now overloaded.
>
	if (2 == dl_nr_migratory + dl_nr_total)
		rq was not overloaded;

	after enqueuing a deadline task that is not migratory,

	if (current task is not preempted)
		rq remains not overloaded;

	else if (current task is not pushed out) {
		if (rq is not overloaded)
			maintenance of overloaded is __corrupted__;
	}

btw, same behavior in RTS?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-12 12:28       ` Hillf Danton
@ 2012-04-12 12:51         ` Steven Rostedt
  2012-04-12 12:56           ` Hillf Danton
  0 siblings, 1 reply; 129+ messages in thread
From: Steven Rostedt @ 2012-04-12 12:51 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Juri Lelli, peterz, tglx, mingo, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fcheccon

On Thu, 2012-04-12 at 20:28 +0800, Hillf Danton wrote:
> On Wed, Apr 11, 2012 at 10:10 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Fri, 2012-04-06 at 21:39 +0800, Hillf Danton wrote:
> >
> >> > +static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >> > +{
> >> > +       dl_rq = &rq_of_dl_rq(dl_rq)->dl;
> >> > +
> >> > +       dl_rq->dl_nr_total++;
> >> > +       if (dl_se->nr_cpus_allowed > 1)
> >> > +               dl_rq->dl_nr_migratory++;
> >> > +
> >> > +       update_dl_migration(dl_rq);
> >>
> >>       if (dl_se->nr_cpus_allowed > 1) {
> >>               dl_rq->dl_nr_migratory++;
> >>               /* No change in migratory, no update of migration */
> >
> > This is not true. As dl_nr_total changed. If there was only one dl task
> > queued that can migrate, and then another dl task is queued but this
> > task can not migrate, the update_dl_migration still needs to be called.
> > As dl_nr_migratory would be 1, but now dl_nr_total > 1. This means we
> > are now overloaded.
> >

You're speaking in riddles.

> 	if (2 == dl_nr_migratory + dl_nr_total)
> 		rq was not overloaded;
> 
> 	after enqueuing a deadline task that is not migratory,

Now rq would be overloaded because:

	dl_nr_migratory + dl_nr_total == 3


> 
> 	if (current task is not preempted)
> 		rq remains not overloaded;

s/not//

> 
> 	else if (current task is not pushed out) {
> 		if (rq is not overloaded)
> 			maintenance of overloaded is __corrupted__;
> 	}
> 
> btw, same behavior in RTS?

I still don't understand what you are saying.

I can see your scenario happening with the change you are suggesting
though.

-- Steve



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-12 12:51         ` Steven Rostedt
@ 2012-04-12 12:56           ` Hillf Danton
  2012-04-12 13:35             ` Steven Rostedt
  0 siblings, 1 reply; 129+ messages in thread
From: Hillf Danton @ 2012-04-12 12:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, peterz, tglx, mingo, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fcheccon

On Thu, Apr 12, 2012 at 8:51 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 2012-04-12 at 20:28 +0800, Hillf Danton wrote:
>> On Wed, Apr 11, 2012 at 10:10 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>> > On Fri, 2012-04-06 at 21:39 +0800, Hillf Danton wrote:
>> >
>> >> > +static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>> >> > +{
>> >> > +       dl_rq = &rq_of_dl_rq(dl_rq)->dl;
>> >> > +
>> >> > +       dl_rq->dl_nr_total++;
>> >> > +       if (dl_se->nr_cpus_allowed > 1)
>> >> > +               dl_rq->dl_nr_migratory++;
>> >> > +
>> >> > +       update_dl_migration(dl_rq);
>> >>
>> >>       if (dl_se->nr_cpus_allowed > 1) {
>> >>               dl_rq->dl_nr_migratory++;
>> >>               /* No change in migratory, no update of migration */
>> >
>> > This is not true. As dl_nr_total changed. If there was only one dl task
>> > queued that can migrate, and then another dl task is queued but this
>> > task can not migrate, the update_dl_migration still needs to be called.
>> > As dl_nr_migratory would be 1, but now dl_nr_total > 1. This means we
>> > are now overloaded.
>> >
>
> You're speaking in riddles.
>
>>       if (2 == dl_nr_migratory + dl_nr_total)
>>               rq was not overloaded;
>>
>>       after enqueuing a deadline task that is not migratory,
>
> Now rq would be overloaded because:
>
>        dl_nr_migratory + dl_nr_total == 3
>

Which is the current task, and where?

>
>>
>>       if (current task is not preempted)
>>               rq remains not overloaded;
>
> s/not//
>
>>
>>       else if (current task is not pushed out) {
>>               if (rq is not overloaded)
>>                       maintenance of overloaded is __corrupted__;
>>       }
>>
>> btw, same behavior in RTS?
>
> I still don't understand what you are saying.
>
> I can see your scenario happening with the change you are suggesting
> though.
>
> -- Steve
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-12 12:56           ` Hillf Danton
@ 2012-04-12 13:35             ` Steven Rostedt
  2012-04-12 13:41               ` Hillf Danton
  0 siblings, 1 reply; 129+ messages in thread
From: Steven Rostedt @ 2012-04-12 13:35 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Juri Lelli, peterz, tglx, mingo, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fcheccon

On Thu, 2012-04-12 at 20:56 +0800, Hillf Danton wrote:

> >>       if (2 == dl_nr_migratory + dl_nr_total)
> >>               rq was not overloaded;
> >>
> >>       after enqueuing a deadline task that is not migratory,
> >
> > Now rq would be overloaded because:
> >
> >        dl_nr_migratory + dl_nr_total == 3
> >
> 
> Which is the current task, and where?

More riddles?

The current task is the one executing on the CPU.

-- Steve

> 
> >
> >>
> >>       if (current task is not preempted)
> >>               rq remains not overloaded;
> >
> > s/not//
> >
> >>
> >>       else if (current task is not pushed out) {
> >>               if (rq is not overloaded)
> >>                       maintenance of overloaded is __corrupted__;
> >>       }
> >>
> >> btw, same behavior in RTS?
> >
> > I still don't understand what you are saying.
> >
> > I can see your scenario happening with the change you are suggesting
> > though.
> >
> > -- Steve
> >
> >



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-12 13:35             ` Steven Rostedt
@ 2012-04-12 13:41               ` Hillf Danton
  0 siblings, 0 replies; 129+ messages in thread
From: Hillf Danton @ 2012-04-12 13:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, peterz, tglx, mingo, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fcheccon

On Thu, Apr 12, 2012 at 9:35 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 2012-04-12 at 20:56 +0800, Hillf Danton wrote:
>
>> >>       if (2 == dl_nr_migratory + dl_nr_total)
>> >>               rq was not overloaded;
>> >>
>> >>       after enqueuing a deadline task that is not migratory,
>> >
>> > Now rq would be overloaded because:
>> >
>> >        dl_nr_migratory + dl_nr_total == 3
>> >
>>
>> Which is the current task, and where?
>
> More riddles?
>
As always:)

>
> The current task is the one executing on the CPU.
>
I will see rt tomorrow, and report if any new finding.

Best Regards
Hillf

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 11/16] sched: add latency tracing for -deadline tasks.
  2012-04-11 21:03   ` Steven Rostedt
  2012-04-12  7:16     ` Juri Lelli
@ 2012-04-16 15:51     ` Daniel Vacek
  2012-04-16 19:56       ` Steven Rostedt
  1 sibling, 1 reply; 129+ messages in thread
From: Daniel Vacek @ 2012-04-16 15:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, peterz, tglx, mingo, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, raistlin

Hi,

On Wed, Apr 11, 2012 at 23:03, Steven Rostedt <rostedt@goodmis.org> wrote:
>> +     /*
>> +      * Semantic is like this:
>> +      *  - wakeup tracer handles all tasks in the system, independently
>> +      *    from their scheduling class;
>> +      *  - wakeup_rt tracer handles tasks belonging to sched_dl and
>> +      *    sched_rt class;
>> +      *  - wakeup_dl handles tasks belonging to sched_dl class only.
>> +      */
>> +     if ((wakeup_dl && !dl_task(p)) ||
>> +         (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
>> +         (p->prio >= wakeup_prio || p->prio >= current->prio))
>>               return;
>
> Anyway, perhaps this should be broken up, as we don't want the double
> test, that is, wakeup_rt and wakeup_dl are both checked. Perhaps do:
>
>        if (wakeup_dl && !dl_task(p))
>                return;
>        else if (wakeup_rt && !dl_task(p) && !rt_task(p))
>                return;
>
>        if (p->prio >= wakeup_prio || p->prio >= current->prio)
>                return;
>
>
> -- Steve

sorry for the question, I'm obviously missing something here but what
is the logic behind this rewrite? In both cases my gcc generates the
same code for me.

nX

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 11/16] sched: add latency tracing for -deadline tasks.
  2012-04-16 15:51     ` Daniel Vacek
@ 2012-04-16 19:56       ` Steven Rostedt
  2012-04-16 21:31         ` Daniel Vacek
  0 siblings, 1 reply; 129+ messages in thread
From: Steven Rostedt @ 2012-04-16 19:56 UTC (permalink / raw)
  To: Daniel Vacek
  Cc: Juri Lelli, peterz, tglx, mingo, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, raistlin

On Mon, 2012-04-16 at 17:51 +0200, Daniel Vacek wrote:

> sorry for the question, I'm obviously missing something here but what
> is the logic behind this rewrite? In both cases my gcc generates the
> same code for me.

Yeah, I noticed that later. I thought it was doing something slightly
different, but after a good nights rest, and re-reading what I wrote in
the morning, it was obviously the same functionality.

But that said. The final result is much easier to read. And as you
stated, it doesn't make a difference in the final outcome, it ended up
being a good fix (more readable code means less bugs).

-- Steve



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 11/16] sched: add latency tracing for -deadline tasks.
  2012-04-16 19:56       ` Steven Rostedt
@ 2012-04-16 21:31         ` Daniel Vacek
  0 siblings, 0 replies; 129+ messages in thread
From: Daniel Vacek @ 2012-04-16 21:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, peterz, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, raistlin

On Mon, Apr 16, 2012 at 21:56, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 2012-04-16 at 17:51 +0200, Daniel Vacek wrote:
>
>> sorry for the question, I'm obviously missing something here but what
>> is the logic behind this rewrite? In both cases my gcc generates the
>> same code for me.
>
> Yeah, I noticed that later. I thought it was doing something slightly
> different, but after a good nights rest, and re-reading what I wrote in
> the morning, it was obviously the same functionality.
>
> But that said. The final result is much easier to read. And as you
> stated, it doesn't make a difference in the final outcome, it ended up
> being a good fix (more readable code means less bugs).
>
> -- Steve

That's exactly why I reacted in the first place. I would say the
original code was cleaner and more readable IMHO.

And I twisted my brain hardly in getting what's the difference I can't
see for this ugly change ;-) So at the end I compiled both versions
and then asked what I'm doing wrong :-)

Glad to hear I was not wrong. So it is up to you (or Juri) which
version is the 'right' one.

-- nX

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-11 16:14   ` Steven Rostedt
@ 2012-04-19 13:44     ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-19 13:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang

On 04/11/2012 06:14 PM, Steven Rostedt wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>
>> +static int latest_cpu_find(struct cpumask *span,
>> +			   struct task_struct *task,
>> +			   struct cpumask *later_mask)
>>   {
>> +	const struct sched_dl_entity *dl_se =&task->dl;
>> +	int cpu, found = -1, best = 0;
>> +	u64 max_dl = 0;
>> +
>> +	for_each_cpu(cpu, span) {
>> +		struct rq *rq = cpu_rq(cpu);
>> +		struct dl_rq *dl_rq =&rq->dl;
>> +
>> +		if (cpumask_test_cpu(cpu,&task->cpus_allowed)&&
>> +		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
>> +		     dl_rq->earliest_dl.curr))) {
>> +			if (later_mask)
>> +				cpumask_set_cpu(cpu, later_mask);
>> +			if (!best&&  !dl_rq->dl_nr_running) {
>
> I hate to say this (and I also have yet to look at the patches after
> this) but we should really take into account the RT tasks. It would suck
> to preempt a normal RT task when a non RT task is running on another
> CPU.

Well, this changes in 15/16, but your point remains, and I see it :-).
We are currently reworking the push/pull mechanism (15/16 will most probably
change as well) to speed it up. I agree that it would be nice to take into
accont RT tasks as well, so we'll surely think about it.
   
>> +				best = 1;
>> +				found = cpu;
>> +			} else if (!best&&
>> +				   dl_time_before(max_dl,
>> +						  dl_rq->earliest_dl.curr)) {
>> +				max_dl = dl_rq->earliest_dl.curr;
>> +				found = cpu;
>> +			}
>> +		} else if (later_mask)
>> +			cpumask_clear_cpu(cpu, later_mask);
>> +	}
>> +
>> +	return found;
>> +}
>
>

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 13/16] sched: drafted deadline inheritance logic.
  2012-04-12  2:42   ` Steven Rostedt
@ 2012-04-22 14:04     ` Juri Lelli
  2012-04-23  8:39     ` Peter Zijlstra
  1 sibling, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-22 14:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/12/2012 04:42 AM, Steven Rostedt wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> From: Dario Faggioli<raistlin@linux.it>
>>
>> Some method to deal with rt-mutexes and make sched_dl interact with
>> the current PI-coded is needed, raising all but trivial issues, that
>> needs (according to us) to be solved with some restructuring of
>> the pi-code (i.e., going toward a proxy execution-ish implementation).
>
> I also would like to point out that this is probably only for -rt? Or is
> this for PI futexes as well?
>

Well, this directly affects rtmutex.c only. But, as soon as a PI-enabled
user-space mutex fails the fastpath all returns to rtmutex machinery, right?
So, I would say that this apply to PI futexes as well :-|.

> Note, non-proxy version of this code may be an issue. Could a process
> force more bandwidth than was allowed by rlimit if the process used PI
> futexes forced a behavior that PI waiters were waiting on another
> process while it held the PI futex? The process that held the futex may
> get more bandwidth if it never lets go of the futex, right?
>

Yes, I would remark that this is a needed _temporary_ solution (just to not
delay further the release post). I tried to remain simple as to not prevent
any fast shift in the way we want to resolve this problem. Anyway, it would
be great if some discussion on this point could re-start :-).

>>
>> This is under development, in the meanwhile, as a temporary solution,
>> what this commits does is:
>>   - ensure a pi-lock owner with waiters is never throttled down. Instead,
>>     when it runs out of runtime, it immediately gets replenished and it's
>>     deadline is postponed;
>>   - the scheduling parameters (relative deadline and default runtime)
>>     used for that replenishments --during the whole period it holds the
>>     pi-lock-- are the ones of the waiting task with earliest deadline.
>
> This sounds similar to what I implemented for a company back in 2005.
>
>>
>> Acting this way, we provide some kind of boosting to the lock-owner,
>> still by using the existing (actually, slightly modified by the previous
>> commit) pi-architecture.
>>
>> We would stress the fact that this is only a surely needed, all but
>> clean solution to the problem. In the end it's only a way to re-start
>> discussion within the community. So, as always, comments, ideas, rants,
>> etc.. are welcome! :-)
>>
>> Signed-off-by: Dario Faggioli<raistlin@linux.it>
>> Signed-off-by: Juri Lelli<juri.lelli@gmail.com>
>> ---
>>   include/linux/sched.h |    9 +++++-
>>   kernel/fork.c         |    1 +
>>   kernel/rtmutex.c      |   13 +++++++-
>>   kernel/sched.c        |   34 ++++++++++++++++++---
>>   kernel/sched_dl.c     |   77 +++++++++++++++++++++++++++++++++---------------
>>   5 files changed, 102 insertions(+), 32 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 5ef7bb6..ca45db4 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1309,8 +1309,12 @@ struct sched_dl_entity {
>>   	 * @dl_new tells if a new instance arrived. If so we must
>>   	 * start executing it with full runtime and reset its absolute
>>   	 * deadline;
>> +	 *
>> +	 * @dl_boosted tells if we are boosted due to DI. If so we are
>
>   DI?
>

DI = Deadline Inheritance. Even if not strictly correct in the definition,
when you're boosted you inherit highest priority waiter's parameters, among
which deadline. In fact you inherit its budget.

>> +	 * outside bandwidth enforcement mechanism (but only until we
>> +	 * exit the critical section).
>>   	 */
>> -	int dl_throttled, dl_new;
>> +	int dl_throttled, dl_new, dl_boosted;
>>
>>   	/*
>>   	 * Bandwidth enforcement timer. Each -deadline task has its
>> @@ -1556,6 +1560,8 @@ struct task_struct {
>>   	struct rb_node *pi_waiters_leftmost;
>>   	/* Deadlock detection and priority inheritance handling */
>>   	struct rt_mutex_waiter *pi_blocked_on;
>> +	/* Top pi_waiters task */
>> +	struct task_struct *pi_top_task;
>>   #endif
>>

Sorry for the delayed reply, I was really busy for the last two
weeks :-\.

Thanks and Regards,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 12/16] rtmutex: turn the plist into an rb-tree.
  2012-04-11 21:11   ` Steven Rostedt
@ 2012-04-22 14:28     ` Juri Lelli
  2012-04-23  8:33     ` Peter Zijlstra
  1 sibling, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-22 14:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang

On 04/11/2012 11:11 PM, Steven Rostedt wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> From: Peter Zijlstra<peterz@infradead.org>
>>
>> Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
>> and provide a proper comparison function for -deadline and
>> -priority tasks.
>
> I have to ask. Why not just add a rbtree with a plist? That is, add all
> deadline tasks to the rbtree and all others to the plist. As plist has a
> O(1) operation, and rbtree does not. We are making all RT tasks suffer
> the overhead of the rbtree.
>

I basically got this patch from the v3 patchset and, since it applied
perfectly and came from Peter, I assumed it was the right way to go ;-).

> As deadline tasks always win, the two may stay agnostic from each other.
> Check first the rbtree, if it is empty, then check the plist.
>
> This will become more predominant with the -rt tree as it converts most
> the locks in the kernel to pi mutexes.
>

I see your point, but I'm not yet convinced that in the end the plist +
rbtree implementation would win. AFAIK, the only O(1) plist operation is
removal, beeing addition O(K) [K RT priorities]. Now, we have O(logn)
[n elements] operations for rbtrees and we speed-up search with the
leftmost pointer.
So, are we sure that add complexity (and related checks) is needed here?
I'm not against your point, I'm only asking :-).

Thanks a lot,

- Juri
  
>>
>> This is done mainly because:
>>   - classical prio field of the plist is just an int, which might
>>     not be enough for representing a deadline;
>>   - manipulating such a list would become O(nr_deadline_tasks),
>>     which might be to much, as the number of -deadline task increases.
>>
>> Therefore, an rb-tree is used, and tasks are queued in it according
>> to the following logic:
>>   - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
>>     one with the higher (lower, actually!) prio wins;
>>   - among a -priority and a -deadline task, the latter always wins;
>>   - among two -deadline tasks, the one with the earliest deadline
>>     wins.
>>
>> Queueing and dequeueing functions are changed accordingly, for both
>> the list of a task's pi-waiters and the list of tasks blocked on
>> a pi-lock.
>>
>> Signed-off-by: Peter Zijlstra<peterz@infradead.org>
>> Signed-off-by: Dario Faggioli<raistlin@linux.it>
>> Signed-off-by: Juri Lelli<juri.lelli@gmail.com>
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 12/16] rtmutex: turn the plist into an rb-tree.
  2012-04-11 21:11   ` Steven Rostedt
  2012-04-22 14:28     ` Juri Lelli
@ 2012-04-23  8:33     ` Peter Zijlstra
  2012-04-23 11:37       ` Steven Rostedt
  1 sibling, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23  8:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, tglx, mingo, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Wed, 2012-04-11 at 17:11 -0400, Steven Rostedt wrote:
> 
> I have to ask. Why not just add a rbtree with a plist? That is, add all
> deadline tasks to the rbtree and all others to the plist. As plist has a
> O(1) operation, and rbtree does not. We are making all RT tasks suffer
> the overhead of the rbtree.

You always love to add complexity to stuff before making it work, don't
you ;-)

I'm not quite convinced the plist stuff is _that_ much faster, anyway,
its all a stop-gap measure until we can do proper BWI which would wipe
all this code anyway.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 13/16] sched: drafted deadline inheritance logic.
  2012-04-12  2:42   ` Steven Rostedt
  2012-04-22 14:04     ` Juri Lelli
@ 2012-04-23  8:39     ` Peter Zijlstra
  1 sibling, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23  8:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Wed, 2012-04-11 at 22:42 -0400, Steven Rostedt wrote:
> 
> I also would like to point out that this is probably only for -rt?

Nope, upstream carries kernel/rtmutex.c as well.

>  Or is this for PI futexes as well? 

Quite so.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/16] sched: SCHED_DEADLINE data structures.
  2012-04-06  7:14 ` [PATCH 03/16] sched: SCHED_DEADLINE data structures Juri Lelli
@ 2012-04-23  9:08   ` Peter Zijlstra
  2012-04-23  9:47     ` Juri Lelli
  2012-04-23  9:13   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23  9:08 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> +struct sched_dl_entity {
> +       struct rb_node  rb_node;
> +       int nr_cpus_allowed; 

I think it would be all-round best to move
sched_rt_entity::nr_cpus_allowed out next to cpus_allowed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/16] sched: SCHED_DEADLINE data structures.
  2012-04-06  7:14 ` [PATCH 03/16] sched: SCHED_DEADLINE data structures Juri Lelli
  2012-04-23  9:08   ` Peter Zijlstra
@ 2012-04-23  9:13   ` Peter Zijlstra
  2012-04-23  9:28     ` Juri Lelli
  2012-04-23  9:30   ` Peter Zijlstra
  2012-04-23  9:34   ` Peter Zijlstra
  3 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23  9:13 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> +       if (unlikely(prio >= MAX_DL_PRIO && prio < MAX_RT_PRIO))

You could write that as:

 if ((unsigned)prio < MAX_RT_PRIO)

Although GCC might be smart enough to do that already, also I'm not sure
people in general are willing to put up with such 'fun' stuff :-)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/16] sched: SCHED_DEADLINE data structures.
  2012-04-23  9:13   ` Peter Zijlstra
@ 2012-04-23  9:28     ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-23  9:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 11:13 AM, Peter Zijlstra wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> +       if (unlikely(prio>= MAX_DL_PRIO&&  prio<  MAX_RT_PRIO))
>
> You could write that as:
>
>   if ((unsigned)prio<  MAX_RT_PRIO)
>

Right, I'll do :-).

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/16] sched: SCHED_DEADLINE data structures.
  2012-04-06  7:14 ` [PATCH 03/16] sched: SCHED_DEADLINE data structures Juri Lelli
  2012-04-23  9:08   ` Peter Zijlstra
  2012-04-23  9:13   ` Peter Zijlstra
@ 2012-04-23  9:30   ` Peter Zijlstra
  2012-04-23  9:36     ` Juri Lelli
  2012-04-23  9:34   ` Peter Zijlstra
  3 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23  9:30 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>  /*
> + * This function validates the new parameters of a -deadline task.
> + * We ask for the deadline not being zero, and greater or equal
> + * than the runtime.
> + */
> +static bool
> +__checkparam_dl(const struct sched_param2 *prm)
> +{
> +       return prm && (&prm->sched_deadline) != 0 &&
> +              (s64)(&prm->sched_deadline - &prm->sched_runtime) >= 0;
> +} 

Shouldn't this also include deadline == period for now?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/16] sched: SCHED_DEADLINE data structures.
  2012-04-06  7:14 ` [PATCH 03/16] sched: SCHED_DEADLINE data structures Juri Lelli
                     ` (2 preceding siblings ...)
  2012-04-23  9:30   ` Peter Zijlstra
@ 2012-04-23  9:34   ` Peter Zijlstra
  2012-04-23 10:16     ` Juri Lelli
  3 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23  9:34 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> +               p->sched_class = &dl_sched_class;

Does this patch actually compile? I've only seen a fwd declaration of
this variable but here you take an actual reference to it.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/16] sched: SCHED_DEADLINE data structures.
  2012-04-23  9:30   ` Peter Zijlstra
@ 2012-04-23  9:36     ` Juri Lelli
  2012-04-23  9:39       ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-23  9:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 11:30 AM, Peter Zijlstra wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>>   /*
>> + * This function validates the new parameters of a -deadline task.
>> + * We ask for the deadline not being zero, and greater or equal
>> + * than the runtime.
>> + */
>> +static bool
>> +__checkparam_dl(const struct sched_param2 *prm)
>> +{
>> +       return prm&&  (&prm->sched_deadline) != 0&&
>> +              (s64)(&prm->sched_deadline -&prm->sched_runtime)>= 0;
>> +}
>
> Shouldn't this also include deadline == period for now?

No notion of period in 03/16. It is inserted in 08/16 and checkparam
changed accordingly.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/16] sched: SCHED_DEADLINE data structures.
  2012-04-23  9:36     ` Juri Lelli
@ 2012-04-23  9:39       ` Peter Zijlstra
  0 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23  9:39 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Mon, 2012-04-23 at 11:36 +0200, Juri Lelli wrote:
> On 04/23/2012 11:30 AM, Peter Zijlstra wrote:
> > On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> >>   /*
> >> + * This function validates the new parameters of a -deadline task.
> >> + * We ask for the deadline not being zero, and greater or equal
> >> + * than the runtime.
> >> + */
> >> +static bool
> >> +__checkparam_dl(const struct sched_param2 *prm)
> >> +{
> >> +       return prm&&  (&prm->sched_deadline) != 0&&
> >> +              (s64)(&prm->sched_deadline -&prm->sched_runtime)>= 0;
> >> +}
> >
> > Shouldn't this also include deadline == period for now?
> 
> No notion of period in 03/16. It is inserted in 08/16 and checkparam
> changed accordingly.

Ah, ok. got confused by 2/ having sched_period.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/16] sched: SCHED_DEADLINE data structures.
  2012-04-23  9:08   ` Peter Zijlstra
@ 2012-04-23  9:47     ` Juri Lelli
  2012-04-23  9:49       ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-23  9:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 11:08 AM, Peter Zijlstra wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> +struct sched_dl_entity {
>> +       struct rb_node  rb_node;
>> +       int nr_cpus_allowed;
>
> I think it would be all-round best to move
> sched_rt_entity::nr_cpus_allowed out next to cpus_allowed.

You mean unify them: a single nr_cpus_allowed after
task_struct::cpus_allowed, right?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/16] sched: SCHED_DEADLINE data structures.
  2012-04-23  9:47     ` Juri Lelli
@ 2012-04-23  9:49       ` Peter Zijlstra
  2012-04-23  9:55         ` Juri Lelli
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23  9:49 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Mon, 2012-04-23 at 11:47 +0200, Juri Lelli wrote:
> On 04/23/2012 11:08 AM, Peter Zijlstra wrote:
> > On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> >> +struct sched_dl_entity {
> >> +       struct rb_node  rb_node;
> >> +       int nr_cpus_allowed;
> >
> > I think it would be all-round best to move
> > sched_rt_entity::nr_cpus_allowed out next to cpus_allowed.
> 
> You mean unify them: a single nr_cpus_allowed after
> task_struct::cpus_allowed, right?

Yes, no point in keeping that one value twice and in fact there's a
usage of p->rt.nr_cpus_allowed in sched/fair.c so its past time its
moved out of that rt specific thing.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/16] sched: SCHED_DEADLINE data structures.
  2012-04-23  9:49       ` Peter Zijlstra
@ 2012-04-23  9:55         ` Juri Lelli
  2012-04-23 10:12           ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-23  9:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 11:49 AM, Peter Zijlstra wrote:
> On Mon, 2012-04-23 at 11:47 +0200, Juri Lelli wrote:
>> On 04/23/2012 11:08 AM, Peter Zijlstra wrote:
>>> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>>>> +struct sched_dl_entity {
>>>> +       struct rb_node  rb_node;
>>>> +       int nr_cpus_allowed;
>>>
>>> I think it would be all-round best to move
>>> sched_rt_entity::nr_cpus_allowed out next to cpus_allowed.
>>
>> You mean unify them: a single nr_cpus_allowed after
>> task_struct::cpus_allowed, right?
>
> Yes, no point in keeping that one value twice and in fact there's a
> usage of p->rt.nr_cpus_allowed in sched/fair.c so its past time its
> moved out of that rt specific thing.

Sure. Since this is a small change, probably not strictly related to
this patchset, may I wait to see the change in mainline?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/16] sched: SCHED_DEADLINE data structures.
  2012-04-23  9:55         ` Juri Lelli
@ 2012-04-23 10:12           ` Peter Zijlstra
  0 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 10:12 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Mon, 2012-04-23 at 11:55 +0200, Juri Lelli wrote:
> On 04/23/2012 11:49 AM, Peter Zijlstra wrote:
> > On Mon, 2012-04-23 at 11:47 +0200, Juri Lelli wrote:
> >> On 04/23/2012 11:08 AM, Peter Zijlstra wrote:
> >>> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> >>>> +struct sched_dl_entity {
> >>>> +       struct rb_node  rb_node;
> >>>> +       int nr_cpus_allowed;
> >>>
> >>> I think it would be all-round best to move
> >>> sched_rt_entity::nr_cpus_allowed out next to cpus_allowed.
> >>
> >> You mean unify them: a single nr_cpus_allowed after
> >> task_struct::cpus_allowed, right?
> >
> > Yes, no point in keeping that one value twice and in fact there's a
> > usage of p->rt.nr_cpus_allowed in sched/fair.c so its past time its
> > moved out of that rt specific thing.
> 
> Sure. Since this is a small change, probably not strictly related to
> this patchset, may I wait to see the change in mainline?

Ok..


---
Subject: sched: Move nr_cpus_allowed out of sched_rt_entity
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon Apr 23 12:11:21 CEST 2012

Since nr_cpus_allowed is used outside of sched/rt.c and wants to be
used outside of there more, move it to a more natural site.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-gba0fya9qv86mo5m1cjf7f0v@git.kernel.org
---
 arch/blackfin/kernel/process.c |    2 +-
 include/linux/init_task.h      |    2 +-
 include/linux/sched.h          |    2 +-
 kernel/sched/core.c            |    2 +-
 kernel/sched/fair.c            |    2 +-
 kernel/sched/rt.c              |   36 +++++++++++++++++++++---------------
 6 files changed, 26 insertions(+), 20 deletions(-)

--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -171,7 +171,7 @@ asmlinkage int bfin_clone(struct pt_regs
 	unsigned long newsp;
 
 #ifdef __ARCH_SYNC_CORE_DCACHE
-	if (current->rt.nr_cpus_allowed == num_possible_cpus())
+	if (current->nr_cpus_allowed == num_possible_cpus())
 		set_cpus_allowed_ptr(current, cpumask_of(smp_processor_id()));
 #endif
 
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -149,6 +149,7 @@ extern struct cred init_cred;
 	.normal_prio	= MAX_PRIO-20,					\
 	.policy		= SCHED_NORMAL,					\
 	.cpus_allowed	= CPU_MASK_ALL,					\
+	.nr_cpus_allowed= NR_CPUS,					\
 	.mm		= NULL,						\
 	.active_mm	= &init_mm,					\
 	.se		= {						\
@@ -157,7 +158,6 @@ extern struct cred init_cred;
 	.rt		= {						\
 		.run_list	= LIST_HEAD_INIT(tsk.rt.run_list),	\
 		.time_slice	= RR_TIMESLICE,				\
-		.nr_cpus_allowed = NR_CPUS,				\
 	},								\
 	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
 	INIT_PUSHABLE_TASKS(tsk)					\
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1234,7 +1234,6 @@ struct sched_rt_entity {
 	struct list_head run_list;
 	unsigned long timeout;
 	unsigned int time_slice;
-	int nr_cpus_allowed;
 
 	struct sched_rt_entity *back;
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -1299,6 +1298,7 @@ struct task_struct {
 #endif
 
 	unsigned int policy;
+	int nr_cpus_allowed;
 	cpumask_t cpus_allowed;
 
 #ifdef CONFIG_PREEMPT_RCU
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4957,7 +4957,7 @@ void do_set_cpus_allowed(struct task_str
 		p->sched_class->set_cpus_allowed(p, new_mask);
 
 	cpumask_copy(&p->cpus_allowed, new_mask);
-	p->rt.nr_cpus_allowed = cpumask_weight(new_mask);
+	p->nr_cpus_allowed = cpumask_weight(new_mask);
 }
 
 /*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2703,7 +2703,7 @@ select_task_rq_fair(struct task_struct *
 	int want_sd = 1;
 	int sync = wake_flags & WF_SYNC;
 
-	if (p->rt.nr_cpus_allowed == 1)
+	if (p->nr_cpus_allowed == 1)
 		return prev_cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -274,13 +274,16 @@ static void update_rt_migration(struct r
 
 static void inc_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 {
+	struct task_struct *p;
+
 	if (!rt_entity_is_task(rt_se))
 		return;
 
+	p = rt_task_of(rt_se);
 	rt_rq = &rq_of_rt_rq(rt_rq)->rt;
 
 	rt_rq->rt_nr_total++;
-	if (rt_se->nr_cpus_allowed > 1)
+	if (p->nr_cpus_allowed > 1)
 		rt_rq->rt_nr_migratory++;
 
 	update_rt_migration(rt_rq);
@@ -288,13 +291,16 @@ static void inc_rt_migration(struct sche
 
 static void dec_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 {
+	struct task_struct *p;
+
 	if (!rt_entity_is_task(rt_se))
 		return;
 
+	p = rt_task_of(rt_se);
 	rt_rq = &rq_of_rt_rq(rt_rq)->rt;
 
 	rt_rq->rt_nr_total--;
-	if (rt_se->nr_cpus_allowed > 1)
+	if (p->nr_cpus_allowed > 1)
 		rt_rq->rt_nr_migratory--;
 
 	update_rt_migration(rt_rq);
@@ -1161,7 +1167,7 @@ enqueue_task_rt(struct rq *rq, struct ta
 
 	enqueue_rt_entity(rt_se, flags & ENQUEUE_HEAD);
 
-	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
+	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
 
 	inc_nr_running(rq);
@@ -1225,7 +1231,7 @@ select_task_rq_rt(struct task_struct *p,
 
 	cpu = task_cpu(p);
 
-	if (p->rt.nr_cpus_allowed == 1)
+	if (p->nr_cpus_allowed == 1)
 		goto out;
 
 	/* For anything but wake ups, just return the task_cpu */
@@ -1260,9 +1266,9 @@ select_task_rq_rt(struct task_struct *p,
 	 * will have to sort it out.
 	 */
 	if (curr && unlikely(rt_task(curr)) &&
-	    (curr->rt.nr_cpus_allowed < 2 ||
+	    (curr->nr_cpus_allowed < 2 ||
 	     curr->prio <= p->prio) &&
-	    (p->rt.nr_cpus_allowed > 1)) {
+	    (p->nr_cpus_allowed > 1)) {
 		int target = find_lowest_rq(p);
 
 		if (target != -1)
@@ -1276,10 +1282,10 @@ select_task_rq_rt(struct task_struct *p,
 
 static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 {
-	if (rq->curr->rt.nr_cpus_allowed == 1)
+	if (rq->curr->nr_cpus_allowed == 1)
 		return;
 
-	if (p->rt.nr_cpus_allowed != 1
+	if (p->nr_cpus_allowed != 1
 	    && cpupri_find(&rq->rd->cpupri, p, NULL))
 		return;
 
@@ -1395,7 +1401,7 @@ static void put_prev_task_rt(struct rq *
 	 * The previous task needs to be made eligible for pushing
 	 * if it is still active
 	 */
-	if (on_rt_rq(&p->rt) && p->rt.nr_cpus_allowed > 1)
+	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
 }
 
@@ -1408,7 +1414,7 @@ static int pick_rt_task(struct rq *rq, s
 {
 	if (!task_running(rq, p) &&
 	    (cpu < 0 || cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) &&
-	    (p->rt.nr_cpus_allowed > 1))
+	    (p->nr_cpus_allowed > 1))
 		return 1;
 	return 0;
 }
@@ -1464,7 +1470,7 @@ static int find_lowest_rq(struct task_st
 	if (unlikely(!lowest_mask))
 		return -1;
 
-	if (task->rt.nr_cpus_allowed == 1)
+	if (task->nr_cpus_allowed == 1)
 		return -1; /* No other targets possible */
 
 	if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
@@ -1586,7 +1592,7 @@ static struct task_struct *pick_next_pus
 
 	BUG_ON(rq->cpu != task_cpu(p));
 	BUG_ON(task_current(rq, p));
-	BUG_ON(p->rt.nr_cpus_allowed <= 1);
+	BUG_ON(p->nr_cpus_allowed <= 1);
 
 	BUG_ON(!p->on_rq);
 	BUG_ON(!rt_task(p));
@@ -1793,9 +1799,9 @@ static void task_woken_rt(struct rq *rq,
 	if (!task_running(rq, p) &&
 	    !test_tsk_need_resched(rq->curr) &&
 	    has_pushable_tasks(rq) &&
-	    p->rt.nr_cpus_allowed > 1 &&
+	    p->nr_cpus_allowed > 1 &&
 	    rt_task(rq->curr) &&
-	    (rq->curr->rt.nr_cpus_allowed < 2 ||
+	    (rq->curr->nr_cpus_allowed < 2 ||
 	     rq->curr->prio <= p->prio))
 		push_rt_tasks(rq);
 }
@@ -1817,7 +1823,7 @@ static void set_cpus_allowed_rt(struct t
 	 * Only update if the process changes its state from whether it
 	 * can migrate or not.
 	 */
-	if ((p->rt.nr_cpus_allowed > 1) == (weight > 1))
+	if ((p->nr_cpus_allowed > 1) == (weight > 1))
 		return;
 
 	rq = task_rq(p);


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-06  7:14 ` [PATCH 05/16] sched: SCHED_DEADLINE policy implementation Juri Lelli
  2012-04-11  3:06   ` Steven Rostedt
  2012-04-11 13:41   ` Steven Rostedt
@ 2012-04-23 10:15   ` Peter Zijlstra
  2012-04-23 10:18     ` Juri Lelli
  2012-04-23 10:31   ` Peter Zijlstra
                     ` (7 subsequent siblings)
  10 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 10:15 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> + * Copyright (C) 2010 Dario Faggioli <raistlin@linux.it>,
> + *                    Michael Trimarchi <michael@amarulasolutions.com>,
> + *                    Fabio Checconi <fabio@gandalf.sssup.it>

Its 2012 at the time of writing, you might want to update this.. ;-)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/16] sched: SCHED_DEADLINE data structures.
  2012-04-23  9:34   ` Peter Zijlstra
@ 2012-04-23 10:16     ` Juri Lelli
  2012-04-23 10:28       ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-23 10:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 11:34 AM, Peter Zijlstra wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> +               p->sched_class =&dl_sched_class;
>
> Does this patch actually compile? I've only seen a fwd declaration of
> this variable but here you take an actual reference to it.

The patch compiles (just tested), although this sounds strange to
me too..

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 10:15   ` Peter Zijlstra
@ 2012-04-23 10:18     ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-23 10:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 12:15 PM, Peter Zijlstra wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> + * Copyright (C) 2010 Dario Faggioli<raistlin@linux.it>,
>> + *                    Michael Trimarchi<michael@amarulasolutions.com>,
>> + *                    Fabio Checconi<fabio@gandalf.sssup.it>
>
> Its 2012 at the time of writing, you might want to update this.. ;-)

Yep, time passes.. :-P

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/16] sched: SCHED_DEADLINE data structures.
  2012-04-23 10:16     ` Juri Lelli
@ 2012-04-23 10:28       ` Peter Zijlstra
  2012-04-23 10:33         ` Juri Lelli
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 10:28 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Mon, 2012-04-23 at 12:16 +0200, Juri Lelli wrote:
> On 04/23/2012 11:34 AM, Peter Zijlstra wrote:
> > On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> >> +               p->sched_class =&dl_sched_class;
> >
> > Does this patch actually compile? I've only seen a fwd declaration of
> > this variable but here you take an actual reference to it.
> 
> The patch compiles (just tested), although this sounds strange to
> me too..

I just realized that its a proper definition without an initializer. So
no harm done, just a brain that still needs to wake up or so.. :-)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-06  7:14 ` [PATCH 05/16] sched: SCHED_DEADLINE policy implementation Juri Lelli
                     ` (2 preceding siblings ...)
  2012-04-23 10:15   ` Peter Zijlstra
@ 2012-04-23 10:31   ` Peter Zijlstra
  2012-04-23 10:37     ` Juri Lelli
  2012-04-23 11:32   ` Peter Zijlstra
                     ` (6 subsequent siblings)
  10 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 10:31 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> +       dl_se->deadline = rq->clock + dl_se->dl_deadline;

You might want to use rq->clock_task, this clock excludes times spend in
hardirq context and steal-time (when paravirt).

Then again, it might not want to use that.. but its something you might
want to consider and make explicit by means of a comment.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 03/16] sched: SCHED_DEADLINE data structures.
  2012-04-23 10:28       ` Peter Zijlstra
@ 2012-04-23 10:33         ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-23 10:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 12:28 PM, Peter Zijlstra wrote:
> On Mon, 2012-04-23 at 12:16 +0200, Juri Lelli wrote:
>> On 04/23/2012 11:34 AM, Peter Zijlstra wrote:
>>> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>>>> +               p->sched_class =&dl_sched_class;
>>>
>>> Does this patch actually compile? I've only seen a fwd declaration of
>>> this variable but here you take an actual reference to it.
>>
>> The patch compiles (just tested), although this sounds strange to
>> me too..
>
> I just realized that its a proper definition without an initializer. So
> no harm done, just a brain that still needs to wake up or so.. :-)

Ouch.. right! A struct sched_class after all! Regarding brains, mine should
be enough woken up at this time :-|.

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 10:31   ` Peter Zijlstra
@ 2012-04-23 10:37     ` Juri Lelli
  2012-04-23 21:25       ` Tommaso Cucinotta
  0 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-23 10:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 12:31 PM, Peter Zijlstra wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> +       dl_se->deadline = rq->clock + dl_se->dl_deadline;
>
> You might want to use rq->clock_task, this clock excludes times spend in
> hardirq context and steal-time (when paravirt).
>
> Then again, it might not want to use that.. but its something you might
> want to consider and make explicit by means of a comment.

Yes, I planned a consistency check for the use of clock/clock_task
throughout the code, but it seems I then forgot it.
Planned for the next iteration :-).

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-06  7:14 ` [PATCH 05/16] sched: SCHED_DEADLINE policy implementation Juri Lelli
                     ` (3 preceding siblings ...)
  2012-04-23 10:31   ` Peter Zijlstra
@ 2012-04-23 11:32   ` Peter Zijlstra
  2012-04-23 12:13     ` Juri Lelli
  2012-04-23 11:34   ` Peter Zijlstra
                     ` (5 subsequent siblings)
  10 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 11:32 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> +       /*
> +        * We Keep moving the deadline away until we get some
> +        * available runtime for the entity. This ensures correct
> +        * handling of situations where the runtime overrun is
> +        * arbitrary large.
> +        */
> +       while (dl_se->runtime <= 0) {
> +               dl_se->deadline += dl_se->dl_deadline;
> +               dl_se->runtime += dl_se->dl_runtime;
> +       } 

Does gcc 'optimize' that into a division? If so, it might need special
glue to make it not do that.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-06  7:14 ` [PATCH 05/16] sched: SCHED_DEADLINE policy implementation Juri Lelli
                     ` (4 preceding siblings ...)
  2012-04-23 11:32   ` Peter Zijlstra
@ 2012-04-23 11:34   ` Peter Zijlstra
  2012-04-23 11:57     ` Juri Lelli
  2012-04-23 11:55   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  10 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 11:34 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> +       /*
> +        * At this point, the deadline really should be "in
> +        * the future" with respect to rq->clock. If it's
> +        * not, we are, for some reason, lagging too much!
> +        * Anyway, after having warn userspace abut that,
> +        * we still try to keep the things running by
> +        * resetting the deadline and the budget of the
> +        * entity.
> +        */
> +       if (dl_time_before(dl_se->deadline, rq->clock)) {
> +               WARN_ON_ONCE(1);

Doing printk() and friends from scheduler context isn't actually safe
and can lock up your machine.. there's a printk_sched() that
maybe-sorta-kinda can get your complaints out..

> +               dl_se->deadline = rq->clock + dl_se->dl_deadline;
> +               dl_se->runtime = dl_se->dl_runtime;
> +       } 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 12/16] rtmutex: turn the plist into an rb-tree.
  2012-04-23  8:33     ` Peter Zijlstra
@ 2012-04-23 11:37       ` Steven Rostedt
  0 siblings, 0 replies; 129+ messages in thread
From: Steven Rostedt @ 2012-04-23 11:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, tglx, mingo, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Mon, 2012-04-23 at 10:33 +0200, Peter Zijlstra wrote:
> On Wed, 2012-04-11 at 17:11 -0400, Steven Rostedt wrote:
> > 
> > I have to ask. Why not just add a rbtree with a plist? That is, add all
> > deadline tasks to the rbtree and all others to the plist. As plist has a
> > O(1) operation, and rbtree does not. We are making all RT tasks suffer
> > the overhead of the rbtree.
> 
> You always love to add complexity to stuff before making it work, don't
> you ;-)

Nah, I like the make it work first, then optimize. But I wanted to bring
this up as a concern. After benchmarks, it may not be an issue anyway. I
just don't want to forget about doing the benchmarks.

-- Steve



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-06  7:14 ` [PATCH 05/16] sched: SCHED_DEADLINE policy implementation Juri Lelli
                     ` (5 preceding siblings ...)
  2012-04-23 11:34   ` Peter Zijlstra
@ 2012-04-23 11:55   ` Peter Zijlstra
  2012-04-23 14:43     ` Juri Lelli
  2012-04-23 21:55     ` Tommaso Cucinotta
  2012-04-23 14:11   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  10 siblings, 2 replies; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 11:55 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> +/*
> + * Here we check if --at time t-- an entity (which is probably being
> + * [re]activated or, in general, enqueued) can use its remaining runtime
> + * and its current deadline _without_ exceeding the bandwidth it is
> + * assigned (function returns true if it can).
> + *
> + * For this to hold, we must check if:
> + *   runtime / (deadline - t) < dl_runtime / dl_deadline .

It might be good to put a few words in as to why that is.. I know I
always forget (but know where to find it by now), also might be good to
refer those papers Tommaso listed when Steven asked this a while back.

> + */
> +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
> +{
> +       u64 left, right;
> +
> +       /*
> +        * left and right are the two sides of the equation above,
> +        * after a bit of shuffling to use multiplications instead
> +        * of divisions.
> +        *
> +        * Note that none of the time values involved in the two
> +        * multiplications are absolute: dl_deadline and dl_runtime
> +        * are the relative deadline and the maximum runtime of each
> +        * instance, runtime is the runtime left for the last instance
> +        * and (deadline - t), since t is rq->clock, is the time left
> +        * to the (absolute) deadline. Therefore, overflowing the u64
> +        * type is very unlikely to occur in both cases.
> +        */
> +       left = dl_se->dl_deadline * dl_se->runtime;
> +       right = (dl_se->deadline - t) * dl_se->dl_runtime;


>From what I can see there are no constraints on the values in
__setparam_dl() so the above left term can be constructed to be an
overflow.

Ideally we'd use u128 here, but I don't think people will let us :/

> +       return dl_time_before(right, left);
> +} 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 11:34   ` Peter Zijlstra
@ 2012-04-23 11:57     ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-23 11:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 01:34 PM, Peter Zijlstra wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> +       /*
>> +        * At this point, the deadline really should be "in
>> +        * the future" with respect to rq->clock. If it's
>> +        * not, we are, for some reason, lagging too much!
>> +        * Anyway, after having warn userspace abut that,
>> +        * we still try to keep the things running by
>> +        * resetting the deadline and the budget of the
>> +        * entity.
>> +        */
>> +       if (dl_time_before(dl_se->deadline, rq->clock)) {
>> +               WARN_ON_ONCE(1);
>
> Doing printk() and friends from scheduler context isn't actually safe
> and can lock up your machine.. there's a printk_sched() that
> maybe-sorta-kinda can get your complaints out..
>

Thanks! I'll look at it.
  
>> +               dl_se->deadline = rq->clock + dl_se->dl_deadline;
>> +               dl_se->runtime = dl_se->dl_runtime;
>> +       }

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 11:32   ` Peter Zijlstra
@ 2012-04-23 12:13     ` Juri Lelli
  2012-04-23 12:22       ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-23 12:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 01:32 PM, Peter Zijlstra wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> +       /*
>> +        * We Keep moving the deadline away until we get some
>> +        * available runtime for the entity. This ensures correct
>> +        * handling of situations where the runtime overrun is
>> +        * arbitrary large.
>> +        */
>> +       while (dl_se->runtime<= 0) {
>> +               dl_se->deadline += dl_se->dl_deadline;
>> +               dl_se->runtime += dl_se->dl_runtime;
>> +       }
>
> Does gcc 'optimize' that into a division? If so, it might need special
> glue to make it not do that.

I got two adds and a jle, no div here..

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 12:13     ` Juri Lelli
@ 2012-04-23 12:22       ` Peter Zijlstra
  2012-04-23 13:37         ` Juri Lelli
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 12:22 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Mon, 2012-04-23 at 14:13 +0200, Juri Lelli wrote:
> On 04/23/2012 01:32 PM, Peter Zijlstra wrote:
> > On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> >> +       /*
> >> +        * We Keep moving the deadline away until we get some
> >> +        * available runtime for the entity. This ensures correct
> >> +        * handling of situations where the runtime overrun is
> >> +        * arbitrary large.
> >> +        */
> >> +       while (dl_se->runtime<= 0) {
> >> +               dl_se->deadline += dl_se->dl_deadline;
> >> +               dl_se->runtime += dl_se->dl_runtime;
> >> +       }
> >
> > Does gcc 'optimize' that into a division? If so, it might need special
> > glue to make it not do that.
> 
> I got two adds and a jle, no div here..

Gcc is known to change such loops to something like:

 if (runtime <= 0) {
   tmp = 1 - runtime / dl_runtime;
   deadline += tmp * dl_deadline;
   runtime += tmp * dl_runtime;
 }



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 12:22       ` Peter Zijlstra
@ 2012-04-23 13:37         ` Juri Lelli
  2012-04-23 14:01           ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-23 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 02:22 PM, Peter Zijlstra wrote:
> On Mon, 2012-04-23 at 14:13 +0200, Juri Lelli wrote:
>> On 04/23/2012 01:32 PM, Peter Zijlstra wrote:
>>> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>>>> +       /*
>>>> +        * We Keep moving the deadline away until we get some
>>>> +        * available runtime for the entity. This ensures correct
>>>> +        * handling of situations where the runtime overrun is
>>>> +        * arbitrary large.
>>>> +        */
>>>> +       while (dl_se->runtime<= 0) {
>>>> +               dl_se->deadline += dl_se->dl_deadline;
>>>> +               dl_se->runtime += dl_se->dl_runtime;
>>>> +       }
>>>
>>> Does gcc 'optimize' that into a division? If so, it might need special
>>> glue to make it not do that.
>>
>> I got two adds and a jle, no div here..
>
> Gcc is known to change such loops to something like:
>
>   if (runtime<= 0) {
>     tmp = 1 - runtime / dl_runtime;
>     deadline += tmp * dl_deadline;
>     runtime += tmp * dl_runtime;
>   }
>
>

This is what I got for that snippet:

ffffffff81062826 <enqueue_task_dl>:
[...]
ffffffff81062885:       49 03 44 24 20          add    0x20(%r12),%rax
ffffffff8106288a:       49 8b 54 24 28          mov    0x28(%r12),%rdx
ffffffff8106288f:       49 01 54 24 38          add    %rdx,0x38(%r12)
ffffffff81062894:       49 89 44 24 30          mov    %rax,0x30(%r12)
ffffffff81062899:       49 8b 44 24 30          mov    0x30(%r12),%rax
ffffffff8106289e:       48 85 c0                test   %rax,%rax
ffffffff810628a1:       7e e2                   jle    ffffffff81062885 <enqueue_task_dl+0x5f>

So it seems we are fine in this case, right?
It is anyway better to enforce this Gcc behaviour, just to be
on the safe side?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 13:37         ` Juri Lelli
@ 2012-04-23 14:01           ` Peter Zijlstra
  0 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 14:01 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang,
	Andrew Morton, Linus Torvalds

On Mon, 2012-04-23 at 15:37 +0200, Juri Lelli wrote:
> 
> This is what I got for that snippet:
> 
> ffffffff81062826 <enqueue_task_dl>:
> [...]
> ffffffff81062885:       49 03 44 24 20          add    0x20(%r12),%rax
> ffffffff8106288a:       49 8b 54 24 28          mov    0x28(%r12),%rdx
> ffffffff8106288f:       49 01 54 24 38          add    %rdx,0x38(%r12)
> ffffffff81062894:       49 89 44 24 30          mov    %rax,0x30(%r12)
> ffffffff81062899:       49 8b 44 24 30          mov    0x30(%r12),%rax
> ffffffff8106289e:       48 85 c0                test   %rax,%rax
> ffffffff810628a1:       7e e2                   jle    ffffffff81062885 <enqueue_task_dl+0x5f>
> 
> So it seems we are fine in this case, right?

Yep.

> It is anyway better to enforce this Gcc behaviour, just to be
> on the safe side? 

Dunno, the 'fix' is somewhat hideous (although we could make it suck
less), we've only ever bothered with it if caused problems, so I guess
we'll just wait and see until it breaks.


---
Subject: kernel,sched,time: Clean up gcc work-arounds
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon Apr 23 15:55:48 CEST 2012

We've grown various copies of a particular gcc work-around, consolidate
them into one and add a larger comment.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/compiler.h |   12 ++++++++++++
 include/linux/math64.h   |    4 +---
 kernel/sched/core.c      |    8 ++------
 kernel/sched/fair.c      |    8 ++------
 kernel/time.c            |   11 ++++-------
 5 files changed, 21 insertions(+), 22 deletions(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -310,4 +310,16 @@ void ftrace_likely_update(struct ftrace_
  */
 #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
 
+/*
+ * Avoid gcc loop optimization by clobbering a variable, forcing a reload
+ * and invalidating the optimization.
+ *
+ * The optimization in question transforms various loops into divisions/modulo
+ * operations, this is a problem when either the resulting operation generates
+ * unimplemented libgcc functions (u64 divisions for example) or the loop is
+ * known not to contain a lot of iterations and the division is in fact more
+ * expensive.
+ */
+#define __gcc_dont_optimize_loop(var) asm("" "+rm" (var))
+
 #endif /* __LINUX_COMPILER_H */
--- a/include/linux/math64.h
+++ b/include/linux/math64.h
@@ -105,9 +105,7 @@ __iter_div_u64_rem(u64 dividend, u32 div
 	u32 ret = 0;
 
 	while (dividend >= divisor) {
-		/* The following asm() prevents the compiler from
-		   optimising this loop into a modulo operation.  */
-		asm("" : "+rm"(dividend));
+		__gcc_dont_optimize_loop(dividend);
 
 		dividend -= divisor;
 		ret++;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -628,12 +628,8 @@ void sched_avg_update(struct rq *rq)
 	s64 period = sched_avg_period();
 
 	while ((s64)(rq->clock - rq->age_stamp) > period) {
-		/*
-		 * Inline assembly required to prevent the compiler
-		 * optimising this loop into a divmod call.
-		 * See __iter_div_u64_rem() for another example of this.
-		 */
-		asm("" : "+rm" (rq->age_stamp));
+		__gcc_dont_optimize_loop(rq->age_stamp);
+
 		rq->age_stamp += period;
 		rq->rt_avg /= 2;
 	}
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -853,12 +853,8 @@ static void update_cfs_load(struct cfs_r
 		update_cfs_rq_load_contribution(cfs_rq, global_update);
 
 	while (cfs_rq->load_period > period) {
-		/*
-		 * Inline assembly required to prevent the compiler
-		 * optimising this loop into a divmod call.
-		 * See __iter_div_u64_rem() for another example of this.
-		 */
-		asm("" : "+rm" (cfs_rq->load_period));
+		__gcc_dont_optimize_loop(cfs_rq->load_period);
+
 		cfs_rq->load_period /= 2;
 		cfs_rq->load_avg /= 2;
 	}
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -349,17 +349,14 @@ EXPORT_SYMBOL(mktime);
 void set_normalized_timespec(struct timespec *ts, time_t sec, s64 nsec)
 {
 	while (nsec >= NSEC_PER_SEC) {
-		/*
-		 * The following asm() prevents the compiler from
-		 * optimising this loop into a modulo operation. See
-		 * also __iter_div_u64_rem() in include/linux/time.h
-		 */
-		asm("" : "+rm"(nsec));
+		__gcc_dont_optimize_loop(nsec);
+
 		nsec -= NSEC_PER_SEC;
 		++sec;
 	}
 	while (nsec < 0) {
-		asm("" : "+rm"(nsec));
+		__gcc_dont_optimize_loop(nsec);
+
 		nsec += NSEC_PER_SEC;
 		--sec;
 	}


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-06  7:14 ` [PATCH 05/16] sched: SCHED_DEADLINE policy implementation Juri Lelli
                     ` (6 preceding siblings ...)
  2012-04-23 11:55   ` Peter Zijlstra
@ 2012-04-23 14:11   ` Peter Zijlstra
  2012-04-23 14:25   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 14:11 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> +static int start_dl_timer(struct sched_dl_entity *dl_se)
> +{
> +       struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +       struct rq *rq = rq_of_dl_rq(dl_rq);
> +       ktime_t now, act;
> +       ktime_t soft, hard;
> +       unsigned long range;
> +       s64 delta;
> +
> +       /*
> +        * We want the timer to fire at the deadline, but considering
> +        * that it is actually coming from rq->clock and not from
> +        * hrtimer's time base reading.
> +        */
> +       act = ns_to_ktime(dl_se->deadline);
> +       now = hrtimer_cb_get_time(&dl_se->dl_timer);
> +       delta = ktime_to_ns(now) - rq->clock;
> +       act = ktime_add_ns(act, delta);


Right, this all is very sad.. but I guess we'll have to like live with
it. The only other option is adding another timer base that tries to
keep itself in sync with rq->clock but that all sounds very painful
indeed.

Keeping up with rq->clock_task would be even more painful since it slows
the clock down in random fashion making the timer fire early.
Compensating that is going to be both fun and expensive.

> +       /*
> +        * If the expiry time already passed, e.g., because the value
> +        * chosen as the deadline is too small, don't even try to
> +        * start the timer in the past!
> +        */
> +       if (ktime_us_delta(act, now) < 0)
> +               return 0;
> +
> +       hrtimer_set_expires(&dl_se->dl_timer, act);
> +
> +       soft = hrtimer_get_softexpires(&dl_se->dl_timer);
> +       hard = hrtimer_get_expires(&dl_se->dl_timer);
> +       range = ktime_to_ns(ktime_sub(hard, soft));
> +       __hrtimer_start_range_ns(&dl_se->dl_timer, soft,
> +                                range, HRTIMER_MODE_ABS, 0);
> +
> +       return hrtimer_active(&dl_se->dl_timer);
> +} 

/me reminds himself to make __hrtimer_start_range_ns() return -ETIME.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-06  7:14 ` [PATCH 05/16] sched: SCHED_DEADLINE policy implementation Juri Lelli
                     ` (7 preceding siblings ...)
  2012-04-23 14:11   ` Peter Zijlstra
@ 2012-04-23 14:25   ` Peter Zijlstra
  2012-04-23 15:34     ` Juri Lelli
  2012-04-23 14:35   ` Peter Zijlstra
  2012-04-23 15:15   ` Peter Zijlstra
  10 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 14:25 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> +/*
> + * This is the bandwidth enforcement timer callback. If here, we know
> + * a task is not on its dl_rq, since the fact that the timer was running
> + * means the task is throttled and needs a runtime replenishment.
> + *
> + * However, what we actually do depends on the fact the task is active,
> + * (it is on its rq) or has been removed from there by a call to
> + * dequeue_task_dl(). In the former case we must issue the runtime
> + * replenishment and add the task back to the dl_rq; in the latter, we just
> + * do nothing but clearing dl_throttled, so that runtime and deadline
> + * updating (and the queueing back to dl_rq) will be done by the
> + * next call to enqueue_task_dl().

OK, so that comment isn't entirely clear to me, how can that timer still
be active when the task isn't? You start the timer when you throttle it,
at that point it cannot in fact dequeue itself anymore.

The only possibility I see is the one mentioned with the dl_task() check
below, that someone else called sched_setscheduler() on it.

> + */
> +static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
> +{
> +       unsigned long flags;
> +       struct sched_dl_entity *dl_se = container_of(timer,
> +                                                    struct sched_dl_entity,
> +                                                    dl_timer);
> +       struct task_struct *p = dl_task_of(dl_se);
> +       struct rq *rq = task_rq_lock(p, &flags);
> +
> +       /*
> +        * We need to take care of a possible races here. In fact, the
> +        * task might have changed its scheduling policy to something
> +        * different from SCHED_DEADLINE (through sched_setscheduler()).
> +        */
> +       if (!dl_task(p))
> +               goto unlock;
> +
> +       dl_se->dl_throttled = 0;
> +       if (p->on_rq) {
> +               enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
> +               if (task_has_dl_policy(rq->curr))
> +                       check_preempt_curr_dl(rq, p, 0);
> +               else
> +                       resched_task(rq->curr);
> +       }

So I can't see how that cannot be true.

> +unlock:
> +       task_rq_unlock(rq, p, &flags);
> +
> +       return HRTIMER_NORESTART;
> +} 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-06  7:14 ` [PATCH 05/16] sched: SCHED_DEADLINE policy implementation Juri Lelli
                     ` (8 preceding siblings ...)
  2012-04-23 14:25   ` Peter Zijlstra
@ 2012-04-23 14:35   ` Peter Zijlstra
  2012-04-23 15:39     ` Juri Lelli
  2012-04-23 15:15   ` Peter Zijlstra
  10 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 14:35 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> +static void init_dl_task_timer(struct sched_dl_entity *dl_se)
> +{
> +       struct hrtimer *timer = &dl_se->dl_timer;
> +
> +       if (hrtimer_active(timer)) {
> +               hrtimer_try_to_cancel(timer);
> +               return;
> +       }

Same question I guess, how can it be active here? Also, just letting it
run doesn't seem like the best way out.. 

> +
> +       hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +       timer->function = dl_task_timer;
> +       timer->irqsafe = 1;
> +} 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 11:55   ` Peter Zijlstra
@ 2012-04-23 14:43     ` Juri Lelli
  2012-04-23 15:11       ` Peter Zijlstra
  2012-04-23 21:55     ` Tommaso Cucinotta
  1 sibling, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-23 14:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 01:55 PM, Peter Zijlstra wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> +/*
>> + * Here we check if --at time t-- an entity (which is probably being
>> + * [re]activated or, in general, enqueued) can use its remaining runtime
>> + * and its current deadline _without_ exceeding the bandwidth it is
>> + * assigned (function returns true if it can).
>> + *
>> + * For this to hold, we must check if:
>> + *   runtime / (deadline - t)<  dl_runtime / dl_deadline .
>
> It might be good to put a few words in as to why that is.. I know I
> always forget (but know where to find it by now), also might be good to
> refer those papers Tommaso listed when Steven asked this a while back.
>

Ok, I'll fix the comment, extend it and add T.'s references in the
Documentation.

>> + */
>> +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
>> +{
>> +       u64 left, right;
>> +
>> +       /*
>> +        * left and right are the two sides of the equation above,
>> +        * after a bit of shuffling to use multiplications instead
>> +        * of divisions.
>> +        *
>> +        * Note that none of the time values involved in the two
>> +        * multiplications are absolute: dl_deadline and dl_runtime
>> +        * are the relative deadline and the maximum runtime of each
>> +        * instance, runtime is the runtime left for the last instance
>> +        * and (deadline - t), since t is rq->clock, is the time left
>> +        * to the (absolute) deadline. Therefore, overflowing the u64
>> +        * type is very unlikely to occur in both cases.
>> +        */
>> +       left = dl_se->dl_deadline * dl_se->runtime;
>> +       right = (dl_se->deadline - t) * dl_se->dl_runtime;
>
>
>  From what I can see there are no constraints on the values in
> __setparam_dl() so the above left term can be constructed to be an
> overflow.
>

Yes, could happen :-\.

> Ideally we'd use u128 here, but I don't think people will let us :/
>

Do we need to do something about that? If we cannot go for bigger space
probably limit dl_deadline (or warn the user)..

>> +       return dl_time_before(right, left);
>> +}

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 14:43     ` Juri Lelli
@ 2012-04-23 15:11       ` Peter Zijlstra
  0 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 15:11 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Mon, 2012-04-23 at 16:43 +0200, Juri Lelli wrote:
> 
> >  From what I can see there are no constraints on the values in
> > __setparam_dl() so the above left term can be constructed to be an
> > overflow.
> >
> 
> Yes, could happen :-\.
> 
> > Ideally we'd use u128 here, but I don't think people will let us :/
> >
> 
> Do we need to do something about that? If we cannot go for bigger space
> probably limit dl_deadline (or warn the user).. 

Depends on what happens, if only this task gets screwy, no real problem,
they supplied funny input, they get funny output. If OTOH it affects
other tasks we should do something.

Ideally we'd avoid the situation by some clever maths, second best would
be rejecting the parameters up front.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-06  7:14 ` [PATCH 05/16] sched: SCHED_DEADLINE policy implementation Juri Lelli
                     ` (9 preceding siblings ...)
  2012-04-23 14:35   ` Peter Zijlstra
@ 2012-04-23 15:15   ` Peter Zijlstra
  2012-04-23 15:37     ` Juri Lelli
  10 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 15:15 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> +static
> +int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
> +{
> +       int dmiss = dl_time_before(dl_se->deadline, rq->clock);
> +       int rorun = dl_se->runtime <= 0;
>+
> +       if (!rorun && !dmiss)
> +               return 0;
> +
> +       /*
> +        * If we are beyond our current deadline and we are still
> +        * executing, then we have already used some of the runtime of
> +        * the next instance. Thus, if we do not account that, we are
> +        * stealing bandwidth from the system at each deadline miss!
> +        */
> +       if (dmiss) {
> +               dl_se->runtime = rorun ? dl_se->runtime : 0;
> +               dl_se->runtime -= rq->clock - dl_se->deadline;
> +       }

So ideally this can't happen, but since we already leak time from the
system through means of hardirq / kstop / context-switch-overhead /
clock-jitter etc.. we avoid the error accumulating?

> +
> +       return 1;
> +} 



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 14:25   ` Peter Zijlstra
@ 2012-04-23 15:34     ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-23 15:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 04:25 PM, Peter Zijlstra wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> +/*
>> + * This is the bandwidth enforcement timer callback. If here, we know
>> + * a task is not on its dl_rq, since the fact that the timer was running
>> + * means the task is throttled and needs a runtime replenishment.
>> + *
>> + * However, what we actually do depends on the fact the task is active,
>> + * (it is on its rq) or has been removed from there by a call to
>> + * dequeue_task_dl(). In the former case we must issue the runtime
>> + * replenishment and add the task back to the dl_rq; in the latter, we just
>> + * do nothing but clearing dl_throttled, so that runtime and deadline
>> + * updating (and the queueing back to dl_rq) will be done by the
>> + * next call to enqueue_task_dl().
>
> OK, so that comment isn't entirely clear to me, how can that timer still
> be active when the task isn't? You start the timer when you throttle it,
> at that point it cannot in fact dequeue itself anymore.
>
> The only possibility I see is the one mentioned with the dl_task() check
> below, that someone else called sched_setscheduler() on it.
>

Ok, I was also stuck at this point when I first reviewed v3.
Then I convinced myself that, even if probably always true,
the p->on_rq check would prevent weird situations like for
example: by the time I block on a mutex, go to sleep or whatever,
I am throttled, then the dl_timer fires and I'm still !on_rq.
But I didn't see this happening ever actually...

>> + */
>> +static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
>> +{
>> +       unsigned long flags;
>> +       struct sched_dl_entity *dl_se = container_of(timer,
>> +                                                    struct sched_dl_entity,
>> +                                                    dl_timer);
>> +       struct task_struct *p = dl_task_of(dl_se);
>> +       struct rq *rq = task_rq_lock(p,&flags);
>> +
>> +       /*
>> +        * We need to take care of a possible races here. In fact, the
>> +        * task might have changed its scheduling policy to something
>> +        * different from SCHED_DEADLINE (through sched_setscheduler()).
>> +        */
>> +       if (!dl_task(p))
>> +               goto unlock;
>> +
>> +       dl_se->dl_throttled = 0;
>> +       if (p->on_rq) {
>> +               enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
>> +               if (task_has_dl_policy(rq->curr))
>> +                       check_preempt_curr_dl(rq, p, 0);
>> +               else
>> +                       resched_task(rq->curr);
>> +       }
>
> So I can't see how that cannot be true.
>
>> +unlock:
>> +       task_rq_unlock(rq, p,&flags);
>> +
>> +       return HRTIMER_NORESTART;
>> +}

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 15:15   ` Peter Zijlstra
@ 2012-04-23 15:37     ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-23 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 05:15 PM, Peter Zijlstra wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> +static
>> +int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
>> +{
>> +       int dmiss = dl_time_before(dl_se->deadline, rq->clock);
>> +       int rorun = dl_se->runtime<= 0;
>> +
>> +       if (!rorun&&  !dmiss)
>> +               return 0;
>> +
>> +       /*
>> +        * If we are beyond our current deadline and we are still
>> +        * executing, then we have already used some of the runtime of
>> +        * the next instance. Thus, if we do not account that, we are
>> +        * stealing bandwidth from the system at each deadline miss!
>> +        */
>> +       if (dmiss) {
>> +               dl_se->runtime = rorun ? dl_se->runtime : 0;
>> +               dl_se->runtime -= rq->clock - dl_se->deadline;
>> +       }
>
> So ideally this can't happen, but since we already leak time from the
> system through means of hardirq / kstop / context-switch-overhead /
> clock-jitter etc.. we avoid the error accumulating?
>

Yep, seems fair :-).
  
>> +
>> +       return 1;
>> +}
>
>

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 14:35   ` Peter Zijlstra
@ 2012-04-23 15:39     ` Juri Lelli
  2012-04-23 15:43       ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-23 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 04:35 PM, Peter Zijlstra wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> +static void init_dl_task_timer(struct sched_dl_entity *dl_se)
>> +{
>> +       struct hrtimer *timer =&dl_se->dl_timer;
>> +
>> +       if (hrtimer_active(timer)) {
>> +               hrtimer_try_to_cancel(timer);
>> +               return;
>> +       }
>
> Same question I guess, how can it be active here? Also, just letting it
> run doesn't seem like the best way out..
>

Probably s/hrtimer_try_to_cancel/hrtimer_cancel is better.
  
>> +
>> +       hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> +       timer->function = dl_task_timer;
>> +       timer->irqsafe = 1;
>> +}

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 15:39     ` Juri Lelli
@ 2012-04-23 15:43       ` Peter Zijlstra
  2012-04-23 16:41         ` Juri Lelli
  2012-05-15 10:10         ` Juri Lelli
  0 siblings, 2 replies; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 15:43 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Mon, 2012-04-23 at 17:39 +0200, Juri Lelli wrote:
> On 04/23/2012 04:35 PM, Peter Zijlstra wrote:
> > On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> >> +static void init_dl_task_timer(struct sched_dl_entity *dl_se)
> >> +{
> >> +       struct hrtimer *timer =&dl_se->dl_timer;
> >> +
> >> +       if (hrtimer_active(timer)) {
> >> +               hrtimer_try_to_cancel(timer);
> >> +               return;
> >> +       }
> >
> > Same question I guess, how can it be active here? Also, just letting it
> > run doesn't seem like the best way out..
> >
> 
> Probably s/hrtimer_try_to_cancel/hrtimer_cancel is better.

Yeah, not sure you can do hrtimer_cancel() there though, you're holding
->pi_lock and rq->lock and have IRQs disabled. That sounds like asking
for trouble.

Anyway, if it can't happen, we don't have to fix it.. so lets answer
that first ;-)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 15:43       ` Peter Zijlstra
@ 2012-04-23 16:41         ` Juri Lelli
       [not found]           ` <4F95D41F.5060700@sssup.it>
  2012-05-15 10:10         ` Juri Lelli
  1 sibling, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-23 16:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 05:43 PM, Peter Zijlstra wrote:
> On Mon, 2012-04-23 at 17:39 +0200, Juri Lelli wrote:
>> On 04/23/2012 04:35 PM, Peter Zijlstra wrote:
>>> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>>>> +static void init_dl_task_timer(struct sched_dl_entity *dl_se)
>>>> +{
>>>> +       struct hrtimer *timer =&dl_se->dl_timer;
>>>> +
>>>> +       if (hrtimer_active(timer)) {
>>>> +               hrtimer_try_to_cancel(timer);
>>>> +               return;
>>>> +       }
>>>
>>> Same question I guess, how can it be active here? Also, just letting it
>>> run doesn't seem like the best way out..
>>>
>>
>> Probably s/hrtimer_try_to_cancel/hrtimer_cancel is better.
>
> Yeah, not sure you can do hrtimer_cancel() there though, you're holding
> ->pi_lock and rq->lock and have IRQs disabled. That sounds like asking
> for trouble.
>
> Anyway, if it can't happen, we don't have to fix it.. so lets answer
> that first ;-)

The user could call __setparam_dl on a throttled task through
__sched_setscheduler.

BTW, I noticed that we should change this (inside __sched_setscheduler):

         /*
          * If not changing anything there's no need to proceed further
          */
         if (unlikely(policy == p->policy && (!rt_policy(policy) ||
                         param->sched_priority == p->rt_priority))) {

                 __task_rq_unlock(rq);
                 raw_spin_unlock_irqrestore(&p->pi_lock, flags);
                 return 0;
         }

to something like this:

	if (unlikely(policy == p->policy && (!rt_policy(policy) ||
                         param->sched_priority == p->rt_priority) &&
			!dl_policy(policy)))

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 10:37     ` Juri Lelli
@ 2012-04-23 21:25       ` Tommaso Cucinotta
  2012-04-23 21:45         ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Tommaso Cucinotta @ 2012-04-23 21:25 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, tglx, mingo, rostedt, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang

Il 23/04/2012 11:37, Juri Lelli ha scritto:
> On 04/23/2012 12:31 PM, Peter Zijlstra wrote:
>> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>>> +       dl_se->deadline = rq->clock + dl_se->dl_deadline;
>>
>> You might want to use rq->clock_task, this clock excludes times spend in
>> hardirq context and steal-time (when paravirt).
>>
>> Then again, it might not want to use that.. but its something you might
>> want to consider and make explicit by means of a comment.
>
> Yes, I planned a consistency check for the use of clock/clock_task
> throughout the code, but it seems I then forgot it.
> Planned for the next iteration :-).

unless I'm mistaken, there are 3 repetitions of this block in 05/16:

+		dl_se->deadline = rq->clock + dl_se->dl_deadline;
+		dl_se->runtime = dl_se->dl_runtime;


perhaps enclosing them into a function (e.g., reset_from_now() or 
similar) may help to keep consistency...

Another thing: I cannot get the real difference between rq->clock and 
rq->task_clock.
If task_clock is a kind of CLOCK_MONOTONIC thing that increases only 
when the task (or any task) is scheduled, then you don't want to use 
that here.
Here you need to set the new ->deadline to an absolute time, so I guess 
the regular rq->clock is what you need, isn't it ?

Hope I didn't say too much nonsense.

     T.

-- 
Tommaso Cucinotta, Computer Engineering PhD, Researcher
ReTiS Lab, Scuola Superiore Sant'Anna, Pisa, Italy
Tel +39 050 882 024, Fax +39 050 882 003
http://retis.sssup.it/people/tommaso


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 21:25       ` Tommaso Cucinotta
@ 2012-04-23 21:45         ` Peter Zijlstra
  2012-04-23 23:25           ` Tommaso Cucinotta
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 21:45 UTC (permalink / raw)
  To: Tommaso Cucinotta
  Cc: Juri Lelli, tglx, mingo, rostedt, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang

On Mon, 2012-04-23 at 22:25 +0100, Tommaso Cucinotta wrote:
> I cannot get the real difference between rq->clock and rq->task_clock. 

One runs at wall-time (rq->clock) the other excludes time in irq-context
and steal-time (rq->clock_task).

The idea is that ->clock_task gives the time as observed by schedulable
tasks and excludes other muck.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 11:55   ` Peter Zijlstra
  2012-04-23 14:43     ` Juri Lelli
@ 2012-04-23 21:55     ` Tommaso Cucinotta
  2012-04-23 21:58       ` Peter Zijlstra
  1 sibling, 1 reply; 129+ messages in thread
From: Tommaso Cucinotta @ 2012-04-23 21:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, tglx, mingo, rostedt, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang

Il 23/04/2012 12:55, Peter Zijlstra ha scritto:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> +/*
>> + * Here we check if --at time t-- an entity (which is probably being
>> + * [re]activated or, in general, enqueued) can use its remaining runtime
>> + * and its current deadline _without_ exceeding the bandwidth it is
>> + * assigned (function returns true if it can).
>> + *
>> + * For this to hold, we must check if:
>> + *   runtime / (deadline - t)<  dl_runtime / dl_deadline .
> It might be good to put a few words in as to why that is.. I know I
> always forget (but know where to find it by now), also might be good to
> refer those papers Tommaso listed when Steven asked this a while back.
>
>> + */
>> +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
>> +{
>> +       u64 left, right;
>> +
>> +       /*
>> +        * left and right are the two sides of the equation above,
>> +        * after a bit of shuffling to use multiplications instead
>> +        * of divisions.
>> +        *
>> +        * Note that none of the time values involved in the two
>> +        * multiplications are absolute: dl_deadline and dl_runtime
>> +        * are the relative deadline and the maximum runtime of each
>> +        * instance, runtime is the runtime left for the last instance
>> +        * and (deadline - t), since t is rq->clock, is the time left
>> +        * to the (absolute) deadline. Therefore, overflowing the u64
>> +        * type is very unlikely to occur in both cases.
>> +        */
>> +       left = dl_se->dl_deadline * dl_se->runtime;
>> +       right = (dl_se->deadline - t) * dl_se->dl_runtime;
>
>  From what I can see there are no constraints on the values in
> __setparam_dl() so the above left term can be constructed to be an
> overflow.
>
> Ideally we'd use u128 here, but I don't think people will let us :/

why not write this straight in asm, i.e., multiply 64*64 then divide by 
64 keeping the intermediate result on 128 bits?
Something straightforward to write in asm, but not that easy to let gcc 
understand that I don't want to multiply 128*128 :-).... a few years ago 
I had a similar issue; perhaps it was a 32/64 version of this problem, 
and gcc was not optimizing properly the C code with -O3, so I had used 
asm segments.
In this case, if avoiding the division is a major requirement, then we 
could multiply twice 64*64 in asm, then compare the two results on 128 
bits ? Again, a few assembly lines on architectures supporting the 64*64 
and 128-bits comparison.

     T.

-- 
Tommaso Cucinotta, Computer Engineering PhD, Researcher
ReTiS Lab, Scuola Superiore Sant'Anna, Pisa, Italy
Tel +39 050 882 024, Fax +39 050 882 003
http://retis.sssup.it/people/tommaso


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 21:55     ` Tommaso Cucinotta
@ 2012-04-23 21:58       ` Peter Zijlstra
  2012-04-23 23:21         ` Tommaso Cucinotta
  2012-04-24  1:03         ` Steven Rostedt
  0 siblings, 2 replies; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-23 21:58 UTC (permalink / raw)
  To: Tommaso Cucinotta
  Cc: Juri Lelli, tglx, mingo, rostedt, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang

On Mon, 2012-04-23 at 22:55 +0100, Tommaso Cucinotta wrote:
> why not write this straight in asm, i.e., multiply 64*64 then divide by 
> 64 keeping the intermediate result on 128 bits? 

If you know of a way to do this for all 30 odd architectures supported
by our beloved kernel, do let me know ;-)

Yes I can do it for x86_64, but people tend to get mighty upset if you
break the compile for all other arches...

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 21:58       ` Peter Zijlstra
@ 2012-04-23 23:21         ` Tommaso Cucinotta
  2012-04-24  9:50           ` Peter Zijlstra
  2012-04-24  1:03         ` Steven Rostedt
  1 sibling, 1 reply; 129+ messages in thread
From: Tommaso Cucinotta @ 2012-04-23 23:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, tglx, mingo, rostedt, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang

Il 23/04/2012 22:58, Peter Zijlstra ha scritto:
> On Mon, 2012-04-23 at 22:55 +0100, Tommaso Cucinotta wrote:
>> why not write this straight in asm, i.e., multiply 64*64 then divide by
>> 64 keeping the intermediate result on 128 bits?
> If you know of a way to do this for all 30 odd architectures supported
> by our beloved kernel, do let me know ;-)

:-)
> Yes I can do it for x86_64, but people tend to get mighty upset if you
> break the compile for all other arches...

rather than breaking compile, I was thinking more of using the 
optimization for a more accurate comparison on archs that have 64-bit 
mul and 128-bit cmp, and leaving the overflow on other archs. Though, 
that would imply a difference in behavior on those borderline cases 
(very big periods I guess).

However, I'm also puzzled from what would happen by compiling the 
current code on mostly 16-bit micros which have very limited 32-bit 
operations...

     T.

-- 
Tommaso Cucinotta, Computer Engineering PhD, Researcher
ReTiS Lab, Scuola Superiore Sant'Anna, Pisa, Italy
Tel +39 050 882 024, Fax +39 050 882 003
http://retis.sssup.it/people/tommaso


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 21:45         ` Peter Zijlstra
@ 2012-04-23 23:25           ` Tommaso Cucinotta
  2012-04-24  6:29             ` Dario Faggioli
  0 siblings, 1 reply; 129+ messages in thread
From: Tommaso Cucinotta @ 2012-04-23 23:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, tglx, mingo, rostedt, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang

Il 23/04/2012 22:45, Peter Zijlstra ha scritto:
> On Mon, 2012-04-23 at 22:25 +0100, Tommaso Cucinotta wrote:
>> I cannot get the real difference between rq->clock and rq->task_clock.
> One runs at wall-time (rq->clock) the other excludes time in irq-context
> and steal-time (rq->clock_task).
>
> The idea is that ->clock_task gives the time as observed by schedulable
> tasks and excludes other muck.

so clock_task might be better to compute the consumed budget at task 
deschedule, but for setting deadlines one period ahead in the future 
guess the regular wall-time rq->clock is the one to be used?

Thx,

     T.

-- 
Tommaso Cucinotta, Computer Engineering PhD, Researcher
ReTiS Lab, Scuola Superiore Sant'Anna, Pisa, Italy
Tel +39 050 882 024, Fax +39 050 882 003
http://retis.sssup.it/people/tommaso


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 21:58       ` Peter Zijlstra
  2012-04-23 23:21         ` Tommaso Cucinotta
@ 2012-04-24  1:03         ` Steven Rostedt
  1 sibling, 0 replies; 129+ messages in thread
From: Steven Rostedt @ 2012-04-24  1:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tommaso Cucinotta, Juri Lelli, tglx, mingo, cfriesen, oleg,
	fweisbec, darren, johan.eker, p.faure, linux-kernel, claudio,
	michael, fchecconi, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Mon, 2012-04-23 at 23:58 +0200, Peter Zijlstra wrote:
> On Mon, 2012-04-23 at 22:55 +0100, Tommaso Cucinotta wrote:
> > why not write this straight in asm, i.e., multiply 64*64 then divide by 
> > 64 keeping the intermediate result on 128 bits? 
> 
> If you know of a way to do this for all 30 odd architectures supported
> by our beloved kernel, do let me know ;-)
> 
> Yes I can do it for x86_64, but people tend to get mighty upset if you
> break the compile for all other arches...

Use the draconian method. Make SCHED_DEADLINE dependent on
"ARCH_HAS_128_MULT" and any arch maintainer that wants SCHED_DEADLINE
for their arch will be responsible for implementing it ;-)

-- Steve



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 23:25           ` Tommaso Cucinotta
@ 2012-04-24  6:29             ` Dario Faggioli
  2012-04-24  6:52               ` Juri Lelli
  0 siblings, 1 reply; 129+ messages in thread
From: Dario Faggioli @ 2012-04-24  6:29 UTC (permalink / raw)
  To: Tommaso Cucinotta
  Cc: Peter Zijlstra, Juri Lelli, tglx, mingo, rostedt, cfriesen, oleg,
	fweisbec, darren, johan.eker, p.faure, linux-kernel, claudio,
	michael, fchecconi, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, insop.song, liming.wang

[-- Attachment #1: Type: text/plain, Size: 845 bytes --]

On Tue, 2012-04-24 at 00:25 +0100, Tommaso Cucinotta wrote: 
> > The idea is that ->clock_task gives the time as observed by schedulable
> > tasks and excludes other muck.
> 
> so clock_task might be better to compute the consumed budget at task 
> deschedule, but for setting deadlines one period ahead in the future 
> guess the regular wall-time rq->clock is the one to be used?
> 
Yep, that was the idea, unless my recollection has completely gone
flaky! :-P

Perhaps adding a comment saying right this thing above, as Peter
suggested?

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-24  6:29             ` Dario Faggioli
@ 2012-04-24  6:52               ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-24  6:52 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tommaso Cucinotta, Peter Zijlstra, tglx, mingo, rostedt,
	cfriesen, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, nicola.manica,
	luca.abeni, dhaval.giani, hgu1972, paulmck, insop.song,
	liming.wang

On 04/24/2012 08:29 AM, Dario Faggioli wrote:
> On Tue, 2012-04-24 at 00:25 +0100, Tommaso Cucinotta wrote:
>>> The idea is that ->clock_task gives the time as observed by schedulable
>>> tasks and excludes other muck.
>>
>> so clock_task might be better to compute the consumed budget at task
>> deschedule, but for setting deadlines one period ahead in the future
>> guess the regular wall-time rq->clock is the one to be used?
>>
> Yep, that was the idea, unless my recollection has completely gone
> flaky! :-P
>
> Perhaps adding a comment saying right this thing above, as Peter
> suggested?
>

Sure! TODO for the next release :-).

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
       [not found]           ` <4F95D41F.5060700@sssup.it>
@ 2012-04-24  7:21             ` Juri Lelli
  2012-04-24  9:00               ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-24  7:21 UTC (permalink / raw)
  To: Tommaso Cucinotta
  Cc: Peter Zijlstra, tglx, mingo, rostedt, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang

On 04/24/2012 12:13 AM, Tommaso Cucinotta wrote:
> Il 23/04/2012 17:41, Juri Lelli ha scritto:
>> The user could call __setparam_dl on a throttled task through
>> __sched_setscheduler.
>
> in case it can be related: a scenario that used to break isolation
> (in the old aquosa crap): 1) create a deadline task 2) (actively)
> wait till it's just about to be throttled 3) remove reservation
> (i.e., return the task to the normal system policy and destroy
> reservation info in the kernel) 4) reserve it again
>

Yes, this is very similar to what I thought just after I've sent the
email (ouch! :-)).
  
> Assuming the borderline condition of a nearly fully saturated system,
> if 3)-4) manage to happen sufficiently close to each other and right
> after 2), now the task budget is refilled with a deadline which is
> where it should not be, according to the admission control rules. In
> other words, we may break guarantees of other tasks by a properly
> misbehaving task. Something relevant when considering misbehaviour
> and admission control from a security perspective [1].
>

Thanks for the ref., I'll read it!

> At that time, I was persuaded that the right way to avoid this would
> be to avoid to free system cpu bw immediately when a reservation is
> destroyed, but rather wait till its current abs deadline, then "free"
> the bandwidth. A new task trying to re-create the reservation too
> early, i.e., at step 4) above, would be rejected by the system as it
> would still see a fully occupied cpu bw. Never implemented of course
> :-)...
>

A kind of "two steps" approach. It would work, I just have to think how
to implement it (and let the system survive ;-)). Then create some
bench to test it.

> And also, from a security perspective, a misbehaving (sched_other)
> task might thrash the system with useless nansosleeps forcing the OS
> to continuously schedule/deschedule it. Equivalently, with a deadline
> scheduler, you could try to set a very small period/deadline. That's
> why in [1], among the configurable variables, there was a minimum
> allowed reservation period.
>

Yes, this should be easily controlled at admission time.

> Nothing really urgent, just something you might want to keep in mind
> for the future, I thought.
>

Well, depends on how much effort will this turn to require. I personally
would prefer to be able to come out with a new release ASAP. Just to
continue the discussion with the most of the comments addressed and a
more updated code (I also have a mainline version of the patchset
quite ready).

Thanks a lot,

- Juri


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-24  7:21             ` Juri Lelli
@ 2012-04-24  9:00               ` Peter Zijlstra
  0 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-24  9:00 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Tommaso Cucinotta, tglx, mingo, rostedt, cfriesen, oleg,
	fweisbec, darren, johan.eker, p.faure, linux-kernel, claudio,
	michael, fchecconi, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Tue, 2012-04-24 at 09:21 +0200, Juri Lelli wrote:
> Well, depends on how much effort will this turn to require. I personally
> would prefer to be able to come out with a new release ASAP. Just to
> continue the discussion with the most of the comments addressed and a
> more updated code (I also have a mainline version of the patchset
> quite ready). 

Right, one thing we can initially do is require root for using
SCHED_DEADLINE and then when later work closes all the holes and we've
added user bandwidth controls we can allow everybody in.



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 23:21         ` Tommaso Cucinotta
@ 2012-04-24  9:50           ` Peter Zijlstra
  0 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-24  9:50 UTC (permalink / raw)
  To: Tommaso Cucinotta
  Cc: Juri Lelli, tglx, mingo, rostedt, cfriesen, oleg, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang

On Tue, 2012-04-24 at 00:21 +0100, Tommaso Cucinotta wrote:
> > Yes I can do it for x86_64, but people tend to get mighty upset if you
> > break the compile for all other arches...
> 
> rather than breaking compile, I was thinking more of using the 
> optimization for a more accurate comparison on archs that have 64-bit 
> mul and 128-bit cmp, and leaving the overflow on other archs. Though, 
> that would imply a difference in behavior on those borderline cases 
> (very big periods I guess).
> 
> However, I'm also puzzled from what would happen by compiling the 
> current code on mostly 16-bit micros which have very limited 32-bit 
> operations... 

We don't support 16bit archs, 32bit is almost useless as it is :-)

Anyway, how about something like this, I guess archs can go wild and add
asm/math128.h if they want etc..

Completely untested, hasn't even seen a compiler..

---
Subject: math128: Add {add,mult,cmp}_u128
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue Apr 24 11:47:12 CEST 2012


Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/math128.h |   75 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

--- /dev/null
+++ b/include/linux/math128.h
@@ -0,0 +1,75 @@
+#ifndef _LINUX_MATH128_H
+#define _LINUX_MATH128_H
+
+#include <linux/types.h>
+
+typedef struct {
+	u64 hi, lo;
+} u128;
+
+u128 add_u128(u128 a, u128 b)
+{
+	u128 res;
+
+	res.hi = a.hi + b.hi;
+	res.lo = a.lo + b.lo;
+
+	if (res.lo < a.lo || res.lo < b.lo)
+		res.hi++;
+
+	return res;
+}
+
+/*
+ * a * b = (ah * 2^32 + al) * (bh * 2^32 + bl) =
+ *   ah*bh * 2^64 + (ah*bl + bh*al) * 2^32 + al*bl
+ */
+u128 mult_u128(u64 a, u64 b)
+{
+	u128 res;
+	u64 ah, al;
+	u64 bh, bl;
+	u128 t1, t2, t3, t4;
+
+	ah = a >> 32;
+	al = a & ((1ULL << 32) - 1);
+
+	bh = b >> 32;
+	bl = b & ((1ULL << 32) - 1);
+
+	t1.lo = 0;
+	t1.hi = ah * bh;
+
+	t2.lo = ah * bl;
+	t2.hi = t2.lo >> 32;
+	t2.lo <<= 32;
+
+	t3.lo = al * bh;
+	t3.hi = t3.lo >> 32;
+	t3.lo <<= 32;
+
+	t4.lo = al * bl;
+	t4.hi = 0;
+
+	res = add_u128(t1, t2);
+	res = add_u128(res, t3);
+	res = add_u128(res, t4);
+
+	return res;
+}
+
+int cmp_u128(u128 a, u128 b)
+{
+	if (a.hi > b.hi)
+		return 1;
+	if (a.hi < b.hi)
+		return -1;
+	if (a.lo > b.lo)
+		return 1;
+	if (a.lo < b.lo)
+		return -1;
+
+	return 0;
+}
+
+#endif /* _LINUX_MATH128_H */


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-11 16:33   ` Steven Rostedt
@ 2012-04-24 13:15     ` Peter Zijlstra
  2012-04-24 18:50       ` Steven Rostedt
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-24 13:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Wed, 2012-04-11 at 12:33 -0400, Steven Rostedt wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> >  
> > @@ -543,6 +897,9 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
> >  {
> >  	update_curr_dl(rq);
> >  	p->se.exec_start = 0;
> > +
> > +	if (on_dl_rq(&p->dl) && p->dl.nr_cpus_allowed > 1)
> > +		enqueue_pushable_dl_task(rq, p);
> >  }
> 
> Ouch! We need to fix this. This has nothing to do with your patch
> series, but if you look at schedule():
> 
> 	put_prev_task(rq, prev);
> 	next = pick_next_task(rq);
> 
> 
> We put the prev task and then pick the next task. If we call schedule
> for some reason when we don't need to really schedule, then we just
> added and removed from the pushable rb tree the same task. That is, we
> did the rb manipulation twice, for no good reason.
> 
> Not sure how to fix this. But it will require a generic change.


Something like so: https://lkml.org/lkml/2012/2/16/487 ?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 10/16] sched: add resource limits for -deadline tasks.
  2012-04-06  7:14 ` [PATCH 10/16] sched: add resource limits " Juri Lelli
@ 2012-04-24 15:07   ` Peter Zijlstra
  2012-04-24 15:22     ` Juri Lelli
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-24 15:07 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> From: Dario Faggioli <raistlin@linux.it>
> 
> Add resource limits for non-root tasks in using the SCHED_DEADLINE
> policy, very similarly to what already exists for RT policies.
> 
> In fact, this patch:
>  - adds the resource limit RLIMIT_DLDLINE, which is the minimum value
>    a user task can use as its own deadline;
>  - adds the resource limit RLIMIT_DLRTIME, which is the maximum value
>    a user task can use as it own runtime.
> 
> Notice that to exploit these, a modified version of the ulimit
> utility and a modified resource.h header file are needed. They
> both will be available on the website of the project.
> 
> Signed-off-by: Dario Faggioli <raistlin@linux.it>
> Signed-off-by: Juri Lelli <juri.lelli@gmail.com>

I'm not sure this is the right way to go.. those existing things aren't
entirely as useful/sane as one might hope either.

The DLDLINE minimum is ok I guess, the DLRTIME one doesn't really do
anything, by spawning multiple tasks one can still saturate the cpu and
thus we have no effective control for unpriv users.

Ideally DLRTIME would be a utilization cap per user and tracked in
user_struct such that we can enforce a max utilization per user.

This also needs a global (and possibly per-cgroup) user limit too to cap
the total utilization of all users (excluding root) so that multiple
users cannot combine their efforts in order to bring down the machine.

In light of these latter controls the per-user control might be
considered optional, furthermore I don't particularly like the rlimit
infrastructure but I guess its the best we have for per-user like things
if indeed we want to go there.



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 10/16] sched: add resource limits for -deadline tasks.
  2012-04-24 15:07   ` Peter Zijlstra
@ 2012-04-24 15:22     ` Juri Lelli
  2012-04-24 16:27       ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Juri Lelli @ 2012-04-24 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/24/2012 05:07 PM, Peter Zijlstra wrote:
> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>> From: Dario Faggioli<raistlin@linux.it>
>>
>> Add resource limits for non-root tasks in using the SCHED_DEADLINE
>> policy, very similarly to what already exists for RT policies.
>>
>> In fact, this patch:
>>   - adds the resource limit RLIMIT_DLDLINE, which is the minimum value
>>     a user task can use as its own deadline;
>>   - adds the resource limit RLIMIT_DLRTIME, which is the maximum value
>>     a user task can use as it own runtime.
>>
>> Notice that to exploit these, a modified version of the ulimit
>> utility and a modified resource.h header file are needed. They
>> both will be available on the website of the project.
>>
>> Signed-off-by: Dario Faggioli<raistlin@linux.it>
>> Signed-off-by: Juri Lelli<juri.lelli@gmail.com>
>
> I'm not sure this is the right way to go.. those existing things aren't
> entirely as useful/sane as one might hope either.
>
> The DLDLINE minimum is ok I guess, the DLRTIME one doesn't really do
> anything, by spawning multiple tasks one can still saturate the cpu and
> thus we have no effective control for unpriv users.
>
> Ideally DLRTIME would be a utilization cap per user and tracked in
> user_struct such that we can enforce a max utilization per user.
>
> This also needs a global (and possibly per-cgroup) user limit too to cap
> the total utilization of all users (excluding root) so that multiple
> users cannot combine their efforts in order to bring down the machine.
>
> In light of these latter controls the per-user control might be
> considered optional, furthermore I don't particularly like the rlimit
> infrastructure but I guess its the best we have for per-user like things
> if indeed we want to go there.

Ok, but considering what you said regarding setscheduler security problems:

On 04/24/2012 11:00 AM, Peter Zijlstra wrote:
> On Tue, 2012-04-24 at 09:21 +0200, Juri Lelli wrote:
>> >  Well, depends on how much effort will this turn to require. I personally
>> >  would prefer to be able to come out with a new release ASAP. Just to
>> >  continue the discussion with the most of the comments addressed and a
>> >  more updated code (I also have a mainline version of the patchset
>> >  quite ready).
> Right, one thing we can initially do is require root for using
> SCHED_DEADLINE and then when later work closes all the holes and we've
> added user bandwidth controls we can allow everybody in.

Are you suggesting to drop/postpone this to some later time?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 10/16] sched: add resource limits for -deadline tasks.
  2012-04-24 15:22     ` Juri Lelli
@ 2012-04-24 16:27       ` Peter Zijlstra
  2012-04-24 17:14         ` Juri Lelli
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-24 16:27 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Tue, 2012-04-24 at 17:22 +0200, Juri Lelli wrote:

> Ok, but considering what you said regarding setscheduler security problems:

> Are you suggesting to drop/postpone this to some later time?

Yeah, I think we can put this on the end of the queue and try and get
some of the basic infrastructure merged first.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 10/16] sched: add resource limits for -deadline tasks.
  2012-04-24 16:27       ` Peter Zijlstra
@ 2012-04-24 17:14         ` Juri Lelli
  0 siblings, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-04-24 17:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/24/2012 06:27 PM, Peter Zijlstra wrote:
> On Tue, 2012-04-24 at 17:22 +0200, Juri Lelli wrote:
>
>> Ok, but considering what you said regarding setscheduler security problems:
>
>> Are you suggesting to drop/postpone this to some later time?
>
> Yeah, I think we can put this on the end of the queue and try and get
> some of the basic infrastructure merged first.

Perfect, no rants on my side ;-).

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-24 13:15     ` Peter Zijlstra
@ 2012-04-24 18:50       ` Steven Rostedt
  2012-04-24 18:53         ` Peter Zijlstra
  0 siblings, 1 reply; 129+ messages in thread
From: Steven Rostedt @ 2012-04-24 18:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Tue, 2012-04-24 at 15:15 +0200, Peter Zijlstra wrote:
> On Wed, 2012-04-11 at 12:33 -0400, Steven Rostedt wrote:
> > On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
> > >  
> > > @@ -543,6 +897,9 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
> > >  {
> > >  	update_curr_dl(rq);
> > >  	p->se.exec_start = 0;
> > > +
> > > +	if (on_dl_rq(&p->dl) && p->dl.nr_cpus_allowed > 1)
> > > +		enqueue_pushable_dl_task(rq, p);
> > >  }
> > 
> > Ouch! We need to fix this. This has nothing to do with your patch
> > series, but if you look at schedule():
> > 
> > 	put_prev_task(rq, prev);
> > 	next = pick_next_task(rq);
> > 
> > 
> > We put the prev task and then pick the next task. If we call schedule
> > for some reason when we don't need to really schedule, then we just
> > added and removed from the pushable rb tree the same task. That is, we
> > did the rb manipulation twice, for no good reason.
> > 
> > Not sure how to fix this. But it will require a generic change.
> 
> 
> Something like so: https://lkml.org/lkml/2012/2/16/487 ?

But it still does the same thing:

+static struct task_struct *
+pick_next_task_rt(struct rq *rq, struct task_struct *prev)
 {
-	struct task_struct *p = _pick_next_task_rt(rq);
+	struct task_struct *p;
+	struct rt_rq *rt_rq = &rq->rt;
+
+	if (!rt_rq->rt_nr_running)
+		return NULL;
+
+	if (rt_rq_throttled(rt_rq))
+		return NULL;
+
+	if (prev)
+		prev->sched_class->put_prev_task(rq, prev);
+
+	p = _pick_next_task_rt(rq);

Now if we can do the _pick_next_task_rt() before put_prev_task(), and
only do the put_prev_task() if p != prev, then that would be something.
 
-- Steve



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-24 18:50       ` Steven Rostedt
@ 2012-04-24 18:53         ` Peter Zijlstra
  2012-04-24 19:01           ` Steven Rostedt
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2012-04-24 18:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Tue, 2012-04-24 at 14:50 -0400, Steven Rostedt wrote:
> > Something like so: https://lkml.org/lkml/2012/2/16/487 ?
> 
> But it still does the same thing:
> 
> +static struct task_struct *
> +pick_next_task_rt(struct rq *rq, struct task_struct *prev)
>  {
> -       struct task_struct *p = _pick_next_task_rt(rq);
> +       struct task_struct *p;
> +       struct rt_rq *rt_rq = &rq->rt;
> +
> +       if (!rt_rq->rt_nr_running)
> +               return NULL;
> +
> +       if (rt_rq_throttled(rt_rq))
> +               return NULL;
> +
> +       if (prev)
> +               prev->sched_class->put_prev_task(rq, prev);
> +
> +       p = _pick_next_task_rt(rq);
> 
> Now if we can do the _pick_next_task_rt() before put_prev_task(), and
> only do the put_prev_task() if p != prev, then that would be something.

Well that's up to the implementation of pick_next_task_rt() that
conversion is just a minimal make it work thing.

But the generic changes that patch carries should allow you to do what
you want, right?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic
  2012-04-24 18:53         ` Peter Zijlstra
@ 2012-04-24 19:01           ` Steven Rostedt
  0 siblings, 0 replies; 129+ messages in thread
From: Steven Rostedt @ 2012-04-24 19:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, tglx, mingo, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On Tue, 2012-04-24 at 20:53 +0200, Peter Zijlstra wrote:

> Well that's up to the implementation of pick_next_task_rt() that
> conversion is just a minimal make it work thing.
> 
> But the generic changes that patch carries should allow you to do what
> you want, right?

Yep. It should.

-- Steve




^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 05/16] sched: SCHED_DEADLINE policy implementation.
  2012-04-23 15:43       ` Peter Zijlstra
  2012-04-23 16:41         ` Juri Lelli
@ 2012-05-15 10:10         ` Juri Lelli
  1 sibling, 0 replies; 129+ messages in thread
From: Juri Lelli @ 2012-05-15 10:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, rostedt, cfriesen, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang

On 04/23/2012 05:43 PM, Peter Zijlstra wrote:
> On Mon, 2012-04-23 at 17:39 +0200, Juri Lelli wrote:
>> On 04/23/2012 04:35 PM, Peter Zijlstra wrote:
>>> On Fri, 2012-04-06 at 09:14 +0200, Juri Lelli wrote:
>>>> +static void init_dl_task_timer(struct sched_dl_entity *dl_se)
>>>> +{
>>>> +       struct hrtimer *timer =&dl_se->dl_timer;
>>>> +
>>>> +       if (hrtimer_active(timer)) {
>>>> +               hrtimer_try_to_cancel(timer);
>>>> +               return;
>>>> +       }
>>>
>>> Same question I guess, how can it be active here? Also, just letting it
>>> run doesn't seem like the best way out..
>>>
>>
>> Probably s/hrtimer_try_to_cancel/hrtimer_cancel is better.
>
> Yeah, not sure you can do hrtimer_cancel() there though, you're holding
> ->pi_lock and rq->lock and have IRQs disabled. That sounds like asking
> for trouble.
>
> Anyway, if it can't happen, we don't have to fix it.. so lets answer
> that first ;-)

Even if I dropped the bits for allowing !root users, this critical point
still remains.
What if I leave this like it is and instead I do the following?

@@ -488,9 +488,10 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
         /*
          * We need to take care of a possible races here. In fact, the
          * task might have changed its scheduling policy to something
-        * different from SCHED_DEADLINE (through sched_setscheduler()).
+        * different from SCHED_DEADLINE or changed its reservation
+        * parameters (through sched_{setscheduler(),setscheduler2()}).
          */
-       if (!dl_task(p))
+       if (!dl_task(p) || dl_se->dl_new)
                 goto unlock;
  
         dl_se->dl_throttled = 0;

The idea is that hrtimer_try_to_cancel should fail only if the callback routine
is running. If, meanwhile, I set up new parameters, I can try to recognize this
situation through dl_new (set to 1 during __setparam_dl).

BTW, I'd have a new version ready (also rebased on the current tip/master). It
address all the comments excluding your gcc work-around, math128 and
nr_cpus_allowed shift (patches are ready but those changes not yet mainline,
right?). Anyway, do you think would be fine to post it?

Thanks and Regards,

- Juri

^ permalink raw reply	[flat|nested] 129+ messages in thread

end of thread, other threads:[~2012-05-15 10:10 UTC | newest]

Thread overview: 129+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-06  7:14 [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Juri Lelli
2012-04-06  7:14 ` [PATCH 01/16] sched: add sched_class->task_dead Juri Lelli
2012-04-08 17:49   ` Oleg Nesterov
2012-04-08 18:09     ` Juri Lelli
2012-04-06  7:14 ` [PATCH 02/16] sched: add extended scheduling interface Juri Lelli
2012-04-06  7:14 ` [PATCH 03/16] sched: SCHED_DEADLINE data structures Juri Lelli
2012-04-23  9:08   ` Peter Zijlstra
2012-04-23  9:47     ` Juri Lelli
2012-04-23  9:49       ` Peter Zijlstra
2012-04-23  9:55         ` Juri Lelli
2012-04-23 10:12           ` Peter Zijlstra
2012-04-23  9:13   ` Peter Zijlstra
2012-04-23  9:28     ` Juri Lelli
2012-04-23  9:30   ` Peter Zijlstra
2012-04-23  9:36     ` Juri Lelli
2012-04-23  9:39       ` Peter Zijlstra
2012-04-23  9:34   ` Peter Zijlstra
2012-04-23 10:16     ` Juri Lelli
2012-04-23 10:28       ` Peter Zijlstra
2012-04-23 10:33         ` Juri Lelli
2012-04-06  7:14 ` [PATCH 04/16] sched: SCHED_DEADLINE SMP-related " Juri Lelli
2012-04-06  7:14 ` [PATCH 05/16] sched: SCHED_DEADLINE policy implementation Juri Lelli
2012-04-11  3:06   ` Steven Rostedt
2012-04-11  6:54     ` Juri Lelli
2012-04-11 13:41   ` Steven Rostedt
2012-04-11 13:55     ` Juri Lelli
2012-04-23 10:15   ` Peter Zijlstra
2012-04-23 10:18     ` Juri Lelli
2012-04-23 10:31   ` Peter Zijlstra
2012-04-23 10:37     ` Juri Lelli
2012-04-23 21:25       ` Tommaso Cucinotta
2012-04-23 21:45         ` Peter Zijlstra
2012-04-23 23:25           ` Tommaso Cucinotta
2012-04-24  6:29             ` Dario Faggioli
2012-04-24  6:52               ` Juri Lelli
2012-04-23 11:32   ` Peter Zijlstra
2012-04-23 12:13     ` Juri Lelli
2012-04-23 12:22       ` Peter Zijlstra
2012-04-23 13:37         ` Juri Lelli
2012-04-23 14:01           ` Peter Zijlstra
2012-04-23 11:34   ` Peter Zijlstra
2012-04-23 11:57     ` Juri Lelli
2012-04-23 11:55   ` Peter Zijlstra
2012-04-23 14:43     ` Juri Lelli
2012-04-23 15:11       ` Peter Zijlstra
2012-04-23 21:55     ` Tommaso Cucinotta
2012-04-23 21:58       ` Peter Zijlstra
2012-04-23 23:21         ` Tommaso Cucinotta
2012-04-24  9:50           ` Peter Zijlstra
2012-04-24  1:03         ` Steven Rostedt
2012-04-23 14:11   ` Peter Zijlstra
2012-04-23 14:25   ` Peter Zijlstra
2012-04-23 15:34     ` Juri Lelli
2012-04-23 14:35   ` Peter Zijlstra
2012-04-23 15:39     ` Juri Lelli
2012-04-23 15:43       ` Peter Zijlstra
2012-04-23 16:41         ` Juri Lelli
     [not found]           ` <4F95D41F.5060700@sssup.it>
2012-04-24  7:21             ` Juri Lelli
2012-04-24  9:00               ` Peter Zijlstra
2012-05-15 10:10         ` Juri Lelli
2012-04-23 15:15   ` Peter Zijlstra
2012-04-23 15:37     ` Juri Lelli
2012-04-06  7:14 ` [PATCH 06/16] sched: SCHED_DEADLINE push and pull logic Juri Lelli
2012-04-06 13:39   ` Hillf Danton
2012-04-06 17:31     ` Juri Lelli
2012-04-07  2:32       ` Hillf Danton
2012-04-07  7:46         ` Dario Faggioli
2012-04-08 20:20         ` Juri Lelli
2012-04-09 12:28           ` Hillf Danton
2012-04-10  8:11             ` Juri Lelli
2012-04-11 15:57               ` Steven Rostedt
2012-04-11 16:00           ` Steven Rostedt
2012-04-11 16:09             ` Juri Lelli
2012-04-11 14:10     ` Steven Rostedt
2012-04-12 12:28       ` Hillf Danton
2012-04-12 12:51         ` Steven Rostedt
2012-04-12 12:56           ` Hillf Danton
2012-04-12 13:35             ` Steven Rostedt
2012-04-12 13:41               ` Hillf Danton
2012-04-11 16:07   ` Steven Rostedt
2012-04-11 16:11     ` Juri Lelli
2012-04-11 16:14   ` Steven Rostedt
2012-04-19 13:44     ` Juri Lelli
2012-04-11 16:21   ` Steven Rostedt
2012-04-11 16:24     ` Juri Lelli
2012-04-11 16:33   ` Steven Rostedt
2012-04-24 13:15     ` Peter Zijlstra
2012-04-24 18:50       ` Steven Rostedt
2012-04-24 18:53         ` Peter Zijlstra
2012-04-24 19:01           ` Steven Rostedt
2012-04-11 17:25   ` Steven Rostedt
2012-04-11 17:48     ` Juri Lelli
2012-04-06  7:14 ` [PATCH 07/16] sched: SCHED_DEADLINE avg_update accounting Juri Lelli
2012-04-06  7:14 ` [PATCH 08/16] sched: add period support for -deadline tasks Juri Lelli
2012-04-11 20:32   ` Steven Rostedt
2012-04-11 21:56     ` Juri Lelli
2012-04-11 22:13     ` Tommaso Cucinotta
2012-04-12  0:19       ` Steven Rostedt
2012-04-12  6:39     ` Luca Abeni
2012-04-06  7:14 ` [PATCH 09/16] sched: add schedstats " Juri Lelli
2012-04-06  7:14 ` [PATCH 10/16] sched: add resource limits " Juri Lelli
2012-04-24 15:07   ` Peter Zijlstra
2012-04-24 15:22     ` Juri Lelli
2012-04-24 16:27       ` Peter Zijlstra
2012-04-24 17:14         ` Juri Lelli
2012-04-06  7:14 ` [PATCH 11/16] sched: add latency tracing " Juri Lelli
2012-04-11 21:03   ` Steven Rostedt
2012-04-12  7:16     ` Juri Lelli
2012-04-16 15:51     ` Daniel Vacek
2012-04-16 19:56       ` Steven Rostedt
2012-04-16 21:31         ` Daniel Vacek
2012-04-06  7:14 ` [PATCH 12/16] rtmutex: turn the plist into an rb-tree Juri Lelli
2012-04-11 21:11   ` Steven Rostedt
2012-04-22 14:28     ` Juri Lelli
2012-04-23  8:33     ` Peter Zijlstra
2012-04-23 11:37       ` Steven Rostedt
2012-04-06  7:14 ` [PATCH 13/16] sched: drafted deadline inheritance logic Juri Lelli
2012-04-12  2:42   ` Steven Rostedt
2012-04-22 14:04     ` Juri Lelli
2012-04-23  8:39     ` Peter Zijlstra
2012-04-06  7:14 ` [PATCH 14/16] sched: add bandwidth management for sched_dl Juri Lelli
2012-04-06  7:14 ` [PATCH 15/16] sched: speed up -dl pushes with a push-heap Juri Lelli
2012-04-06  7:14 ` [PATCH 16/16] sched: add sched_dl documentation Juri Lelli
2012-04-06  8:25 ` [RFC][PATCH 00/16] sched: SCHED_DEADLINE v4 Luca Abeni
2012-04-07  9:25   ` Tadeus Prastowo
2012-04-06 11:07 ` Dario Faggioli
2012-04-07  7:52 ` Juri Lelli
2012-04-11 14:17 ` [RFC][PATCH 00/16] sched: " Steven Rostedt
2012-04-11 14:28   ` Juri Lelli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.