linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/14] sched: SCHED_DEADLINE v9
@ 2013-11-07 13:43 Juri Lelli
  2013-11-07 13:43 ` [PATCH 01/14] sched: add sched_class->task_dead Juri Lelli
                   ` (12 more replies)
  0 siblings, 13 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-07 13:43 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

Hello everyone,

SCHED_DEADLINE patchset v9, less than a month after last version [1]!

Changes w.r.t. v8:

 - rebase on top of 3.12;
 - all comments from Peter Zijlstra and Ingo Molnar applied (thanks!):
    + auxiliary functions for UP/SMP cases
    + add a smp_rmb to match dl_set_overload smp_wmb
    + different clocks for deadlines and runtime need further study
    + amend the comment about yield_task_dl semantic
    + use an accessor to read the rq clock
    + clarify 06/14 changelog (why we want to specify periods != deadlines)

The development is taking place at:
   https://github.com/jlelli/sched-deadline

branch: sched-dl-V9-rebase (this patchset on top of tip/master).

Check the repositories frequently if you're interested, and feel free to
e-mail me for any issue you run into.

Test applications:
  https://github.com/gbagnoli/rt-app 
  https://github.com/jlelli/schedtool-dl

Development mailing list: linux-dl; you can subscribe from here:
http://feanor.sssup.it/mailman/listinfo/linux-dl
or via e-mail (send a message to linux-dl-request@retis.sssup.it with
just the word `help' as subject or in the body to receive info).

The code was being jointly developed by ReTiS Lab (http://retis.sssup.it)
and Evidence S.r.l (http://www.evidence.eu.com) in the context of the ACTORS
EU-funded project (http://www.actors-project.eu) and the S(o)OS EU-funded
project (http://www.soos-project.eu/). Development is now supported by the
JUNIPER EU-funded project [2].

As usual, any kind of feedback is welcome and appreciated.

Thanks in advice and regards,

 - Juri

[1] http://lwn.net/Articles/376502, http://lwn.net/Articles/353797,
    http://lwn.net/Articles/412410, http://lwn.net/Articles/490944,
    http://lwn.net/Articles/498472, http://lwn.net/Articles/521091,
    http://lwn.net/Articles/537388, http://lwn.net/Articles/570293

[2] http://www.juniper-project.org/page/overview

Dario Faggioli (9):
  sched: add sched_class->task_dead.
  sched: add extended scheduling interface.
  sched: SCHED_DEADLINE structures & implementation.
  sched: SCHED_DEADLINE avg_update accounting.
  sched: add schedstats for -deadline tasks.
  sched: add latency tracing for -deadline tasks.
  sched: drafted deadline inheritance logic.
  sched: add bandwidth management for sched_dl.
  sched: add sched_dl documentation.

Harald Gustafsson (1):
  sched: add period support for -deadline tasks.

Juri Lelli (3):
  sched: SCHED_DEADLINE SMP-related data structures & logic.
  sched: make dl_bw a sub-quota of rt_bw
  sched: speed up -dl pushes with a push-heap.

Peter Zijlstra (1):
  rtmutex: turn the plist into an rb-tree.

 Documentation/scheduler/sched-deadline.txt |  196 ++++
 arch/arm/include/asm/unistd.h              |    2 +-
 arch/arm/include/uapi/asm/unistd.h         |    3 +
 arch/arm/kernel/calls.S                    |    3 +
 arch/x86/syscalls/syscall_32.tbl           |    3 +
 arch/x86/syscalls/syscall_64.tbl           |    3 +
 include/linux/init_task.h                  |   10 +
 include/linux/rtmutex.h                    |   18 +-
 include/linux/sched.h                      |  122 +-
 include/linux/sched/deadline.h             |   26 +
 include/linux/sched/rt.h                   |    3 +-
 include/linux/sched/sysctl.h               |   11 +
 include/linux/syscalls.h                   |    7 +
 include/uapi/linux/sched.h                 |    1 +
 kernel/fork.c                              |    8 +-
 kernel/futex.c                             |    2 +
 kernel/hrtimer.c                           |    3 +-
 kernel/rtmutex-debug.c                     |    8 +-
 kernel/rtmutex.c                           |  164 ++-
 kernel/rtmutex_common.h                    |   22 +-
 kernel/sched/Makefile                      |    4 +-
 kernel/sched/core.c                        |  674 ++++++++++-
 kernel/sched/cpudeadline.c                 |  216 ++++
 kernel/sched/cpudeadline.h                 |   33 +
 kernel/sched/deadline.c                    | 1666 ++++++++++++++++++++++++++++
 kernel/sched/debug.c                       |   46 +
 kernel/sched/rt.c                          |    2 +-
 kernel/sched/sched.h                       |  150 +++
 kernel/sched/stop_task.c                   |    2 +-
 kernel/sysctl.c                            |    7 +
 kernel/trace/trace_sched_wakeup.c          |   45 +-
 kernel/trace/trace_selftest.c              |   28 +-
 32 files changed, 3363 insertions(+), 125 deletions(-)
 create mode 100644 Documentation/scheduler/sched-deadline.txt
 create mode 100644 include/linux/sched/deadline.h
 create mode 100644 kernel/sched/cpudeadline.c
 create mode 100644 kernel/sched/cpudeadline.h
 create mode 100644 kernel/sched/deadline.c

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH 01/14] sched: add sched_class->task_dead.
  2013-11-07 13:43 [PATCH 00/14] sched: SCHED_DEADLINE v9 Juri Lelli
@ 2013-11-07 13:43 ` Juri Lelli
  2013-11-12  4:17   ` Paul Turner
                     ` (2 more replies)
  2013-11-07 13:43 ` [PATCH 02/14] sched: add extended scheduling interface Juri Lelli
                   ` (11 subsequent siblings)
  12 siblings, 3 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-07 13:43 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

From: Dario Faggioli <raistlin@linux.it>

Add a new function to the scheduling class interface. It is called
at the end of a context switch, if the prev task is in TASK_DEAD state.

It might be useful for the scheduling classes that want to be notified
when one of their task dies, e.g. to perform some cleanup actions.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 kernel/sched/core.c  |    3 +++
 kernel/sched/sched.h |    1 +
 2 files changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5ac63c9..850a02c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1890,6 +1890,9 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
+		if (prev->sched_class->task_dead)
+			prev->sched_class->task_dead(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b3c5653..64eda5c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -992,6 +992,7 @@ struct sched_class {
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
 	void (*task_fork) (struct task_struct *p);
+	void (*task_dead) (struct task_struct *p);
 
 	void (*switched_from) (struct rq *this_rq, struct task_struct *task);
 	void (*switched_to) (struct rq *this_rq, struct task_struct *task);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 02/14] sched: add extended scheduling interface.
  2013-11-07 13:43 [PATCH 00/14] sched: SCHED_DEADLINE v9 Juri Lelli
  2013-11-07 13:43 ` [PATCH 01/14] sched: add sched_class->task_dead Juri Lelli
@ 2013-11-07 13:43 ` Juri Lelli
  2013-11-12 17:23   ` Steven Rostedt
                     ` (3 more replies)
  2013-11-07 13:43 ` [PATCH 03/14] sched: SCHED_DEADLINE structures & implementation Juri Lelli
                   ` (10 subsequent siblings)
  12 siblings, 4 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-07 13:43 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

From: Dario Faggioli <raistlin@linux.it>

Add the interface bits needed for supporting scheduling algorithms
with extended parameters (e.g., SCHED_DEADLINE).

In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:
 - a (maximum/typical) instance execution time,
 - a minimum interval between consecutive instances,
 - a time constraint by which each instance must be completed.

Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.

For these reasons, this patch:
 - defines the new struct sched_param2, containing all the fields
   that are necessary for specifying a task in the computational
   model described above;
 - defines and implements the new scheduling related syscalls that
   manipulate it, i.e., sched_setscheduler2(), sched_setparam2()
   and sched_getparam2().

Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.

Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the *2() calls accordingly with their own purposes.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 arch/arm/include/asm/unistd.h      |    2 +-
 arch/arm/include/uapi/asm/unistd.h |    3 +
 arch/arm/kernel/calls.S            |    3 +
 arch/x86/syscalls/syscall_32.tbl   |    3 +
 arch/x86/syscalls/syscall_64.tbl   |    3 +
 include/linux/sched.h              |   50 ++++++++++++++++
 include/linux/syscalls.h           |    7 +++
 kernel/sched/core.c                |  110 +++++++++++++++++++++++++++++++++++-
 8 files changed, 177 insertions(+), 4 deletions(-)

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 141baa3..5f260fd 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -15,7 +15,7 @@
 
 #include <uapi/asm/unistd.h>
 
-#define __NR_syscalls  (380)
+#define __NR_syscalls  (383)
 #define __ARM_NR_cmpxchg		(__ARM_NR_BASE+0x00fff0)
 
 #define __ARCH_WANT_STAT64
diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index af33b44..6a4985e 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -406,6 +406,9 @@
 #define __NR_process_vm_writev		(__NR_SYSCALL_BASE+377)
 #define __NR_kcmp			(__NR_SYSCALL_BASE+378)
 #define __NR_finit_module		(__NR_SYSCALL_BASE+379)
+#define __NR_sched_setscheduler2	(__NR_SYSCALL_BASE+380)
+#define __NR_sched_setparam2		(__NR_SYSCALL_BASE+381)
+#define __NR_sched_getparam2		(__NR_SYSCALL_BASE+382)
 
 /*
  * This may need to be greater than __NR_last_syscall+1 in order to
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index c6ca7e3..0fb1ef7 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -389,6 +389,9 @@
 		CALL(sys_process_vm_writev)
 		CALL(sys_kcmp)
 		CALL(sys_finit_module)
+/* 380 */	CALL(sys_sched_setscheduler2)
+		CALL(sys_sched_setparam2)
+		CALL(sys_sched_getparam2)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index aabfb83..dfce815 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -357,3 +357,6 @@
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
 349	i386	kcmp			sys_kcmp
 350	i386	finit_module		sys_finit_module
+351	i386	sched_setparam2		sys_sched_setparam2
+352	i386	sched_getparam2		sys_sched_getparam2
+353	i386	sched_setscheduler2	sys_sched_setscheduler2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 38ae65d..1849a70 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -320,6 +320,9 @@
 311	64	process_vm_writev	sys_process_vm_writev
 312	common	kcmp			sys_kcmp
 313	common	finit_module		sys_finit_module
+314	common	sched_setparam2		sys_sched_setparam2
+315	common	sched_getparam2		sys_sched_getparam2
+316	common	sched_setscheduler2	sys_sched_setscheduler2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e27baee..9f7d633 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -55,6 +55,54 @@ struct sched_param {
 
 #include <asm/processor.h>
 
+/*
+ * Extended scheduling parameters data structure.
+ *
+ * This is needed because the original struct sched_param can not be
+ * altered without introducing ABI issues with legacy applications
+ * (e.g., in sched_getparam()).
+ *
+ * However, the possibility of specifying more than just a priority for
+ * the tasks may be useful for a wide variety of application fields, e.g.,
+ * multimedia, streaming, automation and control, and many others.
+ *
+ * This variant (sched_param2) is meant at describing a so-called
+ * sporadic time-constrained task. In such model a task is specified by:
+ *  - the activation period or minimum instance inter-arrival time;
+ *  - the maximum (or average, depending on the actual scheduling
+ *    discipline) computation time of all instances, a.k.a. runtime;
+ *  - the deadline (relative to the actual activation time) of each
+ *    instance.
+ * Very briefly, a periodic (sporadic) task asks for the execution of
+ * some specific computation --which is typically called an instance--
+ * (at most) every period. Moreover, each instance typically lasts no more
+ * than the runtime and must be completed by time instant t equal to
+ * the instance activation time + the deadline.
+ *
+ * This is reflected by the actual fields of the sched_param2 structure:
+ *
+ *  @sched_priority     task's priority (might still be useful)
+ *  @sched_deadline     representative of the task's deadline
+ *  @sched_runtime      representative of the task's runtime
+ *  @sched_period       representative of the task's period
+ *  @sched_flags        for customizing the scheduler behaviour
+ *
+ * Given this task model, there are a multiplicity of scheduling algorithms
+ * and policies, that can be used to ensure all the tasks will make their
+ * timing constraints.
+ *
+ * @__unused		padding to allow future expansion without ABI issues
+ */
+struct sched_param2 {
+	int sched_priority;
+	unsigned int sched_flags;
+	u64 sched_runtime;
+	u64 sched_deadline;
+	u64 sched_period;
+
+	u64 __unused[12];
+};
+
 struct exec_domain;
 struct futex_pi_state;
 struct robust_list_head;
@@ -1895,6 +1943,8 @@ extern int sched_setscheduler(struct task_struct *, int,
 			      const struct sched_param *);
 extern int sched_setscheduler_nocheck(struct task_struct *, int,
 				      const struct sched_param *);
+extern int sched_setscheduler2(struct task_struct *, int,
+				 const struct sched_param2 *);
 extern struct task_struct *idle_task(int cpu);
 /**
  * is_idle_task - is the specified task an idle task?
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 7fac04e..170ac59 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -38,6 +38,7 @@ struct rlimit;
 struct rlimit64;
 struct rusage;
 struct sched_param;
+struct sched_param2;
 struct sel_arg_struct;
 struct semaphore;
 struct sembuf;
@@ -277,11 +278,17 @@ asmlinkage long sys_clock_nanosleep(clockid_t which_clock, int flags,
 asmlinkage long sys_nice(int increment);
 asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_setscheduler2(pid_t pid, int policy,
+					struct sched_param2 __user *param);
 asmlinkage long sys_sched_setparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_setparam2(pid_t pid,
+					struct sched_param2 __user *param);
 asmlinkage long sys_sched_getscheduler(pid_t pid);
 asmlinkage long sys_sched_getparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_getparam2(pid_t pid,
+					struct sched_param2 __user *param);
 asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
 asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 850a02c..4fcbf13 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3285,7 +3285,8 @@ static bool check_same_owner(struct task_struct *p)
 }
 
 static int __sched_setscheduler(struct task_struct *p, int policy,
-				const struct sched_param *param, bool user)
+				const struct sched_param2 *param,
+				bool user)
 {
 	int retval, oldprio, oldpolicy = -1, on_rq, running;
 	unsigned long flags;
@@ -3450,10 +3451,20 @@ recheck:
 int sched_setscheduler(struct task_struct *p, int policy,
 		       const struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, true);
+	struct sched_param2 param2 = {
+		.sched_priority = param->sched_priority
+	};
+	return __sched_setscheduler(p, policy, &param2, true);
 }
 EXPORT_SYMBOL_GPL(sched_setscheduler);
 
+int sched_setscheduler2(struct task_struct *p, int policy,
+			  const struct sched_param2 *param2)
+{
+	return __sched_setscheduler(p, policy, param2, true);
+}
+EXPORT_SYMBOL_GPL(sched_setscheduler2);
+
 /**
  * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
  * @p: the task in question.
@@ -3470,7 +3481,10 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
 int sched_setscheduler_nocheck(struct task_struct *p, int policy,
 			       const struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, false);
+	struct sched_param2 param2 = {
+		.sched_priority = param->sched_priority
+	};
+	return __sched_setscheduler(p, policy, &param2, false);
 }
 
 static int
@@ -3495,6 +3509,31 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 	return retval;
 }
 
+static int
+do_sched_setscheduler2(pid_t pid, int policy,
+			 struct sched_param2 __user *param2)
+{
+	struct sched_param2 lparam2;
+	struct task_struct *p;
+	int retval;
+
+	if (!param2 || pid < 0)
+		return -EINVAL;
+
+	memset(&lparam2, 0, sizeof(struct sched_param2));
+	if (copy_from_user(&lparam2, param2, sizeof(struct sched_param2)))
+		return -EFAULT;
+
+	rcu_read_lock();
+	retval = -ESRCH;
+	p = find_process_by_pid(pid);
+	if (p != NULL)
+		retval = sched_setscheduler2(p, policy, &lparam2);
+	rcu_read_unlock();
+
+	return retval;
+}
+
 /**
  * sys_sched_setscheduler - set/change the scheduler policy and RT priority
  * @pid: the pid in question.
@@ -3514,6 +3553,21 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy,
 }
 
 /**
+ * sys_sched_setscheduler2 - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @policy: new policy (could use extended sched_param).
+ * @param: structure containg the extended parameters.
+ */
+SYSCALL_DEFINE3(sched_setscheduler2, pid_t, pid, int, policy,
+		struct sched_param2 __user *, param2)
+{
+	if (policy < 0)
+		return -EINVAL;
+
+	return do_sched_setscheduler2(pid, policy, param2);
+}
+
+/**
  * sys_sched_setparam - set/change the RT priority of a thread
  * @pid: the pid in question.
  * @param: structure containing the new RT priority.
@@ -3526,6 +3580,17 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
 }
 
 /**
+ * sys_sched_setparam2 - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @param2: structure containing the extended parameters.
+ */
+SYSCALL_DEFINE2(sched_setparam2, pid_t, pid,
+		struct sched_param2 __user *, param2)
+{
+	return do_sched_setscheduler2(pid, -1, param2);
+}
+
+/**
  * sys_sched_getscheduler - get the policy (scheduling class) of a thread
  * @pid: the pid in question.
  *
@@ -3595,6 +3660,45 @@ out_unlock:
 	return retval;
 }
 
+/**
+ * sys_sched_getparam2 - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @param2: structure containing the extended parameters.
+ */
+SYSCALL_DEFINE2(sched_getparam2, pid_t, pid,
+		struct sched_param2 __user *, param2)
+{
+	struct sched_param2 lp;
+	struct task_struct *p;
+	int retval;
+
+	if (!param2 || pid < 0)
+		return -EINVAL;
+
+	rcu_read_lock();
+	p = find_process_by_pid(pid);
+	retval = -ESRCH;
+	if (!p)
+		goto out_unlock;
+
+	retval = security_task_getscheduler(p);
+	if (retval)
+		goto out_unlock;
+
+	lp.sched_priority = p->rt_priority;
+	rcu_read_unlock();
+
+	retval = copy_to_user(param2, &lp,
+			sizeof(struct sched_param2)) ? -EFAULT : 0;
+
+	return retval;
+
+out_unlock:
+	rcu_read_unlock();
+	return retval;
+
+}
+
 long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 {
 	cpumask_var_t cpus_allowed, new_mask;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 03/14] sched: SCHED_DEADLINE structures & implementation.
  2013-11-07 13:43 [PATCH 00/14] sched: SCHED_DEADLINE v9 Juri Lelli
  2013-11-07 13:43 ` [PATCH 01/14] sched: add sched_class->task_dead Juri Lelli
  2013-11-07 13:43 ` [PATCH 02/14] sched: add extended scheduling interface Juri Lelli
@ 2013-11-07 13:43 ` Juri Lelli
  2013-11-13  2:31   ` Steven Rostedt
                     ` (2 more replies)
  2013-11-07 13:43 ` [PATCH 04/14] sched: SCHED_DEADLINE SMP-related data structures & logic Juri Lelli
                   ` (9 subsequent siblings)
  12 siblings, 3 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-07 13:43 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

From: Dario Faggioli <raistlin@linux.it>

Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.

Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.

Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.

The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.

The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.

To summarize, this patch:
 - introduces the data structures, constants and symbols needed;
 - implements the core logic of the scheduling algorithm in the new
   scheduling class file;
 - provides all the glue code between the new scheduling class and
   the core scheduler and refines the interactions between sched/dl
   and the other existing scheduling classes.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 arch/arm/include/asm/unistd.h  |    2 +-
 include/linux/sched.h          |   46 ++-
 include/linux/sched/deadline.h |   24 ++
 include/linux/sched/rt.h       |    2 +-
 include/uapi/linux/sched.h     |    1 +
 kernel/fork.c                  |    4 +-
 kernel/hrtimer.c               |    3 +-
 kernel/sched/Makefile          |    2 +-
 kernel/sched/core.c            |  111 ++++++-
 kernel/sched/deadline.c        |  682 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h           |   28 ++
 kernel/sched/stop_task.c       |    2 +-
 12 files changed, 884 insertions(+), 23 deletions(-)
 create mode 100644 include/linux/sched/deadline.h
 create mode 100644 kernel/sched/deadline.c

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 5f260fd..acabef1 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -15,7 +15,7 @@
 
 #include <uapi/asm/unistd.h>
 
-#define __NR_syscalls  (383)
+#define __NR_syscalls  (384)
 #define __ARM_NR_cmpxchg		(__ARM_NR_BASE+0x00fff0)
 
 #define __ARCH_WANT_STAT64
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f7d633..fdf957c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -92,6 +92,10 @@ struct sched_param {
  * timing constraints.
  *
  * @__unused		padding to allow future expansion without ABI issues
+ *
+ * As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
+ * only user of this new interface. More information about the algorithm
+ * available in the scheduling class file or in Documentation/.
  */
 struct sched_param2 {
 	int sched_priority;
@@ -1054,6 +1058,45 @@ struct sched_rt_entity {
 #endif
 };
 
+struct sched_dl_entity {
+	struct rb_node	rb_node;
+
+	/*
+	 * Original scheduling parameters. Copied here from sched_param2
+	 * during sched_setscheduler2(), they will remain the same until
+	 * the next sched_setscheduler2().
+	 */
+	u64 dl_runtime;		/* maximum runtime for each instance	*/
+	u64 dl_deadline;	/* relative deadline of each instance	*/
+
+	/*
+	 * Actual scheduling parameters. Initialized with the values above,
+	 * they are continously updated during task execution. Note that
+	 * the remaining runtime could be < 0 in case we are in overrun.
+	 */
+	s64 runtime;		/* remaining runtime for this instance	*/
+	u64 deadline;		/* absolute deadline for this instance	*/
+	unsigned int flags;	/* specifying the scheduler behaviour	*/
+
+	/*
+	 * Some bool flags:
+	 *
+	 * @dl_throttled tells if we exhausted the runtime. If so, the
+	 * task has to wait for a replenishment to be performed at the
+	 * next firing of dl_timer.
+	 *
+	 * @dl_new tells if a new instance arrived. If so we must
+	 * start executing it with full runtime and reset its absolute
+	 * deadline;
+	 */
+	int dl_throttled, dl_new;
+
+	/*
+	 * Bandwidth enforcement timer. Each -deadline task has its
+	 * own bandwidth to be enforced, thus we need one timer per task.
+	 */
+	struct hrtimer dl_timer;
+};
 
 struct rcu_node;
 
@@ -1088,6 +1131,7 @@ struct task_struct {
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group *sched_task_group;
 #endif
+	struct sched_dl_entity dl;
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
@@ -2024,7 +2068,7 @@ extern void wake_up_new_task(struct task_struct *tsk);
 #else
  static inline void kick_process(struct task_struct *tsk) { }
 #endif
-extern void sched_fork(struct task_struct *p);
+extern int sched_fork(struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 extern void proc_caches_init(void);
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
new file mode 100644
index 0000000..9d303b8
--- /dev/null
+++ b/include/linux/sched/deadline.h
@@ -0,0 +1,24 @@
+#ifndef _SCHED_DEADLINE_H
+#define _SCHED_DEADLINE_H
+
+/*
+ * SCHED_DEADLINE tasks has negative priorities, reflecting
+ * the fact that any of them has higher prio than RT and
+ * NORMAL/BATCH tasks.
+ */
+
+#define MAX_DL_PRIO		0
+
+static inline int dl_prio(int prio)
+{
+	if (unlikely(prio < MAX_DL_PRIO))
+		return 1;
+	return 0;
+}
+
+static inline int dl_task(struct task_struct *p)
+{
+	return dl_prio(p->prio);
+}
+
+#endif /* _SCHED_DEADLINE_H */
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index 440434d..a157797 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -22,7 +22,7 @@
 
 static inline int rt_prio(int prio)
 {
-	if (unlikely(prio < MAX_RT_PRIO))
+	if ((unsigned)prio < MAX_RT_PRIO)
 		return 1;
 	return 0;
 }
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5a0f945..2d5e49a 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -39,6 +39,7 @@
 #define SCHED_BATCH		3
 /* SCHED_ISO: reserved but not implemented yet */
 #define SCHED_IDLE		5
+#define SCHED_DEADLINE		6
 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
 #define SCHED_RESET_ON_FORK     0x40000000
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 086fe73..55fc95f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1313,7 +1313,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
-	sched_fork(p);
+	retval = sched_fork(p);
+	if (retval)
+		goto bad_fork_cleanup_policy;
 
 	retval = perf_event_init_task(p);
 	if (retval)
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 383319b..0909436 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -46,6 +46,7 @@
 #include <linux/sched.h>
 #include <linux/sched/sysctl.h>
 #include <linux/sched/rt.h>
+#include <linux/sched/deadline.h>
 #include <linux/timer.h>
 #include <linux/freezer.h>
 
@@ -1610,7 +1611,7 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp,
 	unsigned long slack;
 
 	slack = current->timer_slack_ns;
-	if (rt_task(current))
+	if (dl_task(current) || rt_task(current))
 		slack = 0;
 
 	hrtimer_init_on_stack(&t.timer, clockid, mode);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 54adcf3..d77282f 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -11,7 +11,7 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
 endif
 
-obj-y += core.o proc.o clock.o cputime.o idle_task.o fair.o rt.o stop_task.o
+obj-y += core.o proc.o clock.o cputime.o idle_task.o fair.o rt.o deadline.o stop_task.o
 obj-$(CONFIG_SMP) += cpupri.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4fcbf13..cfe15bfc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -903,7 +903,9 @@ static inline int normal_prio(struct task_struct *p)
 {
 	int prio;
 
-	if (task_has_rt_policy(p))
+	if (task_has_dl_policy(p))
+		prio = MAX_DL_PRIO-1;
+	else if (task_has_rt_policy(p))
 		prio = MAX_RT_PRIO-1 - p->rt_priority;
 	else
 		prio = __normal_prio(p);
@@ -1611,6 +1613,12 @@ static void __sched_fork(struct task_struct *p)
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
 
+	RB_CLEAR_NODE(&p->dl.rb_node);
+	hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	p->dl.dl_runtime = p->dl.runtime = 0;
+	p->dl.dl_deadline = p->dl.deadline = 0;
+	p->dl.flags = 0;
+
 	INIT_LIST_HEAD(&p->rt.run_list);
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -1654,7 +1662,7 @@ void set_numabalancing_state(bool enabled)
 /*
  * fork()/clone()-time setup:
  */
-void sched_fork(struct task_struct *p)
+int sched_fork(struct task_struct *p)
 {
 	unsigned long flags;
 	int cpu = get_cpu();
@@ -1676,7 +1684,7 @@ void sched_fork(struct task_struct *p)
 	 * Revert to default priority/policy on fork if requested.
 	 */
 	if (unlikely(p->sched_reset_on_fork)) {
-		if (task_has_rt_policy(p)) {
+		if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
 			p->policy = SCHED_NORMAL;
 			p->static_prio = NICE_TO_PRIO(0);
 			p->rt_priority = 0;
@@ -1693,8 +1701,14 @@ void sched_fork(struct task_struct *p)
 		p->sched_reset_on_fork = 0;
 	}
 
-	if (!rt_prio(p->prio))
+	if (dl_prio(p->prio)) {
+		put_cpu();
+		return -EAGAIN;
+	} else if (rt_prio(p->prio)) {
+		p->sched_class = &rt_sched_class;
+	} else {
 		p->sched_class = &fair_sched_class;
+	}
 
 	if (p->sched_class->task_fork)
 		p->sched_class->task_fork(p);
@@ -1726,6 +1740,7 @@ void sched_fork(struct task_struct *p)
 #endif
 
 	put_cpu();
+	return 0;
 }
 
 /*
@@ -3029,7 +3044,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	struct rq *rq;
 	const struct sched_class *prev_class;
 
-	BUG_ON(prio < 0 || prio > MAX_PRIO);
+	BUG_ON(prio > MAX_PRIO);
 
 	rq = __task_rq_lock(p);
 
@@ -3061,7 +3076,9 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
-	if (rt_prio(prio))
+	if (dl_prio(prio))
+		p->sched_class = &dl_sched_class;
+	else if (rt_prio(prio))
 		p->sched_class = &rt_sched_class;
 	else
 		p->sched_class = &fair_sched_class;
@@ -3095,9 +3112,9 @@ void set_user_nice(struct task_struct *p, long nice)
 	 * The RT priorities are set via sched_setscheduler(), but we still
 	 * allow the 'normal' nice value to be set - but as expected
 	 * it wont have any effect on scheduling until the task is
-	 * SCHED_FIFO/SCHED_RR:
+	 * SCHED_DEADLINE, SCHED_FIFO or SCHED_RR:
 	 */
-	if (task_has_rt_policy(p)) {
+	if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
 		p->static_prio = NICE_TO_PRIO(nice);
 		goto out_unlock;
 	}
@@ -3261,7 +3278,9 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 	p->normal_prio = normal_prio(p);
 	/* we are holding p->pi_lock already */
 	p->prio = rt_mutex_getprio(p);
-	if (rt_prio(p->prio))
+	if (dl_prio(p->prio))
+		p->sched_class = &dl_sched_class;
+	else if (rt_prio(p->prio))
 		p->sched_class = &rt_sched_class;
 	else
 		p->sched_class = &fair_sched_class;
@@ -3269,6 +3288,50 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 }
 
 /*
+ * This function initializes the sched_dl_entity of a newly becoming
+ * SCHED_DEADLINE task.
+ *
+ * Only the static values are considered here, the actual runtime and the
+ * absolute deadline will be properly calculated when the task is enqueued
+ * for the first time with its new policy.
+ */
+static void
+__setparam_dl(struct task_struct *p, const struct sched_param2 *param2)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	init_dl_task_timer(dl_se);
+	dl_se->dl_runtime = param2->sched_runtime;
+	dl_se->dl_deadline = param2->sched_deadline;
+	dl_se->flags = param2->sched_flags;
+	dl_se->dl_throttled = 0;
+	dl_se->dl_new = 1;
+}
+
+static void
+__getparam_dl(struct task_struct *p, struct sched_param2 *param2)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	param2->sched_priority = p->rt_priority;
+	param2->sched_runtime = dl_se->dl_runtime;
+	param2->sched_deadline = dl_se->dl_deadline;
+	param2->sched_flags = dl_se->flags;
+}
+
+/*
+ * This function validates the new parameters of a -deadline task.
+ * We ask for the deadline not being zero, and greater or equal
+ * than the runtime.
+ */
+static bool
+__checkparam_dl(const struct sched_param2 *prm)
+{
+	return prm && (&prm->sched_deadline) != 0 &&
+	       (s64)(&prm->sched_deadline - &prm->sched_runtime) >= 0;
+}
+
+/*
  * check the target process has a UID that matches the current process's
  */
 static bool check_same_owner(struct task_struct *p)
@@ -3305,7 +3368,8 @@ recheck:
 		reset_on_fork = !!(policy & SCHED_RESET_ON_FORK);
 		policy &= ~SCHED_RESET_ON_FORK;
 
-		if (policy != SCHED_FIFO && policy != SCHED_RR &&
+		if (policy != SCHED_DEADLINE &&
+				policy != SCHED_FIFO && policy != SCHED_RR &&
 				policy != SCHED_NORMAL && policy != SCHED_BATCH &&
 				policy != SCHED_IDLE)
 			return -EINVAL;
@@ -3320,7 +3384,8 @@ recheck:
 	    (p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) ||
 	    (!p->mm && param->sched_priority > MAX_RT_PRIO-1))
 		return -EINVAL;
-	if (rt_policy(policy) != (param->sched_priority != 0))
+	if ((dl_policy(policy) && !__checkparam_dl(param)) ||
+	    (rt_policy(policy) != (param->sched_priority != 0)))
 		return -EINVAL;
 
 	/*
@@ -3386,7 +3451,8 @@ recheck:
 	 * If not changing anything there's no need to proceed further:
 	 */
 	if (unlikely(policy == p->policy && (!rt_policy(policy) ||
-			param->sched_priority == p->rt_priority))) {
+			param->sched_priority == p->rt_priority) &&
+			!dl_policy(policy))) {
 		task_rq_unlock(rq, p, &flags);
 		return 0;
 	}
@@ -3423,7 +3489,11 @@ recheck:
 
 	oldprio = p->prio;
 	prev_class = p->sched_class;
-	__setscheduler(rq, p, policy, param->sched_priority);
+	if (dl_policy(policy)) {
+		__setparam_dl(p, param);
+		__setscheduler(rq, p, policy, param->sched_priority);
+	} else
+		__setscheduler(rq, p, policy, param->sched_priority);
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
@@ -3527,8 +3597,11 @@ do_sched_setscheduler2(pid_t pid, int policy,
 	rcu_read_lock();
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);
-	if (p != NULL)
+	if (p != NULL) {
+		if (dl_policy(policy))
+			lparam2.sched_priority = 0;
 		retval = sched_setscheduler2(p, policy, &lparam2);
+	}
 	rcu_read_unlock();
 
 	return retval;
@@ -3685,7 +3758,10 @@ SYSCALL_DEFINE2(sched_getparam2, pid_t, pid,
 	if (retval)
 		goto out_unlock;
 
-	lp.sched_priority = p->rt_priority;
+	if (task_has_dl_policy(p))
+		__getparam_dl(p, &lp);
+	else
+		lp.sched_priority = p->rt_priority;
 	rcu_read_unlock();
 
 	retval = copy_to_user(param2, &lp,
@@ -4120,6 +4196,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
 	case SCHED_RR:
 		ret = MAX_USER_RT_PRIO-1;
 		break;
+	case SCHED_DEADLINE:
 	case SCHED_NORMAL:
 	case SCHED_BATCH:
 	case SCHED_IDLE:
@@ -4146,6 +4223,7 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
 	case SCHED_RR:
 		ret = 1;
 		break;
+	case SCHED_DEADLINE:
 	case SCHED_NORMAL:
 	case SCHED_BATCH:
 	case SCHED_IDLE:
@@ -6563,6 +6641,7 @@ void __init sched_init(void)
 		rq->calc_load_update = jiffies + LOAD_FREQ;
 		init_cfs_rq(&rq->cfs);
 		init_rt_rq(&rq->rt, rq);
+		init_dl_rq(&rq->dl, rq);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
@@ -6746,7 +6825,7 @@ void normalize_rt_tasks(void)
 		p->se.statistics.block_start	= 0;
 #endif
 
-		if (!rt_task(p)) {
+		if (!dl_task(p) && !rt_task(p)) {
 			/*
 			 * Renice negative nice level userspace
 			 * tasks back to 0:
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
new file mode 100644
index 0000000..cb93f2e
--- /dev/null
+++ b/kernel/sched/deadline.c
@@ -0,0 +1,682 @@
+/*
+ * Deadline Scheduling Class (SCHED_DEADLINE)
+ *
+ * Earliest Deadline First (EDF) + Constant Bandwidth Server (CBS).
+ *
+ * Tasks that periodically executes their instances for less than their
+ * runtime won't miss any of their deadlines.
+ * Tasks that are not periodic or sporadic or that tries to execute more
+ * than their reserved bandwidth will be slowed down (and may potentially
+ * miss some of their deadlines), and won't affect any other task.
+ *
+ * Copyright (C) 2012 Dario Faggioli <raistlin@linux.it>,
+ *                    Michael Trimarchi <michael@amarulasolutions.com>,
+ *                    Fabio Checconi <fchecconi@gmail.com>
+ */
+#include "sched.h"
+
+static inline int dl_time_before(u64 a, u64 b)
+{
+	return (s64)(a - b) < 0;
+}
+
+static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
+{
+	return container_of(dl_se, struct task_struct, dl);
+}
+
+static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
+{
+	return container_of(dl_rq, struct rq, dl);
+}
+
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->dl;
+}
+
+static inline int on_dl_rq(struct sched_dl_entity *dl_se)
+{
+	return !RB_EMPTY_NODE(&dl_se->rb_node);
+}
+
+static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	return dl_rq->rb_leftmost == &dl_se->rb_node;
+}
+
+void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
+{
+	dl_rq->rb_root = RB_ROOT;
+}
+
+static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
+				  int flags);
+
+/*
+ * We are being explicitly informed that a new instance is starting,
+ * and this means that:
+ *  - the absolute deadline of the entity has to be placed at
+ *    current time + relative deadline;
+ *  - the runtime of the entity has to be set to the maximum value.
+ *
+ * The capability of specifying such event is useful whenever a -deadline
+ * entity wants to (try to!) synchronize its behaviour with the scheduler's
+ * one, and to (try to!) reconcile itself with its own scheduling
+ * parameters.
+ */
+static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	WARN_ON(!dl_se->dl_new || dl_se->dl_throttled);
+
+	/*
+	 * We use the regular wall clock time to set deadlines in the
+	 * future; in fact, we must consider execution overheads (time
+	 * spent on hardirq context, etc.).
+	 */
+	dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
+	dl_se->runtime = dl_se->dl_runtime;
+	dl_se->dl_new = 0;
+}
+
+/*
+ * Pure Earliest Deadline First (EDF) scheduling does not deal with the
+ * possibility of a entity lasting more than what it declared, and thus
+ * exhausting its runtime.
+ *
+ * Here we are interested in making runtime overrun possible, but we do
+ * not want a entity which is misbehaving to affect the scheduling of all
+ * other entities.
+ * Therefore, a budgeting strategy called Constant Bandwidth Server (CBS)
+ * is used, in order to confine each entity within its own bandwidth.
+ *
+ * This function deals exactly with that, and ensures that when the runtime
+ * of a entity is replenished, its deadline is also postponed. That ensures
+ * the overrunning entity can't interfere with other entity in the system and
+ * can't make them miss their deadlines. Reasons why this kind of overruns
+ * could happen are, typically, a entity voluntarily trying to overcome its
+ * runtime, or it just underestimated it during sched_setscheduler_ex().
+ */
+static void replenish_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * We keep moving the deadline away until we get some
+	 * available runtime for the entity. This ensures correct
+	 * handling of situations where the runtime overrun is
+	 * arbitrary large.
+	 */
+	while (dl_se->runtime <= 0) {
+		dl_se->deadline += dl_se->dl_deadline;
+		dl_se->runtime += dl_se->dl_runtime;
+	}
+
+	/*
+	 * At this point, the deadline really should be "in
+	 * the future" with respect to rq->clock. If it's
+	 * not, we are, for some reason, lagging too much!
+	 * Anyway, after having warn userspace abut that,
+	 * we still try to keep the things running by
+	 * resetting the deadline and the budget of the
+	 * entity.
+	 */
+	if (dl_time_before(dl_se->deadline, rq_clock(rq))) {
+		static bool lag_once = false;
+
+		if (!lag_once) {
+			lag_once = true;
+			printk_sched("sched: DL replenish lagged to much\n");
+		}
+		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
+		dl_se->runtime = dl_se->dl_runtime;
+	}
+}
+
+/*
+ * Here we check if --at time t-- an entity (which is probably being
+ * [re]activated or, in general, enqueued) can use its remaining runtime
+ * and its current deadline _without_ exceeding the bandwidth it is
+ * assigned (function returns true if it can't). We are in fact applying
+ * one of the CBS rules: when a task wakes up, if the residual runtime
+ * over residual deadline fits within the allocated bandwidth, then we
+ * can keep the current (absolute) deadline and residual budget without
+ * disrupting the schedulability of the system. Otherwise, we should
+ * refill the runtime and set the deadline a period in the future,
+ * because keeping the current (absolute) deadline of the task would
+ * result in breaking guarantees promised to other tasks.
+ *
+ * This function returns true if:
+ *
+ *   runtime / (deadline - t) > dl_runtime / dl_deadline ,
+ *
+ * IOW we can't recycle current parameters.
+ */
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
+{
+	u64 left, right;
+
+	/*
+	 * left and right are the two sides of the equation above,
+	 * after a bit of shuffling to use multiplications instead
+	 * of divisions.
+	 *
+	 * Note that none of the time values involved in the two
+	 * multiplications are absolute: dl_deadline and dl_runtime
+	 * are the relative deadline and the maximum runtime of each
+	 * instance, runtime is the runtime left for the last instance
+	 * and (deadline - t), since t is rq->clock, is the time left
+	 * to the (absolute) deadline. Even if overflowing the u64 type
+	 * is very unlikely to occur in both cases, here we scale down
+	 * as we want to avoid that risk at all. Scaling down by 10
+	 * means that we reduce granularity to 1us. We are fine with it,
+	 * since this is only a true/false check and, anyway, thinking
+	 * of anything below microseconds resolution is actually fiction
+	 * (but still we want to give the user that illusion >;).
+	 */
+	left = (dl_se->dl_deadline >> 10) * (dl_se->runtime >> 10);
+	right = ((dl_se->deadline - t) >> 10) * (dl_se->dl_runtime >> 10);
+
+	return dl_time_before(right, left);
+}
+
+/*
+ * When a -deadline entity is queued back on the runqueue, its runtime and
+ * deadline might need updating.
+ *
+ * The policy here is that we update the deadline of the entity only if:
+ *  - the current deadline is in the past,
+ *  - using the remaining runtime with the current deadline would make
+ *    the entity exceed its bandwidth.
+ */
+static void update_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * The arrival of a new instance needs special treatment, i.e.,
+	 * the actual scheduling parameters have to be "renewed".
+	 */
+	if (dl_se->dl_new) {
+		setup_new_dl_entity(dl_se);
+		return;
+	}
+
+	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
+	    dl_entity_overflow(dl_se, rq_clock(rq))) {
+		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
+		dl_se->runtime = dl_se->dl_runtime;
+	}
+}
+
+/*
+ * If the entity depleted all its runtime, and if we want it to sleep
+ * while waiting for some new execution time to become available, we
+ * set the bandwidth enforcement timer to the replenishment instant
+ * and try to activate it.
+ *
+ * Notice that it is important for the caller to know if the timer
+ * actually started or not (i.e., the replenishment instant is in
+ * the future or in the past).
+ */
+static int start_dl_timer(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+	ktime_t now, act;
+	ktime_t soft, hard;
+	unsigned long range;
+	s64 delta;
+
+	/*
+	 * We want the timer to fire at the deadline, but considering
+	 * that it is actually coming from rq->clock and not from
+	 * hrtimer's time base reading.
+	 */
+	act = ns_to_ktime(dl_se->deadline);
+	now = hrtimer_cb_get_time(&dl_se->dl_timer);
+	delta = ktime_to_ns(now) - rq_clock(rq);
+	act = ktime_add_ns(act, delta);
+
+	/*
+	 * If the expiry time already passed, e.g., because the value
+	 * chosen as the deadline is too small, don't even try to
+	 * start the timer in the past!
+	 */
+	if (ktime_us_delta(act, now) < 0)
+		return 0;
+
+	hrtimer_set_expires(&dl_se->dl_timer, act);
+
+	soft = hrtimer_get_softexpires(&dl_se->dl_timer);
+	hard = hrtimer_get_expires(&dl_se->dl_timer);
+	range = ktime_to_ns(ktime_sub(hard, soft));
+	__hrtimer_start_range_ns(&dl_se->dl_timer, soft,
+				 range, HRTIMER_MODE_ABS, 0);
+
+	return hrtimer_active(&dl_se->dl_timer);
+}
+
+/*
+ * This is the bandwidth enforcement timer callback. If here, we know
+ * a task is not on its dl_rq, since the fact that the timer was running
+ * means the task is throttled and needs a runtime replenishment.
+ *
+ * However, what we actually do depends on the fact the task is active,
+ * (it is on its rq) or has been removed from there by a call to
+ * dequeue_task_dl(). In the former case we must issue the runtime
+ * replenishment and add the task back to the dl_rq; in the latter, we just
+ * do nothing but clearing dl_throttled, so that runtime and deadline
+ * updating (and the queueing back to dl_rq) will be done by the
+ * next call to enqueue_task_dl().
+ */
+static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
+{
+	struct sched_dl_entity *dl_se = container_of(timer,
+						     struct sched_dl_entity,
+						     dl_timer);
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = task_rq(p);
+	raw_spin_lock(&rq->lock);
+
+	/*
+	 * We need to take care of a possible races here. In fact, the
+	 * task might have changed its scheduling policy to something
+	 * different from SCHED_DEADLINE or changed its reservation
+	 * parameters (through sched_setscheduler()).
+	 */
+	if (!dl_task(p) || dl_se->dl_new)
+		goto unlock;
+
+	dl_se->dl_throttled = 0;
+	if (p->on_rq) {
+		enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
+		if (task_has_dl_policy(rq->curr))
+			check_preempt_curr_dl(rq, p, 0);
+		else
+			resched_task(rq->curr);
+	}
+unlock:
+	raw_spin_unlock(&rq->lock);
+
+	return HRTIMER_NORESTART;
+}
+
+void init_dl_task_timer(struct sched_dl_entity *dl_se)
+{
+	struct hrtimer *timer = &dl_se->dl_timer;
+
+	if (hrtimer_active(timer)) {
+		hrtimer_try_to_cancel(timer);
+		return;
+	}
+
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	timer->function = dl_task_timer;
+}
+
+static
+int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
+{
+	int dmiss = dl_time_before(dl_se->deadline, rq_clock(rq));
+	int rorun = dl_se->runtime <= 0;
+
+	if (!rorun && !dmiss)
+		return 0;
+
+	/*
+	 * If we are beyond our current deadline and we are still
+	 * executing, then we have already used some of the runtime of
+	 * the next instance. Thus, if we do not account that, we are
+	 * stealing bandwidth from the system at each deadline miss!
+	 */
+	if (dmiss) {
+		dl_se->runtime = rorun ? dl_se->runtime : 0;
+		dl_se->runtime -= rq_clock(rq) - dl_se->deadline;
+	}
+
+	return 1;
+}
+
+/*
+ * Update the current task's runtime statistics (provided it is still
+ * a -deadline task and has not been removed from the dl_rq).
+ */
+static void update_curr_dl(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_dl_entity *dl_se = &curr->dl;
+	u64 delta_exec;
+
+	if (!dl_task(curr) || !on_dl_rq(dl_se))
+		return;
+
+	/*
+	 * Consumed budget is computed considering the time as
+	 * observed by schedulable tasks (excluding time spent
+	 * in hardirq context, etc.). Deadlines are instead
+	 * computed using hard walltime. This seems to be the more
+	 * natural solution, but the full ramifications of this
+	 * approach need further study.
+	 */
+	delta_exec = rq_clock_task(rq) - curr->se.exec_start;
+	if (unlikely((s64)delta_exec < 0))
+		delta_exec = 0;
+
+	schedstat_set(curr->se.statistics.exec_max,
+		      max(curr->se.statistics.exec_max, delta_exec));
+
+	curr->se.sum_exec_runtime += delta_exec;
+	account_group_exec_runtime(curr, delta_exec);
+
+	curr->se.exec_start = rq_clock_task(rq);
+	cpuacct_charge(curr, delta_exec);
+
+	dl_se->runtime -= delta_exec;
+	if (dl_runtime_exceeded(rq, dl_se)) {
+		__dequeue_task_dl(rq, curr, 0);
+		if (likely(start_dl_timer(dl_se)))
+			dl_se->dl_throttled = 1;
+		else
+			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
+
+		if (!is_leftmost(curr, &rq->dl))
+			resched_task(curr);
+	}
+}
+
+static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rb_node **link = &dl_rq->rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct sched_dl_entity *entry;
+	int leftmost = 1;
+
+	BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct sched_dl_entity, rb_node);
+		if (dl_time_before(dl_se->deadline, entry->deadline))
+			link = &parent->rb_left;
+		else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		dl_rq->rb_leftmost = &dl_se->rb_node;
+
+	rb_link_node(&dl_se->rb_node, parent, link);
+	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
+
+	dl_rq->dl_nr_running++;
+}
+
+static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+	if (RB_EMPTY_NODE(&dl_se->rb_node))
+		return;
+
+	if (dl_rq->rb_leftmost == &dl_se->rb_node) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&dl_se->rb_node);
+		dl_rq->rb_leftmost = next_node;
+	}
+
+	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
+	RB_CLEAR_NODE(&dl_se->rb_node);
+
+	dl_rq->dl_nr_running--;
+}
+
+static void
+enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
+{
+	BUG_ON(on_dl_rq(dl_se));
+
+	/*
+	 * If this is a wakeup or a new instance, the scheduling
+	 * parameters of the task might need updating. Otherwise,
+	 * we want a replenishment of its runtime.
+	 */
+	if (!dl_se->dl_new && flags & ENQUEUE_REPLENISH)
+		replenish_dl_entity(dl_se);
+	else
+		update_dl_entity(dl_se);
+
+	__enqueue_dl_entity(dl_se);
+}
+
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	__dequeue_dl_entity(dl_se);
+}
+
+static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+	/*
+	 * If p is throttled, we do nothing. In fact, if it exhausted
+	 * its budget it needs a replenishment and, since it now is on
+	 * its rq, the bandwidth timer callback (which clearly has not
+	 * run yet) will take care of this.
+	 */
+	if (p->dl.dl_throttled)
+		return;
+
+	enqueue_dl_entity(&p->dl, flags);
+	inc_nr_running(rq);
+}
+
+static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+	dequeue_dl_entity(&p->dl);
+}
+
+static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+	update_curr_dl(rq);
+	__dequeue_task_dl(rq, p, flags);
+
+	dec_nr_running(rq);
+}
+
+/*
+ * Yield task semantic for -deadline tasks is:
+ *
+ *   get off from the CPU until our next instance, with
+ *   a new runtime. This is of little use now, since we
+ *   don't have a bandwidth reclaiming mechanism. Anyway,
+ *   bandwidth reclaiming is planned for the future, and
+ *   yield_task_dl will indicate that some spare budget
+ *   is available for other task instances to use it.
+ */
+static void yield_task_dl(struct rq *rq)
+{
+	struct task_struct *p = rq->curr;
+
+	/*
+	 * We make the task go to sleep until its current deadline by
+	 * forcing its runtime to zero. This way, update_curr_dl() stops
+	 * it and the bandwidth timer will wake it up and will give it
+	 * new scheduling parameters (thanks to dl_new=1).
+	 */
+	if (p->dl.runtime > 0) {
+		rq->curr->dl.dl_new = 1;
+		p->dl.runtime = 0;
+	}
+	update_curr_dl(rq);
+}
+
+/*
+ * Only called when both the current and waking task are -deadline
+ * tasks.
+ */
+static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
+				  int flags)
+{
+	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
+		resched_task(rq->curr);
+}
+
+#ifdef CONFIG_SCHED_HRTICK
+static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+{
+	s64 delta = p->dl.dl_runtime - p->dl.runtime;
+
+	if (delta > 10000)
+		hrtick_start(rq, p->dl.runtime);
+}
+#endif
+
+static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
+						   struct dl_rq *dl_rq)
+{
+	struct rb_node *left = dl_rq->rb_leftmost;
+
+	if (!left)
+		return NULL;
+
+	return rb_entry(left, struct sched_dl_entity, rb_node);
+}
+
+struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se;
+	struct task_struct *p;
+	struct dl_rq *dl_rq;
+
+	dl_rq = &rq->dl;
+
+	if (unlikely(!dl_rq->dl_nr_running))
+		return NULL;
+
+	dl_se = pick_next_dl_entity(rq, dl_rq);
+	BUG_ON(!dl_se);
+
+	p = dl_task_of(dl_se);
+	p->se.exec_start = rq_clock_task(rq);
+#ifdef CONFIG_SCHED_HRTICK
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, p);
+#endif
+	return p;
+}
+
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
+{
+	update_curr_dl(rq);
+}
+
+static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
+{
+	update_curr_dl(rq);
+
+#ifdef CONFIG_SCHED_HRTICK
+	if (hrtick_enabled(rq) && queued && p->dl.runtime > 0)
+		start_hrtick_dl(rq, p);
+#endif
+}
+
+static void task_fork_dl(struct task_struct *p)
+{
+	/*
+	 * SCHED_DEADLINE tasks cannot fork and this is achieved through
+	 * sched_fork()
+	 */
+}
+
+static void task_dead_dl(struct task_struct *p)
+{
+	struct hrtimer *timer = &p->dl.dl_timer;
+
+	if (hrtimer_active(timer))
+		hrtimer_try_to_cancel(timer);
+}
+
+static void set_curr_task_dl(struct rq *rq)
+{
+	struct task_struct *p = rq->curr;
+
+	p->se.exec_start = rq_clock_task(rq);
+}
+
+static void switched_from_dl(struct rq *rq, struct task_struct *p)
+{
+	if (hrtimer_active(&p->dl.dl_timer))
+		hrtimer_try_to_cancel(&p->dl.dl_timer);
+}
+
+static void switched_to_dl(struct rq *rq, struct task_struct *p)
+{
+	/*
+	 * If p is throttled, don't consider the possibility
+	 * of preempting rq->curr, the check will be done right
+	 * after its runtime will get replenished.
+	 */
+	if (unlikely(p->dl.dl_throttled))
+		return;
+
+	if (!p->on_rq || rq->curr != p) {
+		if (task_has_dl_policy(rq->curr))
+			check_preempt_curr_dl(rq, p, 0);
+		else
+			resched_task(rq->curr);
+	}
+}
+
+static void prio_changed_dl(struct rq *rq, struct task_struct *p,
+			    int oldprio)
+{
+	switched_to_dl(rq, p);
+}
+
+#ifdef CONFIG_SMP
+static int
+select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
+{
+	return task_cpu(p);
+}
+#endif
+
+const struct sched_class dl_sched_class = {
+	.next			= &rt_sched_class,
+	.enqueue_task		= enqueue_task_dl,
+	.dequeue_task		= dequeue_task_dl,
+	.yield_task		= yield_task_dl,
+
+	.check_preempt_curr	= check_preempt_curr_dl,
+
+	.pick_next_task		= pick_next_task_dl,
+	.put_prev_task		= put_prev_task_dl,
+
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_dl,
+#endif
+
+	.set_curr_task		= set_curr_task_dl,
+	.task_tick		= task_tick_dl,
+	.task_fork              = task_fork_dl,
+	.task_dead		= task_dead_dl,
+
+	.prio_changed           = prio_changed_dl,
+	.switched_from		= switched_from_dl,
+	.switched_to		= switched_to_dl,
+};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 64eda5c..ba97476 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2,6 +2,7 @@
 #include <linux/sched.h>
 #include <linux/sched/sysctl.h>
 #include <linux/sched/rt.h>
+#include <linux/sched/deadline.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
 #include <linux/stop_machine.h>
@@ -87,11 +88,23 @@ static inline int rt_policy(int policy)
 	return 0;
 }
 
+static inline int dl_policy(int policy)
+{
+	if (unlikely(policy == SCHED_DEADLINE))
+		return 1;
+	return 0;
+}
+
 static inline int task_has_rt_policy(struct task_struct *p)
 {
 	return rt_policy(p->policy);
 }
 
+static inline int task_has_dl_policy(struct task_struct *p)
+{
+	return dl_policy(p->policy);
+}
+
 /*
  * This is the priority-queue data structure of the RT scheduling class:
  */
@@ -363,6 +376,15 @@ struct rt_rq {
 #endif
 };
 
+/* Deadline class' related fields in a runqueue */
+struct dl_rq {
+	/* runqueue is an rbtree, ordered by deadline */
+	struct rb_root rb_root;
+	struct rb_node *rb_leftmost;
+
+	unsigned long dl_nr_running;
+};
+
 #ifdef CONFIG_SMP
 
 /*
@@ -427,6 +449,7 @@ struct rq {
 
 	struct cfs_rq cfs;
 	struct rt_rq rt;
+	struct dl_rq dl;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
@@ -957,6 +980,7 @@ static const u32 prio_to_wmult[40] = {
 #else
 #define ENQUEUE_WAKING		0
 #endif
+#define ENQUEUE_REPLENISH	8
 
 #define DEQUEUE_SLEEP		1
 
@@ -1012,6 +1036,7 @@ struct sched_class {
    for (class = sched_class_highest; class; class = class->next)
 
 extern const struct sched_class stop_sched_class;
+extern const struct sched_class dl_sched_class;
 extern const struct sched_class rt_sched_class;
 extern const struct sched_class fair_sched_class;
 extern const struct sched_class idle_sched_class;
@@ -1047,6 +1072,8 @@ extern void resched_cpu(int cpu);
 extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 
+extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
+
 extern void update_idle_cpu_load(struct rq *this_rq);
 
 extern void init_task_runnable_average(struct task_struct *p);
@@ -1305,6 +1332,7 @@ extern void print_rt_stats(struct seq_file *m, int cpu);
 
 extern void init_cfs_rq(struct cfs_rq *cfs_rq);
 extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq);
+extern void init_dl_rq(struct dl_rq *rt_rq, struct rq *rq);
 
 extern void account_cfs_bandwidth_used(int enabled, int was_enabled);
 
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index e08fbee..a5cef17 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -103,7 +103,7 @@ get_rr_interval_stop(struct rq *rq, struct task_struct *task)
  * Simple, special scheduling class for the per-CPU stop tasks:
  */
 const struct sched_class stop_sched_class = {
-	.next			= &rt_sched_class,
+	.next			= &dl_sched_class,
 
 	.enqueue_task		= enqueue_task_stop,
 	.dequeue_task		= dequeue_task_stop,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 04/14] sched: SCHED_DEADLINE SMP-related data structures & logic.
  2013-11-07 13:43 [PATCH 00/14] sched: SCHED_DEADLINE v9 Juri Lelli
                   ` (2 preceding siblings ...)
  2013-11-07 13:43 ` [PATCH 03/14] sched: SCHED_DEADLINE structures & implementation Juri Lelli
@ 2013-11-07 13:43 ` Juri Lelli
  2013-11-20 18:51   ` Steven Rostedt
  2014-01-13 15:53   ` [tip:sched/core] sched/deadline: Add " tip-bot for Juri Lelli
  2013-11-07 13:43 ` [PATCH 05/14] sched: SCHED_DEADLINE avg_update accounting Juri Lelli
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-07 13:43 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

Introduces data structures relevant for implementing dynamic
migration of -deadline tasks and the logic for checking if
runqueues are overloaded with -deadline tasks and for choosing
where a task should migrate, when it is the case.

Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
be moved among CPUs when necessary. It is also possible to bind a
task to a (set of) CPU(s), thus restricting its capability of
migrating, or forbidding migrations at all.

The very same approach used in sched_rt is utilised:
 - -deadline tasks are kept into CPU-specific runqueues,
 - -deadline tasks are migrated among runqueues to achieve the
   following:
    * on an M-CPU system the M earliest deadline ready tasks
      are always running;
    * affinity/cpusets settings of all the -deadline tasks is
      always respected.

Therefore, this very special form of "load balancing" is done with
an active method, i.e., the scheduler pushes or pulls tasks between
runqueues when they are woken up and/or (de)scheduled.
IOW, every time a preemption occurs, the descheduled task might be sent
to some other CPU (depending on its deadline) to continue executing
(push). On the other hand, every time a CPU becomes idle, it might pull
the second earliest deadline ready task from some other CPU.

To enforce this, a pull operation is always attempted before taking any
scheduling decision (pre_schedule()), as well as a push one after each
scheduling decision (post_schedule()). In addition, when a task arrives
or wakes up, the best CPU where to resume it is selected taking into
account its affinity mask, the system topology, but also its deadline.
E.g., from the scheduling point of view, the best CPU where to wake
up (and also where to push) a task is the one which is running the task
with the latest deadline among the M executing ones.

In order to facilitate these decisions, per-runqueue "caching" of the
deadlines of the currently running and of the first ready task is used.
Queued but not running tasks are also parked in another rb-tree to
speed-up pushes.

Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
---
 include/linux/sched.h   |    1 +
 kernel/sched/core.c     |    9 +-
 kernel/sched/deadline.c |  937 ++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/rt.c       |    2 +-
 kernel/sched/sched.h    |   34 ++
 5 files changed, 966 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fdf957c..59ea0da 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1174,6 +1174,7 @@ struct task_struct {
 	struct list_head tasks;
 #ifdef CONFIG_SMP
 	struct plist_node pushable_tasks;
+	struct rb_node pushable_dl_tasks;
 #endif
 
 	struct mm_struct *mm, *active_mm;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cfe15bfc..2077aae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1737,6 +1737,7 @@ int sched_fork(struct task_struct *p)
 #endif
 #ifdef CONFIG_SMP
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
+	RB_CLEAR_NODE(&p->pushable_dl_tasks);
 #endif
 
 	put_cpu();
@@ -5148,6 +5149,7 @@ static void free_rootdomain(struct rcu_head *rcu)
 	struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
 
 	cpupri_cleanup(&rd->cpupri);
+	free_cpumask_var(rd->dlo_mask);
 	free_cpumask_var(rd->rto_mask);
 	free_cpumask_var(rd->online);
 	free_cpumask_var(rd->span);
@@ -5199,8 +5201,10 @@ static int init_rootdomain(struct root_domain *rd)
 		goto out;
 	if (!alloc_cpumask_var(&rd->online, GFP_KERNEL))
 		goto free_span;
-	if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
+	if (!alloc_cpumask_var(&rd->dlo_mask, GFP_KERNEL))
 		goto free_online;
+	if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
+		goto free_dlo_mask;
 
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_rto_mask;
@@ -5208,6 +5212,8 @@ static int init_rootdomain(struct root_domain *rd)
 
 free_rto_mask:
 	free_cpumask_var(rd->rto_mask);
+free_dlo_mask:
+	free_cpumask_var(rd->dlo_mask);
 free_online:
 	free_cpumask_var(rd->online);
 free_span:
@@ -6542,6 +6548,7 @@ void __init sched_init_smp(void)
 	free_cpumask_var(non_isolated_cpus);
 
 	init_sched_rt_class();
+	init_sched_dl_class();
 }
 #else
 void __init sched_init_smp(void)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index cb93f2e..18a73b4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -10,6 +10,7 @@
  * miss some of their deadlines), and won't affect any other task.
  *
  * Copyright (C) 2012 Dario Faggioli <raistlin@linux.it>,
+ *                    Juri Lelli <juri.lelli@gmail.com>,
  *                    Michael Trimarchi <michael@amarulasolutions.com>,
  *                    Fabio Checconi <fchecconi@gmail.com>
  */
@@ -20,6 +21,15 @@ static inline int dl_time_before(u64 a, u64 b)
 	return (s64)(a - b) < 0;
 }
 
+/*
+ * Tells if entity @a should preempt entity @b.
+ */
+static inline
+int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
+{
+	return dl_time_before(a->deadline, b->deadline);
+}
+
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
 	return container_of(dl_se, struct task_struct, dl);
@@ -53,8 +63,168 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
 {
 	dl_rq->rb_root = RB_ROOT;
+
+#ifdef CONFIG_SMP
+	/* zero means no -deadline tasks */
+	dl_rq->earliest_dl.curr = dl_rq->earliest_dl.next = 0;
+
+	dl_rq->dl_nr_migratory = 0;
+	dl_rq->overloaded = 0;
+	dl_rq->pushable_dl_tasks_root = RB_ROOT;
+#endif
+}
+
+#ifdef CONFIG_SMP
+
+static inline int dl_overloaded(struct rq *rq)
+{
+	return atomic_read(&rq->rd->dlo_count);
+}
+
+static inline void dl_set_overload(struct rq *rq)
+{
+	if (!rq->online)
+		return;
+
+	cpumask_set_cpu(rq->cpu, rq->rd->dlo_mask);
+	/*
+	 * Must be visible before the overload count is
+	 * set (as in sched_rt.c).
+	 *
+	 * Matched by the barrier in pull_dl_task().
+	 */
+	smp_wmb();
+	atomic_inc(&rq->rd->dlo_count);
+}
+
+static inline void dl_clear_overload(struct rq *rq)
+{
+	if (!rq->online)
+		return;
+
+	atomic_dec(&rq->rd->dlo_count);
+	cpumask_clear_cpu(rq->cpu, rq->rd->dlo_mask);
+}
+
+static void update_dl_migration(struct dl_rq *dl_rq)
+{
+	if (dl_rq->dl_nr_migratory && dl_rq->dl_nr_total > 1) {
+		if (!dl_rq->overloaded) {
+			dl_set_overload(rq_of_dl_rq(dl_rq));
+			dl_rq->overloaded = 1;
+		}
+	} else if (dl_rq->overloaded) {
+		dl_clear_overload(rq_of_dl_rq(dl_rq));
+		dl_rq->overloaded = 0;
+	}
+}
+
+static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	struct task_struct *p = dl_task_of(dl_se);
+	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
+
+	dl_rq->dl_nr_total++;
+	if (p->nr_cpus_allowed > 1)
+		dl_rq->dl_nr_migratory++;
+
+	update_dl_migration(dl_rq);
+}
+
+static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	struct task_struct *p = dl_task_of(dl_se);
+	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
+
+	dl_rq->dl_nr_total--;
+	if (p->nr_cpus_allowed > 1)
+		dl_rq->dl_nr_migratory--;
+
+	update_dl_migration(dl_rq);
+}
+
+/*
+ * The list of pushable -deadline task is not a plist, like in
+ * sched_rt.c, it is an rb-tree with tasks ordered by deadline.
+ */
+static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+	struct dl_rq *dl_rq = &rq->dl;
+	struct rb_node **link = &dl_rq->pushable_dl_tasks_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct task_struct *entry;
+	int leftmost = 1;
+
+	BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct task_struct,
+				 pushable_dl_tasks);
+		if (dl_entity_preempt(&p->dl, &entry->dl))
+			link = &parent->rb_left;
+		else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		dl_rq->pushable_dl_tasks_leftmost = &p->pushable_dl_tasks;
+
+	rb_link_node(&p->pushable_dl_tasks, parent, link);
+	rb_insert_color(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
+}
+
+static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+	struct dl_rq *dl_rq = &rq->dl;
+
+	if (RB_EMPTY_NODE(&p->pushable_dl_tasks))
+		return;
+
+	if (dl_rq->pushable_dl_tasks_leftmost == &p->pushable_dl_tasks) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&p->pushable_dl_tasks);
+		dl_rq->pushable_dl_tasks_leftmost = next_node;
+	}
+
+	rb_erase(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
+	RB_CLEAR_NODE(&p->pushable_dl_tasks);
+}
+
+static inline int has_pushable_dl_tasks(struct rq *rq)
+{
+	return !RB_EMPTY_ROOT(&rq->dl.pushable_dl_tasks_root);
+}
+
+static int push_dl_task(struct rq *rq);
+
+#else
+
+static inline
+void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline
+void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline
+void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+}
+
+static inline
+void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
 }
 
+#endif /* CONFIG_SMP */
+
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
 static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
@@ -307,6 +477,14 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 			check_preempt_curr_dl(rq, p, 0);
 		else
 			resched_task(rq->curr);
+#ifdef CONFIG_SMP
+		/*
+		 * Queueing this task back might have overloaded rq,
+		 * check if we need to kick someone away.
+		 */
+		if (has_pushable_dl_tasks(rq))
+			push_dl_task(rq);
+#endif
 	}
 unlock:
 	raw_spin_unlock(&rq->lock);
@@ -397,6 +575,100 @@ static void update_curr_dl(struct rq *rq)
 	}
 }
 
+#ifdef CONFIG_SMP
+
+static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu);
+
+static inline u64 next_deadline(struct rq *rq)
+{
+	struct task_struct *next = pick_next_earliest_dl_task(rq, rq->cpu);
+
+	if (next && dl_prio(next->prio))
+		return next->dl.deadline;
+	else
+		return 0;
+}
+
+static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
+{
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	if (dl_rq->earliest_dl.curr == 0 ||
+	    dl_time_before(deadline, dl_rq->earliest_dl.curr)) {
+		/*
+		 * If the dl_rq had no -deadline tasks, or if the new task
+		 * has shorter deadline than the current one on dl_rq, we
+		 * know that the previous earliest becomes our next earliest,
+		 * as the new task becomes the earliest itself.
+		 */
+		dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
+		dl_rq->earliest_dl.curr = deadline;
+	} else if (dl_rq->earliest_dl.next == 0 ||
+		   dl_time_before(deadline, dl_rq->earliest_dl.next)) {
+		/*
+		 * On the other hand, if the new -deadline task has a
+		 * a later deadline than the earliest one on dl_rq, but
+		 * it is earlier than the next (if any), we must
+		 * recompute the next-earliest.
+		 */
+		dl_rq->earliest_dl.next = next_deadline(rq);
+	}
+}
+
+static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
+{
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * Since we may have removed our earliest (and/or next earliest)
+	 * task we must recompute them.
+	 */
+	if (!dl_rq->dl_nr_running) {
+		dl_rq->earliest_dl.curr = 0;
+		dl_rq->earliest_dl.next = 0;
+	} else {
+		struct rb_node *leftmost = dl_rq->rb_leftmost;
+		struct sched_dl_entity *entry;
+
+		entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
+		dl_rq->earliest_dl.curr = entry->deadline;
+		dl_rq->earliest_dl.next = next_deadline(rq);
+	}
+}
+
+#else
+
+static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
+static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
+
+#endif /* CONFIG_SMP */
+
+static inline
+void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	int prio = dl_task_of(dl_se)->prio;
+	u64 deadline = dl_se->deadline;
+
+	WARN_ON(!dl_prio(prio));
+	dl_rq->dl_nr_running++;
+
+	inc_dl_deadline(dl_rq, deadline);
+	inc_dl_migration(dl_se, dl_rq);
+}
+
+static inline
+void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	int prio = dl_task_of(dl_se)->prio;
+
+	WARN_ON(!dl_prio(prio));
+	WARN_ON(!dl_rq->dl_nr_running);
+	dl_rq->dl_nr_running--;
+
+	dec_dl_deadline(dl_rq, dl_se->deadline);
+	dec_dl_migration(dl_se, dl_rq);
+}
+
 static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
@@ -424,7 +696,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 	rb_link_node(&dl_se->rb_node, parent, link);
 	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
 
-	dl_rq->dl_nr_running++;
+	inc_dl_tasks(dl_se, dl_rq);
 }
 
 static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
@@ -444,7 +716,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
 	RB_CLEAR_NODE(&dl_se->rb_node);
 
-	dl_rq->dl_nr_running--;
+	dec_dl_tasks(dl_se, dl_rq);
 }
 
 static void
@@ -482,12 +754,17 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		return;
 
 	enqueue_dl_entity(&p->dl, flags);
+
+	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+		enqueue_pushable_dl_task(rq, p);
+
 	inc_nr_running(rq);
 }
 
 static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	dequeue_dl_entity(&p->dl);
+	dequeue_pushable_dl_task(rq, p);
 }
 
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -525,6 +802,77 @@ static void yield_task_dl(struct rq *rq)
 	update_curr_dl(rq);
 }
 
+#ifdef CONFIG_SMP
+
+static int find_later_rq(struct task_struct *task);
+static int latest_cpu_find(struct cpumask *span,
+			   struct task_struct *task,
+			   struct cpumask *later_mask);
+
+static int
+select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
+{
+	struct task_struct *curr;
+	struct rq *rq;
+	int cpu;
+
+	cpu = task_cpu(p);
+
+	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
+		goto out;
+
+	rq = cpu_rq(cpu);
+
+	rcu_read_lock();
+	curr = ACCESS_ONCE(rq->curr); /* unlocked access */
+
+	/*
+	 * If we are dealing with a -deadline task, we must
+	 * decide where to wake it up.
+	 * If it has a later deadline and the current task
+	 * on this rq can't move (provided the waking task
+	 * can!) we prefer to send it somewhere else. On the
+	 * other hand, if it has a shorter deadline, we
+	 * try to make it stay here, it might be important.
+	 */
+	if (unlikely(dl_task(curr)) &&
+	    (curr->nr_cpus_allowed < 2 ||
+	     !dl_entity_preempt(&p->dl, &curr->dl)) &&
+	    (p->nr_cpus_allowed > 1)) {
+		int target = find_later_rq(p);
+
+		if (target != -1)
+			cpu = target;
+	}
+	rcu_read_unlock();
+
+out:
+	return cpu;
+}
+
+static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
+{
+	/*
+	 * Current can't be migrated, useless to reschedule,
+	 * let's hope p can move out.
+	 */
+	if (rq->curr->nr_cpus_allowed == 1 ||
+	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
+		return;
+
+	/*
+	 * p is migratable, so let's not schedule it and
+	 * see if it is pushed or pulled somewhere else.
+	 */
+	if (p->nr_cpus_allowed != 1 &&
+	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
+		return;
+
+	resched_task(rq->curr);
+}
+
+#endif /* CONFIG_SMP */
+
 /*
  * Only called when both the current and waking task are -deadline
  * tasks.
@@ -532,8 +880,20 @@ static void yield_task_dl(struct rq *rq)
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 				  int flags)
 {
-	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
+	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
 		resched_task(rq->curr);
+		return;
+	}
+
+#ifdef CONFIG_SMP
+	/*
+	 * In the unlikely case current and p have the same deadline
+	 * let us try to decide what's the best thing to do...
+	 */
+	if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
+	    !need_resched())
+		check_preempt_equal_dl(rq, p);
+#endif /* CONFIG_SMP */
 }
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -573,16 +933,29 @@ struct task_struct *pick_next_task_dl(struct rq *rq)
 
 	p = dl_task_of(dl_se);
 	p->se.exec_start = rq_clock_task(rq);
+
+	/* Running task will never be pushed. */
+	if (p)
+		dequeue_pushable_dl_task(rq, p);
+
 #ifdef CONFIG_SCHED_HRTICK
 	if (hrtick_enabled(rq))
 		start_hrtick_dl(rq, p);
 #endif
+
+#ifdef CONFIG_SMP
+	rq->post_schedule = has_pushable_dl_tasks(rq);
+#endif /* CONFIG_SMP */
+
 	return p;
 }
 
 static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
 {
 	update_curr_dl(rq);
+
+	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
+		enqueue_pushable_dl_task(rq, p);
 }
 
 static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
@@ -616,16 +989,517 @@ static void set_curr_task_dl(struct rq *rq)
 	struct task_struct *p = rq->curr;
 
 	p->se.exec_start = rq_clock_task(rq);
+
+	/* You can't push away the running task */
+	dequeue_pushable_dl_task(rq, p);
+}
+
+#ifdef CONFIG_SMP
+
+/* Only try algorithms three times */
+#define DL_MAX_TRIES 3
+
+static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
+{
+	if (!task_running(rq, p) &&
+	    (cpu < 0 || cpumask_test_cpu(cpu, &p->cpus_allowed)) &&
+	    (p->nr_cpus_allowed > 1))
+		return 1;
+
+	return 0;
+}
+
+/* Returns the second earliest -deadline task, NULL otherwise */
+static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu)
+{
+	struct rb_node *next_node = rq->dl.rb_leftmost;
+	struct sched_dl_entity *dl_se;
+	struct task_struct *p = NULL;
+
+next_node:
+	next_node = rb_next(next_node);
+	if (next_node) {
+		dl_se = rb_entry(next_node, struct sched_dl_entity, rb_node);
+		p = dl_task_of(dl_se);
+
+		if (pick_dl_task(rq, p, cpu))
+			return p;
+
+		goto next_node;
+	}
+
+	return NULL;
+}
+
+static int latest_cpu_find(struct cpumask *span,
+			   struct task_struct *task,
+			   struct cpumask *later_mask)
+{
+	const struct sched_dl_entity *dl_se = &task->dl;
+	int cpu, found = -1, best = 0;
+	u64 max_dl = 0;
+
+	for_each_cpu(cpu, span) {
+		struct rq *rq = cpu_rq(cpu);
+		struct dl_rq *dl_rq = &rq->dl;
+
+		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
+		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
+		     dl_rq->earliest_dl.curr))) {
+			if (later_mask)
+				cpumask_set_cpu(cpu, later_mask);
+			if (!best && !dl_rq->dl_nr_running) {
+				best = 1;
+				found = cpu;
+			} else if (!best &&
+				   dl_time_before(max_dl,
+						  dl_rq->earliest_dl.curr)) {
+				max_dl = dl_rq->earliest_dl.curr;
+				found = cpu;
+			}
+		} else if (later_mask)
+			cpumask_clear_cpu(cpu, later_mask);
+	}
+
+	return found;
+}
+
+static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
+
+static int find_later_rq(struct task_struct *task)
+{
+	struct sched_domain *sd;
+	struct cpumask *later_mask = __get_cpu_var(local_cpu_mask_dl);
+	int this_cpu = smp_processor_id();
+	int best_cpu, cpu = task_cpu(task);
+
+	/* Make sure the mask is initialized first */
+	if (unlikely(!later_mask))
+		return -1;
+
+	if (task->nr_cpus_allowed == 1)
+		return -1;
+
+	best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
+	if (best_cpu == -1)
+		return -1;
+
+	/*
+	 * If we are here, some target has been found,
+	 * the most suitable of which is cached in best_cpu.
+	 * This is, among the runqueues where the current tasks
+	 * have later deadlines than the task's one, the rq
+	 * with the latest possible one.
+	 *
+	 * Now we check how well this matches with task's
+	 * affinity and system topology.
+	 *
+	 * The last cpu where the task run is our first
+	 * guess, since it is most likely cache-hot there.
+	 */
+	if (cpumask_test_cpu(cpu, later_mask))
+		return cpu;
+	/*
+	 * Check if this_cpu is to be skipped (i.e., it is
+	 * not in the mask) or not.
+	 */
+	if (!cpumask_test_cpu(this_cpu, later_mask))
+		this_cpu = -1;
+
+	rcu_read_lock();
+	for_each_domain(cpu, sd) {
+		if (sd->flags & SD_WAKE_AFFINE) {
+
+			/*
+			 * If possible, preempting this_cpu is
+			 * cheaper than migrating.
+			 */
+			if (this_cpu != -1 &&
+			    cpumask_test_cpu(this_cpu, sched_domain_span(sd))) {
+				rcu_read_unlock();
+				return this_cpu;
+			}
+
+			/*
+			 * Last chance: if best_cpu is valid and is
+			 * in the mask, that becomes our choice.
+			 */
+			if (best_cpu < nr_cpu_ids &&
+			    cpumask_test_cpu(best_cpu, sched_domain_span(sd))) {
+				rcu_read_unlock();
+				return best_cpu;
+			}
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * At this point, all our guesses failed, we just return
+	 * 'something', and let the caller sort the things out.
+	 */
+	if (this_cpu != -1)
+		return this_cpu;
+
+	cpu = cpumask_any(later_mask);
+	if (cpu < nr_cpu_ids)
+		return cpu;
+
+	return -1;
+}
+
+/* Locks the rq it finds */
+static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
+{
+	struct rq *later_rq = NULL;
+	int tries;
+	int cpu;
+
+	for (tries = 0; tries < DL_MAX_TRIES; tries++) {
+		cpu = find_later_rq(task);
+
+		if ((cpu == -1) || (cpu == rq->cpu))
+			break;
+
+		later_rq = cpu_rq(cpu);
+
+		/* Retry if something changed. */
+		if (double_lock_balance(rq, later_rq)) {
+			if (unlikely(task_rq(task) != rq ||
+				     !cpumask_test_cpu(later_rq->cpu,
+				                       &task->cpus_allowed) ||
+				     task_running(rq, task) || !task->on_rq)) {
+				double_unlock_balance(rq, later_rq);
+				later_rq = NULL;
+				break;
+			}
+		}
+
+		/*
+		 * If the rq we found has no -deadline task, or
+		 * its earliest one has a later deadline than our
+		 * task, the rq is a good one.
+		 */
+		if (!later_rq->dl.dl_nr_running ||
+		    dl_time_before(task->dl.deadline,
+				   later_rq->dl.earliest_dl.curr))
+			break;
+
+		/* Otherwise we try again. */
+		double_unlock_balance(rq, later_rq);
+		later_rq = NULL;
+	}
+
+	return later_rq;
 }
 
+static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
+{
+	struct task_struct *p;
+
+	if (!has_pushable_dl_tasks(rq))
+		return NULL;
+
+	p = rb_entry(rq->dl.pushable_dl_tasks_leftmost,
+		     struct task_struct, pushable_dl_tasks);
+
+	BUG_ON(rq->cpu != task_cpu(p));
+	BUG_ON(task_current(rq, p));
+	BUG_ON(p->nr_cpus_allowed <= 1);
+
+	BUG_ON(!p->se.on_rq);
+	BUG_ON(!dl_task(p));
+
+	return p;
+}
+
+/*
+ * See if the non running -deadline tasks on this rq
+ * can be sent to some other CPU where they can preempt
+ * and start executing.
+ */
+static int push_dl_task(struct rq *rq)
+{
+	struct task_struct *next_task;
+	struct rq *later_rq;
+
+	if (!rq->dl.overloaded)
+		return 0;
+
+	next_task = pick_next_pushable_dl_task(rq);
+	if (!next_task)
+		return 0;
+
+retry:
+	if (unlikely(next_task == rq->curr)) {
+		WARN_ON(1);
+		return 0;
+	}
+
+	/*
+	 * If next_task preempts rq->curr, and rq->curr
+	 * can move away, it makes sense to just reschedule
+	 * without going further in pushing next_task.
+	 */
+	if (dl_task(rq->curr) &&
+	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+	    rq->curr->nr_cpus_allowed > 1) {
+		resched_task(rq->curr);
+		return 0;
+	}
+
+	/* We might release rq lock */
+	get_task_struct(next_task);
+
+	/* Will lock the rq it'll find */
+	later_rq = find_lock_later_rq(next_task, rq);
+	if (!later_rq) {
+		struct task_struct *task;
+
+		/*
+		 * We must check all this again, since
+		 * find_lock_later_rq releases rq->lock and it is
+		 * then possible that next_task has migrated.
+		 */
+		task = pick_next_pushable_dl_task(rq);
+		if (task_cpu(next_task) == rq->cpu && task == next_task) {
+			/*
+			 * The task is still there. We don't try
+			 * again, some other cpu will pull it when ready.
+			 */
+			dequeue_pushable_dl_task(rq, next_task);
+			goto out;
+		}
+
+		if (!task)
+			/* No more tasks */
+			goto out;
+
+		put_task_struct(next_task);
+		next_task = task;
+		goto retry;
+	}
+
+	deactivate_task(rq, next_task, 0);
+	set_task_cpu(next_task, later_rq->cpu);
+	activate_task(later_rq, next_task, 0);
+
+	resched_task(later_rq->curr);
+
+	double_unlock_balance(rq, later_rq);
+
+out:
+	put_task_struct(next_task);
+
+	return 1;
+}
+
+static void push_dl_tasks(struct rq *rq)
+{
+	/* Terminates as it moves a -deadline task */
+	while (push_dl_task(rq))
+		;
+}
+
+static int pull_dl_task(struct rq *this_rq)
+{
+	int this_cpu = this_rq->cpu, ret = 0, cpu;
+	struct task_struct *p;
+	struct rq *src_rq;
+	u64 dmin = LONG_MAX;
+
+	if (likely(!dl_overloaded(this_rq)))
+		return 0;
+
+	/*
+	 * Match the barrier from dl_set_overloaded; this guarantees that if we
+	 * see overloaded we must also see the dlo_mask bit.
+	 */
+	smp_rmb();
+
+	for_each_cpu(cpu, this_rq->rd->dlo_mask) {
+		if (this_cpu == cpu)
+			continue;
+
+		src_rq = cpu_rq(cpu);
+
+		/*
+		 * It looks racy, abd it is! However, as in sched_rt.c,
+		 * we are fine with this.
+		 */
+		if (this_rq->dl.dl_nr_running &&
+		    dl_time_before(this_rq->dl.earliest_dl.curr,
+				   src_rq->dl.earliest_dl.next))
+			continue;
+
+		/* Might drop this_rq->lock */
+		double_lock_balance(this_rq, src_rq);
+
+		/*
+		 * If there are no more pullable tasks on the
+		 * rq, we're done with it.
+		 */
+		if (src_rq->dl.dl_nr_running <= 1)
+			goto skip;
+
+		p = pick_next_earliest_dl_task(src_rq, this_cpu);
+
+		/*
+		 * We found a task to be pulled if:
+		 *  - it preempts our current (if there's one),
+		 *  - it will preempt the last one we pulled (if any).
+		 */
+		if (p && dl_time_before(p->dl.deadline, dmin) &&
+		    (!this_rq->dl.dl_nr_running ||
+		     dl_time_before(p->dl.deadline,
+				    this_rq->dl.earliest_dl.curr))) {
+			WARN_ON(p == src_rq->curr);
+			WARN_ON(!p->se.on_rq);
+
+			/*
+			 * Then we pull iff p has actually an earlier
+			 * deadline than the current task of its runqueue.
+			 */
+			if (dl_time_before(p->dl.deadline,
+					   src_rq->curr->dl.deadline))
+				goto skip;
+
+			ret = 1;
+
+			deactivate_task(src_rq, p, 0);
+			set_task_cpu(p, this_cpu);
+			activate_task(this_rq, p, 0);
+			dmin = p->dl.deadline;
+
+			/* Is there any other task even earlier? */
+		}
+skip:
+		double_unlock_balance(this_rq, src_rq);
+	}
+
+	return ret;
+}
+
+static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
+{
+	/* Try to pull other tasks here */
+	if (dl_task(prev))
+		pull_dl_task(rq);
+}
+
+static void post_schedule_dl(struct rq *rq)
+{
+	push_dl_tasks(rq);
+}
+
+/*
+ * Since the task is not running and a reschedule is not going to happen
+ * anytime soon on its runqueue, we try pushing it away now.
+ */
+static void task_woken_dl(struct rq *rq, struct task_struct *p)
+{
+	if (!task_running(rq, p) &&
+	    !test_tsk_need_resched(rq->curr) &&
+	    has_pushable_dl_tasks(rq) &&
+	    p->nr_cpus_allowed > 1 &&
+	    dl_task(rq->curr) &&
+	    (rq->curr->nr_cpus_allowed < 2 ||
+	     dl_entity_preempt(&rq->curr->dl, &p->dl))) {
+		push_dl_tasks(rq);
+	}
+}
+
+static void set_cpus_allowed_dl(struct task_struct *p,
+				const struct cpumask *new_mask)
+{
+	struct rq *rq;
+	int weight;
+
+	BUG_ON(!dl_task(p));
+
+	/*
+	 * Update only if the task is actually running (i.e.,
+	 * it is on the rq AND it is not throttled).
+	 */
+	if (!on_dl_rq(&p->dl))
+		return;
+
+	weight = cpumask_weight(new_mask);
+
+	/*
+	 * Only update if the process changes its state from whether it
+	 * can migrate or not.
+	 */
+	if ((p->nr_cpus_allowed > 1) == (weight > 1))
+		return;
+
+	rq = task_rq(p);
+
+	/*
+	 * The process used to be able to migrate OR it can now migrate
+	 */
+	if (weight <= 1) {
+		if (!task_current(rq, p))
+			dequeue_pushable_dl_task(rq, p);
+		BUG_ON(!rq->dl.dl_nr_migratory);
+		rq->dl.dl_nr_migratory--;
+	} else {
+		if (!task_current(rq, p))
+			enqueue_pushable_dl_task(rq, p);
+		rq->dl.dl_nr_migratory++;
+	}
+	
+	update_dl_migration(&rq->dl);
+}
+
+/* Assumes rq->lock is held */
+static void rq_online_dl(struct rq *rq)
+{
+	if (rq->dl.overloaded)
+		dl_set_overload(rq);
+}
+
+/* Assumes rq->lock is held */
+static void rq_offline_dl(struct rq *rq)
+{
+	if (rq->dl.overloaded)
+		dl_clear_overload(rq);
+}
+
+void init_sched_dl_class(void)
+{
+	unsigned int i;
+
+	for_each_possible_cpu(i)
+		zalloc_cpumask_var_node(&per_cpu(local_cpu_mask_dl, i),
+					GFP_KERNEL, cpu_to_node(i));
+}
+
+#endif /* CONFIG_SMP */
+
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
 {
-	if (hrtimer_active(&p->dl.dl_timer))
+	if (hrtimer_active(&p->dl.dl_timer) && !dl_policy(p->policy))
 		hrtimer_try_to_cancel(&p->dl.dl_timer);
+
+#ifdef CONFIG_SMP
+	/*
+	 * Since this might be the only -deadline task on the rq,
+	 * this is the right place to try to pull some other one
+	 * from an overloaded cpu, if any.
+	 */
+	if (!rq->dl.dl_nr_running)
+		pull_dl_task(rq);
+#endif
 }
 
+/*
+ * When switching to -deadline, we may overload the rq, then
+ * we try to push someone off, if possible.
+ */
 static void switched_to_dl(struct rq *rq, struct task_struct *p)
 {
+	int check_resched = 1;
+
 	/*
 	 * If p is throttled, don't consider the possibility
 	 * of preempting rq->curr, the check will be done right
@@ -635,26 +1509,53 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
 		return;
 
 	if (!p->on_rq || rq->curr != p) {
-		if (task_has_dl_policy(rq->curr))
+#ifdef CONFIG_SMP
+		if (rq->dl.overloaded && push_dl_task(rq) && rq != task_rq(p))
+			/* Only reschedule if pushing failed */
+			check_resched = 0;
+#endif /* CONFIG_SMP */
+		if (check_resched && task_has_dl_policy(rq->curr))
 			check_preempt_curr_dl(rq, p, 0);
-		else
-			resched_task(rq->curr);
 	}
 }
 
+/*
+ * If the scheduling parameters of a -deadline task changed,
+ * a push or pull operation might be needed.
+ */
 static void prio_changed_dl(struct rq *rq, struct task_struct *p,
 			    int oldprio)
 {
-	switched_to_dl(rq, p);
-}
-
+	if (p->on_rq || rq->curr == p) {
 #ifdef CONFIG_SMP
-static int
-select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
-{
-	return task_cpu(p);
+		/*
+		 * This might be too much, but unfortunately
+		 * we don't have the old deadline value, and
+		 * we can't argue if the task is increasing
+		 * or lowering its prio, so...
+		 */
+		if (!rq->dl.overloaded)
+			pull_dl_task(rq);
+
+		/*
+		 * If we now have a earlier deadline task than p,
+		 * then reschedule, provided p is still on this
+		 * runqueue.
+		 */
+		if (dl_time_before(rq->dl.earliest_dl.curr, p->dl.deadline) &&
+		    rq->curr == p)
+			resched_task(p);
+#else
+		/*
+		 * Again, we don't know if p has a earlier
+		 * or later deadline, so let's blindly set a
+		 * (maybe not needed) rescheduling point.
+		 */
+		resched_task(p);
+#endif /* CONFIG_SMP */
+	} else
+		switched_to_dl(rq, p);
 }
-#endif
 
 const struct sched_class dl_sched_class = {
 	.next			= &rt_sched_class,
@@ -669,6 +1570,12 @@ const struct sched_class dl_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_dl,
+	.set_cpus_allowed       = set_cpus_allowed_dl,
+	.rq_online              = rq_online_dl,
+	.rq_offline             = rq_offline_dl,
+	.pre_schedule		= pre_schedule_dl,
+	.post_schedule		= post_schedule_dl,
+	.task_woken		= task_woken_dl,
 #endif
 
 	.set_curr_task		= set_curr_task_dl,
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 01970c8..f7c4881 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1720,7 +1720,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
 	    !test_tsk_need_resched(rq->curr) &&
 	    has_pushable_tasks(rq) &&
 	    p->nr_cpus_allowed > 1 &&
-	    rt_task(rq->curr) &&
+	    (dl_task(rq->curr) || rt_task(rq->curr)) &&
 	    (rq->curr->nr_cpus_allowed < 2 ||
 	     rq->curr->prio <= p->prio))
 		push_rt_tasks(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ba97476..70d0030 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -383,6 +383,31 @@ struct dl_rq {
 	struct rb_node *rb_leftmost;
 
 	unsigned long dl_nr_running;
+
+#ifdef CONFIG_SMP
+	/*
+	 * Deadline values of the currently executing and the
+	 * earliest ready task on this rq. Caching these facilitates
+	 * the decision wether or not a ready but not running task
+	 * should migrate somewhere else.
+	 */
+	struct {
+		u64 curr;
+		u64 next;
+	} earliest_dl;
+
+	unsigned long dl_nr_migratory;
+	unsigned long dl_nr_total;
+	int overloaded;
+
+	/*
+	 * Tasks on this rq that can be pushed away. They are kept in
+	 * an rb-tree, ordered by tasks' deadlines, with caching
+	 * of the leftmost (earliest deadline) element.
+	 */
+	struct rb_root pushable_dl_tasks_root;
+	struct rb_node *pushable_dl_tasks_leftmost;
+#endif
 };
 
 #ifdef CONFIG_SMP
@@ -403,6 +428,13 @@ struct root_domain {
 	cpumask_var_t online;
 
 	/*
+	 * The bit corresponding to a CPU gets set here if such CPU has more
+	 * than one runnable -deadline task (as it is below for RT tasks).
+	 */
+	cpumask_var_t dlo_mask;
+	atomic_t dlo_count;
+
+	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
 	 * one runnable RT task.
 	 */
@@ -1063,6 +1095,8 @@ static inline void idle_balance(int cpu, struct rq *rq)
 extern void sysrq_sched_debug_show(void);
 extern void sched_init_granularity(void);
 extern void update_max_interval(void);
+
+extern void init_sched_dl_class(void);
 extern void init_sched_rt_class(void);
 extern void init_sched_fair_class(void);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 05/14] sched: SCHED_DEADLINE avg_update accounting.
  2013-11-07 13:43 [PATCH 00/14] sched: SCHED_DEADLINE v9 Juri Lelli
                   ` (3 preceding siblings ...)
  2013-11-07 13:43 ` [PATCH 04/14] sched: SCHED_DEADLINE SMP-related data structures & logic Juri Lelli
@ 2013-11-07 13:43 ` Juri Lelli
  2014-01-13 15:53   ` [tip:sched/core] sched/deadline: Add " tip-bot for Dario Faggioli
  2013-11-07 13:43 ` [PATCH 06/14] sched: add period support for -deadline tasks Juri Lelli
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 81+ messages in thread
From: Juri Lelli @ 2013-11-07 13:43 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

From: Dario Faggioli <raistlin@linux.it>

Make the core scheduler and load balancer aware of the load
produced by -deadline tasks, by updating the moving average
like for sched_rt.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 kernel/sched/deadline.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 18a73b4..2cd5a22 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -562,6 +562,8 @@ static void update_curr_dl(struct rq *rq)
 	curr->se.exec_start = rq_clock_task(rq);
 	cpuacct_charge(curr, delta_exec);
 
+	sched_rt_avg_update(rq, delta_exec);
+
 	dl_se->runtime -= delta_exec;
 	if (dl_runtime_exceeded(rq, dl_se)) {
 		__dequeue_task_dl(rq, curr, 0);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 06/14] sched: add period support for -deadline tasks.
  2013-11-07 13:43 [PATCH 00/14] sched: SCHED_DEADLINE v9 Juri Lelli
                   ` (4 preceding siblings ...)
  2013-11-07 13:43 ` [PATCH 05/14] sched: SCHED_DEADLINE avg_update accounting Juri Lelli
@ 2013-11-07 13:43 ` Juri Lelli
  2014-01-13 15:53   ` [tip:sched/core] sched/deadline: Add period support for SCHED_DEADLINE tasks tip-bot for Harald Gustafsson
  2013-11-07 13:43 ` [PATCH 07/14] sched: add schedstats for -deadline tasks Juri Lelli
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 81+ messages in thread
From: Juri Lelli @ 2013-11-07 13:43 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

From: Harald Gustafsson <harald.gustafsson@ericsson.com>

Make it possible to specify a period (different or equal than deadline) for
-deadline tasks. Relative deadlines (D_i) are used on task arrivals to generate
new scheduling (absolute) deadlines as "d = t + D_i", and periods (P_i) to
postpone the scheduling deadlines as "d = d + P_i" when the budget is zero.
This is in general useful to model (and schedule) tasks that have slow
activation rates (long periods), but have to be scheduled soon once activated
(short deadlines).

Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 include/linux/sched.h   |    1 +
 kernel/sched/core.c     |   15 ++++++++++++---
 kernel/sched/deadline.c |   10 +++++++---
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 59ea0da..90517bc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1068,6 +1068,7 @@ struct sched_dl_entity {
 	 */
 	u64 dl_runtime;		/* maximum runtime for each instance	*/
 	u64 dl_deadline;	/* relative deadline of each instance	*/
+	u64 dl_period;		/* separation of two instances (period) */
 
 	/*
 	 * Actual scheduling parameters. Initialized with the values above,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2077aae..7838eca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1617,6 +1617,7 @@ static void __sched_fork(struct task_struct *p)
 	hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	p->dl.dl_runtime = p->dl.runtime = 0;
 	p->dl.dl_deadline = p->dl.deadline = 0;
+	p->dl.dl_period = 0;
 	p->dl.flags = 0;
 
 	INIT_LIST_HEAD(&p->rt.run_list);
@@ -3304,6 +3305,10 @@ __setparam_dl(struct task_struct *p, const struct sched_param2 *param2)
 	init_dl_task_timer(dl_se);
 	dl_se->dl_runtime = param2->sched_runtime;
 	dl_se->dl_deadline = param2->sched_deadline;
+	if (param2->sched_period != 0)
+		dl_se->dl_period = param2->sched_period;
+	else
+		dl_se->dl_period = dl_se->dl_deadline;
 	dl_se->flags = param2->sched_flags;
 	dl_se->dl_throttled = 0;
 	dl_se->dl_new = 1;
@@ -3317,19 +3322,23 @@ __getparam_dl(struct task_struct *p, struct sched_param2 *param2)
 	param2->sched_priority = p->rt_priority;
 	param2->sched_runtime = dl_se->dl_runtime;
 	param2->sched_deadline = dl_se->dl_deadline;
+	param2->sched_period = dl_se->dl_period;
 	param2->sched_flags = dl_se->flags;
 }
 
 /*
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
- * than the runtime.
+ * than the runtime, as well as the period of being zero or
+ * greater than deadline.
  */
 static bool
 __checkparam_dl(const struct sched_param2 *prm)
 {
-	return prm && (&prm->sched_deadline) != 0 &&
-	       (s64)(&prm->sched_deadline - &prm->sched_runtime) >= 0;
+	return prm && prm->sched_deadline != 0 &&
+	       (prm->sched_period == 0 ||
+		(s64)(prm->sched_period - prm->sched_deadline) >= 0) &&
+	       (s64)(prm->sched_deadline - prm->sched_runtime) >= 0;
 }
 
 /*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 2cd5a22..3e0e6e3 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -289,7 +289,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 * arbitrary large.
 	 */
 	while (dl_se->runtime <= 0) {
-		dl_se->deadline += dl_se->dl_deadline;
+		dl_se->deadline += dl_se->dl_period;
 		dl_se->runtime += dl_se->dl_runtime;
 	}
 
@@ -329,9 +329,13 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
  *
  * This function returns true if:
  *
- *   runtime / (deadline - t) > dl_runtime / dl_deadline ,
+ *   runtime / (deadline - t) > dl_runtime / dl_period ,
  *
  * IOW we can't recycle current parameters.
+ *
+ * Notice that the bandwidth check is done against the period. For
+ * task with deadline equal to period this is the same of using
+ * dl_deadline instead of dl_period in the equation above.
  */
 static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
 {
@@ -355,7 +359,7 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
 	 * of anything below microseconds resolution is actually fiction
 	 * (but still we want to give the user that illusion >;).
 	 */
-	left = (dl_se->dl_deadline >> 10) * (dl_se->runtime >> 10);
+	left = (dl_se->dl_period >> 10) * (dl_se->runtime >> 10);
 	right = ((dl_se->deadline - t) >> 10) * (dl_se->dl_runtime >> 10);
 
 	return dl_time_before(right, left);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 07/14] sched: add schedstats for -deadline tasks.
  2013-11-07 13:43 [PATCH 00/14] sched: SCHED_DEADLINE v9 Juri Lelli
                   ` (5 preceding siblings ...)
  2013-11-07 13:43 ` [PATCH 06/14] sched: add period support for -deadline tasks Juri Lelli
@ 2013-11-07 13:43 ` Juri Lelli
  2013-11-07 13:43 ` [PATCH 08/14] sched: add latency tracing " Juri Lelli
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-07 13:43 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

From: Dario Faggioli <raistlin@linux.it>

Add some typical sched-debug output to dl_rq(s) and some
schedstats to -deadline tasks. This helps spotting problems
with queue and dequeue operations (incorrect ordering) and
gives hints about system status (one can have an idea if the
system is overloaded, if tasks are missing their deadlines
and the entity of such anomalies).

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 include/linux/sched.h   |   13 +++++++++++++
 kernel/sched/deadline.c |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/debug.c    |   46 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h    |    9 ++++++++-
 4 files changed, 114 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 90517bc..3064319 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1058,6 +1058,15 @@ struct sched_rt_entity {
 #endif
 };
 
+#ifdef CONFIG_SCHEDSTATS
+struct sched_stats_dl {
+	u64			last_dmiss;
+	u64			last_rorun;
+	u64			dmiss_max;
+	u64			rorun_max;
+};
+#endif
+
 struct sched_dl_entity {
 	struct rb_node	rb_node;
 
@@ -1097,6 +1106,10 @@ struct sched_dl_entity {
 	 * own bandwidth to be enforced, thus we need one timer per task.
 	 */
 	struct hrtimer dl_timer;
+
+#ifdef CONFIG_SCHEDSTATS
+	struct sched_stats_dl stats;
+#endif
 };
 
 struct rcu_node;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3e0e6e3..a4b86b0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -518,6 +518,27 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 	if (!rorun && !dmiss)
 		return 0;
 
+#ifdef CONFIG_SCHEDSTATS
+	/*
+	 * Record statistics about last and maximum deadline
+	 * misses and runtime overruns.
+	 */
+	if (dmiss) {
+		u64 damount = rq_clock(rq) - dl_se->deadline;
+
+		schedstat_set(dl_se->stats.last_dmiss, damount);
+		schedstat_set(dl_se->stats.dmiss_max,
+			      max(dl_se->stats.dmiss_max, damount));
+	}
+	if (rorun) {
+		u64 ramount = -dl_se->runtime;
+
+		schedstat_set(dl_se->stats.last_rorun, ramount);
+		schedstat_set(dl_se->stats.rorun_max,
+			      max(dl_se->stats.rorun_max, ramount));
+	}
+#endif
+
 	/*
 	 * If we are beyond our current deadline and we are still
 	 * executing, then we have already used some of the runtime of
@@ -561,6 +582,7 @@ static void update_curr_dl(struct rq *rq)
 		      max(curr->se.statistics.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
+	schedstat_add(&rq->dl, exec_clock, delta_exec);
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
@@ -912,6 +934,18 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
 }
 #endif
 
+#ifdef CONFIG_SCHED_DEBUG
+struct sched_dl_entity *__pick_dl_last_entity(struct dl_rq *dl_rq)
+{
+	struct rb_node *last = rb_last(&dl_rq->rb_root);
+
+	if (!last)
+		return NULL;
+
+	return rb_entry(last, struct sched_dl_entity, rb_node);
+}
+#endif /* CONFIG_SCHED_DEBUG */
+
 static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
 						   struct dl_rq *dl_rq)
 {
@@ -1593,3 +1627,16 @@ const struct sched_class dl_sched_class = {
 	.switched_from		= switched_from_dl,
 	.switched_to		= switched_to_dl,
 };
+
+#ifdef CONFIG_SCHED_DEBUG
+extern void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq);
+
+void print_dl_stats(struct seq_file *m, int cpu)
+{
+	struct dl_rq *dl_rq = &cpu_rq(cpu)->dl;
+
+	rcu_read_lock();
+	print_dl_rq(m, cpu, dl_rq);
+	rcu_read_unlock();
+}
+#endif /* CONFIG_SCHED_DEBUG */
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1965599..811b025 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -253,6 +253,45 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
 #undef P
 }
 
+extern struct sched_dl_entity *__pick_dl_last_entity(struct dl_rq *dl_rq);
+extern void print_dl_stats(struct seq_file *m, int cpu);
+
+void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq)
+{
+	s64 min_deadline = -1, max_deadline = -1;
+	struct rq *rq = cpu_rq(cpu);
+	struct sched_dl_entity *last;
+	unsigned long flags;
+
+	SEQ_printf(m, "\ndl_rq[%d]:\n", cpu);
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+	if (dl_rq->rb_leftmost)
+		min_deadline = (rb_entry(dl_rq->rb_leftmost,
+					 struct sched_dl_entity,
+					 rb_node))->deadline;
+	last = __pick_dl_last_entity(dl_rq);
+	if (last)
+		max_deadline = last->deadline;
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+#define P(x) \
+	SEQ_printf(m, "  .%-30s: %Ld\n", #x, (long long)(dl_rq->x))
+#define __PN(x) \
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(x))
+#define PN(x) \
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(dl_rq->x))
+
+	P(dl_nr_running);
+	PN(exec_clock);
+	__PN(min_deadline);
+	__PN(max_deadline);
+
+#undef PN
+#undef __PN
+#undef P
+}
+
 extern __read_mostly int sched_clock_running;
 
 static void print_cpu(struct seq_file *m, int cpu)
@@ -320,6 +359,7 @@ do {									\
 	spin_lock_irqsave(&sched_debug_lock, flags);
 	print_cfs_stats(m, cpu);
 	print_rt_stats(m, cpu);
+	print_dl_stats(m, cpu);
 
 	rcu_read_lock();
 	print_rq(m, rq, cpu);
@@ -540,6 +580,12 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.statistics.nr_wakeups_affine_attempts);
 	P(se.statistics.nr_wakeups_passive);
 	P(se.statistics.nr_wakeups_idle);
+	if (dl_task(p)) {
+		PN(dl.stats.last_dmiss);
+		PN(dl.stats.dmiss_max);
+		PN(dl.stats.last_rorun);
+		PN(dl.stats.rorun_max);
+	}
 
 	{
 		u64 avg_atom, avg_per_cpu;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 70d0030..b04aeed 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -384,6 +384,8 @@ struct dl_rq {
 
 	unsigned long dl_nr_running;
 
+	u64 exec_clock;
+
 #ifdef CONFIG_SMP
 	/*
 	 * Deadline values of the currently executing and the
@@ -410,6 +412,11 @@ struct dl_rq {
 #endif
 };
 
+#ifdef CONFIG_SCHED_DEBUG
+struct sched_dl_entity *__pick_dl_last_entity(struct dl_rq *dl_rq);
+void print_dl_stats(struct seq_file *m, int cpu);
+#endif
+
 #ifdef CONFIG_SMP
 
 /*
@@ -1366,7 +1373,7 @@ extern void print_rt_stats(struct seq_file *m, int cpu);
 
 extern void init_cfs_rq(struct cfs_rq *cfs_rq);
 extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq);
-extern void init_dl_rq(struct dl_rq *rt_rq, struct rq *rq);
+extern void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq);
 
 extern void account_cfs_bandwidth_used(int enabled, int was_enabled);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-07 13:43 [PATCH 00/14] sched: SCHED_DEADLINE v9 Juri Lelli
                   ` (6 preceding siblings ...)
  2013-11-07 13:43 ` [PATCH 07/14] sched: add schedstats for -deadline tasks Juri Lelli
@ 2013-11-07 13:43 ` Juri Lelli
  2013-11-20 21:33   ` Steven Rostedt
  2014-01-13 15:54   ` [tip:sched/core] sched/deadline: Add latency tracing for SCHED_DEADLINE tasks tip-bot for Dario Faggioli
  2013-11-07 13:43 ` [PATCH 09/14] rtmutex: turn the plist into an rb-tree Juri Lelli
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-07 13:43 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

From: Dario Faggioli <raistlin@linux.it>

It is very likely that systems that wants/needs to use the new
SCHED_DEADLINE policy also want to have the scheduling latency of
the -deadline tasks under control.

For this reason a new version of the scheduling wakeup latency,
called "wakeup_dl", is introduced.

As a consequence of applying this patch there will be three wakeup
latency tracer:
 * "wakeup", that deals with all tasks in the system;
 * "wakeup_rt", that deals with -rt and -deadline tasks only;
 * "wakeup_dl", that deals with -deadline tasks only.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 kernel/trace/trace_sched_wakeup.c |   44 +++++++++++++++++++++++++++++++++----
 kernel/trace/trace_selftest.c     |   28 +++++++++++++----------
 2 files changed, 57 insertions(+), 15 deletions(-)

diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index fee77e1..1457fb1 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -27,6 +27,7 @@ static int			wakeup_cpu;
 static int			wakeup_current_cpu;
 static unsigned			wakeup_prio = -1;
 static int			wakeup_rt;
+static int			wakeup_dl;
 
 static arch_spinlock_t wakeup_lock =
 	(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
@@ -472,9 +473,17 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
 	tracing_record_cmdline(p);
 	tracing_record_cmdline(current);
 
-	if ((wakeup_rt && !rt_task(p)) ||
-			p->prio >= wakeup_prio ||
-			p->prio >= current->prio)
+	/*
+	 * Semantic is like this:
+	 *  - wakeup tracer handles all tasks in the system, independently
+	 *    from their scheduling class;
+	 *  - wakeup_rt tracer handles tasks belonging to sched_dl and
+	 *    sched_rt class;
+	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
+	 */
+	if ((wakeup_dl && !dl_task(p)) ||
+	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
+	    (p->prio >= wakeup_prio || p->prio >= current->prio))
 		return;
 
 	pc = preempt_count();
@@ -486,7 +495,7 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
 	arch_spin_lock(&wakeup_lock);
 
 	/* check for races. */
-	if (!tracer_enabled || p->prio >= wakeup_prio)
+	if (!tracer_enabled || (!dl_task(p) && p->prio >= wakeup_prio))
 		goto out_locked;
 
 	/* reset the trace */
@@ -597,16 +606,25 @@ static int __wakeup_tracer_init(struct trace_array *tr)
 
 static int wakeup_tracer_init(struct trace_array *tr)
 {
+	wakeup_dl = 0;
 	wakeup_rt = 0;
 	return __wakeup_tracer_init(tr);
 }
 
 static int wakeup_rt_tracer_init(struct trace_array *tr)
 {
+	wakeup_dl = 0;
 	wakeup_rt = 1;
 	return __wakeup_tracer_init(tr);
 }
 
+static int wakeup_dl_tracer_init(struct trace_array *tr)
+{
+	wakeup_dl = 1;
+	wakeup_rt = 0;
+	return __wakeup_tracer_init(tr);
+}
+
 static void wakeup_tracer_reset(struct trace_array *tr)
 {
 	int lat_flag = save_flags & TRACE_ITER_LATENCY_FMT;
@@ -674,6 +692,20 @@ static struct tracer wakeup_rt_tracer __read_mostly =
 	.use_max_tr	= true,
 };
 
+static struct tracer wakeup_dl_tracer __read_mostly =
+{
+	.name		= "wakeup_dl",
+	.init		= wakeup_dl_tracer_init,
+	.reset		= wakeup_tracer_reset,
+	.start		= wakeup_tracer_start,
+	.stop		= wakeup_tracer_stop,
+	.wait_pipe	= poll_wait_pipe,
+	.print_max	= 1,
+#ifdef CONFIG_FTRACE_SELFTEST
+	.selftest    = trace_selftest_startup_wakeup,
+#endif
+};
+
 __init static int init_wakeup_tracer(void)
 {
 	int ret;
@@ -686,6 +718,10 @@ __init static int init_wakeup_tracer(void)
 	if (ret)
 		return ret;
 
+	ret = register_tracer(&wakeup_dl_tracer);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 core_initcall(init_wakeup_tracer);
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index a7329b7..f76f8d6 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -1022,11 +1022,17 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
 #ifdef CONFIG_SCHED_TRACER
 static int trace_wakeup_test_thread(void *data)
 {
-	/* Make this a RT thread, doesn't need to be too high */
-	static const struct sched_param param = { .sched_priority = 5 };
+	/* Make this a -deadline thread */
+	struct sched_param2 paramx = {
+		.sched_priority = 0,
+		.sched_runtime = 100000ULL,
+		.sched_deadline = 10000000ULL,
+		.sched_period = 10000000ULL
+		.sched_flags = 0
+	};
 	struct completion *x = data;
 
-	sched_setscheduler(current, SCHED_FIFO, &param);
+	sched_setscheduler2(current, SCHED_DEADLINE, &paramx);
 
 	/* Make it know we have a new prio */
 	complete(x);
@@ -1040,8 +1046,8 @@ static int trace_wakeup_test_thread(void *data)
 	/* we are awake, now wait to disappear */
 	while (!kthread_should_stop()) {
 		/*
-		 * This is an RT task, do short sleeps to let
-		 * others run.
+		 * This will likely be the system top priority
+		 * task, do short sleeps to let others run.
 		 */
 		msleep(100);
 	}
@@ -1054,21 +1060,21 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
 {
 	unsigned long save_max = tracing_max_latency;
 	struct task_struct *p;
-	struct completion isrt;
+	struct completion is_ready;
 	unsigned long count;
 	int ret;
 
-	init_completion(&isrt);
+	init_completion(&is_ready);
 
-	/* create a high prio thread */
-	p = kthread_run(trace_wakeup_test_thread, &isrt, "ftrace-test");
+	/* create a -deadline thread */
+	p = kthread_run(trace_wakeup_test_thread, &is_ready, "ftrace-test");
 	if (IS_ERR(p)) {
 		printk(KERN_CONT "Failed to create ftrace wakeup test thread ");
 		return -1;
 	}
 
-	/* make sure the thread is running at an RT prio */
-	wait_for_completion(&isrt);
+	/* make sure the thread is running at -deadline policy */
+	wait_for_completion(&is_ready);
 
 	/* start the tracing */
 	ret = tracer_init(trace, tr);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 09/14] rtmutex: turn the plist into an rb-tree.
  2013-11-07 13:43 [PATCH 00/14] sched: SCHED_DEADLINE v9 Juri Lelli
                   ` (7 preceding siblings ...)
  2013-11-07 13:43 ` [PATCH 08/14] sched: add latency tracing " Juri Lelli
@ 2013-11-07 13:43 ` Juri Lelli
  2013-11-21  3:07   ` Steven Rostedt
                     ` (2 more replies)
  2013-11-07 13:43 ` [PATCH 10/14] sched: drafted deadline inheritance logic Juri Lelli
                   ` (3 subsequent siblings)
  12 siblings, 3 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-07 13:43 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

From: Peter Zijlstra <peterz@infradead.org>

Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.

This is done mainly because:
 - classical prio field of the plist is just an int, which might
   not be enough for representing a deadline;
 - manipulating such a list would become O(nr_deadline_tasks),
   which might be to much, as the number of -deadline task increases.

Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
 - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
   one with the higher (lower, actually!) prio wins;
 - among a -priority and a -deadline task, the latter always wins;
 - among two -deadline tasks, the one with the earliest deadline
   wins.

Queueing and dequeueing functions are changed accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 include/linux/init_task.h |   10 +++
 include/linux/rtmutex.h   |   18 ++----
 include/linux/sched.h     |    4 +-
 kernel/fork.c             |    3 +-
 kernel/futex.c            |    2 +
 kernel/rtmutex-debug.c    |    8 +--
 kernel/rtmutex.c          |  152 ++++++++++++++++++++++++++++++++++++---------
 kernel/rtmutex_common.h   |   22 +++----
 kernel/sched/core.c       |    4 --
 9 files changed, 157 insertions(+), 66 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 5cd0f09..69b97ea 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -11,6 +11,7 @@
 #include <linux/user_namespace.h>
 #include <linux/securebits.h>
 #include <linux/seqlock.h>
+#include <linux/rbtree.h>
 #include <net/net_namespace.h>
 #include <linux/sched/rt.h>
 
@@ -154,6 +155,14 @@ extern struct task_group root_task_group;
 
 #define INIT_TASK_COMM "swapper"
 
+#ifdef CONFIG_RT_MUTEXES
+# define INIT_RT_MUTEXES(tsk)						\
+	.pi_waiters = RB_ROOT,						\
+	.pi_waiters_leftmost = NULL,
+#else
+# define INIT_RT_MUTEXES(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -221,6 +230,7 @@ extern struct task_group root_task_group;
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
 	INIT_CPUSET_SEQ							\
+	INIT_RT_MUTEXES(tsk)						\
 	INIT_VTIME(tsk)							\
 }
 
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index de17134..3aed8d7 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -13,7 +13,7 @@
 #define __LINUX_RT_MUTEX_H
 
 #include <linux/linkage.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
 #include <linux/spinlock_types.h>
 
 extern int max_lock_depth; /* for sysctl */
@@ -22,12 +22,14 @@ extern int max_lock_depth; /* for sysctl */
  * The rt_mutex structure
  *
  * @wait_lock:	spinlock to protect the structure
- * @wait_list:	pilist head to enqueue waiters in priority order
+ * @waiters:	rbtree root to enqueue waiters in priority order
+ * @waiters_leftmost: top waiter
  * @owner:	the mutex owner
  */
 struct rt_mutex {
 	raw_spinlock_t		wait_lock;
-	struct plist_head	wait_list;
+	struct rb_root          waiters;
+	struct rb_node          *waiters_leftmost;
 	struct task_struct	*owner;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 	int			save_state;
@@ -66,7 +68,7 @@ struct hrtimer_sleeper;
 
 #define __RT_MUTEX_INITIALIZER(mutexname) \
 	{ .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(mutexname.wait_lock) \
-	, .wait_list = PLIST_HEAD_INIT(mutexname.wait_list) \
+	, .waiters = RB_ROOT \
 	, .owner = NULL \
 	__DEBUG_RT_MUTEX_INITIALIZER(mutexname)}
 
@@ -98,12 +100,4 @@ extern int rt_mutex_trylock(struct rt_mutex *lock);
 
 extern void rt_mutex_unlock(struct rt_mutex *lock);
 
-#ifdef CONFIG_RT_MUTEXES
-# define INIT_RT_MUTEXES(tsk)						\
-	.pi_waiters	= PLIST_HEAD_INIT(tsk.pi_waiters),	\
-	INIT_RT_MUTEX_DEBUG(tsk)
-#else
-# define INIT_RT_MUTEXES(tsk)
-#endif
-
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3064319..f95bafd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -16,6 +16,7 @@ struct sched_param {
 #include <linux/types.h>
 #include <linux/timex.h>
 #include <linux/jiffies.h>
+#include <linux/plist.h>
 #include <linux/rbtree.h>
 #include <linux/thread_info.h>
 #include <linux/cpumask.h>
@@ -1340,7 +1341,8 @@ struct task_struct {
 
 #ifdef CONFIG_RT_MUTEXES
 	/* PI waiters blocked on a rt_mutex held by this task */
-	struct plist_head pi_waiters;
+	struct rb_root pi_waiters;
+	struct rb_node *pi_waiters_leftmost;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 55fc95f..aa6e18d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1089,7 +1089,8 @@ static void rt_mutex_init_task(struct task_struct *p)
 {
 	raw_spin_lock_init(&p->pi_lock);
 #ifdef CONFIG_RT_MUTEXES
-	plist_head_init(&p->pi_waiters);
+	p->pi_waiters = RB_ROOT;
+	p->pi_waiters_leftmost = NULL;
 	p->pi_blocked_on = NULL;
 #endif
 }
diff --git a/kernel/futex.c b/kernel/futex.c
index c3a1a55..ff7141e 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2315,6 +2315,8 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	 * code while we sleep on uaddr.
 	 */
 	debug_rt_mutex_init_waiter(&rt_waiter);
+	RB_CLEAR_NODE(&rt_waiter.pi_tree_entry);
+	RB_CLEAR_NODE(&rt_waiter.tree_entry);
 	rt_waiter.task = NULL;
 
 	ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
diff --git a/kernel/rtmutex-debug.c b/kernel/rtmutex-debug.c
index 13b243a..49b2ed3 100644
--- a/kernel/rtmutex-debug.c
+++ b/kernel/rtmutex-debug.c
@@ -24,7 +24,7 @@
 #include <linux/kallsyms.h>
 #include <linux/syscalls.h>
 #include <linux/interrupt.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
 #include <linux/fs.h>
 #include <linux/debug_locks.h>
 
@@ -57,7 +57,7 @@ static void printk_lock(struct rt_mutex *lock, int print_owner)
 
 void rt_mutex_debug_task_free(struct task_struct *task)
 {
-	DEBUG_LOCKS_WARN_ON(!plist_head_empty(&task->pi_waiters));
+	DEBUG_LOCKS_WARN_ON(!RB_EMPTY_ROOT(&task->pi_waiters));
 	DEBUG_LOCKS_WARN_ON(task->pi_blocked_on);
 }
 
@@ -154,16 +154,12 @@ void debug_rt_mutex_proxy_unlock(struct rt_mutex *lock)
 void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
 {
 	memset(waiter, 0x11, sizeof(*waiter));
-	plist_node_init(&waiter->list_entry, MAX_PRIO);
-	plist_node_init(&waiter->pi_list_entry, MAX_PRIO);
 	waiter->deadlock_task_pid = NULL;
 }
 
 void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)
 {
 	put_pid(waiter->deadlock_task_pid);
-	DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->list_entry));
-	DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
 	memset(waiter, 0x22, sizeof(*waiter));
 }
 
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 0dd6aec..4ea7eaa 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -91,10 +91,104 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex *lock)
 }
 #endif
 
+static inline int
+rt_mutex_waiter_less(struct rt_mutex_waiter *left,
+		     struct rt_mutex_waiter *right)
+{
+	if (left->task->prio < right->task->prio)
+		return 1;
+
+	/*
+	 * If both tasks are dl_task(), we check their deadlines.
+	 */
+	if (dl_prio(left->task->prio) && dl_prio(right->task->prio))
+		return (left->task->dl.deadline < right->task->dl.deadline);
+
+	return 0;
+}
+
+static void
+rt_mutex_enqueue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
+{
+	struct rb_node **link = &lock->waiters.rb_node;
+	struct rb_node *parent = NULL;
+	struct rt_mutex_waiter *entry;
+	int leftmost = 1;
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct rt_mutex_waiter, tree_entry);
+		if (rt_mutex_waiter_less(waiter, entry)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		lock->waiters_leftmost = &waiter->tree_entry;
+
+	rb_link_node(&waiter->tree_entry, parent, link);
+	rb_insert_color(&waiter->tree_entry, &lock->waiters);
+}
+
+static void
+rt_mutex_dequeue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
+{
+	if (RB_EMPTY_NODE(&waiter->tree_entry))
+		return;
+
+	if (lock->waiters_leftmost == &waiter->tree_entry)
+		lock->waiters_leftmost = rb_next(&waiter->tree_entry);
+
+	rb_erase(&waiter->tree_entry, &lock->waiters);
+	RB_CLEAR_NODE(&waiter->tree_entry);
+}
+
+static void
+rt_mutex_enqueue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
+{
+	struct rb_node **link = &task->pi_waiters.rb_node;
+	struct rb_node *parent = NULL;
+	struct rt_mutex_waiter *entry;
+	int leftmost = 1;
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct rt_mutex_waiter, pi_tree_entry);
+		if (rt_mutex_waiter_less(waiter, entry)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		task->pi_waiters_leftmost = &waiter->pi_tree_entry;
+
+	rb_link_node(&waiter->pi_tree_entry, parent, link);
+	rb_insert_color(&waiter->pi_tree_entry, &task->pi_waiters);
+}
+
+static void
+rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
+{
+	if (RB_EMPTY_NODE(&waiter->pi_tree_entry))
+		return;
+
+	if (task->pi_waiters_leftmost == &waiter->pi_tree_entry)
+		task->pi_waiters_leftmost = rb_next(&waiter->pi_tree_entry);
+
+	rb_erase(&waiter->pi_tree_entry, &task->pi_waiters);
+	RB_CLEAR_NODE(&waiter->pi_tree_entry);
+}
+
 /*
- * Calculate task priority from the waiter list priority
+ * Calculate task priority from the waiter tree priority
  *
- * Return task->normal_prio when the waiter list is empty or when
+ * Return task->normal_prio when the waiter tree is empty or when
  * the waiter is not allowed to do priority boosting
  */
 int rt_mutex_getprio(struct task_struct *task)
@@ -102,7 +196,7 @@ int rt_mutex_getprio(struct task_struct *task)
 	if (likely(!task_has_pi_waiters(task)))
 		return task->normal_prio;
 
-	return min(task_top_pi_waiter(task)->pi_list_entry.prio,
+	return min(task_top_pi_waiter(task)->task->prio,
 		   task->normal_prio);
 }
 
@@ -233,7 +327,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * When deadlock detection is off then we check, if further
 	 * priority adjustment is necessary.
 	 */
-	if (!detect_deadlock && waiter->list_entry.prio == task->prio)
+	if (!detect_deadlock && waiter->task->prio == task->prio)
 		goto out_unlock_pi;
 
 	lock = waiter->lock;
@@ -254,9 +348,9 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	top_waiter = rt_mutex_top_waiter(lock);
 
 	/* Requeue the waiter */
-	plist_del(&waiter->list_entry, &lock->wait_list);
-	waiter->list_entry.prio = task->prio;
-	plist_add(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_dequeue(lock, waiter);
+	waiter->task->prio = task->prio;
+	rt_mutex_enqueue(lock, waiter);
 
 	/* Release the task */
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
@@ -280,17 +374,15 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		/* Boost the owner */
-		plist_del(&top_waiter->pi_list_entry, &task->pi_waiters);
-		waiter->pi_list_entry.prio = waiter->list_entry.prio;
-		plist_add(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_dequeue_pi(task, top_waiter);
+		rt_mutex_enqueue_pi(task, waiter);
 		__rt_mutex_adjust_prio(task);
 
 	} else if (top_waiter == waiter) {
 		/* Deboost the owner */
-		plist_del(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_dequeue_pi(task, waiter);
 		waiter = rt_mutex_top_waiter(lock);
-		waiter->pi_list_entry.prio = waiter->list_entry.prio;
-		plist_add(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_enqueue_pi(task, waiter);
 		__rt_mutex_adjust_prio(task);
 	}
 
@@ -355,7 +447,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 	 * 3) it is top waiter
 	 */
 	if (rt_mutex_has_waiters(lock)) {
-		if (task->prio >= rt_mutex_top_waiter(lock)->list_entry.prio) {
+		if (task->prio >= rt_mutex_top_waiter(lock)->task->prio) {
 			if (!waiter || waiter != rt_mutex_top_waiter(lock))
 				return 0;
 		}
@@ -369,7 +461,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 
 		/* remove the queued waiter. */
 		if (waiter) {
-			plist_del(&waiter->list_entry, &lock->wait_list);
+			rt_mutex_dequeue(lock, waiter);
 			task->pi_blocked_on = NULL;
 		}
 
@@ -379,8 +471,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 		 */
 		if (rt_mutex_has_waiters(lock)) {
 			top = rt_mutex_top_waiter(lock);
-			top->pi_list_entry.prio = top->list_entry.prio;
-			plist_add(&top->pi_list_entry, &task->pi_waiters);
+			rt_mutex_enqueue_pi(task, top);
 		}
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 	}
@@ -416,13 +507,11 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	__rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
-	plist_node_init(&waiter->list_entry, task->prio);
-	plist_node_init(&waiter->pi_list_entry, task->prio);
-
+	
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
 		top_waiter = rt_mutex_top_waiter(lock);
-	plist_add(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_enqueue(lock, waiter);
 
 	task->pi_blocked_on = waiter;
 
@@ -433,8 +522,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		raw_spin_lock_irqsave(&owner->pi_lock, flags);
-		plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters);
-		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
+		rt_mutex_dequeue_pi(owner, top_waiter);
+		rt_mutex_enqueue_pi(owner, waiter);
 
 		__rt_mutex_adjust_prio(owner);
 		if (owner->pi_blocked_on)
@@ -486,7 +575,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 	 * boosted mode and go back to normal after releasing
 	 * lock->wait_lock.
 	 */
-	plist_del(&waiter->pi_list_entry, &current->pi_waiters);
+	rt_mutex_dequeue_pi(current, waiter);
 
 	rt_mutex_set_owner(lock, NULL);
 
@@ -510,7 +599,7 @@ static void remove_waiter(struct rt_mutex *lock,
 	int chain_walk = 0;
 
 	raw_spin_lock_irqsave(&current->pi_lock, flags);
-	plist_del(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_dequeue(lock, waiter);
 	current->pi_blocked_on = NULL;
 	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
 
@@ -521,13 +610,13 @@ static void remove_waiter(struct rt_mutex *lock,
 
 		raw_spin_lock_irqsave(&owner->pi_lock, flags);
 
-		plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
+		rt_mutex_dequeue_pi(owner, waiter);
 
 		if (rt_mutex_has_waiters(lock)) {
 			struct rt_mutex_waiter *next;
 
 			next = rt_mutex_top_waiter(lock);
-			plist_add(&next->pi_list_entry, &owner->pi_waiters);
+			rt_mutex_enqueue_pi(owner, next);
 		}
 		__rt_mutex_adjust_prio(owner);
 
@@ -537,8 +626,6 @@ static void remove_waiter(struct rt_mutex *lock,
 		raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
 	}
 
-	WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
-
 	if (!chain_walk)
 		return;
 
@@ -565,7 +652,7 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || waiter->list_entry.prio == task->prio) {
+	if (!waiter || waiter->task->prio == task->prio) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
@@ -638,6 +725,8 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 	int ret = 0;
 
 	debug_rt_mutex_init_waiter(&waiter);
+	RB_CLEAR_NODE(&waiter.pi_tree_entry);
+	RB_CLEAR_NODE(&waiter.tree_entry);
 
 	raw_spin_lock(&lock->wait_lock);
 
@@ -904,7 +993,8 @@ void __rt_mutex_init(struct rt_mutex *lock, const char *name)
 {
 	lock->owner = NULL;
 	raw_spin_lock_init(&lock->wait_lock);
-	plist_head_init(&lock->wait_list);
+	lock->waiters = RB_ROOT;
+	lock->waiters_leftmost = NULL;
 
 	debug_rt_mutex_init(lock, name);
 }
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index 53a66c8..b65442f 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -40,13 +40,13 @@ extern void schedule_rt_mutex_test(struct rt_mutex *lock);
  * This is the control structure for tasks blocked on a rt_mutex,
  * which is allocated on the kernel stack on of the blocked task.
  *
- * @list_entry:		pi node to enqueue into the mutex waiters list
- * @pi_list_entry:	pi node to enqueue into the mutex owner waiters list
+ * @tree_entry:		pi node to enqueue into the mutex waiters tree
+ * @pi_tree_entry:	pi node to enqueue into the mutex owner waiters tree
  * @task:		task reference to the blocked task
  */
 struct rt_mutex_waiter {
-	struct plist_node	list_entry;
-	struct plist_node	pi_list_entry;
+	struct rb_node          tree_entry;
+	struct rb_node          pi_tree_entry;
 	struct task_struct	*task;
 	struct rt_mutex		*lock;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
@@ -57,11 +57,11 @@ struct rt_mutex_waiter {
 };
 
 /*
- * Various helpers to access the waiters-plist:
+ * Various helpers to access the waiters-tree:
  */
 static inline int rt_mutex_has_waiters(struct rt_mutex *lock)
 {
-	return !plist_head_empty(&lock->wait_list);
+	return !RB_EMPTY_ROOT(&lock->waiters);
 }
 
 static inline struct rt_mutex_waiter *
@@ -69,8 +69,8 @@ rt_mutex_top_waiter(struct rt_mutex *lock)
 {
 	struct rt_mutex_waiter *w;
 
-	w = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter,
-			       list_entry);
+	w = rb_entry(lock->waiters_leftmost, struct rt_mutex_waiter,
+		     tree_entry);
 	BUG_ON(w->lock != lock);
 
 	return w;
@@ -78,14 +78,14 @@ rt_mutex_top_waiter(struct rt_mutex *lock)
 
 static inline int task_has_pi_waiters(struct task_struct *p)
 {
-	return !plist_head_empty(&p->pi_waiters);
+	return !RB_EMPTY_ROOT(&p->pi_waiters);
 }
 
 static inline struct rt_mutex_waiter *
 task_top_pi_waiter(struct task_struct *p)
 {
-	return plist_first_entry(&p->pi_waiters, struct rt_mutex_waiter,
-				  pi_list_entry);
+	return rb_entry(p->pi_waiters_leftmost, struct rt_mutex_waiter,
+			pi_tree_entry);
 }
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7838eca..6a3ab87 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6728,10 +6728,6 @@ void __init sched_init(void)
 	INIT_HLIST_HEAD(&init_task.preempt_notifiers);
 #endif
 
-#ifdef CONFIG_RT_MUTEXES
-	plist_head_init(&init_task.pi_waiters);
-#endif
-
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
 	 */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 10/14] sched: drafted deadline inheritance logic.
  2013-11-07 13:43 [PATCH 00/14] sched: SCHED_DEADLINE v9 Juri Lelli
                   ` (8 preceding siblings ...)
  2013-11-07 13:43 ` [PATCH 09/14] rtmutex: turn the plist into an rb-tree Juri Lelli
@ 2013-11-07 13:43 ` Juri Lelli
  2014-01-13 15:54   ` [tip:sched/core] sched/deadline: Add SCHED_DEADLINE " tip-bot for Dario Faggioli
  2013-11-07 13:43 ` [PATCH 11/14] sched: add bandwidth management for sched_dl Juri Lelli
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 81+ messages in thread
From: Juri Lelli @ 2013-11-07 13:43 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

From: Dario Faggioli <raistlin@linux.it>

Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).

This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
 - ensure a pi-lock owner with waiters is never throttled down. Instead,
   when it runs out of runtime, it immediately gets replenished and it's
   deadline is postponed;
 - the scheduling parameters (relative deadline and default runtime)
   used for that replenishments --during the whole period it holds the
   pi-lock-- are the ones of the waiting task with earliest deadline.

Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.

We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 include/linux/sched.h             |    8 +++-
 include/linux/sched/rt.h          |    1 +
 kernel/fork.c                     |    1 +
 kernel/rtmutex.c                  |   14 +++++-
 kernel/sched/core.c               |   36 ++++++++++++---
 kernel/sched/deadline.c           |   91 +++++++++++++++++++++----------------
 kernel/sched/sched.h              |   14 ++++++
 kernel/trace/trace_sched_wakeup.c |    1 +
 8 files changed, 119 insertions(+), 47 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f95bafd..9dad4f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1099,8 +1099,12 @@ struct sched_dl_entity {
 	 * @dl_new tells if a new instance arrived. If so we must
 	 * start executing it with full runtime and reset its absolute
 	 * deadline;
+	 *
+	 * @dl_boosted tells if we are boosted due to DI. If so we are
+	 * outside bandwidth enforcement mechanism (but only until we
+	 * exit the critical section).
 	 */
-	int dl_throttled, dl_new;
+	int dl_throttled, dl_new, dl_boosted;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
@@ -1345,6 +1349,8 @@ struct task_struct {
 	struct rb_node *pi_waiters_leftmost;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
+	/* Top pi_waiters task */
+	struct task_struct *pi_top_task;
 #endif
 
 #ifdef CONFIG_DEBUG_MUTEXES
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index a157797..1fa88a2c 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -35,6 +35,7 @@ static inline int rt_task(struct task_struct *p)
 #ifdef CONFIG_RT_MUTEXES
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
+extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index aa6e18d..7247105 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1092,6 +1092,7 @@ static void rt_mutex_init_task(struct task_struct *p)
 	p->pi_waiters = RB_ROOT;
 	p->pi_waiters_leftmost = NULL;
 	p->pi_blocked_on = NULL;
+	p->pi_top_task = NULL;
 #endif
 }
 
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 4ea7eaa..f7f667e 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -14,6 +14,7 @@
 #include <linux/export.h>
 #include <linux/sched.h>
 #include <linux/sched/rt.h>
+#include <linux/sched/deadline.h>
 #include <linux/timer.h>
 
 #include "rtmutex_common.h"
@@ -200,6 +201,14 @@ int rt_mutex_getprio(struct task_struct *task)
 		   task->normal_prio);
 }
 
+struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
+{
+	if (likely(!task_has_pi_waiters(task)))
+		return NULL;
+
+	return task_top_pi_waiter(task)->task;
+}
+
 /*
  * Adjust the priority of a task, after its pi_waiters got modified.
  *
@@ -209,7 +218,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
 {
 	int prio = rt_mutex_getprio(task);
 
-	if (task->prio != prio)
+	if (task->prio != prio || dl_prio(prio))
 		rt_mutex_setprio(task, prio);
 }
 
@@ -652,7 +661,8 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || waiter->task->prio == task->prio) {
+	if (!waiter || (waiter->task->prio == task->prio &&
+			!dl_prio(task->prio))) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6a3ab87..d9dca90 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -951,7 +951,7 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
 		if (prev_class->switched_from)
 			prev_class->switched_from(rq, p);
 		p->sched_class->switched_to(rq, p);
-	} else if (oldprio != p->prio)
+	} else if (oldprio != p->prio || dl_task(p))
 		p->sched_class->prio_changed(rq, p, oldprio);
 }
 
@@ -3042,7 +3042,7 @@ EXPORT_SYMBOL(sleep_on_timeout);
  */
 void rt_mutex_setprio(struct task_struct *p, int prio)
 {
-	int oldprio, on_rq, running;
+	int oldprio, on_rq, running, enqueue_flag = 0;
 	struct rq *rq;
 	const struct sched_class *prev_class;
 
@@ -3069,6 +3069,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	}
 
 	trace_sched_pi_setprio(p, prio);
+	p->pi_top_task = rt_mutex_get_top_task(p);
 	oldprio = p->prio;
 	prev_class = p->sched_class;
 	on_rq = p->on_rq;
@@ -3078,19 +3079,42 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
-	if (dl_prio(prio))
+	/*
+	 * Boosting condition are:
+	 * 1. -rt task is running and holds mutex A
+	 *      --> -dl task blocks on mutex A
+	 *
+	 * 2. -dl task is running and holds mutex A
+	 *      --> -dl task blocks on mutex A and could preempt the
+	 *          running task
+	 */
+	if (dl_prio(prio)) {
+		if (!dl_prio(p->normal_prio) || (p->pi_top_task &&
+			dl_entity_preempt(&p->pi_top_task->dl, &p->dl))) {
+			p->dl.dl_boosted = 1;
+			p->dl.dl_throttled = 0;
+			enqueue_flag = ENQUEUE_REPLENISH;
+		} else
+			p->dl.dl_boosted = 0;
 		p->sched_class = &dl_sched_class;
-	else if (rt_prio(prio))
+	} else if (rt_prio(prio)) {
+		if (dl_prio(oldprio))
+			p->dl.dl_boosted = 0;
+		if (oldprio < prio)
+			enqueue_flag = ENQUEUE_HEAD;
 		p->sched_class = &rt_sched_class;
-	else
+	} else {
+		if (dl_prio(oldprio))
+			p->dl.dl_boosted = 0;
 		p->sched_class = &fair_sched_class;
+	}
 
 	p->prio = prio;
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
 	if (on_rq)
-		enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
+		enqueue_task(rq, p, enqueue_flag);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index a4b86b0..3b251b0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -16,20 +16,6 @@
  */
 #include "sched.h"
 
-static inline int dl_time_before(u64 a, u64 b)
-{
-	return (s64)(a - b) < 0;
-}
-
-/*
- * Tells if entity @a should preempt entity @b.
- */
-static inline
-int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
-{
-	return dl_time_before(a->deadline, b->deadline);
-}
-
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
 	return container_of(dl_se, struct task_struct, dl);
@@ -242,7 +228,8 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
  * one, and to (try to!) reconcile itself with its own scheduling
  * parameters.
  */
-static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
+static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se,
+				       struct sched_dl_entity *pi_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -254,8 +241,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
 	 * future; in fact, we must consider execution overheads (time
 	 * spent on hardirq context, etc.).
 	 */
-	dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
-	dl_se->runtime = dl_se->dl_runtime;
+	dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
+	dl_se->runtime = pi_se->dl_runtime;
 	dl_se->dl_new = 0;
 }
 
@@ -277,11 +264,23 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
  * could happen are, typically, a entity voluntarily trying to overcome its
  * runtime, or it just underestimated it during sched_setscheduler_ex().
  */
-static void replenish_dl_entity(struct sched_dl_entity *dl_se)
+static void replenish_dl_entity(struct sched_dl_entity *dl_se,
+				struct sched_dl_entity *pi_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
+	BUG_ON(pi_se->dl_runtime <= 0);
+
+	/*
+	 * This could be the case for a !-dl task that is boosted.
+	 * Just go with full inherited parameters.
+	 */
+	if (dl_se->dl_deadline == 0) {
+		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
+		dl_se->runtime = pi_se->dl_runtime;
+	}
+
 	/*
 	 * We keep moving the deadline away until we get some
 	 * available runtime for the entity. This ensures correct
@@ -289,8 +288,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 * arbitrary large.
 	 */
 	while (dl_se->runtime <= 0) {
-		dl_se->deadline += dl_se->dl_period;
-		dl_se->runtime += dl_se->dl_runtime;
+		dl_se->deadline += pi_se->dl_period;
+		dl_se->runtime += pi_se->dl_runtime;
 	}
 
 	/*
@@ -309,8 +308,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 			lag_once = true;
 			printk_sched("sched: DL replenish lagged to much\n");
 		}
-		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
-		dl_se->runtime = dl_se->dl_runtime;
+		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
+		dl_se->runtime = pi_se->dl_runtime;
 	}
 }
 
@@ -337,7 +336,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
  * task with deadline equal to period this is the same of using
  * dl_deadline instead of dl_period in the equation above.
  */
-static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se,
+			       struct sched_dl_entity *pi_se, u64 t)
 {
 	u64 left, right;
 
@@ -359,8 +359,8 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
 	 * of anything below microseconds resolution is actually fiction
 	 * (but still we want to give the user that illusion >;).
 	 */
-	left = (dl_se->dl_period >> 10) * (dl_se->runtime >> 10);
-	right = ((dl_se->deadline - t) >> 10) * (dl_se->dl_runtime >> 10);
+	left = (pi_se->dl_period >> 10) * (dl_se->runtime >> 10);
+	right = ((dl_se->deadline - t) >> 10) * (pi_se->dl_runtime >> 10);
 
 	return dl_time_before(right, left);
 }
@@ -374,7 +374,8 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
  *  - using the remaining runtime with the current deadline would make
  *    the entity exceed its bandwidth.
  */
-static void update_dl_entity(struct sched_dl_entity *dl_se)
+static void update_dl_entity(struct sched_dl_entity *dl_se,
+			     struct sched_dl_entity *pi_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -384,14 +385,14 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
 	 * the actual scheduling parameters have to be "renewed".
 	 */
 	if (dl_se->dl_new) {
-		setup_new_dl_entity(dl_se);
+		setup_new_dl_entity(dl_se, pi_se);
 		return;
 	}
 
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
-	    dl_entity_overflow(dl_se, rq_clock(rq))) {
-		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
-		dl_se->runtime = dl_se->dl_runtime;
+	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
+		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
+		dl_se->runtime = pi_se->dl_runtime;
 	}
 }
 
@@ -405,7 +406,7 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
  * actually started or not (i.e., the replenishment instant is in
  * the future or in the past).
  */
-static int start_dl_timer(struct sched_dl_entity *dl_se)
+static int start_dl_timer(struct sched_dl_entity *dl_se, bool boosted)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -414,6 +415,8 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
 	unsigned long range;
 	s64 delta;
 
+	if (boosted)
+		return 0;
 	/*
 	 * We want the timer to fire at the deadline, but considering
 	 * that it is actually coming from rq->clock and not from
@@ -593,7 +596,7 @@ static void update_curr_dl(struct rq *rq)
 	dl_se->runtime -= delta_exec;
 	if (dl_runtime_exceeded(rq, dl_se)) {
 		__dequeue_task_dl(rq, curr, 0);
-		if (likely(start_dl_timer(dl_se)))
+		if (likely(start_dl_timer(dl_se, curr->dl.dl_boosted)))
 			dl_se->dl_throttled = 1;
 		else
 			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
@@ -748,7 +751,8 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 }
 
 static void
-enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
+enqueue_dl_entity(struct sched_dl_entity *dl_se,
+		  struct sched_dl_entity *pi_se, int flags)
 {
 	BUG_ON(on_dl_rq(dl_se));
 
@@ -758,9 +762,9 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	 * we want a replenishment of its runtime.
 	 */
 	if (!dl_se->dl_new && flags & ENQUEUE_REPLENISH)
-		replenish_dl_entity(dl_se);
+		replenish_dl_entity(dl_se, pi_se);
 	else
-		update_dl_entity(dl_se);
+		update_dl_entity(dl_se, pi_se);
 
 	__enqueue_dl_entity(dl_se);
 }
@@ -772,6 +776,18 @@ static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
 
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
+	struct task_struct *pi_task = p->pi_top_task;
+	struct sched_dl_entity *pi_se = &p->dl;
+
+	/*
+	 * Use the scheduling parameters of the top pi-waiter
+	 * task if we have one and its (relative) deadline is
+	 * smaller than our one... OTW we keep our runtime and
+	 * deadline.
+	 */
+	if (pi_task && p->dl.dl_boosted && dl_prio(pi_task->normal_prio))
+		pi_se = &pi_task->dl;
+
 	/*
 	 * If p is throttled, we do nothing. In fact, if it exhausted
 	 * its budget it needs a replenishment and, since it now is on
@@ -781,7 +797,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	if (p->dl.dl_throttled)
 		return;
 
-	enqueue_dl_entity(&p->dl, flags);
+	enqueue_dl_entity(&p->dl, pi_se, flags);
 
 	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
@@ -1020,8 +1036,7 @@ static void task_dead_dl(struct task_struct *p)
 {
 	struct hrtimer *timer = &p->dl.dl_timer;
 
-	if (hrtimer_active(timer))
-		hrtimer_try_to_cancel(timer);
+	hrtimer_cancel(timer);
 }
 
 static void set_curr_task_dl(struct rq *rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b04aeed..4da50db 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -105,6 +105,20 @@ static inline int task_has_dl_policy(struct task_struct *p)
 	return dl_policy(p->policy);
 }
 
+static inline int dl_time_before(u64 a, u64 b)
+{
+	return (s64)(a - b) < 0;
+}
+
+/*
+ * Tells if entity @a should preempt entity @b.
+ */
+static inline
+int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
+{
+	return dl_time_before(a->deadline, b->deadline);
+}
+
 /*
  * This is the priority-queue data structure of the RT scheduling class:
  */
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 1457fb1..b5a9f02 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -16,6 +16,7 @@
 #include <linux/uaccess.h>
 #include <linux/ftrace.h>
 #include <linux/sched/rt.h>
+#include <linux/sched/deadline.h>
 #include <trace/events/sched.h>
 #include "trace.h"
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 11/14] sched: add bandwidth management for sched_dl.
  2013-11-07 13:43 [PATCH 00/14] sched: SCHED_DEADLINE v9 Juri Lelli
                   ` (9 preceding siblings ...)
  2013-11-07 13:43 ` [PATCH 10/14] sched: drafted deadline inheritance logic Juri Lelli
@ 2013-11-07 13:43 ` Juri Lelli
  2014-01-13 15:54   ` [tip:sched/core] sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks tip-bot for Dario Faggioli
  2013-11-07 13:43 ` [PATCH 12/14] sched: make dl_bw a sub-quota of rt_bw Juri Lelli
  2013-11-07 13:43 ` [PATCH 13/14] sched: speed up -dl pushes with a push-heap Juri Lelli
  12 siblings, 1 reply; 81+ messages in thread
From: Juri Lelli @ 2013-11-07 13:43 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

From: Dario Faggioli <raistlin@linux.it>

In order of -deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.

Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).
Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.

However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).

Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.

This patch, therefore:
 - adds system wide deadline bandwidth management by means of:
    * /proc/sys/kernel/sched_dl_runtime_us,
    * /proc/sys/kernel/sched_dl_period_us,
   that determine (i.e., runtime / period) the total bandwidth
   available on each CPU of each root_domain for -deadline tasks;
 - couples the RT and deadline bandwidth management, i.e., enforces
   that the sum of how much bandwidth is being devoted to -rt
   -deadline tasks to stay below 100%.

This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:

    M * (sched_dl_runtime_us / sched_dl_period_us)

It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 include/linux/sched.h        |    1 +
 include/linux/sched/sysctl.h |   13 ++
 kernel/sched/core.c          |  431 +++++++++++++++++++++++++++++++++++++++---
 kernel/sched/deadline.c      |   50 ++++-
 kernel/sched/sched.h         |   74 +++++++-
 kernel/sysctl.c              |   14 ++
 6 files changed, 549 insertions(+), 34 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9dad4f3..067f736 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1079,6 +1079,7 @@ struct sched_dl_entity {
 	u64 dl_runtime;		/* maximum runtime for each instance	*/
 	u64 dl_deadline;	/* relative deadline of each instance	*/
 	u64 dl_period;		/* separation of two instances (period) */
+	u64 dl_bw;		/* dl_runtime / dl_deadline		*/
 
 	/*
 	 * Actual scheduling parameters. Initialized with the values above,
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..33fbf32 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -83,6 +83,15 @@ static inline unsigned int get_sysctl_timer_migration(void)
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
 
+/*
+ *  control SCHED_DEADLINE reservations:
+ *
+ *  /proc/sys/kernel/sched_dl_period_us
+ *  /proc/sys/kernel/sched_dl_runtime_us
+ */
+extern unsigned int sysctl_sched_dl_period;
+extern int sysctl_sched_dl_runtime;
+
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
 #endif
@@ -101,4 +110,8 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
+int sched_dl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
 #endif /* _SCHED_SYSCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d9dca90..d8a8622 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -296,6 +296,15 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+/*
+ * Maximum bandwidth available for all -deadline tasks and groups
+ * (if group scheduling is configured) on each CPU.
+ *
+ * default: 5%
+ */
+unsigned int sysctl_sched_dl_period = 1000000;
+int sysctl_sched_dl_runtime = 50000;
+
 
 
 /*
@@ -1745,6 +1754,96 @@ int sched_fork(struct task_struct *p)
 	return 0;
 }
 
+unsigned long to_ratio(u64 period, u64 runtime)
+{
+	if (runtime == RUNTIME_INF)
+		return 1ULL << 20;
+
+	/*
+	 * Doing this here saves a lot of checks in all
+	 * the calling paths, and returning zero seems
+	 * safe for them anyway.
+	 */
+	if (period == 0)
+		return 0;
+
+	return div64_u64(runtime << 20, period);
+}
+
+static inline
+void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
+{
+	dl_b->total_bw -= tsk_bw;
+}
+
+static inline
+void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
+{
+	dl_b->total_bw += tsk_bw;
+}
+
+static inline
+bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
+{
+	return dl_b->bw != -1 &&
+	       dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
+}
+
+/*
+ * We must be sure that accepting a new task (or allowing changing the
+ * parameters of an existing one) is consistent with the bandwidth
+ * constraints. If yes, this function also accordingly updates the currently
+ * allocated bandwidth to reflect the new situation.
+ *
+ * This function is called while holding p's rq->lock.
+ */
+static int dl_overflow(struct task_struct *p, int policy,
+		       const struct sched_param2 *param2)
+{
+#ifdef CONFIG_SMP
+	struct dl_bw *dl_b = &task_rq(p)->rd->dl_bw;
+#else
+	struct dl_bw *dl_b = &task_rq(p)->dl.dl_bw;
+#endif
+	u64 period = param2->sched_period;
+	u64 runtime = param2->sched_runtime;
+	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
+#ifdef CONFIG_SMP
+	int cpus = cpumask_weight(task_rq(p)->rd->span);
+#else
+	int cpus = 1;
+#endif
+	int err = -1;
+
+	if (new_bw == p->dl.dl_bw)
+		return 0;
+
+	/*
+	 * Either if a task, enters, leave, or stays -deadline but changes
+	 * its parameters, we may need to update accordingly the total
+	 * allocated bandwidth of the container.
+	 */
+	raw_spin_lock(&dl_b->lock);
+	if (dl_policy(policy) && !task_has_dl_policy(p) &&
+	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
+		__dl_add(dl_b, new_bw);
+		err = 0;
+	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
+		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+		__dl_clear(dl_b, p->dl.dl_bw);
+		__dl_add(dl_b, new_bw);
+		err = 0;
+	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
+		__dl_clear(dl_b, p->dl.dl_bw);
+		err = 0;
+	}
+	raw_spin_unlock(&dl_b->lock);
+
+	return err;
+}
+
+extern void init_dl_bw(struct dl_bw *dl_b);
+
 /*
  * wake_up_new_task - wake up a newly created task for the first time.
  *
@@ -3333,6 +3432,7 @@ __setparam_dl(struct task_struct *p, const struct sched_param2 *param2)
 		dl_se->dl_period = param2->sched_period;
 	else
 		dl_se->dl_period = dl_se->dl_deadline;
+	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
 	dl_se->flags = param2->sched_flags;
 	dl_se->dl_throttled = 0;
 	dl_se->dl_new = 1;
@@ -3354,15 +3454,18 @@ __getparam_dl(struct task_struct *p, struct sched_param2 *param2)
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
  * than the runtime, as well as the period of being zero or
- * greater than deadline.
+ * greater than deadline. Furthermore, we have to be sure that
+ * user parameters are above the internal resolution (1us); we
+ * check sched_runtime only since it is always the smaller one.
  */
 static bool
 __checkparam_dl(const struct sched_param2 *prm)
 {
 	return prm && prm->sched_deadline != 0 &&
 	       (prm->sched_period == 0 ||
-		(s64)(prm->sched_period - prm->sched_deadline) >= 0) &&
-	       (s64)(prm->sched_deadline - prm->sched_runtime) >= 0;
+	       (s64)(prm->sched_period - prm->sched_deadline) >= 0) &&
+	       (s64)(prm->sched_deadline - prm->sched_runtime) >= 0 &&
+	       prm->sched_runtime >= (2 << (DL_SCALE - 1));
 }
 
 /*
@@ -3491,8 +3594,8 @@ recheck:
 		return 0;
 	}
 
-#ifdef CONFIG_RT_GROUP_SCHED
 	if (user) {
+#ifdef CONFIG_RT_GROUP_SCHED
 		/*
 		 * Do not allow realtime tasks into groups that have no runtime
 		 * assigned.
@@ -3503,8 +3606,34 @@ recheck:
 			task_rq_unlock(rq, p, &flags);
 			return -EPERM;
 		}
-	}
 #endif
+#ifdef CONFIG_SMP
+		if (dl_bandwidth_enabled() && dl_policy(policy)) {
+			cpumask_t *span = rq->rd->span;
+			cpumask_t act_affinity;
+
+			/*
+			 * cpus_allowed mask is statically initialized with
+			 * CPU_MASK_ALL, span is instead dynamic. Here we
+			 * compute the "dynamic" affinity of a task.
+			 */
+			cpumask_and(&act_affinity, &p->cpus_allowed,
+				    cpu_active_mask);
+
+			/*
+			 * Don't allow tasks with an affinity mask smaller than
+			 * the entire root_domain to become SCHED_DEADLINE. We
+			 * will also fail if there's no bandwidth available.
+			 */
+			if (!cpumask_equal(&act_affinity, span) ||
+			    		   rq->rd->dl_bw.bw == 0) {
+				__task_rq_unlock(rq);
+				raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+				return -EPERM;
+			}
+		}
+#endif
+	}
 
 	/* recheck policy now with rq lock held */
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
@@ -3512,6 +3641,19 @@ recheck:
 		task_rq_unlock(rq, p, &flags);
 		goto recheck;
 	}
+
+	/*
+	 * If setscheduling to SCHED_DEADLINE (or changing the parameters
+	 * of a SCHED_DEADLINE task) we need to check if enough bandwidth
+	 * is available.
+	 */
+	if ((dl_policy(policy) || dl_task(p)) &&
+	    dl_overflow(p, policy, param)) {
+		__task_rq_unlock(rq);
+		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		return -EBUSY;
+	}
+
 	on_rq = p->on_rq;
 	running = task_current(rq, p);
 	if (on_rq)
@@ -3855,6 +3997,24 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	if (retval)
 		goto out_unlock;
 
+	/*
+	 * Since bandwidth control happens on root_domain basis,
+	 * if admission test is enabled, we only admit -deadline
+	 * tasks allowed to run on all the CPUs in the task's
+	 * root_domain.
+	 */
+#ifdef CONFIG_SMP
+	if (task_has_dl_policy(p)) {
+		const struct cpumask *span = task_rq(p)->rd->span;
+
+		if (dl_bandwidth_enabled() &&
+		    !cpumask_equal(in_mask, span)) {
+			retval = -EBUSY;
+			goto out_unlock;
+		}
+	}
+#endif
+
 	cpuset_cpus_allowed(p, cpus_allowed);
 	cpumask_and(new_mask, in_mask, cpus_allowed);
 again:
@@ -4517,6 +4677,42 @@ out:
 EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
 /*
+ * When dealing with a -deadline task, we have to check if moving it to
+ * a new CPU is possible or not. In fact, this is only true iff there
+ * is enough bandwidth available on such CPU, otherwise we want the
+ * whole migration progedure to fail over.
+ */
+static inline
+bool set_task_cpu_dl(struct task_struct *p, unsigned int cpu)
+{
+	struct dl_bw *dl_b = &task_rq(p)->rd->dl_bw;
+	struct dl_bw *cpu_b = &cpu_rq(cpu)->rd->dl_bw;
+	int ret = 1;
+	u64 bw;
+
+	if (dl_b == cpu_b)
+		return 1;
+
+	raw_spin_lock(&dl_b->lock);
+	raw_spin_lock(&cpu_b->lock);
+
+	bw = cpu_b->bw * cpumask_weight(cpu_rq(cpu)->rd->span);
+	if (dl_bandwidth_enabled() &&
+	    bw < cpu_b->total_bw + p->dl.dl_bw) {
+		ret = 0;
+		goto unlock;
+	}
+	dl_b->total_bw -= p->dl.dl_bw;
+	cpu_b->total_bw += p->dl.dl_bw;
+
+unlock:
+	raw_spin_unlock(&cpu_b->lock);
+	raw_spin_unlock(&dl_b->lock);
+
+	return ret;
+}
+
+/*
  * Move (not current) task off this cpu, onto dest cpu. We're doing
  * this because either it can't run here any more (set_cpus_allowed()
  * away from this CPU, or CPU going down), or because we're
@@ -4548,6 +4744,13 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 		goto fail;
 
 	/*
+	 * If p is -deadline, proceed only if there is enough
+	 * bandwidth available on dest_cpu
+	 */
+	if (unlikely(dl_task(p)) && !set_task_cpu_dl(p, dest_cpu))
+		goto fail;
+
+	/*
 	 * If we're not on a rq, the next wake-up will ensure we're
 	 * placed properly.
 	 */
@@ -5239,6 +5442,8 @@ static int init_rootdomain(struct root_domain *rd)
 	if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
 		goto free_dlo_mask;
 
+	init_dl_bw(&rd->dl_bw);
+
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_rto_mask;
 	return 0;
@@ -6657,6 +6862,8 @@ void __init sched_init(void)
 
 	init_rt_bandwidth(&def_rt_bandwidth,
 			global_rt_period(), global_rt_runtime());
+	init_dl_bandwidth(&def_dl_bandwidth,
+			global_dl_period(), global_dl_runtime());
 
 #ifdef CONFIG_RT_GROUP_SCHED
 	init_rt_bandwidth(&root_task_group.rt_bandwidth,
@@ -7056,16 +7263,6 @@ void sched_move_task(struct task_struct *tsk)
 }
 #endif /* CONFIG_CGROUP_SCHED */
 
-#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
-static unsigned long to_ratio(u64 period, u64 runtime)
-{
-	if (runtime == RUNTIME_INF)
-		return 1ULL << 20;
-
-	return div64_u64(runtime << 20, period);
-}
-#endif
-
 #ifdef CONFIG_RT_GROUP_SCHED
 /*
  * Ensure that the real time constraints are schedulable.
@@ -7239,10 +7436,48 @@ static long sched_group_rt_period(struct task_group *tg)
 	do_div(rt_period_us, NSEC_PER_USEC);
 	return rt_period_us;
 }
+#endif /* CONFIG_RT_GROUP_SCHED */
+
+/*
+ * Coupling of -rt and -deadline bandwidth.
+ *
+ * Here we check if the new -rt bandwidth value is consistent
+ * with the system settings for the bandwidth available
+ * to -deadline tasks.
+ *
+ * IOW, we want to enforce that
+ *
+ *   rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_rt_dl_global_constraints(u64 rt_bw)
+{
+	unsigned long flags;
+	u64 dl_bw;
+	bool ret;
+
+	raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock, flags);
+	if (global_rt_runtime() == RUNTIME_INF ||
+	    global_dl_runtime() == RUNTIME_INF) {
+		ret = true;
+		goto unlock;
+	}
+
+	dl_bw = to_ratio(def_dl_bandwidth.dl_period,
+			 def_dl_bandwidth.dl_runtime);
+
+	ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
+unlock:
+	raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock, flags);
+
+	return ret;
+}
 
+#ifdef CONFIG_RT_GROUP_SCHED
 static int sched_rt_global_constraints(void)
 {
-	u64 runtime, period;
+	u64 runtime, period, bw;
 	int ret = 0;
 
 	if (sysctl_sched_rt_period <= 0)
@@ -7257,6 +7492,10 @@ static int sched_rt_global_constraints(void)
 	if (runtime > period && runtime != RUNTIME_INF)
 		return -EINVAL;
 
+	bw = to_ratio(period, runtime);
+	if (!__sched_rt_dl_global_constraints(bw))
+		return -EINVAL;
+
 	mutex_lock(&rt_constraints_mutex);
 	read_lock(&tasklist_lock);
 	ret = __rt_schedulable(NULL, 0, 0);
@@ -7279,19 +7518,19 @@ static int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
 static int sched_rt_global_constraints(void)
 {
 	unsigned long flags;
-	int i;
+	int i, ret = 0;
+	u64 bw;
 
 	if (sysctl_sched_rt_period <= 0)
 		return -EINVAL;
 
-	/*
-	 * There's always some RT tasks in the root group
-	 * -- migration, kstopmachine etc..
-	 */
-	if (sysctl_sched_rt_runtime == 0)
-		return -EBUSY;
-
 	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
+	bw = to_ratio(global_rt_period(), global_rt_runtime());
+	if (!__sched_rt_dl_global_constraints(bw)) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
 	for_each_possible_cpu(i) {
 		struct rt_rq *rt_rq = &cpu_rq(i)->rt;
 
@@ -7299,12 +7538,96 @@ static int sched_rt_global_constraints(void)
 		rt_rq->rt_runtime = global_rt_runtime();
 		raw_spin_unlock(&rt_rq->rt_runtime_lock);
 	}
+unlock:
 	raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
 
-	return 0;
+	return ret;
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+/*
+ * Coupling of -dl and -rt bandwidth.
+ *
+ * Here we check, while setting the system wide bandwidth available
+ * for -dl tasks and groups, if the new values are consistent with
+ * the system settings for the bandwidth available to -rt entities.
+ *
+ * IOW, we want to enforce that
+ *
+ *   rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_dl_rt_global_constraints(u64 dl_bw)
+{
+	u64 rt_bw;
+	bool ret;
+
+	raw_spin_lock(&def_rt_bandwidth.rt_runtime_lock);
+	if (global_dl_runtime() == RUNTIME_INF ||
+	    global_rt_runtime() == RUNTIME_INF) {
+		ret = true;
+		goto unlock;
+	}
+
+	rt_bw = to_ratio(ktime_to_ns(def_rt_bandwidth.rt_period),
+			 def_rt_bandwidth.rt_runtime);
+
+	ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
+unlock:
+	raw_spin_unlock(&def_rt_bandwidth.rt_runtime_lock);
+
+	return ret;
+}
+
+static bool __sched_dl_global_constraints(u64 runtime, u64 period)
+{
+	if (!period || (runtime != RUNTIME_INF && runtime > period))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int sched_dl_global_constraints(void)
+{
+	u64 runtime = global_dl_runtime();
+	u64 period = global_dl_period();
+	u64 new_bw = to_ratio(period, runtime);
+	int ret, i;
+
+	ret = __sched_dl_global_constraints(runtime, period);
+	if (ret)
+		return ret;
+
+	if (!__sched_dl_rt_global_constraints(new_bw))
+		return -EINVAL;
+
+	/*
+	 * Here we want to check the bandwidth not being set to some
+	 * value smaller than the currently allocated bandwidth in
+	 * any of the root_domains.
+	 *
+	 * FIXME: Cycling on all the CPUs is overdoing, but simpler than
+	 * cycling on root_domains... Discussion on different/better
+	 * solutions is welcome!
+	 */
+	for_each_possible_cpu(i) {
+#ifdef CONFIG_SMP
+		struct dl_bw *dl_b = &cpu_rq(i)->rd->dl_bw;
+#else
+		struct dl_bw *dl_b = &cpu_rq(i)->dl.dl_bw;
+#endif
+		raw_spin_lock(&dl_b->lock);
+		if (new_bw < dl_b->total_bw) {
+			raw_spin_unlock(&dl_b->lock);
+			return -EBUSY;
+		}
+		raw_spin_unlock(&dl_b->lock);
+	}
+
+	return 0;
+}
+
 int sched_rr_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos)
@@ -7354,6 +7677,64 @@ int sched_rt_handler(struct ctl_table *table, int write,
 	return ret;
 }
 
+int sched_dl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int ret;
+	int old_period, old_runtime;
+	static DEFINE_MUTEX(mutex);
+	unsigned long flags;
+
+	mutex_lock(&mutex);
+	old_period = sysctl_sched_dl_period;
+	old_runtime = sysctl_sched_dl_runtime;
+
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock,
+				      flags);
+
+		ret = sched_dl_global_constraints();
+		if (ret) {
+			sysctl_sched_dl_period = old_period;
+			sysctl_sched_dl_runtime = old_runtime;
+		} else {
+			u64 new_bw;
+			int i;
+
+			def_dl_bandwidth.dl_period = global_dl_period();
+			def_dl_bandwidth.dl_runtime = global_dl_runtime();
+			if (global_dl_runtime() == RUNTIME_INF)
+				new_bw = -1;
+			else
+				new_bw = to_ratio(global_dl_period(),
+						  global_dl_runtime());
+			/*
+			 * FIXME: As above...
+			 */
+			for_each_possible_cpu(i) {
+#ifdef CONFIG_SMP
+				struct dl_bw *dl_b = &cpu_rq(i)->rd->dl_bw;
+#else
+				struct dl_bw *dl_b = &cpu_rq(i)->dl.dl_bw;
+#endif
+
+				raw_spin_lock(&dl_b->lock);
+				dl_b->bw = new_bw;
+				raw_spin_unlock(&dl_b->lock);
+			}
+		}
+
+		raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock,
+					   flags);
+	}
+	mutex_unlock(&mutex);
+
+	return ret;
+}
+
 #ifdef CONFIG_CGROUP_SCHED
 
 static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3b251b0..fe39d7e 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -16,6 +16,8 @@
  */
 #include "sched.h"
 
+struct dl_bandwidth def_dl_bandwidth;
+
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
 	return container_of(dl_se, struct task_struct, dl);
@@ -46,6 +48,27 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 	return dl_rq->rb_leftmost == &dl_se->rb_node;
 }
 
+void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
+{
+	raw_spin_lock_init(&dl_b->dl_runtime_lock);
+	dl_b->dl_period = period;
+	dl_b->dl_runtime = runtime;
+}
+
+extern unsigned long to_ratio(u64 period, u64 runtime);
+
+void init_dl_bw(struct dl_bw *dl_b)
+{
+	raw_spin_lock_init(&dl_b->lock);
+	raw_spin_lock(&def_dl_bandwidth.dl_runtime_lock);
+	if (global_dl_runtime() == RUNTIME_INF)
+		dl_b->bw = -1;
+	else
+		dl_b->bw = to_ratio(global_dl_period(), global_dl_runtime());
+	raw_spin_unlock(&def_dl_bandwidth.dl_runtime_lock);
+	dl_b->total_bw = 0;
+}
+
 void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
 {
 	dl_rq->rb_root = RB_ROOT;
@@ -57,6 +80,8 @@ void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
 	dl_rq->dl_nr_migratory = 0;
 	dl_rq->overloaded = 0;
 	dl_rq->pushable_dl_tasks_root = RB_ROOT;
+#else
+	init_dl_bw(&dl_rq->dl_bw);
 #endif
 }
 
@@ -359,8 +384,9 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se,
 	 * of anything below microseconds resolution is actually fiction
 	 * (but still we want to give the user that illusion >;).
 	 */
-	left = (pi_se->dl_period >> 10) * (dl_se->runtime >> 10);
-	right = ((dl_se->deadline - t) >> 10) * (pi_se->dl_runtime >> 10);
+	left = (pi_se->dl_period >> DL_SCALE) * (dl_se->runtime >> DL_SCALE);
+	right = ((dl_se->deadline - t) >> DL_SCALE) *
+		(pi_se->dl_runtime >> DL_SCALE);
 
 	return dl_time_before(right, left);
 }
@@ -934,8 +960,8 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 	 * In the unlikely case current and p have the same deadline
 	 * let us try to decide what's the best thing to do...
 	 */
-	if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
-	    !need_resched())
+	if ((p->dl.deadline == rq->curr->dl.deadline) &&
+	    !test_tsk_need_resched(rq->curr))
 		check_preempt_equal_dl(rq, p);
 #endif /* CONFIG_SMP */
 }
@@ -1035,6 +1061,18 @@ static void task_fork_dl(struct task_struct *p)
 static void task_dead_dl(struct task_struct *p)
 {
 	struct hrtimer *timer = &p->dl.dl_timer;
+#ifdef CONFIG_SMP
+	struct dl_bw *dl_b = &task_rq(p)->rd->dl_bw;
+#else
+	struct dl_bw *dl_b = &task_rq(p)->dl.dl_bw;
+#endif
+
+	/*
+	 * Since we are TASK_DEAD we won't slip out of the domain!
+	 */
+	raw_spin_lock_irq(&dl_b->lock);
+	dl_b->total_bw -= p->dl.dl_bw;
+	raw_spin_unlock_irq(&dl_b->lock);
 
 	hrtimer_cancel(timer);
 }
@@ -1261,7 +1299,7 @@ static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
 	BUG_ON(task_current(rq, p));
 	BUG_ON(p->nr_cpus_allowed <= 1);
 
-	BUG_ON(!p->se.on_rq);
+	BUG_ON(!p->on_rq);
 	BUG_ON(!dl_task(p));
 
 	return p;
@@ -1408,7 +1446,7 @@ static int pull_dl_task(struct rq *this_rq)
 		     dl_time_before(p->dl.deadline,
 				    this_rq->dl.earliest_dl.curr))) {
 			WARN_ON(p == src_rq->curr);
-			WARN_ON(!p->se.on_rq);
+			WARN_ON(!p->on_rq);
 
 			/*
 			 * Then we pull iff p has actually an earlier
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4da50db..08931a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -81,6 +81,13 @@ extern void update_cpu_load_active(struct rq *this_rq);
  */
 #define RUNTIME_INF	((u64)~0ULL)
 
+/*
+ * Single value that decides SCHED_DEADLINE internal math precision.
+ * 10 -> just above 1us
+ * 9  -> just above 0.5us
+ */
+#define DL_SCALE (10)
+
 static inline int rt_policy(int policy)
 {
 	if (policy == SCHED_FIFO || policy == SCHED_RR)
@@ -105,7 +112,7 @@ static inline int task_has_dl_policy(struct task_struct *p)
 	return dl_policy(p->policy);
 }
 
-static inline int dl_time_before(u64 a, u64 b)
+static inline bool dl_time_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
 }
@@ -113,8 +120,8 @@ static inline int dl_time_before(u64 a, u64 b)
 /*
  * Tells if entity @a should preempt entity @b.
  */
-static inline
-int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
+static inline bool
+dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
 {
 	return dl_time_before(a->deadline, b->deadline);
 }
@@ -134,6 +141,48 @@ struct rt_bandwidth {
 	u64			rt_runtime;
 	struct hrtimer		rt_period_timer;
 };
+/*
+ * To keep the bandwidth of -deadline tasks and groups under control
+ * we need some place where:
+ *  - store the maximum -deadline bandwidth of the system (the group);
+ *  - cache the fraction of that bandwidth that is currently allocated.
+ *
+ * This is all done in the data structure below. It is similar to the
+ * one used for RT-throttling (rt_bandwidth), with the main difference
+ * that, since here we are only interested in admission control, we
+ * do not decrease any runtime while the group "executes", neither we
+ * need a timer to replenish it.
+ *
+ * With respect to SMP, the bandwidth is given on a per-CPU basis,
+ * meaning that:
+ *  - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
+ *  - dl_total_bw array contains, in the i-eth element, the currently
+ *    allocated bandwidth on the i-eth CPU.
+ * Moreover, groups consume bandwidth on each CPU, while tasks only
+ * consume bandwidth on the CPU they're running on.
+ * Finally, dl_total_bw_cpu is used to cache the index of dl_total_bw
+ * that will be shown the next time the proc or cgroup controls will
+ * be red. It on its turn can be changed by writing on its own
+ * control.
+ */
+struct dl_bandwidth {
+	raw_spinlock_t dl_runtime_lock;
+	u64 dl_runtime;
+	u64 dl_period;
+};
+
+static inline int dl_bandwidth_enabled(void)
+{
+	return sysctl_sched_dl_runtime >= 0;
+}
+
+struct dl_bw {
+	raw_spinlock_t lock;
+	u64 bw, total_bw;
+};
+
+static inline u64 global_dl_period(void);
+static inline u64 global_dl_runtime(void);
 
 extern struct mutex sched_domains_mutex;
 
@@ -423,6 +472,8 @@ struct dl_rq {
 	 */
 	struct rb_root pushable_dl_tasks_root;
 	struct rb_node *pushable_dl_tasks_leftmost;
+#else
+	struct dl_bw dl_bw;
 #endif
 };
 
@@ -454,6 +505,7 @@ struct root_domain {
 	 */
 	cpumask_var_t dlo_mask;
 	atomic_t dlo_count;
+	struct dl_bw dl_bw;
 
 	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
@@ -872,7 +924,18 @@ static inline u64 global_rt_runtime(void)
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
 
+static inline u64 global_dl_period(void)
+{
+	return (u64)sysctl_sched_dl_period * NSEC_PER_USEC;
+}
+
+static inline u64 global_dl_runtime(void)
+{
+	if (sysctl_sched_dl_runtime < 0)
+		return RUNTIME_INF;
 
+	return (u64)sysctl_sched_dl_runtime * NSEC_PER_USEC;
+}
 
 static inline int task_current(struct rq *rq, struct task_struct *p)
 {
@@ -1120,6 +1183,7 @@ extern void update_max_interval(void);
 extern void init_sched_dl_class(void);
 extern void init_sched_rt_class(void);
 extern void init_sched_fair_class(void);
+extern void init_sched_dl_class(void);
 
 extern void resched_task(struct task_struct *p);
 extern void resched_cpu(int cpu);
@@ -1127,8 +1191,12 @@ extern void resched_cpu(int cpu);
 extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 
+extern struct dl_bandwidth def_dl_bandwidth;
+extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
 extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
 
+unsigned long to_ratio(u64 period, u64 runtime);
+
 extern void update_idle_cpu_load(struct rq *this_rq);
 
 extern void init_task_runnable_average(struct task_struct *p);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8b80f1b..8492cbb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -414,6 +414,20 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sched_rr_handler,
 	},
+	{
+		.procname	= "sched_dl_period_us",
+		.data		= &sysctl_sched_dl_period,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_handler,
+	},
+	{
+		.procname	= "sched_dl_runtime_us",
+		.data		= &sysctl_sched_dl_runtime,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_handler,
+	},
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
 		.procname	= "sched_autogroup_enabled",
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 12/14] sched: make dl_bw a sub-quota of rt_bw
  2013-11-07 13:43 [PATCH 00/14] sched: SCHED_DEADLINE v9 Juri Lelli
                   ` (10 preceding siblings ...)
  2013-11-07 13:43 ` [PATCH 11/14] sched: add bandwidth management for sched_dl Juri Lelli
@ 2013-11-07 13:43 ` Juri Lelli
  2013-11-07 13:43 ` [PATCH 13/14] sched: speed up -dl pushes with a push-heap Juri Lelli
  12 siblings, 0 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-07 13:43 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

Change real-time bandwidth management as to make dl_bw a sub-quota
of rt_bw. This patch leaves rt_bw at its default value and sets
dl_bw at 40% of rt_bw. It also remove sched_dl_period_us control
knob using sched_rt_period_us as common period for both rt_bw and
dl_bw.

Checks are made when the user tries to change dl_bw sub-quota as to
not fall below what currently used. Since dl_bw now depends upon
rt_bw, similar checks are performed when the users modifies rt_bw
and dl_bw is changed accordingly. Setting rt_bw sysctl variable to
-1 (actually disabling rt throttling) disables dl_bw checks as well.

Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 include/linux/sched/deadline.h |    2 +
 include/linux/sched/sysctl.h   |    2 -
 kernel/sched/core.c            |  322 ++++++++++++++++++++--------------------
 kernel/sched/deadline.c        |   13 +-
 kernel/sched/sched.h           |   24 ++-
 kernel/sysctl.c                |    7 -
 6 files changed, 173 insertions(+), 197 deletions(-)

diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index 9d303b8..2e44877 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -21,4 +21,6 @@ static inline int dl_task(struct task_struct *p)
 	return dl_prio(p->prio);
 }
 
+extern inline struct dl_bw *dl_bw_of(int i);
+
 #endif /* _SCHED_DEADLINE_H */
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 33fbf32..444a257 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -86,10 +86,8 @@ extern int sysctl_sched_rt_runtime;
 /*
  *  control SCHED_DEADLINE reservations:
  *
- *  /proc/sys/kernel/sched_dl_period_us
  *  /proc/sys/kernel/sched_dl_runtime_us
  */
-extern unsigned int sysctl_sched_dl_period;
 extern int sysctl_sched_dl_runtime;
 
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d8a8622..08500e0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -297,13 +297,12 @@ __read_mostly int scheduler_running;
 int sysctl_sched_rt_runtime = 950000;
 
 /*
- * Maximum bandwidth available for all -deadline tasks and groups
- * (if group scheduling is configured) on each CPU.
+ * Sub-quota or rt bandwidth available for all -deadline tasks
+ * on each CPU.
  *
- * default: 5%
+ * default: 40%
  */
-unsigned int sysctl_sched_dl_period = 1000000;
-int sysctl_sched_dl_runtime = 50000;
+int sysctl_sched_dl_runtime = 400000;
 
 
 
@@ -1770,6 +1769,28 @@ unsigned long to_ratio(u64 period, u64 runtime)
 	return div64_u64(runtime << 20, period);
 }
 
+#ifdef CONFIG_SMP
+inline struct dl_bw *dl_bw_of(int i)
+{
+	return &cpu_rq(i)->rd->dl_bw;
+}
+
+static inline int __dl_span_weight(struct rq *rq)
+{
+	return cpumask_weight(rq->rd->span);
+}
+#else
+inline struct dl_bw *dl_bw_of(int i)
+{
+	return &cpu_rq(i)->dl.dl_bw;
+}
+
+static inline int __dl_span_weight(struct rq *rq)
+{
+	return 1;
+}
+#endif
+
 static inline
 void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
 {
@@ -1800,19 +1821,11 @@ bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
 static int dl_overflow(struct task_struct *p, int policy,
 		       const struct sched_param2 *param2)
 {
-#ifdef CONFIG_SMP
-	struct dl_bw *dl_b = &task_rq(p)->rd->dl_bw;
-#else
-	struct dl_bw *dl_b = &task_rq(p)->dl.dl_bw;
-#endif
+	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 	u64 period = param2->sched_period;
 	u64 runtime = param2->sched_runtime;
 	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
-#ifdef CONFIG_SMP
-	int cpus = cpumask_weight(task_rq(p)->rd->span);
-#else
-	int cpus = 1;
-#endif
+	int cpus = __dl_span_weight(task_rq(p));
 	int err = -1;
 
 	if (new_bw == p->dl.dl_bw)
@@ -4685,8 +4698,8 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 static inline
 bool set_task_cpu_dl(struct task_struct *p, unsigned int cpu)
 {
-	struct dl_bw *dl_b = &task_rq(p)->rd->dl_bw;
-	struct dl_bw *cpu_b = &cpu_rq(cpu)->rd->dl_bw;
+	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+	struct dl_bw *cpu_b = dl_bw_of(cpu);
 	int ret = 1;
 	u64 bw;
 
@@ -6815,6 +6828,8 @@ LIST_HEAD(task_groups);
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
 
+static u64 actual_dl_runtime(void);
+
 void __init sched_init(void)
 {
 	int i, j;
@@ -6856,15 +6871,14 @@ void __init sched_init(void)
 #endif /* CONFIG_CPUMASK_OFFSTACK */
 	}
 
+	init_rt_bandwidth(&def_rt_bandwidth,
+			global_rt_period(), global_rt_runtime());
+	init_dl_bandwidth(&def_dl_bandwidth, actual_dl_runtime());
+
 #ifdef CONFIG_SMP
 	init_defrootdomain();
 #endif
 
-	init_rt_bandwidth(&def_rt_bandwidth,
-			global_rt_period(), global_rt_runtime());
-	init_dl_bandwidth(&def_dl_bandwidth,
-			global_dl_period(), global_dl_runtime());
-
 #ifdef CONFIG_RT_GROUP_SCHED
 	init_rt_bandwidth(&root_task_group.rt_bandwidth,
 			global_rt_period(), global_rt_runtime());
@@ -7263,6 +7277,86 @@ void sched_move_task(struct task_struct *tsk)
 }
 #endif /* CONFIG_CGROUP_SCHED */
 
+static u64 actual_dl_runtime(void)
+{
+	u64 dl_runtime = global_dl_runtime();
+	u64 rt_runtime = global_rt_runtime();
+	u64 period = global_rt_period();
+
+	/*
+	 * We want to calculate the sub-quota of rt_bw actually available
+	 * for -dl tasks. It is a percentage of percentage. By default 95%
+	 * of system bandwidth is allocate to -rt tasks; among this, a 40%
+	 * quota is reserved for -dl tasks. To have the actual quota a simple
+	 * multiplication is needed: .95 * .40 = .38 (38% of system bandwidth
+	 * for deadline tasks).
+	 * What follows is basically the same, but using unsigned integers.
+	 *
+	 *                   dl_runtime   rt_runtime
+	 * actual_runtime =  ---------- * ---------- * period
+	 *                     period       period
+	 */
+	if (dl_runtime == RUNTIME_INF)
+		return RUNTIME_INF;
+
+	return div64_u64 (dl_runtime * rt_runtime, period);
+}
+
+static int check_dl_bw(void)
+{
+	int i;
+	u64 period = global_rt_period();
+	u64 dl_actual_runtime = def_dl_bandwidth.dl_runtime;
+	u64 new_bw = to_ratio(period, dl_actual_runtime);
+
+	/*
+	 * Here we want to check the bandwidth not being set to some
+	 * value smaller than the currently allocated bandwidth in
+	 * any of the root_domains.
+	 *
+	 * FIXME: Cycling on all the CPUs is overdoing, but simpler than
+	 * cycling on root_domains... Discussion on different/better
+	 * solutions is welcome!
+	 */
+	for_each_possible_cpu(i) {
+		struct dl_bw *dl_b = dl_bw_of(i);
+
+		raw_spin_lock(&dl_b->lock);
+		if (new_bw < dl_b->total_bw) {
+			raw_spin_unlock(&dl_b->lock);
+			return -EBUSY;
+		}
+		raw_spin_unlock(&dl_b->lock);
+	}
+
+	return 0;
+}
+
+static void update_dl_bw(void)
+{
+	u64 new_bw;
+	int i;
+
+	def_dl_bandwidth.dl_runtime = actual_dl_runtime();
+	if (def_dl_bandwidth.dl_runtime == RUNTIME_INF ||
+	    global_rt_runtime() == RUNTIME_INF)
+		new_bw = ULLONG_MAX;
+	else {
+		new_bw = to_ratio(global_rt_period(),
+				  def_dl_bandwidth.dl_runtime);
+	}
+	/*
+	 * FIXME: As above...
+	 */
+	for_each_possible_cpu(i) {
+		struct dl_bw *dl_b = dl_bw_of(i);
+
+		raw_spin_lock(&dl_b->lock);
+		dl_b->bw = new_bw;
+		raw_spin_unlock(&dl_b->lock);
+	}
+}
+
 #ifdef CONFIG_RT_GROUP_SCHED
 /*
  * Ensure that the real time constraints are schedulable.
@@ -7436,48 +7530,10 @@ static long sched_group_rt_period(struct task_group *tg)
 	do_div(rt_period_us, NSEC_PER_USEC);
 	return rt_period_us;
 }
-#endif /* CONFIG_RT_GROUP_SCHED */
-
-/*
- * Coupling of -rt and -deadline bandwidth.
- *
- * Here we check if the new -rt bandwidth value is consistent
- * with the system settings for the bandwidth available
- * to -deadline tasks.
- *
- * IOW, we want to enforce that
- *
- *   rt_bandwidth + dl_bandwidth <= 100%
- *
- * is always true.
- */
-static bool __sched_rt_dl_global_constraints(u64 rt_bw)
-{
-	unsigned long flags;
-	u64 dl_bw;
-	bool ret;
-
-	raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock, flags);
-	if (global_rt_runtime() == RUNTIME_INF ||
-	    global_dl_runtime() == RUNTIME_INF) {
-		ret = true;
-		goto unlock;
-	}
-
-	dl_bw = to_ratio(def_dl_bandwidth.dl_period,
-			 def_dl_bandwidth.dl_runtime);
-
-	ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
-unlock:
-	raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock, flags);
-
-	return ret;
-}
 
-#ifdef CONFIG_RT_GROUP_SCHED
 static int sched_rt_global_constraints(void)
 {
-	u64 runtime, period, bw;
+	u64 runtime, period;
 	int ret = 0;
 
 	if (sysctl_sched_rt_period <= 0)
@@ -7492,9 +7548,13 @@ static int sched_rt_global_constraints(void)
 	if (runtime > period && runtime != RUNTIME_INF)
 		return -EINVAL;
 
-	bw = to_ratio(period, runtime);
-	if (!__sched_rt_dl_global_constraints(bw))
-		return -EINVAL;
+	/*
+	 * Check if changing rt_bw could have negative effects
+	 * on dl_bw
+	 */
+	ret = check_dl_bw();
+	if (ret)
+		return ret;
 
 	mutex_lock(&rt_constraints_mutex);
 	read_lock(&tasklist_lock);
@@ -7518,18 +7578,27 @@ static int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
 static int sched_rt_global_constraints(void)
 {
 	unsigned long flags;
-	int i, ret = 0;
-	u64 bw;
+	int i, ret;
 
 	if (sysctl_sched_rt_period <= 0)
 		return -EINVAL;
 
+	/*
+	 * There's always some RT tasks in the root group
+	 * -- migration, kstopmachine etc..
+	 */
+	if (sysctl_sched_rt_runtime == 0)
+		return -EBUSY;
+
+	/*
+	 * Check if changing rt_bw could have negative effects
+	 * on dl_bw
+	 */
+	ret = check_dl_bw();
+	if (ret)
+		return ret;
+
 	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
-	bw = to_ratio(global_rt_period(), global_rt_runtime());
-	if (!__sched_rt_dl_global_constraints(bw)) {
-		ret = -EINVAL;
-		goto unlock;
-	}
 
 	for_each_possible_cpu(i) {
 		struct rt_rq *rt_rq = &cpu_rq(i)->rt;
@@ -7538,48 +7607,12 @@ static int sched_rt_global_constraints(void)
 		rt_rq->rt_runtime = global_rt_runtime();
 		raw_spin_unlock(&rt_rq->rt_runtime_lock);
 	}
-unlock:
 	raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
 
-	return ret;
+	return 0;
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-/*
- * Coupling of -dl and -rt bandwidth.
- *
- * Here we check, while setting the system wide bandwidth available
- * for -dl tasks and groups, if the new values are consistent with
- * the system settings for the bandwidth available to -rt entities.
- *
- * IOW, we want to enforce that
- *
- *   rt_bandwidth + dl_bandwidth <= 100%
- *
- * is always true.
- */
-static bool __sched_dl_rt_global_constraints(u64 dl_bw)
-{
-	u64 rt_bw;
-	bool ret;
-
-	raw_spin_lock(&def_rt_bandwidth.rt_runtime_lock);
-	if (global_dl_runtime() == RUNTIME_INF ||
-	    global_rt_runtime() == RUNTIME_INF) {
-		ret = true;
-		goto unlock;
-	}
-
-	rt_bw = to_ratio(ktime_to_ns(def_rt_bandwidth.rt_period),
-			 def_rt_bandwidth.rt_runtime);
-
-	ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
-unlock:
-	raw_spin_unlock(&def_rt_bandwidth.rt_runtime_lock);
-
-	return ret;
-}
-
 static bool __sched_dl_global_constraints(u64 runtime, u64 period)
 {
 	if (!period || (runtime != RUNTIME_INF && runtime > period))
@@ -7590,40 +7623,17 @@ static bool __sched_dl_global_constraints(u64 runtime, u64 period)
 
 static int sched_dl_global_constraints(void)
 {
-	u64 runtime = global_dl_runtime();
-	u64 period = global_dl_period();
-	u64 new_bw = to_ratio(period, runtime);
-	int ret, i;
+	u64 period = global_rt_period();
+	u64 dl_actual_runtime = def_dl_bandwidth.dl_runtime;
+	int ret;
 
-	ret = __sched_dl_global_constraints(runtime, period);
+	ret = __sched_dl_global_constraints(dl_actual_runtime, period);
 	if (ret)
 		return ret;
 
-	if (!__sched_dl_rt_global_constraints(new_bw))
-		return -EINVAL;
-
-	/*
-	 * Here we want to check the bandwidth not being set to some
-	 * value smaller than the currently allocated bandwidth in
-	 * any of the root_domains.
-	 *
-	 * FIXME: Cycling on all the CPUs is overdoing, but simpler than
-	 * cycling on root_domains... Discussion on different/better
-	 * solutions is welcome!
-	 */
-	for_each_possible_cpu(i) {
-#ifdef CONFIG_SMP
-		struct dl_bw *dl_b = &cpu_rq(i)->rd->dl_bw;
-#else
-		struct dl_bw *dl_b = &cpu_rq(i)->dl.dl_bw;
-#endif
-		raw_spin_lock(&dl_b->lock);
-		if (new_bw < dl_b->total_bw) {
-			raw_spin_unlock(&dl_b->lock);
-			return -EBUSY;
-		}
-		raw_spin_unlock(&dl_b->lock);
-	}
+	ret = check_dl_bw();
+	if (ret)
+		return ret;
 
 	return 0;
 }
@@ -7654,6 +7664,7 @@ int sched_rt_handler(struct ctl_table *table, int write,
 	int ret;
 	int old_period, old_runtime;
 	static DEFINE_MUTEX(mutex);
+	unsigned long flags;
 
 	mutex_lock(&mutex);
 	old_period = sysctl_sched_rt_period;
@@ -7663,6 +7674,8 @@ int sched_rt_handler(struct ctl_table *table, int write,
 
 	if (!ret && write) {
 		ret = sched_rt_global_constraints();
+		raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock,
+				      flags);
 		if (ret) {
 			sysctl_sched_rt_period = old_period;
 			sysctl_sched_rt_runtime = old_runtime;
@@ -7670,7 +7683,11 @@ int sched_rt_handler(struct ctl_table *table, int write,
 			def_rt_bandwidth.rt_runtime = global_rt_runtime();
 			def_rt_bandwidth.rt_period =
 				ns_to_ktime(global_rt_period());
+
+			update_dl_bw();
 		}
+		raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock,
+					   flags);
 	}
 	mutex_unlock(&mutex);
 
@@ -7682,12 +7699,11 @@ int sched_dl_handler(struct ctl_table *table, int write,
 		loff_t *ppos)
 {
 	int ret;
-	int old_period, old_runtime;
+	int old_runtime;
 	static DEFINE_MUTEX(mutex);
 	unsigned long flags;
 
 	mutex_lock(&mutex);
-	old_period = sysctl_sched_dl_period;
 	old_runtime = sysctl_sched_dl_runtime;
 
 	ret = proc_dointvec(table, write, buffer, lenp, ppos);
@@ -7698,33 +7714,9 @@ int sched_dl_handler(struct ctl_table *table, int write,
 
 		ret = sched_dl_global_constraints();
 		if (ret) {
-			sysctl_sched_dl_period = old_period;
 			sysctl_sched_dl_runtime = old_runtime;
 		} else {
-			u64 new_bw;
-			int i;
-
-			def_dl_bandwidth.dl_period = global_dl_period();
-			def_dl_bandwidth.dl_runtime = global_dl_runtime();
-			if (global_dl_runtime() == RUNTIME_INF)
-				new_bw = -1;
-			else
-				new_bw = to_ratio(global_dl_period(),
-						  global_dl_runtime());
-			/*
-			 * FIXME: As above...
-			 */
-			for_each_possible_cpu(i) {
-#ifdef CONFIG_SMP
-				struct dl_bw *dl_b = &cpu_rq(i)->rd->dl_bw;
-#else
-				struct dl_bw *dl_b = &cpu_rq(i)->dl.dl_bw;
-#endif
-
-				raw_spin_lock(&dl_b->lock);
-				dl_b->bw = new_bw;
-				raw_spin_unlock(&dl_b->lock);
-			}
+			update_dl_bw();
 		}
 
 		raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock,
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fe39d7e..be84cb3 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -48,10 +48,9 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 	return dl_rq->rb_leftmost == &dl_se->rb_node;
 }
 
-void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
+void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 runtime)
 {
 	raw_spin_lock_init(&dl_b->dl_runtime_lock);
-	dl_b->dl_period = period;
 	dl_b->dl_runtime = runtime;
 }
 
@@ -62,9 +61,9 @@ void init_dl_bw(struct dl_bw *dl_b)
 	raw_spin_lock_init(&dl_b->lock);
 	raw_spin_lock(&def_dl_bandwidth.dl_runtime_lock);
 	if (global_dl_runtime() == RUNTIME_INF)
-		dl_b->bw = -1;
+		dl_b->bw = ULLONG_MAX;
 	else
-		dl_b->bw = to_ratio(global_dl_period(), global_dl_runtime());
+		dl_b->bw = to_ratio(global_rt_period(), global_dl_runtime());
 	raw_spin_unlock(&def_dl_bandwidth.dl_runtime_lock);
 	dl_b->total_bw = 0;
 }
@@ -1061,11 +1060,7 @@ static void task_fork_dl(struct task_struct *p)
 static void task_dead_dl(struct task_struct *p)
 {
 	struct hrtimer *timer = &p->dl.dl_timer;
-#ifdef CONFIG_SMP
-	struct dl_bw *dl_b = &task_rq(p)->rd->dl_bw;
-#else
-	struct dl_bw *dl_b = &task_rq(p)->dl.dl_bw;
-#endif
+	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
 
 	/*
 	 * Since we are TASK_DEAD we won't slip out of the domain!
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 08931a6..bf4acf8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -142,20 +142,20 @@ struct rt_bandwidth {
 	struct hrtimer		rt_period_timer;
 };
 /*
- * To keep the bandwidth of -deadline tasks and groups under control
- * we need some place where:
- *  - store the maximum -deadline bandwidth of the system (the group);
+ * To keep the bandwidth of -deadline tasks under control we need some
+ * place where:
+ *  - store the maximum -deadline bandwidth of the system;
  *  - cache the fraction of that bandwidth that is currently allocated.
  *
  * This is all done in the data structure below. It is similar to the
  * one used for RT-throttling (rt_bandwidth), with the main difference
  * that, since here we are only interested in admission control, we
- * do not decrease any runtime while the group "executes", neither we
+ * do not decrease any runtime while the task "executes", neither we
  * need a timer to replenish it.
  *
  * With respect to SMP, the bandwidth is given on a per-CPU basis,
  * meaning that:
- *  - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
+ *  - dl_bw (< 100%) is the bandwidth of the system on each CPU;
  *  - dl_total_bw array contains, in the i-eth element, the currently
  *    allocated bandwidth on the i-eth CPU.
  * Moreover, groups consume bandwidth on each CPU, while tasks only
@@ -168,7 +168,6 @@ struct rt_bandwidth {
 struct dl_bandwidth {
 	raw_spinlock_t dl_runtime_lock;
 	u64 dl_runtime;
-	u64 dl_period;
 };
 
 static inline int dl_bandwidth_enabled(void)
@@ -178,10 +177,12 @@ static inline int dl_bandwidth_enabled(void)
 
 struct dl_bw {
 	raw_spinlock_t lock;
-	u64 bw, total_bw;
+	/* default value */
+	u64 bw;
+	/* allocated */
+	u64 total_bw;
 };
 
-static inline u64 global_dl_period(void);
 static inline u64 global_dl_runtime(void);
 
 extern struct mutex sched_domains_mutex;
@@ -924,11 +925,6 @@ static inline u64 global_rt_runtime(void)
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
 
-static inline u64 global_dl_period(void)
-{
-	return (u64)sysctl_sched_dl_period * NSEC_PER_USEC;
-}
-
 static inline u64 global_dl_runtime(void)
 {
 	if (sysctl_sched_dl_runtime < 0)
@@ -1192,7 +1188,7 @@ extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 
 extern struct dl_bandwidth def_dl_bandwidth;
-extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
+extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 runtime);
 extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
 
 unsigned long to_ratio(u64 period, u64 runtime);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8492cbb..39167fc 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -415,13 +415,6 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= sched_rr_handler,
 	},
 	{
-		.procname	= "sched_dl_period_us",
-		.data		= &sysctl_sched_dl_period,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= sched_dl_handler,
-	},
-	{
 		.procname	= "sched_dl_runtime_us",
 		.data		= &sysctl_sched_dl_runtime,
 		.maxlen		= sizeof(int),
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH 13/14] sched: speed up -dl pushes with a push-heap.
  2013-11-07 13:43 [PATCH 00/14] sched: SCHED_DEADLINE v9 Juri Lelli
                   ` (11 preceding siblings ...)
  2013-11-07 13:43 ` [PATCH 12/14] sched: make dl_bw a sub-quota of rt_bw Juri Lelli
@ 2013-11-07 13:43 ` Juri Lelli
  2014-01-13 15:54   ` [tip:sched/core] sched/deadline: speed up SCHED_DEADLINE " tip-bot for Juri Lelli
  12 siblings, 1 reply; 81+ messages in thread
From: Juri Lelli @ 2013-11-07 13:43 UTC (permalink / raw)
  To: peterz, tglx
  Cc: mingo, rostedt, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	juri.lelli, nicola.manica, luca.abeni, dhaval.giani, hgu1972,
	paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

Data from tests confirmed that the original active load balancing
logic didn't scale neither in the number of CPU nor in the number of
tasks (as sched_rt does).

Here we provide a global data structure to keep track of deadlines
of the running tasks in the system. The structure is composed by
a bitmask showing the free CPUs and a max-heap, needed when the system
is heavily loaded.

The implementation and concurrent access scheme are kept simple by
design. However, our measurements show that we can compete with sched_rt
on large multi-CPUs machines [1].

Only the push path is addressed, the extension to use this structure
also for pull decisions is straightforward. However, we are currently
evaluating different (in order to decrease/avoid contention) data
structures to solve possibly both problems. We are also going to re-run
tests considering recent changes inside cpupri [2].

[1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf
[2] http://www.spinics.net/lists/linux-rt-users/msg06778.html

Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
---
 kernel/sched/Makefile      |    2 +-
 kernel/sched/core.c        |    3 +
 kernel/sched/cpudeadline.c |  216 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpudeadline.h |   33 +++++++
 kernel/sched/deadline.c    |   53 +++--------
 kernel/sched/sched.h       |    2 +
 6 files changed, 269 insertions(+), 40 deletions(-)
 create mode 100644 kernel/sched/cpudeadline.c
 create mode 100644 kernel/sched/cpudeadline.h

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index d77282f..ba43447 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -12,7 +12,7 @@ CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
 endif
 
 obj-y += core.o proc.o clock.o cputime.o idle_task.o fair.o rt.o deadline.o stop_task.o
-obj-$(CONFIG_SMP) += cpupri.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 08500e0..6531760 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5398,6 +5398,7 @@ static void free_rootdomain(struct rcu_head *rcu)
 	struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
 
 	cpupri_cleanup(&rd->cpupri);
+	cpudl_cleanup(&rd->cpudl);
 	free_cpumask_var(rd->dlo_mask);
 	free_cpumask_var(rd->rto_mask);
 	free_cpumask_var(rd->online);
@@ -5456,6 +5457,8 @@ static int init_rootdomain(struct root_domain *rd)
 		goto free_dlo_mask;
 
 	init_dl_bw(&rd->dl_bw);
+	if (cpudl_init(&rd->cpudl) != 0)
+		goto free_dlo_mask;
 
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_rto_mask;
diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
new file mode 100644
index 0000000..13a31a9
--- /dev/null
+++ b/kernel/sched/cpudeadline.c
@@ -0,0 +1,216 @@
+/*
+ *  kernel/sched/cpudl.c
+ *
+ *  Global CPU deadline management
+ *
+ *  Author: Juri Lelli <j.lelli@sssup.it>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+#include "cpudeadline.h"
+
+static inline int parent(int i)
+{
+	return (i - 1) >> 1;
+}
+
+static inline int left_child(int i)
+{
+	return (i << 1) + 1;
+}
+
+static inline int right_child(int i)
+{
+	return (i << 1) + 2;
+}
+
+static inline int dl_time_before(u64 a, u64 b)
+{
+	return (s64)(a - b) < 0;
+}
+
+void cpudl_exchange(struct cpudl *cp, int a, int b)
+{
+	int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;
+
+	swap(cp->elements[a], cp->elements[b]);
+	swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]);
+}
+
+void cpudl_heapify(struct cpudl *cp, int idx)
+{
+	int l, r, largest;
+
+	/* adapted from lib/prio_heap.c */
+	while(1) {
+		l = left_child(idx);
+		r = right_child(idx);
+		largest = idx;
+
+		if ((l < cp->size) && dl_time_before(cp->elements[idx].dl,
+							cp->elements[l].dl))
+			largest = l;
+		if ((r < cp->size) && dl_time_before(cp->elements[largest].dl,
+							cp->elements[r].dl))
+			largest = r;
+		if (largest == idx)
+			break;
+
+		/* Push idx down the heap one level and bump one up */
+		cpudl_exchange(cp, largest, idx);
+		idx = largest;
+	}
+}
+
+void cpudl_change_key(struct cpudl *cp, int idx, u64 new_dl)
+{
+	WARN_ON(idx > num_present_cpus() || idx == IDX_INVALID);
+
+	if (dl_time_before(new_dl, cp->elements[idx].dl)) {
+		cp->elements[idx].dl = new_dl;
+		cpudl_heapify(cp, idx);
+	} else {
+		cp->elements[idx].dl = new_dl;
+		while (idx > 0 && dl_time_before(cp->elements[parent(idx)].dl,
+					cp->elements[idx].dl)) {
+			cpudl_exchange(cp, idx, parent(idx));
+			idx = parent(idx);
+		}
+	}
+}
+
+static inline int cpudl_maximum(struct cpudl *cp)
+{
+	return cp->elements[0].cpu;
+}
+
+/*
+ * cpudl_find - find the best (later-dl) CPU in the system
+ * @cp: the cpudl max-heap context
+ * @p: the task
+ * @later_mask: a mask to fill in with the selected CPUs (or NULL)
+ *
+ * Returns: int - best CPU (heap maximum if suitable)
+ */
+int cpudl_find(struct cpudl *cp, struct task_struct *p,
+	       struct cpumask *later_mask)
+{
+	int best_cpu = -1;
+	const struct sched_dl_entity *dl_se = &p->dl;
+
+	if (later_mask && cpumask_and(later_mask, cp->free_cpus,
+			&p->cpus_allowed) && cpumask_and(later_mask,
+			later_mask, cpu_active_mask)) {
+		best_cpu = cpumask_any(later_mask);
+		goto out;
+	} else if (cpumask_test_cpu(cpudl_maximum(cp), &p->cpus_allowed) &&
+			dl_time_before(dl_se->deadline, cp->elements[0].dl)) {
+		best_cpu = cpudl_maximum(cp);
+		if (later_mask)
+			cpumask_set_cpu(best_cpu, later_mask);
+	}
+
+out:
+	WARN_ON(best_cpu > num_present_cpus() && best_cpu != -1);
+
+	return best_cpu;
+}
+
+/*
+ * cpudl_set - update the cpudl max-heap
+ * @cp: the cpudl max-heap context
+ * @cpu: the target cpu
+ * @dl: the new earliest deadline for this cpu
+ *
+ * Notes: assumes cpu_rq(cpu)->lock is locked
+ *
+ * Returns: (void)
+ */
+void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid)
+{
+	int old_idx, new_cpu;
+	unsigned long flags;
+
+	WARN_ON(cpu > num_present_cpus());
+
+	raw_spin_lock_irqsave(&cp->lock, flags);
+	old_idx = cp->cpu_to_idx[cpu];
+	if (!is_valid) {
+		/* remove item */
+		if (old_idx == IDX_INVALID) {
+			/* 
+			 * Nothing to remove if old_idx was invalid.
+			 * This could happen if a rq_offline_dl is
+			 * called for a CPU without -dl tasks running.
+			 */
+			goto out;
+		}
+		new_cpu = cp->elements[cp->size - 1].cpu;
+		cp->elements[old_idx].dl = cp->elements[cp->size - 1].dl;
+		cp->elements[old_idx].cpu = new_cpu;
+		cp->size--;
+		cp->cpu_to_idx[new_cpu] = old_idx;
+		cp->cpu_to_idx[cpu] = IDX_INVALID;
+		while (old_idx > 0 && dl_time_before(
+				cp->elements[parent(old_idx)].dl,
+				cp->elements[old_idx].dl)) {
+			cpudl_exchange(cp, old_idx, parent(old_idx));
+			old_idx = parent(old_idx);
+		}
+		cpumask_set_cpu(cpu, cp->free_cpus);
+                cpudl_heapify(cp, old_idx);
+
+		goto out;
+	}
+
+	if (old_idx == IDX_INVALID) {
+		cp->size++;
+		cp->elements[cp->size - 1].dl = 0;
+		cp->elements[cp->size - 1].cpu = cpu;
+		cp->cpu_to_idx[cpu] = cp->size - 1;
+		cpudl_change_key(cp, cp->size - 1, dl);
+		cpumask_clear_cpu(cpu, cp->free_cpus);
+	} else {
+		cpudl_change_key(cp, old_idx, dl);
+	}
+
+out:
+	raw_spin_unlock_irqrestore(&cp->lock, flags);
+}
+
+/*
+ * cpudl_init - initialize the cpudl structure
+ * @cp: the cpudl max-heap context
+ */
+int cpudl_init(struct cpudl *cp)
+{
+	int i;
+
+	memset(cp, 0, sizeof(*cp));
+	raw_spin_lock_init(&cp->lock);
+	cp->size = 0;
+	for (i = 0; i < NR_CPUS; i++)
+		cp->cpu_to_idx[i] = IDX_INVALID;
+	if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL))
+		return -ENOMEM;
+	cpumask_setall(cp->free_cpus);
+
+	return 0;
+}
+
+/*
+ * cpudl_cleanup - clean up the cpudl structure
+ * @cp: the cpudl max-heap context
+ */
+void cpudl_cleanup(struct cpudl *cp)
+{
+	/*
+	 * nothing to do for the moment
+	 */
+}
diff --git a/kernel/sched/cpudeadline.h b/kernel/sched/cpudeadline.h
new file mode 100644
index 0000000..a202789
--- /dev/null
+++ b/kernel/sched/cpudeadline.h
@@ -0,0 +1,33 @@
+#ifndef _LINUX_CPUDL_H
+#define _LINUX_CPUDL_H
+
+#include <linux/sched.h>
+
+#define IDX_INVALID     -1
+
+struct array_item {
+	u64 dl;
+	int cpu;
+};
+
+struct cpudl {
+	raw_spinlock_t lock;
+	int size;
+	int cpu_to_idx[NR_CPUS];
+	struct array_item elements[NR_CPUS];
+	cpumask_var_t free_cpus;
+};
+
+
+#ifdef CONFIG_SMP
+int cpudl_find(struct cpudl *cp, struct task_struct *p,
+	       struct cpumask *later_mask);
+void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid);
+int cpudl_init(struct cpudl *cp);
+void cpudl_cleanup(struct cpudl *cp);
+#else
+#define cpudl_set(cp, cpu, dl) do { } while (0)
+#define cpudl_init() do { } while (0)
+#endif /* CONFIG_SMP */
+
+#endif /* _LINUX_CPUDL_H */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index be84cb3..9c1fd55 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -16,6 +16,8 @@
  */
 #include "sched.h"
 
+#include <linux/slab.h>
+
 struct dl_bandwidth def_dl_bandwidth;
 
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
@@ -659,6 +661,7 @@ static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 		 */
 		dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
 		dl_rq->earliest_dl.curr = deadline;
+		cpudl_set(&rq->rd->cpudl, rq->cpu, deadline, 1);
 	} else if (dl_rq->earliest_dl.next == 0 ||
 		   dl_time_before(deadline, dl_rq->earliest_dl.next)) {
 		/*
@@ -682,6 +685,7 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 	if (!dl_rq->dl_nr_running) {
 		dl_rq->earliest_dl.curr = 0;
 		dl_rq->earliest_dl.next = 0;
+		cpudl_set(&rq->rd->cpudl, rq->cpu, 0, 0);
 	} else {
 		struct rb_node *leftmost = dl_rq->rb_leftmost;
 		struct sched_dl_entity *entry;
@@ -689,6 +693,7 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 		entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
 		dl_rq->earliest_dl.curr = entry->deadline;
 		dl_rq->earliest_dl.next = next_deadline(rq);
+		cpudl_set(&rq->rd->cpudl, rq->cpu, entry->deadline, 1);
 	}
 }
 
@@ -874,9 +879,6 @@ static void yield_task_dl(struct rq *rq)
 #ifdef CONFIG_SMP
 
 static int find_later_rq(struct task_struct *task);
-static int latest_cpu_find(struct cpumask *span,
-			   struct task_struct *task,
-			   struct cpumask *later_mask);
 
 static int
 select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
@@ -926,7 +928,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	 * let's hope p can move out.
 	 */
 	if (rq->curr->nr_cpus_allowed == 1 ||
-	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
+	    cpudl_find(&rq->rd->cpudl, rq->curr, NULL) == -1)
 		return;
 
 	/*
@@ -934,7 +936,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	 * see if it is pushed or pulled somewhere else.
 	 */
 	if (p->nr_cpus_allowed != 1 &&
-	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
+	    cpudl_find(&rq->rd->cpudl, p, NULL) != -1)
 		return;
 
 	resched_task(rq->curr);
@@ -1119,39 +1121,6 @@ next_node:
 	return NULL;
 }
 
-static int latest_cpu_find(struct cpumask *span,
-			   struct task_struct *task,
-			   struct cpumask *later_mask)
-{
-	const struct sched_dl_entity *dl_se = &task->dl;
-	int cpu, found = -1, best = 0;
-	u64 max_dl = 0;
-
-	for_each_cpu(cpu, span) {
-		struct rq *rq = cpu_rq(cpu);
-		struct dl_rq *dl_rq = &rq->dl;
-
-		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
-		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
-		     dl_rq->earliest_dl.curr))) {
-			if (later_mask)
-				cpumask_set_cpu(cpu, later_mask);
-			if (!best && !dl_rq->dl_nr_running) {
-				best = 1;
-				found = cpu;
-			} else if (!best &&
-				   dl_time_before(max_dl,
-						  dl_rq->earliest_dl.curr)) {
-				max_dl = dl_rq->earliest_dl.curr;
-				found = cpu;
-			}
-		} else if (later_mask)
-			cpumask_clear_cpu(cpu, later_mask);
-	}
-
-	return found;
-}
-
 static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
 
 static int find_later_rq(struct task_struct *task)
@@ -1168,7 +1137,8 @@ static int find_later_rq(struct task_struct *task)
 	if (task->nr_cpus_allowed == 1)
 		return -1;
 
-	best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
+	best_cpu = cpudl_find(&task_rq(task)->rd->cpudl,
+			task, later_mask);
 	if (best_cpu == -1)
 		return -1;
 
@@ -1544,6 +1514,9 @@ static void rq_online_dl(struct rq *rq)
 {
 	if (rq->dl.overloaded)
 		dl_set_overload(rq);
+
+	if (rq->dl.dl_nr_running > 0)
+		cpudl_set(&rq->rd->cpudl, rq->cpu, rq->dl.earliest_dl.curr, 1);
 }
 
 /* Assumes rq->lock is held */
@@ -1551,6 +1524,8 @@ static void rq_offline_dl(struct rq *rq)
 {
 	if (rq->dl.overloaded)
 		dl_clear_overload(rq);
+
+	cpudl_set(&rq->rd->cpudl, rq->cpu, 0, 0);
 }
 
 void init_sched_dl_class(void)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bf4acf8..596e290 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -9,6 +9,7 @@
 #include <linux/tick.h>
 
 #include "cpupri.h"
+#include "cpudeadline.h"
 #include "cpuacct.h"
 
 struct rq;
@@ -507,6 +508,7 @@ struct root_domain {
 	cpumask_var_t dlo_mask;
 	atomic_t dlo_count;
 	struct dl_bw dl_bw;
+	struct cpudl cpudl;
 
 	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 01/14] sched: add sched_class->task_dead.
  2013-11-07 13:43 ` [PATCH 01/14] sched: add sched_class->task_dead Juri Lelli
@ 2013-11-12  4:17   ` Paul Turner
  2013-11-12 17:19   ` Steven Rostedt
  2013-11-27 14:10   ` [tip:sched/core] sched: Add sched_class->task_dead() method tip-bot for Dario Faggioli
  2 siblings, 0 replies; 81+ messages in thread
From: Paul Turner @ 2013-11-12  4:17 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, rostedt,
	Oleg Nesterov, Frédéric Weisbecker, darren, johan.eker,
	p.faure, LKML, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, Dhaval Giani, hgu1972, Paul McKenney,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	Vincent Guittot, bruce.ashfield

On Thu, Nov 7, 2013 at 5:43 AM, Juri Lelli <juri.lelli@gmail.com> wrote:
> From: Dario Faggioli <raistlin@linux.it>
>
> Add a new function to the scheduling class interface. It is called
> at the end of a context switch, if the prev task is in TASK_DEAD state.
>
> It might be useful for the scheduling classes that want to be notified
> when one of their task dies, e.g. to perform some cleanup actions.

We could also usefully use this within SCHED_FAIR to remove a task's
load contribution when it completes.

>
> Signed-off-by: Dario Faggioli <raistlin@linux.it>
> Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
> ---
>  kernel/sched/core.c  |    3 +++
>  kernel/sched/sched.h |    1 +
>  2 files changed, 4 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5ac63c9..850a02c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1890,6 +1890,9 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>         if (mm)
>                 mmdrop(mm);
>         if (unlikely(prev_state == TASK_DEAD)) {
> +               if (prev->sched_class->task_dead)
> +                       prev->sched_class->task_dead(prev);
> +
>                 /*
>                  * Remove function-return probe instances associated with this
>                  * task and put them back on the free list.
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b3c5653..64eda5c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -992,6 +992,7 @@ struct sched_class {
>         void (*set_curr_task) (struct rq *rq);
>         void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
>         void (*task_fork) (struct task_struct *p);
> +       void (*task_dead) (struct task_struct *p);
>
>         void (*switched_from) (struct rq *this_rq, struct task_struct *task);
>         void (*switched_to) (struct rq *this_rq, struct task_struct *task);

Reviewed-by: Paul Turner <pjt@google.com>

> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 01/14] sched: add sched_class->task_dead.
  2013-11-07 13:43 ` [PATCH 01/14] sched: add sched_class->task_dead Juri Lelli
  2013-11-12  4:17   ` Paul Turner
@ 2013-11-12 17:19   ` Steven Rostedt
  2013-11-12 17:53     ` Juri Lelli
  2013-11-27 14:10   ` [tip:sched/core] sched: Add sched_class->task_dead() method tip-bot for Dario Faggioli
  2 siblings, 1 reply; 81+ messages in thread
From: Steven Rostedt @ 2013-11-12 17:19 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On Thu,  7 Nov 2013 14:43:35 +0100
Juri Lelli <juri.lelli@gmail.com> wrote:

> From: Dario Faggioli <raistlin@linux.it>
> 
> Add a new function to the scheduling class interface. It is called
> at the end of a context switch, if the prev task is in TASK_DEAD state.
> 
> It might be useful for the scheduling classes that want to be notified
> when one of their task dies, e.g. to perform some cleanup actions.

Nit.  s/task/tasks/

-- Steve

> 
> Signed-off-by: Dario Faggioli <raistlin@linux.it>
> Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
> ---
>  kernel/sched/core.c  |    3 +++
>  kernel/sched/sched.h |    1 +
>  2 files changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5ac63c9..850a02c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1890,6 +1890,9 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>  	if (mm)
>  		mmdrop(mm);
>  	if (unlikely(prev_state == TASK_DEAD)) {
> +		if (prev->sched_class->task_dead)
> +			prev->sched_class->task_dead(prev);
> +
>  		/*
>  		 * Remove function-return probe instances associated with this
>  		 * task and put them back on the free list.
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b3c5653..64eda5c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -992,6 +992,7 @@ struct sched_class {
>  	void (*set_curr_task) (struct rq *rq);
>  	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
>  	void (*task_fork) (struct task_struct *p);
> +	void (*task_dead) (struct task_struct *p);
>  
>  	void (*switched_from) (struct rq *this_rq, struct task_struct *task);
>  	void (*switched_to) (struct rq *this_rq, struct task_struct *task);


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 02/14] sched: add extended scheduling interface.
  2013-11-07 13:43 ` [PATCH 02/14] sched: add extended scheduling interface Juri Lelli
@ 2013-11-12 17:23   ` Steven Rostedt
  2013-11-13  8:43     ` Juri Lelli
  2013-11-12 17:32   ` Steven Rostedt
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 81+ messages in thread
From: Steven Rostedt @ 2013-11-12 17:23 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On Thu,  7 Nov 2013 14:43:36 +0100
Juri Lelli <juri.lelli@gmail.com> wrote:


> + * This is reflected by the actual fields of the sched_param2 structure:
> + *
> + *  @sched_priority     task's priority (might still be useful)
> + *  @sched_deadline     representative of the task's deadline
> + *  @sched_runtime      representative of the task's runtime
> + *  @sched_period       representative of the task's period
> + *  @sched_flags        for customizing the scheduler behaviour
> + *
> + * Given this task model, there are a multiplicity of scheduling algorithms
> + * and policies, that can be used to ensure all the tasks will make their
> + * timing constraints.
> + *
> + * @__unused		padding to allow future expansion without ABI issues
> + */
> +struct sched_param2 {
> +	int sched_priority;
> +	unsigned int sched_flags;

I'm just thinking, if we are creating a new structure, and this
structure already contains u64 elements, why not make sched_flags u64
too? We are now just limiting the total number of possible flags to 32.
I'm not sure how many flags will be needed in the future, maybe 32 is
good enough, but just something to think about.

Of course you can argue that the int sched_flags matches the int
sched_priority leaving out any holes in the structure, which is a
legitimate argument.

> +	u64 sched_runtime;
> +	u64 sched_deadline;
> +	u64 sched_period;
> +
> +	u64 __unused[12];

And in the future, we could use one of these __unused[12] as a
sched_flags2;

I'm not saying we should make it u64, just wanted to make sure we are
fine with it as 32 for now.

-- Steve



> +};
> +

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 02/14] sched: add extended scheduling interface.
  2013-11-07 13:43 ` [PATCH 02/14] sched: add extended scheduling interface Juri Lelli
  2013-11-12 17:23   ` Steven Rostedt
@ 2013-11-12 17:32   ` Steven Rostedt
  2013-11-13  9:07     ` Juri Lelli
  2013-11-27 13:23   ` [PATCH 02/14] sched: add extended scheduling interface. (new ABI) Ingo Molnar
  2014-01-13 15:53   ` [tip:sched/core] sched: Add new scheduler syscalls to support an extended scheduling parameters ABI tip-bot for Dario Faggioli
  3 siblings, 1 reply; 81+ messages in thread
From: Steven Rostedt @ 2013-11-12 17:32 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On Thu,  7 Nov 2013 14:43:36 +0100
Juri Lelli <juri.lelli@gmail.com> wrote:

  
> +static int
> +do_sched_setscheduler2(pid_t pid, int policy,
> +			 struct sched_param2 __user *param2)
> +{
> +	struct sched_param2 lparam2;
> +	struct task_struct *p;
> +	int retval;
> +
> +	if (!param2 || pid < 0)
> +		return -EINVAL;
> +
> +	memset(&lparam2, 0, sizeof(struct sched_param2));
> +	if (copy_from_user(&lparam2, param2, sizeof(struct sched_param2)))
> +		return -EFAULT;

Why the memset() before the copy_from_user()? We are copying
sizeof(sched_param2) anyway, and should overwrite anything that was on
the stack. I'm not aware of any possible leak from copying from
userspace. I could understand it if we were copying to userspace.

do_sched_setscheduler() doesn't do that either.

> +
> +	rcu_read_lock();
> +	retval = -ESRCH;
> +	p = find_process_by_pid(pid);
> +	if (p != NULL)
> +		retval = sched_setscheduler2(p, policy, &lparam2);
> +	rcu_read_unlock();
> +
> +	return retval;
> +}
> +
>  /**
>   * sys_sched_setscheduler - set/change the scheduler policy and RT priority
>   * @pid: the pid in question.
> @@ -3514,6 +3553,21 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy,
>  }
>  
>  /**
> + * sys_sched_setscheduler2 - same as above, but with extended sched_param
> + * @pid: the pid in question.
> + * @policy: new policy (could use extended sched_param).
> + * @param: structure containg the extended parameters.
> + */
> +SYSCALL_DEFINE3(sched_setscheduler2, pid_t, pid, int, policy,
> +		struct sched_param2 __user *, param2)
> +{
> +	if (policy < 0)
> +		return -EINVAL;
> +
> +	return do_sched_setscheduler2(pid, policy, param2);
> +}
> +
> +/**
>   * sys_sched_setparam - set/change the RT priority of a thread
>   * @pid: the pid in question.
>   * @param: structure containing the new RT priority.
> @@ -3526,6 +3580,17 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
>  }
>  
>  /**
> + * sys_sched_setparam2 - same as above, but with extended sched_param
> + * @pid: the pid in question.
> + * @param2: structure containing the extended parameters.
> + */
> +SYSCALL_DEFINE2(sched_setparam2, pid_t, pid,
> +		struct sched_param2 __user *, param2)
> +{
> +	return do_sched_setscheduler2(pid, -1, param2);
> +}
> +
> +/**
>   * sys_sched_getscheduler - get the policy (scheduling class) of a thread
>   * @pid: the pid in question.
>   *
> @@ -3595,6 +3660,45 @@ out_unlock:
>  	return retval;
>  }
>  
> +/**
> + * sys_sched_getparam2 - same as above, but with extended sched_param
> + * @pid: the pid in question.
> + * @param2: structure containing the extended parameters.
> + */
> +SYSCALL_DEFINE2(sched_getparam2, pid_t, pid,
> +		struct sched_param2 __user *, param2)
> +{
> +	struct sched_param2 lp;
> +	struct task_struct *p;
> +	int retval;
> +
> +	if (!param2 || pid < 0)
> +		return -EINVAL;
> +
> +	rcu_read_lock();
> +	p = find_process_by_pid(pid);
> +	retval = -ESRCH;
> +	if (!p)
> +		goto out_unlock;
> +
> +	retval = security_task_getscheduler(p);
> +	if (retval)
> +		goto out_unlock;
> +
> +	lp.sched_priority = p->rt_priority;
> +	rcu_read_unlock();
> +

OK, now we are missing the memset(). This does leak info, as lp never
was set to zero, it just contains anything on the stack, and the only
value you updated was sched_priority. We just copied to user memory
from the kernel stack.

-- Steve



> +	retval = copy_to_user(param2, &lp,
> +			sizeof(struct sched_param2)) ? -EFAULT : 0;
> +
> +	return retval;
> +
> +out_unlock:
> +	rcu_read_unlock();
> +	return retval;
> +
> +}
> +
>  long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
>  {
>  	cpumask_var_t cpus_allowed, new_mask;


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 01/14] sched: add sched_class->task_dead.
  2013-11-12 17:19   ` Steven Rostedt
@ 2013-11-12 17:53     ` Juri Lelli
  0 siblings, 0 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-12 17:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On 11/12/2013 06:19 PM, Steven Rostedt wrote:
> On Thu,  7 Nov 2013 14:43:35 +0100
> Juri Lelli <juri.lelli@gmail.com> wrote:
> 
>> From: Dario Faggioli <raistlin@linux.it>
>>
>> Add a new function to the scheduling class interface. It is called
>> at the end of a context switch, if the prev task is in TASK_DEAD state.
>>
>> It might be useful for the scheduling classes that want to be notified
>> when one of their task dies, e.g. to perform some cleanup actions.
> 
> Nit.  s/task/tasks/
>

Amended, thanks!

Best,

- Juri

>>
>> Signed-off-by: Dario Faggioli <raistlin@linux.it>
>> Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
>> ---
>>  kernel/sched/core.c  |    3 +++
>>  kernel/sched/sched.h |    1 +
>>  2 files changed, 4 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 5ac63c9..850a02c 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1890,6 +1890,9 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>>  	if (mm)
>>  		mmdrop(mm);
>>  	if (unlikely(prev_state == TASK_DEAD)) {
>> +		if (prev->sched_class->task_dead)
>> +			prev->sched_class->task_dead(prev);
>> +
>>  		/*
>>  		 * Remove function-return probe instances associated with this
>>  		 * task and put them back on the free list.
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index b3c5653..64eda5c 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -992,6 +992,7 @@ struct sched_class {
>>  	void (*set_curr_task) (struct rq *rq);
>>  	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
>>  	void (*task_fork) (struct task_struct *p);
>> +	void (*task_dead) (struct task_struct *p);
>>  
>>  	void (*switched_from) (struct rq *this_rq, struct task_struct *task);
>>  	void (*switched_to) (struct rq *this_rq, struct task_struct *task);
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 03/14] sched: SCHED_DEADLINE structures & implementation.
  2013-11-07 13:43 ` [PATCH 03/14] sched: SCHED_DEADLINE structures & implementation Juri Lelli
@ 2013-11-13  2:31   ` Steven Rostedt
  2013-11-13  9:54     ` Juri Lelli
  2013-11-20 20:23   ` Steven Rostedt
  2014-01-13 15:53   ` [tip:sched/core] sched/deadline: Add " tip-bot for Dario Faggioli
  2 siblings, 1 reply; 81+ messages in thread
From: Steven Rostedt @ 2013-11-13  2:31 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On Thu,  7 Nov 2013 14:43:37 +0100
Juri Lelli <juri.lelli@gmail.com> wrote:

> From: Dario Faggioli <raistlin@linux.it>

> --- /dev/null
> +++ b/include/linux/sched/deadline.h
> @@ -0,0 +1,24 @@
> +#ifndef _SCHED_DEADLINE_H
> +#define _SCHED_DEADLINE_H
> +
> +/*
> + * SCHED_DEADLINE tasks has negative priorities, reflecting
> + * the fact that any of them has higher prio than RT and
> + * NORMAL/BATCH tasks.
> + */
> +
> +#define MAX_DL_PRIO		0
> +
> +static inline int dl_prio(int prio)
> +{
> +	if (unlikely(prio < MAX_DL_PRIO))
> +		return 1;
> +	return 0;
> +}
> +
> +static inline int dl_task(struct task_struct *p)
> +{
> +	return dl_prio(p->prio);
> +}
> +
> +#endif /* _SCHED_DEADLINE_H */
> diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
> index 440434d..a157797 100644
> --- a/include/linux/sched/rt.h
> +++ b/include/linux/sched/rt.h
> @@ -22,7 +22,7 @@
>  
>  static inline int rt_prio(int prio)
>  {
> -	if (unlikely(prio < MAX_RT_PRIO))
> +	if ((unsigned)prio < MAX_RT_PRIO)

Why remove the "unlikely" here?

>  		return 1;
>  	return 0;
>  }
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 5a0f945..2d5e49a 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -39,6 +39,7 @@
>  #define SCHED_BATCH		3
>  /* SCHED_ISO: reserved but not implemented yet */
>  #define SCHED_IDLE		5
> +#define SCHED_DEADLINE		6
>  /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
>  #define SCHED_RESET_ON_FORK     0x40000000
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 086fe73..55fc95f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1313,7 +1313,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  #endif
>  
>  	/* Perform scheduler related setup. Assign this task to a CPU. */
> -	sched_fork(p);
> +	retval = sched_fork(p);
> +	if (retval)
> +		goto bad_fork_cleanup_policy;
>  
>  	retval = perf_event_init_task(p);
>  	if (retval)
> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> index 383319b..0909436 100644
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -46,6 +46,7 @@
>  #include <linux/sched.h>
>  #include <linux/sched/sysctl.h>
>  #include <linux/sched/rt.h>
> +#include <linux/sched/deadline.h>
>  #include <linux/timer.h>
>  #include <linux/freezer.h>
>  
> @@ -1610,7 +1611,7 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp,
>  	unsigned long slack;
>  
>  	slack = current->timer_slack_ns;
> -	if (rt_task(current))
> +	if (dl_task(current) || rt_task(current))

Since dl_task() checks if prio is less than 0, and rt_task checks for
prio < MAX_RT_PRIO, I wonder if we can introduce a

	dl_or_rt_task(current)

that does a signed compare against MAX_RT_PRIO to eliminate the double
compare (in case gcc doesn't figure it out).

Not something that we need to change now, but something in the future
maybe.

>  		slack = 0;
>  
>  	hrtimer_init_on_stack(&t.timer, clockid, mode);
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 54adcf3..d77282f 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -11,7 +11,7 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
>  CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
>  endif
>  
> -obj-y += core.o proc.o clock.o cputime.o idle_task.o fair.o rt.o stop_task.o
> +obj-y += core.o proc.o clock.o cputime.o idle_task.o fair.o rt.o deadline.o stop_task.o
>  obj-$(CONFIG_SMP) += cpupri.o
>  obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
>  obj-$(CONFIG_SCHEDSTATS) += stats.o
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4fcbf13..cfe15bfc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -903,7 +903,9 @@ static inline int normal_prio(struct task_struct *p)
>  {
>  	int prio;
>  
> -	if (task_has_rt_policy(p))
> +	if (task_has_dl_policy(p))
> +		prio = MAX_DL_PRIO-1;
> +	else if (task_has_rt_policy(p))
>  		prio = MAX_RT_PRIO-1 - p->rt_priority;
>  	else
>  		prio = __normal_prio(p);
> @@ -1611,6 +1613,12 @@ static void __sched_fork(struct task_struct *p)
>  	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>  #endif
>  
> +	RB_CLEAR_NODE(&p->dl.rb_node);
> +	hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	p->dl.dl_runtime = p->dl.runtime = 0;
> +	p->dl.dl_deadline = p->dl.deadline = 0;
> +	p->dl.flags = 0;
> +
>  	INIT_LIST_HEAD(&p->rt.run_list);
>  
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
> @@ -1654,7 +1662,7 @@ void set_numabalancing_state(bool enabled)
>  /*
>   * fork()/clone()-time setup:
>   */
> -void sched_fork(struct task_struct *p)
> +int sched_fork(struct task_struct *p)
>  {
>  	unsigned long flags;
>  	int cpu = get_cpu();
> @@ -1676,7 +1684,7 @@ void sched_fork(struct task_struct *p)
>  	 * Revert to default priority/policy on fork if requested.
>  	 */
>  	if (unlikely(p->sched_reset_on_fork)) {
> -		if (task_has_rt_policy(p)) {
> +		if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
>  			p->policy = SCHED_NORMAL;
>  			p->static_prio = NICE_TO_PRIO(0);
>  			p->rt_priority = 0;
> @@ -1693,8 +1701,14 @@ void sched_fork(struct task_struct *p)
>  		p->sched_reset_on_fork = 0;
>  	}
>  
> -	if (!rt_prio(p->prio))
> +	if (dl_prio(p->prio)) {
> +		put_cpu();
> +		return -EAGAIN;

Deadline tasks are not allowed to fork? Well, without setting
reset_on_fork. Probably should add a comment here.

> +	} else if (rt_prio(p->prio)) {
> +		p->sched_class = &rt_sched_class;
> +	} else {
>  		p->sched_class = &fair_sched_class;
> +	}
>  
>  	if (p->sched_class->task_fork)
>  		p->sched_class->task_fork(p);
> @@ -1726,6 +1740,7 @@ void sched_fork(struct task_struct *p)
>  #endif
>  
>  	put_cpu();
> +	return 0;
>  }
>  
>  /*
> @@ -3029,7 +3044,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
>  	struct rq *rq;
>  	const struct sched_class *prev_class;
>  
> -	BUG_ON(prio < 0 || prio > MAX_PRIO);
> +	BUG_ON(prio > MAX_PRIO);

Should we have this be:

	BUG_ON(prio < -MAX_PRIO || prio > MAX_PRIO);
?

>  
>  	rq = __task_rq_lock(p);
>  
> @@ -3061,7 +3076,9 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
>  	if (running)
>  		p->sched_class->put_prev_task(rq, p);
>  
> -	if (rt_prio(prio))
> +	if (dl_prio(prio))
> +		p->sched_class = &dl_sched_class;
> +	else if (rt_prio(prio))
>  		p->sched_class = &rt_sched_class;
>  	else
>  		p->sched_class = &fair_sched_class;
> @@ -3095,9 +3112,9 @@ void set_user_nice(struct task_struct *p, long nice)
>  	 * The RT priorities are set via sched_setscheduler(), but we still
>  	 * allow the 'normal' nice value to be set - but as expected
>  	 * it wont have any effect on scheduling until the task is
> -	 * SCHED_FIFO/SCHED_RR:
> +	 * SCHED_DEADLINE, SCHED_FIFO or SCHED_RR:
>  	 */
> -	if (task_has_rt_policy(p)) {
> +	if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
>  		p->static_prio = NICE_TO_PRIO(nice);
>  		goto out_unlock;
>  	}
> @@ -3261,7 +3278,9 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
>  	p->normal_prio = normal_prio(p);
>  	/* we are holding p->pi_lock already */
>  	p->prio = rt_mutex_getprio(p);
> -	if (rt_prio(p->prio))
> +	if (dl_prio(p->prio))
> +		p->sched_class = &dl_sched_class;
> +	else if (rt_prio(p->prio))
>  		p->sched_class = &rt_sched_class;
>  	else
>  		p->sched_class = &fair_sched_class;
> @@ -3269,6 +3288,50 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
>  }
>  
>  /*
> + * This function initializes the sched_dl_entity of a newly becoming
> + * SCHED_DEADLINE task.
> + *
> + * Only the static values are considered here, the actual runtime and the
> + * absolute deadline will be properly calculated when the task is enqueued
> + * for the first time with its new policy.
> + */
> +static void
> +__setparam_dl(struct task_struct *p, const struct sched_param2 *param2)
> +{
> +	struct sched_dl_entity *dl_se = &p->dl;
> +
> +	init_dl_task_timer(dl_se);
> +	dl_se->dl_runtime = param2->sched_runtime;
> +	dl_se->dl_deadline = param2->sched_deadline;
> +	dl_se->flags = param2->sched_flags;
> +	dl_se->dl_throttled = 0;
> +	dl_se->dl_new = 1;
> +}
> +
> +static void
> +__getparam_dl(struct task_struct *p, struct sched_param2 *param2)
> +{
> +	struct sched_dl_entity *dl_se = &p->dl;
> +
> +	param2->sched_priority = p->rt_priority;
> +	param2->sched_runtime = dl_se->dl_runtime;
> +	param2->sched_deadline = dl_se->dl_deadline;
> +	param2->sched_flags = dl_se->flags;
> +}
> +
> +/*
> + * This function validates the new parameters of a -deadline task.
> + * We ask for the deadline not being zero, and greater or equal
> + * than the runtime.
> + */
> +static bool
> +__checkparam_dl(const struct sched_param2 *prm)
> +{
> +	return prm && (&prm->sched_deadline) != 0 &&
> +	       (s64)(&prm->sched_deadline - &prm->sched_runtime) >= 0;
> +}
> +
> +/*
>   * check the target process has a UID that matches the current process's
>   */
>  static bool check_same_owner(struct task_struct *p)
> @@ -3305,7 +3368,8 @@ recheck:
>  		reset_on_fork = !!(policy & SCHED_RESET_ON_FORK);
>  		policy &= ~SCHED_RESET_ON_FORK;
>  
> -		if (policy != SCHED_FIFO && policy != SCHED_RR &&
> +		if (policy != SCHED_DEADLINE &&
> +				policy != SCHED_FIFO && policy != SCHED_RR &&
>  				policy != SCHED_NORMAL && policy != SCHED_BATCH &&
>  				policy != SCHED_IDLE)
>  			return -EINVAL;
> @@ -3320,7 +3384,8 @@ recheck:
>  	    (p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) ||
>  	    (!p->mm && param->sched_priority > MAX_RT_PRIO-1))
>  		return -EINVAL;
> -	if (rt_policy(policy) != (param->sched_priority != 0))
> +	if ((dl_policy(policy) && !__checkparam_dl(param)) ||
> +	    (rt_policy(policy) != (param->sched_priority != 0)))
>  		return -EINVAL;
>  
>  	/*
> @@ -3386,7 +3451,8 @@ recheck:
>  	 * If not changing anything there's no need to proceed further:
>  	 */
>  	if (unlikely(policy == p->policy && (!rt_policy(policy) ||
> -			param->sched_priority == p->rt_priority))) {
> +			param->sched_priority == p->rt_priority) &&
> +			!dl_policy(policy))) {
>  		task_rq_unlock(rq, p, &flags);
>  		return 0;
>  	}
> @@ -3423,7 +3489,11 @@ recheck:
>  
>  	oldprio = p->prio;
>  	prev_class = p->sched_class;
> -	__setscheduler(rq, p, policy, param->sched_priority);
> +	if (dl_policy(policy)) {
> +		__setparam_dl(p, param);
> +		__setscheduler(rq, p, policy, param->sched_priority);
> +	} else
> +		__setscheduler(rq, p, policy, param->sched_priority);

Why the double "__setscheduler()" call? Why not just:

	if (dl_policy(policy))
		__setparam_dl(p, param);
	__setscheduler(rq, p, policy, param->sched_priority);

?

>  
>  	if (running)
>  		p->sched_class->set_curr_task(rq);
> @@ -3527,8 +3597,11 @@ do_sched_setscheduler2(pid_t pid, int policy,
>  	rcu_read_lock();
>  	retval = -ESRCH;
>  	p = find_process_by_pid(pid);
> -	if (p != NULL)
> +	if (p != NULL) {
> +		if (dl_policy(policy))
> +			lparam2.sched_priority = 0;
>  		retval = sched_setscheduler2(p, policy, &lparam2);
> +	}
>  	rcu_read_unlock();
>  
>  	return retval;
> @@ -3685,7 +3758,10 @@ SYSCALL_DEFINE2(sched_getparam2, pid_t, pid,
>  	if (retval)
>  		goto out_unlock;
>  
> -	lp.sched_priority = p->rt_priority;
> +	if (task_has_dl_policy(p))
> +		__getparam_dl(p, &lp);
> +	else
> +		lp.sched_priority = p->rt_priority;
>  	rcu_read_unlock();
>  
>  	retval = copy_to_user(param2, &lp,
> @@ -4120,6 +4196,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
>  	case SCHED_RR:
>  		ret = MAX_USER_RT_PRIO-1;
>  		break;
> +	case SCHED_DEADLINE:
>  	case SCHED_NORMAL:
>  	case SCHED_BATCH:
>  	case SCHED_IDLE:
> @@ -4146,6 +4223,7 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
>  	case SCHED_RR:
>  		ret = 1;
>  		break;
> +	case SCHED_DEADLINE:
>  	case SCHED_NORMAL:
>  	case SCHED_BATCH:
>  	case SCHED_IDLE:
> @@ -6563,6 +6641,7 @@ void __init sched_init(void)
>  		rq->calc_load_update = jiffies + LOAD_FREQ;
>  		init_cfs_rq(&rq->cfs);
>  		init_rt_rq(&rq->rt, rq);
> +		init_dl_rq(&rq->dl, rq);
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
>  		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
> @@ -6746,7 +6825,7 @@ void normalize_rt_tasks(void)
>  		p->se.statistics.block_start	= 0;
>  #endif
>  
> -		if (!rt_task(p)) {
> +		if (!dl_task(p) && !rt_task(p)) {

Again, a future enhancement is to combine these two.

>  			/*
>  			 * Renice negative nice level userspace
>  			 * tasks back to 0:
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> new file mode 100644
> index 0000000..cb93f2e
> --- /dev/null
> +++ b/kernel/sched/deadline.c
> @@ -0,0 +1,682 @@
> +/*
> + * Deadline Scheduling Class (SCHED_DEADLINE)
> + *
> + * Earliest Deadline First (EDF) + Constant Bandwidth Server (CBS).
> + *
> + * Tasks that periodically executes their instances for less than their
> + * runtime won't miss any of their deadlines.
> + * Tasks that are not periodic or sporadic or that tries to execute more
> + * than their reserved bandwidth will be slowed down (and may potentially
> + * miss some of their deadlines), and won't affect any other task.
> + *
> + * Copyright (C) 2012 Dario Faggioli <raistlin@linux.it>,
> + *                    Michael Trimarchi <michael@amarulasolutions.com>,
> + *                    Fabio Checconi <fchecconi@gmail.com>
> + */
> +#include "sched.h"
> +
> +static inline int dl_time_before(u64 a, u64 b)
> +{
> +	return (s64)(a - b) < 0;

I think I've seen this pattern enough that we probably should have a
helper function that does this. Off topic for this thread, but just
something for us to think about.

> +}
> +
> +static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
> +{
> +	return container_of(dl_se, struct task_struct, dl);
> +}
> +
> +static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
> +{
> +	return container_of(dl_rq, struct rq, dl);
> +}
> +
> +static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
> +{
> +	struct task_struct *p = dl_task_of(dl_se);
> +	struct rq *rq = task_rq(p);
> +
> +	return &rq->dl;
> +}
> +
> +static inline int on_dl_rq(struct sched_dl_entity *dl_se)
> +{
> +	return !RB_EMPTY_NODE(&dl_se->rb_node);
> +}
> +
> +static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
> +{
> +	struct sched_dl_entity *dl_se = &p->dl;
> +
> +	return dl_rq->rb_leftmost == &dl_se->rb_node;
> +}
> +
> +void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
> +{
> +	dl_rq->rb_root = RB_ROOT;
> +}
> +
> +static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
> +static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
> +static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
> +				  int flags);
> +
> +/*
> + * We are being explicitly informed that a new instance is starting,
> + * and this means that:
> + *  - the absolute deadline of the entity has to be placed at
> + *    current time + relative deadline;
> + *  - the runtime of the entity has to be set to the maximum value.
> + *
> + * The capability of specifying such event is useful whenever a -deadline
> + * entity wants to (try to!) synchronize its behaviour with the scheduler's
> + * one, and to (try to!) reconcile itself with its own scheduling
> + * parameters.
> + */
> +static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
> +{
> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +	struct rq *rq = rq_of_dl_rq(dl_rq);
> +
> +	WARN_ON(!dl_se->dl_new || dl_se->dl_throttled);
> +
> +	/*
> +	 * We use the regular wall clock time to set deadlines in the
> +	 * future; in fact, we must consider execution overheads (time
> +	 * spent on hardirq context, etc.).
> +	 */
> +	dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
> +	dl_se->runtime = dl_se->dl_runtime;
> +	dl_se->dl_new = 0;
> +}
> +
> +/*
> + * Pure Earliest Deadline First (EDF) scheduling does not deal with the
> + * possibility of a entity lasting more than what it declared, and thus
> + * exhausting its runtime.
> + *
> + * Here we are interested in making runtime overrun possible, but we do
> + * not want a entity which is misbehaving to affect the scheduling of all
> + * other entities.
> + * Therefore, a budgeting strategy called Constant Bandwidth Server (CBS)
> + * is used, in order to confine each entity within its own bandwidth.
> + *
> + * This function deals exactly with that, and ensures that when the runtime
> + * of a entity is replenished, its deadline is also postponed. That ensures
> + * the overrunning entity can't interfere with other entity in the system and
> + * can't make them miss their deadlines. Reasons why this kind of overruns
> + * could happen are, typically, a entity voluntarily trying to overcome its
> + * runtime, or it just underestimated it during sched_setscheduler_ex().
> + */
> +static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> +{
> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +	struct rq *rq = rq_of_dl_rq(dl_rq);
> +
> +	/*
> +	 * We keep moving the deadline away until we get some
> +	 * available runtime for the entity. This ensures correct
> +	 * handling of situations where the runtime overrun is
> +	 * arbitrary large.
> +	 */
> +	while (dl_se->runtime <= 0) {
> +		dl_se->deadline += dl_se->dl_deadline;
> +		dl_se->runtime += dl_se->dl_runtime;
> +	}
> +
> +	/*
> +	 * At this point, the deadline really should be "in
> +	 * the future" with respect to rq->clock. If it's
> +	 * not, we are, for some reason, lagging too much!
> +	 * Anyway, after having warn userspace abut that,
> +	 * we still try to keep the things running by
> +	 * resetting the deadline and the budget of the
> +	 * entity.
> +	 */
> +	if (dl_time_before(dl_se->deadline, rq_clock(rq))) {
> +		static bool lag_once = false;
> +
> +		if (!lag_once) {
> +			lag_once = true;
> +			printk_sched("sched: DL replenish lagged to much\n");
> +		}
> +		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
> +		dl_se->runtime = dl_se->dl_runtime;
> +	}
> +}
> +
> +/*
> + * Here we check if --at time t-- an entity (which is probably being
> + * [re]activated or, in general, enqueued) can use its remaining runtime
> + * and its current deadline _without_ exceeding the bandwidth it is
> + * assigned (function returns true if it can't). We are in fact applying
> + * one of the CBS rules: when a task wakes up, if the residual runtime
> + * over residual deadline fits within the allocated bandwidth, then we
> + * can keep the current (absolute) deadline and residual budget without
> + * disrupting the schedulability of the system. Otherwise, we should
> + * refill the runtime and set the deadline a period in the future,
> + * because keeping the current (absolute) deadline of the task would
> + * result in breaking guarantees promised to other tasks.
> + *
> + * This function returns true if:
> + *
> + *   runtime / (deadline - t) > dl_runtime / dl_deadline ,


I'm a bit confused here about this algorithm. If I state I want to have
a run time of 20 ms out of 100 ms, so my dl_runtime / dl_deadline is
1/5 or %20.

Now if I don't don't get scheduled at the start of my new period
because other tasks are running, and I start say, at 50ms in. Thus, my
deadline - t == 50, but my runtime remaining is still 20ms. Thus I have
20/50 which is 2/5 or %40. Doing the calculation I have 40 > 20 being
true, and thus we label this as an overflow.

Is this what we really want? Or did I misunderstand something here?

Is this only for tasks that went to sleep voluntarily and have just
woken up?


> + *
> + * IOW we can't recycle current parameters.
> + */
> +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
> +{
> +	u64 left, right;
> +
> +	/*
> +	 * left and right are the two sides of the equation above,
> +	 * after a bit of shuffling to use multiplications instead
> +	 * of divisions.
> +	 *
> +	 * Note that none of the time values involved in the two
> +	 * multiplications are absolute: dl_deadline and dl_runtime
> +	 * are the relative deadline and the maximum runtime of each
> +	 * instance, runtime is the runtime left for the last instance
> +	 * and (deadline - t), since t is rq->clock, is the time left
> +	 * to the (absolute) deadline. Even if overflowing the u64 type
> +	 * is very unlikely to occur in both cases, here we scale down
> +	 * as we want to avoid that risk at all. Scaling down by 10
> +	 * means that we reduce granularity to 1us. We are fine with it,
> +	 * since this is only a true/false check and, anyway, thinking
> +	 * of anything below microseconds resolution is actually fiction
> +	 * (but still we want to give the user that illusion >;).
> +	 */
> +	left = (dl_se->dl_deadline >> 10) * (dl_se->runtime >> 10);
> +	right = ((dl_se->deadline - t) >> 10) * (dl_se->dl_runtime >> 10);
> +
> +	return dl_time_before(right, left);
> +}
> +
> +/*
> + * When a -deadline entity is queued back on the runqueue, its runtime and
> + * deadline might need updating.
> + *
> + * The policy here is that we update the deadline of the entity only if:
> + *  - the current deadline is in the past,
> + *  - using the remaining runtime with the current deadline would make
> + *    the entity exceed its bandwidth.
> + */
> +static void update_dl_entity(struct sched_dl_entity *dl_se)
> +{
> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +	struct rq *rq = rq_of_dl_rq(dl_rq);
> +
> +	/*
> +	 * The arrival of a new instance needs special treatment, i.e.,
> +	 * the actual scheduling parameters have to be "renewed".
> +	 */
> +	if (dl_se->dl_new) {
> +		setup_new_dl_entity(dl_se);
> +		return;
> +	}
> +
> +	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
> +	    dl_entity_overflow(dl_se, rq_clock(rq))) {
> +		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
> +		dl_se->runtime = dl_se->dl_runtime;

In the case that I stated, we just moved the deadline forward.

Or is this only because it was just enqueued, and we fear that it will
starve out other tasks? That is, this only happens if the task was
sleeping and just woke up.

-- Steve

> +	}
> +}
> +
> +/*
> + * If the entity depleted all its runtime, and if we want it to sleep
> + * while waiting for some new execution time to become available, we
> + * set the bandwidth enforcement timer to the replenishment instant
> + * and try to activate it.
> + *
> + * Notice that it is important for the caller to know if the timer
> + * actually started or not (i.e., the replenishment instant is in
> + * the future or in the past).
> + */
> +static int start_dl_timer(struct sched_dl_entity *dl_se)
> +{
> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +	struct rq *rq = rq_of_dl_rq(dl_rq);
> +	ktime_t now, act;
> +	ktime_t soft, hard;
> +	unsigned long range;
> +	s64 delta;
> +
> +	/*
> +	 * We want the timer to fire at the deadline, but considering
> +	 * that it is actually coming from rq->clock and not from
> +	 * hrtimer's time base reading.
> +	 */
> +	act = ns_to_ktime(dl_se->deadline);
> +	now = hrtimer_cb_get_time(&dl_se->dl_timer);
> +	delta = ktime_to_ns(now) - rq_clock(rq);
> +	act = ktime_add_ns(act, delta);
> +
> +	/*
> +	 * If the expiry time already passed, e.g., because the value
> +	 * chosen as the deadline is too small, don't even try to
> +	 * start the timer in the past!
> +	 */
> +	if (ktime_us_delta(act, now) < 0)
> +		return 0;
> +
> +	hrtimer_set_expires(&dl_se->dl_timer, act);
> +
> +	soft = hrtimer_get_softexpires(&dl_se->dl_timer);
> +	hard = hrtimer_get_expires(&dl_se->dl_timer);
> +	range = ktime_to_ns(ktime_sub(hard, soft));
> +	__hrtimer_start_range_ns(&dl_se->dl_timer, soft,
> +				 range, HRTIMER_MODE_ABS, 0);
> +
> +	return hrtimer_active(&dl_se->dl_timer);
> +}
> +
> +/*
> + * This is the bandwidth enforcement timer callback. If here, we know
> + * a task is not on its dl_rq, since the fact that the timer was running
> + * means the task is throttled and needs a runtime replenishment.
> + *
> + * However, what we actually do depends on the fact the task is active,
> + * (it is on its rq) or has been removed from there by a call to
> + * dequeue_task_dl(). In the former case we must issue the runtime
> + * replenishment and add the task back to the dl_rq; in the latter, we just
> + * do nothing but clearing dl_throttled, so that runtime and deadline
> + * updating (and the queueing back to dl_rq) will be done by the
> + * next call to enqueue_task_dl().
> + */
> +static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
> +{
> +	struct sched_dl_entity *dl_se = container_of(timer,
> +						     struct sched_dl_entity,
> +						     dl_timer);
> +	struct task_struct *p = dl_task_of(dl_se);
> +	struct rq *rq = task_rq(p);
> +	raw_spin_lock(&rq->lock);
> +
> +	/*
> +	 * We need to take care of a possible races here. In fact, the
> +	 * task might have changed its scheduling policy to something
> +	 * different from SCHED_DEADLINE or changed its reservation
> +	 * parameters (through sched_setscheduler()).
> +	 */
> +	if (!dl_task(p) || dl_se->dl_new)
> +		goto unlock;
> +
> +	dl_se->dl_throttled = 0;
> +	if (p->on_rq) {
> +		enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
> +		if (task_has_dl_policy(rq->curr))
> +			check_preempt_curr_dl(rq, p, 0);
> +		else
> +			resched_task(rq->curr);
> +	}
> +unlock:
> +	raw_spin_unlock(&rq->lock);
> +
> +	return HRTIMER_NORESTART;
> +}
> +
> +void init_dl_task_timer(struct sched_dl_entity *dl_se)
> +{
> +	struct hrtimer *timer = &dl_se->dl_timer;
> +
> +	if (hrtimer_active(timer)) {
> +		hrtimer_try_to_cancel(timer);
> +		return;
> +	}
> +
> +	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	timer->function = dl_task_timer;
> +}
> +
> +static
> +int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
> +{
> +	int dmiss = dl_time_before(dl_se->deadline, rq_clock(rq));
> +	int rorun = dl_se->runtime <= 0;
> +
> +	if (!rorun && !dmiss)
> +		return 0;
> +
> +	/*
> +	 * If we are beyond our current deadline and we are still
> +	 * executing, then we have already used some of the runtime of
> +	 * the next instance. Thus, if we do not account that, we are
> +	 * stealing bandwidth from the system at each deadline miss!
> +	 */
> +	if (dmiss) {
> +		dl_se->runtime = rorun ? dl_se->runtime : 0;
> +		dl_se->runtime -= rq_clock(rq) - dl_se->deadline;
> +	}
> +
> +	return 1;
> +}
> +
> +/*
> + * Update the current task's runtime statistics (provided it is still
> + * a -deadline task and has not been removed from the dl_rq).
> + */
> +static void update_curr_dl(struct rq *rq)
> +{
> +	struct task_struct *curr = rq->curr;
> +	struct sched_dl_entity *dl_se = &curr->dl;
> +	u64 delta_exec;
> +
> +	if (!dl_task(curr) || !on_dl_rq(dl_se))
> +		return;
> +
> +	/*
> +	 * Consumed budget is computed considering the time as
> +	 * observed by schedulable tasks (excluding time spent
> +	 * in hardirq context, etc.). Deadlines are instead
> +	 * computed using hard walltime. This seems to be the more
> +	 * natural solution, but the full ramifications of this
> +	 * approach need further study.
> +	 */
> +	delta_exec = rq_clock_task(rq) - curr->se.exec_start;
> +	if (unlikely((s64)delta_exec < 0))
> +		delta_exec = 0;
> +
> +	schedstat_set(curr->se.statistics.exec_max,
> +		      max(curr->se.statistics.exec_max, delta_exec));
> +
> +	curr->se.sum_exec_runtime += delta_exec;
> +	account_group_exec_runtime(curr, delta_exec);
> +
> +	curr->se.exec_start = rq_clock_task(rq);
> +	cpuacct_charge(curr, delta_exec);
> +
> +	dl_se->runtime -= delta_exec;
> +	if (dl_runtime_exceeded(rq, dl_se)) {
> +		__dequeue_task_dl(rq, curr, 0);
> +		if (likely(start_dl_timer(dl_se)))
> +			dl_se->dl_throttled = 1;
> +		else
> +			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
> +
> +		if (!is_leftmost(curr, &rq->dl))
> +			resched_task(curr);
> +	}
> +}
> +
> +static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
> +{
> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +	struct rb_node **link = &dl_rq->rb_root.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct sched_dl_entity *entry;
> +	int leftmost = 1;
> +
> +	BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
> +
> +	while (*link) {
> +		parent = *link;
> +		entry = rb_entry(parent, struct sched_dl_entity, rb_node);
> +		if (dl_time_before(dl_se->deadline, entry->deadline))
> +			link = &parent->rb_left;
> +		else {
> +			link = &parent->rb_right;
> +			leftmost = 0;
> +		}
> +	}
> +
> +	if (leftmost)
> +		dl_rq->rb_leftmost = &dl_se->rb_node;
> +
> +	rb_link_node(&dl_se->rb_node, parent, link);
> +	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
> +
> +	dl_rq->dl_nr_running++;
> +}
> +
> +static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
> +{
> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +
> +	if (RB_EMPTY_NODE(&dl_se->rb_node))
> +		return;
> +
> +	if (dl_rq->rb_leftmost == &dl_se->rb_node) {
> +		struct rb_node *next_node;
> +
> +		next_node = rb_next(&dl_se->rb_node);
> +		dl_rq->rb_leftmost = next_node;
> +	}
> +
> +	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
> +	RB_CLEAR_NODE(&dl_se->rb_node);
> +
> +	dl_rq->dl_nr_running--;
> +}
> +
> +static void
> +enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
> +{
> +	BUG_ON(on_dl_rq(dl_se));
> +
> +	/*
> +	 * If this is a wakeup or a new instance, the scheduling
> +	 * parameters of the task might need updating. Otherwise,
> +	 * we want a replenishment of its runtime.
> +	 */
> +	if (!dl_se->dl_new && flags & ENQUEUE_REPLENISH)
> +		replenish_dl_entity(dl_se);
> +	else
> +		update_dl_entity(dl_se);
> +
> +	__enqueue_dl_entity(dl_se);
> +}
> +
> +static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
> +{
> +	__dequeue_dl_entity(dl_se);
> +}
> +
> +static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> +{
> +	/*
> +	 * If p is throttled, we do nothing. In fact, if it exhausted
> +	 * its budget it needs a replenishment and, since it now is on
> +	 * its rq, the bandwidth timer callback (which clearly has not
> +	 * run yet) will take care of this.
> +	 */
> +	if (p->dl.dl_throttled)
> +		return;
> +
> +	enqueue_dl_entity(&p->dl, flags);
> +	inc_nr_running(rq);
> +}
> +
> +static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> +{
> +	dequeue_dl_entity(&p->dl);
> +}
> +
> +static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> +{
> +	update_curr_dl(rq);
> +	__dequeue_task_dl(rq, p, flags);
> +
> +	dec_nr_running(rq);
> +}
> +
> +/*
> + * Yield task semantic for -deadline tasks is:
> + *
> + *   get off from the CPU until our next instance, with
> + *   a new runtime. This is of little use now, since we
> + *   don't have a bandwidth reclaiming mechanism. Anyway,
> + *   bandwidth reclaiming is planned for the future, and
> + *   yield_task_dl will indicate that some spare budget
> + *   is available for other task instances to use it.
> + */
> +static void yield_task_dl(struct rq *rq)
> +{
> +	struct task_struct *p = rq->curr;
> +
> +	/*
> +	 * We make the task go to sleep until its current deadline by
> +	 * forcing its runtime to zero. This way, update_curr_dl() stops
> +	 * it and the bandwidth timer will wake it up and will give it
> +	 * new scheduling parameters (thanks to dl_new=1).
> +	 */
> +	if (p->dl.runtime > 0) {
> +		rq->curr->dl.dl_new = 1;
> +		p->dl.runtime = 0;
> +	}
> +	update_curr_dl(rq);
> +}
> +
> +/*
> + * Only called when both the current and waking task are -deadline
> + * tasks.
> + */
> +static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
> +				  int flags)
> +{
> +	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
> +		resched_task(rq->curr);
> +}
> +
> +#ifdef CONFIG_SCHED_HRTICK
> +static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
> +{
> +	s64 delta = p->dl.dl_runtime - p->dl.runtime;
> +
> +	if (delta > 10000)
> +		hrtick_start(rq, p->dl.runtime);
> +}
> +#endif
> +
> +static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
> +						   struct dl_rq *dl_rq)
> +{
> +	struct rb_node *left = dl_rq->rb_leftmost;
> +
> +	if (!left)
> +		return NULL;
> +
> +	return rb_entry(left, struct sched_dl_entity, rb_node);
> +}
> +
> +struct task_struct *pick_next_task_dl(struct rq *rq)
> +{
> +	struct sched_dl_entity *dl_se;
> +	struct task_struct *p;
> +	struct dl_rq *dl_rq;
> +
> +	dl_rq = &rq->dl;
> +
> +	if (unlikely(!dl_rq->dl_nr_running))
> +		return NULL;
> +
> +	dl_se = pick_next_dl_entity(rq, dl_rq);
> +	BUG_ON(!dl_se);
> +
> +	p = dl_task_of(dl_se);
> +	p->se.exec_start = rq_clock_task(rq);
> +#ifdef CONFIG_SCHED_HRTICK
> +	if (hrtick_enabled(rq))
> +		start_hrtick_dl(rq, p);
> +#endif
> +	return p;
> +}
> +
> +static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
> +{
> +	update_curr_dl(rq);
> +}
> +
> +static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
> +{
> +	update_curr_dl(rq);
> +
> +#ifdef CONFIG_SCHED_HRTICK
> +	if (hrtick_enabled(rq) && queued && p->dl.runtime > 0)
> +		start_hrtick_dl(rq, p);
> +#endif
> +}
> +
> +static void task_fork_dl(struct task_struct *p)
> +{
> +	/*
> +	 * SCHED_DEADLINE tasks cannot fork and this is achieved through
> +	 * sched_fork()
> +	 */
> +}
> +
> +static void task_dead_dl(struct task_struct *p)
> +{
> +	struct hrtimer *timer = &p->dl.dl_timer;
> +
> +	if (hrtimer_active(timer))
> +		hrtimer_try_to_cancel(timer);
> +}
> +
> +static void set_curr_task_dl(struct rq *rq)
> +{
> +	struct task_struct *p = rq->curr;
> +
> +	p->se.exec_start = rq_clock_task(rq);
> +}
> +
> +static void switched_from_dl(struct rq *rq, struct task_struct *p)
> +{
> +	if (hrtimer_active(&p->dl.dl_timer))
> +		hrtimer_try_to_cancel(&p->dl.dl_timer);
> +}
> +
> +static void switched_to_dl(struct rq *rq, struct task_struct *p)
> +{
> +	/*
> +	 * If p is throttled, don't consider the possibility
> +	 * of preempting rq->curr, the check will be done right
> +	 * after its runtime will get replenished.
> +	 */
> +	if (unlikely(p->dl.dl_throttled))
> +		return;
> +
> +	if (!p->on_rq || rq->curr != p) {
> +		if (task_has_dl_policy(rq->curr))
> +			check_preempt_curr_dl(rq, p, 0);
> +		else
> +			resched_task(rq->curr);
> +	}
> +}
> +
> +static void prio_changed_dl(struct rq *rq, struct task_struct *p,
> +			    int oldprio)
> +{
> +	switched_to_dl(rq, p);
> +}
> +
> +#ifdef CONFIG_SMP
> +static int
> +select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
> +{
> +	return task_cpu(p);
> +}
> +#endif
> +
> +const struct sched_class dl_sched_class = {
> +	.next			= &rt_sched_class,
> +	.enqueue_task		= enqueue_task_dl,
> +	.dequeue_task		= dequeue_task_dl,
> +	.yield_task		= yield_task_dl,
> +
> +	.check_preempt_curr	= check_preempt_curr_dl,
> +
> +	.pick_next_task		= pick_next_task_dl,
> +	.put_prev_task		= put_prev_task_dl,
> +
> +#ifdef CONFIG_SMP
> +	.select_task_rq		= select_task_rq_dl,
> +#endif
> +
> +	.set_curr_task		= set_curr_task_dl,
> +	.task_tick		= task_tick_dl,
> +	.task_fork              = task_fork_dl,
> +	.task_dead		= task_dead_dl,
> +
> +	.prio_changed           = prio_changed_dl,
> +	.switched_from		= switched_from_dl,
> +	.switched_to		= switched_to_dl,
> +};
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 64eda5c..ba97476 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2,6 +2,7 @@
>  #include <linux/sched.h>
>  #include <linux/sched/sysctl.h>
>  #include <linux/sched/rt.h>
> +#include <linux/sched/deadline.h>
>  #include <linux/mutex.h>
>  #include <linux/spinlock.h>
>  #include <linux/stop_machine.h>
> @@ -87,11 +88,23 @@ static inline int rt_policy(int policy)
>  	return 0;
>  }
>  
> +static inline int dl_policy(int policy)
> +{
> +	if (unlikely(policy == SCHED_DEADLINE))
> +		return 1;
> +	return 0;
> +}
> +
>  static inline int task_has_rt_policy(struct task_struct *p)
>  {
>  	return rt_policy(p->policy);
>  }
>  
> +static inline int task_has_dl_policy(struct task_struct *p)
> +{
> +	return dl_policy(p->policy);
> +}
> +
>  /*
>   * This is the priority-queue data structure of the RT scheduling class:
>   */
> @@ -363,6 +376,15 @@ struct rt_rq {
>  #endif
>  };
>  
> +/* Deadline class' related fields in a runqueue */
> +struct dl_rq {
> +	/* runqueue is an rbtree, ordered by deadline */
> +	struct rb_root rb_root;
> +	struct rb_node *rb_leftmost;
> +
> +	unsigned long dl_nr_running;
> +};
> +
>  #ifdef CONFIG_SMP
>  
>  /*
> @@ -427,6 +449,7 @@ struct rq {
>  
>  	struct cfs_rq cfs;
>  	struct rt_rq rt;
> +	struct dl_rq dl;
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	/* list of leaf cfs_rq on this cpu: */
> @@ -957,6 +980,7 @@ static const u32 prio_to_wmult[40] = {
>  #else
>  #define ENQUEUE_WAKING		0
>  #endif
> +#define ENQUEUE_REPLENISH	8
>  
>  #define DEQUEUE_SLEEP		1
>  
> @@ -1012,6 +1036,7 @@ struct sched_class {
>     for (class = sched_class_highest; class; class = class->next)
>  
>  extern const struct sched_class stop_sched_class;
> +extern const struct sched_class dl_sched_class;
>  extern const struct sched_class rt_sched_class;
>  extern const struct sched_class fair_sched_class;
>  extern const struct sched_class idle_sched_class;
> @@ -1047,6 +1072,8 @@ extern void resched_cpu(int cpu);
>  extern struct rt_bandwidth def_rt_bandwidth;
>  extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
>  
> +extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
> +
>  extern void update_idle_cpu_load(struct rq *this_rq);
>  
>  extern void init_task_runnable_average(struct task_struct *p);
> @@ -1305,6 +1332,7 @@ extern void print_rt_stats(struct seq_file *m, int cpu);
>  
>  extern void init_cfs_rq(struct cfs_rq *cfs_rq);
>  extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq);
> +extern void init_dl_rq(struct dl_rq *rt_rq, struct rq *rq);
>  
>  extern void account_cfs_bandwidth_used(int enabled, int was_enabled);
>  
> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
> index e08fbee..a5cef17 100644
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -103,7 +103,7 @@ get_rr_interval_stop(struct rq *rq, struct task_struct *task)
>   * Simple, special scheduling class for the per-CPU stop tasks:
>   */
>  const struct sched_class stop_sched_class = {
> -	.next			= &rt_sched_class,
> +	.next			= &dl_sched_class,
>  
>  	.enqueue_task		= enqueue_task_stop,
>  	.dequeue_task		= dequeue_task_stop,


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 02/14] sched: add extended scheduling interface.
  2013-11-12 17:23   ` Steven Rostedt
@ 2013-11-13  8:43     ` Juri Lelli
  0 siblings, 0 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-13  8:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On 11/12/2013 06:23 PM, Steven Rostedt wrote:
> On Thu,  7 Nov 2013 14:43:36 +0100
> Juri Lelli <juri.lelli@gmail.com> wrote:
> 
> 
>> + * This is reflected by the actual fields of the sched_param2 structure:
>> + *
>> + *  @sched_priority     task's priority (might still be useful)
>> + *  @sched_deadline     representative of the task's deadline
>> + *  @sched_runtime      representative of the task's runtime
>> + *  @sched_period       representative of the task's period
>> + *  @sched_flags        for customizing the scheduler behaviour
>> + *
>> + * Given this task model, there are a multiplicity of scheduling algorithms
>> + * and policies, that can be used to ensure all the tasks will make their
>> + * timing constraints.
>> + *
>> + * @__unused		padding to allow future expansion without ABI issues
>> + */
>> +struct sched_param2 {
>> +	int sched_priority;
>> +	unsigned int sched_flags;
> 
> I'm just thinking, if we are creating a new structure, and this
> structure already contains u64 elements, why not make sched_flags u64
> too? We are now just limiting the total number of possible flags to 32.
> I'm not sure how many flags will be needed in the future, maybe 32 is
> good enough, but just something to think about.
> 
> Of course you can argue that the int sched_flags matches the int
> sched_priority leaving out any holes in the structure, which is a
> legitimate argument.
> 
>> +	u64 sched_runtime;
>> +	u64 sched_deadline;
>> +	u64 sched_period;
>> +
>> +	u64 __unused[12];
> 
> And in the future, we could use one of these __unused[12] as a
> sched_flags2;
> 
> I'm not saying we should make it u64, just wanted to make sure we are
> fine with it as 32 for now.
> 

I'd stick with the current declaration for exactly the points you have made.

What others think?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 02/14] sched: add extended scheduling interface.
  2013-11-12 17:32   ` Steven Rostedt
@ 2013-11-13  9:07     ` Juri Lelli
  0 siblings, 0 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-13  9:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On 11/12/2013 06:32 PM, Steven Rostedt wrote:
> On Thu,  7 Nov 2013 14:43:36 +0100
> Juri Lelli <juri.lelli@gmail.com> wrote:
> 
>   
>> +static int
>> +do_sched_setscheduler2(pid_t pid, int policy,
>> +			 struct sched_param2 __user *param2)
>> +{
>> +	struct sched_param2 lparam2;
>> +	struct task_struct *p;
>> +	int retval;
>> +
>> +	if (!param2 || pid < 0)
>> +		return -EINVAL;
>> +
>> +	memset(&lparam2, 0, sizeof(struct sched_param2));
>> +	if (copy_from_user(&lparam2, param2, sizeof(struct sched_param2)))
>> +		return -EFAULT;
> 
> Why the memset() before the copy_from_user()? We are copying
> sizeof(sched_param2) anyway, and should overwrite anything that was on
> the stack. I'm not aware of any possible leak from copying from
> userspace. I could understand it if we were copying to userspace.
> 
> do_sched_setscheduler() doesn't do that either.
> 
>> +
>> +	rcu_read_lock();
>> +	retval = -ESRCH;
>> +	p = find_process_by_pid(pid);
>> +	if (p != NULL)
>> +		retval = sched_setscheduler2(p, policy, &lparam2);
>> +	rcu_read_unlock();
>> +
>> +	return retval;
>> +}
>> +
>>  /**
>>   * sys_sched_setscheduler - set/change the scheduler policy and RT priority
>>   * @pid: the pid in question.
>> @@ -3514,6 +3553,21 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy,
>>  }
>>  
>>  /**
>> + * sys_sched_setscheduler2 - same as above, but with extended sched_param
>> + * @pid: the pid in question.
>> + * @policy: new policy (could use extended sched_param).
>> + * @param: structure containg the extended parameters.
>> + */
>> +SYSCALL_DEFINE3(sched_setscheduler2, pid_t, pid, int, policy,
>> +		struct sched_param2 __user *, param2)
>> +{
>> +	if (policy < 0)
>> +		return -EINVAL;
>> +
>> +	return do_sched_setscheduler2(pid, policy, param2);
>> +}
>> +
>> +/**
>>   * sys_sched_setparam - set/change the RT priority of a thread
>>   * @pid: the pid in question.
>>   * @param: structure containing the new RT priority.
>> @@ -3526,6 +3580,17 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
>>  }
>>  
>>  /**
>> + * sys_sched_setparam2 - same as above, but with extended sched_param
>> + * @pid: the pid in question.
>> + * @param2: structure containing the extended parameters.
>> + */
>> +SYSCALL_DEFINE2(sched_setparam2, pid_t, pid,
>> +		struct sched_param2 __user *, param2)
>> +{
>> +	return do_sched_setscheduler2(pid, -1, param2);
>> +}
>> +
>> +/**
>>   * sys_sched_getscheduler - get the policy (scheduling class) of a thread
>>   * @pid: the pid in question.
>>   *
>> @@ -3595,6 +3660,45 @@ out_unlock:
>>  	return retval;
>>  }
>>  
>> +/**
>> + * sys_sched_getparam2 - same as above, but with extended sched_param
>> + * @pid: the pid in question.
>> + * @param2: structure containing the extended parameters.
>> + */
>> +SYSCALL_DEFINE2(sched_getparam2, pid_t, pid,
>> +		struct sched_param2 __user *, param2)
>> +{
>> +	struct sched_param2 lp;
>> +	struct task_struct *p;
>> +	int retval;
>> +
>> +	if (!param2 || pid < 0)
>> +		return -EINVAL;
>> +
>> +	rcu_read_lock();
>> +	p = find_process_by_pid(pid);
>> +	retval = -ESRCH;
>> +	if (!p)
>> +		goto out_unlock;
>> +
>> +	retval = security_task_getscheduler(p);
>> +	if (retval)
>> +		goto out_unlock;
>> +
>> +	lp.sched_priority = p->rt_priority;
>> +	rcu_read_unlock();
>> +
> 
> OK, now we are missing the memset(). This does leak info, as lp never
> was set to zero, it just contains anything on the stack, and the only
> value you updated was sched_priority. We just copied to user memory
> from the kernel stack.

Right! memset() moved:

@@ -3779,7 +3779,6 @@ do_sched_setscheduler2(pid_t pid, int policy,
        if (!param2 || pid < 0)
                return -EINVAL;

-       memset(&lparam2, 0, sizeof(struct sched_param2));
        if (copy_from_user(&lparam2, param2, sizeof(struct sched_param2)))
                return -EFAULT;

@@ -3937,6 +3936,8 @@ SYSCALL_DEFINE2(sched_getparam2, pid_t, pid,
        if (!param2 || pid < 0)
                return -EINVAL;

+       memset(&lp, 0, sizeof(struct sched_param2));
+
        rcu_read_lock();
        p = find_process_by_pid(pid);
        retval = -ESRCH;

Thanks,

- Juri

> 
>> +	retval = copy_to_user(param2, &lp,
>> +			sizeof(struct sched_param2)) ? -EFAULT : 0;
>> +
>> +	return retval;
>> +
>> +out_unlock:
>> +	rcu_read_unlock();
>> +	return retval;
>> +
>> +}
>> +
>>  long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
>>  {
>>  	cpumask_var_t cpus_allowed, new_mask;
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 03/14] sched: SCHED_DEADLINE structures & implementation.
  2013-11-13  2:31   ` Steven Rostedt
@ 2013-11-13  9:54     ` Juri Lelli
  0 siblings, 0 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-13  9:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On 11/13/2013 03:31 AM, Steven Rostedt wrote:
> On Thu,  7 Nov 2013 14:43:37 +0100
> Juri Lelli <juri.lelli@gmail.com> wrote:
> 
>> From: Dario Faggioli <raistlin@linux.it>
> 
>> --- /dev/null
>> +++ b/include/linux/sched/deadline.h
>> @@ -0,0 +1,24 @@
>> +#ifndef _SCHED_DEADLINE_H
>> +#define _SCHED_DEADLINE_H
>> +
>> +/*
>> + * SCHED_DEADLINE tasks has negative priorities, reflecting
>> + * the fact that any of them has higher prio than RT and
>> + * NORMAL/BATCH tasks.
>> + */
>> +
>> +#define MAX_DL_PRIO		0
>> +
>> +static inline int dl_prio(int prio)
>> +{
>> +	if (unlikely(prio < MAX_DL_PRIO))
>> +		return 1;
>> +	return 0;
>> +}
>> +
>> +static inline int dl_task(struct task_struct *p)
>> +{
>> +	return dl_prio(p->prio);
>> +}
>> +
>> +#endif /* _SCHED_DEADLINE_H */
>> diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
>> index 440434d..a157797 100644
>> --- a/include/linux/sched/rt.h
>> +++ b/include/linux/sched/rt.h
>> @@ -22,7 +22,7 @@
>>  
>>  static inline int rt_prio(int prio)
>>  {
>> -	if (unlikely(prio < MAX_RT_PRIO))
>> +	if ((unsigned)prio < MAX_RT_PRIO)
> 
> Why remove the "unlikely" here?
>

No reason that I can recall, most probably something went wrong with successive
rebases. Grrrr! Fixed.

>>  		return 1;
>>  	return 0;
>>  }
>> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
>> index 5a0f945..2d5e49a 100644
>> --- a/include/uapi/linux/sched.h
>> +++ b/include/uapi/linux/sched.h
>> @@ -39,6 +39,7 @@
>>  #define SCHED_BATCH		3
>>  /* SCHED_ISO: reserved but not implemented yet */
>>  #define SCHED_IDLE		5
>> +#define SCHED_DEADLINE		6
>>  /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
>>  #define SCHED_RESET_ON_FORK     0x40000000
>>  
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 086fe73..55fc95f 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1313,7 +1313,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>>  #endif
>>  
>>  	/* Perform scheduler related setup. Assign this task to a CPU. */
>> -	sched_fork(p);
>> +	retval = sched_fork(p);
>> +	if (retval)
>> +		goto bad_fork_cleanup_policy;
>>  
>>  	retval = perf_event_init_task(p);
>>  	if (retval)
>> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
>> index 383319b..0909436 100644
>> --- a/kernel/hrtimer.c
>> +++ b/kernel/hrtimer.c
>> @@ -46,6 +46,7 @@
>>  #include <linux/sched.h>
>>  #include <linux/sched/sysctl.h>
>>  #include <linux/sched/rt.h>
>> +#include <linux/sched/deadline.h>
>>  #include <linux/timer.h>
>>  #include <linux/freezer.h>
>>  
>> @@ -1610,7 +1611,7 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp,
>>  	unsigned long slack;
>>  
>>  	slack = current->timer_slack_ns;
>> -	if (rt_task(current))
>> +	if (dl_task(current) || rt_task(current))
> 
> Since dl_task() checks if prio is less than 0, and rt_task checks for
> prio < MAX_RT_PRIO, I wonder if we can introduce a
> 
> 	dl_or_rt_task(current)
> 
> that does a signed compare against MAX_RT_PRIO to eliminate the double
> compare (in case gcc doesn't figure it out).
> 
> Not something that we need to change now, but something in the future
> maybe.
> 

Ok.

>>  		slack = 0;
>>  
>>  	hrtimer_init_on_stack(&t.timer, clockid, mode);
>> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
>> index 54adcf3..d77282f 100644
>> --- a/kernel/sched/Makefile
>> +++ b/kernel/sched/Makefile
>> @@ -11,7 +11,7 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
>>  CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
>>  endif
>>  
>> -obj-y += core.o proc.o clock.o cputime.o idle_task.o fair.o rt.o stop_task.o
>> +obj-y += core.o proc.o clock.o cputime.o idle_task.o fair.o rt.o deadline.o stop_task.o
>>  obj-$(CONFIG_SMP) += cpupri.o
>>  obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
>>  obj-$(CONFIG_SCHEDSTATS) += stats.o
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 4fcbf13..cfe15bfc 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -903,7 +903,9 @@ static inline int normal_prio(struct task_struct *p)
>>  {
>>  	int prio;
>>  
>> -	if (task_has_rt_policy(p))
>> +	if (task_has_dl_policy(p))
>> +		prio = MAX_DL_PRIO-1;
>> +	else if (task_has_rt_policy(p))
>>  		prio = MAX_RT_PRIO-1 - p->rt_priority;
>>  	else
>>  		prio = __normal_prio(p);
>> @@ -1611,6 +1613,12 @@ static void __sched_fork(struct task_struct *p)
>>  	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>>  #endif
>>  
>> +	RB_CLEAR_NODE(&p->dl.rb_node);
>> +	hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> +	p->dl.dl_runtime = p->dl.runtime = 0;
>> +	p->dl.dl_deadline = p->dl.deadline = 0;
>> +	p->dl.flags = 0;
>> +
>>  	INIT_LIST_HEAD(&p->rt.run_list);
>>  
>>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>> @@ -1654,7 +1662,7 @@ void set_numabalancing_state(bool enabled)
>>  /*
>>   * fork()/clone()-time setup:
>>   */
>> -void sched_fork(struct task_struct *p)
>> +int sched_fork(struct task_struct *p)
>>  {
>>  	unsigned long flags;
>>  	int cpu = get_cpu();
>> @@ -1676,7 +1684,7 @@ void sched_fork(struct task_struct *p)
>>  	 * Revert to default priority/policy on fork if requested.
>>  	 */
>>  	if (unlikely(p->sched_reset_on_fork)) {
>> -		if (task_has_rt_policy(p)) {
>> +		if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
>>  			p->policy = SCHED_NORMAL;
>>  			p->static_prio = NICE_TO_PRIO(0);
>>  			p->rt_priority = 0;
>> @@ -1693,8 +1701,14 @@ void sched_fork(struct task_struct *p)
>>  		p->sched_reset_on_fork = 0;
>>  	}
>>  
>> -	if (!rt_prio(p->prio))
>> +	if (dl_prio(p->prio)) {
>> +		put_cpu();
>> +		return -EAGAIN;
> 
> Deadline tasks are not allowed to fork? Well, without setting
> reset_on_fork. Probably should add a comment here.
> 

Yes, because with didn't yet decide how to split a father's bandwidth among
children. This way we force the user to think about it. I'll add a comment here.

Peter was actually wondering if we should return a different error:

On 10/14/2013 01:18 PM, Peter Zijlstra wrote:> On Mon, Oct 14, 2013 at
12:43:35PM +0200, Juri Lelli wrote:
>> > @@ -1693,8 +1701,14 @@ void sched_fork(struct task_struct *p)
>> >  		p->sched_reset_on_fork = 0;
>> >  	}
>> >
>> > -	if (!rt_prio(p->prio))
>> > +	if (dl_prio(p->prio)) {
>> > +		put_cpu();
>> > +		return -EAGAIN;
> Is this really the error we want to return on fork()?
>
> EAGAIN to me indicates a spurious error and we should try again later;
> however as it obvious from the code above; we'll always fail, there's no
> point in trying again later.
>
> I would think something like EINVAL; even though there are no arguments
> to fork(); would me a better option.
>
> Then again; I really don't care too much; anybody any preferences?

>> +	} else if (rt_prio(p->prio)) {
>> +		p->sched_class = &rt_sched_class;
>> +	} else {
>>  		p->sched_class = &fair_sched_class;
>> +	}
>>  
>>  	if (p->sched_class->task_fork)
>>  		p->sched_class->task_fork(p);
>> @@ -1726,6 +1740,7 @@ void sched_fork(struct task_struct *p)
>>  #endif
>>  
>>  	put_cpu();
>> +	return 0;
>>  }
>>  
>>  /*
>> @@ -3029,7 +3044,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
>>  	struct rq *rq;
>>  	const struct sched_class *prev_class;
>>  
>> -	BUG_ON(prio < 0 || prio > MAX_PRIO);
>> +	BUG_ON(prio > MAX_PRIO);
> 
> Should we have this be:
> 
> 	BUG_ON(prio < -MAX_PRIO || prio > MAX_PRIO);
> ?
> 

BUG_ON(prio < (MAX_DL_PRIO-1) || prio > MAX_PRIO);
?

>>  
>>  	rq = __task_rq_lock(p);
>>  
>> @@ -3061,7 +3076,9 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
>>  	if (running)
>>  		p->sched_class->put_prev_task(rq, p);
>>  
>> -	if (rt_prio(prio))
>> +	if (dl_prio(prio))
>> +		p->sched_class = &dl_sched_class;
>> +	else if (rt_prio(prio))
>>  		p->sched_class = &rt_sched_class;
>>  	else
>>  		p->sched_class = &fair_sched_class;
>> @@ -3095,9 +3112,9 @@ void set_user_nice(struct task_struct *p, long nice)
>>  	 * The RT priorities are set via sched_setscheduler(), but we still
>>  	 * allow the 'normal' nice value to be set - but as expected
>>  	 * it wont have any effect on scheduling until the task is
>> -	 * SCHED_FIFO/SCHED_RR:
>> +	 * SCHED_DEADLINE, SCHED_FIFO or SCHED_RR:
>>  	 */
>> -	if (task_has_rt_policy(p)) {
>> +	if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
>>  		p->static_prio = NICE_TO_PRIO(nice);
>>  		goto out_unlock;
>>  	}
>> @@ -3261,7 +3278,9 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
>>  	p->normal_prio = normal_prio(p);
>>  	/* we are holding p->pi_lock already */
>>  	p->prio = rt_mutex_getprio(p);
>> -	if (rt_prio(p->prio))
>> +	if (dl_prio(p->prio))
>> +		p->sched_class = &dl_sched_class;
>> +	else if (rt_prio(p->prio))
>>  		p->sched_class = &rt_sched_class;
>>  	else
>>  		p->sched_class = &fair_sched_class;
>> @@ -3269,6 +3288,50 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
>>  }
>>  
>>  /*
>> + * This function initializes the sched_dl_entity of a newly becoming
>> + * SCHED_DEADLINE task.
>> + *
>> + * Only the static values are considered here, the actual runtime and the
>> + * absolute deadline will be properly calculated when the task is enqueued
>> + * for the first time with its new policy.
>> + */
>> +static void
>> +__setparam_dl(struct task_struct *p, const struct sched_param2 *param2)
>> +{
>> +	struct sched_dl_entity *dl_se = &p->dl;
>> +
>> +	init_dl_task_timer(dl_se);
>> +	dl_se->dl_runtime = param2->sched_runtime;
>> +	dl_se->dl_deadline = param2->sched_deadline;
>> +	dl_se->flags = param2->sched_flags;
>> +	dl_se->dl_throttled = 0;
>> +	dl_se->dl_new = 1;
>> +}
>> +
>> +static void
>> +__getparam_dl(struct task_struct *p, struct sched_param2 *param2)
>> +{
>> +	struct sched_dl_entity *dl_se = &p->dl;
>> +
>> +	param2->sched_priority = p->rt_priority;
>> +	param2->sched_runtime = dl_se->dl_runtime;
>> +	param2->sched_deadline = dl_se->dl_deadline;
>> +	param2->sched_flags = dl_se->flags;
>> +}
>> +
>> +/*
>> + * This function validates the new parameters of a -deadline task.
>> + * We ask for the deadline not being zero, and greater or equal
>> + * than the runtime.
>> + */
>> +static bool
>> +__checkparam_dl(const struct sched_param2 *prm)
>> +{
>> +	return prm && (&prm->sched_deadline) != 0 &&
>> +	       (s64)(&prm->sched_deadline - &prm->sched_runtime) >= 0;
>> +}
>> +
>> +/*
>>   * check the target process has a UID that matches the current process's
>>   */
>>  static bool check_same_owner(struct task_struct *p)
>> @@ -3305,7 +3368,8 @@ recheck:
>>  		reset_on_fork = !!(policy & SCHED_RESET_ON_FORK);
>>  		policy &= ~SCHED_RESET_ON_FORK;
>>  
>> -		if (policy != SCHED_FIFO && policy != SCHED_RR &&
>> +		if (policy != SCHED_DEADLINE &&
>> +				policy != SCHED_FIFO && policy != SCHED_RR &&
>>  				policy != SCHED_NORMAL && policy != SCHED_BATCH &&
>>  				policy != SCHED_IDLE)
>>  			return -EINVAL;
>> @@ -3320,7 +3384,8 @@ recheck:
>>  	    (p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) ||
>>  	    (!p->mm && param->sched_priority > MAX_RT_PRIO-1))
>>  		return -EINVAL;
>> -	if (rt_policy(policy) != (param->sched_priority != 0))
>> +	if ((dl_policy(policy) && !__checkparam_dl(param)) ||
>> +	    (rt_policy(policy) != (param->sched_priority != 0)))
>>  		return -EINVAL;
>>  
>>  	/*
>> @@ -3386,7 +3451,8 @@ recheck:
>>  	 * If not changing anything there's no need to proceed further:
>>  	 */
>>  	if (unlikely(policy == p->policy && (!rt_policy(policy) ||
>> -			param->sched_priority == p->rt_priority))) {
>> +			param->sched_priority == p->rt_priority) &&
>> +			!dl_policy(policy))) {
>>  		task_rq_unlock(rq, p, &flags);
>>  		return 0;
>>  	}
>> @@ -3423,7 +3489,11 @@ recheck:
>>  
>>  	oldprio = p->prio;
>>  	prev_class = p->sched_class;
>> -	__setscheduler(rq, p, policy, param->sched_priority);
>> +	if (dl_policy(policy)) {
>> +		__setparam_dl(p, param);
>> +		__setscheduler(rq, p, policy, param->sched_priority);
>> +	} else
>> +		__setscheduler(rq, p, policy, param->sched_priority);
> 
> Why the double "__setscheduler()" call? Why not just:
> 
> 	if (dl_policy(policy))
> 		__setparam_dl(p, param);
> 	__setscheduler(rq, p, policy, param->sched_priority);
> 
> ?
> 

Right, changed.

>>  
>>  	if (running)
>>  		p->sched_class->set_curr_task(rq);
>> @@ -3527,8 +3597,11 @@ do_sched_setscheduler2(pid_t pid, int policy,
>>  	rcu_read_lock();
>>  	retval = -ESRCH;
>>  	p = find_process_by_pid(pid);
>> -	if (p != NULL)
>> +	if (p != NULL) {
>> +		if (dl_policy(policy))
>> +			lparam2.sched_priority = 0;
>>  		retval = sched_setscheduler2(p, policy, &lparam2);
>> +	}
>>  	rcu_read_unlock();
>>  
>>  	return retval;
>> @@ -3685,7 +3758,10 @@ SYSCALL_DEFINE2(sched_getparam2, pid_t, pid,
>>  	if (retval)
>>  		goto out_unlock;
>>  
>> -	lp.sched_priority = p->rt_priority;
>> +	if (task_has_dl_policy(p))
>> +		__getparam_dl(p, &lp);
>> +	else
>> +		lp.sched_priority = p->rt_priority;
>>  	rcu_read_unlock();
>>  
>>  	retval = copy_to_user(param2, &lp,
>> @@ -4120,6 +4196,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
>>  	case SCHED_RR:
>>  		ret = MAX_USER_RT_PRIO-1;
>>  		break;
>> +	case SCHED_DEADLINE:
>>  	case SCHED_NORMAL:
>>  	case SCHED_BATCH:
>>  	case SCHED_IDLE:
>> @@ -4146,6 +4223,7 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
>>  	case SCHED_RR:
>>  		ret = 1;
>>  		break;
>> +	case SCHED_DEADLINE:
>>  	case SCHED_NORMAL:
>>  	case SCHED_BATCH:
>>  	case SCHED_IDLE:
>> @@ -6563,6 +6641,7 @@ void __init sched_init(void)
>>  		rq->calc_load_update = jiffies + LOAD_FREQ;
>>  		init_cfs_rq(&rq->cfs);
>>  		init_rt_rq(&rq->rt, rq);
>> +		init_dl_rq(&rq->dl, rq);
>>  #ifdef CONFIG_FAIR_GROUP_SCHED
>>  		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
>>  		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
>> @@ -6746,7 +6825,7 @@ void normalize_rt_tasks(void)
>>  		p->se.statistics.block_start	= 0;
>>  #endif
>>  
>> -		if (!rt_task(p)) {
>> +		if (!dl_task(p) && !rt_task(p)) {
> 
> Again, a future enhancement is to combine these two.
> 

Ok.

>>  			/*
>>  			 * Renice negative nice level userspace
>>  			 * tasks back to 0:
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> new file mode 100644
>> index 0000000..cb93f2e
>> --- /dev/null
>> +++ b/kernel/sched/deadline.c
>> @@ -0,0 +1,682 @@
>> +/*
>> + * Deadline Scheduling Class (SCHED_DEADLINE)
>> + *
>> + * Earliest Deadline First (EDF) + Constant Bandwidth Server (CBS).
>> + *
>> + * Tasks that periodically executes their instances for less than their
>> + * runtime won't miss any of their deadlines.
>> + * Tasks that are not periodic or sporadic or that tries to execute more
>> + * than their reserved bandwidth will be slowed down (and may potentially
>> + * miss some of their deadlines), and won't affect any other task.
>> + *
>> + * Copyright (C) 2012 Dario Faggioli <raistlin@linux.it>,
>> + *                    Michael Trimarchi <michael@amarulasolutions.com>,
>> + *                    Fabio Checconi <fchecconi@gmail.com>
>> + */
>> +#include "sched.h"
>> +
>> +static inline int dl_time_before(u64 a, u64 b)
>> +{
>> +	return (s64)(a - b) < 0;
> 
> I think I've seen this pattern enough that we probably should have a
> helper function that does this. Off topic for this thread, but just
> something for us to think about.
> 
>> +}
>> +
>> +static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
>> +{
>> +	return container_of(dl_se, struct task_struct, dl);
>> +}
>> +
>> +static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
>> +{
>> +	return container_of(dl_rq, struct rq, dl);
>> +}
>> +
>> +static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
>> +{
>> +	struct task_struct *p = dl_task_of(dl_se);
>> +	struct rq *rq = task_rq(p);
>> +
>> +	return &rq->dl;
>> +}
>> +
>> +static inline int on_dl_rq(struct sched_dl_entity *dl_se)
>> +{
>> +	return !RB_EMPTY_NODE(&dl_se->rb_node);
>> +}
>> +
>> +static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
>> +{
>> +	struct sched_dl_entity *dl_se = &p->dl;
>> +
>> +	return dl_rq->rb_leftmost == &dl_se->rb_node;
>> +}
>> +
>> +void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
>> +{
>> +	dl_rq->rb_root = RB_ROOT;
>> +}
>> +
>> +static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
>> +static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
>> +static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
>> +				  int flags);
>> +
>> +/*
>> + * We are being explicitly informed that a new instance is starting,
>> + * and this means that:
>> + *  - the absolute deadline of the entity has to be placed at
>> + *    current time + relative deadline;
>> + *  - the runtime of the entity has to be set to the maximum value.
>> + *
>> + * The capability of specifying such event is useful whenever a -deadline
>> + * entity wants to (try to!) synchronize its behaviour with the scheduler's
>> + * one, and to (try to!) reconcile itself with its own scheduling
>> + * parameters.
>> + */
>> +static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
>> +{
>> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>> +	struct rq *rq = rq_of_dl_rq(dl_rq);
>> +
>> +	WARN_ON(!dl_se->dl_new || dl_se->dl_throttled);
>> +
>> +	/*
>> +	 * We use the regular wall clock time to set deadlines in the
>> +	 * future; in fact, we must consider execution overheads (time
>> +	 * spent on hardirq context, etc.).
>> +	 */
>> +	dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
>> +	dl_se->runtime = dl_se->dl_runtime;
>> +	dl_se->dl_new = 0;
>> +}
>> +
>> +/*
>> + * Pure Earliest Deadline First (EDF) scheduling does not deal with the
>> + * possibility of a entity lasting more than what it declared, and thus
>> + * exhausting its runtime.
>> + *
>> + * Here we are interested in making runtime overrun possible, but we do
>> + * not want a entity which is misbehaving to affect the scheduling of all
>> + * other entities.
>> + * Therefore, a budgeting strategy called Constant Bandwidth Server (CBS)
>> + * is used, in order to confine each entity within its own bandwidth.
>> + *
>> + * This function deals exactly with that, and ensures that when the runtime
>> + * of a entity is replenished, its deadline is also postponed. That ensures
>> + * the overrunning entity can't interfere with other entity in the system and
>> + * can't make them miss their deadlines. Reasons why this kind of overruns
>> + * could happen are, typically, a entity voluntarily trying to overcome its
>> + * runtime, or it just underestimated it during sched_setscheduler_ex().
>> + */
>> +static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>> +{
>> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>> +	struct rq *rq = rq_of_dl_rq(dl_rq);
>> +
>> +	/*
>> +	 * We keep moving the deadline away until we get some
>> +	 * available runtime for the entity. This ensures correct
>> +	 * handling of situations where the runtime overrun is
>> +	 * arbitrary large.
>> +	 */
>> +	while (dl_se->runtime <= 0) {
>> +		dl_se->deadline += dl_se->dl_deadline;
>> +		dl_se->runtime += dl_se->dl_runtime;
>> +	}
>> +
>> +	/*
>> +	 * At this point, the deadline really should be "in
>> +	 * the future" with respect to rq->clock. If it's
>> +	 * not, we are, for some reason, lagging too much!
>> +	 * Anyway, after having warn userspace abut that,
>> +	 * we still try to keep the things running by
>> +	 * resetting the deadline and the budget of the
>> +	 * entity.
>> +	 */
>> +	if (dl_time_before(dl_se->deadline, rq_clock(rq))) {
>> +		static bool lag_once = false;
>> +
>> +		if (!lag_once) {
>> +			lag_once = true;
>> +			printk_sched("sched: DL replenish lagged to much\n");
>> +		}
>> +		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
>> +		dl_se->runtime = dl_se->dl_runtime;
>> +	}
>> +}
>> +
>> +/*
>> + * Here we check if --at time t-- an entity (which is probably being
>> + * [re]activated or, in general, enqueued) can use its remaining runtime
>> + * and its current deadline _without_ exceeding the bandwidth it is
>> + * assigned (function returns true if it can't). We are in fact applying
>> + * one of the CBS rules: when a task wakes up, if the residual runtime
>> + * over residual deadline fits within the allocated bandwidth, then we
>> + * can keep the current (absolute) deadline and residual budget without
>> + * disrupting the schedulability of the system. Otherwise, we should
>> + * refill the runtime and set the deadline a period in the future,
>> + * because keeping the current (absolute) deadline of the task would
>> + * result in breaking guarantees promised to other tasks.
>> + *
>> + * This function returns true if:
>> + *
>> + *   runtime / (deadline - t) > dl_runtime / dl_deadline ,
> 
> 
> I'm a bit confused here about this algorithm. If I state I want to have
> a run time of 20 ms out of 100 ms, so my dl_runtime / dl_deadline is
> 1/5 or %20.
> 
> Now if I don't don't get scheduled at the start of my new period
> because other tasks are running, and I start say, at 50ms in. Thus, my
> deadline - t == 50, but my runtime remaining is still 20ms. Thus I have
> 20/50 which is 2/5 or %40. Doing the calculation I have 40 > 20 being
> true, and thus we label this as an overflow.
> 
> Is this what we really want? Or did I misunderstand something here?
> 
> Is this only for tasks that went to sleep voluntarily and have just
> woken up?
> 

The fact is that this rule is applied only for a task that have just woken up
(in general enqueued back). In the case you depicted the task didn't sleep
after he was activated. If someone was executing instead of him in its first
50ms it means that that task was having earliest deadline. Since your task set
passed the admission control, and since no check is made when a task is simply
elected for execution, your task will run with its current parameters having
the chance to execute for 20ms in its last 50ms.

Does this clarify?

> 
>> + *
>> + * IOW we can't recycle current parameters.
>> + */
>> +static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
>> +{
>> +	u64 left, right;
>> +
>> +	/*
>> +	 * left and right are the two sides of the equation above,
>> +	 * after a bit of shuffling to use multiplications instead
>> +	 * of divisions.
>> +	 *
>> +	 * Note that none of the time values involved in the two
>> +	 * multiplications are absolute: dl_deadline and dl_runtime
>> +	 * are the relative deadline and the maximum runtime of each
>> +	 * instance, runtime is the runtime left for the last instance
>> +	 * and (deadline - t), since t is rq->clock, is the time left
>> +	 * to the (absolute) deadline. Even if overflowing the u64 type
>> +	 * is very unlikely to occur in both cases, here we scale down
>> +	 * as we want to avoid that risk at all. Scaling down by 10
>> +	 * means that we reduce granularity to 1us. We are fine with it,
>> +	 * since this is only a true/false check and, anyway, thinking
>> +	 * of anything below microseconds resolution is actually fiction
>> +	 * (but still we want to give the user that illusion >;).
>> +	 */
>> +	left = (dl_se->dl_deadline >> 10) * (dl_se->runtime >> 10);
>> +	right = ((dl_se->deadline - t) >> 10) * (dl_se->dl_runtime >> 10);
>> +
>> +	return dl_time_before(right, left);
>> +}
>> +
>> +/*
>> + * When a -deadline entity is queued back on the runqueue, its runtime and
>> + * deadline might need updating.
>> + *
>> + * The policy here is that we update the deadline of the entity only if:
>> + *  - the current deadline is in the past,
>> + *  - using the remaining runtime with the current deadline would make
>> + *    the entity exceed its bandwidth.
>> + */
>> +static void update_dl_entity(struct sched_dl_entity *dl_se)
>> +{
>> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>> +	struct rq *rq = rq_of_dl_rq(dl_rq);
>> +
>> +	/*
>> +	 * The arrival of a new instance needs special treatment, i.e.,
>> +	 * the actual scheduling parameters have to be "renewed".
>> +	 */
>> +	if (dl_se->dl_new) {
>> +		setup_new_dl_entity(dl_se);
>> +		return;
>> +	}
>> +
>> +	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
>> +	    dl_entity_overflow(dl_se, rq_clock(rq))) {
>> +		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
>> +		dl_se->runtime = dl_se->dl_runtime;
> 
> In the case that I stated, we just moved the deadline forward.
> 
> Or is this only because it was just enqueued, and we fear that it will
> starve out other tasks? That is, this only happens if the task was
> sleeping and just woke up.
> 

Yes, only if the task was sleeping.

Thanks,

- Juri

>> +	}
>> +}
>> +
>> +/*
>> + * If the entity depleted all its runtime, and if we want it to sleep
>> + * while waiting for some new execution time to become available, we
>> + * set the bandwidth enforcement timer to the replenishment instant
>> + * and try to activate it.
>> + *
>> + * Notice that it is important for the caller to know if the timer
>> + * actually started or not (i.e., the replenishment instant is in
>> + * the future or in the past).
>> + */
>> +static int start_dl_timer(struct sched_dl_entity *dl_se)
>> +{
>> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>> +	struct rq *rq = rq_of_dl_rq(dl_rq);
>> +	ktime_t now, act;
>> +	ktime_t soft, hard;
>> +	unsigned long range;
>> +	s64 delta;
>> +
>> +	/*
>> +	 * We want the timer to fire at the deadline, but considering
>> +	 * that it is actually coming from rq->clock and not from
>> +	 * hrtimer's time base reading.
>> +	 */
>> +	act = ns_to_ktime(dl_se->deadline);
>> +	now = hrtimer_cb_get_time(&dl_se->dl_timer);
>> +	delta = ktime_to_ns(now) - rq_clock(rq);
>> +	act = ktime_add_ns(act, delta);
>> +
>> +	/*
>> +	 * If the expiry time already passed, e.g., because the value
>> +	 * chosen as the deadline is too small, don't even try to
>> +	 * start the timer in the past!
>> +	 */
>> +	if (ktime_us_delta(act, now) < 0)
>> +		return 0;
>> +
>> +	hrtimer_set_expires(&dl_se->dl_timer, act);
>> +
>> +	soft = hrtimer_get_softexpires(&dl_se->dl_timer);
>> +	hard = hrtimer_get_expires(&dl_se->dl_timer);
>> +	range = ktime_to_ns(ktime_sub(hard, soft));
>> +	__hrtimer_start_range_ns(&dl_se->dl_timer, soft,
>> +				 range, HRTIMER_MODE_ABS, 0);
>> +
>> +	return hrtimer_active(&dl_se->dl_timer);
>> +}
>> +
>> +/*
>> + * This is the bandwidth enforcement timer callback. If here, we know
>> + * a task is not on its dl_rq, since the fact that the timer was running
>> + * means the task is throttled and needs a runtime replenishment.
>> + *
>> + * However, what we actually do depends on the fact the task is active,
>> + * (it is on its rq) or has been removed from there by a call to
>> + * dequeue_task_dl(). In the former case we must issue the runtime
>> + * replenishment and add the task back to the dl_rq; in the latter, we just
>> + * do nothing but clearing dl_throttled, so that runtime and deadline
>> + * updating (and the queueing back to dl_rq) will be done by the
>> + * next call to enqueue_task_dl().
>> + */
>> +static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
>> +{
>> +	struct sched_dl_entity *dl_se = container_of(timer,
>> +						     struct sched_dl_entity,
>> +						     dl_timer);
>> +	struct task_struct *p = dl_task_of(dl_se);
>> +	struct rq *rq = task_rq(p);
>> +	raw_spin_lock(&rq->lock);
>> +
>> +	/*
>> +	 * We need to take care of a possible races here. In fact, the
>> +	 * task might have changed its scheduling policy to something
>> +	 * different from SCHED_DEADLINE or changed its reservation
>> +	 * parameters (through sched_setscheduler()).
>> +	 */
>> +	if (!dl_task(p) || dl_se->dl_new)
>> +		goto unlock;
>> +
>> +	dl_se->dl_throttled = 0;
>> +	if (p->on_rq) {
>> +		enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
>> +		if (task_has_dl_policy(rq->curr))
>> +			check_preempt_curr_dl(rq, p, 0);
>> +		else
>> +			resched_task(rq->curr);
>> +	}
>> +unlock:
>> +	raw_spin_unlock(&rq->lock);
>> +
>> +	return HRTIMER_NORESTART;
>> +}
>> +
>> +void init_dl_task_timer(struct sched_dl_entity *dl_se)
>> +{
>> +	struct hrtimer *timer = &dl_se->dl_timer;
>> +
>> +	if (hrtimer_active(timer)) {
>> +		hrtimer_try_to_cancel(timer);
>> +		return;
>> +	}
>> +
>> +	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> +	timer->function = dl_task_timer;
>> +}
>> +
>> +static
>> +int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
>> +{
>> +	int dmiss = dl_time_before(dl_se->deadline, rq_clock(rq));
>> +	int rorun = dl_se->runtime <= 0;
>> +
>> +	if (!rorun && !dmiss)
>> +		return 0;
>> +
>> +	/*
>> +	 * If we are beyond our current deadline and we are still
>> +	 * executing, then we have already used some of the runtime of
>> +	 * the next instance. Thus, if we do not account that, we are
>> +	 * stealing bandwidth from the system at each deadline miss!
>> +	 */
>> +	if (dmiss) {
>> +		dl_se->runtime = rorun ? dl_se->runtime : 0;
>> +		dl_se->runtime -= rq_clock(rq) - dl_se->deadline;
>> +	}
>> +
>> +	return 1;
>> +}
>> +
>> +/*
>> + * Update the current task's runtime statistics (provided it is still
>> + * a -deadline task and has not been removed from the dl_rq).
>> + */
>> +static void update_curr_dl(struct rq *rq)
>> +{
>> +	struct task_struct *curr = rq->curr;
>> +	struct sched_dl_entity *dl_se = &curr->dl;
>> +	u64 delta_exec;
>> +
>> +	if (!dl_task(curr) || !on_dl_rq(dl_se))
>> +		return;
>> +
>> +	/*
>> +	 * Consumed budget is computed considering the time as
>> +	 * observed by schedulable tasks (excluding time spent
>> +	 * in hardirq context, etc.). Deadlines are instead
>> +	 * computed using hard walltime. This seems to be the more
>> +	 * natural solution, but the full ramifications of this
>> +	 * approach need further study.
>> +	 */
>> +	delta_exec = rq_clock_task(rq) - curr->se.exec_start;
>> +	if (unlikely((s64)delta_exec < 0))
>> +		delta_exec = 0;
>> +
>> +	schedstat_set(curr->se.statistics.exec_max,
>> +		      max(curr->se.statistics.exec_max, delta_exec));
>> +
>> +	curr->se.sum_exec_runtime += delta_exec;
>> +	account_group_exec_runtime(curr, delta_exec);
>> +
>> +	curr->se.exec_start = rq_clock_task(rq);
>> +	cpuacct_charge(curr, delta_exec);
>> +
>> +	dl_se->runtime -= delta_exec;
>> +	if (dl_runtime_exceeded(rq, dl_se)) {
>> +		__dequeue_task_dl(rq, curr, 0);
>> +		if (likely(start_dl_timer(dl_se)))
>> +			dl_se->dl_throttled = 1;
>> +		else
>> +			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
>> +
>> +		if (!is_leftmost(curr, &rq->dl))
>> +			resched_task(curr);
>> +	}
>> +}
>> +
>> +static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>> +{
>> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>> +	struct rb_node **link = &dl_rq->rb_root.rb_node;
>> +	struct rb_node *parent = NULL;
>> +	struct sched_dl_entity *entry;
>> +	int leftmost = 1;
>> +
>> +	BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
>> +
>> +	while (*link) {
>> +		parent = *link;
>> +		entry = rb_entry(parent, struct sched_dl_entity, rb_node);
>> +		if (dl_time_before(dl_se->deadline, entry->deadline))
>> +			link = &parent->rb_left;
>> +		else {
>> +			link = &parent->rb_right;
>> +			leftmost = 0;
>> +		}
>> +	}
>> +
>> +	if (leftmost)
>> +		dl_rq->rb_leftmost = &dl_se->rb_node;
>> +
>> +	rb_link_node(&dl_se->rb_node, parent, link);
>> +	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
>> +
>> +	dl_rq->dl_nr_running++;
>> +}
>> +
>> +static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>> +{
>> +	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>> +
>> +	if (RB_EMPTY_NODE(&dl_se->rb_node))
>> +		return;
>> +
>> +	if (dl_rq->rb_leftmost == &dl_se->rb_node) {
>> +		struct rb_node *next_node;
>> +
>> +		next_node = rb_next(&dl_se->rb_node);
>> +		dl_rq->rb_leftmost = next_node;
>> +	}
>> +
>> +	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
>> +	RB_CLEAR_NODE(&dl_se->rb_node);
>> +
>> +	dl_rq->dl_nr_running--;
>> +}
>> +
>> +static void
>> +enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
>> +{
>> +	BUG_ON(on_dl_rq(dl_se));
>> +
>> +	/*
>> +	 * If this is a wakeup or a new instance, the scheduling
>> +	 * parameters of the task might need updating. Otherwise,
>> +	 * we want a replenishment of its runtime.
>> +	 */
>> +	if (!dl_se->dl_new && flags & ENQUEUE_REPLENISH)
>> +		replenish_dl_entity(dl_se);
>> +	else
>> +		update_dl_entity(dl_se);
>> +
>> +	__enqueue_dl_entity(dl_se);
>> +}
>> +
>> +static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
>> +{
>> +	__dequeue_dl_entity(dl_se);
>> +}
>> +
>> +static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>> +{
>> +	/*
>> +	 * If p is throttled, we do nothing. In fact, if it exhausted
>> +	 * its budget it needs a replenishment and, since it now is on
>> +	 * its rq, the bandwidth timer callback (which clearly has not
>> +	 * run yet) will take care of this.
>> +	 */
>> +	if (p->dl.dl_throttled)
>> +		return;
>> +
>> +	enqueue_dl_entity(&p->dl, flags);
>> +	inc_nr_running(rq);
>> +}
>> +
>> +static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>> +{
>> +	dequeue_dl_entity(&p->dl);
>> +}
>> +
>> +static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>> +{
>> +	update_curr_dl(rq);
>> +	__dequeue_task_dl(rq, p, flags);
>> +
>> +	dec_nr_running(rq);
>> +}
>> +
>> +/*
>> + * Yield task semantic for -deadline tasks is:
>> + *
>> + *   get off from the CPU until our next instance, with
>> + *   a new runtime. This is of little use now, since we
>> + *   don't have a bandwidth reclaiming mechanism. Anyway,
>> + *   bandwidth reclaiming is planned for the future, and
>> + *   yield_task_dl will indicate that some spare budget
>> + *   is available for other task instances to use it.
>> + */
>> +static void yield_task_dl(struct rq *rq)
>> +{
>> +	struct task_struct *p = rq->curr;
>> +
>> +	/*
>> +	 * We make the task go to sleep until its current deadline by
>> +	 * forcing its runtime to zero. This way, update_curr_dl() stops
>> +	 * it and the bandwidth timer will wake it up and will give it
>> +	 * new scheduling parameters (thanks to dl_new=1).
>> +	 */
>> +	if (p->dl.runtime > 0) {
>> +		rq->curr->dl.dl_new = 1;
>> +		p->dl.runtime = 0;
>> +	}
>> +	update_curr_dl(rq);
>> +}
>> +
>> +/*
>> + * Only called when both the current and waking task are -deadline
>> + * tasks.
>> + */
>> +static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
>> +				  int flags)
>> +{
>> +	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
>> +		resched_task(rq->curr);
>> +}
>> +
>> +#ifdef CONFIG_SCHED_HRTICK
>> +static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
>> +{
>> +	s64 delta = p->dl.dl_runtime - p->dl.runtime;
>> +
>> +	if (delta > 10000)
>> +		hrtick_start(rq, p->dl.runtime);
>> +}
>> +#endif
>> +
>> +static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
>> +						   struct dl_rq *dl_rq)
>> +{
>> +	struct rb_node *left = dl_rq->rb_leftmost;
>> +
>> +	if (!left)
>> +		return NULL;
>> +
>> +	return rb_entry(left, struct sched_dl_entity, rb_node);
>> +}
>> +
>> +struct task_struct *pick_next_task_dl(struct rq *rq)
>> +{
>> +	struct sched_dl_entity *dl_se;
>> +	struct task_struct *p;
>> +	struct dl_rq *dl_rq;
>> +
>> +	dl_rq = &rq->dl;
>> +
>> +	if (unlikely(!dl_rq->dl_nr_running))
>> +		return NULL;
>> +
>> +	dl_se = pick_next_dl_entity(rq, dl_rq);
>> +	BUG_ON(!dl_se);
>> +
>> +	p = dl_task_of(dl_se);
>> +	p->se.exec_start = rq_clock_task(rq);
>> +#ifdef CONFIG_SCHED_HRTICK
>> +	if (hrtick_enabled(rq))
>> +		start_hrtick_dl(rq, p);
>> +#endif
>> +	return p;
>> +}
>> +
>> +static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
>> +{
>> +	update_curr_dl(rq);
>> +}
>> +
>> +static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
>> +{
>> +	update_curr_dl(rq);
>> +
>> +#ifdef CONFIG_SCHED_HRTICK
>> +	if (hrtick_enabled(rq) && queued && p->dl.runtime > 0)
>> +		start_hrtick_dl(rq, p);
>> +#endif
>> +}
>> +
>> +static void task_fork_dl(struct task_struct *p)
>> +{
>> +	/*
>> +	 * SCHED_DEADLINE tasks cannot fork and this is achieved through
>> +	 * sched_fork()
>> +	 */
>> +}
>> +
>> +static void task_dead_dl(struct task_struct *p)
>> +{
>> +	struct hrtimer *timer = &p->dl.dl_timer;
>> +
>> +	if (hrtimer_active(timer))
>> +		hrtimer_try_to_cancel(timer);
>> +}
>> +
>> +static void set_curr_task_dl(struct rq *rq)
>> +{
>> +	struct task_struct *p = rq->curr;
>> +
>> +	p->se.exec_start = rq_clock_task(rq);
>> +}
>> +
>> +static void switched_from_dl(struct rq *rq, struct task_struct *p)
>> +{
>> +	if (hrtimer_active(&p->dl.dl_timer))
>> +		hrtimer_try_to_cancel(&p->dl.dl_timer);
>> +}
>> +
>> +static void switched_to_dl(struct rq *rq, struct task_struct *p)
>> +{
>> +	/*
>> +	 * If p is throttled, don't consider the possibility
>> +	 * of preempting rq->curr, the check will be done right
>> +	 * after its runtime will get replenished.
>> +	 */
>> +	if (unlikely(p->dl.dl_throttled))
>> +		return;
>> +
>> +	if (!p->on_rq || rq->curr != p) {
>> +		if (task_has_dl_policy(rq->curr))
>> +			check_preempt_curr_dl(rq, p, 0);
>> +		else
>> +			resched_task(rq->curr);
>> +	}
>> +}
>> +
>> +static void prio_changed_dl(struct rq *rq, struct task_struct *p,
>> +			    int oldprio)
>> +{
>> +	switched_to_dl(rq, p);
>> +}
>> +
>> +#ifdef CONFIG_SMP
>> +static int
>> +select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
>> +{
>> +	return task_cpu(p);
>> +}
>> +#endif
>> +
>> +const struct sched_class dl_sched_class = {
>> +	.next			= &rt_sched_class,
>> +	.enqueue_task		= enqueue_task_dl,
>> +	.dequeue_task		= dequeue_task_dl,
>> +	.yield_task		= yield_task_dl,
>> +
>> +	.check_preempt_curr	= check_preempt_curr_dl,
>> +
>> +	.pick_next_task		= pick_next_task_dl,
>> +	.put_prev_task		= put_prev_task_dl,
>> +
>> +#ifdef CONFIG_SMP
>> +	.select_task_rq		= select_task_rq_dl,
>> +#endif
>> +
>> +	.set_curr_task		= set_curr_task_dl,
>> +	.task_tick		= task_tick_dl,
>> +	.task_fork              = task_fork_dl,
>> +	.task_dead		= task_dead_dl,
>> +
>> +	.prio_changed           = prio_changed_dl,
>> +	.switched_from		= switched_from_dl,
>> +	.switched_to		= switched_to_dl,
>> +};
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 64eda5c..ba97476 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2,6 +2,7 @@
>>  #include <linux/sched.h>
>>  #include <linux/sched/sysctl.h>
>>  #include <linux/sched/rt.h>
>> +#include <linux/sched/deadline.h>
>>  #include <linux/mutex.h>
>>  #include <linux/spinlock.h>
>>  #include <linux/stop_machine.h>
>> @@ -87,11 +88,23 @@ static inline int rt_policy(int policy)
>>  	return 0;
>>  }
>>  
>> +static inline int dl_policy(int policy)
>> +{
>> +	if (unlikely(policy == SCHED_DEADLINE))
>> +		return 1;
>> +	return 0;
>> +}
>> +
>>  static inline int task_has_rt_policy(struct task_struct *p)
>>  {
>>  	return rt_policy(p->policy);
>>  }
>>  
>> +static inline int task_has_dl_policy(struct task_struct *p)
>> +{
>> +	return dl_policy(p->policy);
>> +}
>> +
>>  /*
>>   * This is the priority-queue data structure of the RT scheduling class:
>>   */
>> @@ -363,6 +376,15 @@ struct rt_rq {
>>  #endif
>>  };
>>  
>> +/* Deadline class' related fields in a runqueue */
>> +struct dl_rq {
>> +	/* runqueue is an rbtree, ordered by deadline */
>> +	struct rb_root rb_root;
>> +	struct rb_node *rb_leftmost;
>> +
>> +	unsigned long dl_nr_running;
>> +};
>> +
>>  #ifdef CONFIG_SMP
>>  
>>  /*
>> @@ -427,6 +449,7 @@ struct rq {
>>  
>>  	struct cfs_rq cfs;
>>  	struct rt_rq rt;
>> +	struct dl_rq dl;
>>  
>>  #ifdef CONFIG_FAIR_GROUP_SCHED
>>  	/* list of leaf cfs_rq on this cpu: */
>> @@ -957,6 +980,7 @@ static const u32 prio_to_wmult[40] = {
>>  #else
>>  #define ENQUEUE_WAKING		0
>>  #endif
>> +#define ENQUEUE_REPLENISH	8
>>  
>>  #define DEQUEUE_SLEEP		1
>>  
>> @@ -1012,6 +1036,7 @@ struct sched_class {
>>     for (class = sched_class_highest; class; class = class->next)
>>  
>>  extern const struct sched_class stop_sched_class;
>> +extern const struct sched_class dl_sched_class;
>>  extern const struct sched_class rt_sched_class;
>>  extern const struct sched_class fair_sched_class;
>>  extern const struct sched_class idle_sched_class;
>> @@ -1047,6 +1072,8 @@ extern void resched_cpu(int cpu);
>>  extern struct rt_bandwidth def_rt_bandwidth;
>>  extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
>>  
>> +extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
>> +
>>  extern void update_idle_cpu_load(struct rq *this_rq);
>>  
>>  extern void init_task_runnable_average(struct task_struct *p);
>> @@ -1305,6 +1332,7 @@ extern void print_rt_stats(struct seq_file *m, int cpu);
>>  
>>  extern void init_cfs_rq(struct cfs_rq *cfs_rq);
>>  extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq);
>> +extern void init_dl_rq(struct dl_rq *rt_rq, struct rq *rq);
>>  
>>  extern void account_cfs_bandwidth_used(int enabled, int was_enabled);
>>  
>> diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
>> index e08fbee..a5cef17 100644
>> --- a/kernel/sched/stop_task.c
>> +++ b/kernel/sched/stop_task.c
>> @@ -103,7 +103,7 @@ get_rr_interval_stop(struct rq *rq, struct task_struct *task)
>>   * Simple, special scheduling class for the per-CPU stop tasks:
>>   */
>>  const struct sched_class stop_sched_class = {
>> -	.next			= &rt_sched_class,
>> +	.next			= &dl_sched_class,
>>  
>>  	.enqueue_task		= enqueue_task_stop,
>>  	.dequeue_task		= dequeue_task_stop,
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 04/14] sched: SCHED_DEADLINE SMP-related data structures & logic.
  2013-11-07 13:43 ` [PATCH 04/14] sched: SCHED_DEADLINE SMP-related data structures & logic Juri Lelli
@ 2013-11-20 18:51   ` Steven Rostedt
  2013-11-21 14:13     ` Juri Lelli
  2014-01-13 15:53   ` [tip:sched/core] sched/deadline: Add " tip-bot for Juri Lelli
  1 sibling, 1 reply; 81+ messages in thread
From: Steven Rostedt @ 2013-11-20 18:51 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On Thu,  7 Nov 2013 14:43:38 +0100
Juri Lelli <juri.lelli@gmail.com> wrote:


> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index cb93f2e..18a73b4 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -10,6 +10,7 @@
>   * miss some of their deadlines), and won't affect any other task.
>   *
>   * Copyright (C) 2012 Dario Faggioli <raistlin@linux.it>,
> + *                    Juri Lelli <juri.lelli@gmail.com>,
>   *                    Michael Trimarchi <michael@amarulasolutions.com>,
>   *                    Fabio Checconi <fchecconi@gmail.com>
>   */
> @@ -20,6 +21,15 @@ static inline int dl_time_before(u64 a, u64 b)
>  	return (s64)(a - b) < 0;
>  }
>  
> +/*
> + * Tells if entity @a should preempt entity @b.
> + */
> +static inline
> +int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
> +{
> +	return dl_time_before(a->deadline, b->deadline);
> +}
> +
>  static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
>  {
>  	return container_of(dl_se, struct task_struct, dl);
> @@ -53,8 +63,168 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
>  void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
>  {
>  	dl_rq->rb_root = RB_ROOT;
> +
> +#ifdef CONFIG_SMP
> +	/* zero means no -deadline tasks */

I'm curious to why you add the '-' to -deadline.


> +	dl_rq->earliest_dl.curr = dl_rq->earliest_dl.next = 0;
> +
> +	dl_rq->dl_nr_migratory = 0;
> +	dl_rq->overloaded = 0;
> +	dl_rq->pushable_dl_tasks_root = RB_ROOT;
> +#endif
> +}
> +
> +#ifdef CONFIG_SMP
> +
> +static inline int dl_overloaded(struct rq *rq)
> +{
> +	return atomic_read(&rq->rd->dlo_count);
> +}
> +
> +static inline void dl_set_overload(struct rq *rq)
> +{
> +	if (!rq->online)
> +		return;
> +
> +	cpumask_set_cpu(rq->cpu, rq->rd->dlo_mask);
> +	/*
> +	 * Must be visible before the overload count is
> +	 * set (as in sched_rt.c).
> +	 *
> +	 * Matched by the barrier in pull_dl_task().
> +	 */
> +	smp_wmb();
> +	atomic_inc(&rq->rd->dlo_count);
> +}
> +
> +static inline void dl_clear_overload(struct rq *rq)
> +{
> +	if (!rq->online)
> +		return;
> +
> +	atomic_dec(&rq->rd->dlo_count);
> +	cpumask_clear_cpu(rq->cpu, rq->rd->dlo_mask);
> +}
> +
> +static void update_dl_migration(struct dl_rq *dl_rq)
> +{
> +	if (dl_rq->dl_nr_migratory && dl_rq->dl_nr_total > 1) {
> +		if (!dl_rq->overloaded) {
> +			dl_set_overload(rq_of_dl_rq(dl_rq));
> +			dl_rq->overloaded = 1;
> +		}
> +	} else if (dl_rq->overloaded) {
> +		dl_clear_overload(rq_of_dl_rq(dl_rq));
> +		dl_rq->overloaded = 0;
> +	}
> +}
> +
> +static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> +	struct task_struct *p = dl_task_of(dl_se);
> +	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
> +
> +	dl_rq->dl_nr_total++;
> +	if (p->nr_cpus_allowed > 1)
> +		dl_rq->dl_nr_migratory++;
> +
> +	update_dl_migration(dl_rq);
> +}
> +
> +static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> +	struct task_struct *p = dl_task_of(dl_se);
> +	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
> +
> +	dl_rq->dl_nr_total--;
> +	if (p->nr_cpus_allowed > 1)
> +		dl_rq->dl_nr_migratory--;
> +
> +	update_dl_migration(dl_rq);
> +}
> +
> +/*
> + * The list of pushable -deadline task is not a plist, like in
> + * sched_rt.c, it is an rb-tree with tasks ordered by deadline.
> + */
> +static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> +{
> +	struct dl_rq *dl_rq = &rq->dl;
> +	struct rb_node **link = &dl_rq->pushable_dl_tasks_root.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct task_struct *entry;
> +	int leftmost = 1;
> +
> +	BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
> +
> +	while (*link) {
> +		parent = *link;
> +		entry = rb_entry(parent, struct task_struct,
> +				 pushable_dl_tasks);
> +		if (dl_entity_preempt(&p->dl, &entry->dl))
> +			link = &parent->rb_left;
> +		else {
> +			link = &parent->rb_right;
> +			leftmost = 0;
> +		}
> +	}
> +
> +	if (leftmost)
> +		dl_rq->pushable_dl_tasks_leftmost = &p->pushable_dl_tasks;
> +
> +	rb_link_node(&p->pushable_dl_tasks, parent, link);
> +	rb_insert_color(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
> +}
> +
> +static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> +{
> +	struct dl_rq *dl_rq = &rq->dl;
> +
> +	if (RB_EMPTY_NODE(&p->pushable_dl_tasks))
> +		return;
> +
> +	if (dl_rq->pushable_dl_tasks_leftmost == &p->pushable_dl_tasks) {
> +		struct rb_node *next_node;
> +
> +		next_node = rb_next(&p->pushable_dl_tasks);
> +		dl_rq->pushable_dl_tasks_leftmost = next_node;
> +	}
> +
> +	rb_erase(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
> +	RB_CLEAR_NODE(&p->pushable_dl_tasks);
> +}
> +
> +static inline int has_pushable_dl_tasks(struct rq *rq)
> +{
> +	return !RB_EMPTY_ROOT(&rq->dl.pushable_dl_tasks_root);
> +}
> +
> +static int push_dl_task(struct rq *rq);
> +
> +#else
> +
> +static inline
> +void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> +{
> +}
> +
> +static inline
> +void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> +{
> +}
> +
> +static inline
> +void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> +}
> +
> +static inline
> +void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
>  }
>  
> +#endif /* CONFIG_SMP */
> +
>  static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
>  static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
>  static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
> @@ -307,6 +477,14 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
>  			check_preempt_curr_dl(rq, p, 0);
>  		else
>  			resched_task(rq->curr);
> +#ifdef CONFIG_SMP
> +		/*
> +		 * Queueing this task back might have overloaded rq,
> +		 * check if we need to kick someone away.
> +		 */
> +		if (has_pushable_dl_tasks(rq))
> +			push_dl_task(rq);
> +#endif
>  	}
>  unlock:
>  	raw_spin_unlock(&rq->lock);
> @@ -397,6 +575,100 @@ static void update_curr_dl(struct rq *rq)
>  	}
>  }
>  
> +#ifdef CONFIG_SMP
> +
> +static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu);
> +
> +static inline u64 next_deadline(struct rq *rq)
> +{
> +	struct task_struct *next = pick_next_earliest_dl_task(rq, rq->cpu);
> +
> +	if (next && dl_prio(next->prio))
> +		return next->dl.deadline;
> +	else
> +		return 0;
> +}
> +
> +static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> +{
> +	struct rq *rq = rq_of_dl_rq(dl_rq);
> +
> +	if (dl_rq->earliest_dl.curr == 0 ||
> +	    dl_time_before(deadline, dl_rq->earliest_dl.curr)) {
> +		/*
> +		 * If the dl_rq had no -deadline tasks, or if the new task
> +		 * has shorter deadline than the current one on dl_rq, we
> +		 * know that the previous earliest becomes our next earliest,
> +		 * as the new task becomes the earliest itself.
> +		 */
> +		dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
> +		dl_rq->earliest_dl.curr = deadline;
> +	} else if (dl_rq->earliest_dl.next == 0 ||
> +		   dl_time_before(deadline, dl_rq->earliest_dl.next)) {
> +		/*
> +		 * On the other hand, if the new -deadline task has a
> +		 * a later deadline than the earliest one on dl_rq, but
> +		 * it is earlier than the next (if any), we must
> +		 * recompute the next-earliest.
> +		 */
> +		dl_rq->earliest_dl.next = next_deadline(rq);
> +	}
> +}
> +
> +static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> +{
> +	struct rq *rq = rq_of_dl_rq(dl_rq);
> +
> +	/*
> +	 * Since we may have removed our earliest (and/or next earliest)
> +	 * task we must recompute them.
> +	 */
> +	if (!dl_rq->dl_nr_running) {
> +		dl_rq->earliest_dl.curr = 0;
> +		dl_rq->earliest_dl.next = 0;
> +	} else {
> +		struct rb_node *leftmost = dl_rq->rb_leftmost;
> +		struct sched_dl_entity *entry;
> +
> +		entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
> +		dl_rq->earliest_dl.curr = entry->deadline;
> +		dl_rq->earliest_dl.next = next_deadline(rq);
> +	}
> +}
> +
> +#else
> +
> +static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
> +static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
> +
> +#endif /* CONFIG_SMP */
> +
> +static inline
> +void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> +	int prio = dl_task_of(dl_se)->prio;
> +	u64 deadline = dl_se->deadline;
> +
> +	WARN_ON(!dl_prio(prio));
> +	dl_rq->dl_nr_running++;
> +
> +	inc_dl_deadline(dl_rq, deadline);
> +	inc_dl_migration(dl_se, dl_rq);
> +}
> +
> +static inline
> +void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> +	int prio = dl_task_of(dl_se)->prio;
> +
> +	WARN_ON(!dl_prio(prio));
> +	WARN_ON(!dl_rq->dl_nr_running);
> +	dl_rq->dl_nr_running--;
> +
> +	dec_dl_deadline(dl_rq, dl_se->deadline);
> +	dec_dl_migration(dl_se, dl_rq);
> +}
> +
>  static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>  {
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> @@ -424,7 +696,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>  	rb_link_node(&dl_se->rb_node, parent, link);
>  	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
>  
> -	dl_rq->dl_nr_running++;
> +	inc_dl_tasks(dl_se, dl_rq);
>  }
>  
>  static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
> @@ -444,7 +716,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>  	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
>  	RB_CLEAR_NODE(&dl_se->rb_node);
>  
> -	dl_rq->dl_nr_running--;
> +	dec_dl_tasks(dl_se, dl_rq);
>  }
>  
>  static void
> @@ -482,12 +754,17 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>  		return;
>  
>  	enqueue_dl_entity(&p->dl, flags);
> +
> +	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
> +		enqueue_pushable_dl_task(rq, p);
> +
>  	inc_nr_running(rq);
>  }
>  
>  static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	dequeue_dl_entity(&p->dl);
> +	dequeue_pushable_dl_task(rq, p);
>  }
>  
>  static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> @@ -525,6 +802,77 @@ static void yield_task_dl(struct rq *rq)
>  	update_curr_dl(rq);
>  }
>  
> +#ifdef CONFIG_SMP
> +
> +static int find_later_rq(struct task_struct *task);
> +static int latest_cpu_find(struct cpumask *span,
> +			   struct task_struct *task,
> +			   struct cpumask *later_mask);
> +
> +static int
> +select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
> +{
> +	struct task_struct *curr;
> +	struct rq *rq;
> +	int cpu;
> +
> +	cpu = task_cpu(p);
> +
> +	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
> +		goto out;
> +
> +	rq = cpu_rq(cpu);
> +
> +	rcu_read_lock();
> +	curr = ACCESS_ONCE(rq->curr); /* unlocked access */
> +
> +	/*
> +	 * If we are dealing with a -deadline task, we must
> +	 * decide where to wake it up.
> +	 * If it has a later deadline and the current task
> +	 * on this rq can't move (provided the waking task
> +	 * can!) we prefer to send it somewhere else. On the
> +	 * other hand, if it has a shorter deadline, we
> +	 * try to make it stay here, it might be important.
> +	 */
> +	if (unlikely(dl_task(curr)) &&
> +	    (curr->nr_cpus_allowed < 2 ||
> +	     !dl_entity_preempt(&p->dl, &curr->dl)) &&
> +	    (p->nr_cpus_allowed > 1)) {
> +		int target = find_later_rq(p);
> +
> +		if (target != -1)
> +			cpu = target;
> +	}
> +	rcu_read_unlock();
> +
> +out:
> +	return cpu;
> +}
> +
> +static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
> +{
> +	/*
> +	 * Current can't be migrated, useless to reschedule,
> +	 * let's hope p can move out.
> +	 */
> +	if (rq->curr->nr_cpus_allowed == 1 ||
> +	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
> +		return;
> +
> +	/*
> +	 * p is migratable, so let's not schedule it and
> +	 * see if it is pushed or pulled somewhere else.
> +	 */
> +	if (p->nr_cpus_allowed != 1 &&
> +	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
> +		return;
> +
> +	resched_task(rq->curr);
> +}
> +
> +#endif /* CONFIG_SMP */
> +
>  /*
>   * Only called when both the current and waking task are -deadline
>   * tasks.
> @@ -532,8 +880,20 @@ static void yield_task_dl(struct rq *rq)
>  static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
>  				  int flags)
>  {
> -	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
> +	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
>  		resched_task(rq->curr);
> +		return;
> +	}
> +
> +#ifdef CONFIG_SMP
> +	/*
> +	 * In the unlikely case current and p have the same deadline
> +	 * let us try to decide what's the best thing to do...
> +	 */
> +	if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
> +	    !need_resched())
> +		check_preempt_equal_dl(rq, p);
> +#endif /* CONFIG_SMP */
>  }
>  
>  #ifdef CONFIG_SCHED_HRTICK
> @@ -573,16 +933,29 @@ struct task_struct *pick_next_task_dl(struct rq *rq)
>  
>  	p = dl_task_of(dl_se);
>  	p->se.exec_start = rq_clock_task(rq);
> +
> +	/* Running task will never be pushed. */
> +	if (p)
> +		dequeue_pushable_dl_task(rq, p);
> +
>  #ifdef CONFIG_SCHED_HRTICK
>  	if (hrtick_enabled(rq))
>  		start_hrtick_dl(rq, p);
>  #endif
> +
> +#ifdef CONFIG_SMP
> +	rq->post_schedule = has_pushable_dl_tasks(rq);
> +#endif /* CONFIG_SMP */
> +
>  	return p;
>  }
>  
>  static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
>  {
>  	update_curr_dl(rq);
> +
> +	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
> +		enqueue_pushable_dl_task(rq, p);
>  }
>  
>  static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
> @@ -616,16 +989,517 @@ static void set_curr_task_dl(struct rq *rq)
>  	struct task_struct *p = rq->curr;
>  
>  	p->se.exec_start = rq_clock_task(rq);
> +
> +	/* You can't push away the running task */
> +	dequeue_pushable_dl_task(rq, p);
> +}
> +
> +#ifdef CONFIG_SMP
> +
> +/* Only try algorithms three times */
> +#define DL_MAX_TRIES 3
> +
> +static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
> +{
> +	if (!task_running(rq, p) &&
> +	    (cpu < 0 || cpumask_test_cpu(cpu, &p->cpus_allowed)) &&
> +	    (p->nr_cpus_allowed > 1))
> +		return 1;
> +
> +	return 0;
> +}
> +
> +/* Returns the second earliest -deadline task, NULL otherwise */
> +static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu)
> +{
> +	struct rb_node *next_node = rq->dl.rb_leftmost;
> +	struct sched_dl_entity *dl_se;
> +	struct task_struct *p = NULL;
> +
> +next_node:
> +	next_node = rb_next(next_node);
> +	if (next_node) {
> +		dl_se = rb_entry(next_node, struct sched_dl_entity, rb_node);
> +		p = dl_task_of(dl_se);
> +
> +		if (pick_dl_task(rq, p, cpu))
> +			return p;
> +
> +		goto next_node;
> +	}
> +
> +	return NULL;
> +}
> +
> +static int latest_cpu_find(struct cpumask *span,
> +			   struct task_struct *task,
> +			   struct cpumask *later_mask)
> +{
> +	const struct sched_dl_entity *dl_se = &task->dl;
> +	int cpu, found = -1, best = 0;
> +	u64 max_dl = 0;
> +
> +	for_each_cpu(cpu, span) {
> +		struct rq *rq = cpu_rq(cpu);
> +		struct dl_rq *dl_rq = &rq->dl;
> +
> +		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
> +		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
> +		     dl_rq->earliest_dl.curr))) {
> +			if (later_mask)
> +				cpumask_set_cpu(cpu, later_mask);
> +			if (!best && !dl_rq->dl_nr_running) {
> +				best = 1;
> +				found = cpu;
> +			} else if (!best &&
> +				   dl_time_before(max_dl,
> +						  dl_rq->earliest_dl.curr)) {

Ug, the above is hard to read. What about:

	if (!best) {
		if (!dl_rq->dl_nr_running) {
			best = 1;
			found = cpu;
		} elsif (dl_time_before(...)) {
			...
		}
	}

Also, I would think dl should be nice to rt as well. There may be a
idle CPU or a non rt task, and this could pick a CPU running an RT
task. Worse yet, that RT task may be pinned to that CPU.

We should be able to incorporate cpuprio_find() to be dl aware too.
That is, work for both -rt and -dl.



> +				max_dl = dl_rq->earliest_dl.curr;
> +				found = cpu;
> +			}
> +		} else if (later_mask)
> +			cpumask_clear_cpu(cpu, later_mask);
> +	}
> +
> +	return found;
> +}
> +
> +static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
> +
> +static int find_later_rq(struct task_struct *task)
> +{
> +	struct sched_domain *sd;
> +	struct cpumask *later_mask = __get_cpu_var(local_cpu_mask_dl);
> +	int this_cpu = smp_processor_id();
> +	int best_cpu, cpu = task_cpu(task);
> +
> +	/* Make sure the mask is initialized first */
> +	if (unlikely(!later_mask))
> +		return -1;
> +
> +	if (task->nr_cpus_allowed == 1)
> +		return -1;
> +
> +	best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
> +	if (best_cpu == -1)
> +		return -1;
> +
> +	/*
> +	 * If we are here, some target has been found,
> +	 * the most suitable of which is cached in best_cpu.
> +	 * This is, among the runqueues where the current tasks
> +	 * have later deadlines than the task's one, the rq
> +	 * with the latest possible one.
> +	 *
> +	 * Now we check how well this matches with task's
> +	 * affinity and system topology.
> +	 *
> +	 * The last cpu where the task run is our first
> +	 * guess, since it is most likely cache-hot there.
> +	 */
> +	if (cpumask_test_cpu(cpu, later_mask))
> +		return cpu;
> +	/*
> +	 * Check if this_cpu is to be skipped (i.e., it is
> +	 * not in the mask) or not.
> +	 */
> +	if (!cpumask_test_cpu(this_cpu, later_mask))
> +		this_cpu = -1;
> +
> +	rcu_read_lock();
> +	for_each_domain(cpu, sd) {
> +		if (sd->flags & SD_WAKE_AFFINE) {
> +
> +			/*
> +			 * If possible, preempting this_cpu is
> +			 * cheaper than migrating.
> +			 */
> +			if (this_cpu != -1 &&
> +			    cpumask_test_cpu(this_cpu, sched_domain_span(sd))) {
> +				rcu_read_unlock();
> +				return this_cpu;
> +			}
> +
> +			/*
> +			 * Last chance: if best_cpu is valid and is
> +			 * in the mask, that becomes our choice.
> +			 */
> +			if (best_cpu < nr_cpu_ids &&
> +			    cpumask_test_cpu(best_cpu, sched_domain_span(sd))) {
> +				rcu_read_unlock();
> +				return best_cpu;
> +			}
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	/*
> +	 * At this point, all our guesses failed, we just return
> +	 * 'something', and let the caller sort the things out.
> +	 */
> +	if (this_cpu != -1)
> +		return this_cpu;
> +
> +	cpu = cpumask_any(later_mask);
> +	if (cpu < nr_cpu_ids)
> +		return cpu;
> +
> +	return -1;
> +}
> +
> +/* Locks the rq it finds */
> +static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
> +{
> +	struct rq *later_rq = NULL;
> +	int tries;
> +	int cpu;
> +
> +	for (tries = 0; tries < DL_MAX_TRIES; tries++) {
> +		cpu = find_later_rq(task);
> +
> +		if ((cpu == -1) || (cpu == rq->cpu))
> +			break;
> +
> +		later_rq = cpu_rq(cpu);
> +
> +		/* Retry if something changed. */
> +		if (double_lock_balance(rq, later_rq)) {
> +			if (unlikely(task_rq(task) != rq ||
> +				     !cpumask_test_cpu(later_rq->cpu,
> +				                       &task->cpus_allowed) ||
> +				     task_running(rq, task) || !task->on_rq)) {
> +				double_unlock_balance(rq, later_rq);
> +				later_rq = NULL;
> +				break;
> +			}
> +		}
> +
> +		/*
> +		 * If the rq we found has no -deadline task, or
> +		 * its earliest one has a later deadline than our
> +		 * task, the rq is a good one.
> +		 */
> +		if (!later_rq->dl.dl_nr_running ||
> +		    dl_time_before(task->dl.deadline,
> +				   later_rq->dl.earliest_dl.curr))
> +			break;
> +
> +		/* Otherwise we try again. */
> +		double_unlock_balance(rq, later_rq);
> +		later_rq = NULL;
> +	}
> +
> +	return later_rq;
>  }
>  
> +static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
> +{
> +	struct task_struct *p;
> +
> +	if (!has_pushable_dl_tasks(rq))
> +		return NULL;
> +
> +	p = rb_entry(rq->dl.pushable_dl_tasks_leftmost,
> +		     struct task_struct, pushable_dl_tasks);
> +
> +	BUG_ON(rq->cpu != task_cpu(p));
> +	BUG_ON(task_current(rq, p));
> +	BUG_ON(p->nr_cpus_allowed <= 1);
> +
> +	BUG_ON(!p->se.on_rq);
> +	BUG_ON(!dl_task(p));
> +
> +	return p;
> +}
> +
> +/*
> + * See if the non running -deadline tasks on this rq
> + * can be sent to some other CPU where they can preempt
> + * and start executing.
> + */
> +static int push_dl_task(struct rq *rq)
> +{
> +	struct task_struct *next_task;
> +	struct rq *later_rq;
> +
> +	if (!rq->dl.overloaded)
> +		return 0;
> +
> +	next_task = pick_next_pushable_dl_task(rq);
> +	if (!next_task)
> +		return 0;
> +
> +retry:
> +	if (unlikely(next_task == rq->curr)) {
> +		WARN_ON(1);
> +		return 0;
> +	}
> +
> +	/*
> +	 * If next_task preempts rq->curr, and rq->curr
> +	 * can move away, it makes sense to just reschedule
> +	 * without going further in pushing next_task.
> +	 */
> +	if (dl_task(rq->curr) &&
> +	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
> +	    rq->curr->nr_cpus_allowed > 1) {
> +		resched_task(rq->curr);
> +		return 0;
> +	}
> +
> +	/* We might release rq lock */
> +	get_task_struct(next_task);
> +
> +	/* Will lock the rq it'll find */
> +	later_rq = find_lock_later_rq(next_task, rq);
> +	if (!later_rq) {
> +		struct task_struct *task;
> +
> +		/*
> +		 * We must check all this again, since
> +		 * find_lock_later_rq releases rq->lock and it is
> +		 * then possible that next_task has migrated.
> +		 */
> +		task = pick_next_pushable_dl_task(rq);
> +		if (task_cpu(next_task) == rq->cpu && task == next_task) {
> +			/*
> +			 * The task is still there. We don't try
> +			 * again, some other cpu will pull it when ready.
> +			 */
> +			dequeue_pushable_dl_task(rq, next_task);
> +			goto out;
> +		}
> +
> +		if (!task)
> +			/* No more tasks */
> +			goto out;
> +
> +		put_task_struct(next_task);
> +		next_task = task;
> +		goto retry;
> +	}
> +
> +	deactivate_task(rq, next_task, 0);
> +	set_task_cpu(next_task, later_rq->cpu);
> +	activate_task(later_rq, next_task, 0);
> +
> +	resched_task(later_rq->curr);
> +
> +	double_unlock_balance(rq, later_rq);
> +
> +out:
> +	put_task_struct(next_task);
> +
> +	return 1;
> +}
> +
> +static void push_dl_tasks(struct rq *rq)
> +{
> +	/* Terminates as it moves a -deadline task */
> +	while (push_dl_task(rq))
> +		;
> +}
> +
> +static int pull_dl_task(struct rq *this_rq)
> +{
> +	int this_cpu = this_rq->cpu, ret = 0, cpu;
> +	struct task_struct *p;
> +	struct rq *src_rq;
> +	u64 dmin = LONG_MAX;
> +
> +	if (likely(!dl_overloaded(this_rq)))
> +		return 0;
> +
> +	/*
> +	 * Match the barrier from dl_set_overloaded; this guarantees that if we
> +	 * see overloaded we must also see the dlo_mask bit.
> +	 */
> +	smp_rmb();
> +
> +	for_each_cpu(cpu, this_rq->rd->dlo_mask) {
> +		if (this_cpu == cpu)
> +			continue;
> +
> +		src_rq = cpu_rq(cpu);
> +
> +		/*
> +		 * It looks racy, abd it is! However, as in sched_rt.c,

abd it is?

:-)

-- Steve

> +		 * we are fine with this.
> +		 */
> +		if (this_rq->dl.dl_nr_running &&
> +		    dl_time_before(this_rq->dl.earliest_dl.curr,
> +				   src_rq->dl.earliest_dl.next))
> +			continue;
> +
> +		/* Might drop this_rq->lock */
> +		double_lock_balance(this_rq, src_rq);
> +
> +		/*
> +		 * If there are no more pullable tasks on the
> +		 * rq, we're done with it.
> +		 */
> +		if (src_rq->dl.dl_nr_running <= 1)
> +			goto skip;
> +
> +		p = pick_next_earliest_dl_task(src_rq, this_cpu);
> +
> +		/*
> +		 * We found a task to be pulled if:
> +		 *  - it preempts our current (if there's one),
> +		 *  - it will preempt the last one we pulled (if any).
> +		 */
> +		if (p && dl_time_before(p->dl.deadline, dmin) &&
> +		    (!this_rq->dl.dl_nr_running ||
> +		     dl_time_before(p->dl.deadline,
> +				    this_rq->dl.earliest_dl.curr))) {
> +			WARN_ON(p == src_rq->curr);
> +			WARN_ON(!p->se.on_rq);
> +
> +			/*
> +			 * Then we pull iff p has actually an earlier
> +			 * deadline than the current task of its runqueue.
> +			 */
> +			if (dl_time_before(p->dl.deadline,
> +					   src_rq->curr->dl.deadline))
> +				goto skip;
> +
> +			ret = 1;
> +
> +			deactivate_task(src_rq, p, 0);
> +			set_task_cpu(p, this_cpu);
> +			activate_task(this_rq, p, 0);
> +			dmin = p->dl.deadline;
> +
> +			/* Is there any other task even earlier? */
> +		}
> +skip:
> +		double_unlock_balance(this_rq, src_rq);
> +	}
> +
> +	return ret;
> +}
> +
> +static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
> +{
> +	/* Try to pull other tasks here */
> +	if (dl_task(prev))
> +		pull_dl_task(rq);
> +}
> +
> +static void post_schedule_dl(struct rq *rq)
> +{
> +	push_dl_tasks(rq);
> +}
> +
> +/*
> + * Since the task is not running and a reschedule is not going to happen
> + * anytime soon on its runqueue, we try pushing it away now.
> + */
> +static void task_woken_dl(struct rq *rq, struct task_struct *p)
> +{
> +	if (!task_running(rq, p) &&
> +	    !test_tsk_need_resched(rq->curr) &&
> +	    has_pushable_dl_tasks(rq) &&
> +	    p->nr_cpus_allowed > 1 &&
> +	    dl_task(rq->curr) &&
> +	    (rq->curr->nr_cpus_allowed < 2 ||
> +	     dl_entity_preempt(&rq->curr->dl, &p->dl))) {
> +		push_dl_tasks(rq);
> +	}
> +}
> +
> +static void set_cpus_allowed_dl(struct task_struct *p,
> +				const struct cpumask *new_mask)
> +{
> +	struct rq *rq;
> +	int weight;
> +
> +	BUG_ON(!dl_task(p));
> +
> +	/*
> +	 * Update only if the task is actually running (i.e.,
> +	 * it is on the rq AND it is not throttled).
> +	 */
> +	if (!on_dl_rq(&p->dl))
> +		return;
> +
> +	weight = cpumask_weight(new_mask);
> +
> +	/*
> +	 * Only update if the process changes its state from whether it
> +	 * can migrate or not.
> +	 */
> +	if ((p->nr_cpus_allowed > 1) == (weight > 1))
> +		return;
> +
> +	rq = task_rq(p);
> +
> +	/*
> +	 * The process used to be able to migrate OR it can now migrate
> +	 */
> +	if (weight <= 1) {
> +		if (!task_current(rq, p))
> +			dequeue_pushable_dl_task(rq, p);
> +		BUG_ON(!rq->dl.dl_nr_migratory);
> +		rq->dl.dl_nr_migratory--;
> +	} else {
> +		if (!task_current(rq, p))
> +			enqueue_pushable_dl_task(rq, p);
> +		rq->dl.dl_nr_migratory++;
> +	}
> +	
> +	update_dl_migration(&rq->dl);
> +}
> +
> +/* Assumes rq->lock is held */
> +static void rq_online_dl(struct rq *rq)
> +{
> +	if (rq->dl.overloaded)
> +		dl_set_overload(rq);
> +}
> +
> +/* Assumes rq->lock is held */
> +static void rq_offline_dl(struct rq *rq)
> +{
> +	if (rq->dl.overloaded)
> +		dl_clear_overload(rq);
> +}
> +
> +void init_sched_dl_class(void)
> +{
> +	unsigned int i;
> +
> +	for_each_possible_cpu(i)
> +		zalloc_cpumask_var_node(&per_cpu(local_cpu_mask_dl, i),
> +					GFP_KERNEL, cpu_to_node(i));
> +}
> +
> +#endif /* CONFIG_SMP */
> +
>  static void switched_from_dl(struct rq *rq, struct task_struct *p)
>  {
> -	if (hrtimer_active(&p->dl.dl_timer))
> +	if (hrtimer_active(&p->dl.dl_timer) && !dl_policy(p->policy))
>  		hrtimer_try_to_cancel(&p->dl.dl_timer);
> +
> +#ifdef CONFIG_SMP
> +	/*
> +	 * Since this might be the only -deadline task on the rq,
> +	 * this is the right place to try to pull some other one
> +	 * from an overloaded cpu, if any.
> +	 */
> +	if (!rq->dl.dl_nr_running)
> +		pull_dl_task(rq);
> +#endif
>  }
>  
> +/*
> + * When switching to -deadline, we may overload the rq, then
> + * we try to push someone off, if possible.
> + */
>  static void switched_to_dl(struct rq *rq, struct task_struct *p)
>  {
> +	int check_resched = 1;
> +
>  	/*
>  	 * If p is throttled, don't consider the possibility
>  	 * of preempting rq->curr, the check will be done right
> @@ -635,26 +1509,53 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
>  		return;
>  
>  	if (!p->on_rq || rq->curr != p) {
> -		if (task_has_dl_policy(rq->curr))
> +#ifdef CONFIG_SMP
> +		if (rq->dl.overloaded && push_dl_task(rq) && rq != task_rq(p))
> +			/* Only reschedule if pushing failed */
> +			check_resched = 0;
> +#endif /* CONFIG_SMP */
> +		if (check_resched && task_has_dl_policy(rq->curr))
>  			check_preempt_curr_dl(rq, p, 0);
> -		else
> -			resched_task(rq->curr);
>  	}
>  }
>  
> +/*
> + * If the scheduling parameters of a -deadline task changed,
> + * a push or pull operation might be needed.
> + */
>  static void prio_changed_dl(struct rq *rq, struct task_struct *p,
>  			    int oldprio)
>  {
> -	switched_to_dl(rq, p);
> -}
> -
> +	if (p->on_rq || rq->curr == p) {
>  #ifdef CONFIG_SMP
> -static int
> -select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
> -{
> -	return task_cpu(p);
> +		/*
> +		 * This might be too much, but unfortunately
> +		 * we don't have the old deadline value, and
> +		 * we can't argue if the task is increasing
> +		 * or lowering its prio, so...
> +		 */
> +		if (!rq->dl.overloaded)
> +			pull_dl_task(rq);
> +
> +		/*
> +		 * If we now have a earlier deadline task than p,
> +		 * then reschedule, provided p is still on this
> +		 * runqueue.
> +		 */
> +		if (dl_time_before(rq->dl.earliest_dl.curr, p->dl.deadline) &&
> +		    rq->curr == p)
> +			resched_task(p);
> +#else
> +		/*
> +		 * Again, we don't know if p has a earlier
> +		 * or later deadline, so let's blindly set a
> +		 * (maybe not needed) rescheduling point.
> +		 */
> +		resched_task(p);
> +#endif /* CONFIG_SMP */
> +	} else
> +		switched_to_dl(rq, p);
>  }
> -#endif
>  
>  const struct sched_class dl_sched_class = {
>  	.next			= &rt_sched_class,
> @@ -669,6 +1570,12 @@ const struct sched_class dl_sched_class = {
>  
>  #ifdef CONFIG_SMP
>  	.select_task_rq		= select_task_rq_dl,
> +	.set_cpus_allowed       = set_cpus_allowed_dl,
> +	.rq_online              = rq_online_dl,
> +	.rq_offline             = rq_offline_dl,
> +	.pre_schedule		= pre_schedule_dl,
> +	.post_schedule		= post_schedule_dl,
> +	.task_woken		= task_woken_dl,
>  #endif
>  
>  	.set_curr_task		= set_curr_task_dl,
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 01970c8..f7c4881 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1720,7 +1720,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
>  	    !test_tsk_need_resched(rq->curr) &&
>  	    has_pushable_tasks(rq) &&
>  	    p->nr_cpus_allowed > 1 &&
> -	    rt_task(rq->curr) &&
> +	    (dl_task(rq->curr) || rt_task(rq->curr)) &&
>  	    (rq->curr->nr_cpus_allowed < 2 ||
>  	     rq->curr->prio <= p->prio))
>  		push_rt_tasks(rq);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ba97476..70d0030 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -383,6 +383,31 @@ struct dl_rq {
>  	struct rb_node *rb_leftmost;
>  
>  	unsigned long dl_nr_running;
> +
> +#ifdef CONFIG_SMP
> +	/*
> +	 * Deadline values of the currently executing and the
> +	 * earliest ready task on this rq. Caching these facilitates
> +	 * the decision wether or not a ready but not running task
> +	 * should migrate somewhere else.
> +	 */
> +	struct {
> +		u64 curr;
> +		u64 next;
> +	} earliest_dl;
> +
> +	unsigned long dl_nr_migratory;
> +	unsigned long dl_nr_total;
> +	int overloaded;
> +
> +	/*
> +	 * Tasks on this rq that can be pushed away. They are kept in
> +	 * an rb-tree, ordered by tasks' deadlines, with caching
> +	 * of the leftmost (earliest deadline) element.
> +	 */
> +	struct rb_root pushable_dl_tasks_root;
> +	struct rb_node *pushable_dl_tasks_leftmost;
> +#endif
>  };
>  
>  #ifdef CONFIG_SMP
> @@ -403,6 +428,13 @@ struct root_domain {
>  	cpumask_var_t online;
>  
>  	/*
> +	 * The bit corresponding to a CPU gets set here if such CPU has more
> +	 * than one runnable -deadline task (as it is below for RT tasks).
> +	 */
> +	cpumask_var_t dlo_mask;
> +	atomic_t dlo_count;
> +
> +	/*
>  	 * The "RT overload" flag: it gets set if a CPU has more than
>  	 * one runnable RT task.
>  	 */
> @@ -1063,6 +1095,8 @@ static inline void idle_balance(int cpu, struct rq *rq)
>  extern void sysrq_sched_debug_show(void);
>  extern void sched_init_granularity(void);
>  extern void update_max_interval(void);
> +
> +extern void init_sched_dl_class(void);
>  extern void init_sched_rt_class(void);
>  extern void init_sched_fair_class(void);
>  


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 03/14] sched: SCHED_DEADLINE structures & implementation.
  2013-11-07 13:43 ` [PATCH 03/14] sched: SCHED_DEADLINE structures & implementation Juri Lelli
  2013-11-13  2:31   ` Steven Rostedt
@ 2013-11-20 20:23   ` Steven Rostedt
  2013-11-21 14:15     ` Juri Lelli
  2014-01-13 15:53   ` [tip:sched/core] sched/deadline: Add " tip-bot for Dario Faggioli
  2 siblings, 1 reply; 81+ messages in thread
From: Steven Rostedt @ 2013-11-20 20:23 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On Thu,  7 Nov 2013 14:43:37 +0100
Juri Lelli <juri.lelli@gmail.com> wrote:


> +/*
> + * This function validates the new parameters of a -deadline task.
> + * We ask for the deadline not being zero, and greater or equal
> + * than the runtime.
> + */
> +static bool
> +__checkparam_dl(const struct sched_param2 *prm)
> +{
> +	return prm && (&prm->sched_deadline) != 0 &&
> +	       (s64)(&prm->sched_deadline - &prm->sched_runtime) >= 0;

Patch 6 brought this to my attention. Looks like using the address of
the fields is wrong. I know patch 6 fixes this, but lets make it
correct in this patch first.

Thanks,

-- Steve


> +}
> +
> +/*
>   * check the target process has a UID that matches the current process's
>   */
>  static bool check_same_owner(struct task_struct *p)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-07 13:43 ` [PATCH 08/14] sched: add latency tracing " Juri Lelli
@ 2013-11-20 21:33   ` Steven Rostedt
  2013-11-27 13:43     ` Juri Lelli
  2014-01-13 15:54   ` [tip:sched/core] sched/deadline: Add latency tracing for SCHED_DEADLINE tasks tip-bot for Dario Faggioli
  1 sibling, 1 reply; 81+ messages in thread
From: Steven Rostedt @ 2013-11-20 21:33 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On Thu,  7 Nov 2013 14:43:42 +0100
Juri Lelli <juri.lelli@gmail.com> wrote:


> +	/*
> +	 * Semantic is like this:
> +	 *  - wakeup tracer handles all tasks in the system, independently
> +	 *    from their scheduling class;
> +	 *  - wakeup_rt tracer handles tasks belonging to sched_dl and
> +	 *    sched_rt class;
> +	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
> +	 */
> +	if ((wakeup_dl && !dl_task(p)) ||
> +	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
> +	    (p->prio >= wakeup_prio || p->prio >= current->prio))
>  		return;
>  
>  	pc = preempt_count();
> @@ -486,7 +495,7 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
>  	arch_spin_lock(&wakeup_lock);
>  
>  	/* check for races. */
> -	if (!tracer_enabled || p->prio >= wakeup_prio)
> +	if (!tracer_enabled || (!dl_task(p) && p->prio >= wakeup_prio))
>  		goto out_locked;

We probably want to add a "tracing_dl" variable, and do the test like
this:

	if (!tracer_enabled || tracing_dl ||
	    (!dl_task(p) && p->prio >= wakeup_prio))

and for the first if statement too. Otherwise if two dl tasks are
running on two different CPUs, the second will override the first. Once
you start tracing a dl_task, you shouldn't bother tracing another task
until that one wakes up.

	if (dl_task(p))
		tracing_dl = 1;
	else
		tracing_dl = 0;

-- Steve

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 09/14] rtmutex: turn the plist into an rb-tree.
  2013-11-07 13:43 ` [PATCH 09/14] rtmutex: turn the plist into an rb-tree Juri Lelli
@ 2013-11-21  3:07   ` Steven Rostedt
  2013-11-21 17:52   ` [PATCH] rtmutex: Fix compare of waiter prio and task prio Steven Rostedt
  2014-01-13 15:54   ` [tip:sched/core] rtmutex: Turn the plist into an rb-tree tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 81+ messages in thread
From: Steven Rostedt @ 2013-11-21  3:07 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On Thu,  7 Nov 2013 14:43:43 +0100
Juri Lelli <juri.lelli@gmail.com> wrote:

> From: Peter Zijlstra <peterz@infradead.org>
> 
> Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
> and provide a proper comparison function for -deadline and
> -priority tasks.
> 
> This is done mainly because:
>  - classical prio field of the plist is just an int, which might
>    not be enough for representing a deadline;
>  - manipulating such a list would become O(nr_deadline_tasks),
>    which might be to much, as the number of -deadline task increases.
> 
> Therefore, an rb-tree is used, and tasks are queued in it according
> to the following logic:
>  - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
>    one with the higher (lower, actually!) prio wins;
>  - among a -priority and a -deadline task, the latter always wins;
>  - among two -deadline tasks, the one with the earliest deadline
>    wins.
> 
> Queueing and dequeueing functions are changed accordingly, for both
> the list of a task's pi-waiters and the list of tasks blocked on
> a pi-lock.

It will be interesting to see if this affects performance of the -rt
patch, as the pi lists are stressed much more.

Although this looks like it will remove that nasty hack in the -rt
patch where the locks have to call "init_lists()" because plists are
something not initialized easily on static variables.




> diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
> index 0dd6aec..4ea7eaa 100644
> --- a/kernel/rtmutex.c
> +++ b/kernel/rtmutex.c
> @@ -91,10 +91,104 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex *lock)
>  }
>  #endif
>  
> +static inline int
> +rt_mutex_waiter_less(struct rt_mutex_waiter *left,
> +		     struct rt_mutex_waiter *right)
> +{
> +	if (left->task->prio < right->task->prio)
> +		return 1;
> +
> +	/*
> +	 * If both tasks are dl_task(), we check their deadlines.
> +	 */
> +	if (dl_prio(left->task->prio) && dl_prio(right->task->prio))
> +		return (left->task->dl.deadline < right->task->dl.deadline);

Hmm, actually you only need to check the left task if it has a
dl_prio() or not. If it has a dl_prio, then the only way it could have
not returned with a 1 from the first compare is if the right task also
has a dl_prio().


> +
> +	return 0;
> +}
> +
> +static void
> +rt_mutex_enqueue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
> +{
> +	struct rb_node **link = &lock->waiters.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct rt_mutex_waiter *entry;
> +	int leftmost = 1;
> +
> +	while (*link) {
> +		parent = *link;
> +		entry = rb_entry(parent, struct rt_mutex_waiter, tree_entry);
> +		if (rt_mutex_waiter_less(waiter, entry)) {
> +			link = &parent->rb_left;
> +		} else {
> +			link = &parent->rb_right;
> +			leftmost = 0;
> +		}
> +	}
> +
> +	if (leftmost)
> +		lock->waiters_leftmost = &waiter->tree_entry;
> +
> +	rb_link_node(&waiter->tree_entry, parent, link);
> +	rb_insert_color(&waiter->tree_entry, &lock->waiters);
> +}
> +
> +static void
> +rt_mutex_dequeue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
> +{
> +	if (RB_EMPTY_NODE(&waiter->tree_entry))
> +		return;
> +
> +	if (lock->waiters_leftmost == &waiter->tree_entry)
> +		lock->waiters_leftmost = rb_next(&waiter->tree_entry);
> +
> +	rb_erase(&waiter->tree_entry, &lock->waiters);
> +	RB_CLEAR_NODE(&waiter->tree_entry);
> +}
> +
> +static void
> +rt_mutex_enqueue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
> +{
> +	struct rb_node **link = &task->pi_waiters.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct rt_mutex_waiter *entry;
> +	int leftmost = 1;
> +
> +	while (*link) {
> +		parent = *link;
> +		entry = rb_entry(parent, struct rt_mutex_waiter, pi_tree_entry);
> +		if (rt_mutex_waiter_less(waiter, entry)) {
> +			link = &parent->rb_left;
> +		} else {
> +			link = &parent->rb_right;
> +			leftmost = 0;
> +		}
> +	}
> +
> +	if (leftmost)
> +		task->pi_waiters_leftmost = &waiter->pi_tree_entry;
> +
> +	rb_link_node(&waiter->pi_tree_entry, parent, link);
> +	rb_insert_color(&waiter->pi_tree_entry, &task->pi_waiters);
> +}
> +
> +static void
> +rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
> +{
> +	if (RB_EMPTY_NODE(&waiter->pi_tree_entry))
> +		return;
> +
> +	if (task->pi_waiters_leftmost == &waiter->pi_tree_entry)
> +		task->pi_waiters_leftmost = rb_next(&waiter->pi_tree_entry);
> +
> +	rb_erase(&waiter->pi_tree_entry, &task->pi_waiters);
> +	RB_CLEAR_NODE(&waiter->pi_tree_entry);
> +}
> +
>  /*
> - * Calculate task priority from the waiter list priority
> + * Calculate task priority from the waiter tree priority
>   *
> - * Return task->normal_prio when the waiter list is empty or when
> + * Return task->normal_prio when the waiter tree is empty or when
>   * the waiter is not allowed to do priority boosting
>   */
>  int rt_mutex_getprio(struct task_struct *task)
> @@ -102,7 +196,7 @@ int rt_mutex_getprio(struct task_struct *task)
>  	if (likely(!task_has_pi_waiters(task)))
>  		return task->normal_prio;
>  
> -	return min(task_top_pi_waiter(task)->pi_list_entry.prio,
> +	return min(task_top_pi_waiter(task)->task->prio,
>  		   task->normal_prio);
>  }
>  
> @@ -233,7 +327,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
>  	 * When deadlock detection is off then we check, if further
>  	 * priority adjustment is necessary.
>  	 */
> -	if (!detect_deadlock && waiter->list_entry.prio == task->prio)
> +	if (!detect_deadlock && waiter->task->prio == task->prio)

This will always be true, as waiter->task == task.

>  		goto out_unlock_pi;
>  
>  	lock = waiter->lock;
> @@ -254,9 +348,9 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
>  	top_waiter = rt_mutex_top_waiter(lock);
>  
>  	/* Requeue the waiter */
> -	plist_del(&waiter->list_entry, &lock->wait_list);
> -	waiter->list_entry.prio = task->prio;
> -	plist_add(&waiter->list_entry, &lock->wait_list);
> +	rt_mutex_dequeue(lock, waiter);
> +	waiter->task->prio = task->prio;

This is rather pointless, as waiter->task == task.

We need to add a prio to the rt_mutex_waiter structure, because we need
a way to know if the prio changed or not. There's a reason we used the
list_entry.prio and not the task prio.

Then you could substitute all the waiter->task->prio with just
waiter->prio and that should also work.

-- Steve

> +	rt_mutex_enqueue(lock, waiter);
>  
>  	/* Release the task */
>  	raw_spin_unlock_irqrestore(&task->pi_lock, flags);


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 04/14] sched: SCHED_DEADLINE SMP-related data structures & logic.
  2013-11-20 18:51   ` Steven Rostedt
@ 2013-11-21 14:13     ` Juri Lelli
  2013-11-21 14:41       ` Steven Rostedt
  2013-11-21 16:08       ` Paul E. McKenney
  0 siblings, 2 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-21 14:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On 11/20/2013 07:51 PM, Steven Rostedt wrote:
> On Thu,  7 Nov 2013 14:43:38 +0100
> Juri Lelli <juri.lelli@gmail.com> wrote:
> 
> 
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index cb93f2e..18a73b4 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -10,6 +10,7 @@
>>   * miss some of their deadlines), and won't affect any other task.
>>   *
>>   * Copyright (C) 2012 Dario Faggioli <raistlin@linux.it>,
>> + *                    Juri Lelli <juri.lelli@gmail.com>,
>>   *                    Michael Trimarchi <michael@amarulasolutions.com>,
>>   *                    Fabio Checconi <fchecconi@gmail.com>
>>   */
>> @@ -20,6 +21,15 @@ static inline int dl_time_before(u64 a, u64 b)
>>  	return (s64)(a - b) < 0;
>>  }
>>  
>> +/*
>> + * Tells if entity @a should preempt entity @b.
>> + */
>> +static inline
>> +int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
>> +{
>> +	return dl_time_before(a->deadline, b->deadline);
>> +}
>> +
>>  static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
>>  {
>>  	return container_of(dl_se, struct task_struct, dl);
>> @@ -53,8 +63,168 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
>>  void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
>>  {
>>  	dl_rq->rb_root = RB_ROOT;
>> +
>> +#ifdef CONFIG_SMP
>> +	/* zero means no -deadline tasks */
> 
> I'm curious to why you add the '-' to -deadline.
> 

I guess "deadline tasks" is too much generic; "SCHED_DEADLINE tasks" too long;
"DL" or "-dl" can be associated to "download". Nothing special in the end, just
tought it was a reasonable abbreviation.

> 
>> +	dl_rq->earliest_dl.curr = dl_rq->earliest_dl.next = 0;
>> +
>> +	dl_rq->dl_nr_migratory = 0;
>> +	dl_rq->overloaded = 0;
>> +	dl_rq->pushable_dl_tasks_root = RB_ROOT;
>> +#endif
>> +}
>> +
>> +#ifdef CONFIG_SMP
>> +
>> +static inline int dl_overloaded(struct rq *rq)
>> +{
>> +	return atomic_read(&rq->rd->dlo_count);
>> +}
>> +
>> +static inline void dl_set_overload(struct rq *rq)
>> +{
>> +	if (!rq->online)
>> +		return;
>> +
>> +	cpumask_set_cpu(rq->cpu, rq->rd->dlo_mask);
>> +	/*
>> +	 * Must be visible before the overload count is
>> +	 * set (as in sched_rt.c).
>> +	 *
>> +	 * Matched by the barrier in pull_dl_task().
>> +	 */
>> +	smp_wmb();
>> +	atomic_inc(&rq->rd->dlo_count);
>> +}
>> +
>> +static inline void dl_clear_overload(struct rq *rq)
>> +{
>> +	if (!rq->online)
>> +		return;
>> +
>> +	atomic_dec(&rq->rd->dlo_count);
>> +	cpumask_clear_cpu(rq->cpu, rq->rd->dlo_mask);
>> +}
>> +
>> +static void update_dl_migration(struct dl_rq *dl_rq)
>> +{
>> +	if (dl_rq->dl_nr_migratory && dl_rq->dl_nr_total > 1) {
>> +		if (!dl_rq->overloaded) {
>> +			dl_set_overload(rq_of_dl_rq(dl_rq));
>> +			dl_rq->overloaded = 1;
>> +		}
>> +	} else if (dl_rq->overloaded) {
>> +		dl_clear_overload(rq_of_dl_rq(dl_rq));
>> +		dl_rq->overloaded = 0;
>> +	}
>> +}
>> +
>> +static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>> +{
>> +	struct task_struct *p = dl_task_of(dl_se);
>> +	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
>> +
>> +	dl_rq->dl_nr_total++;
>> +	if (p->nr_cpus_allowed > 1)
>> +		dl_rq->dl_nr_migratory++;
>> +
>> +	update_dl_migration(dl_rq);
>> +}
>> +
>> +static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>> +{
>> +	struct task_struct *p = dl_task_of(dl_se);
>> +	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
>> +
>> +	dl_rq->dl_nr_total--;
>> +	if (p->nr_cpus_allowed > 1)
>> +		dl_rq->dl_nr_migratory--;
>> +
>> +	update_dl_migration(dl_rq);
>> +}
>> +
>> +/*
>> + * The list of pushable -deadline task is not a plist, like in
>> + * sched_rt.c, it is an rb-tree with tasks ordered by deadline.
>> + */
>> +static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
>> +{
>> +	struct dl_rq *dl_rq = &rq->dl;
>> +	struct rb_node **link = &dl_rq->pushable_dl_tasks_root.rb_node;
>> +	struct rb_node *parent = NULL;
>> +	struct task_struct *entry;
>> +	int leftmost = 1;
>> +
>> +	BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
>> +
>> +	while (*link) {
>> +		parent = *link;
>> +		entry = rb_entry(parent, struct task_struct,
>> +				 pushable_dl_tasks);
>> +		if (dl_entity_preempt(&p->dl, &entry->dl))
>> +			link = &parent->rb_left;
>> +		else {
>> +			link = &parent->rb_right;
>> +			leftmost = 0;
>> +		}
>> +	}
>> +
>> +	if (leftmost)
>> +		dl_rq->pushable_dl_tasks_leftmost = &p->pushable_dl_tasks;
>> +
>> +	rb_link_node(&p->pushable_dl_tasks, parent, link);
>> +	rb_insert_color(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
>> +}
>> +
>> +static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
>> +{
>> +	struct dl_rq *dl_rq = &rq->dl;
>> +
>> +	if (RB_EMPTY_NODE(&p->pushable_dl_tasks))
>> +		return;
>> +
>> +	if (dl_rq->pushable_dl_tasks_leftmost == &p->pushable_dl_tasks) {
>> +		struct rb_node *next_node;
>> +
>> +		next_node = rb_next(&p->pushable_dl_tasks);
>> +		dl_rq->pushable_dl_tasks_leftmost = next_node;
>> +	}
>> +
>> +	rb_erase(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
>> +	RB_CLEAR_NODE(&p->pushable_dl_tasks);
>> +}
>> +
>> +static inline int has_pushable_dl_tasks(struct rq *rq)
>> +{
>> +	return !RB_EMPTY_ROOT(&rq->dl.pushable_dl_tasks_root);
>> +}
>> +
>> +static int push_dl_task(struct rq *rq);
>> +
>> +#else
>> +
>> +static inline
>> +void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
>> +{
>> +}
>> +
>> +static inline
>> +void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
>> +{
>> +}
>> +
>> +static inline
>> +void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>> +{
>> +}
>> +
>> +static inline
>> +void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>> +{
>>  }
>>  
>> +#endif /* CONFIG_SMP */
>> +
>>  static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
>>  static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
>>  static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
>> @@ -307,6 +477,14 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
>>  			check_preempt_curr_dl(rq, p, 0);
>>  		else
>>  			resched_task(rq->curr);
>> +#ifdef CONFIG_SMP
>> +		/*
>> +		 * Queueing this task back might have overloaded rq,
>> +		 * check if we need to kick someone away.
>> +		 */
>> +		if (has_pushable_dl_tasks(rq))
>> +			push_dl_task(rq);
>> +#endif
>>  	}
>>  unlock:
>>  	raw_spin_unlock(&rq->lock);
>> @@ -397,6 +575,100 @@ static void update_curr_dl(struct rq *rq)
>>  	}
>>  }
>>  
>> +#ifdef CONFIG_SMP
>> +
>> +static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu);
>> +
>> +static inline u64 next_deadline(struct rq *rq)
>> +{
>> +	struct task_struct *next = pick_next_earliest_dl_task(rq, rq->cpu);
>> +
>> +	if (next && dl_prio(next->prio))
>> +		return next->dl.deadline;
>> +	else
>> +		return 0;
>> +}
>> +
>> +static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
>> +{
>> +	struct rq *rq = rq_of_dl_rq(dl_rq);
>> +
>> +	if (dl_rq->earliest_dl.curr == 0 ||
>> +	    dl_time_before(deadline, dl_rq->earliest_dl.curr)) {
>> +		/*
>> +		 * If the dl_rq had no -deadline tasks, or if the new task
>> +		 * has shorter deadline than the current one on dl_rq, we
>> +		 * know that the previous earliest becomes our next earliest,
>> +		 * as the new task becomes the earliest itself.
>> +		 */
>> +		dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
>> +		dl_rq->earliest_dl.curr = deadline;
>> +	} else if (dl_rq->earliest_dl.next == 0 ||
>> +		   dl_time_before(deadline, dl_rq->earliest_dl.next)) {
>> +		/*
>> +		 * On the other hand, if the new -deadline task has a
>> +		 * a later deadline than the earliest one on dl_rq, but
>> +		 * it is earlier than the next (if any), we must
>> +		 * recompute the next-earliest.
>> +		 */
>> +		dl_rq->earliest_dl.next = next_deadline(rq);
>> +	}
>> +}
>> +
>> +static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
>> +{
>> +	struct rq *rq = rq_of_dl_rq(dl_rq);
>> +
>> +	/*
>> +	 * Since we may have removed our earliest (and/or next earliest)
>> +	 * task we must recompute them.
>> +	 */
>> +	if (!dl_rq->dl_nr_running) {
>> +		dl_rq->earliest_dl.curr = 0;
>> +		dl_rq->earliest_dl.next = 0;
>> +	} else {
>> +		struct rb_node *leftmost = dl_rq->rb_leftmost;
>> +		struct sched_dl_entity *entry;
>> +
>> +		entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
>> +		dl_rq->earliest_dl.curr = entry->deadline;
>> +		dl_rq->earliest_dl.next = next_deadline(rq);
>> +	}
>> +}
>> +
>> +#else
>> +
>> +static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
>> +static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
>> +
>> +#endif /* CONFIG_SMP */
>> +
>> +static inline
>> +void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>> +{
>> +	int prio = dl_task_of(dl_se)->prio;
>> +	u64 deadline = dl_se->deadline;
>> +
>> +	WARN_ON(!dl_prio(prio));
>> +	dl_rq->dl_nr_running++;
>> +
>> +	inc_dl_deadline(dl_rq, deadline);
>> +	inc_dl_migration(dl_se, dl_rq);
>> +}
>> +
>> +static inline
>> +void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>> +{
>> +	int prio = dl_task_of(dl_se)->prio;
>> +
>> +	WARN_ON(!dl_prio(prio));
>> +	WARN_ON(!dl_rq->dl_nr_running);
>> +	dl_rq->dl_nr_running--;
>> +
>> +	dec_dl_deadline(dl_rq, dl_se->deadline);
>> +	dec_dl_migration(dl_se, dl_rq);
>> +}
>> +
>>  static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>>  {
>>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>> @@ -424,7 +696,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>>  	rb_link_node(&dl_se->rb_node, parent, link);
>>  	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
>>  
>> -	dl_rq->dl_nr_running++;
>> +	inc_dl_tasks(dl_se, dl_rq);
>>  }
>>  
>>  static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>> @@ -444,7 +716,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>>  	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
>>  	RB_CLEAR_NODE(&dl_se->rb_node);
>>  
>> -	dl_rq->dl_nr_running--;
>> +	dec_dl_tasks(dl_se, dl_rq);
>>  }
>>  
>>  static void
>> @@ -482,12 +754,17 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>>  		return;
>>  
>>  	enqueue_dl_entity(&p->dl, flags);
>> +
>> +	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
>> +		enqueue_pushable_dl_task(rq, p);
>> +
>>  	inc_nr_running(rq);
>>  }
>>  
>>  static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>>  {
>>  	dequeue_dl_entity(&p->dl);
>> +	dequeue_pushable_dl_task(rq, p);
>>  }
>>  
>>  static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>> @@ -525,6 +802,77 @@ static void yield_task_dl(struct rq *rq)
>>  	update_curr_dl(rq);
>>  }
>>  
>> +#ifdef CONFIG_SMP
>> +
>> +static int find_later_rq(struct task_struct *task);
>> +static int latest_cpu_find(struct cpumask *span,
>> +			   struct task_struct *task,
>> +			   struct cpumask *later_mask);
>> +
>> +static int
>> +select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
>> +{
>> +	struct task_struct *curr;
>> +	struct rq *rq;
>> +	int cpu;
>> +
>> +	cpu = task_cpu(p);
>> +
>> +	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
>> +		goto out;
>> +
>> +	rq = cpu_rq(cpu);
>> +
>> +	rcu_read_lock();
>> +	curr = ACCESS_ONCE(rq->curr); /* unlocked access */
>> +
>> +	/*
>> +	 * If we are dealing with a -deadline task, we must
>> +	 * decide where to wake it up.
>> +	 * If it has a later deadline and the current task
>> +	 * on this rq can't move (provided the waking task
>> +	 * can!) we prefer to send it somewhere else. On the
>> +	 * other hand, if it has a shorter deadline, we
>> +	 * try to make it stay here, it might be important.
>> +	 */
>> +	if (unlikely(dl_task(curr)) &&
>> +	    (curr->nr_cpus_allowed < 2 ||
>> +	     !dl_entity_preempt(&p->dl, &curr->dl)) &&
>> +	    (p->nr_cpus_allowed > 1)) {
>> +		int target = find_later_rq(p);
>> +
>> +		if (target != -1)
>> +			cpu = target;
>> +	}
>> +	rcu_read_unlock();
>> +
>> +out:
>> +	return cpu;
>> +}
>> +
>> +static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
>> +{
>> +	/*
>> +	 * Current can't be migrated, useless to reschedule,
>> +	 * let's hope p can move out.
>> +	 */
>> +	if (rq->curr->nr_cpus_allowed == 1 ||
>> +	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
>> +		return;
>> +
>> +	/*
>> +	 * p is migratable, so let's not schedule it and
>> +	 * see if it is pushed or pulled somewhere else.
>> +	 */
>> +	if (p->nr_cpus_allowed != 1 &&
>> +	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
>> +		return;
>> +
>> +	resched_task(rq->curr);
>> +}
>> +
>> +#endif /* CONFIG_SMP */
>> +
>>  /*
>>   * Only called when both the current and waking task are -deadline
>>   * tasks.
>> @@ -532,8 +880,20 @@ static void yield_task_dl(struct rq *rq)
>>  static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
>>  				  int flags)
>>  {
>> -	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
>> +	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
>>  		resched_task(rq->curr);
>> +		return;
>> +	}
>> +
>> +#ifdef CONFIG_SMP
>> +	/*
>> +	 * In the unlikely case current and p have the same deadline
>> +	 * let us try to decide what's the best thing to do...
>> +	 */
>> +	if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
>> +	    !need_resched())
>> +		check_preempt_equal_dl(rq, p);
>> +#endif /* CONFIG_SMP */
>>  }
>>  
>>  #ifdef CONFIG_SCHED_HRTICK
>> @@ -573,16 +933,29 @@ struct task_struct *pick_next_task_dl(struct rq *rq)
>>  
>>  	p = dl_task_of(dl_se);
>>  	p->se.exec_start = rq_clock_task(rq);
>> +
>> +	/* Running task will never be pushed. */
>> +	if (p)
>> +		dequeue_pushable_dl_task(rq, p);
>> +
>>  #ifdef CONFIG_SCHED_HRTICK
>>  	if (hrtick_enabled(rq))
>>  		start_hrtick_dl(rq, p);
>>  #endif
>> +
>> +#ifdef CONFIG_SMP
>> +	rq->post_schedule = has_pushable_dl_tasks(rq);
>> +#endif /* CONFIG_SMP */
>> +
>>  	return p;
>>  }
>>  
>>  static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
>>  {
>>  	update_curr_dl(rq);
>> +
>> +	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
>> +		enqueue_pushable_dl_task(rq, p);
>>  }
>>  
>>  static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
>> @@ -616,16 +989,517 @@ static void set_curr_task_dl(struct rq *rq)
>>  	struct task_struct *p = rq->curr;
>>  
>>  	p->se.exec_start = rq_clock_task(rq);
>> +
>> +	/* You can't push away the running task */
>> +	dequeue_pushable_dl_task(rq, p);
>> +}
>> +
>> +#ifdef CONFIG_SMP
>> +
>> +/* Only try algorithms three times */
>> +#define DL_MAX_TRIES 3
>> +
>> +static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
>> +{
>> +	if (!task_running(rq, p) &&
>> +	    (cpu < 0 || cpumask_test_cpu(cpu, &p->cpus_allowed)) &&
>> +	    (p->nr_cpus_allowed > 1))
>> +		return 1;
>> +
>> +	return 0;
>> +}
>> +
>> +/* Returns the second earliest -deadline task, NULL otherwise */
>> +static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu)
>> +{
>> +	struct rb_node *next_node = rq->dl.rb_leftmost;
>> +	struct sched_dl_entity *dl_se;
>> +	struct task_struct *p = NULL;
>> +
>> +next_node:
>> +	next_node = rb_next(next_node);
>> +	if (next_node) {
>> +		dl_se = rb_entry(next_node, struct sched_dl_entity, rb_node);
>> +		p = dl_task_of(dl_se);
>> +
>> +		if (pick_dl_task(rq, p, cpu))
>> +			return p;
>> +
>> +		goto next_node;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +static int latest_cpu_find(struct cpumask *span,
>> +			   struct task_struct *task,
>> +			   struct cpumask *later_mask)
>> +{
>> +	const struct sched_dl_entity *dl_se = &task->dl;
>> +	int cpu, found = -1, best = 0;
>> +	u64 max_dl = 0;
>> +
>> +	for_each_cpu(cpu, span) {
>> +		struct rq *rq = cpu_rq(cpu);
>> +		struct dl_rq *dl_rq = &rq->dl;
>> +
>> +		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
>> +		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
>> +		     dl_rq->earliest_dl.curr))) {
>> +			if (later_mask)
>> +				cpumask_set_cpu(cpu, later_mask);
>> +			if (!best && !dl_rq->dl_nr_running) {
>> +				best = 1;
>> +				found = cpu;
>> +			} else if (!best &&
>> +				   dl_time_before(max_dl,
>> +						  dl_rq->earliest_dl.curr)) {
> 
> Ug, the above is hard to read. What about:
> 
> 	if (!best) {
> 		if (!dl_rq->dl_nr_running) {
> 			best = 1;
> 			found = cpu;
> 		} elsif (dl_time_before(...)) {
> 			...
> 		}
> 	}
> 

This is completely removed in 13/14. I don't like it either, but since we end
up removing this mess, do you think we still have to fix this here?

> Also, I would think dl should be nice to rt as well. There may be a
> idle CPU or a non rt task, and this could pick a CPU running an RT
> task. Worse yet, that RT task may be pinned to that CPU.
> 

Well, in 13/14 we introduce a free_cpus mask. A CPU is considered free if it
doesn't have any -deadline task running. We can modify that excluding also CPUs
running RT tasks, but I have to think a bit if we can do this also from -rt code.

> We should be able to incorporate cpuprio_find() to be dl aware too.
> That is, work for both -rt and -dl.
>

Like checking if a -dl task is running on the cpu chosen for pushing an -rt
task, and continue searching in that case.

> 
> 
>> +				max_dl = dl_rq->earliest_dl.curr;
>> +				found = cpu;
>> +			}
>> +		} else if (later_mask)
>> +			cpumask_clear_cpu(cpu, later_mask);
>> +	}
>> +
>> +	return found;
>> +}
>> +
>> +static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
>> +
>> +static int find_later_rq(struct task_struct *task)
>> +{
>> +	struct sched_domain *sd;
>> +	struct cpumask *later_mask = __get_cpu_var(local_cpu_mask_dl);
>> +	int this_cpu = smp_processor_id();
>> +	int best_cpu, cpu = task_cpu(task);
>> +
>> +	/* Make sure the mask is initialized first */
>> +	if (unlikely(!later_mask))
>> +		return -1;
>> +
>> +	if (task->nr_cpus_allowed == 1)
>> +		return -1;
>> +
>> +	best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
>> +	if (best_cpu == -1)
>> +		return -1;
>> +
>> +	/*
>> +	 * If we are here, some target has been found,
>> +	 * the most suitable of which is cached in best_cpu.
>> +	 * This is, among the runqueues where the current tasks
>> +	 * have later deadlines than the task's one, the rq
>> +	 * with the latest possible one.
>> +	 *
>> +	 * Now we check how well this matches with task's
>> +	 * affinity and system topology.
>> +	 *
>> +	 * The last cpu where the task run is our first
>> +	 * guess, since it is most likely cache-hot there.
>> +	 */
>> +	if (cpumask_test_cpu(cpu, later_mask))
>> +		return cpu;
>> +	/*
>> +	 * Check if this_cpu is to be skipped (i.e., it is
>> +	 * not in the mask) or not.
>> +	 */
>> +	if (!cpumask_test_cpu(this_cpu, later_mask))
>> +		this_cpu = -1;
>> +
>> +	rcu_read_lock();
>> +	for_each_domain(cpu, sd) {
>> +		if (sd->flags & SD_WAKE_AFFINE) {
>> +
>> +			/*
>> +			 * If possible, preempting this_cpu is
>> +			 * cheaper than migrating.
>> +			 */
>> +			if (this_cpu != -1 &&
>> +			    cpumask_test_cpu(this_cpu, sched_domain_span(sd))) {
>> +				rcu_read_unlock();
>> +				return this_cpu;
>> +			}
>> +
>> +			/*
>> +			 * Last chance: if best_cpu is valid and is
>> +			 * in the mask, that becomes our choice.
>> +			 */
>> +			if (best_cpu < nr_cpu_ids &&
>> +			    cpumask_test_cpu(best_cpu, sched_domain_span(sd))) {
>> +				rcu_read_unlock();
>> +				return best_cpu;
>> +			}
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	/*
>> +	 * At this point, all our guesses failed, we just return
>> +	 * 'something', and let the caller sort the things out.
>> +	 */
>> +	if (this_cpu != -1)
>> +		return this_cpu;
>> +
>> +	cpu = cpumask_any(later_mask);
>> +	if (cpu < nr_cpu_ids)
>> +		return cpu;
>> +
>> +	return -1;
>> +}
>> +
>> +/* Locks the rq it finds */
>> +static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
>> +{
>> +	struct rq *later_rq = NULL;
>> +	int tries;
>> +	int cpu;
>> +
>> +	for (tries = 0; tries < DL_MAX_TRIES; tries++) {
>> +		cpu = find_later_rq(task);
>> +
>> +		if ((cpu == -1) || (cpu == rq->cpu))
>> +			break;
>> +
>> +		later_rq = cpu_rq(cpu);
>> +
>> +		/* Retry if something changed. */
>> +		if (double_lock_balance(rq, later_rq)) {
>> +			if (unlikely(task_rq(task) != rq ||
>> +				     !cpumask_test_cpu(later_rq->cpu,
>> +				                       &task->cpus_allowed) ||
>> +				     task_running(rq, task) || !task->on_rq)) {
>> +				double_unlock_balance(rq, later_rq);
>> +				later_rq = NULL;
>> +				break;
>> +			}
>> +		}
>> +
>> +		/*
>> +		 * If the rq we found has no -deadline task, or
>> +		 * its earliest one has a later deadline than our
>> +		 * task, the rq is a good one.
>> +		 */
>> +		if (!later_rq->dl.dl_nr_running ||
>> +		    dl_time_before(task->dl.deadline,
>> +				   later_rq->dl.earliest_dl.curr))
>> +			break;
>> +
>> +		/* Otherwise we try again. */
>> +		double_unlock_balance(rq, later_rq);
>> +		later_rq = NULL;
>> +	}
>> +
>> +	return later_rq;
>>  }
>>  
>> +static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
>> +{
>> +	struct task_struct *p;
>> +
>> +	if (!has_pushable_dl_tasks(rq))
>> +		return NULL;
>> +
>> +	p = rb_entry(rq->dl.pushable_dl_tasks_leftmost,
>> +		     struct task_struct, pushable_dl_tasks);
>> +
>> +	BUG_ON(rq->cpu != task_cpu(p));
>> +	BUG_ON(task_current(rq, p));
>> +	BUG_ON(p->nr_cpus_allowed <= 1);
>> +
>> +	BUG_ON(!p->se.on_rq);
>> +	BUG_ON(!dl_task(p));
>> +
>> +	return p;
>> +}
>> +
>> +/*
>> + * See if the non running -deadline tasks on this rq
>> + * can be sent to some other CPU where they can preempt
>> + * and start executing.
>> + */
>> +static int push_dl_task(struct rq *rq)
>> +{
>> +	struct task_struct *next_task;
>> +	struct rq *later_rq;
>> +
>> +	if (!rq->dl.overloaded)
>> +		return 0;
>> +
>> +	next_task = pick_next_pushable_dl_task(rq);
>> +	if (!next_task)
>> +		return 0;
>> +
>> +retry:
>> +	if (unlikely(next_task == rq->curr)) {
>> +		WARN_ON(1);
>> +		return 0;
>> +	}
>> +
>> +	/*
>> +	 * If next_task preempts rq->curr, and rq->curr
>> +	 * can move away, it makes sense to just reschedule
>> +	 * without going further in pushing next_task.
>> +	 */
>> +	if (dl_task(rq->curr) &&
>> +	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
>> +	    rq->curr->nr_cpus_allowed > 1) {
>> +		resched_task(rq->curr);
>> +		return 0;
>> +	}
>> +
>> +	/* We might release rq lock */
>> +	get_task_struct(next_task);
>> +
>> +	/* Will lock the rq it'll find */
>> +	later_rq = find_lock_later_rq(next_task, rq);
>> +	if (!later_rq) {
>> +		struct task_struct *task;
>> +
>> +		/*
>> +		 * We must check all this again, since
>> +		 * find_lock_later_rq releases rq->lock and it is
>> +		 * then possible that next_task has migrated.
>> +		 */
>> +		task = pick_next_pushable_dl_task(rq);
>> +		if (task_cpu(next_task) == rq->cpu && task == next_task) {
>> +			/*
>> +			 * The task is still there. We don't try
>> +			 * again, some other cpu will pull it when ready.
>> +			 */
>> +			dequeue_pushable_dl_task(rq, next_task);
>> +			goto out;
>> +		}
>> +
>> +		if (!task)
>> +			/* No more tasks */
>> +			goto out;
>> +
>> +		put_task_struct(next_task);
>> +		next_task = task;
>> +		goto retry;
>> +	}
>> +
>> +	deactivate_task(rq, next_task, 0);
>> +	set_task_cpu(next_task, later_rq->cpu);
>> +	activate_task(later_rq, next_task, 0);
>> +
>> +	resched_task(later_rq->curr);
>> +
>> +	double_unlock_balance(rq, later_rq);
>> +
>> +out:
>> +	put_task_struct(next_task);
>> +
>> +	return 1;
>> +}
>> +
>> +static void push_dl_tasks(struct rq *rq)
>> +{
>> +	/* Terminates as it moves a -deadline task */
>> +	while (push_dl_task(rq))
>> +		;
>> +}
>> +
>> +static int pull_dl_task(struct rq *this_rq)
>> +{
>> +	int this_cpu = this_rq->cpu, ret = 0, cpu;
>> +	struct task_struct *p;
>> +	struct rq *src_rq;
>> +	u64 dmin = LONG_MAX;
>> +
>> +	if (likely(!dl_overloaded(this_rq)))
>> +		return 0;
>> +
>> +	/*
>> +	 * Match the barrier from dl_set_overloaded; this guarantees that if we
>> +	 * see overloaded we must also see the dlo_mask bit.
>> +	 */
>> +	smp_rmb();
>> +
>> +	for_each_cpu(cpu, this_rq->rd->dlo_mask) {
>> +		if (this_cpu == cpu)
>> +			continue;
>> +
>> +		src_rq = cpu_rq(cpu);
>> +
>> +		/*
>> +		 * It looks racy, abd it is! However, as in sched_rt.c,
> 
> abd it is?
> 

Oops!

Thanks,

- Juri

> 
>> +		 * we are fine with this.
>> +		 */
>> +		if (this_rq->dl.dl_nr_running &&
>> +		    dl_time_before(this_rq->dl.earliest_dl.curr,
>> +				   src_rq->dl.earliest_dl.next))
>> +			continue;
>> +
>> +		/* Might drop this_rq->lock */
>> +		double_lock_balance(this_rq, src_rq);
>> +
>> +		/*
>> +		 * If there are no more pullable tasks on the
>> +		 * rq, we're done with it.
>> +		 */
>> +		if (src_rq->dl.dl_nr_running <= 1)
>> +			goto skip;
>> +
>> +		p = pick_next_earliest_dl_task(src_rq, this_cpu);
>> +
>> +		/*
>> +		 * We found a task to be pulled if:
>> +		 *  - it preempts our current (if there's one),
>> +		 *  - it will preempt the last one we pulled (if any).
>> +		 */
>> +		if (p && dl_time_before(p->dl.deadline, dmin) &&
>> +		    (!this_rq->dl.dl_nr_running ||
>> +		     dl_time_before(p->dl.deadline,
>> +				    this_rq->dl.earliest_dl.curr))) {
>> +			WARN_ON(p == src_rq->curr);
>> +			WARN_ON(!p->se.on_rq);
>> +
>> +			/*
>> +			 * Then we pull iff p has actually an earlier
>> +			 * deadline than the current task of its runqueue.
>> +			 */
>> +			if (dl_time_before(p->dl.deadline,
>> +					   src_rq->curr->dl.deadline))
>> +				goto skip;
>> +
>> +			ret = 1;
>> +
>> +			deactivate_task(src_rq, p, 0);
>> +			set_task_cpu(p, this_cpu);
>> +			activate_task(this_rq, p, 0);
>> +			dmin = p->dl.deadline;
>> +
>> +			/* Is there any other task even earlier? */
>> +		}
>> +skip:
>> +		double_unlock_balance(this_rq, src_rq);
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
>> +{
>> +	/* Try to pull other tasks here */
>> +	if (dl_task(prev))
>> +		pull_dl_task(rq);
>> +}
>> +
>> +static void post_schedule_dl(struct rq *rq)
>> +{
>> +	push_dl_tasks(rq);
>> +}
>> +
>> +/*
>> + * Since the task is not running and a reschedule is not going to happen
>> + * anytime soon on its runqueue, we try pushing it away now.
>> + */
>> +static void task_woken_dl(struct rq *rq, struct task_struct *p)
>> +{
>> +	if (!task_running(rq, p) &&
>> +	    !test_tsk_need_resched(rq->curr) &&
>> +	    has_pushable_dl_tasks(rq) &&
>> +	    p->nr_cpus_allowed > 1 &&
>> +	    dl_task(rq->curr) &&
>> +	    (rq->curr->nr_cpus_allowed < 2 ||
>> +	     dl_entity_preempt(&rq->curr->dl, &p->dl))) {
>> +		push_dl_tasks(rq);
>> +	}
>> +}
>> +
>> +static void set_cpus_allowed_dl(struct task_struct *p,
>> +				const struct cpumask *new_mask)
>> +{
>> +	struct rq *rq;
>> +	int weight;
>> +
>> +	BUG_ON(!dl_task(p));
>> +
>> +	/*
>> +	 * Update only if the task is actually running (i.e.,
>> +	 * it is on the rq AND it is not throttled).
>> +	 */
>> +	if (!on_dl_rq(&p->dl))
>> +		return;
>> +
>> +	weight = cpumask_weight(new_mask);
>> +
>> +	/*
>> +	 * Only update if the process changes its state from whether it
>> +	 * can migrate or not.
>> +	 */
>> +	if ((p->nr_cpus_allowed > 1) == (weight > 1))
>> +		return;
>> +
>> +	rq = task_rq(p);
>> +
>> +	/*
>> +	 * The process used to be able to migrate OR it can now migrate
>> +	 */
>> +	if (weight <= 1) {
>> +		if (!task_current(rq, p))
>> +			dequeue_pushable_dl_task(rq, p);
>> +		BUG_ON(!rq->dl.dl_nr_migratory);
>> +		rq->dl.dl_nr_migratory--;
>> +	} else {
>> +		if (!task_current(rq, p))
>> +			enqueue_pushable_dl_task(rq, p);
>> +		rq->dl.dl_nr_migratory++;
>> +	}
>> +	
>> +	update_dl_migration(&rq->dl);
>> +}
>> +
>> +/* Assumes rq->lock is held */
>> +static void rq_online_dl(struct rq *rq)
>> +{
>> +	if (rq->dl.overloaded)
>> +		dl_set_overload(rq);
>> +}
>> +
>> +/* Assumes rq->lock is held */
>> +static void rq_offline_dl(struct rq *rq)
>> +{
>> +	if (rq->dl.overloaded)
>> +		dl_clear_overload(rq);
>> +}
>> +
>> +void init_sched_dl_class(void)
>> +{
>> +	unsigned int i;
>> +
>> +	for_each_possible_cpu(i)
>> +		zalloc_cpumask_var_node(&per_cpu(local_cpu_mask_dl, i),
>> +					GFP_KERNEL, cpu_to_node(i));
>> +}
>> +
>> +#endif /* CONFIG_SMP */
>> +
>>  static void switched_from_dl(struct rq *rq, struct task_struct *p)
>>  {
>> -	if (hrtimer_active(&p->dl.dl_timer))
>> +	if (hrtimer_active(&p->dl.dl_timer) && !dl_policy(p->policy))
>>  		hrtimer_try_to_cancel(&p->dl.dl_timer);
>> +
>> +#ifdef CONFIG_SMP
>> +	/*
>> +	 * Since this might be the only -deadline task on the rq,
>> +	 * this is the right place to try to pull some other one
>> +	 * from an overloaded cpu, if any.
>> +	 */
>> +	if (!rq->dl.dl_nr_running)
>> +		pull_dl_task(rq);
>> +#endif
>>  }
>>  
>> +/*
>> + * When switching to -deadline, we may overload the rq, then
>> + * we try to push someone off, if possible.
>> + */
>>  static void switched_to_dl(struct rq *rq, struct task_struct *p)
>>  {
>> +	int check_resched = 1;
>> +
>>  	/*
>>  	 * If p is throttled, don't consider the possibility
>>  	 * of preempting rq->curr, the check will be done right
>> @@ -635,26 +1509,53 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
>>  		return;
>>  
>>  	if (!p->on_rq || rq->curr != p) {
>> -		if (task_has_dl_policy(rq->curr))
>> +#ifdef CONFIG_SMP
>> +		if (rq->dl.overloaded && push_dl_task(rq) && rq != task_rq(p))
>> +			/* Only reschedule if pushing failed */
>> +			check_resched = 0;
>> +#endif /* CONFIG_SMP */
>> +		if (check_resched && task_has_dl_policy(rq->curr))
>>  			check_preempt_curr_dl(rq, p, 0);
>> -		else
>> -			resched_task(rq->curr);
>>  	}
>>  }
>>  
>> +/*
>> + * If the scheduling parameters of a -deadline task changed,
>> + * a push or pull operation might be needed.
>> + */
>>  static void prio_changed_dl(struct rq *rq, struct task_struct *p,
>>  			    int oldprio)
>>  {
>> -	switched_to_dl(rq, p);
>> -}
>> -
>> +	if (p->on_rq || rq->curr == p) {
>>  #ifdef CONFIG_SMP
>> -static int
>> -select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
>> -{
>> -	return task_cpu(p);
>> +		/*
>> +		 * This might be too much, but unfortunately
>> +		 * we don't have the old deadline value, and
>> +		 * we can't argue if the task is increasing
>> +		 * or lowering its prio, so...
>> +		 */
>> +		if (!rq->dl.overloaded)
>> +			pull_dl_task(rq);
>> +
>> +		/*
>> +		 * If we now have a earlier deadline task than p,
>> +		 * then reschedule, provided p is still on this
>> +		 * runqueue.
>> +		 */
>> +		if (dl_time_before(rq->dl.earliest_dl.curr, p->dl.deadline) &&
>> +		    rq->curr == p)
>> +			resched_task(p);
>> +#else
>> +		/*
>> +		 * Again, we don't know if p has a earlier
>> +		 * or later deadline, so let's blindly set a
>> +		 * (maybe not needed) rescheduling point.
>> +		 */
>> +		resched_task(p);
>> +#endif /* CONFIG_SMP */
>> +	} else
>> +		switched_to_dl(rq, p);
>>  }
>> -#endif
>>  
>>  const struct sched_class dl_sched_class = {
>>  	.next			= &rt_sched_class,
>> @@ -669,6 +1570,12 @@ const struct sched_class dl_sched_class = {
>>  
>>  #ifdef CONFIG_SMP
>>  	.select_task_rq		= select_task_rq_dl,
>> +	.set_cpus_allowed       = set_cpus_allowed_dl,
>> +	.rq_online              = rq_online_dl,
>> +	.rq_offline             = rq_offline_dl,
>> +	.pre_schedule		= pre_schedule_dl,
>> +	.post_schedule		= post_schedule_dl,
>> +	.task_woken		= task_woken_dl,
>>  #endif
>>  
>>  	.set_curr_task		= set_curr_task_dl,
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index 01970c8..f7c4881 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -1720,7 +1720,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
>>  	    !test_tsk_need_resched(rq->curr) &&
>>  	    has_pushable_tasks(rq) &&
>>  	    p->nr_cpus_allowed > 1 &&
>> -	    rt_task(rq->curr) &&
>> +	    (dl_task(rq->curr) || rt_task(rq->curr)) &&
>>  	    (rq->curr->nr_cpus_allowed < 2 ||
>>  	     rq->curr->prio <= p->prio))
>>  		push_rt_tasks(rq);
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index ba97476..70d0030 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -383,6 +383,31 @@ struct dl_rq {
>>  	struct rb_node *rb_leftmost;
>>  
>>  	unsigned long dl_nr_running;
>> +
>> +#ifdef CONFIG_SMP
>> +	/*
>> +	 * Deadline values of the currently executing and the
>> +	 * earliest ready task on this rq. Caching these facilitates
>> +	 * the decision wether or not a ready but not running task
>> +	 * should migrate somewhere else.
>> +	 */
>> +	struct {
>> +		u64 curr;
>> +		u64 next;
>> +	} earliest_dl;
>> +
>> +	unsigned long dl_nr_migratory;
>> +	unsigned long dl_nr_total;
>> +	int overloaded;
>> +
>> +	/*
>> +	 * Tasks on this rq that can be pushed away. They are kept in
>> +	 * an rb-tree, ordered by tasks' deadlines, with caching
>> +	 * of the leftmost (earliest deadline) element.
>> +	 */
>> +	struct rb_root pushable_dl_tasks_root;
>> +	struct rb_node *pushable_dl_tasks_leftmost;
>> +#endif
>>  };
>>  
>>  #ifdef CONFIG_SMP
>> @@ -403,6 +428,13 @@ struct root_domain {
>>  	cpumask_var_t online;
>>  
>>  	/*
>> +	 * The bit corresponding to a CPU gets set here if such CPU has more
>> +	 * than one runnable -deadline task (as it is below for RT tasks).
>> +	 */
>> +	cpumask_var_t dlo_mask;
>> +	atomic_t dlo_count;
>> +
>> +	/*
>>  	 * The "RT overload" flag: it gets set if a CPU has more than
>>  	 * one runnable RT task.
>>  	 */
>> @@ -1063,6 +1095,8 @@ static inline void idle_balance(int cpu, struct rq *rq)
>>  extern void sysrq_sched_debug_show(void);
>>  extern void sched_init_granularity(void);
>>  extern void update_max_interval(void);
>> +
>> +extern void init_sched_dl_class(void);
>>  extern void init_sched_rt_class(void);
>>  extern void init_sched_fair_class(void);
>>  
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 03/14] sched: SCHED_DEADLINE structures & implementation.
  2013-11-20 20:23   ` Steven Rostedt
@ 2013-11-21 14:15     ` Juri Lelli
  0 siblings, 0 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-21 14:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On 11/20/2013 09:23 PM, Steven Rostedt wrote:
> On Thu,  7 Nov 2013 14:43:37 +0100
> Juri Lelli <juri.lelli@gmail.com> wrote:
> 
> 
>> +/*
>> + * This function validates the new parameters of a -deadline task.
>> + * We ask for the deadline not being zero, and greater or equal
>> + * than the runtime.
>> + */
>> +static bool
>> +__checkparam_dl(const struct sched_param2 *prm)
>> +{
>> +	return prm && (&prm->sched_deadline) != 0 &&
>> +	       (s64)(&prm->sched_deadline - &prm->sched_runtime) >= 0;
> 
> Patch 6 brought this to my attention. Looks like using the address of
> the fields is wrong. I know patch 6 fixes this, but lets make it
> correct in this patch first.
> 

Fixed.

Thanks,

- Juri

> 
>> +}
>> +
>> +/*
>>   * check the target process has a UID that matches the current process's
>>   */
>>  static bool check_same_owner(struct task_struct *p)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 04/14] sched: SCHED_DEADLINE SMP-related data structures & logic.
  2013-11-21 14:13     ` Juri Lelli
@ 2013-11-21 14:41       ` Steven Rostedt
  2013-11-21 16:08       ` Paul E. McKenney
  1 sibling, 0 replies; 81+ messages in thread
From: Steven Rostedt @ 2013-11-21 14:41 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On Thu, 21 Nov 2013 15:13:28 +0100
Juri Lelli <juri.lelli@gmail.com> wrote:
;
> >> +
> >> +#ifdef CONFIG_SMP
> >> +	/* zero means no -deadline tasks */
> > 
> > I'm curious to why you add the '-' to -deadline.
> > 
> 
> I guess "deadline tasks" is too much generic; "SCHED_DEADLINE tasks" too long;
> "DL" or "-dl" can be associated to "download". Nothing special in the end, just
> tought it was a reasonable abbreviation.

Yeah, keep the -deadline, as that makes it more searchable.

> >> +static int latest_cpu_find(struct cpumask *span,
> >> +			   struct task_struct *task,
> >> +			   struct cpumask *later_mask)
> >> +{
> >> +	const struct sched_dl_entity *dl_se = &task->dl;
> >> +	int cpu, found = -1, best = 0;
> >> +	u64 max_dl = 0;
> >> +
> >> +	for_each_cpu(cpu, span) {
> >> +		struct rq *rq = cpu_rq(cpu);
> >> +		struct dl_rq *dl_rq = &rq->dl;
> >> +
> >> +		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
> >> +		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
> >> +		     dl_rq->earliest_dl.curr))) {
> >> +			if (later_mask)
> >> +				cpumask_set_cpu(cpu, later_mask);
> >> +			if (!best && !dl_rq->dl_nr_running) {
> >> +				best = 1;
> >> +				found = cpu;
> >> +			} else if (!best &&
> >> +				   dl_time_before(max_dl,
> >> +						  dl_rq->earliest_dl.curr)) {
> > 
> > Ug, the above is hard to read. What about:
> > 
> > 	if (!best) {
> > 		if (!dl_rq->dl_nr_running) {
> > 			best = 1;
> > 			found = cpu;
> > 		} elsif (dl_time_before(...)) {
> > 			...
> > 		}
> > 	}
> > 
> 
> This is completely removed in 13/14. I don't like it either, but since we end
> up removing this mess, do you think we still have to fix this here?

OK, if it gets removed later (I haven't hit patch 13 yet), then it
should be fine.

> 
> > Also, I would think dl should be nice to rt as well. There may be a
> > idle CPU or a non rt task, and this could pick a CPU running an RT
> > task. Worse yet, that RT task may be pinned to that CPU.
> > 
> 
> Well, in 13/14 we introduce a free_cpus mask. A CPU is considered free if it
> doesn't have any -deadline task running. We can modify that excluding also CPUs
> running RT tasks, but I have to think a bit if we can do this also from -rt code.
> 
> > We should be able to incorporate cpuprio_find() to be dl aware too.
> > That is, work for both -rt and -dl.
> >
> 
> Like checking if a -dl task is running on the cpu chosen for pushing an -rt
> task, and continue searching in that case.

Yep, and also, it could perhaps be used for knowing where to push a
-dl task too.

-- Steve

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 04/14] sched: SCHED_DEADLINE SMP-related data structures & logic.
  2013-11-21 14:13     ` Juri Lelli
  2013-11-21 14:41       ` Steven Rostedt
@ 2013-11-21 16:08       ` Paul E. McKenney
  2013-11-21 16:16         ` Juri Lelli
  1 sibling, 1 reply; 81+ messages in thread
From: Paul E. McKenney @ 2013-11-21 16:08 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steven Rostedt, peterz, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

On Thu, Nov 21, 2013 at 03:13:28PM +0100, Juri Lelli wrote:
> On 11/20/2013 07:51 PM, Steven Rostedt wrote:
> > On Thu,  7 Nov 2013 14:43:38 +0100
> > Juri Lelli <juri.lelli@gmail.com> wrote:
> > 
> > 
> >> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> >> index cb93f2e..18a73b4 100644
> >> --- a/kernel/sched/deadline.c
> >> +++ b/kernel/sched/deadline.c
> >> @@ -10,6 +10,7 @@
> >>   * miss some of their deadlines), and won't affect any other task.
> >>   *
> >>   * Copyright (C) 2012 Dario Faggioli <raistlin@linux.it>,
> >> + *                    Juri Lelli <juri.lelli@gmail.com>,
> >>   *                    Michael Trimarchi <michael@amarulasolutions.com>,
> >>   *                    Fabio Checconi <fchecconi@gmail.com>
> >>   */
> >> @@ -20,6 +21,15 @@ static inline int dl_time_before(u64 a, u64 b)
> >>  	return (s64)(a - b) < 0;
> >>  }
> >>  
> >> +/*
> >> + * Tells if entity @a should preempt entity @b.
> >> + */
> >> +static inline
> >> +int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
> >> +{
> >> +	return dl_time_before(a->deadline, b->deadline);
> >> +}
> >> +
> >>  static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
> >>  {
> >>  	return container_of(dl_se, struct task_struct, dl);
> >> @@ -53,8 +63,168 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
> >>  void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
> >>  {
> >>  	dl_rq->rb_root = RB_ROOT;
> >> +
> >> +#ifdef CONFIG_SMP
> >> +	/* zero means no -deadline tasks */
> > 
> > I'm curious to why you add the '-' to -deadline.
> > 
> 
> I guess "deadline tasks" is too much generic; "SCHED_DEADLINE tasks" too long;
> "DL" or "-dl" can be associated to "download". Nothing special in the end, just
> tought it was a reasonable abbreviation.

My guess was that "-deadline" was an abbreviation for tasks that had
missed their deadline, just so you know.  ;-)

								Thanx, Paul

> >> +	dl_rq->earliest_dl.curr = dl_rq->earliest_dl.next = 0;
> >> +
> >> +	dl_rq->dl_nr_migratory = 0;
> >> +	dl_rq->overloaded = 0;
> >> +	dl_rq->pushable_dl_tasks_root = RB_ROOT;
> >> +#endif
> >> +}
> >> +
> >> +#ifdef CONFIG_SMP
> >> +
> >> +static inline int dl_overloaded(struct rq *rq)
> >> +{
> >> +	return atomic_read(&rq->rd->dlo_count);
> >> +}
> >> +
> >> +static inline void dl_set_overload(struct rq *rq)
> >> +{
> >> +	if (!rq->online)
> >> +		return;
> >> +
> >> +	cpumask_set_cpu(rq->cpu, rq->rd->dlo_mask);
> >> +	/*
> >> +	 * Must be visible before the overload count is
> >> +	 * set (as in sched_rt.c).
> >> +	 *
> >> +	 * Matched by the barrier in pull_dl_task().
> >> +	 */
> >> +	smp_wmb();
> >> +	atomic_inc(&rq->rd->dlo_count);
> >> +}
> >> +
> >> +static inline void dl_clear_overload(struct rq *rq)
> >> +{
> >> +	if (!rq->online)
> >> +		return;
> >> +
> >> +	atomic_dec(&rq->rd->dlo_count);
> >> +	cpumask_clear_cpu(rq->cpu, rq->rd->dlo_mask);
> >> +}
> >> +
> >> +static void update_dl_migration(struct dl_rq *dl_rq)
> >> +{
> >> +	if (dl_rq->dl_nr_migratory && dl_rq->dl_nr_total > 1) {
> >> +		if (!dl_rq->overloaded) {
> >> +			dl_set_overload(rq_of_dl_rq(dl_rq));
> >> +			dl_rq->overloaded = 1;
> >> +		}
> >> +	} else if (dl_rq->overloaded) {
> >> +		dl_clear_overload(rq_of_dl_rq(dl_rq));
> >> +		dl_rq->overloaded = 0;
> >> +	}
> >> +}
> >> +
> >> +static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >> +{
> >> +	struct task_struct *p = dl_task_of(dl_se);
> >> +	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
> >> +
> >> +	dl_rq->dl_nr_total++;
> >> +	if (p->nr_cpus_allowed > 1)
> >> +		dl_rq->dl_nr_migratory++;
> >> +
> >> +	update_dl_migration(dl_rq);
> >> +}
> >> +
> >> +static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >> +{
> >> +	struct task_struct *p = dl_task_of(dl_se);
> >> +	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
> >> +
> >> +	dl_rq->dl_nr_total--;
> >> +	if (p->nr_cpus_allowed > 1)
> >> +		dl_rq->dl_nr_migratory--;
> >> +
> >> +	update_dl_migration(dl_rq);
> >> +}
> >> +
> >> +/*
> >> + * The list of pushable -deadline task is not a plist, like in
> >> + * sched_rt.c, it is an rb-tree with tasks ordered by deadline.
> >> + */
> >> +static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> >> +{
> >> +	struct dl_rq *dl_rq = &rq->dl;
> >> +	struct rb_node **link = &dl_rq->pushable_dl_tasks_root.rb_node;
> >> +	struct rb_node *parent = NULL;
> >> +	struct task_struct *entry;
> >> +	int leftmost = 1;
> >> +
> >> +	BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
> >> +
> >> +	while (*link) {
> >> +		parent = *link;
> >> +		entry = rb_entry(parent, struct task_struct,
> >> +				 pushable_dl_tasks);
> >> +		if (dl_entity_preempt(&p->dl, &entry->dl))
> >> +			link = &parent->rb_left;
> >> +		else {
> >> +			link = &parent->rb_right;
> >> +			leftmost = 0;
> >> +		}
> >> +	}
> >> +
> >> +	if (leftmost)
> >> +		dl_rq->pushable_dl_tasks_leftmost = &p->pushable_dl_tasks;
> >> +
> >> +	rb_link_node(&p->pushable_dl_tasks, parent, link);
> >> +	rb_insert_color(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
> >> +}
> >> +
> >> +static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> >> +{
> >> +	struct dl_rq *dl_rq = &rq->dl;
> >> +
> >> +	if (RB_EMPTY_NODE(&p->pushable_dl_tasks))
> >> +		return;
> >> +
> >> +	if (dl_rq->pushable_dl_tasks_leftmost == &p->pushable_dl_tasks) {
> >> +		struct rb_node *next_node;
> >> +
> >> +		next_node = rb_next(&p->pushable_dl_tasks);
> >> +		dl_rq->pushable_dl_tasks_leftmost = next_node;
> >> +	}
> >> +
> >> +	rb_erase(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
> >> +	RB_CLEAR_NODE(&p->pushable_dl_tasks);
> >> +}
> >> +
> >> +static inline int has_pushable_dl_tasks(struct rq *rq)
> >> +{
> >> +	return !RB_EMPTY_ROOT(&rq->dl.pushable_dl_tasks_root);
> >> +}
> >> +
> >> +static int push_dl_task(struct rq *rq);
> >> +
> >> +#else
> >> +
> >> +static inline
> >> +void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> >> +{
> >> +}
> >> +
> >> +static inline
> >> +void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> >> +{
> >> +}
> >> +
> >> +static inline
> >> +void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >> +{
> >> +}
> >> +
> >> +static inline
> >> +void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >> +{
> >>  }
> >>  
> >> +#endif /* CONFIG_SMP */
> >> +
> >>  static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
> >>  static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
> >>  static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
> >> @@ -307,6 +477,14 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
> >>  			check_preempt_curr_dl(rq, p, 0);
> >>  		else
> >>  			resched_task(rq->curr);
> >> +#ifdef CONFIG_SMP
> >> +		/*
> >> +		 * Queueing this task back might have overloaded rq,
> >> +		 * check if we need to kick someone away.
> >> +		 */
> >> +		if (has_pushable_dl_tasks(rq))
> >> +			push_dl_task(rq);
> >> +#endif
> >>  	}
> >>  unlock:
> >>  	raw_spin_unlock(&rq->lock);
> >> @@ -397,6 +575,100 @@ static void update_curr_dl(struct rq *rq)
> >>  	}
> >>  }
> >>  
> >> +#ifdef CONFIG_SMP
> >> +
> >> +static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu);
> >> +
> >> +static inline u64 next_deadline(struct rq *rq)
> >> +{
> >> +	struct task_struct *next = pick_next_earliest_dl_task(rq, rq->cpu);
> >> +
> >> +	if (next && dl_prio(next->prio))
> >> +		return next->dl.deadline;
> >> +	else
> >> +		return 0;
> >> +}
> >> +
> >> +static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> >> +{
> >> +	struct rq *rq = rq_of_dl_rq(dl_rq);
> >> +
> >> +	if (dl_rq->earliest_dl.curr == 0 ||
> >> +	    dl_time_before(deadline, dl_rq->earliest_dl.curr)) {
> >> +		/*
> >> +		 * If the dl_rq had no -deadline tasks, or if the new task
> >> +		 * has shorter deadline than the current one on dl_rq, we
> >> +		 * know that the previous earliest becomes our next earliest,
> >> +		 * as the new task becomes the earliest itself.
> >> +		 */
> >> +		dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
> >> +		dl_rq->earliest_dl.curr = deadline;
> >> +	} else if (dl_rq->earliest_dl.next == 0 ||
> >> +		   dl_time_before(deadline, dl_rq->earliest_dl.next)) {
> >> +		/*
> >> +		 * On the other hand, if the new -deadline task has a
> >> +		 * a later deadline than the earliest one on dl_rq, but
> >> +		 * it is earlier than the next (if any), we must
> >> +		 * recompute the next-earliest.
> >> +		 */
> >> +		dl_rq->earliest_dl.next = next_deadline(rq);
> >> +	}
> >> +}
> >> +
> >> +static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> >> +{
> >> +	struct rq *rq = rq_of_dl_rq(dl_rq);
> >> +
> >> +	/*
> >> +	 * Since we may have removed our earliest (and/or next earliest)
> >> +	 * task we must recompute them.
> >> +	 */
> >> +	if (!dl_rq->dl_nr_running) {
> >> +		dl_rq->earliest_dl.curr = 0;
> >> +		dl_rq->earliest_dl.next = 0;
> >> +	} else {
> >> +		struct rb_node *leftmost = dl_rq->rb_leftmost;
> >> +		struct sched_dl_entity *entry;
> >> +
> >> +		entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
> >> +		dl_rq->earliest_dl.curr = entry->deadline;
> >> +		dl_rq->earliest_dl.next = next_deadline(rq);
> >> +	}
> >> +}
> >> +
> >> +#else
> >> +
> >> +static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
> >> +static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
> >> +
> >> +#endif /* CONFIG_SMP */
> >> +
> >> +static inline
> >> +void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >> +{
> >> +	int prio = dl_task_of(dl_se)->prio;
> >> +	u64 deadline = dl_se->deadline;
> >> +
> >> +	WARN_ON(!dl_prio(prio));
> >> +	dl_rq->dl_nr_running++;
> >> +
> >> +	inc_dl_deadline(dl_rq, deadline);
> >> +	inc_dl_migration(dl_se, dl_rq);
> >> +}
> >> +
> >> +static inline
> >> +void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >> +{
> >> +	int prio = dl_task_of(dl_se)->prio;
> >> +
> >> +	WARN_ON(!dl_prio(prio));
> >> +	WARN_ON(!dl_rq->dl_nr_running);
> >> +	dl_rq->dl_nr_running--;
> >> +
> >> +	dec_dl_deadline(dl_rq, dl_se->deadline);
> >> +	dec_dl_migration(dl_se, dl_rq);
> >> +}
> >> +
> >>  static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
> >>  {
> >>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> >> @@ -424,7 +696,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
> >>  	rb_link_node(&dl_se->rb_node, parent, link);
> >>  	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
> >>  
> >> -	dl_rq->dl_nr_running++;
> >> +	inc_dl_tasks(dl_se, dl_rq);
> >>  }
> >>  
> >>  static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
> >> @@ -444,7 +716,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
> >>  	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
> >>  	RB_CLEAR_NODE(&dl_se->rb_node);
> >>  
> >> -	dl_rq->dl_nr_running--;
> >> +	dec_dl_tasks(dl_se, dl_rq);
> >>  }
> >>  
> >>  static void
> >> @@ -482,12 +754,17 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> >>  		return;
> >>  
> >>  	enqueue_dl_entity(&p->dl, flags);
> >> +
> >> +	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
> >> +		enqueue_pushable_dl_task(rq, p);
> >> +
> >>  	inc_nr_running(rq);
> >>  }
> >>  
> >>  static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> >>  {
> >>  	dequeue_dl_entity(&p->dl);
> >> +	dequeue_pushable_dl_task(rq, p);
> >>  }
> >>  
> >>  static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> >> @@ -525,6 +802,77 @@ static void yield_task_dl(struct rq *rq)
> >>  	update_curr_dl(rq);
> >>  }
> >>  
> >> +#ifdef CONFIG_SMP
> >> +
> >> +static int find_later_rq(struct task_struct *task);
> >> +static int latest_cpu_find(struct cpumask *span,
> >> +			   struct task_struct *task,
> >> +			   struct cpumask *later_mask);
> >> +
> >> +static int
> >> +select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
> >> +{
> >> +	struct task_struct *curr;
> >> +	struct rq *rq;
> >> +	int cpu;
> >> +
> >> +	cpu = task_cpu(p);
> >> +
> >> +	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
> >> +		goto out;
> >> +
> >> +	rq = cpu_rq(cpu);
> >> +
> >> +	rcu_read_lock();
> >> +	curr = ACCESS_ONCE(rq->curr); /* unlocked access */
> >> +
> >> +	/*
> >> +	 * If we are dealing with a -deadline task, we must
> >> +	 * decide where to wake it up.
> >> +	 * If it has a later deadline and the current task
> >> +	 * on this rq can't move (provided the waking task
> >> +	 * can!) we prefer to send it somewhere else. On the
> >> +	 * other hand, if it has a shorter deadline, we
> >> +	 * try to make it stay here, it might be important.
> >> +	 */
> >> +	if (unlikely(dl_task(curr)) &&
> >> +	    (curr->nr_cpus_allowed < 2 ||
> >> +	     !dl_entity_preempt(&p->dl, &curr->dl)) &&
> >> +	    (p->nr_cpus_allowed > 1)) {
> >> +		int target = find_later_rq(p);
> >> +
> >> +		if (target != -1)
> >> +			cpu = target;
> >> +	}
> >> +	rcu_read_unlock();
> >> +
> >> +out:
> >> +	return cpu;
> >> +}
> >> +
> >> +static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
> >> +{
> >> +	/*
> >> +	 * Current can't be migrated, useless to reschedule,
> >> +	 * let's hope p can move out.
> >> +	 */
> >> +	if (rq->curr->nr_cpus_allowed == 1 ||
> >> +	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
> >> +		return;
> >> +
> >> +	/*
> >> +	 * p is migratable, so let's not schedule it and
> >> +	 * see if it is pushed or pulled somewhere else.
> >> +	 */
> >> +	if (p->nr_cpus_allowed != 1 &&
> >> +	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
> >> +		return;
> >> +
> >> +	resched_task(rq->curr);
> >> +}
> >> +
> >> +#endif /* CONFIG_SMP */
> >> +
> >>  /*
> >>   * Only called when both the current and waking task are -deadline
> >>   * tasks.
> >> @@ -532,8 +880,20 @@ static void yield_task_dl(struct rq *rq)
> >>  static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
> >>  				  int flags)
> >>  {
> >> -	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
> >> +	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
> >>  		resched_task(rq->curr);
> >> +		return;
> >> +	}
> >> +
> >> +#ifdef CONFIG_SMP
> >> +	/*
> >> +	 * In the unlikely case current and p have the same deadline
> >> +	 * let us try to decide what's the best thing to do...
> >> +	 */
> >> +	if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
> >> +	    !need_resched())
> >> +		check_preempt_equal_dl(rq, p);
> >> +#endif /* CONFIG_SMP */
> >>  }
> >>  
> >>  #ifdef CONFIG_SCHED_HRTICK
> >> @@ -573,16 +933,29 @@ struct task_struct *pick_next_task_dl(struct rq *rq)
> >>  
> >>  	p = dl_task_of(dl_se);
> >>  	p->se.exec_start = rq_clock_task(rq);
> >> +
> >> +	/* Running task will never be pushed. */
> >> +	if (p)
> >> +		dequeue_pushable_dl_task(rq, p);
> >> +
> >>  #ifdef CONFIG_SCHED_HRTICK
> >>  	if (hrtick_enabled(rq))
> >>  		start_hrtick_dl(rq, p);
> >>  #endif
> >> +
> >> +#ifdef CONFIG_SMP
> >> +	rq->post_schedule = has_pushable_dl_tasks(rq);
> >> +#endif /* CONFIG_SMP */
> >> +
> >>  	return p;
> >>  }
> >>  
> >>  static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
> >>  {
> >>  	update_curr_dl(rq);
> >> +
> >> +	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
> >> +		enqueue_pushable_dl_task(rq, p);
> >>  }
> >>  
> >>  static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
> >> @@ -616,16 +989,517 @@ static void set_curr_task_dl(struct rq *rq)
> >>  	struct task_struct *p = rq->curr;
> >>  
> >>  	p->se.exec_start = rq_clock_task(rq);
> >> +
> >> +	/* You can't push away the running task */
> >> +	dequeue_pushable_dl_task(rq, p);
> >> +}
> >> +
> >> +#ifdef CONFIG_SMP
> >> +
> >> +/* Only try algorithms three times */
> >> +#define DL_MAX_TRIES 3
> >> +
> >> +static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
> >> +{
> >> +	if (!task_running(rq, p) &&
> >> +	    (cpu < 0 || cpumask_test_cpu(cpu, &p->cpus_allowed)) &&
> >> +	    (p->nr_cpus_allowed > 1))
> >> +		return 1;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +/* Returns the second earliest -deadline task, NULL otherwise */
> >> +static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu)
> >> +{
> >> +	struct rb_node *next_node = rq->dl.rb_leftmost;
> >> +	struct sched_dl_entity *dl_se;
> >> +	struct task_struct *p = NULL;
> >> +
> >> +next_node:
> >> +	next_node = rb_next(next_node);
> >> +	if (next_node) {
> >> +		dl_se = rb_entry(next_node, struct sched_dl_entity, rb_node);
> >> +		p = dl_task_of(dl_se);
> >> +
> >> +		if (pick_dl_task(rq, p, cpu))
> >> +			return p;
> >> +
> >> +		goto next_node;
> >> +	}
> >> +
> >> +	return NULL;
> >> +}
> >> +
> >> +static int latest_cpu_find(struct cpumask *span,
> >> +			   struct task_struct *task,
> >> +			   struct cpumask *later_mask)
> >> +{
> >> +	const struct sched_dl_entity *dl_se = &task->dl;
> >> +	int cpu, found = -1, best = 0;
> >> +	u64 max_dl = 0;
> >> +
> >> +	for_each_cpu(cpu, span) {
> >> +		struct rq *rq = cpu_rq(cpu);
> >> +		struct dl_rq *dl_rq = &rq->dl;
> >> +
> >> +		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
> >> +		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
> >> +		     dl_rq->earliest_dl.curr))) {
> >> +			if (later_mask)
> >> +				cpumask_set_cpu(cpu, later_mask);
> >> +			if (!best && !dl_rq->dl_nr_running) {
> >> +				best = 1;
> >> +				found = cpu;
> >> +			} else if (!best &&
> >> +				   dl_time_before(max_dl,
> >> +						  dl_rq->earliest_dl.curr)) {
> > 
> > Ug, the above is hard to read. What about:
> > 
> > 	if (!best) {
> > 		if (!dl_rq->dl_nr_running) {
> > 			best = 1;
> > 			found = cpu;
> > 		} elsif (dl_time_before(...)) {
> > 			...
> > 		}
> > 	}
> > 
> 
> This is completely removed in 13/14. I don't like it either, but since we end
> up removing this mess, do you think we still have to fix this here?
> 
> > Also, I would think dl should be nice to rt as well. There may be a
> > idle CPU or a non rt task, and this could pick a CPU running an RT
> > task. Worse yet, that RT task may be pinned to that CPU.
> > 
> 
> Well, in 13/14 we introduce a free_cpus mask. A CPU is considered free if it
> doesn't have any -deadline task running. We can modify that excluding also CPUs
> running RT tasks, but I have to think a bit if we can do this also from -rt code.
> 
> > We should be able to incorporate cpuprio_find() to be dl aware too.
> > That is, work for both -rt and -dl.
> >
> 
> Like checking if a -dl task is running on the cpu chosen for pushing an -rt
> task, and continue searching in that case.
> 
> > 
> > 
> >> +				max_dl = dl_rq->earliest_dl.curr;
> >> +				found = cpu;
> >> +			}
> >> +		} else if (later_mask)
> >> +			cpumask_clear_cpu(cpu, later_mask);
> >> +	}
> >> +
> >> +	return found;
> >> +}
> >> +
> >> +static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
> >> +
> >> +static int find_later_rq(struct task_struct *task)
> >> +{
> >> +	struct sched_domain *sd;
> >> +	struct cpumask *later_mask = __get_cpu_var(local_cpu_mask_dl);
> >> +	int this_cpu = smp_processor_id();
> >> +	int best_cpu, cpu = task_cpu(task);
> >> +
> >> +	/* Make sure the mask is initialized first */
> >> +	if (unlikely(!later_mask))
> >> +		return -1;
> >> +
> >> +	if (task->nr_cpus_allowed == 1)
> >> +		return -1;
> >> +
> >> +	best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
> >> +	if (best_cpu == -1)
> >> +		return -1;
> >> +
> >> +	/*
> >> +	 * If we are here, some target has been found,
> >> +	 * the most suitable of which is cached in best_cpu.
> >> +	 * This is, among the runqueues where the current tasks
> >> +	 * have later deadlines than the task's one, the rq
> >> +	 * with the latest possible one.
> >> +	 *
> >> +	 * Now we check how well this matches with task's
> >> +	 * affinity and system topology.
> >> +	 *
> >> +	 * The last cpu where the task run is our first
> >> +	 * guess, since it is most likely cache-hot there.
> >> +	 */
> >> +	if (cpumask_test_cpu(cpu, later_mask))
> >> +		return cpu;
> >> +	/*
> >> +	 * Check if this_cpu is to be skipped (i.e., it is
> >> +	 * not in the mask) or not.
> >> +	 */
> >> +	if (!cpumask_test_cpu(this_cpu, later_mask))
> >> +		this_cpu = -1;
> >> +
> >> +	rcu_read_lock();
> >> +	for_each_domain(cpu, sd) {
> >> +		if (sd->flags & SD_WAKE_AFFINE) {
> >> +
> >> +			/*
> >> +			 * If possible, preempting this_cpu is
> >> +			 * cheaper than migrating.
> >> +			 */
> >> +			if (this_cpu != -1 &&
> >> +			    cpumask_test_cpu(this_cpu, sched_domain_span(sd))) {
> >> +				rcu_read_unlock();
> >> +				return this_cpu;
> >> +			}
> >> +
> >> +			/*
> >> +			 * Last chance: if best_cpu is valid and is
> >> +			 * in the mask, that becomes our choice.
> >> +			 */
> >> +			if (best_cpu < nr_cpu_ids &&
> >> +			    cpumask_test_cpu(best_cpu, sched_domain_span(sd))) {
> >> +				rcu_read_unlock();
> >> +				return best_cpu;
> >> +			}
> >> +		}
> >> +	}
> >> +	rcu_read_unlock();
> >> +
> >> +	/*
> >> +	 * At this point, all our guesses failed, we just return
> >> +	 * 'something', and let the caller sort the things out.
> >> +	 */
> >> +	if (this_cpu != -1)
> >> +		return this_cpu;
> >> +
> >> +	cpu = cpumask_any(later_mask);
> >> +	if (cpu < nr_cpu_ids)
> >> +		return cpu;
> >> +
> >> +	return -1;
> >> +}
> >> +
> >> +/* Locks the rq it finds */
> >> +static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
> >> +{
> >> +	struct rq *later_rq = NULL;
> >> +	int tries;
> >> +	int cpu;
> >> +
> >> +	for (tries = 0; tries < DL_MAX_TRIES; tries++) {
> >> +		cpu = find_later_rq(task);
> >> +
> >> +		if ((cpu == -1) || (cpu == rq->cpu))
> >> +			break;
> >> +
> >> +		later_rq = cpu_rq(cpu);
> >> +
> >> +		/* Retry if something changed. */
> >> +		if (double_lock_balance(rq, later_rq)) {
> >> +			if (unlikely(task_rq(task) != rq ||
> >> +				     !cpumask_test_cpu(later_rq->cpu,
> >> +				                       &task->cpus_allowed) ||
> >> +				     task_running(rq, task) || !task->on_rq)) {
> >> +				double_unlock_balance(rq, later_rq);
> >> +				later_rq = NULL;
> >> +				break;
> >> +			}
> >> +		}
> >> +
> >> +		/*
> >> +		 * If the rq we found has no -deadline task, or
> >> +		 * its earliest one has a later deadline than our
> >> +		 * task, the rq is a good one.
> >> +		 */
> >> +		if (!later_rq->dl.dl_nr_running ||
> >> +		    dl_time_before(task->dl.deadline,
> >> +				   later_rq->dl.earliest_dl.curr))
> >> +			break;
> >> +
> >> +		/* Otherwise we try again. */
> >> +		double_unlock_balance(rq, later_rq);
> >> +		later_rq = NULL;
> >> +	}
> >> +
> >> +	return later_rq;
> >>  }
> >>  
> >> +static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
> >> +{
> >> +	struct task_struct *p;
> >> +
> >> +	if (!has_pushable_dl_tasks(rq))
> >> +		return NULL;
> >> +
> >> +	p = rb_entry(rq->dl.pushable_dl_tasks_leftmost,
> >> +		     struct task_struct, pushable_dl_tasks);
> >> +
> >> +	BUG_ON(rq->cpu != task_cpu(p));
> >> +	BUG_ON(task_current(rq, p));
> >> +	BUG_ON(p->nr_cpus_allowed <= 1);
> >> +
> >> +	BUG_ON(!p->se.on_rq);
> >> +	BUG_ON(!dl_task(p));
> >> +
> >> +	return p;
> >> +}
> >> +
> >> +/*
> >> + * See if the non running -deadline tasks on this rq
> >> + * can be sent to some other CPU where they can preempt
> >> + * and start executing.
> >> + */
> >> +static int push_dl_task(struct rq *rq)
> >> +{
> >> +	struct task_struct *next_task;
> >> +	struct rq *later_rq;
> >> +
> >> +	if (!rq->dl.overloaded)
> >> +		return 0;
> >> +
> >> +	next_task = pick_next_pushable_dl_task(rq);
> >> +	if (!next_task)
> >> +		return 0;
> >> +
> >> +retry:
> >> +	if (unlikely(next_task == rq->curr)) {
> >> +		WARN_ON(1);
> >> +		return 0;
> >> +	}
> >> +
> >> +	/*
> >> +	 * If next_task preempts rq->curr, and rq->curr
> >> +	 * can move away, it makes sense to just reschedule
> >> +	 * without going further in pushing next_task.
> >> +	 */
> >> +	if (dl_task(rq->curr) &&
> >> +	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
> >> +	    rq->curr->nr_cpus_allowed > 1) {
> >> +		resched_task(rq->curr);
> >> +		return 0;
> >> +	}
> >> +
> >> +	/* We might release rq lock */
> >> +	get_task_struct(next_task);
> >> +
> >> +	/* Will lock the rq it'll find */
> >> +	later_rq = find_lock_later_rq(next_task, rq);
> >> +	if (!later_rq) {
> >> +		struct task_struct *task;
> >> +
> >> +		/*
> >> +		 * We must check all this again, since
> >> +		 * find_lock_later_rq releases rq->lock and it is
> >> +		 * then possible that next_task has migrated.
> >> +		 */
> >> +		task = pick_next_pushable_dl_task(rq);
> >> +		if (task_cpu(next_task) == rq->cpu && task == next_task) {
> >> +			/*
> >> +			 * The task is still there. We don't try
> >> +			 * again, some other cpu will pull it when ready.
> >> +			 */
> >> +			dequeue_pushable_dl_task(rq, next_task);
> >> +			goto out;
> >> +		}
> >> +
> >> +		if (!task)
> >> +			/* No more tasks */
> >> +			goto out;
> >> +
> >> +		put_task_struct(next_task);
> >> +		next_task = task;
> >> +		goto retry;
> >> +	}
> >> +
> >> +	deactivate_task(rq, next_task, 0);
> >> +	set_task_cpu(next_task, later_rq->cpu);
> >> +	activate_task(later_rq, next_task, 0);
> >> +
> >> +	resched_task(later_rq->curr);
> >> +
> >> +	double_unlock_balance(rq, later_rq);
> >> +
> >> +out:
> >> +	put_task_struct(next_task);
> >> +
> >> +	return 1;
> >> +}
> >> +
> >> +static void push_dl_tasks(struct rq *rq)
> >> +{
> >> +	/* Terminates as it moves a -deadline task */
> >> +	while (push_dl_task(rq))
> >> +		;
> >> +}
> >> +
> >> +static int pull_dl_task(struct rq *this_rq)
> >> +{
> >> +	int this_cpu = this_rq->cpu, ret = 0, cpu;
> >> +	struct task_struct *p;
> >> +	struct rq *src_rq;
> >> +	u64 dmin = LONG_MAX;
> >> +
> >> +	if (likely(!dl_overloaded(this_rq)))
> >> +		return 0;
> >> +
> >> +	/*
> >> +	 * Match the barrier from dl_set_overloaded; this guarantees that if we
> >> +	 * see overloaded we must also see the dlo_mask bit.
> >> +	 */
> >> +	smp_rmb();
> >> +
> >> +	for_each_cpu(cpu, this_rq->rd->dlo_mask) {
> >> +		if (this_cpu == cpu)
> >> +			continue;
> >> +
> >> +		src_rq = cpu_rq(cpu);
> >> +
> >> +		/*
> >> +		 * It looks racy, abd it is! However, as in sched_rt.c,
> > 
> > abd it is?
> > 
> 
> Oops!
> 
> Thanks,
> 
> - Juri
> 
> > 
> >> +		 * we are fine with this.
> >> +		 */
> >> +		if (this_rq->dl.dl_nr_running &&
> >> +		    dl_time_before(this_rq->dl.earliest_dl.curr,
> >> +				   src_rq->dl.earliest_dl.next))
> >> +			continue;
> >> +
> >> +		/* Might drop this_rq->lock */
> >> +		double_lock_balance(this_rq, src_rq);
> >> +
> >> +		/*
> >> +		 * If there are no more pullable tasks on the
> >> +		 * rq, we're done with it.
> >> +		 */
> >> +		if (src_rq->dl.dl_nr_running <= 1)
> >> +			goto skip;
> >> +
> >> +		p = pick_next_earliest_dl_task(src_rq, this_cpu);
> >> +
> >> +		/*
> >> +		 * We found a task to be pulled if:
> >> +		 *  - it preempts our current (if there's one),
> >> +		 *  - it will preempt the last one we pulled (if any).
> >> +		 */
> >> +		if (p && dl_time_before(p->dl.deadline, dmin) &&
> >> +		    (!this_rq->dl.dl_nr_running ||
> >> +		     dl_time_before(p->dl.deadline,
> >> +				    this_rq->dl.earliest_dl.curr))) {
> >> +			WARN_ON(p == src_rq->curr);
> >> +			WARN_ON(!p->se.on_rq);
> >> +
> >> +			/*
> >> +			 * Then we pull iff p has actually an earlier
> >> +			 * deadline than the current task of its runqueue.
> >> +			 */
> >> +			if (dl_time_before(p->dl.deadline,
> >> +					   src_rq->curr->dl.deadline))
> >> +				goto skip;
> >> +
> >> +			ret = 1;
> >> +
> >> +			deactivate_task(src_rq, p, 0);
> >> +			set_task_cpu(p, this_cpu);
> >> +			activate_task(this_rq, p, 0);
> >> +			dmin = p->dl.deadline;
> >> +
> >> +			/* Is there any other task even earlier? */
> >> +		}
> >> +skip:
> >> +		double_unlock_balance(this_rq, src_rq);
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
> >> +{
> >> +	/* Try to pull other tasks here */
> >> +	if (dl_task(prev))
> >> +		pull_dl_task(rq);
> >> +}
> >> +
> >> +static void post_schedule_dl(struct rq *rq)
> >> +{
> >> +	push_dl_tasks(rq);
> >> +}
> >> +
> >> +/*
> >> + * Since the task is not running and a reschedule is not going to happen
> >> + * anytime soon on its runqueue, we try pushing it away now.
> >> + */
> >> +static void task_woken_dl(struct rq *rq, struct task_struct *p)
> >> +{
> >> +	if (!task_running(rq, p) &&
> >> +	    !test_tsk_need_resched(rq->curr) &&
> >> +	    has_pushable_dl_tasks(rq) &&
> >> +	    p->nr_cpus_allowed > 1 &&
> >> +	    dl_task(rq->curr) &&
> >> +	    (rq->curr->nr_cpus_allowed < 2 ||
> >> +	     dl_entity_preempt(&rq->curr->dl, &p->dl))) {
> >> +		push_dl_tasks(rq);
> >> +	}
> >> +}
> >> +
> >> +static void set_cpus_allowed_dl(struct task_struct *p,
> >> +				const struct cpumask *new_mask)
> >> +{
> >> +	struct rq *rq;
> >> +	int weight;
> >> +
> >> +	BUG_ON(!dl_task(p));
> >> +
> >> +	/*
> >> +	 * Update only if the task is actually running (i.e.,
> >> +	 * it is on the rq AND it is not throttled).
> >> +	 */
> >> +	if (!on_dl_rq(&p->dl))
> >> +		return;
> >> +
> >> +	weight = cpumask_weight(new_mask);
> >> +
> >> +	/*
> >> +	 * Only update if the process changes its state from whether it
> >> +	 * can migrate or not.
> >> +	 */
> >> +	if ((p->nr_cpus_allowed > 1) == (weight > 1))
> >> +		return;
> >> +
> >> +	rq = task_rq(p);
> >> +
> >> +	/*
> >> +	 * The process used to be able to migrate OR it can now migrate
> >> +	 */
> >> +	if (weight <= 1) {
> >> +		if (!task_current(rq, p))
> >> +			dequeue_pushable_dl_task(rq, p);
> >> +		BUG_ON(!rq->dl.dl_nr_migratory);
> >> +		rq->dl.dl_nr_migratory--;
> >> +	} else {
> >> +		if (!task_current(rq, p))
> >> +			enqueue_pushable_dl_task(rq, p);
> >> +		rq->dl.dl_nr_migratory++;
> >> +	}
> >> +	
> >> +	update_dl_migration(&rq->dl);
> >> +}
> >> +
> >> +/* Assumes rq->lock is held */
> >> +static void rq_online_dl(struct rq *rq)
> >> +{
> >> +	if (rq->dl.overloaded)
> >> +		dl_set_overload(rq);
> >> +}
> >> +
> >> +/* Assumes rq->lock is held */
> >> +static void rq_offline_dl(struct rq *rq)
> >> +{
> >> +	if (rq->dl.overloaded)
> >> +		dl_clear_overload(rq);
> >> +}
> >> +
> >> +void init_sched_dl_class(void)
> >> +{
> >> +	unsigned int i;
> >> +
> >> +	for_each_possible_cpu(i)
> >> +		zalloc_cpumask_var_node(&per_cpu(local_cpu_mask_dl, i),
> >> +					GFP_KERNEL, cpu_to_node(i));
> >> +}
> >> +
> >> +#endif /* CONFIG_SMP */
> >> +
> >>  static void switched_from_dl(struct rq *rq, struct task_struct *p)
> >>  {
> >> -	if (hrtimer_active(&p->dl.dl_timer))
> >> +	if (hrtimer_active(&p->dl.dl_timer) && !dl_policy(p->policy))
> >>  		hrtimer_try_to_cancel(&p->dl.dl_timer);
> >> +
> >> +#ifdef CONFIG_SMP
> >> +	/*
> >> +	 * Since this might be the only -deadline task on the rq,
> >> +	 * this is the right place to try to pull some other one
> >> +	 * from an overloaded cpu, if any.
> >> +	 */
> >> +	if (!rq->dl.dl_nr_running)
> >> +		pull_dl_task(rq);
> >> +#endif
> >>  }
> >>  
> >> +/*
> >> + * When switching to -deadline, we may overload the rq, then
> >> + * we try to push someone off, if possible.
> >> + */
> >>  static void switched_to_dl(struct rq *rq, struct task_struct *p)
> >>  {
> >> +	int check_resched = 1;
> >> +
> >>  	/*
> >>  	 * If p is throttled, don't consider the possibility
> >>  	 * of preempting rq->curr, the check will be done right
> >> @@ -635,26 +1509,53 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
> >>  		return;
> >>  
> >>  	if (!p->on_rq || rq->curr != p) {
> >> -		if (task_has_dl_policy(rq->curr))
> >> +#ifdef CONFIG_SMP
> >> +		if (rq->dl.overloaded && push_dl_task(rq) && rq != task_rq(p))
> >> +			/* Only reschedule if pushing failed */
> >> +			check_resched = 0;
> >> +#endif /* CONFIG_SMP */
> >> +		if (check_resched && task_has_dl_policy(rq->curr))
> >>  			check_preempt_curr_dl(rq, p, 0);
> >> -		else
> >> -			resched_task(rq->curr);
> >>  	}
> >>  }
> >>  
> >> +/*
> >> + * If the scheduling parameters of a -deadline task changed,
> >> + * a push or pull operation might be needed.
> >> + */
> >>  static void prio_changed_dl(struct rq *rq, struct task_struct *p,
> >>  			    int oldprio)
> >>  {
> >> -	switched_to_dl(rq, p);
> >> -}
> >> -
> >> +	if (p->on_rq || rq->curr == p) {
> >>  #ifdef CONFIG_SMP
> >> -static int
> >> -select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
> >> -{
> >> -	return task_cpu(p);
> >> +		/*
> >> +		 * This might be too much, but unfortunately
> >> +		 * we don't have the old deadline value, and
> >> +		 * we can't argue if the task is increasing
> >> +		 * or lowering its prio, so...
> >> +		 */
> >> +		if (!rq->dl.overloaded)
> >> +			pull_dl_task(rq);
> >> +
> >> +		/*
> >> +		 * If we now have a earlier deadline task than p,
> >> +		 * then reschedule, provided p is still on this
> >> +		 * runqueue.
> >> +		 */
> >> +		if (dl_time_before(rq->dl.earliest_dl.curr, p->dl.deadline) &&
> >> +		    rq->curr == p)
> >> +			resched_task(p);
> >> +#else
> >> +		/*
> >> +		 * Again, we don't know if p has a earlier
> >> +		 * or later deadline, so let's blindly set a
> >> +		 * (maybe not needed) rescheduling point.
> >> +		 */
> >> +		resched_task(p);
> >> +#endif /* CONFIG_SMP */
> >> +	} else
> >> +		switched_to_dl(rq, p);
> >>  }
> >> -#endif
> >>  
> >>  const struct sched_class dl_sched_class = {
> >>  	.next			= &rt_sched_class,
> >> @@ -669,6 +1570,12 @@ const struct sched_class dl_sched_class = {
> >>  
> >>  #ifdef CONFIG_SMP
> >>  	.select_task_rq		= select_task_rq_dl,
> >> +	.set_cpus_allowed       = set_cpus_allowed_dl,
> >> +	.rq_online              = rq_online_dl,
> >> +	.rq_offline             = rq_offline_dl,
> >> +	.pre_schedule		= pre_schedule_dl,
> >> +	.post_schedule		= post_schedule_dl,
> >> +	.task_woken		= task_woken_dl,
> >>  #endif
> >>  
> >>  	.set_curr_task		= set_curr_task_dl,
> >> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >> index 01970c8..f7c4881 100644
> >> --- a/kernel/sched/rt.c
> >> +++ b/kernel/sched/rt.c
> >> @@ -1720,7 +1720,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
> >>  	    !test_tsk_need_resched(rq->curr) &&
> >>  	    has_pushable_tasks(rq) &&
> >>  	    p->nr_cpus_allowed > 1 &&
> >> -	    rt_task(rq->curr) &&
> >> +	    (dl_task(rq->curr) || rt_task(rq->curr)) &&
> >>  	    (rq->curr->nr_cpus_allowed < 2 ||
> >>  	     rq->curr->prio <= p->prio))
> >>  		push_rt_tasks(rq);
> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> index ba97476..70d0030 100644
> >> --- a/kernel/sched/sched.h
> >> +++ b/kernel/sched/sched.h
> >> @@ -383,6 +383,31 @@ struct dl_rq {
> >>  	struct rb_node *rb_leftmost;
> >>  
> >>  	unsigned long dl_nr_running;
> >> +
> >> +#ifdef CONFIG_SMP
> >> +	/*
> >> +	 * Deadline values of the currently executing and the
> >> +	 * earliest ready task on this rq. Caching these facilitates
> >> +	 * the decision wether or not a ready but not running task
> >> +	 * should migrate somewhere else.
> >> +	 */
> >> +	struct {
> >> +		u64 curr;
> >> +		u64 next;
> >> +	} earliest_dl;
> >> +
> >> +	unsigned long dl_nr_migratory;
> >> +	unsigned long dl_nr_total;
> >> +	int overloaded;
> >> +
> >> +	/*
> >> +	 * Tasks on this rq that can be pushed away. They are kept in
> >> +	 * an rb-tree, ordered by tasks' deadlines, with caching
> >> +	 * of the leftmost (earliest deadline) element.
> >> +	 */
> >> +	struct rb_root pushable_dl_tasks_root;
> >> +	struct rb_node *pushable_dl_tasks_leftmost;
> >> +#endif
> >>  };
> >>  
> >>  #ifdef CONFIG_SMP
> >> @@ -403,6 +428,13 @@ struct root_domain {
> >>  	cpumask_var_t online;
> >>  
> >>  	/*
> >> +	 * The bit corresponding to a CPU gets set here if such CPU has more
> >> +	 * than one runnable -deadline task (as it is below for RT tasks).
> >> +	 */
> >> +	cpumask_var_t dlo_mask;
> >> +	atomic_t dlo_count;
> >> +
> >> +	/*
> >>  	 * The "RT overload" flag: it gets set if a CPU has more than
> >>  	 * one runnable RT task.
> >>  	 */
> >> @@ -1063,6 +1095,8 @@ static inline void idle_balance(int cpu, struct rq *rq)
> >>  extern void sysrq_sched_debug_show(void);
> >>  extern void sched_init_granularity(void);
> >>  extern void update_max_interval(void);
> >> +
> >> +extern void init_sched_dl_class(void);
> >>  extern void init_sched_rt_class(void);
> >>  extern void init_sched_fair_class(void);
> >>  
> > 
> 
> 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 04/14] sched: SCHED_DEADLINE SMP-related data structures & logic.
  2013-11-21 16:08       ` Paul E. McKenney
@ 2013-11-21 16:16         ` Juri Lelli
  2013-11-21 16:26           ` Paul E. McKenney
  0 siblings, 1 reply; 81+ messages in thread
From: Juri Lelli @ 2013-11-21 16:16 UTC (permalink / raw)
  To: paulmck
  Cc: Steven Rostedt, peterz, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

On 11/21/2013 05:08 PM, Paul E. McKenney wrote:
> On Thu, Nov 21, 2013 at 03:13:28PM +0100, Juri Lelli wrote:
>> On 11/20/2013 07:51 PM, Steven Rostedt wrote:
>>> On Thu,  7 Nov 2013 14:43:38 +0100
>>> Juri Lelli <juri.lelli@gmail.com> wrote:
>>>
>>>
>>>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>>>> index cb93f2e..18a73b4 100644
>>>> --- a/kernel/sched/deadline.c
>>>> +++ b/kernel/sched/deadline.c
>>>> @@ -10,6 +10,7 @@
>>>>   * miss some of their deadlines), and won't affect any other task.
>>>>   *
>>>>   * Copyright (C) 2012 Dario Faggioli <raistlin@linux.it>,
>>>> + *                    Juri Lelli <juri.lelli@gmail.com>,
>>>>   *                    Michael Trimarchi <michael@amarulasolutions.com>,
>>>>   *                    Fabio Checconi <fchecconi@gmail.com>
>>>>   */
>>>> @@ -20,6 +21,15 @@ static inline int dl_time_before(u64 a, u64 b)
>>>>  	return (s64)(a - b) < 0;
>>>>  }
>>>>  
>>>> +/*
>>>> + * Tells if entity @a should preempt entity @b.
>>>> + */
>>>> +static inline
>>>> +int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
>>>> +{
>>>> +	return dl_time_before(a->deadline, b->deadline);
>>>> +}
>>>> +
>>>>  static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
>>>>  {
>>>>  	return container_of(dl_se, struct task_struct, dl);
>>>> @@ -53,8 +63,168 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
>>>>  void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
>>>>  {
>>>>  	dl_rq->rb_root = RB_ROOT;
>>>> +
>>>> +#ifdef CONFIG_SMP
>>>> +	/* zero means no -deadline tasks */
>>>
>>> I'm curious to why you add the '-' to -deadline.
>>>
>>
>> I guess "deadline tasks" is too much generic; "SCHED_DEADLINE tasks" too long;
>> "DL" or "-dl" can be associated to "download". Nothing special in the end, just
>> tought it was a reasonable abbreviation.
> 
> My guess was that "-deadline" was an abbreviation for tasks that had
> missed their deadline, just so you know.  ;-)
>

Argh! ;)

Actually, thinking about it twice: if we call "-rt tasks" things that are
governed by code that resides in sched/rt.c, we could agree on "-deadline
tasks" for things that pass through sched/deadline.c.

Thanks,

- Juri

>>>> +	dl_rq->earliest_dl.curr = dl_rq->earliest_dl.next = 0;
>>>> +
>>>> +	dl_rq->dl_nr_migratory = 0;
>>>> +	dl_rq->overloaded = 0;
>>>> +	dl_rq->pushable_dl_tasks_root = RB_ROOT;
>>>> +#endif
>>>> +}
>>>> +
>>>> +#ifdef CONFIG_SMP
>>>> +
>>>> +static inline int dl_overloaded(struct rq *rq)
>>>> +{
>>>> +	return atomic_read(&rq->rd->dlo_count);
>>>> +}
>>>> +
>>>> +static inline void dl_set_overload(struct rq *rq)
>>>> +{
>>>> +	if (!rq->online)
>>>> +		return;
>>>> +
>>>> +	cpumask_set_cpu(rq->cpu, rq->rd->dlo_mask);
>>>> +	/*
>>>> +	 * Must be visible before the overload count is
>>>> +	 * set (as in sched_rt.c).
>>>> +	 *
>>>> +	 * Matched by the barrier in pull_dl_task().
>>>> +	 */
>>>> +	smp_wmb();
>>>> +	atomic_inc(&rq->rd->dlo_count);
>>>> +}
>>>> +
>>>> +static inline void dl_clear_overload(struct rq *rq)
>>>> +{
>>>> +	if (!rq->online)
>>>> +		return;
>>>> +
>>>> +	atomic_dec(&rq->rd->dlo_count);
>>>> +	cpumask_clear_cpu(rq->cpu, rq->rd->dlo_mask);
>>>> +}
>>>> +
>>>> +static void update_dl_migration(struct dl_rq *dl_rq)
>>>> +{
>>>> +	if (dl_rq->dl_nr_migratory && dl_rq->dl_nr_total > 1) {
>>>> +		if (!dl_rq->overloaded) {
>>>> +			dl_set_overload(rq_of_dl_rq(dl_rq));
>>>> +			dl_rq->overloaded = 1;
>>>> +		}
>>>> +	} else if (dl_rq->overloaded) {
>>>> +		dl_clear_overload(rq_of_dl_rq(dl_rq));
>>>> +		dl_rq->overloaded = 0;
>>>> +	}
>>>> +}
>>>> +
>>>> +static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>>>> +{
>>>> +	struct task_struct *p = dl_task_of(dl_se);
>>>> +	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
>>>> +
>>>> +	dl_rq->dl_nr_total++;
>>>> +	if (p->nr_cpus_allowed > 1)
>>>> +		dl_rq->dl_nr_migratory++;
>>>> +
>>>> +	update_dl_migration(dl_rq);
>>>> +}
>>>> +
>>>> +static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>>>> +{
>>>> +	struct task_struct *p = dl_task_of(dl_se);
>>>> +	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
>>>> +
>>>> +	dl_rq->dl_nr_total--;
>>>> +	if (p->nr_cpus_allowed > 1)
>>>> +		dl_rq->dl_nr_migratory--;
>>>> +
>>>> +	update_dl_migration(dl_rq);
>>>> +}
>>>> +
>>>> +/*
>>>> + * The list of pushable -deadline task is not a plist, like in
>>>> + * sched_rt.c, it is an rb-tree with tasks ordered by deadline.
>>>> + */
>>>> +static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
>>>> +{
>>>> +	struct dl_rq *dl_rq = &rq->dl;
>>>> +	struct rb_node **link = &dl_rq->pushable_dl_tasks_root.rb_node;
>>>> +	struct rb_node *parent = NULL;
>>>> +	struct task_struct *entry;
>>>> +	int leftmost = 1;
>>>> +
>>>> +	BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
>>>> +
>>>> +	while (*link) {
>>>> +		parent = *link;
>>>> +		entry = rb_entry(parent, struct task_struct,
>>>> +				 pushable_dl_tasks);
>>>> +		if (dl_entity_preempt(&p->dl, &entry->dl))
>>>> +			link = &parent->rb_left;
>>>> +		else {
>>>> +			link = &parent->rb_right;
>>>> +			leftmost = 0;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	if (leftmost)
>>>> +		dl_rq->pushable_dl_tasks_leftmost = &p->pushable_dl_tasks;
>>>> +
>>>> +	rb_link_node(&p->pushable_dl_tasks, parent, link);
>>>> +	rb_insert_color(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
>>>> +}
>>>> +
>>>> +static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
>>>> +{
>>>> +	struct dl_rq *dl_rq = &rq->dl;
>>>> +
>>>> +	if (RB_EMPTY_NODE(&p->pushable_dl_tasks))
>>>> +		return;
>>>> +
>>>> +	if (dl_rq->pushable_dl_tasks_leftmost == &p->pushable_dl_tasks) {
>>>> +		struct rb_node *next_node;
>>>> +
>>>> +		next_node = rb_next(&p->pushable_dl_tasks);
>>>> +		dl_rq->pushable_dl_tasks_leftmost = next_node;
>>>> +	}
>>>> +
>>>> +	rb_erase(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
>>>> +	RB_CLEAR_NODE(&p->pushable_dl_tasks);
>>>> +}
>>>> +
>>>> +static inline int has_pushable_dl_tasks(struct rq *rq)
>>>> +{
>>>> +	return !RB_EMPTY_ROOT(&rq->dl.pushable_dl_tasks_root);
>>>> +}
>>>> +
>>>> +static int push_dl_task(struct rq *rq);
>>>> +
>>>> +#else
>>>> +
>>>> +static inline
>>>> +void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
>>>> +{
>>>> +}
>>>> +
>>>> +static inline
>>>> +void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
>>>> +{
>>>> +}
>>>> +
>>>> +static inline
>>>> +void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>>>> +{
>>>> +}
>>>> +
>>>> +static inline
>>>> +void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>>>> +{
>>>>  }
>>>>  
>>>> +#endif /* CONFIG_SMP */
>>>> +
>>>>  static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
>>>>  static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
>>>>  static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
>>>> @@ -307,6 +477,14 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
>>>>  			check_preempt_curr_dl(rq, p, 0);
>>>>  		else
>>>>  			resched_task(rq->curr);
>>>> +#ifdef CONFIG_SMP
>>>> +		/*
>>>> +		 * Queueing this task back might have overloaded rq,
>>>> +		 * check if we need to kick someone away.
>>>> +		 */
>>>> +		if (has_pushable_dl_tasks(rq))
>>>> +			push_dl_task(rq);
>>>> +#endif
>>>>  	}
>>>>  unlock:
>>>>  	raw_spin_unlock(&rq->lock);
>>>> @@ -397,6 +575,100 @@ static void update_curr_dl(struct rq *rq)
>>>>  	}
>>>>  }
>>>>  
>>>> +#ifdef CONFIG_SMP
>>>> +
>>>> +static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu);
>>>> +
>>>> +static inline u64 next_deadline(struct rq *rq)
>>>> +{
>>>> +	struct task_struct *next = pick_next_earliest_dl_task(rq, rq->cpu);
>>>> +
>>>> +	if (next && dl_prio(next->prio))
>>>> +		return next->dl.deadline;
>>>> +	else
>>>> +		return 0;
>>>> +}
>>>> +
>>>> +static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
>>>> +{
>>>> +	struct rq *rq = rq_of_dl_rq(dl_rq);
>>>> +
>>>> +	if (dl_rq->earliest_dl.curr == 0 ||
>>>> +	    dl_time_before(deadline, dl_rq->earliest_dl.curr)) {
>>>> +		/*
>>>> +		 * If the dl_rq had no -deadline tasks, or if the new task
>>>> +		 * has shorter deadline than the current one on dl_rq, we
>>>> +		 * know that the previous earliest becomes our next earliest,
>>>> +		 * as the new task becomes the earliest itself.
>>>> +		 */
>>>> +		dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
>>>> +		dl_rq->earliest_dl.curr = deadline;
>>>> +	} else if (dl_rq->earliest_dl.next == 0 ||
>>>> +		   dl_time_before(deadline, dl_rq->earliest_dl.next)) {
>>>> +		/*
>>>> +		 * On the other hand, if the new -deadline task has a
>>>> +		 * a later deadline than the earliest one on dl_rq, but
>>>> +		 * it is earlier than the next (if any), we must
>>>> +		 * recompute the next-earliest.
>>>> +		 */
>>>> +		dl_rq->earliest_dl.next = next_deadline(rq);
>>>> +	}
>>>> +}
>>>> +
>>>> +static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
>>>> +{
>>>> +	struct rq *rq = rq_of_dl_rq(dl_rq);
>>>> +
>>>> +	/*
>>>> +	 * Since we may have removed our earliest (and/or next earliest)
>>>> +	 * task we must recompute them.
>>>> +	 */
>>>> +	if (!dl_rq->dl_nr_running) {
>>>> +		dl_rq->earliest_dl.curr = 0;
>>>> +		dl_rq->earliest_dl.next = 0;
>>>> +	} else {
>>>> +		struct rb_node *leftmost = dl_rq->rb_leftmost;
>>>> +		struct sched_dl_entity *entry;
>>>> +
>>>> +		entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
>>>> +		dl_rq->earliest_dl.curr = entry->deadline;
>>>> +		dl_rq->earliest_dl.next = next_deadline(rq);
>>>> +	}
>>>> +}
>>>> +
>>>> +#else
>>>> +
>>>> +static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
>>>> +static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
>>>> +
>>>> +#endif /* CONFIG_SMP */
>>>> +
>>>> +static inline
>>>> +void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>>>> +{
>>>> +	int prio = dl_task_of(dl_se)->prio;
>>>> +	u64 deadline = dl_se->deadline;
>>>> +
>>>> +	WARN_ON(!dl_prio(prio));
>>>> +	dl_rq->dl_nr_running++;
>>>> +
>>>> +	inc_dl_deadline(dl_rq, deadline);
>>>> +	inc_dl_migration(dl_se, dl_rq);
>>>> +}
>>>> +
>>>> +static inline
>>>> +void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>>>> +{
>>>> +	int prio = dl_task_of(dl_se)->prio;
>>>> +
>>>> +	WARN_ON(!dl_prio(prio));
>>>> +	WARN_ON(!dl_rq->dl_nr_running);
>>>> +	dl_rq->dl_nr_running--;
>>>> +
>>>> +	dec_dl_deadline(dl_rq, dl_se->deadline);
>>>> +	dec_dl_migration(dl_se, dl_rq);
>>>> +}
>>>> +
>>>>  static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>>>>  {
>>>>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
>>>> @@ -424,7 +696,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>>>>  	rb_link_node(&dl_se->rb_node, parent, link);
>>>>  	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
>>>>  
>>>> -	dl_rq->dl_nr_running++;
>>>> +	inc_dl_tasks(dl_se, dl_rq);
>>>>  }
>>>>  
>>>>  static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>>>> @@ -444,7 +716,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>>>>  	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
>>>>  	RB_CLEAR_NODE(&dl_se->rb_node);
>>>>  
>>>> -	dl_rq->dl_nr_running--;
>>>> +	dec_dl_tasks(dl_se, dl_rq);
>>>>  }
>>>>  
>>>>  static void
>>>> @@ -482,12 +754,17 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>>>>  		return;
>>>>  
>>>>  	enqueue_dl_entity(&p->dl, flags);
>>>> +
>>>> +	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
>>>> +		enqueue_pushable_dl_task(rq, p);
>>>> +
>>>>  	inc_nr_running(rq);
>>>>  }
>>>>  
>>>>  static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>>>>  {
>>>>  	dequeue_dl_entity(&p->dl);
>>>> +	dequeue_pushable_dl_task(rq, p);
>>>>  }
>>>>  
>>>>  static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
>>>> @@ -525,6 +802,77 @@ static void yield_task_dl(struct rq *rq)
>>>>  	update_curr_dl(rq);
>>>>  }
>>>>  
>>>> +#ifdef CONFIG_SMP
>>>> +
>>>> +static int find_later_rq(struct task_struct *task);
>>>> +static int latest_cpu_find(struct cpumask *span,
>>>> +			   struct task_struct *task,
>>>> +			   struct cpumask *later_mask);
>>>> +
>>>> +static int
>>>> +select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
>>>> +{
>>>> +	struct task_struct *curr;
>>>> +	struct rq *rq;
>>>> +	int cpu;
>>>> +
>>>> +	cpu = task_cpu(p);
>>>> +
>>>> +	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
>>>> +		goto out;
>>>> +
>>>> +	rq = cpu_rq(cpu);
>>>> +
>>>> +	rcu_read_lock();
>>>> +	curr = ACCESS_ONCE(rq->curr); /* unlocked access */
>>>> +
>>>> +	/*
>>>> +	 * If we are dealing with a -deadline task, we must
>>>> +	 * decide where to wake it up.
>>>> +	 * If it has a later deadline and the current task
>>>> +	 * on this rq can't move (provided the waking task
>>>> +	 * can!) we prefer to send it somewhere else. On the
>>>> +	 * other hand, if it has a shorter deadline, we
>>>> +	 * try to make it stay here, it might be important.
>>>> +	 */
>>>> +	if (unlikely(dl_task(curr)) &&
>>>> +	    (curr->nr_cpus_allowed < 2 ||
>>>> +	     !dl_entity_preempt(&p->dl, &curr->dl)) &&
>>>> +	    (p->nr_cpus_allowed > 1)) {
>>>> +		int target = find_later_rq(p);
>>>> +
>>>> +		if (target != -1)
>>>> +			cpu = target;
>>>> +	}
>>>> +	rcu_read_unlock();
>>>> +
>>>> +out:
>>>> +	return cpu;
>>>> +}
>>>> +
>>>> +static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
>>>> +{
>>>> +	/*
>>>> +	 * Current can't be migrated, useless to reschedule,
>>>> +	 * let's hope p can move out.
>>>> +	 */
>>>> +	if (rq->curr->nr_cpus_allowed == 1 ||
>>>> +	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
>>>> +		return;
>>>> +
>>>> +	/*
>>>> +	 * p is migratable, so let's not schedule it and
>>>> +	 * see if it is pushed or pulled somewhere else.
>>>> +	 */
>>>> +	if (p->nr_cpus_allowed != 1 &&
>>>> +	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
>>>> +		return;
>>>> +
>>>> +	resched_task(rq->curr);
>>>> +}
>>>> +
>>>> +#endif /* CONFIG_SMP */
>>>> +
>>>>  /*
>>>>   * Only called when both the current and waking task are -deadline
>>>>   * tasks.
>>>> @@ -532,8 +880,20 @@ static void yield_task_dl(struct rq *rq)
>>>>  static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
>>>>  				  int flags)
>>>>  {
>>>> -	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
>>>> +	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
>>>>  		resched_task(rq->curr);
>>>> +		return;
>>>> +	}
>>>> +
>>>> +#ifdef CONFIG_SMP
>>>> +	/*
>>>> +	 * In the unlikely case current and p have the same deadline
>>>> +	 * let us try to decide what's the best thing to do...
>>>> +	 */
>>>> +	if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
>>>> +	    !need_resched())
>>>> +		check_preempt_equal_dl(rq, p);
>>>> +#endif /* CONFIG_SMP */
>>>>  }
>>>>  
>>>>  #ifdef CONFIG_SCHED_HRTICK
>>>> @@ -573,16 +933,29 @@ struct task_struct *pick_next_task_dl(struct rq *rq)
>>>>  
>>>>  	p = dl_task_of(dl_se);
>>>>  	p->se.exec_start = rq_clock_task(rq);
>>>> +
>>>> +	/* Running task will never be pushed. */
>>>> +	if (p)
>>>> +		dequeue_pushable_dl_task(rq, p);
>>>> +
>>>>  #ifdef CONFIG_SCHED_HRTICK
>>>>  	if (hrtick_enabled(rq))
>>>>  		start_hrtick_dl(rq, p);
>>>>  #endif
>>>> +
>>>> +#ifdef CONFIG_SMP
>>>> +	rq->post_schedule = has_pushable_dl_tasks(rq);
>>>> +#endif /* CONFIG_SMP */
>>>> +
>>>>  	return p;
>>>>  }
>>>>  
>>>>  static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
>>>>  {
>>>>  	update_curr_dl(rq);
>>>> +
>>>> +	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
>>>> +		enqueue_pushable_dl_task(rq, p);
>>>>  }
>>>>  
>>>>  static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
>>>> @@ -616,16 +989,517 @@ static void set_curr_task_dl(struct rq *rq)
>>>>  	struct task_struct *p = rq->curr;
>>>>  
>>>>  	p->se.exec_start = rq_clock_task(rq);
>>>> +
>>>> +	/* You can't push away the running task */
>>>> +	dequeue_pushable_dl_task(rq, p);
>>>> +}
>>>> +
>>>> +#ifdef CONFIG_SMP
>>>> +
>>>> +/* Only try algorithms three times */
>>>> +#define DL_MAX_TRIES 3
>>>> +
>>>> +static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
>>>> +{
>>>> +	if (!task_running(rq, p) &&
>>>> +	    (cpu < 0 || cpumask_test_cpu(cpu, &p->cpus_allowed)) &&
>>>> +	    (p->nr_cpus_allowed > 1))
>>>> +		return 1;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/* Returns the second earliest -deadline task, NULL otherwise */
>>>> +static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu)
>>>> +{
>>>> +	struct rb_node *next_node = rq->dl.rb_leftmost;
>>>> +	struct sched_dl_entity *dl_se;
>>>> +	struct task_struct *p = NULL;
>>>> +
>>>> +next_node:
>>>> +	next_node = rb_next(next_node);
>>>> +	if (next_node) {
>>>> +		dl_se = rb_entry(next_node, struct sched_dl_entity, rb_node);
>>>> +		p = dl_task_of(dl_se);
>>>> +
>>>> +		if (pick_dl_task(rq, p, cpu))
>>>> +			return p;
>>>> +
>>>> +		goto next_node;
>>>> +	}
>>>> +
>>>> +	return NULL;
>>>> +}
>>>> +
>>>> +static int latest_cpu_find(struct cpumask *span,
>>>> +			   struct task_struct *task,
>>>> +			   struct cpumask *later_mask)
>>>> +{
>>>> +	const struct sched_dl_entity *dl_se = &task->dl;
>>>> +	int cpu, found = -1, best = 0;
>>>> +	u64 max_dl = 0;
>>>> +
>>>> +	for_each_cpu(cpu, span) {
>>>> +		struct rq *rq = cpu_rq(cpu);
>>>> +		struct dl_rq *dl_rq = &rq->dl;
>>>> +
>>>> +		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
>>>> +		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
>>>> +		     dl_rq->earliest_dl.curr))) {
>>>> +			if (later_mask)
>>>> +				cpumask_set_cpu(cpu, later_mask);
>>>> +			if (!best && !dl_rq->dl_nr_running) {
>>>> +				best = 1;
>>>> +				found = cpu;
>>>> +			} else if (!best &&
>>>> +				   dl_time_before(max_dl,
>>>> +						  dl_rq->earliest_dl.curr)) {
>>>
>>> Ug, the above is hard to read. What about:
>>>
>>> 	if (!best) {
>>> 		if (!dl_rq->dl_nr_running) {
>>> 			best = 1;
>>> 			found = cpu;
>>> 		} elsif (dl_time_before(...)) {
>>> 			...
>>> 		}
>>> 	}
>>>
>>
>> This is completely removed in 13/14. I don't like it either, but since we end
>> up removing this mess, do you think we still have to fix this here?
>>
>>> Also, I would think dl should be nice to rt as well. There may be a
>>> idle CPU or a non rt task, and this could pick a CPU running an RT
>>> task. Worse yet, that RT task may be pinned to that CPU.
>>>
>>
>> Well, in 13/14 we introduce a free_cpus mask. A CPU is considered free if it
>> doesn't have any -deadline task running. We can modify that excluding also CPUs
>> running RT tasks, but I have to think a bit if we can do this also from -rt code.
>>
>>> We should be able to incorporate cpuprio_find() to be dl aware too.
>>> That is, work for both -rt and -dl.
>>>
>>
>> Like checking if a -dl task is running on the cpu chosen for pushing an -rt
>> task, and continue searching in that case.
>>
>>>
>>>
>>>> +				max_dl = dl_rq->earliest_dl.curr;
>>>> +				found = cpu;
>>>> +			}
>>>> +		} else if (later_mask)
>>>> +			cpumask_clear_cpu(cpu, later_mask);
>>>> +	}
>>>> +
>>>> +	return found;
>>>> +}
>>>> +
>>>> +static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
>>>> +
>>>> +static int find_later_rq(struct task_struct *task)
>>>> +{
>>>> +	struct sched_domain *sd;
>>>> +	struct cpumask *later_mask = __get_cpu_var(local_cpu_mask_dl);
>>>> +	int this_cpu = smp_processor_id();
>>>> +	int best_cpu, cpu = task_cpu(task);
>>>> +
>>>> +	/* Make sure the mask is initialized first */
>>>> +	if (unlikely(!later_mask))
>>>> +		return -1;
>>>> +
>>>> +	if (task->nr_cpus_allowed == 1)
>>>> +		return -1;
>>>> +
>>>> +	best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
>>>> +	if (best_cpu == -1)
>>>> +		return -1;
>>>> +
>>>> +	/*
>>>> +	 * If we are here, some target has been found,
>>>> +	 * the most suitable of which is cached in best_cpu.
>>>> +	 * This is, among the runqueues where the current tasks
>>>> +	 * have later deadlines than the task's one, the rq
>>>> +	 * with the latest possible one.
>>>> +	 *
>>>> +	 * Now we check how well this matches with task's
>>>> +	 * affinity and system topology.
>>>> +	 *
>>>> +	 * The last cpu where the task run is our first
>>>> +	 * guess, since it is most likely cache-hot there.
>>>> +	 */
>>>> +	if (cpumask_test_cpu(cpu, later_mask))
>>>> +		return cpu;
>>>> +	/*
>>>> +	 * Check if this_cpu is to be skipped (i.e., it is
>>>> +	 * not in the mask) or not.
>>>> +	 */
>>>> +	if (!cpumask_test_cpu(this_cpu, later_mask))
>>>> +		this_cpu = -1;
>>>> +
>>>> +	rcu_read_lock();
>>>> +	for_each_domain(cpu, sd) {
>>>> +		if (sd->flags & SD_WAKE_AFFINE) {
>>>> +
>>>> +			/*
>>>> +			 * If possible, preempting this_cpu is
>>>> +			 * cheaper than migrating.
>>>> +			 */
>>>> +			if (this_cpu != -1 &&
>>>> +			    cpumask_test_cpu(this_cpu, sched_domain_span(sd))) {
>>>> +				rcu_read_unlock();
>>>> +				return this_cpu;
>>>> +			}
>>>> +
>>>> +			/*
>>>> +			 * Last chance: if best_cpu is valid and is
>>>> +			 * in the mask, that becomes our choice.
>>>> +			 */
>>>> +			if (best_cpu < nr_cpu_ids &&
>>>> +			    cpumask_test_cpu(best_cpu, sched_domain_span(sd))) {
>>>> +				rcu_read_unlock();
>>>> +				return best_cpu;
>>>> +			}
>>>> +		}
>>>> +	}
>>>> +	rcu_read_unlock();
>>>> +
>>>> +	/*
>>>> +	 * At this point, all our guesses failed, we just return
>>>> +	 * 'something', and let the caller sort the things out.
>>>> +	 */
>>>> +	if (this_cpu != -1)
>>>> +		return this_cpu;
>>>> +
>>>> +	cpu = cpumask_any(later_mask);
>>>> +	if (cpu < nr_cpu_ids)
>>>> +		return cpu;
>>>> +
>>>> +	return -1;
>>>> +}
>>>> +
>>>> +/* Locks the rq it finds */
>>>> +static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
>>>> +{
>>>> +	struct rq *later_rq = NULL;
>>>> +	int tries;
>>>> +	int cpu;
>>>> +
>>>> +	for (tries = 0; tries < DL_MAX_TRIES; tries++) {
>>>> +		cpu = find_later_rq(task);
>>>> +
>>>> +		if ((cpu == -1) || (cpu == rq->cpu))
>>>> +			break;
>>>> +
>>>> +		later_rq = cpu_rq(cpu);
>>>> +
>>>> +		/* Retry if something changed. */
>>>> +		if (double_lock_balance(rq, later_rq)) {
>>>> +			if (unlikely(task_rq(task) != rq ||
>>>> +				     !cpumask_test_cpu(later_rq->cpu,
>>>> +				                       &task->cpus_allowed) ||
>>>> +				     task_running(rq, task) || !task->on_rq)) {
>>>> +				double_unlock_balance(rq, later_rq);
>>>> +				later_rq = NULL;
>>>> +				break;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		/*
>>>> +		 * If the rq we found has no -deadline task, or
>>>> +		 * its earliest one has a later deadline than our
>>>> +		 * task, the rq is a good one.
>>>> +		 */
>>>> +		if (!later_rq->dl.dl_nr_running ||
>>>> +		    dl_time_before(task->dl.deadline,
>>>> +				   later_rq->dl.earliest_dl.curr))
>>>> +			break;
>>>> +
>>>> +		/* Otherwise we try again. */
>>>> +		double_unlock_balance(rq, later_rq);
>>>> +		later_rq = NULL;
>>>> +	}
>>>> +
>>>> +	return later_rq;
>>>>  }
>>>>  
>>>> +static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
>>>> +{
>>>> +	struct task_struct *p;
>>>> +
>>>> +	if (!has_pushable_dl_tasks(rq))
>>>> +		return NULL;
>>>> +
>>>> +	p = rb_entry(rq->dl.pushable_dl_tasks_leftmost,
>>>> +		     struct task_struct, pushable_dl_tasks);
>>>> +
>>>> +	BUG_ON(rq->cpu != task_cpu(p));
>>>> +	BUG_ON(task_current(rq, p));
>>>> +	BUG_ON(p->nr_cpus_allowed <= 1);
>>>> +
>>>> +	BUG_ON(!p->se.on_rq);
>>>> +	BUG_ON(!dl_task(p));
>>>> +
>>>> +	return p;
>>>> +}
>>>> +
>>>> +/*
>>>> + * See if the non running -deadline tasks on this rq
>>>> + * can be sent to some other CPU where they can preempt
>>>> + * and start executing.
>>>> + */
>>>> +static int push_dl_task(struct rq *rq)
>>>> +{
>>>> +	struct task_struct *next_task;
>>>> +	struct rq *later_rq;
>>>> +
>>>> +	if (!rq->dl.overloaded)
>>>> +		return 0;
>>>> +
>>>> +	next_task = pick_next_pushable_dl_task(rq);
>>>> +	if (!next_task)
>>>> +		return 0;
>>>> +
>>>> +retry:
>>>> +	if (unlikely(next_task == rq->curr)) {
>>>> +		WARN_ON(1);
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * If next_task preempts rq->curr, and rq->curr
>>>> +	 * can move away, it makes sense to just reschedule
>>>> +	 * without going further in pushing next_task.
>>>> +	 */
>>>> +	if (dl_task(rq->curr) &&
>>>> +	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
>>>> +	    rq->curr->nr_cpus_allowed > 1) {
>>>> +		resched_task(rq->curr);
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	/* We might release rq lock */
>>>> +	get_task_struct(next_task);
>>>> +
>>>> +	/* Will lock the rq it'll find */
>>>> +	later_rq = find_lock_later_rq(next_task, rq);
>>>> +	if (!later_rq) {
>>>> +		struct task_struct *task;
>>>> +
>>>> +		/*
>>>> +		 * We must check all this again, since
>>>> +		 * find_lock_later_rq releases rq->lock and it is
>>>> +		 * then possible that next_task has migrated.
>>>> +		 */
>>>> +		task = pick_next_pushable_dl_task(rq);
>>>> +		if (task_cpu(next_task) == rq->cpu && task == next_task) {
>>>> +			/*
>>>> +			 * The task is still there. We don't try
>>>> +			 * again, some other cpu will pull it when ready.
>>>> +			 */
>>>> +			dequeue_pushable_dl_task(rq, next_task);
>>>> +			goto out;
>>>> +		}
>>>> +
>>>> +		if (!task)
>>>> +			/* No more tasks */
>>>> +			goto out;
>>>> +
>>>> +		put_task_struct(next_task);
>>>> +		next_task = task;
>>>> +		goto retry;
>>>> +	}
>>>> +
>>>> +	deactivate_task(rq, next_task, 0);
>>>> +	set_task_cpu(next_task, later_rq->cpu);
>>>> +	activate_task(later_rq, next_task, 0);
>>>> +
>>>> +	resched_task(later_rq->curr);
>>>> +
>>>> +	double_unlock_balance(rq, later_rq);
>>>> +
>>>> +out:
>>>> +	put_task_struct(next_task);
>>>> +
>>>> +	return 1;
>>>> +}
>>>> +
>>>> +static void push_dl_tasks(struct rq *rq)
>>>> +{
>>>> +	/* Terminates as it moves a -deadline task */
>>>> +	while (push_dl_task(rq))
>>>> +		;
>>>> +}
>>>> +
>>>> +static int pull_dl_task(struct rq *this_rq)
>>>> +{
>>>> +	int this_cpu = this_rq->cpu, ret = 0, cpu;
>>>> +	struct task_struct *p;
>>>> +	struct rq *src_rq;
>>>> +	u64 dmin = LONG_MAX;
>>>> +
>>>> +	if (likely(!dl_overloaded(this_rq)))
>>>> +		return 0;
>>>> +
>>>> +	/*
>>>> +	 * Match the barrier from dl_set_overloaded; this guarantees that if we
>>>> +	 * see overloaded we must also see the dlo_mask bit.
>>>> +	 */
>>>> +	smp_rmb();
>>>> +
>>>> +	for_each_cpu(cpu, this_rq->rd->dlo_mask) {
>>>> +		if (this_cpu == cpu)
>>>> +			continue;
>>>> +
>>>> +		src_rq = cpu_rq(cpu);
>>>> +
>>>> +		/*
>>>> +		 * It looks racy, abd it is! However, as in sched_rt.c,
>>>
>>> abd it is?
>>>
>>
>> Oops!
>>
>> Thanks,
>>
>> - Juri
>>
>>>
>>>> +		 * we are fine with this.
>>>> +		 */
>>>> +		if (this_rq->dl.dl_nr_running &&
>>>> +		    dl_time_before(this_rq->dl.earliest_dl.curr,
>>>> +				   src_rq->dl.earliest_dl.next))
>>>> +			continue;
>>>> +
>>>> +		/* Might drop this_rq->lock */
>>>> +		double_lock_balance(this_rq, src_rq);
>>>> +
>>>> +		/*
>>>> +		 * If there are no more pullable tasks on the
>>>> +		 * rq, we're done with it.
>>>> +		 */
>>>> +		if (src_rq->dl.dl_nr_running <= 1)
>>>> +			goto skip;
>>>> +
>>>> +		p = pick_next_earliest_dl_task(src_rq, this_cpu);
>>>> +
>>>> +		/*
>>>> +		 * We found a task to be pulled if:
>>>> +		 *  - it preempts our current (if there's one),
>>>> +		 *  - it will preempt the last one we pulled (if any).
>>>> +		 */
>>>> +		if (p && dl_time_before(p->dl.deadline, dmin) &&
>>>> +		    (!this_rq->dl.dl_nr_running ||
>>>> +		     dl_time_before(p->dl.deadline,
>>>> +				    this_rq->dl.earliest_dl.curr))) {
>>>> +			WARN_ON(p == src_rq->curr);
>>>> +			WARN_ON(!p->se.on_rq);
>>>> +
>>>> +			/*
>>>> +			 * Then we pull iff p has actually an earlier
>>>> +			 * deadline than the current task of its runqueue.
>>>> +			 */
>>>> +			if (dl_time_before(p->dl.deadline,
>>>> +					   src_rq->curr->dl.deadline))
>>>> +				goto skip;
>>>> +
>>>> +			ret = 1;
>>>> +
>>>> +			deactivate_task(src_rq, p, 0);
>>>> +			set_task_cpu(p, this_cpu);
>>>> +			activate_task(this_rq, p, 0);
>>>> +			dmin = p->dl.deadline;
>>>> +
>>>> +			/* Is there any other task even earlier? */
>>>> +		}
>>>> +skip:
>>>> +		double_unlock_balance(this_rq, src_rq);
>>>> +	}
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
>>>> +{
>>>> +	/* Try to pull other tasks here */
>>>> +	if (dl_task(prev))
>>>> +		pull_dl_task(rq);
>>>> +}
>>>> +
>>>> +static void post_schedule_dl(struct rq *rq)
>>>> +{
>>>> +	push_dl_tasks(rq);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Since the task is not running and a reschedule is not going to happen
>>>> + * anytime soon on its runqueue, we try pushing it away now.
>>>> + */
>>>> +static void task_woken_dl(struct rq *rq, struct task_struct *p)
>>>> +{
>>>> +	if (!task_running(rq, p) &&
>>>> +	    !test_tsk_need_resched(rq->curr) &&
>>>> +	    has_pushable_dl_tasks(rq) &&
>>>> +	    p->nr_cpus_allowed > 1 &&
>>>> +	    dl_task(rq->curr) &&
>>>> +	    (rq->curr->nr_cpus_allowed < 2 ||
>>>> +	     dl_entity_preempt(&rq->curr->dl, &p->dl))) {
>>>> +		push_dl_tasks(rq);
>>>> +	}
>>>> +}
>>>> +
>>>> +static void set_cpus_allowed_dl(struct task_struct *p,
>>>> +				const struct cpumask *new_mask)
>>>> +{
>>>> +	struct rq *rq;
>>>> +	int weight;
>>>> +
>>>> +	BUG_ON(!dl_task(p));
>>>> +
>>>> +	/*
>>>> +	 * Update only if the task is actually running (i.e.,
>>>> +	 * it is on the rq AND it is not throttled).
>>>> +	 */
>>>> +	if (!on_dl_rq(&p->dl))
>>>> +		return;
>>>> +
>>>> +	weight = cpumask_weight(new_mask);
>>>> +
>>>> +	/*
>>>> +	 * Only update if the process changes its state from whether it
>>>> +	 * can migrate or not.
>>>> +	 */
>>>> +	if ((p->nr_cpus_allowed > 1) == (weight > 1))
>>>> +		return;
>>>> +
>>>> +	rq = task_rq(p);
>>>> +
>>>> +	/*
>>>> +	 * The process used to be able to migrate OR it can now migrate
>>>> +	 */
>>>> +	if (weight <= 1) {
>>>> +		if (!task_current(rq, p))
>>>> +			dequeue_pushable_dl_task(rq, p);
>>>> +		BUG_ON(!rq->dl.dl_nr_migratory);
>>>> +		rq->dl.dl_nr_migratory--;
>>>> +	} else {
>>>> +		if (!task_current(rq, p))
>>>> +			enqueue_pushable_dl_task(rq, p);
>>>> +		rq->dl.dl_nr_migratory++;
>>>> +	}
>>>> +	
>>>> +	update_dl_migration(&rq->dl);
>>>> +}
>>>> +
>>>> +/* Assumes rq->lock is held */
>>>> +static void rq_online_dl(struct rq *rq)
>>>> +{
>>>> +	if (rq->dl.overloaded)
>>>> +		dl_set_overload(rq);
>>>> +}
>>>> +
>>>> +/* Assumes rq->lock is held */
>>>> +static void rq_offline_dl(struct rq *rq)
>>>> +{
>>>> +	if (rq->dl.overloaded)
>>>> +		dl_clear_overload(rq);
>>>> +}
>>>> +
>>>> +void init_sched_dl_class(void)
>>>> +{
>>>> +	unsigned int i;
>>>> +
>>>> +	for_each_possible_cpu(i)
>>>> +		zalloc_cpumask_var_node(&per_cpu(local_cpu_mask_dl, i),
>>>> +					GFP_KERNEL, cpu_to_node(i));
>>>> +}
>>>> +
>>>> +#endif /* CONFIG_SMP */
>>>> +
>>>>  static void switched_from_dl(struct rq *rq, struct task_struct *p)
>>>>  {
>>>> -	if (hrtimer_active(&p->dl.dl_timer))
>>>> +	if (hrtimer_active(&p->dl.dl_timer) && !dl_policy(p->policy))
>>>>  		hrtimer_try_to_cancel(&p->dl.dl_timer);
>>>> +
>>>> +#ifdef CONFIG_SMP
>>>> +	/*
>>>> +	 * Since this might be the only -deadline task on the rq,
>>>> +	 * this is the right place to try to pull some other one
>>>> +	 * from an overloaded cpu, if any.
>>>> +	 */
>>>> +	if (!rq->dl.dl_nr_running)
>>>> +		pull_dl_task(rq);
>>>> +#endif
>>>>  }
>>>>  
>>>> +/*
>>>> + * When switching to -deadline, we may overload the rq, then
>>>> + * we try to push someone off, if possible.
>>>> + */
>>>>  static void switched_to_dl(struct rq *rq, struct task_struct *p)
>>>>  {
>>>> +	int check_resched = 1;
>>>> +
>>>>  	/*
>>>>  	 * If p is throttled, don't consider the possibility
>>>>  	 * of preempting rq->curr, the check will be done right
>>>> @@ -635,26 +1509,53 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
>>>>  		return;
>>>>  
>>>>  	if (!p->on_rq || rq->curr != p) {
>>>> -		if (task_has_dl_policy(rq->curr))
>>>> +#ifdef CONFIG_SMP
>>>> +		if (rq->dl.overloaded && push_dl_task(rq) && rq != task_rq(p))
>>>> +			/* Only reschedule if pushing failed */
>>>> +			check_resched = 0;
>>>> +#endif /* CONFIG_SMP */
>>>> +		if (check_resched && task_has_dl_policy(rq->curr))
>>>>  			check_preempt_curr_dl(rq, p, 0);
>>>> -		else
>>>> -			resched_task(rq->curr);
>>>>  	}
>>>>  }
>>>>  
>>>> +/*
>>>> + * If the scheduling parameters of a -deadline task changed,
>>>> + * a push or pull operation might be needed.
>>>> + */
>>>>  static void prio_changed_dl(struct rq *rq, struct task_struct *p,
>>>>  			    int oldprio)
>>>>  {
>>>> -	switched_to_dl(rq, p);
>>>> -}
>>>> -
>>>> +	if (p->on_rq || rq->curr == p) {
>>>>  #ifdef CONFIG_SMP
>>>> -static int
>>>> -select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
>>>> -{
>>>> -	return task_cpu(p);
>>>> +		/*
>>>> +		 * This might be too much, but unfortunately
>>>> +		 * we don't have the old deadline value, and
>>>> +		 * we can't argue if the task is increasing
>>>> +		 * or lowering its prio, so...
>>>> +		 */
>>>> +		if (!rq->dl.overloaded)
>>>> +			pull_dl_task(rq);
>>>> +
>>>> +		/*
>>>> +		 * If we now have a earlier deadline task than p,
>>>> +		 * then reschedule, provided p is still on this
>>>> +		 * runqueue.
>>>> +		 */
>>>> +		if (dl_time_before(rq->dl.earliest_dl.curr, p->dl.deadline) &&
>>>> +		    rq->curr == p)
>>>> +			resched_task(p);
>>>> +#else
>>>> +		/*
>>>> +		 * Again, we don't know if p has a earlier
>>>> +		 * or later deadline, so let's blindly set a
>>>> +		 * (maybe not needed) rescheduling point.
>>>> +		 */
>>>> +		resched_task(p);
>>>> +#endif /* CONFIG_SMP */
>>>> +	} else
>>>> +		switched_to_dl(rq, p);
>>>>  }
>>>> -#endif
>>>>  
>>>>  const struct sched_class dl_sched_class = {
>>>>  	.next			= &rt_sched_class,
>>>> @@ -669,6 +1570,12 @@ const struct sched_class dl_sched_class = {
>>>>  
>>>>  #ifdef CONFIG_SMP
>>>>  	.select_task_rq		= select_task_rq_dl,
>>>> +	.set_cpus_allowed       = set_cpus_allowed_dl,
>>>> +	.rq_online              = rq_online_dl,
>>>> +	.rq_offline             = rq_offline_dl,
>>>> +	.pre_schedule		= pre_schedule_dl,
>>>> +	.post_schedule		= post_schedule_dl,
>>>> +	.task_woken		= task_woken_dl,
>>>>  #endif
>>>>  
>>>>  	.set_curr_task		= set_curr_task_dl,
>>>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>>>> index 01970c8..f7c4881 100644
>>>> --- a/kernel/sched/rt.c
>>>> +++ b/kernel/sched/rt.c
>>>> @@ -1720,7 +1720,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
>>>>  	    !test_tsk_need_resched(rq->curr) &&
>>>>  	    has_pushable_tasks(rq) &&
>>>>  	    p->nr_cpus_allowed > 1 &&
>>>> -	    rt_task(rq->curr) &&
>>>> +	    (dl_task(rq->curr) || rt_task(rq->curr)) &&
>>>>  	    (rq->curr->nr_cpus_allowed < 2 ||
>>>>  	     rq->curr->prio <= p->prio))
>>>>  		push_rt_tasks(rq);
>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>>> index ba97476..70d0030 100644
>>>> --- a/kernel/sched/sched.h
>>>> +++ b/kernel/sched/sched.h
>>>> @@ -383,6 +383,31 @@ struct dl_rq {
>>>>  	struct rb_node *rb_leftmost;
>>>>  
>>>>  	unsigned long dl_nr_running;
>>>> +
>>>> +#ifdef CONFIG_SMP
>>>> +	/*
>>>> +	 * Deadline values of the currently executing and the
>>>> +	 * earliest ready task on this rq. Caching these facilitates
>>>> +	 * the decision wether or not a ready but not running task
>>>> +	 * should migrate somewhere else.
>>>> +	 */
>>>> +	struct {
>>>> +		u64 curr;
>>>> +		u64 next;
>>>> +	} earliest_dl;
>>>> +
>>>> +	unsigned long dl_nr_migratory;
>>>> +	unsigned long dl_nr_total;
>>>> +	int overloaded;
>>>> +
>>>> +	/*
>>>> +	 * Tasks on this rq that can be pushed away. They are kept in
>>>> +	 * an rb-tree, ordered by tasks' deadlines, with caching
>>>> +	 * of the leftmost (earliest deadline) element.
>>>> +	 */
>>>> +	struct rb_root pushable_dl_tasks_root;
>>>> +	struct rb_node *pushable_dl_tasks_leftmost;
>>>> +#endif
>>>>  };
>>>>  
>>>>  #ifdef CONFIG_SMP
>>>> @@ -403,6 +428,13 @@ struct root_domain {
>>>>  	cpumask_var_t online;
>>>>  
>>>>  	/*
>>>> +	 * The bit corresponding to a CPU gets set here if such CPU has more
>>>> +	 * than one runnable -deadline task (as it is below for RT tasks).
>>>> +	 */
>>>> +	cpumask_var_t dlo_mask;
>>>> +	atomic_t dlo_count;
>>>> +
>>>> +	/*
>>>>  	 * The "RT overload" flag: it gets set if a CPU has more than
>>>>  	 * one runnable RT task.
>>>>  	 */
>>>> @@ -1063,6 +1095,8 @@ static inline void idle_balance(int cpu, struct rq *rq)
>>>>  extern void sysrq_sched_debug_show(void);
>>>>  extern void sched_init_granularity(void);
>>>>  extern void update_max_interval(void);
>>>> +
>>>> +extern void init_sched_dl_class(void);
>>>>  extern void init_sched_rt_class(void);
>>>>  extern void init_sched_fair_class(void);
>>>>  
>>>
>>
>>
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 04/14] sched: SCHED_DEADLINE SMP-related data structures & logic.
  2013-11-21 16:16         ` Juri Lelli
@ 2013-11-21 16:26           ` Paul E. McKenney
  2013-11-21 16:47             ` Steven Rostedt
  0 siblings, 1 reply; 81+ messages in thread
From: Paul E. McKenney @ 2013-11-21 16:26 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Steven Rostedt, peterz, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

On Thu, Nov 21, 2013 at 05:16:50PM +0100, Juri Lelli wrote:
> On 11/21/2013 05:08 PM, Paul E. McKenney wrote:
> > On Thu, Nov 21, 2013 at 03:13:28PM +0100, Juri Lelli wrote:
> >> On 11/20/2013 07:51 PM, Steven Rostedt wrote:
> >>> On Thu,  7 Nov 2013 14:43:38 +0100
> >>> Juri Lelli <juri.lelli@gmail.com> wrote:
> >>>
> >>>
> >>>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> >>>> index cb93f2e..18a73b4 100644
> >>>> --- a/kernel/sched/deadline.c
> >>>> +++ b/kernel/sched/deadline.c
> >>>> @@ -10,6 +10,7 @@
> >>>>   * miss some of their deadlines), and won't affect any other task.
> >>>>   *
> >>>>   * Copyright (C) 2012 Dario Faggioli <raistlin@linux.it>,
> >>>> + *                    Juri Lelli <juri.lelli@gmail.com>,
> >>>>   *                    Michael Trimarchi <michael@amarulasolutions.com>,
> >>>>   *                    Fabio Checconi <fchecconi@gmail.com>
> >>>>   */
> >>>> @@ -20,6 +21,15 @@ static inline int dl_time_before(u64 a, u64 b)
> >>>>  	return (s64)(a - b) < 0;
> >>>>  }
> >>>>  
> >>>> +/*
> >>>> + * Tells if entity @a should preempt entity @b.
> >>>> + */
> >>>> +static inline
> >>>> +int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
> >>>> +{
> >>>> +	return dl_time_before(a->deadline, b->deadline);
> >>>> +}
> >>>> +
> >>>>  static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
> >>>>  {
> >>>>  	return container_of(dl_se, struct task_struct, dl);
> >>>> @@ -53,8 +63,168 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
> >>>>  void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
> >>>>  {
> >>>>  	dl_rq->rb_root = RB_ROOT;
> >>>> +
> >>>> +#ifdef CONFIG_SMP
> >>>> +	/* zero means no -deadline tasks */
> >>>
> >>> I'm curious to why you add the '-' to -deadline.
> >>>
> >>
> >> I guess "deadline tasks" is too much generic; "SCHED_DEADLINE tasks" too long;
> >> "DL" or "-dl" can be associated to "download". Nothing special in the end, just
> >> tought it was a reasonable abbreviation.
> > 
> > My guess was that "-deadline" was an abbreviation for tasks that had
> > missed their deadline, just so you know.  ;-)
> >
> 
> Argh! ;)
> 
> Actually, thinking about it twice: if we call "-rt tasks" things that are
> governed by code that resides in sched/rt.c, we could agree on "-deadline
> tasks" for things that pass through sched/deadline.c.

Actually, "SCHED_DEADLINE" isn't -that- much longer than "-deadline"
and it does have the virtue of being unambiguous.

							Thanx, Paul

> Thanks,
> 
> - Juri
> 
> >>>> +	dl_rq->earliest_dl.curr = dl_rq->earliest_dl.next = 0;
> >>>> +
> >>>> +	dl_rq->dl_nr_migratory = 0;
> >>>> +	dl_rq->overloaded = 0;
> >>>> +	dl_rq->pushable_dl_tasks_root = RB_ROOT;
> >>>> +#endif
> >>>> +}
> >>>> +
> >>>> +#ifdef CONFIG_SMP
> >>>> +
> >>>> +static inline int dl_overloaded(struct rq *rq)
> >>>> +{
> >>>> +	return atomic_read(&rq->rd->dlo_count);
> >>>> +}
> >>>> +
> >>>> +static inline void dl_set_overload(struct rq *rq)
> >>>> +{
> >>>> +	if (!rq->online)
> >>>> +		return;
> >>>> +
> >>>> +	cpumask_set_cpu(rq->cpu, rq->rd->dlo_mask);
> >>>> +	/*
> >>>> +	 * Must be visible before the overload count is
> >>>> +	 * set (as in sched_rt.c).
> >>>> +	 *
> >>>> +	 * Matched by the barrier in pull_dl_task().
> >>>> +	 */
> >>>> +	smp_wmb();
> >>>> +	atomic_inc(&rq->rd->dlo_count);
> >>>> +}
> >>>> +
> >>>> +static inline void dl_clear_overload(struct rq *rq)
> >>>> +{
> >>>> +	if (!rq->online)
> >>>> +		return;
> >>>> +
> >>>> +	atomic_dec(&rq->rd->dlo_count);
> >>>> +	cpumask_clear_cpu(rq->cpu, rq->rd->dlo_mask);
> >>>> +}
> >>>> +
> >>>> +static void update_dl_migration(struct dl_rq *dl_rq)
> >>>> +{
> >>>> +	if (dl_rq->dl_nr_migratory && dl_rq->dl_nr_total > 1) {
> >>>> +		if (!dl_rq->overloaded) {
> >>>> +			dl_set_overload(rq_of_dl_rq(dl_rq));
> >>>> +			dl_rq->overloaded = 1;
> >>>> +		}
> >>>> +	} else if (dl_rq->overloaded) {
> >>>> +		dl_clear_overload(rq_of_dl_rq(dl_rq));
> >>>> +		dl_rq->overloaded = 0;
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >>>> +{
> >>>> +	struct task_struct *p = dl_task_of(dl_se);
> >>>> +	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
> >>>> +
> >>>> +	dl_rq->dl_nr_total++;
> >>>> +	if (p->nr_cpus_allowed > 1)
> >>>> +		dl_rq->dl_nr_migratory++;
> >>>> +
> >>>> +	update_dl_migration(dl_rq);
> >>>> +}
> >>>> +
> >>>> +static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >>>> +{
> >>>> +	struct task_struct *p = dl_task_of(dl_se);
> >>>> +	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
> >>>> +
> >>>> +	dl_rq->dl_nr_total--;
> >>>> +	if (p->nr_cpus_allowed > 1)
> >>>> +		dl_rq->dl_nr_migratory--;
> >>>> +
> >>>> +	update_dl_migration(dl_rq);
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> + * The list of pushable -deadline task is not a plist, like in
> >>>> + * sched_rt.c, it is an rb-tree with tasks ordered by deadline.
> >>>> + */
> >>>> +static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> >>>> +{
> >>>> +	struct dl_rq *dl_rq = &rq->dl;
> >>>> +	struct rb_node **link = &dl_rq->pushable_dl_tasks_root.rb_node;
> >>>> +	struct rb_node *parent = NULL;
> >>>> +	struct task_struct *entry;
> >>>> +	int leftmost = 1;
> >>>> +
> >>>> +	BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
> >>>> +
> >>>> +	while (*link) {
> >>>> +		parent = *link;
> >>>> +		entry = rb_entry(parent, struct task_struct,
> >>>> +				 pushable_dl_tasks);
> >>>> +		if (dl_entity_preempt(&p->dl, &entry->dl))
> >>>> +			link = &parent->rb_left;
> >>>> +		else {
> >>>> +			link = &parent->rb_right;
> >>>> +			leftmost = 0;
> >>>> +		}
> >>>> +	}
> >>>> +
> >>>> +	if (leftmost)
> >>>> +		dl_rq->pushable_dl_tasks_leftmost = &p->pushable_dl_tasks;
> >>>> +
> >>>> +	rb_link_node(&p->pushable_dl_tasks, parent, link);
> >>>> +	rb_insert_color(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
> >>>> +}
> >>>> +
> >>>> +static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> >>>> +{
> >>>> +	struct dl_rq *dl_rq = &rq->dl;
> >>>> +
> >>>> +	if (RB_EMPTY_NODE(&p->pushable_dl_tasks))
> >>>> +		return;
> >>>> +
> >>>> +	if (dl_rq->pushable_dl_tasks_leftmost == &p->pushable_dl_tasks) {
> >>>> +		struct rb_node *next_node;
> >>>> +
> >>>> +		next_node = rb_next(&p->pushable_dl_tasks);
> >>>> +		dl_rq->pushable_dl_tasks_leftmost = next_node;
> >>>> +	}
> >>>> +
> >>>> +	rb_erase(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
> >>>> +	RB_CLEAR_NODE(&p->pushable_dl_tasks);
> >>>> +}
> >>>> +
> >>>> +static inline int has_pushable_dl_tasks(struct rq *rq)
> >>>> +{
> >>>> +	return !RB_EMPTY_ROOT(&rq->dl.pushable_dl_tasks_root);
> >>>> +}
> >>>> +
> >>>> +static int push_dl_task(struct rq *rq);
> >>>> +
> >>>> +#else
> >>>> +
> >>>> +static inline
> >>>> +void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> >>>> +{
> >>>> +}
> >>>> +
> >>>> +static inline
> >>>> +void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
> >>>> +{
> >>>> +}
> >>>> +
> >>>> +static inline
> >>>> +void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >>>> +{
> >>>> +}
> >>>> +
> >>>> +static inline
> >>>> +void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >>>> +{
> >>>>  }
> >>>>  
> >>>> +#endif /* CONFIG_SMP */
> >>>> +
> >>>>  static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
> >>>>  static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
> >>>>  static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
> >>>> @@ -307,6 +477,14 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
> >>>>  			check_preempt_curr_dl(rq, p, 0);
> >>>>  		else
> >>>>  			resched_task(rq->curr);
> >>>> +#ifdef CONFIG_SMP
> >>>> +		/*
> >>>> +		 * Queueing this task back might have overloaded rq,
> >>>> +		 * check if we need to kick someone away.
> >>>> +		 */
> >>>> +		if (has_pushable_dl_tasks(rq))
> >>>> +			push_dl_task(rq);
> >>>> +#endif
> >>>>  	}
> >>>>  unlock:
> >>>>  	raw_spin_unlock(&rq->lock);
> >>>> @@ -397,6 +575,100 @@ static void update_curr_dl(struct rq *rq)
> >>>>  	}
> >>>>  }
> >>>>  
> >>>> +#ifdef CONFIG_SMP
> >>>> +
> >>>> +static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu);
> >>>> +
> >>>> +static inline u64 next_deadline(struct rq *rq)
> >>>> +{
> >>>> +	struct task_struct *next = pick_next_earliest_dl_task(rq, rq->cpu);
> >>>> +
> >>>> +	if (next && dl_prio(next->prio))
> >>>> +		return next->dl.deadline;
> >>>> +	else
> >>>> +		return 0;
> >>>> +}
> >>>> +
> >>>> +static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> >>>> +{
> >>>> +	struct rq *rq = rq_of_dl_rq(dl_rq);
> >>>> +
> >>>> +	if (dl_rq->earliest_dl.curr == 0 ||
> >>>> +	    dl_time_before(deadline, dl_rq->earliest_dl.curr)) {
> >>>> +		/*
> >>>> +		 * If the dl_rq had no -deadline tasks, or if the new task
> >>>> +		 * has shorter deadline than the current one on dl_rq, we
> >>>> +		 * know that the previous earliest becomes our next earliest,
> >>>> +		 * as the new task becomes the earliest itself.
> >>>> +		 */
> >>>> +		dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
> >>>> +		dl_rq->earliest_dl.curr = deadline;
> >>>> +	} else if (dl_rq->earliest_dl.next == 0 ||
> >>>> +		   dl_time_before(deadline, dl_rq->earliest_dl.next)) {
> >>>> +		/*
> >>>> +		 * On the other hand, if the new -deadline task has a
> >>>> +		 * a later deadline than the earliest one on dl_rq, but
> >>>> +		 * it is earlier than the next (if any), we must
> >>>> +		 * recompute the next-earliest.
> >>>> +		 */
> >>>> +		dl_rq->earliest_dl.next = next_deadline(rq);
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> >>>> +{
> >>>> +	struct rq *rq = rq_of_dl_rq(dl_rq);
> >>>> +
> >>>> +	/*
> >>>> +	 * Since we may have removed our earliest (and/or next earliest)
> >>>> +	 * task we must recompute them.
> >>>> +	 */
> >>>> +	if (!dl_rq->dl_nr_running) {
> >>>> +		dl_rq->earliest_dl.curr = 0;
> >>>> +		dl_rq->earliest_dl.next = 0;
> >>>> +	} else {
> >>>> +		struct rb_node *leftmost = dl_rq->rb_leftmost;
> >>>> +		struct sched_dl_entity *entry;
> >>>> +
> >>>> +		entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
> >>>> +		dl_rq->earliest_dl.curr = entry->deadline;
> >>>> +		dl_rq->earliest_dl.next = next_deadline(rq);
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +#else
> >>>> +
> >>>> +static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
> >>>> +static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
> >>>> +
> >>>> +#endif /* CONFIG_SMP */
> >>>> +
> >>>> +static inline
> >>>> +void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >>>> +{
> >>>> +	int prio = dl_task_of(dl_se)->prio;
> >>>> +	u64 deadline = dl_se->deadline;
> >>>> +
> >>>> +	WARN_ON(!dl_prio(prio));
> >>>> +	dl_rq->dl_nr_running++;
> >>>> +
> >>>> +	inc_dl_deadline(dl_rq, deadline);
> >>>> +	inc_dl_migration(dl_se, dl_rq);
> >>>> +}
> >>>> +
> >>>> +static inline
> >>>> +void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >>>> +{
> >>>> +	int prio = dl_task_of(dl_se)->prio;
> >>>> +
> >>>> +	WARN_ON(!dl_prio(prio));
> >>>> +	WARN_ON(!dl_rq->dl_nr_running);
> >>>> +	dl_rq->dl_nr_running--;
> >>>> +
> >>>> +	dec_dl_deadline(dl_rq, dl_se->deadline);
> >>>> +	dec_dl_migration(dl_se, dl_rq);
> >>>> +}
> >>>> +
> >>>>  static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
> >>>>  {
> >>>>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> >>>> @@ -424,7 +696,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
> >>>>  	rb_link_node(&dl_se->rb_node, parent, link);
> >>>>  	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
> >>>>  
> >>>> -	dl_rq->dl_nr_running++;
> >>>> +	inc_dl_tasks(dl_se, dl_rq);
> >>>>  }
> >>>>  
> >>>>  static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
> >>>> @@ -444,7 +716,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
> >>>>  	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
> >>>>  	RB_CLEAR_NODE(&dl_se->rb_node);
> >>>>  
> >>>> -	dl_rq->dl_nr_running--;
> >>>> +	dec_dl_tasks(dl_se, dl_rq);
> >>>>  }
> >>>>  
> >>>>  static void
> >>>> @@ -482,12 +754,17 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> >>>>  		return;
> >>>>  
> >>>>  	enqueue_dl_entity(&p->dl, flags);
> >>>> +
> >>>> +	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
> >>>> +		enqueue_pushable_dl_task(rq, p);
> >>>> +
> >>>>  	inc_nr_running(rq);
> >>>>  }
> >>>>  
> >>>>  static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> >>>>  {
> >>>>  	dequeue_dl_entity(&p->dl);
> >>>> +	dequeue_pushable_dl_task(rq, p);
> >>>>  }
> >>>>  
> >>>>  static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> >>>> @@ -525,6 +802,77 @@ static void yield_task_dl(struct rq *rq)
> >>>>  	update_curr_dl(rq);
> >>>>  }
> >>>>  
> >>>> +#ifdef CONFIG_SMP
> >>>> +
> >>>> +static int find_later_rq(struct task_struct *task);
> >>>> +static int latest_cpu_find(struct cpumask *span,
> >>>> +			   struct task_struct *task,
> >>>> +			   struct cpumask *later_mask);
> >>>> +
> >>>> +static int
> >>>> +select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
> >>>> +{
> >>>> +	struct task_struct *curr;
> >>>> +	struct rq *rq;
> >>>> +	int cpu;
> >>>> +
> >>>> +	cpu = task_cpu(p);
> >>>> +
> >>>> +	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
> >>>> +		goto out;
> >>>> +
> >>>> +	rq = cpu_rq(cpu);
> >>>> +
> >>>> +	rcu_read_lock();
> >>>> +	curr = ACCESS_ONCE(rq->curr); /* unlocked access */
> >>>> +
> >>>> +	/*
> >>>> +	 * If we are dealing with a -deadline task, we must
> >>>> +	 * decide where to wake it up.
> >>>> +	 * If it has a later deadline and the current task
> >>>> +	 * on this rq can't move (provided the waking task
> >>>> +	 * can!) we prefer to send it somewhere else. On the
> >>>> +	 * other hand, if it has a shorter deadline, we
> >>>> +	 * try to make it stay here, it might be important.
> >>>> +	 */
> >>>> +	if (unlikely(dl_task(curr)) &&
> >>>> +	    (curr->nr_cpus_allowed < 2 ||
> >>>> +	     !dl_entity_preempt(&p->dl, &curr->dl)) &&
> >>>> +	    (p->nr_cpus_allowed > 1)) {
> >>>> +		int target = find_later_rq(p);
> >>>> +
> >>>> +		if (target != -1)
> >>>> +			cpu = target;
> >>>> +	}
> >>>> +	rcu_read_unlock();
> >>>> +
> >>>> +out:
> >>>> +	return cpu;
> >>>> +}
> >>>> +
> >>>> +static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
> >>>> +{
> >>>> +	/*
> >>>> +	 * Current can't be migrated, useless to reschedule,
> >>>> +	 * let's hope p can move out.
> >>>> +	 */
> >>>> +	if (rq->curr->nr_cpus_allowed == 1 ||
> >>>> +	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
> >>>> +		return;
> >>>> +
> >>>> +	/*
> >>>> +	 * p is migratable, so let's not schedule it and
> >>>> +	 * see if it is pushed or pulled somewhere else.
> >>>> +	 */
> >>>> +	if (p->nr_cpus_allowed != 1 &&
> >>>> +	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
> >>>> +		return;
> >>>> +
> >>>> +	resched_task(rq->curr);
> >>>> +}
> >>>> +
> >>>> +#endif /* CONFIG_SMP */
> >>>> +
> >>>>  /*
> >>>>   * Only called when both the current and waking task are -deadline
> >>>>   * tasks.
> >>>> @@ -532,8 +880,20 @@ static void yield_task_dl(struct rq *rq)
> >>>>  static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
> >>>>  				  int flags)
> >>>>  {
> >>>> -	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
> >>>> +	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
> >>>>  		resched_task(rq->curr);
> >>>> +		return;
> >>>> +	}
> >>>> +
> >>>> +#ifdef CONFIG_SMP
> >>>> +	/*
> >>>> +	 * In the unlikely case current and p have the same deadline
> >>>> +	 * let us try to decide what's the best thing to do...
> >>>> +	 */
> >>>> +	if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
> >>>> +	    !need_resched())
> >>>> +		check_preempt_equal_dl(rq, p);
> >>>> +#endif /* CONFIG_SMP */
> >>>>  }
> >>>>  
> >>>>  #ifdef CONFIG_SCHED_HRTICK
> >>>> @@ -573,16 +933,29 @@ struct task_struct *pick_next_task_dl(struct rq *rq)
> >>>>  
> >>>>  	p = dl_task_of(dl_se);
> >>>>  	p->se.exec_start = rq_clock_task(rq);
> >>>> +
> >>>> +	/* Running task will never be pushed. */
> >>>> +	if (p)
> >>>> +		dequeue_pushable_dl_task(rq, p);
> >>>> +
> >>>>  #ifdef CONFIG_SCHED_HRTICK
> >>>>  	if (hrtick_enabled(rq))
> >>>>  		start_hrtick_dl(rq, p);
> >>>>  #endif
> >>>> +
> >>>> +#ifdef CONFIG_SMP
> >>>> +	rq->post_schedule = has_pushable_dl_tasks(rq);
> >>>> +#endif /* CONFIG_SMP */
> >>>> +
> >>>>  	return p;
> >>>>  }
> >>>>  
> >>>>  static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
> >>>>  {
> >>>>  	update_curr_dl(rq);
> >>>> +
> >>>> +	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
> >>>> +		enqueue_pushable_dl_task(rq, p);
> >>>>  }
> >>>>  
> >>>>  static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
> >>>> @@ -616,16 +989,517 @@ static void set_curr_task_dl(struct rq *rq)
> >>>>  	struct task_struct *p = rq->curr;
> >>>>  
> >>>>  	p->se.exec_start = rq_clock_task(rq);
> >>>> +
> >>>> +	/* You can't push away the running task */
> >>>> +	dequeue_pushable_dl_task(rq, p);
> >>>> +}
> >>>> +
> >>>> +#ifdef CONFIG_SMP
> >>>> +
> >>>> +/* Only try algorithms three times */
> >>>> +#define DL_MAX_TRIES 3
> >>>> +
> >>>> +static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
> >>>> +{
> >>>> +	if (!task_running(rq, p) &&
> >>>> +	    (cpu < 0 || cpumask_test_cpu(cpu, &p->cpus_allowed)) &&
> >>>> +	    (p->nr_cpus_allowed > 1))
> >>>> +		return 1;
> >>>> +
> >>>> +	return 0;
> >>>> +}
> >>>> +
> >>>> +/* Returns the second earliest -deadline task, NULL otherwise */
> >>>> +static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu)
> >>>> +{
> >>>> +	struct rb_node *next_node = rq->dl.rb_leftmost;
> >>>> +	struct sched_dl_entity *dl_se;
> >>>> +	struct task_struct *p = NULL;
> >>>> +
> >>>> +next_node:
> >>>> +	next_node = rb_next(next_node);
> >>>> +	if (next_node) {
> >>>> +		dl_se = rb_entry(next_node, struct sched_dl_entity, rb_node);
> >>>> +		p = dl_task_of(dl_se);
> >>>> +
> >>>> +		if (pick_dl_task(rq, p, cpu))
> >>>> +			return p;
> >>>> +
> >>>> +		goto next_node;
> >>>> +	}
> >>>> +
> >>>> +	return NULL;
> >>>> +}
> >>>> +
> >>>> +static int latest_cpu_find(struct cpumask *span,
> >>>> +			   struct task_struct *task,
> >>>> +			   struct cpumask *later_mask)
> >>>> +{
> >>>> +	const struct sched_dl_entity *dl_se = &task->dl;
> >>>> +	int cpu, found = -1, best = 0;
> >>>> +	u64 max_dl = 0;
> >>>> +
> >>>> +	for_each_cpu(cpu, span) {
> >>>> +		struct rq *rq = cpu_rq(cpu);
> >>>> +		struct dl_rq *dl_rq = &rq->dl;
> >>>> +
> >>>> +		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
> >>>> +		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
> >>>> +		     dl_rq->earliest_dl.curr))) {
> >>>> +			if (later_mask)
> >>>> +				cpumask_set_cpu(cpu, later_mask);
> >>>> +			if (!best && !dl_rq->dl_nr_running) {
> >>>> +				best = 1;
> >>>> +				found = cpu;
> >>>> +			} else if (!best &&
> >>>> +				   dl_time_before(max_dl,
> >>>> +						  dl_rq->earliest_dl.curr)) {
> >>>
> >>> Ug, the above is hard to read. What about:
> >>>
> >>> 	if (!best) {
> >>> 		if (!dl_rq->dl_nr_running) {
> >>> 			best = 1;
> >>> 			found = cpu;
> >>> 		} elsif (dl_time_before(...)) {
> >>> 			...
> >>> 		}
> >>> 	}
> >>>
> >>
> >> This is completely removed in 13/14. I don't like it either, but since we end
> >> up removing this mess, do you think we still have to fix this here?
> >>
> >>> Also, I would think dl should be nice to rt as well. There may be a
> >>> idle CPU or a non rt task, and this could pick a CPU running an RT
> >>> task. Worse yet, that RT task may be pinned to that CPU.
> >>>
> >>
> >> Well, in 13/14 we introduce a free_cpus mask. A CPU is considered free if it
> >> doesn't have any -deadline task running. We can modify that excluding also CPUs
> >> running RT tasks, but I have to think a bit if we can do this also from -rt code.
> >>
> >>> We should be able to incorporate cpuprio_find() to be dl aware too.
> >>> That is, work for both -rt and -dl.
> >>>
> >>
> >> Like checking if a -dl task is running on the cpu chosen for pushing an -rt
> >> task, and continue searching in that case.
> >>
> >>>
> >>>
> >>>> +				max_dl = dl_rq->earliest_dl.curr;
> >>>> +				found = cpu;
> >>>> +			}
> >>>> +		} else if (later_mask)
> >>>> +			cpumask_clear_cpu(cpu, later_mask);
> >>>> +	}
> >>>> +
> >>>> +	return found;
> >>>> +}
> >>>> +
> >>>> +static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
> >>>> +
> >>>> +static int find_later_rq(struct task_struct *task)
> >>>> +{
> >>>> +	struct sched_domain *sd;
> >>>> +	struct cpumask *later_mask = __get_cpu_var(local_cpu_mask_dl);
> >>>> +	int this_cpu = smp_processor_id();
> >>>> +	int best_cpu, cpu = task_cpu(task);
> >>>> +
> >>>> +	/* Make sure the mask is initialized first */
> >>>> +	if (unlikely(!later_mask))
> >>>> +		return -1;
> >>>> +
> >>>> +	if (task->nr_cpus_allowed == 1)
> >>>> +		return -1;
> >>>> +
> >>>> +	best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
> >>>> +	if (best_cpu == -1)
> >>>> +		return -1;
> >>>> +
> >>>> +	/*
> >>>> +	 * If we are here, some target has been found,
> >>>> +	 * the most suitable of which is cached in best_cpu.
> >>>> +	 * This is, among the runqueues where the current tasks
> >>>> +	 * have later deadlines than the task's one, the rq
> >>>> +	 * with the latest possible one.
> >>>> +	 *
> >>>> +	 * Now we check how well this matches with task's
> >>>> +	 * affinity and system topology.
> >>>> +	 *
> >>>> +	 * The last cpu where the task run is our first
> >>>> +	 * guess, since it is most likely cache-hot there.
> >>>> +	 */
> >>>> +	if (cpumask_test_cpu(cpu, later_mask))
> >>>> +		return cpu;
> >>>> +	/*
> >>>> +	 * Check if this_cpu is to be skipped (i.e., it is
> >>>> +	 * not in the mask) or not.
> >>>> +	 */
> >>>> +	if (!cpumask_test_cpu(this_cpu, later_mask))
> >>>> +		this_cpu = -1;
> >>>> +
> >>>> +	rcu_read_lock();
> >>>> +	for_each_domain(cpu, sd) {
> >>>> +		if (sd->flags & SD_WAKE_AFFINE) {
> >>>> +
> >>>> +			/*
> >>>> +			 * If possible, preempting this_cpu is
> >>>> +			 * cheaper than migrating.
> >>>> +			 */
> >>>> +			if (this_cpu != -1 &&
> >>>> +			    cpumask_test_cpu(this_cpu, sched_domain_span(sd))) {
> >>>> +				rcu_read_unlock();
> >>>> +				return this_cpu;
> >>>> +			}
> >>>> +
> >>>> +			/*
> >>>> +			 * Last chance: if best_cpu is valid and is
> >>>> +			 * in the mask, that becomes our choice.
> >>>> +			 */
> >>>> +			if (best_cpu < nr_cpu_ids &&
> >>>> +			    cpumask_test_cpu(best_cpu, sched_domain_span(sd))) {
> >>>> +				rcu_read_unlock();
> >>>> +				return best_cpu;
> >>>> +			}
> >>>> +		}
> >>>> +	}
> >>>> +	rcu_read_unlock();
> >>>> +
> >>>> +	/*
> >>>> +	 * At this point, all our guesses failed, we just return
> >>>> +	 * 'something', and let the caller sort the things out.
> >>>> +	 */
> >>>> +	if (this_cpu != -1)
> >>>> +		return this_cpu;
> >>>> +
> >>>> +	cpu = cpumask_any(later_mask);
> >>>> +	if (cpu < nr_cpu_ids)
> >>>> +		return cpu;
> >>>> +
> >>>> +	return -1;
> >>>> +}
> >>>> +
> >>>> +/* Locks the rq it finds */
> >>>> +static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
> >>>> +{
> >>>> +	struct rq *later_rq = NULL;
> >>>> +	int tries;
> >>>> +	int cpu;
> >>>> +
> >>>> +	for (tries = 0; tries < DL_MAX_TRIES; tries++) {
> >>>> +		cpu = find_later_rq(task);
> >>>> +
> >>>> +		if ((cpu == -1) || (cpu == rq->cpu))
> >>>> +			break;
> >>>> +
> >>>> +		later_rq = cpu_rq(cpu);
> >>>> +
> >>>> +		/* Retry if something changed. */
> >>>> +		if (double_lock_balance(rq, later_rq)) {
> >>>> +			if (unlikely(task_rq(task) != rq ||
> >>>> +				     !cpumask_test_cpu(later_rq->cpu,
> >>>> +				                       &task->cpus_allowed) ||
> >>>> +				     task_running(rq, task) || !task->on_rq)) {
> >>>> +				double_unlock_balance(rq, later_rq);
> >>>> +				later_rq = NULL;
> >>>> +				break;
> >>>> +			}
> >>>> +		}
> >>>> +
> >>>> +		/*
> >>>> +		 * If the rq we found has no -deadline task, or
> >>>> +		 * its earliest one has a later deadline than our
> >>>> +		 * task, the rq is a good one.
> >>>> +		 */
> >>>> +		if (!later_rq->dl.dl_nr_running ||
> >>>> +		    dl_time_before(task->dl.deadline,
> >>>> +				   later_rq->dl.earliest_dl.curr))
> >>>> +			break;
> >>>> +
> >>>> +		/* Otherwise we try again. */
> >>>> +		double_unlock_balance(rq, later_rq);
> >>>> +		later_rq = NULL;
> >>>> +	}
> >>>> +
> >>>> +	return later_rq;
> >>>>  }
> >>>>  
> >>>> +static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
> >>>> +{
> >>>> +	struct task_struct *p;
> >>>> +
> >>>> +	if (!has_pushable_dl_tasks(rq))
> >>>> +		return NULL;
> >>>> +
> >>>> +	p = rb_entry(rq->dl.pushable_dl_tasks_leftmost,
> >>>> +		     struct task_struct, pushable_dl_tasks);
> >>>> +
> >>>> +	BUG_ON(rq->cpu != task_cpu(p));
> >>>> +	BUG_ON(task_current(rq, p));
> >>>> +	BUG_ON(p->nr_cpus_allowed <= 1);
> >>>> +
> >>>> +	BUG_ON(!p->se.on_rq);
> >>>> +	BUG_ON(!dl_task(p));
> >>>> +
> >>>> +	return p;
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> + * See if the non running -deadline tasks on this rq
> >>>> + * can be sent to some other CPU where they can preempt
> >>>> + * and start executing.
> >>>> + */
> >>>> +static int push_dl_task(struct rq *rq)
> >>>> +{
> >>>> +	struct task_struct *next_task;
> >>>> +	struct rq *later_rq;
> >>>> +
> >>>> +	if (!rq->dl.overloaded)
> >>>> +		return 0;
> >>>> +
> >>>> +	next_task = pick_next_pushable_dl_task(rq);
> >>>> +	if (!next_task)
> >>>> +		return 0;
> >>>> +
> >>>> +retry:
> >>>> +	if (unlikely(next_task == rq->curr)) {
> >>>> +		WARN_ON(1);
> >>>> +		return 0;
> >>>> +	}
> >>>> +
> >>>> +	/*
> >>>> +	 * If next_task preempts rq->curr, and rq->curr
> >>>> +	 * can move away, it makes sense to just reschedule
> >>>> +	 * without going further in pushing next_task.
> >>>> +	 */
> >>>> +	if (dl_task(rq->curr) &&
> >>>> +	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
> >>>> +	    rq->curr->nr_cpus_allowed > 1) {
> >>>> +		resched_task(rq->curr);
> >>>> +		return 0;
> >>>> +	}
> >>>> +
> >>>> +	/* We might release rq lock */
> >>>> +	get_task_struct(next_task);
> >>>> +
> >>>> +	/* Will lock the rq it'll find */
> >>>> +	later_rq = find_lock_later_rq(next_task, rq);
> >>>> +	if (!later_rq) {
> >>>> +		struct task_struct *task;
> >>>> +
> >>>> +		/*
> >>>> +		 * We must check all this again, since
> >>>> +		 * find_lock_later_rq releases rq->lock and it is
> >>>> +		 * then possible that next_task has migrated.
> >>>> +		 */
> >>>> +		task = pick_next_pushable_dl_task(rq);
> >>>> +		if (task_cpu(next_task) == rq->cpu && task == next_task) {
> >>>> +			/*
> >>>> +			 * The task is still there. We don't try
> >>>> +			 * again, some other cpu will pull it when ready.
> >>>> +			 */
> >>>> +			dequeue_pushable_dl_task(rq, next_task);
> >>>> +			goto out;
> >>>> +		}
> >>>> +
> >>>> +		if (!task)
> >>>> +			/* No more tasks */
> >>>> +			goto out;
> >>>> +
> >>>> +		put_task_struct(next_task);
> >>>> +		next_task = task;
> >>>> +		goto retry;
> >>>> +	}
> >>>> +
> >>>> +	deactivate_task(rq, next_task, 0);
> >>>> +	set_task_cpu(next_task, later_rq->cpu);
> >>>> +	activate_task(later_rq, next_task, 0);
> >>>> +
> >>>> +	resched_task(later_rq->curr);
> >>>> +
> >>>> +	double_unlock_balance(rq, later_rq);
> >>>> +
> >>>> +out:
> >>>> +	put_task_struct(next_task);
> >>>> +
> >>>> +	return 1;
> >>>> +}
> >>>> +
> >>>> +static void push_dl_tasks(struct rq *rq)
> >>>> +{
> >>>> +	/* Terminates as it moves a -deadline task */
> >>>> +	while (push_dl_task(rq))
> >>>> +		;
> >>>> +}
> >>>> +
> >>>> +static int pull_dl_task(struct rq *this_rq)
> >>>> +{
> >>>> +	int this_cpu = this_rq->cpu, ret = 0, cpu;
> >>>> +	struct task_struct *p;
> >>>> +	struct rq *src_rq;
> >>>> +	u64 dmin = LONG_MAX;
> >>>> +
> >>>> +	if (likely(!dl_overloaded(this_rq)))
> >>>> +		return 0;
> >>>> +
> >>>> +	/*
> >>>> +	 * Match the barrier from dl_set_overloaded; this guarantees that if we
> >>>> +	 * see overloaded we must also see the dlo_mask bit.
> >>>> +	 */
> >>>> +	smp_rmb();
> >>>> +
> >>>> +	for_each_cpu(cpu, this_rq->rd->dlo_mask) {
> >>>> +		if (this_cpu == cpu)
> >>>> +			continue;
> >>>> +
> >>>> +		src_rq = cpu_rq(cpu);
> >>>> +
> >>>> +		/*
> >>>> +		 * It looks racy, abd it is! However, as in sched_rt.c,
> >>>
> >>> abd it is?
> >>>
> >>
> >> Oops!
> >>
> >> Thanks,
> >>
> >> - Juri
> >>
> >>>
> >>>> +		 * we are fine with this.
> >>>> +		 */
> >>>> +		if (this_rq->dl.dl_nr_running &&
> >>>> +		    dl_time_before(this_rq->dl.earliest_dl.curr,
> >>>> +				   src_rq->dl.earliest_dl.next))
> >>>> +			continue;
> >>>> +
> >>>> +		/* Might drop this_rq->lock */
> >>>> +		double_lock_balance(this_rq, src_rq);
> >>>> +
> >>>> +		/*
> >>>> +		 * If there are no more pullable tasks on the
> >>>> +		 * rq, we're done with it.
> >>>> +		 */
> >>>> +		if (src_rq->dl.dl_nr_running <= 1)
> >>>> +			goto skip;
> >>>> +
> >>>> +		p = pick_next_earliest_dl_task(src_rq, this_cpu);
> >>>> +
> >>>> +		/*
> >>>> +		 * We found a task to be pulled if:
> >>>> +		 *  - it preempts our current (if there's one),
> >>>> +		 *  - it will preempt the last one we pulled (if any).
> >>>> +		 */
> >>>> +		if (p && dl_time_before(p->dl.deadline, dmin) &&
> >>>> +		    (!this_rq->dl.dl_nr_running ||
> >>>> +		     dl_time_before(p->dl.deadline,
> >>>> +				    this_rq->dl.earliest_dl.curr))) {
> >>>> +			WARN_ON(p == src_rq->curr);
> >>>> +			WARN_ON(!p->se.on_rq);
> >>>> +
> >>>> +			/*
> >>>> +			 * Then we pull iff p has actually an earlier
> >>>> +			 * deadline than the current task of its runqueue.
> >>>> +			 */
> >>>> +			if (dl_time_before(p->dl.deadline,
> >>>> +					   src_rq->curr->dl.deadline))
> >>>> +				goto skip;
> >>>> +
> >>>> +			ret = 1;
> >>>> +
> >>>> +			deactivate_task(src_rq, p, 0);
> >>>> +			set_task_cpu(p, this_cpu);
> >>>> +			activate_task(this_rq, p, 0);
> >>>> +			dmin = p->dl.deadline;
> >>>> +
> >>>> +			/* Is there any other task even earlier? */
> >>>> +		}
> >>>> +skip:
> >>>> +		double_unlock_balance(this_rq, src_rq);
> >>>> +	}
> >>>> +
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>> +static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
> >>>> +{
> >>>> +	/* Try to pull other tasks here */
> >>>> +	if (dl_task(prev))
> >>>> +		pull_dl_task(rq);
> >>>> +}
> >>>> +
> >>>> +static void post_schedule_dl(struct rq *rq)
> >>>> +{
> >>>> +	push_dl_tasks(rq);
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> + * Since the task is not running and a reschedule is not going to happen
> >>>> + * anytime soon on its runqueue, we try pushing it away now.
> >>>> + */
> >>>> +static void task_woken_dl(struct rq *rq, struct task_struct *p)
> >>>> +{
> >>>> +	if (!task_running(rq, p) &&
> >>>> +	    !test_tsk_need_resched(rq->curr) &&
> >>>> +	    has_pushable_dl_tasks(rq) &&
> >>>> +	    p->nr_cpus_allowed > 1 &&
> >>>> +	    dl_task(rq->curr) &&
> >>>> +	    (rq->curr->nr_cpus_allowed < 2 ||
> >>>> +	     dl_entity_preempt(&rq->curr->dl, &p->dl))) {
> >>>> +		push_dl_tasks(rq);
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +static void set_cpus_allowed_dl(struct task_struct *p,
> >>>> +				const struct cpumask *new_mask)
> >>>> +{
> >>>> +	struct rq *rq;
> >>>> +	int weight;
> >>>> +
> >>>> +	BUG_ON(!dl_task(p));
> >>>> +
> >>>> +	/*
> >>>> +	 * Update only if the task is actually running (i.e.,
> >>>> +	 * it is on the rq AND it is not throttled).
> >>>> +	 */
> >>>> +	if (!on_dl_rq(&p->dl))
> >>>> +		return;
> >>>> +
> >>>> +	weight = cpumask_weight(new_mask);
> >>>> +
> >>>> +	/*
> >>>> +	 * Only update if the process changes its state from whether it
> >>>> +	 * can migrate or not.
> >>>> +	 */
> >>>> +	if ((p->nr_cpus_allowed > 1) == (weight > 1))
> >>>> +		return;
> >>>> +
> >>>> +	rq = task_rq(p);
> >>>> +
> >>>> +	/*
> >>>> +	 * The process used to be able to migrate OR it can now migrate
> >>>> +	 */
> >>>> +	if (weight <= 1) {
> >>>> +		if (!task_current(rq, p))
> >>>> +			dequeue_pushable_dl_task(rq, p);
> >>>> +		BUG_ON(!rq->dl.dl_nr_migratory);
> >>>> +		rq->dl.dl_nr_migratory--;
> >>>> +	} else {
> >>>> +		if (!task_current(rq, p))
> >>>> +			enqueue_pushable_dl_task(rq, p);
> >>>> +		rq->dl.dl_nr_migratory++;
> >>>> +	}
> >>>> +	
> >>>> +	update_dl_migration(&rq->dl);
> >>>> +}
> >>>> +
> >>>> +/* Assumes rq->lock is held */
> >>>> +static void rq_online_dl(struct rq *rq)
> >>>> +{
> >>>> +	if (rq->dl.overloaded)
> >>>> +		dl_set_overload(rq);
> >>>> +}
> >>>> +
> >>>> +/* Assumes rq->lock is held */
> >>>> +static void rq_offline_dl(struct rq *rq)
> >>>> +{
> >>>> +	if (rq->dl.overloaded)
> >>>> +		dl_clear_overload(rq);
> >>>> +}
> >>>> +
> >>>> +void init_sched_dl_class(void)
> >>>> +{
> >>>> +	unsigned int i;
> >>>> +
> >>>> +	for_each_possible_cpu(i)
> >>>> +		zalloc_cpumask_var_node(&per_cpu(local_cpu_mask_dl, i),
> >>>> +					GFP_KERNEL, cpu_to_node(i));
> >>>> +}
> >>>> +
> >>>> +#endif /* CONFIG_SMP */
> >>>> +
> >>>>  static void switched_from_dl(struct rq *rq, struct task_struct *p)
> >>>>  {
> >>>> -	if (hrtimer_active(&p->dl.dl_timer))
> >>>> +	if (hrtimer_active(&p->dl.dl_timer) && !dl_policy(p->policy))
> >>>>  		hrtimer_try_to_cancel(&p->dl.dl_timer);
> >>>> +
> >>>> +#ifdef CONFIG_SMP
> >>>> +	/*
> >>>> +	 * Since this might be the only -deadline task on the rq,
> >>>> +	 * this is the right place to try to pull some other one
> >>>> +	 * from an overloaded cpu, if any.
> >>>> +	 */
> >>>> +	if (!rq->dl.dl_nr_running)
> >>>> +		pull_dl_task(rq);
> >>>> +#endif
> >>>>  }
> >>>>  
> >>>> +/*
> >>>> + * When switching to -deadline, we may overload the rq, then
> >>>> + * we try to push someone off, if possible.
> >>>> + */
> >>>>  static void switched_to_dl(struct rq *rq, struct task_struct *p)
> >>>>  {
> >>>> +	int check_resched = 1;
> >>>> +
> >>>>  	/*
> >>>>  	 * If p is throttled, don't consider the possibility
> >>>>  	 * of preempting rq->curr, the check will be done right
> >>>> @@ -635,26 +1509,53 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
> >>>>  		return;
> >>>>  
> >>>>  	if (!p->on_rq || rq->curr != p) {
> >>>> -		if (task_has_dl_policy(rq->curr))
> >>>> +#ifdef CONFIG_SMP
> >>>> +		if (rq->dl.overloaded && push_dl_task(rq) && rq != task_rq(p))
> >>>> +			/* Only reschedule if pushing failed */
> >>>> +			check_resched = 0;
> >>>> +#endif /* CONFIG_SMP */
> >>>> +		if (check_resched && task_has_dl_policy(rq->curr))
> >>>>  			check_preempt_curr_dl(rq, p, 0);
> >>>> -		else
> >>>> -			resched_task(rq->curr);
> >>>>  	}
> >>>>  }
> >>>>  
> >>>> +/*
> >>>> + * If the scheduling parameters of a -deadline task changed,
> >>>> + * a push or pull operation might be needed.
> >>>> + */
> >>>>  static void prio_changed_dl(struct rq *rq, struct task_struct *p,
> >>>>  			    int oldprio)
> >>>>  {
> >>>> -	switched_to_dl(rq, p);
> >>>> -}
> >>>> -
> >>>> +	if (p->on_rq || rq->curr == p) {
> >>>>  #ifdef CONFIG_SMP
> >>>> -static int
> >>>> -select_task_rq_dl(struct task_struct *p, int sd_flag, int flags)
> >>>> -{
> >>>> -	return task_cpu(p);
> >>>> +		/*
> >>>> +		 * This might be too much, but unfortunately
> >>>> +		 * we don't have the old deadline value, and
> >>>> +		 * we can't argue if the task is increasing
> >>>> +		 * or lowering its prio, so...
> >>>> +		 */
> >>>> +		if (!rq->dl.overloaded)
> >>>> +			pull_dl_task(rq);
> >>>> +
> >>>> +		/*
> >>>> +		 * If we now have a earlier deadline task than p,
> >>>> +		 * then reschedule, provided p is still on this
> >>>> +		 * runqueue.
> >>>> +		 */
> >>>> +		if (dl_time_before(rq->dl.earliest_dl.curr, p->dl.deadline) &&
> >>>> +		    rq->curr == p)
> >>>> +			resched_task(p);
> >>>> +#else
> >>>> +		/*
> >>>> +		 * Again, we don't know if p has a earlier
> >>>> +		 * or later deadline, so let's blindly set a
> >>>> +		 * (maybe not needed) rescheduling point.
> >>>> +		 */
> >>>> +		resched_task(p);
> >>>> +#endif /* CONFIG_SMP */
> >>>> +	} else
> >>>> +		switched_to_dl(rq, p);
> >>>>  }
> >>>> -#endif
> >>>>  
> >>>>  const struct sched_class dl_sched_class = {
> >>>>  	.next			= &rt_sched_class,
> >>>> @@ -669,6 +1570,12 @@ const struct sched_class dl_sched_class = {
> >>>>  
> >>>>  #ifdef CONFIG_SMP
> >>>>  	.select_task_rq		= select_task_rq_dl,
> >>>> +	.set_cpus_allowed       = set_cpus_allowed_dl,
> >>>> +	.rq_online              = rq_online_dl,
> >>>> +	.rq_offline             = rq_offline_dl,
> >>>> +	.pre_schedule		= pre_schedule_dl,
> >>>> +	.post_schedule		= post_schedule_dl,
> >>>> +	.task_woken		= task_woken_dl,
> >>>>  #endif
> >>>>  
> >>>>  	.set_curr_task		= set_curr_task_dl,
> >>>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >>>> index 01970c8..f7c4881 100644
> >>>> --- a/kernel/sched/rt.c
> >>>> +++ b/kernel/sched/rt.c
> >>>> @@ -1720,7 +1720,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
> >>>>  	    !test_tsk_need_resched(rq->curr) &&
> >>>>  	    has_pushable_tasks(rq) &&
> >>>>  	    p->nr_cpus_allowed > 1 &&
> >>>> -	    rt_task(rq->curr) &&
> >>>> +	    (dl_task(rq->curr) || rt_task(rq->curr)) &&
> >>>>  	    (rq->curr->nr_cpus_allowed < 2 ||
> >>>>  	     rq->curr->prio <= p->prio))
> >>>>  		push_rt_tasks(rq);
> >>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >>>> index ba97476..70d0030 100644
> >>>> --- a/kernel/sched/sched.h
> >>>> +++ b/kernel/sched/sched.h
> >>>> @@ -383,6 +383,31 @@ struct dl_rq {
> >>>>  	struct rb_node *rb_leftmost;
> >>>>  
> >>>>  	unsigned long dl_nr_running;
> >>>> +
> >>>> +#ifdef CONFIG_SMP
> >>>> +	/*
> >>>> +	 * Deadline values of the currently executing and the
> >>>> +	 * earliest ready task on this rq. Caching these facilitates
> >>>> +	 * the decision wether or not a ready but not running task
> >>>> +	 * should migrate somewhere else.
> >>>> +	 */
> >>>> +	struct {
> >>>> +		u64 curr;
> >>>> +		u64 next;
> >>>> +	} earliest_dl;
> >>>> +
> >>>> +	unsigned long dl_nr_migratory;
> >>>> +	unsigned long dl_nr_total;
> >>>> +	int overloaded;
> >>>> +
> >>>> +	/*
> >>>> +	 * Tasks on this rq that can be pushed away. They are kept in
> >>>> +	 * an rb-tree, ordered by tasks' deadlines, with caching
> >>>> +	 * of the leftmost (earliest deadline) element.
> >>>> +	 */
> >>>> +	struct rb_root pushable_dl_tasks_root;
> >>>> +	struct rb_node *pushable_dl_tasks_leftmost;
> >>>> +#endif
> >>>>  };
> >>>>  
> >>>>  #ifdef CONFIG_SMP
> >>>> @@ -403,6 +428,13 @@ struct root_domain {
> >>>>  	cpumask_var_t online;
> >>>>  
> >>>>  	/*
> >>>> +	 * The bit corresponding to a CPU gets set here if such CPU has more
> >>>> +	 * than one runnable -deadline task (as it is below for RT tasks).
> >>>> +	 */
> >>>> +	cpumask_var_t dlo_mask;
> >>>> +	atomic_t dlo_count;
> >>>> +
> >>>> +	/*
> >>>>  	 * The "RT overload" flag: it gets set if a CPU has more than
> >>>>  	 * one runnable RT task.
> >>>>  	 */
> >>>> @@ -1063,6 +1095,8 @@ static inline void idle_balance(int cpu, struct rq *rq)
> >>>>  extern void sysrq_sched_debug_show(void);
> >>>>  extern void sched_init_granularity(void);
> >>>>  extern void update_max_interval(void);
> >>>> +
> >>>> +extern void init_sched_dl_class(void);
> >>>>  extern void init_sched_rt_class(void);
> >>>>  extern void init_sched_fair_class(void);
> >>>>  
> >>>
> >>
> >>
> > 
> 
> 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 04/14] sched: SCHED_DEADLINE SMP-related data structures & logic.
  2013-11-21 16:26           ` Paul E. McKenney
@ 2013-11-21 16:47             ` Steven Rostedt
  2013-11-21 19:38               ` Paul E. McKenney
  0 siblings, 1 reply; 81+ messages in thread
From: Steven Rostedt @ 2013-11-21 16:47 UTC (permalink / raw)
  To: paulmck
  Cc: Juri Lelli, peterz, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

On Thu, 21 Nov 2013 08:26:31 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> > >> I guess "deadline tasks" is too much generic; "SCHED_DEADLINE tasks" too long;
> > >> "DL" or "-dl" can be associated to "download". Nothing special in the end, just
> > >> tought it was a reasonable abbreviation.
> > > 
> > > My guess was that "-deadline" was an abbreviation for tasks that had
> > > missed their deadline, just so you know.  ;-)
> > >
> > 
> > Argh! ;)
> > 
> > Actually, thinking about it twice: if we call "-rt tasks" things that are
> > governed by code that resides in sched/rt.c, we could agree on "-deadline
> > tasks" for things that pass through sched/deadline.c.
> 
> Actually, "SCHED_DEADLINE" isn't -that- much longer than "-deadline"
> and it does have the virtue of being unambiguous.
> 

 -sdl ?

Unless you are a fan of "Saturday Day Live!" (for the older generation
that can't stay awake for SNL)

-- Steve

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH] rtmutex: Fix compare of waiter prio and task prio
  2013-11-07 13:43 ` [PATCH 09/14] rtmutex: turn the plist into an rb-tree Juri Lelli
  2013-11-21  3:07   ` Steven Rostedt
@ 2013-11-21 17:52   ` Steven Rostedt
  2013-11-22 10:37     ` Juri Lelli
  2014-01-13 15:54   ` [tip:sched/core] rtmutex: Turn the plist into an rb-tree tip-bot for Peter Zijlstra
  2 siblings, 1 reply; 81+ messages in thread
From: Steven Rostedt @ 2013-11-21 17:52 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

The conversion of the rt_mutex from using plist to rbtree eliminated
the use of the waiter->list_entry.prio, and instead used directly the
waiter->task->prio.

The problem with this is that the priority inheritance code relies on
the prio of the waiter being stored is different from the task's prio.
The change didn't take into account waiter->task == task, which makes
the compares of:

	if (waiter->task->prio == task->prio)

rather pointless, since they will always be the same:

	task->pi_blocked_on = waiter;
	waiter->task = task;

When deadlock detection is not being used (for internal users of
rt_mutex_lock(); things other than futex), the code relies on
the prio associated to the waiter being different than the prio
associated to the task.

Another use case where this is critical, is when a task that is
blocked on an rt_mutex has its priority increased by a separate task.
Then the compare in rt_mutex_adjust_pi() (called from
sched_setscheduler()), returns without doing anything. This is because
it checks if the priority of the task is different than the priority of
its waiter.

The simple solution is to add a prio member to the rt_mutex_waiter
structure that associates the priority to the waiter that is separate
from the task.

I created a test program that tests this case:

  http://rostedt.homelinux.com/code/pi_mutex_test.c

(too big to include in a change log) I'll work on getting this test
into other projects like LTP and the kernel (perf test?)

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-rt.git/kernel/rtmutex.c
===================================================================
--- linux-rt.git.orig/kernel/rtmutex.c
+++ linux-rt.git/kernel/rtmutex.c
@@ -197,7 +197,7 @@ int rt_mutex_getprio(struct task_struct
 	if (likely(!task_has_pi_waiters(task)))
 		return task->normal_prio;
 
-	return min(task_top_pi_waiter(task)->task->prio,
+	return min(task_top_pi_waiter(task)->prio,
 		   task->normal_prio);
 }
 
@@ -336,7 +336,7 @@ static int rt_mutex_adjust_prio_chain(st
 	 * When deadlock detection is off then we check, if further
 	 * priority adjustment is necessary.
 	 */
-	if (!detect_deadlock && waiter->task->prio == task->prio)
+	if (!detect_deadlock && waiter->prio == task->prio)
 		goto out_unlock_pi;
 
 	lock = waiter->lock;
@@ -358,7 +358,7 @@ static int rt_mutex_adjust_prio_chain(st
 
 	/* Requeue the waiter */
 	rt_mutex_dequeue(lock, waiter);
-	waiter->task->prio = task->prio;
+	waiter->prio = task->prio;
 	rt_mutex_enqueue(lock, waiter);
 
 	/* Release the task */
@@ -456,7 +456,7 @@ static int try_to_take_rt_mutex(struct r
 	 * 3) it is top waiter
 	 */
 	if (rt_mutex_has_waiters(lock)) {
-		if (task->prio >= rt_mutex_top_waiter(lock)->task->prio) {
+		if (task->prio >= rt_mutex_top_waiter(lock)->prio) {
 			if (!waiter || waiter != rt_mutex_top_waiter(lock))
 				return 0;
 		}
@@ -516,7 +516,8 @@ static int task_blocks_on_rt_mutex(struc
 	__rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
-	
+	waiter->prio = task->prio;
+
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
 		top_waiter = rt_mutex_top_waiter(lock);
@@ -661,7 +662,7 @@ void rt_mutex_adjust_pi(struct task_stru
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || (waiter->task->prio == task->prio &&
+	if (!waiter || (waiter->prio == task->prio &&
 			!dl_prio(task->prio))) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
Index: linux-rt.git/kernel/rtmutex_common.h
===================================================================
--- linux-rt.git.orig/kernel/rtmutex_common.h
+++ linux-rt.git/kernel/rtmutex_common.h
@@ -54,6 +54,7 @@ struct rt_mutex_waiter {
 	struct pid		*deadlock_task_pid;
 	struct rt_mutex		*deadlock_lock;
 #endif
+	int			prio;
 };
 
 /*

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 04/14] sched: SCHED_DEADLINE SMP-related data structures & logic.
  2013-11-21 16:47             ` Steven Rostedt
@ 2013-11-21 19:38               ` Paul E. McKenney
  0 siblings, 0 replies; 81+ messages in thread
From: Paul E. McKenney @ 2013-11-21 19:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, peterz, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

On Thu, Nov 21, 2013 at 11:47:18AM -0500, Steven Rostedt wrote:
> On Thu, 21 Nov 2013 08:26:31 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > > >> I guess "deadline tasks" is too much generic; "SCHED_DEADLINE tasks" too long;
> > > >> "DL" or "-dl" can be associated to "download". Nothing special in the end, just
> > > >> tought it was a reasonable abbreviation.
> > > > 
> > > > My guess was that "-deadline" was an abbreviation for tasks that had
> > > > missed their deadline, just so you know.  ;-)
> > > >
> > > 
> > > Argh! ;)
> > > 
> > > Actually, thinking about it twice: if we call "-rt tasks" things that are
> > > governed by code that resides in sched/rt.c, we could agree on "-deadline
> > > tasks" for things that pass through sched/deadline.c.
> > 
> > Actually, "SCHED_DEADLINE" isn't -that- much longer than "-deadline"
> > and it does have the virtue of being unambiguous.
> 
>  -sdl ?

I will let you guys hash this one out.

> Unless you are a fan of "Saturday Day Live!" (for the older generation
> that can't stay awake for SNL)

Isn't that what DVRs are for?  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH] rtmutex: Fix compare of waiter prio and task prio
  2013-11-21 17:52   ` [PATCH] rtmutex: Fix compare of waiter prio and task prio Steven Rostedt
@ 2013-11-22 10:37     ` Juri Lelli
  0 siblings, 0 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-22 10:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On 11/21/2013 06:52 PM, Steven Rostedt wrote:
> The conversion of the rt_mutex from using plist to rbtree eliminated
> the use of the waiter->list_entry.prio, and instead used directly the
> waiter->task->prio.
> 
> The problem with this is that the priority inheritance code relies on
> the prio of the waiter being stored is different from the task's prio.
> The change didn't take into account waiter->task == task, which makes
> the compares of:
> 
> 	if (waiter->task->prio == task->prio)
> 
> rather pointless, since they will always be the same:
> 
> 	task->pi_blocked_on = waiter;
> 	waiter->task = task;
> 
> When deadlock detection is not being used (for internal users of
> rt_mutex_lock(); things other than futex), the code relies on
> the prio associated to the waiter being different than the prio
> associated to the task.
> 
> Another use case where this is critical, is when a task that is
> blocked on an rt_mutex has its priority increased by a separate task.
> Then the compare in rt_mutex_adjust_pi() (called from
> sched_setscheduler()), returns without doing anything. This is because
> it checks if the priority of the task is different than the priority of
> its waiter.
> 
> The simple solution is to add a prio member to the rt_mutex_waiter
> structure that associates the priority to the waiter that is separate
> from the task.
> 
> I created a test program that tests this case:
> 
>   http://rostedt.homelinux.com/code/pi_mutex_test.c
> 
> (too big to include in a change log) I'll work on getting this test
> into other projects like LTP and the kernel (perf test?)
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> 
> Index: linux-rt.git/kernel/rtmutex.c
> ===================================================================
> --- linux-rt.git.orig/kernel/rtmutex.c
> +++ linux-rt.git/kernel/rtmutex.c
> @@ -197,7 +197,7 @@ int rt_mutex_getprio(struct task_struct
>  	if (likely(!task_has_pi_waiters(task)))
>  		return task->normal_prio;
>  
> -	return min(task_top_pi_waiter(task)->task->prio,
> +	return min(task_top_pi_waiter(task)->prio,
>  		   task->normal_prio);
>  }
>  
> @@ -336,7 +336,7 @@ static int rt_mutex_adjust_prio_chain(st
>  	 * When deadlock detection is off then we check, if further
>  	 * priority adjustment is necessary.
>  	 */
> -	if (!detect_deadlock && waiter->task->prio == task->prio)
> +	if (!detect_deadlock && waiter->prio == task->prio)
>  		goto out_unlock_pi;
>  
>  	lock = waiter->lock;
> @@ -358,7 +358,7 @@ static int rt_mutex_adjust_prio_chain(st
>  
>  	/* Requeue the waiter */
>  	rt_mutex_dequeue(lock, waiter);
> -	waiter->task->prio = task->prio;
> +	waiter->prio = task->prio;
>  	rt_mutex_enqueue(lock, waiter);
>  
>  	/* Release the task */
> @@ -456,7 +456,7 @@ static int try_to_take_rt_mutex(struct r
>  	 * 3) it is top waiter
>  	 */
>  	if (rt_mutex_has_waiters(lock)) {
> -		if (task->prio >= rt_mutex_top_waiter(lock)->task->prio) {
> +		if (task->prio >= rt_mutex_top_waiter(lock)->prio) {
>  			if (!waiter || waiter != rt_mutex_top_waiter(lock))
>  				return 0;
>  		}
> @@ -516,7 +516,8 @@ static int task_blocks_on_rt_mutex(struc
>  	__rt_mutex_adjust_prio(task);
>  	waiter->task = task;
>  	waiter->lock = lock;
> -	
> +	waiter->prio = task->prio;
> +
>  	/* Get the top priority waiter on the lock */
>  	if (rt_mutex_has_waiters(lock))
>  		top_waiter = rt_mutex_top_waiter(lock);
> @@ -661,7 +662,7 @@ void rt_mutex_adjust_pi(struct task_stru
>  	raw_spin_lock_irqsave(&task->pi_lock, flags);
>  
>  	waiter = task->pi_blocked_on;
> -	if (!waiter || (waiter->task->prio == task->prio &&
> +	if (!waiter || (waiter->prio == task->prio &&
>  			!dl_prio(task->prio))) {
>  		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
>  		return;
> Index: linux-rt.git/kernel/rtmutex_common.h
> ===================================================================
> --- linux-rt.git.orig/kernel/rtmutex_common.h
> +++ linux-rt.git/kernel/rtmutex_common.h
> @@ -54,6 +54,7 @@ struct rt_mutex_waiter {
>  	struct pid		*deadlock_task_pid;
>  	struct rt_mutex		*deadlock_lock;
>  #endif
> +	int			prio;
>  };
>  
>  /*
> 

Thanks! But, now that waiters have their own prio, don't we need to
enqueue them using that?

Something like:

    rtmutex: enqueue waiters by their prio

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index a2c8ee8..2e960a2 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -96,13 +96,16 @@ static inline int
 rt_mutex_waiter_less(struct rt_mutex_waiter *left,
                     struct rt_mutex_waiter *right)
 {
-       if (left->task->prio < right->task->prio)
+       if (left->prio < right->prio)
                return 1;
 
        /*
-        * If both tasks are dl_task(), we check their deadlines.
+        * If both waiters have dl_prio(), we check the deadlines of the
+        * associated tasks.
+        * If left waiter has a dl_prio(), and we didn't return 1 above,
+        * then right waiter has a dl_prio() too.
         */
-       if (dl_prio(left->task->prio) && dl_prio(right->task->prio))
+       if (dl_prio(left->prio))
                return (left->task->dl.deadline < right->task->dl.deadline);
 
        return 0;

Thanks,

- Juri

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 02/14] sched: add extended scheduling interface. (new ABI)
  2013-11-07 13:43 ` [PATCH 02/14] sched: add extended scheduling interface Juri Lelli
  2013-11-12 17:23   ` Steven Rostedt
  2013-11-12 17:32   ` Steven Rostedt
@ 2013-11-27 13:23   ` Ingo Molnar
  2013-11-27 13:30     ` Peter Zijlstra
  2014-01-13 15:53   ` [tip:sched/core] sched: Add new scheduler syscalls to support an extended scheduling parameters ABI tip-bot for Dario Faggioli
  3 siblings, 1 reply; 81+ messages in thread
From: Ingo Molnar @ 2013-11-27 13:23 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield,
	Andrew Morton, Linus Torvalds


* Juri Lelli <juri.lelli@gmail.com> wrote:

> + * @__unused		padding to allow future expansion without ABI issues
> + */
> +struct sched_param2 {
> +	int sched_priority;
> +	unsigned int sched_flags;
> +	u64 sched_runtime;
> +	u64 sched_deadline;
> +	u64 sched_period;
> +
> +	u64 __unused[12];
> +};

So this really needs to use s32/u32.

But the bigger problem is that this is a rather dumb ABI which copies 
128 bytes unconditionally. That will be enough up to the point we run 
out of it.

Instead I think what we want is a simple yet extensible ABI where the 
size of the parameters is part of the structure itself - which acts as 
a natural 'version'.

We already have such an extensible syscall implementation, see 
sys_perf_event_open in kernel/events/core.c: bits of which could be 
factored out to make all this easier and more robust.

To make this auto-versioning property more apparent I'd suggest a 
rename of the syscalls as well: sys_sched_setattr(), 
sys_sched_getattr(), or so.

The compatibility principle is: there's a 'struct sched_attr' with a 
sched_attr::size field (plus the fields above, and no padding). The 
sched_attr::size field us the structure size user-space expects.

There are 3 main compatibility cases:

 - the kernel's 'sizeof sched_attr' is equal to sched_attr:size: the 
   kernel version and user-space version matches, it's a straight ABI 
   in this case with full functionality.

 - the kernel's 'sizeof sched_attr' is larger than sched_attr::size 
   [the kernel is newer than what user-space was built for], in this 
   case the kernel assumes that all remaining values are zero and acts
   accordingly.

 - the kernel's 'sizeof sched_attr' is smaller than sched_attr::size 
   [the kernel is older than what user-space was built for]. In this 
   case the kernel should return -ENOSYS if any of the additional 
   fields are nonzero. If those are all zero then it will work as if a 
   smaller structure was passed in.

This ensures maximal upwards and downwards compatibility and keeps the 
syscall ABI compat yet extensible. The ABI is the quickest when tool 
version matches kernel version - but that's the typical case for 
distros. Yet even the mismatching versions work fine and the ABI is 
kept.

( See kernel/events/core.c for more details. Some of the helpers there
  should be factored out to allow easier support for such syscalls. )

Note that I did a few other small fixes to the changelog and to the 
code as well - see the patch attached below - please work based on 
this version.

Thanks,

	Ingo

======================>
Subject: sched: Add 3 new scheduler syscalls to support an extended scheduling parameters ABI
From: Dario Faggioli <raistlin@linux.it>
Date: Thu, 7 Nov 2013 14:43:36 +0100

Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).

In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:

 - a (maximum/typical) instance execution time,
 - a minimum interval between consecutive instances,
 - a time constraint by which each instance must be completed.

Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.

For these reasons, this patch:

 - defines the new struct sched_param2, containing all the fields
   that are necessary for specifying a task in the computational
   model described above;
 - defines and implements the new scheduling related syscalls that
   manipulate it, i.e., sched_setscheduler2(), sched_setparam2()
   and sched_getparam2().

Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.

Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the *2() calls accordingly with their own purposes.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: bruce.ashfield@windriver.com
Cc: claudio@evidence.eu.com
Cc: darren@dvhart.com
Cc: dhaval.giani@gmail.com
Cc: fchecconi@gmail.com
Cc: fweisbec@gmail.com
Cc: harald.gustafsson@ericsson.com
Cc: hgu1972@gmail.com
Cc: insop.song@gmail.com
Cc: jkacur@redhat.com
Cc: johan.eker@ericsson.com
Cc: liming.wang@windriver.com
Cc: luca.abeni@unitn.it
Cc: michael@amarulasolutions.com
Cc: nicola.manica@disi.unitn.it
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: p.faure@akatech.ch
Cc: rostedt@goodmis.org
Cc: tommaso.cucinotta@sssup.it
Cc: vincent.guittot@linaro.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
[ Twiddled the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/arm/include/asm/unistd.h      |    2 
 arch/arm/include/uapi/asm/unistd.h |    3 +
 arch/arm/kernel/calls.S            |    3 +
 arch/x86/syscalls/syscall_32.tbl   |    3 +
 arch/x86/syscalls/syscall_64.tbl   |    3 +
 include/linux/sched.h              |   50 +++++++++++++++++
 include/linux/syscalls.h           |    7 ++
 kernel/sched/core.c                |  106 +++++++++++++++++++++++++++++++++++--
 8 files changed, 173 insertions(+), 4 deletions(-)

Index: tip/arch/arm/include/asm/unistd.h
===================================================================
--- tip.orig/arch/arm/include/asm/unistd.h
+++ tip/arch/arm/include/asm/unistd.h
@@ -15,7 +15,7 @@
 
 #include <uapi/asm/unistd.h>
 
-#define __NR_syscalls  (380)
+#define __NR_syscalls  (383)
 #define __ARM_NR_cmpxchg		(__ARM_NR_BASE+0x00fff0)
 
 #define __ARCH_WANT_STAT64
Index: tip/arch/arm/include/uapi/asm/unistd.h
===================================================================
--- tip.orig/arch/arm/include/uapi/asm/unistd.h
+++ tip/arch/arm/include/uapi/asm/unistd.h
@@ -406,6 +406,9 @@
 #define __NR_process_vm_writev		(__NR_SYSCALL_BASE+377)
 #define __NR_kcmp			(__NR_SYSCALL_BASE+378)
 #define __NR_finit_module		(__NR_SYSCALL_BASE+379)
+#define __NR_sched_setscheduler2	(__NR_SYSCALL_BASE+380)
+#define __NR_sched_setparam2		(__NR_SYSCALL_BASE+381)
+#define __NR_sched_getparam2		(__NR_SYSCALL_BASE+382)
 
 /*
  * This may need to be greater than __NR_last_syscall+1 in order to
Index: tip/arch/arm/kernel/calls.S
===================================================================
--- tip.orig/arch/arm/kernel/calls.S
+++ tip/arch/arm/kernel/calls.S
@@ -389,6 +389,9 @@
 		CALL(sys_process_vm_writev)
 		CALL(sys_kcmp)
 		CALL(sys_finit_module)
+/* 380 */	CALL(sys_sched_setscheduler2)
+		CALL(sys_sched_setparam2)
+		CALL(sys_sched_getparam2)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
Index: tip/arch/x86/syscalls/syscall_32.tbl
===================================================================
--- tip.orig/arch/x86/syscalls/syscall_32.tbl
+++ tip/arch/x86/syscalls/syscall_32.tbl
@@ -357,3 +357,6 @@
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
 349	i386	kcmp			sys_kcmp
 350	i386	finit_module		sys_finit_module
+351	i386	sched_setparam2		sys_sched_setparam2
+352	i386	sched_getparam2		sys_sched_getparam2
+353	i386	sched_setscheduler2	sys_sched_setscheduler2
Index: tip/arch/x86/syscalls/syscall_64.tbl
===================================================================
--- tip.orig/arch/x86/syscalls/syscall_64.tbl
+++ tip/arch/x86/syscalls/syscall_64.tbl
@@ -320,6 +320,9 @@
 311	64	process_vm_writev	sys_process_vm_writev
 312	common	kcmp			sys_kcmp
 313	common	finit_module		sys_finit_module
+314	common	sched_setparam2		sys_sched_setparam2
+315	common	sched_getparam2		sys_sched_getparam2
+316	common	sched_setscheduler2	sys_sched_setscheduler2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -56,6 +56,54 @@ struct sched_param {
 
 #include <asm/processor.h>
 
+/*
+ * Extended scheduling parameters data structure.
+ *
+ * This is needed because the original struct sched_param can not be
+ * altered without introducing ABI issues with legacy applications
+ * (e.g., in sched_getparam()).
+ *
+ * However, the possibility of specifying more than just a priority for
+ * the tasks may be useful for a wide variety of application fields, e.g.,
+ * multimedia, streaming, automation and control, and many others.
+ *
+ * This variant (sched_param2) is meant at describing a so-called
+ * sporadic time-constrained task. In such model a task is specified by:
+ *  - the activation period or minimum instance inter-arrival time;
+ *  - the maximum (or average, depending on the actual scheduling
+ *    discipline) computation time of all instances, a.k.a. runtime;
+ *  - the deadline (relative to the actual activation time) of each
+ *    instance.
+ * Very briefly, a periodic (sporadic) task asks for the execution of
+ * some specific computation --which is typically called an instance--
+ * (at most) every period. Moreover, each instance typically lasts no more
+ * than the runtime and must be completed by time instant t equal to
+ * the instance activation time + the deadline.
+ *
+ * This is reflected by the actual fields of the sched_param2 structure:
+ *
+ *  @sched_priority     task's priority (might still be useful)
+ *  @sched_deadline     representative of the task's deadline
+ *  @sched_runtime      representative of the task's runtime
+ *  @sched_period       representative of the task's period
+ *  @sched_flags        for customizing the scheduler behaviour
+ *
+ * Given this task model, there are a multiplicity of scheduling algorithms
+ * and policies, that can be used to ensure all the tasks will make their
+ * timing constraints.
+ *
+ * @__unused		padding to allow future expansion without ABI issues
+ */
+struct sched_param2 {
+	int sched_priority;
+	unsigned int sched_flags;
+	u64 sched_runtime;
+	u64 sched_deadline;
+	u64 sched_period;
+
+	u64 __unused[12];
+};
+
 struct exec_domain;
 struct futex_pi_state;
 struct robust_list_head;
@@ -1961,6 +2009,8 @@ extern int sched_setscheduler(struct tas
 			      const struct sched_param *);
 extern int sched_setscheduler_nocheck(struct task_struct *, int,
 				      const struct sched_param *);
+extern int sched_setscheduler2(struct task_struct *, int,
+				 const struct sched_param2 *);
 extern struct task_struct *idle_task(int cpu);
 /**
  * is_idle_task - is the specified task an idle task?
Index: tip/include/linux/syscalls.h
===================================================================
--- tip.orig/include/linux/syscalls.h
+++ tip/include/linux/syscalls.h
@@ -38,6 +38,7 @@ struct rlimit;
 struct rlimit64;
 struct rusage;
 struct sched_param;
+struct sched_param2;
 struct sel_arg_struct;
 struct semaphore;
 struct sembuf;
@@ -277,11 +278,17 @@ asmlinkage long sys_clock_nanosleep(cloc
 asmlinkage long sys_nice(int increment);
 asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_setscheduler2(pid_t pid, int policy,
+					struct sched_param2 __user *param);
 asmlinkage long sys_sched_setparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_setparam2(pid_t pid,
+					struct sched_param2 __user *param);
 asmlinkage long sys_sched_getscheduler(pid_t pid);
 asmlinkage long sys_sched_getparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_getparam2(pid_t pid,
+					struct sched_param2 __user *param);
 asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
 asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
Index: tip/kernel/sched/core.c
===================================================================
--- tip.orig/kernel/sched/core.c
+++ tip/kernel/sched/core.c
@@ -3025,7 +3025,8 @@ static bool check_same_owner(struct task
 }
 
 static int __sched_setscheduler(struct task_struct *p, int policy,
-				const struct sched_param *param, bool user)
+				const struct sched_param2 *param,
+				bool user)
 {
 	int retval, oldprio, oldpolicy = -1, on_rq, running;
 	unsigned long flags;
@@ -3190,10 +3191,20 @@ recheck:
 int sched_setscheduler(struct task_struct *p, int policy,
 		       const struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, true);
+	struct sched_param2 param2 = {
+		.sched_priority = param->sched_priority
+	};
+	return __sched_setscheduler(p, policy, &param2, true);
 }
 EXPORT_SYMBOL_GPL(sched_setscheduler);
 
+int sched_setscheduler2(struct task_struct *p, int policy,
+			  const struct sched_param2 *param2)
+{
+	return __sched_setscheduler(p, policy, param2, true);
+}
+EXPORT_SYMBOL_GPL(sched_setscheduler2);
+
 /**
  * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
  * @p: the task in question.
@@ -3210,7 +3221,10 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
 int sched_setscheduler_nocheck(struct task_struct *p, int policy,
 			       const struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, false);
+	struct sched_param2 param2 = {
+		.sched_priority = param->sched_priority
+	};
+	return __sched_setscheduler(p, policy, &param2, false);
 }
 
 static int
@@ -3235,6 +3249,31 @@ do_sched_setscheduler(pid_t pid, int pol
 	return retval;
 }
 
+static int
+do_sched_setscheduler2(pid_t pid, int policy,
+			 struct sched_param2 __user *param2)
+{
+	struct sched_param2 lparam2;
+	struct task_struct *p;
+	int retval;
+
+	if (!param2 || pid < 0)
+		return -EINVAL;
+
+	memset(&lparam2, 0, sizeof(struct sched_param2));
+	if (copy_from_user(&lparam2, param2, sizeof(struct sched_param2)))
+		return -EFAULT;
+
+	rcu_read_lock();
+	retval = -ESRCH;
+	p = find_process_by_pid(pid);
+	if (p != NULL)
+		retval = sched_setscheduler2(p, policy, &lparam2);
+	rcu_read_unlock();
+
+	return retval;
+}
+
 /**
  * sys_sched_setscheduler - set/change the scheduler policy and RT priority
  * @pid: the pid in question.
@@ -3254,6 +3293,21 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_
 }
 
 /**
+ * sys_sched_setscheduler2 - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @policy: new policy (could use extended sched_param).
+ * @param: structure containg the extended parameters.
+ */
+SYSCALL_DEFINE3(sched_setscheduler2, pid_t, pid, int, policy,
+		struct sched_param2 __user *, param2)
+{
+	if (policy < 0)
+		return -EINVAL;
+
+	return do_sched_setscheduler2(pid, policy, param2);
+}
+
+/**
  * sys_sched_setparam - set/change the RT priority of a thread
  * @pid: the pid in question.
  * @param: structure containing the new RT priority.
@@ -3266,6 +3320,17 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, p
 }
 
 /**
+ * sys_sched_setparam2 - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @param2: structure containing the extended parameters.
+ */
+SYSCALL_DEFINE2(sched_setparam2, pid_t, pid,
+		struct sched_param2 __user *, param2)
+{
+	return do_sched_setscheduler2(pid, -1, param2);
+}
+
+/**
  * sys_sched_getscheduler - get the policy (scheduling class) of a thread
  * @pid: the pid in question.
  *
@@ -3331,6 +3396,41 @@ SYSCALL_DEFINE2(sched_getparam, pid_t, p
 	return retval;
 
 out_unlock:
+	rcu_read_unlock();
+	return retval;
+}
+
+/**
+ * sys_sched_getparam2 - same as above, but with extended sched_param
+ * @pid: the pid in question.
+ * @param2: structure containing the extended parameters.
+ */
+SYSCALL_DEFINE2(sched_getparam2, pid_t, pid, struct sched_param2 __user *, param2)
+{
+	struct sched_param2 lp;
+	struct task_struct *p;
+	int retval;
+
+	if (!param2 || pid < 0)
+		return -EINVAL;
+
+	rcu_read_lock();
+	p = find_process_by_pid(pid);
+	retval = -ESRCH;
+	if (!p)
+		goto out_unlock;
+
+	retval = security_task_getscheduler(p);
+	if (retval)
+		goto out_unlock;
+
+	lp.sched_priority = p->rt_priority;
+	rcu_read_unlock();
+
+	retval = copy_to_user(param2, &lp, sizeof(lp)) ? -EFAULT : 0;
+	return retval;
+
+out_unlock:
 	rcu_read_unlock();
 	return retval;
 }

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 02/14] sched: add extended scheduling interface. (new ABI)
  2013-11-27 13:23   ` [PATCH 02/14] sched: add extended scheduling interface. (new ABI) Ingo Molnar
@ 2013-11-27 13:30     ` Peter Zijlstra
  2013-11-27 14:01       ` Ingo Molnar
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2013-11-27 13:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Juri Lelli, tglx, mingo, rostedt, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield,
	Andrew Morton, Linus Torvalds

On Wed, Nov 27, 2013 at 02:23:54PM +0100, Ingo Molnar wrote:
> There are 3 main compatibility cases:
> 
>  - the kernel's 'sizeof sched_attr' is equal to sched_attr:size: the 
>    kernel version and user-space version matches, it's a straight ABI 
>    in this case with full functionality.
> 
>  - the kernel's 'sizeof sched_attr' is larger than sched_attr::size 
>    [the kernel is newer than what user-space was built for], in this 
>    case the kernel assumes that all remaining values are zero and acts
>    accordingly.

It also needs to fail sched_getparam() when any of the fields that do
not fit in the smaller struct provided are !0.

>  - the kernel's 'sizeof sched_attr' is smaller than sched_attr::size 
>    [the kernel is older than what user-space was built for]. In this 
>    case the kernel should return -ENOSYS if any of the additional 
>    fields are nonzero. If those are all zero then it will work as if a 
>    smaller structure was passed in.

So the problem I see with this one is that because you're allowed to
call sched_setparam() or whatever it will be called next on another
task; a task can very easily fail its sched_getparam() call.

Suppose the application is 'old' and only supports a subset of the
fields; but its wants to get, modify and set its params. This will work
as long nothing will set anything it doesn't know about.

As soon as some external entity -- say a sysad using schedtool -- sets a
param field it doesn't support the get, modify, set routing completely
fails.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-20 21:33   ` Steven Rostedt
@ 2013-11-27 13:43     ` Juri Lelli
  2013-11-27 14:16       ` Steven Rostedt
  0 siblings, 1 reply; 81+ messages in thread
From: Juri Lelli @ 2013-11-27 13:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On 11/20/2013 10:33 PM, Steven Rostedt wrote:
> On Thu,  7 Nov 2013 14:43:42 +0100
> Juri Lelli <juri.lelli@gmail.com> wrote:
> 
> 
>> +	/*
>> +	 * Semantic is like this:
>> +	 *  - wakeup tracer handles all tasks in the system, independently
>> +	 *    from their scheduling class;
>> +	 *  - wakeup_rt tracer handles tasks belonging to sched_dl and
>> +	 *    sched_rt class;
>> +	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
>> +	 */
>> +	if ((wakeup_dl && !dl_task(p)) ||
>> +	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
>> +	    (p->prio >= wakeup_prio || p->prio >= current->prio))
>>  		return;
>>  
>>  	pc = preempt_count();
>> @@ -486,7 +495,7 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
>>  	arch_spin_lock(&wakeup_lock);
>>  
>>  	/* check for races. */
>> -	if (!tracer_enabled || p->prio >= wakeup_prio)
>> +	if (!tracer_enabled || (!dl_task(p) && p->prio >= wakeup_prio))
>>  		goto out_locked;
> 
> We probably want to add a "tracing_dl" variable, and do the test like
> this:
> 
> 	if (!tracer_enabled || tracing_dl ||
> 	    (!dl_task(p) && p->prio >= wakeup_prio))
> 
> and for the first if statement too. Otherwise if two dl tasks are
> running on two different CPUs, the second will override the first. Once
> you start tracing a dl_task, you shouldn't bother tracing another task
> until that one wakes up.
> 
> 	if (dl_task(p))
> 		tracing_dl = 1;
> 	else
> 		tracing_dl = 0;
> 

Ok, this fixes the build error:

--------------------------------------
 kernel/trace/trace_selftest.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index f76f8d6..ad94604 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -1023,16 +1023,16 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
 static int trace_wakeup_test_thread(void *data)
 {
 	/* Make this a -deadline thread */
-	struct sched_param2 paramx = {
+	static const struct sched_param2 param = {
 		.sched_priority = 0,
+		.sched_flags = 0,
 		.sched_runtime = 100000ULL,
 		.sched_deadline = 10000000ULL,
 		.sched_period = 10000000ULL
-		.sched_flags = 0
 	};
 	struct completion *x = data;
 
-	sched_setscheduler2(current, SCHED_DEADLINE, &paramx);
+	sched_setscheduler2(current, SCHED_DEADLINE, &param);
 
 	/* Make it know we have a new prio */
 	complete(x);
@@ -1088,19 +1088,19 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
 
 	while (p->on_rq) {
 		/*
-		 * Sleep to make sure the RT thread is asleep too.
+		 * Sleep to make sure the -deadline thread is asleep too.
 		 * On virtual machines we can't rely on timings,
 		 * but we want to make sure this test still works.
 		 */
 		msleep(100);
 	}
 
-	init_completion(&isrt);
+	init_completion(&is_ready);
 
 	wake_up_process(p);
 
 	/* Wait for the task to wake up */
-	wait_for_completion(&isrt);
+	wait_for_completion(&is_ready);
 
 	/* stop the tracing. */
 	tracing_stop();
--------------------------------

And this should implement what you were asking for:

--------------------------------
 kernel/trace/trace_sched_wakeup.c |   28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 1457fb1..090c4d9 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -28,6 +28,7 @@ static int			wakeup_current_cpu;
 static unsigned			wakeup_prio = -1;
 static int			wakeup_rt;
 static int			wakeup_dl;
+static int			tracing_dl = 0;
 
 static arch_spinlock_t wakeup_lock =
 	(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
@@ -438,6 +439,7 @@ static void __wakeup_reset(struct trace_array *tr)
 {
 	wakeup_cpu = -1;
 	wakeup_prio = -1;
+	tracing_dl = 0;
 
 	if (wakeup_task)
 		put_task_struct(wakeup_task);
@@ -481,9 +483,9 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
 	 *    sched_rt class;
 	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
 	 */
-	if ((wakeup_dl && !dl_task(p)) ||
+	if (tracing_dl || (wakeup_dl && !dl_task(p)) ||
 	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
-	    (p->prio >= wakeup_prio || p->prio >= current->prio))
+	    (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= current->prio)))
 		return;
 
 	pc = preempt_count();
@@ -495,7 +497,8 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
 	arch_spin_lock(&wakeup_lock);
 
 	/* check for races. */
-	if (!tracer_enabled || (!dl_task(p) && p->prio >= wakeup_prio))
+	if (!tracer_enabled || tracing_dl ||
+	    (!dl_task(p) && p->prio >= wakeup_prio))
 		goto out_locked;
 
 	/* reset the trace */
@@ -505,6 +508,15 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
 	wakeup_current_cpu = wakeup_cpu;
 	wakeup_prio = p->prio;
 
+	/*
+	 * Once you start tracing a -deadline task, don't bother tracing
+	 * another task until the first one wakes up.
+	 */
+	if (dl_task(p))
+		tracing_dl = 1;
+	else
+		tracing_dl = 0;
+
 	wakeup_task = p;
 	get_task_struct(wakeup_task);
 
@@ -700,10 +712,18 @@ static struct tracer wakeup_dl_tracer __read_mostly =
 	.start		= wakeup_tracer_start,
 	.stop		= wakeup_tracer_stop,
 	.wait_pipe	= poll_wait_pipe,
-	.print_max	= 1,
+	.print_max	= true,
+	.print_header	= wakeup_print_header,
+	.print_line	= wakeup_print_line,
+	.flags		= &tracer_flags,
+	.set_flag	= wakeup_set_flag,
+	.flag_changed	= wakeup_flag_changed,
 #ifdef CONFIG_FTRACE_SELFTEST
 	.selftest    = trace_selftest_startup_wakeup,
 #endif
+	.open		= wakeup_trace_open,
+	.close		= wakeup_trace_close,
+	.use_max_tr	= true,
 };
 
 __init static int init_wakeup_tracer(void)
-----------------------------------

Makes sense? :)

Thanks,

- Juri

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 02/14] sched: add extended scheduling interface. (new ABI)
  2013-11-27 13:30     ` Peter Zijlstra
@ 2013-11-27 14:01       ` Ingo Molnar
  2013-11-27 14:13         ` Peter Zijlstra
  0 siblings, 1 reply; 81+ messages in thread
From: Ingo Molnar @ 2013-11-27 14:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, tglx, mingo, rostedt, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield,
	Andrew Morton, Linus Torvalds


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Nov 27, 2013 at 02:23:54PM +0100, Ingo Molnar wrote:
> > There are 3 main compatibility cases:
> > 
> >  - the kernel's 'sizeof sched_attr' is equal to sched_attr:size: the 
> >    kernel version and user-space version matches, it's a straight ABI 
> >    in this case with full functionality.
> > 
> >  - the kernel's 'sizeof sched_attr' is larger than sched_attr::size 
> >    [the kernel is newer than what user-space was built for], in this 
> >    case the kernel assumes that all remaining values are zero and acts
> >    accordingly.
> 
> It also needs to fail sched_getparam() when any of the fields that do
> not fit in the smaller struct provided are !0.
>
> >  - the kernel's 'sizeof sched_attr' is smaller than sched_attr::size 
> >    [the kernel is older than what user-space was built for]. In 
> >    this case the kernel should return -ENOSYS if any of the 
> >    additional fields are nonzero. If those are all zero then it 
> >    will work as if a smaller structure was passed in.
> 
> So the problem I see with this one is that because you're allowed to 
> call sched_setparam() or whatever it will be called next on another 
> task; a task can very easily fail its sched_getparam() call.
> 
> Suppose the application is 'old' and only supports a subset of the 
> fields; but its wants to get, modify and set its params. This will 
> work as long nothing will set anything it doesn't know about.
> 
> As soon as some external entity -- say a sysad using schedtool -- 
> sets a param field it doesn't support the get, modify, set routing 
> completely fails.

There are two approaches to this that I can see:

1)

allow partial information to be returned to user-space, for existing 
input parameters. The new fields won't be displayed, but the tool 
doesn't know about them anyway so it's OK. The tool can still display 
all the other existing parameters.

2)

Return -ENOSYS if the 'extra' fields are nonzero. In this case the 
usual case of old tooling + new kernel will still work just fine, 
because old tooling won't set the new fields to any non-default 
(nonzero) values. In the 'mixed' case old tooling will not be able to 
change/display those fields.

I tend to lean towards #1. What do you think?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [tip:sched/core] sched: Add sched_class->task_dead() method
  2013-11-07 13:43 ` [PATCH 01/14] sched: add sched_class->task_dead Juri Lelli
  2013-11-12  4:17   ` Paul Turner
  2013-11-12 17:19   ` Steven Rostedt
@ 2013-11-27 14:10   ` tip-bot for Dario Faggioli
  2 siblings, 0 replies; 81+ messages in thread
From: tip-bot for Dario Faggioli @ 2013-11-27 14:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, pjt, raistlin, tglx, juri.lelli

Commit-ID:  e6c390f2dfd04c165ce45b0032f73fba85b1f282
Gitweb:     http://git.kernel.org/tip/e6c390f2dfd04c165ce45b0032f73fba85b1f282
Author:     Dario Faggioli <raistlin@linux.it>
AuthorDate: Thu, 7 Nov 2013 14:43:35 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 27 Nov 2013 14:08:50 +0100

sched: Add sched_class->task_dead() method

Add a new function to the scheduling class interface. It is called
at the end of a context switch, if the prev task is in TASK_DEAD state.

It will be useful for the scheduling classes that want to be notified
when one of their tasks dies, e.g. to perform some cleanup actions,
such as SCHED_DEADLINE.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Cc: bruce.ashfield@windriver.com
Cc: claudio@evidence.eu.com
Cc: darren@dvhart.com
Cc: dhaval.giani@gmail.com
Cc: fchecconi@gmail.com
Cc: fweisbec@gmail.com
Cc: harald.gustafsson@ericsson.com
Cc: hgu1972@gmail.com
Cc: insop.song@gmail.com
Cc: jkacur@redhat.com
Cc: johan.eker@ericsson.com
Cc: liming.wang@windriver.com
Cc: luca.abeni@unitn.it
Cc: michael@amarulasolutions.com
Cc: nicola.manica@disi.unitn.it
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: p.faure@akatech.ch
Cc: rostedt@goodmis.org
Cc: tommaso.cucinotta@sssup.it
Cc: vincent.guittot@linaro.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-2-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 3 +++
 kernel/sched/sched.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 19db8f3..25b3779 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2003,6 +2003,9 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	if (unlikely(prev_state == TASK_DEAD)) {
 		task_numa_free(prev);
 
+		if (prev->sched_class->task_dead)
+			prev->sched_class->task_dead(prev);
+
 		/*
 		 * Remove function-return probe instances associated with this
 		 * task and put them back on the free list.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 88c85b2..b3b4a49 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1023,6 +1023,7 @@ struct sched_class {
 	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
 	void (*task_fork) (struct task_struct *p);
+	void (*task_dead) (struct task_struct *p);
 
 	void (*switched_from) (struct rq *this_rq, struct task_struct *task);
 	void (*switched_to) (struct rq *this_rq, struct task_struct *task);

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 02/14] sched: add extended scheduling interface. (new ABI)
  2013-11-27 14:01       ` Ingo Molnar
@ 2013-11-27 14:13         ` Peter Zijlstra
  2013-11-27 14:17           ` Ingo Molnar
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2013-11-27 14:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Juri Lelli, tglx, mingo, rostedt, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield,
	Andrew Morton, Linus Torvalds

On Wed, Nov 27, 2013 at 03:01:43PM +0100, Ingo Molnar wrote:
> > So the problem I see with this one is that because you're allowed to 
> > call sched_setparam() or whatever it will be called next on another 
> > task; a task can very easily fail its sched_getparam() call.
> > 
> > Suppose the application is 'old' and only supports a subset of the 
> > fields; but its wants to get, modify and set its params. This will 
> > work as long nothing will set anything it doesn't know about.
> > 
> > As soon as some external entity -- say a sysad using schedtool -- 
> > sets a param field it doesn't support the get, modify, set routing 
> > completely fails.
> 
> There are two approaches to this that I can see:
> 
> 1)
> 
> allow partial information to be returned to user-space, for existing 
> input parameters. The new fields won't be displayed, but the tool 
> doesn't know about them anyway so it's OK. The tool can still display 
> all the other existing parameters.

But suppose a task simply wants to lower/raise its static (FIFO)
priority and does:

sched_getparam(&params);
params.prio += 1;
sched_setparam(&params);

If anything outside of the known param fields was set, we just silently
lost it, for the setparam() call will fill out 0s for the unprovided
fields.

> 2)
> 
> Return -ENOSYS if the 'extra' fields are nonzero. In this case the 
> usual case of old tooling + new kernel will still work just fine, 
> because old tooling won't set the new fields to any non-default 
> (nonzero) values. In the 'mixed' case old tooling will not be able to 
> change/display those fields.
> 
> I tend to lean towards #1. What do you think?

As per the above that can result in silent unexpected behavioural
changes.

I'd much rather be explicit and break hard; so 2).

So mixing new tools (schedtool, chrt etc) and old apps will give pain,
but at least not silent surprises.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 13:43     ` Juri Lelli
@ 2013-11-27 14:16       ` Steven Rostedt
  2013-11-27 14:19         ` Juri Lelli
  2013-11-27 14:26         ` Peter Zijlstra
  0 siblings, 2 replies; 81+ messages in thread
From: Steven Rostedt @ 2013-11-27 14:16 UTC (permalink / raw)
  To: Juri Lelli
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On Wed, 27 Nov 2013 14:43:45 +0100
Juri Lelli <juri.lelli@gmail.com> wrote:
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
> index f76f8d6..ad94604 100644
> --- a/kernel/trace/trace_selftest.c
> +++ b/kernel/trace/trace_selftest.c
> @@ -1023,16 +1023,16 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
>  static int trace_wakeup_test_thread(void *data)
>  {
>  	/* Make this a -deadline thread */
> -	struct sched_param2 paramx = {
> +	static const struct sched_param2 param = {
>  		.sched_priority = 0,
> +		.sched_flags = 0,
>  		.sched_runtime = 100000ULL,
>  		.sched_deadline = 10000000ULL,
>  		.sched_period = 10000000ULL
> -		.sched_flags = 0

Assigning structures like this, you don't need to set the zero fields.
all fields not explicitly stated, are set to zero.

>  	};
>  	struct completion *x = data;
>  
> 
> --------------------------------
>  kernel/trace/trace_sched_wakeup.c |   28 ++++++++++++++++++++++++----
>  1 file changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> index 1457fb1..090c4d9 100644
> --- a/kernel/trace/trace_sched_wakeup.c
> +++ b/kernel/trace/trace_sched_wakeup.c
> @@ -28,6 +28,7 @@ static int			wakeup_current_cpu;
>  static unsigned			wakeup_prio = -1;
>  static int			wakeup_rt;
>  static int			wakeup_dl;
> +static int			tracing_dl = 0;

Get rid of the ' = 0', its implicit to all static and global variables
that are not given any value.

>  
>  static arch_spinlock_t wakeup_lock =
>  	(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
> @@ -438,6 +439,7 @@ static void __wakeup_reset(struct trace_array *tr)
>  {
>  	wakeup_cpu = -1;
>  	wakeup_prio = -1;
> +	tracing_dl = 0;
>  
>  	if (wakeup_task)
>  		put_task_struct(wakeup_task);
> @@ -481,9 +483,9 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
>  	 *    sched_rt class;
>  	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
>  	 */
> -	if ((wakeup_dl && !dl_task(p)) ||
> +	if (tracing_dl || (wakeup_dl && !dl_task(p)) ||
>  	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
> -	    (p->prio >= wakeup_prio || p->prio >= current->prio))
> +	    (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= current->prio)))
>  		return;
>  
>  	pc = preempt_count();
> @@ -495,7 +497,8 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
>  	arch_spin_lock(&wakeup_lock);
>  
>  	/* check for races. */
> -	if (!tracer_enabled || (!dl_task(p) && p->prio >= wakeup_prio))
> +	if (!tracer_enabled || tracing_dl ||
> +	    (!dl_task(p) && p->prio >= wakeup_prio))
>  		goto out_locked;
>  
>  	/* reset the trace */
> @@ -505,6 +508,15 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
>  	wakeup_current_cpu = wakeup_cpu;
>  	wakeup_prio = p->prio;
>  
> +	/*
> +	 * Once you start tracing a -deadline task, don't bother tracing
> +	 * another task until the first one wakes up.
> +	 */
> +	if (dl_task(p))
> +		tracing_dl = 1;
> +	else
> +		tracing_dl = 0;

Do we need the else statement? I would think the only way to get here
is if tracing_dl is already set to zero.

-- Steve

> +
>  	wakeup_task = p;
>  	get_task_struct(wakeup_task);
>  
> @@ -700,10 +712,18 @@ static struct tracer wakeup_dl_tracer __read_mostly =
>  	.start		= wakeup_tracer_start,
>  	.stop		= wakeup_tracer_stop,
>  	.wait_pipe	= poll_wait_pipe,
> -	.print_max	= 1,
> +	.print_max	= true,
> +	.print_header	= wakeup_print_header,
> +	.print_line	= wakeup_print_line,
> +	.flags		= &tracer_flags,
> +	.set_flag	= wakeup_set_flag,
> +	.flag_changed	= wakeup_flag_changed,
>  #ifdef CONFIG_FTRACE_SELFTEST
>  	.selftest    = trace_selftest_startup_wakeup,
>  #endif
> +	.open		= wakeup_trace_open,
> +	.close		= wakeup_trace_close,
> +	.use_max_tr	= true,
>  };
>  
>  __init static int init_wakeup_tracer(void)
> -----------------------------------
> 
> Makes sense? :)
> 
> Thanks,
> 
> - Juri


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 02/14] sched: add extended scheduling interface. (new ABI)
  2013-11-27 14:13         ` Peter Zijlstra
@ 2013-11-27 14:17           ` Ingo Molnar
  2013-11-28 11:14             ` Juri Lelli
  0 siblings, 1 reply; 81+ messages in thread
From: Ingo Molnar @ 2013-11-27 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, tglx, mingo, rostedt, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield,
	Andrew Morton, Linus Torvalds


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Nov 27, 2013 at 03:01:43PM +0100, Ingo Molnar wrote:
> > > So the problem I see with this one is that because you're allowed to 
> > > call sched_setparam() or whatever it will be called next on another 
> > > task; a task can very easily fail its sched_getparam() call.
> > > 
> > > Suppose the application is 'old' and only supports a subset of the 
> > > fields; but its wants to get, modify and set its params. This will 
> > > work as long nothing will set anything it doesn't know about.
> > > 
> > > As soon as some external entity -- say a sysad using schedtool -- 
> > > sets a param field it doesn't support the get, modify, set routing 
> > > completely fails.
> > 
> > There are two approaches to this that I can see:
> > 
> > 1)
> > 
> > allow partial information to be returned to user-space, for existing 
> > input parameters. The new fields won't be displayed, but the tool 
> > doesn't know about them anyway so it's OK. The tool can still display 
> > all the other existing parameters.
> 
> But suppose a task simply wants to lower/raise its static (FIFO)
> priority and does:
> 
> sched_getparam(&params);
> params.prio += 1;
> sched_setparam(&params);
> 
> If anything outside of the known param fields was set, we just silently
> lost it, for the setparam() call will fill out 0s for the unprovided
> fields.
> 
> > 2)
> > 
> > Return -ENOSYS if the 'extra' fields are nonzero. In this case the 
> > usual case of old tooling + new kernel will still work just fine, 
> > because old tooling won't set the new fields to any non-default 
> > (nonzero) values. In the 'mixed' case old tooling will not be able to 
> > change/display those fields.
> > 
> > I tend to lean towards #1. What do you think?
> 
> As per the above that can result in silent unexpected behavioural
> changes.
> 
> I'd much rather be explicit and break hard; so 2).
> 
> So mixing new tools (schedtool, chrt etc) and old apps will give pain,
> but at least not silent surprises.

You are right, I concur.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 14:16       ` Steven Rostedt
@ 2013-11-27 14:19         ` Juri Lelli
  2013-11-27 14:26         ` Peter Zijlstra
  1 sibling, 0 replies; 81+ messages in thread
From: Juri Lelli @ 2013-11-27 14:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: peterz, tglx, mingo, oleg, fweisbec, darren, johan.eker, p.faure,
	linux-kernel, claudio, michael, fchecconi, tommaso.cucinotta,
	nicola.manica, luca.abeni, dhaval.giani, hgu1972, paulmck,
	raistlin, insop.song, liming.wang, jkacur, harald.gustafsson,
	vincent.guittot, bruce.ashfield

On 11/27/2013 03:16 PM, Steven Rostedt wrote:
> On Wed, 27 Nov 2013 14:43:45 +0100
> Juri Lelli <juri.lelli@gmail.com> wrote:
> diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
>> index f76f8d6..ad94604 100644
>> --- a/kernel/trace/trace_selftest.c
>> +++ b/kernel/trace/trace_selftest.c
>> @@ -1023,16 +1023,16 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
>>  static int trace_wakeup_test_thread(void *data)
>>  {
>>  	/* Make this a -deadline thread */
>> -	struct sched_param2 paramx = {
>> +	static const struct sched_param2 param = {
>>  		.sched_priority = 0,
>> +		.sched_flags = 0,
>>  		.sched_runtime = 100000ULL,
>>  		.sched_deadline = 10000000ULL,
>>  		.sched_period = 10000000ULL
>> -		.sched_flags = 0
> 
> Assigning structures like this, you don't need to set the zero fields.
> all fields not explicitly stated, are set to zero.
>

Right.

>>  	};
>>  	struct completion *x = data;
>>  
>>
>> --------------------------------
>>  kernel/trace/trace_sched_wakeup.c |   28 ++++++++++++++++++++++++----
>>  1 file changed, 24 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
>> index 1457fb1..090c4d9 100644
>> --- a/kernel/trace/trace_sched_wakeup.c
>> +++ b/kernel/trace/trace_sched_wakeup.c
>> @@ -28,6 +28,7 @@ static int			wakeup_current_cpu;
>>  static unsigned			wakeup_prio = -1;
>>  static int			wakeup_rt;
>>  static int			wakeup_dl;
>> +static int			tracing_dl = 0;
> 
> Get rid of the ' = 0', its implicit to all static and global variables
> that are not given any value.
> 

And right.

>>  
>>  static arch_spinlock_t wakeup_lock =
>>  	(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
>> @@ -438,6 +439,7 @@ static void __wakeup_reset(struct trace_array *tr)
>>  {
>>  	wakeup_cpu = -1;
>>  	wakeup_prio = -1;
>> +	tracing_dl = 0;
>>  
>>  	if (wakeup_task)
>>  		put_task_struct(wakeup_task);
>> @@ -481,9 +483,9 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
>>  	 *    sched_rt class;
>>  	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
>>  	 */
>> -	if ((wakeup_dl && !dl_task(p)) ||
>> +	if (tracing_dl || (wakeup_dl && !dl_task(p)) ||
>>  	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
>> -	    (p->prio >= wakeup_prio || p->prio >= current->prio))
>> +	    (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= current->prio)))
>>  		return;
>>  
>>  	pc = preempt_count();
>> @@ -495,7 +497,8 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
>>  	arch_spin_lock(&wakeup_lock);
>>  
>>  	/* check for races. */
>> -	if (!tracer_enabled || (!dl_task(p) && p->prio >= wakeup_prio))
>> +	if (!tracer_enabled || tracing_dl ||
>> +	    (!dl_task(p) && p->prio >= wakeup_prio))
>>  		goto out_locked;
>>  
>>  	/* reset the trace */
>> @@ -505,6 +508,15 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
>>  	wakeup_current_cpu = wakeup_cpu;
>>  	wakeup_prio = p->prio;
>>  
>> +	/*
>> +	 * Once you start tracing a -deadline task, don't bother tracing
>> +	 * another task until the first one wakes up.
>> +	 */
>> +	if (dl_task(p))
>> +		tracing_dl = 1;
>> +	else
>> +		tracing_dl = 0;
> 
> Do we need the else statement? I would think the only way to get here
> is if tracing_dl is already set to zero.
> 

No, indeed.

Thanks,

- Juri

>> +
>>  	wakeup_task = p;
>>  	get_task_struct(wakeup_task);
>>  
>> @@ -700,10 +712,18 @@ static struct tracer wakeup_dl_tracer __read_mostly =
>>  	.start		= wakeup_tracer_start,
>>  	.stop		= wakeup_tracer_stop,
>>  	.wait_pipe	= poll_wait_pipe,
>> -	.print_max	= 1,
>> +	.print_max	= true,
>> +	.print_header	= wakeup_print_header,
>> +	.print_line	= wakeup_print_line,
>> +	.flags		= &tracer_flags,
>> +	.set_flag	= wakeup_set_flag,
>> +	.flag_changed	= wakeup_flag_changed,
>>  #ifdef CONFIG_FTRACE_SELFTEST
>>  	.selftest    = trace_selftest_startup_wakeup,
>>  #endif
>> +	.open		= wakeup_trace_open,
>> +	.close		= wakeup_trace_close,
>> +	.use_max_tr	= true,
>>  };
>>  
>>  __init static int init_wakeup_tracer(void)
>> -----------------------------------
>>
>> Makes sense? :)
>>
>> Thanks,
>>
>> - Juri
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 14:16       ` Steven Rostedt
  2013-11-27 14:19         ` Juri Lelli
@ 2013-11-27 14:26         ` Peter Zijlstra
  2013-11-27 14:34           ` Ingo Molnar
  2013-11-27 15:00           ` Steven Rostedt
  1 sibling, 2 replies; 81+ messages in thread
From: Peter Zijlstra @ 2013-11-27 14:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Juri Lelli, tglx, mingo, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

On Wed, Nov 27, 2013 at 09:16:47AM -0500, Steven Rostedt wrote:
> On Wed, 27 Nov 2013 14:43:45 +0100
> Juri Lelli <juri.lelli@gmail.com> wrote:
> diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
> > index f76f8d6..ad94604 100644
> > --- a/kernel/trace/trace_selftest.c
> > +++ b/kernel/trace/trace_selftest.c
> > @@ -1023,16 +1023,16 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
> >  static int trace_wakeup_test_thread(void *data)
> >  {
> >  	/* Make this a -deadline thread */
> > -	struct sched_param2 paramx = {
> > +	static const struct sched_param2 param = {
> >  		.sched_priority = 0,
> > +		.sched_flags = 0,
> >  		.sched_runtime = 100000ULL,
> >  		.sched_deadline = 10000000ULL,
> >  		.sched_period = 10000000ULL
> > -		.sched_flags = 0
> 
> Assigning structures like this, you don't need to set the zero fields.
> all fields not explicitly stated, are set to zero.

Only because its static. Otherwise unnamed members have indeterminate
value after initialization.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 14:26         ` Peter Zijlstra
@ 2013-11-27 14:34           ` Ingo Molnar
  2013-11-27 14:58             ` Peter Zijlstra
  2013-11-27 15:00           ` Steven Rostedt
  1 sibling, 1 reply; 81+ messages in thread
From: Ingo Molnar @ 2013-11-27 14:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Juri Lelli, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Nov 27, 2013 at 09:16:47AM -0500, Steven Rostedt wrote:
> > On Wed, 27 Nov 2013 14:43:45 +0100
> > Juri Lelli <juri.lelli@gmail.com> wrote:
> > diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
> > > index f76f8d6..ad94604 100644
> > > --- a/kernel/trace/trace_selftest.c
> > > +++ b/kernel/trace/trace_selftest.c
> > > @@ -1023,16 +1023,16 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
> > >  static int trace_wakeup_test_thread(void *data)
> > >  {
> > >  	/* Make this a -deadline thread */
> > > -	struct sched_param2 paramx = {
> > > +	static const struct sched_param2 param = {
> > >  		.sched_priority = 0,
> > > +		.sched_flags = 0,
> > >  		.sched_runtime = 100000ULL,
> > >  		.sched_deadline = 10000000ULL,
> > >  		.sched_period = 10000000ULL
> > > -		.sched_flags = 0
> > 
> > Assigning structures like this, you don't need to set the zero fields.
> > all fields not explicitly stated, are set to zero.
> 
> Only because its static. Otherwise unnamed members have indeterminate
> value after initialization.

I think for 'struct' C will initialize them to zero, even if they are 
not mentioned and even if they are on the stack.

It will only be indeterminate when it's not initialized at all.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 14:34           ` Ingo Molnar
@ 2013-11-27 14:58             ` Peter Zijlstra
  2013-11-27 15:35               ` Ingo Molnar
  2013-11-27 15:42               ` Ingo Molnar
  0 siblings, 2 replies; 81+ messages in thread
From: Peter Zijlstra @ 2013-11-27 14:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Juri Lelli, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

On Wed, Nov 27, 2013 at 03:34:35PM +0100, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, Nov 27, 2013 at 09:16:47AM -0500, Steven Rostedt wrote:
> > > On Wed, 27 Nov 2013 14:43:45 +0100
> > > Juri Lelli <juri.lelli@gmail.com> wrote:
> > > diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
> > > > index f76f8d6..ad94604 100644
> > > > --- a/kernel/trace/trace_selftest.c
> > > > +++ b/kernel/trace/trace_selftest.c
> > > > @@ -1023,16 +1023,16 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
> > > >  static int trace_wakeup_test_thread(void *data)
> > > >  {
> > > >  	/* Make this a -deadline thread */
> > > > -	struct sched_param2 paramx = {
> > > > +	static const struct sched_param2 param = {
> > > >  		.sched_priority = 0,
> > > > +		.sched_flags = 0,
> > > >  		.sched_runtime = 100000ULL,
> > > >  		.sched_deadline = 10000000ULL,
> > > >  		.sched_period = 10000000ULL
> > > > -		.sched_flags = 0
> > > 
> > > Assigning structures like this, you don't need to set the zero fields.
> > > all fields not explicitly stated, are set to zero.
> > 
> > Only because its static. Otherwise unnamed members have indeterminate
> > value after initialization.
> 
> I think for 'struct' C will initialize them to zero, even if they are 
> not mentioned and even if they are on the stack.
> 
> It will only be indeterminate when it's not initialized at all.

Language spec: ISO/IEC 9899:1999 (aka C99) section 6.7.8 point 9
says:

9. Except where explicitly stated otherwise, for the purpose of this
subclause unnamed members of objects of structure and union type do no
participate in initialization. Unnamed members of structure objects have
indeterminate value even after initialization.

Later points (notably 21) make such an exception for aggregate objects
of static storage.

Of course, its entirely possible I read the thing wrong; its 31 points
detailing the initialization of objects.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 14:26         ` Peter Zijlstra
  2013-11-27 14:34           ` Ingo Molnar
@ 2013-11-27 15:00           ` Steven Rostedt
  1 sibling, 0 replies; 81+ messages in thread
From: Steven Rostedt @ 2013-11-27 15:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, tglx, mingo, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

On Wed, 27 Nov 2013 15:26:49 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Nov 27, 2013 at 09:16:47AM -0500, Steven Rostedt wrote:
> > On Wed, 27 Nov 2013 14:43:45 +0100
> > Juri Lelli <juri.lelli@gmail.com> wrote:
> > diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
> > > index f76f8d6..ad94604 100644
> > > --- a/kernel/trace/trace_selftest.c
> > > +++ b/kernel/trace/trace_selftest.c
> > > @@ -1023,16 +1023,16 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
> > >  static int trace_wakeup_test_thread(void *data)
> > >  {
> > >  	/* Make this a -deadline thread */
> > > -	struct sched_param2 paramx = {
> > > +	static const struct sched_param2 param = {
> > >  		.sched_priority = 0,
> > > +		.sched_flags = 0,
> > >  		.sched_runtime = 100000ULL,
> > >  		.sched_deadline = 10000000ULL,
> > >  		.sched_period = 10000000ULL
> > > -		.sched_flags = 0
> > 
> > Assigning structures like this, you don't need to set the zero fields.
> > all fields not explicitly stated, are set to zero.
> 
> Only because its static. Otherwise unnamed members have indeterminate
> value after initialization.

Hmm, I wonder if there's a gcc extension that guarantees it. Because I
could have sworn I've seen initialization of structures that expected
zero'd behavior elsewhere in the kernel.

-- Steve

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 14:58             ` Peter Zijlstra
@ 2013-11-27 15:35               ` Ingo Molnar
  2013-11-27 15:40                 ` Peter Zijlstra
  2013-11-27 15:42               ` Ingo Molnar
  1 sibling, 1 reply; 81+ messages in thread
From: Ingo Molnar @ 2013-11-27 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Juri Lelli, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Nov 27, 2013 at 03:34:35PM +0100, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Wed, Nov 27, 2013 at 09:16:47AM -0500, Steven Rostedt wrote:
> > > > On Wed, 27 Nov 2013 14:43:45 +0100
> > > > Juri Lelli <juri.lelli@gmail.com> wrote:
> > > > diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
> > > > > index f76f8d6..ad94604 100644
> > > > > --- a/kernel/trace/trace_selftest.c
> > > > > +++ b/kernel/trace/trace_selftest.c
> > > > > @@ -1023,16 +1023,16 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
> > > > >  static int trace_wakeup_test_thread(void *data)
> > > > >  {
> > > > >  	/* Make this a -deadline thread */
> > > > > -	struct sched_param2 paramx = {
> > > > > +	static const struct sched_param2 param = {
> > > > >  		.sched_priority = 0,
> > > > > +		.sched_flags = 0,
> > > > >  		.sched_runtime = 100000ULL,
> > > > >  		.sched_deadline = 10000000ULL,
> > > > >  		.sched_period = 10000000ULL
> > > > > -		.sched_flags = 0
> > > > 
> > > > Assigning structures like this, you don't need to set the zero fields.
> > > > all fields not explicitly stated, are set to zero.
> > > 
> > > Only because its static. Otherwise unnamed members have indeterminate
> > > value after initialization.
> > 
> > I think for 'struct' C will initialize them to zero, even if they are 
> > not mentioned and even if they are on the stack.
> > 
> > It will only be indeterminate when it's not initialized at all.
> 
> Language spec: ISO/IEC 9899:1999 (aka C99) section 6.7.8 point 9
> says:
> 
> 9. Except where explicitly stated otherwise, for the purpose of this
> subclause unnamed members of objects of structure and union type do no
> participate in initialization. Unnamed members of structure objects have
> indeterminate value even after initialization.
> 
> Later points (notably 21) make such an exception for aggregate objects
> of static storage.
> 
> Of course, its entirely possible I read the thing wrong; its 31 points
> detailing the initialization of objects.

So why does GCC then behave like this:

triton:~> cat test.c

struct foo {
        int a;
        int b;
};

int main(void)
{
        struct foo x = { .a = 1 };

        return x.b;
}

triton:~> gcc -Wall -Wextra -O2 -o test test.c; ./test; echo $?
0

I'd expect -Wall -Wextra to warn about as trivial as the uninitialized 
variable use that you argue happens.

I'd also expect it to not return 0 but some random value on the stack 
(which is most likely not 0).

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 15:35               ` Ingo Molnar
@ 2013-11-27 15:40                 ` Peter Zijlstra
  2013-11-27 15:46                   ` Ingo Molnar
  2013-11-27 16:24                   ` Oleg Nesterov
  0 siblings, 2 replies; 81+ messages in thread
From: Peter Zijlstra @ 2013-11-27 15:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Juri Lelli, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

On Wed, Nov 27, 2013 at 04:35:19PM +0100, Ingo Molnar wrote:
> So why does GCC then behave like this:

I think because its a much saner behaviour; also it might still be the
spec actually says this, its a somewhat opaque text.

Anyway, yes GCC seems to behave as we 'expect' it to; I just can't find
the language spec actually guaranteeing this.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 14:58             ` Peter Zijlstra
  2013-11-27 15:35               ` Ingo Molnar
@ 2013-11-27 15:42               ` Ingo Molnar
  1 sibling, 0 replies; 81+ messages in thread
From: Ingo Molnar @ 2013-11-27 15:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Juri Lelli, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield


and even this works:

triton:~> cat test.c

struct foo {
        int a;
        int b;
};

int litter_our_stack(void)
{
        volatile struct foo x = { .a = 1, .b = 2 };

        return x.b;
}

int test_code(void)
{
        volatile struct foo x = { .a = 1, /* .b not initialized explicitly */ };

        return x.b;
}

int main(void)
{
        return litter_our_stack() + test_code();
}

triton:~> gcc -Wall -Wextra -O0 -o test test.c; ./test; echo $?
2
triton:~> 


The result is 2, so x.b in test_code() got explicitly set to 0.

If it was uninitialized, not only would we expect a compiler warning, 
but we'd also get a result of '4'. (the two functions have the same 
stack depth, so 'litter_our_stack()' initializes .b to 2.)

-O0 guarantees that GCC just dumbly implements these functions without 
any optimizations.

Thanks,

	ngo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 15:40                 ` Peter Zijlstra
@ 2013-11-27 15:46                   ` Ingo Molnar
  2013-11-27 15:54                     ` Peter Zijlstra
  2013-11-27 15:56                     ` Steven Rostedt
  2013-11-27 16:24                   ` Oleg Nesterov
  1 sibling, 2 replies; 81+ messages in thread
From: Ingo Molnar @ 2013-11-27 15:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Juri Lelli, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Nov 27, 2013 at 04:35:19PM +0100, Ingo Molnar wrote:
> > So why does GCC then behave like this:
> 
> I think because its a much saner behaviour; also it might still be the
> spec actually says this, its a somewhat opaque text.
> 
> Anyway, yes GCC seems to behave as we 'expect' it to; I just can't find
> the language spec actually guaranteeing this.

So from C99 standard §6.7.8 (Initialization)/21:

    "If there are fewer initializers in a brace-enclosed list than 
  there are elements or members of an aggregate, or fewer characters 
  in a string literal used to initialize an array of known size than 
  there are elements in the array, the remainder of the aggregate 
  shall be initialized implicitly the same as objects that have static 
  storage duration."

static initialization == zeroing in this case.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 15:46                   ` Ingo Molnar
@ 2013-11-27 15:54                     ` Peter Zijlstra
  2013-11-27 15:56                     ` Steven Rostedt
  1 sibling, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2013-11-27 15:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Juri Lelli, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

On Wed, Nov 27, 2013 at 04:46:00PM +0100, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, Nov 27, 2013 at 04:35:19PM +0100, Ingo Molnar wrote:
> > > So why does GCC then behave like this:
> > 
> > I think because its a much saner behaviour; also it might still be the
> > spec actually says this, its a somewhat opaque text.
> > 
> > Anyway, yes GCC seems to behave as we 'expect' it to; I just can't find
> > the language spec actually guaranteeing this.
> 
> So from C99 standard §6.7.8 (Initialization)/21:
> 
>     "If there are fewer initializers in a brace-enclosed list than 
>   there are elements or members of an aggregate, or fewer characters 
>   in a string literal used to initialize an array of known size than 
>   there are elements in the array, the remainder of the aggregate 
>   shall be initialized implicitly the same as objects that have static 
>   storage duration."
> 
> static initialization == zeroing in this case.

Hurm for some reason I thought that was for static objects only.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 15:46                   ` Ingo Molnar
  2013-11-27 15:54                     ` Peter Zijlstra
@ 2013-11-27 15:56                     ` Steven Rostedt
  2013-11-27 16:01                       ` Peter Zijlstra
                                         ` (2 more replies)
  1 sibling, 3 replies; 81+ messages in thread
From: Steven Rostedt @ 2013-11-27 15:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Juri Lelli, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

On Wed, 27 Nov 2013 16:46:00 +0100
Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, Nov 27, 2013 at 04:35:19PM +0100, Ingo Molnar wrote:
> > > So why does GCC then behave like this:
> > 
> > I think because its a much saner behaviour; also it might still be the
> > spec actually says this, its a somewhat opaque text.
> > 
> > Anyway, yes GCC seems to behave as we 'expect' it to; I just can't find
> > the language spec actually guaranteeing this.
> 
> So from C99 standard §6.7.8 (Initialization)/21:
> 
>     "If there are fewer initializers in a brace-enclosed list than 
>   there are elements or members of an aggregate, or fewer characters 
>   in a string literal used to initialize an array of known size than 
>   there are elements in the array, the remainder of the aggregate 
>   shall be initialized implicitly the same as objects that have static 
>   storage duration."
> 
> static initialization == zeroing in this case.
> 

The confusion here is that the above looks to be talking about arrays.
But it really doesn't specify structures.

But searching the internet, it looks as though most people believe it
applies to structures, and any compiler that does otherwise will most
likely break applications.

That is, this looks to be one of the gray areas that the compiler
writers just happen to do what's most sane. And they probably assume
it's talking about structures as well, hence the lack of warnings.

It gets confusing, as the doc also shows:

struct { int a[3], b; } w[] = { { 1 }, 2 };

Then points out that w.a[0] is 1 and w.b[0] is 2, and all other
elements are zero.

-- Steve

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 15:56                     ` Steven Rostedt
@ 2013-11-27 16:01                       ` Peter Zijlstra
  2013-11-27 16:02                       ` Steven Rostedt
  2013-11-27 16:13                       ` Ingo Molnar
  2 siblings, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2013-11-27 16:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Juri Lelli, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

On Wed, Nov 27, 2013 at 10:56:16AM -0500, Steven Rostedt wrote:
> > So from C99 standard §6.7.8 (Initialization)/21:
> > 
> >     "If there are fewer initializers in a brace-enclosed list than 
> >   there are elements or members of an aggregate, or fewer characters 
> >   in a string literal used to initialize an array of known size than 
> >   there are elements in the array, the remainder of the aggregate 
> >   shall be initialized implicitly the same as objects that have static 
> >   storage duration."
> > 
> > static initialization == zeroing in this case.
> > 
> 
> The confusion here is that the above looks to be talking about arrays.
> But it really doesn't specify structures.
> 
> But searching the internet, it looks as though most people believe it
> applies to structures, and any compiler that does otherwise will most
> likely break applications.
> 
> That is, this looks to be one of the gray areas that the compiler
> writers just happen to do what's most sane. And they probably assume
> it's talking about structures as well, hence the lack of warnings.

16 says initializers for aggregate or union types are brace-enclosed
lists. A struct is an aggregate type.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 15:56                     ` Steven Rostedt
  2013-11-27 16:01                       ` Peter Zijlstra
@ 2013-11-27 16:02                       ` Steven Rostedt
  2013-11-27 16:13                       ` Ingo Molnar
  2 siblings, 0 replies; 81+ messages in thread
From: Steven Rostedt @ 2013-11-27 16:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, tglx, mingo, oleg,
	fweisbec, darren, johan.eker, p.faure, linux-kernel, claudio,
	michael, fchecconi, tommaso.cucinotta, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur, harald.gustafsson, vincent.guittot,
	bruce.ashfield

On Wed, 27 Nov 2013 10:56:16 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 27 Nov 2013 16:46:00 +0100
> Ingo Molnar <mingo@kernel.org> wrote:

> > So from C99 standard §6.7.8 (Initialization)/21:
> > 
> >     "If there are fewer initializers in a brace-enclosed list than 
> >   there are elements or members of an aggregate, or fewer characters 
> >   in a string literal used to initialize an array of known size than 
> >   there are elements in the array, the remainder of the aggregate 
> >   shall be initialized implicitly the same as objects that have static 
> >   storage duration."
> > 
> > static initialization == zeroing in this case.
> > 
> 
> The confusion here is that the above looks to be talking about arrays.
> But it really doesn't specify structures.

Bah, re-reading it, it does seem to talk about structures.

I hate reading specs, they learn how to write documentation from
lawyers.

-- Steve

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 15:56                     ` Steven Rostedt
  2013-11-27 16:01                       ` Peter Zijlstra
  2013-11-27 16:02                       ` Steven Rostedt
@ 2013-11-27 16:13                       ` Ingo Molnar
  2013-11-27 16:33                         ` Steven Rostedt
  2 siblings, 1 reply; 81+ messages in thread
From: Ingo Molnar @ 2013-11-27 16:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Juri Lelli, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 27 Nov 2013 16:46:00 +0100
> Ingo Molnar <mingo@kernel.org> wrote:
> 
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Wed, Nov 27, 2013 at 04:35:19PM +0100, Ingo Molnar wrote:
> > > > So why does GCC then behave like this:
> > > 
> > > I think because its a much saner behaviour; also it might still be the
> > > spec actually says this, its a somewhat opaque text.
> > > 
> > > Anyway, yes GCC seems to behave as we 'expect' it to; I just can't find
> > > the language spec actually guaranteeing this.
> > 
> > So from C99 standard §6.7.8 (Initialization)/21:
> > 
> >     "If there are fewer initializers in a brace-enclosed list than 
> >   there are elements or members of an aggregate, or fewer characters 
> >   in a string literal used to initialize an array of known size than 
> >   there are elements in the array, the remainder of the aggregate 
> >   shall be initialized implicitly the same as objects that have static 
> >   storage duration."
> > 
> > static initialization == zeroing in this case.
> > 
> 
> The confusion here is that the above looks to be talking about arrays.
> But it really doesn't specify structures.

It talks about neither 'arrays' nor 'structures', it talks about 
'aggregates' - which is defined as _both_: 'structures and arrays'.

That's what compiler legalese brings you ;-)

> But searching the internet, it looks as though most people believe 
> it applies to structures, and any compiler that does otherwise will 
> most likely break applications.
> 
> That is, this looks to be one of the gray areas that the compiler 
> writers just happen to do what's most sane. And they probably assume 
> it's talking about structures as well, hence the lack of warnings.

I don't think it's grey, I think it's pretty well specified.

> It gets confusing, as the doc also shows:
> 
> struct { int a[3], b; } w[] = { { 1 }, 2 };

I don't think this is valid syntax, I think this needs one more set of 
braces:

 struct { int a[3], b; } w[] = { { { 1 }, 2 } };

> Then points out that w.a[0] is 1 and w.b[0] is 2, and all other 
> elements are zero.

If by 'w.a[0]' you mean 'w[0].a[0]', and if by 'w.b[0]' you mean 
'w[0].b' then yes, this comes from the definition and it's what I'd 
call 'obvious' initialization behavior.

What makes it confusing to you?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 15:40                 ` Peter Zijlstra
  2013-11-27 15:46                   ` Ingo Molnar
@ 2013-11-27 16:24                   ` Oleg Nesterov
  1 sibling, 0 replies; 81+ messages in thread
From: Oleg Nesterov @ 2013-11-27 16:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Juri Lelli, tglx, mingo, fweisbec,
	darren, johan.eker, p.faure, linux-kernel, claudio, michael,
	fchecconi, tommaso.cucinotta, nicola.manica, luca.abeni,
	dhaval.giani, hgu1972, paulmck, raistlin, insop.song,
	liming.wang, jkacur, harald.gustafsson, vincent.guittot,
	bruce.ashfield

On 11/27, Peter Zijlstra wrote:
>
> Anyway, yes GCC seems to behave as we 'expect' it to; I just can't find
> the language spec actually guaranteeing this.

And the kernel already assumes that gcc should do this, for example

	struct siginfo info = {};

in do_tkill().

(this initialization should prevent the leak of uninitialized members
 to userspace).

Oleg.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 08/14] sched: add latency tracing for -deadline tasks.
  2013-11-27 16:13                       ` Ingo Molnar
@ 2013-11-27 16:33                         ` Steven Rostedt
  0 siblings, 0 replies; 81+ messages in thread
From: Steven Rostedt @ 2013-11-27 16:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Juri Lelli, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield

On Wed, 27 Nov 2013 17:13:08 +0100
Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Wed, 27 Nov 2013 16:46:00 +0100
> > Ingo Molnar <mingo@kernel.org> wrote:
> > 
> > > 
> > > * Peter Zijlstra <peterz@infradead.org> wrote:
> > > 
> > > > On Wed, Nov 27, 2013 at 04:35:19PM +0100, Ingo Molnar wrote:
> > > > > So why does GCC then behave like this:
> > > > 
> > > > I think because its a much saner behaviour; also it might still be the
> > > > spec actually says this, its a somewhat opaque text.
> > > > 
> > > > Anyway, yes GCC seems to behave as we 'expect' it to; I just can't find
> > > > the language spec actually guaranteeing this.
> > > 
> > > So from C99 standard §6.7.8 (Initialization)/21:
> > > 
> > >     "If there are fewer initializers in a brace-enclosed list than 
> > >   there are elements or members of an aggregate, or fewer characters 
> > >   in a string literal used to initialize an array of known size than 
> > >   there are elements in the array, the remainder of the aggregate 
> > >   shall be initialized implicitly the same as objects that have static 
> > >   storage duration."
> > > 
> > > static initialization == zeroing in this case.
> > > 
> > 
> > The confusion here is that the above looks to be talking about arrays.
> > But it really doesn't specify structures.
> 
> It talks about neither 'arrays' nor 'structures', it talks about 
> 'aggregates' - which is defined as _both_: 'structures and arrays'.

Yeah, I misread it. I was reading the array section for awhile, and got
confused.


> 
> That's what compiler legalese brings you ;-)

Yep.

> 
> > But searching the internet, it looks as though most people believe 
> > it applies to structures, and any compiler that does otherwise will 
> > most likely break applications.
> > 
> > That is, this looks to be one of the gray areas that the compiler 
> > writers just happen to do what's most sane. And they probably assume 
> > it's talking about structures as well, hence the lack of warnings.
> 
> I don't think it's grey, I think it's pretty well specified.
> 
> > It gets confusing, as the doc also shows:
> > 
> > struct { int a[3], b; } w[] = { { 1 }, 2 };
> 
> I don't think this is valid syntax, I think this needs one more set of 
> braces:
> 
>  struct { int a[3], b; } w[] = { { { 1 }, 2 } };
> 
> > Then points out that w.a[0] is 1 and w.b[0] is 2, and all other 
> > elements are zero.
> 
> If by 'w.a[0]' you mean 'w[0].a[0]', and if by 'w.b[0]' you mean 
> 'w[0].b' then yes, this comes from the definition and it's what I'd 
> call 'obvious' initialization behavior.
> 
> What makes it confusing to you?

Well, because it's mixing arrays and structures, and I was on the
misconception that the paragraph was talking about just arrays.

Can I just have my turkey now?

-- Steve

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 02/14] sched: add extended scheduling interface. (new ABI)
  2013-11-27 14:17           ` Ingo Molnar
@ 2013-11-28 11:14             ` Juri Lelli
  2013-11-28 11:28               ` Peter Zijlstra
  0 siblings, 1 reply; 81+ messages in thread
From: Juri Lelli @ 2013-11-28 11:14 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield,
	Andrew Morton, Linus Torvalds

On 11/27/2013 03:17 PM, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Wed, Nov 27, 2013 at 03:01:43PM +0100, Ingo Molnar wrote:
>>>> So the problem I see with this one is that because you're allowed to 
>>>> call sched_setparam() or whatever it will be called next on another 
>>>> task; a task can very easily fail its sched_getparam() call.
>>>>
>>>> Suppose the application is 'old' and only supports a subset of the 
>>>> fields; but its wants to get, modify and set its params. This will 
>>>> work as long nothing will set anything it doesn't know about.
>>>>
>>>> As soon as some external entity -- say a sysad using schedtool -- 
>>>> sets a param field it doesn't support the get, modify, set routing 
>>>> completely fails.
>>>
>>> There are two approaches to this that I can see:
>>>
>>> 1)
>>>
>>> allow partial information to be returned to user-space, for existing 
>>> input parameters. The new fields won't be displayed, but the tool 
>>> doesn't know about them anyway so it's OK. The tool can still display 
>>> all the other existing parameters.
>>
>> But suppose a task simply wants to lower/raise its static (FIFO)
>> priority and does:
>>
>> sched_getparam(&params);
>> params.prio += 1;
>> sched_setparam(&params);
>>
>> If anything outside of the known param fields was set, we just silently
>> lost it, for the setparam() call will fill out 0s for the unprovided
>> fields.
>>
>>> 2)
>>>
>>> Return -ENOSYS if the 'extra' fields are nonzero. In this case the 
>>> usual case of old tooling + new kernel will still work just fine, 
>>> because old tooling won't set the new fields to any non-default 
>>> (nonzero) values. In the 'mixed' case old tooling will not be able to 
>>> change/display those fields.
>>>
>>> I tend to lean towards #1. What do you think?
>>
>> As per the above that can result in silent unexpected behavioural
>> changes.
>>
>> I'd much rather be explicit and break hard; so 2).
>>
>> So mixing new tools (schedtool, chrt etc) and old apps will give pain,
>> but at least not silent surprises.
> 
> You are right, I concur.
> 

Ok, I came up with what follows. I slightly tested it, but this is intended
to be my first try to understand what you asking for :).

Question:

I do (copying from perf_copy_attr())

	if (!size)		/* abi compat */
		size = SCHED_ATTR_SIZE_VER0;

Shouldn't I return some type of error in this case? We know that there is
nothing before VER0 to which we have to be bwd compatible to. And with
this my "old" schedtool still works since he doesn't know about size...

Comments?

Thanks,

- Juri

Author: Juri Lelli <juri.lelli@gmail.com>
Date:   Thu Nov 28 11:07:47 2013 +0100

    fixup: ABI changed, builds and runs

diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index 6a4985e..3e5fafa 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -407,8 +407,8 @@
 #define __NR_kcmp			(__NR_SYSCALL_BASE+378)
 #define __NR_finit_module		(__NR_SYSCALL_BASE+379)
 #define __NR_sched_setscheduler2	(__NR_SYSCALL_BASE+380)
-#define __NR_sched_setparam2		(__NR_SYSCALL_BASE+381)
-#define __NR_sched_getparam2		(__NR_SYSCALL_BASE+382)
+#define __NR_sched_setattr		(__NR_SYSCALL_BASE+381)
+#define __NR_sched_getattr		(__NR_SYSCALL_BASE+382)
 
 /*
  * This may need to be greater than __NR_last_syscall+1 in order to
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 0fb1ef7..0008788 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -390,8 +390,8 @@
 		CALL(sys_kcmp)
 		CALL(sys_finit_module)
 /* 380 */	CALL(sys_sched_setscheduler2)
-		CALL(sys_sched_setparam2)
-		CALL(sys_sched_getparam2)
+		CALL(sys_sched_setattr)
+		CALL(sys_sched_getattr)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index dfce815..0f13118 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -357,6 +357,6 @@
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
 349	i386	kcmp			sys_kcmp
 350	i386	finit_module		sys_finit_module
-351	i386	sched_setparam2		sys_sched_setparam2
-352	i386	sched_getparam2		sys_sched_getparam2
+351	i386	sched_setattr		sys_sched_setattr
+352	i386	sched_getattr		sys_sched_getattr
 353	i386	sched_setscheduler2	sys_sched_setscheduler2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 1849a70..0de09ef 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -320,8 +320,8 @@
 311	64	process_vm_writev	sys_process_vm_writev
 312	common	kcmp			sys_kcmp
 313	common	finit_module		sys_finit_module
-314	common	sched_setparam2		sys_sched_setparam2
-315	common	sched_getparam2		sys_sched_getparam2
+314	common	sched_setattr		sys_sched_setattr
+315	common	sched_getattr		sys_sched_getattr
 316	common	sched_setscheduler2	sys_sched_setscheduler2
 
 #
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0cbef29e..76c9bd8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -56,6 +56,8 @@ struct sched_param {
 
 #include <asm/processor.h>
 
+#define SCHED_ATTR_SIZE_VER0	40	/* sizeof first published struct */
+
 /*
  * Extended scheduling parameters data structure.
  *
@@ -67,7 +69,7 @@ struct sched_param {
  * the tasks may be useful for a wide variety of application fields, e.g.,
  * multimedia, streaming, automation and control, and many others.
  *
- * This variant (sched_param2) is meant at describing a so-called
+ * This variant (sched_attr) is meant at describing a so-called
  * sporadic time-constrained task. In such model a task is specified by:
  *  - the activation period or minimum instance inter-arrival time;
  *  - the maximum (or average, depending on the actual scheduling
@@ -80,28 +82,30 @@ struct sched_param {
  * than the runtime and must be completed by time instant t equal to
  * the instance activation time + the deadline.
  *
- * This is reflected by the actual fields of the sched_param2 structure:
+ * This is reflected by the actual fields of the sched_attr structure:
  *
  *  @sched_priority     task's priority (might still be useful)
+ *  @sched_flags        for customizing the scheduler behaviour
  *  @sched_deadline     representative of the task's deadline
  *  @sched_runtime      representative of the task's runtime
  *  @sched_period       representative of the task's period
- *  @sched_flags        for customizing the scheduler behaviour
  *
  * Given this task model, there are a multiplicity of scheduling algorithms
  * and policies, that can be used to ensure all the tasks will make their
  * timing constraints.
  *
- * @__unused		padding to allow future expansion without ABI issues
+ *  @size		size of the structure, for fwd/bwd compat.
  */
-struct sched_param2 {
+struct sched_attr {
 	int sched_priority;
 	unsigned int sched_flags;
 	u64 sched_runtime;
 	u64 sched_deadline;
 	u64 sched_period;
+	u32 size;
 
-	u64 __unused[12];
+	/* Align to u64. */
+	u32 __reserved;
 };
 
 struct exec_domain;
@@ -2011,7 +2015,7 @@ extern int sched_setscheduler(struct task_struct *, int,
 extern int sched_setscheduler_nocheck(struct task_struct *, int,
 				      const struct sched_param *);
 extern int sched_setscheduler2(struct task_struct *, int,
-				 const struct sched_param2 *);
+				 const struct sched_attr *);
 extern struct task_struct *idle_task(int cpu);
 /**
  * is_idle_task - is the specified task an idle task?
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 11f9d5b..fbdf44a 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -279,16 +279,16 @@ asmlinkage long sys_nice(int increment);
 asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
 					struct sched_param __user *param);
 asmlinkage long sys_sched_setscheduler2(pid_t pid, int policy,
-					struct sched_param2 __user *param);
+					struct sched_attr __user *attr);
 asmlinkage long sys_sched_setparam(pid_t pid,
 					struct sched_param __user *param);
-asmlinkage long sys_sched_setparam2(pid_t pid,
-					struct sched_param2 __user *param);
+asmlinkage long sys_sched_setattr(pid_t pid,
+					struct sched_attr __user *attr);
 asmlinkage long sys_sched_getscheduler(pid_t pid);
 asmlinkage long sys_sched_getparam(pid_t pid,
 					struct sched_param __user *param);
-asmlinkage long sys_sched_getparam2(pid_t pid,
-					struct sched_param2 __user *param);
+asmlinkage long sys_sched_getattr(pid_t pid,
+					struct sched_attr __user *attr);
 asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
 asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8257aa8..5904651 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3025,7 +3025,7 @@ static bool check_same_owner(struct task_struct *p)
 }
 
 static int __sched_setscheduler(struct task_struct *p, int policy,
-				const struct sched_param2 *param,
+				const struct sched_attr *param,
 				bool user)
 {
 	int retval, oldprio, oldpolicy = -1, on_rq, running;
@@ -3191,17 +3191,17 @@ recheck:
 int sched_setscheduler(struct task_struct *p, int policy,
 		       const struct sched_param *param)
 {
-	struct sched_param2 param2 = {
+	struct sched_attr attr = {
 		.sched_priority = param->sched_priority
 	};
-	return __sched_setscheduler(p, policy, &param2, true);
+	return __sched_setscheduler(p, policy, &attr, true);
 }
 EXPORT_SYMBOL_GPL(sched_setscheduler);
 
 int sched_setscheduler2(struct task_struct *p, int policy,
-			  const struct sched_param2 *param2)
+			  const struct sched_attr *attr)
 {
-	return __sched_setscheduler(p, policy, param2, true);
+	return __sched_setscheduler(p, policy, attr, true);
 }
 EXPORT_SYMBOL_GPL(sched_setscheduler2);
 
@@ -3221,10 +3221,10 @@ EXPORT_SYMBOL_GPL(sched_setscheduler2);
 int sched_setscheduler_nocheck(struct task_struct *p, int policy,
 			       const struct sched_param *param)
 {
-	struct sched_param2 param2 = {
+	struct sched_attr attr = {
 		.sched_priority = param->sched_priority
 	};
-	return __sched_setscheduler(p, policy, &param2, false);
+	return __sched_setscheduler(p, policy, &attr, false);
 }
 
 static int
@@ -3249,26 +3249,92 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 	return retval;
 }
 
+/*
+ * Mimics kerner/events/core.c perf_copy_attr().
+ */
+static int sched_copy_attr(struct sched_attr __user *uattr,
+			   struct sched_attr *attr)
+{
+	u32 size;
+	int ret;
+
+	if (!access_ok(VERIFY_WRITE, uattr, SCHED_ATTR_SIZE_VER0))
+		return -EFAULT;
+
+	/*
+	 * zero the full structure, so that a short copy will be nice.
+	 */
+	memset(attr, 0, sizeof(*attr));
+
+	ret = get_user(size, &uattr->size);
+	if (ret)
+		return ret;
+
+	if (size > PAGE_SIZE)	/* silly large */
+		goto err_size;
+
+	if (!size)		/* abi compat */
+		size = SCHED_ATTR_SIZE_VER0;
+
+	if (size < SCHED_ATTR_SIZE_VER0)
+		goto err_size;
+
+	/*
+	 * If we're handed a bigger struct than we know of,
+	 * ensure all the unknown bits are 0 - i.e. new
+	 * user-space does not rely on any kernel feature
+	 * extensions we dont know about yet.
+	 */
+	if (size > sizeof(*attr)) {
+		unsigned char __user *addr;
+		unsigned char __user *end;
+		unsigned char val;
+
+		addr = (void __user *)uattr + sizeof(*attr);
+		end  = (void __user *)uattr + size;
+
+		for (; addr < end; addr++) {
+			ret = get_user(val, addr);
+			if (ret)
+				return ret;
+			if (val)
+				goto err_size;
+		}
+		size = sizeof(*attr);
+	}
+
+	ret = copy_from_user(attr, uattr, size);
+	if (ret)
+		return -EFAULT;
+
+out:
+	return ret;
+
+err_size:
+	put_user(sizeof(*attr), &uattr->size);
+	ret = -E2BIG;
+	goto out;
+}
+
 static int
 do_sched_setscheduler2(pid_t pid, int policy,
-			 struct sched_param2 __user *param2)
+		       struct sched_attr __user *attr_uptr)
 {
-	struct sched_param2 lparam2;
+	struct sched_attr attr;
 	struct task_struct *p;
 	int retval;
 
-	if (!param2 || pid < 0)
+	if (!attr_uptr || pid < 0)
 		return -EINVAL;
 
-	memset(&lparam2, 0, sizeof(struct sched_param2));
-	if (copy_from_user(&lparam2, param2, sizeof(struct sched_param2)))
+	if (sched_copy_attr(attr_uptr, &attr))
 		return -EFAULT;
 
 	rcu_read_lock();
 	retval = -ESRCH;
 	p = find_process_by_pid(pid);
 	if (p != NULL)
-		retval = sched_setscheduler2(p, policy, &lparam2);
+		retval = sched_setscheduler2(p, policy, &attr);
 	rcu_read_unlock();
 
 	return retval;
@@ -3296,15 +3362,15 @@ SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy,
  * sys_sched_setscheduler2 - same as above, but with extended sched_param
  * @pid: the pid in question.
  * @policy: new policy (could use extended sched_param).
- * @param: structure containg the extended parameters.
+ * @attr: structure containg the extended parameters.
  */
 SYSCALL_DEFINE3(sched_setscheduler2, pid_t, pid, int, policy,
-		struct sched_param2 __user *, param2)
+		struct sched_attr __user *, attr)
 {
 	if (policy < 0)
 		return -EINVAL;
 
-	return do_sched_setscheduler2(pid, policy, param2);
+	return do_sched_setscheduler2(pid, policy, attr);
 }
 
 /**
@@ -3320,14 +3386,14 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
 }
 
 /**
- * sys_sched_setparam2 - same as above, but with extended sched_param
+ * sys_sched_setattr - same as above, but with extended sched_attr
  * @pid: the pid in question.
- * @param2: structure containing the extended parameters.
+ * @attr: structure containing the extended parameters.
  */
-SYSCALL_DEFINE2(sched_setparam2, pid_t, pid,
-		struct sched_param2 __user *, param2)
+SYSCALL_DEFINE2(sched_setattr, pid_t, pid,
+		struct sched_attr __user *, attr)
 {
-	return do_sched_setscheduler2(pid, -1, param2);
+	return do_sched_setscheduler2(pid, -1, attr);
 }
 
 /**
@@ -3401,19 +3467,21 @@ out_unlock:
 }
 
 /**
- * sys_sched_getparam2 - same as above, but with extended sched_param
+ * sys_sched_getattr - same as above, but with extended "sched_param"
  * @pid: the pid in question.
- * @param2: structure containing the extended parameters.
+ * @attr: structure containing the extended parameters.
  */
-SYSCALL_DEFINE2(sched_getparam2, pid_t, pid, struct sched_param2 __user *, param2)
+SYSCALL_DEFINE2(sched_getattr, pid_t, pid, struct sched_attr __user *, attr)
 {
-	struct sched_param2 lp;
+	struct sched_attr lp;
 	struct task_struct *p;
 	int retval;
 
-	if (!param2 || pid < 0)
+	if (!attr || pid < 0)
 		return -EINVAL;
 
+	memset(&lp, 0, sizeof(struct sched_attr));
+
 	rcu_read_lock();
 	p = find_process_by_pid(pid);
 	retval = -ESRCH;
@@ -3427,7 +3495,7 @@ SYSCALL_DEFINE2(sched_getparam2, pid_t, pid, struct sched_param2 __user *, param
 	lp.sched_priority = p->rt_priority;
 	rcu_read_unlock();
 
-	retval = copy_to_user(param2, &lp, sizeof(lp)) ? -EFAULT : 0;
+	retval = copy_to_user(attr, &lp, sizeof(lp)) ? -EFAULT : 0;
 	return retval;
 
 out_unlock:


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 02/14] sched: add extended scheduling interface. (new ABI)
  2013-11-28 11:14             ` Juri Lelli
@ 2013-11-28 11:28               ` Peter Zijlstra
  2013-11-30 14:06                 ` Ingo Molnar
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2013-11-28 11:28 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Ingo Molnar, tglx, mingo, rostedt, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield,
	Andrew Morton, Linus Torvalds

On Thu, Nov 28, 2013 at 12:14:03PM +0100, Juri Lelli wrote:
> +SYSCALL_DEFINE2(sched_getattr, pid_t, pid, struct sched_attr __user *, attr)
>  {
> -	struct sched_param2 lp;
> +	struct sched_attr lp;
>  	struct task_struct *p;
>  	int retval;
>  
> -	if (!param2 || pid < 0)
> +	if (!attr || pid < 0)
>  		return -EINVAL;
>  
> +	memset(&lp, 0, sizeof(struct sched_attr));
> +
>  	rcu_read_lock();
>  	p = find_process_by_pid(pid);
>  	retval = -ESRCH;
> @@ -3427,7 +3495,7 @@ SYSCALL_DEFINE2(sched_getparam2, pid_t, pid, struct sched_param2 __user *, param
>  	lp.sched_priority = p->rt_priority;
>  	rcu_read_unlock();
>  
> -	retval = copy_to_user(param2, &lp, sizeof(lp)) ? -EFAULT : 0;
> +	retval = copy_to_user(attr, &lp, sizeof(lp)) ? -EFAULT : 0;
>  	return retval;
>  
>  out_unlock:


So this side needs a bit more care; suppose the kernel has a larger attr
than userspace knows about.

What would make more sense; add another syscall argument with the
userspace sizeof(struct sched_attr), or expect userspace to initialize
attr->size to the right value before calling sched_getattr() ?

To me the extra argument makes more sense; that is:

  struct sched_attr attr;

  ret = sched_getattr(0, &attr, sizeof(attr));

seems like a saner thing than:

  struct sched_attr attr = { .size = sizeof(attr), };

  ret = sched_getattr(0, &attr);

Mostly because the former has a clear separation between input and
output arguments, whereas for the second form the attr argument is both
input and output.

Ingo?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 02/14] sched: add extended scheduling interface. (new ABI)
  2013-11-28 11:28               ` Peter Zijlstra
@ 2013-11-30 14:06                 ` Ingo Molnar
  2013-12-03 16:13                   ` Juri Lelli
  0 siblings, 1 reply; 81+ messages in thread
From: Ingo Molnar @ 2013-11-30 14:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, tglx, mingo, rostedt, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield,
	Andrew Morton, Linus Torvalds


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Nov 28, 2013 at 12:14:03PM +0100, Juri Lelli wrote:
> > +SYSCALL_DEFINE2(sched_getattr, pid_t, pid, struct sched_attr __user *, attr)
> >  {
> > -	struct sched_param2 lp;
> > +	struct sched_attr lp;
> >  	struct task_struct *p;
> >  	int retval;
> >  
> > -	if (!param2 || pid < 0)
> > +	if (!attr || pid < 0)
> >  		return -EINVAL;
> >  
> > +	memset(&lp, 0, sizeof(struct sched_attr));
> > +
> >  	rcu_read_lock();
> >  	p = find_process_by_pid(pid);
> >  	retval = -ESRCH;
> > @@ -3427,7 +3495,7 @@ SYSCALL_DEFINE2(sched_getparam2, pid_t, pid, struct sched_param2 __user *, param
> >  	lp.sched_priority = p->rt_priority;
> >  	rcu_read_unlock();
> >  
> > -	retval = copy_to_user(param2, &lp, sizeof(lp)) ? -EFAULT : 0;
> > +	retval = copy_to_user(attr, &lp, sizeof(lp)) ? -EFAULT : 0;
> >  	return retval;
> >  
> >  out_unlock:
> 
> 
> So this side needs a bit more care; suppose the kernel has a larger attr
> than userspace knows about.
> 
> What would make more sense; add another syscall argument with the
> userspace sizeof(struct sched_attr), or expect userspace to initialize
> attr->size to the right value before calling sched_getattr() ?
> 
> To me the extra argument makes more sense; that is:
> 
>   struct sched_attr attr;
> 
>   ret = sched_getattr(0, &attr, sizeof(attr));
> 
> seems like a saner thing than:
> 
>   struct sched_attr attr = { .size = sizeof(attr), };
> 
>   ret = sched_getattr(0, &attr);
> 
> Mostly because the former has a clear separation between input and 
> output arguments, whereas for the second form the attr argument is 
> both input and output.
> 
> Ingo?

I suppose so - in the sys_perf_event_open() case we ran out of 
arguments, so attr::size was the only sane way to do it.

[ Btw., perf events side note: for completeness and for debugging we 
  probably want to add a sys_perf_event_get() method as well, to 
  recover the attributes of an existing event. Unidirectional APIs are 
  not very nice. ]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 02/14] sched: add extended scheduling interface. (new ABI)
  2013-11-30 14:06                 ` Ingo Molnar
@ 2013-12-03 16:13                   ` Juri Lelli
  2013-12-03 16:41                     ` Steven Rostedt
  0 siblings, 1 reply; 81+ messages in thread
From: Juri Lelli @ 2013-12-03 16:13 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: tglx, mingo, rostedt, oleg, fweisbec, darren, johan.eker,
	p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield,
	Andrew Morton, Linus Torvalds

On 11/30/2013 03:06 PM, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Thu, Nov 28, 2013 at 12:14:03PM +0100, Juri Lelli wrote:
>>> +SYSCALL_DEFINE2(sched_getattr, pid_t, pid, struct sched_attr __user *, attr)
>>>  {
>>> -	struct sched_param2 lp;
>>> +	struct sched_attr lp;
>>>  	struct task_struct *p;
>>>  	int retval;
>>>  
>>> -	if (!param2 || pid < 0)
>>> +	if (!attr || pid < 0)
>>>  		return -EINVAL;
>>>  
>>> +	memset(&lp, 0, sizeof(struct sched_attr));
>>> +
>>>  	rcu_read_lock();
>>>  	p = find_process_by_pid(pid);
>>>  	retval = -ESRCH;
>>> @@ -3427,7 +3495,7 @@ SYSCALL_DEFINE2(sched_getparam2, pid_t, pid, struct sched_param2 __user *, param
>>>  	lp.sched_priority = p->rt_priority;
>>>  	rcu_read_unlock();
>>>  
>>> -	retval = copy_to_user(param2, &lp, sizeof(lp)) ? -EFAULT : 0;
>>> +	retval = copy_to_user(attr, &lp, sizeof(lp)) ? -EFAULT : 0;
>>>  	return retval;
>>>  
>>>  out_unlock:
>>
>>
>> So this side needs a bit more care; suppose the kernel has a larger attr
>> than userspace knows about.
>>
>> What would make more sense; add another syscall argument with the
>> userspace sizeof(struct sched_attr), or expect userspace to initialize
>> attr->size to the right value before calling sched_getattr() ?
>>
>> To me the extra argument makes more sense; that is:
>>
>>   struct sched_attr attr;
>>
>>   ret = sched_getattr(0, &attr, sizeof(attr));
>>
>> seems like a saner thing than:
>>
>>   struct sched_attr attr = { .size = sizeof(attr), };
>>
>>   ret = sched_getattr(0, &attr);
>>
>> Mostly because the former has a clear separation between input and 
>> output arguments, whereas for the second form the attr argument is 
>> both input and output.
>>
>> Ingo?
> 
> I suppose so - in the sys_perf_event_open() case we ran out of 
> arguments, so attr::size was the only sane way to do it.
> 

Ok, I modified it like this:

------------------------------------------------------------
Subject: [PATCH] fixup: add checks for sys_sched_getattr

Add an extra argument to the syscall with the userspace
sizeof(struct sched_attr) to be able to handle situations
when the kernel has a larger attr than userspace knows about.
---
 include/linux/syscalls.h |    3 ++-
 kernel/sched/core.c      |   55 ++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index fbdf44a..45ce599 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -288,7 +288,8 @@ asmlinkage long sys_sched_getscheduler(pid_t pid);
 asmlinkage long sys_sched_getparam(pid_t pid,
 					struct sched_param __user *param);
 asmlinkage long sys_sched_getattr(pid_t pid,
-					struct sched_attr __user *attr);
+					struct sched_attr __user *attr,
+					unsigned int size);
 asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
 asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe755f7..b7d91c6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3507,7 +3507,7 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 }
 
 /*
- * Mimics kerner/events/core.c perf_copy_attr().
+ * Mimics kernel/events/core.c perf_copy_attr().
  */
 static int sched_copy_attr(struct sched_attr __user *uattr,
 			   struct sched_attr *attr)
@@ -3726,18 +3726,65 @@ out_unlock:
 	return retval;
 }
 
+static int sched_read_attr(struct sched_attr __user *uattr,
+			   struct sched_attr *attr,
+			   unsigned int size,
+			   unsigned int usize)
+{
+	int ret;
+
+	if (!access_ok(VERIFY_WRITE, uattr, SCHED_ATTR_SIZE_VER0))
+		return -EFAULT;
+
+	/*
+	 * zero the full structure, so that a short copy will be nice.
+	 */
+	memset(uattr, 0, sizeof(*uattr));
+
+	/*
+	 * If we're handed a smaller struct than we know of,
+	 * ensure all the unknown bits are 0 - i.e. old
+	 * user-space does not get uncomplete information.
+	 */
+	if (usize < sizeof(*attr)) {
+		unsigned char *addr;
+		unsigned char *end;
+
+		addr = (void *)attr + usize;
+		end  = (void *)attr + size;
+
+		for (; addr < end; addr++)
+			if (*addr)
+				goto err_size;
+	}
+
+	ret = copy_to_user(uattr, attr, usize);
+	if (ret)
+		return -EFAULT;
+
+out:
+	return ret;
+
+err_size:
+	ret = -E2BIG;
+	goto out;
+}
+
 /**
  * sys_sched_getattr - same as above, but with extended "sched_param"
  * @pid: the pid in question.
  * @attr: structure containing the extended parameters.
+ * @size: sizeof(attr) for fwd/bwd comp.
  */
-SYSCALL_DEFINE2(sched_getattr, pid_t, pid, struct sched_attr __user *, attr)
+SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, attr,
+		unsigned int, size)
 {
 	struct sched_attr lp;
 	struct task_struct *p;
 	int retval;
 
-	if (!attr || pid < 0)
+	if (!attr || pid < 0 || size > PAGE_SIZE ||
+	    size < SCHED_ATTR_SIZE_VER0)
 		return -EINVAL;
 
 	memset(&lp, 0, sizeof(struct sched_attr));
@@ -3758,7 +3805,7 @@ SYSCALL_DEFINE2(sched_getattr, pid_t, pid, struct sched_attr __user *, attr)
 		lp.sched_priority = p->rt_priority;
 	rcu_read_unlock();
 
-	retval = copy_to_user(attr, &lp, sizeof(lp)) ? -EFAULT : 0;
+	retval = sched_read_attr(attr, &lp, sizeof(lp), size);
 	return retval;
 
 out_unlock:
-----------------------------------------------------------

Do we need to make sched_setattr symmetrical, or, since the user has
to fill the fields anyway, we leave it as is?

Thanks,

- Juri

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH 02/14] sched: add extended scheduling interface. (new ABI)
  2013-12-03 16:13                   ` Juri Lelli
@ 2013-12-03 16:41                     ` Steven Rostedt
  2013-12-03 17:04                       ` Juri Lelli
  0 siblings, 1 reply; 81+ messages in thread
From: Steven Rostedt @ 2013-12-03 16:41 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Ingo Molnar, Peter Zijlstra, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield,
	Andrew Morton, Linus Torvalds

On Tue, 03 Dec 2013 17:13:44 +0100
Juri Lelli <juri.lelli@gmail.com> wrote:

> On 11/30/2013 03:06 PM, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> >> On Thu, Nov 28, 2013 at 12:14:03PM +0100, Juri Lelli wrote:
> >>> +SYSCALL_DEFINE2(sched_getattr, pid_t, pid, struct sched_attr __user *, attr)
> >>>  {
> >>> -	struct sched_param2 lp;
> >>> +	struct sched_attr lp;
> >>>  	struct task_struct *p;
> >>>  	int retval;
> >>>  
> >>> -	if (!param2 || pid < 0)
> >>> +	if (!attr || pid < 0)
> >>>  		return -EINVAL;
> >>>  
> >>> +	memset(&lp, 0, sizeof(struct sched_attr));
> >>> +
> >>>  	rcu_read_lock();
> >>>  	p = find_process_by_pid(pid);
> >>>  	retval = -ESRCH;
> >>> @@ -3427,7 +3495,7 @@ SYSCALL_DEFINE2(sched_getparam2, pid_t, pid, struct sched_param2 __user *, param
> >>>  	lp.sched_priority = p->rt_priority;
> >>>  	rcu_read_unlock();
> >>>  
> >>> -	retval = copy_to_user(param2, &lp, sizeof(lp)) ? -EFAULT : 0;
> >>> +	retval = copy_to_user(attr, &lp, sizeof(lp)) ? -EFAULT : 0;
> >>>  	return retval;
> >>>  
> >>>  out_unlock:
> >>
> >>
> >> So this side needs a bit more care; suppose the kernel has a larger attr
> >> than userspace knows about.
> >>
> >> What would make more sense; add another syscall argument with the
> >> userspace sizeof(struct sched_attr), or expect userspace to initialize
> >> attr->size to the right value before calling sched_getattr() ?
> >>
> >> To me the extra argument makes more sense; that is:
> >>
> >>   struct sched_attr attr;
> >>
> >>   ret = sched_getattr(0, &attr, sizeof(attr));
> >>
> >> seems like a saner thing than:
> >>
> >>   struct sched_attr attr = { .size = sizeof(attr), };
> >>
> >>   ret = sched_getattr(0, &attr);
> >>
> >> Mostly because the former has a clear separation between input and 
> >> output arguments, whereas for the second form the attr argument is 
> >> both input and output.
> >>
> >> Ingo?
> > 
> > I suppose so - in the sys_perf_event_open() case we ran out of 
> > arguments, so attr::size was the only sane way to do it.
> > 
> 
> Ok, I modified it like this:
> 
> ------------------------------------------------------------
> Subject: [PATCH] fixup: add checks for sys_sched_getattr
> 
> Add an extra argument to the syscall with the userspace
> sizeof(struct sched_attr) to be able to handle situations
> when the kernel has a larger attr than userspace knows about.
> ---
>  include/linux/syscalls.h |    3 ++-
>  kernel/sched/core.c      |   55 ++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 53 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index fbdf44a..45ce599 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -288,7 +288,8 @@ asmlinkage long sys_sched_getscheduler(pid_t pid);
>  asmlinkage long sys_sched_getparam(pid_t pid,
>  					struct sched_param __user *param);
>  asmlinkage long sys_sched_getattr(pid_t pid,
> -					struct sched_attr __user *attr);
> +					struct sched_attr __user *attr,
> +					unsigned int size);
>  asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
>  					unsigned long __user *user_mask_ptr);
>  asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fe755f7..b7d91c6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3507,7 +3507,7 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
>  }
>  
>  /*
> - * Mimics kerner/events/core.c perf_copy_attr().
> + * Mimics kernel/events/core.c perf_copy_attr().
>   */
>  static int sched_copy_attr(struct sched_attr __user *uattr,
>  			   struct sched_attr *attr)
> @@ -3726,18 +3726,65 @@ out_unlock:
>  	return retval;
>  }
>  
> +static int sched_read_attr(struct sched_attr __user *uattr,
> +			   struct sched_attr *attr,
> +			   unsigned int size,
> +			   unsigned int usize)
> +{
> +	int ret;
> +
> +	if (!access_ok(VERIFY_WRITE, uattr, SCHED_ATTR_SIZE_VER0))

We want to verify from uattr to usize, right? As that is what we are
writing to.

> +		return -EFAULT;
> +
> +	/*
> +	 * zero the full structure, so that a short copy will be nice.
> +	 */
> +	memset(uattr, 0, sizeof(*uattr));

Wait! We can't write to user space like this, not to mention that usize
may even be less than sizeof(struct sched_attr).

-- Steve


> +
> +	/*
> +	 * If we're handed a smaller struct than we know of,
> +	 * ensure all the unknown bits are 0 - i.e. old
> +	 * user-space does not get uncomplete information.
> +	 */
> +	if (usize < sizeof(*attr)) {
> +		unsigned char *addr;
> +		unsigned char *end;
> +
> +		addr = (void *)attr + usize;
> +		end  = (void *)attr + size;
> +
> +		for (; addr < end; addr++)
> +			if (*addr)
> +				goto err_size;
> +	}
> +
> +	ret = copy_to_user(uattr, attr, usize);
> +	if (ret)
> +		return -EFAULT;
> +
> +out:
> +	return ret;
> +
> +err_size:
> +	ret = -E2BIG;
> +	goto out;
> +}
> +
>  /**
>   * sys_sched_getattr - same as above, but with extended "sched_param"
>   * @pid: the pid in question.
>   * @attr: structure containing the extended parameters.
> + * @size: sizeof(attr) for fwd/bwd comp.
>   */
> -SYSCALL_DEFINE2(sched_getattr, pid_t, pid, struct sched_attr __user *, attr)
> +SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, attr,
> +		unsigned int, size)
>  {
>  	struct sched_attr lp;
>  	struct task_struct *p;
>  	int retval;
>  
> -	if (!attr || pid < 0)
> +	if (!attr || pid < 0 || size > PAGE_SIZE ||
> +	    size < SCHED_ATTR_SIZE_VER0)
>  		return -EINVAL;
>  
>  	memset(&lp, 0, sizeof(struct sched_attr));
> @@ -3758,7 +3805,7 @@ SYSCALL_DEFINE2(sched_getattr, pid_t, pid, struct sched_attr __user *, attr)
>  		lp.sched_priority = p->rt_priority;
>  	rcu_read_unlock();
>  
> -	retval = copy_to_user(attr, &lp, sizeof(lp)) ? -EFAULT : 0;
> +	retval = sched_read_attr(attr, &lp, sizeof(lp), size);
>  	return retval;
>  
>  out_unlock:
> -----------------------------------------------------------
> 
> Do we need to make sched_setattr symmetrical, or, since the user has
> to fill the fields anyway, we leave it as is?
> 
> Thanks,
> 
> - Juri


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH 02/14] sched: add extended scheduling interface. (new ABI)
  2013-12-03 16:41                     ` Steven Rostedt
@ 2013-12-03 17:04                       ` Juri Lelli
  0 siblings, 0 replies; 81+ messages in thread
From: Juri Lelli @ 2013-12-03 17:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Peter Zijlstra, tglx, mingo, oleg, fweisbec, darren,
	johan.eker, p.faure, linux-kernel, claudio, michael, fchecconi,
	tommaso.cucinotta, nicola.manica, luca.abeni, dhaval.giani,
	hgu1972, paulmck, raistlin, insop.song, liming.wang, jkacur,
	harald.gustafsson, vincent.guittot, bruce.ashfield,
	Andrew Morton, Linus Torvalds

On 12/03/2013 05:41 PM, Steven Rostedt wrote:
> On Tue, 03 Dec 2013 17:13:44 +0100
> Juri Lelli <juri.lelli@gmail.com> wrote:
> 
>> On 11/30/2013 03:06 PM, Ingo Molnar wrote:
>>>
>>> * Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>>> On Thu, Nov 28, 2013 at 12:14:03PM +0100, Juri Lelli wrote:
>>>>> +SYSCALL_DEFINE2(sched_getattr, pid_t, pid, struct sched_attr __user *, attr)
>>>>>  {
>>>>> -	struct sched_param2 lp;
>>>>> +	struct sched_attr lp;
>>>>>  	struct task_struct *p;
>>>>>  	int retval;
>>>>>  
>>>>> -	if (!param2 || pid < 0)
>>>>> +	if (!attr || pid < 0)
>>>>>  		return -EINVAL;
>>>>>  
>>>>> +	memset(&lp, 0, sizeof(struct sched_attr));
>>>>> +
>>>>>  	rcu_read_lock();
>>>>>  	p = find_process_by_pid(pid);
>>>>>  	retval = -ESRCH;
>>>>> @@ -3427,7 +3495,7 @@ SYSCALL_DEFINE2(sched_getparam2, pid_t, pid, struct sched_param2 __user *, param
>>>>>  	lp.sched_priority = p->rt_priority;
>>>>>  	rcu_read_unlock();
>>>>>  
>>>>> -	retval = copy_to_user(param2, &lp, sizeof(lp)) ? -EFAULT : 0;
>>>>> +	retval = copy_to_user(attr, &lp, sizeof(lp)) ? -EFAULT : 0;
>>>>>  	return retval;
>>>>>  
>>>>>  out_unlock:
>>>>
>>>>
>>>> So this side needs a bit more care; suppose the kernel has a larger attr
>>>> than userspace knows about.
>>>>
>>>> What would make more sense; add another syscall argument with the
>>>> userspace sizeof(struct sched_attr), or expect userspace to initialize
>>>> attr->size to the right value before calling sched_getattr() ?
>>>>
>>>> To me the extra argument makes more sense; that is:
>>>>
>>>>   struct sched_attr attr;
>>>>
>>>>   ret = sched_getattr(0, &attr, sizeof(attr));
>>>>
>>>> seems like a saner thing than:
>>>>
>>>>   struct sched_attr attr = { .size = sizeof(attr), };
>>>>
>>>>   ret = sched_getattr(0, &attr);
>>>>
>>>> Mostly because the former has a clear separation between input and 
>>>> output arguments, whereas for the second form the attr argument is 
>>>> both input and output.
>>>>
>>>> Ingo?
>>>
>>> I suppose so - in the sys_perf_event_open() case we ran out of 
>>> arguments, so attr::size was the only sane way to do it.
>>>
>>
>> Ok, I modified it like this:
>>
>> ------------------------------------------------------------
>> Subject: [PATCH] fixup: add checks for sys_sched_getattr
>>
>> Add an extra argument to the syscall with the userspace
>> sizeof(struct sched_attr) to be able to handle situations
>> when the kernel has a larger attr than userspace knows about.
>> ---
>>  include/linux/syscalls.h |    3 ++-
>>  kernel/sched/core.c      |   55 ++++++++++++++++++++++++++++++++++++++++++----
>>  2 files changed, 53 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index fbdf44a..45ce599 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -288,7 +288,8 @@ asmlinkage long sys_sched_getscheduler(pid_t pid);
>>  asmlinkage long sys_sched_getparam(pid_t pid,
>>  					struct sched_param __user *param);
>>  asmlinkage long sys_sched_getattr(pid_t pid,
>> -					struct sched_attr __user *attr);
>> +					struct sched_attr __user *attr,
>> +					unsigned int size);
>>  asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
>>  					unsigned long __user *user_mask_ptr);
>>  asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index fe755f7..b7d91c6 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3507,7 +3507,7 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
>>  }
>>  
>>  /*
>> - * Mimics kerner/events/core.c perf_copy_attr().
>> + * Mimics kernel/events/core.c perf_copy_attr().
>>   */
>>  static int sched_copy_attr(struct sched_attr __user *uattr,
>>  			   struct sched_attr *attr)
>> @@ -3726,18 +3726,65 @@ out_unlock:
>>  	return retval;
>>  }
>>  
>> +static int sched_read_attr(struct sched_attr __user *uattr,
>> +			   struct sched_attr *attr,
>> +			   unsigned int size,
>> +			   unsigned int usize)
>> +{
>> +	int ret;
>> +
>> +	if (!access_ok(VERIFY_WRITE, uattr, SCHED_ATTR_SIZE_VER0))
> 
> We want to verify from uattr to usize, right? As that is what we are
> writing to.
> 

Right. s/SCHED_ATTR_SIZE_VER0/usize/

>> +		return -EFAULT;
>> +
>> +	/*
>> +	 * zero the full structure, so that a short copy will be nice.
>> +	 */
>> +	memset(uattr, 0, sizeof(*uattr));
> 
> Wait! We can't write to user space like this, not to mention that usize
> may even be less than sizeof(struct sched_attr).
> 

Ouch! I should never copy and paste.

> 
>> +
>> +	/*
>> +	 * If we're handed a smaller struct than we know of,
>> +	 * ensure all the unknown bits are 0 - i.e. old
>> +	 * user-space does not get uncomplete information.
>> +	 */
>> +	if (usize < sizeof(*attr)) {
>> +		unsigned char *addr;
>> +		unsigned char *end;
>> +
>> +		addr = (void *)attr + usize;
>> +		end  = (void *)attr + size;
>> +
>> +		for (; addr < end; addr++)
>> +			if (*addr)
>> +				goto err_size;
>> +	}
>> +

But, if we got here, we know that all is zero after usize and that
usize >= SCHED_ATTR_SIZE_VER0 (see below). So it should be safe to
copy_to_user() without the memset (since attr has been already zeroed),
do I get it right?

Thanks,

- Juri

>> +	ret = copy_to_user(uattr, attr, usize);
>> +	if (ret)
>> +		return -EFAULT;
>> +
>> +out:
>> +	return ret;
>> +
>> +err_size:
>> +	ret = -E2BIG;
>> +	goto out;
>> +}
>> +
>>  /**
>>   * sys_sched_getattr - same as above, but with extended "sched_param"
>>   * @pid: the pid in question.
>>   * @attr: structure containing the extended parameters.
>> + * @size: sizeof(attr) for fwd/bwd comp.
>>   */
>> -SYSCALL_DEFINE2(sched_getattr, pid_t, pid, struct sched_attr __user *, attr)
>> +SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, attr,
>> +		unsigned int, size)
>>  {
>>  	struct sched_attr lp;
>>  	struct task_struct *p;
>>  	int retval;
>>  
>> -	if (!attr || pid < 0)
>> +	if (!attr || pid < 0 || size > PAGE_SIZE ||
>> +	    size < SCHED_ATTR_SIZE_VER0)
>>  		return -EINVAL;
>>  
>>  	memset(&lp, 0, sizeof(struct sched_attr));
>> @@ -3758,7 +3805,7 @@ SYSCALL_DEFINE2(sched_getattr, pid_t, pid, struct sched_attr __user *, attr)
>>  		lp.sched_priority = p->rt_priority;
>>  	rcu_read_unlock();
>>  
>> -	retval = copy_to_user(attr, &lp, sizeof(lp)) ? -EFAULT : 0;
>> +	retval = sched_read_attr(attr, &lp, sizeof(lp), size);
>>  	return retval;
>>  
>>  out_unlock:
>> -----------------------------------------------------------
>>
>> Do we need to make sched_setattr symmetrical, or, since the user has
>> to fill the fields anyway, we leave it as is?
>>
>> Thanks,
>>
>> - Juri
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [tip:sched/core] sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
  2013-11-07 13:43 ` [PATCH 02/14] sched: add extended scheduling interface Juri Lelli
                     ` (2 preceding siblings ...)
  2013-11-27 13:23   ` [PATCH 02/14] sched: add extended scheduling interface. (new ABI) Ingo Molnar
@ 2014-01-13 15:53   ` tip-bot for Dario Faggioli
  2014-01-15 16:22     ` [RFC][PATCH] sched: Move SCHED_RESET_ON_FORK into attr::sched_flags Peter Zijlstra
  2014-01-17 17:29     ` [tip:sched/core] sched: Add new scheduler syscalls to support an extended scheduling parameters ABI Stephen Warren
  3 siblings, 2 replies; 81+ messages in thread
From: tip-bot for Dario Faggioli @ 2014-01-13 15:53 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, raistlin, tglx, juri.lelli

Commit-ID:  d50dde5a10f305253cbc3855307f608f8a3c5f73
Gitweb:     http://git.kernel.org/tip/d50dde5a10f305253cbc3855307f608f8a3c5f73
Author:     Dario Faggioli <raistlin@linux.it>
AuthorDate: Thu, 7 Nov 2013 14:43:36 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 13:41:04 +0100

sched: Add new scheduler syscalls to support an extended scheduling parameters ABI

Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).

In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:

 - a (maximum/typical) instance execution time,
 - a minimum interval between consecutive instances,
 - a time constraint by which each instance must be completed.

Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.

For these reasons, this patch:

 - defines the new struct sched_attr, containing all the fields
   that are necessary for specifying a task in the computational
   model described above;

 - defines and implements the new scheduling related syscalls that
   manipulate it, i.e., sched_setattr() and sched_getattr().

Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.

Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/arm/include/asm/unistd.h      |   2 +-
 arch/arm/include/uapi/asm/unistd.h |   2 +
 arch/arm/kernel/calls.S            |   2 +
 arch/x86/syscalls/syscall_32.tbl   |   2 +
 arch/x86/syscalls/syscall_64.tbl   |   2 +
 include/linux/sched.h              |  62 +++++++++
 include/linux/syscalls.h           |   6 +
 kernel/sched/core.c                | 263 ++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h               |   9 +-
 9 files changed, 326 insertions(+), 24 deletions(-)

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 141baa3..acabef1 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -15,7 +15,7 @@
 
 #include <uapi/asm/unistd.h>
 
-#define __NR_syscalls  (380)
+#define __NR_syscalls  (384)
 #define __ARM_NR_cmpxchg		(__ARM_NR_BASE+0x00fff0)
 
 #define __ARCH_WANT_STAT64
diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index af33b44..fb5584d 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -406,6 +406,8 @@
 #define __NR_process_vm_writev		(__NR_SYSCALL_BASE+377)
 #define __NR_kcmp			(__NR_SYSCALL_BASE+378)
 #define __NR_finit_module		(__NR_SYSCALL_BASE+379)
+#define __NR_sched_setattr		(__NR_SYSCALL_BASE+380)
+#define __NR_sched_getattr		(__NR_SYSCALL_BASE+381)
 
 /*
  * This may need to be greater than __NR_last_syscall+1 in order to
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index c6ca7e3..166e945 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -389,6 +389,8 @@
 		CALL(sys_process_vm_writev)
 		CALL(sys_kcmp)
 		CALL(sys_finit_module)
+/* 380 */	CALL(sys_sched_setattr)
+		CALL(sys_sched_getattr)
 #ifndef syscalls_counted
 .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
 #define syscalls_counted
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index aabfb83..96bc506 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -357,3 +357,5 @@
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
 349	i386	kcmp			sys_kcmp
 350	i386	finit_module		sys_finit_module
+351	i386	sched_setattr		sys_sched_setattr
+352	i386	sched_getattr		sys_sched_getattr
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 38ae65d..a12bddc 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -320,6 +320,8 @@
 311	64	process_vm_writev	sys_process_vm_writev
 312	common	kcmp			sys_kcmp
 313	common	finit_module		sys_finit_module
+314	common	sched_setattr		sys_sched_setattr
+315	common	sched_getattr		sys_sched_getattr
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3a1e985..86025b6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -56,6 +56,66 @@ struct sched_param {
 
 #include <asm/processor.h>
 
+#define SCHED_ATTR_SIZE_VER0	48	/* sizeof first published struct */
+
+/*
+ * Extended scheduling parameters data structure.
+ *
+ * This is needed because the original struct sched_param can not be
+ * altered without introducing ABI issues with legacy applications
+ * (e.g., in sched_getparam()).
+ *
+ * However, the possibility of specifying more than just a priority for
+ * the tasks may be useful for a wide variety of application fields, e.g.,
+ * multimedia, streaming, automation and control, and many others.
+ *
+ * This variant (sched_attr) is meant at describing a so-called
+ * sporadic time-constrained task. In such model a task is specified by:
+ *  - the activation period or minimum instance inter-arrival time;
+ *  - the maximum (or average, depending on the actual scheduling
+ *    discipline) computation time of all instances, a.k.a. runtime;
+ *  - the deadline (relative to the actual activation time) of each
+ *    instance.
+ * Very briefly, a periodic (sporadic) task asks for the execution of
+ * some specific computation --which is typically called an instance--
+ * (at most) every period. Moreover, each instance typically lasts no more
+ * than the runtime and must be completed by time instant t equal to
+ * the instance activation time + the deadline.
+ *
+ * This is reflected by the actual fields of the sched_attr structure:
+ *
+ *  @size		size of the structure, for fwd/bwd compat.
+ *
+ *  @sched_policy	task's scheduling policy
+ *  @sched_flags	for customizing the scheduler behaviour
+ *  @sched_nice		task's nice value      (SCHED_NORMAL/BATCH)
+ *  @sched_priority	task's static priority (SCHED_FIFO/RR)
+ *  @sched_deadline	representative of the task's deadline
+ *  @sched_runtime	representative of the task's runtime
+ *  @sched_period	representative of the task's period
+ *
+ * Given this task model, there are a multiplicity of scheduling algorithms
+ * and policies, that can be used to ensure all the tasks will make their
+ * timing constraints.
+ */
+struct sched_attr {
+	u32 size;
+
+	u32 sched_policy;
+	u64 sched_flags;
+
+	/* SCHED_NORMAL, SCHED_BATCH */
+	s32 sched_nice;
+
+	/* SCHED_FIFO, SCHED_RR */
+	u32 sched_priority;
+
+	/* SCHED_DEADLINE */
+	u64 sched_runtime;
+	u64 sched_deadline;
+	u64 sched_period;
+};
+
 struct exec_domain;
 struct futex_pi_state;
 struct robust_list_head;
@@ -1958,6 +2018,8 @@ extern int sched_setscheduler(struct task_struct *, int,
 			      const struct sched_param *);
 extern int sched_setscheduler_nocheck(struct task_struct *, int,
 				      const struct sched_param *);
+extern int sched_setattr(struct task_struct *,
+			 const struct sched_attr *);
 extern struct task_struct *idle_task(int cpu);
 /**
  * is_idle_task - is the specified task an idle task?
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 94273bb..40ed9e9 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -38,6 +38,7 @@ struct rlimit;
 struct rlimit64;
 struct rusage;
 struct sched_param;
+struct sched_attr;
 struct sel_arg_struct;
 struct semaphore;
 struct sembuf;
@@ -279,9 +280,14 @@ asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
 					struct sched_param __user *param);
 asmlinkage long sys_sched_setparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_setattr(pid_t pid,
+					struct sched_attr __user *attr);
 asmlinkage long sys_sched_getscheduler(pid_t pid);
 asmlinkage long sys_sched_getparam(pid_t pid,
 					struct sched_param __user *param);
+asmlinkage long sys_sched_getattr(pid_t pid,
+					struct sched_attr __user *attr,
+					unsigned int size);
 asmlinkage long sys_sched_setaffinity(pid_t pid, unsigned int len,
 					unsigned long __user *user_mask_ptr);
 asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b21a63e..8174f88 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2817,6 +2817,7 @@ out_unlock:
 	__task_rq_unlock(rq);
 }
 #endif
+
 void set_user_nice(struct task_struct *p, long nice)
 {
 	int old_prio, delta, on_rq;
@@ -2991,22 +2992,29 @@ static struct task_struct *find_process_by_pid(pid_t pid)
 	return pid ? find_task_by_vpid(pid) : current;
 }
 
-/* Actually do priority change: must hold rq lock. */
-static void
-__setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
+/* Actually do priority change: must hold pi & rq lock. */
+static void __setscheduler(struct rq *rq, struct task_struct *p,
+			   const struct sched_attr *attr)
 {
+	int policy = attr->sched_policy;
+
 	p->policy = policy;
-	p->rt_priority = prio;
+
+	if (rt_policy(policy))
+		p->rt_priority = attr->sched_priority;
+	else
+		p->static_prio = NICE_TO_PRIO(attr->sched_nice);
+
 	p->normal_prio = normal_prio(p);
-	/* we are holding p->pi_lock already */
 	p->prio = rt_mutex_getprio(p);
+
 	if (rt_prio(p->prio))
 		p->sched_class = &rt_sched_class;
 	else
 		p->sched_class = &fair_sched_class;
+
 	set_load_weight(p);
 }
-
 /*
  * check the target process has a UID that matches the current process's
  */
@@ -3023,10 +3031,12 @@ static bool check_same_owner(struct task_struct *p)
 	return match;
 }
 
-static int __sched_setscheduler(struct task_struct *p, int policy,
-				const struct sched_param *param, bool user)
+static int __sched_setscheduler(struct task_struct *p,
+				const struct sched_attr *attr,
+				bool user)
 {
 	int retval, oldprio, oldpolicy = -1, on_rq, running;
+	int policy = attr->sched_policy;
 	unsigned long flags;
 	const struct sched_class *prev_class;
 	struct rq *rq;
@@ -3054,17 +3064,22 @@ recheck:
 	 * 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL,
 	 * SCHED_BATCH and SCHED_IDLE is 0.
 	 */
-	if (param->sched_priority < 0 ||
-	    (p->mm && param->sched_priority > MAX_USER_RT_PRIO-1) ||
-	    (!p->mm && param->sched_priority > MAX_RT_PRIO-1))
+	if (attr->sched_priority < 0 ||
+	    (p->mm && attr->sched_priority > MAX_USER_RT_PRIO-1) ||
+	    (!p->mm && attr->sched_priority > MAX_RT_PRIO-1))
 		return -EINVAL;
-	if (rt_policy(policy) != (param->sched_priority != 0))
+	if (rt_policy(policy) != (attr->sched_priority != 0))
 		return -EINVAL;
 
 	/*
 	 * Allow unprivileged RT tasks to decrease priority:
 	 */
 	if (user && !capable(CAP_SYS_NICE)) {
+		if (fair_policy(policy)) {
+			if (!can_nice(p, attr->sched_nice))
+				return -EPERM;
+		}
+
 		if (rt_policy(policy)) {
 			unsigned long rlim_rtprio =
 					task_rlimit(p, RLIMIT_RTPRIO);
@@ -3074,8 +3089,8 @@ recheck:
 				return -EPERM;
 
 			/* can't increase priority */
-			if (param->sched_priority > p->rt_priority &&
-			    param->sched_priority > rlim_rtprio)
+			if (attr->sched_priority > p->rt_priority &&
+			    attr->sched_priority > rlim_rtprio)
 				return -EPERM;
 		}
 
@@ -3123,11 +3138,16 @@ recheck:
 	/*
 	 * If not changing anything there's no need to proceed further:
 	 */
-	if (unlikely(policy == p->policy && (!rt_policy(policy) ||
-			param->sched_priority == p->rt_priority))) {
+	if (unlikely(policy == p->policy)) {
+		if (fair_policy(policy) && attr->sched_nice != TASK_NICE(p))
+			goto change;
+		if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
+			goto change;
+
 		task_rq_unlock(rq, p, &flags);
 		return 0;
 	}
+change:
 
 #ifdef CONFIG_RT_GROUP_SCHED
 	if (user) {
@@ -3161,7 +3181,7 @@ recheck:
 
 	oldprio = p->prio;
 	prev_class = p->sched_class;
-	__setscheduler(rq, p, policy, param->sched_priority);
+	__setscheduler(rq, p, attr);
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
@@ -3189,10 +3209,20 @@ recheck:
 int sched_setscheduler(struct task_struct *p, int policy,
 		       const struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, true);
+	struct sched_attr attr = {
+		.sched_policy   = policy,
+		.sched_priority = param->sched_priority
+	};
+	return __sched_setscheduler(p, &attr, true);
 }
 EXPORT_SYMBOL_GPL(sched_setscheduler);
 
+int sched_setattr(struct task_struct *p, const struct sched_attr *attr)
+{
+	return __sched_setscheduler(p, attr, true);
+}
+EXPORT_SYMBOL_GPL(sched_setattr);
+
 /**
  * sched_setscheduler_nocheck - change the scheduling policy and/or RT priority of a thread from kernelspace.
  * @p: the task in question.
@@ -3209,7 +3239,11 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
 int sched_setscheduler_nocheck(struct task_struct *p, int policy,
 			       const struct sched_param *param)
 {
-	return __sched_setscheduler(p, policy, param, false);
+	struct sched_attr attr = {
+		.sched_policy   = policy,
+		.sched_priority = param->sched_priority
+	};
+	return __sched_setscheduler(p, &attr, false);
 }
 
 static int
@@ -3234,6 +3268,79 @@ do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 	return retval;
 }
 
+/*
+ * Mimics kernel/events/core.c perf_copy_attr().
+ */
+static int sched_copy_attr(struct sched_attr __user *uattr,
+			   struct sched_attr *attr)
+{
+	u32 size;
+	int ret;
+
+	if (!access_ok(VERIFY_WRITE, uattr, SCHED_ATTR_SIZE_VER0))
+		return -EFAULT;
+
+	/*
+	 * zero the full structure, so that a short copy will be nice.
+	 */
+	memset(attr, 0, sizeof(*attr));
+
+	ret = get_user(size, &uattr->size);
+	if (ret)
+		return ret;
+
+	if (size > PAGE_SIZE)	/* silly large */
+		goto err_size;
+
+	if (!size)		/* abi compat */
+		size = SCHED_ATTR_SIZE_VER0;
+
+	if (size < SCHED_ATTR_SIZE_VER0)
+		goto err_size;
+
+	/*
+	 * If we're handed a bigger struct than we know of,
+	 * ensure all the unknown bits are 0 - i.e. new
+	 * user-space does not rely on any kernel feature
+	 * extensions we dont know about yet.
+	 */
+	if (size > sizeof(*attr)) {
+		unsigned char __user *addr;
+		unsigned char __user *end;
+		unsigned char val;
+
+		addr = (void __user *)uattr + sizeof(*attr);
+		end  = (void __user *)uattr + size;
+
+		for (; addr < end; addr++) {
+			ret = get_user(val, addr);
+			if (ret)
+				return ret;
+			if (val)
+				goto err_size;
+		}
+		size = sizeof(*attr);
+	}
+
+	ret = copy_from_user(attr, uattr, size);
+	if (ret)
+		return -EFAULT;
+
+	/*
+	 * XXX: do we want to be lenient like existing syscalls; or do we want
+	 * to be strict and return an error on out-of-bounds values?
+	 */
+	attr->sched_nice = clamp(attr->sched_nice, -20, 19);
+
+out:
+	return ret;
+
+err_size:
+	put_user(sizeof(*attr), &uattr->size);
+	ret = -E2BIG;
+	goto out;
+}
+
 /**
  * sys_sched_setscheduler - set/change the scheduler policy and RT priority
  * @pid: the pid in question.
@@ -3265,6 +3372,33 @@ SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
 }
 
 /**
+ * sys_sched_setattr - same as above, but with extended sched_attr
+ * @pid: the pid in question.
+ * @attr: structure containing the extended parameters.
+ */
+SYSCALL_DEFINE2(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr)
+{
+	struct sched_attr attr;
+	struct task_struct *p;
+	int retval;
+
+	if (!uattr || pid < 0)
+		return -EINVAL;
+
+	if (sched_copy_attr(uattr, &attr))
+		return -EFAULT;
+
+	rcu_read_lock();
+	retval = -ESRCH;
+	p = find_process_by_pid(pid);
+	if (p != NULL)
+		retval = sched_setattr(p, &attr);
+	rcu_read_unlock();
+
+	return retval;
+}
+
+/**
  * sys_sched_getscheduler - get the policy (scheduling class) of a thread
  * @pid: the pid in question.
  *
@@ -3334,6 +3468,92 @@ out_unlock:
 	return retval;
 }
 
+static int sched_read_attr(struct sched_attr __user *uattr,
+			   struct sched_attr *attr,
+			   unsigned int usize)
+{
+	int ret;
+
+	if (!access_ok(VERIFY_WRITE, uattr, usize))
+		return -EFAULT;
+
+	/*
+	 * If we're handed a smaller struct than we know of,
+	 * ensure all the unknown bits are 0 - i.e. old
+	 * user-space does not get uncomplete information.
+	 */
+	if (usize < sizeof(*attr)) {
+		unsigned char *addr;
+		unsigned char *end;
+
+		addr = (void *)attr + usize;
+		end  = (void *)attr + sizeof(*attr);
+
+		for (; addr < end; addr++) {
+			if (*addr)
+				goto err_size;
+		}
+
+		attr->size = usize;
+	}
+
+	ret = copy_to_user(uattr, attr, usize);
+	if (ret)
+		return -EFAULT;
+
+out:
+	return ret;
+
+err_size:
+	ret = -E2BIG;
+	goto out;
+}
+
+/**
+ * sys_sched_getattr - same as above, but with extended "sched_param"
+ * @pid: the pid in question.
+ * @attr: structure containing the extended parameters.
+ * @size: sizeof(attr) for fwd/bwd comp.
+ */
+SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
+		unsigned int, size)
+{
+	struct sched_attr attr = {
+		.size = sizeof(struct sched_attr),
+	};
+	struct task_struct *p;
+	int retval;
+
+	if (!uattr || pid < 0 || size > PAGE_SIZE ||
+	    size < SCHED_ATTR_SIZE_VER0)
+		return -EINVAL;
+
+	rcu_read_lock();
+	p = find_process_by_pid(pid);
+	retval = -ESRCH;
+	if (!p)
+		goto out_unlock;
+
+	retval = security_task_getscheduler(p);
+	if (retval)
+		goto out_unlock;
+
+	attr.sched_policy = p->policy;
+	if (task_has_rt_policy(p))
+		attr.sched_priority = p->rt_priority;
+	else
+		attr.sched_nice = TASK_NICE(p);
+
+	rcu_read_unlock();
+
+	retval = sched_read_attr(uattr, &attr, size);
+	return retval;
+
+out_unlock:
+	rcu_read_unlock();
+	return retval;
+}
+
 long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 {
 	cpumask_var_t cpus_allowed, new_mask;
@@ -6400,13 +6620,16 @@ EXPORT_SYMBOL(__might_sleep);
 static void normalize_task(struct rq *rq, struct task_struct *p)
 {
 	const struct sched_class *prev_class = p->sched_class;
+	struct sched_attr attr = {
+		.sched_policy = SCHED_NORMAL,
+	};
 	int old_prio = p->prio;
 	int on_rq;
 
 	on_rq = p->on_rq;
 	if (on_rq)
 		dequeue_task(rq, p, 0);
-	__setscheduler(rq, p, SCHED_NORMAL, 0);
+	__setscheduler(rq, p, &attr);
 	if (on_rq) {
 		enqueue_task(rq, p, 0);
 		resched_task(rq->curr);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b3b4a49..df023db 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -81,11 +81,14 @@ extern void update_cpu_load_active(struct rq *this_rq);
  */
 #define RUNTIME_INF	((u64)~0ULL)
 
+static inline int fair_policy(int policy)
+{
+	return policy == SCHED_NORMAL || policy == SCHED_BATCH;
+}
+
 static inline int rt_policy(int policy)
 {
-	if (policy == SCHED_FIFO || policy == SCHED_RR)
-		return 1;
-	return 0;
+	return policy == SCHED_FIFO || policy == SCHED_RR;
 }
 
 static inline int task_has_rt_policy(struct task_struct *p)

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [tip:sched/core] sched/deadline: Add SCHED_DEADLINE structures & implementation
  2013-11-07 13:43 ` [PATCH 03/14] sched: SCHED_DEADLINE structures & implementation Juri Lelli
  2013-11-13  2:31   ` Steven Rostedt
  2013-11-20 20:23   ` Steven Rostedt
@ 2014-01-13 15:53   ` tip-bot for Dario Faggioli
  2 siblings, 0 replies; 81+ messages in thread
From: tip-bot for Dario Faggioli @ 2014-01-13 15:53 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, raistlin, fchecconi, tglx,
	michael, juri.lelli

Commit-ID:  aab03e05e8f7e26f51dee792beddcb5cca9215a5
Gitweb:     http://git.kernel.org/tip/aab03e05e8f7e26f51dee792beddcb5cca9215a5
Author:     Dario Faggioli <raistlin@linux.it>
AuthorDate: Thu, 28 Nov 2013 11:14:43 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 13:41:06 +0100

sched/deadline: Add SCHED_DEADLINE structures & implementation

Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.

Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.

Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.

The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.

The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.

To summarize, this patch:
 - introduces the data structures, constants and symbols needed;
 - implements the core logic of the scheduling algorithm in the new
   scheduling class file;
 - provides all the glue code between the new scheduling class and
   the core scheduler and refines the interactions between sched/dl
   and the other existing scheduling classes.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h          |  46 ++-
 include/linux/sched/deadline.h |  24 ++
 include/uapi/linux/sched.h     |   1 +
 kernel/fork.c                  |   4 +-
 kernel/hrtimer.c               |   3 +-
 kernel/sched/Makefile          |   3 +-
 kernel/sched/core.c            | 109 ++++++-
 kernel/sched/deadline.c        | 684 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h           |  26 ++
 kernel/sched/stop_task.c       |   2 +-
 10 files changed, 882 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 86025b6..6c19679 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -97,6 +97,10 @@ struct sched_param {
  * Given this task model, there are a multiplicity of scheduling algorithms
  * and policies, that can be used to ensure all the tasks will make their
  * timing constraints.
+ *
+ * As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
+ * only user of this new interface. More information about the algorithm
+ * available in the scheduling class file or in Documentation/.
  */
 struct sched_attr {
 	u32 size;
@@ -1088,6 +1092,45 @@ struct sched_rt_entity {
 #endif
 };
 
+struct sched_dl_entity {
+	struct rb_node	rb_node;
+
+	/*
+	 * Original scheduling parameters. Copied here from sched_attr
+	 * during sched_setscheduler2(), they will remain the same until
+	 * the next sched_setscheduler2().
+	 */
+	u64 dl_runtime;		/* maximum runtime for each instance	*/
+	u64 dl_deadline;	/* relative deadline of each instance	*/
+
+	/*
+	 * Actual scheduling parameters. Initialized with the values above,
+	 * they are continously updated during task execution. Note that
+	 * the remaining runtime could be < 0 in case we are in overrun.
+	 */
+	s64 runtime;		/* remaining runtime for this instance	*/
+	u64 deadline;		/* absolute deadline for this instance	*/
+	unsigned int flags;	/* specifying the scheduler behaviour	*/
+
+	/*
+	 * Some bool flags:
+	 *
+	 * @dl_throttled tells if we exhausted the runtime. If so, the
+	 * task has to wait for a replenishment to be performed at the
+	 * next firing of dl_timer.
+	 *
+	 * @dl_new tells if a new instance arrived. If so we must
+	 * start executing it with full runtime and reset its absolute
+	 * deadline;
+	 */
+	int dl_throttled, dl_new;
+
+	/*
+	 * Bandwidth enforcement timer. Each -deadline task has its
+	 * own bandwidth to be enforced, thus we need one timer per task.
+	 */
+	struct hrtimer dl_timer;
+};
 
 struct rcu_node;
 
@@ -1124,6 +1167,7 @@ struct task_struct {
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group *sched_task_group;
 #endif
+	struct sched_dl_entity dl;
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
@@ -2099,7 +2143,7 @@ extern void wake_up_new_task(struct task_struct *tsk);
 #else
  static inline void kick_process(struct task_struct *tsk) { }
 #endif
-extern void sched_fork(unsigned long clone_flags, struct task_struct *p);
+extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
 extern void proc_caches_init(void);
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
new file mode 100644
index 0000000..9d303b8
--- /dev/null
+++ b/include/linux/sched/deadline.h
@@ -0,0 +1,24 @@
+#ifndef _SCHED_DEADLINE_H
+#define _SCHED_DEADLINE_H
+
+/*
+ * SCHED_DEADLINE tasks has negative priorities, reflecting
+ * the fact that any of them has higher prio than RT and
+ * NORMAL/BATCH tasks.
+ */
+
+#define MAX_DL_PRIO		0
+
+static inline int dl_prio(int prio)
+{
+	if (unlikely(prio < MAX_DL_PRIO))
+		return 1;
+	return 0;
+}
+
+static inline int dl_task(struct task_struct *p)
+{
+	return dl_prio(p->prio);
+}
+
+#endif /* _SCHED_DEADLINE_H */
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5a0f945..2d5e49a 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -39,6 +39,7 @@
 #define SCHED_BATCH		3
 /* SCHED_ISO: reserved but not implemented yet */
 #define SCHED_IDLE		5
+#define SCHED_DEADLINE		6
 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
 #define SCHED_RESET_ON_FORK     0x40000000
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 6023d15..e6c0f1a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1311,7 +1311,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #endif
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
-	sched_fork(clone_flags, p);
+	retval = sched_fork(clone_flags, p);
+	if (retval)
+		goto bad_fork_cleanup_policy;
 
 	retval = perf_event_init_task(p);
 	if (retval)
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 383319b..0909436 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -46,6 +46,7 @@
 #include <linux/sched.h>
 #include <linux/sched/sysctl.h>
 #include <linux/sched/rt.h>
+#include <linux/sched/deadline.h>
 #include <linux/timer.h>
 #include <linux/freezer.h>
 
@@ -1610,7 +1611,7 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp,
 	unsigned long slack;
 
 	slack = current->timer_slack_ns;
-	if (rt_task(current))
+	if (dl_task(current) || rt_task(current))
 		slack = 0;
 
 	hrtimer_init_on_stack(&t.timer, clockid, mode);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 7b62140..b039035 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -11,7 +11,8 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
 endif
 
-obj-y += core.o proc.o clock.o cputime.o idle_task.o fair.o rt.o stop_task.o
+obj-y += core.o proc.o clock.o cputime.o
+obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o
 obj-y += wait.o completion.o
 obj-$(CONFIG_SMP) += cpupri.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8174f88..203aecd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -899,7 +899,9 @@ static inline int normal_prio(struct task_struct *p)
 {
 	int prio;
 
-	if (task_has_rt_policy(p))
+	if (task_has_dl_policy(p))
+		prio = MAX_DL_PRIO-1;
+	else if (task_has_rt_policy(p))
 		prio = MAX_RT_PRIO-1 - p->rt_priority;
 	else
 		prio = __normal_prio(p);
@@ -1717,6 +1719,12 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
 #endif
 
+	RB_CLEAR_NODE(&p->dl.rb_node);
+	hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	p->dl.dl_runtime = p->dl.runtime = 0;
+	p->dl.dl_deadline = p->dl.deadline = 0;
+	p->dl.flags = 0;
+
 	INIT_LIST_HEAD(&p->rt.run_list);
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -1768,7 +1776,7 @@ void set_numabalancing_state(bool enabled)
 /*
  * fork()/clone()-time setup:
  */
-void sched_fork(unsigned long clone_flags, struct task_struct *p)
+int sched_fork(unsigned long clone_flags, struct task_struct *p)
 {
 	unsigned long flags;
 	int cpu = get_cpu();
@@ -1790,7 +1798,7 @@ void sched_fork(unsigned long clone_flags, struct task_struct *p)
 	 * Revert to default priority/policy on fork if requested.
 	 */
 	if (unlikely(p->sched_reset_on_fork)) {
-		if (task_has_rt_policy(p)) {
+		if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
 			p->policy = SCHED_NORMAL;
 			p->static_prio = NICE_TO_PRIO(0);
 			p->rt_priority = 0;
@@ -1807,8 +1815,14 @@ void sched_fork(unsigned long clone_flags, struct task_struct *p)
 		p->sched_reset_on_fork = 0;
 	}
 
-	if (!rt_prio(p->prio))
+	if (dl_prio(p->prio)) {
+		put_cpu();
+		return -EAGAIN;
+	} else if (rt_prio(p->prio)) {
+		p->sched_class = &rt_sched_class;
+	} else {
 		p->sched_class = &fair_sched_class;
+	}
 
 	if (p->sched_class->task_fork)
 		p->sched_class->task_fork(p);
@@ -1837,6 +1851,7 @@ void sched_fork(unsigned long clone_flags, struct task_struct *p)
 #endif
 
 	put_cpu();
+	return 0;
 }
 
 /*
@@ -2768,7 +2783,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	struct rq *rq;
 	const struct sched_class *prev_class;
 
-	BUG_ON(prio < 0 || prio > MAX_PRIO);
+	BUG_ON(prio > MAX_PRIO);
 
 	rq = __task_rq_lock(p);
 
@@ -2800,7 +2815,9 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
-	if (rt_prio(prio))
+	if (dl_prio(prio))
+		p->sched_class = &dl_sched_class;
+	else if (rt_prio(prio))
 		p->sched_class = &rt_sched_class;
 	else
 		p->sched_class = &fair_sched_class;
@@ -2835,9 +2852,9 @@ void set_user_nice(struct task_struct *p, long nice)
 	 * The RT priorities are set via sched_setscheduler(), but we still
 	 * allow the 'normal' nice value to be set - but as expected
 	 * it wont have any effect on scheduling until the task is
-	 * SCHED_FIFO/SCHED_RR:
+	 * SCHED_DEADLINE, SCHED_FIFO or SCHED_RR:
 	 */
-	if (task_has_rt_policy(p)) {
+	if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
 		p->static_prio = NICE_TO_PRIO(nice);
 		goto out_unlock;
 	}
@@ -2992,6 +3009,27 @@ static struct task_struct *find_process_by_pid(pid_t pid)
 	return pid ? find_task_by_vpid(pid) : current;
 }
 
+/*
+ * This function initializes the sched_dl_entity of a newly becoming
+ * SCHED_DEADLINE task.
+ *
+ * Only the static values are considered here, the actual runtime and the
+ * absolute deadline will be properly calculated when the task is enqueued
+ * for the first time with its new policy.
+ */
+static void
+__setparam_dl(struct task_struct *p, const struct sched_attr *attr)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	init_dl_task_timer(dl_se);
+	dl_se->dl_runtime = attr->sched_runtime;
+	dl_se->dl_deadline = attr->sched_deadline;
+	dl_se->flags = attr->sched_flags;
+	dl_se->dl_throttled = 0;
+	dl_se->dl_new = 1;
+}
+
 /* Actually do priority change: must hold pi & rq lock. */
 static void __setscheduler(struct rq *rq, struct task_struct *p,
 			   const struct sched_attr *attr)
@@ -3000,7 +3038,9 @@ static void __setscheduler(struct rq *rq, struct task_struct *p,
 
 	p->policy = policy;
 
-	if (rt_policy(policy))
+	if (dl_policy(policy))
+		__setparam_dl(p, attr);
+	else if (rt_policy(policy))
 		p->rt_priority = attr->sched_priority;
 	else
 		p->static_prio = NICE_TO_PRIO(attr->sched_nice);
@@ -3008,13 +3048,39 @@ static void __setscheduler(struct rq *rq, struct task_struct *p,
 	p->normal_prio = normal_prio(p);
 	p->prio = rt_mutex_getprio(p);
 
-	if (rt_prio(p->prio))
+	if (dl_prio(p->prio))
+		p->sched_class = &dl_sched_class;
+	else if (rt_prio(p->prio))
 		p->sched_class = &rt_sched_class;
 	else
 		p->sched_class = &fair_sched_class;
 
 	set_load_weight(p);
 }
+
+static void
+__getparam_dl(struct task_struct *p, struct sched_attr *attr)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	attr->sched_priority = p->rt_priority;
+	attr->sched_runtime = dl_se->dl_runtime;
+	attr->sched_deadline = dl_se->dl_deadline;
+	attr->sched_flags = dl_se->flags;
+}
+
+/*
+ * This function validates the new parameters of a -deadline task.
+ * We ask for the deadline not being zero, and greater or equal
+ * than the runtime.
+ */
+static bool
+__checkparam_dl(const struct sched_attr *attr)
+{
+	return attr && attr->sched_deadline != 0 &&
+	       (s64)(attr->sched_deadline - attr->sched_runtime) >= 0;
+}
+
 /*
  * check the target process has a UID that matches the current process's
  */
@@ -3053,7 +3119,8 @@ recheck:
 		reset_on_fork = !!(policy & SCHED_RESET_ON_FORK);
 		policy &= ~SCHED_RESET_ON_FORK;
 
-		if (policy != SCHED_FIFO && policy != SCHED_RR &&
+		if (policy != SCHED_DEADLINE &&
+				policy != SCHED_FIFO && policy != SCHED_RR &&
 				policy != SCHED_NORMAL && policy != SCHED_BATCH &&
 				policy != SCHED_IDLE)
 			return -EINVAL;
@@ -3068,7 +3135,8 @@ recheck:
 	    (p->mm && attr->sched_priority > MAX_USER_RT_PRIO-1) ||
 	    (!p->mm && attr->sched_priority > MAX_RT_PRIO-1))
 		return -EINVAL;
-	if (rt_policy(policy) != (attr->sched_priority != 0))
+	if ((dl_policy(policy) && !__checkparam_dl(attr)) ||
+	    (rt_policy(policy) != (attr->sched_priority != 0)))
 		return -EINVAL;
 
 	/*
@@ -3143,6 +3211,8 @@ recheck:
 			goto change;
 		if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
 			goto change;
+		if (dl_policy(policy))
+			goto change;
 
 		task_rq_unlock(rq, p, &flags);
 		return 0;
@@ -3453,6 +3523,10 @@ SYSCALL_DEFINE2(sched_getparam, pid_t, pid, struct sched_param __user *, param)
 	if (retval)
 		goto out_unlock;
 
+	if (task_has_dl_policy(p)) {
+		retval = -EINVAL;
+		goto out_unlock;
+	}
 	lp.sched_priority = p->rt_priority;
 	rcu_read_unlock();
 
@@ -3510,7 +3584,7 @@ err_size:
 }
 
 /**
- * sys_sched_getattr - same as above, but with extended "sched_param"
+ * sys_sched_getattr - similar to sched_getparam, but with sched_attr
  * @pid: the pid in question.
  * @attr: structure containing the extended parameters.
  * @size: sizeof(attr) for fwd/bwd comp.
@@ -3539,7 +3613,9 @@ SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 		goto out_unlock;
 
 	attr.sched_policy = p->policy;
-	if (task_has_rt_policy(p))
+	if (task_has_dl_policy(p))
+		__getparam_dl(p, &attr);
+	else if (task_has_rt_policy(p))
 		attr.sched_priority = p->rt_priority;
 	else
 		attr.sched_nice = TASK_NICE(p);
@@ -3965,6 +4041,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
 	case SCHED_RR:
 		ret = MAX_USER_RT_PRIO-1;
 		break;
+	case SCHED_DEADLINE:
 	case SCHED_NORMAL:
 	case SCHED_BATCH:
 	case SCHED_IDLE:
@@ -3991,6 +4068,7 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
 	case SCHED_RR:
 		ret = 1;
 		break;
+	case SCHED_DEADLINE:
 	case SCHED_NORMAL:
 	case SCHED_BATCH:
 	case SCHED_IDLE:
@@ -6472,6 +6550,7 @@ void __init sched_init(void)
 		rq->calc_load_update = jiffies + LOAD_FREQ;
 		init_cfs_rq(&rq->cfs);
 		init_rt_rq(&rq->rt, rq);
+		init_dl_rq(&rq->dl, rq);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
@@ -6659,7 +6738,7 @@ void normalize_rt_tasks(void)
 		p->se.statistics.block_start	= 0;
 #endif
 
-		if (!rt_task(p)) {
+		if (!dl_task(p) && !rt_task(p)) {
 			/*
 			 * Renice negative nice level userspace
 			 * tasks back to 0:
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
new file mode 100644
index 0000000..93d82b2
--- /dev/null
+++ b/kernel/sched/deadline.c
@@ -0,0 +1,684 @@
+/*
+ * Deadline Scheduling Class (SCHED_DEADLINE)
+ *
+ * Earliest Deadline First (EDF) + Constant Bandwidth Server (CBS).
+ *
+ * Tasks that periodically executes their instances for less than their
+ * runtime won't miss any of their deadlines.
+ * Tasks that are not periodic or sporadic or that tries to execute more
+ * than their reserved bandwidth will be slowed down (and may potentially
+ * miss some of their deadlines), and won't affect any other task.
+ *
+ * Copyright (C) 2012 Dario Faggioli <raistlin@linux.it>,
+ *                    Michael Trimarchi <michael@amarulasolutions.com>,
+ *                    Fabio Checconi <fchecconi@gmail.com>
+ */
+#include "sched.h"
+
+static inline int dl_time_before(u64 a, u64 b)
+{
+	return (s64)(a - b) < 0;
+}
+
+static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
+{
+	return container_of(dl_se, struct task_struct, dl);
+}
+
+static inline struct rq *rq_of_dl_rq(struct dl_rq *dl_rq)
+{
+	return container_of(dl_rq, struct rq, dl);
+}
+
+static inline struct dl_rq *dl_rq_of_se(struct sched_dl_entity *dl_se)
+{
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = task_rq(p);
+
+	return &rq->dl;
+}
+
+static inline int on_dl_rq(struct sched_dl_entity *dl_se)
+{
+	return !RB_EMPTY_NODE(&dl_se->rb_node);
+}
+
+static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+
+	return dl_rq->rb_leftmost == &dl_se->rb_node;
+}
+
+void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
+{
+	dl_rq->rb_root = RB_ROOT;
+}
+
+static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
+static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
+				  int flags);
+
+/*
+ * We are being explicitly informed that a new instance is starting,
+ * and this means that:
+ *  - the absolute deadline of the entity has to be placed at
+ *    current time + relative deadline;
+ *  - the runtime of the entity has to be set to the maximum value.
+ *
+ * The capability of specifying such event is useful whenever a -deadline
+ * entity wants to (try to!) synchronize its behaviour with the scheduler's
+ * one, and to (try to!) reconcile itself with its own scheduling
+ * parameters.
+ */
+static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	WARN_ON(!dl_se->dl_new || dl_se->dl_throttled);
+
+	/*
+	 * We use the regular wall clock time to set deadlines in the
+	 * future; in fact, we must consider execution overheads (time
+	 * spent on hardirq context, etc.).
+	 */
+	dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
+	dl_se->runtime = dl_se->dl_runtime;
+	dl_se->dl_new = 0;
+}
+
+/*
+ * Pure Earliest Deadline First (EDF) scheduling does not deal with the
+ * possibility of a entity lasting more than what it declared, and thus
+ * exhausting its runtime.
+ *
+ * Here we are interested in making runtime overrun possible, but we do
+ * not want a entity which is misbehaving to affect the scheduling of all
+ * other entities.
+ * Therefore, a budgeting strategy called Constant Bandwidth Server (CBS)
+ * is used, in order to confine each entity within its own bandwidth.
+ *
+ * This function deals exactly with that, and ensures that when the runtime
+ * of a entity is replenished, its deadline is also postponed. That ensures
+ * the overrunning entity can't interfere with other entity in the system and
+ * can't make them miss their deadlines. Reasons why this kind of overruns
+ * could happen are, typically, a entity voluntarily trying to overcome its
+ * runtime, or it just underestimated it during sched_setscheduler_ex().
+ */
+static void replenish_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * We keep moving the deadline away until we get some
+	 * available runtime for the entity. This ensures correct
+	 * handling of situations where the runtime overrun is
+	 * arbitrary large.
+	 */
+	while (dl_se->runtime <= 0) {
+		dl_se->deadline += dl_se->dl_deadline;
+		dl_se->runtime += dl_se->dl_runtime;
+	}
+
+	/*
+	 * At this point, the deadline really should be "in
+	 * the future" with respect to rq->clock. If it's
+	 * not, we are, for some reason, lagging too much!
+	 * Anyway, after having warn userspace abut that,
+	 * we still try to keep the things running by
+	 * resetting the deadline and the budget of the
+	 * entity.
+	 */
+	if (dl_time_before(dl_se->deadline, rq_clock(rq))) {
+		static bool lag_once = false;
+
+		if (!lag_once) {
+			lag_once = true;
+			printk_sched("sched: DL replenish lagged to much\n");
+		}
+		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
+		dl_se->runtime = dl_se->dl_runtime;
+	}
+}
+
+/*
+ * Here we check if --at time t-- an entity (which is probably being
+ * [re]activated or, in general, enqueued) can use its remaining runtime
+ * and its current deadline _without_ exceeding the bandwidth it is
+ * assigned (function returns true if it can't). We are in fact applying
+ * one of the CBS rules: when a task wakes up, if the residual runtime
+ * over residual deadline fits within the allocated bandwidth, then we
+ * can keep the current (absolute) deadline and residual budget without
+ * disrupting the schedulability of the system. Otherwise, we should
+ * refill the runtime and set the deadline a period in the future,
+ * because keeping the current (absolute) deadline of the task would
+ * result in breaking guarantees promised to other tasks.
+ *
+ * This function returns true if:
+ *
+ *   runtime / (deadline - t) > dl_runtime / dl_deadline ,
+ *
+ * IOW we can't recycle current parameters.
+ */
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
+{
+	u64 left, right;
+
+	/*
+	 * left and right are the two sides of the equation above,
+	 * after a bit of shuffling to use multiplications instead
+	 * of divisions.
+	 *
+	 * Note that none of the time values involved in the two
+	 * multiplications are absolute: dl_deadline and dl_runtime
+	 * are the relative deadline and the maximum runtime of each
+	 * instance, runtime is the runtime left for the last instance
+	 * and (deadline - t), since t is rq->clock, is the time left
+	 * to the (absolute) deadline. Even if overflowing the u64 type
+	 * is very unlikely to occur in both cases, here we scale down
+	 * as we want to avoid that risk at all. Scaling down by 10
+	 * means that we reduce granularity to 1us. We are fine with it,
+	 * since this is only a true/false check and, anyway, thinking
+	 * of anything below microseconds resolution is actually fiction
+	 * (but still we want to give the user that illusion >;).
+	 */
+	left = (dl_se->dl_deadline >> 10) * (dl_se->runtime >> 10);
+	right = ((dl_se->deadline - t) >> 10) * (dl_se->dl_runtime >> 10);
+
+	return dl_time_before(right, left);
+}
+
+/*
+ * When a -deadline entity is queued back on the runqueue, its runtime and
+ * deadline might need updating.
+ *
+ * The policy here is that we update the deadline of the entity only if:
+ *  - the current deadline is in the past,
+ *  - using the remaining runtime with the current deadline would make
+ *    the entity exceed its bandwidth.
+ */
+static void update_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * The arrival of a new instance needs special treatment, i.e.,
+	 * the actual scheduling parameters have to be "renewed".
+	 */
+	if (dl_se->dl_new) {
+		setup_new_dl_entity(dl_se);
+		return;
+	}
+
+	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
+	    dl_entity_overflow(dl_se, rq_clock(rq))) {
+		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
+		dl_se->runtime = dl_se->dl_runtime;
+	}
+}
+
+/*
+ * If the entity depleted all its runtime, and if we want it to sleep
+ * while waiting for some new execution time to become available, we
+ * set the bandwidth enforcement timer to the replenishment instant
+ * and try to activate it.
+ *
+ * Notice that it is important for the caller to know if the timer
+ * actually started or not (i.e., the replenishment instant is in
+ * the future or in the past).
+ */
+static int start_dl_timer(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+	ktime_t now, act;
+	ktime_t soft, hard;
+	unsigned long range;
+	s64 delta;
+
+	/*
+	 * We want the timer to fire at the deadline, but considering
+	 * that it is actually coming from rq->clock and not from
+	 * hrtimer's time base reading.
+	 */
+	act = ns_to_ktime(dl_se->deadline);
+	now = hrtimer_cb_get_time(&dl_se->dl_timer);
+	delta = ktime_to_ns(now) - rq_clock(rq);
+	act = ktime_add_ns(act, delta);
+
+	/*
+	 * If the expiry time already passed, e.g., because the value
+	 * chosen as the deadline is too small, don't even try to
+	 * start the timer in the past!
+	 */
+	if (ktime_us_delta(act, now) < 0)
+		return 0;
+
+	hrtimer_set_expires(&dl_se->dl_timer, act);
+
+	soft = hrtimer_get_softexpires(&dl_se->dl_timer);
+	hard = hrtimer_get_expires(&dl_se->dl_timer);
+	range = ktime_to_ns(ktime_sub(hard, soft));
+	__hrtimer_start_range_ns(&dl_se->dl_timer, soft,
+				 range, HRTIMER_MODE_ABS, 0);
+
+	return hrtimer_active(&dl_se->dl_timer);
+}
+
+/*
+ * This is the bandwidth enforcement timer callback. If here, we know
+ * a task is not on its dl_rq, since the fact that the timer was running
+ * means the task is throttled and needs a runtime replenishment.
+ *
+ * However, what we actually do depends on the fact the task is active,
+ * (it is on its rq) or has been removed from there by a call to
+ * dequeue_task_dl(). In the former case we must issue the runtime
+ * replenishment and add the task back to the dl_rq; in the latter, we just
+ * do nothing but clearing dl_throttled, so that runtime and deadline
+ * updating (and the queueing back to dl_rq) will be done by the
+ * next call to enqueue_task_dl().
+ */
+static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
+{
+	struct sched_dl_entity *dl_se = container_of(timer,
+						     struct sched_dl_entity,
+						     dl_timer);
+	struct task_struct *p = dl_task_of(dl_se);
+	struct rq *rq = task_rq(p);
+	raw_spin_lock(&rq->lock);
+
+	/*
+	 * We need to take care of a possible races here. In fact, the
+	 * task might have changed its scheduling policy to something
+	 * different from SCHED_DEADLINE or changed its reservation
+	 * parameters (through sched_setscheduler()).
+	 */
+	if (!dl_task(p) || dl_se->dl_new)
+		goto unlock;
+
+	sched_clock_tick();
+	update_rq_clock(rq);
+	dl_se->dl_throttled = 0;
+	if (p->on_rq) {
+		enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
+		if (task_has_dl_policy(rq->curr))
+			check_preempt_curr_dl(rq, p, 0);
+		else
+			resched_task(rq->curr);
+	}
+unlock:
+	raw_spin_unlock(&rq->lock);
+
+	return HRTIMER_NORESTART;
+}
+
+void init_dl_task_timer(struct sched_dl_entity *dl_se)
+{
+	struct hrtimer *timer = &dl_se->dl_timer;
+
+	if (hrtimer_active(timer)) {
+		hrtimer_try_to_cancel(timer);
+		return;
+	}
+
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	timer->function = dl_task_timer;
+}
+
+static
+int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
+{
+	int dmiss = dl_time_before(dl_se->deadline, rq_clock(rq));
+	int rorun = dl_se->runtime <= 0;
+
+	if (!rorun && !dmiss)
+		return 0;
+
+	/*
+	 * If we are beyond our current deadline and we are still
+	 * executing, then we have already used some of the runtime of
+	 * the next instance. Thus, if we do not account that, we are
+	 * stealing bandwidth from the system at each deadline miss!
+	 */
+	if (dmiss) {
+		dl_se->runtime = rorun ? dl_se->runtime : 0;
+		dl_se->runtime -= rq_clock(rq) - dl_se->deadline;
+	}
+
+	return 1;
+}
+
+/*
+ * Update the current task's runtime statistics (provided it is still
+ * a -deadline task and has not been removed from the dl_rq).
+ */
+static void update_curr_dl(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	struct sched_dl_entity *dl_se = &curr->dl;
+	u64 delta_exec;
+
+	if (!dl_task(curr) || !on_dl_rq(dl_se))
+		return;
+
+	/*
+	 * Consumed budget is computed considering the time as
+	 * observed by schedulable tasks (excluding time spent
+	 * in hardirq context, etc.). Deadlines are instead
+	 * computed using hard walltime. This seems to be the more
+	 * natural solution, but the full ramifications of this
+	 * approach need further study.
+	 */
+	delta_exec = rq_clock_task(rq) - curr->se.exec_start;
+	if (unlikely((s64)delta_exec < 0))
+		delta_exec = 0;
+
+	schedstat_set(curr->se.statistics.exec_max,
+		      max(curr->se.statistics.exec_max, delta_exec));
+
+	curr->se.sum_exec_runtime += delta_exec;
+	account_group_exec_runtime(curr, delta_exec);
+
+	curr->se.exec_start = rq_clock_task(rq);
+	cpuacct_charge(curr, delta_exec);
+
+	dl_se->runtime -= delta_exec;
+	if (dl_runtime_exceeded(rq, dl_se)) {
+		__dequeue_task_dl(rq, curr, 0);
+		if (likely(start_dl_timer(dl_se)))
+			dl_se->dl_throttled = 1;
+		else
+			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
+
+		if (!is_leftmost(curr, &rq->dl))
+			resched_task(curr);
+	}
+}
+
+static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rb_node **link = &dl_rq->rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct sched_dl_entity *entry;
+	int leftmost = 1;
+
+	BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct sched_dl_entity, rb_node);
+		if (dl_time_before(dl_se->deadline, entry->deadline))
+			link = &parent->rb_left;
+		else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		dl_rq->rb_leftmost = &dl_se->rb_node;
+
+	rb_link_node(&dl_se->rb_node, parent, link);
+	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
+
+	dl_rq->dl_nr_running++;
+}
+
+static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+
+	if (RB_EMPTY_NODE(&dl_se->rb_node))
+		return;
+
+	if (dl_rq->rb_leftmost == &dl_se->rb_node) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&dl_se->rb_node);
+		dl_rq->rb_leftmost = next_node;
+	}
+
+	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
+	RB_CLEAR_NODE(&dl_se->rb_node);
+
+	dl_rq->dl_nr_running--;
+}
+
+static void
+enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
+{
+	BUG_ON(on_dl_rq(dl_se));
+
+	/*
+	 * If this is a wakeup or a new instance, the scheduling
+	 * parameters of the task might need updating. Otherwise,
+	 * we want a replenishment of its runtime.
+	 */
+	if (!dl_se->dl_new && flags & ENQUEUE_REPLENISH)
+		replenish_dl_entity(dl_se);
+	else
+		update_dl_entity(dl_se);
+
+	__enqueue_dl_entity(dl_se);
+}
+
+static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
+{
+	__dequeue_dl_entity(dl_se);
+}
+
+static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+	/*
+	 * If p is throttled, we do nothing. In fact, if it exhausted
+	 * its budget it needs a replenishment and, since it now is on
+	 * its rq, the bandwidth timer callback (which clearly has not
+	 * run yet) will take care of this.
+	 */
+	if (p->dl.dl_throttled)
+		return;
+
+	enqueue_dl_entity(&p->dl, flags);
+	inc_nr_running(rq);
+}
+
+static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+	dequeue_dl_entity(&p->dl);
+}
+
+static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+{
+	update_curr_dl(rq);
+	__dequeue_task_dl(rq, p, flags);
+
+	dec_nr_running(rq);
+}
+
+/*
+ * Yield task semantic for -deadline tasks is:
+ *
+ *   get off from the CPU until our next instance, with
+ *   a new runtime. This is of little use now, since we
+ *   don't have a bandwidth reclaiming mechanism. Anyway,
+ *   bandwidth reclaiming is planned for the future, and
+ *   yield_task_dl will indicate that some spare budget
+ *   is available for other task instances to use it.
+ */
+static void yield_task_dl(struct rq *rq)
+{
+	struct task_struct *p = rq->curr;
+
+	/*
+	 * We make the task go to sleep until its current deadline by
+	 * forcing its runtime to zero. This way, update_curr_dl() stops
+	 * it and the bandwidth timer will wake it up and will give it
+	 * new scheduling parameters (thanks to dl_new=1).
+	 */
+	if (p->dl.runtime > 0) {
+		rq->curr->dl.dl_new = 1;
+		p->dl.runtime = 0;
+	}
+	update_curr_dl(rq);
+}
+
+/*
+ * Only called when both the current and waking task are -deadline
+ * tasks.
+ */
+static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
+				  int flags)
+{
+	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
+		resched_task(rq->curr);
+}
+
+#ifdef CONFIG_SCHED_HRTICK
+static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
+{
+	s64 delta = p->dl.dl_runtime - p->dl.runtime;
+
+	if (delta > 10000)
+		hrtick_start(rq, p->dl.runtime);
+}
+#endif
+
+static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
+						   struct dl_rq *dl_rq)
+{
+	struct rb_node *left = dl_rq->rb_leftmost;
+
+	if (!left)
+		return NULL;
+
+	return rb_entry(left, struct sched_dl_entity, rb_node);
+}
+
+struct task_struct *pick_next_task_dl(struct rq *rq)
+{
+	struct sched_dl_entity *dl_se;
+	struct task_struct *p;
+	struct dl_rq *dl_rq;
+
+	dl_rq = &rq->dl;
+
+	if (unlikely(!dl_rq->dl_nr_running))
+		return NULL;
+
+	dl_se = pick_next_dl_entity(rq, dl_rq);
+	BUG_ON(!dl_se);
+
+	p = dl_task_of(dl_se);
+	p->se.exec_start = rq_clock_task(rq);
+#ifdef CONFIG_SCHED_HRTICK
+	if (hrtick_enabled(rq))
+		start_hrtick_dl(rq, p);
+#endif
+	return p;
+}
+
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
+{
+	update_curr_dl(rq);
+}
+
+static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
+{
+	update_curr_dl(rq);
+
+#ifdef CONFIG_SCHED_HRTICK
+	if (hrtick_enabled(rq) && queued && p->dl.runtime > 0)
+		start_hrtick_dl(rq, p);
+#endif
+}
+
+static void task_fork_dl(struct task_struct *p)
+{
+	/*
+	 * SCHED_DEADLINE tasks cannot fork and this is achieved through
+	 * sched_fork()
+	 */
+}
+
+static void task_dead_dl(struct task_struct *p)
+{
+	struct hrtimer *timer = &p->dl.dl_timer;
+
+	if (hrtimer_active(timer))
+		hrtimer_try_to_cancel(timer);
+}
+
+static void set_curr_task_dl(struct rq *rq)
+{
+	struct task_struct *p = rq->curr;
+
+	p->se.exec_start = rq_clock_task(rq);
+}
+
+static void switched_from_dl(struct rq *rq, struct task_struct *p)
+{
+	if (hrtimer_active(&p->dl.dl_timer))
+		hrtimer_try_to_cancel(&p->dl.dl_timer);
+}
+
+static void switched_to_dl(struct rq *rq, struct task_struct *p)
+{
+	/*
+	 * If p is throttled, don't consider the possibility
+	 * of preempting rq->curr, the check will be done right
+	 * after its runtime will get replenished.
+	 */
+	if (unlikely(p->dl.dl_throttled))
+		return;
+
+	if (p->on_rq || rq->curr != p) {
+		if (task_has_dl_policy(rq->curr))
+			check_preempt_curr_dl(rq, p, 0);
+		else
+			resched_task(rq->curr);
+	}
+}
+
+static void prio_changed_dl(struct rq *rq, struct task_struct *p,
+			    int oldprio)
+{
+	switched_to_dl(rq, p);
+}
+
+#ifdef CONFIG_SMP
+static int
+select_task_rq_dl(struct task_struct *p, int prev_cpu, int sd_flag, int flags)
+{
+	return task_cpu(p);
+}
+#endif
+
+const struct sched_class dl_sched_class = {
+	.next			= &rt_sched_class,
+	.enqueue_task		= enqueue_task_dl,
+	.dequeue_task		= dequeue_task_dl,
+	.yield_task		= yield_task_dl,
+
+	.check_preempt_curr	= check_preempt_curr_dl,
+
+	.pick_next_task		= pick_next_task_dl,
+	.put_prev_task		= put_prev_task_dl,
+
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_dl,
+#endif
+
+	.set_curr_task		= set_curr_task_dl,
+	.task_tick		= task_tick_dl,
+	.task_fork              = task_fork_dl,
+	.task_dead		= task_dead_dl,
+
+	.prio_changed           = prio_changed_dl,
+	.switched_from		= switched_from_dl,
+	.switched_to		= switched_to_dl,
+};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index df023db..83eb539 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2,6 +2,7 @@
 #include <linux/sched.h>
 #include <linux/sched/sysctl.h>
 #include <linux/sched/rt.h>
+#include <linux/sched/deadline.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
 #include <linux/stop_machine.h>
@@ -91,11 +92,21 @@ static inline int rt_policy(int policy)
 	return policy == SCHED_FIFO || policy == SCHED_RR;
 }
 
+static inline int dl_policy(int policy)
+{
+	return policy == SCHED_DEADLINE;
+}
+
 static inline int task_has_rt_policy(struct task_struct *p)
 {
 	return rt_policy(p->policy);
 }
 
+static inline int task_has_dl_policy(struct task_struct *p)
+{
+	return dl_policy(p->policy);
+}
+
 /*
  * This is the priority-queue data structure of the RT scheduling class:
  */
@@ -367,6 +378,15 @@ struct rt_rq {
 #endif
 };
 
+/* Deadline class' related fields in a runqueue */
+struct dl_rq {
+	/* runqueue is an rbtree, ordered by deadline */
+	struct rb_root rb_root;
+	struct rb_node *rb_leftmost;
+
+	unsigned long dl_nr_running;
+};
+
 #ifdef CONFIG_SMP
 
 /*
@@ -435,6 +455,7 @@ struct rq {
 
 	struct cfs_rq cfs;
 	struct rt_rq rt;
+	struct dl_rq dl;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
@@ -991,6 +1012,7 @@ static const u32 prio_to_wmult[40] = {
 #else
 #define ENQUEUE_WAKING		0
 #endif
+#define ENQUEUE_REPLENISH	8
 
 #define DEQUEUE_SLEEP		1
 
@@ -1046,6 +1068,7 @@ struct sched_class {
    for (class = sched_class_highest; class; class = class->next)
 
 extern const struct sched_class stop_sched_class;
+extern const struct sched_class dl_sched_class;
 extern const struct sched_class rt_sched_class;
 extern const struct sched_class fair_sched_class;
 extern const struct sched_class idle_sched_class;
@@ -1081,6 +1104,8 @@ extern void resched_cpu(int cpu);
 extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 
+extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
+
 extern void update_idle_cpu_load(struct rq *this_rq);
 
 extern void init_task_runnable_average(struct task_struct *p);
@@ -1357,6 +1382,7 @@ extern void print_rt_stats(struct seq_file *m, int cpu);
 
 extern void init_cfs_rq(struct cfs_rq *cfs_rq);
 extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq);
+extern void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq);
 
 extern void cfs_bandwidth_usage_inc(void);
 extern void cfs_bandwidth_usage_dec(void);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 47197de..fdb6bb0 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -103,7 +103,7 @@ get_rr_interval_stop(struct rq *rq, struct task_struct *task)
  * Simple, special scheduling class for the per-CPU stop tasks:
  */
 const struct sched_class stop_sched_class = {
-	.next			= &rt_sched_class,
+	.next			= &dl_sched_class,
 
 	.enqueue_task		= enqueue_task_stop,
 	.dequeue_task		= dequeue_task_stop,

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [tip:sched/core] sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic
  2013-11-07 13:43 ` [PATCH 04/14] sched: SCHED_DEADLINE SMP-related data structures & logic Juri Lelli
  2013-11-20 18:51   ` Steven Rostedt
@ 2014-01-13 15:53   ` tip-bot for Juri Lelli
  1 sibling, 0 replies; 81+ messages in thread
From: tip-bot for Juri Lelli @ 2014-01-13 15:53 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, raistlin, tglx, juri.lelli

Commit-ID:  1baca4ce16b8cc7d4f50be1f7914799af30a2861
Gitweb:     http://git.kernel.org/tip/1baca4ce16b8cc7d4f50be1f7914799af30a2861
Author:     Juri Lelli <juri.lelli@gmail.com>
AuthorDate: Thu, 7 Nov 2013 14:43:38 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 13:41:07 +0100

sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic

Introduces data structures relevant for implementing dynamic
migration of -deadline tasks and the logic for checking if
runqueues are overloaded with -deadline tasks and for choosing
where a task should migrate, when it is the case.

Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
be moved among CPUs when necessary. It is also possible to bind a
task to a (set of) CPU(s), thus restricting its capability of
migrating, or forbidding migrations at all.

The very same approach used in sched_rt is utilised:
 - -deadline tasks are kept into CPU-specific runqueues,
 - -deadline tasks are migrated among runqueues to achieve the
   following:
    * on an M-CPU system the M earliest deadline ready tasks
      are always running;
    * affinity/cpusets settings of all the -deadline tasks is
      always respected.

Therefore, this very special form of "load balancing" is done with
an active method, i.e., the scheduler pushes or pulls tasks between
runqueues when they are woken up and/or (de)scheduled.
IOW, every time a preemption occurs, the descheduled task might be sent
to some other CPU (depending on its deadline) to continue executing
(push). On the other hand, every time a CPU becomes idle, it might pull
the second earliest deadline ready task from some other CPU.

To enforce this, a pull operation is always attempted before taking any
scheduling decision (pre_schedule()), as well as a push one after each
scheduling decision (post_schedule()). In addition, when a task arrives
or wakes up, the best CPU where to resume it is selected taking into
account its affinity mask, the system topology, but also its deadline.
E.g., from the scheduling point of view, the best CPU where to wake
up (and also where to push) a task is the one which is running the task
with the latest deadline among the M executing ones.

In order to facilitate these decisions, per-runqueue "caching" of the
deadlines of the currently running and of the first ready task is used.
Queued but not running tasks are also parked in another rb-tree to
speed-up pushes.

Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h   |   1 +
 kernel/sched/core.c     |   9 +-
 kernel/sched/deadline.c | 934 +++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/rt.c       |   2 +-
 kernel/sched/sched.h    |  34 ++
 5 files changed, 963 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6c19679..cc66f26 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1201,6 +1201,7 @@ struct task_struct {
 	struct list_head tasks;
 #ifdef CONFIG_SMP
 	struct plist_node pushable_tasks;
+	struct rb_node pushable_dl_tasks;
 #endif
 
 	struct mm_struct *mm, *active_mm;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 203aecd..548cc04 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1848,6 +1848,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	init_task_preempt_count(p);
 #ifdef CONFIG_SMP
 	plist_node_init(&p->pushable_tasks, MAX_PRIO);
+	RB_CLEAR_NODE(&p->pushable_dl_tasks);
 #endif
 
 	put_cpu();
@@ -5040,6 +5041,7 @@ static void free_rootdomain(struct rcu_head *rcu)
 	struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
 
 	cpupri_cleanup(&rd->cpupri);
+	free_cpumask_var(rd->dlo_mask);
 	free_cpumask_var(rd->rto_mask);
 	free_cpumask_var(rd->online);
 	free_cpumask_var(rd->span);
@@ -5091,8 +5093,10 @@ static int init_rootdomain(struct root_domain *rd)
 		goto out;
 	if (!alloc_cpumask_var(&rd->online, GFP_KERNEL))
 		goto free_span;
-	if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
+	if (!alloc_cpumask_var(&rd->dlo_mask, GFP_KERNEL))
 		goto free_online;
+	if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
+		goto free_dlo_mask;
 
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_rto_mask;
@@ -5100,6 +5104,8 @@ static int init_rootdomain(struct root_domain *rd)
 
 free_rto_mask:
 	free_cpumask_var(rd->rto_mask);
+free_dlo_mask:
+	free_cpumask_var(rd->dlo_mask);
 free_online:
 	free_cpumask_var(rd->online);
 free_span:
@@ -6451,6 +6457,7 @@ void __init sched_init_smp(void)
 	free_cpumask_var(non_isolated_cpus);
 
 	init_sched_rt_class();
+	init_sched_dl_class();
 }
 #else
 void __init sched_init_smp(void)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 93d82b2..fcc02c9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -10,6 +10,7 @@
  * miss some of their deadlines), and won't affect any other task.
  *
  * Copyright (C) 2012 Dario Faggioli <raistlin@linux.it>,
+ *                    Juri Lelli <juri.lelli@gmail.com>,
  *                    Michael Trimarchi <michael@amarulasolutions.com>,
  *                    Fabio Checconi <fchecconi@gmail.com>
  */
@@ -20,6 +21,15 @@ static inline int dl_time_before(u64 a, u64 b)
 	return (s64)(a - b) < 0;
 }
 
+/*
+ * Tells if entity @a should preempt entity @b.
+ */
+static inline
+int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
+{
+	return dl_time_before(a->deadline, b->deadline);
+}
+
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
 	return container_of(dl_se, struct task_struct, dl);
@@ -53,8 +63,168 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
 {
 	dl_rq->rb_root = RB_ROOT;
+
+#ifdef CONFIG_SMP
+	/* zero means no -deadline tasks */
+	dl_rq->earliest_dl.curr = dl_rq->earliest_dl.next = 0;
+
+	dl_rq->dl_nr_migratory = 0;
+	dl_rq->overloaded = 0;
+	dl_rq->pushable_dl_tasks_root = RB_ROOT;
+#endif
+}
+
+#ifdef CONFIG_SMP
+
+static inline int dl_overloaded(struct rq *rq)
+{
+	return atomic_read(&rq->rd->dlo_count);
+}
+
+static inline void dl_set_overload(struct rq *rq)
+{
+	if (!rq->online)
+		return;
+
+	cpumask_set_cpu(rq->cpu, rq->rd->dlo_mask);
+	/*
+	 * Must be visible before the overload count is
+	 * set (as in sched_rt.c).
+	 *
+	 * Matched by the barrier in pull_dl_task().
+	 */
+	smp_wmb();
+	atomic_inc(&rq->rd->dlo_count);
+}
+
+static inline void dl_clear_overload(struct rq *rq)
+{
+	if (!rq->online)
+		return;
+
+	atomic_dec(&rq->rd->dlo_count);
+	cpumask_clear_cpu(rq->cpu, rq->rd->dlo_mask);
+}
+
+static void update_dl_migration(struct dl_rq *dl_rq)
+{
+	if (dl_rq->dl_nr_migratory && dl_rq->dl_nr_total > 1) {
+		if (!dl_rq->overloaded) {
+			dl_set_overload(rq_of_dl_rq(dl_rq));
+			dl_rq->overloaded = 1;
+		}
+	} else if (dl_rq->overloaded) {
+		dl_clear_overload(rq_of_dl_rq(dl_rq));
+		dl_rq->overloaded = 0;
+	}
+}
+
+static void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	struct task_struct *p = dl_task_of(dl_se);
+	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
+
+	dl_rq->dl_nr_total++;
+	if (p->nr_cpus_allowed > 1)
+		dl_rq->dl_nr_migratory++;
+
+	update_dl_migration(dl_rq);
+}
+
+static void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	struct task_struct *p = dl_task_of(dl_se);
+	dl_rq = &rq_of_dl_rq(dl_rq)->dl;
+
+	dl_rq->dl_nr_total--;
+	if (p->nr_cpus_allowed > 1)
+		dl_rq->dl_nr_migratory--;
+
+	update_dl_migration(dl_rq);
+}
+
+/*
+ * The list of pushable -deadline task is not a plist, like in
+ * sched_rt.c, it is an rb-tree with tasks ordered by deadline.
+ */
+static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+	struct dl_rq *dl_rq = &rq->dl;
+	struct rb_node **link = &dl_rq->pushable_dl_tasks_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct task_struct *entry;
+	int leftmost = 1;
+
+	BUG_ON(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct task_struct,
+				 pushable_dl_tasks);
+		if (dl_entity_preempt(&p->dl, &entry->dl))
+			link = &parent->rb_left;
+		else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		dl_rq->pushable_dl_tasks_leftmost = &p->pushable_dl_tasks;
+
+	rb_link_node(&p->pushable_dl_tasks, parent, link);
+	rb_insert_color(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
 }
 
+static void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+	struct dl_rq *dl_rq = &rq->dl;
+
+	if (RB_EMPTY_NODE(&p->pushable_dl_tasks))
+		return;
+
+	if (dl_rq->pushable_dl_tasks_leftmost == &p->pushable_dl_tasks) {
+		struct rb_node *next_node;
+
+		next_node = rb_next(&p->pushable_dl_tasks);
+		dl_rq->pushable_dl_tasks_leftmost = next_node;
+	}
+
+	rb_erase(&p->pushable_dl_tasks, &dl_rq->pushable_dl_tasks_root);
+	RB_CLEAR_NODE(&p->pushable_dl_tasks);
+}
+
+static inline int has_pushable_dl_tasks(struct rq *rq)
+{
+	return !RB_EMPTY_ROOT(&rq->dl.pushable_dl_tasks_root);
+}
+
+static int push_dl_task(struct rq *rq);
+
+#else
+
+static inline
+void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline
+void dequeue_pushable_dl_task(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline
+void inc_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+}
+
+static inline
+void dec_dl_migration(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+}
+
+#endif /* CONFIG_SMP */
+
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags);
 static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags);
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
@@ -309,6 +479,14 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 			check_preempt_curr_dl(rq, p, 0);
 		else
 			resched_task(rq->curr);
+#ifdef CONFIG_SMP
+		/*
+		 * Queueing this task back might have overloaded rq,
+		 * check if we need to kick someone away.
+		 */
+		if (has_pushable_dl_tasks(rq))
+			push_dl_task(rq);
+#endif
 	}
 unlock:
 	raw_spin_unlock(&rq->lock);
@@ -399,6 +577,100 @@ static void update_curr_dl(struct rq *rq)
 	}
 }
 
+#ifdef CONFIG_SMP
+
+static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu);
+
+static inline u64 next_deadline(struct rq *rq)
+{
+	struct task_struct *next = pick_next_earliest_dl_task(rq, rq->cpu);
+
+	if (next && dl_prio(next->prio))
+		return next->dl.deadline;
+	else
+		return 0;
+}
+
+static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
+{
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	if (dl_rq->earliest_dl.curr == 0 ||
+	    dl_time_before(deadline, dl_rq->earliest_dl.curr)) {
+		/*
+		 * If the dl_rq had no -deadline tasks, or if the new task
+		 * has shorter deadline than the current one on dl_rq, we
+		 * know that the previous earliest becomes our next earliest,
+		 * as the new task becomes the earliest itself.
+		 */
+		dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
+		dl_rq->earliest_dl.curr = deadline;
+	} else if (dl_rq->earliest_dl.next == 0 ||
+		   dl_time_before(deadline, dl_rq->earliest_dl.next)) {
+		/*
+		 * On the other hand, if the new -deadline task has a
+		 * a later deadline than the earliest one on dl_rq, but
+		 * it is earlier than the next (if any), we must
+		 * recompute the next-earliest.
+		 */
+		dl_rq->earliest_dl.next = next_deadline(rq);
+	}
+}
+
+static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
+{
+	struct rq *rq = rq_of_dl_rq(dl_rq);
+
+	/*
+	 * Since we may have removed our earliest (and/or next earliest)
+	 * task we must recompute them.
+	 */
+	if (!dl_rq->dl_nr_running) {
+		dl_rq->earliest_dl.curr = 0;
+		dl_rq->earliest_dl.next = 0;
+	} else {
+		struct rb_node *leftmost = dl_rq->rb_leftmost;
+		struct sched_dl_entity *entry;
+
+		entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
+		dl_rq->earliest_dl.curr = entry->deadline;
+		dl_rq->earliest_dl.next = next_deadline(rq);
+	}
+}
+
+#else
+
+static inline void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
+static inline void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline) {}
+
+#endif /* CONFIG_SMP */
+
+static inline
+void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	int prio = dl_task_of(dl_se)->prio;
+	u64 deadline = dl_se->deadline;
+
+	WARN_ON(!dl_prio(prio));
+	dl_rq->dl_nr_running++;
+
+	inc_dl_deadline(dl_rq, deadline);
+	inc_dl_migration(dl_se, dl_rq);
+}
+
+static inline
+void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+	int prio = dl_task_of(dl_se)->prio;
+
+	WARN_ON(!dl_prio(prio));
+	WARN_ON(!dl_rq->dl_nr_running);
+	dl_rq->dl_nr_running--;
+
+	dec_dl_deadline(dl_rq, dl_se->deadline);
+	dec_dl_migration(dl_se, dl_rq);
+}
+
 static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
@@ -426,7 +698,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 	rb_link_node(&dl_se->rb_node, parent, link);
 	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
 
-	dl_rq->dl_nr_running++;
+	inc_dl_tasks(dl_se, dl_rq);
 }
 
 static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
@@ -446,7 +718,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
 	RB_CLEAR_NODE(&dl_se->rb_node);
 
-	dl_rq->dl_nr_running--;
+	dec_dl_tasks(dl_se, dl_rq);
 }
 
 static void
@@ -484,12 +756,17 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		return;
 
 	enqueue_dl_entity(&p->dl, flags);
+
+	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+		enqueue_pushable_dl_task(rq, p);
+
 	inc_nr_running(rq);
 }
 
 static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	dequeue_dl_entity(&p->dl);
+	dequeue_pushable_dl_task(rq, p);
 }
 
 static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -527,6 +804,74 @@ static void yield_task_dl(struct rq *rq)
 	update_curr_dl(rq);
 }
 
+#ifdef CONFIG_SMP
+
+static int find_later_rq(struct task_struct *task);
+static int latest_cpu_find(struct cpumask *span,
+			   struct task_struct *task,
+			   struct cpumask *later_mask);
+
+static int
+select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
+{
+	struct task_struct *curr;
+	struct rq *rq;
+
+	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
+		goto out;
+
+	rq = cpu_rq(cpu);
+
+	rcu_read_lock();
+	curr = ACCESS_ONCE(rq->curr); /* unlocked access */
+
+	/*
+	 * If we are dealing with a -deadline task, we must
+	 * decide where to wake it up.
+	 * If it has a later deadline and the current task
+	 * on this rq can't move (provided the waking task
+	 * can!) we prefer to send it somewhere else. On the
+	 * other hand, if it has a shorter deadline, we
+	 * try to make it stay here, it might be important.
+	 */
+	if (unlikely(dl_task(curr)) &&
+	    (curr->nr_cpus_allowed < 2 ||
+	     !dl_entity_preempt(&p->dl, &curr->dl)) &&
+	    (p->nr_cpus_allowed > 1)) {
+		int target = find_later_rq(p);
+
+		if (target != -1)
+			cpu = target;
+	}
+	rcu_read_unlock();
+
+out:
+	return cpu;
+}
+
+static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
+{
+	/*
+	 * Current can't be migrated, useless to reschedule,
+	 * let's hope p can move out.
+	 */
+	if (rq->curr->nr_cpus_allowed == 1 ||
+	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
+		return;
+
+	/*
+	 * p is migratable, so let's not schedule it and
+	 * see if it is pushed or pulled somewhere else.
+	 */
+	if (p->nr_cpus_allowed != 1 &&
+	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
+		return;
+
+	resched_task(rq->curr);
+}
+
+#endif /* CONFIG_SMP */
+
 /*
  * Only called when both the current and waking task are -deadline
  * tasks.
@@ -534,8 +879,20 @@ static void yield_task_dl(struct rq *rq)
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 				  int flags)
 {
-	if (dl_time_before(p->dl.deadline, rq->curr->dl.deadline))
+	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
 		resched_task(rq->curr);
+		return;
+	}
+
+#ifdef CONFIG_SMP
+	/*
+	 * In the unlikely case current and p have the same deadline
+	 * let us try to decide what's the best thing to do...
+	 */
+	if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
+	    !need_resched())
+		check_preempt_equal_dl(rq, p);
+#endif /* CONFIG_SMP */
 }
 
 #ifdef CONFIG_SCHED_HRTICK
@@ -575,16 +932,29 @@ struct task_struct *pick_next_task_dl(struct rq *rq)
 
 	p = dl_task_of(dl_se);
 	p->se.exec_start = rq_clock_task(rq);
+
+	/* Running task will never be pushed. */
+	if (p)
+		dequeue_pushable_dl_task(rq, p);
+
 #ifdef CONFIG_SCHED_HRTICK
 	if (hrtick_enabled(rq))
 		start_hrtick_dl(rq, p);
 #endif
+
+#ifdef CONFIG_SMP
+	rq->post_schedule = has_pushable_dl_tasks(rq);
+#endif /* CONFIG_SMP */
+
 	return p;
 }
 
 static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
 {
 	update_curr_dl(rq);
+
+	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
+		enqueue_pushable_dl_task(rq, p);
 }
 
 static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
@@ -618,16 +988,517 @@ static void set_curr_task_dl(struct rq *rq)
 	struct task_struct *p = rq->curr;
 
 	p->se.exec_start = rq_clock_task(rq);
+
+	/* You can't push away the running task */
+	dequeue_pushable_dl_task(rq, p);
+}
+
+#ifdef CONFIG_SMP
+
+/* Only try algorithms three times */
+#define DL_MAX_TRIES 3
+
+static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
+{
+	if (!task_running(rq, p) &&
+	    (cpu < 0 || cpumask_test_cpu(cpu, &p->cpus_allowed)) &&
+	    (p->nr_cpus_allowed > 1))
+		return 1;
+
+	return 0;
+}
+
+/* Returns the second earliest -deadline task, NULL otherwise */
+static struct task_struct *pick_next_earliest_dl_task(struct rq *rq, int cpu)
+{
+	struct rb_node *next_node = rq->dl.rb_leftmost;
+	struct sched_dl_entity *dl_se;
+	struct task_struct *p = NULL;
+
+next_node:
+	next_node = rb_next(next_node);
+	if (next_node) {
+		dl_se = rb_entry(next_node, struct sched_dl_entity, rb_node);
+		p = dl_task_of(dl_se);
+
+		if (pick_dl_task(rq, p, cpu))
+			return p;
+
+		goto next_node;
+	}
+
+	return NULL;
+}
+
+static int latest_cpu_find(struct cpumask *span,
+			   struct task_struct *task,
+			   struct cpumask *later_mask)
+{
+	const struct sched_dl_entity *dl_se = &task->dl;
+	int cpu, found = -1, best = 0;
+	u64 max_dl = 0;
+
+	for_each_cpu(cpu, span) {
+		struct rq *rq = cpu_rq(cpu);
+		struct dl_rq *dl_rq = &rq->dl;
+
+		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
+		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
+		     dl_rq->earliest_dl.curr))) {
+			if (later_mask)
+				cpumask_set_cpu(cpu, later_mask);
+			if (!best && !dl_rq->dl_nr_running) {
+				best = 1;
+				found = cpu;
+			} else if (!best &&
+				   dl_time_before(max_dl,
+						  dl_rq->earliest_dl.curr)) {
+				max_dl = dl_rq->earliest_dl.curr;
+				found = cpu;
+			}
+		} else if (later_mask)
+			cpumask_clear_cpu(cpu, later_mask);
+	}
+
+	return found;
+}
+
+static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
+
+static int find_later_rq(struct task_struct *task)
+{
+	struct sched_domain *sd;
+	struct cpumask *later_mask = __get_cpu_var(local_cpu_mask_dl);
+	int this_cpu = smp_processor_id();
+	int best_cpu, cpu = task_cpu(task);
+
+	/* Make sure the mask is initialized first */
+	if (unlikely(!later_mask))
+		return -1;
+
+	if (task->nr_cpus_allowed == 1)
+		return -1;
+
+	best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
+	if (best_cpu == -1)
+		return -1;
+
+	/*
+	 * If we are here, some target has been found,
+	 * the most suitable of which is cached in best_cpu.
+	 * This is, among the runqueues where the current tasks
+	 * have later deadlines than the task's one, the rq
+	 * with the latest possible one.
+	 *
+	 * Now we check how well this matches with task's
+	 * affinity and system topology.
+	 *
+	 * The last cpu where the task run is our first
+	 * guess, since it is most likely cache-hot there.
+	 */
+	if (cpumask_test_cpu(cpu, later_mask))
+		return cpu;
+	/*
+	 * Check if this_cpu is to be skipped (i.e., it is
+	 * not in the mask) or not.
+	 */
+	if (!cpumask_test_cpu(this_cpu, later_mask))
+		this_cpu = -1;
+
+	rcu_read_lock();
+	for_each_domain(cpu, sd) {
+		if (sd->flags & SD_WAKE_AFFINE) {
+
+			/*
+			 * If possible, preempting this_cpu is
+			 * cheaper than migrating.
+			 */
+			if (this_cpu != -1 &&
+			    cpumask_test_cpu(this_cpu, sched_domain_span(sd))) {
+				rcu_read_unlock();
+				return this_cpu;
+			}
+
+			/*
+			 * Last chance: if best_cpu is valid and is
+			 * in the mask, that becomes our choice.
+			 */
+			if (best_cpu < nr_cpu_ids &&
+			    cpumask_test_cpu(best_cpu, sched_domain_span(sd))) {
+				rcu_read_unlock();
+				return best_cpu;
+			}
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * At this point, all our guesses failed, we just return
+	 * 'something', and let the caller sort the things out.
+	 */
+	if (this_cpu != -1)
+		return this_cpu;
+
+	cpu = cpumask_any(later_mask);
+	if (cpu < nr_cpu_ids)
+		return cpu;
+
+	return -1;
+}
+
+/* Locks the rq it finds */
+static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
+{
+	struct rq *later_rq = NULL;
+	int tries;
+	int cpu;
+
+	for (tries = 0; tries < DL_MAX_TRIES; tries++) {
+		cpu = find_later_rq(task);
+
+		if ((cpu == -1) || (cpu == rq->cpu))
+			break;
+
+		later_rq = cpu_rq(cpu);
+
+		/* Retry if something changed. */
+		if (double_lock_balance(rq, later_rq)) {
+			if (unlikely(task_rq(task) != rq ||
+				     !cpumask_test_cpu(later_rq->cpu,
+				                       &task->cpus_allowed) ||
+				     task_running(rq, task) || !task->on_rq)) {
+				double_unlock_balance(rq, later_rq);
+				later_rq = NULL;
+				break;
+			}
+		}
+
+		/*
+		 * If the rq we found has no -deadline task, or
+		 * its earliest one has a later deadline than our
+		 * task, the rq is a good one.
+		 */
+		if (!later_rq->dl.dl_nr_running ||
+		    dl_time_before(task->dl.deadline,
+				   later_rq->dl.earliest_dl.curr))
+			break;
+
+		/* Otherwise we try again. */
+		double_unlock_balance(rq, later_rq);
+		later_rq = NULL;
+	}
+
+	return later_rq;
+}
+
+static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
+{
+	struct task_struct *p;
+
+	if (!has_pushable_dl_tasks(rq))
+		return NULL;
+
+	p = rb_entry(rq->dl.pushable_dl_tasks_leftmost,
+		     struct task_struct, pushable_dl_tasks);
+
+	BUG_ON(rq->cpu != task_cpu(p));
+	BUG_ON(task_current(rq, p));
+	BUG_ON(p->nr_cpus_allowed <= 1);
+
+	BUG_ON(!p->se.on_rq);
+	BUG_ON(!dl_task(p));
+
+	return p;
+}
+
+/*
+ * See if the non running -deadline tasks on this rq
+ * can be sent to some other CPU where they can preempt
+ * and start executing.
+ */
+static int push_dl_task(struct rq *rq)
+{
+	struct task_struct *next_task;
+	struct rq *later_rq;
+
+	if (!rq->dl.overloaded)
+		return 0;
+
+	next_task = pick_next_pushable_dl_task(rq);
+	if (!next_task)
+		return 0;
+
+retry:
+	if (unlikely(next_task == rq->curr)) {
+		WARN_ON(1);
+		return 0;
+	}
+
+	/*
+	 * If next_task preempts rq->curr, and rq->curr
+	 * can move away, it makes sense to just reschedule
+	 * without going further in pushing next_task.
+	 */
+	if (dl_task(rq->curr) &&
+	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+	    rq->curr->nr_cpus_allowed > 1) {
+		resched_task(rq->curr);
+		return 0;
+	}
+
+	/* We might release rq lock */
+	get_task_struct(next_task);
+
+	/* Will lock the rq it'll find */
+	later_rq = find_lock_later_rq(next_task, rq);
+	if (!later_rq) {
+		struct task_struct *task;
+
+		/*
+		 * We must check all this again, since
+		 * find_lock_later_rq releases rq->lock and it is
+		 * then possible that next_task has migrated.
+		 */
+		task = pick_next_pushable_dl_task(rq);
+		if (task_cpu(next_task) == rq->cpu && task == next_task) {
+			/*
+			 * The task is still there. We don't try
+			 * again, some other cpu will pull it when ready.
+			 */
+			dequeue_pushable_dl_task(rq, next_task);
+			goto out;
+		}
+
+		if (!task)
+			/* No more tasks */
+			goto out;
+
+		put_task_struct(next_task);
+		next_task = task;
+		goto retry;
+	}
+
+	deactivate_task(rq, next_task, 0);
+	set_task_cpu(next_task, later_rq->cpu);
+	activate_task(later_rq, next_task, 0);
+
+	resched_task(later_rq->curr);
+
+	double_unlock_balance(rq, later_rq);
+
+out:
+	put_task_struct(next_task);
+
+	return 1;
+}
+
+static void push_dl_tasks(struct rq *rq)
+{
+	/* Terminates as it moves a -deadline task */
+	while (push_dl_task(rq))
+		;
 }
 
+static int pull_dl_task(struct rq *this_rq)
+{
+	int this_cpu = this_rq->cpu, ret = 0, cpu;
+	struct task_struct *p;
+	struct rq *src_rq;
+	u64 dmin = LONG_MAX;
+
+	if (likely(!dl_overloaded(this_rq)))
+		return 0;
+
+	/*
+	 * Match the barrier from dl_set_overloaded; this guarantees that if we
+	 * see overloaded we must also see the dlo_mask bit.
+	 */
+	smp_rmb();
+
+	for_each_cpu(cpu, this_rq->rd->dlo_mask) {
+		if (this_cpu == cpu)
+			continue;
+
+		src_rq = cpu_rq(cpu);
+
+		/*
+		 * It looks racy, abd it is! However, as in sched_rt.c,
+		 * we are fine with this.
+		 */
+		if (this_rq->dl.dl_nr_running &&
+		    dl_time_before(this_rq->dl.earliest_dl.curr,
+				   src_rq->dl.earliest_dl.next))
+			continue;
+
+		/* Might drop this_rq->lock */
+		double_lock_balance(this_rq, src_rq);
+
+		/*
+		 * If there are no more pullable tasks on the
+		 * rq, we're done with it.
+		 */
+		if (src_rq->dl.dl_nr_running <= 1)
+			goto skip;
+
+		p = pick_next_earliest_dl_task(src_rq, this_cpu);
+
+		/*
+		 * We found a task to be pulled if:
+		 *  - it preempts our current (if there's one),
+		 *  - it will preempt the last one we pulled (if any).
+		 */
+		if (p && dl_time_before(p->dl.deadline, dmin) &&
+		    (!this_rq->dl.dl_nr_running ||
+		     dl_time_before(p->dl.deadline,
+				    this_rq->dl.earliest_dl.curr))) {
+			WARN_ON(p == src_rq->curr);
+			WARN_ON(!p->se.on_rq);
+
+			/*
+			 * Then we pull iff p has actually an earlier
+			 * deadline than the current task of its runqueue.
+			 */
+			if (dl_time_before(p->dl.deadline,
+					   src_rq->curr->dl.deadline))
+				goto skip;
+
+			ret = 1;
+
+			deactivate_task(src_rq, p, 0);
+			set_task_cpu(p, this_cpu);
+			activate_task(this_rq, p, 0);
+			dmin = p->dl.deadline;
+
+			/* Is there any other task even earlier? */
+		}
+skip:
+		double_unlock_balance(this_rq, src_rq);
+	}
+
+	return ret;
+}
+
+static void pre_schedule_dl(struct rq *rq, struct task_struct *prev)
+{
+	/* Try to pull other tasks here */
+	if (dl_task(prev))
+		pull_dl_task(rq);
+}
+
+static void post_schedule_dl(struct rq *rq)
+{
+	push_dl_tasks(rq);
+}
+
+/*
+ * Since the task is not running and a reschedule is not going to happen
+ * anytime soon on its runqueue, we try pushing it away now.
+ */
+static void task_woken_dl(struct rq *rq, struct task_struct *p)
+{
+	if (!task_running(rq, p) &&
+	    !test_tsk_need_resched(rq->curr) &&
+	    has_pushable_dl_tasks(rq) &&
+	    p->nr_cpus_allowed > 1 &&
+	    dl_task(rq->curr) &&
+	    (rq->curr->nr_cpus_allowed < 2 ||
+	     dl_entity_preempt(&rq->curr->dl, &p->dl))) {
+		push_dl_tasks(rq);
+	}
+}
+
+static void set_cpus_allowed_dl(struct task_struct *p,
+				const struct cpumask *new_mask)
+{
+	struct rq *rq;
+	int weight;
+
+	BUG_ON(!dl_task(p));
+
+	/*
+	 * Update only if the task is actually running (i.e.,
+	 * it is on the rq AND it is not throttled).
+	 */
+	if (!on_dl_rq(&p->dl))
+		return;
+
+	weight = cpumask_weight(new_mask);
+
+	/*
+	 * Only update if the process changes its state from whether it
+	 * can migrate or not.
+	 */
+	if ((p->nr_cpus_allowed > 1) == (weight > 1))
+		return;
+
+	rq = task_rq(p);
+
+	/*
+	 * The process used to be able to migrate OR it can now migrate
+	 */
+	if (weight <= 1) {
+		if (!task_current(rq, p))
+			dequeue_pushable_dl_task(rq, p);
+		BUG_ON(!rq->dl.dl_nr_migratory);
+		rq->dl.dl_nr_migratory--;
+	} else {
+		if (!task_current(rq, p))
+			enqueue_pushable_dl_task(rq, p);
+		rq->dl.dl_nr_migratory++;
+	}
+
+	update_dl_migration(&rq->dl);
+}
+
+/* Assumes rq->lock is held */
+static void rq_online_dl(struct rq *rq)
+{
+	if (rq->dl.overloaded)
+		dl_set_overload(rq);
+}
+
+/* Assumes rq->lock is held */
+static void rq_offline_dl(struct rq *rq)
+{
+	if (rq->dl.overloaded)
+		dl_clear_overload(rq);
+}
+
+void init_sched_dl_class(void)
+{
+	unsigned int i;
+
+	for_each_possible_cpu(i)
+		zalloc_cpumask_var_node(&per_cpu(local_cpu_mask_dl, i),
+					GFP_KERNEL, cpu_to_node(i));
+}
+
+#endif /* CONFIG_SMP */
+
 static void switched_from_dl(struct rq *rq, struct task_struct *p)
 {
-	if (hrtimer_active(&p->dl.dl_timer))
+	if (hrtimer_active(&p->dl.dl_timer) && !dl_policy(p->policy))
 		hrtimer_try_to_cancel(&p->dl.dl_timer);
+
+#ifdef CONFIG_SMP
+	/*
+	 * Since this might be the only -deadline task on the rq,
+	 * this is the right place to try to pull some other one
+	 * from an overloaded cpu, if any.
+	 */
+	if (!rq->dl.dl_nr_running)
+		pull_dl_task(rq);
+#endif
 }
 
+/*
+ * When switching to -deadline, we may overload the rq, then
+ * we try to push someone off, if possible.
+ */
 static void switched_to_dl(struct rq *rq, struct task_struct *p)
 {
+	int check_resched = 1;
+
 	/*
 	 * If p is throttled, don't consider the possibility
 	 * of preempting rq->curr, the check will be done right
@@ -637,26 +1508,53 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
 		return;
 
 	if (p->on_rq || rq->curr != p) {
-		if (task_has_dl_policy(rq->curr))
+#ifdef CONFIG_SMP
+		if (rq->dl.overloaded && push_dl_task(rq) && rq != task_rq(p))
+			/* Only reschedule if pushing failed */
+			check_resched = 0;
+#endif /* CONFIG_SMP */
+		if (check_resched && task_has_dl_policy(rq->curr))
 			check_preempt_curr_dl(rq, p, 0);
-		else
-			resched_task(rq->curr);
 	}
 }
 
+/*
+ * If the scheduling parameters of a -deadline task changed,
+ * a push or pull operation might be needed.
+ */
 static void prio_changed_dl(struct rq *rq, struct task_struct *p,
 			    int oldprio)
 {
-	switched_to_dl(rq, p);
-}
-
+	if (p->on_rq || rq->curr == p) {
 #ifdef CONFIG_SMP
-static int
-select_task_rq_dl(struct task_struct *p, int prev_cpu, int sd_flag, int flags)
-{
-	return task_cpu(p);
+		/*
+		 * This might be too much, but unfortunately
+		 * we don't have the old deadline value, and
+		 * we can't argue if the task is increasing
+		 * or lowering its prio, so...
+		 */
+		if (!rq->dl.overloaded)
+			pull_dl_task(rq);
+
+		/*
+		 * If we now have a earlier deadline task than p,
+		 * then reschedule, provided p is still on this
+		 * runqueue.
+		 */
+		if (dl_time_before(rq->dl.earliest_dl.curr, p->dl.deadline) &&
+		    rq->curr == p)
+			resched_task(p);
+#else
+		/*
+		 * Again, we don't know if p has a earlier
+		 * or later deadline, so let's blindly set a
+		 * (maybe not needed) rescheduling point.
+		 */
+		resched_task(p);
+#endif /* CONFIG_SMP */
+	} else
+		switched_to_dl(rq, p);
 }
-#endif
 
 const struct sched_class dl_sched_class = {
 	.next			= &rt_sched_class,
@@ -671,6 +1569,12 @@ const struct sched_class dl_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_dl,
+	.set_cpus_allowed       = set_cpus_allowed_dl,
+	.rq_online              = rq_online_dl,
+	.rq_offline             = rq_offline_dl,
+	.pre_schedule		= pre_schedule_dl,
+	.post_schedule		= post_schedule_dl,
+	.task_woken		= task_woken_dl,
 #endif
 
 	.set_curr_task		= set_curr_task_dl,
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 1c40655..a2740b7 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1738,7 +1738,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
 	    !test_tsk_need_resched(rq->curr) &&
 	    has_pushable_tasks(rq) &&
 	    p->nr_cpus_allowed > 1 &&
-	    rt_task(rq->curr) &&
+	    (dl_task(rq->curr) || rt_task(rq->curr)) &&
 	    (rq->curr->nr_cpus_allowed < 2 ||
 	     rq->curr->prio <= p->prio))
 		push_rt_tasks(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 83eb539..93ea627 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -385,6 +385,31 @@ struct dl_rq {
 	struct rb_node *rb_leftmost;
 
 	unsigned long dl_nr_running;
+
+#ifdef CONFIG_SMP
+	/*
+	 * Deadline values of the currently executing and the
+	 * earliest ready task on this rq. Caching these facilitates
+	 * the decision wether or not a ready but not running task
+	 * should migrate somewhere else.
+	 */
+	struct {
+		u64 curr;
+		u64 next;
+	} earliest_dl;
+
+	unsigned long dl_nr_migratory;
+	unsigned long dl_nr_total;
+	int overloaded;
+
+	/*
+	 * Tasks on this rq that can be pushed away. They are kept in
+	 * an rb-tree, ordered by tasks' deadlines, with caching
+	 * of the leftmost (earliest deadline) element.
+	 */
+	struct rb_root pushable_dl_tasks_root;
+	struct rb_node *pushable_dl_tasks_leftmost;
+#endif
 };
 
 #ifdef CONFIG_SMP
@@ -405,6 +430,13 @@ struct root_domain {
 	cpumask_var_t online;
 
 	/*
+	 * The bit corresponding to a CPU gets set here if such CPU has more
+	 * than one runnable -deadline task (as it is below for RT tasks).
+	 */
+	cpumask_var_t dlo_mask;
+	atomic_t dlo_count;
+
+	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
 	 * one runnable RT task.
 	 */
@@ -1095,6 +1127,8 @@ static inline void idle_balance(int cpu, struct rq *rq)
 extern void sysrq_sched_debug_show(void);
 extern void sched_init_granularity(void);
 extern void update_max_interval(void);
+
+extern void init_sched_dl_class(void);
 extern void init_sched_rt_class(void);
 extern void init_sched_fair_class(void);
 

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [tip:sched/core] sched/deadline: Add SCHED_DEADLINE avg_update accounting
  2013-11-07 13:43 ` [PATCH 05/14] sched: SCHED_DEADLINE avg_update accounting Juri Lelli
@ 2014-01-13 15:53   ` tip-bot for Dario Faggioli
  0 siblings, 0 replies; 81+ messages in thread
From: tip-bot for Dario Faggioli @ 2014-01-13 15:53 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, raistlin, tglx, juri.lelli

Commit-ID:  239be4a982154ea0c979fca5846349bb68973aed
Gitweb:     http://git.kernel.org/tip/239be4a982154ea0c979fca5846349bb68973aed
Author:     Dario Faggioli <raistlin@linux.it>
AuthorDate: Thu, 7 Nov 2013 14:43:39 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 13:41:08 +0100

sched/deadline: Add SCHED_DEADLINE avg_update accounting

Make the core scheduler and load balancer aware of the load
produced by -deadline tasks, by updating the moving average
like for sched_rt.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-6-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/deadline.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fcc02c9..21f58d2 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -564,6 +564,8 @@ static void update_curr_dl(struct rq *rq)
 	curr->se.exec_start = rq_clock_task(rq);
 	cpuacct_charge(curr, delta_exec);
 
+	sched_rt_avg_update(rq, delta_exec);
+
 	dl_se->runtime -= delta_exec;
 	if (dl_runtime_exceeded(rq, dl_se)) {
 		__dequeue_task_dl(rq, curr, 0);

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [tip:sched/core] sched/deadline: Add period support for SCHED_DEADLINE tasks
  2013-11-07 13:43 ` [PATCH 06/14] sched: add period support for -deadline tasks Juri Lelli
@ 2014-01-13 15:53   ` tip-bot for Harald Gustafsson
  0 siblings, 0 replies; 81+ messages in thread
From: tip-bot for Harald Gustafsson @ 2014-01-13 15:53 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, raistlin, harald.gustafsson,
	tglx, juri.lelli

Commit-ID:  755378a47192a3d1f7c3a8ca6c15c1cf76de0af2
Gitweb:     http://git.kernel.org/tip/755378a47192a3d1f7c3a8ca6c15c1cf76de0af2
Author:     Harald Gustafsson <harald.gustafsson@ericsson.com>
AuthorDate: Thu, 7 Nov 2013 14:43:40 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 13:41:09 +0100

sched/deadline: Add period support for SCHED_DEADLINE tasks

Make it possible to specify a period (different or equal than
deadline) for -deadline tasks. Relative deadlines (D_i) are used on
task arrivals to generate new scheduling (absolute) deadlines as "d =
t + D_i", and periods (P_i) to postpone the scheduling deadlines as "d
= d + P_i" when the budget is zero.

This is in general useful to model (and schedule) tasks that have slow
activation rates (long periods), but have to be scheduled soon once
activated (short deadlines).

Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-7-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h   |  1 +
 kernel/sched/core.c     | 10 ++++++++--
 kernel/sched/deadline.c | 10 +++++++---
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cc66f26..158f4c2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1102,6 +1102,7 @@ struct sched_dl_entity {
 	 */
 	u64 dl_runtime;		/* maximum runtime for each instance	*/
 	u64 dl_deadline;	/* relative deadline of each instance	*/
+	u64 dl_period;		/* separation of two instances (period) */
 
 	/*
 	 * Actual scheduling parameters. Initialized with the values above,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 548cc04..069230b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1723,6 +1723,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	hrtimer_init(&p->dl.dl_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	p->dl.dl_runtime = p->dl.runtime = 0;
 	p->dl.dl_deadline = p->dl.deadline = 0;
+	p->dl.dl_period = 0;
 	p->dl.flags = 0;
 
 	INIT_LIST_HEAD(&p->rt.run_list);
@@ -3026,6 +3027,7 @@ __setparam_dl(struct task_struct *p, const struct sched_attr *attr)
 	init_dl_task_timer(dl_se);
 	dl_se->dl_runtime = attr->sched_runtime;
 	dl_se->dl_deadline = attr->sched_deadline;
+	dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline;
 	dl_se->flags = attr->sched_flags;
 	dl_se->dl_throttled = 0;
 	dl_se->dl_new = 1;
@@ -3067,19 +3069,23 @@ __getparam_dl(struct task_struct *p, struct sched_attr *attr)
 	attr->sched_priority = p->rt_priority;
 	attr->sched_runtime = dl_se->dl_runtime;
 	attr->sched_deadline = dl_se->dl_deadline;
+	attr->sched_period = dl_se->dl_period;
 	attr->sched_flags = dl_se->flags;
 }
 
 /*
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
- * than the runtime.
+ * than the runtime, as well as the period of being zero or
+ * greater than deadline.
  */
 static bool
 __checkparam_dl(const struct sched_attr *attr)
 {
 	return attr && attr->sched_deadline != 0 &&
-	       (s64)(attr->sched_deadline - attr->sched_runtime) >= 0;
+		(attr->sched_period == 0 ||
+		(s64)(attr->sched_period   - attr->sched_deadline) >= 0) &&
+		(s64)(attr->sched_deadline - attr->sched_runtime ) >= 0;
 }
 
 /*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 21f58d2..3958bc5 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -289,7 +289,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 * arbitrary large.
 	 */
 	while (dl_se->runtime <= 0) {
-		dl_se->deadline += dl_se->dl_deadline;
+		dl_se->deadline += dl_se->dl_period;
 		dl_se->runtime += dl_se->dl_runtime;
 	}
 
@@ -329,9 +329,13 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
  *
  * This function returns true if:
  *
- *   runtime / (deadline - t) > dl_runtime / dl_deadline ,
+ *   runtime / (deadline - t) > dl_runtime / dl_period ,
  *
  * IOW we can't recycle current parameters.
+ *
+ * Notice that the bandwidth check is done against the period. For
+ * task with deadline equal to period this is the same of using
+ * dl_deadline instead of dl_period in the equation above.
  */
 static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
 {
@@ -355,7 +359,7 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
 	 * of anything below microseconds resolution is actually fiction
 	 * (but still we want to give the user that illusion >;).
 	 */
-	left = (dl_se->dl_deadline >> 10) * (dl_se->runtime >> 10);
+	left = (dl_se->dl_period >> 10) * (dl_se->runtime >> 10);
 	right = ((dl_se->deadline - t) >> 10) * (dl_se->dl_runtime >> 10);
 
 	return dl_time_before(right, left);

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [tip:sched/core] sched/deadline: Add latency tracing for SCHED_DEADLINE tasks
  2013-11-07 13:43 ` [PATCH 08/14] sched: add latency tracing " Juri Lelli
  2013-11-20 21:33   ` Steven Rostedt
@ 2014-01-13 15:54   ` tip-bot for Dario Faggioli
  1 sibling, 0 replies; 81+ messages in thread
From: tip-bot for Dario Faggioli @ 2014-01-13 15:54 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, raistlin, tglx, juri.lelli

Commit-ID:  af6ace764d03900524e9b1ac621a1c520ee49fc6
Gitweb:     http://git.kernel.org/tip/af6ace764d03900524e9b1ac621a1c520ee49fc6
Author:     Dario Faggioli <raistlin@linux.it>
AuthorDate: Thu, 7 Nov 2013 14:43:42 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 13:41:11 +0100

sched/deadline: Add latency tracing for SCHED_DEADLINE tasks

It is very likely that systems that wants/needs to use the new
SCHED_DEADLINE policy also want to have the scheduling latency of
the -deadline tasks under control.

For this reason a new version of the scheduling wakeup latency,
called "wakeup_dl", is introduced.

As a consequence of applying this patch there will be three wakeup
latency tracer:

 * "wakeup", that deals with all tasks in the system;
 * "wakeup_rt", that deals with -rt and -deadline tasks only;
 * "wakeup_dl", that deals with -deadline tasks only.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-9-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/trace/trace_sched_wakeup.c | 64 ++++++++++++++++++++++++++++++++++++---
 kernel/trace/trace_selftest.c     | 33 +++++++++++---------
 2 files changed, 79 insertions(+), 18 deletions(-)

diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index fee77e1..090c4d9 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -27,6 +27,8 @@ static int			wakeup_cpu;
 static int			wakeup_current_cpu;
 static unsigned			wakeup_prio = -1;
 static int			wakeup_rt;
+static int			wakeup_dl;
+static int			tracing_dl = 0;
 
 static arch_spinlock_t wakeup_lock =
 	(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
@@ -437,6 +439,7 @@ static void __wakeup_reset(struct trace_array *tr)
 {
 	wakeup_cpu = -1;
 	wakeup_prio = -1;
+	tracing_dl = 0;
 
 	if (wakeup_task)
 		put_task_struct(wakeup_task);
@@ -472,9 +475,17 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
 	tracing_record_cmdline(p);
 	tracing_record_cmdline(current);
 
-	if ((wakeup_rt && !rt_task(p)) ||
-			p->prio >= wakeup_prio ||
-			p->prio >= current->prio)
+	/*
+	 * Semantic is like this:
+	 *  - wakeup tracer handles all tasks in the system, independently
+	 *    from their scheduling class;
+	 *  - wakeup_rt tracer handles tasks belonging to sched_dl and
+	 *    sched_rt class;
+	 *  - wakeup_dl handles tasks belonging to sched_dl class only.
+	 */
+	if (tracing_dl || (wakeup_dl && !dl_task(p)) ||
+	    (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
+	    (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= current->prio)))
 		return;
 
 	pc = preempt_count();
@@ -486,7 +497,8 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
 	arch_spin_lock(&wakeup_lock);
 
 	/* check for races. */
-	if (!tracer_enabled || p->prio >= wakeup_prio)
+	if (!tracer_enabled || tracing_dl ||
+	    (!dl_task(p) && p->prio >= wakeup_prio))
 		goto out_locked;
 
 	/* reset the trace */
@@ -496,6 +508,15 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
 	wakeup_current_cpu = wakeup_cpu;
 	wakeup_prio = p->prio;
 
+	/*
+	 * Once you start tracing a -deadline task, don't bother tracing
+	 * another task until the first one wakes up.
+	 */
+	if (dl_task(p))
+		tracing_dl = 1;
+	else
+		tracing_dl = 0;
+
 	wakeup_task = p;
 	get_task_struct(wakeup_task);
 
@@ -597,16 +618,25 @@ static int __wakeup_tracer_init(struct trace_array *tr)
 
 static int wakeup_tracer_init(struct trace_array *tr)
 {
+	wakeup_dl = 0;
 	wakeup_rt = 0;
 	return __wakeup_tracer_init(tr);
 }
 
 static int wakeup_rt_tracer_init(struct trace_array *tr)
 {
+	wakeup_dl = 0;
 	wakeup_rt = 1;
 	return __wakeup_tracer_init(tr);
 }
 
+static int wakeup_dl_tracer_init(struct trace_array *tr)
+{
+	wakeup_dl = 1;
+	wakeup_rt = 0;
+	return __wakeup_tracer_init(tr);
+}
+
 static void wakeup_tracer_reset(struct trace_array *tr)
 {
 	int lat_flag = save_flags & TRACE_ITER_LATENCY_FMT;
@@ -674,6 +704,28 @@ static struct tracer wakeup_rt_tracer __read_mostly =
 	.use_max_tr	= true,
 };
 
+static struct tracer wakeup_dl_tracer __read_mostly =
+{
+	.name		= "wakeup_dl",
+	.init		= wakeup_dl_tracer_init,
+	.reset		= wakeup_tracer_reset,
+	.start		= wakeup_tracer_start,
+	.stop		= wakeup_tracer_stop,
+	.wait_pipe	= poll_wait_pipe,
+	.print_max	= true,
+	.print_header	= wakeup_print_header,
+	.print_line	= wakeup_print_line,
+	.flags		= &tracer_flags,
+	.set_flag	= wakeup_set_flag,
+	.flag_changed	= wakeup_flag_changed,
+#ifdef CONFIG_FTRACE_SELFTEST
+	.selftest    = trace_selftest_startup_wakeup,
+#endif
+	.open		= wakeup_trace_open,
+	.close		= wakeup_trace_close,
+	.use_max_tr	= true,
+};
+
 __init static int init_wakeup_tracer(void)
 {
 	int ret;
@@ -686,6 +738,10 @@ __init static int init_wakeup_tracer(void)
 	if (ret)
 		return ret;
 
+	ret = register_tracer(&wakeup_dl_tracer);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 core_initcall(init_wakeup_tracer);
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index a7329b7..e98fca6 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -1022,11 +1022,16 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
 #ifdef CONFIG_SCHED_TRACER
 static int trace_wakeup_test_thread(void *data)
 {
-	/* Make this a RT thread, doesn't need to be too high */
-	static const struct sched_param param = { .sched_priority = 5 };
+	/* Make this a -deadline thread */
+	static const struct sched_attr attr = {
+		.sched_policy = SCHED_DEADLINE,
+		.sched_runtime = 100000ULL,
+		.sched_deadline = 10000000ULL,
+		.sched_period = 10000000ULL
+	};
 	struct completion *x = data;
 
-	sched_setscheduler(current, SCHED_FIFO, &param);
+	sched_setattr(current, &attr);
 
 	/* Make it know we have a new prio */
 	complete(x);
@@ -1040,8 +1045,8 @@ static int trace_wakeup_test_thread(void *data)
 	/* we are awake, now wait to disappear */
 	while (!kthread_should_stop()) {
 		/*
-		 * This is an RT task, do short sleeps to let
-		 * others run.
+		 * This will likely be the system top priority
+		 * task, do short sleeps to let others run.
 		 */
 		msleep(100);
 	}
@@ -1054,21 +1059,21 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
 {
 	unsigned long save_max = tracing_max_latency;
 	struct task_struct *p;
-	struct completion isrt;
+	struct completion is_ready;
 	unsigned long count;
 	int ret;
 
-	init_completion(&isrt);
+	init_completion(&is_ready);
 
-	/* create a high prio thread */
-	p = kthread_run(trace_wakeup_test_thread, &isrt, "ftrace-test");
+	/* create a -deadline thread */
+	p = kthread_run(trace_wakeup_test_thread, &is_ready, "ftrace-test");
 	if (IS_ERR(p)) {
 		printk(KERN_CONT "Failed to create ftrace wakeup test thread ");
 		return -1;
 	}
 
-	/* make sure the thread is running at an RT prio */
-	wait_for_completion(&isrt);
+	/* make sure the thread is running at -deadline policy */
+	wait_for_completion(&is_ready);
 
 	/* start the tracing */
 	ret = tracer_init(trace, tr);
@@ -1082,19 +1087,19 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
 
 	while (p->on_rq) {
 		/*
-		 * Sleep to make sure the RT thread is asleep too.
+		 * Sleep to make sure the -deadline thread is asleep too.
 		 * On virtual machines we can't rely on timings,
 		 * but we want to make sure this test still works.
 		 */
 		msleep(100);
 	}
 
-	init_completion(&isrt);
+	init_completion(&is_ready);
 
 	wake_up_process(p);
 
 	/* Wait for the task to wake up */
-	wait_for_completion(&isrt);
+	wait_for_completion(&is_ready);
 
 	/* stop the tracing. */
 	tracing_stop();

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [tip:sched/core] rtmutex: Turn the plist into an rb-tree
  2013-11-07 13:43 ` [PATCH 09/14] rtmutex: turn the plist into an rb-tree Juri Lelli
  2013-11-21  3:07   ` Steven Rostedt
  2013-11-21 17:52   ` [PATCH] rtmutex: Fix compare of waiter prio and task prio Steven Rostedt
@ 2014-01-13 15:54   ` tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 81+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-13 15:54 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, raistlin, tglx, juri.lelli

Commit-ID:  fb00aca474405f4fa8a8519c3179fed722eabd83
Gitweb:     http://git.kernel.org/tip/fb00aca474405f4fa8a8519c3179fed722eabd83
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 7 Nov 2013 14:43:43 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 13:41:50 +0100

rtmutex: Turn the plist into an rb-tree

Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.

This is done mainly because:
 - classical prio field of the plist is just an int, which might
   not be enough for representing a deadline;
 - manipulating such a list would become O(nr_deadline_tasks),
   which might be to much, as the number of -deadline task increases.

Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
 - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
   one with the higher (lower, actually!) prio wins;
 - among a -priority and a -deadline task, the latter always wins;
 - among two -deadline tasks, the one with the earliest deadline
   wins.

Queueing and dequeueing functions are changed accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-again-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/init_task.h       |  10 +++
 include/linux/rtmutex.h         |  18 ++---
 include/linux/sched.h           |   4 +-
 kernel/fork.c                   |   3 +-
 kernel/futex.c                  |   2 +
 kernel/locking/rtmutex-debug.c  |   8 +--
 kernel/locking/rtmutex.c        | 151 ++++++++++++++++++++++++++++++++--------
 kernel/locking/rtmutex_common.h |  22 +++---
 kernel/sched/core.c             |   4 --
 9 files changed, 157 insertions(+), 65 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index b0ed422..f0e5238 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -11,6 +11,7 @@
 #include <linux/user_namespace.h>
 #include <linux/securebits.h>
 #include <linux/seqlock.h>
+#include <linux/rbtree.h>
 #include <net/net_namespace.h>
 #include <linux/sched/rt.h>
 
@@ -154,6 +155,14 @@ extern struct task_group root_task_group;
 
 #define INIT_TASK_COMM "swapper"
 
+#ifdef CONFIG_RT_MUTEXES
+# define INIT_RT_MUTEXES(tsk)						\
+	.pi_waiters = RB_ROOT,						\
+	.pi_waiters_leftmost = NULL,
+#else
+# define INIT_RT_MUTEXES(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -221,6 +230,7 @@ extern struct task_group root_task_group;
 	INIT_TRACE_RECURSION						\
 	INIT_TASK_RCU_PREEMPT(tsk)					\
 	INIT_CPUSET_SEQ(tsk)						\
+	INIT_RT_MUTEXES(tsk)						\
 	INIT_VTIME(tsk)							\
 }
 
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index de17134..3aed8d7 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -13,7 +13,7 @@
 #define __LINUX_RT_MUTEX_H
 
 #include <linux/linkage.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
 #include <linux/spinlock_types.h>
 
 extern int max_lock_depth; /* for sysctl */
@@ -22,12 +22,14 @@ extern int max_lock_depth; /* for sysctl */
  * The rt_mutex structure
  *
  * @wait_lock:	spinlock to protect the structure
- * @wait_list:	pilist head to enqueue waiters in priority order
+ * @waiters:	rbtree root to enqueue waiters in priority order
+ * @waiters_leftmost: top waiter
  * @owner:	the mutex owner
  */
 struct rt_mutex {
 	raw_spinlock_t		wait_lock;
-	struct plist_head	wait_list;
+	struct rb_root          waiters;
+	struct rb_node          *waiters_leftmost;
 	struct task_struct	*owner;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
 	int			save_state;
@@ -66,7 +68,7 @@ struct hrtimer_sleeper;
 
 #define __RT_MUTEX_INITIALIZER(mutexname) \
 	{ .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(mutexname.wait_lock) \
-	, .wait_list = PLIST_HEAD_INIT(mutexname.wait_list) \
+	, .waiters = RB_ROOT \
 	, .owner = NULL \
 	__DEBUG_RT_MUTEX_INITIALIZER(mutexname)}
 
@@ -98,12 +100,4 @@ extern int rt_mutex_trylock(struct rt_mutex *lock);
 
 extern void rt_mutex_unlock(struct rt_mutex *lock);
 
-#ifdef CONFIG_RT_MUTEXES
-# define INIT_RT_MUTEXES(tsk)						\
-	.pi_waiters	= PLIST_HEAD_INIT(tsk.pi_waiters),	\
-	INIT_RT_MUTEX_DEBUG(tsk)
-#else
-# define INIT_RT_MUTEXES(tsk)
-#endif
-
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 158f4c2..9ea1501 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -16,6 +16,7 @@ struct sched_param {
 #include <linux/types.h>
 #include <linux/timex.h>
 #include <linux/jiffies.h>
+#include <linux/plist.h>
 #include <linux/rbtree.h>
 #include <linux/thread_info.h>
 #include <linux/cpumask.h>
@@ -1354,7 +1355,8 @@ struct task_struct {
 
 #ifdef CONFIG_RT_MUTEXES
 	/* PI waiters blocked on a rt_mutex held by this task */
-	struct plist_head pi_waiters;
+	struct rb_root pi_waiters;
+	struct rb_node *pi_waiters_leftmost;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index e6c0f1a..7049ae5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1087,7 +1087,8 @@ static void rt_mutex_init_task(struct task_struct *p)
 {
 	raw_spin_lock_init(&p->pi_lock);
 #ifdef CONFIG_RT_MUTEXES
-	plist_head_init(&p->pi_waiters);
+	p->pi_waiters = RB_ROOT;
+	p->pi_waiters_leftmost = NULL;
 	p->pi_blocked_on = NULL;
 #endif
 }
diff --git a/kernel/futex.c b/kernel/futex.c
index f6ff019..679531c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2316,6 +2316,8 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	 * code while we sleep on uaddr.
 	 */
 	debug_rt_mutex_init_waiter(&rt_waiter);
+	RB_CLEAR_NODE(&rt_waiter.pi_tree_entry);
+	RB_CLEAR_NODE(&rt_waiter.tree_entry);
 	rt_waiter.task = NULL;
 
 	ret = get_futex_key(uaddr2, flags & FLAGS_SHARED, &key2, VERIFY_WRITE);
diff --git a/kernel/locking/rtmutex-debug.c b/kernel/locking/rtmutex-debug.c
index 13b243a..49b2ed3 100644
--- a/kernel/locking/rtmutex-debug.c
+++ b/kernel/locking/rtmutex-debug.c
@@ -24,7 +24,7 @@
 #include <linux/kallsyms.h>
 #include <linux/syscalls.h>
 #include <linux/interrupt.h>
-#include <linux/plist.h>
+#include <linux/rbtree.h>
 #include <linux/fs.h>
 #include <linux/debug_locks.h>
 
@@ -57,7 +57,7 @@ static void printk_lock(struct rt_mutex *lock, int print_owner)
 
 void rt_mutex_debug_task_free(struct task_struct *task)
 {
-	DEBUG_LOCKS_WARN_ON(!plist_head_empty(&task->pi_waiters));
+	DEBUG_LOCKS_WARN_ON(!RB_EMPTY_ROOT(&task->pi_waiters));
 	DEBUG_LOCKS_WARN_ON(task->pi_blocked_on);
 }
 
@@ -154,16 +154,12 @@ void debug_rt_mutex_proxy_unlock(struct rt_mutex *lock)
 void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter)
 {
 	memset(waiter, 0x11, sizeof(*waiter));
-	plist_node_init(&waiter->list_entry, MAX_PRIO);
-	plist_node_init(&waiter->pi_list_entry, MAX_PRIO);
 	waiter->deadlock_task_pid = NULL;
 }
 
 void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter)
 {
 	put_pid(waiter->deadlock_task_pid);
-	DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->list_entry));
-	DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
 	memset(waiter, 0x22, sizeof(*waiter));
 }
 
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 0dd6aec..3bf0aa6 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -14,6 +14,7 @@
 #include <linux/export.h>
 #include <linux/sched.h>
 #include <linux/sched/rt.h>
+#include <linux/sched/deadline.h>
 #include <linux/timer.h>
 
 #include "rtmutex_common.h"
@@ -91,10 +92,104 @@ static inline void mark_rt_mutex_waiters(struct rt_mutex *lock)
 }
 #endif
 
+static inline int
+rt_mutex_waiter_less(struct rt_mutex_waiter *left,
+		     struct rt_mutex_waiter *right)
+{
+	if (left->task->prio < right->task->prio)
+		return 1;
+
+	/*
+	 * If both tasks are dl_task(), we check their deadlines.
+	 */
+	if (dl_prio(left->task->prio) && dl_prio(right->task->prio))
+		return (left->task->dl.deadline < right->task->dl.deadline);
+
+	return 0;
+}
+
+static void
+rt_mutex_enqueue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
+{
+	struct rb_node **link = &lock->waiters.rb_node;
+	struct rb_node *parent = NULL;
+	struct rt_mutex_waiter *entry;
+	int leftmost = 1;
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct rt_mutex_waiter, tree_entry);
+		if (rt_mutex_waiter_less(waiter, entry)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		lock->waiters_leftmost = &waiter->tree_entry;
+
+	rb_link_node(&waiter->tree_entry, parent, link);
+	rb_insert_color(&waiter->tree_entry, &lock->waiters);
+}
+
+static void
+rt_mutex_dequeue(struct rt_mutex *lock, struct rt_mutex_waiter *waiter)
+{
+	if (RB_EMPTY_NODE(&waiter->tree_entry))
+		return;
+
+	if (lock->waiters_leftmost == &waiter->tree_entry)
+		lock->waiters_leftmost = rb_next(&waiter->tree_entry);
+
+	rb_erase(&waiter->tree_entry, &lock->waiters);
+	RB_CLEAR_NODE(&waiter->tree_entry);
+}
+
+static void
+rt_mutex_enqueue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
+{
+	struct rb_node **link = &task->pi_waiters.rb_node;
+	struct rb_node *parent = NULL;
+	struct rt_mutex_waiter *entry;
+	int leftmost = 1;
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct rt_mutex_waiter, pi_tree_entry);
+		if (rt_mutex_waiter_less(waiter, entry)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	if (leftmost)
+		task->pi_waiters_leftmost = &waiter->pi_tree_entry;
+
+	rb_link_node(&waiter->pi_tree_entry, parent, link);
+	rb_insert_color(&waiter->pi_tree_entry, &task->pi_waiters);
+}
+
+static void
+rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
+{
+	if (RB_EMPTY_NODE(&waiter->pi_tree_entry))
+		return;
+
+	if (task->pi_waiters_leftmost == &waiter->pi_tree_entry)
+		task->pi_waiters_leftmost = rb_next(&waiter->pi_tree_entry);
+
+	rb_erase(&waiter->pi_tree_entry, &task->pi_waiters);
+	RB_CLEAR_NODE(&waiter->pi_tree_entry);
+}
+
 /*
- * Calculate task priority from the waiter list priority
+ * Calculate task priority from the waiter tree priority
  *
- * Return task->normal_prio when the waiter list is empty or when
+ * Return task->normal_prio when the waiter tree is empty or when
  * the waiter is not allowed to do priority boosting
  */
 int rt_mutex_getprio(struct task_struct *task)
@@ -102,7 +197,7 @@ int rt_mutex_getprio(struct task_struct *task)
 	if (likely(!task_has_pi_waiters(task)))
 		return task->normal_prio;
 
-	return min(task_top_pi_waiter(task)->pi_list_entry.prio,
+	return min(task_top_pi_waiter(task)->task->prio,
 		   task->normal_prio);
 }
 
@@ -233,7 +328,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * When deadlock detection is off then we check, if further
 	 * priority adjustment is necessary.
 	 */
-	if (!detect_deadlock && waiter->list_entry.prio == task->prio)
+	if (!detect_deadlock && waiter->task->prio == task->prio)
 		goto out_unlock_pi;
 
 	lock = waiter->lock;
@@ -254,9 +349,9 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	top_waiter = rt_mutex_top_waiter(lock);
 
 	/* Requeue the waiter */
-	plist_del(&waiter->list_entry, &lock->wait_list);
-	waiter->list_entry.prio = task->prio;
-	plist_add(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_dequeue(lock, waiter);
+	waiter->task->prio = task->prio;
+	rt_mutex_enqueue(lock, waiter);
 
 	/* Release the task */
 	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
@@ -280,17 +375,15 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		/* Boost the owner */
-		plist_del(&top_waiter->pi_list_entry, &task->pi_waiters);
-		waiter->pi_list_entry.prio = waiter->list_entry.prio;
-		plist_add(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_dequeue_pi(task, top_waiter);
+		rt_mutex_enqueue_pi(task, waiter);
 		__rt_mutex_adjust_prio(task);
 
 	} else if (top_waiter == waiter) {
 		/* Deboost the owner */
-		plist_del(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_dequeue_pi(task, waiter);
 		waiter = rt_mutex_top_waiter(lock);
-		waiter->pi_list_entry.prio = waiter->list_entry.prio;
-		plist_add(&waiter->pi_list_entry, &task->pi_waiters);
+		rt_mutex_enqueue_pi(task, waiter);
 		__rt_mutex_adjust_prio(task);
 	}
 
@@ -355,7 +448,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 	 * 3) it is top waiter
 	 */
 	if (rt_mutex_has_waiters(lock)) {
-		if (task->prio >= rt_mutex_top_waiter(lock)->list_entry.prio) {
+		if (task->prio >= rt_mutex_top_waiter(lock)->task->prio) {
 			if (!waiter || waiter != rt_mutex_top_waiter(lock))
 				return 0;
 		}
@@ -369,7 +462,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 
 		/* remove the queued waiter. */
 		if (waiter) {
-			plist_del(&waiter->list_entry, &lock->wait_list);
+			rt_mutex_dequeue(lock, waiter);
 			task->pi_blocked_on = NULL;
 		}
 
@@ -379,8 +472,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 		 */
 		if (rt_mutex_has_waiters(lock)) {
 			top = rt_mutex_top_waiter(lock);
-			top->pi_list_entry.prio = top->list_entry.prio;
-			plist_add(&top->pi_list_entry, &task->pi_waiters);
+			rt_mutex_enqueue_pi(task, top);
 		}
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 	}
@@ -416,13 +508,11 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	__rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
-	plist_node_init(&waiter->list_entry, task->prio);
-	plist_node_init(&waiter->pi_list_entry, task->prio);
 
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
 		top_waiter = rt_mutex_top_waiter(lock);
-	plist_add(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_enqueue(lock, waiter);
 
 	task->pi_blocked_on = waiter;
 
@@ -433,8 +523,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		raw_spin_lock_irqsave(&owner->pi_lock, flags);
-		plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters);
-		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
+		rt_mutex_dequeue_pi(owner, top_waiter);
+		rt_mutex_enqueue_pi(owner, waiter);
 
 		__rt_mutex_adjust_prio(owner);
 		if (owner->pi_blocked_on)
@@ -486,7 +576,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 	 * boosted mode and go back to normal after releasing
 	 * lock->wait_lock.
 	 */
-	plist_del(&waiter->pi_list_entry, &current->pi_waiters);
+	rt_mutex_dequeue_pi(current, waiter);
 
 	rt_mutex_set_owner(lock, NULL);
 
@@ -510,7 +600,7 @@ static void remove_waiter(struct rt_mutex *lock,
 	int chain_walk = 0;
 
 	raw_spin_lock_irqsave(&current->pi_lock, flags);
-	plist_del(&waiter->list_entry, &lock->wait_list);
+	rt_mutex_dequeue(lock, waiter);
 	current->pi_blocked_on = NULL;
 	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
 
@@ -521,13 +611,13 @@ static void remove_waiter(struct rt_mutex *lock,
 
 		raw_spin_lock_irqsave(&owner->pi_lock, flags);
 
-		plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
+		rt_mutex_dequeue_pi(owner, waiter);
 
 		if (rt_mutex_has_waiters(lock)) {
 			struct rt_mutex_waiter *next;
 
 			next = rt_mutex_top_waiter(lock);
-			plist_add(&next->pi_list_entry, &owner->pi_waiters);
+			rt_mutex_enqueue_pi(owner, next);
 		}
 		__rt_mutex_adjust_prio(owner);
 
@@ -537,8 +627,6 @@ static void remove_waiter(struct rt_mutex *lock,
 		raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
 	}
 
-	WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
-
 	if (!chain_walk)
 		return;
 
@@ -565,7 +653,7 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || waiter->list_entry.prio == task->prio) {
+	if (!waiter || waiter->task->prio == task->prio) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
@@ -638,6 +726,8 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 	int ret = 0;
 
 	debug_rt_mutex_init_waiter(&waiter);
+	RB_CLEAR_NODE(&waiter.pi_tree_entry);
+	RB_CLEAR_NODE(&waiter.tree_entry);
 
 	raw_spin_lock(&lock->wait_lock);
 
@@ -904,7 +994,8 @@ void __rt_mutex_init(struct rt_mutex *lock, const char *name)
 {
 	lock->owner = NULL;
 	raw_spin_lock_init(&lock->wait_lock);
-	plist_head_init(&lock->wait_list);
+	lock->waiters = RB_ROOT;
+	lock->waiters_leftmost = NULL;
 
 	debug_rt_mutex_init(lock, name);
 }
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 53a66c8..b65442f 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -40,13 +40,13 @@ extern void schedule_rt_mutex_test(struct rt_mutex *lock);
  * This is the control structure for tasks blocked on a rt_mutex,
  * which is allocated on the kernel stack on of the blocked task.
  *
- * @list_entry:		pi node to enqueue into the mutex waiters list
- * @pi_list_entry:	pi node to enqueue into the mutex owner waiters list
+ * @tree_entry:		pi node to enqueue into the mutex waiters tree
+ * @pi_tree_entry:	pi node to enqueue into the mutex owner waiters tree
  * @task:		task reference to the blocked task
  */
 struct rt_mutex_waiter {
-	struct plist_node	list_entry;
-	struct plist_node	pi_list_entry;
+	struct rb_node          tree_entry;
+	struct rb_node          pi_tree_entry;
 	struct task_struct	*task;
 	struct rt_mutex		*lock;
 #ifdef CONFIG_DEBUG_RT_MUTEXES
@@ -57,11 +57,11 @@ struct rt_mutex_waiter {
 };
 
 /*
- * Various helpers to access the waiters-plist:
+ * Various helpers to access the waiters-tree:
  */
 static inline int rt_mutex_has_waiters(struct rt_mutex *lock)
 {
-	return !plist_head_empty(&lock->wait_list);
+	return !RB_EMPTY_ROOT(&lock->waiters);
 }
 
 static inline struct rt_mutex_waiter *
@@ -69,8 +69,8 @@ rt_mutex_top_waiter(struct rt_mutex *lock)
 {
 	struct rt_mutex_waiter *w;
 
-	w = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter,
-			       list_entry);
+	w = rb_entry(lock->waiters_leftmost, struct rt_mutex_waiter,
+		     tree_entry);
 	BUG_ON(w->lock != lock);
 
 	return w;
@@ -78,14 +78,14 @@ rt_mutex_top_waiter(struct rt_mutex *lock)
 
 static inline int task_has_pi_waiters(struct task_struct *p)
 {
-	return !plist_head_empty(&p->pi_waiters);
+	return !RB_EMPTY_ROOT(&p->pi_waiters);
 }
 
 static inline struct rt_mutex_waiter *
 task_top_pi_waiter(struct task_struct *p)
 {
-	return plist_first_entry(&p->pi_waiters, struct rt_mutex_waiter,
-				  pi_list_entry);
+	return rb_entry(p->pi_waiters_leftmost, struct rt_mutex_waiter,
+			pi_tree_entry);
 }
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 069230b..aebcc70 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6635,10 +6635,6 @@ void __init sched_init(void)
 	INIT_HLIST_HEAD(&init_task.preempt_notifiers);
 #endif
 
-#ifdef CONFIG_RT_MUTEXES
-	plist_head_init(&init_task.pi_waiters);
-#endif
-
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
 	 */

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [tip:sched/core] sched/deadline: Add SCHED_DEADLINE inheritance logic
  2013-11-07 13:43 ` [PATCH 10/14] sched: drafted deadline inheritance logic Juri Lelli
@ 2014-01-13 15:54   ` tip-bot for Dario Faggioli
  0 siblings, 0 replies; 81+ messages in thread
From: tip-bot for Dario Faggioli @ 2014-01-13 15:54 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, raistlin, tglx, juri.lelli

Commit-ID:  2d3d891d3344159d5b452a645e355bbe29591e8b
Gitweb:     http://git.kernel.org/tip/2d3d891d3344159d5b452a645e355bbe29591e8b
Author:     Dario Faggioli <raistlin@linux.it>
AuthorDate: Thu, 7 Nov 2013 14:43:44 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 13:42:56 +0100

sched/deadline: Add SCHED_DEADLINE inheritance logic

Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).

This is under development, in the meanwhile, as a temporary solution,
what this commits does is:

 - ensure a pi-lock owner with waiters is never throttled down. Instead,
   when it runs out of runtime, it immediately gets replenished and it's
   deadline is postponed;

 - the scheduling parameters (relative deadline and default runtime)
   used for that replenishments --during the whole period it holds the
   pi-lock-- are the ones of the waiting task with earliest deadline.

Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.

We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Added !RT_MUTEXES build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h             |  8 +++-
 include/linux/sched/rt.h          |  5 +++
 kernel/fork.c                     |  1 +
 kernel/locking/rtmutex.c          | 31 +++++++++----
 kernel/locking/rtmutex_common.h   |  1 +
 kernel/sched/core.c               | 36 +++++++++++++---
 kernel/sched/deadline.c           | 91 +++++++++++++++++++++++----------------
 kernel/sched/sched.h              | 14 ++++++
 kernel/trace/trace_sched_wakeup.c |  1 +
 9 files changed, 134 insertions(+), 54 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9ea1501..13c53a9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1124,8 +1124,12 @@ struct sched_dl_entity {
 	 * @dl_new tells if a new instance arrived. If so we must
 	 * start executing it with full runtime and reset its absolute
 	 * deadline;
+	 *
+	 * @dl_boosted tells if we are boosted due to DI. If so we are
+	 * outside bandwidth enforcement mechanism (but only until we
+	 * exit the critical section).
 	 */
-	int dl_throttled, dl_new;
+	int dl_throttled, dl_new, dl_boosted;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
@@ -1359,6 +1363,8 @@ struct task_struct {
 	struct rb_node *pi_waiters_leftmost;
 	/* Deadlock detection and priority inheritance handling */
 	struct rt_mutex_waiter *pi_blocked_on;
+	/* Top pi_waiters task */
+	struct task_struct *pi_top_task;
 #endif
 
 #ifdef CONFIG_DEBUG_MUTEXES
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index 440434d..34e4ebe 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -35,6 +35,7 @@ static inline int rt_task(struct task_struct *p)
 #ifdef CONFIG_RT_MUTEXES
 extern int rt_mutex_getprio(struct task_struct *p);
 extern void rt_mutex_setprio(struct task_struct *p, int prio);
+extern struct task_struct *rt_mutex_get_top_task(struct task_struct *task);
 extern void rt_mutex_adjust_pi(struct task_struct *p);
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
 {
@@ -45,6 +46,10 @@ static inline int rt_mutex_getprio(struct task_struct *p)
 {
 	return p->normal_prio;
 }
+static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
+{
+	return NULL;
+}
 # define rt_mutex_adjust_pi(p)		do { } while (0)
 static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 7049ae5..01b450a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1090,6 +1090,7 @@ static void rt_mutex_init_task(struct task_struct *p)
 	p->pi_waiters = RB_ROOT;
 	p->pi_waiters_leftmost = NULL;
 	p->pi_blocked_on = NULL;
+	p->pi_top_task = NULL;
 #endif
 }
 
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 3bf0aa6..2e960a2 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -96,13 +96,16 @@ static inline int
 rt_mutex_waiter_less(struct rt_mutex_waiter *left,
 		     struct rt_mutex_waiter *right)
 {
-	if (left->task->prio < right->task->prio)
+	if (left->prio < right->prio)
 		return 1;
 
 	/*
-	 * If both tasks are dl_task(), we check their deadlines.
+	 * If both waiters have dl_prio(), we check the deadlines of the
+	 * associated tasks.
+	 * If left waiter has a dl_prio(), and we didn't return 1 above,
+	 * then right waiter has a dl_prio() too.
 	 */
-	if (dl_prio(left->task->prio) && dl_prio(right->task->prio))
+	if (dl_prio(left->prio))
 		return (left->task->dl.deadline < right->task->dl.deadline);
 
 	return 0;
@@ -197,10 +200,18 @@ int rt_mutex_getprio(struct task_struct *task)
 	if (likely(!task_has_pi_waiters(task)))
 		return task->normal_prio;
 
-	return min(task_top_pi_waiter(task)->task->prio,
+	return min(task_top_pi_waiter(task)->prio,
 		   task->normal_prio);
 }
 
+struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
+{
+	if (likely(!task_has_pi_waiters(task)))
+		return NULL;
+
+	return task_top_pi_waiter(task)->task;
+}
+
 /*
  * Adjust the priority of a task, after its pi_waiters got modified.
  *
@@ -210,7 +221,7 @@ static void __rt_mutex_adjust_prio(struct task_struct *task)
 {
 	int prio = rt_mutex_getprio(task);
 
-	if (task->prio != prio)
+	if (task->prio != prio || dl_prio(prio))
 		rt_mutex_setprio(task, prio);
 }
 
@@ -328,7 +339,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	 * When deadlock detection is off then we check, if further
 	 * priority adjustment is necessary.
 	 */
-	if (!detect_deadlock && waiter->task->prio == task->prio)
+	if (!detect_deadlock && waiter->prio == task->prio)
 		goto out_unlock_pi;
 
 	lock = waiter->lock;
@@ -350,7 +361,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 
 	/* Requeue the waiter */
 	rt_mutex_dequeue(lock, waiter);
-	waiter->task->prio = task->prio;
+	waiter->prio = task->prio;
 	rt_mutex_enqueue(lock, waiter);
 
 	/* Release the task */
@@ -448,7 +459,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 	 * 3) it is top waiter
 	 */
 	if (rt_mutex_has_waiters(lock)) {
-		if (task->prio >= rt_mutex_top_waiter(lock)->task->prio) {
+		if (task->prio >= rt_mutex_top_waiter(lock)->prio) {
 			if (!waiter || waiter != rt_mutex_top_waiter(lock))
 				return 0;
 		}
@@ -508,6 +519,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	__rt_mutex_adjust_prio(task);
 	waiter->task = task;
 	waiter->lock = lock;
+	waiter->prio = task->prio;
 
 	/* Get the top priority waiter on the lock */
 	if (rt_mutex_has_waiters(lock))
@@ -653,7 +665,8 @@ void rt_mutex_adjust_pi(struct task_struct *task)
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
 
 	waiter = task->pi_blocked_on;
-	if (!waiter || waiter->task->prio == task->prio) {
+	if (!waiter || (waiter->prio == task->prio &&
+			!dl_prio(task->prio))) {
 		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
 		return;
 	}
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index b65442f..7431a9c 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -54,6 +54,7 @@ struct rt_mutex_waiter {
 	struct pid		*deadlock_task_pid;
 	struct rt_mutex		*deadlock_lock;
 #endif
+	int prio;
 };
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aebcc70..599ee3b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -947,7 +947,7 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
 		if (prev_class->switched_from)
 			prev_class->switched_from(rq, p);
 		p->sched_class->switched_to(rq, p);
-	} else if (oldprio != p->prio)
+	} else if (oldprio != p->prio || dl_task(p))
 		p->sched_class->prio_changed(rq, p, oldprio);
 }
 
@@ -2781,7 +2781,7 @@ EXPORT_SYMBOL(sleep_on_timeout);
  */
 void rt_mutex_setprio(struct task_struct *p, int prio)
 {
-	int oldprio, on_rq, running;
+	int oldprio, on_rq, running, enqueue_flag = 0;
 	struct rq *rq;
 	const struct sched_class *prev_class;
 
@@ -2808,6 +2808,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	}
 
 	trace_sched_pi_setprio(p, prio);
+	p->pi_top_task = rt_mutex_get_top_task(p);
 	oldprio = p->prio;
 	prev_class = p->sched_class;
 	on_rq = p->on_rq;
@@ -2817,19 +2818,42 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
-	if (dl_prio(prio))
+	/*
+	 * Boosting condition are:
+	 * 1. -rt task is running and holds mutex A
+	 *      --> -dl task blocks on mutex A
+	 *
+	 * 2. -dl task is running and holds mutex A
+	 *      --> -dl task blocks on mutex A and could preempt the
+	 *          running task
+	 */
+	if (dl_prio(prio)) {
+		if (!dl_prio(p->normal_prio) || (p->pi_top_task &&
+			dl_entity_preempt(&p->pi_top_task->dl, &p->dl))) {
+			p->dl.dl_boosted = 1;
+			p->dl.dl_throttled = 0;
+			enqueue_flag = ENQUEUE_REPLENISH;
+		} else
+			p->dl.dl_boosted = 0;
 		p->sched_class = &dl_sched_class;
-	else if (rt_prio(prio))
+	} else if (rt_prio(prio)) {
+		if (dl_prio(oldprio))
+			p->dl.dl_boosted = 0;
+		if (oldprio < prio)
+			enqueue_flag = ENQUEUE_HEAD;
 		p->sched_class = &rt_sched_class;
-	else
+	} else {
+		if (dl_prio(oldprio))
+			p->dl.dl_boosted = 0;
 		p->sched_class = &fair_sched_class;
+	}
 
 	p->prio = prio;
 
 	if (running)
 		p->sched_class->set_curr_task(rq);
 	if (on_rq)
-		enqueue_task(rq, p, oldprio < prio ? ENQUEUE_HEAD : 0);
+		enqueue_task(rq, p, enqueue_flag);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3958bc5..7f6de43 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -16,20 +16,6 @@
  */
 #include "sched.h"
 
-static inline int dl_time_before(u64 a, u64 b)
-{
-	return (s64)(a - b) < 0;
-}
-
-/*
- * Tells if entity @a should preempt entity @b.
- */
-static inline
-int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
-{
-	return dl_time_before(a->deadline, b->deadline);
-}
-
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
 	return container_of(dl_se, struct task_struct, dl);
@@ -242,7 +228,8 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
  * one, and to (try to!) reconcile itself with its own scheduling
  * parameters.
  */
-static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
+static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se,
+				       struct sched_dl_entity *pi_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -254,8 +241,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
 	 * future; in fact, we must consider execution overheads (time
 	 * spent on hardirq context, etc.).
 	 */
-	dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
-	dl_se->runtime = dl_se->dl_runtime;
+	dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
+	dl_se->runtime = pi_se->dl_runtime;
 	dl_se->dl_new = 0;
 }
 
@@ -277,11 +264,23 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
  * could happen are, typically, a entity voluntarily trying to overcome its
  * runtime, or it just underestimated it during sched_setscheduler_ex().
  */
-static void replenish_dl_entity(struct sched_dl_entity *dl_se)
+static void replenish_dl_entity(struct sched_dl_entity *dl_se,
+				struct sched_dl_entity *pi_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
+	BUG_ON(pi_se->dl_runtime <= 0);
+
+	/*
+	 * This could be the case for a !-dl task that is boosted.
+	 * Just go with full inherited parameters.
+	 */
+	if (dl_se->dl_deadline == 0) {
+		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
+		dl_se->runtime = pi_se->dl_runtime;
+	}
+
 	/*
 	 * We keep moving the deadline away until we get some
 	 * available runtime for the entity. This ensures correct
@@ -289,8 +288,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 * arbitrary large.
 	 */
 	while (dl_se->runtime <= 0) {
-		dl_se->deadline += dl_se->dl_period;
-		dl_se->runtime += dl_se->dl_runtime;
+		dl_se->deadline += pi_se->dl_period;
+		dl_se->runtime += pi_se->dl_runtime;
 	}
 
 	/*
@@ -309,8 +308,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 			lag_once = true;
 			printk_sched("sched: DL replenish lagged to much\n");
 		}
-		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
-		dl_se->runtime = dl_se->dl_runtime;
+		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
+		dl_se->runtime = pi_se->dl_runtime;
 	}
 }
 
@@ -337,7 +336,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
  * task with deadline equal to period this is the same of using
  * dl_deadline instead of dl_period in the equation above.
  */
-static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
+static bool dl_entity_overflow(struct sched_dl_entity *dl_se,
+			       struct sched_dl_entity *pi_se, u64 t)
 {
 	u64 left, right;
 
@@ -359,8 +359,8 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
 	 * of anything below microseconds resolution is actually fiction
 	 * (but still we want to give the user that illusion >;).
 	 */
-	left = (dl_se->dl_period >> 10) * (dl_se->runtime >> 10);
-	right = ((dl_se->deadline - t) >> 10) * (dl_se->dl_runtime >> 10);
+	left = (pi_se->dl_period >> 10) * (dl_se->runtime >> 10);
+	right = ((dl_se->deadline - t) >> 10) * (pi_se->dl_runtime >> 10);
 
 	return dl_time_before(right, left);
 }
@@ -374,7 +374,8 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se, u64 t)
  *  - using the remaining runtime with the current deadline would make
  *    the entity exceed its bandwidth.
  */
-static void update_dl_entity(struct sched_dl_entity *dl_se)
+static void update_dl_entity(struct sched_dl_entity *dl_se,
+			     struct sched_dl_entity *pi_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -384,14 +385,14 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
 	 * the actual scheduling parameters have to be "renewed".
 	 */
 	if (dl_se->dl_new) {
-		setup_new_dl_entity(dl_se);
+		setup_new_dl_entity(dl_se, pi_se);
 		return;
 	}
 
 	if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
-	    dl_entity_overflow(dl_se, rq_clock(rq))) {
-		dl_se->deadline = rq_clock(rq) + dl_se->dl_deadline;
-		dl_se->runtime = dl_se->dl_runtime;
+	    dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) {
+		dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline;
+		dl_se->runtime = pi_se->dl_runtime;
 	}
 }
 
@@ -405,7 +406,7 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
  * actually started or not (i.e., the replenishment instant is in
  * the future or in the past).
  */
-static int start_dl_timer(struct sched_dl_entity *dl_se)
+static int start_dl_timer(struct sched_dl_entity *dl_se, bool boosted)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
@@ -414,6 +415,8 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
 	unsigned long range;
 	s64 delta;
 
+	if (boosted)
+		return 0;
 	/*
 	 * We want the timer to fire at the deadline, but considering
 	 * that it is actually coming from rq->clock and not from
@@ -573,7 +576,7 @@ static void update_curr_dl(struct rq *rq)
 	dl_se->runtime -= delta_exec;
 	if (dl_runtime_exceeded(rq, dl_se)) {
 		__dequeue_task_dl(rq, curr, 0);
-		if (likely(start_dl_timer(dl_se)))
+		if (likely(start_dl_timer(dl_se, curr->dl.dl_boosted)))
 			dl_se->dl_throttled = 1;
 		else
 			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
@@ -728,7 +731,8 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 }
 
 static void
-enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
+enqueue_dl_entity(struct sched_dl_entity *dl_se,
+		  struct sched_dl_entity *pi_se, int flags)
 {
 	BUG_ON(on_dl_rq(dl_se));
 
@@ -738,9 +742,9 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 	 * we want a replenishment of its runtime.
 	 */
 	if (!dl_se->dl_new && flags & ENQUEUE_REPLENISH)
-		replenish_dl_entity(dl_se);
+		replenish_dl_entity(dl_se, pi_se);
 	else
-		update_dl_entity(dl_se);
+		update_dl_entity(dl_se, pi_se);
 
 	__enqueue_dl_entity(dl_se);
 }
@@ -752,6 +756,18 @@ static void dequeue_dl_entity(struct sched_dl_entity *dl_se)
 
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
+	struct task_struct *pi_task = rt_mutex_get_top_task(p);
+	struct sched_dl_entity *pi_se = &p->dl;
+
+	/*
+	 * Use the scheduling parameters of the top pi-waiter
+	 * task if we have one and its (relative) deadline is
+	 * smaller than our one... OTW we keep our runtime and
+	 * deadline.
+	 */
+	if (pi_task && p->dl.dl_boosted && dl_prio(pi_task->normal_prio))
+		pi_se = &pi_task->dl;
+
 	/*
 	 * If p is throttled, we do nothing. In fact, if it exhausted
 	 * its budget it needs a replenishment and, since it now is on
@@ -761,7 +777,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	if (p->dl.dl_throttled)
 		return;
 
-	enqueue_dl_entity(&p->dl, flags);
+	enqueue_dl_entity(&p->dl, pi_se, flags);
 
 	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
@@ -985,8 +1001,7 @@ static void task_dead_dl(struct task_struct *p)
 {
 	struct hrtimer *timer = &p->dl.dl_timer;
 
-	if (hrtimer_active(timer))
-		hrtimer_try_to_cancel(timer);
+	hrtimer_cancel(timer);
 }
 
 static void set_curr_task_dl(struct rq *rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 93ea627..52453a2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -107,6 +107,20 @@ static inline int task_has_dl_policy(struct task_struct *p)
 	return dl_policy(p->policy);
 }
 
+static inline int dl_time_before(u64 a, u64 b)
+{
+	return (s64)(a - b) < 0;
+}
+
+/*
+ * Tells if entity @a should preempt entity @b.
+ */
+static inline
+int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
+{
+	return dl_time_before(a->deadline, b->deadline);
+}
+
 /*
  * This is the priority-queue data structure of the RT scheduling class:
  */
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 090c4d9..6e32635 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -16,6 +16,7 @@
 #include <linux/uaccess.h>
 #include <linux/ftrace.h>
 #include <linux/sched/rt.h>
+#include <linux/sched/deadline.h>
 #include <trace/events/sched.h>
 #include "trace.h"
 

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [tip:sched/core] sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks
  2013-11-07 13:43 ` [PATCH 11/14] sched: add bandwidth management for sched_dl Juri Lelli
@ 2014-01-13 15:54   ` tip-bot for Dario Faggioli
  0 siblings, 0 replies; 81+ messages in thread
From: tip-bot for Dario Faggioli @ 2014-01-13 15:54 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, raistlin, tglx, juri.lelli

Commit-ID:  332ac17ef5bfcff4766dfdfd3b4cdf10b8f8f155
Gitweb:     http://git.kernel.org/tip/332ac17ef5bfcff4766dfdfd3b4cdf10b8f8f155
Author:     Dario Faggioli <raistlin@linux.it>
AuthorDate: Thu, 7 Nov 2013 14:43:45 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 13:46:42 +0100

sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks

In order of deadline scheduling to be effective and useful, it is
important that some method of having the allocation of the available
CPU bandwidth to tasks and task groups under control.
This is usually called "admission control" and if it is not performed
at all, no guarantee can be given on the actual scheduling of the
-deadline tasks.

Since when RT-throttling has been introduced each task group have a
bandwidth associated to itself, calculated as a certain amount of
runtime over a period. Moreover, to make it possible to manipulate
such bandwidth, readable/writable controls have been added to both
procfs (for system wide settings) and cgroupfs (for per-group
settings).

Therefore, the same interface is being used for controlling the
bandwidth distrubution to -deadline tasks and task groups, i.e.,
new controls but with similar names, equivalent meaning and with
the same usage paradigm are added.

However, more discussion is needed in order to figure out how
we want to manage SCHED_DEADLINE bandwidth at the task group level.
Therefore, this patch adds a less sophisticated, but actually
very sensible, mechanism to ensure that a certain utilization
cap is not overcome per each root_domain (the single rq for !SMP
configurations).

Another main difference between deadline bandwidth management and
RT-throttling is that -deadline tasks have bandwidth on their own
(while -rt ones doesn't!), and thus we don't need an higher level
throttling mechanism to enforce the desired bandwidth.

This patch, therefore:

 - adds system wide deadline bandwidth management by means of:
    * /proc/sys/kernel/sched_dl_runtime_us,
    * /proc/sys/kernel/sched_dl_period_us,
   that determine (i.e., runtime / period) the total bandwidth
   available on each CPU of each root_domain for -deadline tasks;

 - couples the RT and deadline bandwidth management, i.e., enforces
   that the sum of how much bandwidth is being devoted to -rt
   -deadline tasks to stay below 100%.

This means that, for a root_domain comprising M CPUs, -deadline tasks
can be created until the sum of their bandwidths stay below:

    M * (sched_dl_runtime_us / sched_dl_period_us)

It is also possible to disable this bandwidth management logic, and
be thus free of oversubscribing the system up to any arbitrary level.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h        |   1 +
 include/linux/sched/sysctl.h |  13 ++
 kernel/sched/core.c          | 441 ++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/deadline.c      |  46 ++++-
 kernel/sched/sched.h         |  76 +++++++-
 kernel/sysctl.c              |  14 ++
 6 files changed, 555 insertions(+), 36 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 13c53a9..a196cb7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1104,6 +1104,7 @@ struct sched_dl_entity {
 	u64 dl_runtime;		/* maximum runtime for each instance	*/
 	u64 dl_deadline;	/* relative deadline of each instance	*/
 	u64 dl_period;		/* separation of two instances (period) */
+	u64 dl_bw;		/* dl_runtime / dl_deadline		*/
 
 	/*
 	 * Actual scheduling parameters. Initialized with the values above,
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 31e0193..8070a83 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -81,6 +81,15 @@ static inline unsigned int get_sysctl_timer_migration(void)
 extern unsigned int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
 
+/*
+ *  control SCHED_DEADLINE reservations:
+ *
+ *  /proc/sys/kernel/sched_dl_period_us
+ *  /proc/sys/kernel/sched_dl_runtime_us
+ */
+extern unsigned int sysctl_sched_dl_period;
+extern int sysctl_sched_dl_runtime;
+
 #ifdef CONFIG_CFS_BANDWIDTH
 extern unsigned int sysctl_sched_cfs_bandwidth_slice;
 #endif
@@ -99,4 +108,8 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
+int sched_dl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
 #endif /* _SCHED_SYSCTL_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 599ee3b..c7c68e6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -296,6 +296,15 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+/*
+ * Maximum bandwidth available for all -deadline tasks and groups
+ * (if group scheduling is configured) on each CPU.
+ *
+ * default: 5%
+ */
+unsigned int sysctl_sched_dl_period = 1000000;
+int sysctl_sched_dl_runtime = 50000;
+
 
 
 /*
@@ -1856,6 +1865,111 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	return 0;
 }
 
+unsigned long to_ratio(u64 period, u64 runtime)
+{
+	if (runtime == RUNTIME_INF)
+		return 1ULL << 20;
+
+	/*
+	 * Doing this here saves a lot of checks in all
+	 * the calling paths, and returning zero seems
+	 * safe for them anyway.
+	 */
+	if (period == 0)
+		return 0;
+
+	return div64_u64(runtime << 20, period);
+}
+
+#ifdef CONFIG_SMP
+inline struct dl_bw *dl_bw_of(int i)
+{
+	return &cpu_rq(i)->rd->dl_bw;
+}
+
+static inline int __dl_span_weight(struct rq *rq)
+{
+	return cpumask_weight(rq->rd->span);
+}
+#else
+inline struct dl_bw *dl_bw_of(int i)
+{
+	return &cpu_rq(i)->dl.dl_bw;
+}
+
+static inline int __dl_span_weight(struct rq *rq)
+{
+	return 1;
+}
+#endif
+
+static inline
+void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
+{
+	dl_b->total_bw -= tsk_bw;
+}
+
+static inline
+void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
+{
+	dl_b->total_bw += tsk_bw;
+}
+
+static inline
+bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
+{
+	return dl_b->bw != -1 &&
+	       dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
+}
+
+/*
+ * We must be sure that accepting a new task (or allowing changing the
+ * parameters of an existing one) is consistent with the bandwidth
+ * constraints. If yes, this function also accordingly updates the currently
+ * allocated bandwidth to reflect the new situation.
+ *
+ * This function is called while holding p's rq->lock.
+ */
+static int dl_overflow(struct task_struct *p, int policy,
+		       const struct sched_attr *attr)
+{
+
+	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+	u64 period = attr->sched_period;
+	u64 runtime = attr->sched_runtime;
+	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
+	int cpus = __dl_span_weight(task_rq(p));
+	int err = -1;
+
+	if (new_bw == p->dl.dl_bw)
+		return 0;
+
+	/*
+	 * Either if a task, enters, leave, or stays -deadline but changes
+	 * its parameters, we may need to update accordingly the total
+	 * allocated bandwidth of the container.
+	 */
+	raw_spin_lock(&dl_b->lock);
+	if (dl_policy(policy) && !task_has_dl_policy(p) &&
+	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
+		__dl_add(dl_b, new_bw);
+		err = 0;
+	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
+		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+		__dl_clear(dl_b, p->dl.dl_bw);
+		__dl_add(dl_b, new_bw);
+		err = 0;
+	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
+		__dl_clear(dl_b, p->dl.dl_bw);
+		err = 0;
+	}
+	raw_spin_unlock(&dl_b->lock);
+
+	return err;
+}
+
+extern void init_dl_bw(struct dl_bw *dl_b);
+
 /*
  * wake_up_new_task - wake up a newly created task for the first time.
  *
@@ -3053,6 +3167,7 @@ __setparam_dl(struct task_struct *p, const struct sched_attr *attr)
 	dl_se->dl_deadline = attr->sched_deadline;
 	dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline;
 	dl_se->flags = attr->sched_flags;
+	dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime);
 	dl_se->dl_throttled = 0;
 	dl_se->dl_new = 1;
 }
@@ -3101,7 +3216,9 @@ __getparam_dl(struct task_struct *p, struct sched_attr *attr)
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
  * than the runtime, as well as the period of being zero or
- * greater than deadline.
+ * greater than deadline. Furthermore, we have to be sure that
+ * user parameters are above the internal resolution (1us); we
+ * check sched_runtime only since it is always the smaller one.
  */
 static bool
 __checkparam_dl(const struct sched_attr *attr)
@@ -3109,7 +3226,8 @@ __checkparam_dl(const struct sched_attr *attr)
 	return attr && attr->sched_deadline != 0 &&
 		(attr->sched_period == 0 ||
 		(s64)(attr->sched_period   - attr->sched_deadline) >= 0) &&
-		(s64)(attr->sched_deadline - attr->sched_runtime ) >= 0;
+		(s64)(attr->sched_deadline - attr->sched_runtime ) >= 0  &&
+		attr->sched_runtime >= (2 << (DL_SCALE - 1));
 }
 
 /*
@@ -3250,8 +3368,8 @@ recheck:
 	}
 change:
 
-#ifdef CONFIG_RT_GROUP_SCHED
 	if (user) {
+#ifdef CONFIG_RT_GROUP_SCHED
 		/*
 		 * Do not allow realtime tasks into groups that have no runtime
 		 * assigned.
@@ -3262,8 +3380,33 @@ change:
 			task_rq_unlock(rq, p, &flags);
 			return -EPERM;
 		}
-	}
 #endif
+#ifdef CONFIG_SMP
+		if (dl_bandwidth_enabled() && dl_policy(policy)) {
+			cpumask_t *span = rq->rd->span;
+			cpumask_t act_affinity;
+
+			/*
+			 * cpus_allowed mask is statically initialized with
+			 * CPU_MASK_ALL, span is instead dynamic. Here we
+			 * compute the "dynamic" affinity of a task.
+			 */
+			cpumask_and(&act_affinity, &p->cpus_allowed,
+				    cpu_active_mask);
+
+			/*
+			 * Don't allow tasks with an affinity mask smaller than
+			 * the entire root_domain to become SCHED_DEADLINE. We
+			 * will also fail if there's no bandwidth available.
+			 */
+			if (!cpumask_equal(&act_affinity, span) ||
+					   rq->rd->dl_bw.bw == 0) {
+				task_rq_unlock(rq, p, &flags);
+				return -EPERM;
+			}
+		}
+#endif
+	}
 
 	/* recheck policy now with rq lock held */
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
@@ -3271,6 +3414,18 @@ change:
 		task_rq_unlock(rq, p, &flags);
 		goto recheck;
 	}
+
+	/*
+	 * If setscheduling to SCHED_DEADLINE (or changing the parameters
+	 * of a SCHED_DEADLINE task) we need to check if enough bandwidth
+	 * is available.
+	 */
+	if ((dl_policy(policy) || dl_task(p)) &&
+	    dl_overflow(p, policy, attr)) {
+		task_rq_unlock(rq, p, &flags);
+		return -EBUSY;
+	}
+
 	on_rq = p->on_rq;
 	running = task_current(rq, p);
 	if (on_rq)
@@ -3705,6 +3860,24 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
 	if (retval)
 		goto out_unlock;
 
+	/*
+	 * Since bandwidth control happens on root_domain basis,
+	 * if admission test is enabled, we only admit -deadline
+	 * tasks allowed to run on all the CPUs in the task's
+	 * root_domain.
+	 */
+#ifdef CONFIG_SMP
+	if (task_has_dl_policy(p)) {
+		const struct cpumask *span = task_rq(p)->rd->span;
+
+		if (dl_bandwidth_enabled() &&
+		    !cpumask_equal(in_mask, span)) {
+			retval = -EBUSY;
+			goto out_unlock;
+		}
+	}
+#endif
+
 	cpuset_cpus_allowed(p, cpus_allowed);
 	cpumask_and(new_mask, in_mask, cpus_allowed);
 again:
@@ -4359,6 +4532,42 @@ out:
 EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
 
 /*
+ * When dealing with a -deadline task, we have to check if moving it to
+ * a new CPU is possible or not. In fact, this is only true iff there
+ * is enough bandwidth available on such CPU, otherwise we want the
+ * whole migration progedure to fail over.
+ */
+static inline
+bool set_task_cpu_dl(struct task_struct *p, unsigned int cpu)
+{
+	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+	struct dl_bw *cpu_b = dl_bw_of(cpu);
+	int ret = 1;
+	u64 bw;
+
+	if (dl_b == cpu_b)
+		return 1;
+
+	raw_spin_lock(&dl_b->lock);
+	raw_spin_lock(&cpu_b->lock);
+
+	bw = cpu_b->bw * cpumask_weight(cpu_rq(cpu)->rd->span);
+	if (dl_bandwidth_enabled() &&
+	    bw < cpu_b->total_bw + p->dl.dl_bw) {
+		ret = 0;
+		goto unlock;
+	}
+	dl_b->total_bw -= p->dl.dl_bw;
+	cpu_b->total_bw += p->dl.dl_bw;
+
+unlock:
+	raw_spin_unlock(&cpu_b->lock);
+	raw_spin_unlock(&dl_b->lock);
+
+	return ret;
+}
+
+/*
  * Move (not current) task off this cpu, onto dest cpu. We're doing
  * this because either it can't run here any more (set_cpus_allowed()
  * away from this CPU, or CPU going down), or because we're
@@ -4390,6 +4599,13 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 		goto fail;
 
 	/*
+	 * If p is -deadline, proceed only if there is enough
+	 * bandwidth available on dest_cpu
+	 */
+	if (unlikely(dl_task(p)) && !set_task_cpu_dl(p, dest_cpu))
+		goto fail;
+
+	/*
 	 * If we're not on a rq, the next wake-up will ensure we're
 	 * placed properly.
 	 */
@@ -5128,6 +5344,8 @@ static int init_rootdomain(struct root_domain *rd)
 	if (!alloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
 		goto free_dlo_mask;
 
+	init_dl_bw(&rd->dl_bw);
+
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_rto_mask;
 	return 0;
@@ -6557,13 +6775,15 @@ void __init sched_init(void)
 #endif /* CONFIG_CPUMASK_OFFSTACK */
 	}
 
+	init_rt_bandwidth(&def_rt_bandwidth,
+			global_rt_period(), global_rt_runtime());
+	init_dl_bandwidth(&def_dl_bandwidth,
+			global_dl_period(), global_dl_runtime());
+
 #ifdef CONFIG_SMP
 	init_defrootdomain();
 #endif
 
-	init_rt_bandwidth(&def_rt_bandwidth,
-			global_rt_period(), global_rt_runtime());
-
 #ifdef CONFIG_RT_GROUP_SCHED
 	init_rt_bandwidth(&root_task_group.rt_bandwidth,
 			global_rt_period(), global_rt_runtime());
@@ -6966,16 +7186,6 @@ void sched_move_task(struct task_struct *tsk)
 }
 #endif /* CONFIG_CGROUP_SCHED */
 
-#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
-static unsigned long to_ratio(u64 period, u64 runtime)
-{
-	if (runtime == RUNTIME_INF)
-		return 1ULL << 20;
-
-	return div64_u64(runtime << 20, period);
-}
-#endif
-
 #ifdef CONFIG_RT_GROUP_SCHED
 /*
  * Ensure that the real time constraints are schedulable.
@@ -7149,10 +7359,48 @@ static long sched_group_rt_period(struct task_group *tg)
 	do_div(rt_period_us, NSEC_PER_USEC);
 	return rt_period_us;
 }
+#endif /* CONFIG_RT_GROUP_SCHED */
 
+/*
+ * Coupling of -rt and -deadline bandwidth.
+ *
+ * Here we check if the new -rt bandwidth value is consistent
+ * with the system settings for the bandwidth available
+ * to -deadline tasks.
+ *
+ * IOW, we want to enforce that
+ *
+ *   rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_rt_dl_global_constraints(u64 rt_bw)
+{
+	unsigned long flags;
+	u64 dl_bw;
+	bool ret;
+
+	raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock, flags);
+	if (global_rt_runtime() == RUNTIME_INF ||
+	    global_dl_runtime() == RUNTIME_INF) {
+		ret = true;
+		goto unlock;
+	}
+
+	dl_bw = to_ratio(def_dl_bandwidth.dl_period,
+			 def_dl_bandwidth.dl_runtime);
+
+	ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
+unlock:
+	raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock, flags);
+
+	return ret;
+}
+
+#ifdef CONFIG_RT_GROUP_SCHED
 static int sched_rt_global_constraints(void)
 {
-	u64 runtime, period;
+	u64 runtime, period, bw;
 	int ret = 0;
 
 	if (sysctl_sched_rt_period <= 0)
@@ -7167,6 +7415,10 @@ static int sched_rt_global_constraints(void)
 	if (runtime > period && runtime != RUNTIME_INF)
 		return -EINVAL;
 
+	bw = to_ratio(period, runtime);
+	if (!__sched_rt_dl_global_constraints(bw))
+		return -EINVAL;
+
 	mutex_lock(&rt_constraints_mutex);
 	read_lock(&tasklist_lock);
 	ret = __rt_schedulable(NULL, 0, 0);
@@ -7189,19 +7441,19 @@ static int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
 static int sched_rt_global_constraints(void)
 {
 	unsigned long flags;
-	int i;
+	int i, ret = 0;
+	u64 bw;
 
 	if (sysctl_sched_rt_period <= 0)
 		return -EINVAL;
 
-	/*
-	 * There's always some RT tasks in the root group
-	 * -- migration, kstopmachine etc..
-	 */
-	if (sysctl_sched_rt_runtime == 0)
-		return -EBUSY;
-
 	raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags);
+	bw = to_ratio(global_rt_period(), global_rt_runtime());
+	if (!__sched_rt_dl_global_constraints(bw)) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
 	for_each_possible_cpu(i) {
 		struct rt_rq *rt_rq = &cpu_rq(i)->rt;
 
@@ -7209,12 +7461,93 @@ static int sched_rt_global_constraints(void)
 		rt_rq->rt_runtime = global_rt_runtime();
 		raw_spin_unlock(&rt_rq->rt_runtime_lock);
 	}
+unlock:
 	raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
 
-	return 0;
+	return ret;
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+/*
+ * Coupling of -dl and -rt bandwidth.
+ *
+ * Here we check, while setting the system wide bandwidth available
+ * for -dl tasks and groups, if the new values are consistent with
+ * the system settings for the bandwidth available to -rt entities.
+ *
+ * IOW, we want to enforce that
+ *
+ *   rt_bandwidth + dl_bandwidth <= 100%
+ *
+ * is always true.
+ */
+static bool __sched_dl_rt_global_constraints(u64 dl_bw)
+{
+	u64 rt_bw;
+	bool ret;
+
+	raw_spin_lock(&def_rt_bandwidth.rt_runtime_lock);
+	if (global_dl_runtime() == RUNTIME_INF ||
+	    global_rt_runtime() == RUNTIME_INF) {
+		ret = true;
+		goto unlock;
+	}
+
+	rt_bw = to_ratio(ktime_to_ns(def_rt_bandwidth.rt_period),
+			 def_rt_bandwidth.rt_runtime);
+
+	ret = rt_bw + dl_bw <= to_ratio(RUNTIME_INF, RUNTIME_INF);
+unlock:
+	raw_spin_unlock(&def_rt_bandwidth.rt_runtime_lock);
+
+	return ret;
+}
+
+static bool __sched_dl_global_constraints(u64 runtime, u64 period)
+{
+	if (!period || (runtime != RUNTIME_INF && runtime > period))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int sched_dl_global_constraints(void)
+{
+	u64 runtime = global_dl_runtime();
+	u64 period = global_dl_period();
+	u64 new_bw = to_ratio(period, runtime);
+	int ret, i;
+
+	ret = __sched_dl_global_constraints(runtime, period);
+	if (ret)
+		return ret;
+
+	if (!__sched_dl_rt_global_constraints(new_bw))
+		return -EINVAL;
+
+	/*
+	 * Here we want to check the bandwidth not being set to some
+	 * value smaller than the currently allocated bandwidth in
+	 * any of the root_domains.
+	 *
+	 * FIXME: Cycling on all the CPUs is overdoing, but simpler than
+	 * cycling on root_domains... Discussion on different/better
+	 * solutions is welcome!
+	 */
+	for_each_possible_cpu(i) {
+		struct dl_bw *dl_b = dl_bw_of(i);
+
+		raw_spin_lock(&dl_b->lock);
+		if (new_bw < dl_b->total_bw) {
+			raw_spin_unlock(&dl_b->lock);
+			return -EBUSY;
+		}
+		raw_spin_unlock(&dl_b->lock);
+	}
+
+	return 0;
+}
+
 int sched_rr_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos)
@@ -7264,6 +7597,60 @@ int sched_rt_handler(struct ctl_table *table, int write,
 	return ret;
 }
 
+int sched_dl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int ret;
+	int old_period, old_runtime;
+	static DEFINE_MUTEX(mutex);
+	unsigned long flags;
+
+	mutex_lock(&mutex);
+	old_period = sysctl_sched_dl_period;
+	old_runtime = sysctl_sched_dl_runtime;
+
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (!ret && write) {
+		raw_spin_lock_irqsave(&def_dl_bandwidth.dl_runtime_lock,
+				      flags);
+
+		ret = sched_dl_global_constraints();
+		if (ret) {
+			sysctl_sched_dl_period = old_period;
+			sysctl_sched_dl_runtime = old_runtime;
+		} else {
+			u64 new_bw;
+			int i;
+
+			def_dl_bandwidth.dl_period = global_dl_period();
+			def_dl_bandwidth.dl_runtime = global_dl_runtime();
+			if (global_dl_runtime() == RUNTIME_INF)
+				new_bw = -1;
+			else
+				new_bw = to_ratio(global_dl_period(),
+						  global_dl_runtime());
+			/*
+			 * FIXME: As above...
+			 */
+			for_each_possible_cpu(i) {
+				struct dl_bw *dl_b = dl_bw_of(i);
+
+				raw_spin_lock(&dl_b->lock);
+				dl_b->bw = new_bw;
+				raw_spin_unlock(&dl_b->lock);
+			}
+		}
+
+		raw_spin_unlock_irqrestore(&def_dl_bandwidth.dl_runtime_lock,
+					   flags);
+	}
+	mutex_unlock(&mutex);
+
+	return ret;
+}
+
 #ifdef CONFIG_CGROUP_SCHED
 
 static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7f6de43..802188f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -16,6 +16,8 @@
  */
 #include "sched.h"
 
+struct dl_bandwidth def_dl_bandwidth;
+
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
 {
 	return container_of(dl_se, struct task_struct, dl);
@@ -46,6 +48,27 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
 	return dl_rq->rb_leftmost == &dl_se->rb_node;
 }
 
+void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime)
+{
+	raw_spin_lock_init(&dl_b->dl_runtime_lock);
+	dl_b->dl_period = period;
+	dl_b->dl_runtime = runtime;
+}
+
+extern unsigned long to_ratio(u64 period, u64 runtime);
+
+void init_dl_bw(struct dl_bw *dl_b)
+{
+	raw_spin_lock_init(&dl_b->lock);
+	raw_spin_lock(&def_dl_bandwidth.dl_runtime_lock);
+	if (global_dl_runtime() == RUNTIME_INF)
+		dl_b->bw = -1;
+	else
+		dl_b->bw = to_ratio(global_dl_period(), global_dl_runtime());
+	raw_spin_unlock(&def_dl_bandwidth.dl_runtime_lock);
+	dl_b->total_bw = 0;
+}
+
 void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
 {
 	dl_rq->rb_root = RB_ROOT;
@@ -57,6 +80,8 @@ void init_dl_rq(struct dl_rq *dl_rq, struct rq *rq)
 	dl_rq->dl_nr_migratory = 0;
 	dl_rq->overloaded = 0;
 	dl_rq->pushable_dl_tasks_root = RB_ROOT;
+#else
+	init_dl_bw(&dl_rq->dl_bw);
 #endif
 }
 
@@ -359,8 +384,9 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se,
 	 * of anything below microseconds resolution is actually fiction
 	 * (but still we want to give the user that illusion >;).
 	 */
-	left = (pi_se->dl_period >> 10) * (dl_se->runtime >> 10);
-	right = ((dl_se->deadline - t) >> 10) * (pi_se->dl_runtime >> 10);
+	left = (pi_se->dl_period >> DL_SCALE) * (dl_se->runtime >> DL_SCALE);
+	right = ((dl_se->deadline - t) >> DL_SCALE) *
+		(pi_se->dl_runtime >> DL_SCALE);
 
 	return dl_time_before(right, left);
 }
@@ -911,8 +937,8 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 	 * In the unlikely case current and p have the same deadline
 	 * let us try to decide what's the best thing to do...
 	 */
-	if ((s64)(p->dl.deadline - rq->curr->dl.deadline) == 0 &&
-	    !need_resched())
+	if ((p->dl.deadline == rq->curr->dl.deadline) &&
+	    !test_tsk_need_resched(rq->curr))
 		check_preempt_equal_dl(rq, p);
 #endif /* CONFIG_SMP */
 }
@@ -1000,6 +1026,14 @@ static void task_fork_dl(struct task_struct *p)
 static void task_dead_dl(struct task_struct *p)
 {
 	struct hrtimer *timer = &p->dl.dl_timer;
+	struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+
+	/*
+	 * Since we are TASK_DEAD we won't slip out of the domain!
+	 */
+	raw_spin_lock_irq(&dl_b->lock);
+	dl_b->total_bw -= p->dl.dl_bw;
+	raw_spin_unlock_irq(&dl_b->lock);
 
 	hrtimer_cancel(timer);
 }
@@ -1226,7 +1260,7 @@ static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
 	BUG_ON(task_current(rq, p));
 	BUG_ON(p->nr_cpus_allowed <= 1);
 
-	BUG_ON(!p->se.on_rq);
+	BUG_ON(!p->on_rq);
 	BUG_ON(!dl_task(p));
 
 	return p;
@@ -1373,7 +1407,7 @@ static int pull_dl_task(struct rq *this_rq)
 		     dl_time_before(p->dl.deadline,
 				    this_rq->dl.earliest_dl.curr))) {
 			WARN_ON(p == src_rq->curr);
-			WARN_ON(!p->se.on_rq);
+			WARN_ON(!p->on_rq);
 
 			/*
 			 * Then we pull iff p has actually an earlier
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 52453a2..ad4f4fb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -74,6 +74,13 @@ extern void update_cpu_load_active(struct rq *this_rq);
 #define NICE_0_SHIFT		SCHED_LOAD_SHIFT
 
 /*
+ * Single value that decides SCHED_DEADLINE internal math precision.
+ * 10 -> just above 1us
+ * 9  -> just above 0.5us
+ */
+#define DL_SCALE (10)
+
+/*
  * These are the 'tuning knobs' of the scheduler:
  */
 
@@ -107,7 +114,7 @@ static inline int task_has_dl_policy(struct task_struct *p)
 	return dl_policy(p->policy);
 }
 
-static inline int dl_time_before(u64 a, u64 b)
+static inline bool dl_time_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
 }
@@ -115,8 +122,8 @@ static inline int dl_time_before(u64 a, u64 b)
 /*
  * Tells if entity @a should preempt entity @b.
  */
-static inline
-int dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
+static inline bool
+dl_entity_preempt(struct sched_dl_entity *a, struct sched_dl_entity *b)
 {
 	return dl_time_before(a->deadline, b->deadline);
 }
@@ -136,6 +143,50 @@ struct rt_bandwidth {
 	u64			rt_runtime;
 	struct hrtimer		rt_period_timer;
 };
+/*
+ * To keep the bandwidth of -deadline tasks and groups under control
+ * we need some place where:
+ *  - store the maximum -deadline bandwidth of the system (the group);
+ *  - cache the fraction of that bandwidth that is currently allocated.
+ *
+ * This is all done in the data structure below. It is similar to the
+ * one used for RT-throttling (rt_bandwidth), with the main difference
+ * that, since here we are only interested in admission control, we
+ * do not decrease any runtime while the group "executes", neither we
+ * need a timer to replenish it.
+ *
+ * With respect to SMP, the bandwidth is given on a per-CPU basis,
+ * meaning that:
+ *  - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
+ *  - dl_total_bw array contains, in the i-eth element, the currently
+ *    allocated bandwidth on the i-eth CPU.
+ * Moreover, groups consume bandwidth on each CPU, while tasks only
+ * consume bandwidth on the CPU they're running on.
+ * Finally, dl_total_bw_cpu is used to cache the index of dl_total_bw
+ * that will be shown the next time the proc or cgroup controls will
+ * be red. It on its turn can be changed by writing on its own
+ * control.
+ */
+struct dl_bandwidth {
+	raw_spinlock_t dl_runtime_lock;
+	u64 dl_runtime;
+	u64 dl_period;
+};
+
+static inline int dl_bandwidth_enabled(void)
+{
+	return sysctl_sched_dl_runtime >= 0;
+}
+
+extern struct dl_bw *dl_bw_of(int i);
+
+struct dl_bw {
+	raw_spinlock_t lock;
+	u64 bw, total_bw;
+};
+
+static inline u64 global_dl_period(void);
+static inline u64 global_dl_runtime(void);
 
 extern struct mutex sched_domains_mutex;
 
@@ -423,6 +474,8 @@ struct dl_rq {
 	 */
 	struct rb_root pushable_dl_tasks_root;
 	struct rb_node *pushable_dl_tasks_leftmost;
+#else
+	struct dl_bw dl_bw;
 #endif
 };
 
@@ -449,6 +502,7 @@ struct root_domain {
 	 */
 	cpumask_var_t dlo_mask;
 	atomic_t dlo_count;
+	struct dl_bw dl_bw;
 
 	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
@@ -897,7 +951,18 @@ static inline u64 global_rt_runtime(void)
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
 
+static inline u64 global_dl_period(void)
+{
+	return (u64)sysctl_sched_dl_period * NSEC_PER_USEC;
+}
+
+static inline u64 global_dl_runtime(void)
+{
+	if (sysctl_sched_dl_runtime < 0)
+		return RUNTIME_INF;
 
+	return (u64)sysctl_sched_dl_runtime * NSEC_PER_USEC;
+}
 
 static inline int task_current(struct rq *rq, struct task_struct *p)
 {
@@ -1145,6 +1210,7 @@ extern void update_max_interval(void);
 extern void init_sched_dl_class(void);
 extern void init_sched_rt_class(void);
 extern void init_sched_fair_class(void);
+extern void init_sched_dl_class(void);
 
 extern void resched_task(struct task_struct *p);
 extern void resched_cpu(int cpu);
@@ -1152,8 +1218,12 @@ extern void resched_cpu(int cpu);
 extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 
+extern struct dl_bandwidth def_dl_bandwidth;
+extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime);
 extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
 
+unsigned long to_ratio(u64 period, u64 runtime);
+
 extern void update_idle_cpu_load(struct rq *this_rq);
 
 extern void init_task_runnable_average(struct task_struct *p);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c8da99f..c7fb079 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -414,6 +414,20 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sched_rr_handler,
 	},
+	{
+		.procname	= "sched_dl_period_us",
+		.data		= &sysctl_sched_dl_period,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_handler,
+	},
+	{
+		.procname	= "sched_dl_runtime_us",
+		.data		= &sysctl_sched_dl_runtime,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= sched_dl_handler,
+	},
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
 		.procname	= "sched_autogroup_enabled",

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [tip:sched/core] sched/deadline: speed up SCHED_DEADLINE pushes with a push-heap
  2013-11-07 13:43 ` [PATCH 13/14] sched: speed up -dl pushes with a push-heap Juri Lelli
@ 2014-01-13 15:54   ` tip-bot for Juri Lelli
  0 siblings, 0 replies; 81+ messages in thread
From: tip-bot for Juri Lelli @ 2014-01-13 15:54 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, peterz, tglx, juri.lelli

Commit-ID:  6bfd6d72f51c51177676f2b1ba113fe0a85fdae4
Gitweb:     http://git.kernel.org/tip/6bfd6d72f51c51177676f2b1ba113fe0a85fdae4
Author:     Juri Lelli <juri.lelli@gmail.com>
AuthorDate: Thu, 7 Nov 2013 14:43:47 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 13 Jan 2014 13:46:46 +0100

sched/deadline: speed up SCHED_DEADLINE pushes with a push-heap

Data from tests confirmed that the original active load balancing
logic didn't scale neither in the number of CPU nor in the number of
tasks (as sched_rt does).

Here we provide a global data structure to keep track of deadlines
of the running tasks in the system. The structure is composed by
a bitmask showing the free CPUs and a max-heap, needed when the system
is heavily loaded.

The implementation and concurrent access scheme are kept simple by
design. However, our measurements show that we can compete with sched_rt
on large multi-CPUs machines [1].

Only the push path is addressed, the extension to use this structure
also for pull decisions is straightforward. However, we are currently
evaluating different (in order to decrease/avoid contention) data
structures to solve possibly both problems. We are also going to re-run
tests considering recent changes inside cpupri [2].

 [1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf
 [2] http://www.spinics.net/lists/linux-rt-users/msg06778.html

Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-14-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/Makefile      |   2 +-
 kernel/sched/core.c        |   3 +
 kernel/sched/cpudeadline.c | 216 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cpudeadline.h |  33 +++++++
 kernel/sched/deadline.c    |  53 +++--------
 kernel/sched/sched.h       |   2 +
 6 files changed, 269 insertions(+), 40 deletions(-)

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index b039035..9a95c8c 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -14,7 +14,7 @@ endif
 obj-y += core.o proc.o clock.o cputime.o
 obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o
 obj-y += wait.o completion.o
-obj-$(CONFIG_SMP) += cpupri.o
+obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o
 obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c7c68e6..e30356d6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5287,6 +5287,7 @@ static void free_rootdomain(struct rcu_head *rcu)
 	struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
 
 	cpupri_cleanup(&rd->cpupri);
+	cpudl_cleanup(&rd->cpudl);
 	free_cpumask_var(rd->dlo_mask);
 	free_cpumask_var(rd->rto_mask);
 	free_cpumask_var(rd->online);
@@ -5345,6 +5346,8 @@ static int init_rootdomain(struct root_domain *rd)
 		goto free_dlo_mask;
 
 	init_dl_bw(&rd->dl_bw);
+	if (cpudl_init(&rd->cpudl) != 0)
+		goto free_dlo_mask;
 
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_rto_mask;
diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
new file mode 100644
index 0000000..3bcade5
--- /dev/null
+++ b/kernel/sched/cpudeadline.c
@@ -0,0 +1,216 @@
+/*
+ *  kernel/sched/cpudl.c
+ *
+ *  Global CPU deadline management
+ *
+ *  Author: Juri Lelli <j.lelli@sssup.it>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; version 2
+ *  of the License.
+ */
+
+#include <linux/gfp.h>
+#include <linux/kernel.h>
+#include "cpudeadline.h"
+
+static inline int parent(int i)
+{
+	return (i - 1) >> 1;
+}
+
+static inline int left_child(int i)
+{
+	return (i << 1) + 1;
+}
+
+static inline int right_child(int i)
+{
+	return (i << 1) + 2;
+}
+
+static inline int dl_time_before(u64 a, u64 b)
+{
+	return (s64)(a - b) < 0;
+}
+
+void cpudl_exchange(struct cpudl *cp, int a, int b)
+{
+	int cpu_a = cp->elements[a].cpu, cpu_b = cp->elements[b].cpu;
+
+	swap(cp->elements[a], cp->elements[b]);
+	swap(cp->cpu_to_idx[cpu_a], cp->cpu_to_idx[cpu_b]);
+}
+
+void cpudl_heapify(struct cpudl *cp, int idx)
+{
+	int l, r, largest;
+
+	/* adapted from lib/prio_heap.c */
+	while(1) {
+		l = left_child(idx);
+		r = right_child(idx);
+		largest = idx;
+
+		if ((l < cp->size) && dl_time_before(cp->elements[idx].dl,
+							cp->elements[l].dl))
+			largest = l;
+		if ((r < cp->size) && dl_time_before(cp->elements[largest].dl,
+							cp->elements[r].dl))
+			largest = r;
+		if (largest == idx)
+			break;
+
+		/* Push idx down the heap one level and bump one up */
+		cpudl_exchange(cp, largest, idx);
+		idx = largest;
+	}
+}
+
+void cpudl_change_key(struct cpudl *cp, int idx, u64 new_dl)
+{
+	WARN_ON(idx > num_present_cpus() || idx == IDX_INVALID);
+
+	if (dl_time_before(new_dl, cp->elements[idx].dl)) {
+		cp->elements[idx].dl = new_dl;
+		cpudl_heapify(cp, idx);
+	} else {
+		cp->elements[idx].dl = new_dl;
+		while (idx > 0 && dl_time_before(cp->elements[parent(idx)].dl,
+					cp->elements[idx].dl)) {
+			cpudl_exchange(cp, idx, parent(idx));
+			idx = parent(idx);
+		}
+	}
+}
+
+static inline int cpudl_maximum(struct cpudl *cp)
+{
+	return cp->elements[0].cpu;
+}
+
+/*
+ * cpudl_find - find the best (later-dl) CPU in the system
+ * @cp: the cpudl max-heap context
+ * @p: the task
+ * @later_mask: a mask to fill in with the selected CPUs (or NULL)
+ *
+ * Returns: int - best CPU (heap maximum if suitable)
+ */
+int cpudl_find(struct cpudl *cp, struct task_struct *p,
+	       struct cpumask *later_mask)
+{
+	int best_cpu = -1;
+	const struct sched_dl_entity *dl_se = &p->dl;
+
+	if (later_mask && cpumask_and(later_mask, cp->free_cpus,
+			&p->cpus_allowed) && cpumask_and(later_mask,
+			later_mask, cpu_active_mask)) {
+		best_cpu = cpumask_any(later_mask);
+		goto out;
+	} else if (cpumask_test_cpu(cpudl_maximum(cp), &p->cpus_allowed) &&
+			dl_time_before(dl_se->deadline, cp->elements[0].dl)) {
+		best_cpu = cpudl_maximum(cp);
+		if (later_mask)
+			cpumask_set_cpu(best_cpu, later_mask);
+	}
+
+out:
+	WARN_ON(best_cpu > num_present_cpus() && best_cpu != -1);
+
+	return best_cpu;
+}
+
+/*
+ * cpudl_set - update the cpudl max-heap
+ * @cp: the cpudl max-heap context
+ * @cpu: the target cpu
+ * @dl: the new earliest deadline for this cpu
+ *
+ * Notes: assumes cpu_rq(cpu)->lock is locked
+ *
+ * Returns: (void)
+ */
+void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid)
+{
+	int old_idx, new_cpu;
+	unsigned long flags;
+
+	WARN_ON(cpu > num_present_cpus());
+
+	raw_spin_lock_irqsave(&cp->lock, flags);
+	old_idx = cp->cpu_to_idx[cpu];
+	if (!is_valid) {
+		/* remove item */
+		if (old_idx == IDX_INVALID) {
+			/*
+			 * Nothing to remove if old_idx was invalid.
+			 * This could happen if a rq_offline_dl is
+			 * called for a CPU without -dl tasks running.
+			 */
+			goto out;
+		}
+		new_cpu = cp->elements[cp->size - 1].cpu;
+		cp->elements[old_idx].dl = cp->elements[cp->size - 1].dl;
+		cp->elements[old_idx].cpu = new_cpu;
+		cp->size--;
+		cp->cpu_to_idx[new_cpu] = old_idx;
+		cp->cpu_to_idx[cpu] = IDX_INVALID;
+		while (old_idx > 0 && dl_time_before(
+				cp->elements[parent(old_idx)].dl,
+				cp->elements[old_idx].dl)) {
+			cpudl_exchange(cp, old_idx, parent(old_idx));
+			old_idx = parent(old_idx);
+		}
+		cpumask_set_cpu(cpu, cp->free_cpus);
+                cpudl_heapify(cp, old_idx);
+
+		goto out;
+	}
+
+	if (old_idx == IDX_INVALID) {
+		cp->size++;
+		cp->elements[cp->size - 1].dl = 0;
+		cp->elements[cp->size - 1].cpu = cpu;
+		cp->cpu_to_idx[cpu] = cp->size - 1;
+		cpudl_change_key(cp, cp->size - 1, dl);
+		cpumask_clear_cpu(cpu, cp->free_cpus);
+	} else {
+		cpudl_change_key(cp, old_idx, dl);
+	}
+
+out:
+	raw_spin_unlock_irqrestore(&cp->lock, flags);
+}
+
+/*
+ * cpudl_init - initialize the cpudl structure
+ * @cp: the cpudl max-heap context
+ */
+int cpudl_init(struct cpudl *cp)
+{
+	int i;
+
+	memset(cp, 0, sizeof(*cp));
+	raw_spin_lock_init(&cp->lock);
+	cp->size = 0;
+	for (i = 0; i < NR_CPUS; i++)
+		cp->cpu_to_idx[i] = IDX_INVALID;
+	if (!alloc_cpumask_var(&cp->free_cpus, GFP_KERNEL))
+		return -ENOMEM;
+	cpumask_setall(cp->free_cpus);
+
+	return 0;
+}
+
+/*
+ * cpudl_cleanup - clean up the cpudl structure
+ * @cp: the cpudl max-heap context
+ */
+void cpudl_cleanup(struct cpudl *cp)
+{
+	/*
+	 * nothing to do for the moment
+	 */
+}
diff --git a/kernel/sched/cpudeadline.h b/kernel/sched/cpudeadline.h
new file mode 100644
index 0000000..a202789
--- /dev/null
+++ b/kernel/sched/cpudeadline.h
@@ -0,0 +1,33 @@
+#ifndef _LINUX_CPUDL_H
+#define _LINUX_CPUDL_H
+
+#include <linux/sched.h>
+
+#define IDX_INVALID     -1
+
+struct array_item {
+	u64 dl;
+	int cpu;
+};
+
+struct cpudl {
+	raw_spinlock_t lock;
+	int size;
+	int cpu_to_idx[NR_CPUS];
+	struct array_item elements[NR_CPUS];
+	cpumask_var_t free_cpus;
+};
+
+
+#ifdef CONFIG_SMP
+int cpudl_find(struct cpudl *cp, struct task_struct *p,
+	       struct cpumask *later_mask);
+void cpudl_set(struct cpudl *cp, int cpu, u64 dl, int is_valid);
+int cpudl_init(struct cpudl *cp);
+void cpudl_cleanup(struct cpudl *cp);
+#else
+#define cpudl_set(cp, cpu, dl) do { } while (0)
+#define cpudl_init() do { } while (0)
+#endif /* CONFIG_SMP */
+
+#endif /* _LINUX_CPUDL_H */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 802188f..0c6b1d0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -16,6 +16,8 @@
  */
 #include "sched.h"
 
+#include <linux/slab.h>
+
 struct dl_bandwidth def_dl_bandwidth;
 
 static inline struct task_struct *dl_task_of(struct sched_dl_entity *dl_se)
@@ -640,6 +642,7 @@ static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 		 */
 		dl_rq->earliest_dl.next = dl_rq->earliest_dl.curr;
 		dl_rq->earliest_dl.curr = deadline;
+		cpudl_set(&rq->rd->cpudl, rq->cpu, deadline, 1);
 	} else if (dl_rq->earliest_dl.next == 0 ||
 		   dl_time_before(deadline, dl_rq->earliest_dl.next)) {
 		/*
@@ -663,6 +666,7 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 	if (!dl_rq->dl_nr_running) {
 		dl_rq->earliest_dl.curr = 0;
 		dl_rq->earliest_dl.next = 0;
+		cpudl_set(&rq->rd->cpudl, rq->cpu, 0, 0);
 	} else {
 		struct rb_node *leftmost = dl_rq->rb_leftmost;
 		struct sched_dl_entity *entry;
@@ -670,6 +674,7 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
 		entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
 		dl_rq->earliest_dl.curr = entry->deadline;
 		dl_rq->earliest_dl.next = next_deadline(rq);
+		cpudl_set(&rq->rd->cpudl, rq->cpu, entry->deadline, 1);
 	}
 }
 
@@ -855,9 +860,6 @@ static void yield_task_dl(struct rq *rq)
 #ifdef CONFIG_SMP
 
 static int find_later_rq(struct task_struct *task);
-static int latest_cpu_find(struct cpumask *span,
-			   struct task_struct *task,
-			   struct cpumask *later_mask);
 
 static int
 select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
@@ -904,7 +906,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	 * let's hope p can move out.
 	 */
 	if (rq->curr->nr_cpus_allowed == 1 ||
-	    latest_cpu_find(rq->rd->span, rq->curr, NULL) == -1)
+	    cpudl_find(&rq->rd->cpudl, rq->curr, NULL) == -1)
 		return;
 
 	/*
@@ -912,7 +914,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	 * see if it is pushed or pulled somewhere else.
 	 */
 	if (p->nr_cpus_allowed != 1 &&
-	    latest_cpu_find(rq->rd->span, p, NULL) != -1)
+	    cpudl_find(&rq->rd->cpudl, p, NULL) != -1)
 		return;
 
 	resched_task(rq->curr);
@@ -1085,39 +1087,6 @@ next_node:
 	return NULL;
 }
 
-static int latest_cpu_find(struct cpumask *span,
-			   struct task_struct *task,
-			   struct cpumask *later_mask)
-{
-	const struct sched_dl_entity *dl_se = &task->dl;
-	int cpu, found = -1, best = 0;
-	u64 max_dl = 0;
-
-	for_each_cpu(cpu, span) {
-		struct rq *rq = cpu_rq(cpu);
-		struct dl_rq *dl_rq = &rq->dl;
-
-		if (cpumask_test_cpu(cpu, &task->cpus_allowed) &&
-		    (!dl_rq->dl_nr_running || dl_time_before(dl_se->deadline,
-		     dl_rq->earliest_dl.curr))) {
-			if (later_mask)
-				cpumask_set_cpu(cpu, later_mask);
-			if (!best && !dl_rq->dl_nr_running) {
-				best = 1;
-				found = cpu;
-			} else if (!best &&
-				   dl_time_before(max_dl,
-						  dl_rq->earliest_dl.curr)) {
-				max_dl = dl_rq->earliest_dl.curr;
-				found = cpu;
-			}
-		} else if (later_mask)
-			cpumask_clear_cpu(cpu, later_mask);
-	}
-
-	return found;
-}
-
 static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
 
 static int find_later_rq(struct task_struct *task)
@@ -1134,7 +1103,8 @@ static int find_later_rq(struct task_struct *task)
 	if (task->nr_cpus_allowed == 1)
 		return -1;
 
-	best_cpu = latest_cpu_find(task_rq(task)->rd->span, task, later_mask);
+	best_cpu = cpudl_find(&task_rq(task)->rd->cpudl,
+			task, later_mask);
 	if (best_cpu == -1)
 		return -1;
 
@@ -1510,6 +1480,9 @@ static void rq_online_dl(struct rq *rq)
 {
 	if (rq->dl.overloaded)
 		dl_set_overload(rq);
+
+	if (rq->dl.dl_nr_running > 0)
+		cpudl_set(&rq->rd->cpudl, rq->cpu, rq->dl.earliest_dl.curr, 1);
 }
 
 /* Assumes rq->lock is held */
@@ -1517,6 +1490,8 @@ static void rq_offline_dl(struct rq *rq)
 {
 	if (rq->dl.overloaded)
 		dl_clear_overload(rq);
+
+	cpudl_set(&rq->rd->cpudl, rq->cpu, 0, 0);
 }
 
 void init_sched_dl_class(void)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ad4f4fb..2b7421d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -10,6 +10,7 @@
 #include <linux/slab.h>
 
 #include "cpupri.h"
+#include "cpudeadline.h"
 #include "cpuacct.h"
 
 struct rq;
@@ -503,6 +504,7 @@ struct root_domain {
 	cpumask_var_t dlo_mask;
 	atomic_t dlo_count;
 	struct dl_bw dl_bw;
+	struct cpudl cpudl;
 
 	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [RFC][PATCH] sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
  2014-01-13 15:53   ` [tip:sched/core] sched: Add new scheduler syscalls to support an extended scheduling parameters ABI tip-bot for Dario Faggioli
@ 2014-01-15 16:22     ` Peter Zijlstra
  2014-01-16 13:40       ` [tip:sched/core] sched: Move SCHED_RESET_ON_FORK into attr:: sched_flags tip-bot for Peter Zijlstra
  2014-01-17 17:29     ` [tip:sched/core] sched: Add new scheduler syscalls to support an extended scheduling parameters ABI Stephen Warren
  1 sibling, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2014-01-15 16:22 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, tglx, raistlin, juri.lelli; +Cc: linux-tip-commits


Does something like the below make sense to you guys?

---
Subject: sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Jan 15 17:05:04 CET 2014

I noticed the new sched_{set,get}attr() calls didn't properly deal
with the SCHED_RESET_ON_FORK hack.

Instead of propagating the flags in high bits nonsense use the brand
spanking new attr::sched_flags field.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-h96ck1ivcs35jbpcnuwyu5fr@git.kernel.org
---
 include/uapi/linux/sched.h |    5 +++++
 kernel/sched/core.c        |   42 ++++++++++++++++++++++++++++--------------
 2 files changed, 33 insertions(+), 14 deletions(-)

--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -40,8 +40,13 @@
 /* SCHED_ISO: reserved but not implemented yet */
 #define SCHED_IDLE		5
 #define SCHED_DEADLINE		6
+
 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
 #define SCHED_RESET_ON_FORK     0x40000000
 
+/*
+ * For the sched_{set,get}attr() calls
+ */
+#define SCHED_FLAG_RESET_ON_FORK	0x01
 
 #endif /* _UAPI_LINUX_SCHED_H */
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3267,8 +3267,7 @@ static int __sched_setscheduler(struct t
 		reset_on_fork = p->sched_reset_on_fork;
 		policy = oldpolicy = p->policy;
 	} else {
-		reset_on_fork = !!(policy & SCHED_RESET_ON_FORK);
-		policy &= ~SCHED_RESET_ON_FORK;
+		reset_on_fork = !!(attr->sched_flags & SCHED_FLAG_RESET_ON_FORK);
 
 		if (policy != SCHED_DEADLINE &&
 				policy != SCHED_FIFO && policy != SCHED_RR &&
@@ -3277,6 +3276,9 @@ static int __sched_setscheduler(struct t
 			return -EINVAL;
 	}
 
+	if (attr->sched_flags & ~(SCHED_FLAG_RESET_ON_FORK))
+		return -EINVAL;
+
 	/*
 	 * Valid priorities for SCHED_FIFO and SCHED_RR are
 	 * 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL,
@@ -3444,6 +3446,26 @@ static int __sched_setscheduler(struct t
 	return 0;
 }
 
+static int _sched_setscheduler(struct task_struct *p, int policy,
+			       const struct sched_param *param, bool check)
+{
+	struct sched_attr attr = {
+		.sched_policy   = policy,
+		.sched_priority = param->sched_priority,
+		.sched_nice	= PRIO_TO_NICE(p->static_prio),
+	};
+
+	/*
+	 * Fixup the legacy SCHED_RESET_ON_FORK hack
+	 */
+	if (policy & SCHED_RESET_ON_FORK) {
+		attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
+		policy &= ~SCHED_RESET_ON_FORK;
+		attr.sched_policy = policy;
+	}
+
+	return __sched_setscheduler(p, &attr, check);
+}
 /**
  * sched_setscheduler - change the scheduling policy and/or RT priority of a thread.
  * @p: the task in question.
@@ -3457,12 +3479,7 @@ static int __sched_setscheduler(struct t
 int sched_setscheduler(struct task_struct *p, int policy,
 		       const struct sched_param *param)
 {
-	struct sched_attr attr = {
-		.sched_policy   = policy,
-		.sched_priority = param->sched_priority,
-		.sched_nice	= PRIO_TO_NICE(p->static_prio),
-	};
-	return __sched_setscheduler(p, &attr, true);
+	return _sched_setscheduler(p, policy, param, true);
 }
 EXPORT_SYMBOL_GPL(sched_setscheduler);
 
@@ -3488,12 +3505,7 @@ EXPORT_SYMBOL_GPL(sched_setattr);
 int sched_setscheduler_nocheck(struct task_struct *p, int policy,
 			       const struct sched_param *param)
 {
-	struct sched_attr attr = {
-		.sched_policy   = policy,
-		.sched_priority = param->sched_priority,
-		.sched_nice	= PRIO_TO_NICE(p->static_prio),
-	};
-	return __sched_setscheduler(p, &attr, false);
+	return _sched_setscheduler(p, policy, param, false);
 }
 
 static int
@@ -3793,6 +3805,8 @@ SYSCALL_DEFINE3(sched_getattr, pid_t, pi
 		goto out_unlock;
 
 	attr.sched_policy = p->policy;
+	if (p->sched_reset_on_fork)
+		attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
 	if (task_has_dl_policy(p))
 		__getparam_dl(p, &attr);
 	else if (task_has_rt_policy(p))

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [tip:sched/core] sched: Move SCHED_RESET_ON_FORK into attr:: sched_flags
  2014-01-15 16:22     ` [RFC][PATCH] sched: Move SCHED_RESET_ON_FORK into attr::sched_flags Peter Zijlstra
@ 2014-01-16 13:40       ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 81+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-01-16 13:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, peterz, raistlin, tglx, juri.lelli

Commit-ID:  7479f3c9cf67edf5e8a76b21ea3726757f35cf53
Gitweb:     http://git.kernel.org/tip/7479f3c9cf67edf5e8a76b21ea3726757f35cf53
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 15 Jan 2014 17:05:04 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 16 Jan 2014 09:27:17 +0100

sched: Move SCHED_RESET_ON_FORK into attr::sched_flags

I noticed the new sched_{set,get}attr() calls didn't properly deal
with the SCHED_RESET_ON_FORK hack.

Instead of propagating the flags in high bits nonsense use the brand
spanking new attr::sched_flags field.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Dario Faggioli <raistlin@linux.it>
Link: http://lkml.kernel.org/r/20140115162242.GJ31570@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/uapi/linux/sched.h |  5 +++++
 kernel/sched/core.c        | 42 ++++++++++++++++++++++++++++--------------
 2 files changed, 33 insertions(+), 14 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 2d5e49a..34f9d73 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -40,8 +40,13 @@
 /* SCHED_ISO: reserved but not implemented yet */
 #define SCHED_IDLE		5
 #define SCHED_DEADLINE		6
+
 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
 #define SCHED_RESET_ON_FORK     0x40000000
 
+/*
+ * For the sched_{set,get}attr() calls
+ */
+#define SCHED_FLAG_RESET_ON_FORK	0x01
 
 #endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5a6ccdf..93a2836 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3267,8 +3267,7 @@ recheck:
 		reset_on_fork = p->sched_reset_on_fork;
 		policy = oldpolicy = p->policy;
 	} else {
-		reset_on_fork = !!(policy & SCHED_RESET_ON_FORK);
-		policy &= ~SCHED_RESET_ON_FORK;
+		reset_on_fork = !!(attr->sched_flags & SCHED_FLAG_RESET_ON_FORK);
 
 		if (policy != SCHED_DEADLINE &&
 				policy != SCHED_FIFO && policy != SCHED_RR &&
@@ -3277,6 +3276,9 @@ recheck:
 			return -EINVAL;
 	}
 
+	if (attr->sched_flags & ~(SCHED_FLAG_RESET_ON_FORK))
+		return -EINVAL;
+
 	/*
 	 * Valid priorities for SCHED_FIFO and SCHED_RR are
 	 * 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL,
@@ -3443,6 +3445,26 @@ change:
 	return 0;
 }
 
+static int _sched_setscheduler(struct task_struct *p, int policy,
+			       const struct sched_param *param, bool check)
+{
+	struct sched_attr attr = {
+		.sched_policy   = policy,
+		.sched_priority = param->sched_priority,
+		.sched_nice	= PRIO_TO_NICE(p->static_prio),
+	};
+
+	/*
+	 * Fixup the legacy SCHED_RESET_ON_FORK hack
+	 */
+	if (policy & SCHED_RESET_ON_FORK) {
+		attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
+		policy &= ~SCHED_RESET_ON_FORK;
+		attr.sched_policy = policy;
+	}
+
+	return __sched_setscheduler(p, &attr, check);
+}
 /**
  * sched_setscheduler - change the scheduling policy and/or RT priority of a thread.
  * @p: the task in question.
@@ -3456,12 +3478,7 @@ change:
 int sched_setscheduler(struct task_struct *p, int policy,
 		       const struct sched_param *param)
 {
-	struct sched_attr attr = {
-		.sched_policy   = policy,
-		.sched_priority = param->sched_priority,
-		.sched_nice	= PRIO_TO_NICE(p->static_prio),
-	};
-	return __sched_setscheduler(p, &attr, true);
+	return _sched_setscheduler(p, policy, param, true);
 }
 EXPORT_SYMBOL_GPL(sched_setscheduler);
 
@@ -3487,12 +3504,7 @@ EXPORT_SYMBOL_GPL(sched_setattr);
 int sched_setscheduler_nocheck(struct task_struct *p, int policy,
 			       const struct sched_param *param)
 {
-	struct sched_attr attr = {
-		.sched_policy   = policy,
-		.sched_priority = param->sched_priority,
-		.sched_nice	= PRIO_TO_NICE(p->static_prio),
-	};
-	return __sched_setscheduler(p, &attr, false);
+	return _sched_setscheduler(p, policy, param, false);
 }
 
 static int
@@ -3792,6 +3804,8 @@ SYSCALL_DEFINE3(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 		goto out_unlock;
 
 	attr.sched_policy = p->policy;
+	if (p->sched_reset_on_fork)
+		attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
 	if (task_has_dl_policy(p))
 		__getparam_dl(p, &attr);
 	else if (task_has_rt_policy(p))

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [tip:sched/core] sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
  2014-01-13 15:53   ` [tip:sched/core] sched: Add new scheduler syscalls to support an extended scheduling parameters ABI tip-bot for Dario Faggioli
  2014-01-15 16:22     ` [RFC][PATCH] sched: Move SCHED_RESET_ON_FORK into attr::sched_flags Peter Zijlstra
@ 2014-01-17 17:29     ` Stephen Warren
  2014-01-17 18:04       ` Stephen Warren
  1 sibling, 1 reply; 81+ messages in thread
From: Stephen Warren @ 2014-01-17 17:29 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, peterz, tglx, raistlin, juri.lelli,
	linux-tip-commits

On 01/13/2014 08:53 AM, tip-bot for Dario Faggioli wrote:
> Commit-ID:  d50dde5a10f305253cbc3855307f608f8a3c5f73
> Gitweb:     http://git.kernel.org/tip/d50dde5a10f305253cbc3855307f608f8a3c5f73
> Author:     Dario Faggioli <raistlin@linux.it>
> AuthorDate: Thu, 7 Nov 2013 14:43:36 +0100
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Mon, 13 Jan 2014 13:41:04 +0100
> 
> sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
> 
> Add the syscalls needed for supporting scheduling algorithms
> with extended scheduling parameters (e.g., SCHED_DEADLINE).

I'm seeing a regression in next-20140116 on my system, and "git bisect"
points at this patch. I noticed the fix patch "sched: Fix up scheduler
syscall LTP fails", but applying that doesn't solve the problem.

The symptoms are that when the system boots, I see the login prompt over
the serial port OK, but after I've entered my user/password, the login
process simply hangs. CTRL-C drops me back to the login prompt.

Do you have any hints how I should debug this?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [tip:sched/core] sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
  2014-01-17 17:29     ` [tip:sched/core] sched: Add new scheduler syscalls to support an extended scheduling parameters ABI Stephen Warren
@ 2014-01-17 18:04       ` Stephen Warren
  0 siblings, 0 replies; 81+ messages in thread
From: Stephen Warren @ 2014-01-17 18:04 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, peterz, tglx, raistlin, juri.lelli,
	linux-tip-commits

On 01/17/2014 10:29 AM, Stephen Warren wrote:
> On 01/13/2014 08:53 AM, tip-bot for Dario Faggioli wrote:
>> Commit-ID:  d50dde5a10f305253cbc3855307f608f8a3c5f73
>> Gitweb:     http://git.kernel.org/tip/d50dde5a10f305253cbc3855307f608f8a3c5f73
>> Author:     Dario Faggioli <raistlin@linux.it>
>> AuthorDate: Thu, 7 Nov 2013 14:43:36 +0100
>> Committer:  Ingo Molnar <mingo@kernel.org>
>> CommitDate: Mon, 13 Jan 2014 13:41:04 +0100
>>
>> sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
>>
>> Add the syscalls needed for supporting scheduling algorithms
>> with extended scheduling parameters (e.g., SCHED_DEADLINE).
> 
> I'm seeing a regression in next-20140116 on my system, and "git bisect"
> points at this patch. I noticed the fix patch "sched: Fix up scheduler
> syscall LTP fails", but applying that doesn't solve the problem.

Never mind; additional searching revealed "sched: Fix
__sched_setscheduler() nice test", which does fix the issue for me.

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2014-01-17 18:04 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-07 13:43 [PATCH 00/14] sched: SCHED_DEADLINE v9 Juri Lelli
2013-11-07 13:43 ` [PATCH 01/14] sched: add sched_class->task_dead Juri Lelli
2013-11-12  4:17   ` Paul Turner
2013-11-12 17:19   ` Steven Rostedt
2013-11-12 17:53     ` Juri Lelli
2013-11-27 14:10   ` [tip:sched/core] sched: Add sched_class->task_dead() method tip-bot for Dario Faggioli
2013-11-07 13:43 ` [PATCH 02/14] sched: add extended scheduling interface Juri Lelli
2013-11-12 17:23   ` Steven Rostedt
2013-11-13  8:43     ` Juri Lelli
2013-11-12 17:32   ` Steven Rostedt
2013-11-13  9:07     ` Juri Lelli
2013-11-27 13:23   ` [PATCH 02/14] sched: add extended scheduling interface. (new ABI) Ingo Molnar
2013-11-27 13:30     ` Peter Zijlstra
2013-11-27 14:01       ` Ingo Molnar
2013-11-27 14:13         ` Peter Zijlstra
2013-11-27 14:17           ` Ingo Molnar
2013-11-28 11:14             ` Juri Lelli
2013-11-28 11:28               ` Peter Zijlstra
2013-11-30 14:06                 ` Ingo Molnar
2013-12-03 16:13                   ` Juri Lelli
2013-12-03 16:41                     ` Steven Rostedt
2013-12-03 17:04                       ` Juri Lelli
2014-01-13 15:53   ` [tip:sched/core] sched: Add new scheduler syscalls to support an extended scheduling parameters ABI tip-bot for Dario Faggioli
2014-01-15 16:22     ` [RFC][PATCH] sched: Move SCHED_RESET_ON_FORK into attr::sched_flags Peter Zijlstra
2014-01-16 13:40       ` [tip:sched/core] sched: Move SCHED_RESET_ON_FORK into attr:: sched_flags tip-bot for Peter Zijlstra
2014-01-17 17:29     ` [tip:sched/core] sched: Add new scheduler syscalls to support an extended scheduling parameters ABI Stephen Warren
2014-01-17 18:04       ` Stephen Warren
2013-11-07 13:43 ` [PATCH 03/14] sched: SCHED_DEADLINE structures & implementation Juri Lelli
2013-11-13  2:31   ` Steven Rostedt
2013-11-13  9:54     ` Juri Lelli
2013-11-20 20:23   ` Steven Rostedt
2013-11-21 14:15     ` Juri Lelli
2014-01-13 15:53   ` [tip:sched/core] sched/deadline: Add " tip-bot for Dario Faggioli
2013-11-07 13:43 ` [PATCH 04/14] sched: SCHED_DEADLINE SMP-related data structures & logic Juri Lelli
2013-11-20 18:51   ` Steven Rostedt
2013-11-21 14:13     ` Juri Lelli
2013-11-21 14:41       ` Steven Rostedt
2013-11-21 16:08       ` Paul E. McKenney
2013-11-21 16:16         ` Juri Lelli
2013-11-21 16:26           ` Paul E. McKenney
2013-11-21 16:47             ` Steven Rostedt
2013-11-21 19:38               ` Paul E. McKenney
2014-01-13 15:53   ` [tip:sched/core] sched/deadline: Add " tip-bot for Juri Lelli
2013-11-07 13:43 ` [PATCH 05/14] sched: SCHED_DEADLINE avg_update accounting Juri Lelli
2014-01-13 15:53   ` [tip:sched/core] sched/deadline: Add " tip-bot for Dario Faggioli
2013-11-07 13:43 ` [PATCH 06/14] sched: add period support for -deadline tasks Juri Lelli
2014-01-13 15:53   ` [tip:sched/core] sched/deadline: Add period support for SCHED_DEADLINE tasks tip-bot for Harald Gustafsson
2013-11-07 13:43 ` [PATCH 07/14] sched: add schedstats for -deadline tasks Juri Lelli
2013-11-07 13:43 ` [PATCH 08/14] sched: add latency tracing " Juri Lelli
2013-11-20 21:33   ` Steven Rostedt
2013-11-27 13:43     ` Juri Lelli
2013-11-27 14:16       ` Steven Rostedt
2013-11-27 14:19         ` Juri Lelli
2013-11-27 14:26         ` Peter Zijlstra
2013-11-27 14:34           ` Ingo Molnar
2013-11-27 14:58             ` Peter Zijlstra
2013-11-27 15:35               ` Ingo Molnar
2013-11-27 15:40                 ` Peter Zijlstra
2013-11-27 15:46                   ` Ingo Molnar
2013-11-27 15:54                     ` Peter Zijlstra
2013-11-27 15:56                     ` Steven Rostedt
2013-11-27 16:01                       ` Peter Zijlstra
2013-11-27 16:02                       ` Steven Rostedt
2013-11-27 16:13                       ` Ingo Molnar
2013-11-27 16:33                         ` Steven Rostedt
2013-11-27 16:24                   ` Oleg Nesterov
2013-11-27 15:42               ` Ingo Molnar
2013-11-27 15:00           ` Steven Rostedt
2014-01-13 15:54   ` [tip:sched/core] sched/deadline: Add latency tracing for SCHED_DEADLINE tasks tip-bot for Dario Faggioli
2013-11-07 13:43 ` [PATCH 09/14] rtmutex: turn the plist into an rb-tree Juri Lelli
2013-11-21  3:07   ` Steven Rostedt
2013-11-21 17:52   ` [PATCH] rtmutex: Fix compare of waiter prio and task prio Steven Rostedt
2013-11-22 10:37     ` Juri Lelli
2014-01-13 15:54   ` [tip:sched/core] rtmutex: Turn the plist into an rb-tree tip-bot for Peter Zijlstra
2013-11-07 13:43 ` [PATCH 10/14] sched: drafted deadline inheritance logic Juri Lelli
2014-01-13 15:54   ` [tip:sched/core] sched/deadline: Add SCHED_DEADLINE " tip-bot for Dario Faggioli
2013-11-07 13:43 ` [PATCH 11/14] sched: add bandwidth management for sched_dl Juri Lelli
2014-01-13 15:54   ` [tip:sched/core] sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks tip-bot for Dario Faggioli
2013-11-07 13:43 ` [PATCH 12/14] sched: make dl_bw a sub-quota of rt_bw Juri Lelli
2013-11-07 13:43 ` [PATCH 13/14] sched: speed up -dl pushes with a push-heap Juri Lelli
2014-01-13 15:54   ` [tip:sched/core] sched/deadline: speed up SCHED_DEADLINE " tip-bot for Juri Lelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).